ModelHub XC c7a4c6bbbd 初始化项目,由ModelHub XC社区提供模型
Model: Nalandadata/nalanda-qwen-7b-grpo
Source: Original Platform
2026-04-28 03:59:13 +08:00

license, base_model, tags, datasets, language, pipeline_tag, model-index
license base_model tags datasets language pipeline_tag model-index
apache-2.0 Qwen/Qwen2.5-7B-Instruct
qwen2
grpo
reinforcement-learning
jee
neet
stem
education
india
fine-tuned
custom
en
text-generation
name results
nalanda-qwen2.5-7b-grpo
task metrics
type name
text-generation JEE/NEET Exam MCQ
name type value
Physics Accuracy accuracy 65.0
name type value
Chemistry Accuracy accuracy 71.5
name type value
Mathematics Accuracy accuracy 64.5
name type value
Biology Accuracy accuracy 77.5
name type value
Overall JEE/NEET Accuracy accuracy 69.6

Nalanda Qwen 2.5 7B GRPO

A fine-tuned version of Qwen/Qwen2.5-7B-Instruct specialized for Indian competitive exam questions (JEE Mains, JEE Advanced, NEET UG) across Physics, Chemistry, Mathematics, and Biology.

Training Methodology

This model was trained using a two-stage pipeline inspired by Yoshihara et al. (2025, ICML) and DeepSeekMath (Shao et al., 2024):

Stage 1: Light Supervised Fine-Tuning (SFT)

  • 200 training steps (~5% of dataset)
  • Data mixing: 70% JEE/NEET questions + 30% general instruction data (SlimOrca)
  • LoRA rank 8, attention layers only, learning rate 3e-5
  • NEFTune noise (alpha=5) for improved generalization
  • Purpose: Introduce domain vocabulary and question formats without overwriting general knowledge

Stage 2: Group Relative Policy Optimization (GRPO)

  • 600 training steps with 8 model generations per prompt
  • 10,000 MCQs with verified correct answers (balanced: 2,500 per subject)
  • Three reward functions:
    • Correctness (max 2.0): High reward for correct answer
    • Format compliance (max 1.0): Reward for structured <answer> tags
    • Reasoning quality (max 1.0): Reward for showing work (equations, step indicators)
  • Learning rate 5e-6
  • Purpose: Teach the model to arrive at correct answers through its own reasoning

Why GRPO Instead of SFT?

Standard SFT on the same 126K questions caused catastrophic forgetting (-15pp accuracy drop). SFT forces the model to mimic specific solution patterns, destroying reasoning ability. GRPO rewards the model for arriving at correct answers through its own reasoning path, which preserves and enhances general capabilities.

Results

JEE/NEET Exam Accuracy (800 held-out MCQs)

Subject Qwen 2.5 7B Baseline This Model Improvement
Physics 51.0% 65.0% +14.0pp
Chemistry 61.5% 71.5% +10.0pp
Mathematics 56.0% 64.5% +8.5pp
Biology 73.5% 77.5% +4.0pp
Overall 60.5% 69.6% +9.1pp

Public Benchmark Preservation

Benchmark Baseline This Model Delta
GSM8K 94.7% 96.0% +1.3pp
ARC-Challenge 90.0% 90.0% 0.0pp
MMLU-Physics 81.1% 83.8% +2.7pp
MMLU-Chemistry 62.0% 68.0% +6.0pp

General reasoning is fully preserved — the fine-tuned model is strictly better than baseline.

Training Data

Trained on 116,831 expert-curated JEE/NEET exam questions from Nalanda Data. The dataset covers:

  • JEE Mains & JEE Advanced (Physics, Chemistry, Mathematics)
  • NEET UG (Physics, Chemistry, Biology)
  • Each question includes: question text, four options, verified correct answer, step-by-step solution
  • Questions contain LaTeX mathematical notation

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Nalandadata/nalanda-qwen-7b-grpo")
tokenizer = AutoTokenizer.from_pretrained("Nalandadata/nalanda-qwen-7b-grpo")

messages = [
    {"role": "system", "content": "You are an expert at solving JEE and NEET exam questions. Think step by step, then state your final answer."},
    {"role": "user", "content": "A particle moves along the x-axis with velocity v = 3t^2 - 6t + 2 m/s. Find the displacement in the first 3 seconds.\n\n(A) 2 m\n(B) 3 m\n(C) 5 m\n(D) 8 m"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
from vllm import LLM, SamplingParams

llm = LLM(model="Nalandadata/nalanda-qwen-7b-grpo")
sampling = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=512)

# Use the same chat template as above
output = llm.generate([prompt], sampling)

Training Infrastructure

  • Platform: Modal serverless GPU cloud
  • Training GPU: NVIDIA A10G (24GB)
  • Evaluation GPU: NVIDIA A100-40GB
  • Total compute cost: ~$47 USD across all experiments
  • Quantization: 4-bit QLoRA during training, saved as merged 16-bit for inference

Limitations

  • Mathematics accuracy improvement (+8.5pp) is lower than Physics/Chemistry, likely because math reasoning requires deeper structural changes
  • Model was trained on Indian competitive exam format; performance on non-MCQ or non-Indian-curriculum questions may vary
  • The model uses Qwen 2.5's chat template — ensure you apply it correctly for best results

Citation

If you use this model, please cite:

@misc{nalanda-qwen-grpo-2026,
  title={Nalanda Qwen 2.5 7B GRPO: Domain Data Drives LLM Fine-Tuning Performance},
  author={Nalanda Data},
  year={2026},
  url={https://huggingface.co/Nalandadata/nalanda-qwen-7b-grpo}
}

License

This model is released under the Apache 2.0 license, consistent with the base Qwen 2.5 model license.

Description
Model synced from source: Nalandadata/nalanda-qwen-7b-grpo
Readme 2 MiB
Languages
Jinja 100%