nalanda-qwen-7b-grpo/README.md

---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
tags:
  - qwen2
  - grpo
  - reinforcement-learning
  - jee
  - neet
  - stem
  - education
  - india
  - fine-tuned
datasets:
  - custom
language:
  - en
pipeline_tag: text-generation
model-index:
  - name: nalanda-qwen2.5-7b-grpo
    results:
      - task:
          type: text-generation
          name: JEE/NEET Exam MCQ
        metrics:
          - name: Physics Accuracy
            type: accuracy
            value: 65.0
          - name: Chemistry Accuracy
            type: accuracy
            value: 71.5
          - name: Mathematics Accuracy
            type: accuracy
            value: 64.5
          - name: Biology Accuracy
            type: accuracy
            value: 77.5
          - name: Overall JEE/NEET Accuracy
            type: accuracy
            value: 69.6
---

# Nalanda Qwen 2.5 7B GRPO

A fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) specialized for Indian competitive exam questions (JEE Mains, JEE Advanced, NEET UG) across Physics, Chemistry, Mathematics, and Biology.

## Training Methodology

This model was trained using a **two-stage pipeline** inspired by [Yoshihara et al. (2025, ICML)](https://arxiv.org/abs/2503.04465) and [DeepSeekMath (Shao et al., 2024)](https://arxiv.org/abs/2402.03300):

### Stage 1: Light Supervised Fine-Tuning (SFT)
- **200 training steps** (~5% of dataset)
- **Data mixing**: 70% JEE/NEET questions + 30% general instruction data ([SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup))
- LoRA rank 8, attention layers only, learning rate 3e-5
- NEFTune noise (alpha=5) for improved generalization
- Purpose: Introduce domain vocabulary and question formats without overwriting general knowledge

### Stage 2: Group Relative Policy Optimization (GRPO)
- **600 training steps** with 8 model generations per prompt
- **10,000 MCQs** with verified correct answers (balanced: 2,500 per subject)
- Three reward functions:
  - **Correctness** (max 2.0): High reward for correct answer
  - **Format compliance** (max 1.0): Reward for structured `<answer>` tags
  - **Reasoning quality** (max 1.0): Reward for showing work (equations, step indicators)
- Learning rate 5e-6
- Purpose: Teach the model to arrive at correct answers through its own reasoning

### Why GRPO Instead of SFT?

Standard SFT on the same 126K questions caused **catastrophic forgetting** (-15pp accuracy drop). SFT forces the model to mimic specific solution patterns, destroying reasoning ability. GRPO rewards the model for arriving at correct answers through its own reasoning path, which preserves and enhances general capabilities.

## Results

### JEE/NEET Exam Accuracy (800 held-out MCQs)

| Subject | Qwen 2.5 7B Baseline | **This Model** | **Improvement** |
|---------|:--------------------:|:--------------:|:---------------:|
| Physics | 51.0% | **65.0%** | **+14.0pp** |
| Chemistry | 61.5% | **71.5%** | **+10.0pp** |
| Mathematics | 56.0% | **64.5%** | **+8.5pp** |
| Biology | 73.5% | **77.5%** | **+4.0pp** |
| **Overall** | **60.5%** | **69.6%** | **+9.1pp** |

### Public Benchmark Preservation

| Benchmark | Baseline | This Model | Delta |
|-----------|:--------:|:----------:|:-----:|
| GSM8K | 94.7% | 96.0% | +1.3pp |
| ARC-Challenge | 90.0% | 90.0% | 0.0pp |
| MMLU-Physics | 81.1% | 83.8% | +2.7pp |
| MMLU-Chemistry | 62.0% | 68.0% | +6.0pp |

General reasoning is fully preserved — the fine-tuned model is strictly better than baseline.

## Training Data

Trained on **116,831 expert-curated JEE/NEET exam questions** from [Nalanda Data](https://nalandadata.com). The dataset covers:
- JEE Mains & JEE Advanced (Physics, Chemistry, Mathematics)
- NEET UG (Physics, Chemistry, Biology)
- Each question includes: question text, four options, verified correct answer, step-by-step solution
- Questions contain LaTeX mathematical notation

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Nalandadata/nalanda-qwen-7b-grpo")
tokenizer = AutoTokenizer.from_pretrained("Nalandadata/nalanda-qwen-7b-grpo")

messages = [
    {"role": "system", "content": "You are an expert at solving JEE and NEET exam questions. Think step by step, then state your final answer."},
    {"role": "user", "content": "A particle moves along the x-axis with velocity v = 3t^2 - 6t + 2 m/s. Find the displacement in the first 3 seconds.\n\n(A) 2 m\n(B) 3 m\n(C) 5 m\n(D) 8 m"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### With vLLM (recommended for production)

```python
from vllm import LLM, SamplingParams

llm = LLM(model="Nalandadata/nalanda-qwen-7b-grpo")
sampling = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=512)

# Use the same chat template as above
output = llm.generate([prompt], sampling)
```

## Training Infrastructure

- **Platform**: Modal serverless GPU cloud
- **Training GPU**: NVIDIA A10G (24GB)
- **Evaluation GPU**: NVIDIA A100-40GB
- **Total compute cost**: ~$47 USD across all experiments
- **Quantization**: 4-bit QLoRA during training, saved as merged 16-bit for inference

## Limitations

- Mathematics accuracy improvement (+8.5pp) is lower than Physics/Chemistry, likely because math reasoning requires deeper structural changes
- Model was trained on Indian competitive exam format; performance on non-MCQ or non-Indian-curriculum questions may vary
- The model uses Qwen 2.5's chat template — ensure you apply it correctly for best results

## Citation

If you use this model, please cite:

```
@misc{nalanda-qwen-grpo-2026,
  title={Nalanda Qwen 2.5 7B GRPO: Domain Data Drives LLM Fine-Tuning Performance},
  author={Nalanda Data},
  year={2026},
  url={https://huggingface.co/Nalandadata/nalanda-qwen-7b-grpo}
}
```

## License

This model is released under the Apache 2.0 license, consistent with the base Qwen 2.5 model license.
初始化项目，由ModelHub XC社区提供模型 Model: Nalandadata/nalanda-qwen-7b-grpo Source: Original Platform 2026-04-28 03:59:13 +08:00			`---`
			`license: apache-2.0`
			`base_model: Qwen/Qwen2.5-7B-Instruct`
			`tags:`
			`- qwen2`
			`- grpo`
			`- reinforcement-learning`
			`- jee`
			`- neet`
			`- stem`
			`- education`
			`- india`
			`- fine-tuned`
			`datasets:`
			`- custom`
			`language:`
			`- en`
			`pipeline_tag: text-generation`
			`model-index:`
			`- name: nalanda-qwen2.5-7b-grpo`
			`results:`
			`- task:`
			`type: text-generation`
			`name: JEE/NEET Exam MCQ`
			`metrics:`
			`- name: Physics Accuracy`
			`type: accuracy`
			`value: 65.0`
			`- name: Chemistry Accuracy`
			`type: accuracy`
			`value: 71.5`
			`- name: Mathematics Accuracy`
			`type: accuracy`
			`value: 64.5`
			`- name: Biology Accuracy`
			`type: accuracy`
			`value: 77.5`
			`- name: Overall JEE/NEET Accuracy`
			`type: accuracy`
			`value: 69.6`
			`---`

			`# Nalanda Qwen 2.5 7B GRPO`

			`A fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) specialized for Indian competitive exam questions (JEE Mains, JEE Advanced, NEET UG) across Physics, Chemistry, Mathematics, and Biology.`

			`## Training Methodology`

			`This model was trained using a two-stage pipeline inspired by [Yoshihara et al. (2025, ICML)](https://arxiv.org/abs/2503.04465) and [DeepSeekMath (Shao et al., 2024)](https://arxiv.org/abs/2402.03300):`

			`### Stage 1: Light Supervised Fine-Tuning (SFT)`
			`- 200 training steps (~5% of dataset)`
			`- Data mixing: 70% JEE/NEET questions + 30% general instruction data ([SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup))`
			`- LoRA rank 8, attention layers only, learning rate 3e-5`
			`- NEFTune noise (alpha=5) for improved generalization`
			`- Purpose: Introduce domain vocabulary and question formats without overwriting general knowledge`

			`### Stage 2: Group Relative Policy Optimization (GRPO)`
			`- 600 training steps with 8 model generations per prompt`
			`- 10,000 MCQs with verified correct answers (balanced: 2,500 per subject)`
			`- Three reward functions:`
			`- Correctness (max 2.0): High reward for correct answer`
			- Format compliance (max 1.0): Reward for structured `<answer>` tags
			`- Reasoning quality (max 1.0): Reward for showing work (equations, step indicators)`
			`- Learning rate 5e-6`
			`- Purpose: Teach the model to arrive at correct answers through its own reasoning`

			`### Why GRPO Instead of SFT?`

			`Standard SFT on the same 126K questions caused catastrophic forgetting (-15pp accuracy drop). SFT forces the model to mimic specific solution patterns, destroying reasoning ability. GRPO rewards the model for arriving at correct answers through its own reasoning path, which preserves and enhances general capabilities.`

			`## Results`

			`### JEE/NEET Exam Accuracy (800 held-out MCQs)`

			`\| Subject \| Qwen 2.5 7B Baseline \| This Model \| Improvement \|`
			`\|---------\|:--------------------:\|:--------------:\|:---------------:\|`
			`\| Physics \| 51.0% \| 65.0% \| +14.0pp \|`
			`\| Chemistry \| 61.5% \| 71.5% \| +10.0pp \|`
			`\| Mathematics \| 56.0% \| 64.5% \| +8.5pp \|`
			`\| Biology \| 73.5% \| 77.5% \| +4.0pp \|`
			`\| Overall \| 60.5% \| 69.6% \| +9.1pp \|`

			`### Public Benchmark Preservation`

			`\| Benchmark \| Baseline \| This Model \| Delta \|`
			`\|-----------\|:--------:\|:----------:\|:-----:\|`
			`\| GSM8K \| 94.7% \| 96.0% \| +1.3pp \|`
			`\| ARC-Challenge \| 90.0% \| 90.0% \| 0.0pp \|`
			`\| MMLU-Physics \| 81.1% \| 83.8% \| +2.7pp \|`
			`\| MMLU-Chemistry \| 62.0% \| 68.0% \| +6.0pp \|`

			`General reasoning is fully preserved — the fine-tuned model is strictly better than baseline.`

			`## Training Data`

			`Trained on 116,831 expert-curated JEE/NEET exam questions from [Nalanda Data](https://nalandadata.com). The dataset covers:`
			`- JEE Mains & JEE Advanced (Physics, Chemistry, Mathematics)`
			`- NEET UG (Physics, Chemistry, Biology)`
			`- Each question includes: question text, four options, verified correct answer, step-by-step solution`
			`- Questions contain LaTeX mathematical notation`

			`## Usage`

			```python
			`from transformers import AutoModelForCausalLM, AutoTokenizer`

			`model = AutoModelForCausalLM.from_pretrained("Nalandadata/nalanda-qwen-7b-grpo")`
			`tokenizer = AutoTokenizer.from_pretrained("Nalandadata/nalanda-qwen-7b-grpo")`

			`messages = [`
			`{"role": "system", "content": "You are an expert at solving JEE and NEET exam questions. Think step by step, then state your final answer."},`
			`{"role": "user", "content": "A particle moves along the x-axis with velocity v = 3t^2 - 6t + 2 m/s. Find the displacement in the first 3 seconds.\n\n(A) 2 m\n(B) 3 m\n(C) 5 m\n(D) 8 m"}`
			`]`

			`text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)`
			`inputs = tokenizer(text, return_tensors="pt").to(model.device)`
			`outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)`
			`print(tokenizer.decode(outputs[0], skip_special_tokens=True))`
			```

			`### With vLLM (recommended for production)`

			```python
			`from vllm import LLM, SamplingParams`

			`llm = LLM(model="Nalandadata/nalanda-qwen-7b-grpo")`
			`sampling = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=512)`

			`# Use the same chat template as above`
			`output = llm.generate([prompt], sampling)`
			```

			`## Training Infrastructure`

			`- Platform: Modal serverless GPU cloud`
			`- Training GPU: NVIDIA A10G (24GB)`
			`- Evaluation GPU: NVIDIA A100-40GB`
			`- Total compute cost: ~$47 USD across all experiments`
			`- Quantization: 4-bit QLoRA during training, saved as merged 16-bit for inference`

			`## Limitations`

			`- Mathematics accuracy improvement (+8.5pp) is lower than Physics/Chemistry, likely because math reasoning requires deeper structural changes`
			`- Model was trained on Indian competitive exam format; performance on non-MCQ or non-Indian-curriculum questions may vary`
			`- The model uses Qwen 2.5's chat template — ensure you apply it correctly for best results`

			`## Citation`

			`If you use this model, please cite:`

			```
			`@misc{nalanda-qwen-grpo-2026,`
			`title={Nalanda Qwen 2.5 7B GRPO: Domain Data Drives LLM Fine-Tuning Performance},`
			`author={Nalanda Data},`
			`year={2026},`
			`url={https://huggingface.co/Nalandadata/nalanda-qwen-7b-grpo}`
			`}`
			```

			`## License`

			`This model is released under the Apache 2.0 license, consistent with the base Qwen 2.5 model license.`