Aryabhata-1.0/README.md

---
license: cc-by-nc-4.0
tags:
- small-language-model
- jee
- exam-centric
- indian-education
- reinforcement-learning
- supervised-finetuning
- model-merging
- rejection-sampling
- mathematics
- ai4education
- physicswallah
language:
- en
model_name: PhysicsWallah/Aryabhata-1.0
model_creator: Physics Wallah AI Research
model_type: Causal decoder-based model
base_model: Qwen/Qwen2.5-Math-7B
pipeline_tag: text-generation
library_name: transformers
---

# Aryabhatta 1.0 : An exam-focused language model for JEE Math

![](benchmark.png)

## Overview

**Aryabhata 1.0** is a 7B parameter small language model for mathematics developed by **Physics Wallah AI Research**, optimized for high-stakes Indian competitive exams like **JEE Mains**. Despite its compact size, Aryabhata 1.0 achieves **state-of-the-art performance** on exam-centric reasoning tasks with impressive **token efficiency** and low inference cost.


> 🚧 *Aryabhata 1.0 is an **experimental release**. We are actively seeking feedback — please contribute in the Discussion tab of this repo.*
---

## 🧠 Key Features

- **Architecture**: 7B parameter causal decoder-based model.
- **Exam-Centric Optimization**: Specifically tuned for JEE-level Mathematics reasoning.
- **High Accuracy**:
  - **86%** on **JEE Mains January 2025** session.
  - **90.2%** on **JEE Mains April 2025** session.
- **Token Efficiency**: Operates effectively around a **~2K token window**, compared to ~8K required by other reasoning models.
- **Compute Efficient**: Trained on a **1x2 NVIDIA H100 GPU** using optimized pipeline.

---

## 🛠️ Training Details

- **Training Data**: ~130K problem-solution pairs curated from proprietary Physics Wallah exam datasets.
- **Training Pipeline**:
  - **Model Merging**
  - **Rejection Sampling**
  - **Supervised Fine-Tuning (SFT)**
  - **Reinforcement Learning with Verifiable Rewards (RLVR)**

### 🔀 Model Merging
We began with model merging (Weighted average) to build a strong initialization (Aryabhata 0.5) by combining diverse model capabilities:
* Qwen 2.5 Math: A robust math-centric LLM with solid symbolic math foundations.
* Ace Math: An enhanced version of Qwen 2.5 Math, fine-tuned by NVIDIA for improved accuracy in mathematics benchmarks.
* DeepSeek R1 Distill Qwen: A long-form reasoning model, fine-tuned on reasoning traces distilled from DeepSeek R1.

### 📚 Data Curation + Rejection Sampling
We extracted ~250K raw questions from Physics Wallah's internal database and applied aggressive filtering and cleaning:
* Removed: diagram-based, non-English, and option-heavy questions.
* Kept: questions matching the distribution of JEE Main 2019–2024.
Final curated dataset: ~130K high-quality questions.

For each question:
* Generated 4 CoTs using Aryabhata 0.5.
* Retained only those leading to correct final answers.

Resulting Dataset:
* ~100K questions
* ~350K high-quality CoTs

We used this dataset for SFT.

### 🎯 Reinforcement Learning with Verifiable Rewards (RLVR)
We used a custom in-house variant of Group Relative Policy Optimization (GRPO), adapted for math-specific reward functions.
* Removed KL-divergence penalty
* Removed clipping

We used RLVR on the remaining ~30K questions.

This multi-phase training strategy allows Aryabhata 1.0 to capture **pedagogy-aligned reasoning patterns**, making it highly effective for solving real student queries in mathematics.

---

## 📊 Performance Highlights

### Evaluation Setup
All evaluations were performed with temperature = 0.0, and we report pass@1 accuracy.

#### Evaluation Datasets
We evaluated the model on two sets of official JEE Mains 2025 mathematics papers:
* January Session: 10 question papers containing 250 questions.
* April Session: 9 question papers containing 225 questions.

Each paper includes a mix of:
* Multiple Choice Questions (MCQs) with one correct option
* Numeric Answer Type (NAT) questions requiring precise numerical responses

#### Evaluation Metric
We used a composite evaluation metric to reflect real-world grading rigor and reduce false positives:

1. Float Match
  * Compares predicted and target answers within a tolerance (±1e-9)
  * Handles rounding artifacts and small numerical errors robustly
2. String Match
  * Used for symbolic answers (e.g., fractions, radicals)
  * Uses strict exact match — predictions must match ground truth character-for-character
3. LLM-as-Judge (GPT-4o-mini)
  * Used for Mathematical equivalence for ambiguous formats

### 🔹 Accuracy Comparison Across Models
![](accuracy.png)
> *Aryabhata has the best accuracy on JEE Main Maths, on par with frontier models*

### 🔹 Accuracy vs Token Usage
![](accuracy-vs-token.png)
> *Aryabhata is on par with frontier models in terms of accuracy vs token usage*

---

## 🔧 Intended Use

**Primary Use Cases**:
- Competitive exam preparation (JEE Main level mathematics problems)
- Question answering and doubt-solving systems
- Educational tutoring and concept explanation


## 💡 How to Use

### 🧪 Using with 🤗 Transformers

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_id = "PhysicsWallahAI/Aryabhata-1.0"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)


# Define stop strings
stop_strings = ["<|im_end|>", "<|end|>", "<im_start|>", "⁠```python\n", "⁠<|im_start|>", "]}}]}}]"]

def strip_bad_tokens(s, stop_strings):
    for suffix in stop_strings:
        if s.endswith(suffix):
            return s[:-len(suffix)]
    return s


# Create generation config (can also set temperature, top_p, etc.)
generation_config = GenerationConfig(
    max_new_tokens=4096,
    stop_strings = stop_strings
)

query = 'Find all the values of \\sqrt[3]{1}'
messages = [{'role': 'system', 'content': 'Think step-by-step; put only the final answer inside \\boxed{}.'},
            {'role': 'user', 'content': query}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config, tokenizer=tokenizer)

print(strip_bad_tokens(tokenizer.decode(outputs[0], skip_special_tokens=True), stop_strings))
````

---

### ⚡ Using with vLLM

To run the model efficiently using vLLM:

```python
from vllm import LLM, SamplingParams

# Initialize model (downloads from Hugging Face if not local)
llm = LLM(model="PhysicsWallahAI/Aryabhata-1.0")

# Define prompt and sampling configuration
query = 'Find all the values of \\sqrt[3]{1}'
messages = [{'role': 'system', 'content': 'Think step-by-step; put only the final answer inside \\boxed{}.'},
            {'role': 'user', 'content': query}]
sampling_params = SamplingParams(temperature=0.0, max_tokens=4*1024, stop=["<|im_end|>", "<|end|>", "<im_start|>", "⁠```python\n", "⁠<|im_start|>", "]}}]}}]"])

# Run inference
results = llm.chat(messages, sampling_params)

# Print result
print(results[0].outputs[0].text.strip())
```

---

Read more about Aryabhata 1.0 in our [Technical Report](https://arxiv.org/abs/2508.08665)

---

## 🚀 Roadmap

**Aryabhata 2.0** (Upcoming):
- Extending domain coverage to **Physics** and **Chemistry**
- Supporting **JEE Advanced**, **NEET**, and **Foundation syllabus**
- Further optimization for affordability and accuracy in real-time deployments

---

## 🤝 Citation

If you use this model, please cite:

```bibtex
@misc{Aryabhata2025,
  title = {Aryabhata 1.0: A compact, exam-focused language model tailored for mathematics in Indian competitive exams, especially JEE Main.},
  author = {Physics Wallah AI Research},
  year = {2025},
  note = {\url{https://huggingface.co/PhysicsWallahAI/Aryabhata-1.0}},
}
-												初始化项目，由ModelHub XC社区提供模型

Model: PhysicsWallahAI/Aryabhata-1.0
Source: Original Platform

											
										
										
											2026-05-13 15:06:36 +08:00
+								---
 								license: cc-by-nc-4.0
 								tags:
 								- small-language-model
 								- jee
 								- exam-centric
 								- indian-education
 								- reinforcement-learning
 								- supervised-finetuning
 								- model-merging
 								- rejection-sampling
 								- mathematics
 								- ai4education
 								- physicswallah
 								language:
 								- en
 								model_name: PhysicsWallah/Aryabhata-1.0
 								model_creator: Physics Wallah AI Research
 								model_type: Causal decoder-based model
 								base_model: Qwen/Qwen2.5-Math-7B
 								pipeline_tag: text-generation
 								library_name: transformers
 								---
 								# Aryabhatta 1.0 : An exam-focused language model for JEE Math
 								![](benchmark.png)
 								## Overview
 								**Aryabhata 1.0** is a 7B parameter small language model for mathematics developed by **Physics Wallah AI Research**, optimized for high-stakes Indian competitive exams like **JEE Mains**. Despite its compact size, Aryabhata 1.0 achieves **state-of-the-art performance** on exam-centric reasoning tasks with impressive **token efficiency** and low inference cost.
 								> 🚧 *Aryabhata 1.0 is an **experimental release**. We are actively seeking feedback — please contribute in the Discussion tab of this repo.*
 								---
 								## 🧠 Key Features
 								- **Architecture**: 7B parameter causal decoder-based model.
 								- **Exam-Centric Optimization**: Specifically tuned for JEE-level Mathematics reasoning.
 								- **High Accuracy**:
 								  - **86%** on **JEE Mains January 2025** session.
 								  - **90.2%** on **JEE Mains April 2025** session.
 								- **Token Efficiency**: Operates effectively around a **~2K token window**, compared to ~8K required by other reasoning models.
 								- **Compute Efficient**: Trained on a **1x2 NVIDIA H100 GPU** using optimized pipeline.
 								---
 								## 🛠️ Training Details
 								- **Training Data**: ~130K problem-solution pairs curated from proprietary Physics Wallah exam datasets.
 								- **Training Pipeline**:
 								  - **Model Merging**
 								  - **Rejection Sampling**
 								  - **Supervised Fine-Tuning (SFT)**
 								  - **Reinforcement Learning with Verifiable Rewards (RLVR)**
 								### 🔀 Model Merging
 								We began with model merging (Weighted average) to build a strong initialization (Aryabhata 0.5) by combining diverse model capabilities:
 								* Qwen 2.5 Math: A robust math-centric LLM with solid symbolic math foundations.
 								* Ace Math: An enhanced version of Qwen 2.5 Math, fine-tuned by NVIDIA for improved accuracy in mathematics benchmarks.
 								* DeepSeek R1 Distill Qwen: A long-form reasoning model, fine-tuned on reasoning traces distilled from DeepSeek R1.
 								### 📚 Data Curation + Rejection Sampling
 								We extracted ~250K raw questions from Physics Wallah's internal database and applied aggressive filtering and cleaning:
 								* Removed: diagram-based, non-English, and option-heavy questions.
 								* Kept: questions matching the distribution of JEE Main 2019–2024.
 								Final curated dataset: ~130K high-quality questions.
 								For each question:
 								* Generated 4 CoTs using Aryabhata 0.5.
 								* Retained only those leading to correct final answers.
 								Resulting Dataset:
 								* ~100K questions
 								* ~350K high-quality CoTs
 								We used this dataset for SFT.
 								### 🎯 Reinforcement Learning with Verifiable Rewards (RLVR)
 								We used a custom in-house variant of Group Relative Policy Optimization (GRPO), adapted for math-specific reward functions.
 								* Removed KL-divergence penalty
 								* Removed clipping
 								We used RLVR on the remaining ~30K questions.
 								This multi-phase training strategy allows Aryabhata 1.0 to capture **pedagogy-aligned reasoning patterns**, making it highly effective for solving real student queries in mathematics.
 								---
 								## 📊 Performance Highlights
 								### Evaluation Setup
 								All evaluations were performed with temperature = 0.0, and we report pass@1 accuracy.
 								#### Evaluation Datasets
 								We evaluated the model on two sets of official JEE Mains 2025 mathematics papers:
 								* January Session: 10 question papers containing 250 questions.
 								* April Session: 9 question papers containing 225 questions.
 								Each paper includes a mix of:
 								* Multiple Choice Questions (MCQs) with one correct option
 								* Numeric Answer Type (NAT) questions requiring precise numerical responses
 								#### Evaluation Metric
 								We used a composite evaluation metric to reflect real-world grading rigor and reduce false positives:
 . Float Match
 								  * Compares predicted and target answers within a tolerance (±1e-9)
 								  * Handles rounding artifacts and small numerical errors robustly
 . String Match
 								  * Used for symbolic answers (e.g., fractions, radicals)
 								  * Uses strict exact match — predictions must match ground truth character-for-character
 . LLM-as-Judge (GPT-4o-mini)
 								  * Used for Mathematical equivalence for ambiguous formats
 								### 🔹 Accuracy Comparison Across Models
 								![](accuracy.png)
 								> *Aryabhata has the best accuracy on JEE Main Maths, on par with frontier models*
 								### 🔹 Accuracy vs Token Usage
 								![](accuracy-vs-token.png)
 								> *Aryabhata is on par with frontier models in terms of accuracy vs token usage*
 								---
 								## 🔧 Intended Use
 								**Primary Use Cases**:
 								- Competitive exam preparation (JEE Main level mathematics problems)
 								- Question answering and doubt-solving systems
 								- Educational tutoring and concept explanation
 								## 💡 How to Use
 								### 🧪 Using with 🤗 Transformers
 								```python
 								from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
 								model_id = "PhysicsWallahAI/Aryabhata-1.0"
 								tokenizer = AutoTokenizer.from_pretrained(model_id)
 								model = AutoModelForCausalLM.from_pretrained(model_id)
 								# Define stop strings
 								stop_strings = ["<|im_end|>", "<|end|>", "<im_start|>", "⁠```python\n", "⁠<|im_start|>", "]}}]}}]"]
 								def strip_bad_tokens(s, stop_strings):
 								    for suffix in stop_strings:
 								        if s.endswith(suffix):
 								            return s[:-len(suffix)]
 								    return s
 								# Create generation config (can also set temperature, top_p, etc.)
 								generation_config = GenerationConfig(
 								    max_new_tokens=4096,
 								    stop_strings = stop_strings
 								)
 								query = 'Find all the values of \\sqrt[3]{1}'
 								messages = [{'role': 'system', 'content': 'Think step-by-step; put only the final answer inside \\boxed{}.'},
 								            {'role': 'user', 'content': query}]
 								text = tokenizer.apply_chat_template(
 								    messages,
 								    tokenize=False,
 								    add_generation_prompt=True
 								)
 								inputs = tokenizer([text], return_tensors="pt")
 								outputs = model.generate(**inputs, generation_config=generation_config, tokenizer=tokenizer)
 								print(strip_bad_tokens(tokenizer.decode(outputs[0], skip_special_tokens=True), stop_strings))
 								````
 								---
 								### ⚡ Using with vLLM
 								To run the model efficiently using vLLM:
 								```python
 								from vllm import LLM, SamplingParams
 								# Initialize model (downloads from Hugging Face if not local)
 								llm = LLM(model="PhysicsWallahAI/Aryabhata-1.0")
 								# Define prompt and sampling configuration
 								query = 'Find all the values of \\sqrt[3]{1}'
 								messages = [{'role': 'system', 'content': 'Think step-by-step; put only the final answer inside \\boxed{}.'},
 								            {'role': 'user', 'content': query}]
 								sampling_params = SamplingParams(temperature=0.0, max_tokens=4*1024, stop=["<|im_end|>", "<|end|>", "<im_start|>", "⁠```python\n", "⁠<|im_start|>", "]}}]}}]"])
 								# Run inference
 								results = llm.chat(messages, sampling_params)
 								# Print result
 								print(results[0].outputs[0].text.strip())
 								```
 								---
 								Read more about Aryabhata 1.0 in our [Technical Report](https://arxiv.org/abs/2508.08665)
 								---
 								## 🚀 Roadmap
 								**Aryabhata 2.0** (Upcoming):
 								- Extending domain coverage to **Physics** and **Chemistry**
 								- Supporting **JEE Advanced**, **NEET**, and **Foundation syllabus**
 								- Further optimization for affordability and accuracy in real-time deployments
 								---
 								## 🤝 Citation
 								If you use this model, please cite:
 								```bibtex
 								@misc{Aryabhata2025,
 								  title = {Aryabhata 1.0: A compact, exam-focused language model tailored for mathematics in Indian competitive exams, especially JEE Main.},
 								  author = {Physics Wallah AI Research},
 								  year = {2025},
 								  note = {\url{https://huggingface.co/PhysicsWallahAI/Aryabhata-1.0}},
 								}