SOD-GRPO_teacher-4B/README.md

---
language:
- en
license: apache-2.0
library_name: transformers
base_model: Qwen/Qwen3-4B
tags:
- agent
- tool-use
- reinforcement-learning
- GRPO
- math
- code
- reasoning
pipeline_tag: text-generation
---

<div align="center">

<h1>SOD-GRPO_teacher-4B</h1>

<p>
  <a href="https://arxiv.org/abs/2605.07725">
    <img src="https://img.shields.io/badge/Paper-arXiv-red?logo=arxiv&logoColor=red" alt="Paper on arXiv"/>
  </a>
  <a href="https://github.com/YoungZ365/SOD">
    <img src="https://img.shields.io/badge/Code-GitHub-black?logo=github&logoColor=white" alt="Code on GitHub"/>
  </a>
  <a href="https://huggingface.co/collections/youngzhong/sod-6a03530369d76913c24a4ffb">
    <img src="https://img.shields.io/badge/Collection-SOD-yellow?logo=huggingface" alt="HuggingFace Collection"/>
  </a>
  <a href="https://huggingface.co/papers/2605.07725">
    <img src="https://img.shields.io/badge/Daily%20Paper-SOD-yellow?logo=huggingface&logoColor=yellow" alt="HuggingFace Daily Paper"/>
  </a>
  <a href="https://www.alphaxiv.org/abs/2605.07725">
    <img src="https://img.shields.io/badge/alphaXiv-2605.07725-purple?logo=arxiv&logoColor=white" alt="alphaXiv"/>
  </a>
</p>

</div>

## About

**SOD-GRPO_teacher-4B** is a 4B agentic reasoning model trained with **GRPO (Group Relative Policy Optimization)**, serving as the teacher model in the SOD distillation framework.

This model is used to distill smaller student models ([SOD-0.6B](https://huggingface.co/youngzhong/SOD-0.6B) and [SOD-1.7B](https://huggingface.co/youngzhong/SOD-1.7B)) via the SOD method, which introduces adaptive step-level weighting to handle cascading error propagation in tool-integrated reasoning.

## Model Information

| Attribute | Value |
|-----------|-------|
| Base Model | [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) |
| Training Pipeline | Cold-Start SFT → GRPO |
| Parameters | 4B |

## Related Models

| Model | Description |
|-------|-------------|
| [SOD-0.6B](https://huggingface.co/youngzhong/SOD-0.6B) | SOD-distilled 0.6B student |
| [SOD-1.7B](https://huggingface.co/youngzhong/SOD-1.7B) | SOD-distilled 1.7B student |
| [SOD-GRPO_teacher-4B](https://huggingface.co/youngzhong/SOD-GRPO_teacher-4B) | GRPO-trained 4B teacher model (this model) |

## Performance

We report **average@32** over 5 runs on challenging math, science, and code benchmarks.

| Method | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average |
|--------|-----------|-----------|--------------|------------------|---------|
| **GRPO (This Model)** | **67.60** | **60.42** | **55.19** | **63.13** | **61.59** |

### Distilled Students

| Model | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average |
|-------|-----------|-----------|--------------|------------------|---------|
| SOD-0.6B | 20.84 | 26.13 | 22.19 | 27.72 | 24.22 |
| SOD-1.7B | 50.83 | 41.72 | 38.72 | 40.63 | 42.98 |

## Acknowledgement

We sincerely thank the authors of [DemyAgent-4B](https://huggingface.co/Gen-Verse/DemyAgent-4B) and the paper *"Demystifying Reinforcement Learning in Agentic Reasoning"* ([arXiv:2510.11701](https://arxiv.org/abs/2510.11701)) for their contribution.

## Citation

```bibtex
@article{zhong2026sod,
  title={SOD: Step-wise On-policy Distillation for Small Language Model Agents},
  author={Zhong, Qiyong and Zheng, Mao and Song, Mingyang and Lin, Xin and Sun, Jie and Jiang, Houcheng and Wang, Xiang and Fang, Junfeng},
  journal={arXiv preprint arXiv:2605.07725},
  year={2026}
}
```
初始化项目，由ModelHub XC社区提供模型 Model: youngzhong/SOD-GRPO_teacher-4B Source: Original Platform 2026-05-25 07:01:17 +08:00			`---`
			`language:`
			`- en`
			`license: apache-2.0`
			`library_name: transformers`
			`base_model: Qwen/Qwen3-4B`
			`tags:`
			`- agent`
			`- tool-use`
			`- reinforcement-learning`
			`- GRPO`
			`- math`
			`- code`
			`- reasoning`
			`pipeline_tag: text-generation`
			`---`

			`<div align="center">`

			`<h1>SOD-GRPO_teacher-4B</h1>`

			`<p>`
			`<a href="https://arxiv.org/abs/2605.07725">`
			`<img src="https://img.shields.io/badge/Paper-arXiv-red?logo=arxiv&logoColor=red" alt="Paper on arXiv"/>`
			`</a>`
			`<a href="https://github.com/YoungZ365/SOD">`
			`<img src="https://img.shields.io/badge/Code-GitHub-black?logo=github&logoColor=white" alt="Code on GitHub"/>`
			`</a>`
			`<a href="https://huggingface.co/collections/youngzhong/sod-6a03530369d76913c24a4ffb">`
			`<img src="https://img.shields.io/badge/Collection-SOD-yellow?logo=huggingface" alt="HuggingFace Collection"/>`
			`</a>`
			`<a href="https://huggingface.co/papers/2605.07725">`
			`<img src="https://img.shields.io/badge/Daily%20Paper-SOD-yellow?logo=huggingface&logoColor=yellow" alt="HuggingFace Daily Paper"/>`
			`</a>`
			`<a href="https://www.alphaxiv.org/abs/2605.07725">`
			`<img src="https://img.shields.io/badge/alphaXiv-2605.07725-purple?logo=arxiv&logoColor=white" alt="alphaXiv"/>`
			`</a>`
			`</p>`

			`</div>`

			`## About`

			`SOD-GRPO_teacher-4B is a 4B agentic reasoning model trained with GRPO (Group Relative Policy Optimization), serving as the teacher model in the SOD distillation framework.`

			`This model is used to distill smaller student models ([SOD-0.6B](https://huggingface.co/youngzhong/SOD-0.6B) and [SOD-1.7B](https://huggingface.co/youngzhong/SOD-1.7B)) via the SOD method, which introduces adaptive step-level weighting to handle cascading error propagation in tool-integrated reasoning.`

			`## Model Information`

			`\| Attribute \| Value \|`
			`\|-----------\|-------\|`
			`\| Base Model \| [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) \|`
			`\| Training Pipeline \| Cold-Start SFT → GRPO \|`
			`\| Parameters \| 4B \|`

			`## Related Models`

			`\| Model \| Description \|`
			`\|-------\|-------------\|`
			`\| [SOD-0.6B](https://huggingface.co/youngzhong/SOD-0.6B) \| SOD-distilled 0.6B student \|`
			`\| [SOD-1.7B](https://huggingface.co/youngzhong/SOD-1.7B) \| SOD-distilled 1.7B student \|`
			`\| [SOD-GRPO_teacher-4B](https://huggingface.co/youngzhong/SOD-GRPO_teacher-4B) \| GRPO-trained 4B teacher model (this model) \|`

			`## Performance`

			`We report average@32 over 5 runs on challenging math, science, and code benchmarks.`

			`\| Method \| AIME 2024 \| AIME 2025 \| GPQA-Diamond \| LiveCodeBench-v6 \| Average \|`
			`\|--------\|-----------\|-----------\|--------------\|------------------\|---------\|`
			`\| GRPO (This Model) \| 67.60 \| 60.42 \| 55.19 \| 63.13 \| 61.59 \|`

			`### Distilled Students`

			`\| Model \| AIME 2024 \| AIME 2025 \| GPQA-Diamond \| LiveCodeBench-v6 \| Average \|`
			`\|-------\|-----------\|-----------\|--------------\|------------------\|---------\|`
			`\| SOD-0.6B \| 20.84 \| 26.13 \| 22.19 \| 27.72 \| 24.22 \|`
			`\| SOD-1.7B \| 50.83 \| 41.72 \| 38.72 \| 40.63 \| 42.98 \|`

			`## Acknowledgement`

			`We sincerely thank the authors of [DemyAgent-4B](https://huggingface.co/Gen-Verse/DemyAgent-4B) and the paper "Demystifying Reinforcement Learning in Agentic Reasoning" ([arXiv:2510.11701](https://arxiv.org/abs/2510.11701)) for their contribution.`

			`## Citation`

			```bibtex
			`@article{zhong2026sod,`
			`title={SOD: Step-wise On-policy Distillation for Small Language Model Agents},`
			`author={Zhong, Qiyong and Zheng, Mao and Song, Mingyang and Lin, Xin and Sun, Jie and Jiang, Houcheng and Wang, Xiang and Fang, Junfeng},`
			`journal={arXiv preprint arXiv:2605.07725},`
			`year={2026}`
			`}`
			```