--- language: - en license: apache-2.0 library_name: transformers base_model: Qwen/Qwen3-4B tags: - agent - tool-use - reinforcement-learning - GRPO - math - code - reasoning pipeline_tag: text-generation ---

SOD-GRPO_teacher-4B

## About **SOD-GRPO_teacher-4B** is a 4B agentic reasoning model trained with **GRPO (Group Relative Policy Optimization)**, serving as the teacher model in the SOD distillation framework. This model is used to distill smaller student models ([SOD-0.6B](https://huggingface.co/youngzhong/SOD-0.6B) and [SOD-1.7B](https://huggingface.co/youngzhong/SOD-1.7B)) via the SOD method, which introduces adaptive step-level weighting to handle cascading error propagation in tool-integrated reasoning. ## Model Information | Attribute | Value | |-----------|-------| | Base Model | [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) | | Training Pipeline | Cold-Start SFT → GRPO | | Parameters | 4B | ## Related Models | Model | Description | |-------|-------------| | [SOD-0.6B](https://huggingface.co/youngzhong/SOD-0.6B) | SOD-distilled 0.6B student | | [SOD-1.7B](https://huggingface.co/youngzhong/SOD-1.7B) | SOD-distilled 1.7B student | | [SOD-GRPO_teacher-4B](https://huggingface.co/youngzhong/SOD-GRPO_teacher-4B) | GRPO-trained 4B teacher model (this model) | ## Performance We report **average@32** over 5 runs on challenging math, science, and code benchmarks. | Method | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average | |--------|-----------|-----------|--------------|------------------|---------| | **GRPO (This Model)** | **67.60** | **60.42** | **55.19** | **63.13** | **61.59** | ### Distilled Students | Model | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average | |-------|-----------|-----------|--------------|------------------|---------| | SOD-0.6B | 20.84 | 26.13 | 22.19 | 27.72 | 24.22 | | SOD-1.7B | 50.83 | 41.72 | 38.72 | 40.63 | 42.98 | ## Acknowledgement We sincerely thank the authors of [DemyAgent-4B](https://huggingface.co/Gen-Verse/DemyAgent-4B) and the paper *"Demystifying Reinforcement Learning in Agentic Reasoning"* ([arXiv:2510.11701](https://arxiv.org/abs/2510.11701)) for their contribution. ## Citation ```bibtex @article{zhong2026sod, title={SOD: Step-wise On-policy Distillation for Small Language Model Agents}, author={Zhong, Qiyong and Zheng, Mao and Song, Mingyang and Lin, Xin and Sun, Jie and Jiang, Houcheng and Wang, Xiang and Fang, Junfeng}, journal={arXiv preprint arXiv:2605.07725}, year={2026} } ```