初始化项目,由ModelHub XC社区提供模型
Model: youngzhong/SOD-GRPO_teacher-4B Source: Original Platform
This commit is contained in:
92
README.md
Normal file
92
README.md
Normal file
@@ -0,0 +1,92 @@
|
||||
---
|
||||
language:
|
||||
- en
|
||||
license: apache-2.0
|
||||
library_name: transformers
|
||||
base_model: Qwen/Qwen3-4B
|
||||
tags:
|
||||
- agent
|
||||
- tool-use
|
||||
- reinforcement-learning
|
||||
- GRPO
|
||||
- math
|
||||
- code
|
||||
- reasoning
|
||||
pipeline_tag: text-generation
|
||||
---
|
||||
|
||||
<div align="center">
|
||||
|
||||
<h1>SOD-GRPO_teacher-4B</h1>
|
||||
|
||||
<p>
|
||||
<a href="https://arxiv.org/abs/2605.07725">
|
||||
<img src="https://img.shields.io/badge/Paper-arXiv-red?logo=arxiv&logoColor=red" alt="Paper on arXiv"/>
|
||||
</a>
|
||||
<a href="https://github.com/YoungZ365/SOD">
|
||||
<img src="https://img.shields.io/badge/Code-GitHub-black?logo=github&logoColor=white" alt="Code on GitHub"/>
|
||||
</a>
|
||||
<a href="https://huggingface.co/collections/youngzhong/sod-6a03530369d76913c24a4ffb">
|
||||
<img src="https://img.shields.io/badge/Collection-SOD-yellow?logo=huggingface" alt="HuggingFace Collection"/>
|
||||
</a>
|
||||
<a href="https://huggingface.co/papers/2605.07725">
|
||||
<img src="https://img.shields.io/badge/Daily%20Paper-SOD-yellow?logo=huggingface&logoColor=yellow" alt="HuggingFace Daily Paper"/>
|
||||
</a>
|
||||
<a href="https://www.alphaxiv.org/abs/2605.07725">
|
||||
<img src="https://img.shields.io/badge/alphaXiv-2605.07725-purple?logo=arxiv&logoColor=white" alt="alphaXiv"/>
|
||||
</a>
|
||||
</p>
|
||||
|
||||
</div>
|
||||
|
||||
## About
|
||||
|
||||
**SOD-GRPO_teacher-4B** is a 4B agentic reasoning model trained with **GRPO (Group Relative Policy Optimization)**, serving as the teacher model in the SOD distillation framework.
|
||||
|
||||
This model is used to distill smaller student models ([SOD-0.6B](https://huggingface.co/youngzhong/SOD-0.6B) and [SOD-1.7B](https://huggingface.co/youngzhong/SOD-1.7B)) via the SOD method, which introduces adaptive step-level weighting to handle cascading error propagation in tool-integrated reasoning.
|
||||
|
||||
## Model Information
|
||||
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| Base Model | [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) |
|
||||
| Training Pipeline | Cold-Start SFT → GRPO |
|
||||
| Parameters | 4B |
|
||||
|
||||
## Related Models
|
||||
|
||||
| Model | Description |
|
||||
|-------|-------------|
|
||||
| [SOD-0.6B](https://huggingface.co/youngzhong/SOD-0.6B) | SOD-distilled 0.6B student |
|
||||
| [SOD-1.7B](https://huggingface.co/youngzhong/SOD-1.7B) | SOD-distilled 1.7B student |
|
||||
| [SOD-GRPO_teacher-4B](https://huggingface.co/youngzhong/SOD-GRPO_teacher-4B) | GRPO-trained 4B teacher model (this model) |
|
||||
|
||||
## Performance
|
||||
|
||||
We report **average@32** over 5 runs on challenging math, science, and code benchmarks.
|
||||
|
||||
| Method | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average |
|
||||
|--------|-----------|-----------|--------------|------------------|---------|
|
||||
| **GRPO (This Model)** | **67.60** | **60.42** | **55.19** | **63.13** | **61.59** |
|
||||
|
||||
### Distilled Students
|
||||
|
||||
| Model | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average |
|
||||
|-------|-----------|-----------|--------------|------------------|---------|
|
||||
| SOD-0.6B | 20.84 | 26.13 | 22.19 | 27.72 | 24.22 |
|
||||
| SOD-1.7B | 50.83 | 41.72 | 38.72 | 40.63 | 42.98 |
|
||||
|
||||
## Acknowledgement
|
||||
|
||||
We sincerely thank the authors of [DemyAgent-4B](https://huggingface.co/Gen-Verse/DemyAgent-4B) and the paper *"Demystifying Reinforcement Learning in Agentic Reasoning"* ([arXiv:2510.11701](https://arxiv.org/abs/2510.11701)) for their contribution.
|
||||
|
||||
## Citation
|
||||
|
||||
```bibtex
|
||||
@article{zhong2026sod,
|
||||
title={SOD: Step-wise On-policy Distillation for Small Language Model Agents},
|
||||
author={Zhong, Qiyong and Zheng, Mao and Song, Mingyang and Lin, Xin and Sun, Jie and Jiang, Houcheng and Wang, Xiang and Fang, Junfeng},
|
||||
journal={arXiv preprint arXiv:2605.07725},
|
||||
year={2026}
|
||||
}
|
||||
```
|
||||
Reference in New Issue
Block a user