Qwen3-4B-Base-GRPO/README.md

---
license: other
language:
- en
- zh
library_name: transformers
pipeline_tag: text-generation
tags:
- qwen
- qwen3
- math
- grpo
- verl
- rl
- reinforcement-learning
- on-policy-distillation
- full-parameter-rl
- reasoning
- safetensors
- arxiv:2604.13016
base_model: Qwen/Qwen3-4B-Base
base_model_relation: finetune
---

<h1 align="center">Qwen3-4B-Base-GRPO</h1>

<div align="center" style="line-height: 1;">
  <a href="https://arxiv.org/abs/2604.13016" style="margin: 2px;">
    <img alt="Paper" src="https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://github.com/thunlp/OPD" style="margin: 2px;">
    <img alt="Github" src="https://img.shields.io/badge/OPD-000000?style=for-the-badge&logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://huggingface.co/papers/2604.13016" style="margin: 2px;">
    <img alt="HF Papers" src="https://img.shields.io/badge/HF--Paper-%23FFD14D?style=for-the-badge&logo=huggingface&logoColor=black" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://x.com/HBX_hbx/status/2044464414829777354" style="margin: 2px;">
    <img alt="Twitter" src="https://img.shields.io/badge/Twitter-%23000000.svg?style=for-the-badge&logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
</div>

<br>

Qwen3-4B-Base-GRPO is a post-RL checkpoint trained with the **verl** framework.
It starts from **Qwen3-4B-Base** and applies GRPO on the **DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.

This model is associated with the paper:  
**Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe**  
Paper link: https://arxiv.org/abs/2604.13016

## Model Description

This model is obtained by applying GRPO reinforcement learning to `Qwen3-4B-Base` with verl. The training is intended to improve math-focused reasoning performance under the on-policy distillation setting.


### Key characteristics

- **Base model**: Qwen3-4B-Base
- **Training framework**: verl
- **Training stage**: Reinforcement Learning (GRPO)
- **Parameter update**: Full-parameter actor update
- **Primary domain**: Mathematical reasoning
- **Reward model**: Not used (`reward_model.enable: false`)
- **Rollout engine**: vLLM
- **Context length**: 32768 tokens
- **Responses per prompt**: 8

## Training Details

### Training configuration

- **Framework**: verl
- **Algorithm**: `grpo`
- **GRPO outcome weight**: `1.0`
- **Learned reward model**: disabled (`reward_model.enable: false`)
- **Reward source**: custom rule-based math reward function
- **Training dataset**: `DAPO-Math-17k-Processed`
- **Training file**: `datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet`
- **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
- **Prompt length**: `1024`
- **Response length**: `7168`
- **Validation response length**: `31744`
- **Max model length**: `32768`
- **Rollout temperature**: `1.0`
- **Repetition penalty**: `1.0`
- **KL loss**: disabled
- **Format reward**: disabled
- **Loss aggregation**: `token-mean`
- **Learning rate**: `1e-6`
- **PPO mini-batch size**: `64`
- **PPO micro-batch size per GPU**: `1`
- **Tensor parallel size**: `1`
- **Number of GPUs**: `8`
- **Number of epochs**: `1`
- **Save frequency**: every `20` steps
- **Test frequency**: every `20` steps

### Dataset

- **Training dataset**: `DAPO-Math-17k-Processed`
- **Validation datasets**: `AIME25`, `AMC23`, `AIME24`

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "lllyx/Qwen3-4B-Base-GRPO"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)
```

## Citation

If you use this model, please consider citing the related paper:

```bibtex
@article{li2026rethinking,
  title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
  author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
  journal={arXiv preprint arXiv:2604.13016},
  year={2026}
}
```
初始化项目，由ModelHub XC社区提供模型 Model: lllyx/Qwen3-4B-Base-GRPO Source: Original Platform 2026-05-30 19:34:32 +08:00			`---`
			`license: other`
			`language:`
			`- en`
			`- zh`
			`library_name: transformers`
			`pipeline_tag: text-generation`
			`tags:`
			`- qwen`
			`- qwen3`
			`- math`
			`- grpo`
			`- verl`
			`- rl`
			`- reinforcement-learning`
			`- on-policy-distillation`
			`- full-parameter-rl`
			`- reasoning`
			`- safetensors`
			`- arxiv:2604.13016`
			`base_model: Qwen/Qwen3-4B-Base`
			`base_model_relation: finetune`
			`---`

			`<h1 align="center">Qwen3-4B-Base-GRPO</h1>`

			`<div align="center" style="line-height: 1;">`
			`<a href="https://arxiv.org/abs/2604.13016" style="margin: 2px;">`
			`<img alt="Paper" src="https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white" style="display: inline-block; vertical-align: middle;"/>`
			`</a>`
			`<a href="https://github.com/thunlp/OPD" style="margin: 2px;">`
			`<img alt="Github" src="https://img.shields.io/badge/OPD-000000?style=for-the-badge&logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/>`
			`</a>`
			`<a href="https://huggingface.co/papers/2604.13016" style="margin: 2px;">`
			`<img alt="HF Papers" src="https://img.shields.io/badge/HF--Paper-%23FFD14D?style=for-the-badge&logo=huggingface&logoColor=black" style="display: inline-block; vertical-align: middle;"/>`
			`</a>`
			`<a href="https://x.com/HBX_hbx/status/2044464414829777354" style="margin: 2px;">`
			`<img alt="Twitter" src="https://img.shields.io/badge/Twitter-%23000000.svg?style=for-the-badge&logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>`
			`</a>`
			`</div>`

			`<br>`

			`Qwen3-4B-Base-GRPO is a post-RL checkpoint trained with the verl framework.`
			`It starts from Qwen3-4B-Base and applies GRPO on the DAPO-Math-17k-Processed dataset for mathematical reasoning and problem-solving.`

			`This model is associated with the paper:`
			`Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe`
			`Paper link: https://arxiv.org/abs/2604.13016`

			`## Model Description`

			This model is obtained by applying GRPO reinforcement learning to `Qwen3-4B-Base` with verl. The training is intended to improve math-focused reasoning performance under the on-policy distillation setting.


			`### Key characteristics`

			`- Base model: Qwen3-4B-Base`
			`- Training framework: verl`
			`- Training stage: Reinforcement Learning (GRPO)`
			`- Parameter update: Full-parameter actor update`
			`- Primary domain: Mathematical reasoning`
			- Reward model: Not used (`reward_model.enable: false`)
			`- Rollout engine: vLLM`
			`- Context length: 32768 tokens`
			`- Responses per prompt: 8`

			`## Training Details`

			`### Training configuration`

			`- Framework: verl`
			- Algorithm: `grpo`
			- GRPO outcome weight: `1.0`
			- Learned reward model: disabled (`reward_model.enable: false`)
			`- Reward source: custom rule-based math reward function`
			- Training dataset: `DAPO-Math-17k-Processed`
			- Training file: `datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet`
			- Validation datasets: `AIME25`, `AMC23`, `AIME24`
			- Prompt length: `1024`
			- Response length: `7168`
			- Validation response length: `31744`
			- Max model length: `32768`
			- Rollout temperature: `1.0`
			- Repetition penalty: `1.0`
			`- KL loss: disabled`
			`- Format reward: disabled`
			- Loss aggregation: `token-mean`
			- Learning rate: `1e-6`
			- PPO mini-batch size: `64`
			- PPO micro-batch size per GPU: `1`
			- Tensor parallel size: `1`
			- Number of GPUs: `8`
			- Number of epochs: `1`
			- Save frequency: every `20` steps
			- Test frequency: every `20` steps

			`### Dataset`

			- Training dataset: `DAPO-Math-17k-Processed`
			- Validation datasets: `AIME25`, `AMC23`, `AIME24`

			`## Usage`

			```python
			`from transformers import AutoModelForCausalLM, AutoTokenizer`

			`model_id = "lllyx/Qwen3-4B-Base-GRPO"`

			`tokenizer = AutoTokenizer.from_pretrained(model_id)`
			`model = AutoModelForCausalLM.from_pretrained(`
			`model_id,`
			`torch_dtype="auto",`
			`device_map="auto",`
			`)`
			```

			`## Citation`

			`If you use this model, please consider citing the related paper:`

			```bibtex
			`@article{li2026rethinking,`
			`title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},`
			`author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},`
			`journal={arXiv preprint arXiv:2604.13016},`
			`year={2026}`
			`}`
			```