Files
OctoThinker-1B-Short-Zero/README.md
ModelHub XC e31614c9c3 初始化项目,由ModelHub XC社区提供模型
Model: OctoThinker/OctoThinker-1B-Short-Zero
Source: Original Platform
2026-04-20 03:29:02 +08:00

76 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: llama3.2
datasets:
- OctoThinker/MegaMath-Web-Pro-Max
- LLM360/MegaMath
language:
- en
base_model:
- meta-llama/Llama-3.2-1B
pipeline_tag: text-generation
---
# [OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling](https://arxiv.org/abs/2506.20512)
## OctoThinker-1B-Short-Zero
The OctoThinker family is built on carefully studied mid-training insights, starting from the Llama-3 family, to create a reinforcement learningfriendly base language model.
OctoThinker-1B-Short-Zero is trained using the R1-Zero-style reinforcement learning technique, starting from OctoThinker-1B-Short-Base without any supervised fine-tuning (SFT).
### Training Recipe for OctoThinker-1B-Short-Base
<div style="display: flex; justify-content: left; gap: 20px;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/62cbeb2d72dfd24b86bdf977/nQEwjHnW88XIThK8jHAA3.png" alt="Data Pipeline" style="width:90%;">
</div>
### Evaluation Results of OctoThinker-1B-Base Series
Note that we adopt the few-shot prompting evaluation for these base language models.
<div style="display: flex; justify-content: left; gap: 20px;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/62cbeb2d72dfd24b86bdf977/3n1cnG81wLjjPwzyMyQ-o.png" alt="Data Pipeline" style="width:80%;">
</div>
### RL Training Dynamics of OctoThinker-1B-Zero Series
<div style="display: flex; justify-content: left; gap: 20px;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/62cbeb2d72dfd24b86bdf977/xoiYwO-fsOCdPTKsASJNK.png" alt="Data Pipeline" style="width:80%;">
</div>
### More about OctoThinker
<div style="display: flex; justify-content: left; gap: 20px;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/62cbeb2d72dfd24b86bdf977/bn85CEB_DW6azJ7KJp11Q.png" alt="Data Pipeline" style="width:100%;">
</div>
## Citation
Check out our [paper](https://arxiv.org/abs/2506.20512) for more details. If you use our models, and datasets or find our work useful, please cite
```
@article{wang2025octothinker,
title={OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling},
author={Wang, Zengzhi and Zhou, Fan and Li, Xuefeng and Liu, Pengfei},
year={2025},
journal={arXiv preprint arXiv:2506.20512},
note={Preprint}
}
```