Files
OctoThinker-1B-Short-Zero/README.md
ModelHub XC e31614c9c3 初始化项目,由ModelHub XC社区提供模型
Model: OctoThinker/OctoThinker-1B-Short-Zero
Source: Original Platform
2026-04-20 03:29:02 +08:00

2.2 KiB
Raw Blame History

license, datasets, language, base_model, pipeline_tag
license datasets language base_model pipeline_tag
llama3.2
OctoThinker/MegaMath-Web-Pro-Max
LLM360/MegaMath
en
meta-llama/Llama-3.2-1B
text-generation

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

OctoThinker-1B-Short-Zero

The OctoThinker family is built on carefully studied mid-training insights, starting from the Llama-3 family, to create a reinforcement learningfriendly base language model.

OctoThinker-1B-Short-Zero is trained using the R1-Zero-style reinforcement learning technique, starting from OctoThinker-1B-Short-Base without any supervised fine-tuning (SFT).

Training Recipe for OctoThinker-1B-Short-Base

Data Pipeline

Evaluation Results of OctoThinker-1B-Base Series

Note that we adopt the few-shot prompting evaluation for these base language models.

Data Pipeline

RL Training Dynamics of OctoThinker-1B-Zero Series

Data Pipeline

More about OctoThinker

Data Pipeline

Citation

Check out our paper for more details. If you use our models, and datasets or find our work useful, please cite

@article{wang2025octothinker,
  title={OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling},
  author={Wang, Zengzhi and Zhou, Fan and Li, Xuefeng and Liu, Pengfei},
  year={2025},
  journal={arXiv preprint arXiv:2506.20512},
  note={Preprint}
}