The model was trained on the taki555/DeepScaleR-Easy dataset using Reinforcement Learning (RL) strategies to incentivize accurate yet concise reasoning trajectories, addressing the computational overhead often associated with scaled CoT.
@inproceedings{wu2026art,title={The Art of Efficient Reasoning: Data, Reward, and Optimization},author={Taiqiang Wu and Zenan Xu and Bo Zhou and Ngai Wong},year={2026},url={https://arxiv.org/pdf/2602.20945}}