M-Thinker-7B-Iter2

Go to file

ModelHub XC 37fab807b6 初始化项目，由ModelHub XC社区提供模型

Model: XueZhang-bjtu/M-Thinker-7B-Iter2
Source: Original Platform

2026-05-06 18:59:49 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-06 18:59:49 +08:00

chat_template.jinja

初始化项目，由ModelHub XC社区提供模型

2026-05-06 18:59:49 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-06 18:59:49 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-06 18:59:49 +08:00

model-00001-of-00004.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-06 18:59:49 +08:00

model-00002-of-00004.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-06 18:59:49 +08:00

model-00003-of-00004.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-06 18:59:49 +08:00

model-00004-of-00004.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-06 18:59:49 +08:00

model.safetensors.index.json

初始化项目，由ModelHub XC社区提供模型

2026-05-06 18:59:49 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-06 18:59:49 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-05-06 18:59:49 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-06 18:59:49 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-05-06 18:59:49 +08:00

README.md

license, language, tags, pipeline_tag

license

language

tags

pipeline_tag

apache-2.0

O1-like model

Math

text-generation

This repository contains the resources for our paper Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning

Large Reasoning Models (LRMs) have achieved remarkable performance on complex reasoning tasks by adopting the "think-then-answer" paradigm, which enhances both accuracy and interpretability. However, current LRMs exhibit two critical limitations when processing non-English languages: (1) They often struggle to maintain input-output language consistency; (2) They generally perform poorly with wrong reasoning paths and lower answer accuracy compared to English. These limitations significantly degrade the user experience for non-English speakers and hinder the global deployment of LRMs. To address these limitations, we propose M-Thinker, which is trained by the GRPO algorithm that involves a Language Consistency (LC) reward and a novel Cross-lingual Thinking Alignment (CTA) reward. Specifically, the LC reward defines a strict constraint on the language consistency between the input, thought, and answer. Besides, the CTA reward compares the model's non-English reasoning paths with its English reasoning path to transfer its own reasoning capability from English to non-English languages. Through an iterative RL procedure, our M-Thinker-1.5B/7B models not only achieve nearly 100% language consistency and superior performance on two multilingual benchmarks (MMATH and PolyMath), but also exhibit excellent generalization on out-of-domain languages.

Model Access	Backbone	Training data Access
M-Thinker-7B-Iter2 (👍👍)	M-Thinker-7B-Iter1	M-Thinker-7B-RL-Iter2-data
M-Thinker-7B-Iter1 (👍)	7B-cold-start-SFT	M-Thinker-7B-RL-Iter1-data
7B-cold-start-SFT	DeepSeek-R1-Distill-Qwen-7B	M-Thinker-SFT-data
M-Thinker-1.5B-Iter2 (👍👍)	M-Thinker-1.5B-Iter1	M-Thinker-1.5B-RL-Iter2-data
M-Thinker-1.5B-Iter1 (👍)	1.5B-cold-start-SFT	M-Thinker-1.5B-RL-Iter1-data
1.5B-cold-start-SFT	DeepSeek-R1-Distill-Qwen-1.5B	M-Thinker-SFT-data

If you find this work useful, please consider citing our paper:

@misc{zhang2025thinknativelyunlockingmultilingual,
      title={Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning}, 
      author={Xue Zhang and Yunlong Liang and Fandong Meng and Songming Zhang and Kaiyu Huang and Yufeng Chen and Jinan Xu and Jie Zhou},
      year={2025},
      eprint={2510.07300},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.07300}, 
}

README.md Unescape Escape

README.md