--- license: apache-2.0 tags: - text-generation - causal-lm - reasoning library_name: transformers --- # Introduction
[![Paper](https://img.shields.io/badge/Paper-arXiv-b5212f.svg?logo=arxiv)](https://arxiv.org/abs/2510.04140) [![GitHub](https://img.shields.io/badge/Code-GitHub-181717.svg?logo=github)](https://github.com/Jiangzs1028/MENTOR)
MENTOR is a framework that enables LLMs to achieve effective and diverse exploration in reinforcement learning by providing expert guidance only at critical decision points, rather than imitating entire expert trajectories. ## Key Highlights - **Selective Expert Guidance:** Injects expert signals only at critical decision points, avoiding full-trajectory imitation. - **Effective & Diverse Exploration:** Balances expert guidance with autonomous exploration, preventing entropy collapse. - **Absorb Essence, Remove Redundancy:** Captures essential expert strategies while discarding unnecessary patterns. # Chat Template ```python def build_MENTOR_chat_template(question, tokenizer): system_prompt = ( "You are a helpful AI Assistant that provides well-reasoned and detailed responses. " "You FIRST think about the reasoning process as an internal monologue and " "then provide the final answer. The reasoning process MUST BE enclosed " "within tags. The final answer MUST BE put in \\boxed{}." ) return tokenizer.apply_chat_template( [ {"role": "system", "content": system_prompt}, {"role": "user", "content": question} ], tokenize=False, add_generation_prompt=True ) ``` # Citation If you find our model useful, please kindly cite our paper: ``` @article{jiang2025selective, title={Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs}, author={Jiang, Zishang and Han, Jinyi and Li, Tingyun and Wang, Xinyi and Jiang, Sihang and Liang, Jiaqing and Dai, Zhaoqian and Ma, Shuguang and Yu, Fei and Xiao, Yanghua}, journal={arXiv preprint arXiv:2510.04140}, year={2025} } ```