57 lines
2.1 KiB
Markdown
57 lines
2.1 KiB
Markdown
|
|
---
|
||
|
|
license: apache-2.0
|
||
|
|
tags:
|
||
|
|
- text-generation
|
||
|
|
- causal-lm
|
||
|
|
- reasoning
|
||
|
|
library_name: transformers
|
||
|
|
---
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
# Introduction
|
||
|
|
|
||
|
|
<div align="center">
|
||
|
|
|
||
|
|
[](https://arxiv.org/abs/2510.04140)
|
||
|
|
[](https://github.com/Jiangzs1028/MENTOR)
|
||
|
|
|
||
|
|
</div>
|
||
|
|
|
||
|
|
MENTOR is a framework that enables LLMs to achieve effective and diverse exploration in reinforcement learning by providing expert guidance only at critical decision points, rather than imitating entire expert trajectories.
|
||
|
|
|
||
|
|
## Key Highlights
|
||
|
|
- **Selective Expert Guidance:** Injects expert signals only at critical decision points, avoiding full-trajectory imitation.
|
||
|
|
- **Effective & Diverse Exploration:** Balances expert guidance with autonomous exploration, preventing entropy collapse.
|
||
|
|
- **Absorb Essence, Remove Redundancy:** Captures essential expert strategies while discarding unnecessary patterns.
|
||
|
|
|
||
|
|
# Chat Template
|
||
|
|
|
||
|
|
```python
|
||
|
|
def build_MENTOR_chat_template(question, tokenizer):
|
||
|
|
system_prompt = (
|
||
|
|
"You are a helpful AI Assistant that provides well-reasoned and detailed responses. "
|
||
|
|
"You FIRST think about the reasoning process as an internal monologue and "
|
||
|
|
"then provide the final answer. The reasoning process MUST BE enclosed "
|
||
|
|
"within <think> </think> tags. The final answer MUST BE put in \\boxed{}."
|
||
|
|
)
|
||
|
|
return tokenizer.apply_chat_template(
|
||
|
|
[
|
||
|
|
{"role": "system", "content": system_prompt},
|
||
|
|
{"role": "user", "content": question}
|
||
|
|
],
|
||
|
|
tokenize=False,
|
||
|
|
add_generation_prompt=True
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
# Citation
|
||
|
|
If you find our model useful, please kindly cite our paper:
|
||
|
|
```
|
||
|
|
@article{jiang2025selective,
|
||
|
|
title={Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs},
|
||
|
|
author={Jiang, Zishang and Han, Jinyi and Li, Tingyun and Wang, Xinyi and Jiang, Sihang and Liang, Jiaqing and Dai, Zhaoqian and Ma, Shuguang and Yu, Fei and Xiao, Yanghua},
|
||
|
|
journal={arXiv preprint arXiv:2510.04140},
|
||
|
|
year={2025}
|
||
|
|
}
|
||
|
|
```
|