Files
qwen2.5-math-rlep/README.md
ModelHub XC 75b3d4875f 初始化项目,由ModelHub XC社区提供模型
Model: Kwai-Klear/qwen2.5-math-rlep
Source: Original Platform
2026-05-26 22:20:12 +08:00

104 lines
4.9 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
datasets:
- Kwai-Klear/RLEP_dataset
- BytedTsinghua-SIA/DAPO-Math-17k
base_model: Qwen/Qwen2.5-Math-7B
---
# RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
This repository contains the `qwen2.5-math-rlep` model, which is a key checkpoint from the RLEP training process based on Qwen2.5-Math-7B, as presented in the paper [RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning](https://huggingface.co/papers/2507.07451).
Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. **RLEP** -- Reinforcement Learning with Experience rePlay -- is a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance.
[[Paper](https://huggingface.co/papers/2507.07451)] [[Code](https://github.com/Kwai-Klear/RLEP)] [[Checkpoints](https://huggingface.co/Kwai-Klear/qwen2.5-math-rlep)] [[Dataset](https://huggingface.co/datasets/Kwai-Klear/RLEP_dataset)]
<p align="center">
<img src="https://github.com/Kwai-Klear/RLEP/raw/main/image/rlep_method.png" width="85%" alt="RLEP Method Overview">
</p>
## ✨ Key Highlights
* **Rapid early gains**: On AIME-2024 RLEP hits the baselines peak accuracy by step 135 (the baseline needs 380). On AIME-2025 it surpasses the baselines best score after only 50 steps.
* **Higher final performance**: RLEP ultimately lifts the peak accuracy from 38.2% → 39.9% (AIME-2024), 19.8% → 22.3% (AIME-2025), and 77.0% → 82.2% on AMC-2023 benchmark.
<p align="center">
<img src="https://github.com/Kwai-Klear/RLEP/raw/main/image/exp_acc.png" width="85%" alt="RLEP Experimental Accuracy">
</p>
## 🚀 Quick Start (Inference)
Heres a simple example of running inference with vLLM.
First, install vLLM (version0.7.3):
```bash
pip3 install vllm
```
After installation, you can load and run the model in your Python code like this:
```python
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from vllm import LLM, SamplingParams
model_path = 'Kwai-Klear/qwen2.5-math-rlep'
sampling_params = SamplingParams(temperature=1.0, top_p=1.0, max_tokens=1024 * 3, n=1)
llm = LLM(
model=model_path,
enforce_eager=False,
tensor_parallel_size=1,
seed=0,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
question = '''Find the sum of all integer bases $b>9$ for which $17_b$ is a divisor of $97_b.$'''
prefix="Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\n"
post_fix = '\n\nRemember to put your answer on its own line after "Answer:".'
question_with_instruct = prefix + question + post_fix # the model is trained with this instruct.
messages = [{'content': question_with_instruct, 'role':'user'}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
output =llm.generate([text], sampling_params)[0]
answer = output.outputs[0].text
print(question)
print(answer)
```
To evaluete the model on benchmarks like AIME-2024, AIME-2025 and AMC-2023 etc. please refer to [our repo](http://github.com/Kwai-Klear/RLEP?tab=readme-ov-file#evaluation).
## Evaluation Results
We evaluated the converged RLEP model at 320 training steps and the DAPO-nodyn-bs64 baseline at 400 steps.
| | AIME-2024 | AIME-2025 | AMC-2023 |
|-------------------|-----------|-----------|----------|
| DAPO | 32.6 | 18.9 | 77.5 |
| DAPO-nodyn-bs64 | 37.4 | 19.4 | 77.3 |
| **RLEP** | **38.5** | **21.3** | **83.0** |
## Citation
If you find our paper or code helpful, we would appreciate it if you could cite our work:
```bibtex
@misc{zhang2025rlepreinforcementlearningexperience,
title={RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning},
author={Hongzhi Zhang and Jia Fu and Jingyuan Zhang and Kai Fu and Qi Wang and Fuzheng Zhang and Guorui Zhou},
year={2025},
eprint={2507.07451},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.07451},
}
```
## Acknowledgement
We conducted our experiments with the [VERL](https://github.com/volcengine/verl) framework and the [Qwen2.5-7B-Math](https://huggingface.co/Qwen/Qwen2.5-Math-7B) model, using the dataset and training scripts provided by [DAPO](https://dapo-sia.github.io/).
Many thanks to the open-sourced works and the broader community for making these resources available!