Files
R-PRM-7B-DPO/README.md

113 lines
8.7 KiB
Markdown
Raw Permalink Normal View History

---
license: apache-2.0
language: zh
tags:
- reinforcement-learning
- reward-model
- dpo
model_name: R-PRM-7B-DPO
pipeline_tag: text-generation
---
# R-PRM: Reasoning-Driven Process Reward Modeling
<p align="center">
<a href="https://arxiv.org/abs/2503.21295"> 📃 Paper</a> |
<a href="https://shesj-note.notion.site/R-PRM-Reasoning-Driven-Process-Reward-Modeling-9543fb238b0d48338dd44c60999ffd9b"> 📝 Blog</a> |
<a href="https://github.com/NJUNLP/R-PRM"> ⚙️ Code</a> |
<a href="https://huggingface.co/kevinpro/R-PRM-7B-DPO"> 🤖 Model</a> |
<a href="https://huggingface.co/datasets/kevinpro/R-PRM"> 🤗 Dataset</a> |
<a href="https://ricardokevins.github.io/"> 📭 Contact</a>
</p>
## Overview
Welcome to the repository of **R-PRM**, our cutting-edge framework designed to revolutionize process-level evaluation in mathematical reasoning for large language models (LLMs).
* 🚀 We introduce **Reasoning-Driven Process Reward Modeling (R-PRM)**, a novel approach that enhances LLMs' ability to evaluate mathematical reasoning step-by-step. By leveraging stronger LLMs to generate seed data, optimizing preferences without additional annotations, and scaling inference-time computation, R-PRM delivers comprehensive, transparent, and robust assessments of reasoning processes.
* 📈 Our framework significantly boosts evaluation **accuracy** and **generalization**, outperforming strong baselines by wide margins on ProcessBench and PRMBench. When guiding policy models, R-PRM consistently improves reasoning performance across diverse datasets, achieving state-of-the-art (SOTA) results.
* 🌐 Overall, R-PRM offers a scalable and data-efficient solution to the challenge of scarce process-level annotations, enabling a more generalizable enhancement of reasoning evaluation capabilities without extensive human labeling.
![Figure 1: R-PRM Framework Illustration](./fig/R-PRM.jpg "R-PRM Framework Illustration")
## 🏆 Experiment Results
### 🧪 **Data Efficiency**
R-PRM demonstrates exceptional data efficiency under varying training scales:
- With just **12.8k** training samples, R-PRM reaches **F1 = 52.6**, already surpassing most open-source PRMs.
- R-PRM achieves **+3.6** F1 over Qwen2.5-Math-7B-PRM800K when trained on just **64k** samples (vs. Qwen's **265k**), and extends this lead to **+8.7** F1 when both are trained on comparable data volumes.
- Notably, despite using only **~15%** of the data, R-PRMs performance is already comparable to Qwen2.5-Math-PRM, which was trained on a much larger **1.8M** LLM-filtered dataset.
![Figure2: DataScaline](fig/DataScaling.png)
### 📊 **ProcessBench**
Our reasoning-driven framework improves over Qwen2.5-Math-7B-PRM800K by **+8.7 F1 (SFT)** and **+13.9 F1 (DPO)**, demonstrating its powerful evaluation capability.
| **Model** | **GSM8K** | **MATH** | **OLYMPIAD** | **OMNIMATH** | **Avg. F1** |
| ----------------------- | ---------------------- | ---------------------- | ---------------------- | ---------------------- | ---------------------- |
| Math-Shepherd-7B | 47.9 | 29.5 | 24.8 | 23.8 | 31.5 |
| Skywork-PRM-7B | 70.8 | 53.6 | 22.9 | 21.0 | 42.1 |
| Qwen2.5-Math-7B-PRM800K | 68.2 | 62.6 | 50.7 | 44.3 | 56.5 |
| ⭐ **R-PRM-7B-SFT** | 77.2 (**+9.0**) | 71.6 (**+9.0**) | 59.6 (**+8.9**) | 52.3 (**+8.0**) | 65.2 (**+8.7**) |
| ⭐ **R-PRM-7B-DPO** | 80.7 (**+12.5**) | 76.9 (**+14.3**) | 63.8 (**+13.1**) | 60.1 (**+15.8**) | 70.4 (**+13.9**) |
| Qwen2.5-Math-PRM-7B | 82.4 | 77.6 | 67.5 | 66.3 | 73.5 |
| GPT-4o | 79.2 | 63.6 | 51.4 | 53.5 | 61.9 |
| o1-mini | 93.2 | 88.9 | 87.2 | 82.4 | 87.9 |
### 🧠 **PRMBench**
R-PRM achieves **+8.5 F1 (DPO)** over Qwen2.5-Math-7B-PRM800K
📌 Excels in **soundness**, **sensitivity**, and **multi-dimensional error analysis**.
![PRMBench Performance](./fig/PRMBench.png)
### 🧪 **Best-of-N Strategy**
When selecting the best among N reasoning paths, R-PRM improves accuracy by **+8.6 points** over the **Pass@1 baseline**, achieving the **best results** among all PRMs across six math datasets.
| **Setting / Model** | **AIME24** | **AMC23** | **MATH** | **Olympiad** | **College** | **Minerva** | **Avg.** |
| ------------------------------ | ---------------- | --------------- | -------------- | ------------------ | ----------------- | ----------------- | -------------- |
| pass@1 (baseline) | 11.2 | 47.8 | 73.0 | 38.0 | 38.6 | 37.2 | 41.0 |
| maj@8 | 20.0 | 57.5 | 79.6 | 47.0 | 41.5 | 42.7 | 48.0 |
| pass@8 (upper bound) | 33.3 | 82.5 | 88.8 | 58.5 | 47.5 | 57.7 | 61.4 |
| Math-Shepherd-7B | 16.7 | 42.5 | 76.0 | 42.0 | 37.0 | 39.3 | 42.3 |
| Skywork-PRM-7B | 16.7 | 55.0 | 81.2 | 44.0 | 40.5 | **44.5** | 47.0 |
| Qwen2.5-Math-7B-PRM800K | 13.3 | 57.5 | 80.0 | 44.5 | **43.5** | 43.0 | 47.7 |
| Qwen2.5-Math-PRM-7B | 16.7 | 55.0 | 82.0 | 48.0 | **43.5** | 43.0 | **48.0** |
| ⭐ **R-PRM-7B-DPO** | **20.0** | **62.5** | **82.2** | **48.0** | 41.0 | 44.1 | **49.6** |
### 🔁 **Guide Search Strategy**
By guiding reasoning step-by-step, R-PRM surpasses Pass@1 by **+8.4 points**, outperforming both **majority voting** and previous PRM-guided methods.
| **Setting / Model** | **AIME24** | **AMC23** | **MATH** | **Olympiad** | **College** | **Minerva** | **Avg.** |
| ------------------------------ | ---------------- | --------------- | -------------- | ------------------ | ----------------- | ----------------- | -------------- |
| pass@1 | 11.2 | 47.8 | 73.0 | 38.0 | 38.6 | 37.2 | 41.0 |
| major@8 | 20.0 | 57.5 | 79.6 | 47.0 | 41.5 | 42.7 | 48.0 |
| pass@8 (upper bound) | 33.3 | 82.5 | 88.8 | 58.5 | 47.5 | 57.7 | 61.4 |
| Math-Shepherd-7B | 13.3 | 52.5 | 74.6 | 38.5 | 36.5 | 41.2 | 42.8 |
| Skywork-PRM-7B | 10.0 | 57.5 | 77.8 | 41.5 | 39.0 | 43.4 | 44.9 |
| Qwen2.5-Math-7B-PRM800K | **23.3** | 45.0 | 78.2 | 42.0 | 35.5 | 38.6 | 43.8 |
| Qwen2.5-Math-PRM-7B | 16.7 | 60.0 | **81.0** | 43.5 | 39.0 | 40.4 | 46.8 |
| ⭐ **R-PRM-7B-DPO** | 16.7 | **70.0** | 80.0 | **46.5** | 39.5 | **43.4** | **49.4** |
### 🚀 **Inference-Time Scaling**
Evaluation performance improves consistently as more reasoning trajectories are sampled at inference.
→ From **62.8 F1 (2 samples)** to **67.6 F1 (4 samples)** on ProcessBench.
This showcases R-PRMs ability to deliver **robust, ensemble-style judgment** through multi-path reasoning.
![Figure3: ProcessBench Scaling](./fig/processbench-scaling.png)
## Citation
If you find this repository helpful, feel free to cite our paper:
```
@misc{she2025rprmreasoningdrivenprocessreward,
title={R-PRM: Reasoning-Driven Process Reward Modeling},
author={Shuaijie She and Junxiao Liu and Yifeng Liu and Jiajun Chen and Xin Huang and Shujian Huang},
year={2025},
eprint={2503.21295},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.21295},
}
```