Files
R-PRM-7B-DPO/README.md
ModelHub XC 25f74763be 初始化项目,由ModelHub XC社区提供模型
Model: kevinpro/R-PRM-7B-DPO
Source: Original Platform
2026-04-25 07:50:01 +08:00

113 lines
8.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
language: zh
tags:
- reinforcement-learning
- reward-model
- dpo
model_name: R-PRM-7B-DPO
pipeline_tag: text-generation
---
# R-PRM: Reasoning-Driven Process Reward Modeling
<p align="center">
<a href="https://arxiv.org/abs/2503.21295"> 📃 Paper</a> |
<a href="https://shesj-note.notion.site/R-PRM-Reasoning-Driven-Process-Reward-Modeling-9543fb238b0d48338dd44c60999ffd9b"> 📝 Blog</a> |
<a href="https://github.com/NJUNLP/R-PRM"> ⚙️ Code</a> |
<a href="https://huggingface.co/kevinpro/R-PRM-7B-DPO"> 🤖 Model</a> |
<a href="https://huggingface.co/datasets/kevinpro/R-PRM"> 🤗 Dataset</a> |
<a href="https://ricardokevins.github.io/"> 📭 Contact</a>
</p>
## Overview
Welcome to the repository of **R-PRM**, our cutting-edge framework designed to revolutionize process-level evaluation in mathematical reasoning for large language models (LLMs).
* 🚀 We introduce **Reasoning-Driven Process Reward Modeling (R-PRM)**, a novel approach that enhances LLMs' ability to evaluate mathematical reasoning step-by-step. By leveraging stronger LLMs to generate seed data, optimizing preferences without additional annotations, and scaling inference-time computation, R-PRM delivers comprehensive, transparent, and robust assessments of reasoning processes.
* 📈 Our framework significantly boosts evaluation **accuracy** and **generalization**, outperforming strong baselines by wide margins on ProcessBench and PRMBench. When guiding policy models, R-PRM consistently improves reasoning performance across diverse datasets, achieving state-of-the-art (SOTA) results.
* 🌐 Overall, R-PRM offers a scalable and data-efficient solution to the challenge of scarce process-level annotations, enabling a more generalizable enhancement of reasoning evaluation capabilities without extensive human labeling.
![Figure 1: R-PRM Framework Illustration](./fig/R-PRM.jpg "R-PRM Framework Illustration")
## 🏆 Experiment Results
### 🧪 **Data Efficiency**
R-PRM demonstrates exceptional data efficiency under varying training scales:
- With just **12.8k** training samples, R-PRM reaches **F1 = 52.6**, already surpassing most open-source PRMs.
- R-PRM achieves **+3.6** F1 over Qwen2.5-Math-7B-PRM800K when trained on just **64k** samples (vs. Qwen's **265k**), and extends this lead to **+8.7** F1 when both are trained on comparable data volumes.
- Notably, despite using only **~15%** of the data, R-PRMs performance is already comparable to Qwen2.5-Math-PRM, which was trained on a much larger **1.8M** LLM-filtered dataset.
![Figure2: DataScaline](fig/DataScaling.png)
### 📊 **ProcessBench**
Our reasoning-driven framework improves over Qwen2.5-Math-7B-PRM800K by **+8.7 F1 (SFT)** and **+13.9 F1 (DPO)**, demonstrating its powerful evaluation capability.
| **Model** | **GSM8K** | **MATH** | **OLYMPIAD** | **OMNIMATH** | **Avg. F1** |
| ----------------------- | ---------------------- | ---------------------- | ---------------------- | ---------------------- | ---------------------- |
| Math-Shepherd-7B | 47.9 | 29.5 | 24.8 | 23.8 | 31.5 |
| Skywork-PRM-7B | 70.8 | 53.6 | 22.9 | 21.0 | 42.1 |
| Qwen2.5-Math-7B-PRM800K | 68.2 | 62.6 | 50.7 | 44.3 | 56.5 |
| ⭐ **R-PRM-7B-SFT** | 77.2 (**+9.0**) | 71.6 (**+9.0**) | 59.6 (**+8.9**) | 52.3 (**+8.0**) | 65.2 (**+8.7**) |
| ⭐ **R-PRM-7B-DPO** | 80.7 (**+12.5**) | 76.9 (**+14.3**) | 63.8 (**+13.1**) | 60.1 (**+15.8**) | 70.4 (**+13.9**) |
| Qwen2.5-Math-PRM-7B | 82.4 | 77.6 | 67.5 | 66.3 | 73.5 |
| GPT-4o | 79.2 | 63.6 | 51.4 | 53.5 | 61.9 |
| o1-mini | 93.2 | 88.9 | 87.2 | 82.4 | 87.9 |
### 🧠 **PRMBench**
R-PRM achieves **+8.5 F1 (DPO)** over Qwen2.5-Math-7B-PRM800K
📌 Excels in **soundness**, **sensitivity**, and **multi-dimensional error analysis**.
![PRMBench Performance](./fig/PRMBench.png)
### 🧪 **Best-of-N Strategy**
When selecting the best among N reasoning paths, R-PRM improves accuracy by **+8.6 points** over the **Pass@1 baseline**, achieving the **best results** among all PRMs across six math datasets.
| **Setting / Model** | **AIME24** | **AMC23** | **MATH** | **Olympiad** | **College** | **Minerva** | **Avg.** |
| ------------------------------ | ---------------- | --------------- | -------------- | ------------------ | ----------------- | ----------------- | -------------- |
| pass@1 (baseline) | 11.2 | 47.8 | 73.0 | 38.0 | 38.6 | 37.2 | 41.0 |
| maj@8 | 20.0 | 57.5 | 79.6 | 47.0 | 41.5 | 42.7 | 48.0 |
| pass@8 (upper bound) | 33.3 | 82.5 | 88.8 | 58.5 | 47.5 | 57.7 | 61.4 |
| Math-Shepherd-7B | 16.7 | 42.5 | 76.0 | 42.0 | 37.0 | 39.3 | 42.3 |
| Skywork-PRM-7B | 16.7 | 55.0 | 81.2 | 44.0 | 40.5 | **44.5** | 47.0 |
| Qwen2.5-Math-7B-PRM800K | 13.3 | 57.5 | 80.0 | 44.5 | **43.5** | 43.0 | 47.7 |
| Qwen2.5-Math-PRM-7B | 16.7 | 55.0 | 82.0 | 48.0 | **43.5** | 43.0 | **48.0** |
| ⭐ **R-PRM-7B-DPO** | **20.0** | **62.5** | **82.2** | **48.0** | 41.0 | 44.1 | **49.6** |
### 🔁 **Guide Search Strategy**
By guiding reasoning step-by-step, R-PRM surpasses Pass@1 by **+8.4 points**, outperforming both **majority voting** and previous PRM-guided methods.
| **Setting / Model** | **AIME24** | **AMC23** | **MATH** | **Olympiad** | **College** | **Minerva** | **Avg.** |
| ------------------------------ | ---------------- | --------------- | -------------- | ------------------ | ----------------- | ----------------- | -------------- |
| pass@1 | 11.2 | 47.8 | 73.0 | 38.0 | 38.6 | 37.2 | 41.0 |
| major@8 | 20.0 | 57.5 | 79.6 | 47.0 | 41.5 | 42.7 | 48.0 |
| pass@8 (upper bound) | 33.3 | 82.5 | 88.8 | 58.5 | 47.5 | 57.7 | 61.4 |
| Math-Shepherd-7B | 13.3 | 52.5 | 74.6 | 38.5 | 36.5 | 41.2 | 42.8 |
| Skywork-PRM-7B | 10.0 | 57.5 | 77.8 | 41.5 | 39.0 | 43.4 | 44.9 |
| Qwen2.5-Math-7B-PRM800K | **23.3** | 45.0 | 78.2 | 42.0 | 35.5 | 38.6 | 43.8 |
| Qwen2.5-Math-PRM-7B | 16.7 | 60.0 | **81.0** | 43.5 | 39.0 | 40.4 | 46.8 |
| ⭐ **R-PRM-7B-DPO** | 16.7 | **70.0** | 80.0 | **46.5** | 39.5 | **43.4** | **49.4** |
### 🚀 **Inference-Time Scaling**
Evaluation performance improves consistently as more reasoning trajectories are sampled at inference.
→ From **62.8 F1 (2 samples)** to **67.6 F1 (4 samples)** on ProcessBench.
This showcases R-PRMs ability to deliver **robust, ensemble-style judgment** through multi-path reasoning.
![Figure3: ProcessBench Scaling](./fig/processbench-scaling.png)
## Citation
If you find this repository helpful, feel free to cite our paper:
```
@misc{she2025rprmreasoningdrivenprocessreward,
title={R-PRM: Reasoning-Driven Process Reward Modeling},
author={Shuaijie She and Junxiao Liu and Yifeng Liu and Jiajun Chen and Xin Huang and Shujian Huang},
year={2025},
eprint={2503.21295},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.21295},
}
```