Files
ReasonFlux-PRM-1.5B/README.md
ModelHub XC eaeeaa321e 初始化项目,由ModelHub XC社区提供模型
Model: Gen-Verse/ReasonFlux-PRM-1.5B
Source: Original Platform
2026-05-06 23:43:43 +08:00

59 lines
2.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: mit
library_name: transformers
---
# ReasonFlux-PRM
[Code](https://github.com/Gen-Verse/ReasonFlux) | [Paper](https://arxiv.org/abs/2506.18896)
We introduce ReasonFlux-PRM, a trajectory-aware process reward model (PRM) explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. ReasonFlux-PRM is able to support both offline and online reward supervision, by selecting high-quality training data for model distillation, providing dense process-level rewards for policy optimization during reinforcement learning, and enabling reward-guided test-time scaling.
<table>
<tr>
<th>Model</th>
<th>Type</th>
<th>Size</th>
<th>Capabilities</th>
<th>Use Cases</th>
<th>Download</th>
</tr>
<tr>
<td><strong>ReasonFlux-PRM</strong></td>
<td>PRM</td>
<td>7B</td>
<td>• Trajectory-aware scoring<br/>• Online/Offline supervision<br/>• Dense process rewards</td>
<td>Data selection, RL training, Test-time scaling</td>
<td><a href="https://huggingface.co/Gen-Verse/ReasonFlux-PRM-7B">🤗 7B</a></td>
</tr>
<tr>
<td><strong>ReasonFlux-PRM</strong></td>
<td>PRM</td>
<td>1.5B</td>
<td>• Lightweight scoring<br/>• Efficient inference<br/>• Edge deployment</td>
<td>Resource-constrained applications</td>
<td><a href="https://huggingface.co/Gen-Verse/ReasonFlux-PRM-1.5B">🤗 1.5B</a></td>
</tr>
</tr>
<tr>
<td><strong>ReasonFlux-PRM-Qwen-2.5</strong></td>
<td>End-to-End Trained Policy Model</td>
<td>7B</td>
<td>• Long CoT reasoning <br/>• Solving complex tasks and problems</td>
<td>Math and Science Reasoning</td>
<td><a href="https://huggingface.co/Gen-Verse/ReasonFlux-PRM-Qwen-2.5-7B">🤗 7B</a></td>
</tr>
</table>
>*Note: We obtain ReasonFlux-PRM-Qwen-2.5-7B through an end-to-end training process, first applying SFT on 1k TrajectoryResponse pairs selected by ReasonFlux-PRM-7B, followed by RL training with ReasonFlux-PRM-7B integrated GRPO.*
## Citation
```bash
@article{zou2025reasonfluxprm,
title={ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs},
author={Zou, Jiaru and Yang, Ling and Gu, Jingwen and Qiu, Jiahao and Shen, Ke and He, Jingrui and Wang, Mengdi},
journal={arXiv preprint arXiv:2506.18896},
year={2025}
}
```