Files
GPRM-4B/README.md

103 lines
4.6 KiB
Markdown
Raw Permalink Normal View History

---
license: mit
language:
- en
- zh
pipeline_tag: text-generation
tags:
- process-reward-model
- prm
- reasoning
- reinforcement-learning
- grpo
- safetensors
- qwen
base_model: Qwen/Qwen3-4B-Instruct-2507
model_creator: zai-org
---
# GPRM-4B
👋 Join our community.
📖 Read the GPRM technical report.
📍 API access available upon request.
[GitHub] [Technical Report]
## Introduction
**GPRM (Global Perspective Process Reward Model)** is a next-generation process reward model designed to overcome the "local context" limitations of traditional PRMs. While previous models judge each step in isolation, GPRM introduces a **Global Perspective**, significantly improving error localization and reasoning verification in long-chain tasks.
Previous PRMs often suffer from two major flaws: they ignore historical evaluations and lack visibility into how a step affects future reasoning.
**GPRM addresses these via:**
- **History-Aware Evaluation:** Explicitly conditions on previous steps and their associated judgments.
- **Future-Informed Reasoning:** Incorporates a look-ahead perspective to validate steps against subsequent derivations.
- **4-D Diagnostic Framework:** Structured evaluation across **Look-back** (consistency), **Look-ahead** (plausibility), **Self-check** (validity), and **Goal alignment**.
## Benchmark
### PRMBench (Overall Score)
| Model | Simplicity | Soundness | Sensitivity | Overall |
|-------|-----------|-----------|-------------|---------|
| GPT-4o | 59.7 | 70.9 | 75.8 | 66.8 |
| o1-mini | 64.6 | 72.1 | 75.5 | 68.8 |
| Gemini-2.0-flash-exp | 58.1 | 66.0 | 75.4 | 66.9 |
| Qwen2.5-Math-PRM-7B | 52.1 | 71.0 | 75.5 | 65.5 |
| R-PRM-7B-DPO | 55.2 | 71.2 | 76.6 | 66.8 |
| GenPRM-7B | 56.1 | 71.8 | 77.0 | 67.4 |
| Skywork-PRM-7B | 59.6 | 68.5 | 73.3 | 65.1 |
| GPRM-4B-SFT | 65.0 | 75.2 | 78.8 | 72.9 |
| **GPRM-4B-GRPO** | **65.8** | **76.2** | **79.3** | **73.9** |
| GPRM-14B-GRPO | 67.2 | 77.6 | 80.2 | 74.6 |
### ProcessBench (Avg. F1 Score)
| Model | GSM8K | MATH | OlympiadBench | OmniMath | Avg. F1 |
|-------|-------|------|---------------|----------|---------|
| GPT-4o | 79.2 | 63.6 | 51.4 | 53.5 | 61.9 |
| o1-mini | 93.2 | 88.9 | 87.2 | 82.4 | 87.9 |
| Qwen2.5-Math-PRM-7B | 68.2 | 62.6 | 50.7 | 44.3 | 58.5 |
| R-PRM-7B-DPO | 80.7 | 76.9 | 63.8 | 60.1 | 70.4 |
| GenPRM-7B | 73.7 | 77.9 | 71.8 | 73.8 | 74.1 |
| Skywork-PRM-7B | 70.8 | 53.6 | 22.9 | 21.0 | 42.1 |
| GPRM-4B-SFT | 73.1 | 76.2 | 69.4 | 70.5 | 72.3 |
| **GPRM-4B-GRPO** | **73.1** | **77.5** | **71.5** | **75.1** | **74.3** |
| GPRM-14B-GRPO | 74.7 | 79.3 | 73.9 | 75.3 | 75.8 |
### Agent Error Bench (Accuracy %)
| Model | ALFWorld (S/S+M) | WebShop (S/S+M) | GAIA (S/S+M) | Average (S/S+M) |
|-------|-----------------|-----------------|--------------|-----------------|
| Direct Prompting (GPT-4.1) | 28.0 / 14.0 | 30.0 / 6.0 | 26.0 / 10.0 | 28.0 / 10.0 |
| AgentDebug | 35.0 / 28.0 | 42.0 / 22.0 | 58.0 / 44.0 | 45.0 / 31.3 |
| **GPRM-4B** | **38.0 / 30.0** | **44.0 / 24.0** | **60.0 / 46.0** | **47.0 / 33.0** |
| GPRM-14B | 46.0 / 37.0 | 51.0 / 29.0 | 67.0 / 51.0 | 54.0 / 39.0 |
### Downstream Test-Time Search (Base: Qwen2.5-7B-Instruct)
#### Best-of-8 (Accuracy %)
| PRM Guide | AIME24 | AMC23 | MATH | OlympiadBench | College Math | Minerva MATH | Avg. |
|-----------|--------|-------|------|---------------|--------------|--------------|------|
| Reference: pass@1 | 11.2 | 47.8 | 73.0 | 38.0 | 38.6 | 37.2 | 41.0 |
| Reference: maj@8 | 20.0 | 57.5 | 79.6 | 47.0 | 41.5 | 42.7 | 48.0 |
| R-PRM-7B-DPO | 20.0 | 62.5 | 82.2 | 48.0 | 41.0 | 44.1 | 49.6 |
| **GPRM-4B** | **20.0** | **63.0** | **82.6** | **48.5** | **40.5** | **45.0** | **50.1** |
| GPRM-14B | 20.0 | 64.2 | 83.1 | 50.3 | 42.6 | 45.8 | 51.0 |
#### Greedy Guided Search@8 (Accuracy %)
| PRM Guide | AIME24 | AMC23 | MATH | OlympiadBench | College Math | Minerva MATH | Avg. |
|-----------|--------|-------|------|---------------|--------------|--------------|------|
| R-PRM-7B-DPO | 16.7 | 70.0 | 80.0 | 46.5 | 39.5 | 43.4 | 49.4 |
| **GPRM-4B** | **23.3** | **85.0** | **80.0** | **48.0** | **45.0** | **48.8** | **55.0** |
| GPRM-14B | 23.3 | 87.5 | 85.0 | 45.0 | 39.5 | 50.0 | 55.0 |
## Training Strategy
GPRM utilizes a two-stage progressive training pipeline:
1. **Stage I (Structured SFT):** Learns 4-dimensional diagnostic reasoning via targeted error injection (Calculation, Logic, Goal-drift, Inconsistency) using Qwen3-235B-Instruct as teacher for annotation.
2. **Stage II (GRPO Optimization):** Refines evaluation policy under complete global context (History + Current + Future) using Group Relative Policy Optimization on hard-mined samples from PRM800K.