GPRM-4B/README.md

---
license: mit
language:
- en
- zh
pipeline_tag: text-generation
tags:
- process-reward-model
- prm
- reasoning
- reinforcement-learning
- grpo
- safetensors
- qwen
base_model: Qwen/Qwen3-4B-Instruct-2507
model_creator: zai-org
---

# GPRM-4B

👋 Join our community.
📖 Read the GPRM technical report.
📍 API access available upon request.

[GitHub] [Technical Report]

## Introduction

**GPRM (Global Perspective Process Reward Model)** is a next-generation process reward model designed to overcome the "local context" limitations of traditional PRMs. While previous models judge each step in isolation, GPRM introduces a **Global Perspective**, significantly improving error localization and reasoning verification in long-chain tasks.

Previous PRMs often suffer from two major flaws: they ignore historical evaluations and lack visibility into how a step affects future reasoning.

**GPRM addresses these via:**
- **History-Aware Evaluation:** Explicitly conditions on previous steps and their associated judgments.
- **Future-Informed Reasoning:** Incorporates a look-ahead perspective to validate steps against subsequent derivations.
- **4-D Diagnostic Framework:** Structured evaluation across **Look-back** (consistency), **Look-ahead** (plausibility), **Self-check** (validity), and **Goal alignment**.

## Benchmark

### PRMBench (Overall Score)

| Model | Simplicity | Soundness | Sensitivity | Overall |
|-------|-----------|-----------|-------------|---------|
| GPT-4o | 59.7 | 70.9 | 75.8 | 66.8 |
| o1-mini | 64.6 | 72.1 | 75.5 | 68.8 |
| Gemini-2.0-flash-exp | 58.1 | 66.0 | 75.4 | 66.9 |
| Qwen2.5-Math-PRM-7B | 52.1 | 71.0 | 75.5 | 65.5 |
| R-PRM-7B-DPO | 55.2 | 71.2 | 76.6 | 66.8 |
| GenPRM-7B | 56.1 | 71.8 | 77.0 | 67.4 |
| Skywork-PRM-7B | 59.6 | 68.5 | 73.3 | 65.1 |
| GPRM-4B-SFT | 65.0 | 75.2 | 78.8 | 72.9 |
| **GPRM-4B-GRPO** | **65.8** | **76.2** | **79.3** | **73.9** |
| GPRM-14B-GRPO | 67.2 | 77.6 | 80.2 | 74.6 |

### ProcessBench (Avg. F1 Score)

| Model | GSM8K | MATH | OlympiadBench | OmniMath | Avg. F1 |
|-------|-------|------|---------------|----------|---------|
| GPT-4o | 79.2 | 63.6 | 51.4 | 53.5 | 61.9 |
| o1-mini | 93.2 | 88.9 | 87.2 | 82.4 | 87.9 |
| Qwen2.5-Math-PRM-7B | 68.2 | 62.6 | 50.7 | 44.3 | 58.5 |
| R-PRM-7B-DPO | 80.7 | 76.9 | 63.8 | 60.1 | 70.4 |
| GenPRM-7B | 73.7 | 77.9 | 71.8 | 73.8 | 74.1 |
| Skywork-PRM-7B | 70.8 | 53.6 | 22.9 | 21.0 | 42.1 |
| GPRM-4B-SFT | 73.1 | 76.2 | 69.4 | 70.5 | 72.3 |
| **GPRM-4B-GRPO** | **73.1** | **77.5** | **71.5** | **75.1** | **74.3** |
| GPRM-14B-GRPO | 74.7 | 79.3 | 73.9 | 75.3 | 75.8 |

### Agent Error Bench (Accuracy %)

| Model | ALFWorld (S/S+M) | WebShop (S/S+M) | GAIA (S/S+M) | Average (S/S+M) |
|-------|-----------------|-----------------|--------------|-----------------|
| Direct Prompting (GPT-4.1) | 28.0 / 14.0 | 30.0 / 6.0 | 26.0 / 10.0 | 28.0 / 10.0 |
| AgentDebug | 35.0 / 28.0 | 42.0 / 22.0 | 58.0 / 44.0 | 45.0 / 31.3 |
| **GPRM-4B** | **38.0 / 30.0** | **44.0 / 24.0** | **60.0 / 46.0** | **47.0 / 33.0** |
| GPRM-14B | 46.0 / 37.0 | 51.0 / 29.0 | 67.0 / 51.0 | 54.0 / 39.0 |

### Downstream Test-Time Search (Base: Qwen2.5-7B-Instruct)

#### Best-of-8 (Accuracy %)

| PRM Guide | AIME24 | AMC23 | MATH | OlympiadBench | College Math | Minerva MATH | Avg. |
|-----------|--------|-------|------|---------------|--------------|--------------|------|
| Reference: pass@1 | 11.2 | 47.8 | 73.0 | 38.0 | 38.6 | 37.2 | 41.0 |
| Reference: maj@8 | 20.0 | 57.5 | 79.6 | 47.0 | 41.5 | 42.7 | 48.0 |
| R-PRM-7B-DPO | 20.0 | 62.5 | 82.2 | 48.0 | 41.0 | 44.1 | 49.6 |
| **GPRM-4B** | **20.0** | **63.0** | **82.6** | **48.5** | **40.5** | **45.0** | **50.1** |
| GPRM-14B | 20.0 | 64.2 | 83.1 | 50.3 | 42.6 | 45.8 | 51.0 |

#### Greedy Guided Search@8 (Accuracy %)

| PRM Guide | AIME24 | AMC23 | MATH | OlympiadBench | College Math | Minerva MATH | Avg. |
|-----------|--------|-------|------|---------------|--------------|--------------|------|
| R-PRM-7B-DPO | 16.7 | 70.0 | 80.0 | 46.5 | 39.5 | 43.4 | 49.4 |
| **GPRM-4B** | **23.3** | **85.0** | **80.0** | **48.0** | **45.0** | **48.8** | **55.0** |
| GPRM-14B | 23.3 | 87.5 | 85.0 | 45.0 | 39.5 | 50.0 | 55.0 |

## Training Strategy

GPRM utilizes a two-stage progressive training pipeline:

1. **Stage I (Structured SFT):** Learns 4-dimensional diagnostic reasoning via targeted error injection (Calculation, Logic, Goal-drift, Inconsistency) using Qwen3-235B-Instruct as teacher for annotation.
2. **Stage II (GRPO Optimization):** Refines evaluation policy under complete global context (History + Current + Future) using Group Relative Policy Optimization on hard-mined samples from PRM800K.
初始化项目，由ModelHub XC社区提供模型 Model: skylenage-ai/GPRM-4B Source: Original Platform 2026-05-05 22:26:47 +08:00			`---`
			`license: mit`
			`language:`
			`- en`
			`- zh`
			`pipeline_tag: text-generation`
			`tags:`
			`- process-reward-model`
			`- prm`
			`- reasoning`
			`- reinforcement-learning`
			`- grpo`
			`- safetensors`
			`- qwen`
			`base_model: Qwen/Qwen3-4B-Instruct-2507`
			`model_creator: zai-org`
			`---`

			`# GPRM-4B`

			`👋 Join our community.`
			`📖 Read the GPRM technical report.`
			`📍 API access available upon request.`

			`[GitHub] [Technical Report]`

			`## Introduction`

			`GPRM (Global Perspective Process Reward Model) is a next-generation process reward model designed to overcome the "local context" limitations of traditional PRMs. While previous models judge each step in isolation, GPRM introduces a Global Perspective, significantly improving error localization and reasoning verification in long-chain tasks.`

			`Previous PRMs often suffer from two major flaws: they ignore historical evaluations and lack visibility into how a step affects future reasoning.`

			`GPRM addresses these via:`
			`- History-Aware Evaluation: Explicitly conditions on previous steps and their associated judgments.`
			`- Future-Informed Reasoning: Incorporates a look-ahead perspective to validate steps against subsequent derivations.`
			`- 4-D Diagnostic Framework: Structured evaluation across Look-back (consistency), Look-ahead (plausibility), Self-check (validity), and Goal alignment.`

			`## Benchmark`

			`### PRMBench (Overall Score)`

			`\| Model \| Simplicity \| Soundness \| Sensitivity \| Overall \|`
			`\|-------\|-----------\|-----------\|-------------\|---------\|`
			`\| GPT-4o \| 59.7 \| 70.9 \| 75.8 \| 66.8 \|`
			`\| o1-mini \| 64.6 \| 72.1 \| 75.5 \| 68.8 \|`
			`\| Gemini-2.0-flash-exp \| 58.1 \| 66.0 \| 75.4 \| 66.9 \|`
			`\| Qwen2.5-Math-PRM-7B \| 52.1 \| 71.0 \| 75.5 \| 65.5 \|`
			`\| R-PRM-7B-DPO \| 55.2 \| 71.2 \| 76.6 \| 66.8 \|`
			`\| GenPRM-7B \| 56.1 \| 71.8 \| 77.0 \| 67.4 \|`
			`\| Skywork-PRM-7B \| 59.6 \| 68.5 \| 73.3 \| 65.1 \|`
			`\| GPRM-4B-SFT \| 65.0 \| 75.2 \| 78.8 \| 72.9 \|`
			`\| GPRM-4B-GRPO \| 65.8 \| 76.2 \| 79.3 \| 73.9 \|`
			`\| GPRM-14B-GRPO \| 67.2 \| 77.6 \| 80.2 \| 74.6 \|`

			`### ProcessBench (Avg. F1 Score)`

			`\| Model \| GSM8K \| MATH \| OlympiadBench \| OmniMath \| Avg. F1 \|`
			`\|-------\|-------\|------\|---------------\|----------\|---------\|`
			`\| GPT-4o \| 79.2 \| 63.6 \| 51.4 \| 53.5 \| 61.9 \|`
			`\| o1-mini \| 93.2 \| 88.9 \| 87.2 \| 82.4 \| 87.9 \|`
			`\| Qwen2.5-Math-PRM-7B \| 68.2 \| 62.6 \| 50.7 \| 44.3 \| 58.5 \|`
			`\| R-PRM-7B-DPO \| 80.7 \| 76.9 \| 63.8 \| 60.1 \| 70.4 \|`
			`\| GenPRM-7B \| 73.7 \| 77.9 \| 71.8 \| 73.8 \| 74.1 \|`
			`\| Skywork-PRM-7B \| 70.8 \| 53.6 \| 22.9 \| 21.0 \| 42.1 \|`
			`\| GPRM-4B-SFT \| 73.1 \| 76.2 \| 69.4 \| 70.5 \| 72.3 \|`
			`\| GPRM-4B-GRPO \| 73.1 \| 77.5 \| 71.5 \| 75.1 \| 74.3 \|`
			`\| GPRM-14B-GRPO \| 74.7 \| 79.3 \| 73.9 \| 75.3 \| 75.8 \|`

			`### Agent Error Bench (Accuracy %)`

			`\| Model \| ALFWorld (S/S+M) \| WebShop (S/S+M) \| GAIA (S/S+M) \| Average (S/S+M) \|`
			`\|-------\|-----------------\|-----------------\|--------------\|-----------------\|`
			`\| Direct Prompting (GPT-4.1) \| 28.0 / 14.0 \| 30.0 / 6.0 \| 26.0 / 10.0 \| 28.0 / 10.0 \|`
			`\| AgentDebug \| 35.0 / 28.0 \| 42.0 / 22.0 \| 58.0 / 44.0 \| 45.0 / 31.3 \|`
			`\| GPRM-4B \| 38.0 / 30.0 \| 44.0 / 24.0 \| 60.0 / 46.0 \| 47.0 / 33.0 \|`
			`\| GPRM-14B \| 46.0 / 37.0 \| 51.0 / 29.0 \| 67.0 / 51.0 \| 54.0 / 39.0 \|`

			`### Downstream Test-Time Search (Base: Qwen2.5-7B-Instruct)`

			`#### Best-of-8 (Accuracy %)`

			`\| PRM Guide \| AIME24 \| AMC23 \| MATH \| OlympiadBench \| College Math \| Minerva MATH \| Avg. \|`
			`\|-----------\|--------\|-------\|------\|---------------\|--------------\|--------------\|------\|`
			`\| Reference: pass@1 \| 11.2 \| 47.8 \| 73.0 \| 38.0 \| 38.6 \| 37.2 \| 41.0 \|`
			`\| Reference: maj@8 \| 20.0 \| 57.5 \| 79.6 \| 47.0 \| 41.5 \| 42.7 \| 48.0 \|`
			`\| R-PRM-7B-DPO \| 20.0 \| 62.5 \| 82.2 \| 48.0 \| 41.0 \| 44.1 \| 49.6 \|`
			`\| GPRM-4B \| 20.0 \| 63.0 \| 82.6 \| 48.5 \| 40.5 \| 45.0 \| 50.1 \|`
			`\| GPRM-14B \| 20.0 \| 64.2 \| 83.1 \| 50.3 \| 42.6 \| 45.8 \| 51.0 \|`

			`#### Greedy Guided Search@8 (Accuracy %)`

			`\| PRM Guide \| AIME24 \| AMC23 \| MATH \| OlympiadBench \| College Math \| Minerva MATH \| Avg. \|`
			`\|-----------\|--------\|-------\|------\|---------------\|--------------\|--------------\|------\|`
			`\| R-PRM-7B-DPO \| 16.7 \| 70.0 \| 80.0 \| 46.5 \| 39.5 \| 43.4 \| 49.4 \|`
			`\| GPRM-4B \| 23.3 \| 85.0 \| 80.0 \| 48.0 \| 45.0 \| 48.8 \| 55.0 \|`
			`\| GPRM-14B \| 23.3 \| 87.5 \| 85.0 \| 45.0 \| 39.5 \| 50.0 \| 55.0 \|`

			`## Training Strategy`

			`GPRM utilizes a two-stage progressive training pipeline:`

			`1. Stage I (Structured SFT): Learns 4-dimensional diagnostic reasoning via targeted error injection (Calculation, Logic, Goal-drift, Inconsistency) using Qwen3-235B-Instruct as teacher for annotation.`
			`2. Stage II (GRPO Optimization): Refines evaluation policy under complete global context (History + Current + Future) using Group Relative Policy Optimization on hard-mined samples from PRM800K.`