--- license: mit language: - en - zh pipeline_tag: text-generation tags: - process-reward-model - prm - reasoning - reinforcement-learning - grpo - safetensors - qwen base_model: Qwen/Qwen3-4B-Instruct-2507 model_creator: zai-org --- # GPRM-4B 👋 Join our community. 📖 Read the GPRM technical report. 📍 API access available upon request. [GitHub] [Technical Report] ## Introduction **GPRM (Global Perspective Process Reward Model)** is a next-generation process reward model designed to overcome the "local context" limitations of traditional PRMs. While previous models judge each step in isolation, GPRM introduces a **Global Perspective**, significantly improving error localization and reasoning verification in long-chain tasks. Previous PRMs often suffer from two major flaws: they ignore historical evaluations and lack visibility into how a step affects future reasoning. **GPRM addresses these via:** - **History-Aware Evaluation:** Explicitly conditions on previous steps and their associated judgments. - **Future-Informed Reasoning:** Incorporates a look-ahead perspective to validate steps against subsequent derivations. - **4-D Diagnostic Framework:** Structured evaluation across **Look-back** (consistency), **Look-ahead** (plausibility), **Self-check** (validity), and **Goal alignment**. ## Benchmark ### PRMBench (Overall Score) | Model | Simplicity | Soundness | Sensitivity | Overall | |-------|-----------|-----------|-------------|---------| | GPT-4o | 59.7 | 70.9 | 75.8 | 66.8 | | o1-mini | 64.6 | 72.1 | 75.5 | 68.8 | | Gemini-2.0-flash-exp | 58.1 | 66.0 | 75.4 | 66.9 | | Qwen2.5-Math-PRM-7B | 52.1 | 71.0 | 75.5 | 65.5 | | R-PRM-7B-DPO | 55.2 | 71.2 | 76.6 | 66.8 | | GenPRM-7B | 56.1 | 71.8 | 77.0 | 67.4 | | Skywork-PRM-7B | 59.6 | 68.5 | 73.3 | 65.1 | | GPRM-4B-SFT | 65.0 | 75.2 | 78.8 | 72.9 | | **GPRM-4B-GRPO** | **65.8** | **76.2** | **79.3** | **73.9** | | GPRM-14B-GRPO | 67.2 | 77.6 | 80.2 | 74.6 | ### ProcessBench (Avg. F1 Score) | Model | GSM8K | MATH | OlympiadBench | OmniMath | Avg. F1 | |-------|-------|------|---------------|----------|---------| | GPT-4o | 79.2 | 63.6 | 51.4 | 53.5 | 61.9 | | o1-mini | 93.2 | 88.9 | 87.2 | 82.4 | 87.9 | | Qwen2.5-Math-PRM-7B | 68.2 | 62.6 | 50.7 | 44.3 | 58.5 | | R-PRM-7B-DPO | 80.7 | 76.9 | 63.8 | 60.1 | 70.4 | | GenPRM-7B | 73.7 | 77.9 | 71.8 | 73.8 | 74.1 | | Skywork-PRM-7B | 70.8 | 53.6 | 22.9 | 21.0 | 42.1 | | GPRM-4B-SFT | 73.1 | 76.2 | 69.4 | 70.5 | 72.3 | | **GPRM-4B-GRPO** | **73.1** | **77.5** | **71.5** | **75.1** | **74.3** | | GPRM-14B-GRPO | 74.7 | 79.3 | 73.9 | 75.3 | 75.8 | ### Agent Error Bench (Accuracy %) | Model | ALFWorld (S/S+M) | WebShop (S/S+M) | GAIA (S/S+M) | Average (S/S+M) | |-------|-----------------|-----------------|--------------|-----------------| | Direct Prompting (GPT-4.1) | 28.0 / 14.0 | 30.0 / 6.0 | 26.0 / 10.0 | 28.0 / 10.0 | | AgentDebug | 35.0 / 28.0 | 42.0 / 22.0 | 58.0 / 44.0 | 45.0 / 31.3 | | **GPRM-4B** | **38.0 / 30.0** | **44.0 / 24.0** | **60.0 / 46.0** | **47.0 / 33.0** | | GPRM-14B | 46.0 / 37.0 | 51.0 / 29.0 | 67.0 / 51.0 | 54.0 / 39.0 | ### Downstream Test-Time Search (Base: Qwen2.5-7B-Instruct) #### Best-of-8 (Accuracy %) | PRM Guide | AIME24 | AMC23 | MATH | OlympiadBench | College Math | Minerva MATH | Avg. | |-----------|--------|-------|------|---------------|--------------|--------------|------| | Reference: pass@1 | 11.2 | 47.8 | 73.0 | 38.0 | 38.6 | 37.2 | 41.0 | | Reference: maj@8 | 20.0 | 57.5 | 79.6 | 47.0 | 41.5 | 42.7 | 48.0 | | R-PRM-7B-DPO | 20.0 | 62.5 | 82.2 | 48.0 | 41.0 | 44.1 | 49.6 | | **GPRM-4B** | **20.0** | **63.0** | **82.6** | **48.5** | **40.5** | **45.0** | **50.1** | | GPRM-14B | 20.0 | 64.2 | 83.1 | 50.3 | 42.6 | 45.8 | 51.0 | #### Greedy Guided Search@8 (Accuracy %) | PRM Guide | AIME24 | AMC23 | MATH | OlympiadBench | College Math | Minerva MATH | Avg. | |-----------|--------|-------|------|---------------|--------------|--------------|------| | R-PRM-7B-DPO | 16.7 | 70.0 | 80.0 | 46.5 | 39.5 | 43.4 | 49.4 | | **GPRM-4B** | **23.3** | **85.0** | **80.0** | **48.0** | **45.0** | **48.8** | **55.0** | | GPRM-14B | 23.3 | 87.5 | 85.0 | 45.0 | 39.5 | 50.0 | 55.0 | ## Training Strategy GPRM utilizes a two-stage progressive training pipeline: 1. **Stage I (Structured SFT):** Learns 4-dimensional diagnostic reasoning via targeted error injection (Calculation, Logic, Goal-drift, Inconsistency) using Qwen3-235B-Instruct as teacher for annotation. 2. **Stage II (GRPO Optimization):** Refines evaluation policy under complete global context (History + Current + Future) using Group Relative Policy Optimization on hard-mined samples from PRM800K.