Files
WebArbiter-3B/README.md
ModelHub XC cc137423bf 初始化项目,由ModelHub XC社区提供模型
Model: ZYao720/WebArbiter-3B
Source: Original Platform
2026-04-24 07:18:02 +08:00

186 lines
9.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- web-agent
- process-reward-model
- preference
- reward-model
- web-navigation
- reasoning
- grpo
base_model: Qwen/Qwen2.5-3B-Instruct
datasets:
- ZYao720/WebArbiter-Data
model-index:
- name: WebArbiter-3B
results:
- task:
type: text-generation
name: Web Process Reward Modeling
dataset:
name: WebPRMBench
type: ZYao720/WEBPRMBENCH
metrics:
- name: Avg Pairwise Accuracy
type: accuracy
value: 83.65
- name: Avg BoN Accuracy
type: accuracy
value: 59.06
---
<div align="center">
# WebArbiter-3B
**A principle-guided reasoning Process Reward Model for web agents**
**Published at ICLR 2026**
[Paper](https://arxiv.org/abs/2601.21872) | [Code](https://github.com/YaoZhang720/WebArbiter) | [Website](https://yaozhang.ai/WebArbiter/) | [Collection](https://huggingface.co/collections/ZYao720/ZYao720-69cd5263871b22e11d90f80f) | [Demo](https://yaozhang.ai/WebArbiter/demo.html)
</div>
## Introduction
**WebArbiter-3B** is a 3B reasoning Process Reward Model (PRM) for web agents, built on [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct). Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation — producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion.
Despite its compact size, WebArbiter-3B achieves an **Avg. BoN Acc of 59.06%** on [WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH), outperforming the previous SOTA WebPRM (WebShepherd-3B) by **15.5 points** and surpassing all open-source LLM-as-judge baselines up to 70B parameters. For the strongest variant, see [WebArbiter-7B](https://huggingface.co/ZYao720/WebArbiter-7B).
## Highlights
- **Reasoning as reward**: Generates structured `<State>`, `<Criteria>`, `<Analysis>`, and `<Answer>` outputs with auditable reasoning chains, instead of scalar scores or brittle checklists.
- **Principle-inducing evaluation**: Dynamically derives evaluation principles from user intent and page state, enabling robust assessment that generalizes across environments.
- **Two-stage training**: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO) to correct teacher biases and align verdicts with ground-truth correctness.
- **Efficient and deployable**: Strong performance at 3B parameters, suitable for resource-constrained deployment scenarios.
## Results on WebPRMBench
Models marked with ⋆ are ours. **Bold** = best at comparable scale.
| Model | Mind2Web | | WebArena | | AssistantBench | | WorkArena | | Avg. | |
|-------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN |
| *Proprietary LLM-as-judge* | | | | | | | | | | |
| GPT-4o-mini | 81.74 | 50.92 | 78.23 | 56.72 | 89.17 | 73.33 | 81.43 | 46.70 | 82.64 | 56.92 |
| GPT-4o | 79.99 | 52.62 | 84.58 | 66.67 | 85.83 | 66.67 | 84.33 | 55.19 | 83.68 | 60.29 |
| GPT-5 | 80.86 | 62.39 | 84.83 | 71.64 | 81.67 | 63.33 | 81.14 | 64.62 | 82.13 | 65.50 |
| *Open-source LLM-as-judge* | | | | | | | | | | |
| Qwen2.5-3B-Instruct | 76.46 | 36.93 | 60.32 | 15.42 | 75.83 | 33.33 | 64.45 | 19.34 | 69.27 | 26.76 |
| Qwen2.5-7B-Instruct | 77.79 | 39.18 | 74.88 | 42.79 | 84.17 | 53.33 | 77.58 | 35.85 | 77.61 | 42.78 |
| Llama-3-70B-Instruct | 80.55 | 49.36 | 77.36 | 50.75 | 85.83 | 70.00 | 79.08 | 40.09 | 80.71 | 52.55 |
| *WebPRMs (3B)* | | | | | | | | | | |
| WebShepherd-3B | 87.50 | 65.21 | 68.16 | 41.29 | 66.67 | 46.67 | 50.00 | 21.23 | 68.08 | 43.60 |
| ⋆ **WebArbiter-3B** | **93.32** | **78.42** | **81.97** | **56.22** | **78.33** | 46.67 | **81.01** | **54.81** | **83.65** | **59.06** |
| *WebPRMs (7B+)* | | | | | | | | | | |
| WebShepherd-8B | 86.66 | 73.69 | 68.33 | 43.88 | 55.92 | 30.00 | 54.56 | 25.53 | 64.34 | 43.28 |
| ⋆ WebArbiter-7B | 97.07 | 89.53 | 88.43 | 68.66 | 89.17 | 70.00 | 82.09 | 70.19 | 89.19 | 74.60 |
WebArbiter-3B outperforms WebShepherd-8B (a much larger 8B model) on Avg. BoN Acc (59.06 vs 43.28), demonstrating the efficiency of the principle-guided reasoning approach.
## Quick Start
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "ZYao720/WebArbiter-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
# Construct your prompt following the WebPRMBench format.
# See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples.
user_prompt = "..." # evaluation prompt with intent, AXTree, trajectory, two responses
messages = [{"role": "user", "content": user_prompt}]
input_ids = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt",
).to(model.device)
with torch.no_grad():
output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False)
response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
print(response)
```
**Example output:**
```xml
<State>The user is on the DuckDuckGo homepage with a search box visible.
Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State>
<Criteria>1. Goal alignment (weight 0.6) — Does the action advance the search task?
2. Element reference accuracy (weight 0.25) — Is the referenced element correct?
3. Efficiency (weight 0.15) — Does the action avoid unnecessary steps?</Criteria>
<Analysis>Response 1 directly fills the search query into the textbox, which is the
most direct path to completing the search task. Response 2 clicks an irrelevant link
that does not contribute to the search goal.</Analysis>
<Answer>Response 1</Answer>
```
## Training Details
| | Stage 1: Reasoning Distillation | Stage 2: RLVR |
|---|---|---|
| Method | Supervised fine-tuning (SFT) | GRPO with binary verifiable rewards |
| Data | 9,642 teacher-distilled examples | 18,921 preference pairs |
| Teacher | o3 | — |
| Base Model | [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) | Stage 1 checkpoint |
| Fine-tuning | LoRA (rank 128, lr 8e-4) | FSDP + LoRA (lr 9e-6) |
| Framework | [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) | [veRL](https://github.com/volcengine/verl) |
| Hardware | 8 × NVIDIA A100-80GB | 8 × NVIDIA A100-80GB |
| Source Data | [WebPRM Collection](https://huggingface.co/datasets/LangAGI-Lab/WebPRMCollection_preference_pair) (~30k step-level preference pairs from Mind2Web) |
## Intended Uses
WebArbiter-3B is designed to:
- **Evaluate web agent actions**: Given a web state and two candidate actions, determine which better advances the user's task.
- **Guide trajectory search**: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution.
- **Provide interpretable feedback**: Generate structured justifications explaining why one action is preferred, useful for debugging and analysis.
- **Resource-efficient deployment**: Suitable for scenarios where 7B+ models are too large, while still significantly outperforming larger checklist-based WebPRMs.
## Limitations
- **Text-only observations**: WebArbiter relies on accessibility tree representations without visual observations. In environments where layout, spatial arrangement, or visual cues carry task-relevant information, this text-only formulation may miss critical signals.
- **English-only**: Training and evaluation are conducted exclusively in English-language web environments.
- **Safe-action bias**: The model may sometimes overvalue cautious actions (e.g., hover over click) because the accessibility tree does not encode interaction effects.
- **Element reference hallucination**: When a candidate action's reasoning is strongly task-aligned, the model may trust the semantic signal over low-level bid verification, potentially missing incorrect element references.
## License
This model is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0), following the base model [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct).
## Related Resources
| Resource | Link |
|----------|------|
| WebArbiter-8B-Qwen3 (strongest) | [ZYao720/WebArbiter-8B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-8B-Qwen3) |
| WebArbiter-7B | [ZYao720/WebArbiter-7B](https://huggingface.co/ZYao720/WebArbiter-7B) |
| WebArbiter-4B-Qwen3 | [ZYao720/WebArbiter-4B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-4B-Qwen3) |
| WEBPRMBENCH (benchmark) | [ZYao720/WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH) |
| Training Data | [ZYao720/WebArbiter-Data](https://huggingface.co/datasets/ZYao720/WebArbiter-Data) |
| Search Trajectories | [ZYao720/WebArbiter-Trajectories](https://huggingface.co/datasets/ZYao720/WebArbiter-Trajectories) |
## Citation
```bibtex
@misc{zhang2026ZYao720principleguidedreasoningprocess,
title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents},
author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
year={2026},
eprint={2601.21872},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.21872},
}
```