178 lines
7.8 KiB
Markdown
178 lines
7.8 KiB
Markdown
|
|
---
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
license: apache-2.0
|
|||
|
|
library_name: transformers
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
tags:
|
|||
|
|
- web-agent
|
|||
|
|
- process-reward-model
|
|||
|
|
- preference
|
|||
|
|
- reward-model
|
|||
|
|
- web-navigation
|
|||
|
|
- reasoning
|
|||
|
|
- grpo
|
|||
|
|
base_model: Qwen/Qwen3-8B
|
|||
|
|
datasets:
|
|||
|
|
- ZYao720/WebArbiter-Data
|
|||
|
|
model-index:
|
|||
|
|
- name: WebArbiter-8B-Qwen3
|
|||
|
|
results:
|
|||
|
|
- task:
|
|||
|
|
type: text-generation
|
|||
|
|
name: Web Process Reward Modeling
|
|||
|
|
dataset:
|
|||
|
|
name: WebPRMBench
|
|||
|
|
type: ZYao720/WEBPRMBENCH
|
|||
|
|
metrics:
|
|||
|
|
- name: Avg Pairwise Accuracy
|
|||
|
|
type: accuracy
|
|||
|
|
value: 91.10
|
|||
|
|
- name: Avg BoN Accuracy
|
|||
|
|
type: accuracy
|
|||
|
|
value: 76.66
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
<div align="center">
|
|||
|
|
|
|||
|
|
# WebArbiter-8B-Qwen3
|
|||
|
|
|
|||
|
|
**A principle-guided reasoning Process Reward Model for web agents**
|
|||
|
|
|
|||
|
|
**Published at ICLR 2026**
|
|||
|
|
|
|||
|
|
[Paper](https://arxiv.org/abs/2601.21872) | [Code](https://github.com/YaoZhang720/WebArbiter) | [Website](https://yaozhang.ai/WebArbiter/) | [Collection](https://huggingface.co/collections/ZYao720/ZYao720-69cd5263871b22e11d90f80f) | [Demo](https://yaozhang.ai/WebArbiter/demo.html)
|
|||
|
|
|
|||
|
|
</div>
|
|||
|
|
|
|||
|
|
## Introduction
|
|||
|
|
|
|||
|
|
**WebArbiter-8B-Qwen3** is an 8B reasoning Process Reward Model (PRM) for web agents, built on [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B). It demonstrates that the WebArbiter two-stage training pipeline generalizes across backbone families — achieving the **highest Avg. BoN Acc of 76.66%** among all WebArbiter variants.
|
|||
|
|
|
|||
|
|
Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation — producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion.
|
|||
|
|
|
|||
|
|
## Highlights
|
|||
|
|
|
|||
|
|
- **Strongest variant**: Achieves the highest Avg. BoN Acc (76.66%) across all WebArbiter models, outperforming WebArbiter-7B (Qwen2.5) by 2.06 points.
|
|||
|
|
- **Reasoning as reward**: Generates structured `<State>`, `<Criteria>`, `<Analysis>`, and `<Answer>` outputs with auditable reasoning chains.
|
|||
|
|
- **Principle-inducing evaluation**: Dynamically derives evaluation principles from user intent and page state.
|
|||
|
|
- **Two-stage training**: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO).
|
|||
|
|
- **Cross-backbone generalization**: Same training pipeline as Qwen2.5 variants; only backbone-specific hyperparameters differ.
|
|||
|
|
|
|||
|
|
## Results on WebPRMBench
|
|||
|
|
|
|||
|
|
Models marked with ⋆ are ours. **Bold** = best overall.
|
|||
|
|
|
|||
|
|
| Model | Mind2Web | | WebArena | | AssistantBench | | WorkArena | | Avg. | |
|
|||
|
|
|-------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
|||
|
|
| | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN |
|
|||
|
|
| *Proprietary LLM-as-judge* | | | | | | | | | | |
|
|||
|
|
| GPT-4o | 79.99 | 52.62 | 84.58 | 66.67 | 85.83 | 66.67 | 84.33 | 55.19 | 83.68 | 60.29 |
|
|||
|
|
| GPT-5 | 80.86 | 62.39 | 84.83 | 71.64 | 81.67 | 63.33 | 81.14 | 64.62 | 82.13 | 65.50 |
|
|||
|
|
| *WebPRMs (7~8B)* | | | | | | | | | | |
|
|||
|
|
| WebShepherd-8B | 86.66 | 73.69 | 68.33 | 43.88 | 55.92 | 30.00 | 54.56 | 25.53 | 64.34 | 43.28 |
|
|||
|
|
| ⋆ WebArbiter-7B (Qwen2.5) | 97.07 | 89.53 | 88.43 | 68.66 | 89.17 | 70.00 | 82.09 | 70.19 | 89.19 | 74.60 |
|
|||
|
|
| ⋆ **WebArbiter-8B (Qwen3)** | **98.33** | **94.09** | 86.92 | 67.16 | **92.50** | **80.00** | **86.66** | 65.38 | **91.10** | **76.66** |
|
|||
|
|
|
|||
|
|
## Quick Start
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import torch
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
|
|||
|
|
model_name = "ZYao720/WebArbiter-8B-Qwen3"
|
|||
|
|
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
model_name,
|
|||
|
|
torch_dtype=torch.bfloat16,
|
|||
|
|
device_map="auto",
|
|||
|
|
trust_remote_code=True,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Construct your prompt following the WebPRMBench format.
|
|||
|
|
# See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples.
|
|||
|
|
user_prompt = "..." # evaluation prompt with intent, AXTree, trajectory, two responses
|
|||
|
|
|
|||
|
|
messages = [{"role": "user", "content": user_prompt}]
|
|||
|
|
input_ids = tokenizer.apply_chat_template(
|
|||
|
|
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt",
|
|||
|
|
).to(model.device)
|
|||
|
|
|
|||
|
|
with torch.no_grad():
|
|||
|
|
output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False)
|
|||
|
|
|
|||
|
|
response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
|
|||
|
|
print(response)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Example output:**
|
|||
|
|
```xml
|
|||
|
|
<State>The user is on the DuckDuckGo homepage with a search box visible.
|
|||
|
|
Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State>
|
|||
|
|
<Criteria>1. Goal alignment (weight 0.6) — Does the action advance the search task?
|
|||
|
|
2. Element reference accuracy (weight 0.25) — Is the referenced element correct?
|
|||
|
|
3. Efficiency (weight 0.15) — Does the action avoid unnecessary steps?</Criteria>
|
|||
|
|
<Analysis>Response 1 directly fills the search query into the textbox, which is the
|
|||
|
|
most direct path to completing the search task. Response 2 clicks an irrelevant link
|
|||
|
|
that does not contribute to the search goal.</Analysis>
|
|||
|
|
<Answer>Response 1</Answer>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Training Details
|
|||
|
|
|
|||
|
|
| | Stage 1: Reasoning Distillation | Stage 2: RLVR |
|
|||
|
|
|---|---|---|
|
|||
|
|
| Method | Supervised fine-tuning (SFT) | GRPO with binary verifiable rewards |
|
|||
|
|
| Data | 9,642 teacher-distilled examples | 18,921 preference pairs |
|
|||
|
|
| Teacher | o3 | — |
|
|||
|
|
| Base Model | [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) | Stage 1 checkpoint |
|
|||
|
|
| Fine-tuning | LoRA | FSDP + LoRA |
|
|||
|
|
| Framework | [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) | [veRL](https://github.com/volcengine/verl) |
|
|||
|
|
| Hardware | 8 × NVIDIA A100-80GB | 8 × NVIDIA A100-80GB |
|
|||
|
|
| Source Data | [WebPRM Collection](https://huggingface.co/datasets/LangAGI-Lab/WebPRMCollection_preference_pair) (~30k step-level preference pairs from Mind2Web) |
|
|||
|
|
|
|||
|
|
All variants use the same training data, distillation strategy, and RL procedure; only backbone-specific hyperparameters differ. See the [paper](https://arxiv.org/abs/2601.21872) (Appendix C) for full details.
|
|||
|
|
|
|||
|
|
## Intended Uses
|
|||
|
|
|
|||
|
|
WebArbiter-8B-Qwen3 is designed to:
|
|||
|
|
- **Evaluate web agent actions**: Given a web state and two candidate actions, determine which better advances the user's task.
|
|||
|
|
- **Guide trajectory search**: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution.
|
|||
|
|
- **Provide interpretable feedback**: Generate structured justifications explaining why one action is preferred.
|
|||
|
|
|
|||
|
|
## Limitations
|
|||
|
|
|
|||
|
|
- **Text-only observations**: Relies on accessibility tree representations without visual observations.
|
|||
|
|
- **English-only**: Training and evaluation are conducted exclusively in English-language web environments.
|
|||
|
|
- **Safe-action bias**: May sometimes overvalue cautious actions because the accessibility tree does not encode interaction effects.
|
|||
|
|
|
|||
|
|
## License
|
|||
|
|
|
|||
|
|
This model is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0), following the base model [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).
|
|||
|
|
|
|||
|
|
## Related Resources
|
|||
|
|
|
|||
|
|
| Resource | Link |
|
|||
|
|
|----------|------|
|
|||
|
|
| WebArbiter-7B (Qwen2.5) | [ZYao720/WebArbiter-7B](https://huggingface.co/ZYao720/WebArbiter-7B) |
|
|||
|
|
| WebArbiter-4B-Qwen3 | [ZYao720/WebArbiter-4B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-4B-Qwen3) |
|
|||
|
|
| WebArbiter-3B (Qwen2.5) | [ZYao720/WebArbiter-3B](https://huggingface.co/ZYao720/WebArbiter-3B) |
|
|||
|
|
| WEBPRMBENCH (benchmark) | [ZYao720/WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH) |
|
|||
|
|
| Training Data | [ZYao720/WebArbiter-Data](https://huggingface.co/datasets/ZYao720/WebArbiter-Data) |
|
|||
|
|
| Search Trajectories | [ZYao720/WebArbiter-Trajectories](https://huggingface.co/datasets/ZYao720/WebArbiter-Trajectories) |
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@misc{zhang2026ZYao720principleguidedreasoningprocess,
|
|||
|
|
title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents},
|
|||
|
|
author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
|
|||
|
|
year={2026},
|
|||
|
|
eprint={2601.21872},
|
|||
|
|
archivePrefix={arXiv},
|
|||
|
|
primaryClass={cs.AI},
|
|||
|
|
url={https://arxiv.org/abs/2601.21872},
|
|||
|
|
}
|
|||
|
|
```
|