--- language: - en license: apache-2.0 library_name: transformers pipeline_tag: text-generation tags: - web-agent - process-reward-model - preference - reward-model - web-navigation - reasoning - grpo base_model: Qwen/Qwen3-4B datasets: - ZYao720/WebArbiter-Data model-index: - name: WebArbiter-4B-Qwen3 results: - task: type: text-generation name: Web Process Reward Modeling dataset: name: WebPRMBench type: ZYao720/WEBPRMBENCH metrics: - name: Avg Pairwise Accuracy type: accuracy value: 87.73 - name: Avg BoN Accuracy type: accuracy value: 72.55 ---
# WebArbiter-4B-Qwen3 **A principle-guided reasoning Process Reward Model for web agents** **Published at ICLR 2026** [Paper](https://arxiv.org/abs/2601.21872) | [Code](https://github.com/YaoZhang720/WebArbiter) | [Website](https://yaozhang.ai/WebArbiter/) | [Collection](https://huggingface.co/collections/ZYao720/ZYao720-69cd5263871b22e11d90f80f) | [Demo](https://yaozhang.ai/WebArbiter/demo.html)
## Introduction **WebArbiter-4B-Qwen3** is a 4B reasoning Process Reward Model (PRM) for web agents, built on [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B). It demonstrates that stronger base models amplify the benefits of principle-guided reasoning distillation — achieving an **Avg. BoN Acc of 72.55%** with roughly half the parameters of WebArbiter-7B (Qwen2.5), which scores 74.60%. Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation — producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion. ## Highlights - **Parameter-efficient**: Approaches WebArbiter-7B (Qwen2.5) performance (72.55 vs 74.60 Avg. BoN Acc) with roughly half the parameters. - **Reasoning as reward**: Generates structured ``, ``, ``, and `` outputs with auditable reasoning chains. - **Principle-inducing evaluation**: Dynamically derives evaluation principles from user intent and page state. - **Two-stage training**: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO). - **Cross-backbone generalization**: Same training pipeline as Qwen2.5 variants; only backbone-specific hyperparameters differ. ## Results on WebPRMBench Models marked with ⋆ are ours. **Bold** = best at comparable scale. | Model | Mind2Web | | WebArena | | AssistantBench | | WorkArena | | Avg. | | |-------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN | | *Proprietary LLM-as-judge* | | | | | | | | | | | | GPT-4o | 79.99 | 52.62 | 84.58 | 66.67 | 85.83 | 66.67 | 84.33 | 55.19 | 83.68 | 60.29 | | GPT-5 | 80.86 | 62.39 | 84.83 | 71.64 | 81.67 | 63.33 | 81.14 | 64.62 | 82.13 | 65.50 | | *WebPRMs (3~4B)* | | | | | | | | | | | | WebShepherd-3B | 87.50 | 65.21 | 68.16 | 41.29 | 66.67 | 46.67 | 50.00 | 21.23 | 68.08 | 43.60 | | ⋆ WebArbiter-3B (Qwen2.5) | 93.32 | 78.42 | 81.97 | 56.22 | 78.33 | 46.67 | 81.01 | 54.81 | 83.65 | 59.06 | | ⋆ **WebArbiter-4B (Qwen3)** | **98.55** | **94.73** | **83.21** | **61.19** | **92.50** | **83.33** | 76.68 | 50.96 | **87.73** | **72.55** | WebArbiter-4B (Qwen3) substantially outperforms WebArbiter-3B (Qwen2.5) across all environments, improving Avg. BoN Acc from 59.06% to 72.55%. This approaches WebArbiter-7B (Qwen2.5) at 74.60% with roughly half the parameters. ## Quick Start ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "ZYao720/WebArbiter-4B-Qwen3" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) # Construct your prompt following the WebPRMBench format. # See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples. user_prompt = "..." # evaluation prompt with intent, AXTree, trajectory, two responses messages = [{"role": "user", "content": user_prompt}] input_ids = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", ).to(model.device) with torch.no_grad(): output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False) response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True) print(response) ``` **Example output:** ```xml The user is on the DuckDuckGo homepage with a search box visible. Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'. 1. Goal alignment (weight 0.6) — Does the action advance the search task? 2. Element reference accuracy (weight 0.25) — Is the referenced element correct? 3. Efficiency (weight 0.15) — Does the action avoid unnecessary steps? Response 1 directly fills the search query into the textbox, which is the most direct path to completing the search task. Response 2 clicks an irrelevant link that does not contribute to the search goal. Response 1 ``` ## Training Details | | Stage 1: Reasoning Distillation | Stage 2: RLVR | |---|---|---| | Method | Supervised fine-tuning (SFT) | GRPO with binary verifiable rewards | | Data | 9,642 teacher-distilled examples | 18,921 preference pairs | | Teacher | o3 | — | | Base Model | [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) | Stage 1 checkpoint | | Fine-tuning | LoRA | FSDP + LoRA | | Framework | [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) | [veRL](https://github.com/volcengine/verl) | | Hardware | 8 × NVIDIA A100-80GB | 8 × NVIDIA A100-80GB | | Source Data | [WebPRM Collection](https://huggingface.co/datasets/LangAGI-Lab/WebPRMCollection_preference_pair) (~30k step-level preference pairs from Mind2Web) | All variants use the same training data, distillation strategy, and RL procedure; only backbone-specific hyperparameters differ. See the [paper](https://arxiv.org/abs/2601.21872) (Appendix C) for full details. ## Intended Uses WebArbiter-4B-Qwen3 is designed to: - **Evaluate web agent actions**: Given a web state and two candidate actions, determine which better advances the user's task. - **Guide trajectory search**: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution. - **Provide interpretable feedback**: Generate structured justifications explaining why one action is preferred. - **Resource-efficient deployment**: Strong performance at 4B parameters — approaching 7B-level accuracy with roughly half the parameters. ## Limitations - **Text-only observations**: Relies on accessibility tree representations without visual observations. - **English-only**: Training and evaluation are conducted exclusively in English-language web environments. - **Safe-action bias**: May sometimes overvalue cautious actions because the accessibility tree does not encode interaction effects. ## License This model is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0), following the base model [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B). ## Related Resources | Resource | Link | |----------|------| | WebArbiter-8B-Qwen3 (strongest) | [ZYao720/WebArbiter-8B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-8B-Qwen3) | | WebArbiter-7B (Qwen2.5) | [ZYao720/WebArbiter-7B](https://huggingface.co/ZYao720/WebArbiter-7B) | | WebArbiter-3B (Qwen2.5) | [ZYao720/WebArbiter-3B](https://huggingface.co/ZYao720/WebArbiter-3B) | | WEBPRMBENCH (benchmark) | [ZYao720/WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH) | | Training Data | [ZYao720/WebArbiter-Data](https://huggingface.co/datasets/ZYao720/WebArbiter-Data) | | Search Trajectories | [ZYao720/WebArbiter-Trajectories](https://huggingface.co/datasets/ZYao720/WebArbiter-Trajectories) | ## Citation ```bibtex @misc{zhang2026ZYao720principleguidedreasoningprocess, title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents}, author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp}, year={2026}, eprint={2601.21872}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2601.21872}, } ```