Logo Web-Shepherd:

Advancing PRMs for Reinforcing Web Agents

Anonymous Authors
Note that this project page is fully anonymized. Some links might not be available due to anonymization.
geometric reasoning

Performance and cost-efficiency of Web-Shepherd (3B). Web-Shepherd achieves the state-of-the-art performance while requiring significantly lower cost compared to existing baselines.

Introduction

Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks.

Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WEBPRM collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WEB-RewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WEB-RewardBench.

Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.3 points better performance, in 10 times less cost compared to using GPT-4o-mini as the verifier.

WEBPRM Collection

Overview

algebraic reasoning

An overview of the dataset collection process of WEBPRM

Building Preference Reward Models (PRMs) for web agents presents a core challenge: the lack of a high-quality, task-aligned dataset. To address this, we introduce WEBPRM, the dataset explicitly designed for training PRMs in the context of web-based agents.

We collect expert demonstrations from trained annotators across websites accessible via Playwright, based on the Mind2Web benchmark. All annotators undergo a three-hour training to ensure high-quality and consistent behavior modeling. Each interaction is reviewed by a panel of human evaluators, and we filter out ambiguous or irreproducible samples.

Web-Shepherd

Overview

algebraic reasoning

An overview of the Web-Shepherd

We introduce Web-Shepherd, a process reward model designed to provide dense and reliable supervision to web agents and enable more informative credit assignment.

We train Web-Shepherd on the WEBPRM Collection to support two key functionalities: (1) generating task-specific checklists, and (2) assigning rewards based on checklist completion.

Main Results

algebraic reasoning

Evaluation results on WEB-RewardBench. T: text observation, I: image observation

Table above reports the evaluation results on WEB-RewardBench. As shown in Table, state-of-the-art MLLMs struggle to provide reliable rewards for web navigation tasks. This limitation is particularly evident in the trajectory accuracy metric. In this measure, models frequently fail to assign correct rewards consistently at each time step within a single task. In contrast, Web-Shepherd significantly outperforms all baselines, demonstrating a substantial performance gap across all benchmark settings.

Also, Table above demonstrates that both baseline and our models benefit significantly from the checklist in assigning rewards. Checklists lead to more accurate and consistent reward assignments, as evidenced by improvements in trajectory accuracy across all baselines. These results suggests that checklists serve as valuable guidance, helping models maintain coherence in predicting the process reward.