
Evaluation results on WEB-RewardBench. T: text observation, I: image observation
Table above reports the evaluation results on WEB-RewardBench. As shown in Table, state-of-the-art MLLMs struggle to provide reliable rewards for web navigation tasks. This limitation is particularly evident in the trajectory accuracy metric. In this measure, models frequently fail to assign correct rewards consistently at each time step within a single task. In contrast, Web-Shepherd significantly outperforms all baselines, demonstrating a substantial performance gap across all benchmark settings.
Also, Table above demonstrates that both baseline and our models benefit significantly from the checklist in assigning rewards. Checklists lead to more accurate and consistent reward assignments, as evidenced by improvements in trajectory accuracy across all baselines. These results suggests that checklists serve as valuable guidance, helping models maintain coherence in predicting the process reward.