2025-06-06 · HuggingFace

ScreenSuite - The most comprehensive evaluation suite for GUI Agents!

securityagentsmodelstooling

ScreenSuite - The most comprehensive evaluation suite for GUI Agents!

Source: HuggingFace Date: 2025-06-06 URL: https://huggingface.co/blog/screensuite

Summary

Benchmark release: HF ScreenSuite unifies 13 GUI agent benchmarks across perception/grounding, single-step actions, and multi-step agents. Vision-only evaluation (no accessibility trees or DOM) for more realistic assessment. Includes Dockerized Ubuntu Desktop and Android environments plus E2B sandbox support. Top performers: Qwen2.5-VL-72B and GPT-4o on localization tasks. 30-second quickstart via uv.

Implications

Thread: agentic patterns / open-weights ecosystem health. ScreenSuite’s design choice — vision-only, no accessibility tree — is a deliberate realism constraint: real-world GUI agents often can’t rely on structured DOM access. Unifying 13 benchmarks under one harness reduces the p-hacking surface of cherry-picking favorable evals. The inclusion of multi-step agent benchmarks (AndroidWorld, OSWorld, BrowseComp) alongside perception tasks distinguishes this from purely static evals. The Qwen2.5-VL-72B and GPT-4o parity on localization is notable — open-weights models are competitive at the top of the GUI grounding capability distribution.

← all signals