JO
josancamon19/web-evals
aggregated collection of browser/computer use agent evals
Web Evals Dashboard
A unified dashboard and API for exploring and comparing web-based AI agent evaluation benchmarks. Aggregates 4,725+ tasks from 10 major benchmarks into a single interface.
๐ Learn more: Browser Automation & Web Evals Overview
Benchmarks Included
- GAIA (466 tasks) - General AI Assistant benchmark with 3 difficulty levels
- Mind2Web (1,009 tasks) - Real-world website interaction tasks
- Mind2Web2 (130 tasks) - Updated version with domain categorization
- BrowseComp (1,266 tasks) - Web browsing comprehension tasks
- WebArena (812 tasks) - Realistic web navigation scenarios
- WebVoyager (643 tasks) - Long-horizon web navigation tasks
- REAL (113 tasks) - Real-world web agent challenges with difficulty ratings
- Bearcubs (111 tasks) - Web agent evaluation tasks
- Agent-Company (175 tasks) - Domain-specific company tasks
- OSWorld (400+ tasks) - Desktop application automation (Chrome, GIMP, LibreOffice, VS Code, etc.)
Features
- Interactive Streamlit Dashboard - Filter, sort, and explore tasks
- REST API - Programmatic access with full filtering and pagination
- Unified Schema - Normalized data structure across all benchmarks
- Advanced Filtering - By benchmark, difficulty, domain, website/app, and more
- Task Search - Full-text search across task descriptions
Quick Start
# Clone and install dependencies
pip install -r requirements.txt
# Launch Streamlit dashboard
streamlit run main.py
# Or start the FastAPI server
python api.py
# API docs: http://localhost:8000/docsAPI Examples
# Get all GAIA Level 1 tasks
curl "http://localhost:8000/tasks?benchmark=gaia&Level=1.0"
# Search for tasks about email
curl "http://localhost:8000/search?q=email&limit=10"
# Get WebArena tasks with pagination
curl "http://localhost:8000/tasks?benchmark=webarena&limit=50&offset=0"
# Get task by ID
curl "http://localhost:8000/tasks/4480"Project Structure
โโโ main.py # Streamlit dashboard
โโโ api.py # FastAPI REST API
โโโ shared.py # Data loading & normalization
โโโ requirements.txt # Python dependencies
โโโ GAIA/ # GAIA benchmark data
โโโ WebVoyager/ # WebVoyager benchmark data
โโโ webarena/ # WebArena benchmark data
โโโ real-evals-agi/ # REAL benchmark data
โโโ OsWorld/ # OSWorld benchmark data
โโโ mind2web2/ # Mind2Web2 benchmark data
โโโ agent-company/ # Agent-Company benchmark data
โโโ bearcubs/ # Bearcubs benchmark data
โโโ openai-simple-evals/ # BrowseComp benchmark data
Data Schema
All tasks are normalized to a common format:
task_id- Unique identifierQuestion- Task description/instructionbenchmark- Source benchmark nameweb_name- Target website/applicationdomain/subdomain- Task categorization (when available)difficulty/Level- Task difficulty (benchmark-specific)web_url- Starting URL (when available)
Contributing
To add a new benchmark:
- Add data to appropriate directory
- Update
DATASETSconfig inshared.py - Implement loader function if needed
- Add normalization logic in
normalize_task_data()
License
Data sources retain their original licenses. See individual benchmark repositories for details.
On this page
Contributors
Created October 12, 2025
Updated October 12, 2025
