Web Evals Dashboard

A unified dashboard and API for exploring and comparing web-based AI agent evaluation benchmarks. Aggregates 4,725+ tasks from 10 major benchmarks into a single interface.

📚 Learn more: Browser Automation & Web Evals Overview

Benchmarks Included

GAIA (466 tasks) - General AI Assistant benchmark with 3 difficulty levels
Mind2Web (1,009 tasks) - Real-world website interaction tasks
Mind2Web2 (130 tasks) - Updated version with domain categorization
BrowseComp (1,266 tasks) - Web browsing comprehension tasks
WebArena (812 tasks) - Realistic web navigation scenarios
WebVoyager (643 tasks) - Long-horizon web navigation tasks
REAL (113 tasks) - Real-world web agent challenges with difficulty ratings
Bearcubs (111 tasks) - Web agent evaluation tasks
Agent-Company (175 tasks) - Domain-specific company tasks
OSWorld (400+ tasks) - Desktop application automation (Chrome, GIMP, LibreOffice, VS Code, etc.)

Features

Interactive Streamlit Dashboard - Filter, sort, and explore tasks
REST API - Programmatic access with full filtering and pagination
Unified Schema - Normalized data structure across all benchmarks
Advanced Filtering - By benchmark, difficulty, domain, website/app, and more
Task Search - Full-text search across task descriptions

Quick Start

# Clone and install dependencies
pip install -r requirements.txt

# Launch Streamlit dashboard
streamlit run main.py

# Or start the FastAPI server
python api.py
# API docs: http://localhost:8000/docs

API Examples

# Get all GAIA Level 1 tasks
curl "http://localhost:8000/tasks?benchmark=gaia&Level=1.0"

# Search for tasks about email
curl "http://localhost:8000/search?q=email&limit=10"

# Get WebArena tasks with pagination
curl "http://localhost:8000/tasks?benchmark=webarena&limit=50&offset=0"

# Get task by ID
curl "http://localhost:8000/tasks/4480"

Project Structure

├── main.py              # Streamlit dashboard
├── api.py               # FastAPI REST API
├── shared.py            # Data loading & normalization
├── requirements.txt     # Python dependencies
├── GAIA/               # GAIA benchmark data
├── WebVoyager/         # WebVoyager benchmark data
├── webarena/           # WebArena benchmark data
├── real-evals-agi/     # REAL benchmark data
├── OsWorld/            # OSWorld benchmark data
├── mind2web2/          # Mind2Web2 benchmark data
├── agent-company/      # Agent-Company benchmark data
├── bearcubs/           # Bearcubs benchmark data
└── openai-simple-evals/ # BrowseComp benchmark data

Data Schema

All tasks are normalized to a common format:

task_id - Unique identifier
Question - Task description/instruction
benchmark - Source benchmark name
web_name - Target website/application
domain / subdomain - Task categorization (when available)
difficulty / Level - Task difficulty (benchmark-specific)
web_url - Starting URL (when available)

Contributing

To add a new benchmark:

Add data to appropriate directory
Update DATASETS config in shared.py
Implement loader function if needed
Add normalization logic in normalize_task_data()

License

Data sources retain their original licenses. See individual benchmark repositories for details.

josancamon19/web-evals