GitHunt
JO

josancamon19/web-evals

aggregated collection of browser/computer use agent evals

Web Evals Dashboard

A unified dashboard and API for exploring and comparing web-based AI agent evaluation benchmarks. Aggregates 4,725+ tasks from 10 major benchmarks into a single interface.

Dashboard Screenshot

๐Ÿ“š Learn more: Browser Automation & Web Evals Overview

Benchmarks Included

  • GAIA (466 tasks) - General AI Assistant benchmark with 3 difficulty levels
  • Mind2Web (1,009 tasks) - Real-world website interaction tasks
  • Mind2Web2 (130 tasks) - Updated version with domain categorization
  • BrowseComp (1,266 tasks) - Web browsing comprehension tasks
  • WebArena (812 tasks) - Realistic web navigation scenarios
  • WebVoyager (643 tasks) - Long-horizon web navigation tasks
  • REAL (113 tasks) - Real-world web agent challenges with difficulty ratings
  • Bearcubs (111 tasks) - Web agent evaluation tasks
  • Agent-Company (175 tasks) - Domain-specific company tasks
  • OSWorld (400+ tasks) - Desktop application automation (Chrome, GIMP, LibreOffice, VS Code, etc.)

Features

  • Interactive Streamlit Dashboard - Filter, sort, and explore tasks
  • REST API - Programmatic access with full filtering and pagination
  • Unified Schema - Normalized data structure across all benchmarks
  • Advanced Filtering - By benchmark, difficulty, domain, website/app, and more
  • Task Search - Full-text search across task descriptions

Quick Start

# Clone and install dependencies
pip install -r requirements.txt

# Launch Streamlit dashboard
streamlit run main.py

# Or start the FastAPI server
python api.py
# API docs: http://localhost:8000/docs

API Examples

# Get all GAIA Level 1 tasks
curl "http://localhost:8000/tasks?benchmark=gaia&Level=1.0"

# Search for tasks about email
curl "http://localhost:8000/search?q=email&limit=10"

# Get WebArena tasks with pagination
curl "http://localhost:8000/tasks?benchmark=webarena&limit=50&offset=0"

# Get task by ID
curl "http://localhost:8000/tasks/4480"

Project Structure

โ”œโ”€โ”€ main.py              # Streamlit dashboard
โ”œโ”€โ”€ api.py               # FastAPI REST API
โ”œโ”€โ”€ shared.py            # Data loading & normalization
โ”œโ”€โ”€ requirements.txt     # Python dependencies
โ”œโ”€โ”€ GAIA/               # GAIA benchmark data
โ”œโ”€โ”€ WebVoyager/         # WebVoyager benchmark data
โ”œโ”€โ”€ webarena/           # WebArena benchmark data
โ”œโ”€โ”€ real-evals-agi/     # REAL benchmark data
โ”œโ”€โ”€ OsWorld/            # OSWorld benchmark data
โ”œโ”€โ”€ mind2web2/          # Mind2Web2 benchmark data
โ”œโ”€โ”€ agent-company/      # Agent-Company benchmark data
โ”œโ”€โ”€ bearcubs/           # Bearcubs benchmark data
โ””โ”€โ”€ openai-simple-evals/ # BrowseComp benchmark data

Data Schema

All tasks are normalized to a common format:

  • task_id - Unique identifier
  • Question - Task description/instruction
  • benchmark - Source benchmark name
  • web_name - Target website/application
  • domain / subdomain - Task categorization (when available)
  • difficulty / Level - Task difficulty (benchmark-specific)
  • web_url - Starting URL (when available)

Contributing

To add a new benchmark:

  1. Add data to appropriate directory
  2. Update DATASETS config in shared.py
  3. Implement loader function if needed
  4. Add normalization logic in normalize_task_data()

License

Data sources retain their original licenses. See individual benchmark repositories for details.

josancamon19/web-evals | GitHunt