EAI6080_Fall2025 - LLM Benchmarking Project
Course project at Northeastern University in EAI 6080, Fall 2025
Overview
This project automates benchmarking of various LLMs and Agentic AI systems across multiple evaluation benchmarks. The system follows a modular architecture where each benchmark and model is implemented in its own file, making it easy to add new benchmarks or models.
Installation
- Clone the repository:
git clone https://github.com/sinitskiy/EAI6080_Fall2025.git
cd EAI6080_Fall2025- Create and activate a virtual environment:
Windows PowerShell:
# Create virtual environment
python -m venv .venv
# Activate virtual environment
.\.venv\Scripts\Activate.ps1
# If you get an execution policy error, run:
# Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserLinux/Mac:
# Create virtual environment
python -m venv .venv
# Activate virtual environment
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Set up API keys as environment variables:
# Windows PowerShell
$env:OPENAI_API_KEY="your-openai-api-key"
$env:GOOGLE_API_KEY="your-google-api-key"
# Linux/Mac
export OPENAI_API_KEY="your-openai-api-key"
export GOOGLE_API_KEY="your-google-api-key"Note: Always activate your virtual environment before running scripts:
- Windows:
.\.venv\Scripts\Activate.ps1- Linux/Mac:
source .venv/bin/activateYour prompt should show
(.venv)when the virtual environment is active.
Usage
Run Everything (Full Pipeline)
python main.py --allThis runs all four steps:
- Download benchmarks
- Run model predictions
- Evaluate correctness
- Generate summary table
Run Individual Steps
# Step 1: Download benchmarks only
python main.py --download
# Step 2: Run predictions only
python main.py --predict
# Step 3: Evaluate predictions only
python main.py --evaluate
# Step 4: Generate summary only
python main.py --summaryRun Specific Benchmarks or Models
# Run specific benchmarks
python main.py --download --benchmarks HLE BixBench
# Run specific models
python main.py --predict --models GPT_5_mini
# Combine filters
python main.py --all --benchmarks HLE --models GPT_5_miniView Help
python main.py --helpAdding New Benchmarks
- Create a new file in
benchmarks/(e.g.,benchmarks/MyBenchmark.py) - Implement the
download_benchmark()function that returns a pandas DataFrame - The DataFrame must have these columns:
question_id: Unique identifierquestion_text: The questionground_truth: Correct answersubset: Category (optional)metadata: Additional info (optional)image_path: For multimodal benchmarks (optional)
See benchmarks/README.md for a template.
Adding New Models
- Create a new file in
models/(e.g.,models/MyModel.py) - Implement two required functions:
initialize_model(): Set up and return the modelrun_predictions(model, csv_path, model_name): Run predictions on a CSV
See models/README.md for a template.
Adding Custom Evaluators
By default, all benchmarks use evaluators/default_evaluator.py which implements:
- Exact text matching
- Multiple choice answer extraction (handles "B", "B.", "B) because...")
- Semantic similarity matching
To create a custom evaluator for a specific benchmark:
- Create
evaluators/{benchmark_name}_evaluator.py - Implement the
evaluate(csv_path)function
See evaluators/README.md for details.
Benchmarks Included
- HLE - Healthcare and Life Sciences Evaluation (all subsets)
- hle-gold-bio-chem - Biology and Chemistry subset
- BixBench
- SuperGPQA Medicine Hard
- HealthBench Hard
- MedXpertQA (Text and Multimodal)
- MATH-Vision
- CVQA
- LitQA2
- RAG-QA Arena Science
Models Included
API-Based
- GPT-5-mini (OpenAI)
- gemini-2.5-pro (Google)
Local Models
- Qwen2.5-VL-7B-Instruct (Vision-Language)
- Qwen/Qwen2.5-14B-Instruct
- DeepSeek-R1-Distill-Qwen-7B
Output
Results are saved in data/results/:
summary_table.csv- Accuracy scores for all models and benchmarksdetailed_results.csv- Per-question breakdownsummary_report.md- Human-readable markdown report
Benchmark CSVs with predictions are saved in data/benchmarks/{benchmark_name}/questions.csv
Development Workflow
- Download benchmarks once (they're cached locally)
- Run predictions incrementally (existing predictions are skipped)
- Evaluate after all predictions are complete
- Generate summary to see final results
Progress is logged to both console and benchmarking.log.
Platform Support
This project supports multiple deployment environments:
- Windows laptops (local development with optional GPU)
- Mac laptops (Apple Silicon M1/M2/M3/M4 with MPS)
- Northeastern University HPC cluster (SLURM with NVIDIA A100 GPUs)
See manual_4_LLM_installation.md for platform-specific setup instructions.
Notes
- Data files are not uploaded to GitHub (excluded in
.gitignore) - Predictions are saved incrementally, so interrupted runs can be resumed
- API rate limits are handled with retries and exponential backoff
- Large models may require quantization for local deployment