RA
rachittshah/reasoning_evals
evals for reasoning models such as o3-mini-high, o1, deepseek-r1
Reasoning LLM Evaluation Framework
A comprehensive framework for evaluating reasoning capabilities of Large Language Models (LLMs) across multiple tasks and scenarios.
Features
- Multi-model evaluation support (OpenAI, DeepSeek)
- Synthetic data generation for diverse reasoning tasks
- Automated evaluation pipeline
- Performance metrics and visualization
- Comprehensive logging and error handling
Project Structure
reasoning_evals/
├── api/ # API integrations for different LLM providers
├── tasks/ # Task definitions and implementations
├── evaluation/ # Evaluation metrics and scoring
├── utils/ # Utility functions and helpers
├── config/ # Configuration files
├── data/ # Data storage
│ ├── raw/ # Raw input data
│ ├── processed/ # Processed data
│ └── synthetic/ # Generated synthetic data
├── results/ # Evaluation results and visualizations
└── tests/ # Test suite
Setup
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables:
cp .env.example .env # Add your API keys to .env
Usage
- Configure evaluation settings in
config/config.yaml - Run evaluations:
from reasoning_evals.evaluation import run_evaluation results = run_evaluation(task_name="math_reasoning")
Task Types
- STEM Problem Solving
- Logical Reasoning & Puzzle Solving
- Code Generation & Debugging
- Decision-Making & Planning
Contributing
- Fork the repository
- Create a feature branch
- Submit a pull request
License
MIT License