TimeLovercc/code-evaluator
A codebase for code evaluation.
🚀 Code Evaluator
A comprehensive, production-ready framework for evaluating code generation models on programming benchmarks. Designed for researchers and practitioners working with LLMs for code generation tasks.
Inspired by CURE
📋 Table of Contents
- Features
- Architecture
- Installation
- Quick Start
- Datasets
- Configuration
- Usage Examples
- Output Format
- Performance
- Troubleshooting
- Contributing
- Citation
✨ Features
Core Capabilities
- 🔧 Multiple Inference Backends: Seamlessly switch between vLLM (local) and API-based models (OpenAI, Anthropic, etc.)
- ⚡ High-Performance Execution: Distributed GPU inference with optimized batching and memory management
- 🎯 Comprehensive Metrics: Pass@k, execution success rates, Best-of-N sampling, and custom metrics
- 🔒 Safe Code Execution: Sandboxed execution with timeout protection and resource limits
- 📊 Rich Analytics: Detailed performance analysis with multiple evaluation modes
Technical Features
- Multi-GPU Support: Efficient parallel inference across multiple GPUs with configurable worker groups
- Adaptive Batching: Dynamic batch sizing for optimal throughput (up to 256 concurrent sequences)
- Memory Optimization: KV-cache management, prefix caching, and chunked prefill for large models
- Flexible Prompting: Customizable prompt templates with Jinja2 templating
- Robust Error Handling: Graceful failure recovery with detailed error logging
Supported Datasets
- MBPP (Mostly Basic Python Problems)
- LiveCodeBench
- CodeContests
- CodeForces
- LiveBench
- Custom datasets (with proper formatting)
🏗 Architecture
┌─────────────────────────────────────────────────────────┐
│ Configuration Layer │
│ (Hydra + OmegaConf) │
└─────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────┐
│ Evaluation Pipeline │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Dataset │→ │ Generation │→ │ Execution │ │
│ │ Loader │ │ Engine │ │ Sandbox │ │
│ └─────────────┘ └──────────────┘ └──────────────┘ │
│ ↓ │
│ ┌──────────────┐ │
│ │ Metrics │ │
│ │ Calculator │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────┐
│ Inference Backends │
│ ┌─────────────────────┐ ┌─────────────────────────┐ │
│ │ vLLM │ │ API Clients │ │
│ │ (Local Models) │ │ (OpenAI, Anthropic) │ │
│ └─────────────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
📦 Installation
Prerequisites
- Python 3.8 or higher
- CUDA 11.8+ (for GPU acceleration)
- At least 16GB RAM (32GB+ recommended for large models)
- NVIDIA GPU with 24GB+ VRAM (for local model inference)
Step 1: Clone the Repository
git clone https://github.com/TimeLovercc/code-evaluator.git
cd code-evaluatorStep 2: Create Virtual Environment
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activateStep 3: Install Dependencies
uv pip install -r requirements.txtStep 4: Download Datasets
cd data
# Download evaluation datasets
python download_data.py --dataset MBPP
python download_data.py --dataset LiveCodeBench
python download_data.py --dataset CodeContests
python download_data.py --dataset CodeForces
python download_data.py --dataset LiveBench
# Optional: Download training data
python download_data.py --dataset CodeContests_train
cd ..🚀 Quick Start
Basic Evaluation
# Run evaluation with default settings (MBPP dataset)
bash scripts/eval.shEvaluate All Datasets
# Run comprehensive evaluation across all datasets
bash scripts/all_eval.shCustom Model Evaluation
# Using vLLM with a specific model
python src/evaluate/evaluation_exp.py \
inference.vllm.pretrained_model="codellama/CodeLlama-7b-Python-hf" \
dataset.name="MBPP"
# Using API-based model
python src/evaluate/evaluation_exp.py \
inference.use_api=true \
inference.api.model_name="gpt-4" \
inference.api.key="YOUR_API_KEY" \
dataset.name="LiveCodeBench"📊 Datasets
Supported Formats
The framework uses a standardized JSON format with Stdio input/output:
{
"task_id": 0,
"question": "Problem description here",
"test_input": ["5\n1 2 3 4 5\n"],
"test_output": ["15\n"],
"example_input": ["3\n1 2 3\n"],
"example_output": ["6\n"],
"test_time_limit": 1
}Dataset Statistics
| Dataset | # Problems | Difficulty | Format |
|---|---|---|---|
| MBPP | 974 | Basic | Stdio |
| LiveCodeBench | 400+ | Mixed | Stdio/Functional* |
| CodeContests | 13,000+ | Hard | Stdio |
| CodeForces | 10,000+ | Mixed | Stdio |
| LiveBench | 200+ | Mixed | Stdio/Functional* |
*Automatically converted to Stdio format using data/transformation.ipynb
Custom Dataset Integration
To add your own dataset:
- Format your data according to the schema above
- Place the JSON file in
data/eval_data/ - Update
config.yamlwith your dataset name - Run evaluation as usual
Format Conversion
For datasets with functional format (e.g., assert-based tests), use the provided conversion tool:
# Open data/transformation.ipynb
# Follow the notebook to convert functional → Stdio format⚙️ Configuration
The framework uses Hydra for configuration management. Main configuration file: src/evaluate/config.yaml
Key Configuration Options
Model Settings
inference:
use_api: false # true for API models, false for vLLM
# vLLM settings
vllm:
pretrained_model: "Qwen/Qwen3-4B"
max_model_len: 16384
max_generation_token: 4096
temp: 0.8
gpu_groups: [[0], [1], [2], [3]] # GPU allocation
max_batch_size: 256
# API settings
api:
model_name: "gpt-4o-mini"
key: "YOUR_API_KEY"
temperature: 0.8
max_workers: 20
rpm_limit: 100Generation Parameters
generation:
k_code: 16 # Number of code samples per task
k_case: 16 # Number of test case samples per task
no_example: true # Whether to include examples in promptsEvaluation Modes
evaluation:
single_eval: true # One-shot coding accuracy only
scale_tuple_list: [[4, 4], [16, 16]] # Best-of-N configurationsEnvironment Variables
export PYTHONPATH=.
export NCCL_P2P_DISABLE=1 # For multi-GPU without NVLink
export VLLM_USE_V1=0
export OMP_NUM_THREADS=8💻 Usage Examples
Example 1: Benchmarking Multiple Models
# benchmark_models.py
import subprocess
import json
models = [
"codellama/CodeLlama-7b-Python-hf",
"Qwen/Qwen2.5-Coder-7B",
"deepseek-ai/deepseek-coder-6.7b-base"
]
results = {}
for model in models:
cmd = f"python src/evaluate/evaluation_exp.py inference.vllm.pretrained_model='{model}'"
subprocess.run(cmd, shell=True)
# Parse results from output files
with open(f"outputs/eval/results-eval-{model.replace('/', '.')}-final_eval.txt") as f:
results[model] = f.read()
print(json.dumps(results, indent=2))Example 2: Custom Prompt Templates
# custom_prompts.yaml
prompts:
system_prompts: |
You are an expert Python programmer.
Task: {{problem}}
Requirements: {{special_requirements}}
Generate clean, efficient Python code.Example 3: API Rate Limiting
# For high-volume API usage
python src/evaluate/evaluation_exp.py \
inference.use_api=true \
inference.api.rpm_limit=500 \
inference.api.max_workers=50 \
execution.num_chunks=1024Example 4: Debug Mode
# Enable debug mode for quick testing
python src/evaluate/evaluation_exp.py debug=true
# This sets: k_code=2, k_case=2, num_chunks=4📁 Output Format
Directory Structure
outputs/eval/
├── MBPP/
│ ├── generations-eval-model-MBPP.json # Raw generations
│ ├── outputs-eval-model-MBPP.json # Full results
│ └── results-eval-model-final_eval.txt # Summary metrics
├── LiveCodeBench/
│ └── ...
└── ...
Metrics Explained
- Code Accuracy: Proportion of tasks where generated code passes all tests
- Code Accumulate Accuracy: Proportion of individual test cases passed
- Case Accuracy: Quality of generated test cases (if applicable)
- P_01/P_00: Probability metrics for test case discrimination
- Best-of-N: Performance when selecting best solution from N attempts
Sample Output
{
"task_id": 0,
"question": "Write a function to sum numbers",
"generated_code": ["def solution(nums):..."],
"test_bool_table": [[true, true, false], ...],
"case_bool_table": [[true, false], ...],
"test_exe_results": [["15", "10", "error"], ...],
"case_exe_results": [["5", "error"], ...]
}Summary Statistics
code acc (average proportion of tasks the generated code can pass): 0.425
code accumulate acc (average proportion of unit tests the generated code can pass): 0.612
estimated unit test acc: 0.387
estimated p_01: 0.823
estimated p_00: 0.156
BoN setting [4, 4]: acc: 0.512, accumulate acc: 0.687
code average response length: 287.3
📈 Performance
Optimization Tips
1. GPU Memory Management
vllm:
gpu_memory_utilization: 0.90 # Maximize VRAM usage
enable_prefix_caching: true # Cache common prefixes
enable_chunked_prefill: true # Better memory handling2. Batch Size Tuning
- A100 80GB:
max_batch_size: 256 - A100 40GB:
max_batch_size: 128 - RTX 4090:
max_batch_size: 64 - RTX 3090:
max_batch_size: 32
3. Multi-GPU Scaling
# Single GPU per worker (recommended)
gpu_groups: [[0], [1], [2], [3]]
# Tensor parallelism (for very large models)
gpu_groups: [[0,1], [2,3]]4. Execution Optimization
execution:
num_chunks: 512 # Increase for better parallelization
exe_verbose: true # Monitor execution progressBenchmark Results
| Model | MBPP Pass@1 | LiveCodeBench Pass@1 | Throughput (samples/min) |
|---|---|---|---|
| CodeLlama-7B | 42.3% | 38.7% | 120 |
| Qwen2.5-Coder-7B | 51.2% | 45.3% | 115 |
| DeepSeek-Coder-6.7B | 48.5% | 43.2% | 125 |
| GPT-4 (API) | 67.8% | 62.1% | 60 |
🔧 Troubleshooting
Common Issues
CUDA Out of Memory
# Reduce batch size
python src/evaluate/evaluation_exp.py \
inference.vllm.max_batch_size=32 \
inference.vllm.gpu_memory_utilization=0.8Timeout Errors
# Increase execution timeout
dataset:
max_test: 16 # Increase test limit
execution:
num_chunks: 1024 # More parallel chunksAPI Rate Limits
inference:
api:
rpm_limit: 50 # Reduce requests per minute
max_workers: 10 # Fewer concurrent workersModel Loading Issues
# For models requiring trust_remote_code
python src/evaluate/evaluation_exp.py \
inference.vllm.trust_remote_code=trueProcess Cleanup Issues
# If processes don't terminate cleanly
pkill -f "evaluation_exp.py"
nvidia-smi # Check GPU usage🏃♂️ Advanced Usage
Running with SLURM
#!/bin/bash
#SBATCH --job-name=code-eval
#SBATCH --gres=gpu:4
#SBATCH --time=48:00:00
module load cuda/11.8
source .venv/bin/activate
python src/evaluate/evaluation_exp.py dataset.name="CodeContests"Custom Evaluation Pipeline
from src.evaluate.evaluator import CodeEvaluator
from src.evaluate.inference_engines import VLLMInferenceEngine
# Create custom evaluator
engine = VLLMInferenceEngine(cfg)
evaluator = CodeEvaluator(cfg, dataset, engine, ...)
evaluator.evaluate()🤝 Contributing
We welcome contributions! Areas of interest:
- Support for more programming languages
- Additional evaluation metrics
- New dataset integrations
- Performance optimizations
- Documentation improvements
Development Setup
# Install development dependencies
pip install -e .
pip install pytest black isort flake8
# Run tests
pytest tests/
# Format code
black src/
isort src/📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Inspired by the CURE framework
- Built on top of vLLM for efficient inference
- Uses Hydra for configuration management
- Dataset sources from HuggingFace and various coding competition platforms
📮 Contact
For questions and support, please open an issue on GitHub or contact [gzjz07@outlook.com]
Star ⭐ this repository if you find it helpful!