GitHunt
TI

TimeLovercc/code-evaluator

A codebase for code evaluation.

🚀 Code Evaluator

Python 3.8+
License: MIT

A comprehensive, production-ready framework for evaluating code generation models on programming benchmarks. Designed for researchers and practitioners working with LLMs for code generation tasks.

Inspired by CURE

📋 Table of Contents

✨ Features

Core Capabilities

  • 🔧 Multiple Inference Backends: Seamlessly switch between vLLM (local) and API-based models (OpenAI, Anthropic, etc.)
  • ⚡ High-Performance Execution: Distributed GPU inference with optimized batching and memory management
  • 🎯 Comprehensive Metrics: Pass@k, execution success rates, Best-of-N sampling, and custom metrics
  • 🔒 Safe Code Execution: Sandboxed execution with timeout protection and resource limits
  • 📊 Rich Analytics: Detailed performance analysis with multiple evaluation modes

Technical Features

  • Multi-GPU Support: Efficient parallel inference across multiple GPUs with configurable worker groups
  • Adaptive Batching: Dynamic batch sizing for optimal throughput (up to 256 concurrent sequences)
  • Memory Optimization: KV-cache management, prefix caching, and chunked prefill for large models
  • Flexible Prompting: Customizable prompt templates with Jinja2 templating
  • Robust Error Handling: Graceful failure recovery with detailed error logging

Supported Datasets

  • MBPP (Mostly Basic Python Problems)
  • LiveCodeBench
  • CodeContests
  • CodeForces
  • LiveBench
  • Custom datasets (with proper formatting)

🏗 Architecture

┌─────────────────────────────────────────────────────────┐
│                    Configuration Layer                   │
│                   (Hydra + OmegaConf)                   │
└─────────────────────────────────────────────────────────┘
                            │
┌─────────────────────────────────────────────────────────┐
│                    Evaluation Pipeline                   │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │   Dataset   │→ │  Generation  │→ │   Execution  │  │
│  │   Loader    │  │    Engine    │  │   Sandbox    │  │
│  └─────────────┘  └──────────────┘  └──────────────┘  │
│                            ↓                            │
│                    ┌──────────────┐                     │
│                    │   Metrics    │                     │
│                    │  Calculator  │                     │
│                    └──────────────┘                     │
└─────────────────────────────────────────────────────────┘
                            │
┌─────────────────────────────────────────────────────────┐
│                     Inference Backends                   │
│  ┌─────────────────────┐  ┌─────────────────────────┐  │
│  │       vLLM          │  │      API Clients        │  │
│  │  (Local Models)     │  │  (OpenAI, Anthropic)    │  │
│  └─────────────────────┘  └─────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

📦 Installation

Prerequisites

  • Python 3.8 or higher
  • CUDA 11.8+ (for GPU acceleration)
  • At least 16GB RAM (32GB+ recommended for large models)
  • NVIDIA GPU with 24GB+ VRAM (for local model inference)

Step 1: Clone the Repository

git clone https://github.com/TimeLovercc/code-evaluator.git
cd code-evaluator

Step 2: Create Virtual Environment

uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Step 3: Install Dependencies

uv pip install -r requirements.txt

Step 4: Download Datasets

cd data
# Download evaluation datasets
python download_data.py --dataset MBPP
python download_data.py --dataset LiveCodeBench
python download_data.py --dataset CodeContests
python download_data.py --dataset CodeForces
python download_data.py --dataset LiveBench

# Optional: Download training data
python download_data.py --dataset CodeContests_train
cd ..

🚀 Quick Start

Basic Evaluation

# Run evaluation with default settings (MBPP dataset)
bash scripts/eval.sh

Evaluate All Datasets

# Run comprehensive evaluation across all datasets
bash scripts/all_eval.sh

Custom Model Evaluation

# Using vLLM with a specific model
python src/evaluate/evaluation_exp.py \
  inference.vllm.pretrained_model="codellama/CodeLlama-7b-Python-hf" \
  dataset.name="MBPP"

# Using API-based model
python src/evaluate/evaluation_exp.py \
  inference.use_api=true \
  inference.api.model_name="gpt-4" \
  inference.api.key="YOUR_API_KEY" \
  dataset.name="LiveCodeBench"

📊 Datasets

Supported Formats

The framework uses a standardized JSON format with Stdio input/output:

{
  "task_id": 0,
  "question": "Problem description here",
  "test_input": ["5\n1 2 3 4 5\n"],
  "test_output": ["15\n"],
  "example_input": ["3\n1 2 3\n"],
  "example_output": ["6\n"],
  "test_time_limit": 1
}

Dataset Statistics

Dataset # Problems Difficulty Format
MBPP 974 Basic Stdio
LiveCodeBench 400+ Mixed Stdio/Functional*
CodeContests 13,000+ Hard Stdio
CodeForces 10,000+ Mixed Stdio
LiveBench 200+ Mixed Stdio/Functional*

*Automatically converted to Stdio format using data/transformation.ipynb

Custom Dataset Integration

To add your own dataset:

  1. Format your data according to the schema above
  2. Place the JSON file in data/eval_data/
  3. Update config.yaml with your dataset name
  4. Run evaluation as usual

Format Conversion

For datasets with functional format (e.g., assert-based tests), use the provided conversion tool:

# Open data/transformation.ipynb
# Follow the notebook to convert functional → Stdio format

⚙️ Configuration

The framework uses Hydra for configuration management. Main configuration file: src/evaluate/config.yaml

Key Configuration Options

Model Settings

inference:
  use_api: false  # true for API models, false for vLLM
  
  # vLLM settings
  vllm:
    pretrained_model: "Qwen/Qwen3-4B"
    max_model_len: 16384
    max_generation_token: 4096
    temp: 0.8
    gpu_groups: [[0], [1], [2], [3]]  # GPU allocation
    max_batch_size: 256
  
  # API settings
  api:
    model_name: "gpt-4o-mini"
    key: "YOUR_API_KEY"
    temperature: 0.8
    max_workers: 20
    rpm_limit: 100

Generation Parameters

generation:
  k_code: 16    # Number of code samples per task
  k_case: 16    # Number of test case samples per task
  no_example: true  # Whether to include examples in prompts

Evaluation Modes

evaluation:
  single_eval: true  # One-shot coding accuracy only
  scale_tuple_list: [[4, 4], [16, 16]]  # Best-of-N configurations

Environment Variables

export PYTHONPATH=.
export NCCL_P2P_DISABLE=1  # For multi-GPU without NVLink
export VLLM_USE_V1=0
export OMP_NUM_THREADS=8

💻 Usage Examples

Example 1: Benchmarking Multiple Models

# benchmark_models.py
import subprocess
import json

models = [
    "codellama/CodeLlama-7b-Python-hf",
    "Qwen/Qwen2.5-Coder-7B",
    "deepseek-ai/deepseek-coder-6.7b-base"
]

results = {}
for model in models:
    cmd = f"python src/evaluate/evaluation_exp.py inference.vllm.pretrained_model='{model}'"
    subprocess.run(cmd, shell=True)
    # Parse results from output files
    with open(f"outputs/eval/results-eval-{model.replace('/', '.')}-final_eval.txt") as f:
        results[model] = f.read()

print(json.dumps(results, indent=2))

Example 2: Custom Prompt Templates

# custom_prompts.yaml
prompts:
  system_prompts: |
    You are an expert Python programmer. 
    Task: {{problem}}
    Requirements: {{special_requirements}}
    Generate clean, efficient Python code.

Example 3: API Rate Limiting

# For high-volume API usage
python src/evaluate/evaluation_exp.py \
  inference.use_api=true \
  inference.api.rpm_limit=500 \
  inference.api.max_workers=50 \
  execution.num_chunks=1024

Example 4: Debug Mode

# Enable debug mode for quick testing
python src/evaluate/evaluation_exp.py debug=true
# This sets: k_code=2, k_case=2, num_chunks=4

📁 Output Format

Directory Structure

outputs/eval/
├── MBPP/
│   ├── generations-eval-model-MBPP.json    # Raw generations
│   ├── outputs-eval-model-MBPP.json        # Full results
│   └── results-eval-model-final_eval.txt   # Summary metrics
├── LiveCodeBench/
│   └── ...
└── ...

Metrics Explained

  • Code Accuracy: Proportion of tasks where generated code passes all tests
  • Code Accumulate Accuracy: Proportion of individual test cases passed
  • Case Accuracy: Quality of generated test cases (if applicable)
  • P_01/P_00: Probability metrics for test case discrimination
  • Best-of-N: Performance when selecting best solution from N attempts

Sample Output

{
  "task_id": 0,
  "question": "Write a function to sum numbers",
  "generated_code": ["def solution(nums):..."],
  "test_bool_table": [[true, true, false], ...],
  "case_bool_table": [[true, false], ...],
  "test_exe_results": [["15", "10", "error"], ...],
  "case_exe_results": [["5", "error"], ...]
}

Summary Statistics

code acc (average proportion of tasks the generated code can pass): 0.425
code accumulate acc (average proportion of unit tests the generated code can pass): 0.612
estimated unit test acc: 0.387
estimated p_01: 0.823
estimated p_00: 0.156
BoN setting [4, 4]: acc: 0.512, accumulate acc: 0.687
code average response length: 287.3

📈 Performance

Optimization Tips

1. GPU Memory Management

vllm:
  gpu_memory_utilization: 0.90  # Maximize VRAM usage
  enable_prefix_caching: true    # Cache common prefixes
  enable_chunked_prefill: true   # Better memory handling

2. Batch Size Tuning

  • A100 80GB: max_batch_size: 256
  • A100 40GB: max_batch_size: 128
  • RTX 4090: max_batch_size: 64
  • RTX 3090: max_batch_size: 32

3. Multi-GPU Scaling

# Single GPU per worker (recommended)
gpu_groups: [[0], [1], [2], [3]]

# Tensor parallelism (for very large models)
gpu_groups: [[0,1], [2,3]]

4. Execution Optimization

execution:
  num_chunks: 512  # Increase for better parallelization
  exe_verbose: true  # Monitor execution progress

Benchmark Results

Model MBPP Pass@1 LiveCodeBench Pass@1 Throughput (samples/min)
CodeLlama-7B 42.3% 38.7% 120
Qwen2.5-Coder-7B 51.2% 45.3% 115
DeepSeek-Coder-6.7B 48.5% 43.2% 125
GPT-4 (API) 67.8% 62.1% 60

🔧 Troubleshooting

Common Issues

CUDA Out of Memory

# Reduce batch size
python src/evaluate/evaluation_exp.py \
  inference.vllm.max_batch_size=32 \
  inference.vllm.gpu_memory_utilization=0.8

Timeout Errors

# Increase execution timeout
dataset:
  max_test: 16  # Increase test limit
execution:
  num_chunks: 1024  # More parallel chunks

API Rate Limits

inference:
  api:
    rpm_limit: 50  # Reduce requests per minute
    max_workers: 10  # Fewer concurrent workers

Model Loading Issues

# For models requiring trust_remote_code
python src/evaluate/evaluation_exp.py \
  inference.vllm.trust_remote_code=true

Process Cleanup Issues

# If processes don't terminate cleanly
pkill -f "evaluation_exp.py"
nvidia-smi  # Check GPU usage

🏃‍♂️ Advanced Usage

Running with SLURM

#!/bin/bash
#SBATCH --job-name=code-eval
#SBATCH --gres=gpu:4
#SBATCH --time=48:00:00

module load cuda/11.8
source .venv/bin/activate
python src/evaluate/evaluation_exp.py dataset.name="CodeContests"

Custom Evaluation Pipeline

from src.evaluate.evaluator import CodeEvaluator
from src.evaluate.inference_engines import VLLMInferenceEngine

# Create custom evaluator
engine = VLLMInferenceEngine(cfg)
evaluator = CodeEvaluator(cfg, dataset, engine, ...)
evaluator.evaluate()

🤝 Contributing

We welcome contributions! Areas of interest:

  • Support for more programming languages
  • Additional evaluation metrics
  • New dataset integrations
  • Performance optimizations
  • Documentation improvements

Development Setup

# Install development dependencies
pip install -e .
pip install pytest black isort flake8

# Run tests
pytest tests/

# Format code
black src/
isort src/

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Inspired by the CURE framework
  • Built on top of vLLM for efficient inference
  • Uses Hydra for configuration management
  • Dataset sources from HuggingFace and various coding competition platforms

📮 Contact

For questions and support, please open an issue on GitHub or contact [gzjz07@outlook.com]


Star ⭐ this repository if you find it helpful!

TimeLovercc/code-evaluator | GitHunt