Alicimo/baseball_nl2sql
An evaluation framework for testing LLM-generated SQL queries against "The History of Baseball" database. This project implements semantic comparison methods using AST-based distance metrics and token-based cosine similarity.
Natural Language to SQL Translation System
A sophisticated evaluation framework for testing LLM-generated SQL queries against "The History of Baseball" database. This project implements advanced semantic comparison methods using AST-based distance metrics and token-based cosine similarity.
Quick Start
Prerequisites
- Python 3.13+
- UV package manager
- OpenAI-compatible LLM API access
Installation
uv syncEnvironment Setup
Create a .env file with your LLM API credentials:
OPENAI_URL=your_api_endpoint
OPENAI_KEY=your_api_keyRunning the Pipeline
# Run complete pipeline (generate + evaluate)
uv run dvc repro
# Run individual stages
uv run dvc repro generate # Generate SQL queries
uv run dvc repro evaluate # Evaluate against reference queriesProject Structure
├── src/
│ ├── generate.py # LLM-based SQL query generation
│ └── eval.py # Evaluation framework with semantic metrics
├── data/
│ └── examples_queries_test.json # Test cases with reference queries
├── prompts/
│ ├── prompt_CoT.txt # Chain-of-Thought prompt template
│ └── context/
│ └── db_schema_descriptive.txt # Database schema documentation
├── output/
│ ├── generated_queries.json # LLM-generated queries
│ ├── eval.json # Detailed evaluation results
│ └── metrics.json # Aggregate performance metrics
├── scripts/
│ └── schema_parser.py # Database schema processing utility
├── dvc.yaml # ML pipeline definition
├── params.yaml # Configuration parameters
└── pyproject.toml # Python dependencies and project config
Key Features
Advanced Evaluation Metrics
- AST Distance: Semantic comparison using Change Distiller algorithm via SQLGlot
- Token Cosine Similarity: Vector-based similarity of normalized SQL tokens
- SQL Normalization: Handles aliases and formatting variations for fair comparison
Configurable LLM Integration
- OpenAI-compatible API support
- Configurable temperature, max tokens, and retry logic
- Chain-of-Thought prompting with structured reasoning output
Reproducible ML Pipeline
- DVC-managed pipeline for reproducible experiments
- Parameter tracking and metric versioning
- Modular prompt engineering with schema injection
Usage Examples
Generate SQL queries from natural language:
uv run python src/generate.py --examples_path data/examples_queries_test.json --output_dir outputEvaluate generated queries:
uv run python src/eval.py --generated_queries_path output/generated_queries.json --reference_examples_path data/examples_queries_test.jsonCustomize LLM parameters:
Edit params.yaml to modify model, temperature, or prompt configuration, then run uv run dvc repro.
Evaluation Metrics
The system provides two complementary evaluation approaches:
- AST Distance: Measures structural differences between SQL ASTs using edit distance
- Token Cosine Similarity: Computes semantic similarity based on SQL token vectors
Results are saved to output/eval.json with aggregate metrics in output/metrics.json.
Development Notes
- Uses UV for dependency management and Python version control
- Fire library provides CLI interfaces for both generation and evaluation scripts
- SQLGlot handles SQL parsing, normalization, and AST manipulation
- DVC enables reproducible ML experiments with parameter and metric tracking