GitHunt
AL

Alicimo/baseball_nl2sql

An evaluation framework for testing LLM-generated SQL queries against "The History of Baseball" database. This project implements semantic comparison methods using AST-based distance metrics and token-based cosine similarity.

Natural Language to SQL Translation System

A sophisticated evaluation framework for testing LLM-generated SQL queries against "The History of Baseball" database. This project implements advanced semantic comparison methods using AST-based distance metrics and token-based cosine similarity.

Quick Start

Prerequisites

  • Python 3.13+
  • UV package manager
  • OpenAI-compatible LLM API access

Installation

uv sync

Environment Setup

Create a .env file with your LLM API credentials:

OPENAI_URL=your_api_endpoint
OPENAI_KEY=your_api_key

Running the Pipeline

# Run complete pipeline (generate + evaluate)
uv run dvc repro

# Run individual stages
uv run dvc repro generate    # Generate SQL queries
uv run dvc repro evaluate    # Evaluate against reference queries

Project Structure

├── src/
│   ├── generate.py     # LLM-based SQL query generation
│   └── eval.py         # Evaluation framework with semantic metrics
├── data/
│   └── examples_queries_test.json  # Test cases with reference queries
├── prompts/
│   ├── prompt_CoT.txt              # Chain-of-Thought prompt template
│   └── context/
│       └── db_schema_descriptive.txt  # Database schema documentation
├── output/
│   ├── generated_queries.json      # LLM-generated queries
│   ├── eval.json                   # Detailed evaluation results
│   └── metrics.json                # Aggregate performance metrics
├── scripts/
│   └── schema_parser.py            # Database schema processing utility
├── dvc.yaml           # ML pipeline definition
├── params.yaml        # Configuration parameters
└── pyproject.toml     # Python dependencies and project config

Key Features

Advanced Evaluation Metrics

  • AST Distance: Semantic comparison using Change Distiller algorithm via SQLGlot
  • Token Cosine Similarity: Vector-based similarity of normalized SQL tokens
  • SQL Normalization: Handles aliases and formatting variations for fair comparison

Configurable LLM Integration

  • OpenAI-compatible API support
  • Configurable temperature, max tokens, and retry logic
  • Chain-of-Thought prompting with structured reasoning output

Reproducible ML Pipeline

  • DVC-managed pipeline for reproducible experiments
  • Parameter tracking and metric versioning
  • Modular prompt engineering with schema injection

Usage Examples

Generate SQL queries from natural language:

uv run python src/generate.py --examples_path data/examples_queries_test.json --output_dir output

Evaluate generated queries:

uv run python src/eval.py --generated_queries_path output/generated_queries.json --reference_examples_path data/examples_queries_test.json

Customize LLM parameters:

Edit params.yaml to modify model, temperature, or prompt configuration, then run uv run dvc repro.

Evaluation Metrics

The system provides two complementary evaluation approaches:

  1. AST Distance: Measures structural differences between SQL ASTs using edit distance
  2. Token Cosine Similarity: Computes semantic similarity based on SQL token vectors

Results are saved to output/eval.json with aggregate metrics in output/metrics.json.

Development Notes

  • Uses UV for dependency management and Python version control
  • Fire library provides CLI interfaces for both generation and evaluation scripts
  • SQLGlot handles SQL parsing, normalization, and AST manipulation
  • DVC enables reproducible ML experiments with parameter and metric tracking
Alicimo/baseball_nl2sql | GitHunt