GitHunt
SK

skerk001/genomicsgpt

ML + LLM pipeline for genetic variant pathogenicity prediction (AUC 0.9949, 1.69M ClinVar variants) with SHAP explainability and clinical report generation via Llama 3 / Claude

๐Ÿงฌ GenomicsGPT

AI-Powered Genetic Variant Interpretation Platform

GenomicsGPT is an end-to-end variant interpretation pipeline that combines clinical database lookups, machine learning pathogenicity prediction, and LLM-powered clinical narrative generation.

Key Results

Metric Score
AUC-ROC 0.9949 (0.985 leakage-corrected)
Accuracy 0.965
F1 (Macro) 0.948
Sensitivity 0.966
Dataset 1,691,921 ClinVar variants

Architecture

User Input (HGVS, VCF, rsID, etc.)
        โ”‚
        โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Variant Parser  โ”‚  โ† 7 input formats supported
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Data Aggregator  โ”‚  โ† ClinVar API + Ensembl VEP
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   ML Engine      โ”‚  โ† XGBoost + LightGBM ensemble
โ”‚  (AUC = 0.9949) โ”‚     40 features, SHAP explainability
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  LLM Narrative   โ”‚  โ† Llama 3 (local) or Claude API
โ”‚    Engine        โ”‚     Structured clinical reports
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Model Performance

XGBoost, LightGBM, and Ensemble ROC/PR curves, confusion matrices, and pathogenicity score distributions. Full notebook: 03_model_training.ipynb

SHAP Explainability

Top features by mean |SHAP| value with beeswarm plot showing feature-level impact on pathogenicity prediction.

Top predictive features:

  1. cons_synonymous โ€” Silent mutations strongly predict benign
  2. is_lof โ€” Loss-of-function variants strongly predict pathogenic
  3. gene_path_ratio โ€” Gene constraint score
  4. cons_intronic โ€” Deep intronic variants lean benign
  5. num_submitters โ€” Expert review signal

LLM Clinical Report Generation

The LLM engine takes ML predictions + database evidence and generates structured clinical interpretation reports following ACMG/AMP guidelines.

Example output (BRCA1 c.5266dupC โ†’ Llama 3, 24 seconds):

Classification: Pathogenic (confidence: 0.998). ClinVar consensus supports this prediction with 3 submissions reviewed by expert panel.

ACMG Criteria: PVS1 (frameshift โ†’ premature stop codon), PM2 (extremely rare in gnomAD, AF=0.000004, below BA1 threshold).

Clinical Implications: Increased risk for hereditary breast and ovarian cancer syndrome. Recommend genetic counseling and cascade testing for at-risk family members.

Report sections: Variant Summary, Classification, Evidence Summary, Molecular Mechanism, Population Data, Clinical Implications, ACMG Criteria, Limitations.

Two backends supported:

  • Ollama (free, local) โ€” runs Llama 3 on your GPU
  • Claude API (paid) โ€” higher quality output via Anthropic
# Free โ€” local Llama 3
ollama serve  # in another terminal
python demo_report.py

# Or with Claude API
export ANTHROPIC_API_KEY="sk-ant-..."
python demo_report.py "BRAF V600E"

Feature Ablation Study

Feature Set AUC-ROC
All 40 features 0.9946
Without gene features 0.9894
Consequence + LoF only 0.9722
Gene features only 0.7820

Molecular consequence features independently achieve 0.97 AUC, confirming the model learns biological patterns (LoF โ†’ pathogenic, synonymous โ†’ benign) rather than memorizing gene-specific statistics.


ML Pipeline

The classifier is trained on 1.69 million labeled ClinVar variants (GRCh38) with 40 engineered features across 9 categories: variant type, molecular consequence, loss-of-function flags, allele length, position, chromosome, review quality, gene constraint, and HGVS complexity.


Project Structure

genomicsgpt/
โ”œโ”€โ”€ demo_report.py                     # LLM demo โ€” generates clinical reports
โ”œโ”€โ”€ notebooks/
โ”‚   โ””โ”€โ”€ 03_model_training.ipynb        # Full ML pipeline with visualizations
โ”œโ”€โ”€ src/genomicsgpt/
โ”‚   โ”œโ”€โ”€ variant_parser/                # 7-format variant input parser
โ”‚   โ”œโ”€โ”€ data_aggregator/               # ClinVar + Ensembl API clients
โ”‚   โ”œโ”€โ”€ ml_engine/                     # Pathogenicity classifier
โ”‚   โ”œโ”€โ”€ llm_engine/                    # Clinical narrative generation
โ”‚   โ”‚   โ””โ”€โ”€ report_generator.py        # Ollama + Claude backends
โ”‚   โ”œโ”€โ”€ rag_engine/                    # Literature search (planned)
โ”‚   โ””โ”€โ”€ models.py                      # Core data models
โ”œโ”€โ”€ data/models/
โ”‚   โ”œโ”€โ”€ xgb_model.pkl                  # Trained XGBoost model
โ”‚   โ”œโ”€โ”€ lgb_model.pkl                  # Trained LightGBM model
โ”‚   โ”œโ”€โ”€ metrics.json                   # Evaluation metrics
โ”‚   โ””โ”€โ”€ plots/                         # 7 evaluation visualizations
โ”œโ”€โ”€ tests/                             # Unit and integration tests
โ””โ”€โ”€ train_pipeline.py                  # Self-contained training script

Quick Start

# Parse a variant
python -m genomicsgpt parse "BRAF V600E"

# Generate a clinical report (requires Ollama running)
ollama serve  # in another terminal
python demo_report.py "BRCA1 c.5266dupC"

# Reproduce the ML training
pip install xgboost lightgbm shap seaborn scikit-learn pandas
python train_pipeline.py

# Run the interactive notebook
jupyter notebook notebooks/03_model_training.ipynb

Tech Stack

  • ML: XGBoost, LightGBM, scikit-learn, SHAP
  • LLM: Llama 3 (Ollama), Claude API (Anthropic)
  • Data: ClinVar (NCBI), Ensembl VEP
  • Visualization: matplotlib, seaborn
  • APIs: ClinVar E-utilities, Ensembl REST, Ollama REST
  • Testing: pytest (38 tests)

Author

Samir Kerkar โ€” github.com/skerk001

Data Scientist | B.S. Mathematics UC Irvine