skerk001/genomicsgpt
ML + LLM pipeline for genetic variant pathogenicity prediction (AUC 0.9949, 1.69M ClinVar variants) with SHAP explainability and clinical report generation via Llama 3 / Claude
๐งฌ GenomicsGPT
AI-Powered Genetic Variant Interpretation Platform
GenomicsGPT is an end-to-end variant interpretation pipeline that combines clinical database lookups, machine learning pathogenicity prediction, and LLM-powered clinical narrative generation.
Key Results
| Metric | Score |
|---|---|
| AUC-ROC | 0.9949 (0.985 leakage-corrected) |
| Accuracy | 0.965 |
| F1 (Macro) | 0.948 |
| Sensitivity | 0.966 |
| Dataset | 1,691,921 ClinVar variants |
Architecture
User Input (HGVS, VCF, rsID, etc.)
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Variant Parser โ โ 7 input formats supported
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Data Aggregator โ โ ClinVar API + Ensembl VEP
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ ML Engine โ โ XGBoost + LightGBM ensemble
โ (AUC = 0.9949) โ 40 features, SHAP explainability
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ LLM Narrative โ โ Llama 3 (local) or Claude API
โ Engine โ Structured clinical reports
โโโโโโโโโโโโโโโโโโโ
Model Performance
SHAP Explainability
Top predictive features:
- cons_synonymous โ Silent mutations strongly predict benign
- is_lof โ Loss-of-function variants strongly predict pathogenic
- gene_path_ratio โ Gene constraint score
- cons_intronic โ Deep intronic variants lean benign
- num_submitters โ Expert review signal
LLM Clinical Report Generation
The LLM engine takes ML predictions + database evidence and generates structured clinical interpretation reports following ACMG/AMP guidelines.
Example output (BRCA1 c.5266dupC โ Llama 3, 24 seconds):
Classification: Pathogenic (confidence: 0.998). ClinVar consensus supports this prediction with 3 submissions reviewed by expert panel.
ACMG Criteria: PVS1 (frameshift โ premature stop codon), PM2 (extremely rare in gnomAD, AF=0.000004, below BA1 threshold).
Clinical Implications: Increased risk for hereditary breast and ovarian cancer syndrome. Recommend genetic counseling and cascade testing for at-risk family members.
Report sections: Variant Summary, Classification, Evidence Summary, Molecular Mechanism, Population Data, Clinical Implications, ACMG Criteria, Limitations.
Two backends supported:
- Ollama (free, local) โ runs Llama 3 on your GPU
- Claude API (paid) โ higher quality output via Anthropic
# Free โ local Llama 3
ollama serve # in another terminal
python demo_report.py
# Or with Claude API
export ANTHROPIC_API_KEY="sk-ant-..."
python demo_report.py "BRAF V600E"Feature Ablation Study
| Feature Set | AUC-ROC |
|---|---|
| All 40 features | 0.9946 |
| Without gene features | 0.9894 |
| Consequence + LoF only | 0.9722 |
| Gene features only | 0.7820 |
Molecular consequence features independently achieve 0.97 AUC, confirming the model learns biological patterns (LoF โ pathogenic, synonymous โ benign) rather than memorizing gene-specific statistics.
ML Pipeline
The classifier is trained on 1.69 million labeled ClinVar variants (GRCh38) with 40 engineered features across 9 categories: variant type, molecular consequence, loss-of-function flags, allele length, position, chromosome, review quality, gene constraint, and HGVS complexity.
Project Structure
genomicsgpt/
โโโ demo_report.py # LLM demo โ generates clinical reports
โโโ notebooks/
โ โโโ 03_model_training.ipynb # Full ML pipeline with visualizations
โโโ src/genomicsgpt/
โ โโโ variant_parser/ # 7-format variant input parser
โ โโโ data_aggregator/ # ClinVar + Ensembl API clients
โ โโโ ml_engine/ # Pathogenicity classifier
โ โโโ llm_engine/ # Clinical narrative generation
โ โ โโโ report_generator.py # Ollama + Claude backends
โ โโโ rag_engine/ # Literature search (planned)
โ โโโ models.py # Core data models
โโโ data/models/
โ โโโ xgb_model.pkl # Trained XGBoost model
โ โโโ lgb_model.pkl # Trained LightGBM model
โ โโโ metrics.json # Evaluation metrics
โ โโโ plots/ # 7 evaluation visualizations
โโโ tests/ # Unit and integration tests
โโโ train_pipeline.py # Self-contained training script
Quick Start
# Parse a variant
python -m genomicsgpt parse "BRAF V600E"
# Generate a clinical report (requires Ollama running)
ollama serve # in another terminal
python demo_report.py "BRCA1 c.5266dupC"
# Reproduce the ML training
pip install xgboost lightgbm shap seaborn scikit-learn pandas
python train_pipeline.py
# Run the interactive notebook
jupyter notebook notebooks/03_model_training.ipynbTech Stack
- ML: XGBoost, LightGBM, scikit-learn, SHAP
- LLM: Llama 3 (Ollama), Claude API (Anthropic)
- Data: ClinVar (NCBI), Ensembl VEP
- Visualization: matplotlib, seaborn
- APIs: ClinVar E-utilities, Ensembl REST, Ollama REST
- Testing: pytest (38 tests)
Author
Samir Kerkar โ github.com/skerk001
Data Scientist | B.S. Mathematics UC Irvine





