Naila-Srivastava/AI-Powered-Variant-Prioritisation-RareDiseases
Extending WES variant calling with ML to intelligently prioritise VUS for rare disease diagnostics. ROC-AUC: 0.79 | 439K high-risk variants flagged.
AI-Powered Variant Prioritisation for Rare Diseases
Extending a WES variant calling pipeline with machine learning to intelligently prioritise Variants of Uncertain Significance (VUS) for rare disease diagnostics.
Project Overview
This project extends the Variant Calling WES Rare Disease pipeline by integrating a Random Forest classifier to score and rank Variants of Uncertain Significance (VUS) by their likelihood of pathogenicity.
Clinical genomicists face a critical challenge: millions of variants are classified as VUS, yet only a fraction are clinically actionable.
This AI layer bridges that gap by learning from 4.1 million labelled ClinVar variants and applying that knowledge to prioritise previously unclassified variants, reducing the diagnostic haystack to a shortlist of high-risk candidates.
Biological Context
Whole Exome Sequencing (WES) generates thousands of variants per patient. Standard annotation tools (e.g., GATK, VEP) can classify variants as Pathogenic, Benign, or VUS, but the VUS category remains clinically unactionable.
This project uses machine learning to assign a pathogenicity probability score to every VUS, enabling clinicians to prioritise which variants to investigate further.
Tools & Technologies
| Category | Tools |
|---|---|
| Language | Python 3.14 |
| ML Framework | Scikit-learn (Random Forest) |
| Data Processing | Pandas, NumPy |
| Visualisation | Matplotlib |
| Data Source | ClinVar VCF GRCh38 (NCBI) |
| Model Persistence | Joblib |
Data Source
- Database: ClinVar
- Genome Build: GRCh38
- File: clinvar.vcf.gz
- Accessed: February 2026
ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.
How to Run
- Clone the repository
git clone https://github.com/Naila-Srivastava/AI-Powered-Variant-Prioritisation-RareDiseases.git
cd AI-Powered-Variant-Prioritisation-RareDiseases- Install dependencies
pip install pandas numpy scikit-learn matplotlib joblib- Download ClinVar data
# Download GRCh38 VCF from NCBI ClinVar
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz -P data/- Run the notebook in
VS Code StudioorJupyterLaband run all cells.
Methodology
ClinVar VCF (4.3M variants)
│
▼
Data Parsing & INFO field extraction
│
▼
Label Simplification (Pathogenic / Benign / VUS)
│
▼
Feature Engineering
(REF/ALT length, variant type, chromosome, position)
│
▼
Class Balancing (50K Pathogenic + 50K Benign)
│
▼
Random Forest Classifier (100 estimators)
│
▼
VUS Scoring & Ranking (2.45M variants scored)
│
▼
High-Risk VUS List (439,139 variants, score ≥ 0.7)
Model Performance
The model was trained on a balanced dataset of 100,000 variants (50K Pathogenic, 50K Benign) and evaluated on a held-out 20% test set.
Balanced Classification Report:
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Benign | 0.72 | 0.69 | 0.70 |
| Pathogenic | 0.70 | 0.73 | 0.71 |
ROC-AUC: 0.7921
Top Predictive Feature: Genomic position (POS) reflecting that pathogenic variants cluster in biologically constrained regions, consistent with known hotspot mutation patterns.
Results
| Metric | Value |
|---|---|
| Training variants | 1,678,471 |
| VUS scored | 2,455,002 |
| High-risk VUS identified (score ≥ 0.7) | 439,139 |
| ROC-AUC Score | 0.7921 |
| Pathogenic F1-Score | 0.71 |
| Overall Accuracy | 71% |
Top High-Risk Disease Categories Identified
- Inborn Genetic Diseases
- Hereditary Cancer-Predisposing Syndrome
- Cardiovascular Phenotype
- Hereditary Breast/Ovarian Cancer Syndrome
- PTEN Hamartoma Tumor Syndrome
- Neurofibromatosis Type 1
- Ataxia-Telangiectasia Syndrome
Key Takeaways
- Class imbalance is critical— The initial model had poor Pathogenic recall (F1: 0.50). Balancing the training set to equal Pathogenic/Benign samples improved it dramatically to 0.71, highlighting how real-world genomic datasets require careful handling before ML is applied.
- Genomic position is the strongest predictor— The feature importance analysis revealed that chromosomal position (POS) dominates all other features, which is biologically meaningful: pathogenic variants tend to cluster in functionally constrained, evolutionarily conserved regions of the genome.
- VUS prioritisation has real clinical value— Out of 2.45 million VUS, the model flagged 439,139 as high-risk (score ≥ 0.7), with the top disease categories including Inborn Genetic Diseases and Hereditary Cancer-Predisposing Syndromes. This kind of shortlisting is exactly what rare disease diagnostic labs need to reduce the burden of manual review.
- Simple features can go a long way— This model used only 6 features (variant length, type, chromosome, position) yet achieved a ROC-AUC of 0.79. Adding functional annotation features (e.g., CADD scores, conservation scores, splicing impact) would significantly improve performance in future iterations.
- Reproducibility matters— The entire pipeline from raw VCF parsing to model training and VUS scoring is contained in a single notebook, making it fully reproducible and easy to extend.
References
- Python 3.14: Python Software Foundation. (2025). Python 3.14.3. https://www.python.org
- Visual Studio Code: Microsoft. (2024). Visual Studio Code (VS Code). https://code.visualstudio.com
- ClinVar: Landrum MJ, et al. (2018). ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Research, 46(D1), D1062–D1067. https://www.ncbi.nlm.nih.gov/clinvar/
- Scikit-learn: Pedregosa F, et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://scikit-learn.org
- Pandas: McKinney W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference. https://pandas.pydata.org
- Matplotlib: Hunter JD. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering, 9(3), 90–95. https://matplotlib.org
Project Structure
variant_project/
├── data # ClinVar GRCh38 variant database (too big to upload)
│
│── 01_data_exploration.ipynb # Full pipeline notebook
├── models/
│ └── rf_variant_classifier.pkl # Trained Random Forest model (too big to upload)
└── results/
├── model_performance.png # ROC curve, confusion matrix, feature importance
├── vus_distribution.png # VUS score distribution & top diseases
└── vus_prioritised.csv # Full ranked VUS output (too big to upload)
License
This project is open source and available under the Apache License.