GitHunt
RA

Rain021217/ClinVar-Pathogenicity-Prediction

Machine Learning-based Prediction of Pathogenicity for Protein Variants in ClinVar Database

🧬 ClinVar Protein Variant Pathogenicity Prediction

Python
License
XGBoost
scikit-learn

English | 中文


English Version

Machine Learning-based Prediction of Pathogenicity for Protein Variants in ClinVar Database

This project develops machine learning models to predict the pathogenicity of missense mutations in protein-coding regions, using approximately 6.2 million variant records from the ClinVar database.


📖 Project Overview

Background

Predicting the pathogenicity of missense mutations is a critical challenge in precision medicine and genetic disease diagnosis. The ClinVar database contains extensive clinical interpretation of human variants, yet approximately 50% remain classified as "variants of uncertain significance (VUS)," necessitating computational prediction methods.

Objectives

  1. Construct a high-quality modeling dataset from ClinVar with 20 biologically meaningful protein mutation features
  2. Systematically evaluate 7 imbalanced data handling strategies across 4 machine learning models
  3. Develop an interpretable, high-performance pathogenicity prediction model

🎯 Key Contributions

Contribution Description
Large-scale Empirical Study Systematic comparison of 28 model-strategy combinations on ~6.2 million samples
Novel Finding on Sampling Demonstrated that class_weight outperforms SMOTE and other oversampling methods
Optimal Model Pipeline XGBoost + threshold optimization achieves F1=0.6717, AUC=0.9092
Interpretability Validation Evolutionary conservation features (BLOSUM62, Grantham distance) identified as key predictors

📊 Dataset

Category Samples Proportion
Pathogenic variants 311,103 4.97%
Non-pathogenic variants 5,950,874 95.03%
Total 6,261,977 100%

Data Source: ClinVar Database (December 2025 release)

The dataset exhibits severe class imbalance (1:19 ratio), reflecting the rarity of pathogenic variants in the human genome.

Sample Distribution

Train/Test Split: 80%/20% with stratified sampling (random_state=42)

  • Training set: 5,009,581 samples (248,882 pathogenic)
  • Test set: 1,252,396 samples (62,221 pathogenic)

🔬 Feature Engineering

We extracted 20 protein mutation features across 5 dimensions:

Dimension Features Description
Evolutionary blosum62_score Amino acid substitution conservation score
Physicochemical grantham_distance, delta_hydrophobicity, delta_charge, delta_volume, delta_polarity, delta_mw, is_nonsense, aromatic_change, aliphatic_change Changes in amino acid properties
Positional relative_position, distance_to_terminus, is_terminal Location within protein sequence
Structural delta_alpha_propensity, delta_beta_propensity Secondary structure propensity changes
Contextual context_hydrophobicity_mean, context_charge_sum, context_aromatic_count, context_proline_count, context_glycine_count Surrounding sequence environment

Mutation Pattern Analysis

Mutation Heatmap

Key observations:

  • Pathogenic variants: Arginine (R) is the most common source amino acid; nonsense mutations (→*) are highly enriched
  • Non-pathogenic variants: More dispersed pattern; conservative substitutions (e.g., I↔L, V↔I) are common

Feature Distribution Comparison

Feature Distributions

The kernel density plots show clear separation between pathogenic (red) and non-pathogenic (green) variants for evolutionary and physicochemical features.

Feature Distribution Comparison (Boxplots)

Top Feature Boxplots

The boxplots show the distribution comparison of the Top 6 most discriminative features between pathogenic (red) and non-pathogenic (green) variants.

Statistical Effect Size Analysis

We used Mann-Whitney U test (non-parametric, suitable for non-normal distributions) to assess group differences. Given the massive sample size (N > 6 million), p-values are easily driven to near-zero and have limited practical value. Therefore, we report the rank-biserial correlation coefficient (r) as the effect size measure.

Features by Effect Size Category:

Effect Size Criterion Features
Large |r| ≥ 0.5 blosum62_score (r=0.73), grantham_distance (r=0.72), delta_beta_propensity (r=0.53), is_nonsense (r=0.52), delta_alpha_propensity (r=0.52), delta_mw (r=0.51)
Medium 0.3 ≤ |r| < 0.5 delta_volume (r=0.50), delta_polarity (r=0.48)
Small 0.1 ≤ |r| < 0.3 delta_hydrophobicity (r=0.23), aromatic_change (r=0.13)
Negligible |r| < 0.1 delta_charge, aliphatic_change, relative_position, distance_to_terminus, is_terminal, context_* features

Note: Some features (e.g., delta_charge) have extremely small p-values but negligible effect sizes because the two groups have highly overlapping distributions—for example, 58.8% of pathogenic and 70.0% of non-pathogenic samples have delta_charge = 0.

Feature Correlation Matrix

Correlation Matrix

Key correlations:

  • BLOSUM62 score vs Grantham distance: r ≈ -0.88 (strong negative)
  • delta_volume vs delta_mw: r ≈ 0.91 (strong positive)

BLOSUM62 vs Grantham Distance

BLOSUM62 Grantham Scatter

The strong negative correlation confirms that evolutionarily conservative substitutions typically involve physicochemically similar amino acids.


🧪 Methods

Models Evaluated

Model Description Characteristics
Logistic Regression (LR) Linear baseline Fast, interpretable
Decision Tree (DT) Single tree classifier Simple, prone to overfitting
Random Forest (RF) Bagging ensemble Robust, distributed importance
XGBoost Gradient boosting GPU acceleration, high performance

Imbalanced Data Strategies

Strategy Description Mechanism
none No handling (baseline) -
class_weight Weighted loss function Implicit sample weighting
SMOTE Synthetic minority oversampling Linear interpolation
ADASYN Adaptive synthetic sampling Difficulty-based weighting
Borderline-SMOTE Boundary-focused oversampling Focus on decision boundary
Random Undersample Majority class downsampling Data reduction
SMOTE-Tomek Combined over/undersampling Noise removal

Ensemble Methods

  • Soft Voting: Probability averaging across models
  • Stacking: Two-layer meta-learning with LR meta-classifier

📈 Results

28 Model-Strategy Combinations

Heatmap F1

Key Finding: class_weight and none strategies consistently outperform oversampling methods on this large-scale dataset.

F1 Score Comparison Table:

Strategy LR DT RF XGBoost
none 0.6579 0.5184 0.6598 0.6597
class_weight 0.6579 0.5165 0.6640 0.1952*
smote 0.3457 0.5119 0.4598 0.6469
adasyn 0.2174 0.5047 0.3033 0.5926
borderline_smote 0.2246 0.4957 0.3234 0.4803
undersample 0.3475 0.2355 0.3874 0.3747
smote_tomek 0.3423 0.5107 0.4584 0.6462

*Note: XGBoost's poor performance with class_weight is due to excessive scale_pos_weight causing over-prediction of positive class.

Class Weight Sensitivity Analysis

Class Weight Tuning

Sensitivity Analysis Results:

  • LR & DT: Highly robust to weight parameter changes
  • RF: Stable at α ∈ [0.1, 2.5], slight decline when α > 3
  • XGBoost: Highly sensitive - performance drops sharply when α > 1

Optimal setting: α = 1.0 (no additional weighting)

Final Model Performance (After Hyperparameter Tuning & Threshold Optimization)

Model Threshold Precision Recall F1 AUC-ROC AP
XGBoost (Tuned) 0.30 0.8315 0.5634 0.6717 0.9092 0.6542
Voting Ensemble 0.23 0.8445 0.5555 0.6702 0.9089 0.6512
Stacking 0.36 0.8434 0.5552 0.6696 0.9108 0.6572
Random Forest 0.51 0.8500 0.5494 0.6675 0.9060 0.6491
Decision Tree 0.19 0.8763 0.5267 0.6579 0.8664 0.5666
Logistic Regression 0.26 0.8763 0.5267 0.6579 0.8790 0.5873

Key Insights:

  • XGBoost achieves best F1 score (0.6717) after threshold optimization from 0.50 → 0.30
  • Stacking has highest AUC-ROC (0.9108) but marginal F1 improvement over single models
  • All models achieve >97% accuracy due to class imbalance, but F1 is the primary metric

ROC and PR Curves

Feature Importance Analysis: RF vs XGBoost vs DT

Top 5 Feature Importance Comparison:

Rank Random Forest Importance XGBoost Importance Decision Tree Importance
1 is_nonsense 16.3% blosum62_score 94.7% is_nonsense 97.7%
2 delta_mw 15.7% grantham_distance 0.9% blosum62_score 1.4%
3 blosum62_score 14.3% delta_hydrophobicity 0.4% grantham_distance 0.5%
4 grantham_distance 10.9% delta_beta_propensity 0.4% delta_mw 0.2%
5 delta_volume 7.6% context_proline_count 0.3% delta_volume 0.1%

Algorithm-specific Explanations:

  • RF's Distributed Importance: Based on average Gini impurity reduction across 400 trees with random feature subsets
  • XGBoost's Concentrated Importance: "Gain"-based importance; blosum62_score dominates split decisions (94.67%)
  • DT's Extreme Concentration: Limited tree depth (5 layers) concentrates importance on root node feature

Biological Validation:

  • Evolutionary conservation features (blosum62_score + grantham_distance) account for >41% of effect size among top 6 features
  • Pathogenic variants: BLOSUM62 score mean = -5.73 vs Non-pathogenic = +0.97
  • Pathogenic variants: Grantham distance mean = 157 vs Non-pathogenic = 55

Confusion Matrix Analysis

Confusion Matrix

XGBoost (Threshold=0.30) Clinical Interpretation:

Metric Value Interpretation
True Positives 35,049 Correctly identified pathogenic variants
False Negatives 27,172 Missed pathogenic variants (43.66% miss rate)
False Positives 6,985 Incorrectly flagged as pathogenic
True Negatives 1,183,190 Correctly identified non-pathogenic
Sensitivity 56.34% Proportion of pathogenic detected
Specificity 99.41% Proportion of non-pathogenic correctly excluded

The high specificity (99.41%) makes the model suitable for initial screening, with low false positive burden.


🚀 Quick Start

Environment Setup

# Clone the repository
git clone https://github.com/Rain021217/ClinVar-Pathogenicity-Prediction.git
cd ClinVar-Pathogenicity-Prediction

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Running the Notebooks

  1. Data Processing: notebooks/clinvar_data_processing_v3.ipynb

    • Downloads and processes ClinVar data
    • Extracts 20 protein mutation features
    • Generates train/test splits
  2. Model Building: notebooks/model_building_v3.ipynb

    • Trains 4 models with 7 strategies (28 combinations)
    • Hyperparameter tuning with GridSearchCV
    • Ensemble methods (Voting, Stacking)
    • Threshold optimization
    • Feature importance analysis

📁 Project Structure

ClinVar-Pathogenicity-Prediction/
├── README.md                           # Project documentation (bilingual)
├── LICENSE                             # MIT License
├── requirements.txt                    # Python dependencies
├── .gitignore                          # Git ignore rules
│
├── data/                               # Data directory
│   └── README.md                       # Data acquisition guide
│
├── notebooks/                          # Jupyter Notebooks
│   ├── clinvar_data_processing_v3.ipynb   # Data processing pipeline
│   └── model_building_v3.ipynb            # Model training & evaluation
│
└── results/                            # Output results
    └── figures/                        # Visualization figures

Note: Source code utilities (clinvar_utils.py, model_utils_v3.py) are not included in this repository. Please contact the author if needed.


📝 Citation

If this project is helpful for your research, please cite:

@misc{li2026clinvar,
  author = {Jinpeng Li},
  title = {Machine Learning-based Prediction of Pathogenicity for Protein Variants in ClinVar Database},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/Rain021217/ClinVar-Pathogenicity-Prediction}
}

📜 License

This project is licensed under the MIT License.


🙏 Acknowledgments

  • ClinVar Database - Variant annotation data source
  • UniProt - Protein sequence database
  • Course: Python Programming for Health Data Science

📧 Contact


中文版本

基于机器学习的ClinVar蛋白质变异致病性预测研究

本项目开发了机器学习模型,用于预测蛋白质编码区域错义突变的致病性,使用了来自ClinVar数据库的约620万条变异记录。


📖 项目概述

研究背景

预测错义突变的致病性是精准医学和遗传病诊断的关键挑战。ClinVar数据库包含大量人类变异的临床解读信息,但约50%的变异仍被标记为"意义不明(VUS)",亟需计算方法辅助预测。

研究目标

  1. 从ClinVar构建包含20个具有生物学意义的蛋白质突变特征的高质量建模数据集
  2. 系统评估7种不平衡数据处理策略在4种机器学习模型上的效果
  3. 开发可解释的高性能致病性预测模型

🎯 主要贡献

贡献 描述
大规模实证研究 在约620万样本上系统对比28种模型-策略组合
采样策略新发现 证明class_weight优于SMOTE等过采样方法
最优模型方案 XGBoost + 阈值优化达到 F1=0.6717, AUC=0.9092
可解释性验证 进化保守性特征(BLOSUM62、Grantham距离)被识别为关键预测因子

📊 数据集

类别 样本数 占比
致病性变异 311,103 4.97%
非致病性变异 5,950,874 95.03%
总计 6,261,977 100%

数据来源: ClinVar数据库(2025年12月版本)

数据集呈现严重的类别不平衡(1:19比例),反映了人类基因组中致病变异的罕见性。

样本分布

训练/测试集划分: 80%/20%分层抽样(random_state=42)

  • 训练集:5,009,581个样本(致病性248,882个)
  • 测试集:1,252,396个样本(致病性62,221个)

🔬 特征工程

我们从5个维度提取了20个蛋白质突变特征

维度 特征 描述
进化特征 blosum62_score 氨基酸替换保守性得分
物化特征 grantham_distancedelta_hydrophobicitydelta_chargedelta_volumedelta_polaritydelta_mwis_nonsensearomatic_changealiphatic_change 氨基酸理化性质变化
位置特征 relative_positiondistance_to_terminusis_terminal 蛋白质序列中的位置
结构特征 delta_alpha_propensitydelta_beta_propensity 二级结构倾向性变化
上下文特征 context_hydrophobicity_meancontext_charge_sumcontext_aromatic_countcontext_proline_countcontext_glycine_count 周围序列环境

突变模式分析

突变热力图

关键观察:

  • 致病变异: 精氨酸(R)是最常见的突变起始氨基酸;无义突变(→*)高度富集
  • 非致病变异: 模式更分散;保守替换(如I↔L、V↔I)更常见

特征分布对比

特征分布

核密度图显示致病变异(红色)和非致病变异(绿色)在进化和物化特征上有明显分离。

特征分布对比(箱线图)

Top特征箱线图

箱线图展示了Top 6 最具区分能力特征在致病变异(红色)和非致病变异(绿色)之间的分布对比。

统计效应量分析

我们使用 Mann-Whitney U 检验(非参数检验,适用于非正态分布数据)评估组间差异。由于样本量极大(N > 600万),p值易被驱动至接近零而缺乏实际参考价值。因此,我们报告 rank-biserial 相关系数 (r) 作为效应量指标。

按效应量等级划分的特征:

效应量等级 判定标准 特征
大效应 |r| ≥ 0.5 blosum62_score (r=0.73)、grantham_distance (r=0.72)、delta_beta_propensity (r=0.53)、is_nonsense (r=0.52)、delta_alpha_propensity (r=0.52)、delta_mw (r=0.51)
中效应 0.3 ≤ |r| < 0.5 delta_volume (r=0.50)、delta_polarity (r=0.48)
小效应 0.1 ≤ |r| < 0.3 delta_hydrophobicity (r=0.23)、aromatic_change (r=0.13)
可忽略 |r| < 0.1 delta_charge、aliphatic_change、relative_position、distance_to_terminus、is_terminal、context_* 特征

注意: 部分特征(如 delta_charge)虽然p值极小但效应量可忽略,这是因为两组样本分布高度重叠——例如,58.8%的致病和70.0%的非致病样本的 delta_charge = 0

特征相关性矩阵

相关性矩阵

主要相关性:

  • BLOSUM62得分 vs Grantham距离:r ≈ -0.88(强负相关)
  • delta_volume vs delta_mw:r ≈ 0.91(强正相关)

BLOSUM62 vs Grantham距离

BLOSUM62 Grantham散点图

强负相关证实了进化保守的替换通常涉及理化性质相近的氨基酸。


🧪 方法

评估的模型

模型 描述 特点
逻辑回归 (LR) 线性基准 快速、可解释
决策树 (DT) 单棵决策树 简单、易过拟合
随机森林 (RF) Bagging集成 鲁棒、重要性分布均匀
XGBoost 梯度提升 GPU加速、高性能

不平衡数据处理策略

策略 描述 机制
none 无处理(基准) -
class_weight 加权损失函数 隐式样本加权
SMOTE 合成少数类过采样 线性插值
ADASYN 自适应合成采样 基于难度加权
Borderline-SMOTE 边界聚焦过采样 关注决策边界
Random Undersample 随机欠采样 数据缩减
SMOTE-Tomek 组合过/欠采样 噪声去除

集成方法

  • Soft Voting: 跨模型概率平均
  • Stacking: 两层元学习,使用LR作为元分类器

📈 结果

28种模型-策略组合

热力图F1

关键发现: class_weightnone策略在这个大规模数据集上始终优于过采样方法。

F1分数对比表:

策略 LR DT RF XGBoost
none 0.6579 0.5184 0.6598 0.6597
class_weight 0.6579 0.5165 0.6640 0.1952*
smote 0.3457 0.5119 0.4598 0.6469
adasyn 0.2174 0.5047 0.3033 0.5926
borderline_smote 0.2246 0.4957 0.3234 0.4803
undersample 0.3475 0.2355 0.3874 0.3747
smote_tomek 0.3423 0.5107 0.4584 0.6462

*注:XGBoost在class_weight下表现差是因为scale_pos_weight过高导致正类过度预测。

类别权重敏感性分析

类别权重调优

敏感性分析结果:

  • LR & DT: 对权重参数变化高度鲁棒
  • RF: 在α ∈ [0.1, 2.5]时稳定,α > 3时略有下降
  • XGBoost: 高度敏感 - α > 1时性能急剧下降

最优设置: α = 1.0(无额外加权)

最终模型性能(超参数调优和阈值优化后)

模型 阈值 精确率 召回率 F1 AUC-ROC AP
XGBoost (调优) 0.30 0.8315 0.5634 0.6717 0.9092 0.6542
Voting集成 0.23 0.8445 0.5555 0.6702 0.9089 0.6512
Stacking 0.36 0.8434 0.5552 0.6696 0.9108 0.6572
随机森林 0.51 0.8500 0.5494 0.6675 0.9060 0.6491
决策树 0.19 0.8763 0.5267 0.6579 0.8664 0.5666
逻辑回归 0.26 0.8763 0.5267 0.6579 0.8790 0.5873

关键洞察:

  • XGBoost在阈值从0.50优化到0.30后达到最佳F1分数(0.6717)
  • Stacking具有最高AUC-ROC(0.9108),但F1相对单模型提升有限
  • 由于类别不平衡,所有模型准确率>97%,但F1是主要评估指标

ROC和PR曲线

特征重要性分析:RF vs XGBoost vs DT

Top 5特征重要性对比:

排名 随机森林 重要性 XGBoost 重要性 决策树 重要性
1 is_nonsense 16.3% blosum62_score 94.7% is_nonsense 97.7%
2 delta_mw 15.7% grantham_distance 0.9% blosum62_score 1.4%
3 blosum62_score 14.3% delta_hydrophobicity 0.4% grantham_distance 0.5%
4 grantham_distance 10.9% delta_beta_propensity 0.4% delta_mw 0.2%
5 delta_volume 7.6% context_proline_count 0.3% delta_volume 0.1%

算法特性解释:

  • RF的分布式重要性: 基于400棵树中Gini不纯度减少的平均值
  • XGBoost的集中重要性: 基于"增益"的重要性;blosum62_score主导分裂决策(94.67%)
  • DT的极端集中: 有限的树深度(5层)将重要性集中在根节点特征上

生物学验证:

  • 进化保守性特征(blosum62_score + grantham_distance)在Top 6特征中效应量占比**>41%**
  • 致病变异:BLOSUM62得分均值 = -5.73 vs 非致病 = +0.97
  • 致病变异:Grantham距离均值 = 157 vs 非致病 = 55

混淆矩阵分析

混淆矩阵

XGBoost(阈值=0.30)临床解读:

指标 数值 解释
真阳性 35,049 正确识别的致病变异
假阴性 27,172 漏诊的致病变异(漏诊率43.66%)
假阳性 6,985 错误标记为致病
真阴性 1,183,190 正确识别的非致病变异
敏感性 56.34% 检测到的致病变异比例
特异性 99.41% 正确排除的非致病变异比例

高特异性(99.41%)使该模型适合初步筛查,假阳性负担低。


🚀 快速开始

环境配置

# 克隆仓库
git clone https://github.com/Rain021217/ClinVar-Pathogenicity-Prediction.git
cd ClinVar-Pathogenicity-Prediction

# 创建虚拟环境(推荐)
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows

# 安装依赖
pip install -r requirements.txt

运行Notebook

  1. 数据处理: notebooks/clinvar_data_processing_v3.ipynb

    • 下载和处理ClinVar数据
    • 提取20个蛋白质突变特征
    • 生成训练/测试集划分
  2. 模型构建: notebooks/model_building_v3.ipynb

    • 用7种策略训练4个模型(28种组合)
    • 使用GridSearchCV进行超参数调优
    • 集成方法(Voting、Stacking)
    • 阈值优化
    • 特征重要性分析

📁 项目结构

ClinVar-Pathogenicity-Prediction/
├── README.md                           # 项目文档(双语)
├── LICENSE                             # MIT许可证
├── requirements.txt                    # Python依赖
├── .gitignore                          # Git忽略规则
│
├── data/                               # 数据目录
│   └── README.md                       # 数据获取指南
│
├── notebooks/                          # Jupyter Notebooks
│   ├── clinvar_data_processing_v3.ipynb   # 数据处理流程
│   └── model_building_v3.ipynb            # 模型训练与评估
│
└── results/                            # 输出结果
    └── figures/                        # 可视化图表

注意: 源代码工具函数(clinvar_utils.pymodel_utils_v3.py)未包含在此仓库中。如需获取,请联系作者。


📝 引用

如果本项目对您的研究有帮助,请引用:

@misc{li2026clinvar,
  author = {Jinpeng Li},
  title = {Machine Learning-based Prediction of Pathogenicity for Protein Variants in ClinVar Database},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/Rain021217/ClinVar-Pathogenicity-Prediction}
}

📜 许可证

本项目采用 MIT许可证 开源。


🙏 致谢

  • ClinVar数据库 - 变异注释数据来源
  • UniProt - 蛋白质序列数据库
  • 课程:健康数据科学的Python语言编程基础

📧 联系方式


⭐ 如果您觉得这个项目有用,请给它一个星标!