🧬 ClinVar Protein Variant Pathogenicity Prediction

English Version

Machine Learning-based Prediction of Pathogenicity for Protein Variants in ClinVar Database

This project develops machine learning models to predict the pathogenicity of missense mutations in protein-coding regions, using approximately 6.2 million variant records from the ClinVar database.

📖 Project Overview

Background

Predicting the pathogenicity of missense mutations is a critical challenge in precision medicine and genetic disease diagnosis. The ClinVar database contains extensive clinical interpretation of human variants, yet approximately 50% remain classified as "variants of uncertain significance (VUS)," necessitating computational prediction methods.

Objectives

Construct a high-quality modeling dataset from ClinVar with 20 biologically meaningful protein mutation features
Systematically evaluate 7 imbalanced data handling strategies across 4 machine learning models
Develop an interpretable, high-performance pathogenicity prediction model

🎯 Key Contributions

Contribution	Description
Large-scale Empirical Study	Systematic comparison of 28 model-strategy combinations on ~6.2 million samples
Novel Finding on Sampling	Demonstrated that `class_weight` outperforms SMOTE and other oversampling methods
Optimal Model Pipeline	XGBoost + threshold optimization achieves F1=0.6717, AUC=0.9092
Interpretability Validation	Evolutionary conservation features (BLOSUM62, Grantham distance) identified as key predictors

📊 Dataset

Category	Samples	Proportion
Pathogenic variants	311,103	4.97%
Non-pathogenic variants	5,950,874	95.03%
Total	6,261,977	100%

Data Source: ClinVar Database (December 2025 release)

The dataset exhibits severe class imbalance (1:19 ratio), reflecting the rarity of pathogenic variants in the human genome.

Train/Test Split: 80%/20% with stratified sampling (random_state=42)

Training set: 5,009,581 samples (248,882 pathogenic)
Test set: 1,252,396 samples (62,221 pathogenic)

🔬 Feature Engineering

We extracted 20 protein mutation features across 5 dimensions:

Dimension	Features	Description
Evolutionary	`blosum62_score`	Amino acid substitution conservation score
Physicochemical	`grantham_distance`, `delta_hydrophobicity`, `delta_charge`, `delta_volume`, `delta_polarity`, `delta_mw`, `is_nonsense`, `aromatic_change`, `aliphatic_change`	Changes in amino acid properties
Positional	`relative_position`, `distance_to_terminus`, `is_terminal`	Location within protein sequence
Structural	`delta_alpha_propensity`, `delta_beta_propensity`	Secondary structure propensity changes
Contextual	`context_hydrophobicity_mean`, `context_charge_sum`, `context_aromatic_count`, `context_proline_count`, `context_glycine_count`	Surrounding sequence environment

Mutation Pattern Analysis

Key observations:

Pathogenic variants: Arginine (R) is the most common source amino acid; nonsense mutations (→*) are highly enriched
Non-pathogenic variants: More dispersed pattern; conservative substitutions (e.g., I↔L, V↔I) are common

Feature Distribution Comparison

The kernel density plots show clear separation between pathogenic (red) and non-pathogenic (green) variants for evolutionary and physicochemical features.

Feature Distribution Comparison (Boxplots)

The boxplots show the distribution comparison of the Top 6 most discriminative features between pathogenic (red) and non-pathogenic (green) variants.

Statistical Effect Size Analysis

We used Mann-Whitney U test (non-parametric, suitable for non-normal distributions) to assess group differences. Given the massive sample size (N > 6 million), p-values are easily driven to near-zero and have limited practical value. Therefore, we report the rank-biserial correlation coefficient (r) as the effect size measure.

Features by Effect Size Category:

Effect Size	Criterion	Features
Large	\|r\| ≥ 0.5	blosum62_score (r=0.73), grantham_distance (r=0.72), delta_beta_propensity (r=0.53), is_nonsense (r=0.52), delta_alpha_propensity (r=0.52), delta_mw (r=0.51)
Medium	0.3 ≤ \|r\| < 0.5	delta_volume (r=0.50), delta_polarity (r=0.48)
Small	0.1 ≤ \|r\| < 0.3	delta_hydrophobicity (r=0.23), aromatic_change (r=0.13)
Negligible	\|r\| < 0.1	delta_charge, aliphatic_change, relative_position, distance_to_terminus, is_terminal, context_* features

Note: Some features (e.g., delta_charge) have extremely small p-values but negligible effect sizes because the two groups have highly overlapping distributions—for example, 58.8% of pathogenic and 70.0% of non-pathogenic samples have delta_charge = 0.

Feature Correlation Matrix

Key correlations:

BLOSUM62 score vs Grantham distance: r ≈ -0.88 (strong negative)
delta_volume vs delta_mw: r ≈ 0.91 (strong positive)

BLOSUM62 vs Grantham Distance

The strong negative correlation confirms that evolutionarily conservative substitutions typically involve physicochemically similar amino acids.

🧪 Methods

Models Evaluated

Model	Description	Characteristics
Logistic Regression (LR)	Linear baseline	Fast, interpretable
Decision Tree (DT)	Single tree classifier	Simple, prone to overfitting
Random Forest (RF)	Bagging ensemble	Robust, distributed importance
XGBoost	Gradient boosting	GPU acceleration, high performance

Imbalanced Data Strategies

Strategy	Description	Mechanism
`none`	No handling (baseline)	-
`class_weight`	Weighted loss function	Implicit sample weighting
`SMOTE`	Synthetic minority oversampling	Linear interpolation
`ADASYN`	Adaptive synthetic sampling	Difficulty-based weighting
`Borderline-SMOTE`	Boundary-focused oversampling	Focus on decision boundary
`Random Undersample`	Majority class downsampling	Data reduction
`SMOTE-Tomek`	Combined over/undersampling	Noise removal

Ensemble Methods

Soft Voting: Probability averaging across models
Stacking: Two-layer meta-learning with LR meta-classifier

📈 Results

28 Model-Strategy Combinations

Key Finding: class_weight and none strategies consistently outperform oversampling methods on this large-scale dataset.

F1 Score Comparison Table:

Strategy	LR	DT	RF	XGBoost
none	0.6579	0.5184	0.6598	0.6597
class_weight	0.6579	0.5165	0.6640	0.1952*
smote	0.3457	0.5119	0.4598	0.6469
adasyn	0.2174	0.5047	0.3033	0.5926
borderline_smote	0.2246	0.4957	0.3234	0.4803
undersample	0.3475	0.2355	0.3874	0.3747
smote_tomek	0.3423	0.5107	0.4584	0.6462

*Note: XGBoost's poor performance with class_weight is due to excessive scale_pos_weight causing over-prediction of positive class.

Class Weight Sensitivity Analysis

Sensitivity Analysis Results:

LR & DT: Highly robust to weight parameter changes
RF: Stable at α ∈ [0.1, 2.5], slight decline when α > 3
XGBoost: Highly sensitive - performance drops sharply when α > 1

Optimal setting: α = 1.0 (no additional weighting)

Final Model Performance (After Hyperparameter Tuning & Threshold Optimization)

Model	Threshold	Precision	Recall	F1	AUC-ROC	AP
XGBoost (Tuned)	0.30	0.8315	0.5634	0.6717	0.9092	0.6542
Voting Ensemble	0.23	0.8445	0.5555	0.6702	0.9089	0.6512
Stacking	0.36	0.8434	0.5552	0.6696	0.9108	0.6572
Random Forest	0.51	0.8500	0.5494	0.6675	0.9060	0.6491
Decision Tree	0.19	0.8763	0.5267	0.6579	0.8664	0.5666
Logistic Regression	0.26	0.8763	0.5267	0.6579	0.8790	0.5873

Key Insights:

XGBoost achieves best F1 score (0.6717) after threshold optimization from 0.50 → 0.30
Stacking has highest AUC-ROC (0.9108) but marginal F1 improvement over single models
All models achieve >97% accuracy due to class imbalance, but F1 is the primary metric

ROC and PR Curves

Feature Importance Analysis: RF vs XGBoost vs DT

Top 5 Feature Importance Comparison:

Rank	Random Forest	Importance	XGBoost	Importance	Decision Tree	Importance
1	is_nonsense	16.3%	blosum62_score	94.7%	is_nonsense	97.7%
2	delta_mw	15.7%	grantham_distance	0.9%	blosum62_score	1.4%
3	blosum62_score	14.3%	delta_hydrophobicity	0.4%	grantham_distance	0.5%
4	grantham_distance	10.9%	delta_beta_propensity	0.4%	delta_mw	0.2%
5	delta_volume	7.6%	context_proline_count	0.3%	delta_volume	0.1%

Algorithm-specific Explanations:

RF's Distributed Importance: Based on average Gini impurity reduction across 400 trees with random feature subsets
XGBoost's Concentrated Importance: "Gain"-based importance; blosum62_score dominates split decisions (94.67%)
DT's Extreme Concentration: Limited tree depth (5 layers) concentrates importance on root node feature

Biological Validation:

Evolutionary conservation features (blosum62_score + grantham_distance) account for >41% of effect size among top 6 features
Pathogenic variants: BLOSUM62 score mean = -5.73 vs Non-pathogenic = +0.97
Pathogenic variants: Grantham distance mean = 157 vs Non-pathogenic = 55

Confusion Matrix Analysis

XGBoost (Threshold=0.30) Clinical Interpretation:

Metric	Value	Interpretation
True Positives	35,049	Correctly identified pathogenic variants
False Negatives	27,172	Missed pathogenic variants (43.66% miss rate)
False Positives	6,985	Incorrectly flagged as pathogenic
True Negatives	1,183,190	Correctly identified non-pathogenic
Sensitivity	56.34%	Proportion of pathogenic detected
Specificity	99.41%	Proportion of non-pathogenic correctly excluded

The high specificity (99.41%) makes the model suitable for initial screening, with low false positive burden.

🚀 Quick Start

Environment Setup

# Clone the repository
git clone https://github.com/Rain021217/ClinVar-Pathogenicity-Prediction.git
cd ClinVar-Pathogenicity-Prediction

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Running the Notebooks

Data Processing: notebooks/clinvar_data_processing_v3.ipynb
- Downloads and processes ClinVar data
- Extracts 20 protein mutation features
- Generates train/test splits
Model Building: notebooks/model_building_v3.ipynb
- Trains 4 models with 7 strategies (28 combinations)
- Hyperparameter tuning with GridSearchCV
- Ensemble methods (Voting, Stacking)
- Threshold optimization
- Feature importance analysis

📁 Project Structure

ClinVar-Pathogenicity-Prediction/
├── README.md                           # Project documentation (bilingual)
├── LICENSE                             # MIT License
├── requirements.txt                    # Python dependencies
├── .gitignore                          # Git ignore rules
│
├── data/                               # Data directory
│   └── README.md                       # Data acquisition guide
│
├── notebooks/                          # Jupyter Notebooks
│   ├── clinvar_data_processing_v3.ipynb   # Data processing pipeline
│   └── model_building_v3.ipynb            # Model training & evaluation
│
└── results/                            # Output results
    └── figures/                        # Visualization figures

Note: Source code utilities (clinvar_utils.py, model_utils_v3.py) are not included in this repository. Please contact the author if needed.

📝 Citation

If this project is helpful for your research, please cite:

@misc{li2026clinvar,
  author = {Jinpeng Li},
  title = {Machine Learning-based Prediction of Pathogenicity for Protein Variants in ClinVar Database},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/Rain021217/ClinVar-Pathogenicity-Prediction}
}

📜 License

This project is licensed under the MIT License.

🙏 Acknowledgments

ClinVar Database - Variant annotation data source
UniProt - Protein sequence database
Course: Python Programming for Health Data Science

📧 Contact

Author: Jinpeng Li (李锦鹏)
Email: jinpengli@stu.pku.edu.cn
Affiliation: Medical Technology, Peking University

中文版本

基于机器学习的ClinVar蛋白质变异致病性预测研究

本项目开发了机器学习模型，用于预测蛋白质编码区域错义突变的致病性，使用了来自ClinVar数据库的约620万条变异记录。

📖 项目概述

研究背景

预测错义突变的致病性是精准医学和遗传病诊断的关键挑战。ClinVar数据库包含大量人类变异的临床解读信息，但约50%的变异仍被标记为"意义不明（VUS）"，亟需计算方法辅助预测。

研究目标

从ClinVar构建包含20个具有生物学意义的蛋白质突变特征的高质量建模数据集
系统评估7种不平衡数据处理策略在4种机器学习模型上的效果
开发可解释的高性能致病性预测模型

🎯 主要贡献

贡献	描述
大规模实证研究	在约620万样本上系统对比28种模型-策略组合
采样策略新发现	证明class_weight优于SMOTE等过采样方法
最优模型方案	XGBoost + 阈值优化达到 F1=0.6717, AUC=0.9092
可解释性验证	进化保守性特征（BLOSUM62、Grantham距离）被识别为关键预测因子

📊 数据集

类别	样本数	占比
致病性变异	311,103	4.97%
非致病性变异	5,950,874	95.03%
总计	6,261,977	100%

数据来源： ClinVar数据库（2025年12月版本）

数据集呈现严重的类别不平衡（1:19比例），反映了人类基因组中致病变异的罕见性。

训练/测试集划分： 80%/20%分层抽样（random_state=42）

训练集：5,009,581个样本（致病性248,882个）
测试集：1,252,396个样本（致病性62,221个）

🔬 特征工程

我们从5个维度提取了20个蛋白质突变特征：

维度	特征	描述
进化特征	`blosum62_score`	氨基酸替换保守性得分
物化特征	`grantham_distance`、`delta_hydrophobicity`、`delta_charge`、`delta_volume`、`delta_polarity`、`delta_mw`、`is_nonsense`、`aromatic_change`、`aliphatic_change`	氨基酸理化性质变化
位置特征	`relative_position`、`distance_to_terminus`、`is_terminal`	蛋白质序列中的位置
结构特征	`delta_alpha_propensity`、`delta_beta_propensity`	二级结构倾向性变化
上下文特征	`context_hydrophobicity_mean`、`context_charge_sum`、`context_aromatic_count`、`context_proline_count`、`context_glycine_count`	周围序列环境

突变模式分析

关键观察：

致病变异： 精氨酸（R）是最常见的突变起始氨基酸；无义突变（→*）高度富集
非致病变异： 模式更分散；保守替换（如I↔L、V↔I）更常见

特征分布对比

核密度图显示致病变异（红色）和非致病变异（绿色）在进化和物化特征上有明显分离。

特征分布对比（箱线图）

箱线图展示了Top 6 最具区分能力特征在致病变异（红色）和非致病变异（绿色）之间的分布对比。

统计效应量分析

我们使用 Mann-Whitney U 检验（非参数检验，适用于非正态分布数据）评估组间差异。由于样本量极大（N > 600万），p值易被驱动至接近零而缺乏实际参考价值。因此，我们报告 rank-biserial 相关系数 (r) 作为效应量指标。

按效应量等级划分的特征：

效应量等级	判定标准	特征
大效应	\|r\| ≥ 0.5	blosum62_score (r=0.73)、grantham_distance (r=0.72)、delta_beta_propensity (r=0.53)、is_nonsense (r=0.52)、delta_alpha_propensity (r=0.52)、delta_mw (r=0.51)
中效应	0.3 ≤ \|r\| < 0.5	delta_volume (r=0.50)、delta_polarity (r=0.48)
小效应	0.1 ≤ \|r\| < 0.3	delta_hydrophobicity (r=0.23)、aromatic_change (r=0.13)
可忽略	\|r\| < 0.1	delta_charge、aliphatic_change、relative_position、distance_to_terminus、is_terminal、context_* 特征

注意： 部分特征（如 delta_charge）虽然p值极小但效应量可忽略，这是因为两组样本分布高度重叠——例如，58.8%的致病和70.0%的非致病样本的 delta_charge = 0。

特征相关性矩阵

主要相关性：

BLOSUM62得分 vs Grantham距离：r ≈ -0.88（强负相关）
delta_volume vs delta_mw：r ≈ 0.91（强正相关）

BLOSUM62 vs Grantham距离

强负相关证实了进化保守的替换通常涉及理化性质相近的氨基酸。

🧪 方法

评估的模型

模型	描述	特点
逻辑回归 (LR)	线性基准	快速、可解释
决策树 (DT)	单棵决策树	简单、易过拟合
随机森林 (RF)	Bagging集成	鲁棒、重要性分布均匀
XGBoost	梯度提升	GPU加速、高性能

不平衡数据处理策略

策略	描述	机制
`none`	无处理（基准）	-
`class_weight`	加权损失函数	隐式样本加权
`SMOTE`	合成少数类过采样	线性插值
`ADASYN`	自适应合成采样	基于难度加权
`Borderline-SMOTE`	边界聚焦过采样	关注决策边界
`Random Undersample`	随机欠采样	数据缩减
`SMOTE-Tomek`	组合过/欠采样	噪声去除

集成方法

Soft Voting： 跨模型概率平均
Stacking： 两层元学习，使用LR作为元分类器

📈 结果

28种模型-策略组合

关键发现： class_weight和none策略在这个大规模数据集上始终优于过采样方法。

F1分数对比表：

策略	LR	DT	RF	XGBoost
none	0.6579	0.5184	0.6598	0.6597
class_weight	0.6579	0.5165	0.6640	0.1952*
smote	0.3457	0.5119	0.4598	0.6469
adasyn	0.2174	0.5047	0.3033	0.5926
borderline_smote	0.2246	0.4957	0.3234	0.4803
undersample	0.3475	0.2355	0.3874	0.3747
smote_tomek	0.3423	0.5107	0.4584	0.6462

*注：XGBoost在class_weight下表现差是因为scale_pos_weight过高导致正类过度预测。

类别权重敏感性分析

敏感性分析结果：

LR & DT： 对权重参数变化高度鲁棒
RF： 在α ∈ [0.1, 2.5]时稳定，α > 3时略有下降
XGBoost： 高度敏感 - α > 1时性能急剧下降

最优设置： α = 1.0（无额外加权）

最终模型性能（超参数调优和阈值优化后）

模型	阈值	精确率	召回率	F1	AUC-ROC	AP
XGBoost (调优)	0.30	0.8315	0.5634	0.6717	0.9092	0.6542
Voting集成	0.23	0.8445	0.5555	0.6702	0.9089	0.6512
Stacking	0.36	0.8434	0.5552	0.6696	0.9108	0.6572
随机森林	0.51	0.8500	0.5494	0.6675	0.9060	0.6491
决策树	0.19	0.8763	0.5267	0.6579	0.8664	0.5666
逻辑回归	0.26	0.8763	0.5267	0.6579	0.8790	0.5873

关键洞察：

XGBoost在阈值从0.50优化到0.30后达到最佳F1分数（0.6717）
Stacking具有最高AUC-ROC（0.9108），但F1相对单模型提升有限
由于类别不平衡，所有模型准确率>97%，但F1是主要评估指标

ROC和PR曲线

特征重要性分析：RF vs XGBoost vs DT

Top 5特征重要性对比：

排名	随机森林	重要性	XGBoost	重要性	决策树	重要性
1	is_nonsense	16.3%	blosum62_score	94.7%	is_nonsense	97.7%
2	delta_mw	15.7%	grantham_distance	0.9%	blosum62_score	1.4%
3	blosum62_score	14.3%	delta_hydrophobicity	0.4%	grantham_distance	0.5%
4	grantham_distance	10.9%	delta_beta_propensity	0.4%	delta_mw	0.2%
5	delta_volume	7.6%	context_proline_count	0.3%	delta_volume	0.1%

算法特性解释：

RF的分布式重要性： 基于400棵树中Gini不纯度减少的平均值
XGBoost的集中重要性： 基于"增益"的重要性；blosum62_score主导分裂决策（94.67%）
DT的极端集中： 有限的树深度（5层）将重要性集中在根节点特征上

生物学验证：

进化保守性特征（blosum62_score + grantham_distance）在Top 6特征中效应量占比**>41%**
致病变异：BLOSUM62得分均值 = -5.73 vs 非致病 = +0.97
致病变异：Grantham距离均值 = 157 vs 非致病 = 55

混淆矩阵分析

XGBoost（阈值=0.30）临床解读：

指标	数值	解释
真阳性	35,049	正确识别的致病变异
假阴性	27,172	漏诊的致病变异（漏诊率43.66%）
假阳性	6,985	错误标记为致病
真阴性	1,183,190	正确识别的非致病变异
敏感性	56.34%	检测到的致病变异比例
特异性	99.41%	正确排除的非致病变异比例

高特异性（99.41%）使该模型适合初步筛查，假阳性负担低。

🚀 快速开始

环境配置

# 克隆仓库
git clone https://github.com/Rain021217/ClinVar-Pathogenicity-Prediction.git
cd ClinVar-Pathogenicity-Prediction

# 创建虚拟环境（推荐）
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows

# 安装依赖
pip install -r requirements.txt

运行Notebook

数据处理： notebooks/clinvar_data_processing_v3.ipynb
- 下载和处理ClinVar数据
- 提取20个蛋白质突变特征
- 生成训练/测试集划分
模型构建： notebooks/model_building_v3.ipynb
- 用7种策略训练4个模型（28种组合）
- 使用GridSearchCV进行超参数调优
- 集成方法（Voting、Stacking）
- 阈值优化
- 特征重要性分析

📁 项目结构

ClinVar-Pathogenicity-Prediction/
├── README.md                           # 项目文档（双语）
├── LICENSE                             # MIT许可证
├── requirements.txt                    # Python依赖
├── .gitignore                          # Git忽略规则
│
├── data/                               # 数据目录
│   └── README.md                       # 数据获取指南
│
├── notebooks/                          # Jupyter Notebooks
│   ├── clinvar_data_processing_v3.ipynb   # 数据处理流程
│   └── model_building_v3.ipynb            # 模型训练与评估
│
└── results/                            # 输出结果
    └── figures/                        # 可视化图表

注意： 源代码工具函数（clinvar_utils.py、model_utils_v3.py）未包含在此仓库中。如需获取，请联系作者。

📝 引用

如果本项目对您的研究有帮助，请引用：

@misc{li2026clinvar,
  author = {Jinpeng Li},
  title = {Machine Learning-based Prediction of Pathogenicity for Protein Variants in ClinVar Database},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/Rain021217/ClinVar-Pathogenicity-Prediction}
}

📜 许可证

本项目采用 MIT许可证开源。

🙏 致谢

ClinVar数据库 - 变异注释数据来源
UniProt - 蛋白质序列数据库
课程：健康数据科学的Python语言编程基础

📧 联系方式

作者： 李锦鹏 (Jinpeng Li)
邮箱： jinpengli@stu.pku.edu.cn
单位： 北京大学医学技术专业

⭐ 如果您觉得这个项目有用，请给它一个星标！

Rain021217/ClinVar-Pathogenicity-Prediction