Rain021217/ClinVar-Pathogenicity-Prediction
Machine Learning-based Prediction of Pathogenicity for Protein Variants in ClinVar Database
🧬 ClinVar Protein Variant Pathogenicity Prediction
English Version
Machine Learning-based Prediction of Pathogenicity for Protein Variants in ClinVar Database
This project develops machine learning models to predict the pathogenicity of missense mutations in protein-coding regions, using approximately 6.2 million variant records from the ClinVar database.
📖 Project Overview
Background
Predicting the pathogenicity of missense mutations is a critical challenge in precision medicine and genetic disease diagnosis. The ClinVar database contains extensive clinical interpretation of human variants, yet approximately 50% remain classified as "variants of uncertain significance (VUS)," necessitating computational prediction methods.
Objectives
- Construct a high-quality modeling dataset from ClinVar with 20 biologically meaningful protein mutation features
- Systematically evaluate 7 imbalanced data handling strategies across 4 machine learning models
- Develop an interpretable, high-performance pathogenicity prediction model
🎯 Key Contributions
| Contribution | Description |
|---|---|
| Large-scale Empirical Study | Systematic comparison of 28 model-strategy combinations on ~6.2 million samples |
| Novel Finding on Sampling | Demonstrated that class_weight outperforms SMOTE and other oversampling methods |
| Optimal Model Pipeline | XGBoost + threshold optimization achieves F1=0.6717, AUC=0.9092 |
| Interpretability Validation | Evolutionary conservation features (BLOSUM62, Grantham distance) identified as key predictors |
📊 Dataset
| Category | Samples | Proportion |
|---|---|---|
| Pathogenic variants | 311,103 | 4.97% |
| Non-pathogenic variants | 5,950,874 | 95.03% |
| Total | 6,261,977 | 100% |
Data Source: ClinVar Database (December 2025 release)
The dataset exhibits severe class imbalance (1:19 ratio), reflecting the rarity of pathogenic variants in the human genome.
Train/Test Split: 80%/20% with stratified sampling (random_state=42)
- Training set: 5,009,581 samples (248,882 pathogenic)
- Test set: 1,252,396 samples (62,221 pathogenic)
🔬 Feature Engineering
We extracted 20 protein mutation features across 5 dimensions:
| Dimension | Features | Description |
|---|---|---|
| Evolutionary | blosum62_score |
Amino acid substitution conservation score |
| Physicochemical | grantham_distance, delta_hydrophobicity, delta_charge, delta_volume, delta_polarity, delta_mw, is_nonsense, aromatic_change, aliphatic_change |
Changes in amino acid properties |
| Positional | relative_position, distance_to_terminus, is_terminal |
Location within protein sequence |
| Structural | delta_alpha_propensity, delta_beta_propensity |
Secondary structure propensity changes |
| Contextual | context_hydrophobicity_mean, context_charge_sum, context_aromatic_count, context_proline_count, context_glycine_count |
Surrounding sequence environment |
Mutation Pattern Analysis
Key observations:
- Pathogenic variants: Arginine (R) is the most common source amino acid; nonsense mutations (→*) are highly enriched
- Non-pathogenic variants: More dispersed pattern; conservative substitutions (e.g., I↔L, V↔I) are common
Feature Distribution Comparison
The kernel density plots show clear separation between pathogenic (red) and non-pathogenic (green) variants for evolutionary and physicochemical features.
Feature Distribution Comparison (Boxplots)
The boxplots show the distribution comparison of the Top 6 most discriminative features between pathogenic (red) and non-pathogenic (green) variants.
Statistical Effect Size Analysis
We used Mann-Whitney U test (non-parametric, suitable for non-normal distributions) to assess group differences. Given the massive sample size (N > 6 million), p-values are easily driven to near-zero and have limited practical value. Therefore, we report the rank-biserial correlation coefficient (r) as the effect size measure.
Features by Effect Size Category:
| Effect Size | Criterion | Features |
|---|---|---|
| Large | |r| ≥ 0.5 | blosum62_score (r=0.73), grantham_distance (r=0.72), delta_beta_propensity (r=0.53), is_nonsense (r=0.52), delta_alpha_propensity (r=0.52), delta_mw (r=0.51) |
| Medium | 0.3 ≤ |r| < 0.5 | delta_volume (r=0.50), delta_polarity (r=0.48) |
| Small | 0.1 ≤ |r| < 0.3 | delta_hydrophobicity (r=0.23), aromatic_change (r=0.13) |
| Negligible | |r| < 0.1 | delta_charge, aliphatic_change, relative_position, distance_to_terminus, is_terminal, context_* features |
Note: Some features (e.g.,
delta_charge) have extremely small p-values but negligible effect sizes because the two groups have highly overlapping distributions—for example, 58.8% of pathogenic and 70.0% of non-pathogenic samples havedelta_charge = 0.
Feature Correlation Matrix
Key correlations:
- BLOSUM62 score vs Grantham distance: r ≈ -0.88 (strong negative)
- delta_volume vs delta_mw: r ≈ 0.91 (strong positive)
BLOSUM62 vs Grantham Distance
The strong negative correlation confirms that evolutionarily conservative substitutions typically involve physicochemically similar amino acids.
🧪 Methods
Models Evaluated
| Model | Description | Characteristics |
|---|---|---|
| Logistic Regression (LR) | Linear baseline | Fast, interpretable |
| Decision Tree (DT) | Single tree classifier | Simple, prone to overfitting |
| Random Forest (RF) | Bagging ensemble | Robust, distributed importance |
| XGBoost | Gradient boosting | GPU acceleration, high performance |
Imbalanced Data Strategies
| Strategy | Description | Mechanism |
|---|---|---|
none |
No handling (baseline) | - |
class_weight |
Weighted loss function | Implicit sample weighting |
SMOTE |
Synthetic minority oversampling | Linear interpolation |
ADASYN |
Adaptive synthetic sampling | Difficulty-based weighting |
Borderline-SMOTE |
Boundary-focused oversampling | Focus on decision boundary |
Random Undersample |
Majority class downsampling | Data reduction |
SMOTE-Tomek |
Combined over/undersampling | Noise removal |
Ensemble Methods
- Soft Voting: Probability averaging across models
- Stacking: Two-layer meta-learning with LR meta-classifier
📈 Results
28 Model-Strategy Combinations
Key Finding: class_weight and none strategies consistently outperform oversampling methods on this large-scale dataset.
F1 Score Comparison Table:
| Strategy | LR | DT | RF | XGBoost |
|---|---|---|---|---|
| none | 0.6579 | 0.5184 | 0.6598 | 0.6597 |
| class_weight | 0.6579 | 0.5165 | 0.6640 | 0.1952* |
| smote | 0.3457 | 0.5119 | 0.4598 | 0.6469 |
| adasyn | 0.2174 | 0.5047 | 0.3033 | 0.5926 |
| borderline_smote | 0.2246 | 0.4957 | 0.3234 | 0.4803 |
| undersample | 0.3475 | 0.2355 | 0.3874 | 0.3747 |
| smote_tomek | 0.3423 | 0.5107 | 0.4584 | 0.6462 |
*Note: XGBoost's poor performance with class_weight is due to excessive scale_pos_weight causing over-prediction of positive class.
Class Weight Sensitivity Analysis
Sensitivity Analysis Results:
- LR & DT: Highly robust to weight parameter changes
- RF: Stable at α ∈ [0.1, 2.5], slight decline when α > 3
- XGBoost: Highly sensitive - performance drops sharply when α > 1
Optimal setting: α = 1.0 (no additional weighting)
Final Model Performance (After Hyperparameter Tuning & Threshold Optimization)
| Model | Threshold | Precision | Recall | F1 | AUC-ROC | AP |
|---|---|---|---|---|---|---|
| XGBoost (Tuned) | 0.30 | 0.8315 | 0.5634 | 0.6717 | 0.9092 | 0.6542 |
| Voting Ensemble | 0.23 | 0.8445 | 0.5555 | 0.6702 | 0.9089 | 0.6512 |
| Stacking | 0.36 | 0.8434 | 0.5552 | 0.6696 | 0.9108 | 0.6572 |
| Random Forest | 0.51 | 0.8500 | 0.5494 | 0.6675 | 0.9060 | 0.6491 |
| Decision Tree | 0.19 | 0.8763 | 0.5267 | 0.6579 | 0.8664 | 0.5666 |
| Logistic Regression | 0.26 | 0.8763 | 0.5267 | 0.6579 | 0.8790 | 0.5873 |
Key Insights:
- XGBoost achieves best F1 score (0.6717) after threshold optimization from 0.50 → 0.30
- Stacking has highest AUC-ROC (0.9108) but marginal F1 improvement over single models
- All models achieve >97% accuracy due to class imbalance, but F1 is the primary metric
ROC and PR Curves
Feature Importance Analysis: RF vs XGBoost vs DT
Top 5 Feature Importance Comparison:
| Rank | Random Forest | Importance | XGBoost | Importance | Decision Tree | Importance |
|---|---|---|---|---|---|---|
| 1 | is_nonsense | 16.3% | blosum62_score | 94.7% | is_nonsense | 97.7% |
| 2 | delta_mw | 15.7% | grantham_distance | 0.9% | blosum62_score | 1.4% |
| 3 | blosum62_score | 14.3% | delta_hydrophobicity | 0.4% | grantham_distance | 0.5% |
| 4 | grantham_distance | 10.9% | delta_beta_propensity | 0.4% | delta_mw | 0.2% |
| 5 | delta_volume | 7.6% | context_proline_count | 0.3% | delta_volume | 0.1% |
Algorithm-specific Explanations:
- RF's Distributed Importance: Based on average Gini impurity reduction across 400 trees with random feature subsets
- XGBoost's Concentrated Importance: "Gain"-based importance; blosum62_score dominates split decisions (94.67%)
- DT's Extreme Concentration: Limited tree depth (5 layers) concentrates importance on root node feature
Biological Validation:
- Evolutionary conservation features (blosum62_score + grantham_distance) account for >41% of effect size among top 6 features
- Pathogenic variants: BLOSUM62 score mean = -5.73 vs Non-pathogenic = +0.97
- Pathogenic variants: Grantham distance mean = 157 vs Non-pathogenic = 55
Confusion Matrix Analysis
XGBoost (Threshold=0.30) Clinical Interpretation:
| Metric | Value | Interpretation |
|---|---|---|
| True Positives | 35,049 | Correctly identified pathogenic variants |
| False Negatives | 27,172 | Missed pathogenic variants (43.66% miss rate) |
| False Positives | 6,985 | Incorrectly flagged as pathogenic |
| True Negatives | 1,183,190 | Correctly identified non-pathogenic |
| Sensitivity | 56.34% | Proportion of pathogenic detected |
| Specificity | 99.41% | Proportion of non-pathogenic correctly excluded |
The high specificity (99.41%) makes the model suitable for initial screening, with low false positive burden.
🚀 Quick Start
Environment Setup
# Clone the repository
git clone https://github.com/Rain021217/ClinVar-Pathogenicity-Prediction.git
cd ClinVar-Pathogenicity-Prediction
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtRunning the Notebooks
-
Data Processing:
notebooks/clinvar_data_processing_v3.ipynb- Downloads and processes ClinVar data
- Extracts 20 protein mutation features
- Generates train/test splits
-
Model Building:
notebooks/model_building_v3.ipynb- Trains 4 models with 7 strategies (28 combinations)
- Hyperparameter tuning with GridSearchCV
- Ensemble methods (Voting, Stacking)
- Threshold optimization
- Feature importance analysis
📁 Project Structure
ClinVar-Pathogenicity-Prediction/
├── README.md # Project documentation (bilingual)
├── LICENSE # MIT License
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
│
├── data/ # Data directory
│ └── README.md # Data acquisition guide
│
├── notebooks/ # Jupyter Notebooks
│ ├── clinvar_data_processing_v3.ipynb # Data processing pipeline
│ └── model_building_v3.ipynb # Model training & evaluation
│
└── results/ # Output results
└── figures/ # Visualization figures
Note: Source code utilities (
clinvar_utils.py,model_utils_v3.py) are not included in this repository. Please contact the author if needed.
📝 Citation
If this project is helpful for your research, please cite:
@misc{li2026clinvar,
author = {Jinpeng Li},
title = {Machine Learning-based Prediction of Pathogenicity for Protein Variants in ClinVar Database},
year = {2026},
publisher = {GitHub},
url = {https://github.com/Rain021217/ClinVar-Pathogenicity-Prediction}
}📜 License
This project is licensed under the MIT License.
🙏 Acknowledgments
- ClinVar Database - Variant annotation data source
- UniProt - Protein sequence database
- Course: Python Programming for Health Data Science
📧 Contact
- Author: Jinpeng Li (李锦鹏)
- Email: jinpengli@stu.pku.edu.cn
- Affiliation: Medical Technology, Peking University
中文版本
基于机器学习的ClinVar蛋白质变异致病性预测研究
本项目开发了机器学习模型,用于预测蛋白质编码区域错义突变的致病性,使用了来自ClinVar数据库的约620万条变异记录。
📖 项目概述
研究背景
预测错义突变的致病性是精准医学和遗传病诊断的关键挑战。ClinVar数据库包含大量人类变异的临床解读信息,但约50%的变异仍被标记为"意义不明(VUS)",亟需计算方法辅助预测。
研究目标
- 从ClinVar构建包含20个具有生物学意义的蛋白质突变特征的高质量建模数据集
- 系统评估7种不平衡数据处理策略在4种机器学习模型上的效果
- 开发可解释的高性能致病性预测模型
🎯 主要贡献
| 贡献 | 描述 |
|---|---|
| 大规模实证研究 | 在约620万样本上系统对比28种模型-策略组合 |
| 采样策略新发现 | 证明class_weight优于SMOTE等过采样方法 |
| 最优模型方案 | XGBoost + 阈值优化达到 F1=0.6717, AUC=0.9092 |
| 可解释性验证 | 进化保守性特征(BLOSUM62、Grantham距离)被识别为关键预测因子 |
📊 数据集
| 类别 | 样本数 | 占比 |
|---|---|---|
| 致病性变异 | 311,103 | 4.97% |
| 非致病性变异 | 5,950,874 | 95.03% |
| 总计 | 6,261,977 | 100% |
数据来源: ClinVar数据库(2025年12月版本)
数据集呈现严重的类别不平衡(1:19比例),反映了人类基因组中致病变异的罕见性。
训练/测试集划分: 80%/20%分层抽样(random_state=42)
- 训练集:5,009,581个样本(致病性248,882个)
- 测试集:1,252,396个样本(致病性62,221个)
🔬 特征工程
我们从5个维度提取了20个蛋白质突变特征:
| 维度 | 特征 | 描述 |
|---|---|---|
| 进化特征 | blosum62_score |
氨基酸替换保守性得分 |
| 物化特征 | grantham_distance、delta_hydrophobicity、delta_charge、delta_volume、delta_polarity、delta_mw、is_nonsense、aromatic_change、aliphatic_change |
氨基酸理化性质变化 |
| 位置特征 | relative_position、distance_to_terminus、is_terminal |
蛋白质序列中的位置 |
| 结构特征 | delta_alpha_propensity、delta_beta_propensity |
二级结构倾向性变化 |
| 上下文特征 | context_hydrophobicity_mean、context_charge_sum、context_aromatic_count、context_proline_count、context_glycine_count |
周围序列环境 |
突变模式分析
关键观察:
- 致病变异: 精氨酸(R)是最常见的突变起始氨基酸;无义突变(→*)高度富集
- 非致病变异: 模式更分散;保守替换(如I↔L、V↔I)更常见
特征分布对比
核密度图显示致病变异(红色)和非致病变异(绿色)在进化和物化特征上有明显分离。
特征分布对比(箱线图)
箱线图展示了Top 6 最具区分能力特征在致病变异(红色)和非致病变异(绿色)之间的分布对比。
统计效应量分析
我们使用 Mann-Whitney U 检验(非参数检验,适用于非正态分布数据)评估组间差异。由于样本量极大(N > 600万),p值易被驱动至接近零而缺乏实际参考价值。因此,我们报告 rank-biserial 相关系数 (r) 作为效应量指标。
按效应量等级划分的特征:
| 效应量等级 | 判定标准 | 特征 |
|---|---|---|
| 大效应 | |r| ≥ 0.5 | blosum62_score (r=0.73)、grantham_distance (r=0.72)、delta_beta_propensity (r=0.53)、is_nonsense (r=0.52)、delta_alpha_propensity (r=0.52)、delta_mw (r=0.51) |
| 中效应 | 0.3 ≤ |r| < 0.5 | delta_volume (r=0.50)、delta_polarity (r=0.48) |
| 小效应 | 0.1 ≤ |r| < 0.3 | delta_hydrophobicity (r=0.23)、aromatic_change (r=0.13) |
| 可忽略 | |r| < 0.1 | delta_charge、aliphatic_change、relative_position、distance_to_terminus、is_terminal、context_* 特征 |
注意: 部分特征(如
delta_charge)虽然p值极小但效应量可忽略,这是因为两组样本分布高度重叠——例如,58.8%的致病和70.0%的非致病样本的delta_charge = 0。
特征相关性矩阵
主要相关性:
- BLOSUM62得分 vs Grantham距离:r ≈ -0.88(强负相关)
- delta_volume vs delta_mw:r ≈ 0.91(强正相关)
BLOSUM62 vs Grantham距离
强负相关证实了进化保守的替换通常涉及理化性质相近的氨基酸。
🧪 方法
评估的模型
| 模型 | 描述 | 特点 |
|---|---|---|
| 逻辑回归 (LR) | 线性基准 | 快速、可解释 |
| 决策树 (DT) | 单棵决策树 | 简单、易过拟合 |
| 随机森林 (RF) | Bagging集成 | 鲁棒、重要性分布均匀 |
| XGBoost | 梯度提升 | GPU加速、高性能 |
不平衡数据处理策略
| 策略 | 描述 | 机制 |
|---|---|---|
none |
无处理(基准) | - |
class_weight |
加权损失函数 | 隐式样本加权 |
SMOTE |
合成少数类过采样 | 线性插值 |
ADASYN |
自适应合成采样 | 基于难度加权 |
Borderline-SMOTE |
边界聚焦过采样 | 关注决策边界 |
Random Undersample |
随机欠采样 | 数据缩减 |
SMOTE-Tomek |
组合过/欠采样 | 噪声去除 |
集成方法
- Soft Voting: 跨模型概率平均
- Stacking: 两层元学习,使用LR作为元分类器
📈 结果
28种模型-策略组合
关键发现: class_weight和none策略在这个大规模数据集上始终优于过采样方法。
F1分数对比表:
| 策略 | LR | DT | RF | XGBoost |
|---|---|---|---|---|
| none | 0.6579 | 0.5184 | 0.6598 | 0.6597 |
| class_weight | 0.6579 | 0.5165 | 0.6640 | 0.1952* |
| smote | 0.3457 | 0.5119 | 0.4598 | 0.6469 |
| adasyn | 0.2174 | 0.5047 | 0.3033 | 0.5926 |
| borderline_smote | 0.2246 | 0.4957 | 0.3234 | 0.4803 |
| undersample | 0.3475 | 0.2355 | 0.3874 | 0.3747 |
| smote_tomek | 0.3423 | 0.5107 | 0.4584 | 0.6462 |
*注:XGBoost在class_weight下表现差是因为scale_pos_weight过高导致正类过度预测。
类别权重敏感性分析
敏感性分析结果:
- LR & DT: 对权重参数变化高度鲁棒
- RF: 在α ∈ [0.1, 2.5]时稳定,α > 3时略有下降
- XGBoost: 高度敏感 - α > 1时性能急剧下降
最优设置: α = 1.0(无额外加权)
最终模型性能(超参数调优和阈值优化后)
| 模型 | 阈值 | 精确率 | 召回率 | F1 | AUC-ROC | AP |
|---|---|---|---|---|---|---|
| XGBoost (调优) | 0.30 | 0.8315 | 0.5634 | 0.6717 | 0.9092 | 0.6542 |
| Voting集成 | 0.23 | 0.8445 | 0.5555 | 0.6702 | 0.9089 | 0.6512 |
| Stacking | 0.36 | 0.8434 | 0.5552 | 0.6696 | 0.9108 | 0.6572 |
| 随机森林 | 0.51 | 0.8500 | 0.5494 | 0.6675 | 0.9060 | 0.6491 |
| 决策树 | 0.19 | 0.8763 | 0.5267 | 0.6579 | 0.8664 | 0.5666 |
| 逻辑回归 | 0.26 | 0.8763 | 0.5267 | 0.6579 | 0.8790 | 0.5873 |
关键洞察:
- XGBoost在阈值从0.50优化到0.30后达到最佳F1分数(0.6717)
- Stacking具有最高AUC-ROC(0.9108),但F1相对单模型提升有限
- 由于类别不平衡,所有模型准确率>97%,但F1是主要评估指标
ROC和PR曲线
特征重要性分析:RF vs XGBoost vs DT
Top 5特征重要性对比:
| 排名 | 随机森林 | 重要性 | XGBoost | 重要性 | 决策树 | 重要性 |
|---|---|---|---|---|---|---|
| 1 | is_nonsense | 16.3% | blosum62_score | 94.7% | is_nonsense | 97.7% |
| 2 | delta_mw | 15.7% | grantham_distance | 0.9% | blosum62_score | 1.4% |
| 3 | blosum62_score | 14.3% | delta_hydrophobicity | 0.4% | grantham_distance | 0.5% |
| 4 | grantham_distance | 10.9% | delta_beta_propensity | 0.4% | delta_mw | 0.2% |
| 5 | delta_volume | 7.6% | context_proline_count | 0.3% | delta_volume | 0.1% |
算法特性解释:
- RF的分布式重要性: 基于400棵树中Gini不纯度减少的平均值
- XGBoost的集中重要性: 基于"增益"的重要性;blosum62_score主导分裂决策(94.67%)
- DT的极端集中: 有限的树深度(5层)将重要性集中在根节点特征上
生物学验证:
- 进化保守性特征(blosum62_score + grantham_distance)在Top 6特征中效应量占比**>41%**
- 致病变异:BLOSUM62得分均值 = -5.73 vs 非致病 = +0.97
- 致病变异:Grantham距离均值 = 157 vs 非致病 = 55
混淆矩阵分析
XGBoost(阈值=0.30)临床解读:
| 指标 | 数值 | 解释 |
|---|---|---|
| 真阳性 | 35,049 | 正确识别的致病变异 |
| 假阴性 | 27,172 | 漏诊的致病变异(漏诊率43.66%) |
| 假阳性 | 6,985 | 错误标记为致病 |
| 真阴性 | 1,183,190 | 正确识别的非致病变异 |
| 敏感性 | 56.34% | 检测到的致病变异比例 |
| 特异性 | 99.41% | 正确排除的非致病变异比例 |
高特异性(99.41%)使该模型适合初步筛查,假阳性负担低。
🚀 快速开始
环境配置
# 克隆仓库
git clone https://github.com/Rain021217/ClinVar-Pathogenicity-Prediction.git
cd ClinVar-Pathogenicity-Prediction
# 创建虚拟环境(推荐)
python -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
# 安装依赖
pip install -r requirements.txt运行Notebook
-
数据处理:
notebooks/clinvar_data_processing_v3.ipynb- 下载和处理ClinVar数据
- 提取20个蛋白质突变特征
- 生成训练/测试集划分
-
模型构建:
notebooks/model_building_v3.ipynb- 用7种策略训练4个模型(28种组合)
- 使用GridSearchCV进行超参数调优
- 集成方法(Voting、Stacking)
- 阈值优化
- 特征重要性分析
📁 项目结构
ClinVar-Pathogenicity-Prediction/
├── README.md # 项目文档(双语)
├── LICENSE # MIT许可证
├── requirements.txt # Python依赖
├── .gitignore # Git忽略规则
│
├── data/ # 数据目录
│ └── README.md # 数据获取指南
│
├── notebooks/ # Jupyter Notebooks
│ ├── clinvar_data_processing_v3.ipynb # 数据处理流程
│ └── model_building_v3.ipynb # 模型训练与评估
│
└── results/ # 输出结果
└── figures/ # 可视化图表
注意: 源代码工具函数(
clinvar_utils.py、model_utils_v3.py)未包含在此仓库中。如需获取,请联系作者。
📝 引用
如果本项目对您的研究有帮助,请引用:
@misc{li2026clinvar,
author = {Jinpeng Li},
title = {Machine Learning-based Prediction of Pathogenicity for Protein Variants in ClinVar Database},
year = {2026},
publisher = {GitHub},
url = {https://github.com/Rain021217/ClinVar-Pathogenicity-Prediction}
}📜 许可证
本项目采用 MIT许可证 开源。
🙏 致谢
- ClinVar数据库 - 变异注释数据来源
- UniProt - 蛋白质序列数据库
- 课程:健康数据科学的Python语言编程基础
📧 联系方式
- 作者: 李锦鹏 (Jinpeng Li)
- 邮箱: jinpengli@stu.pku.edu.cn
- 单位: 北京大学 医学技术专业
⭐ 如果您觉得这个项目有用,请给它一个星标!













