S18-Niloy/ML-based-Genome-Analysis
Genetic Variant Classification using machine learning
๐งฌ Genetic Variant Classification
1. Dataset Preparation
The dataset contained three columns: REF, CLASS, and IMPACT.
-
Class Balancing:
The data was balanced using oversampling so that bothCLASS 0andCLASS 1had 862 samples each. -
Train-Test Split:
The dataset was split into 80% training and 20% testing, maintaining class balance using
train_test_split(..., stratify=y). -
Feature Encoding:
Since REF was categorical, one-hot encoding was applied to convert it into numeric features suitable for machine learning and neural network models.
2. Models Trained
| Model | Description |
|---|---|
| Random Forest (RF) | Ensemble tree-based model that captures non-linear relationships effectively. |
| Logistic Regression (LR) | A simple linear model used as a baseline for comparison. |
| Support Vector Machine (SVM) | Effective in high-dimensional spaces; used a linear kernel. |
| Naive Bayes (NB) | A probabilistic classifier that assumes feature independence. |
| Artificial Neural Network (ANN) | Multi-layer neural network capable of capturing complex non-linear interactions. |
3. Evaluation Metrics
Models were evaluated on the test set using the following metrics:
- Accuracy: Proportion of correct predictions.
- Precision: Proportion of predicted positives that are actually positive.
- Recall: Proportion of actual positives correctly predicted.
- F1-Score: Harmonic mean of precision and recall.
- Confusion Matrix: Displays true positives, true negatives, false positives, and false negatives.
๐ Numerical Column Summary
| Statistic | Value |
|---|---|
| Count | 1724 |
| Mean | 0.5 |
| Std | 0.500145 |
| Min | 0 |
| 25% | 0 |
| 50% | 0.5 |
| 75% | 1 |
| Max | 1 |
๐ข Columns and Unique Values
| Column | Unique Values |
|---|---|
| REF | 851 |
| CLASS | 0 : 862 1 : 862 |
| IMPACT | MODERATE : 635 HIGH : 538 MODIFIER : 319 LOW : 232 |
โ๏ธ Model Ablation Results
| Model | Accuracy | Precision (0/1) | Recall (0/1) | F1-Score (0/1) |
|---|---|---|---|---|
| Random Forest | 0.80 | 0.80 / 0.81 | 0.82 / 0.79 | 0.81 / 0.80 |
| Logistic Regression | 0.73 | 0.74 / 0.72 | 0.71 / 0.76 | 0.72 / 0.74 |
| SVM | 0.81 | 0.82 / 0.80 | 0.80 / 0.83 | 0.81 / 0.81 |
| Naive Bayes | 0.64 | 0.72 / 0.60 | 0.46 / 0.82 | 0.56 / 0.69 |
| ANN | 0.77 | 0.77 / 0.77 | 0.77 / 0.77 | 0.77 / 0.77 |
๐ Model Insights
- Random Forest achieved the highest AUC (0.86), making it the best-performing model.
- SVM and ANN followed closely, showing strong generalization.
- Naive Bayes performed the weakest on this dataset due to its independence assumption.
In summary, the ROC curve visually and quantitatively confirmed that Random Forest provides the best trade-off between true positive rate and false positive rate across all thresholds.