GitHunt
S1

S18-Niloy/ML-based-Genome-Analysis

Genetic Variant Classification using machine learning

๐Ÿงฌ Genetic Variant Classification

1. Dataset Preparation

The dataset contained three columns: REF, CLASS, and IMPACT.

  • Class Balancing:
    The data was balanced using oversampling so that both CLASS 0 and CLASS 1 had 862 samples each.

  • Train-Test Split:
    The dataset was split into 80% training and 20% testing, maintaining class balance using
    train_test_split(..., stratify=y).

  • Feature Encoding:
    Since REF was categorical, one-hot encoding was applied to convert it into numeric features suitable for machine learning and neural network models.


2. Models Trained

Model Description
Random Forest (RF) Ensemble tree-based model that captures non-linear relationships effectively.
Logistic Regression (LR) A simple linear model used as a baseline for comparison.
Support Vector Machine (SVM) Effective in high-dimensional spaces; used a linear kernel.
Naive Bayes (NB) A probabilistic classifier that assumes feature independence.
Artificial Neural Network (ANN) Multi-layer neural network capable of capturing complex non-linear interactions.

3. Evaluation Metrics

Models were evaluated on the test set using the following metrics:

  • Accuracy: Proportion of correct predictions.
  • Precision: Proportion of predicted positives that are actually positive.
  • Recall: Proportion of actual positives correctly predicted.
  • F1-Score: Harmonic mean of precision and recall.
  • Confusion Matrix: Displays true positives, true negatives, false positives, and false negatives.

๐Ÿ“Š Numerical Column Summary

Statistic Value
Count 1724
Mean 0.5
Std 0.500145
Min 0
25% 0
50% 0.5
75% 1
Max 1

๐Ÿ”ข Columns and Unique Values

Column Unique Values
REF 851
CLASS 0 : 862
1 : 862
IMPACT MODERATE : 635
HIGH : 538
MODIFIER : 319
LOW : 232

โš™๏ธ Model Ablation Results

Model Accuracy Precision (0/1) Recall (0/1) F1-Score (0/1)
Random Forest 0.80 0.80 / 0.81 0.82 / 0.79 0.81 / 0.80
Logistic Regression 0.73 0.74 / 0.72 0.71 / 0.76 0.72 / 0.74
SVM 0.81 0.82 / 0.80 0.80 / 0.83 0.81 / 0.81
Naive Bayes 0.64 0.72 / 0.60 0.46 / 0.82 0.56 / 0.69
ANN 0.77 0.77 / 0.77 0.77 / 0.77 0.77 / 0.77

๐Ÿ† Model Insights

  • Random Forest achieved the highest AUC (0.86), making it the best-performing model.
  • SVM and ANN followed closely, showing strong generalization.
  • Naive Bayes performed the weakest on this dataset due to its independence assumption.

In summary, the ROC curve visually and quantitatively confirmed that Random Forest provides the best trade-off between true positive rate and false positive rate across all thresholds.

S18-Niloy/ML-based-Genome-Analysis | GitHunt