🧬 Genetic Variant Classification

1. Dataset Preparation

The dataset contained three columns: REF, CLASS, and IMPACT.

Class Balancing:
The data was balanced using oversampling so that both CLASS 0 and CLASS 1 had 862 samples each.
Train-Test Split:
The dataset was split into 80% training and 20% testing, maintaining class balance using
train_test_split(..., stratify=y).
Feature Encoding:
Since REF was categorical, one-hot encoding was applied to convert it into numeric features suitable for machine learning and neural network models.

2. Models Trained

Model	Description
Random Forest (RF)	Ensemble tree-based model that captures non-linear relationships effectively.
Logistic Regression (LR)	A simple linear model used as a baseline for comparison.
Support Vector Machine (SVM)	Effective in high-dimensional spaces; used a linear kernel.
Naive Bayes (NB)	A probabilistic classifier that assumes feature independence.
Artificial Neural Network (ANN)	Multi-layer neural network capable of capturing complex non-linear interactions.

3. Evaluation Metrics

Models were evaluated on the test set using the following metrics:

Accuracy: Proportion of correct predictions.
Precision: Proportion of predicted positives that are actually positive.
Recall: Proportion of actual positives correctly predicted.
F1-Score: Harmonic mean of precision and recall.
Confusion Matrix: Displays true positives, true negatives, false positives, and false negatives.

📊 Numerical Column Summary

Statistic	Value
Count	1724
Mean	0.5
Std	0.500145
Min	0
25%	0
50%	0.5
75%	1
Max	1

🔢 Columns and Unique Values

Column	Unique Values
REF	851
CLASS	0 : 862 1 : 862
IMPACT	MODERATE : 635 HIGH : 538 MODIFIER : 319 LOW : 232

⚙️ Model Ablation Results

Model	Accuracy	Precision (0/1)	Recall (0/1)	F1-Score (0/1)
Random Forest	0.80	0.80 / 0.81	0.82 / 0.79	0.81 / 0.80
Logistic Regression	0.73	0.74 / 0.72	0.71 / 0.76	0.72 / 0.74
SVM	0.81	0.82 / 0.80	0.80 / 0.83	0.81 / 0.81
Naive Bayes	0.64	0.72 / 0.60	0.46 / 0.82	0.56 / 0.69
ANN	0.77	0.77 / 0.77	0.77 / 0.77	0.77 / 0.77

🏆 Model Insights

Random Forest achieved the highest AUC (0.86), making it the best-performing model.
SVM and ANN followed closely, showing strong generalization.
Naive Bayes performed the weakest on this dataset due to its independence assumption.

In summary, the ROC curve visually and quantitatively confirmed that Random Forest provides the best trade-off between true positive rate and false positive rate across all thresholds.

S18-Niloy/ML-based-Genome-Analysis