GitHunt

Framingham_CVD_Risk

Healthcare Classification Problem

Index:

  1. Background
  2. Problem Statement
  3. Data Preprocessing

Preliminary Analysis

  1. Data Distribution and Outliers
  1. Categorical Variables
  2. Numerical Variables
  1. Missing Values and Imputation
  2. Correlation Analysis
  3. Normality Check
  4. Undersampling Data
  5. Transformation Pipeline
  6. Modeling
  • KNN Classifier
  • Logistic Regression
  • Decision Tree Classifier
  • Random Forest Classifier
  • Bernoullis Naive Beyes Classifier
  • Bagging Classifiers
  1. Performance Evaluation
  2. Results
  3. CHallenges and Limitations
  4. Future Scope

About Project:

  • Identifying people at risk of heart disease and making sure they receive proper treatment can prevent these deaths.
  • Risk startification with the aid of machine learning methods to identify people at risk of having CVD can prove a better preventive, prognostic and management tool for the population.

Framingham Heart Study (FHS)

  • The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease among a population of free living subjects in the community of Framingham, Massachusetts in US. The data collected can be studied to identify risk factors and their joint effects.
  • The given dataset is a subset of the longitudinal data collected as part of FHS and includes laboratory, clinic, questionnaire, and adjudicated event data on 4,434 participants from which 10-year coronary heart disease risk has been noted over years of surveillance in the participants.
  • Original current data source
    Available on request here - Link - https://biolincc.nhlbi.nih.gov/teaching/

Objective of the study:

The goal of the analysis is to predict whether the participant has 10-year risk of developing (CHD) coronary heart disease based on current data on risk factors for a participant.

Questions to ask:

  1. Which risk factors do the dataset have?
  2. How is the correlation of risk factors with our target value?
  3. How is our data distributed based on demographic data (sex, age, education level)?
  4. How is the behavioural data represented in our data?
  5. Does our target variable have balanced representation in our dataset?
  6. Applicability of data in view of population demographics

Acknowledgement : Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC), The National Heart, Lung, and Blood Institute (NHLBI), NHI for providing data at request.

Languages

Jupyter Notebook100.0%

Contributors

Created January 11, 2023
Updated March 9, 2026