GitHunt
ZY

zyna-b/Insurance-Cost-Analysis-and-Prediction

Medical insurance EDA and prediction: feature engineering, correlation analysis & Chi-square tests

๐Ÿฅ๐Ÿ’ฐ Insurance Cost Analysis & Prediction - Complete Data Science Project

Python
Pandas
Scikit-Learn
Jupyter
License: MIT

A comprehensive exploratory data analysis (EDA) and machine learning project for predicting insurance costs using demographic and health factors.

๐ŸŽฏ Project Overview

This project demonstrates a complete data science workflow for analyzing and predicting insurance costs. Using a dataset of 1,338 insurance records, we explore relationships between demographic factors, health indicators, and insurance charges through advanced statistical analysis and machine learning.

๐Ÿ” Key Features

  • Complete EDA Pipeline: From data exploration to feature engineering
  • Statistical Analysis: Correlation analysis, Chi-square testing, and hypothesis testing
  • Machine Learning Model: Linear regression with 80.4% R-squared accuracy
  • Feature Engineering: BMI categorization and advanced feature selection
  • Data Visualization: Professional plots using Matplotlib and Seaborn

๐Ÿ“Š Dataset Information

Feature Description Type
Age Age of primary beneficiary Numerical (18-64)
Sex Insurance contractor gender Categorical (male/female)
BMI Body mass index Numerical (15.96-53.13)
Children Number of dependents Numerical (0-5)
Smoker Smoking status Categorical (yes/no)
Region Beneficiary's residential area Categorical (4 regions)
Charges Medical costs billed by insurance Target Variable

Dataset Stats: 1,338 records โ€ข 7 features โ€ข No missing values โ€ข 1 duplicate removed

๐Ÿš€ Quick Start

Prerequisites

Python 3.9+
Jupyter Notebook or JupyterLab

Installation & Setup

  1. Clone this repository
git clone https://github.com/zyna-b/Insurance-Cost-Analysis-EDA.git
cd Insurance-Cost-Analysis-EDA
  1. Create virtual environment
python -m venv venv_py39
# Windows
venv_py39\Scripts\activate
# macOS/Linux
source venv_py39/bin/activate
  1. Install dependencies
pip install -r requirements.txt
  1. Launch analysis
jupyter notebook insurance.ipynb

๐Ÿ“ˆ Analysis Workflow

1. ๐Ÿ” Exploratory Data Analysis

  • Data Inspection: Shape, types, missing values, duplicates
  • Descriptive Statistics: Central tendencies and distributions
  • Univariate Analysis: Individual feature distributions
  • Bivariate Analysis: Feature relationships with target variable

2. ๐Ÿ“Š Data Visualization

  • Distribution Plots: Age, BMI, children, charges histograms with KDE
  • Count Plots: Categorical variable frequencies
  • Box Plots: Outlier detection and quartile analysis
  • Correlation Heatmap: Feature relationship visualization

3. ๐Ÿงน Data Preprocessing

  • Data Cleaning: Duplicate removal and type conversion
  • Feature Encoding:
    • Binary encoding for gender and smoker status
    • One-hot encoding for region variables
  • Feature Engineering: BMI categorization (Underweight, Normal, Overweight, Obesity)
  • Standardization: StandardScaler for numerical features

4. ๐Ÿ“Š Statistical Analysis

Correlation Analysis

# Key findings from Pearson correlation analysis
correlations = {
    'is_smoker': 0.787,      # Strongest predictor
    'age': 0.299,            # Moderate positive correlation
    'bmi': 0.198,            # Weak positive correlation
    'children': 0.068,       # Very weak correlation
    # ... additional features
}

Chi-Square Testing

  • Purpose: Test independence between categorical variables and charges
  • Significance Level: ฮฑ = 0.05
  • Results: Identified significant features for model inclusion

5. ๐Ÿค– Machine Learning Model

Linear Regression Implementation

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Model training and evaluation
model = LinearRegression()
model.fit(X_train, Y_train)
predictions = model.predict(X_test)

Model Performance

  • R-squared Score: 0.804 (80.4% variance explained)
  • Adjusted R-squared: 0.799
  • Train-Test Split: 80% training, 20% testing
  • Random State: 42 (reproducible results)

๐Ÿ” Key Insights & Findings

๐Ÿ’ก Business Intelligence

  1. Smoking Impact: Smoking status is the strongest predictor of insurance costs
  2. Age Factor: Older individuals tend to have higher insurance charges
  3. BMI Influence: Higher BMI correlates with increased medical costs
  4. Regional Variations: Geographic location affects insurance pricing
  5. Family Size: Number of children has minimal impact on costs

๐Ÿ“Š Statistical Discoveries

  • Correlation Strength: Smoking status shows 0.787 correlation with charges
  • Feature Importance: Age, BMI, and smoking status are primary cost drivers
  • Data Distribution: Charges show right-skewed distribution (typical for insurance data)
  • Gender Impact: Minimal difference in average costs between males and females

๐Ÿ› ๏ธ Technologies & Libraries

Core Stack

import pandas as pd              # Data manipulation and analysis
import numpy as np               # Numerical computing
import matplotlib.pyplot as plt  # Data visualization
import seaborn as sns           # Statistical data visualization

Machine Learning & Statistics

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from scipy.stats import pearsonr, chi2_contingency

๐Ÿ“ Project Structure

Insurance-Cost-Analysis-EDA/
โ”œโ”€โ”€ ๐Ÿ““ insurance.ipynb          # Main analysis notebook
โ”œโ”€โ”€ ๐Ÿ“Š insurance.csv           # Dataset (1,338 records)
โ”œโ”€โ”€ ๐Ÿ“‹ requirements.txt        # Python dependencies
โ”œโ”€โ”€ ๐Ÿ“„ README.md              # Project documentation
โ””โ”€โ”€ ๐Ÿ“ venv_py39/             # Virtual environment
    โ”œโ”€โ”€ Scripts/              # Environment executables
    โ”œโ”€โ”€ Lib/                  # Installed packages
    โ””โ”€โ”€ Include/              # Header files

๐Ÿ“ˆ Visualizations Included

  • Distribution Analysis: Histograms with KDE for numerical variables
  • Categorical Analysis: Count plots for sex, smoker, region, children
  • Correlation Matrix: Heatmap showing feature relationships
  • Box Plots: Outlier detection for numerical features
  • Feature Engineering: BMI categorization visualization

๐Ÿ”ฌ Statistical Methods Explained

Correlation Analysis

  • Pearson Correlation: Measures linear relationship strength (-1 to +1)
  • Interpretation: Values closer to ยฑ1 indicate stronger linear relationships
  • Application: Identifying features most correlated with insurance charges

Chi-Square Testing

  • Purpose: Tests independence between categorical variables and target
  • Null Hypothesis: Variables are independent
  • Decision Rule: Reject H0 if p-value < 0.05
  • Business Value: Validates which categorical features significantly impact costs

Feature Engineering

  • BMI Categories: Medical standard classifications
  • Dummy Variables: Binary encoding for categorical features
  • Standardization: Zero mean, unit variance for numerical features

๐ŸŽฏ Model Evaluation Metrics

Metric Value Interpretation
R-squared 0.804 Model explains 80.4% of variance
Adjusted R-squared 0.799 Accounts for number of predictors
Features Used 7 Optimal feature subset selected
Sample Size 1,337 After duplicate removal

๐Ÿ”ฎ Future Enhancements

  • Advanced Models: Random Forest, Gradient Boosting, Neural Networks
  • Cross-Validation: K-fold validation for robust performance metrics
  • Feature Engineering: Polynomial features, interaction terms
  • Hyperparameter Tuning: Grid search for optimal parameters
  • Interactive Dashboard: Streamlit or Dash implementation
  • Model Deployment: Flask API for real-time predictions

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Dataset Source: Kaggle Insurance Dataset
  • Statistical Methods: Scipy documentation and best practices
  • Visualization Inspiration: Seaborn gallery and matplotlib examples
  • Machine Learning Techniques: Scikit-learn documentation

๐Ÿ‘จโ€๐Ÿ’ป Author

Zainab Hamid

๐Ÿ“Š Keywords

insurance-analysis data-science machine-learning exploratory-data-analysis python pandas scikit-learn statistical-analysis data-visualization linear-regression feature-engineering correlation-analysis chi-square-testing jupyter-notebook healthcare-analytics


โญ Found this project helpful? Please consider starring the repository!

๐Ÿ” Looking for specific analysis techniques? Check out the detailed Jupyter notebook for complete implementation.

๐Ÿ“ˆ Interested in similar projects? Follow for more data science content!