GitHunt
LA

lakshmi14k/Vehicle-Collision-Analysis

BI analytics project analyzing traffic collision data across Austin, Chicago, and NYC to identify high-risk patterns and inform public safety interventions using ETL pipelines and interactive dashboards.

Vehicle Collision Analysis

Problem Statement: Urban transportation departments struggle to reduce traffic fatalities and improve road safety due to:

  • High volume of collision incidents requiring investigation
  • Inconsistent accident reporting across jurisdictions
  • Limited visibility into contributing factors and patterns
  • Reactive rather than proactive safety interventions
  • Fragmented data collection standards

Business Impact: Inefficient allocation of traffic safety resources leads to preventable accidents, increased fatalities, and millions in economic losses from property damage and medical costs.

Proposed Solution: A dimensional data warehouse integrating collision data from three major US cities, enabling:

  • Standardized accident tracking and analysis
  • Pattern identification for preventive interventions
  • Comparative analysis across jurisdictions
  • Data-driven traffic safety policy decisions
  • Real-time monitoring of high-risk locations

Key Findings:

Austin (147,750 collisions analyzed - 2014-2024):

  • Fatal crashes represent 0.6% of total incidents
  • Contributing factors missing in 80.6% of records (data quality issue)
  • Pedestrian-involved crashes: 3,505 incidents
  • Average crash severity: moderate injuries

Chicago (817,723 collisions analyzed - 2013-2024):

  • 68.7% of records missing hit-and-run data
  • Top contributing cause: Driver skills/knowledge/experience
  • Work zone crashes: 1,052 incidents (requires targeted safety measures)
  • Injury-causing crashes: 599,383 (73% of total)

NYC (2,075,427 collisions analyzed - 2012-2024):

  • Largest dataset with most comprehensive tracking
  • Missing data primarily in secondary vehicle fields (expected pattern)
  • Borough-level analysis shows concentration in Manhattan and Brooklyn
  • Contributing factors well-documented for primary vehicles

Cross-City Insights:

  • Data quality varies significantly (Chicago most complete, Austin least)
  • NYC has highest volume but lower fatality rate
  • Contributing factor taxonomies need standardization across jurisdictions
  • Temporal patterns show peak accident times during rush hours

Dataset Overview:

Austin Traffic Crashes:

  • Source: Austin Transportation Department
  • Records: 147,750 collisions
  • Time Period: March 2014 - March 2024
  • Variables: 54 columns (crash details, location, injuries, contributing factors)
  • Missing Data: 21.6% overall

Chicago Traffic Crashes:

  • Source: Chicago Police Department
  • Records: 817,723 collisions
  • Time Period: March 2013 - March 2024
  • Variables: 48 columns (crash type, weather, road conditions, injuries)
  • Missing Data: 21.1% overall

NYC Motor Vehicle Collisions:

  • Source: NYC Police Department
  • Records: 2,075,427 collisions
  • Time Period: July 2012 - March 2024
  • Variables: 28 columns (location, casualties, contributing factors, vehicle types)
  • Missing Data: 29.5% overall

Data Quality Challenges

  • Missing values (21-30% across datasets)
  • Inconsistent contributing factor codes (each city uses different taxonomy)
  • Coordinate accuracy (11-13% missing lat/long in some datasets)
  • Multi-vehicle tracking (NYC tracks up to 5 vehicles, others less granular)
  • Weather data (only Chicago tracks weather conditions)

Tech Stack:

  • ETL: Python (pandas), Talend Open Studio
  • Database: Microsoft SQL Server
  • Data Modeling: Star schema dimensional design
  • Visualization: Tableau Public, Power BI
  • Data Profiling: ydata-profiling, Alteryx Designer

Project Structure

Vehicle-Collision-Analysis/
├── README.md
├── Data/
│   ├── Raw/
│   │   ├── Austin.csv                     Austin collisions (148K records)
│   │   ├── Chicago.csv                    Chicago crashes (818K records)
│   │   └── NYC.csv                        NYC collisions (2.08M records)
│   ├── Cleaned/
│   │   ├── Austin_Cleaned.csv             Standardized Austin data
│   │   ├── Chicago_Cleaned.csv            Standardized Chicago data
│   │   └── NYC_Cleaned.csv                Standardized NYC data
├── Python/
│   ├── Cleaned Data.py                    Data cleaning script
│   └── Tableau Data.py                    Master file creation
├── Profiling/
│   ├── YData/
│   │   ├── ydata_Austin                    Austin Y-Data
│   │   ├── ydata_Chicago                   Chicago Y-Data
│   │   └── ydata_NYC                       NYC Y-Data
│   ├── Alteryx/
│   │   ├── Austin_Profiling.yxmd           Austin Alteryx
│   │   ├── Chicago_Profiling.yxmd          Chicago Alteryx
│   │   └── Austin_Profiling.yxmd           NYC Alteryx
├── Docs/
│    ├── Alteryx and ETL workflows          Talend and ETL documentation
└── Viz/
    ├── Vehicle Collision Analysis.twbx     Tableau Dashboard
    |── Master Data.csv                     Master Data for Tableau
    └── Vehicle Collision Analysis.pdf      Tableau Dashboard


Database Schema: Star Schema Design

Fact Tables (2):

  • Fct_Accident - Core collision events with casualties and outcomes
  • BridgeContributingFactor - Many-to-many relationship for contributing factors

Dimension Tables (6):

  • Dim_Location - Street addresses, coordinates, boroughs/beats
  • Dim_Date - Date hierarchy (year, quarter, month, season, weekday)
  • Dim_Time - Time of day hierarchy (hour, period, rush hour flag)
  • Dim_VehicleType - Vehicle classifications (passenger, truck, motorcycle, bicycle)
  • Dim_ContributingFactors - Accident causes (driver error, weather, road conditions)
  • Dim_WeatherConditions - Weather at time of crash (Chicago only)

Key Relationships:

Dim_Location ──┐
Dim_Date ──────┤
Dim_Time ──────┼──> Fct_Accident ──> BridgeContributingFactor ──> Dim_ContributingFactors
Dim_VehicleType┤                              
Dim_Weather ───┘                              

Total Records in Warehouse:

  • 3,040,900 collision records
  • 491,174 records (2022-2024 filtered for Tableau)
  • 4.2M+ contributing factor relationships
  • 10,000+ unique street locations

Technical Implementation

ETL Pipeline
Phase 1: Data Extraction

  • Downloaded CSV/TSV files from city open data portals
  • Initial profiling with ydata-profiling and Alteryx
  • Identified 3M+ total records spanning 10+ years

Phase 2: Data Quality Assessment

Identified issues:

  • Missing contributing factors (80% Austin, 15% Chicago, 92% NYC for secondary factors)
  • Inconsistent date/time formats across datasets
  • Coordinate accuracy issues (11-13% missing)
  • Free-text violation descriptions requiring normalization
  • Multiple vehicle tracking complexity

Phase 3: Data Transformation

Austin Processing:

  • Separated crash_date into date and time components
  • Normalized flag columns (Y/N → Boolean)
  • Handled 93% missing latitude/longitude
  • Standardized injury count fields
  • Mapped contributing factor codes

Chicago Processing:

  • Extracted time from datetime field
  • Converted uppercase column names to lowercase
  • Handled 75% missing lane count data
  • Standardized weather and lighting conditions
  • Consolidated injury severity categories

NYC Processing:

  • Renamed columns for consistency (spaces → underscores)
  • Normalized multi-vehicle data (5 vehicle slots)
  • Handled 99% missing data in vehicles 4-5 (expected)
  • Consolidated contributing factors across vehicles
  • Borough-level aggregation for geographic analysis

Phase 4: Data Integration

  • Created common schema across 3 cities
  • Filtered for 2022-2024 data (Tableau Public optimization)
  • Added calculated fields (season, time_period, severity flags)
  • Generated master visualization dataset (491K records)

Phase 5: Data Loading

  • Loaded staged data into SQL Server
  • Populated dimensional model using Talend
  • Implemented SCD Type 2 for Dim_Location (tracking street name changes)
  • Validated referential integrity and data quality

Data Profiling Results

Austin:

  • 54 variables, 147,750 observations
  • 1.7M missing cells (21.6%)
  • 0 duplicates
  • Key issues: Contributing factors missing in 80%+ of records

Chicago:

  • 48 variables, 817,723 observations
  • 8.3M missing cells (21.1%)
  • 0 duplicates
  • Key issues: Work zone data 99% missing, lane count 75% missing

NYC:

  • 29 variables, 2,075,427 observations
  • 17.8M missing cells (29.5%)
  • 0 duplicates
  • Key issues: Secondary/tertiary vehicle data sparse (expected pattern)

SQL Capabilities Demonstrated

  • Dimensional Modeling: Star schema with 6 dimensions, 2 facts, 1 bridge table
  • DDL Design: Foreign keys, constraints, covering indexes
  • Data Validation: Row count reconciliation, referential integrity checks
  • ETL Logic: Multi-source staging → transformation → dimensional model
  • Data Quality: Profiling, null handling, duplicate detection
  • Performance Optimization: Indexed lookups, efficient joins
  • Bridge Tables: Many-to-many relationship handling (accidents ↔ contributing factors)

Visualizations

Tableau Dashboards:

Dashboard 1: Overview & Hotspots

  • KPIs: Total accidents, fatalities, injuries, fatality rate
  • Interactive city-level map showing accident concentrations
  • Time trend analysis (monthly patterns 2022-2024)
  • Top contributing factors analysis

Dashboard 2: Deep Dive Analysis

  • Weather impact on accident frequency
  • Severity analysis (injuries vs fatalities by city)
  • Time period breakdown (morning, afternoon, evening, night)
  • Deadliest streets across all three cities

For interactive dashboards, click here

Installation & Reproducibility:

Prerequisites:

  • Python 3.8+ (pandas, numpy, datetime)
  • Microsoft SQL Server (optional - for dimensional model)
  • Tableau Public or Desktop

** Setup Instructions**

1. Clone repository:

git clone https://github.com/lakshmi14k/Vehicle-Collision-Analysis.git
cd Vehicle-Collision-Analysis

2. Run Python cleaning scripts:

 Clean raw data for each city
python Python/01_clean_raw_data.py

Combine and prepare for visualization
python Python/02_combine_for_tableau.py

Outputs:

  • Data/Cleaned/Austin_Cleaned.csv
  • Data/Cleaned/Chicago_Cleaned.csv
  • Data/Cleaned/NYC_Cleaned.csv
  • Data/Master/Vehicle_Collisions_Master.csv (ready for Tableau)

3. Load data using ETL tool:

  • Talend: Use workflows in ETL/ folder
  • Alternative: Load CSVs directly using Python pandas or SSIS

4. Open Tableau dashboards:

  • Connect to Vehicle_Collisions_Master.csv in Tableau Public
  • Or connect to SQL Server database for full dimensional model

Results:

Data Pipeline:

  • 3,040,900 total collision records processed
  • 491,174 records (2022-2024) prepared for visualization
  • 3 city datasets standardized and integrated
  • 10 dimension/fact tables populated

Insights Delivered:

  • Identified peak accident times and seasonal patterns
  • Documented top contributing factors by city
  • Mapped high-risk streets requiring safety interventions
  • Enabled year-over-year trend analysis
  • Quantified pedestrian/cyclist/motorist casualty rates

Business Value:

  • Enabled data-driven traffic safety policy decisions
  • Identified high-risk locations for targeted interventions
  • Provided comparative benchmarking across major cities
  • Reduced data silos through standardized integration
  • Created foundation for predictive accident modeling

Key Technical Achievements:

  • Multi-Source Integration: Combined 3 disparate datasets (3M+ records, different schemas)
  • Data Quality Management: Handled 20-30% missing data without losing analytical value
  • Schema Standardization: Unified 3 different column structures into common format
  • Dimensional Modeling: Star schema with proper grain, SCD Type 2, and bridge tables
  • ETL Optimization: Python + Talend hybrid approach for scalability
  • Visualization Strategy: Filtered 3M → 491K records to fit Tableau Public constraints
  • Performance: Processed 3M+ records with complex transformations in under 5 minutes

Built with SQL Server, Python, Alteryx, Talend, Tableau, and dimensional data modeling best practices

Languages

HTML100.0%Python0.0%

Contributors

Created June 17, 2024
Updated February 14, 2026
lakshmi14k/Vehicle-Collision-Analysis | GitHunt