rbalbinotti/grocery_synthetic_data
Synthetic Grocery Supply Chain Data Generator
ποΈ Synthetic Grocery Supply Chain Data Generator
High-Fidelity Synthetic Engine for Inventory Optimization & Demand Forecasting
π― Context & Strategic Objectives
In the retail (grocery) sector, the scarcity of clean historical data or the confidentiality of real data hinders the agile development of AI models. This project fills that gap by providing a Digital Twin of the supply chain, simulating complex operations and enabling Machine Learning model testing in demand forecasting and inventory optimization scenarios.
Main objectives:
- Massive data generation: Foundation for the AI project Smart Supply Chain AI.
- Technical portfolio: Showcase proficiency in data engineering, time-series modeling, and Python-based pipeline development.
π¬ Methodology & Statistical Rigor
The simulation follows Time Series Decomposition principles, modeling demand
-
$T(t)$ : Deterministic growth trend. -
$S(t)$ : Weekly and annual seasonality. -
$X_i(t)$ : Exogenous variables (price, real INMET weather, holidays). -
$\epsilon$ : Gaussian noise simulating market uncertainties.
Technical Differentiator: Real Weather Data
Unlike common synthetic generators, this project incorporates real meteorological data (INMET/BDMEP), enriched with feature engineering to map climate severity and capture real correlations between temperature and perishable demand.
β¨ Pipeline Components
Time Series (create_data_functions.py)
- Base series: DataFrame with dates (
ds), IDs, and target values (demand/sales). - Trend & seasonality: Growth and weekly/annual cycles.
- Lag features:
LagFeatureCreatoradds temporal dependencies (e.g., previous weekβs sales). - Events & holidays: Impacts from promotions and special dates.
- Price: Inverse relationship between price and demand.
Exogenous Weather Variables (weather_conditions.py)
- Temperature: Classified into ranges (Very Cold, Temperate, Hot).
- Precipitation: Intensity (No rain β Violent rainfall).
- Wind: Classified by speed.
- Seasonal simulation: Adjustments based on months and seasons.
π Structure & Final Result
The final dataset is saved in Parquet format for high performance, containing 100,192 rows and 29 columns.
Sample:
| received_date | product | category | sub_category | shelf_life_days | supplier | distance_km | temp_class | precip_class | wind_class | is_holiday | sales_demand | sales_volume | stock_qty |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2025-02-04 | Egg (Chicken) | Dairy | Eggs | 28 | FreshEggs Co. | 65 | Warm | No precip. | Gentle Breeze | False | High | 318 | 1096 |
| 2023-01-03 | Sugar | Pantry | Baking | 730 | Wholesale | 25 | Warm | No precip. | Gentle Breeze | False | High | 10 | 33 |
Note: The complete
grocery_data.parquetfile with all 100,192 rows is available for download on Kaggle.
π οΈ Data Engineering & MLOps
- Modularization: Logic separated into
create_data_functions.pyandweather_conditions.py. - Optimized format:
.parquetfor Big Data pipelines. - Deployment-ready: Dockerfile for environment isolation.
- Dependency management:
pyproject.tomlwith PDM.
π Directory Structure
.
βββ create_data_functions.py
βββ data
βΒ Β βββ external
βΒ Β βββ processed
βΒ Β βββ raw
βββ Dockerfile
βββ LICENSE
βββ pdm.lock
βββ pyproject.toml
βββ README.md
βββ README_PT.md
βββ synthetic_grocery.ipynb
βββ weather_conditions.py
π Stack & References
- Core:
Pandas,NumPy,Scikit-Learn,fastparquet. - Statistics:
holidays,workalendar. - Weather source: Real data from INMET/BDMEP
- Associated project: Smart supply Chain AI
Developed by Roberto RosΓ‘rio Balbinotti β ML Architect & Data Specialist.
E-mail: rbalbinotti@gmail.com