JO
josephmachado/de_project
Step by step instructions to create a production-ready data pipeline
Build a data engineering project, with step-by-step instructions
-
Code for the blog: Build data engineering projects with step-by-step instruction
-
Live workshop link
Data used
Let's assume we are working with a car part seller database (tpch). The data is available in a duckdb database. See the data model below:
We can create fake input data using the create_input_data.py.
Architecture
Most data teams have their version of the 3-hop architecture. For example, dbt has its own version (stage, intermediate, mart), and Spark has medallion (bronze, silver, gold) architecture.
Tools used:
Setup
You have two options to run the exercises in this repo
Option 1: Github codespaces (Recommended)
Steps:
- Create Github codespaces with this link.
- Wait for Github to install the requirements.txt. This step can take about 5minutes.

- Now open the
setup-data-project.ipynband it will open in a Jupyter notebook interface. You will be asked for your kernel choice, choosePython Environmentsand thenpython3.12.00 Global.

- The setup-data-project notebook that goes over how to create a data pipeline.
- In the terminal run the following commands to setup input data, run etl and run tests.
# setup input data
python ./setup/create_input_data.py
# run pipeline
python dags/run_pipeline.py
# run tests
python -m pytest dags/tests/unit/test_dim_customer.pyOption 2: Run locally
Steps:
- Clone this repo, cd into the cloned repo
- Start a virtual env and install requirements.
- Start Jupyter lab and run the
setup-data-project.ipynbnotebook that goes over how to create a data pipeline.
git clone https://github.com/josephmachado/de_project.git
cd de_project
rm -rf env
python -m venv ./env # create a virtual env
source env/bin/activate # use virtual environment
pip install -r requirements.txt
jupyter lab- In the terminal run the following commands to setup input data, run etl and run tests.
# setup input data
python ./setup/create_input_data.py
# run pipeline
python dags/run_pipeline.py
# run tests
python -m pytest dags/tests/unit/test_dim_customer.pyOn this page
Contributors
Created September 17, 2024
Updated December 15, 2025






