GitHunt
AM

Retail Revenue Analytics Mart (dbt + DuckDB )

๐Ÿ“Œ Overview

This project builds a production-style analytics mart from raw transactional retail data using dbt.

Using the UCI Online Retail dataset (2010โ€“2011 UK e-commerce transactions), I transform raw invoice line items into a clean, tested star schema and generate revenue reporting models with built-in data quality enforcement.

This project demonstrates how modern Analytics Engineers:

  • Separate raw ingestion from transformation logic
  • Define clear model grain and data contracts
  • Implement automated data quality checks
  • Build documented, reproducible analytics pipelines
  • Design warehouse-portable dbt projects (DuckDB โ†’ Snowflake)

๐ŸŽฏ Business Context

Retail transactional data often contains:

  • Returns and cancellations
  • Negative quantities
  • Inconsistent identifiers
  • Revenue miscalculations due to unhandled edge cases

Without proper modeling and validation, downstream dashboards and financial reporting can be incorrect.

This project solves that by:

  • Standardizing raw data in a staging layer
  • Modeling clear fact and dimension tables
  • Explicitly handling cancellation logic
  • Enforcing integrity via dbt tests
  • Producing trustworthy revenue reporting outputs

๐Ÿ— Data Model

Raw Layer (Seed)

  • raw.online_retail
  • Grain: one row per invoice line item

Staging Layer

  • stg_online_retail
  • Typed and standardized fields
    Derived fields:
  • invoice_date
  • line_revenue
  • is_cancellation (business rule logic)

Mart Layer (Star Schema)

Dimensions

  • dim_customers โ€” one row per customer
  • dim_products โ€” one row per product (StockCode)

Fact

  • fct_order_items โ€” one row per invoice line item
    • Revenue cleaned and business-validated

Reporting

  • rpt_daily_revenue
    Aggregated daily revenue excluding cancellation rows

๐Ÿ›  Tech Stack

  • dbt Core
  • DuckDB (local analytics warehouse)
  • Snowflake-ready configuration
  • SQL
  • YAML (schema + test configuration)
  • Git / GitHub

Dataset Source:
UCI ML Repository โ€“ Online Retail

๐Ÿš€ How to Run Locally

1๏ธโƒฃ Clone repository

git clone <your-repo-url>
cd retail_analytics_dbt

2๏ธโƒฃ Create virtual environment

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install dbt-core dbt-duckdb

##3๏ธโƒฃ Configure dbt profile
Create ~/.dbt/profiles.yml:

retail_analytics_dbt:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: "retail.duckdb"

Verify connection:

dbt debug

5๏ธโƒฃ Build the full pipeline

dbt seed
dbt build

6๏ธโƒฃ View documentation & lineage

dbt docs generate
dbt docs serve

โœ… Data Quality & Testing Strategy

This project treats data tests as contracts, not afterthoughts.

Generic Tests

  • unique and not_null on dimension keys
  • relationships tests between fact and dimensions

Business Logic Test

Custom SQL test:

  • Ensures non-cancellation revenue is never negative

Cancellation Handling Logic

The dataset contains returns that appear as:

  • Invoice numbers beginning with โ€œCโ€
  • Negative quantities
  • Negative revenue patterns

The staging layer introduces a derived is_cancellation flag to ensure:

  • Revenue reporting excludes returns
  • Downstream aggregates are financially accurate

๐Ÿ“Š Results & Impact

โœ” Built a fully reproducible analytics mart from raw retail data
โœ” Enforced primary and foreign key integrity
โœ” Identified and corrected dataset-specific revenue edge cases
โœ” Designed project for seamless warehouse portability
โœ” Full dataset (~540k rows) builds locally in under 1 second

This pipeline ensures revenue metrics are reliable before reaching BI dashboards or executive reporting.

๐Ÿง  Key Learnings

  • Clearly defining model grain prevents metric inflation
  • Data tests should reflect business logic, not just schema constraints
  • Cancellation logic must be explicitly modeled in retail datasets
  • Warehouse abstraction in dbt enables seamless platform migration
  • Lineage documentation improves transparency and trust

๐Ÿ“Œ Why This Project Matters

This is not a dashboard exercise.

It demonstrates:

  • Production-style analytics engineering practices
  • Version-controlled transformation logic
  • Data contract enforcement
  • Business-aware modeling
  • Reproducible local development
  • Clean migration path to enterprise warehouse

๐Ÿ‘ค Author

Anju Mercian
Platform Engineer | Data Platform | Modern Data Stack