116 results for “topic:data-lakehouse”
Open-source Snowflake & Fivetran alternative, with Postgres compatibility.
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
A curated list of open source tools used in analytics platforms and data engineering ecosystem
End-to-end Data Lakehouse project built on Databricks, following the Medallion Architecture (Bronze, Silver, Gold). Covers real-world data engineering and analytics workflows using Spark, PySpark, SQL, Delta Lake, and Unity Catalog. Designed for learning, portfolio building, and job interviews.
Open-source data framework for biology. Context and memory for datasets and models at scale. Query, trace & validate with a lineage-native lakehouse that supports bio-formats, registries & ontologies. 🍊YC S22
Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
ETL / ELT Framework powered by DuckDB, designed to seamlessly integrate and process data from diverse sources. It leverages Markdown as a configuration medium, where YAML blocks define metadata for each data source, and embedded SQL blocks specify the extraction, transformation, and loading logic.
Data Engine for Manual/Algo Trading: Download/Stream -> Clean -> Store. Supports Data Lakehouse Architecture. Clean Once and Forget.
SwiftLake: Java SQL engine built on Apache Iceberg and DuckDB for efficient lakehouse reads and writes
Floe: Policy-based table maintenance for Apache Iceberg
DatAasee - A Metadata-Lake for Libraries
This repository is a place for the Data Warehousing course at the Information Systems & Analytics department, Santa Clara University.
This repo provides a step-by-step approach to building a modern data warehouse using PostgreSQL. It covers the ETL (Extract, Transform, Load) process, data modeling, exploratory data analysis (EDA), and advanced data analysis techniques.
My M.Sc. dissertation: Modern Data Platform using DataOps, Kubernetes, and Cloud-Native ecosystem to build a resilient Big Data platform based on Data Lakehouse architecture which is the base for Machine Learning (MLOps) and Artificial Intelligence (AIOps).
The project aims to process Formula 1 racing data, create an automated data pipeline, and make the data available for presentation and analysis purposes.
Complete open-source data platform with Airbyte, Dremio, dbt, and Apache Superset - Documented in 18 languages
A project of creating a local data lakehouse using open-source tools and using Apache Iceberg as the open table format
🌊 Git-like Version Control for Data with Nessie, Iceberg, and Spark
This project implements an end-to-end techstack for a data platform, for local development.
🚀 Scalable near-real-time data pipeline using Apache Iceberg, Spark, Kafka, and Trino. ACID-compliant JSON ingestion, processing, and analytics. Dockerized for easy deployment. #DataEngineering #DataLake
Data lakehouse at home with docker compose
Building a modern data warehouse with SQL Server, including ETL processes, data modeling and analytics.
A comprehensive data engineering project that builds a reliable foundation for AI and business intelligence.
This project implements a complete Modern Data Warehouse using SQL-based ETL pipelines and Medallion Architecture (Bronze/Silver/Gold). It includes raw data ingestion, transformation layers, dimensional modeling, data marts, and analytical reporting structures suitable for business intelligence and data engineering workflows.
Production-ready Apache Superset with DuckLake integration. Stateless analytics architecture using DuckDB for compute, PostgreSQL for metadata, and S3/GCS/MinIO for data lake storage. Includes Docker Compose, Kubernetes Helm charts, BigQuery Integration, and CI/CD workflows. Supports MotherDuck cloud integration.
This project is my graduation project of Bachelor degree at HUST. It's about mini data lakehouse. Just got an A on it.
This Repo is build to showcase my skills in snowflake and tableau
Building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics
This repository contains my first end-to-end Data Engineering project, built using Microsoft Azure Cloud and Azure Databricks with PySpark.