Analytics Engineering Framework - Data Transformation

Note: For a comprehensive installation guide of all the AEF repositories together, please look here.

Analytics engineers lay the foundation for others to organize, transform, and document data using software engineering principles. Providing easy to use data platforms that empower data practitioners to independently build data pipelines in a standardized and scalable way, and answer their own data-driven questions.

This opinionated data transformation management repository can be used independently to define, store, and deploy data transformation definitions. However, is designed as a component within a comprehensive Analytics Engineering Framework comprising:

Orchestration Framework: Maintained by Analytics Engineers to provide seamless, extensible orchestration and execution infrastructure.
Data Model: Directly used by end data practitioners to manage data models, schemas, and Dataplex metadata.
Data Orchestration: Directly used by end data practitioners to define and deploy data pipelines using levels, threads, and steps.
(This repository) Data Transformation: Directly used by end data practitioners to define, store, and deploy data transformations.

Repository

This repository is a central location for storing and deploy artifacts necessary for your data transformations, such as JDBC drivers and compiled JAR dependencies. However, its core function is to maintain configuration files that define your transformations. These JSON, YAML, or similar parameter files are referenced as reusable steps in your data pipelines, and are interpreted by the execution infrastructure within the Orchestration framework.

├── artifacts
│   ├── dataporc
│   │   └── custom_dependency.jar
│   ├── jdbcjars
│   │   └── postgresql.jar
│   └── ...
└── jobs
    ├── dev
    │   ├── dataflow-flextemplate-job-executor
    │   │   └── sample_jdbc_dataflow_ingestion.json
    │   │   └── ...        
    │   ├── dataform-tag-executor
    │   │   └── run_dataform_tag.json
    │   │   └── ... 
    │   ├── dataproc-serverless-job-executor
    │   └── ...
    ├── prod
    └── ...

Terraform:

Define your terraform variables

name	description	type	required	default
domain	Your organization or domain name, organization if centralized data management, domain name if one repository for each data domain in a Data mesh environment.	string	true	-
project	The project where the GCS buckets for storing your artifacts and job definitions will be created.	string	true	-
region	The region where the GCS buckets for storing your artifacts and job definitions will be created.	string	true	-
environment	Environment folder name you want to deploy. ../jobs/$ENVIRONMENT/.. If not set wherever is in the dev environment folder will be deployed.	string	false	dev

Run the Terraform Plan / Apply using the variables you defined.

terraform plan -var 'project=<PROJECT>' -var 'region=<REGION>' -var 'domain=<DOMAIN_NAME>' -var 'environment=dev'

Usage

While this repository can be used to keep track of your dependencies and data transformation definitions, the provided terraform code can be used to control deployment, but you can deploy it as another step in your CI/CD pipeline instead.

Place and commit your artifacts.
Place and commit your job definition parameter files.
Define your terraform variables and deploy (plan/apply).

nsadineni/aef-data-transformation

Analytics Engineering Framework - Data Transformation

Repository

Terraform:

Usage

On this page

Contributors