JonnyDaenen/dissi-bq

Distributed schema inference and data loader for BigQuery written in Apache Beam

apache-beam bigquery google-cloud-dataflow

DISSI-BQ

DISSY-BQ is a DIStributed Schema Inference tool for BigQuery JSON data loading,
written in Apache Beam. It is trivial to execute on Cloud Dataflow, in combination
with data that resides in Cloud Storage.

The pipeline loads JSON data from a storage location (e.g. GCS)
into BigQuery according to a common schema. The schema is auto-inferred
based on the data, and is used to keep the target table up to date.

See doc folder for a diagram and an Dataflow component overview.

Prerequisites

Bash terminal (for example scripts)
Python 3.8
gcloud command installed locally
Google Cloud project of which you have sufficient rights + Dataflow API enabled

Quickstart

Create your own venv:

bash scripts/001_create_venv.sh

Run the tests:

bash scripts/002_run_tests.sh

Create resources on your current active project:

bash scripts/003_provision_resources.sh

Note that currently, this scripts generates a service account
with owner permissions. Please adjust these to suit your needs.
We recommend employing a least privilege setup.

Test the pipeline locally

bash scripts/004_run_pipeline_locally.sh

In order to test schema updates, you can uncomment the other
pipelines in script 004 to test the base and modified sample datasets.

Run the pipeline on Dataflow

bash scripts/005_run_pipeline_gcp.sh

Finally, scripts 008, 009 and 010 allow you to test with some larger datasets,
which can make the Dataflow pipeline scale up.

Assumptions

Pipeline can run in batch mode
Dataflow api is enabled
Dataset exists (not created by pipeline so we have freedom to e.g. terraform and assign IAM policies)
Nested fields don't need to be nullable as BQ can handle that when parent is nullable
Null fields can be ignored

Core Features

Input pattern support
Basic Primitive JSON types
TIMESTAMP data detector
INTEGER to FLOAT coercion
REQUIRED to NULLABLE conversion
Table schema updates
Optional data loading in the same job
Ignore null fields (until another doc has a value)

Known limitations

JSON null fields and empty lists are only supported if they appear as non-null in at least one doc
nested lists are not supported

Future work

Some interesting ideas for the future:

extended primitive types (e.g., geolocation)
schema depth/width validation
validation of values according to schema in step 2
lineage tracking
handle invalid inputs
list with null
create SA with least privilege
override/seed schema
forbidden characters in field names
refactor into class structure
make data detectors modular
auto-casting during data load
add counters in dataflow job (e.g. for validation)
catch BQ ingestion errors
error tolerance and dead-lettering
BigQuery integration tests

The source code contains some pointers to future work by using the FUTURE tag in the comments.

Contributors

JO

Created May 16, 2021

Updated May 16, 2021

JonnyDaenen/dissi-bq | GitHunt