Srilekha-1106/databricksProject
Implemented Azure Databricks for real-time data processing and governance using Unity Catalog, Spark Structured Streaming, Delta Lake features, Medallion Architecture, and end-to-end CI/CD pipelines. Focused on incremental loading, compute cluster management, maintaining data quality, and creating workflows.
Real-Time Data Processing with Unity Catalog and CI/CD in Azure Databricks
Azure Databricks Project Setup and Automation
Project Overview
This project involves setting up an Azure Databricks environment, integrating it with Azure storage accounts, automating data processing workflows, and implementing CI/CD pipelines to ensure seamless integration and deployment of data and notebooks.
Steps and Implementation
1. Azure Resource Group Creation
An Azure Resource Group was created to organize and manage all related resources.
2. Storage Accounts Setup
Two storage accounts were created to store and manage the project data.
3. Container Configuration
Within the projectstgaccount, three containers were created, with the landing container designated for storing raw data.
4. Medallion Folder Structure
Three folders were created in the medallion structure to organize data systematically.
5. Azure Databricks Workspace Setup
An Azure Databricks workspace was established to facilitate data processing and analysis.
6. Databricks Access Connector
A Databricks access connector was created and added to the Blob Storage Contributor role for the two storage accounts,
ensuring secure data access.
7. Databricks Metastore and Catalog
Within the Azure Databricks workspace, a metastore was created and attached to the workspace. Subsequently,
a development catalog was set up.
8. Storage Credentials and External Locations
Storage credentials and external locations were configured to manage data access and storage.
9. Cluster Creation
A Databricks cluster was created to execute data processing tasks.
10. File Verification
All provided files were manually run to verify that paths and variable names were correctly defined.
All schemas are created in the dev catalog
11. Autoscaling and Workflow Creation
Autoscaling was enabled, and workflows were created to automate the execution of data processing tasks.
12. Dbutils Widgets
Keys and parameters for dbutils widgets were created to handle dynamic configurations.
13. Trigger Creation and Incremental Data Processing
Triggers were created to automate task execution. Multiple triggers were cloned to manage different data streams, such as raw roads and raw traffic.
New files added to Azure Data Lake Storage (ADLS) initiate the triggers, ensuring incremental data processing and successful job completion.
14. Data Reporting
Processed data was integrated with Power BI for comprehensive reporting and analysis.
15. CI/CD Pipeline Setup
A CI/CD pipeline was established to automate the deployment process. When there is a push to the main branch, all folders are copied to the live folder, requiring admin access for interaction. This setup ensures seamless integration and deployment of all notebooks to different environments, keeping the live folder updated with the latest data.
Conclusion
This project demonstrates the efficient setup and automation of an Azure Databricks environment. It includes secure data integration, automated workflows, and comprehensive reporting, enhanced by a robust CI/CD pipeline to ensure consistent and up-to-date data deployment across different environments. This approach facilitates seamless integration, deployment, and data accessibility while maintaining data integrity and security.
At the end of the project, the workspace appears as shown in the image. It includes the main branch with all the changes already pulled, all the files organized Notebooks folder. The setup ensures all project
components are easily accessible and well-structured. A pipeline was created to facilitate the movement of data from the development catalog to the UAT catalog, requiring admin approval. The Azure DevOps
interface illustrates the stages of deployment, ensuring a controlled and authorized transition of data between these environments.