Azure Data Factory, Azure Databricks, and Azure SQL Server: A Focus on Change Data Capture and Medallion Architecture

Modern data architecture is continuously evolving to meet the need for more effective data management strategies. One such strategy is the Medallion Architecture, a robust framework designed to efficiently handle the complexities of today’s data environment.

Key to this strategy are tools such as Azure Data Factory, Azure Databricks, and Azure SQL Server, which offer significant benefits when aligned with the Medallion Architecture.

Let’s have a look at using Azure SQL as a data source and discussing the use of Change Data Capture (CDC) methodology.

Understanding Change Data Capture (CDC)

CDC is a design pattern that efficiently identifies and captures changes in data, allowing downstream systems to process only the updated data. This method is particularly useful when dealing with large and active databases where data is continuously being updated.

The key advantages of CDC are:

Efficiency: By capturing only the data changes, CDC significantly reduces the volume of data that needs to be processed.
Timeliness: CDC enables near real-time data updates, ensuring that your data analysis is based on the most recent data.
Auditing: By tracking the history of data changes, CDC provides an effective data auditing mechanism.

Now, let’s imagine you are operating a large online store with data stored in an Azure SQL database. The database is continuously updated with new orders, changes to existing orders, and cancellations. By implementing CDC, you can efficiently track these changes and provide real-time analytics for your business.

Raw Data Layer

The raw data layer (Bronze) is the initial landing zone for data, where it remains in its original, unaltered state. It serves as a reliable audit trail and a source for data recovery.

Here, Azure Data Factory (ADF), with its built-in CDC capability, comes into play.

ADF can efficiently detect and copy only the changed data from the source Azure SQL database to the raw data layer, preserving the rawness of the data.

Unified Data Layer

The unified data layer (Silver) is where the integrated, cleansed, and conformed version of the raw data resides, making it suitable for analysis and reporting.

Azure Databricks is employed in this layer to process the CDC data. It reads the captured changes from the raw data layer, transforming and conforming the data.

Despite dealing with complex CDC data, Databricks, with its ability to handle large-scale processing tasks using Spark, ensures optimal performance.

Consumption Layer

The consumption layer (Gold) is where data is made available for users and applications. It includes data marts, cubes, and APIs, optimized for reporting and analytics.

Once the CDC data is cleansed and conformed in the unified data layer, it is loaded back into Azure SQL Server. Here, the updated data is made accessible to end users and applications for analytics and reporting.

Approaching Your First Project

As you embark on your first project using Azure tools and CDC within a Medallion Architecture, there are a few key considerations:

Understanding CDC: Familiarize yourself with how CDC works in Azure SQL and the structure of CDC data.
Planning Transformations: Decide how to handle the different types of changes (inserts, updates, and deletes) in your Databricks transformations.
Managing Latency: Keep in mind the latency requirements of your downstream applications when planning your data pipeline.

Learn More

Implementing CDC with Azure services in a Medallion Architecture offers an efficient way to handle large and active Azure SQL databases. However, challenges may arise. For instance, handling schema changes with CDC can be tricky. In such cases, using ADF’s schema drift capability can be a solution.