Dive deep into CDC with Azure Data Factory

Change Data Capture (CDC) in SQL Server is a powerful feature designed to track and record data changes within a database. It provides a reliable and efficient way to identify changes to tables, allowing you to extract valuable insights into data changes over time. By enabling CDC with Azure Data Factory, SQL Server enables a systematic and automated approach to tracking and logging changes, facilitating better data management, auditing and analysis within the database environment.

Most common use cases: CDC with Azure Data Factory

Common scenarios where CDC with Azure Data Factory has proven useful include:

  • Audit trail and analytics: Track data changes for audit trails and perform analytical assessments of change data.
  • Downstream Expansion: Efficient propagation of changes to downstream subscribers for synchronized data updates.
  • ETL operations: Facilitating extract, transform, load (ETL) operations to seamlessly transfer data changes from an online transaction processing (OLTP) system to a data lake or data warehouse. Tools like Azure Data Factory can be used for this purpose.
  • Event-driven programming: Enabling event-driven programming for immediate responses triggered by data changes, improving real-time system interactions.

Usage: some queries

Here are the SQL queries and commands to manage Change Data Capture (CDC) in SQL Server:

  • Verify that CDC is enabled for the database:

Select  name, is_cdc_enabled from sys.databases;

  • Check which tables have CDC enabled::

Select  name, is_tracked_by_cdc from sys.tables;

  • First you need to enable the database:

EXEC sys.sp_cdc_enable_db

  • Then enable auditing of all tables:
EXECUTE sys.sp_cdc_enable_table

        @source_schema = N’dbo’,

        @source_name = N’PslMaterials’,

        @role_name     = NULL;

  • To disable a database:
    • EXEC sys.sp_cdc_disable_db
  • To disable a table:
EXEC sys.sp_cdc_disable_table

    @source_schema = N’dbo’,

    @source_name   = N’MyTable’,

    @capture_instance = N’dbo_MyTable’

When CDC is enabled for a database, a dedicated schema called CDC is established. Within this schema, several essential tables are created to manage and store change data. It is important to note that disabling CDC for a table or an entire database can cause those tables to be removed, resulting in the loss of historical changes. To preserve this historical data, it is necessary to copy the changes to another table or file.

CDC scheme

Key tables within the CDC schema include:

  • cdc.change_tables: list of CDC-enabled tables
  • cdc.captured_columns: list of captured columns for each table
  • cdc.ddl_history: Documents Data Definition Language (DDL) statements that modify source tables. These changes are not immediately applied to CDC tables; a restart of the CDC instance is required for the changes to take effect.
  • cdc.index_columns: Defines the primary key of CDC tables.
  • cdc.lsn_time_mapping: Manages the time mapping of long block sequences.

Additionally, when a table is CDC-enabled, two more tables are created:

  • cdc.cdc_jobs: Deals with CDC-related work.
  • cdc.SchemaName_TableName_CT: Represents the change table for a specific schema and table, for example, dbo_PslVendors_CT.

Mirror all the fields from the original table with some additional columns needed for CDC:

  • __$start_lsn: Binary code that keeps track of when changes were committed, helping to maintain the order in which the changes occurred.
  • __$seqval: Another binary code used to organize row changes within a transaction.
  • __$operation: a number indicating the type of data change. 1 represents delete, 2 is for insert, and 3 and 4 are for update (capturing the column value before and after the update).
  • __$update_mask: an array of bits indicating which columns were changed during the update.
  • <captured source table columns>: The remaining columns represent the specific data recorded during the creation of the recording instance. If no columns are specified, all columns from the source table are included.

CDC implementation details

  • Each CDC-enabled source table has its own dedicated CDC table.
  • Ensure sufficient database space to accommodate additional generated tables, preventing potential space shortages.
  • The SQL Server Agent capture job retrieves changes from the transaction log and commits them to the appropriate change tables.
  • Cleanup jobs manage change tables, adhering to retention rules to remove stale data.
  • Query functions provide a means to access and use change data from CDC change tables.
  • In Azure SQL Databases, where the SQL Server Agent is not available, the CDC scheduler takes over the role of capturing and cleaning data.

Performance Considerations: Factors Affecting Performance

  • Number of tables with CDC enabled: The more tables enabled for CDC, the higher the processing costs. Weigh necessity against performance.
  • Frequency of changes in monitored tables: Tables that change frequently increase the amount of recorded data. Changing data regularly can affect performance.
  • Space availability in the source database: CDC records changes and stores them. Provide adequate space in the source database to accommodate the change tables without running out of space.

CDC with Azure Data Factory

In the Azure cloud, Data Factory is a powerful tool for a variety of needs, and now includes an overview for Change Data Capture (CDC), which simplifies the process, offering the seamless power of CDC. Let’s explore the steps to take advantage of this feature:

Steps to create CDC in data factory

1. Let’s create a CDC

CDC can be executed as a stand-alone resource, eliminating the need for a pipeline as it is needed, for example, to run a data flow.
factory resources

2. Assign a name to the resource (must be alphanumeric)

Select the type of source, from various types of databases to files. In the case of Azure SQL Database, select Tables. CDC-enabled tables are automatically detected; otherwise, specify the line modification that defines the field (usually a modified date field).

choose your sources

3. Select a destination

In this case, the same as the original types: databases and some space to store change files.

choose your goals

4. Define the destination

The destination table will be automatically created with the Auto map option selected. Select the key for the destination table.

define destination

5. Determine the latency among the given options

In real time, 15 minutes, 30 minutes, 1 hour, 2 hours. Start the process and the agent will read data at defined intervals.

6. Monitor

Green dots indicate instances where CDC was performed, which in this example happens every 15 minutes. The blue dots represent captured changes during each run, providing a clear interface for monitoring.

Monitor

Conclusion

CDC stands out as a powerful and influential tool that offers valuable capabilities for tracking and managing changes in databases. With the advent of CDC with Azure Data Factory, this power is seamlessly harnessed in a simple and convenient way. The combination of CDC and Data Factory represents an effective and affordable solution to implement Change Data Capture with the greatest satisfaction.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *