Change Data Capture ie CDC captures the data that is being inserted, updated, deleted from a table in database.
This is done by enabling CDC on database and then on the concerned table.
Here we are going to understand once CDC is enabled how can we maintain the pipeline so that it does not miss out on any record. For this purpose we will create a Maintenance Pipeline.
Once the CDC p is done the important thing that needs to be taken care of is of its Maintenance, by default the retention period of data to be inserted, deleted , modified to the target is of 3 days ie 4320 minutes. So, if data is not updated, inserted or deleted in the target within these 3 days then those records will go missing.
In order to avoid such cases of missing records, we create a pipeline for it. Wherein we capture such missing records and insert, update, or delete the records based on the operation.
Steps for creating a Maintenance Pipeline
Create a new pipeline in the Azure Data factory
Give a name to the pipeline as Maintenance Pipeline
Then add a Look Up activity in the canvas of the pipeline
Lookup Activity
GetMaxLsn : So this Look up Activity will fetch the Maximum LSN that is the record that was inserted, updated or deleted recently for that purpose we need to create a schema util
and a table name util.load_table
syntax to create util.load_table
Output of util.load_table
Settings in LookUp Activity
These are the configurations to be used in the lookup activity to get the MaxLSN from the util table.
Data set for look up activity
Then the next we have to create a copy activity on Azure Portal
Below is the setting for copy activity
In this copy activity we will fetch the records that we want to store in the Target table for that we need to write a query as given below
Source:
This is the query we need to write in order to fetch the latest record which has been inserted, deleted or updated
This query will only fetch data where the operation is 1 for deleted, 2 for inserted and 4 for update after since operation = 3 is record before modification so we do not want to create redundancy so we will not have records where operation =3
Sink
Here we will specify the target table where we want to store the data.
Next ,we need to create a Type Table in our Target Dataset
This is like a temporary table which will help us in storing the data in the target table.
Then we need to create a stored procedure in Target Database that we will perform Insert, Update, Delete based on the operations present in the __$operation column present in cdc.dbo_sales_CT
If __$operation are following values
__$operation = 1 then delete
__$operation = 2 then insert
__$operation = 3 then this signifies update before
__$operation = 4 this is update after
In case of update there are two records one is the original record another is the updated record so we need to use the record with __$operation = 4 and not use __$operation =3 record
Then in order to test if our maintenance pipeline is working or not we change the retention period of cleanup of cdc in on premises database
Here we have set it to 5 mins so all the updates , insert, deletes which are performed gets cleaned up so we run the maintenance pipeline to capture the records.
EXECUTE sys.sp_cdc_change_job
@job_type = N’cleanup’, @retention = 5;