Dimensional Modelling - SCD
Im sharing my learning from the Data Engineering Bootcamp, where we are learning about Data Engeering. Today we are learning about Dimensional modelling - Idempotency and SCDs.
I would like to extend my gratitude to Zach Wilson, the founder of DataExpert.io, for his invaluable guidance and the comprehensive Data Engineering Bootcamp. Connect with Zach Wilson on LinkedIn.
Thank you, Zach, for this amazing intense bootcamp on Data engineering!
Day-2: Dimensional Modelling - Slowly Changing Dimensions (Theory)
What is an Idempotent Pipeline?
Idempotency: An idempotent pipeline ensures that processing the same data multiple times (whether it's live production data or backfill data) yields the same result without any unintended side effects. This consistency is crucial for reliability and simplicity in data workflows.
Pipelines should produce the same results:
- The day they are run
- How many times they are run
- What hour they are run
Why is hard to troubleshoot non-idempotent pipelines?
- Silent Failure: The non-idempotent pipelines can produce different results every time they are run, it doesnt fail. This results in non reproduceable results.
- This causes inconsistency in the data to the downstream systems too.
What can make a pipline non-idempotent?
INSERT INTO
withoutTRUNCATE
:- If you dont
TRUNCATE
data before INSERT, it will keep on adding the data to the table, creating duplicates. - Thus your pipeline will not be idempotent, as it will keep on adding the data to the table.
- Highly recommended to use MERGE or INSERT OVERWRITE always.
- If you dont
- Using
start_date >
without a correspondingend_date <
:- If you dont have
end_date
in your SCD, you will not be able to track the changes in the data. - This can create a out of memory exceptions, when you backfill the data.
- Highly recommended to use end_date in your SCD always.
- If you dont have
- Not using a full set of partition sensors:
- Your pipeline may run with partial set of inputs.
- pipeline might run when there is no/partial data
- Not using
depends_on_past
for cumulative pipelines:- When you process the data in parallel ( lets say processing parallel days ), pipeline may end up processing yesterday's data, which hasnt been created yet.
The beauty of an idempotent pipeline lies in its ability to produce consistent results for both production and backfill data.
Example: Imagine a pipeline that:
- Reads data (e.g., transactions).
- Cleans and deduplicates it.
- Stores the cleaned data in a database.
- Production: As new transactions come in, they’re cleaned, deduplicated, and stored.
- Backfill: When historical transactions are added, they go through the same cleaning and deduplication process, ensuring no duplicate records and consistent results.
Because the pipeline is idempotent:
- Processing the same data again won’t alter the results or create issues.
- Whether it's production or backfill data, the behavior remains identical.
Few more problems that can make a pipeline non-idempotent
- Relying on the latest partition of a not properly modeled SCD table.
- Cumulative table design AMPLIFIES this bug.
- One exception relying on latest partition is when you have properly modeled SCD table and you are backfilling not in production.
The pains of not having an idempotent pipeline
- Backfilling causes inconsistencies between the old and restated data.
- Hard to troubleshoot bugs and fix the issues.
- Unit testing cannot replicate the production behavior.
- The pipeline doesnt fail, but the data is inconsistent, which is silent failures.
Slowly Changing Dimensions (SCD)
The Slowly Changing Dimensions (SCD) are dimensions that change slowly over time, which would have time frame with them. For example,
- Customer Address
- Product Description
- Employee Details
- etc.
Not modelling the SCDs properly can impact the idempotency.
The slower the changes in a Slowly Changing Dimension (SCD), the more effectively it can be modeled.
How can you model dimensions that change over time?
- Singular snapshot: The dimension table only contains the most recent data.
- Be Careful: This pipelines are not idempotent.
- Daily partitioned snapshots: The dimension table contains a snapshot of the data for each day.
- Very simple way of implementing SCD.
- SCD Types 1,2,3
Types of Slowly Changing Dimensions
Here’s a Markdown table summarizing the content provided:
Type | Description | Key Features | Limitations |
---|---|---|---|
Type 0 | Static dimensions, not actually slowly changing. | - No temporal columns. - Table contains only identifier and value. | - No changes tracked or maintained. |
Type 1 | Overwrites old data with new data, only latest data is retained. | - Simple and efficient. - No history maintained. | - Cannot retrieve historical data. |
Type 2 | Gold standard of Slowly Changing Dimensions, maintains history using temporal columns. | - Uses start_date and end_date . - Current rows have end_date = NULL or 9999-12-31 . - Optionally uses is_current column. | - More than one row per dimension can complicate filtering and querying. |
Type 3 | Maintains only two versions of the data: original and current . | - One row per dimension. | - Loses history of changes between original and current . - Partially idempotent. |
Which types are idempotent?
Type | Idempotent | Reason |
---|---|---|
Type 0 | Yes | The values are unchanging. |
Type 2 | Yes | It is idempotent, but you need to be careful with how you use the start_date and end_date . |
Type 1 | No | Backfilling with this dataset gives the dimension as it is now, not as it was before. |
Type 3 | No | Backfilling makes it impossible to determine whether to pick "original" or "current." |
Loading SCD2
- Load the entire history in one query
- Inefficient but nimble
- Incremnetally load the data after the previoud SCD is generated
- Has the same
depends_on_past
. its efficient but cumbersome.
- Has the same
I hope you enjoyed reading this blog on Slowly Changing Dimensions. If you have any questions or feedback, feel free to reach out to me on