DBT’s hidden gem- dbt snapshots
- by Ankit Sarraf
Intended Audience
- You are an aspiring Data Engineering Professional who is interested in learning more about
dbt snasphot
- You are a seasoned Data Engineer who’d like to curate the article and provide feedback
- You are simply curious
Introduction
With a significant paradigm shift in the Data Engineering field from ETL to ELT*, DBT has gained quick adoption over time as an open-source tool for effective Data Warehousing and Data Analysis by providing the T
of the ELT equation.
* E Represents Extract, L Represents Load, and T Represents Transform (Read more here)
Because of cheaper storage, DBT has been instrumental in providing technical prowess to anyone who knows basic SQL and coding to effectively leverage the data. That is one big problem that DBT is solving.
A big real-time problem that organization face is bridging gaps in the efficient analysis of historical data to gain insights into trends. DBT helps in that by DBT Snapshots.
The below statement clearly sums it up:
Analysts often need to “look back in time” at previous data states in their mutable tables. While some source data systems are built in a way that makes accessing historical data possible, this is not always the case. dbt provides a mechanism, snapshots, which records changes to a mutable table over time. [1]
DBT Snapshots cater by implementing type-2 Slowly Changing Dimensions.
Implementing with Examples
Enough of theory! Let’s uncover the power of dbt snapshot
with an example.
Pre-requisite: You already have DBT configured with BigQuery (or any other Data Warehousing tool) on your system (cloud or local).
Scenario Description:
Our organization wants to track the speed with which Customer Tickets are being served. We have the below table (call it Tickets
):
Query to create this table:
select 1 AS ticket_id, 'new' AS ticket_status, '2022-11-01' AS updated_date
UNION ALL
select 2 AS ticket_id, 'new' AS ticket_status, '2022-11-02' AS updated_date
UNION ALL
select 3 AS ticket_id, 'in-progress' AS ticket_status, '2022-11-03' AS updated_date
UNION ALL
select 4 AS ticket_id, 'in-progress' AS ticket_status, '2022-11-04' AS updated_date
Assumptions:
ticket_id
is a true Primary Key for theTickets
tablemy_snapshots
is the schema where we want our tables and snapshots to resideupdated_date
can be used as a reliable column to identify when the changes to the row took place- We will only deploy the
check
strategy for snapshots
Snapshot with check
strategy (snapshot_Tickets_check.sql)
{% snapshot snapshot_Tickets_check %}
{{
config(
target_schema='my_snapshots',
strategy='check',
unique_key='ticket_id',
check_cols=['ticket_status'],
invalidate_hard_deletes=True
)
}}
SELECT * FROM {{ ref("Tickets") }}
{% endsnapshot %}
To create a snapshot:
dbt snapshot --models snapshot_Tickets_check
Making changes to the original Tickets
table
- Deleted the Ticket with ticket_id = 4
- Updated the status with ticket_id = 1 from
new
toin-progress
- Updated the status with ticket_id = 2 from
new
todone
- Added new ticket with ticket_id = 5 with status
testing
Resulting table:
Query to create this table using DBT:
select 1 AS ticket_id, 'in-progress' AS ticket_status, '2022-11-01' AS updated_date
UNION ALL
select 2 AS ticket_id, 'done' AS ticket_status, '2022-11-04' AS updated_date
UNION ALL
select 3 AS ticket_id, 'in-progress' AS ticket_status, '2022-11-03' AS updated_date
UNION ALL
select 5 AS ticket_id, 'testing' AS ticket_status, '2022-11-05' AS updated_date
We run the snapshot again to observe the difference:
snapshot_Tickets_check
table:
Now the final step:
We need to determine the records that are
- deleted
- updated
- inserted
- unchanged
- stale
Pseudo Code to determine the different types of records in the Snapshot table:
IF the dbt_valid_to
isn’t NULL
and the record creation (dbt_valid_from
) is before the table creation(ref_time
) and the record invalidation (dbt_valid_to
) happened after the table creation (ref_time
) and it is the latest record for the given Primary key (rank_1
) THEN the record was deleted
ELSE if dbt_valid_to
IS set to NULL
and The record was created (dbt_valid_from
) after table creation (ref_time
):
- IF this is the only record (based on
dbt_valid_from
) for the given primary key THEN it is a newly inserted record - ELSE, it is an
updated
record
ELSE If the dbt_valid_to
isn’t NULL and the record invalidation (dbt_valid_to
) happened before the table was created (ref_time
) THEN it is a stale
record
ELSE mark the record as unchanged
Implementation Code
WITHsnaps AS (
SELECT s_rec.*, info.creation_time AS ref_time,
ROW_NUMBER() OVER (PARTITION BY ticket_id ORDER BY dbt_valid_from DESC) rank_1
FROM `my_snapshots`.`snapshot_Tickets_check` s_rec
CROSS JOIN `my_snapshots`.INFORMATION_SCHEMA.TABLES info
WHERE table_name = 'Tickets'
),coded AS (
SELECT ticket_id, ticket_status, updated_date,
CASE
WHEN dbt_valid_to IS NOT NULL AND dbt_valid_from < ref_time AND dbt_valid_to > ref_time AND rank_1 = 1 THEN 'deleted'
WHEN dbt_valid_to IS NULL AND dbt_valid_from > ref_time AND rank_1 = 1 THEN
IF(LEAD(CONCAT(CAST(ticket_id AS STRING))) OVER (PARTITION BY ticket_id ORDER BY dbt_valid_from DESC) IS NULL, 'inserted', 'updated')
WHEN dbt_valid_to IS NULL AND dbt_valid_from < ref_time THEN 'unchanged'
ELSE 'stale'
END AS flag
FROM snaps
)SELECT * FROM coded
ORDER BY ticket_id, updated_date DESC
this would result in:
Conclusion
- In this documentation, we understood what
dbt snapshot
- How can we implement it
- And, how we can leverage it using simple SQL Queries to classify modifications
References
Copyright © 2022 Ankit Sarraf — All rights reserved
Opinions expressed are personal and do not necessarily represent the views or opinions of any organization.