DBT’s hidden gem- dbt snapshots

Anks Sarraf
4 min readNov 10, 2022

--

  • by Ankit Sarraf
DBT Logo

Intended Audience

  1. You are an aspiring Data Engineering Professional who is interested in learning more about dbt snasphot
  2. You are a seasoned Data Engineer who’d like to curate the article and provide feedback
  3. You are simply curious

Introduction

With a significant paradigm shift in the Data Engineering field from ETL to ELT*, DBT has gained quick adoption over time as an open-source tool for effective Data Warehousing and Data Analysis by providing the T of the ELT equation.

* E Represents Extract, L Represents Load, and T Represents Transform (Read more here)

Because of cheaper storage, DBT has been instrumental in providing technical prowess to anyone who knows basic SQL and coding to effectively leverage the data. That is one big problem that DBT is solving.

A big real-time problem that organization face is bridging gaps in the efficient analysis of historical data to gain insights into trends. DBT helps in that by DBT Snapshots.

The below statement clearly sums it up:

Analysts often need to “look back in time” at previous data states in their mutable tables. While some source data systems are built in a way that makes accessing historical data possible, this is not always the case. dbt provides a mechanism, snapshots, which records changes to a mutable table over time. [1]

DBT Snapshots cater by implementing type-2 Slowly Changing Dimensions.

Implementing with Examples

Enough of theory! Let’s uncover the power of dbt snapshotwith an example.

Pre-requisite: You already have DBT configured with BigQuery (or any other Data Warehousing tool) on your system (cloud or local).

Scenario Description:

Our organization wants to track the speed with which Customer Tickets are being served. We have the below table (call it Tickets):

Tickets table (Version 1.0)

Query to create this table:

select 1 AS ticket_id, 'new' AS ticket_status, '2022-11-01' AS updated_date
UNION ALL
select 2 AS ticket_id, 'new' AS ticket_status, '2022-11-02' AS updated_date
UNION ALL
select 3 AS ticket_id, 'in-progress' AS ticket_status, '2022-11-03' AS updated_date
UNION ALL
select 4 AS ticket_id, 'in-progress' AS ticket_status, '2022-11-04' AS updated_date

Assumptions:

  • ticket_idis a true Primary Key for the Ticketstable
  • my_snapshots is the schema where we want our tables and snapshots to reside
  • updated_date can be used as a reliable column to identify when the changes to the row took place
  • We will only deploy the check strategy for snapshots

Snapshot with checkstrategy (snapshot_Tickets_check.sql)

{% snapshot snapshot_Tickets_check %}

{{
config(
target_schema='my_snapshots',
strategy='check',
unique_key='ticket_id',
check_cols=['ticket_status'],
invalidate_hard_deletes=True
)
}}

SELECT * FROM {{ ref("Tickets") }}

{% endsnapshot %}

To create a snapshot:

dbt snapshot --models snapshot_Tickets_check
Snapshot Created with check strategy

Making changes to the original Tickets table

  • Deleted the Ticket with ticket_id = 4
  • Updated the status with ticket_id = 1 from new to in-progress
  • Updated the status with ticket_id = 2 from new to done
  • Added new ticket with ticket_id = 5 with status testing

Resulting table:

Tickets Table (Version 2.0)

Query to create this table using DBT:

select 1 AS ticket_id, 'in-progress' AS ticket_status, '2022-11-01' AS updated_date
UNION ALL
select 2 AS ticket_id, 'done' AS ticket_status, '2022-11-04' AS updated_date
UNION ALL
select 3 AS ticket_id, 'in-progress' AS ticket_status, '2022-11-03' AS updated_date
UNION ALL
select 5 AS ticket_id, 'testing' AS ticket_status, '2022-11-05' AS updated_date

We run the snapshot again to observe the difference:

snapshot_Tickets_check table:

Updated Snapshot using check strategy

Now the final step:

We need to determine the records that are

  • deleted
  • updated
  • inserted
  • unchanged
  • stale

Pseudo Code to determine the different types of records in the Snapshot table:

IF the dbt_valid_to isn’t NULL and the record creation (dbt_valid_from ) is before the table creation(ref_time) and the record invalidation (dbt_valid_to) happened after the table creation (ref_time) and it is the latest record for the given Primary key (rank_1) THEN the record was deleted

ELSE if dbt_valid_to IS set to NULL and The record was created (dbt_valid_from) after table creation (ref_time):

  • IF this is the only record (based on dbt_valid_from) for the given primary key THEN it is a newly inserted record
  • ELSE, it is an updated record

ELSE If the dbt_valid_to isn’t NULL and the record invalidation (dbt_valid_to) happened before the table was created (ref_time) THEN it is a stale record

ELSE mark the record as unchanged

Implementation Code

WITHsnaps AS (
SELECT s_rec.*, info.creation_time AS ref_time,
ROW_NUMBER() OVER (PARTITION BY ticket_id ORDER BY dbt_valid_from DESC) rank_1
FROM `my_snapshots`.`snapshot_Tickets_check` s_rec
CROSS JOIN `my_snapshots`.INFORMATION_SCHEMA.TABLES info
WHERE table_name = 'Tickets'
),
coded AS (
SELECT ticket_id, ticket_status, updated_date,
CASE
WHEN dbt_valid_to IS NOT NULL AND dbt_valid_from < ref_time AND dbt_valid_to > ref_time AND rank_1 = 1 THEN 'deleted'
WHEN dbt_valid_to IS NULL AND dbt_valid_from > ref_time AND rank_1 = 1 THEN
IF(LEAD(CONCAT(CAST(ticket_id AS STRING))) OVER (PARTITION BY ticket_id ORDER BY dbt_valid_from DESC) IS NULL, 'inserted', 'updated')
WHEN dbt_valid_to IS NULL AND dbt_valid_from < ref_time THEN 'unchanged'
ELSE 'stale'
END AS flag
FROM snaps
)
SELECT * FROM coded
ORDER BY ticket_id, updated_date DESC

this would result in:

The outcome of the above query gives us the Flags

Conclusion

  • In this documentation, we understood what dbt snapshot
  • How can we implement it
  • And, how we can leverage it using simple SQL Queries to classify modifications

References

Copyright © 2022 Ankit Sarraf — All rights reserved

Opinions expressed are personal and do not necessarily represent the views or opinions of any organization.

--

--

Anks Sarraf
Anks Sarraf

No responses yet