1 of 33

Observability within dbt

Kevin Chan, Data Engineer

Jonathan Talmi, Senior Data Platform Manager

December 7, 2021

2 of 33

To provide access for everyone to experience more of what life has to offer, regardless of income or circumstance.

3 of 33

3

We are building a new way to shop that maximizes savings, benefits and rewards on mobile

We have started with hotel bookings and consumer goods and have driven nearly $1B in sales.

4 of 33

4

Who are we?

Kevin Chan, Data Engineer

University of Waterloo, Computer Science & Statistics
Ex: Hootsuite, Splunk

Jonathan Talmi, Data Platform Lead

Researcher → Analyst → Data Eng → Analytics Eng
Ex: Shopify, Instacart, Bank of Canada

5 of 33

Intros�

Tech and tooling

Core benefits of system

Core benefits of metadata tracking

Agenda

6 of 33

Data Observability

Freshness

Metrics

Schema

Lineage

Data quality

Metadata

Profiling

Logs

7 of 33

Why Observability Matters

Why isn't my model up to date?

We had limited observability into our dbt deployment!

Why is my model taking so long to run?

Is my data accurate?

How do I speed up my dbt pipelines?

How should I materialize and provision my model?

8 of 33

Observability in dbt

Metrics

Lineage

Metadata

Data quality metrics can be written as dbt tests and alerted on

Logs

dbt artifacts contain metadata about executions and source freshness

Built-in dbt resource lineage and external dependencies using sources and exposures

dbt logs are surfaced in execution pipelines, but rich query logs live in the data warehouse

The Four Pillars of Data Observability - Kevin Hu (2021)

Since data observability is multi-faceted,let’s focus on the four pillars proposed by Kevin Hu of Metaplane��Let’s see how well dbt core supports our requirements as well as the four pillars of observability out of the box�

The first pillar is metrics, which describe the internal properties of your data. dbt provides an interface to write tests on data quality and accuracy metrics using dbt tests, which is often the first line of defense in identifying data quality issues. Alerts on test failures can be fired directly to relevant people.�
The second pillar is metadata, otherwise known as “data about your data”. Dbt stores metadata about dbt runs in files called artifacts. Some artifacts have richer information than others, but the most basic version shows when model is built and how long it took. Artifacts also tell store run and test failures, and source freshness checks. Things are starting to take shape here.�
The third pillar is lineage, which describes how data relates to and depends on other data. dbt creates a graph, or DAG, of all your data models, and through the sources and exposures features, it allows you to bring external dependencies into that graph, like a Fivetran ingestion pipeline or your CRM. The full lineage of a project is stored in the artifacts.�
The fourth pillar is logs: dbt logs data related to invocations and executions, but rich query-level logs are usually exclusively in the data warehouse’s query log. To get this information, we could probably bring in the query logs directly.

9 of 33

rr

Our goal was to build a system that could perform several jobs

Lightweight

Deploy system easily using existing stack

Flexible

Enable SQL-based exploration of artifacts and metadata

Exhaustive

Support all dbt resources, artifacts, and relevant job types

Send alerts to dbt model owners and stakeholders based on custom criteria

Identify pipeline bottlenecks and model-level optimizations

Reliably collect metadata in close to real-time regardless of pipeline success

Jobs to be Done

This gave us a mandate to build a system that would help us perform the jobs required to give analytics engineers their desired observability into dbt.

First, we wanted to send alerts to individual dbt model owners when their models or tests fail, or when their sources aren’t fresh, or based on any custom criteria
Second, we wanted to surface data that would help analytics engineers optimize their models and identify pipeline bottlenecks
Third, we wanted to reliably collect all metadata and action it in close to real time regardless of the success or failure of the dbt pipeline

Our guiding principles for this project as engineers were:

Keep the system lightweight, so that it could be deployed easily on our stack, which is a modern data stack
Surface all relevant artifacts and metadata to enable SQL-based alerting and exploration
Support all dbt resources, artifacts, and jobs, like run, test, build, etc.

10 of 33

Data Sources

Combining dbt artifacts and the query history

provides deeper insights about model-level performance

Run Results

Detailed node and pipeline level execution data

Query history

Rich query performance metrics at the model level

Manifest

Full configuration of a dbt project

dbt artifacts store valuable information about data quality,

performance, executions, and lineage

We thought that dbt artifacts might serve as a good foundation for this system since they contain granular information about dbt executions

dbt tests are an interface for defining data quality metrics, and are often the first line of defense in identifying disruptions. Artifacts store the results of model and test failures
dbt artifacts also store information about when models get built and how long it takes, which help you identify underperforming models

We decided to use three data sources for this project:

The project manifest, which is the full configuration of a dbt project
The run results artifact, which contains detailed node and pipeline level execution data for models, tests, and other resources
The Snowflake query history, which has rich query performance metrics that can tied to individual models

We ultimately found that joining the artifacts to the query history gives us deeper insights about model-level performance than using dbt artifacts alone.

11 of 33

Solution Overview

Orchestrate: Run dbt using Airflow and KubernetesPodOperator tasks
Store metadata: Load artifacts into Snowflake at the end of every dbt pipeline
Model: Join artifacts together with query history using dbt
Report: Display model and pipeline performance in dashboards on Looker
Alert: Fire alerts directly to model owners using Slack user groups.

12 of 33

Orchestration

All dbt pipelines are centralized in Airflow except for adhoc jobs run using dbt Cloud

13 of 33

Orchestration

Schedule using deployment tags

Every model is allowed one deployment tag: hourly, nightly, external

External pipelines use intersection selectors to avoid race conditions

e.g. dbt run -m source:myreport+,tag:external

14 of 33

14

Orchestration

Managed using deployment tags

Every model requires a deployment and domain tag
Used for scheduling
One of: hourly, nightly, external

15 of 33

Combination of PUT, COPY, and REMOVE queries
Uploads artifacts to respective artifacts tables; easy to add new artifacts
Called using `dbt run-operation` at the end of a dbt Airflow task
Executes even after dbt job failure

15

Store metadata

16 of 33

Example Pipeline

Use intersection selector to select external models

Upload metadata

Run using K8s operator

17 of 33

17

Deployment

Changed models are built in production immediately after merging

dbt docs is also updated

How to deploy your model: Deployment tags ��One of: hourly, nightly, weekly, external

If your model is hourly, nightly, or weekly, no need to do anything in Airflow

Deployment

18 of 33

Modelling

*Heavily inspired by Gitlab Data Team

19 of 33

Reporting on Model Runs

20 of 33

Reporting on Test Failures

21 of 33

Performance management

Materialization

Clustering

Warehouse

We surfaced data related to model executions to help analytics engineers identify models that need to be optimized, either by updating the SQL or tweaking configurations.

This chart shows performance data for some long running models taken from the artifacts and query history combined view. There are several actionable pieces of information here:

First, if a model is taking a long time to build, the engineer can consider adopting a new materialization strategy, like incremental or insert_by_period.

Second, if the model has a high percentage of total partitions that are being scanned, it may be worthwhile to explore new cluster keys

Third, if the model tends to have a lot of spillage and bytes sent over the network , it may be worthwhile to increase the size of the warehouse after other opportunities for optimization have been exhausted.

22 of 33

Alerting

All models are tagged with a single domain tag (e.g. growth, product, finance, etc.)

Alerts are sent every 15 minutes tagging the model owner using a slack group, e.g. @growth-domain

23 of 33

Track performance degradation

24 of 33

Model executions over time

25 of 33

Pipeline bottlenecks

26 of 33

Pipeline bottlenecks (cont’d)

27 of 33

Implementation

Deploy easily on the modern data stack, using any BI tool
Alert through BI or directly in dbt pipelines using SnowflakeToSlack

https://github.com/offbi/snowflake-to-slack

Model dbt artifacts yourself, or use the dbt_artifacts dbt package: https://github.com/tailsdotcom/dbt_artifacts
Use dbt Cloud! They are building a robust Metadata API:

28 of 33

We’re hiring!

https://www.snapcommerce.com/careers

Reach out to Jonathan or Kevin on the dbt slack community, find us on LinkedIn, or email us at jonathan@snapcommerce.com and kevin@snapcommerce.com

29 of 33

Thank you!

30 of 33

Model performance tuning

*Heavily inspired by Gitlab Data Team

31 of 33

Runtime Checks (TBD)

32 of 33

Results

Alerts on model and test failures as well as runtime creep are sent within 15 minutes of execution.
Easily determine model state (data quality, staleness, etc.)
Find models and tests that should be refactored
Identify bottlenecks in dbt pipelines

33 of 33

Automated cost and performance management

Metadata can be combined with the query history to get cost per dbt resource
Automate warehouse sizing and materialization recommendations using Snowflake query metadata