1 of 33

Observability within dbt

Kevin Chan, Data Engineer

Jonathan Talmi, Senior Data Platform Manager

December 7, 2021

2 of 33

To provide access for everyone to experience more of what life has to offer, regardless of income or circumstance.

3 of 33

3

We are building a new way to shop that maximizes savings, benefits and rewards on mobile

We have started with hotel bookings and consumer goods and have driven nearly $1B in sales.

4 of 33

4

Who are we?

Kevin Chan, Data Engineer

  • University of Waterloo, Computer Science & Statistics
  • Ex: Hootsuite, Splunk

Jonathan Talmi, Data Platform Lead

  • Researcher → Analyst → Data Eng → Analytics Eng
  • Ex: Shopify, Instacart, Bank of Canada

5 of 33

Intros�

Tech and tooling

Core benefits of system

Core benefits of metadata tracking

Agenda

6 of 33

Data Observability

Data Observability

Freshness

Metrics

Schema

Lineage

Data quality

Metadata

Profiling

Logs

7 of 33

Why Observability Matters

Why isn't my model up to date?

We had limited observability into our dbt deployment!

Why is my model taking so long to run?

Is my data accurate?

How do I speed up my dbt pipelines?

How should I materialize and provision my model?

8 of 33

Observability in dbt

Metrics

Lineage

Metadata

Data quality metrics can be written as dbt tests and alerted on

Logs

dbt artifacts contain metadata about executions and source freshness

Built-in dbt resource lineage and external dependencies using sources and exposures

dbt logs are surfaced in execution pipelines, but rich query logs live in the data warehouse

9 of 33

rr

Our goal was to build a system that could perform several jobs

Lightweight

Deploy system easily using existing stack

Flexible

Enable SQL-based exploration of artifacts and metadata

Exhaustive

Support all dbt resources, artifacts, and relevant job types

  • Send alerts to dbt model owners and stakeholders based on custom criteria
  • Identify pipeline bottlenecks and model-level optimizations
  • Reliably collect metadata in close to real-time regardless of pipeline success

Jobs to be Done

10 of 33

Data Sources

Combining dbt artifacts and the query history

provides deeper insights about model-level performance

Run Results

Detailed node and pipeline level execution data

Query history

Rich query performance metrics at the model level

Manifest

Full configuration of a dbt project

dbt artifacts store valuable information about data quality,

performance, executions, and lineage

11 of 33

Solution Overview

  1. Orchestrate: Run dbt using Airflow and KubernetesPodOperator tasks
  2. Store metadata: Load artifacts into Snowflake at the end of every dbt pipeline
  3. Model: Join artifacts together with query history using dbt
  4. Report: Display model and pipeline performance in dashboards on Looker
  5. Alert: Fire alerts directly to model owners using Slack user groups.

12 of 33

Orchestration

  • All dbt pipelines are centralized in Airflow except for adhoc jobs run using dbt Cloud

13 of 33

Orchestration

Schedule using deployment tags

  • Every model is allowed one deployment tag: hourly, nightly, external

  • External pipelines use intersection selectors to avoid race conditions
    • e.g. dbt run -m source:myreport+,tag:external

14 of 33

14

Orchestration

Managed using deployment tags

  • Every model requires a deployment and domain tag
  • Used for scheduling
  • One of: hourly, nightly, external

15 of 33

  • Combination of PUT, COPY, and REMOVE queries
  • Uploads artifacts to respective artifacts tables; easy to add new artifacts
  • Called using `dbt run-operation` at the end of a dbt Airflow task
  • Executes even after dbt job failure

15

Store metadata

16 of 33

Example Pipeline

Use intersection selector to select external models

Upload metadata

Run using K8s operator

17 of 33

17

Deployment

  • Changed models are built in production immediately after merging
    • dbt docs is also updated
  • How to deploy your model: Deployment tags ��One of: hourly, nightly, weekly, external

If your model is hourly, nightly, or weekly, no need to do anything in Airflow

Deployment

18 of 33

Modelling

*Heavily inspired by Gitlab Data Team

19 of 33

Reporting on Model Runs

20 of 33

Reporting on Test Failures

21 of 33

Performance management

Materialization

Clustering

Warehouse

22 of 33

Alerting

All models are tagged with a single domain tag (e.g. growth, product, finance, etc.)

Alerts are sent every 15 minutes tagging the model owner using a slack group, e.g. @growth-domain

23 of 33

Track performance degradation

24 of 33

Model executions over time

25 of 33

Pipeline bottlenecks

26 of 33

Pipeline bottlenecks (cont’d)

27 of 33

Implementation

  • Deploy easily on the modern data stack, using any BI tool
  • Alert through BI or directly in dbt pipelines using SnowflakeToSlack
  • Model dbt artifacts yourself, or use the dbt_artifacts dbt package: https://github.com/tailsdotcom/dbt_artifacts
  • Use dbt Cloud! They are building a robust Metadata API:

28 of 33

We’re hiring!

Reach out to Jonathan or Kevin on the dbt slack community, find us on LinkedIn, or email us at jonathan@snapcommerce.com and kevin@snapcommerce.com

29 of 33

Thank you!

30 of 33

Model performance tuning

*Heavily inspired by Gitlab Data Team

31 of 33

Runtime Checks (TBD)

32 of 33

Results

  • Alerts on model and test failures as well as runtime creep are sent within 15 minutes of execution.
  • Easily determine model state (data quality, staleness, etc.)
  • Find models and tests that should be refactored
  • Identify bottlenecks in dbt pipelines

33 of 33

Automated cost and performance management

  • Metadata can be combined with the query history to get cost per dbt resource
  • Automate warehouse sizing and materialization recommendations using Snowflake query metadata