Observability within dbt
Kevin Chan, Data Engineer
Jonathan Talmi, Senior Data Platform Manager
December 7, 2021
To provide access for everyone to experience more of what life has to offer, regardless of income or circumstance.
3
We are building a new way to shop that maximizes savings, benefits and rewards on mobile
We have started with hotel bookings and consumer goods and have driven nearly $1B in sales.
4
Who are we?
Kevin Chan, Data Engineer
Jonathan Talmi, Data Platform Lead
Intros�
Tech and tooling
Core benefits of system
Core benefits of metadata tracking
Agenda
Data Observability
Data Observability
Freshness
Metrics
Schema
Lineage
Data quality
Metadata
Profiling
Logs
Why Observability Matters
Why isn't my model up to date?
We had limited observability into our dbt deployment!
Why is my model taking so long to run?
Is my data accurate?
How do I speed up my dbt pipelines?
How should I materialize and provision my model?
Observability in dbt
Metrics
Lineage
Metadata
Data quality metrics can be written as dbt tests and alerted on
Logs
dbt artifacts contain metadata about executions and source freshness
Built-in dbt resource lineage and external dependencies using sources and exposures
dbt logs are surfaced in execution pipelines, but rich query logs live in the data warehouse
rr
Our goal was to build a system that could perform several jobs
Lightweight
Deploy system easily using existing stack
Flexible
Enable SQL-based exploration of artifacts and metadata
Exhaustive
Support all dbt resources, artifacts, and relevant job types
Jobs to be Done
Data Sources
Combining dbt artifacts and the query history
provides deeper insights about model-level performance
Run Results
Detailed node and pipeline level execution data
Query history
Rich query performance metrics at the model level
Manifest
Full configuration of a dbt project
dbt artifacts store valuable information about data quality,
performance, executions, and lineage
Solution Overview
Orchestration
Orchestration
Schedule using deployment tags
14
Orchestration
Managed using deployment tags
15
Store metadata
Example Pipeline
Use intersection selector to select external models
Upload metadata
Run using K8s operator
17
Deployment
If your model is hourly, nightly, or weekly, no need to do anything in Airflow
Deployment
Modelling
*Heavily inspired by Gitlab Data Team
Reporting on Model Runs
Reporting on Test Failures
Performance management
Materialization
Clustering
Warehouse
Alerting
All models are tagged with a single domain tag (e.g. growth, product, finance, etc.)
Alerts are sent every 15 minutes tagging the model owner using a slack group, e.g. @growth-domain
Track performance degradation
Model executions over time
Pipeline bottlenecks
Pipeline bottlenecks (cont’d)
Implementation
We’re hiring!
Reach out to Jonathan or Kevin on the dbt slack community, find us on LinkedIn, or email us at jonathan@snapcommerce.com and kevin@snapcommerce.com
Thank you!
Model performance tuning
*Heavily inspired by Gitlab Data Team
Runtime Checks (TBD)
Results
Automated cost and performance management