1 of 20

Automating Ambiguity:

Managing dynamic source data using dbt macros

Eric Nelson, Analytics & Engineering Lead

2 of 20

Agenda

Background
Problem
Tech Stack
Solution
Macro Overview
Walkthrough & Live Coding
Limitations & Future Additions

3 of 20

Background

Job Function

Mattermost Business Systems, Website & Product Usage ELT

Website & Product data captured via Rudderstack SDK (More on Rudderstack Later)

End-to-End Analytics Infrastructure Design & Maintenance

Data Pipelines → Stitch + Rudderstack → Snowflake → dbt → Looker

A/B Testing: Experimental Design & Monitoring

Clean, reliable transformations of data are essential for testing & validating results

Eric Nelson,

Analytics Engineer @ Mattermost

4 of 20

Problem

Data is messy

Behavioral data is never static

Data governance is just plain hard

Data volume and velocity increase as businesses scale

Maintaining & transforming raw, dynamic data becomes too time consuming

5 of 20

Problem

(@ Mattermost)

Data is messy

Behavioral data is never static

Data governance is just plain hard

Increasing data volume and velocity as businesses scale

How do we efficiently track new data sources & properties generated as product development cycles increase in velocity?

How do we analyze increasing types & volume of customer engagement data?

How do we handle variations in schemas, naming conventions & data types?

How do we process larger amounts of data using minimal compute & resources ($)?

Maintaining & transforming raw, dynamic data becomes too time consuming

(or does it..?)

6 of 20

Our Tech Stack

Customer Data Management:

Rudderstack
Stitch

Storage

Snowflake

BI & Data Visualization

Looker

7 of 20

Rudderstack CDP

Rudderstack is a customer data platform (similar to Segment or Bloomreach)

Its SDK’s provide functionality to enable multi-platform tracking of user engagement:
Web
Desktop
Mobile
Other Digital Mediums
Generate and store data in default schemas, relations & properties
Customizability to capture custom engagement properties

Provides developers flexibility
Difficult to monitor changes as an Analytics Engineer

8 of 20

Solution:

Data is messy

Behavioral data is never static

Data governance is just plain hard

Increasing data volume and velocity as businesses scale

Maintaining & transforming raw, dynamic data becomes too time consuming (or does it..?)

dbt Macro

One dbt Macro to Rule Them All

Tracks changes to schemas and tables by cross-referencing information_schema data

Automatically identifies changes to source tables and updates master

Creates nested visual relations by creating a property that tracks source tables

Provides logic for Incremental Builds vs. Full Refresh

Creates column supersets to identify shared properties and missing properties
Identifies “dummy column” requirements and casts null values with proper data type

9 of 20

Union Macro Logic

Check if incremental build

If yes, incremental, check if source tables contain columns not in master.
If no, not incremental, full refresh automatically identifies and incorporates columns.

If incremental and columns are not in master, then execute:

ALTER TABLE *master_table*

ADD COLUMNS

*missing_column_1* *data_type*,

*missing_column_2* *data_type*,

...

*missing_column_n* *data_type*

;

If not incremental, or if columns in master, skip.

Execute Unioning Script (See Next Slide)...

2a.

10 of 20

Union Macro Diagram

11 of 20

What’s the Big Deal?

Eliminates the need to write lengthy SQL scripts

This is already one of dbt’s primary purpose; but
Models leveraging the macro use 4 lines of code to generate tens, hundreds, even thousands of lines of SQL

No more DDL!

Model maps dependencies for you
Ensures model loads downstream from any source tables

Automatically detects new columns, updates target relation, and loads data!

No more “ALTER *TABLE* ADD COLUMN *COLUMN NAME* *COLUMN TYPE*”
Tracking changes, adding them manually & backfilling is the WORST...

12 of 20

Model File

get_source_target_relation_lists() Arguments:

schema = list of all schemas to include in union macro
database = name of database to be targeted (defaults to profiles.yml db)
table_exclusions = subset of tables in schemas to omit from union script

table_exclusions allows you to pull in any new tables that appear in the schemas that aren’t specified
Can use table_inclusions if you know only a small set of tables will ever be included in union script

union_relations() Arguments:

relations = list of source relation objects to include in union script
tgt_relation = target relation object (table to be created or incrementally loaded by model file)

13 of 20

Retrieve Source & Target Relations

Retrieve source & target relation lists.

14 of 20

Check For Missing Columns

Instantiates dictionary and list variables
Iterate through target relation list, check if valid relation & retrieve columns

Loop through source and target columns to identify missing values (if any)
Create dictionary containing missing column names and data types.

If missing columns exist, execute “ALTER TABLE” script.
If not, execute empty script and redirect back to union_macro.

15 of 20

Generate & Execute Union Script

If the model is being run incrementally, executes add_new_columns() macro to determine if new properties from source need to be added to target.
Executes union script (next slide)

Instantiates dictionary and list variables
Iterate through target relation list, check if valid relation & retrieve columns

Loops through source columns
Creates dictionary containing unique column names and data types

16 of 20