1 of 20

Automating Ambiguity:

Managing dynamic source data using dbt macros

Eric Nelson, Analytics & Engineering Lead

2 of 20

Agenda

  1. Background
  2. Problem
  3. Tech Stack
  4. Solution
  5. Macro Overview
  6. Walkthrough & Live Coding
  7. Limitations & Future Additions

3 of 20

Background

Job Function

  • Mattermost Business Systems, Website & Product Usage ELT
    • Website & Product data captured via Rudderstack SDK (More on Rudderstack Later)

  • End-to-End Analytics Infrastructure Design & Maintenance
    • Data Pipelines → Stitch + Rudderstack → Snowflake → dbt → Looker

  • A/B Testing: Experimental Design & Monitoring
    • Clean, reliable transformations of data are essential for testing & validating results

Eric Nelson,

Analytics Engineer @ Mattermost

4 of 20

Problem

  1. Data is messy
  1. Behavioral data is never static
  1. Data governance is just plain hard
  1. Data volume and velocity increase as businesses scale
  1. Maintaining & transforming raw, dynamic data becomes too time consuming

5 of 20

Problem

(@ Mattermost)

  1. Data is messy

  • Behavioral data is never static

  • Data governance is just plain hard

  • Increasing data volume and velocity as businesses scale
    • How do we efficiently track new data sources & properties generated as product development cycles increase in velocity?
    • How do we analyze increasing types & volume of customer engagement data?
    • How do we handle variations in schemas, naming conventions & data types?
    • How do we process larger amounts of data using minimal compute & resources ($)?
  1. Maintaining & transforming raw, dynamic data becomes too time consuming

(or does it..?)

6 of 20

Our Tech Stack

  • Customer Data Management:
    1. Rudderstack
    2. Stitch
  • Storage
    • Snowflake
  • ELT
    • dbt
  • BI & Data Visualization
    • Looker

7 of 20

Rudderstack CDP

Rudderstack is a customer data platform (similar to Segment or Bloomreach)

  • Its SDK’s provide functionality to enable multi-platform tracking of user engagement:
  • Web
  • Desktop
  • Mobile
  • Other Digital Mediums
  • Generate and store data in default schemas, relations & properties
  • Customizability to capture custom engagement properties
    • Provides developers flexibility
    • Difficult to monitor changes as an Analytics Engineer

8 of 20

Solution:

  1. Data is messy

  1. Behavioral data is never static

  • Data governance is just plain hard

  • Increasing data volume and velocity as businesses scale

  • Maintaining & transforming raw, dynamic data becomes too time consuming (or does it..?)

dbt Macro

One dbt Macro to Rule Them All

    • Tracks changes to schemas and tables by cross-referencing information_schema data
    • Automatically identifies changes to source tables and updates master
    • Creates nested visual relations by creating a property that tracks source tables
    • Provides logic for Incremental Builds vs. Full Refresh
    • Creates column supersets to identify shared properties and missing properties
    • Identifies “dummy column” requirements and casts null values with proper data type

9 of 20

Union Macro Logic

1.

Check if incremental build

2.

  1. If yes, incremental, check if source tables contain columns not in master.
  2. If no, not incremental, full refresh automatically identifies and incorporates columns.

3.

  1. If incremental and columns are not in master, then execute:

ALTER TABLE *master_table*

ADD COLUMNS

*missing_column_1* *data_type*,

*missing_column_2* *data_type*,

...

*missing_column_n* *data_type*

;

4.

  1. If not incremental, or if columns in master, skip.

Execute Unioning Script (See Next Slide)...

2a.

10 of 20

Union Macro Diagram

11 of 20

What’s the Big Deal?

  1. Eliminates the need to write lengthy SQL scripts
    1. This is already one of dbt’s primary purpose; but
    2. Models leveraging the macro use 4 lines of code to generate tens, hundreds, even thousands of lines of SQL
  2. No more DDL!
    • Model maps dependencies for you
    • Ensures model loads downstream from any source tables
  3. Automatically detects new columns, updates target relation, and loads data!
    • No more “ALTER *TABLE* ADD COLUMN *COLUMN NAME* *COLUMN TYPE*”
    • Tracking changes, adding them manually & backfilling is the WORST...

12 of 20

Model File

get_source_target_relation_lists() Arguments:

  1. schema = list of all schemas to include in union macro
  2. database = name of database to be targeted (defaults to profiles.yml db)
  3. table_exclusions = subset of tables in schemas to omit from union script
    1. table_exclusions allows you to pull in any new tables that appear in the schemas that aren’t specified
    2. Can use table_inclusions if you know only a small set of tables will ever be included in union script

union_relations() Arguments:

  1. relations = list of source relation objects to include in union script
  2. tgt_relation = target relation object (table to be created or incrementally loaded by model file)

13 of 20

Retrieve Source & Target Relations

Retrieve source & target relation lists.

14 of 20

Check For Missing Columns

  1. Instantiates dictionary and list variables
  2. Iterate through target relation list, check if valid relation & retrieve columns
  • Loop through source and target columns to identify missing values (if any)
  • Create dictionary containing missing column names and data types.
  • If missing columns exist, execute “ALTER TABLE” script.
  • If not, execute empty script and redirect back to union_macro.

15 of 20

Generate & Execute Union Script

  • If the model is being run incrementally, executes add_new_columns() macro to determine if new properties from source need to be added to target.
  • Executes union script (next slide)
  • Instantiates dictionary and list variables
  • Iterate through target relation list, check if valid relation & retrieve columns
  • Loops through source columns
  • Creates dictionary containing unique column names and data types

16 of 20

Generate & Execute Union Script

17 of 20

Current Limitations & Future Additions

Only supports Snowflake, Postgres & Redshift DB

  • Additional logic required to support other databases

Method to incorporate additional transformation logic for union script

  • Certain models may require certain levels of aggregation
  • Including an argument that accounts for this (example: script_type = “daily_sum” to produce a daily snapshot of numeric properties summed up for each table)

Method to unify columns and column naming conventions

  • Steps to unify columns from source tables with differing naming conventions
    • Parse special characters in column names to identify matches
    • Examine data types
    • Validate property assumptions
    • Coalesce columns and provide single alias

18 of 20

Q & A

19 of 20

Thank You!

20 of 20

Appendix