1 of 60

Building a metadata ecosystem with DBT

Darren Haken

Head Of Engineering

@darrenhaken

2 of 60

Agenda

Data journey to decentralisation

Metadata

Where do we, as a community, go from here?

3 of 60

Data Journey

4 of 60

Auto Trader

Largest automotive market place in the UK

Established in 1977

58 million cross platform visits per month

5 of 60

Auto Trader’s Data Journey

6 of 60

The Data Warehouse Chapter

Centralised data team ~ 60 FTE

Hyper specialised skill set

Classic data warehouse stack

7 of 60

Our centralised data team could not keep up with the demand of the organisation

8 of 60

Data Breadlines

9 of 60

Data Mesh

10 of 60

Data Mesh has given us a vocabulary to

share and talk about the next chapter in our data journey

11 of 60

What is the Data Mesh?

Shift from the centralised paradigm lake/data warehouse to one that draws from modern distributed architecture

Aims to address common failures of the traditional centralised data platform architecture

Unlock analytical data at scale

12 of 60

Data Mesh Principles

Domain Ownership

Data as a Product

Self Serve Data Platform

Federated Computational Governance

13 of 60

Self Serve Data Platform

Provide capabilities to serve the organisation

Allow data product development to be accessible by generalists

Eliminate complexity of working with tools - data lakes, warehouses, Apache Spark

Serves many roles - engineers, analysts, marketeers, executives, partners

14 of 60

Domain Ownership and Data Products

Decentralise data to the people closer to the data, normally product teams

Apply product thinking to data assets - Data Products

Data Products have customers - data scientists, analysts, data engineers, marketeers etc

15 of 60

What is a Data Product?

16 of 60

Decentralisation

17 of 60

Autotraders

journey to the

data mesh

18 of 60

Self-Serve Data Platform

Practitioners

19 of 60

What is DBT?

20 of 60

Architecture Using DBT

21 of 60

Explosion in data practitioners

22 of 60

An explosion of users

60 active practitioners across our engineering org of ~ 200

23 of 60

An explosion of users

24 of 60

An explosion of users

25 of 60

DBT + Data Mesh

26 of 60

Distributing our data across the org has created a new set of challenges

27 of 60

Challenges

28 of 60

Ownership/Maintainers

20% of all DBT models are ownerless

Similar issues in other tools (like Airflow, Spark) but couldn't get accurate stats to show

Looker has 74 references to BQ tables/views that are ownerless in DBT

29 of 60

Ownership Problems

Who are the maintainers for model changes?

Who are the stakeholders that depend on this data?

Who resolves data incidents?

A model is broken - who fixes it?

30 of 60

Sharing Data, Responsibly

Teams depending on other teams data without permission can lead to...

No contracts between teams - unexpected coupling

Multiple sources of truth

Trust/confidence problems

Restricting access is expensive and manual

31 of 60

When sharing data goes wrong

32 of 60

Discoverability

33 of 60

We need a scalable way to overcome these challenges exposed by the Data Mesh

34 of 60

Automation

Over

Human processes

35 of 60

Metadata

36 of 60

DBT supports metadata

Metadata as code

Version controlled

Encapsulate model development with metadata

37 of 60

Use metadata in CI/CD

38 of 60

Automated Policy

39 of 60

Platform Governance

Policies

All models have an owner

Other teams can only use production models

Only the security team can access PII

40 of 60

Automated Policy

41 of 60

Ecosystem

42 of 60

Building the ecosystem

43 of 60

Observability

44 of 60

Observability

45 of 60

Discovery

Synchronise into a central metadata store

Aggregate metadata across multiple tools/platforms

Searchable

Reduce tribal knowledge - Slack queries

46 of 60

Automated Mapping: dbt -> DataHub

47 of 60

Where do we go from here?

48 of 60

We need centralised metadata

49 of 60

Centralised Metadata Stores

50 of 60

Standardised Metadata

Specifications for concepts like Ownership, Observability etc

Standards allow the ecosystem to thrive

51 of 60

Emerging Standards

Open-Metadata.org

Open-Lineage.io

52 of 60

Call to Arms

53 of 60

Takeaways

Distributed data teams introduce a new sets of challenges

Metadata is a powerful tool to automate problems

As practitioners we can drive change and build this missing component within the modern data stack

54 of 60

Want to know more?

engineering.autotrader.co.uk

We’re hiring

careers.autotrader.co.uk

55 of 60

Black box technology

Highly complex UI based ETL

Only for the data shamans

Not accessible to generalists - engineers, analysts, scientists

56 of 60

DBT Monolith

57 of 60

Data Shamans

Masters of Kimball

Guardians of the dashboard

Warehouse tinkerers

58 of 60

Data Breadlines

59 of 60

Domain Knowledge

60 of 60

Guide the dbt architecture

Combine technical metadata with business metadata

Prevent internal models from being shared

Ensure datasets meet naming standards