Building a metadata ecosystem with DBT
Darren Haken
Head Of Engineering
@darrenhaken
Agenda
Data journey to decentralisation
Metadata
Where do we, as a community, go from here?
Data Journey
Auto Trader
Largest automotive market place in the UK
Established in 1977
58 million cross platform visits per month
Auto Trader’s Data Journey
The Data Warehouse Chapter
Centralised data team ~ 60 FTE
Hyper specialised skill set
Classic data warehouse stack
Our centralised data team could not keep up with the demand of the organisation
Data Breadlines
Data Mesh
Data Mesh has given us a vocabulary to
share and talk about the next chapter in our data journey
What is the Data Mesh?
Shift from the centralised paradigm lake/data warehouse to one that draws from modern distributed architecture
Aims to address common failures of the traditional centralised data platform architecture
Unlock analytical data at scale
Data Mesh Principles
Domain Ownership
Data as a Product
Self Serve Data Platform
Federated Computational Governance
Self Serve Data Platform
Provide capabilities to serve the organisation
Allow data product development to be accessible by generalists
Eliminate complexity of working with tools - data lakes, warehouses, Apache Spark
Serves many roles - engineers, analysts, marketeers, executives, partners
Domain Ownership and Data Products
Decentralise data to the people closer to the data, normally product teams
Apply product thinking to data assets - Data Products
Data Products have customers - data scientists, analysts, data engineers, marketeers etc
What is a Data Product?
Decentralisation
Autotraders
journey to the
data mesh
Self-Serve Data Platform
Practitioners
What is DBT?
Architecture Using DBT
Explosion in data practitioners
An explosion of users
60 active practitioners across our engineering org of ~ 200
An explosion of users
An explosion of users
DBT + Data Mesh
Distributing our data across the org has created a new set of challenges
Challenges
Ownership/Maintainers
20% of all DBT models are ownerless
Similar issues in other tools (like Airflow, Spark) but couldn't get accurate stats to show
Looker has 74 references to BQ tables/views that are ownerless in DBT
Ownership Problems
Who are the maintainers for model changes?
Who are the stakeholders that depend on this data?
Who resolves data incidents?
A model is broken - who fixes it?
Sharing Data, Responsibly
Teams depending on other teams data without permission can lead to...
No contracts between teams - unexpected coupling
Multiple sources of truth
Trust/confidence problems
Restricting access is expensive and manual
When sharing data goes wrong
Discoverability
We need a scalable way to overcome these challenges exposed by the Data Mesh
Automation
Over
Human processes
Metadata
DBT supports metadata
Metadata as code
Version controlled
Encapsulate model development with metadata
Use metadata in CI/CD
Automated Policy
Platform Governance
Policies
All models have an owner
Other teams can only use production models
Only the security team can access PII
Automated Policy
Ecosystem
Building the ecosystem
Observability
Observability
Discovery
Synchronise into a central metadata store
Aggregate metadata across multiple tools/platforms
Searchable
Reduce tribal knowledge - Slack queries
Automated Mapping: dbt -> DataHub
Where do we go from here?
We need centralised metadata
Centralised Metadata Stores
Standardised Metadata
Specifications for concepts like Ownership, Observability etc
Standards allow the ecosystem to thrive
Emerging Standards
Open-Metadata.org
Open-Lineage.io
Call to Arms
Takeaways
Distributed data teams introduce a new sets of challenges
Metadata is a powerful tool to automate problems
As practitioners we can drive change and build this missing component within the modern data stack
Want to know more?
engineering.autotrader.co.uk
We’re hiring
careers.autotrader.co.uk
Black box technology
Highly complex UI based ETL
Only for the data shamans
Not accessible to generalists - engineers, analysts, scientists
DBT Monolith
Data Shamans
Masters of Kimball
Guardians of the dashboard
Warehouse tinkerers
Data Breadlines
Domain Knowledge
Guide the dbt architecture
Combine technical metadata with business metadata
Prevent internal models from being shared
Ensure datasets meet naming standards