Gerard Toonstra| Data Discovery with Amundsen | 27-11-2019
Amundsen who?
Decentralization and democratization of data is not productive unless you use tools that allow you to make sense of the plethora of data products you never knew you were going to have.
Data discovery issues
Analysis of price elasticity for MacBooks
Data discovery issues
→ Question #1: Where is the data for sales and product prices by date?
Data discovery issues
→ Question #1: Where is the data for sales and product prices by date?
→ Question #2: Is the data current and actively monitored?
Data discovery issues
→ Question #1: Where is the data for sales and product prices by date?
→ Question #2: Is the data current and actively monitored?
→ Question #3: Does the “price” column include or exclude VAT?
Data discovery issues
→ Question #1: Where is the data for sales and product prices by date?
→ Question #2: Is the data current and actively monitored?
→ Question #3: Does the “price” column include or exclude VAT?
→ Question #4: Is it the actual “sales price” or is it list price?
Data discovery issues
Redesigning, relocating and deprecating data products
Data discovery issues
Finding other users of the same data (inspiration, innovation, sharing)
Side effects
Lots of queries like:
SELECT
*
FROM
default.my_table
WHERE ds=’2018-01-01’
LIMIT 100;
uncertainty
Increased load
and query cost
interruptions
Metadata is key to next bigdata wave
It’s all about relationships
ABC of metadata
Application Context
Metadata needed by humans or applications to operate
ABC of metadata
Application Context
Metadata needed by humans or applications to operate
Behavior
How is data created and used over time?
ABC of metadata
Application Context
Metadata needed by humans or applications to operate
Behavior
How is data created and used over time?
Change
Change in data over time
Terminology: Ground paper
from Berkeley
What’s out there?
What do we need?
Search based | | |
Where is the table/dashboard for X? What does it contain? | | |
Does this analysis already exist? | | |
Other requirements
What do we need?
Search based | Lineage based | |
Where is the table/dashboard for X? What does it contain? | I am changing a data model, who are the owner and most common users? | |
Does this analysis already exist? | This table’s delivery was delayed today, I want to notify everyone downstream. | |
Other requirements
What do we need?
Search based | Lineage based | Network based |
Where is the table/dashboard for X? What does it contain? | I am changing a data model, who are the owner and most common users? | I want to follow a power user in my team. |
Does this analysis already exist? | This table’s delivery was delayed today, I want to notify everyone downstream. | I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. |
Other requirements
What’s out there?
Criteria / Products | Alation | WhereHows | Airbnb Data Portal | Cloudera Navigator | Apache Atlas |
Search based | | | | | |
Lineage based | | | | | |
Network based | | | | | |
Hive/Presto support | | | | | |
Redshift support | | | | | |
Open source (pref.) | | | | | |
Important things for the journey
Important things for the journey
Important things for the journey
Important things for the journey
Amundsen - landing page
Amundsen - search
Amundsen - table detail page
Amundsen - column details
Amundsen - preview
Amundsen - people search
Amundsen - people page
Amundsen services working together
Postgres
Hive
Redshift
...
Presto
Github�Source�File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service
Search Service
Frontend Service
ML Feature�Service
Security�Service
Other Microservices
Metadata Sources
Amundsen services working together
Postgres
Hive
Redshift
...
Presto
Github�Source�File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service
Search Service
Frontend Service
ML Feature�Service
Security�Service
Other Microservices
Metadata Sources
Amundsen services working together
Postgres
Hive
Redshift
...
Presto
Github�Source�File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service
Search Service
Frontend Service
ML Feature�Service
Security�Service
Other Microservices
Metadata Sources
Amundsen services working together
Postgres
Hive
Redshift
...
Presto
Github�Source�File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service
Search Service
Frontend Service
ML Feature�Service
Security�Service
Other Microservices
Metadata Sources
Amundsen services working together
Postgres
Hive
Redshift
...
Presto
Github�Source�File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service
Search Service
Frontend Service
ML Feature�Service
Security�Service
Other Microservices
Metadata Sources
Relations rather than entities
SQL vs Neo4J queries
SQL | Cypher (Neo4j) |
SELECT firstname FROM person WHERE person.nickname= 'The Dude' | MATCH p:person WHERE p.nickname= 'The Dude' RETURN p.firstname |
SQL vs Neo4J queries
SQL | Cypher (Neo4j) |
SELECT firstname FROM person WHERE person.nickname= 'The Dude' | MATCH p:person WHERE p.nickname= 'The Dude' RETURN p.firstname |
SELECT firstname,team.name FROM person JOIN team ON person.teamid = team.id WHERE person.nickname= 'The Dude' AND team.sport = 'Bowling' | MATCH (p:person)-[:in]-(t:team) WHERE p.nickname= 'The Dude' AND t.sport = 'Bowling' RETURN p.firstname, t.name |
SQL vs Neo4J queries
SQL | Cypher (Neo4j) |
SELECT firstname FROM person WHERE person.nickname= 'The Dude' | MATCH p:person WHERE p.nickname= 'The Dude' RETURN p.firstname |
SELECT firstname,team.name FROM person JOIN team ON person.teamid = team.id WHERE person.nickname= 'The Dude' AND team.sport = 'Bowling' | MATCH (p:person)-[:in]-(t:team) WHERE p.nickname= 'The Dude' AND t.sport = 'Bowling' RETURN p.firstname, t.name |
| MATCH (p:person)-[*]-(p2:person) WHERE p.nickname= 'The Dude' RETURN p.firstname, p2.firstname |
ElasticSearch for search and relevance
Metadata extractors
Scheduling extractor jobs ⇒ airflow
Amundsen uses Apache Airflow to orchestrate Databuilder jobs
What to expect...
What to expect...
What to expect...
What to expect...
Roadmap
From Discovery towards Governance
Metadata
Compliance (GDPR/CCPA)
Data Discovery
Downstream impact analysis
. . . . .
Data Quality
Other users
57
Prominent users
Active community
Where to go for more
https://www.github.com/lyft/amundsen
amundsenworkspace.slack.com #amundsen @ slack
(link on github page)
Gerard Toonstra| Data Discovery with Amundsen | 27-11-2019 | https://www.careersatcoolblue.com
Join at slido.com with #bigdata2019