2 of 59

Amundsen who?

Roald Amundsen (1872-1928)

Norwegian arctic explorer

First to explore South Pole

First to traverse NW passage by sea

4 of 59

Decentralization and democratization of data is not productive unless you use tools that allow you to make sense of the plethora of data products you never knew you were going to have.

5 of 59

Data discovery issues

Analysis of price elasticity for MacBooks

6 of 59

Data discovery issues

→ Question #1: Where is the data for sales and product prices by date?

7 of 59

Data discovery issues

→ Question #1: Where is the data for sales and product prices by date?

→ Question #2: Is the data current and actively monitored?

8 of 59

Data discovery issues

→ Question #1: Where is the data for sales and product prices by date?

→ Question #2: Is the data current and actively monitored?

→ Question #3: Does the “price” column include or exclude VAT?

9 of 59

Data discovery issues

→ Question #1: Where is the data for sales and product prices by date?

→ Question #2: Is the data current and actively monitored?

→ Question #3: Does the “price” column include or exclude VAT?

→ Question #4: Is it the actual “sales price” or is it list price?

10 of 59

Data discovery issues

Redesigning, relocating and deprecating data products

11 of 59

Data discovery issues

Finding other users of the same data (inspiration, innovation, sharing)

12 of 59

Side effects

No way to know & understand trusted data
Created channels & oncalls for data questions

Lots of queries like:

SELECT

FROM

default.my_table

WHERE ds=’2018-01-01’

LIMIT 100;

Does data exist?
Prior work?
Source of truth?
Who owns it?
Who uses it?

uncertainty

Increased load

and query cost

interruptions

13 of 59

Metadata is key to next bigdata wave

Discover new data sources
Identify end users to notify them of changes
Understand the popularity and trustworthiness of data
Network with other people on their data analysis
Investigate/monitor the magnitude of protected data exposure
Know what your boss or colleagues are using
Talk to upstream producers
+30% productivity for data scientists

14 of 59

It’s all about relationships

15 of 59

ABC of metadata

Application Context

Metadata needed by humans or applications to operate

Where is the data?
What are the semantics of the data?

16 of 59

ABC of metadata

Application Context

Metadata needed by humans or applications to operate

Where is the data?
What are the semantics of the data?

Behavior

How is data created and used over time?

Who’s using the data?
Who created the data?

17 of 59

ABC of metadata

Application Context

Metadata needed by humans or applications to operate

Where is the data?
What are the semantics of the data?

Behavior

How is data created and used over time?

Who’s using the data?
Who created the data?

Change

Change in data over time

How is the data evolving over time?
Evolution of code that generates the data

Terminology: Ground paper

from Berkeley

18 of 59

What’s out there?

Vendors - Alation, Collibra
Existing open source projects (e.g. Apache Atlas, Marquez)
LinkedIn’s data portal - Wherehows & DataHub (blog, code)
Twitter’s data discovery (blog)
Netflix’s metacat (code, blog)
Airbnb’s data portal (blog, video)
Big Query SQL Web UI & data catalog (blog)
Goods: Organizing Google’s Datasets (paper)
Data Warehousing and Analytics Infrastructure at Facebook (paper)
Ground (RISE Lab): https://rise.cs.berkeley.edu/projects/ground/

20 of 59

What do we need?

Search based
Where is the table/dashboard for X? What does it contain?
Does this analysis already exist?

Other requirements

21 of 59

What do we need?

Search based	Lineage based
Where is the table/dashboard for X? What does it contain?	I am changing a data model, who are the owner and most common users?
Does this analysis already exist?	This table’s delivery was delayed today, I want to notify everyone downstream.

Other requirements

22 of 59

What do we need?

Search based	Lineage based	Network based
Where is the table/dashboard for X? What does it contain?	I am changing a data model, who are the owner and most common users?	I want to follow a power user in my team.
Does this analysis already exist?	This table’s delivery was delayed today, I want to notify everyone downstream.	I want to bookmark tables of interest and get a feed of data delay, schema change, incidents.

Other requirements

Leverage as much data automatically as possible
Preferably, open source and healthy community
API availability
Easy to set up

23 of 59

What’s out there?

Criteria / Products	Alation	WhereHows	Airbnb Data Portal	Cloudera Navigator	Apache Atlas
Search based
Lineage based
Network based
Hive/Presto support
Redshift support
Open source (pref.)

24 of 59

Important things for the journey

Trust embodied in the solution itself

Sort results in order of relevance and popularity

25 of 59

Important things for the journey

Trust embodied in the solution itself

Sort results in order of relevance and popularity

Very low amount of manual curation

Automated curation as much as possible

26 of 59

Important things for the journey

Trust embodied in the solution itself

Sort results in order of relevance and popularity

Very low amount of manual curation

Automated curation as much as possible

Slight preference for open source

Driving the OS community forward with this new tool

27 of 59

Important things for the journey

Trust embodied in the solution itself

Sort results in order of relevance and popularity

Very low amount of manual curation

Automated curation as much as possible

Slight preference for open source

Driving the OS community forward with this new tool

Targeted user experience not available in other tools

29 of 59

Amundsen - landing page

30 of 59

Amundsen - search

31 of 59

Amundsen - table detail page

32 of 59

Amundsen - column details

33 of 59

Amundsen - preview

34 of 59

Amundsen - people search

35 of 59

Amundsen - people page

37 of 59

Amundsen services working together

Postgres

Hive

Redshift

...

Presto

Github�Source�File

Databuilder Crawler

Neo4j

Elastic

Metadata Service

Search Service

Frontend Service

ML Feature�Service

Security�Service

Other Microservices

Metadata Sources

38 of 59

Amundsen services working together

Postgres

Hive

Redshift

...

Presto

Github�Source�File

Databuilder Crawler

Neo4j

Elastic

Metadata Service

Search Service

Frontend Service

ML Feature�Service

Security�Service

Other Microservices

Metadata Sources

39 of 59

Amundsen services working together

Postgres

Hive

Redshift

...

Presto

Github�Source�File

Databuilder Crawler

Neo4j

Elastic

Metadata Service

Search Service

Frontend Service

ML Feature�Service

Security�Service

Other Microservices

Metadata Sources

40 of 59

Amundsen services working together

Postgres

Hive

Redshift

...

Presto

Github�Source�File

Databuilder Crawler

Neo4j

Elastic

Metadata Service

Search Service

Frontend Service

ML Feature�Service

Security�Service

Other Microservices

Metadata Sources

41 of 59

Amundsen services working together

Postgres

Hive

Redshift

...

Presto

Github�Source�File

Databuilder Crawler

Neo4j

Elastic

Metadata Service

Search Service

Frontend Service

ML Feature�Service

Security�Service

Other Microservices

Metadata Sources

42 of 59

Relations rather than entities

43 of 59

SQL vs Neo4J queries

SQL	Cypher (Neo4j)
SELECT firstname FROM person WHERE person.nickname= 'The Dude'	MATCH p:person WHERE p.nickname= 'The Dude' RETURN p.firstname

44 of 59

SQL vs Neo4J queries

SQL	Cypher (Neo4j)
SELECT firstname FROM person WHERE person.nickname= 'The Dude'	MATCH p:person WHERE p.nickname= 'The Dude' RETURN p.firstname
SELECT firstname,team.name FROM person JOIN team ON person.teamid = team.id WHERE person.nickname= 'The Dude' AND team.sport = 'Bowling'	MATCH (p:person)-[:in]-(t:team) WHERE p.nickname= 'The Dude' AND t.sport = 'Bowling' RETURN p.firstname, t.name

45 of 59

SQL vs Neo4J queries

SQL	Cypher (Neo4j)
SELECT firstname FROM person WHERE person.nickname= 'The Dude'	MATCH p:person WHERE p.nickname= 'The Dude' RETURN p.firstname
SELECT firstname,team.name FROM person JOIN team ON person.teamid = team.id WHERE person.nickname= 'The Dude' AND team.sport = 'Bowling'	MATCH (p:person)-[:in]-(t:team) WHERE p.nickname= 'The Dude' AND t.sport = 'Bowling' RETURN p.firstname, t.name
	MATCH (p:person)-[*]-(p2:person) WHERE p.nickname= 'The Dude' RETURN p.firstname, p2.firstname

46 of 59

ElasticSearch for search and relevance

Normal search: match records based on relevancy

Category search: match records first based on data type, then relevancy

column: warehouse_cost

Wildcard search:

event_*

47 of 59

Metadata extractors

48 of 59

Scheduling extractor jobs ⇒ airflow

Amundsen uses Apache Airflow to orchestrate Databuilder jobs

50 of 59

What to expect...

Customizing UI (logo, colors, etc.)

51 of 59

What to expect...

Customizing UI (logo, colors, etc.)
Security / OpenID Connect for authentication

52 of 59

What to expect...

Customizing UI (logo, colors, etc.)
Security / OpenID Connect for authentication
Look at available extractors:

Hive, BigQuery, Athena, Snowflake, Postgres

53 of 59

What to expect...

Customizing UI (logo, colors, etc.)
Security / OpenID Connect for authentication
Look at available extractors:

Hive, BigQuery, Athena, Snowflake, Postgres

Build your dags

55 of 59

Roadmap

UI/UX redesign
Email notifications system
Indexing dashboards
Lineage integration
Granular access level control
More metadata sources
Indexing services and teams

56 of 59

From Discovery towards Governance

Metadata

Compliance (GDPR/CCPA)

Data Discovery

Downstream impact analysis

. . . . .

Data Quality

57 of 59

Other users

Prominent users

Active community

58 of 59

Where to go for more

https://www.github.com/lyft/amundsen

amundsenworkspace.slack.com #amundsen @ slack

(link on github page)

1 of 59

2 of 59

3 of 59

4 of 59

5 of 59

6 of 59

7 of 59

8 of 59

9 of 59

10 of 59

11 of 59

12 of 59

13 of 59

14 of 59

15 of 59

16 of 59

17 of 59

18 of 59

19 of 59

20 of 59

21 of 59

22 of 59

23 of 59

24 of 59

25 of 59

26 of 59

27 of 59

28 of 59

29 of 59

30 of 59

31 of 59

32 of 59

33 of 59

34 of 59

35 of 59

36 of 59

37 of 59

38 of 59

39 of 59

40 of 59

41 of 59

42 of 59

43 of 59

44 of 59

45 of 59

46 of 59

47 of 59

48 of 59

49 of 59

50 of 59

51 of 59

52 of 59

53 of 59

54 of 59

55 of 59

56 of 59

57 of 59

58 of 59

59 of 59