1 of 59

Gerard Toonstra| Data Discovery with Amundsen | 27-11-2019

2 of 59

Amundsen who?

  • Roald Amundsen (1872-1928)

  • Norwegian arctic explorer

  • First to explore South Pole

  • First to traverse NW passage by sea

3 of 59

4 of 59

Decentralization and democratization of data is not productive unless you use tools that allow you to make sense of the plethora of data products you never knew you were going to have.

5 of 59

Data discovery issues

Analysis of price elasticity for MacBooks

6 of 59

Data discovery issues

→ Question #1: Where is the data for sales and product prices by date?

7 of 59

Data discovery issues

→ Question #1: Where is the data for sales and product prices by date?

→ Question #2: Is the data current and actively monitored?

8 of 59

Data discovery issues

→ Question #1: Where is the data for sales and product prices by date?

→ Question #2: Is the data current and actively monitored?

→ Question #3: Does the “price” column include or exclude VAT?

9 of 59

Data discovery issues

→ Question #1: Where is the data for sales and product prices by date?

→ Question #2: Is the data current and actively monitored?

→ Question #3: Does the “price” column include or exclude VAT?

→ Question #4: Is it the actual “sales price” or is it list price?

10 of 59

Data discovery issues

Redesigning, relocating and deprecating data products

11 of 59

Data discovery issues

Finding other users of the same data (inspiration, innovation, sharing)

12 of 59

Side effects

  • No way to know & understand trusted data
  • Created channels & oncalls for data questions

Lots of queries like:

SELECT

*

FROM

default.my_table

WHERE ds=’2018-01-01’

LIMIT 100;

  • Does data exist?
  • Prior work?
  • Source of truth?
  • Who owns it?
  • Who uses it?

uncertainty

Increased load

and query cost

interruptions

13 of 59

Metadata is key to next bigdata wave

  • Discover new data sources
  • Identify end users to notify them of changes
  • Understand the popularity and trustworthiness of data
  • Network with other people on their data analysis
  • Investigate/monitor the magnitude of protected data exposure
  • Know what your boss or colleagues are using
  • Talk to upstream producers
  • +30% productivity for data scientists

14 of 59

It’s all about relationships

15 of 59

ABC of metadata

Application Context

Metadata needed by humans or applications to operate

  • Where is the data?
  • What are the semantics of the data?

16 of 59

ABC of metadata

Application Context

Metadata needed by humans or applications to operate

  • Where is the data?
  • What are the semantics of the data?

Behavior

How is data created and used over time?

  • Who’s using the data?
  • Who created the data?

17 of 59

ABC of metadata

Application Context

Metadata needed by humans or applications to operate

  • Where is the data?
  • What are the semantics of the data?

Behavior

How is data created and used over time?

  • Who’s using the data?
  • Who created the data?

Change

Change in data over time

  • How is the data evolving over time?
  • Evolution of code that generates the data

Terminology: Ground paper

from Berkeley

18 of 59

What’s out there?

  • Vendors - Alation, Collibra
  • Existing open source projects (e.g. Apache Atlas, Marquez)
  • LinkedIn’s data portal - Wherehows & DataHub (blog, code)
  • Twitter’s data discovery (blog)
  • Netflix’s metacat (code, blog)
  • Airbnb’s data portal (blog, video)
  • Big Query SQL Web UI & data catalog (blog)
  • Goods: Organizing Google’s Datasets (paper)
  • Data Warehousing and Analytics Infrastructure at Facebook (paper)
  • Ground (RISE Lab): https://rise.cs.berkeley.edu/projects/ground/

19 of 59

20 of 59

What do we need?

Search based

Where is the table/dashboard for X?

What does it contain?

Does this analysis already exist?

Other requirements

21 of 59

What do we need?

Search based

Lineage based

Where is the table/dashboard for X?

What does it contain?

I am changing a data model, who are the owner and most common users?

Does this analysis already exist?

This table’s delivery was delayed today, I want to notify everyone downstream.

Other requirements

22 of 59

What do we need?

Search based

Lineage based

Network based

Where is the table/dashboard for X?

What does it contain?

I am changing a data model, who are the owner and most common users?

I want to follow a power user in my team.

Does this analysis already exist?

This table’s delivery was delayed today, I want to notify everyone downstream.

I want to bookmark tables of interest and get a feed of data delay, schema change, incidents.

Other requirements

  • Leverage as much data automatically as possible
  • Preferably, open source and healthy community
  • API availability
  • Easy to set up

23 of 59

What’s out there?

Criteria / Products

Alation

WhereHows

Airbnb Data Portal

Cloudera Navigator

Apache Atlas

Search based

Lineage based

Network based

Hive/Presto support

Redshift support

Open source (pref.)

24 of 59

Important things for the journey

  • Trust embodied in the solution itself
    • Sort results in order of relevance and popularity

25 of 59

Important things for the journey

  • Trust embodied in the solution itself
    • Sort results in order of relevance and popularity
  • Very low amount of manual curation
    • Automated curation as much as possible

26 of 59

Important things for the journey

  • Trust embodied in the solution itself
    • Sort results in order of relevance and popularity
  • Very low amount of manual curation
    • Automated curation as much as possible
  • Slight preference for open source
    • Driving the OS community forward with this new tool

27 of 59

Important things for the journey

  • Trust embodied in the solution itself
    • Sort results in order of relevance and popularity
  • Very low amount of manual curation
    • Automated curation as much as possible
  • Slight preference for open source
    • Driving the OS community forward with this new tool

  • Targeted user experience not available in other tools

28 of 59

29 of 59

Amundsen - landing page

30 of 59

Amundsen - search

31 of 59

Amundsen - table detail page

32 of 59

Amundsen - column details

33 of 59

Amundsen - preview

34 of 59

Amundsen - people search

35 of 59

Amundsen - people page

36 of 59

37 of 59

Amundsen services working together

Postgres

Hive

Redshift

...

Presto

Github�Source�File

Databuilder Crawler

Neo4j

Elastic

Search

Metadata Service

Search Service

Frontend Service

ML Feature�Service

Security�Service

Other Microservices

Metadata Sources

38 of 59

Amundsen services working together

Postgres

Hive

Redshift

...

Presto

Github�Source�File

Databuilder Crawler

Neo4j

Elastic

Search

Metadata Service

Search Service

Frontend Service

ML Feature�Service

Security�Service

Other Microservices

Metadata Sources

39 of 59

Amundsen services working together

Postgres

Hive

Redshift

...

Presto

Github�Source�File

Databuilder Crawler

Neo4j

Elastic

Search

Metadata Service

Search Service

Frontend Service

ML Feature�Service

Security�Service

Other Microservices

Metadata Sources

40 of 59

Amundsen services working together

Postgres

Hive

Redshift

...

Presto

Github�Source�File

Databuilder Crawler

Neo4j

Elastic

Search

Metadata Service

Search Service

Frontend Service

ML Feature�Service

Security�Service

Other Microservices

Metadata Sources

41 of 59

Amundsen services working together

Postgres

Hive

Redshift

...

Presto

Github�Source�File

Databuilder Crawler

Neo4j

Elastic

Search

Metadata Service

Search Service

Frontend Service

ML Feature�Service

Security�Service

Other Microservices

Metadata Sources

42 of 59

Relations rather than entities

43 of 59

SQL vs Neo4J queries

SQL

Cypher (Neo4j)

SELECT firstname

FROM person

WHERE person.nickname= 'The Dude'

MATCH p:person

WHERE p.nickname= 'The Dude'

RETURN p.firstname

44 of 59

SQL vs Neo4J queries

SQL

Cypher (Neo4j)

SELECT firstname

FROM person

WHERE person.nickname= 'The Dude'

MATCH p:person

WHERE p.nickname= 'The Dude'

RETURN p.firstname

SELECT firstname,team.name

FROM person JOIN team ON person.teamid = team.id

WHERE person.nickname= 'The Dude'

AND team.sport = 'Bowling'

MATCH (p:person)-[:in]-(t:team)

WHERE p.nickname= 'The Dude'

AND t.sport = 'Bowling'

RETURN p.firstname, t.name

45 of 59

SQL vs Neo4J queries

SQL

Cypher (Neo4j)

SELECT firstname

FROM person

WHERE person.nickname= 'The Dude'

MATCH p:person

WHERE p.nickname= 'The Dude'

RETURN p.firstname

SELECT firstname,team.name

FROM person JOIN team ON person.teamid = team.id

WHERE person.nickname= 'The Dude'

AND team.sport = 'Bowling'

MATCH (p:person)-[:in]-(t:team)

WHERE p.nickname= 'The Dude'

AND t.sport = 'Bowling'

RETURN p.firstname, t.name

MATCH (p:person)-[*]-(p2:person)

WHERE p.nickname= 'The Dude'

RETURN p.firstname, p2.firstname

46 of 59

ElasticSearch for search and relevance

  • Normal search: match records based on relevancy

  • Category search: match records first based on data type, then relevancy
    • column: warehouse_cost

  • Wildcard search:
    • event_*

47 of 59

Metadata extractors

48 of 59

Scheduling extractor jobs ⇒ airflow

Amundsen uses Apache Airflow to orchestrate Databuilder jobs

49 of 59

50 of 59

What to expect...

  • Customizing UI (logo, colors, etc.)

51 of 59

What to expect...

  • Customizing UI (logo, colors, etc.)
  • Security / OpenID Connect for authentication

52 of 59

What to expect...

  • Customizing UI (logo, colors, etc.)
  • Security / OpenID Connect for authentication
  • Look at available extractors:
    • Hive, BigQuery, Athena, Snowflake, Postgres

53 of 59

What to expect...

  • Customizing UI (logo, colors, etc.)
  • Security / OpenID Connect for authentication
  • Look at available extractors:
    • Hive, BigQuery, Athena, Snowflake, Postgres
  • Build your dags

54 of 59

55 of 59

Roadmap

  • UI/UX redesign
  • Email notifications system
  • Indexing dashboards
  • Lineage integration
  • Granular access level control
  • More metadata sources
  • Indexing services and teams

56 of 59

From Discovery towards Governance

Metadata

Compliance (GDPR/CCPA)

Data Discovery

Downstream impact analysis

. . . . .

Data Quality

57 of 59

Other users

57

Prominent users

Active community

58 of 59

Where to go for more

https://www.github.com/lyft/amundsen

amundsenworkspace.slack.com #amundsen @ slack

(link on github page)

59 of 59

Gerard Toonstra| Data Discovery with Amundsen | 27-11-2019 | https://www.careersatcoolblue.com

Join at slido.com with #bigdata2019