1 of 36

Towards Enterprise Grade Data Discovery and Data Lineage

with Apache Atlas and Amundsen

2 of 36

Source: Kan Nishida

3 of 36

Need for Data Discovery and Lineage

4 of 36

Data Discovery

5 of 36

We build algorithmic 10x Data-Driven products that benefit our clients and generate revenue for the bank

Wholesale Banking Advanced Analytics

6 of 36

Data-Driven Decision Making Process

Step 1 : Search and find the data

Step 2: Understand the data

Step 3: Perform and analysis and visualization

Step 4: Make a decision and/or share insights

Data Discovery

7 of 36

Challenge: Search and Find the Data

Ask a friend, your coworker or manager
Ask in a wider Slack Channel
Or simply search in the Git repositories

8 of 36

Challenge: Understand the Data

Multiple results, which one is correct or up to date?
What do different columns mean ?
Dig Deeper: Explore using * SQL queries.

9 of 36

Data Scientists spend upto 1/3rd time in Data Discovery.

10 of 36

Horizons of Data Discovery

	Search Based	Where is the data? What does it contain?
	Lineage Based	What datasets are linked? Upstream/Downstream
	Network Based	Who are the frequent users? What tables my team uses?

11 of 36

First person to explore both North and South poles.

Norwegian explorer, Roald Amundsen

12 of 36

Landing Page

Optimized for search

Popularity score = number of distinct readers * log(total number of reads)

13 of 36

Search Results

Ranked on Relevance and Popularity

14 of 36

Relevance - search for “apple” on Google

Low relevance

High relevance

15 of 36

Popularity - search for “apple” on Google

Low popularity

High popularity

16 of 36

Search Results - Striking the balance

Relevance	Popularity
Names, Description, Tags, [Owners, Frequent users] Different weights for different metadata. e.g., resource name	Querying activity Lower weight for automated querying Higher weight for ad-hoc querying

17 of 36

View Resource Metadata

Table Detail Page

18 of 36

Column Details

Computed Metadata Statistics

19 of 36

Amundsen Architecture

Metadata Sources (Hive, Spark, S3, etc.)

Search Service

Metadata Service

Frontend Service

Other Microservices

Apache Atlas

Databuilder Ingestion Framework

Elasticsearch

Neo4j

20 of 36

Data Lineage and Metadata

21 of 36

Why Apache Atlas ?

Open source
Metadata types & instances

Hooks and Bridges

Out of the box Lineage Support
Classification / Tags
Powerful Graph Engine
REST APIs

22 of 36

Data Governance Overview

23 of 36

Data Lineage and Metadata

24 of 36

Schema Management - AVRO Based

25 of 36

Data Lineage And Metadata

26 of 36

Table Popularity Score

27 of 36

28 of 36

Data Ecosystem at ING

29 of 36

What’s Next?

30 of 36

Depth of Metadata

31 of 36

Roadmap

Table Versions and Partitions
Search Improvements (Relevancy)
Lineage Integration within Amundsen
Data Security & Compliance

Tags based policies (Apache Ranger - Apache Atlas)
Role Based Access Control

32 of 36

Community

we are open source advocates!

33 of 36

Prominent Users

Active Community

34 of 36

Working with Lyft since the inception of the open source journey of Amundsen

First company outside Lyft Engineering to deploy Amundsen in production.

35 of 36

Our Team

36 of 36

Marek Wiewiórka Big Data Architect (GetInData) linkedin.com/in/marekwiewiorka github.com/mwiewior	Verdan Mahmood Software Engineer (ING) linkedin.com/in/verdan github.com/verdan

Thanks!