Towards Enterprise Grade Data Discovery and Data Lineage
with Apache Atlas and Amundsen
Source: Kan Nishida
Need for Data Discovery and Lineage
Data Discovery
We build algorithmic 10x Data-Driven products that benefit our clients and generate revenue for the bank
Wholesale Banking Advanced Analytics
Data-Driven Decision Making Process
Step 1 : Search and find the data
Step 2: Understand the data
Step 3: Perform and analysis and visualization
Step 4: Make a decision and/or share insights
Data Discovery
Challenge: Search and Find the Data
Challenge: Understand the Data
Data Scientists spend upto 1/3rd time in Data Discovery.
Horizons of Data Discovery
| Search Based |
|
| Lineage Based |
|
| Network Based |
|
First person to explore both North and South poles.
Norwegian explorer, Roald Amundsen
Landing Page
Optimized for search
Popularity score = number of distinct readers * log(total number of reads)
Search Results
Ranked on Relevance and Popularity
Relevance - search for “apple” on Google
Low relevance
High relevance
Popularity - search for “apple” on Google
Low popularity
High popularity
Search Results - Striking the balance
Relevance | Popularity |
|
|
View Resource Metadata
Table Detail Page
Column Details
Computed Metadata Statistics
Amundsen Architecture
Metadata Sources (Hive, Spark, S3, etc.)
Search Service
Metadata Service
Frontend Service
Other Microservices
Apache Atlas
Databuilder Ingestion Framework
Elasticsearch
Neo4j
Data Lineage and Metadata
Why Apache Atlas ?
Data Governance Overview
Data Lineage and Metadata
Schema Management - AVRO Based
Data Lineage And Metadata
Table Popularity Score
Data Ecosystem at ING
What’s Next?
Depth of Metadata
Roadmap
Community
we are open source advocates!
Prominent Users
Active Community
Working with Lyft since the inception of the open source journey of Amundsen
First company outside Lyft Engineering to deploy Amundsen in production.
Our Team
Thanks!