1 of 36

Towards Enterprise Grade Data Discovery and Data Lineage

with Apache Atlas and Amundsen

2 of 36

Source: Kan Nishida

3 of 36

Need for Data Discovery and Lineage

4 of 36

Data Discovery

5 of 36

We build algorithmic 10x Data-Driven products that benefit our clients and generate revenue for the bank

Wholesale Banking Advanced Analytics

6 of 36

Data-Driven Decision Making Process

Step 1 : Search and find the data

Step 2: Understand the data

Step 3: Perform and analysis and visualization

Step 4: Make a decision and/or share insights

Data Discovery

7 of 36

Challenge: Search and Find the Data

  • Ask a friend, your coworker or manager
  • Ask in a wider Slack Channel
  • Or simply search in the Git repositories

8 of 36

Challenge: Understand the Data

  • Multiple results, which one is correct or up to date?
  • What do different columns mean ?
  • Dig Deeper: Explore using * SQL queries.

9 of 36

Data Scientists spend upto 1/3rd time in Data Discovery.

10 of 36

Horizons of Data Discovery

Search Based

  • Where is the data?
  • What does it contain?

Lineage Based

  • What datasets are linked?
  • Upstream/Downstream

Network Based

  • Who are the frequent users?
  • What tables my team uses?

11 of 36

First person to explore both North and South poles.

Norwegian explorer, Roald Amundsen

12 of 36

Landing Page

Optimized for search

Popularity score = number of distinct readers * log(total number of reads)

13 of 36

Search Results

Ranked on Relevance and Popularity

14 of 36

Relevance - search for “apple” on Google

Low relevance

High relevance

15 of 36

Popularity - search for “apple” on Google

Low popularity

High popularity

16 of 36

Search Results - Striking the balance

Relevance

Popularity

  • Names, Description, Tags, [Owners, Frequent users]

  • Different weights for different metadata. e.g., resource name
  • Querying activity

  • Lower weight for automated querying

  • Higher weight for ad-hoc querying

17 of 36

View Resource Metadata

Table Detail Page

18 of 36

Column Details

Computed Metadata Statistics

19 of 36

Amundsen Architecture

Metadata Sources (Hive, Spark, S3, etc.)

Search Service

Metadata Service

Frontend Service

Other Microservices

Apache Atlas

Databuilder Ingestion Framework

Elasticsearch

Neo4j

20 of 36

Data Lineage and Metadata

21 of 36

Why Apache Atlas ?

  • Open source
  • Metadata types & instances
    • Hooks and Bridges
  • Out of the box Lineage Support
  • Classification / Tags
  • Powerful Graph Engine
  • REST APIs

22 of 36

Data Governance Overview

23 of 36

Data Lineage and Metadata

24 of 36

Schema Management - AVRO Based

25 of 36

Data Lineage And Metadata

26 of 36

Table Popularity Score

27 of 36

28 of 36

Data Ecosystem at ING

29 of 36

What’s Next?

30 of 36

Depth of Metadata

31 of 36

Roadmap

  • Table Versions and Partitions
  • Search Improvements (Relevancy)
  • Lineage Integration within Amundsen
  • Data Security & Compliance
    • Tags based policies (Apache Ranger - Apache Atlas)
    • Role Based Access Control

32 of 36

Community

we are open source advocates!

33 of 36

Prominent Users

Active Community

34 of 36

Working with Lyft since the inception of the open source journey of Amundsen

First company outside Lyft Engineering to deploy Amundsen in production.

35 of 36

Our Team

36 of 36

Marek Wiewiórka

Big Data Architect (GetInData)

linkedin.com/in/marekwiewiorka

github.com/mwiewior

Verdan Mahmood

Software Engineer (ING)

linkedin.com/in/verdan

github.com/verdan

Thanks!