1 of 46

Sherlock intro & demo

Big data query engine & shared storage for the Korcsmaros Group

Mate, David�23 November, 2018.

2 of 46

Introduction

3 of 46

Sherlock @ KorcsmarosGroup

Data platform developed for EI researchers, collaborators and services developed in the KorcsmarosLab

Features:

  • store all datasets in a redundant, organized storage
  • convert all datasets to common, query compatible file format
  • execute analytical queries on top of data files
  • share datasets among different teams / projects
  • generate operational datasets for certain services or collaborators
  • scalability - for extremely large and complex datasets

4 of 46

What is big data?

  • In business: any data heavy project �working on a “large dataset
  • In IT: a set of “non traditional” �technologies and frameworks, �handling “tricky datasets” �in a scalable way
  • In research: not used widely yet, �but could - because we have a lot’s of “tricky” data :)

5 of 46

Typical big data analytical tools in IT

DATA LAKE

BATCH ANALYTICS

STREAM COMPUTATION

REPORT GENERATION

INTEGRATION

WORKFLOW MANAGEMENT

6 of 46

Data science project vs. Full big data solution

DATA LAKE

Application / Service

Application / Service

Application / Service

Application / Service

  • CI
  • CD
  • TEST AUTO.
  • INFRASTRUCTURE AUTO.
  • DEPLOYMENT AUTO.
  • SECURITY MGMNT.

7 of 46

Data Lake

Where and how do we store the data in a big data application?

8 of 46

Traditional way: The power of SQL

SQL = Structured Query Language, to manipulate and query relational databases

Structured data:

  • Table
  • Column (typed)
  • Row
  • Relation (between �tables / columns)

relational database

9 of 46

Given the following three tables,

how do you find those genes from the String database, which are expressed the brain?

string_proteins

ens_id

tax_id

pubmed_ids

ensg11867234247

9606

15328335

ensg98236453842

9606

ensg36438523943

9606

14847410, 9368760

mapping

ens_id

uniprot_id

ensg11867234247

ph39f7

ensg98236453842

ph39f7

ensg00034783463

ph863s

tissue

uniprot_id

bto_id

bto_name

ph39f7

142

brain

ph863s

476

foot

10 of 46

string_proteins

ens_id

tax_id

pubmed_ids

ensg11867234247

9606

15328335

ensg98236453842

9606

ensg36438523943

9606

14847410, 9368760

mapping

ens_id

uniprot_id

ensg11867234247

ph39f7

ensg98236453842

ph39f7

ensg00034783463

ph863s

tissue

uniprot_id

bto_id

bto_name

ph39f7

142

brain

ph863s

476

foot

SELECT string_proteins.ens_id

FROM string_proteins

LEFT JOIN mapping ON string_proteins.ens_id = mapping.ens_id

LEFT JOIN tissues ON mapping.uniprot_id = tissues.uniprot_id

WHERE tissues.bto_name = ‘brain’;

11 of 46

Data Lake

  • shared storage
  • for large amount of files
  • rules and conventions to organize the files

  • standardized big data file formats
  • standardized access (S3)
  • lot of tools can use it

12 of 46

Relational database

Shared folder

More structured, data types enforced

Data lake

Easier to setup, and to add / delete data

  • Up to few TB
  • Only for clean and structured data
  • Up to a few projects and a few people
  • Scalable
  • Both for structured and unstructured data

13 of 46

S3 demo - cyberduck

14 of 46

Queries on top of the files in data lake

Presto worker node

Presto worker node

Presto coordinator + 2 worker nodes

Data lake

  1. Submit SQL query

2. Fetch data files

3. Execute the in-memory� distributed query

4. Return the results

15 of 46

Sherlock demo

Use Case 1: Tissue specificity

16 of 46

Data in Presto: Tissue (gene expression) DB

molecule_id: "ensg11867234247"

molecule_id_type: "ensembl"

tissue_uberon_id: 2107

tissue_uberon_name: "liver"

source_db: "Bgee"

score: 33.2626

tax_id: 9606

Rules:

Using standard UBERON terms, MI terms, NCBI tax IDs, UniprotKB dbRef names for all values.

Score: optional (can be NULL)

UBERON: Uber-anatomy ontology,

https://www.ebi.ac.uk/ols/ontologies/uberon

17 of 46

Use case 1: List the 100 most highly expressed genes in the human liver

SELECT molecule_id,

molecule_id_type,

tissue_uberon_id,

tissue_uberon_name,

score

FROM master.bgee_2020_04_27

WHERE tax_id = 9606 � AND tissue_uberon_id = 2107�ORDER BY score DESC�LIMIT 100;

18 of 46

CREATE TABLE new_table AS

SELECT ...;

Presto worker node

Presto worker node

Presto worker nodes

Data lake

  • Submit �SQL query

2. Fetch data files

3. Execute the in-memory� distributed query

4. Write the results to a new table / folder

19 of 46

Saving our results to the data lake

CREATE TABLE project.mate_my_table WITH (

format = 'JSON'

) AS

SELECT molecule_id,

molecule_id_type,

tissue_uberon_id,

tissue_uberon_name,

score

FROM master.bgee_14_0

WHERE tax_id = 9606 � AND tissue_uberon_id = 2106�ORDER BY score DESC;

20 of 46

Benevolent

data files

iSNP / NOX

NOX standard data formats

SignaLink

MongoDB

TF DB

Graph DB

Project specific�Databases or files

Data Lake (S3)

Master Zone

Convert to optimized file formats, allowing fast queries

Project Zone

User / project specific files and databases in the Data Lake

Presto �Query Engine

Running SQL queries on top of the data lake

Importing data sets

KEGG

GO

HG 38

Ad-hoc analytical queries

Raw Zone

Raw files of different external databases in original format

Landing Zone

Convert to presto compatible text format + basic cleanup

21 of 46

22 of 46

23 of 46

Sherlock demo

Use Case 2 : ID mapping

24 of 46

Data in Presto: ID mapping table

from_id: "ensg11867234247"

from_id_type: "ensembl"

to_id: "ph39f7"

to_id_type: "uniprotac"

tax_id: 9606

Rules:

Node IDs in small case string

ID type: standard UniprotKB DBref (https://www.uniprot.org/database/)

25 of 46

Use case 1: List genes expressed in the hypothalamus using uniprot IDs

bgee

score

tissue_uberon_name

molecule_id

8723.823

blood

ensg99375824543

729.372

hypothalamus

ensg11867234247

mapping

from_id

to_id

ensg11867234247

ph39f7

ensg98236453842

ph39f7

ensg00034783463

ph863s

26 of 46

Use case 1: List genes expressed in the liver using uniprot IDs

SELECT bgee.molecule_id,

uniprot.to_id,

bgee.tissue_uberon_id,

bgee.tissue_uberon_name,

bgee.score

FROM master.bgee_2020_04_27 bgee

LEFT JOIN master.uniprot_id_mapping_2020_05 uniprot ON

bgee.tax_id = uniprot.tax_id AND

bgee.molecule_id = uniprot.from_id AND

uniprot.from_id_type = 'bgee' AND

uniprot.to_id_type = 'uniprotac'

WHERE bgee.tax_id = 9606 � AND bgee.tissue_uberon_id = 2107

AND uniprot.to_id IS NOT NULL�ORDER BY bgee.score DESC;

27 of 46

Sherlock demo

Use Case 3: Network enrichment

28 of 46

Data in Presto: Protein interactions

interactor_a_id: "ensg11867234247"

interactor_a_id_type: "ensembl"

interactor_a_tax_id: 9606

interactor_a_molecula_type_mi_id: 250

interactor_a_molecula_type_name: gene

interactor_b_id: "ensg98236453842"

interactor_b_id_type: "ensembl"

interactor_b_tax_id: 9606

interactor_b_molecula_type_mi_id: 250

interactor_b_molecula_type_name: gene

interaction_detection_methods_mi_id: [401, 58]

interaction_types_mi_id: [208]

source_databases_mi_id: [469]

pmids: [14847410, 9368760]

Rules:

Node IDs in small case string

ID type: standard UniprotKB DBref (https://www.uniprot.org/database/)

Standard MI terms for

  • molecula type
  • Interaction type
  • source database

(https://www.ebi.ac.uk/ols/ontologies/mi)

29 of 46

Network enrichment

Enrich with interconnections

Enrich with first neighbours

30 of 46

Network enrichment

Enrich with interconnections

Enrich with first neighbours

31 of 46

Enrich with interconnections

SELECT interactor_a_id,

interactor_b_id

FROM master.omnipath_2020_10_04

WHERE interactor_a_id IN ('o95786', 'q96eq8', 'q6zsz5', 'q01113')

AND interactor_b_id IN ('o95786', 'q96eq8', 'q6zsz5', 'q01113');

32 of 46

Enrich with interconnections + first neighbours

SELECT interactor_a_id,

interactor_b_id

FROM master.omnipath_0_7_111

WHERE interactor_a_id IN ('o95786', 'q96eq8', 'q6zsz5', 'q01113')

OR interactor_b_id IN ('o95786', 'q96eq8', 'q6zsz5', 'q01113');

33 of 46

Enrich hypothalamus genes by omnipath (+ ID mapping)

CREATE VIEW project.mate_hypo_genes AS

SELECT uniprot.to_id AS uniprot_id

FROM master.bgee_14_0 bgee

LEFT JOIN master.uniprot_id_mapping_2018_11 uniprot ON

bgee.tax_id = uniprot.tax_id AND bgee.molecule_id = uniprot.from_id AND

uniprot.from_id_type = 'bgee' AND uniprot.to_id_type = 'uniprotac'

WHERE bgee.tax_id = 9606 AND bgee.tissue_uberon_id = 1898

AND uniprot.to_id IS NOT NULL

ORDER BY bgee.score DESC;

SELECT omnipath.interactor_a_id, omnipath.interactor_b_id

FROM omnipath_0_7_111 omnipath

WHERE omnipath.interactor_a_id IN (SELECT * FROM project.mate_hypo_genes )

OR omnipath.interactor_b_id IN (SELECT * FROM project.mate_hypo_genes );

34 of 46

Monitoring the running queries

http://<presto cluster>:<presto port>/

35 of 46

Sherlock demo

Use Case 4: Genome regions

36 of 46

Data in Presto: Genome

chr: "chr11"

length: 1000

start: 27000

end: 28000

sequence: "actg.....ccta"

Rules:

Each genome goes to a separate table.

Split the genome to smaller sequences for optimization.

37 of 46

Use Case 4: select region around SNP location

956000

957000

958000

959000

960000

SNP�958800

+ / - 700

38 of 46

957001 ...

958001 ...

959001 ...

960001

SNP

958100 … 959500

Step 1: select + / - 1 full split, 3000 long sequence

Step 2: substring from the 3000 long sequence: [ SNP - 699 - before_split.start : SNP + 701 - before_split.start ]

This case: [1100 : 2500]

target_split

before_split

after_split

958800

39 of 46

Same algorithm in SQL: step 1

SELECT

MIN(start) AS long_sequence_start,

ARRAY_JOIN ( ARRAY_AGG(sequence ORDER BY start), '') AS long_sequence

FROM master.hg_38

WHERE chr = 'chr11'

AND start BETWEEN 958800 - 2000 AND 958800 + 1000;

40 of 46

Same algorithm in SQL: step 1 + step 2

SELECT

SUBSTR(long_sequence, 958800 - 699 - long_sequence_start, 1401) AS result

FROM (

-- here comes the query of the step 1

);

41 of 46

Same algorithm in SQL: step 1 + step 2

SELECT

SUBSTR(long_sequence, 958800 - 699 - long_sequence_start, 1401) AS result

FROM (

SELECT

ARRAY_JOIN (ARRAY_AGG(sequence ORDER BY start), '') AS long_sequence,

MIN(start) AS long_sequence_start

FROM master.human_genome_grch38_p13_2020_10_21

WHERE chr = 'chr11'

AND start BETWEEN 958800 - 2000 AND 958800 + 1000

);

42 of 46

Same algorithm in SQL: step 1 + step 2 (smarter)

SELECT

SUBSTR(long_sequence,

MAX(1, SNP_POS-699-long_sequence_start),

MIN(LENGTH(long_sequence) - (MAX(1, SNP_POS-699-long_sequence_start)), 1401)

) AS result

FROM (

-- here comes the query of the step 1

);

Here we also handle the case, when we don’t have 700 sequence before and/or after the SNP:

43 of 46

Sherlock: project status

44 of 46

What we have

  • Working Presto cluster in NBI (3 VMs, easy to scale up)
  • Data Lake (currently in Digital Oceans, later using NBI object store)
  • Source code repository (in EI Stash)
    • Documentation, Example queries
    • Data loader scripts
    • Table definitions
  • Databases in the Data Lake:
    • IntAct v2018_10_04
    • Tissue data from Bgee 14_0
    • Human Genome v38
    • OmniPath 0.7.111
    • Uniprot ID mapping 2018_11

45 of 46

Next steps

  • Quality improvements (automated backups, automated tests for the data loaders)
  • Bit more user documentation
  • Start to use it in the lab / EI
    • For iSNP we already use it
    • Some SQL training? ;)
  • Publish / open source some parts
  • Load more data:
    • More molecular interaction dataset
    • Genome annotations
    • Signaling databases
    • Ontologies (GO, MI, BTO)

46 of 46

Thank You! :)