1 of 46

Sherlock intro & demo

Big data query engine & shared storage for the Korcsmaros Group

Mate, David�23 November, 2018.

2 of 46

Introduction

3 of 46

Sherlock @ KorcsmarosGroup

Data platform developed for EI researchers, collaborators and services developed in the KorcsmarosLab

Features:

store all datasets in a redundant, organized storage
convert all datasets to common, query compatible file format
execute analytical queries on top of data files
share datasets among different teams / projects
generate operational datasets for certain services or collaborators
scalability - for extremely large and complex datasets

4 of 46

What is big data?

In business: any data heavy project �working on a “large dataset”
In IT: a set of “non traditional” �technologies and frameworks, �handling “tricky datasets” �in a scalable way
In research: not used widely yet, �but could - because we have a lot’s of “tricky” data :)

5 of 46

Typical big data analytical tools in IT

DATA LAKE

BATCH ANALYTICS

STREAM COMPUTATION

REPORT GENERATION

INTEGRATION

WORKFLOW MANAGEMENT

6 of 46

Data science project vs. Full big data solution

DATA LAKE

Application / Service

CI
CD
TEST AUTO.
INFRASTRUCTURE AUTO.
DEPLOYMENT AUTO.
SECURITY MGMNT.

7 of 46

Data Lake

Where and how do we store the data in a big data application?

8 of 46

Traditional way: The power of SQL

SQL = Structured Query Language, to manipulate and query relational databases

Structured data:

Table
Column (typed)
Row
Relation (between �tables / columns)

relational database

9 of 46

Given the following three tables,

how do you find those genes from the String database, which are expressed the brain?

string_proteins
ens_id	tax_id	pubmed_ids
ensg11867234247	9606	15328335
ensg98236453842	9606
ensg36438523943	9606	14847410, 9368760

mapping
ens_id	uniprot_id
ensg11867234247	ph39f7
ensg98236453842	ph39f7
ensg00034783463	ph863s

tissue
uniprot_id	bto_id	bto_name
ph39f7	142	brain
ph863s	476	foot

10 of 46

string_proteins
ens_id	tax_id	pubmed_ids
ensg11867234247	9606	15328335
ensg98236453842	9606
ensg36438523943	9606	14847410, 9368760

mapping
ens_id	uniprot_id
ensg11867234247	ph39f7
ensg98236453842	ph39f7
ensg00034783463	ph863s

tissue
uniprot_id	bto_id	bto_name
ph39f7	142	brain
ph863s	476	foot

SELECT string_proteins.ens_id

FROM string_proteins

LEFT JOIN mapping ON string_proteins.ens_id = mapping.ens_id

LEFT JOIN tissues ON mapping.uniprot_id = tissues.uniprot_id

WHERE tissues.bto_name = ‘brain’;

11 of 46

Data Lake

shared storage
for large amount of files
rules and conventions to organize the files

standardized big data file formats
standardized access (S3)
lot of tools can use it

12 of 46

Relational database

Shared folder

More structured, data types enforced

Data lake

Easier to setup, and to add / delete data

Up to few TB
Only for clean and structured data

Up to a few projects and a few people

Scalable
Both for structured and unstructured data

13 of 46

S3 demo - cyberduck

14 of 46

Queries on top of the files in data lake

Presto worker node

Presto coordinator + 2 worker nodes

Data lake

Submit SQL query

2. Fetch data files

3. Execute the in-memory� distributed query

4. Return the results

https://prestodb.io/docs/current/index.html

15 of 46

Sherlock demo

Use Case 1: Tissue specificity

16 of 46

Data in Presto: Tissue (gene expression) DB

molecule_id: "ensg11867234247"

molecule_id_type: "ensembl"

tissue_uberon_id: 2107

tissue_uberon_name: "liver"

source_db: "Bgee"

score: 33.2626

tax_id: 9606

Rules:

Using standard UBERON terms, MI terms, NCBI tax IDs, UniprotKB dbRef names for all values.

Score: optional (can be NULL)

UBERON: Uber-anatomy ontology,

https://www.ebi.ac.uk/ols/ontologies/uberon

17 of 46

Use case 1: List the 100 most highly expressed genes in the human liver

SELECT molecule_id,

molecule_id_type,

tissue_uberon_id,

tissue_uberon_name,

score

FROM master.bgee_2020_04_27

WHERE tax_id = 9606 � AND tissue_uberon_id = 2107�ORDER BY score DESC�LIMIT 100;

… �https://www.ebi.ac.uk/ols/ontologies/uberon/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0002106

18 of 46

CREATE TABLE new_table AS

SELECT ...;

Presto worker node

Presto worker nodes

Data lake

Submit �SQL query

2. Fetch data files

3. Execute the in-memory� distributed query

4. Write the results to a new table / folder

19 of 46

Saving our results to the data lake

CREATE TABLE project.mate_my_table WITH (

format = 'JSON'

) AS

SELECT molecule_id,

molecule_id_type,

tissue_uberon_id,

tissue_uberon_name,

score

FROM master.bgee_14_0

WHERE tax_id = 9606 � AND tissue_uberon_id = 2106�ORDER BY score DESC;

20 of 46

Benevolent

data files

iSNP / NOX

NOX standard data formats

SignaLink

MongoDB

TF DB

Graph DB

Project specific�Databases or files

Data Lake (S3)

Master Zone

Convert to optimized file formats, allowing fast queries

Project Zone

User / project specific files and databases in the Data Lake

Presto �Query Engine

Running SQL queries on top of the data lake

Importing data sets

KEGG

GO

HG 38

Ad-hoc analytical queries

Raw Zone

Raw files of different external databases in original format

Landing Zone

Convert to presto compatible text format + basic cleanup

21 of 46

22 of 46

23 of 46

Sherlock demo

Use Case 2 : ID mapping

24 of 46

Data in Presto: ID mapping table

from_id: "ensg11867234247"

from_id_type: "ensembl"

to_id: "ph39f7"

to_id_type: "uniprotac"

tax_id: 9606

Rules:

Node IDs in small case string

ID type: standard UniprotKB DBref (https://www.uniprot.org/database/)

25 of 46

Use case 1: List genes expressed in the hypothalamus using uniprot IDs

bgee
score	tissue_uberon_name	molecule_id
8723.823	blood	ensg99375824543
729.372	hypothalamus	ensg11867234247

mapping
from_id	to_id
ensg11867234247	ph39f7
ensg98236453842	ph39f7
ensg00034783463	ph863s

26 of 46

Use case 1: List genes expressed in the liver using uniprot IDs

SELECT bgee.molecule_id,

uniprot.to_id,

bgee.tissue_uberon_id,

bgee.tissue_uberon_name,

bgee.score

FROM master.bgee_2020_04_27 bgee

LEFT JOIN master.uniprot_id_mapping_2020_05 uniprot ON

bgee.tax_id = uniprot.tax_id AND

bgee.molecule_id = uniprot.from_id AND

uniprot.from_id_type = 'bgee' AND

uniprot.to_id_type = 'uniprotac'

WHERE bgee.tax_id = 9606 � AND bgee.tissue_uberon_id = 2107

AND uniprot.to_id IS NOT NULL�ORDER BY bgee.score DESC;

27 of 46

Sherlock demo

Use Case 3: Network enrichment

28 of 46

Data in Presto: Protein interactions

interactor_a_id: "ensg11867234247"

interactor_a_id_type: "ensembl"

interactor_a_tax_id: 9606

interactor_a_molecula_type_mi_id: 250

interactor_a_molecula_type_name: gene

interactor_b_id: "ensg98236453842"

interactor_b_id_type: "ensembl"

interactor_b_tax_id: 9606

interactor_b_molecula_type_mi_id: 250

interactor_b_molecula_type_name: gene

interaction_detection_methods_mi_id: [401, 58]

interaction_types_mi_id: [208]

source_databases_mi_id: [469]

pmids: [14847410, 9368760]

Rules:

Node IDs in small case string

ID type: standard UniprotKB DBref (https://www.uniprot.org/database/)

Standard MI terms for

molecula type
Interaction type
source database

(https://www.ebi.ac.uk/ols/ontologies/mi)

29 of 46

Network enrichment

Enrich with interconnections

Enrich with first neighbours

30 of 46

Network enrichment

Enrich with interconnections

Enrich with first neighbours

31 of 46

Enrich with interconnections

SELECT interactor_a_id,

interactor_b_id

FROM master.omnipath_2020_10_04

WHERE interactor_a_id IN ('o95786', 'q96eq8', 'q6zsz5', 'q01113')

AND interactor_b_id IN ('o95786', 'q96eq8', 'q6zsz5', 'q01113');

32 of 46

Enrich with interconnections + first neighbours

SELECT interactor_a_id,

interactor_b_id

FROM master.omnipath_0_7_111

WHERE interactor_a_id IN ('o95786', 'q96eq8', 'q6zsz5', 'q01113')

OR interactor_b_id IN ('o95786', 'q96eq8', 'q6zsz5', 'q01113');

33 of 46

Enrich hypothalamus genes by omnipath (+ ID mapping)

CREATE VIEW project.mate_hypo_genes AS

SELECT uniprot.to_id AS uniprot_id

FROM master.bgee_14_0 bgee

LEFT JOIN master.uniprot_id_mapping_2018_11 uniprot ON

bgee.tax_id = uniprot.tax_id AND bgee.molecule_id = uniprot.from_id AND

uniprot.from_id_type = 'bgee' AND uniprot.to_id_type = 'uniprotac'

WHERE bgee.tax_id = 9606 AND bgee.tissue_uberon_id = 1898

AND uniprot.to_id IS NOT NULL

ORDER BY bgee.score DESC;

SELECT omnipath.interactor_a_id, omnipath.interactor_b_id

FROM omnipath_0_7_111 omnipath

WHERE omnipath.interactor_a_id IN (SELECT * FROM project.mate_hypo_genes )

OR omnipath.interactor_b_id IN (SELECT * FROM project.mate_hypo_genes );

34 of 46

Monitoring the running queries

http://<presto cluster>:<presto port>/

35 of 46

Sherlock demo

Use Case 4: Genome regions

36 of 46

Data in Presto: Genome

chr: "chr11"

length: 1000

start: 27000

end: 28000

sequence: "actg.....ccta"

Rules:

Each genome goes to a separate table.

Split the genome to smaller sequences for optimization.

37 of 46

Use Case 4: select region around SNP location

956000

957000

958000

959000

960000

SNP�958800

+ / - 700

38 of 46

957001 ...

958001 ...

959001 ...

960001

SNP

958100 … 959500

Step 1: select + / - 1 full split, 3000 long sequence�

Step 2: substring from the 3000 long sequence: [ SNP - 699 - before_split.start : SNP + 701 - before_split.start ]

This case: [1100 : 2500]

target_split

before_split

after_split

958800

39 of 46

Same algorithm in SQL: step 1

SELECT

MIN(start) AS long_sequence_start,

ARRAY_JOIN ( ARRAY_AGG(sequence ORDER BY start), '') AS long_sequence

FROM master.hg_38

WHERE chr = 'chr11'

AND start BETWEEN 958800 - 2000 AND 958800 + 1000;

40 of 46

Same algorithm in SQL: step 1 + step 2

SELECT

SUBSTR(long_sequence, 958800 - 699 - long_sequence_start, 1401) AS result

FROM (

-- here comes the query of the step 1

);

41 of 46

Same algorithm in SQL: step 1 + step 2

SELECT

SUBSTR(long_sequence, 958800 - 699 - long_sequence_start, 1401) AS result

FROM (

SELECT

ARRAY_JOIN (ARRAY_AGG(sequence ORDER BY start), '') AS long_sequence,

MIN(start) AS long_sequence_start

FROM master.human_genome_grch38_p13_2020_10_21

WHERE chr = 'chr11'

AND start BETWEEN 958800 - 2000 AND 958800 + 1000

);

HG38, chr11:958100 https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?cfg=NCID_1_27358523_130.14.18.128_9146_1542928973_2555627680

42 of 46

Same algorithm in SQL: step 1 + step 2 (smarter)

SELECT

SUBSTR(long_sequence,

MAX(1, SNP_POS-699-long_sequence_start),

MIN(LENGTH(long_sequence) - (MAX(1, SNP_POS-699-long_sequence_start)), 1401)

) AS result

FROM (

-- here comes the query of the step 1

);

Here we also handle the case, when we don’t have 700 sequence before and/or after the SNP:

43 of 46

Sherlock: project status

44 of 46

What we have

Working Presto cluster in NBI (3 VMs, easy to scale up)
Data Lake (currently in Digital Oceans, later using NBI object store)
Source code repository (in EI Stash)

Documentation, Example queries
Data loader scripts
Table definitions

Databases in the Data Lake:

IntAct v2018_10_04
Tissue data from Bgee 14_0
Human Genome v38
OmniPath 0.7.111
Uniprot ID mapping 2018_11

45 of 46

Next steps

Quality improvements (automated backups, automated tests for the data loaders)
Bit more user documentation
Start to use it in the lab / EI

For iSNP we already use it
Some SQL training? ;)

Publish / open source some parts
Load more data:

More molecular interaction dataset
Genome annotations
Signaling databases
Ontologies (GO, MI, BTO)

46 of 46

Thank You! :)