Sherlock intro & demo
Big data query engine & shared storage for the Korcsmaros Group
Mate, David�23 November, 2018.
Introduction
Sherlock @ KorcsmarosGroup
Data platform developed for EI researchers, collaborators and services developed in the KorcsmarosLab
Features:
What is big data?
Typical big data analytical tools in IT
DATA LAKE
BATCH ANALYTICS
STREAM COMPUTATION
REPORT GENERATION
INTEGRATION
WORKFLOW MANAGEMENT
Data science project vs. Full big data solution
DATA LAKE
Application / Service
Application / Service
Application / Service
Application / Service
Data Lake
Where and how do we store the data in a big data application?
Traditional way: The power of SQL
SQL = Structured Query Language, to manipulate and query relational databases
Structured data:
relational database
Given the following three tables,
how do you find those genes from the String database, which are expressed the brain?
string_proteins | ||
ens_id | tax_id | pubmed_ids |
ensg11867234247 | 9606 | 15328335 |
ensg98236453842 | 9606 | |
ensg36438523943 | 9606 | 14847410, 9368760 |
mapping | |
ens_id | uniprot_id |
ensg11867234247 | ph39f7 |
ensg98236453842 | ph39f7 |
ensg00034783463 | ph863s |
tissue | ||
uniprot_id | bto_id | bto_name |
ph39f7 | 142 | brain |
ph863s | 476 | foot |
string_proteins | ||
ens_id | tax_id | pubmed_ids |
ensg11867234247 | 9606 | 15328335 |
ensg98236453842 | 9606 | |
ensg36438523943 | 9606 | 14847410, 9368760 |
mapping | |
ens_id | uniprot_id |
ensg11867234247 | ph39f7 |
ensg98236453842 | ph39f7 |
ensg00034783463 | ph863s |
tissue | ||
uniprot_id | bto_id | bto_name |
ph39f7 | 142 | brain |
ph863s | 476 | foot |
SELECT string_proteins.ens_id
FROM string_proteins
LEFT JOIN mapping ON string_proteins.ens_id = mapping.ens_id
LEFT JOIN tissues ON mapping.uniprot_id = tissues.uniprot_id
WHERE tissues.bto_name = ‘brain’;
Data Lake
Relational database
Shared folder
More structured, data types enforced
Data lake
Easier to setup, and to add / delete data
S3 demo - cyberduck
Queries on top of the files in data lake
Presto worker node
Presto worker node
Presto coordinator + 2 worker nodes
Data lake
2. Fetch data files
3. Execute the in-memory� distributed query
4. Return the results
Sherlock demo
Use Case 1: Tissue specificity
Data in Presto: Tissue (gene expression) DB
molecule_id: "ensg11867234247"
molecule_id_type: "ensembl"
tissue_uberon_id: 2107
tissue_uberon_name: "liver"
source_db: "Bgee"
score: 33.2626
tax_id: 9606
Rules:
Using standard UBERON terms, MI terms, NCBI tax IDs, UniprotKB dbRef names for all values.
Score: optional (can be NULL)
UBERON: Uber-anatomy ontology,
Use case 1: List the 100 most highly expressed genes in the human liver
SELECT molecule_id,
molecule_id_type,
tissue_uberon_id,
tissue_uberon_name,
score
FROM master.bgee_2020_04_27
WHERE tax_id = 9606 � AND tissue_uberon_id = 2107�ORDER BY score DESC�LIMIT 100;
CREATE TABLE new_table AS
SELECT ...;
Presto worker node
Presto worker node
Presto worker nodes
Data lake
2. Fetch data files
3. Execute the in-memory� distributed query
4. Write the results to a new table / folder
Saving our results to the data lake
CREATE TABLE project.mate_my_table WITH (
format = 'JSON'
) AS
SELECT molecule_id,
molecule_id_type,
tissue_uberon_id,
tissue_uberon_name,
score
FROM master.bgee_14_0
WHERE tax_id = 9606 � AND tissue_uberon_id = 2106�ORDER BY score DESC;
Benevolent
data files
iSNP / NOX
NOX standard data formats
SignaLink
MongoDB
TF DB
Graph DB
Project specific�Databases or files
Data Lake (S3)
Master Zone
Convert to optimized file formats, allowing fast queries
Project Zone
User / project specific files and databases in the Data Lake
Presto �Query Engine
Running SQL queries on top of the data lake
Importing data sets
KEGG
GO
HG 38
Ad-hoc analytical queries
Raw Zone
Raw files of different external databases in original format
Landing Zone
Convert to presto compatible text format + basic cleanup
Sherlock demo
Use Case 2 : ID mapping
Data in Presto: ID mapping table
from_id: "ensg11867234247"
from_id_type: "ensembl"
to_id: "ph39f7"
to_id_type: "uniprotac"
tax_id: 9606
Rules:
Node IDs in small case string
ID type: standard UniprotKB DBref (https://www.uniprot.org/database/)
Use case 1: List genes expressed in the hypothalamus using uniprot IDs
bgee | ||
score | tissue_uberon_name | molecule_id |
8723.823 | blood | ensg99375824543 |
729.372 | hypothalamus | ensg11867234247 |
mapping | |
from_id | to_id |
ensg11867234247 | ph39f7 |
ensg98236453842 | ph39f7 |
ensg00034783463 | ph863s |
Use case 1: List genes expressed in the liver using uniprot IDs
SELECT bgee.molecule_id,
uniprot.to_id,
bgee.tissue_uberon_id,
bgee.tissue_uberon_name,
bgee.score
FROM master.bgee_2020_04_27 bgee
LEFT JOIN master.uniprot_id_mapping_2020_05 uniprot ON
bgee.tax_id = uniprot.tax_id AND
bgee.molecule_id = uniprot.from_id AND
uniprot.from_id_type = 'bgee' AND
uniprot.to_id_type = 'uniprotac'
WHERE bgee.tax_id = 9606 � AND bgee.tissue_uberon_id = 2107
AND uniprot.to_id IS NOT NULL�ORDER BY bgee.score DESC;
Sherlock demo
Use Case 3: Network enrichment
Data in Presto: Protein interactions
interactor_a_id: "ensg11867234247"
interactor_a_id_type: "ensembl"
interactor_a_tax_id: 9606
interactor_a_molecula_type_mi_id: 250
interactor_a_molecula_type_name: gene
interactor_b_id: "ensg98236453842"
interactor_b_id_type: "ensembl"
interactor_b_tax_id: 9606
interactor_b_molecula_type_mi_id: 250
interactor_b_molecula_type_name: gene
interaction_detection_methods_mi_id: [401, 58]
interaction_types_mi_id: [208]
source_databases_mi_id: [469]
pmids: [14847410, 9368760]
Rules:
Node IDs in small case string
ID type: standard UniprotKB DBref (https://www.uniprot.org/database/)
Standard MI terms for
(https://www.ebi.ac.uk/ols/ontologies/mi)
Network enrichment
Enrich with interconnections
Enrich with first neighbours
Network enrichment
Enrich with interconnections
Enrich with first neighbours
Enrich with interconnections
SELECT interactor_a_id,
interactor_b_id
FROM master.omnipath_2020_10_04
WHERE interactor_a_id IN ('o95786', 'q96eq8', 'q6zsz5', 'q01113')
AND interactor_b_id IN ('o95786', 'q96eq8', 'q6zsz5', 'q01113');
Enrich with interconnections + first neighbours
SELECT interactor_a_id,
interactor_b_id
FROM master.omnipath_0_7_111
WHERE interactor_a_id IN ('o95786', 'q96eq8', 'q6zsz5', 'q01113')
OR interactor_b_id IN ('o95786', 'q96eq8', 'q6zsz5', 'q01113');
Enrich hypothalamus genes by omnipath (+ ID mapping)
CREATE VIEW project.mate_hypo_genes AS
SELECT uniprot.to_id AS uniprot_id
FROM master.bgee_14_0 bgee
LEFT JOIN master.uniprot_id_mapping_2018_11 uniprot ON
bgee.tax_id = uniprot.tax_id AND bgee.molecule_id = uniprot.from_id AND
uniprot.from_id_type = 'bgee' AND uniprot.to_id_type = 'uniprotac'
WHERE bgee.tax_id = 9606 AND bgee.tissue_uberon_id = 1898
AND uniprot.to_id IS NOT NULL
ORDER BY bgee.score DESC;
SELECT omnipath.interactor_a_id, omnipath.interactor_b_id
FROM omnipath_0_7_111 omnipath
WHERE omnipath.interactor_a_id IN (SELECT * FROM project.mate_hypo_genes )
OR omnipath.interactor_b_id IN (SELECT * FROM project.mate_hypo_genes );
Monitoring the running queries
http://<presto cluster>:<presto port>/
Sherlock demo
Use Case 4: Genome regions
Data in Presto: Genome
chr: "chr11"
length: 1000
start: 27000
end: 28000
sequence: "actg.....ccta"
Rules:
Each genome goes to a separate table.
Split the genome to smaller sequences for optimization.
Use Case 4: select region around SNP location
956000
957000
958000
959000
960000
SNP�958800
+ / - 700
957001 ...
958001 ...
959001 ...
960001
SNP
958100 … 959500
Step 1: select + / - 1 full split, 3000 long sequence�
Step 2: substring from the 3000 long sequence: [ SNP - 699 - before_split.start : SNP + 701 - before_split.start ]
This case: [1100 : 2500]
target_split
before_split
after_split
958800
Same algorithm in SQL: step 1
SELECT
MIN(start) AS long_sequence_start,
ARRAY_JOIN ( ARRAY_AGG(sequence ORDER BY start), '') AS long_sequence
FROM master.hg_38
WHERE chr = 'chr11'
AND start BETWEEN 958800 - 2000 AND 958800 + 1000;
Same algorithm in SQL: step 1 + step 2
SELECT
SUBSTR(long_sequence, 958800 - 699 - long_sequence_start, 1401) AS result
FROM (
-- here comes the query of the step 1
);
Same algorithm in SQL: step 1 + step 2
SELECT
SUBSTR(long_sequence, 958800 - 699 - long_sequence_start, 1401) AS result
FROM (
SELECT
ARRAY_JOIN (ARRAY_AGG(sequence ORDER BY start), '') AS long_sequence,
MIN(start) AS long_sequence_start
FROM master.human_genome_grch38_p13_2020_10_21
WHERE chr = 'chr11'
AND start BETWEEN 958800 - 2000 AND 958800 + 1000
);
Same algorithm in SQL: step 1 + step 2 (smarter)
SELECT
SUBSTR(long_sequence,
MAX(1, SNP_POS-699-long_sequence_start),
MIN(LENGTH(long_sequence) - (MAX(1, SNP_POS-699-long_sequence_start)), 1401)
) AS result
FROM (
-- here comes the query of the step 1
);
Here we also handle the case, when we don’t have 700 sequence before and/or after the SNP:
Sherlock: project status
What we have
Next steps
Thank You! :)