1 of 15

ComPPI: a cellular compartment-specific database for protein–protein interaction network analysis

Veres et al., NAR Database Issue, 2015

2 of 15

Assumptions

Protein function depends on subcellular localization
Interactions are localization specific
False positives in interaction databases from not considering localization of pairs
Localization data lacking and inconsistent

Manually curated data integration and generation of localization and interaction confidence scores

3 of 15

Data Sources

Four species: Yeast, C. elegans, Drosophila, Human
9 interaction databases
8 localization databases
Map everything to Uniprot and 1600 GO cellular component terms

4 of 15

“Manual” curation

Selecting interaction and localization databases
Manually map all localization to 1600 GO cellular component terms
Manually mapping protein ID to UniProt for 30% of proteins
Manual checking and revision of database

5 of 15

Manual

Localization Tree

1600 GO terms, under six major compartments
Single path to any term

Cytosol

Nucleus

Mitochondrion

Secretory-Pathway

Membrane

Extracellular

6 of 15

Localization Score

φ_LocX=V_resp_LocX

Weights:

experimental p_LocX = 0.8

predicted p_LocX = 0.7

unknown p_LocX = 0.6

φ_LocX probability in location

V_resnumber of localization entries

p_LocXevidence weight

φ_LocX= 1 - ((1 - p_LocX )^V^res * …)

Nucleus

Cytoplasm

Membrane

Extracellular

7 of 15

Interaction Score

Compartment specific score
Combined compartment score

Nucleus

Cytoplasm

Membrane

Extracellular

8 of 15

So what does this provide?

“(i) the filtration of localization-based biologically unlikely interactions—where the two interacting proteins have no common localization

(ii) the prediction of possible new localizations and localization-based biological functions”

9 of 15

Evidence Weights??

Human Protein Atlas

experimentally verified localizations
Positive controls: interacting proteins share a localization

Optimization constraints:

experimental > predicted AND experimental > unknown

“maximizes the number of high confidence interactions in the positive control data set (HPA) and simultaneously maximizes the number of low confidence interactions in the ComPPI data set not containing HPA data”

Weights:

experimental p_LocX = 0.8

predicted p_LocX = 0.7

unknown p_LocX = 0.6

10 of 15

Evidence Weights??

Human Protein Atlas (what about other orgs?)

experimentally verified localizations
Positive controls: interacting proteins share a localization

Optimization constraints:

experimental > predicted AND experimental > unknown

“maximizes the number of high confidence interactions in the positive control data set (HPA) and simultaneously maximizes the number of low confidence interactions in the ComPPI data set not containing HPA data”

Weights:

experimental p_LocX = 0.8

predicted p_LocX = 0.7

unknown p_LocX = 0.6

11 of 15

Does it work?

Crotonase

71 interactions from database
Localization: Mitochondria
5 / 71 have interaction score > 0.8

12 of 15

Crotonase

Of 5 interactors in mitochondria, all have high evidence of cytosolic localization as well

“crotonase was shown to be overexpressed and localized in the cytosol in hepatocarcinoma cells, where it contributes to lymphatic metastatis”

13 of 15

Concerns

200 randomly chosen entries evaluated by experts

Based on results, revised of database

False positive / negative rate
Causes of errors
Building in automatic corrections?

14 of 15

Other Concerns

Manual nature of mapping to GO terms in their tree
30% protein mapping

What happens when GO adds new cellular component terms? New proteins in UniProt?

What if we wanted to add mouse, rat, something else? How hard would it be?

15 of 15

Other sources

Gene Ontology extensions

Have started providing context to functionality, including location