1 of 15

ComPPI: a cellular compartment-specific database for protein–protein interaction network analysis

Veres et al., NAR Database Issue, 2015

2 of 15

Assumptions

  • Protein function depends on subcellular localization
  • Interactions are localization specific
  • False positives in interaction databases from not considering localization of pairs
  • Localization data lacking and inconsistent

Manually curated data integration and generation of localization and interaction confidence scores

3 of 15

Data Sources

  • Four species: Yeast, C. elegans, Drosophila, Human
  • 9 interaction databases
  • 8 localization databases
  • Map everything to Uniprot and 1600 GO cellular component terms

4 of 15

“Manual” curation

  1. Selecting interaction and localization databases
  2. Manually map all localization to 1600 GO cellular component terms
  3. Manually mapping protein ID to UniProt for 30% of proteins
  4. Manual checking and revision of database

5 of 15

Manual

Localization Tree

  • 1600 GO terms, under six major compartments
  • Single path to any term

Cytosol

Nucleus

Mitochondrion

Secretory-Pathway

Membrane

Extracellular

6 of 15

Localization Score

φLocX=VrespLocX

Weights:

experimental pLocX = 0.8

predicted pLocX = 0.7

unknown pLocX = 0.6

φLocX probability in location

Vresnumber of localization entries

pLocXevidence weight

φLocX= 1 - ((1 - pLocX )Vres * …)

Nucleus

Cytoplasm

Membrane

Extracellular

7 of 15

Interaction Score

  1. Compartment specific score
  2. Combined compartment score

Nucleus

Cytoplasm

Membrane

Extracellular

8 of 15

So what does this provide?

“(i) the filtration of localization-based biologically unlikely interactions—where the two interacting proteins have no common localization

(ii) the prediction of possible new localizations and localization-based biological functions”

9 of 15

Evidence Weights??

  • Human Protein Atlas
    • experimentally verified localizations
    • Positive controls: interacting proteins share a localization

Optimization constraints:

experimental > predicted AND experimental > unknown

“maximizes the number of high confidence interactions in the positive control data set (HPA) and simultaneously maximizes the number of low confidence interactions in the ComPPI data set not containing HPA data”

Weights:

experimental pLocX = 0.8

predicted pLocX = 0.7

unknown pLocX = 0.6

10 of 15

Evidence Weights??

  • Human Protein Atlas (what about other orgs?)
    • experimentally verified localizations
    • Positive controls: interacting proteins share a localization

Optimization constraints:

experimental > predicted AND experimental > unknown

“maximizes the number of high confidence interactions in the positive control data set (HPA) and simultaneously maximizes the number of low confidence interactions in the ComPPI data set not containing HPA data”

Weights:

experimental pLocX = 0.8

predicted pLocX = 0.7

unknown pLocX = 0.6

11 of 15

Does it work?

  • Crotonase
    • 71 interactions from database
    • Localization: Mitochondria
    • 5 / 71 have interaction score > 0.8

12 of 15

Crotonase

  • Of 5 interactors in mitochondria, all have high evidence of cytosolic localization as well

“crotonase was shown to be overexpressed and localized in the cytosol in hepatocarcinoma cells, where it contributes to lymphatic metastatis”

13 of 15

Concerns

  • 200 randomly chosen entries evaluated by experts
    • Based on results, revised of database
  • False positive / negative rate
  • Causes of errors
  • Building in automatic corrections?

14 of 15

Other Concerns

  • Manual nature of mapping to GO terms in their tree
  • 30% protein mapping

What happens when GO adds new cellular component terms? New proteins in UniProt?

What if we wanted to add mouse, rat, something else? How hard would it be?

15 of 15

Other sources

  • Gene Ontology extensions
    • Have started providing context to functionality, including location