1 of 26

Leveraging massive protein structure datasets for function prediction on a metagenomic scale

Paweł Szczerbiak

Jagiellonian University, Poland

ISMB 2023

2 of 26

  • Recent breakthroughs in protein structure prediction resulted in great availability of high quality structure models
  • Tools such as DeepFRI (v1.0) showed that using structure instead of sequence can lead to much better function prediction
  • Metagenomic-DeepFRI can successfully extend DeepFRI even further by efficiently mapping sequences to putative structures
  • DeepFRI v1.0 produces high coverage annotations, albeit more general than comparable homology-based methods

Motivation

3 of 26

  • Recent breakthroughs in protein structure prediction resulted in great availability of high quality structure models
  • Tools such as DeepFRI (v1.0) showed that using structure instead of sequence can lead to much better function prediction
  • Metagenomic-DeepFRI can successfully extend DeepFRI even further by efficiently mapping sequences to putative structures
  • DeepFRI v1.0 produces high coverage annotations, albeit more general than comparable homology-based methods
  • Here we show how DeepFRI (v1.1) retrained on AlphaFold-UniProt dataset can alleviate this limitation

Motivation

4 of 26

DeepFRI v1.0

Nat Commun 12, 3168 (2021)

5 of 26

MSDVKILLNEADIPAITHWYNVVADMPK

Sequence-based prediction

Structure-based prediction

L x 26 one-hot feature matrix

CNN block

16 x 1D filters

dense layer

sigmoid activation

[0.000012, 0.12, 0.00313, 0.01 … ]

LSTM block

2 x 512 layers

TimeDistributed(Dense(26))

softmax activation

dense + addition

ReLU activation

precomputed on Pfam

GCN block

(256, 256, 512) layers

GlobalSumPooling

dense

sigmoid activation

H(0)

A

[0.000002, 0.19, 0.00813, 0.24 … ]

Four NNs: MF, BP, CC, EC

DeepFRI v1.0 – architecture

Four NNs: MF, BP, CC, EC

6 of 26

GO annotations

v4 structures

: [GO:0006975, 0, GO:0050789, 0 … ]

quality control: GO-terms with at least one

experimental annotation

GO hierarchy propagation

: [GO:0006975, 0, 0, 0, 0, 0, 0 … ]

inputs

training set

quality control: ≥ 80% of residues with pLDDT ≥ 70

structure size between 60 and 1000 aa

59,731,892 annotations

8,877 GO-terms

4,905,062 annotations

6,530 GO-terms

2,822,622 structures

DeepFRI v1.1 – training procedure

7 of 26

: [GO:0006975, 0, GO:0050789, 0 … ]

GO hierarchy propagation

: [GO:0006975, 0, 0, 0, 0, 0, 0 … ]

training set

59,731,892 annotations

8,877 GO-terms

4,905,062 annotations

6,530 GO-terms

2,822,622 structures

DeepFRI v1.1 – training procedure

8 of 26

: [GO:0006975, 0, GO:0050789, 0 … ]

GO hierarchy propagation

: [GO:0006975, 0, 0, 0, 0, 0, 0 … ]

training set

59,731,892 annotations

8,877 GO-terms

4,905,062 annotations

6,530 GO-terms

2,822,622 structures

DeepFRI v1.1

DeepFRI v1.0 architecture

model

DeepFRI v1.1 – training procedure

9 of 26

DeepFRI v1.1

DeepFRI v1.0 architecture

model

DeepFRI v1.1 – training procedure

10 of 26

DeepFRI v1.1

DeepFRI v1.0 architecture

model

GO-terms:

772 CC

1,987 MF

6,118 BP

predictions

DeepFRI v1.1 – training procedure

11 of 26

structure

DeepFRI v1.1

closest structure

sequence

Metagenomic-DeepFRI

inference

DeepFRI v1.1 – training procedure

PDB

AlphaFold-DB

ESMAtlas

MIP

custom DBs

reference structure databases

12 of 26

GO-terms coverage

IC(GOi) = -log2 Prob(GOi)

Information content

13 of 26

When working with structures from PDB:

  • use score ≥ 0.35 for v1.0
  • use score ≥ 0.5 for v1.1

Note: DeepFRI v1.0 has been trained and tested on PDB (same distributions) whereas v1.1 has been trained on Uniprot and tested on PDB (different distributions). This may explain worse performance for the latter at a given threshold

DeepFRI v1.0 DeepFRI v1.1

Calibration curves

High quality predictions

14 of 26

DeepFRI predictions for 17,397 non-singleton cluster representatives from Microbiome Immunity Project (MIP) database (Nat Commun 14, 2351 (2023))

More informative predictions from v1.1

Version comparison on MIP dataset

???

15 of 26

MIP1_00248745

Max TM-score to PDB: 0.48

MIP1_00055961

Max TM-score to PDB: 0.49

MIP1_00230606

Max TM-score to PDB: 0.47

DeepFRI score: 0.99

Information content: 12.1

GO-term: GO:0046910

GO-name: pectinesterase

inhibitor activity

0.56

13.1

GO:0140303

intramembrane lipid

transporter activity

0.80

13.8

GO:0030269

tetrahydromethanopterin S-methyltransferase activity

Version comparison on MIP dataset

MIP novel folds

16 of 26

Higher coverage of COG categories for v1.1

Version comparison on MIP dataset

COG categories

17 of 26

Version comparison on MIP dataset

COG categories

Biases

  • GO to COG mapping not perfect

  • only one hit per structure is picked

  • GO-term relative distributions in the training sets lead to lower score for underrepresented GO-terms

No hits

  • no mapping of DeepFRI GO-terms to COG and vice versa

  • function unknown (‘S” class) not shown

  • no annotations for hard targets that are underrepresented in the training sets

To be addressed (hopefully) in future releases

18 of 26

MGnify predictions

DeepFRI 1.1 annotations for 1,000,000 randomly picked MGnify foldseek cluster representatives predicted by ESMFold (Science 379, 1123-1130 (2023))

High coverage across wide IC range

19 of 26

MGnify predictions

DeepFRI 1.1 annotations for 1,000,000 randomly picked MGnify foldseek cluster representatives predicted by ESMFold (Science 379, 1123-1130 (2023))

20 of 26

High quality annotations

Excluding general GO-terms

score ≥ 0.5 IC ≥ 3

Pfam annotations

without DUF / putative

DeepFRI ProtENN

MGnify predictions

Comparison to ProtENN

DeepFRI 1.1 annotations for 1,000,000 randomly picked MGnify foldseek cluster representatives predicted by ESMFold (Science 379, 1123-1130 (2023))

21 of 26

High quality annotations

Excluding general GO-terms

score ≥ 0.5 IC ≥ 3

Pfam annotations

without DUF / putative

DeepFRI ProtENN

MGnify predictions

Comparison to ProtENN

DeepFRI 1.1 annotations for 1,000,000 randomly picked MGnify foldseek cluster representatives predicted by ESMFold (Science 379, 1123-1130 (2023))

More annotated proteins as compared to ProtENN

22 of 26

Uniprot predictions

DeepFRI 1.1 annotations for 711,700 dark Uniprot foldseek cluster representatives predicted by AlphaFold (Nucleic Acids Research 50, D439–D444 (2022))

Hernandez et al. (BioRxiv, 2023)

Clustering predicted structures at the scale of the known protein universe

23 of 26

Uniprot predictions

DeepFRI 1.1 annotations for 711,700 dark Uniprot foldseek cluster representatives predicted by AlphaFold (Nucleic Acids Research 50, D439–D444 (2022))

Hernandez et al. (BioRxiv, 2023)

Clustering predicted structures at the scale of the known protein universe

Number of predictions comparable to that for ESMAtlas

24 of 26

Key message

  • DeepFRI v1.1 provides higher coverage of informative GO-terms for structure datasets as compared to v1.0

  • Together with Metagenomic-DeepFRI can be used for function prediction for practically any sequence dataset

  • We will keep upgrading the tools and bridge the sequence- function gap

25 of 26

Key message

  • DeepFRI v1.1 provides higher coverage of informative GO-terms for structure datasets as compared to v1.0

  • Together with Metagenomic-DeepFRI can be used for function prediction for practically any sequence dataset

C-279

Poster session

26 of 26

Project details

This research was supported by grant NAWA PPN/PPO/2018/1/00014

tomaszlab.org

tomasz.kosciolek@uj.edu.pl

pawel.szczerbiak@uj.edu.pl

Function annotation

Mary Maranga

Łukasz Szydłowski

DeepFRI v1.1

Paweł Szczerbiak

Witold Wydmański

Metagenomic-DeepFRI

Valentyn Bezshapkin

Piotr Kucharski

Group Leader

Tomasz Kosciolek