Leveraging massive protein structure datasets for function prediction on a metagenomic scale
Paweł Szczerbiak
Jagiellonian University, Poland
ISMB 2023
Motivation
Motivation
DeepFRI v1.0
Nat Commun 12, 3168 (2021)
MSDVKILLNEADIPAITHWYNVVADMPK
Sequence-based prediction
Structure-based prediction
L x 26 one-hot feature matrix
CNN block
16 x 1D filters
dense layer
sigmoid activation
[0.000012, 0.12, 0.00313, 0.01 … ]
LSTM block
2 x 512 layers
TimeDistributed(Dense(26))
softmax activation
dense + addition
ReLU activation
precomputed on Pfam
GCN block
(256, 256, 512) layers
GlobalSumPooling
dense
sigmoid activation
H(0)
A
[0.000002, 0.19, 0.00813, 0.24 … ]
Four NNs: MF, BP, CC, EC
DeepFRI v1.0 – architecture
Four NNs: MF, BP, CC, EC
GO annotations
v4 structures
: [GO:0006975, 0, GO:0050789, 0 … ]
quality control: GO-terms with at least one
experimental annotation
GO hierarchy propagation
: [GO:0006975, 0, 0, 0, 0, 0, 0 … ]
inputs
training set
quality control: ≥ 80% of residues with pLDDT ≥ 70
structure size between 60 and 1000 aa
59,731,892 annotations
8,877 GO-terms
4,905,062 annotations
6,530 GO-terms
2,822,622 structures
DeepFRI v1.1 – training procedure
: [GO:0006975, 0, GO:0050789, 0 … ]
GO hierarchy propagation
: [GO:0006975, 0, 0, 0, 0, 0, 0 … ]
training set
59,731,892 annotations
8,877 GO-terms
4,905,062 annotations
6,530 GO-terms
2,822,622 structures
DeepFRI v1.1 – training procedure
: [GO:0006975, 0, GO:0050789, 0 … ]
GO hierarchy propagation
: [GO:0006975, 0, 0, 0, 0, 0, 0 … ]
training set
59,731,892 annotations
8,877 GO-terms
4,905,062 annotations
6,530 GO-terms
2,822,622 structures
DeepFRI v1.1
DeepFRI v1.0 architecture
model
DeepFRI v1.1 – training procedure
DeepFRI v1.1
DeepFRI v1.0 architecture
model
DeepFRI v1.1 – training procedure
DeepFRI v1.1
DeepFRI v1.0 architecture
model
GO-terms:
772 CC
1,987 MF
6,118 BP
predictions
DeepFRI v1.1 – training procedure
structure
DeepFRI v1.1
closest structure
sequence
Metagenomic-DeepFRI
inference
DeepFRI v1.1 – training procedure
PDB
AlphaFold-DB
ESMAtlas
MIP
custom DBs
reference structure databases
GO-terms coverage
IC(GOi) = -log2 Prob(GOi)
Information content
When working with structures from PDB:
Note: DeepFRI v1.0 has been trained and tested on PDB (same distributions) whereas v1.1 has been trained on Uniprot and tested on PDB (different distributions). This may explain worse performance for the latter at a given threshold
DeepFRI v1.0 DeepFRI v1.1
Calibration curves
High quality predictions
DeepFRI predictions for 17,397 non-singleton cluster representatives from Microbiome Immunity Project (MIP) database (Nat Commun 14, 2351 (2023))
More informative predictions from v1.1
Version comparison on MIP dataset
???
MIP1_00248745
Max TM-score to PDB: 0.48
MIP1_00055961
Max TM-score to PDB: 0.49
MIP1_00230606
Max TM-score to PDB: 0.47
DeepFRI score: 0.99
Information content: 12.1
GO-term: GO:0046910
GO-name: pectinesterase
inhibitor activity
0.56
13.1
GO:0140303
intramembrane lipid
transporter activity
0.80
13.8
GO:0030269
tetrahydromethanopterin S-methyltransferase activity
Version comparison on MIP dataset
MIP novel folds
Higher coverage of COG categories for v1.1
Version comparison on MIP dataset
COG categories
Version comparison on MIP dataset
COG categories
Biases
No hits
To be addressed (hopefully) in future releases
MGnify predictions
DeepFRI 1.1 annotations for 1,000,000 randomly picked MGnify foldseek cluster representatives predicted by ESMFold (Science 379, 1123-1130 (2023))
High coverage across wide IC range
MGnify predictions
DeepFRI 1.1 annotations for 1,000,000 randomly picked MGnify foldseek cluster representatives predicted by ESMFold (Science 379, 1123-1130 (2023))
High quality annotations
Excluding general GO-terms
score ≥ 0.5 IC ≥ 3
Pfam annotations
without DUF / putative
DeepFRI ProtENN
MGnify predictions
Comparison to ProtENN
DeepFRI 1.1 annotations for 1,000,000 randomly picked MGnify foldseek cluster representatives predicted by ESMFold (Science 379, 1123-1130 (2023))
High quality annotations
Excluding general GO-terms
score ≥ 0.5 IC ≥ 3
Pfam annotations
without DUF / putative
DeepFRI ProtENN
MGnify predictions
Comparison to ProtENN
DeepFRI 1.1 annotations for 1,000,000 randomly picked MGnify foldseek cluster representatives predicted by ESMFold (Science 379, 1123-1130 (2023))
More annotated proteins as compared to ProtENN
Uniprot predictions
DeepFRI 1.1 annotations for 711,700 dark Uniprot foldseek cluster representatives predicted by AlphaFold (Nucleic Acids Research 50, D439–D444 (2022))
Hernandez et al. (BioRxiv, 2023)
Clustering predicted structures at the scale of the known protein universe
Uniprot predictions
DeepFRI 1.1 annotations for 711,700 dark Uniprot foldseek cluster representatives predicted by AlphaFold (Nucleic Acids Research 50, D439–D444 (2022))
Hernandez et al. (BioRxiv, 2023)
Clustering predicted structures at the scale of the known protein universe
Number of predictions comparable to that for ESMAtlas
Key message
Key message
C-279
Poster session
Project details
This research was supported by grant NAWA PPN/PPO/2018/1/00014
tomaszlab.org
tomasz.kosciolek@uj.edu.pl
pawel.szczerbiak@uj.edu.pl
Function annotation
Mary Maranga
Łukasz Szydłowski
DeepFRI v1.1
Paweł Szczerbiak
Witold Wydmański
Metagenomic-DeepFRI
Valentyn Bezshapkin
Piotr Kucharski
Group Leader
Tomasz Kosciolek