Transfer Learning for Predicting Virus-Host Protein Interactions for Novel Virus Sequences
Jack Lanchantin1, Tom Weingarten2, Arshdeep Sekhon1, Clint Miller1, Yanjun Qi1
1University of Virginia, 2Google
2
Proteins: the building blocks of life
oxygen transportation
antibodies
digestive enzymes
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
3
What is a protein?
covalent bonds
amino acids
sequences with alphabet of 20 characters
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
4
What is a protein?
MQGHFTETKHE
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
5
What is a protein?
MHFTEDKATILWGKVNVEGETLGRVYPWQ
Tertiary Structure
Secondary Structure
Primary Sequence
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
6
Structure Determines Function
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
7
Protein-Protein Interactions (PPIs)
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
8
Virus-Host Protein-Protein Interactions
SARS-CoV-2
Spike Protein
Human
ACE2 Protein
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
9
Virus-Host Protein-Protein Interactions
Zika
HIV
HIV
Zika
MSTCLAMVK
CGPKKSTNL
SPKRARSV
ADYSVLYNS
LDKYFKN
HKMFYN
VLNDILS
Human protein
Known interaction
Virus protein
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
10
Virus-Host Protein-Protein Interactions
Zika
SARS-CoV-2
HIV
HIV
Zika
MSTCLAMVK
CGPKKSTNL
SPKRARSV
MFVFLVLL
ADYSVLYNS
LDKYFKN
HKMFYN
VLNDILS
Novel virus protein
Human protein
Known interaction
Virus protein
Potential interaction
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
11
MGHFTWGWGTSLWGKVNVEDAYPWTQ
TWGWGLFDKATETWGKVNVKDKLWG
KVWGKVNVEDYPWT
MGHFTEKEKLFVEDAYPWTQ
TEEDKADKATEDAYPWT
MGHFWITSLNKDLWGKSVEDAMGHF
VNVEDAYDKATSSLNFTDLWGK
Sequences: fast & cheap
Structure/Interactions: slow & expensive
Protein Data
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
12
MGHFTWGWGTSLWGKVNVEDAYPWTQ
TWGWGLFDKATETWGKVNVKDKLWG
KVWGKVNVEDYPWT
MGHFTEKEKLFVEDAYPWTQ
TEEDKADKATEDAYPWT
MGHFWITSLNKDLWGKSVEDAMGHF
VNVEDAYDKATSSLNFTDLWGK
Sequences: fast & cheap
Structure/Interactions: slow & expensive
Protein Data
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
13
Protein-Protein Interaction Prediction from Sequences
Gomez et al. 2003, Ben-Hur and Noble 2005, Qi et al. 2010, Eid et al. 2015, Yang et al. 2020
interaction: yes/no
Model
PSRDCNHIA
TMSQPHRSAH
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
14
Novel Virus-Human Protein Interaction Prediction from Sequence
interaction: yes/no
PSRDCNHIA
TMSQPHRSAH
Model
novel virus protein
human protein
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
15
Novel Virus-Human Protein Interaction Prediction from Sequence
interaction: yes/no
PSRDCNHIA
TMSQPHRSAH
Model
novel virus protein
human protein
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
16
Novel Virus-Human Protein Interaction Prediction from Sequence
interaction: yes/no
PSRDCNHIA
TMSQPHRSAH
Model
novel virus protein
human protein
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
17
Self-Supervised Pretraining and Transfer Learning for Limited Labeled Data
Model
the cat ___ on the mat
Model
1. language model training
2. domain-specific training
the cat sat on the mat
this phone is great!
transfer
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
18
Novel Virus-Human Protein Interaction Prediction from Sequence
interaction: yes/no
PSRDCNHIA
TMSQPHRSAH
Model
novel virus protein
human protein
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
19
Novel Virus-Human Protein Interaction Prediction from Sequence
interaction: yes/no
PSRDCNHIA
TMSQPHRSAH
Model
novel virus protein
human protein
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
20
Structure Prediction from Sequence
Lin, Lanchantin, Qi 2016
Jumper et al. 2020
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
21
TMQ_HRSPS_DCNH_A
LSWGKVNVEDA
A M QA
beta barrel
MLM
SS
CT
RH
PPI
PSRDCNHIA
interaction: yes/no
1. Masked Language Model (MLM)
2. Structure Prediction (SP)
3. PPI Prediction
(virus)
(host)
transfer
MSVKHSKH
DeepVHPPI
DeepVHPPI
DeepVHPPI
DeepVHPPI
transfer
Transfer Learning for Sequence-Based Interaction Prediction
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
22
Transfer Learning for Sequence-Based Interaction Prediction
LSWGKVNVEDA
A M QA
beta barrel
MLM
SS
CT
RH
PPI
PSRDCNHIA
interaction: yes/no
2. Structure Prediction (SP)
3. PPI Prediction
(virus)
(host)
transfer
MSVKHSKH
DeepVHPPI
DeepVHPPI
DeepVHPPI
DeepVHPPI
transfer
1. Masked Language Model (MLM)
TMQ_HRSPS_DCNH_A
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
23
TMQ_HRSPS_DCNH_A
DeepVHPPI
TMQAHRSPSMDCNHQA
Masked Language Model (MLM) Pretraining
MLM Classifier
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
24
Transfer Learning for Sequence-Based Interaction Prediction
A M QA
MLM
PPI
PSRDCNHIA
interaction: yes/no
3. PPI Prediction
(virus)
(host)
transfer
MSVKHSKH
DeepVHPPI
DeepVHPPI
DeepVHPPI
TMQ_HRSPS_DCNH_A
LSWGKVNVEDA
beta barrel
SS
CT
RH
2. Structure Prediction (SP)
DeepVHPPI
transfer
1. Masked Language Model (MLM)
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
25
Structure Prediction (SP) Pretraining
DeepVHPPI
Secondary Structure
Contact
Remote Homology
beta barrel
VMQKHRSPHSDCNKVA
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
26
Transfer Learning for Sequence-Based Interaction Prediction
LSWGKVNVEDA
A M QA
beta barrel
MLM
SS
CT
RH
PPI
PSRDCNHIA
interaction: yes/no
2. Structure Prediction (SP)
3. PPI Prediction
(virus)
(host)
transfer
MSVKHSKH
DeepVHPPI
DeepVHPPI
DeepVHPPI
DeepVHPPI
transfer
1. Masked Language Model (MLM)
TMQ_HRSPS_DCNH_A
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
27
Interaction Prediction
DeepVHPPI
M P S R D C
K T M S Q
virus protein
human protein
DeepVHPPI
P(interaction)
Interaction Classifier
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
28
Interaction Prediction
DeepVHPPI
M P S R D C
K T M S Q
virus protein
human protein
DeepVHPPI
P(interaction)
Interaction Classifier
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
29
Transfer Learning for Sequence-Based Interaction Prediction
LSWGKVNVEDA
A M QA
beta barrel
MLM
SS
CT
RH
PPI
PSRDCNHIA
interaction: yes/no
2. Structure Prediction (SP)
3. PPI Prediction
(virus)
(host)
transfer
MSVKHSKH
DeepVHPPI
DeepVHPPI
DeepVHPPI
DeepVHPPI
transfer
1. Masked Language Model (MLM)
TMQ_HRSPS_DCNH_A
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
30
Transfer Learning for Sequence-Based Interaction Prediction
LSWGKVNVEDA
A M QA
beta barrel
MLM
SS
CT
RH
PPI
PSRDCNHIA
interaction: yes/no
2. Structure Prediction (SP)
3. PPI Prediction
(virus)
(host)
transfer
MSVKHSKH
DeepVHPPI
DeepVHPPI
DeepVHPPI
DeepVHPPI
transfer
1. Masked Language Model (MLM)
TMQ_HRSPS_DCNH_A
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
31
Transformer
Transformers
the bat was swung through the air
Vaswani et al. 2017, Devlin et al. 2018
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
32
the bat was flying through the air
Transformers
Transformer
Vaswani et al. 2017, Devlin et al. 2018
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
33
Transformers
the bat was swung through the air
Transformer
Vaswani et al. 2017, Devlin et al. 2018
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
34
M K Q E K T H
Transformers for Protein Sequences
Transformer
Rives et al. 2019
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
35
protein motif (i.e. “word”)
M K Q E K T H
Transformers for Protein Sequences
Transformer
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
36
DeepVHPPI Transformer
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
37
DeepVHPPI Transformer
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
38
DeepVHPPI Transformer
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
39
DeepVHPPI Transformer
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
40
Use Cases of Sequence Based Interaction Predictors
SARS-CoV-2 Spike
MFVFLVLL
LDKYFKN
Human ACE2
1. predict novel virus interactions
2. analyze how mutations affect interactions
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
41
Use Cases of Sequence Based Interaction Predictors
SARS-CoV-2 Spike
MFVFLVLL
LDKYFKN
Human ACE2
1. predict novel virus interactions
2. analyze how mutations affect interactions
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
42
Experimental Setup
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
43
| H1N11 | Ebola1 | SARS-CoV-22 | |||
Method | AUROC | F1 | AUROC | F1 | AUROC | F1 |
SVM (Zhou et al.) | 0.886 | 0.762 | 0.867 | 0.760 | - | - |
Embedding + RF (Yang et al) | - | - | - | - | 0.748 | 0.115 |
DeepVHPPI | 0.894 | 0.819 | 0.927 | 0.836 | 0.726 | 0.089 |
DeepVHPPI + MLM | 0.910 | 0.837 | 0.943 | 0.867 | 0.735 | 0.095 |
DeepVHPPI + MLM + SP | 0.926 | 0.848 | 0.979 | 0.895 | 0.767 | 0.105 |
Protein-Protein Interaction Prediction Results
1Even positive/negative testing split 2Imbalanced positive/negative testing split
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
44
Use Cases of Sequence Based Interaction Predictors
SARS-CoV-2 Spike
MFVFLVLL
LDKYFKN
Human ACE2
1. predict novel virus interactions
2. analyze how mutations affect interactions
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
45
Use Cases of Sequence Based Interaction Predictors
SARS-CoV-2 Spike
MFVFLVLL
LDKYFKN
Human ACE2
1. predict novel virus interactions
2. analyze how mutations affect interactions
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
46
PSRQCNHIA
TMSQPHRSA
Transformer
0.35
interaction
no interaction
Perturbation Analysis: Investigating Mutations
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
47
Perturbation Analysis: Investigating Mutations
SARS-CoV-2 Spike
MFVFLVLL
LDKYFKN
Human ACE2
y = 1
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
48
Perturbation Analysis: Investigating Mutations
SARS-CoV-2 Spike
MFVFLVLL
LDKYFKN
Human ACE2
y = 1
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
49
Perturbation Analysis: Investigating Mutations
SARS-CoV-2 Spike
MFVFPVLL
LDKYFKN
Human ACE2
y = 1
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
50
Perturbation Analysis: Investigating Mutations
SARS-CoV-2 Spike
MFVFLVLL
LDKYFKN
Human ACE2
y = 1
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
51
Perturbation Analysis: Investigating Mutations
SARS-CoV-2 Spike
MFVFLVLL
LDKYFKN
Human ACE2
y = 1
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
52
Perturbation Analysis: Investigating Mutations
SARS-CoV-2 Spike
MFKFLPLL
LDKYFKN
Human ACE2
y = 0
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
53
Perturbation Analysis
DeepVHPPI
P S R D C
T M S Q
C
DeepVHPPI
C
Classifier
0.95
interaction
no interaction
virus protein
human protein
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
54
Perturbation Analysis
P S R D C
T M S Q
C
C
0.95
interaction
no interaction
DeepVHPPI
DeepVHPPI
Classifier
virus protein
human protein
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
55
P Q R D C
T M S Q
C
virus protein
human protein
C
0.35
DeepVHPPI
DeepVHPPI
Classifier
interaction
no interaction
Perturbation Analysis
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
56
Experimental Setup
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
57
Perturbation Analysis: Mutated Spike and ACE2 Interactions
Binding Prediction
Dissociation constant (log10ka)
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
58
Perturbation Analysis: Mutated Spike and ACE2 Interactions
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
59
Summary
2. Predict novel virus interactions
3. Analyze how mutations affect interactions
1. Transfer learning for protein-protein interaction prediction
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021
Arshdeep Sekhon
Yanjun Qi
Tom Weingarten
Thank You
Clint Miller
61
Perturbation Analysis: Investigating Mutations
Train: 108
Valid: 10,552
Test: 94,872
Spearman rank correlation: 0.459 (pvalue=0.0)
Binding Prediction
Dissociation constant (log10ka)
Lanchantin, Weingarten, Sekhon, Miller, Qi - MLCB 2020
Training: 100 mutated seqs
Testing: 95,000 mutated seqs
62
Perturbation Analysis: Mutated Spike and ACE2 Interactions
Spearman rank correlation: 0.72 (pvalue=0.0)
Binding Prediction
Dissociation constant (log10ka)
Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCM 2021
Training: 1,000 mutated seqs
Testing: 94,000 mutated seqs
63
Short Linear Motifs
Mészáros et al. 2020