1 of 63

Transfer Learning for Predicting Virus-Host Protein Interactions for Novel Virus Sequences

Jack Lanchantin1, Tom Weingarten2, Arshdeep Sekhon1, Clint Miller1, Yanjun Qi1

1University of Virginia, 2Google

2 of 63

2

Proteins: the building blocks of life

oxygen transportation

antibodies

digestive enzymes

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

3 of 63

3

What is a protein?

covalent bonds

amino acids

sequences with alphabet of 20 characters

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

4 of 63

4

What is a protein?

MQGHFTETKHE

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

5 of 63

5

What is a protein?

MHFTEDKATILWGKVNVEGETLGRVYPWQ

Tertiary Structure

Secondary Structure

Primary Sequence

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

6 of 63

6

Structure Determines Function

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

7 of 63

7

Protein-Protein Interactions (PPIs)

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

8 of 63

8

Virus-Host Protein-Protein Interactions

SARS-CoV-2

Spike Protein

Human

ACE2 Protein

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

9 of 63

9

Virus-Host Protein-Protein Interactions

Zika

HIV

HIV

Zika

MSTCLAMVK

CGPKKSTNL

SPKRARSV

ADYSVLYNS

LDKYFKN

HKMFYN

VLNDILS

Human protein

Known interaction

Virus protein

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

10 of 63

10

Virus-Host Protein-Protein Interactions

Zika

SARS-CoV-2

HIV

HIV

Zika

MSTCLAMVK

CGPKKSTNL

SPKRARSV

MFVFLVLL

ADYSVLYNS

LDKYFKN

HKMFYN

VLNDILS

Novel virus protein

Human protein

Known interaction

Virus protein

Potential interaction

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

11 of 63

11

MGHFTWGWGTSLWGKVNVEDAYPWTQ

TWGWGLFDKATETWGKVNVKDKLWG

KVWGKVNVEDYPWT

MGHFTEKEKLFVEDAYPWTQ

TEEDKADKATEDAYPWT

MGHFWITSLNKDLWGKSVEDAMGHF

VNVEDAYDKATSSLNFTDLWGK

Sequences: fast & cheap

Structure/Interactions: slow & expensive

Protein Data

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

12 of 63

12

MGHFTWGWGTSLWGKVNVEDAYPWTQ

TWGWGLFDKATETWGKVNVKDKLWG

KVWGKVNVEDYPWT

MGHFTEKEKLFVEDAYPWTQ

TEEDKADKATEDAYPWT

MGHFWITSLNKDLWGKSVEDAMGHF

VNVEDAYDKATSSLNFTDLWGK

Sequences: fast & cheap

Structure/Interactions: slow & expensive

Protein Data

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

13 of 63

13

Protein-Protein Interaction Prediction from Sequences

Gomez et al. 2003, Ben-Hur and Noble 2005, Qi et al. 2010, Eid et al. 2015, Yang et al. 2020

interaction: yes/no

Model

PSRDCNHIA

TMSQPHRSAH

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

14 of 63

14

Novel Virus-Human Protein Interaction Prediction from Sequence

interaction: yes/no

PSRDCNHIA

TMSQPHRSAH

Model

novel virus protein

human protein

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

15 of 63

15

Novel Virus-Human Protein Interaction Prediction from Sequence

  1. Limited interaction data available

  • Interactions are largely determined by structure

interaction: yes/no

PSRDCNHIA

TMSQPHRSAH

Model

novel virus protein

human protein

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

16 of 63

16

Novel Virus-Human Protein Interaction Prediction from Sequence

  • Limited interaction data available

  • Interactions are largely determined by structure

interaction: yes/no

PSRDCNHIA

TMSQPHRSAH

Model

novel virus protein

human protein

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

17 of 63

17

Self-Supervised Pretraining and Transfer Learning for Limited Labeled Data

Model

the cat ___ on the mat

Model

1. language model training

2. domain-specific training

the cat sat on the mat

this phone is great!

transfer

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

18 of 63

18

Novel Virus-Human Protein Interaction Prediction from Sequence

interaction: yes/no

PSRDCNHIA

TMSQPHRSAH

Model

novel virus protein

human protein

  • Limited interaction data available

  • Interactions are largely determined by structure

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

19 of 63

19

Novel Virus-Human Protein Interaction Prediction from Sequence

interaction: yes/no

PSRDCNHIA

TMSQPHRSAH

Model

novel virus protein

human protein

  • Limited interaction data available

  • Interactions are largely determined by structure

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

20 of 63

20

Structure Prediction from Sequence

Lin, Lanchantin, Qi 2016

Jumper et al. 2020

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

21 of 63

21

TMQ_HRSPS_DCNH_A

LSWGKVNVEDA

A M QA

beta barrel

MLM

SS

CT

RH

PPI

PSRDCNHIA

interaction: yes/no

1. Masked Language Model (MLM)

2. Structure Prediction (SP)

3. PPI Prediction

(virus)

(host)

transfer

MSVKHSKH

DeepVHPPI

DeepVHPPI

DeepVHPPI

DeepVHPPI

transfer

Transfer Learning for Sequence-Based Interaction Prediction

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

22 of 63

22

Transfer Learning for Sequence-Based Interaction Prediction

LSWGKVNVEDA

A M QA

beta barrel

MLM

SS

CT

RH

PPI

PSRDCNHIA

interaction: yes/no

2. Structure Prediction (SP)

3. PPI Prediction

(virus)

(host)

transfer

MSVKHSKH

DeepVHPPI

DeepVHPPI

DeepVHPPI

DeepVHPPI

transfer

1. Masked Language Model (MLM)

TMQ_HRSPS_DCNH_A

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

23 of 63

23

TMQ_HRSPS_DCNH_A

DeepVHPPI

TMQAHRSPSMDCNHQA

Masked Language Model (MLM) Pretraining

MLM Classifier

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

24 of 63

24

Transfer Learning for Sequence-Based Interaction Prediction

A M QA

MLM

PPI

PSRDCNHIA

interaction: yes/no

3. PPI Prediction

(virus)

(host)

transfer

MSVKHSKH

DeepVHPPI

DeepVHPPI

DeepVHPPI

TMQ_HRSPS_DCNH_A

LSWGKVNVEDA

beta barrel

SS

CT

RH

2. Structure Prediction (SP)

DeepVHPPI

transfer

1. Masked Language Model (MLM)

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

25 of 63

25

Structure Prediction (SP) Pretraining

DeepVHPPI

Secondary Structure

Contact

Remote Homology

beta barrel

VMQKHRSPHSDCNKVA

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

26 of 63

26

Transfer Learning for Sequence-Based Interaction Prediction

LSWGKVNVEDA

A M QA

beta barrel

MLM

SS

CT

RH

PPI

PSRDCNHIA

interaction: yes/no

2. Structure Prediction (SP)

3. PPI Prediction

(virus)

(host)

transfer

MSVKHSKH

DeepVHPPI

DeepVHPPI

DeepVHPPI

DeepVHPPI

transfer

1. Masked Language Model (MLM)

TMQ_HRSPS_DCNH_A

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

27 of 63

27

Interaction Prediction

DeepVHPPI

M P S R D C

K T M S Q

virus protein

human protein

DeepVHPPI

P(interaction)

Interaction Classifier

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

28 of 63

28

Interaction Prediction

DeepVHPPI

M P S R D C

K T M S Q

virus protein

human protein

DeepVHPPI

P(interaction)

Interaction Classifier

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

29 of 63

29

Transfer Learning for Sequence-Based Interaction Prediction

LSWGKVNVEDA

A M QA

beta barrel

MLM

SS

CT

RH

PPI

PSRDCNHIA

interaction: yes/no

2. Structure Prediction (SP)

3. PPI Prediction

(virus)

(host)

transfer

MSVKHSKH

DeepVHPPI

DeepVHPPI

DeepVHPPI

DeepVHPPI

transfer

1. Masked Language Model (MLM)

TMQ_HRSPS_DCNH_A

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

30 of 63

30

Transfer Learning for Sequence-Based Interaction Prediction

LSWGKVNVEDA

A M QA

beta barrel

MLM

SS

CT

RH

PPI

PSRDCNHIA

interaction: yes/no

2. Structure Prediction (SP)

3. PPI Prediction

(virus)

(host)

transfer

MSVKHSKH

DeepVHPPI

DeepVHPPI

DeepVHPPI

DeepVHPPI

transfer

1. Masked Language Model (MLM)

TMQ_HRSPS_DCNH_A

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

31 of 63

31

Transformer

Transformers

the bat was swung through the air

Vaswani et al. 2017, Devlin et al. 2018

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

32 of 63

32

the bat was flying through the air

Transformers

Transformer

Vaswani et al. 2017, Devlin et al. 2018

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

33 of 63

33

Transformers

the bat was swung through the air

Transformer

Vaswani et al. 2017, Devlin et al. 2018

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

34 of 63

34

M K Q E K T H

Transformers for Protein Sequences

Transformer

Rives et al. 2019

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

35 of 63

35

protein motif (i.e. “word”)

M K Q E K T H

Transformers for Protein Sequences

Transformer

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

36 of 63

36

DeepVHPPI Transformer

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

37 of 63

37

DeepVHPPI Transformer

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

38 of 63

38

DeepVHPPI Transformer

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

39 of 63

39

DeepVHPPI Transformer

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

40 of 63

40

Use Cases of Sequence Based Interaction Predictors

SARS-CoV-2 Spike

MFVFLVLL

LDKYFKN

Human ACE2

1. predict novel virus interactions

2. analyze how mutations affect interactions

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

41 of 63

41

Use Cases of Sequence Based Interaction Predictors

SARS-CoV-2 Spike

MFVFLVLL

LDKYFKN

Human ACE2

1. predict novel virus interactions

2. analyze how mutations affect interactions

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

42 of 63

42

Experimental Setup

  • Training Data: HPIDB 3.0 Dataset
    • 22,000 positive interactions, 226,000 negative interactions
    • 1,100 virus proteins, 20,000 host (human) proteins
  • Testing Data:
    • HIV, Ebola interactions from Zhou et al.
    • Our own curated SARS-CoV-2 interactions collected from BioGrid

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

43 of 63

43

H1N11

Ebola1

SARS-CoV-22

Method

AUROC

F1

AUROC

F1

AUROC

F1

SVM (Zhou et al.)

0.886

0.762

0.867

0.760

-

-

Embedding + RF (Yang et al)

-

-

-

-

0.748

0.115

DeepVHPPI

0.894

0.819

0.927

0.836

0.726

0.089

DeepVHPPI + MLM

0.910

0.837

0.943

0.867

0.735

0.095

DeepVHPPI + MLM + SP

0.926

0.848

0.979

0.895

0.767

0.105

Protein-Protein Interaction Prediction Results

1Even positive/negative testing split 2Imbalanced positive/negative testing split

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

44 of 63

44

Use Cases of Sequence Based Interaction Predictors

SARS-CoV-2 Spike

MFVFLVLL

LDKYFKN

Human ACE2

1. predict novel virus interactions

2. analyze how mutations affect interactions

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

45 of 63

45

Use Cases of Sequence Based Interaction Predictors

SARS-CoV-2 Spike

MFVFLVLL

LDKYFKN

Human ACE2

1. predict novel virus interactions

2. analyze how mutations affect interactions

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

46 of 63

46

PSRQCNHIA

TMSQPHRSA

Transformer

0.35

interaction

no interaction

Perturbation Analysis: Investigating Mutations

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

47 of 63

47

Perturbation Analysis: Investigating Mutations

SARS-CoV-2 Spike

MFVFLVLL

LDKYFKN

Human ACE2

y = 1

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

48 of 63

48

Perturbation Analysis: Investigating Mutations

SARS-CoV-2 Spike

MFVFLVLL

LDKYFKN

Human ACE2

y = 1

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

49 of 63

49

Perturbation Analysis: Investigating Mutations

SARS-CoV-2 Spike

MFVFPVLL

LDKYFKN

Human ACE2

y = 1

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

50 of 63

50

Perturbation Analysis: Investigating Mutations

SARS-CoV-2 Spike

MFVFLVLL

LDKYFKN

Human ACE2

y = 1

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

51 of 63

51

Perturbation Analysis: Investigating Mutations

SARS-CoV-2 Spike

MFVFLVLL

LDKYFKN

Human ACE2

y = 1

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

52 of 63

52

Perturbation Analysis: Investigating Mutations

SARS-CoV-2 Spike

MFKFLPLL

LDKYFKN

Human ACE2

y = 0

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

53 of 63

53

Perturbation Analysis

DeepVHPPI

P S R D C

T M S Q

C

DeepVHPPI

C

Classifier

0.95

interaction

no interaction

virus protein

human protein

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

54 of 63

54

Perturbation Analysis

P S R D C

T M S Q

C

C

0.95

interaction

no interaction

DeepVHPPI

DeepVHPPI

Classifier

virus protein

human protein

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

55 of 63

55

P Q R D C

T M S Q

C

virus protein

human protein

C

0.35

DeepVHPPI

DeepVHPPI

Classifier

interaction

no interaction

Perturbation Analysis

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

56 of 63

56

Experimental Setup

  • 105,528 mutated Spike sequences and their corresponding ACE2 binding affinities from Starr et al. 2020
  • Training / Test splits
    • 100 training, 105,428 testing
    • 1,000 training, 104,528 testing
    • 10,000 training, 95,528 testing

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

57 of 63

57

Perturbation Analysis: Mutated Spike and ACE2 Interactions

Binding Prediction

Dissociation constant (log10ka)

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

58 of 63

58

Perturbation Analysis: Mutated Spike and ACE2 Interactions

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

59 of 63

59

Summary

2. Predict novel virus interactions

3. Analyze how mutations affect interactions

1. Transfer learning for protein-protein interaction prediction

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCB 2021

60 of 63

Arshdeep Sekhon

Yanjun Qi

Tom Weingarten

Thank You

Clint Miller

61 of 63

61

Perturbation Analysis: Investigating Mutations

Train: 108

Valid: 10,552

Test: 94,872

Spearman rank correlation: 0.459 (pvalue=0.0)

Binding Prediction

Dissociation constant (log10ka)

Lanchantin, Weingarten, Sekhon, Miller, Qi - MLCB 2020

Training: 100 mutated seqs

Testing: 95,000 mutated seqs

62 of 63

62

Perturbation Analysis: Mutated Spike and ACE2 Interactions

Spearman rank correlation: 0.72 (pvalue=0.0)

Binding Prediction

Dissociation constant (log10ka)

Lanchantin, Weingarten, Sekhon, Miller, Qi - ACM-BCM 2021

Training: 1,000 mutated seqs

Testing: 94,000 mutated seqs

63 of 63

63

Short Linear Motifs

Mészáros et al. 2020