1 of 94

ColabDesign

Making protein design accessible to all via Google Colab

github.com/sokrypton/ColabDesign

2 of 94

Outline

  • Recap: TrRosetta
  • TrDesign - Inverting TrRosetta for Protein Design
  • TrMRF - Modeling P(seq|structure)
  • Experimental Results
  • Modern methods: RoseTTAFold & AlphaFold
  • Jue Wang - Partial Hallucination
  • AfDesign - Inverting AlphaFold for Protein Design

3 of 94

TrRosetta - Takes sequences and returns structure

predict

MRF

((X-X̄)T(X-X̄)/(N-1)+αI)-1

α = 4.5/√N

PSSM

X̄ = ΣX/N

Sequences (X)

Sequence

X[0]

Rosetta

Yang J, Anishchenko I, Park H, Peng Z, Ovchinnikov S, Baker D. Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences. 2020 Jan 2.

MSA

4 of 94

predict

Rosetta

Yang J, Anishchenko I, Park H, Peng Z, Ovchinnikov S, Baker D. Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences. 2020 Jan 2.

Features

Sequence

Using a single-sequence the method can accuractely predict denovo designed proteins

5 of 94

Top7

β-barrel

IL2-mimetic

jelly-roll

TM-score=0.82

0.70

0.78

0.76

Peak7

0.87

Ferredog-Diesel

0.76

BeNTF2

0.79

Slide modified from�Ivan Anishchenko

Using a single-sequence the method can accuractely predict denovo designed proteins

6 of 94

Using a single-sequence the method can accuractely predict denovo designed proteins but not natural proteins

denovo

proteins

natural

proteins

Accuracy (TM-score)

7 of 94

Can we invert TrRosetta for Protein Design?

EEAREWAEKWGAN

Target (P)

backprop

predict

Loss (L)L = -ΣP*log(Q)

Prediction (Q)

Q = TrRosetta(X)

Random Sequence (X)

8 of 94

But how is this different from other models that return sequence given structure?

They do not "see" the alternative conformation the sequence can adopt!

9 of 94

Explicit conformational landscape optimization

Target

Off-Target

Target

Chris

Norn

Basile

Wicky

P(sequence | structure)

P(structure | sequence)

Protein sequence design by conformational landscape optimization. �Norn, C., Wicky, B.I., Juergens, D., Liu, S., Kim, D., Tischer, D., �Koepnick, B., Anishchenko, I., Baker, D. and Ovchinnikov, S., 2021. �Proceedings of the National Academy of Sciences, 118(11).

10 of 94

Experiment - for given backbone design new sequence

  • For ~200 proteins from the PDB not in the TrRosetta training set, we design a new sequence using Rosetta (as is) and TrRosetta.

Remove sequence

Design new sequence

Compare sequences to compute "Sequence recovery" (seqid)

11 of 94

TrRosetta Sequence recovery is too low compared to Rosetta

Rosetta

TrRosetta

12 of 94

Why is the sequence recovery so low?

  • Hypothesis 1 - adversarial sequences
  • Hypothesis 2 - the model is low resolution
  • Hypothesis 3 - the model only sees a subset of key amino acids
  • Hypothesis 4 - we are modeling P(structure | sequence) w/o P(sequence)
  • Hypothesis 5 - natural backbones require MSA to design

13 of 94

H1 - adversarial sequences

Understanding Deep Image Representations by Inverting Them �Aravindh Mahendran, Andrea Vedaldi�https://arxiv.org/abs/1412.0035

14 of 94

Comparing pairwise potential

log(P(A,B|Contact) / P(A)*P(B))

Native

Design

15 of 94

Comparing secondary structure propensity

(E) Sheets

(H) Helix

log(P(AA|SS) / P(AA))

Native

Design

16 of 94

Comparing average depth per amino acid type

Native (depth)

Distribution of hydrophobic residues

Depth = Distance from surface

aka. how buried is each amino acid

Design (depth)

I

F

L

C

V

W

M

Y

A

H

R

T

Q

S

N

D

E

K

G

P

17 of 94

H2 - low resolution model

18 of 94

Use backrub to generate "low resolution" backbones and design new sequences with Rosetta

Ollikainen, Noah, and Tanja Kortemme. "Computational protein design quantifies structural constraints on amino acid covariation." PLoS computational biology 9, no. 11 (2013): e1003313.

temperature

19 of 94

Perturb backbone, redesign with Rosetta

Given our structure accuracy for denovo design proteins is ~0.7 TMscore, ~0.15 sequence recovery is expected

20 of 94

H3 - only a subset of sequence or contacts being optimized

21 of 94

Run TrRosetta many time independently and generate many sequences

EEAREWAEKWGAN

Target (P)

backprop

predict

Loss (L)L = -ΣP*log(Q)

Prediction (Q)

Q = TrRosetta(X)

Random Sequence (X)

22 of 94

Run TrRosetta many time independently and generate many sequences

23 of 94

Pass sequences to GREMLIN (coevolution method) to find covariance between positions

Generate 10K sequences and run coevolution analysis on these.

24 of 94

Only a subset of contacts coevolving in designed sequences!

Residue pairs coevolving

Residues in contact

25 of 94

TrRosetta - Takes sequences �and returns structure

NN

Coevolution

((X-X̄)T(X-X̄)/(N-1)+αI)-1

α = 4.5/√N

Conservation

X̄ = ΣX/N

Sequence

X[0]

pyRosetta

6D Features

Structure

Sequences

26 of 94

TrRosetta's NN

526 features

...

dilated residual blocks

ELU

Conv2D, d

InstanceNorm

Dropout

ELU

Conv2D, d

InstanceNorm

d = 1,2,4,8,16

pyRosetta

Slide modified from�Ivan Anishchenko

27 of 94

What is each layer doing?

Coevolution

Conservation

Sequences (X)

Sequence

+

+

+

+

+

+

+

+

+

+

+

+

+

28 of 94

What is each layer doing?

Coevolution

Conservation

Sequences (X)

Sequence

+

+

+

+

+

+

+

+

+

+

+

+

+

probability

distance, Å

To make life easier let's focus on distance matrix

29 of 94

What is each layer doing?

Coevolution

Conservation

Sequences (X)

Sequence

+

+

+

+

+

+

+

+

+

+

+

+

+

Or contact map

30 of 94

What is each layer doing?

Coevolution

Conservation

Sequences (X)

Sequence

+

+

+

+

+

+

+

+

Or contact map

31 of 94

Investigating each block in resnet

32 of 94

Covarying residues match those in block 8

Are these layers just "connecting the dots"?

33 of 94

It appears only subset of sequence or contacts are being optimized. ��Can we push the model to optimize all contacts?

34 of 94

TrRosetta - Takes sequences and returns structure

TrRosetta

Coevolution

((X-X̄)T(X-X̄)/(N-1)+αI)-1

α = 4.5/√N

Conservation

X̄ = ΣX/N

Sequence

X[0]

pyRosetta

6D Features

Structure

Sequences

pred_feats = TrRosetta(sequences)�loss = CCE(true_feats, pred_feats)

35 of 94

TrMRF - Takes structure and returns sequences

Coevolution

Conservation

Predicted�Sequences

W,b = TrMRF(structure)�pred_sequences = softmax(sequences@W + b)�loss = CCE(sequences, pred_sequences)

Sequences

6D Features

Structure

TrMRF

36 of 94

Target (P)

Loss (L)L = -ΣP*log(Q)

Prediction (Q)

EEAREWAEKWGAN

Sequence (X)

TrRosetta

37 of 94

Target (P)

Loss (L)L = -ΣP*log(Q)

Prediction (Q)

EEAREWAEKWGAN

Sequence (X)

TrRosetta

TrMRF

38 of 94

Target (P)

Loss (L)L = -ΣP*log(Q)

Prediction (Q)

EEAREWAEKWGAN

Sequence (X)

TrRosetta

TrMRF

39 of 94

TrRosetta Sequence recovery is too low compared to Rosetta

Rosetta

TrRosetta

40 of 94

Adding TrMRF loss significantly improved sequence recovery!

41 of 94

Experiment

  • TrROS - Hallucinate new proteins using TrRosetta.
  • TrMRF - Redesign Proteins using TrMRF objective.
  • TrROS+TrMRF - Redesign Proteins using TrRosetta+TrMRF objective.
  • Experimentally test designs for protease resistance (proxy for stability).
  • Compare initial hallucinated proteins to redesigned proteins.

Gabriel J Rocklin

Tsuboyama Kotaro

Justas

Dauparas

42 of 94

Hallucinate thousands of new proteins

Loss (L)L = KL w/background

Prediction (Q)

EEAREWAEKWGAN

Sequence (X)

TrRosetta

pyRosetta

Structure

43 of 94

Redesign proteins

TrROS = Hallucination

TrMRF = Fixed backbone design

TrROS+TrMRF = Cont. Hallucination

44 of 94

Results - Protease Resistance (Stability) Assay

Results look TOO good

Tsuboyama Kotaro

stable

unstable

stable

unstable

45 of 94

We noticed some of the most stable designs had hydrophobic patches, which AlphaFold predicts as homo-oligomers

Justas

Dauparas

46 of 94

Inter PAE

Inter PAE

resistance to trypsin

resistance to chymotrypsin

Lets compute inter-PAE values for our designs

TrMRF + TrROS

TrMRF + TrROS

Though most designs are predicted as oligomers, the monomers are still "stable"

monomer

oligomer

monomer

oligomer

stable

unstable

stable

unstable

47 of 94

AlphaFold filtering

interPAE > 15

pLDDT > 70

48 of 94

Results - Protease Resistance (Stability) Assay

(removing proteins predicted as oligomers by AlphaFold)

Results appear to hold even after filtering for oligomers

interPAE > 15

pLDDT > 70

stable

unstable

stable

unstable

49 of 94

Combined loss significantly better!

TrROS

TrROS

TrROS+TrMRF

50 of 94

Conclusions

  • Combing inverted NN that sees the entire conformational landscape with NN that optimizes energy given structure appears to results in more "stable" designs.

P(structure | sequence) ~ AbRelax (FF) ~ TrRosetta ~ AlphaFold ~ RoseTTAFoldP(sequence) = 1.0��P(sequence | structure) ~ Rosetta ~ TrMRF�P(structure) = 1.0��P(structure, sequence) ~ P(structure | sequence) * P(sequence | structure)

51 of 94

Modern methods: RoseTTAFold / AlphaFold

Fig. 1: RoseTTAFold accurately predicts structures of de-novo-designed proteins from their amino acid sequences.

Minkyung Baek

52 of 94

Jue Wang

Partial Hallucination

53 of 94

Can we use DL models to design functional proteins?

Scaffolding functional motifs with hallucination

Goal: small, stable protein with desired motif

Doug

Tischer

Sidney Lisanza

David Juergens

54 of 94

Current state-of-the-art

Graft motif into existing protein

Build up scaffold from fragments

Sesterhehn et al. 2020

Limited by databases

Limited to simple motifs; Need to choose fold beforehand

55 of 94

“Hallucination” can create new structures without referencing existing backbone

Anishchenko, Pellock et al. 2021

56 of 94

Partially constrained hallucination: embedding pre-specified motif

Doug

Sidney

Ivan

Sergey

  • Can sample all of protein structure space…?
  • No assumptions about topology or secondary structure of scaffold

57 of 94

Scaffolding p53 helix binding to mdm2

AF2 prediction

RF hallucination

Native

Native motif (p53 helix)

Design motif

Binding partner (mdm2)

58 of 94

Designed p53 mimetics bind to mdm2 in yeast display

p53 helix control

1uM target

p53 helix control

no target

Pool of

Hallucinated designs

1uM target

Expression / surface display of protein

Binding signal

59 of 94

Binder hallucination pipeline

Monomer hallucination

Filtering, experiments

Stage

Loss

  • Entropy (scaffold hallucination)
  • Disto/anglegram CCE (motif)
  • Radius of gyration
  • Repulsion
  • Entropy (scaffold hallucination)
  • Disto/anglegram CCE (motif)
  • Radius of gyration
  • Surface nonpolar
  • Net charge

2-chain hallucination

Iterate & diversify best outputs from previous round

Pairwise dist/angles

Monomer

Monomer + interface

Optional: sequence design

60 of 94

Problem-specific loss functions

Native motif

Design motif

Binding partner

Problem: Hallucinations clash with binding partner (RSVF site V)

Solution: Include repulsive loss

Result: (Almost) no clashes

61 of 94

Back to Sergey�AfDesign

62 of 94

initial XYZ

Sequence

MSA

PSSM

seq�DB

XYZ

Recycle

Structure Module

MSA

Module

Alphafold2

Templates

pdb�DB

not differentiable

63 of 94

AlphaFold: Number of recycles needed to predict denovo designed proteins

64 of 94

James Roney

Why is AlphaFold sooooo bad at num_recycle=0?

65 of 94

Why is num_recycle=0 so bad?

If num_recycle=0, this disables a series of layers

66 of 94

bugfix

Alternative (without making source code changes)� model_config.model.num_recycle = 1

model_config.data.common.num_recycle = 1

processed_feature_dict["num_iter_recycling"] = np.array([0])

James Roney

67 of 94

Bugfix significantly improves the accuracy of denovo designed proteins at num_recycle=0

68 of 94

Number of recycles needed to predict denovo designed proteins

69 of 94

By changing a few configs we can reduce runtime!�(using Google Colab Free = Tesla K80)

length=100 length=50� 26.7s 13.9s (msa=512, subbatch=4, recycles=3, backprop=False)� 2.5s 0.8s (msa=1, subbatch=4, recycles=3, backprop=False)� 1.9s 0.5s (msa=1, subbatch=None, recycles=3, backprop=False)� 0.5s 0.1s (msa=1, subbatch=None, recycles=0, backprop=False)� 1.8s 0.4s (msa=1, subbatch=None, recycles=0, backprop=True )� 3.2s 1.1s (msa=1, subbatch=4, recycles=0, backprop=True )

70 of 94

initial XYZ

Sequence

MSA

PSSM

seq�DB

XYZ

Recycle

Structure Module

MSA

Module

Alphafold2

Templates

pdb�DB

71 of 94

XYZ

MSA

Module

AlphaFold2: inputs/outputs

Distogram

Confidence

Sequence

Templates

Sequence

72 of 94

Target (P)

Prediction (Q)

EEAREWAEKWGAN

Q = Model(X)

Random Sequence (X)

73 of 94

Target (P)

Loss (L)L = -ΣP*log(Q)

Prediction (Q)

EEAREWAEKWGAN

Q = Model(X)

Random Sequence (X)

74 of 94

How to backprop through discrete input?

Gradient (∇)�∇ = ∂L/∂X

Target (P)

Loss (L)L = -ΣP*log(Q)

Prediction (Q)

EEAREWAEKWGAN

Q = Model(X)

Random Sequence (X)

?

75 of 94

Test Case - 1TEN (Fibronectin type-III domain 3)

Starting from random sequence, the MCMC protocol failed1 for this example, RMSD = 4.36

Let's try backprop!

  1. Using AlphaFold for Rapid and Accurate Fixed Backbone Protein Design �Lewis Moffat, Joe G. Greener, David T. Jones�https://doi.org/10.1101/2021.08.24.457549

76 of 94

1TEN - Inputs

stop_gradient(argmax - softmax(logits)) + softmax(logits)

softmax(logits + gumbel_noise)

softmax(logits)

logits

77 of 94

1TEN - Trajectories - RMSD

stop_gradient(argmax - softmax(logits)) + softmax(logits)

softmax(logits + gumbel_noise)

softmax(logits)

logits

78 of 94

1TEN - Trajectories - Sequence Recovery

stop_gradient(argmax - softmax(logits)) + softmax(logits)

softmax(logits + gumbel_noise)

softmax(logits)

logits

79 of 94

Observations

  1. Decrete optimization for this complex topology is stuck at ~2-4 RMSD, conversely the sequence recovery is bad (< 0.15%)
  2. If we "cheat" and use "logits" (continious values) as inputs we can find a solution with RMSD < 0.5!
  3. Even though logits make "no sense" and technically "adverserial", surprisingly if you take max category at each position, the sequence recovery >30%!

Which one should we use?

80 of 94

81 of 94

Wait till convergences switch to hard!

switch

soft = logits

hard

82 of 94

After switch, sequence recovery >25%

switch

hard

soft = logits

83 of 94

84 of 94

Are the sequences better than TrRosetta?

85 of 94

Run AlphaFold many time independently and generate many sequences

86 of 94

Coevolution test recovers more contacts!

Residues in contact

TrDesign

AfDesign*

*1.5K sequences

87 of 94

DEMO

88 of 94

Related work

Inverting prediction model for design

  • Fast differentiable DNA and protein sequence optimization for molecular design�Johannes Linder, Georg Seelig�https://arxiv.org/abs/2005.11275
  • Using AlphaFold for Rapid and Accurate Fixed Backbone Protein Design �Lewis Moffat, Joe G. Greener, David T. Jones�https://doi.org/10.1101/2021.08.24.457549
  • AlphaDesign: A de novo protein design framework based on AlphaFold.�Michael Jendrusch, Jan O. Korbel, S. Kashif Sadiq�https://www.biorxiv.org/content/10.1101/2021.10.11.463937v1

�Design sequence given structure

�Design sequence given general fold description

89 of 94

Thank you!

Collaborate? Join? solab.org

CODE: github.com/sokrypton

PS, new episode today :)

90 of 94

EXTRA

91 of 94

How to backprop?

92 of 94

1TEN - Inputs

stop_gradient(argmax - softmax(logits)) + softmax(logits)

softmax(logits + gumbel_noise)

softmax(logits)

logits

93 of 94

94 of 94