ColabDesign
Making protein design accessible to all via Google Colab
github.com/sokrypton/ColabDesign
Outline
TrRosetta - Takes sequences and returns structure
predict
MRF
((X-X̄)T(X-X̄)/(N-1)+αI)-1
α = 4.5/√N
PSSM
X̄ = ΣX/N
Sequences (X)
Sequence
X[0]
Rosetta
Yang J, Anishchenko I, Park H, Peng Z, Ovchinnikov S, Baker D. Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences. 2020 Jan 2.
MSA
predict
Rosetta
Yang J, Anishchenko I, Park H, Peng Z, Ovchinnikov S, Baker D. Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences. 2020 Jan 2.
Features
Sequence
Using a single-sequence the method can accuractely predict denovo designed proteins
Top7
β-barrel
IL2-mimetic
jelly-roll
TM-score=0.82
0.70
0.78
0.76
Peak7
0.87
Ferredog-Diesel
0.76
BeNTF2
0.79
Slide modified from�Ivan Anishchenko
Using a single-sequence the method can accuractely predict denovo designed proteins
Using a single-sequence the method can accuractely predict denovo designed proteins but not natural proteins
denovo
proteins
natural
proteins
Accuracy (TM-score)
Can we invert TrRosetta for Protein Design?
EEAREWAEKWGAN
Target (P)
backprop
predict
Loss (L)�L = -ΣP*log(Q)
Prediction (Q)
Q = TrRosetta(X)
Random Sequence (X)
But how is this different from other models that return sequence given structure?
They do not "see" the alternative conformation the sequence can adopt!
Explicit conformational landscape optimization
Target
Off-Target
Target
Chris
Norn
Basile
Wicky
P(sequence | structure)
P(structure | sequence)
Protein sequence design by conformational landscape optimization. �Norn, C., Wicky, B.I., Juergens, D., Liu, S., Kim, D., Tischer, D., �Koepnick, B., Anishchenko, I., Baker, D. and Ovchinnikov, S., 2021. �Proceedings of the National Academy of Sciences, 118(11).
Experiment - for given backbone design new sequence
Remove sequence
Design new sequence
Compare sequences to compute "Sequence recovery" (seqid)
TrRosetta Sequence recovery is too low compared to Rosetta
Rosetta
TrRosetta
Why is the sequence recovery so low?
H1 - adversarial sequences
Understanding Deep Image Representations by Inverting Them �Aravindh Mahendran, Andrea Vedaldi�https://arxiv.org/abs/1412.0035
Comparing pairwise potential
log(P(A,B|Contact) / P(A)*P(B))
Native
Design
Comparing secondary structure propensity
(E) Sheets
(H) Helix
log(P(AA|SS) / P(AA))
Native
Design
Comparing average depth per amino acid type
Native (depth)
Distribution of hydrophobic residues
Depth = Distance from surface
aka. how buried is each amino acid
Design (depth)
I
F
L
C
V
W
M
Y
A
H
R
T
Q
S
N
D
E
K
G
P
H2 - low resolution model
Use backrub to generate "low resolution" backbones and design new sequences with Rosetta
Ollikainen, Noah, and Tanja Kortemme. "Computational protein design quantifies structural constraints on amino acid covariation." PLoS computational biology 9, no. 11 (2013): e1003313.
temperature
Perturb backbone, redesign with Rosetta
Given our structure accuracy for denovo design proteins is ~0.7 TMscore, ~0.15 sequence recovery is expected
H3 - only a subset of sequence or contacts being optimized
Run TrRosetta many time independently and generate many sequences
EEAREWAEKWGAN
Target (P)
backprop
predict
Loss (L)�L = -ΣP*log(Q)
Prediction (Q)
Q = TrRosetta(X)
Random Sequence (X)
Run TrRosetta many time independently and generate many sequences
Pass sequences to GREMLIN (coevolution method) to find covariance between positions
Generate 10K sequences and run coevolution analysis on these.
Only a subset of contacts coevolving in designed sequences!
Residue pairs coevolving
Residues in contact
TrRosetta - Takes sequences �and returns structure
NN
Coevolution
((X-X̄)T(X-X̄)/(N-1)+αI)-1
α = 4.5/√N
Conservation
X̄ = ΣX/N
Sequence
X[0]
pyRosetta
6D Features
Structure
Sequences
TrRosetta's NN
526 features
...
dilated residual blocks
ELU
Conv2D, d
InstanceNorm
Dropout
ELU
Conv2D, d
InstanceNorm
d = 1,2,4,8,16
pyRosetta
Slide modified from�Ivan Anishchenko
What is each layer doing?
Coevolution
Conservation
Sequences (X)
Sequence
+
+
+
+
+
+
+
+
+
+
+
+
+
What is each layer doing?
Coevolution
Conservation
Sequences (X)
Sequence
+
+
+
+
+
+
+
+
+
+
+
+
+
probability
distance, Å
To make life easier let's focus on distance matrix
What is each layer doing?
Coevolution
Conservation
Sequences (X)
Sequence
+
+
+
+
+
+
+
+
+
+
+
+
+
Or contact map
What is each layer doing?
Coevolution
Conservation
Sequences (X)
Sequence
+
+
+
+
+
+
+
+
Or contact map
Investigating each block in resnet
Covarying residues match those in block 8
Are these layers just "connecting the dots"?
It appears only subset of sequence or contacts are being optimized. ��Can we push the model to optimize all contacts?
TrRosetta - Takes sequences and returns structure
TrRosetta
Coevolution
((X-X̄)T(X-X̄)/(N-1)+αI)-1
α = 4.5/√N
Conservation
X̄ = ΣX/N
Sequence
X[0]
pyRosetta
6D Features
Structure
Sequences
pred_feats = TrRosetta(sequences)�loss = CCE(true_feats, pred_feats)
TrMRF - Takes structure and returns sequences
Coevolution
Conservation
Predicted�Sequences
W,b = TrMRF(structure)�pred_sequences = softmax(sequences@W + b)�loss = CCE(sequences, pred_sequences)
Sequences
6D Features
Structure
TrMRF
Target (P)
Loss (L)�L = -ΣP*log(Q)
Prediction (Q)
EEAREWAEKWGAN
Sequence (X)
TrRosetta
Target (P)
Loss (L)�L = -ΣP*log(Q)
Prediction (Q)
EEAREWAEKWGAN
Sequence (X)
TrRosetta
TrMRF
Target (P)
Loss (L)�L = -ΣP*log(Q)
Prediction (Q)
EEAREWAEKWGAN
Sequence (X)
TrRosetta
TrMRF
TrRosetta Sequence recovery is too low compared to Rosetta
Rosetta
TrRosetta
Adding TrMRF loss significantly improved sequence recovery!
Experiment
Gabriel J Rocklin
Tsuboyama Kotaro
Justas
Dauparas
Hallucinate thousands of new proteins
Loss (L)�L = KL w/background
Prediction (Q)
EEAREWAEKWGAN
Sequence (X)
TrRosetta
pyRosetta
Structure
Redesign proteins
TrROS = Hallucination
TrMRF = Fixed backbone design
TrROS+TrMRF = Cont. Hallucination
Results - Protease Resistance (Stability) Assay
Results look TOO good
Tsuboyama Kotaro
stable
unstable
stable
unstable
We noticed some of the most stable designs had hydrophobic patches, which AlphaFold predicts as homo-oligomers
Justas
Dauparas
Inter PAE
Inter PAE
resistance to trypsin
resistance to chymotrypsin
Lets compute inter-PAE values for our designs
TrMRF + TrROS
TrMRF + TrROS
Though most designs are predicted as oligomers, the monomers are still "stable"
monomer
oligomer
monomer
oligomer
stable
unstable
stable
unstable
AlphaFold filtering
interPAE > 15
pLDDT > 70
Results - Protease Resistance (Stability) Assay
(removing proteins predicted as oligomers by AlphaFold)
Results appear to hold even after filtering for oligomers
interPAE > 15
pLDDT > 70
stable
unstable
stable
unstable
Combined loss significantly better!
TrROS
TrROS
TrROS+TrMRF
Conclusions
P(structure | sequence) ~ AbRelax (FF) ~ TrRosetta ~ AlphaFold ~ RoseTTAFold�P(sequence) = 1.0��P(sequence | structure) ~ Rosetta ~ TrMRF�P(structure) = 1.0��P(structure, sequence) ~ P(structure | sequence) * P(sequence | structure)
Modern methods: RoseTTAFold / AlphaFold
Fig. 1: RoseTTAFold accurately predicts structures of de-novo-designed proteins from their amino acid sequences.
Minkyung Baek
Jue Wang
Partial Hallucination
Can we use DL models to design functional proteins?
Scaffolding functional motifs with hallucination
Goal: small, stable protein with desired motif
Doug
Tischer
Sidney Lisanza
David Juergens
Current state-of-the-art
Graft motif into existing protein
Build up scaffold from fragments
Sesterhehn et al. 2020
Limited by databases
Limited to simple motifs; Need to choose fold beforehand
“Hallucination” can create new structures without referencing existing backbone
Anishchenko, Pellock et al. 2021
Partially constrained hallucination: embedding pre-specified motif
Doug
Sidney
Ivan
Sergey
Scaffolding p53 helix binding to mdm2
AF2 prediction
RF hallucination
Native
Native motif (p53 helix)
Design motif
Binding partner (mdm2)
Designed p53 mimetics bind to mdm2 in yeast display
p53 helix control
1uM target
p53 helix control
no target
Pool of
Hallucinated designs
1uM target
Expression / surface display of protein
Binding signal
Binder hallucination pipeline
Monomer hallucination
Filtering, experiments
Stage
Loss
2-chain hallucination
Iterate & diversify best outputs from previous round
Pairwise dist/angles
Monomer
Monomer + interface
Optional: sequence design
Problem-specific loss functions
Native motif
Design motif
Binding partner
Problem: Hallucinations clash with binding partner (RSVF site V)
Solution: Include repulsive loss
Result: (Almost) no clashes
Back to Sergey�AfDesign
initial XYZ
Sequence
MSA
PSSM
seq�DB
XYZ
Recycle
Structure Module
MSA
Module
Alphafold2
Templates
pdb�DB
not differentiable
AlphaFold: Number of recycles needed to predict denovo designed proteins
James Roney
Why is AlphaFold sooooo bad at num_recycle=0?
Why is num_recycle=0 so bad?
If num_recycle=0, this disables a series of layers
bugfix
Alternative (without making source code changes)� model_config.model.num_recycle = 1
model_config.data.common.num_recycle = 1
processed_feature_dict["num_iter_recycling"] = np.array([0])
James Roney
Bugfix significantly improves the accuracy of denovo designed proteins at num_recycle=0
Number of recycles needed to predict denovo designed proteins
By changing a few configs we can reduce runtime!�(using Google Colab Free = Tesla K80)
length=100 length=50� 26.7s 13.9s (msa=512, subbatch=4, recycles=3, backprop=False)� 2.5s 0.8s (msa=1, subbatch=4, recycles=3, backprop=False)� 1.9s 0.5s (msa=1, subbatch=None, recycles=3, backprop=False)� 0.5s 0.1s (msa=1, subbatch=None, recycles=0, backprop=False)� 1.8s 0.4s (msa=1, subbatch=None, recycles=0, backprop=True )� 3.2s 1.1s (msa=1, subbatch=4, recycles=0, backprop=True )
initial XYZ
Sequence
MSA
PSSM
seq�DB
XYZ
Recycle
Structure Module
MSA
Module
Alphafold2
Templates
pdb�DB
XYZ
MSA
Module
AlphaFold2: inputs/outputs
Distogram
Confidence
Sequence
Templates
Sequence
Target (P)
Prediction (Q)
EEAREWAEKWGAN
Q = Model(X)
Random Sequence (X)
Target (P)
Loss (L)�L = -ΣP*log(Q)
Prediction (Q)
EEAREWAEKWGAN
Q = Model(X)
Random Sequence (X)
How to backprop through discrete input?
Gradient (∇)�∇ = ∂L/∂X
Target (P)
Loss (L)�L = -ΣP*log(Q)
Prediction (Q)
EEAREWAEKWGAN
Q = Model(X)
Random Sequence (X)
?
Test Case - 1TEN (Fibronectin type-III domain 3)
Starting from random sequence, the MCMC protocol failed1 for this example, RMSD = 4.36
Let's try backprop!
1TEN - Inputs
stop_gradient(argmax - softmax(logits)) + softmax(logits)
softmax(logits + gumbel_noise)
softmax(logits)
logits
1TEN - Trajectories - RMSD
stop_gradient(argmax - softmax(logits)) + softmax(logits)
softmax(logits + gumbel_noise)
softmax(logits)
logits
1TEN - Trajectories - Sequence Recovery
stop_gradient(argmax - softmax(logits)) + softmax(logits)
softmax(logits + gumbel_noise)
softmax(logits)
logits
Observations
Which one should we use?
Wait till convergences switch to hard!
switch
soft = logits
hard
After switch, sequence recovery >25%
switch
hard
soft = logits
Are the sequences better than TrRosetta?
Run AlphaFold many time independently and generate many sequences
Coevolution test recovers more contacts!
Residues in contact
TrDesign
AfDesign*
*1.5K sequences
DEMO
Related work
Inverting prediction model for design
�Design sequence given structure
�Design sequence given general fold description
Thank you!
Collaborate? Join? solab.org
CODE: github.com/sokrypton
PS, new episode today :)
EXTRA
How to backprop?
1TEN - Inputs
stop_gradient(argmax - softmax(logits)) + softmax(logits)
softmax(logits + gumbel_noise)
softmax(logits)
logits