AI For Biology
Alex Graves
MenaML Winter School
February 2025
Recap: Joint Probability Distributions
p(x, y)
x
y
joint
p(x, y)
marginal
p(y)
marginal
p(x)
conditional
p(x | y=b)
Image from “Structural Reliability Analysis and Prediction”, Robert E. Melchers 1999
© 2024 InstaDeep Ltd. All Rights Reserved.
Distributions over many Variables
Generating Diverse High-Fidelity Images with VQ-VAE-2 (2019, Razavi et al.)
Marginal sample
Joint sample
© 2024 InstaDeep Ltd. All Rights Reserved.
Conditional Distributions over many Variables
p(image | caption)
p(caption | image)
DALL·E (OpenAI 2021)
CLIP (OpenAI 2021)
© 2024 InstaDeep Ltd. All Rights Reserved.
Inpainting ⇔ Fine-Grained Conditioning
RePaint: Inpainting using Denoising Diffusion Probabilistic Models, Lugmayr et. al. 2022
p(masked pixels | revealed pixels)
© 2024 InstaDeep Ltd. All Rights Reserved.
LLMs: Most Flexible Conditional Models?
p(future text | past text)
ChatGPT (OpenAI, 2022—)
© 2024 InstaDeep Ltd. All Rights Reserved.
Generative AI as a Data Explorer
“Given this still and the word ‘cat’ somewhere in the title, what might the rest of the video look like?”
“Given this image of a person’s left eye, how likely is it their right eye to be the same colour?”
…
© 2024 InstaDeep Ltd. All Rights Reserved.
Application
to Biology
Model Everything!
Map of the human X chromosome (from ncbi.nlm.nih.gov)
© 2024 InstaDeep Ltd. All Rights Reserved.
Example: Proteomics
Binders
Binding assays
Structure
Taxonomy
GO Terms
EC number
Sequence
AGL…
Domains
p(
,
,
,
,
,
,
,
,
…)
Goal: learn a joint distribution from e.g. the UniProt database
Then solve tasks by conditional sampling
All icons from flaticon.com
© 2024 InstaDeep Ltd. All Rights Reserved.
Protein Folding
One conditional distribution, 3 Nobel prizes!
(David Baker, Demis Hassabis, John Jumper)
Structure
Sequence
AGL…
p(
|
)
“Life could not exist without proteins. That we can now predict protein structures and design our own proteins confers the greatest benefit to humankind.”
Press release, Nobel prize in Chemistry 2024
Image from https://alphafold.ebi.ac.uk/
© 2024 InstaDeep Ltd. All Rights Reserved.
Other Proteomics Tasks
Structure
Sequence
AGL…
p(
|
)
— Inverse protein folding
Structure
Sequence
AGL…
p(
|
)
— Protein function prediction
GO Terms
,
Structure
Sequence
AGL…
p(
)
— De novo antibody design
,
|
Binders
…
© 2024 InstaDeep Ltd. All Rights Reserved.
Challenges
A mix of sequences, tables, graphs, text, images…
Underlying variables can be continuous, discrete or discretised
AGL…
© 2024 InstaDeep Ltd. All Rights Reserved.
Which Loss Function?
Autoregression
slides
Sorry boss the dog ate my ______
Masked prediction (BERT)
boss cat report
Sorry ____ the ___ ate my ______
Diffusion
Backward process: remove noise
Forward process: add noise
Pros: Good for continuous data (especially images), can use guidance for flexible conditioning
Cons: Struggles with discrete data
Pros: Great for discrete sequences (especially text), efficient training
Cons: Struggles with continuous data and data without a natural order (e.g. graphs, grids…); slow sampling, inflexible conditioning
Pros: Good for representation learning on discrete data. Flexible conditioning and arbitrary ordering built in.
Cons: Struggles with generative modelling and continuous data; slow sampling
Can we use one loss for everything?
Internal Data
© 2024 InstaDeep Ltd. All Rights Reserved.
Bayesian Flow Networks
Graves et. al. 2023
Overview
BFNs are similar to diffusion models except the denoising process operates on distribution parameters rather than directly on data
0
0
1
1
Class
0
0
1
1
Class
0
0
1
1
Class
0
0
1
1
Class
0
0
1
1
Class
Prob.
0
0
1
1
Class
0
0
1
1
Class
0
0
1
1
Class
0
0
1
1
Class
0
0
1
1
Class
0
0
1
1
Class
0
0
1
1
Class
time
0
1
This means the generative process is continuous even if the data is discrete
Internal Data
© 2024 InstaDeep Ltd. All Rights Reserved.
© 2024 InstaDeep Ltd. All Rights Reserved.
BFN
Discrete Diffusion
Masked Diffusion
Internal Data
© 2024 InstaDeep Ltd. All Rights Reserved.
Conditional Sampling
Gradient-Free
Masked prediction
(c.f. Classifier-free guidance1)
Particle Filtering2
(Stochastic Monte Carlo)
Gradient-Based
Score-based sampling3 (c.f. Classifier guidance4)
Twisted Particle Filtering5
Unified BFN loss makes multimodal sampling easy (no external models needed)
Continuous generation allows gradient-based conditional sampling for discrete data.
1 Classifier-Free Diffusion Guidance, Ho and Salimans 2022
2 Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem, Trippe et. al. 2022
3 Score-Based Generative Modeling through Stochastic Differential Equations, Song et. al. 2020
4 Diffusion Models Beat GANs on Image Synthesis, Dhariwal and Nichol 2021
5 Practical and Asymptotically Exact Conditional Sampling in Diffusion Models, Wu et. al. 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Toy Example: MNIST
Joint Samples (Image and Class)
© 2024 InstaDeep Ltd. All Rights Reserved.
Class Conditional Samples
SMC Particle Filtering (512 particles)
© 2024 InstaDeep Ltd. All Rights Reserved.
Inpainting Samples (No Class)
Original
Masked
© 2024 InstaDeep Ltd. All Rights Reserved.
Class Conditional Inpainting Samples
2
© 2024 InstaDeep Ltd. All Rights Reserved.
Class Conditional Inpainting Samples
3
© 2024 InstaDeep Ltd. All Rights Reserved.
Image Conditional Samples (Aka Classification)
~99% accuracy
with 64 particles
© 2024 InstaDeep Ltd. All Rights Reserved.
Protein Sequence Modelling
ProtBFN
Outperforms or matches SOTA task-specific autoregressive, diffusion and BERT models.
Improved naturalness, diversity and novelty.
Uses zero-shot conditioning of model.
Patent submitted.
Accepted by Nature Communications, preprint available at https://www.biorxiv.org/content/10.1101/2024.09.24.614734v1
© 2024 InstaDeep Ltd. All Rights Reserved.
Antibody Modelling
Bora Gologlu
Length Attributes
Genetic Attributes
Amino acid sequence
VH: EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAKDRGGNWAILDYWGQGTLVTVSS
L2
L1
H3
H2
H1
VH
VL
VH
VL
CH1
CL
VH
CH1
VL
CL
CH2
CH3
CH2
CH3
FV
Fab
L3
Biophysical Attributes
Negative Patches
Charge Imbalance
Positive Patches
Hydrophobicity
CDR-H1 length
CDR-H2 length
CDR-H3 length
CDR-L1 length
CDR-L2 length
CDR-L3 length
VH length
VL length
HV gene
HD gene
HJ gene
HV seq. identity
HD seq. identity
HJ seq. identity
LV gene
LD gene
LV seq. identity
LJ seq. identity
LC locus
Species
%
%
%
%
%
FWR-H1
CDR-H1
FWR-H2
CDR-H2
FWR-H3
CDR-H3
FWR-H4
FWR-L1
CDR-L1
FWR-L2
CDR-L2
FWR-L3
CDR-L3
FWR-L4
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
CDR-L1
CDR-L2
CDR-L3
CDR-H1
CDR-H2
CDR-H3
D gene
VL: DIQMTQSPSSVSASVGDRVTITCRASQGISSWLAWYQQKPGKAPKLLIYGASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQANSFPPTFGQGTRLEIK
V gene
J gene
V gene
J gene
AbBFN
p(
,
,
,
,
,
,
,
…)
Developability
Taxonomy
VH Seq.
AGL…
VL Seq.
AGL…
Germline
% ID
%
LC Locus
CDR lengths
,
AbBFN was trained on 45 different data modes using the Observed Antibody Space (OAS) database
Can then query the model using conditional sampling
“Given this CDR-H3 sequence and this light V gene, generate me a set of human antibodies.”
“Given this antibody sequence, label it with all relevant metadata.”
…
Antibody explorer?
© 2024 InstaDeep Ltd. All Rights Reserved.
Verifying The Joint Distribution
CDR lengths
Conditional Sampling (Twisted SMC)
AbBFN-X
“Ancestor” antibody
© 2024 InstaDeep Ltd. All Rights Reserved.
AbBFN-X
CDR-H3 length
FWR-H1
CDR-H1
FWR-H2
CDR-H2
FWR-H3
FWR-H4
FWR-L1
CDR-L1
FWR-L2
CDR-L2
FWR-L3
CDR-L3
FWR-L4
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
© 2024 InstaDeep Ltd. All Rights Reserved.
© 2024 InstaDeep Ltd. All Rights Reserved.
Hydrophobicity
LV gene
FWR-H1
CDR-H1
FWR-H2
CDR-H2
FWR-H3
CDR-H3
FWR-H4
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
AGL…
© 2024 InstaDeep Ltd. All Rights Reserved.
© 2024 InstaDeep Ltd. All Rights Reserved.
© 2024 InstaDeep Ltd. All Rights Reserved.
p( | ) =
VL Seq.
AGL…
AGL…
VH Seq.
© 2024 InstaDeep Ltd. All Rights Reserved.
Gene Identification: p( gene | seq )
| F1 Score with imbalance correction | |
Label | AbBFN | ANARCI |
HV gene family | 1.0000 | 1.0000 |
HD gene family | 0.6802 | |
LV gene family | 1.0000 | 0.9894 |
HV gene | 0.9796 | 0.9684 |
HD gene | 0.5766 | |
HJ gene | 0.9792 | 0.8792 |
LV gene | 0.9913 | 0.9894 |
LJ gene | 0.9359 | 0.9071 |
LC locus | 1.0000 | 1.0000 |
© 2024 InstaDeep Ltd. All Rights Reserved.
Guided Antibody Design
Conditioning information:
In OAS, there are only 21 such sequences (out of 2M)
With AbBFN, we can easily generate thousands.
AbBFN has a 65,000x higher hit rate.
Antigen
BFN-VL
Natural-VL
BFN-VH
Natural-VH
© 2024 InstaDeep Ltd. All Rights Reserved.
Summary
© 2024 InstaDeep Ltd. All Rights Reserved.
Extra Slides
Natural, Diverse & Novel
ProtBFN learns statistical and biochemical properties of natural proteins with high-fidelity.
1. 10,000 generated sequences from each model are matched to clusterings from UniRef50. A hit is determined as a match with >50% sequence identity. Coverage score is the ratio of the number of unique clusters hit to the expected number if sequences were drawn i.i.d. from the models training distribution.
2. Identity of ProtBFN generated sequences to the best matching protein sequence found in the models training data. Any identity < 100% is a novel sequence that the model has not seen before.
More natural...1
…and highly novel.2
95%
Sequence identity < 95%
89%
Sequence identity < 80%
44%
Sequence identity < 50%
…more diverse…1
© 2024 InstaDeep Ltd. All Rights Reserved.
Globular Structural Motifs With Novel Sequences
Globular Structural Motifs With Novel Sequences
Single and multi-domain proteins.
Globally coherent generations with inter-domain interactions.
Predicted structures of generated sequences show natural, globally coherent and functionally diverse folds.
Spans diversity of known structures and tree-of-life.
Alpha Helical, Beta Sheet, Alpha-Beta and Irregular domains.
Small and large domains.
Transmembrane Proteins (porins and transporters) and Enzymes.
Domains specific to Archaea, Bacteria, Eukarya (Plants, Humans).
Structure largely determines function in nature.
Sequence
AGL…
Structure
Function
© 2024 InstaDeep Ltd. All Rights Reserved.
System Components
Bayesian
Flow
Networks
Use Bayesian inference to update distributions over individual variables (pixels in images, letters in text, co-ords in molecules…) given a sequence of noisy samples from the data
Generalise to continuous time to create a flow of information from variables to distributions
Feed the distribution parameters to a neural network and use it to model the data
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Bayesian Updates for Continuous Data
Reference: Conjugate Bayesian analysis of the Gaussian distribution, K. Murphy 2007
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Update Distribution
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Bayesian Flow Distribution
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Comparison with Diffusion
Reverse Process
Input Variance
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Discrete Data
© 2024 InstaDeep Ltd. All Rights Reserved.
Discrete Flow
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Flow Distribution for Discrete Data
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Flow Distribution for Binary Data
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Output Distributions
© 2024 InstaDeep Ltd. All Rights Reserved.
Types of Output Distribution
*Can either predict the data directly, or predict the input noise and reparametrise
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Continuous Data
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Continuous data
Input Mean
Output Mean
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Discretised Data
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Discrete data
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Binary data
Bayesian Flow Networks, Graves et al., 2023
© 2024 InstaDeep Ltd. All Rights Reserved.
Loss Function
The n-step loss is the expected sum of n sender-receiver transmission costs*:
Can rewrite the sum as an expectation
Then take to get the continuous-time loss
*Can also think of this as the latent loss of a VAE, with the sequence of sender samples as the latent variable
© 2024 InstaDeep Ltd. All Rights Reserved.
Continuous Time Loss
It turns out the loss simplifies in continuous time (Proposition 3.1):
Continuous
Discrete
Discretised
© 2024 InstaDeep Ltd. All Rights Reserved.
Sample Generation
© 2024 InstaDeep Ltd. All Rights Reserved.
Network Training
© 2024 InstaDeep Ltd. All Rights Reserved.