1 of 65

AI For Biology

Alex Graves

MenaML Winter School

February 2025

2 of 65

Recap: Joint Probability Distributions

p(x, y)

x

y

joint

p(x, y)

marginal

p(y)

marginal

p(x)

conditional

p(x | y=b)

Image from “Structural Reliability Analysis and Prediction”, Robert E. Melchers 1999

© 2024 InstaDeep Ltd. All Rights Reserved.

3 of 65

Distributions over many Variables

Generating Diverse High-Fidelity Images with VQ-VAE-2 (2019, Razavi et al.)

Marginal sample

Joint sample

© 2024 InstaDeep Ltd. All Rights Reserved.

4 of 65

Conditional Distributions over many Variables

p(image | caption)

p(caption | image)

DALL·E (OpenAI 2021)

CLIP (OpenAI 2021)

© 2024 InstaDeep Ltd. All Rights Reserved.

5 of 65

Inpainting ⇔ Fine-Grained Conditioning

RePaint: Inpainting using Denoising Diffusion Probabilistic Models, Lugmayr et. al. 2022

p(masked pixels | revealed pixels)

© 2024 InstaDeep Ltd. All Rights Reserved.

6 of 65

LLMs: Most Flexible Conditional Models?

p(future text | past text)

ChatGPT (OpenAI, 2022—)

© 2024 InstaDeep Ltd. All Rights Reserved.

7 of 65

Generative AI as a Data Explorer

  • In theory learning the joint distribution means learning everything about the data

  • Sampling the marginal and conditional distributions (including “fine-grained” inpainting conditions) allows you to query the model’s knowledge:

“Given this still and the word ‘cat’ somewhere in the title, what might the rest of the video look like?”

“Given this image of a person’s left eye, how likely is it their right eye to be the same colour?”

  • Can think of the model as a way to explore the data it was trained on

© 2024 InstaDeep Ltd. All Rights Reserved.

8 of 65

Application

to Biology

9 of 65

Model Everything!

  • Large, complex datasets exist in many biological fields: genomics, proteomics, evolutionary biology…

  • Each dataset is typically used for multiple tasks: drug discovery, protein structure prediction, gene expression prediction

  • By using generative AI to jointly model all the data in a given dataset, then conditionally sampling, we can train a single model for many tasks

Map of the human X chromosome (from ncbi.nlm.nih.gov)

© 2024 InstaDeep Ltd. All Rights Reserved.

10 of 65

Example: Proteomics

Binders

Binding assays

Structure

Taxonomy

GO Terms

EC number

Sequence

AGL…

Domains

p(

,

,

,

,

,

,

,

,

…)

Goal: learn a joint distribution from e.g. the UniProt database

Then solve tasks by conditional sampling

All icons from flaticon.com

© 2024 InstaDeep Ltd. All Rights Reserved.

11 of 65

Protein Folding

One conditional distribution, 3 Nobel prizes!

(David Baker, Demis Hassabis, John Jumper)

Structure

Sequence

AGL…

p(

|

)

“Life could not exist without proteins. That we can now predict protein structures and design our own proteins confers the greatest benefit to humankind.”

Press release, Nobel prize in Chemistry 2024

Image from https://alphafold.ebi.ac.uk/

© 2024 InstaDeep Ltd. All Rights Reserved.

12 of 65

Other Proteomics Tasks

Structure

Sequence

AGL…

p(

|

)

— Inverse protein folding

Structure

Sequence

AGL…

p(

|

)

— Protein function prediction

GO Terms

,

Structure

Sequence

AGL…

p(

)

De novo antibody design

,

|

Binders

© 2024 InstaDeep Ltd. All Rights Reserved.

13 of 65

Challenges

  • Noisy / incomplete data

  • Difficult to assess performance (e.g. can’t use human feedback)

  • Heterogeneous data

A mix of sequences, tables, graphs, text, images…

Underlying variables can be continuous, discrete or discretised

AGL…

© 2024 InstaDeep Ltd. All Rights Reserved.

14 of 65

Which Loss Function?

Autoregression

slides

Sorry boss the dog ate my ______

Masked prediction (BERT)

boss cat report

Sorry ____ the ___ ate my ______

Diffusion

Backward process: remove noise

Forward process: add noise

Pros: Good for continuous data (especially images), can use guidance for flexible conditioning

Cons: Struggles with discrete data

Pros: Great for discrete sequences (especially text), efficient training

Cons: Struggles with continuous data and data without a natural order (e.g. graphs, grids…); slow sampling, inflexible conditioning

Pros: Good for representation learning on discrete data. Flexible conditioning and arbitrary ordering built in.

Cons: Struggles with generative modelling and continuous data; slow sampling

Can we use one loss for everything?

Internal Data

© 2024 InstaDeep Ltd. All Rights Reserved.

15 of 65

Bayesian Flow Networks

Graves et. al. 2023

16 of 65

Overview

BFNs are similar to diffusion models except the denoising process operates on distribution parameters rather than directly on data

0

0

1

1

Class

0

0

1

1

Class

0

0

1

1

Class

0

0

1

1

Class

0

0

1

1

Class

Prob.

0

0

1

1

Class

0

0

1

1

Class

0

0

1

1

Class

0

0

1

1

Class

0

0

1

1

Class

0

0

1

1

Class

0

0

1

1

Class

time

0

1

This means the generative process is continuous even if the data is discrete

Internal Data

© 2024 InstaDeep Ltd. All Rights Reserved.

17 of 65

© 2024 InstaDeep Ltd. All Rights Reserved.

18 of 65

BFN

Discrete Diffusion

Masked Diffusion

Internal Data

© 2024 InstaDeep Ltd. All Rights Reserved.

19 of 65

Conditional Sampling

Gradient-Free

Masked prediction

(c.f. Classifier-free guidance1)

Particle Filtering2

(Stochastic Monte Carlo)

Gradient-Based

Score-based sampling3 (c.f. Classifier guidance4)

Twisted Particle Filtering5

Unified BFN loss makes multimodal sampling easy (no external models needed)

Continuous generation allows gradient-based conditional sampling for discrete data.

1 Classifier-Free Diffusion Guidance, Ho and Salimans 2022

2 Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem, Trippe et. al. 2022

3 Score-Based Generative Modeling through Stochastic Differential Equations, Song et. al. 2020

4 Diffusion Models Beat GANs on Image Synthesis, Dhariwal and Nichol 2021

5 Practical and Asymptotically Exact Conditional Sampling in Diffusion Models, Wu et. al. 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

20 of 65

Toy Example: MNIST

21 of 65

Joint Samples (Image and Class)

© 2024 InstaDeep Ltd. All Rights Reserved.

22 of 65

Class Conditional Samples

SMC Particle Filtering (512 particles)

© 2024 InstaDeep Ltd. All Rights Reserved.

23 of 65

Inpainting Samples (No Class)

Original

Masked

© 2024 InstaDeep Ltd. All Rights Reserved.

24 of 65

Class Conditional Inpainting Samples

2

© 2024 InstaDeep Ltd. All Rights Reserved.

25 of 65

Class Conditional Inpainting Samples

3

© 2024 InstaDeep Ltd. All Rights Reserved.

26 of 65

Image Conditional Samples (Aka Classification)

~99% accuracy

with 64 particles

© 2024 InstaDeep Ltd. All Rights Reserved.

27 of 65

Protein Sequence Modelling

28 of 65

ProtBFN

Outperforms or matches SOTA task-specific autoregressive, diffusion and BERT models.

Improved naturalness, diversity and novelty.

Uses zero-shot conditioning of model.

Patent submitted.

Accepted by Nature Communications, preprint available at https://www.biorxiv.org/content/10.1101/2024.09.24.614734v1

© 2024 InstaDeep Ltd. All Rights Reserved.

29 of 65

Antibody Modelling

Bora Gologlu

30 of 65

Length Attributes

Genetic Attributes

Amino acid sequence

VH: EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAKDRGGNWAILDYWGQGTLVTVSS

L2

L1

H3

H2

H1

VH

VL

VH

VL

CH1

CL

VH

CH1

VL

CL

CH2

CH3

CH2

CH3

FV

Fab

L3

Biophysical Attributes

Negative Patches

Charge Imbalance

Positive Patches

Hydrophobicity

CDR-H1 length

CDR-H2 length

CDR-H3 length

CDR-L1 length

CDR-L2 length

CDR-L3 length

VH length

VL length

HV gene

HD gene

HJ gene

HV seq. identity

HD seq. identity

HJ seq. identity

LV gene

LD gene

LV seq. identity

LJ seq. identity

LC locus

Species

%

%

%

%

%

FWR-H1

CDR-H1

FWR-H2

CDR-H2

FWR-H3

CDR-H3

FWR-H4

FWR-L1

CDR-L1

FWR-L2

CDR-L2

FWR-L3

CDR-L3

FWR-L4

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

CDR-L1

CDR-L2

CDR-L3

CDR-H1

CDR-H2

CDR-H3

D gene

VL: DIQMTQSPSSVSASVGDRVTITCRASQGISSWLAWYQQKPGKAPKLLIYGASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQANSFPPTFGQGTRLEIK

V gene

J gene

V gene

J gene

31 of 65

AbBFN

p(

,

,

,

,

,

,

,

…)

Developability

Taxonomy

VH Seq.

AGL…

VL Seq.

AGL…

Germline

% ID

%

LC Locus

CDR lengths

,

AbBFN was trained on 45 different data modes using the Observed Antibody Space (OAS) database

Can then query the model using conditional sampling

“Given this CDR-H3 sequence and this light V gene, generate me a set of human antibodies.”

“Given this antibody sequence, label it with all relevant metadata.”

Antibody explorer?

© 2024 InstaDeep Ltd. All Rights Reserved.

32 of 65

Verifying The Joint Distribution

CDR lengths

33 of 65

Conditional Sampling (Twisted SMC)

AbBFN-X

“Ancestor” antibody

© 2024 InstaDeep Ltd. All Rights Reserved.

34 of 65

AbBFN-X

CDR-H3 length

FWR-H1

CDR-H1

FWR-H2

CDR-H2

FWR-H3

FWR-H4

FWR-L1

CDR-L1

FWR-L2

CDR-L2

FWR-L3

CDR-L3

FWR-L4

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

© 2024 InstaDeep Ltd. All Rights Reserved.

35 of 65

© 2024 InstaDeep Ltd. All Rights Reserved.

36 of 65

Hydrophobicity

LV gene

FWR-H1

CDR-H1

FWR-H2

CDR-H2

FWR-H3

CDR-H3

FWR-H4

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

AGL…

© 2024 InstaDeep Ltd. All Rights Reserved.

37 of 65

© 2024 InstaDeep Ltd. All Rights Reserved.

38 of 65

© 2024 InstaDeep Ltd. All Rights Reserved.

39 of 65

p( | ) =

VL Seq.

AGL…

AGL…

VH Seq.

© 2024 InstaDeep Ltd. All Rights Reserved.

40 of 65

Gene Identification: p( gene | seq )

F1 Score with imbalance correction

Label

AbBFN

ANARCI

HV gene family

1.0000

1.0000

HD gene family

0.6802

LV gene family

1.0000

0.9894

HV gene

0.9796

0.9684

HD gene

0.5766

HJ gene

0.9792

0.8792

LV gene

0.9913

0.9894

LJ gene

0.9359

0.9071

LC locus

1.0000

1.0000

© 2024 InstaDeep Ltd. All Rights Reserved.

41 of 65

Guided Antibody Design

Conditioning information:

  • HV gene: IGHV1-2
  • CDR-L3 length: 5 residues
  • Developable
  • Species: human
  • Light Chain type: Kappa
  • Position-specific mutations (IMGT):
    • VH Framework: 55W, 66N, 80R
    • VH CDR-H3: 113W

In OAS, there are only 21 such sequences (out of 2M)

With AbBFN, we can easily generate thousands.

AbBFN has a 65,000x higher hit rate.

Antigen

BFN-VL

Natural-VL

BFN-VH

Natural-VH

© 2024 InstaDeep Ltd. All Rights Reserved.

42 of 65

Summary

  • Generative AI models can now learn incredibly complex joint distributions

  • Conditional sampling from these distributions allows us to query the model, using it as a data explorer

  • This can provide a powerful exploratory tool for Biology and other data-rich sciences

© 2024 InstaDeep Ltd. All Rights Reserved.

43 of 65

Extra Slides

44 of 65

Natural, Diverse & Novel

ProtBFN learns statistical and biochemical properties of natural proteins with high-fidelity.

1. 10,000 generated sequences from each model are matched to clusterings from UniRef50. A hit is determined as a match with >50% sequence identity. Coverage score is the ratio of the number of unique clusters hit to the expected number if sequences were drawn i.i.d. from the models training distribution.

2. Identity of ProtBFN generated sequences to the best matching protein sequence found in the models training data. Any identity < 100% is a novel sequence that the model has not seen before.

More natural...1

…and highly novel.2

95%

Sequence identity < 95%

89%

Sequence identity < 80%

44%

Sequence identity < 50%

…more diverse…1

© 2024 InstaDeep Ltd. All Rights Reserved.

45 of 65

Globular Structural Motifs With Novel Sequences

Globular Structural Motifs With Novel Sequences

Single and multi-domain proteins.

Globally coherent generations with inter-domain interactions.

Predicted structures of generated sequences show natural, globally coherent and functionally diverse folds.

Spans diversity of known structures and tree-of-life.

Alpha Helical, Beta Sheet, Alpha-Beta and Irregular domains.

Small and large domains.

Transmembrane Proteins (porins and transporters) and Enzymes.

Domains specific to Archaea, Bacteria, Eukarya (Plants, Humans).

Structure largely determines function in nature.

Sequence

AGL…

Structure

Function

© 2024 InstaDeep Ltd. All Rights Reserved.

46 of 65

System Components

Bayesian

Flow

Networks

Use Bayesian inference to update distributions over individual variables (pixels in images, letters in text, co-ords in molecules…) given a sequence of noisy samples from the data

Generalise to continuous time to create a flow of information from variables to distributions

Feed the distribution parameters to a neural network and use it to model the data

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

47 of 65

Bayesian Updates for Continuous Data

Reference: Conjugate Bayesian analysis of the Gaussian distribution, K. Murphy 2007

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

48 of 65

Update Distribution

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

49 of 65

Bayesian Flow Distribution

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

50 of 65

Comparison with Diffusion

Reverse Process

Input Variance

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

51 of 65

Discrete Data

  • For discrete x with K classes the input distribution is categorical with a uniform prior
  • But what’s the sender distribution?
  • interpolates between no information about x and complete information
  • Could we set ?
  • Problem: discrete, non-differentiable update process

© 2024 InstaDeep Ltd. All Rights Reserved.

52 of 65

Discrete Flow

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

53 of 65

Flow Distribution for Discrete Data

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

54 of 65

Flow Distribution for Binary Data

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

55 of 65

Output Distributions

  • Bayesian inference is great for independent variables
  • But generative modelling is all about interdependent variables: pixels in an image, words in a text…
  • Can model dependencies by using a neural network to map from input distributions to output distributions:

© 2024 InstaDeep Ltd. All Rights Reserved.

56 of 65

Types of Output Distribution

  • For continuous data is a Delta distribution*
  • For discrete / binary data we use a Categorical / Bernoulli
  • For discretised data we use a discretised Gaussian*

*Can either predict the data directly, or predict the input noise and reparametrise

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

57 of 65

Continuous Data

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

58 of 65

Continuous data

Input Mean

Output Mean

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

59 of 65

Discretised Data

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

60 of 65

Discrete data

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

61 of 65

Binary data

Bayesian Flow Networks, Graves et al., 2023

© 2024 InstaDeep Ltd. All Rights Reserved.

62 of 65

Loss Function

The n-step loss is the expected sum of n sender-receiver transmission costs*:

Can rewrite the sum as an expectation

Then take to get the continuous-time loss

*Can also think of this as the latent loss of a VAE, with the sequence of sender samples as the latent variable

© 2024 InstaDeep Ltd. All Rights Reserved.

63 of 65

Continuous Time Loss

It turns out the loss simplifies in continuous time (Proposition 3.1):

Continuous

Discrete

Discretised

© 2024 InstaDeep Ltd. All Rights Reserved.

64 of 65

Sample Generation

© 2024 InstaDeep Ltd. All Rights Reserved.

65 of 65

Network Training

© 2024 InstaDeep Ltd. All Rights Reserved.