1 of 27

Towards De novo generation of DNA-binding Protein Sequences using Discrete Diffusion

Dhruva Rajwade | Riddhiman Dhar

2 of 27

Motivation

2

  • TF-DNA binding

  • CRISPR

  • Aptamers

Aptamers are NA sequences

3 of 27

Intro: Diffusion Models

3

  • Map noise to a distribution

  • Learn to invert noise (slowly)

  • AlphaFold3: Implements Structural Diffusion

How AlphaFold3 works

From noise…to meow

Continuous Diffusion Overview

4 of 27

Intro: Discrete Diffusion

4

  • Can’t differentiate a Discrete Function

  • Convert Discrete distributions into probabilities

  • Learn to hide amino acids, and uncover them

Each column: Different noise scale

Forward (Flip a coin and mask)

Reverse (Unmask and repeat)

5 of 27

Objectives

5

  • A non-redundant dataset of DNA-binding proteins[1]

  • Using Discrete Diffusion to approximate this distribution

  • Generating diverse sequences | Conditional Generation

[1] Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein (Zeng et al., Nature Communications (2024))

The dataset construction strategy from [1]

6 of 27

Results: Sanity Checks (12k Generated Sequences)

6

Train Set

Generated Set

7 of 27

Results: Diversity and Homology Scanning

7

8 of 27

Results: Exploring The Novel Cluster (102 points)

8

96 AA | 42.5% Similarity

194 AA | 34.5% Similarity

116 AA | 37% Similarity

135 AA | 43% Similarity

  • Unseen FCD Domain Designed
  • Individual domains: Possible!

9 of 27

Results: Multi-domain Proteins+Novel Domains

9

0

196

0

215

215 AA | 52% Similarity

196 AA | 32% Similarity

10 of 27

Results: Structural Analyses with ESMFold

10

Predicted TM-Scores

Mean pLDDT Scores

11 of 27

Results: Visualizing Predictions (Not Cherry-Picked)

11

12 of 27

Application: Inpainting an DNA-binding domain (113 AA)

12

→ Mask →MDLM → Inpaint →

Original Sequence (folded)

Generated Sequence (folded)

Original →

Generated→

13 of 27

Seq2Contact[1]: Framework

13

[1] Understanding Protein-DNA Interactions by Paying Attention to Protein and Genomics Foundation Models (Rajwade et al., NeurIPS Workshops MLSB, AIDrugX, FM4Science (2024 ))

14 of 27

Seq2Contact: Contact Map Predictions

14

Top

Bottom

15 of 27

Inverting Seq2Contact: DNA sequence generation

  • Generate potential DNA-binders using Seq2Contact as a reward network

  • Evaluation: Scanning homology with JASPAR[1]

  • TODO: Docking? MD Simulations?

15

Repeat till convergence

[1] A new generation of JASPAR, the open-access repository for transcription factor binding site profiles (Vlieghe et al., Nucleic Acids Research (2005))

16 of 27

DNA sequence generation (Results)

16

Optimization Metrics

G C G T C T C T G

JASPER Hit vs Generated

Distribution of Hamming Distances

  • Protein: ZnF domain

  • Generated DNA: ZnF binding

  • Histogram: Novel DNA motifs (potential)

Chai1: Predicted Complex

17 of 27

Conclusions and Next Steps

17

  • Diverse (<25%) sequences produced

  • DNA-binding motifs generated successfully (Pfam)

  • Unconditional generation: Inpainting (Motif scaffolding)

  • Next Steps: Guidance with Seq2Contact, to generate sequences with binding affinity to target DNA sequence

18 of 27

Guiding Towards specific DNA-binding

18

19 of 27

Extras: Quantitative Metrics (Seq2Contact)

19

Seq2Contact:Performance comparison with AF3[1], RF2NA[2] and FAFormer[3]

[1]Accurate structure prediction of biomolecular interactions with AlphaFold 3 (Abramson et al., Nature 2024)

[2]Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA (Baek et al., Nature Methods 2023)

[3]Protein-Nucleic Acid Complex Modeling with Frame Averaging Transformer (Huang et al. , NeurIPS 2024)

  • Seq2Contact: Predicts binding purely from sequences

20 of 27

Extras: Seq2Contact: Data Generation

20

3D structures of complexes

Length of Protein

Length of DNA/RNA

Contact Maps

From structure to 3D coordinates

21 of 27

Extras: Seq2Contact: Contact Map Predictions (3D)

8E4H

4TUR

21

22 of 27

Extras: Challenges (Seq2Contact)

22

  • Sparsity of contact maps (Most values are zero!)

  • Lack of enough data: For the model to learn binding

  • Lack of enough data: For the model to be exposed to enough representative types of complexes

Histogram of Binding/Non-Binding (%) values

23 of 27

Extras: Challenges (Can we trust our splits?)

23

24 of 27

Extras: DNA-binding Protein Motifs

24

25 of 27

Extras: ESM2 Applications

25

A sequence level task (Mutation effect prediction)

A structure level task: Tertiary structure prediction

26 of 27

Extras: Related Work On Sequences

26

  • Inferring biological meaning from sequences[1] [2]

  • Sequences contain mutations, and hence contain evolutionary information

  • Representation Learning on sequences

The Masked Modelling Strategy

[1] Language models of protein sequences at the scale of evolution enable accurate structure prediction (Lin et al. , Science 2023)

[2] DNA language models are powerful predictors of genome-wide variant effects (Benegas et al., PNAS 2023 )

27 of 27

Extras: Related Work on Structures

27

  • From sequence to backbone structures[1] [2]

  • From backbone structure to a viable sequence[3]

  • Designing proteins that bind to other proteins[4]

[1] Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA (Baek et al., Nature Methods 2023)

[2]Accurate structure prediction of biomolecular interactions with AlphaFold 3 (Abramson et al., Nature 2024)

[3]Robust deep learning–based protein sequence design using ProteinMPNN (Dauparas et al., Science 2022)

[4] De novo design of high-affinity protein binders with AlphaProteo (Zambaldi et al., ArXiV 2024)

(A)

(B)

(A): AF3, (B): AlphaProteo