1 of 27

Towards De novo generation of DNA-binding Protein Sequences using Discrete Diffusion

Dhruva Rajwade | Riddhiman Dhar

2 of 27

Motivation

2

TF-DNA binding

CRISPR

Aptamers

Aptamers are NA sequences

3 of 27

Intro: Diffusion Models

3

Map noise to a distribution

Learn to invert noise (slowly)

AlphaFold3: Implements Structural Diffusion

How AlphaFold3 works

From noise…to meow

Continuous Diffusion Overview

4 of 27

Intro: Discrete Diffusion

4

Can’t differentiate a Discrete Function

Convert Discrete distributions into probabilities

Learn to hide amino acids, and uncover them

Each column: Different noise scale

Forward (Flip a coin and mask)

Reverse (Unmask and repeat)

5 of 27

Objectives

5

A non-redundant dataset of DNA-binding proteins^[1]

Using Discrete Diffusion to approximate this distribution

Generating diverse sequences | Conditional Generation

[1] Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein (Zeng et al., Nature Communications (2024))

The dataset construction strategy from [1]

6 of 27

Results: Sanity Checks (12k Generated Sequences)

6

Train Set

Generated Set

7 of 27

Results: Diversity and Homology Scanning

7

8 of 27

Results: Exploring The Novel Cluster (102 points)

8

96 AA | 42.5% Similarity

194 AA | 34.5% Similarity

116 AA | 37% Similarity

135 AA | 43% Similarity

Unseen FCD Domain Designed
Individual domains: Possible!

9 of 27

Results: Multi-domain Proteins+Novel Domains

9

0

196

0

215

215 AA | 52% Similarity

196 AA | 32% Similarity

10 of 27

Results: Structural Analyses with ESMFold

10

Predicted TM-Scores

Mean pLDDT Scores

11 of 27

Results: Visualizing Predictions (Not Cherry-Picked)

11

12 of 27

Application: Inpainting an DNA-binding domain (113 AA)

12

→ Mask →MDLM → Inpaint →

Original Sequence (folded)

Generated Sequence (folded)

Original →

Generated→

13 of 27

Seq2Contact^[1]: Framework

13

[1] Understanding Protein-DNA Interactions by Paying Attention to Protein and Genomics Foundation Models (Rajwade et al., NeurIPS Workshops MLSB, AIDrugX, FM4Science (2024 ))

14 of 27

Seq2Contact: Contact Map Predictions

14

Top

Bottom

15 of 27

Inverting Seq2Contact: DNA sequence generation

Generate potential DNA-binders using Seq2Contact as a reward network

Evaluation: Scanning homology with JASPAR^[1]

TODO: Docking? MD Simulations?

15

Repeat till convergence

[1] A new generation of JASPAR, the open-access repository for transcription factor binding site profiles (Vlieghe et al., Nucleic Acids Research (2005))

16 of 27

DNA sequence generation (Results)

16

Optimization Metrics

G C G T C T C T G

JASPER Hit vs Generated

Distribution of Hamming Distances

Protein: ZnF domain

Generated DNA: ZnF binding

Histogram: Novel DNA motifs (potential)

Chai1: Predicted Complex

17 of 27

Conclusions and Next Steps

17

Diverse (<25%) sequences produced

DNA-binding motifs generated successfully (Pfam)

Unconditional generation: Inpainting (Motif scaffolding)

Next Steps: Guidance with Seq2Contact, to generate sequences with binding affinity to target DNA sequence

18 of 27

Guiding Towards specific DNA-binding

18

19 of 27

Extras: Quantitative Metrics (Seq2Contact)

19

Seq2Contact:Performance comparison with AF3^[1], RF2NA^[2] and FAFormer^[3]

[1]Accurate structure prediction of biomolecular interactions with AlphaFold 3 (Abramson et al., Nature 2024)

[2]Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA (Baek et al., Nature Methods 2023)

[3]Protein-Nucleic Acid Complex Modeling with Frame Averaging Transformer (Huang et al. , NeurIPS 2024)

Seq2Contact: Predicts binding purely from sequences

20 of 27

Extras: Seq2Contact: Data Generation

20

3D structures of complexes

Length of Protein

Length of DNA/RNA

Contact Maps

From structure to 3D coordinates

21 of 27

Extras: Seq2Contact: Contact Map Predictions (3D)

8E4H

4TUR

21

22 of 27

Extras: Challenges (Seq2Contact)

22

Sparsity of contact maps (Most values are zero!)

Lack of enough data: For the model to learn binding

Lack of enough data: For the model to be exposed to enough representative types of complexes

Histogram of Binding/Non-Binding (%) values

23 of 27

Extras: Challenges (Can we trust our splits?)

23

24 of 27

Extras: DNA-binding Protein Motifs

24

25 of 27

Extras: ESM2 Applications

25

A sequence level task (Mutation effect prediction)

A structure level task: Tertiary structure prediction

26 of 27

Extras: Related Work On Sequences

26

Inferring biological meaning from sequences^{[1] [2]}

Sequences contain mutations, and hence contain evolutionary information

Representation Learning on sequences

The Masked Modelling Strategy

[1] Language models of protein sequences at the scale of evolution enable accurate structure prediction (Lin et al. , Science 2023)

[2] DNA language models are powerful predictors of genome-wide variant effects (Benegas et al., PNAS 2023 )

27 of 27

Extras: Related Work on Structures

27

From sequence to backbone structures^{[1] [2]}

From backbone structure to a viable sequence^[3]

Designing proteins that bind to other proteins^[4]

[1] Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA (Baek et al., Nature Methods 2023)

[2]Accurate structure prediction of biomolecular interactions with AlphaFold 3 (Abramson et al., Nature 2024)

[3]Robust deep learning–based protein sequence design using ProteinMPNN (Dauparas et al., Science 2022)

[4] De novo design of high-affinity protein binders with AlphaProteo (Zambaldi et al., ArXiV 2024)

(A)

(B)

(A): AF3, (B): AlphaProteo