Towards De novo generation of DNA-binding Protein Sequences using Discrete Diffusion
Dhruva Rajwade | Riddhiman Dhar
Motivation
2
Aptamers are NA sequences
Intro: Diffusion Models
3
How AlphaFold3 works
From noise…to meow
Continuous Diffusion Overview
Intro: Discrete Diffusion
4
Each column: Different noise scale
Forward (Flip a coin and mask)
Reverse (Unmask and repeat)
Objectives
5
[1] Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein (Zeng et al., Nature Communications (2024))
The dataset construction strategy from [1]
Results: Sanity Checks (12k Generated Sequences)
6
Train Set
Generated Set
Results: Diversity and Homology Scanning
7
Results: Exploring The Novel Cluster (102 points)
8
96 AA | 42.5% Similarity
194 AA | 34.5% Similarity
116 AA | 37% Similarity
135 AA | 43% Similarity
Results: Multi-domain Proteins+Novel Domains
9
0
196
0
215
215 AA | 52% Similarity
196 AA | 32% Similarity
Results: Structural Analyses with ESMFold
10
Predicted TM-Scores
Mean pLDDT Scores
Results: Visualizing Predictions (Not Cherry-Picked)
11
Application: Inpainting an DNA-binding domain (113 AA)
12
→ Mask →MDLM → Inpaint →
Original Sequence (folded)
Generated Sequence (folded)
Original →
Generated→
Seq2Contact[1]: Framework
13
[1] Understanding Protein-DNA Interactions by Paying Attention to Protein and Genomics Foundation Models (Rajwade et al., NeurIPS Workshops MLSB, AIDrugX, FM4Science (2024 ))
Seq2Contact: Contact Map Predictions
14
Top
Bottom
Inverting Seq2Contact: DNA sequence generation
15
Repeat till convergence
[1] A new generation of JASPAR, the open-access repository for transcription factor binding site profiles (Vlieghe et al., Nucleic Acids Research (2005))
DNA sequence generation (Results)
16
Optimization Metrics
G C G T C T C T G
JASPER Hit vs Generated
Distribution of Hamming Distances
Chai1: Predicted Complex
Conclusions and Next Steps
17
Guiding Towards specific DNA-binding
18
Extras: Quantitative Metrics (Seq2Contact)
19
Seq2Contact:Performance comparison with AF3[1], RF2NA[2] and FAFormer[3]
[1]Accurate structure prediction of biomolecular interactions with AlphaFold 3 (Abramson et al., Nature 2024)
[2]Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA (Baek et al., Nature Methods 2023)
[3]Protein-Nucleic Acid Complex Modeling with Frame Averaging Transformer (Huang et al. , NeurIPS 2024)
Extras: Seq2Contact: Data Generation
20
3D structures of complexes
Length of Protein
Length of DNA/RNA
Contact Maps
From structure to 3D coordinates
Extras: Seq2Contact: Contact Map Predictions (3D)
8E4H
4TUR
21
Extras: Challenges (Seq2Contact)
22
Histogram of Binding/Non-Binding (%) values
Extras: Challenges (Can we trust our splits?)
23
Extras: DNA-binding Protein Motifs
24
Extras: ESM2 Applications
25
A sequence level task (Mutation effect prediction)
A structure level task: Tertiary structure prediction
Extras: Related Work On Sequences
26
The Masked Modelling Strategy
[1] Language models of protein sequences at the scale of evolution enable accurate structure prediction (Lin et al. , Science 2023)
[2] DNA language models are powerful predictors of genome-wide variant effects (Benegas et al., PNAS 2023 )
Extras: Related Work on Structures
27
[1] Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA (Baek et al., Nature Methods 2023)
[2]Accurate structure prediction of biomolecular interactions with AlphaFold 3 (Abramson et al., Nature 2024)
[3]Robust deep learning–based protein sequence design using ProteinMPNN (Dauparas et al., Science 2022)
[4] De novo design of high-affinity protein binders with AlphaProteo (Zambaldi et al., ArXiV 2024)
(A)
(B)
(A): AF3, (B): AlphaProteo