1 of 35

Machine Learning for Biochemical Applications

Lecture 7: Protein Structure Prediction

October 25th, 2023�

Lecturers: Daryl Barth & Phillip Woolley

2 of 35

Why does protein structure matter?

https://byjus.com/biology/proteins-structure-and-functions/

3 of 35

Is amino acid sequence all you need?

  • Anfinsen’s dogma: at least for a small globular protein in its standard physiological environment, the native structure is determined only by the protein's amino acid sequence
  • The Protein Folding Problem
    • What is the folding code?
    • What is the folding mechanism?
    • Can we predict a native protein structure from its primary, amino acid sequence?

https://en.wikipedia.org/wiki/Anfinsen's_dogma

4 of 35

Protein Structure Prediction Problem

  • Proteins take on many confirmations, we want to find the “average” one that it spends most of its time in…

5 of 35

CASP Competition drove protein structure prediction

https://predictioncenter.org/

6 of 35

CASP14 (2020) Results: Entrance of AlphaFold2

https://predictioncenter.org/

7 of 35

AlphaFold2: the dawn of a new age

https://www.nature.com/articles/s41586-021-03819-2

8 of 35

What is an MSA?

https://www.nature.com/articles/s41586-021-03819-2

9 of 35

MSA: Multiple Sequence Alignment

10 of 35

MSA captures evolutionary information for folding

https://www.nature.com/articles/s41586-021-03819-2

11 of 35

Template matching instantiates pair representation

https://www.nature.com/articles/s41586-021-03819-2

12 of 35

Pair representations are learned contact features

https://www.nature.com/articles/s41586-021-03819-2

13 of 35

Transformer-like module updates representations

https://www.nature.com/articles/s41586-021-03819-2

14 of 35

Structure module converts representation to structure

https://www.nature.com/articles/s41586-021-03819-2

15 of 35

Recycling iteratively refines structure

https://www.nature.com/articles/s41586-021-03819-2

16 of 35

How do we know it works?

https://phys.org/news/2020-11-ai-solution-year-old-protein.html

17 of 35

The 3Ps: pLDDT, PAE, and pTM

  • LDDT, AE, TM require ground truth. What can we do?
  • pLDDT: predicted Local Distance Difference Test
    • Values from 0 to 100
    • pLDDT > 90, high confidence
    • pLDDT < 50, low confidence

  • PAE: Predicted Aligned Error
    • Distance error for every pair of residues

  • pTM: predicted Template Modeling score
    • Global comparison of similarity between two structures
    • Measure of 0 to 1

https://www.nature.com/articles/s41586-021-03819-2

18 of 35

pLDDT and PAE visually

pLDDT: predicted Local Distance Difference Test | PAE: Predicted Aligned Error

https://www.rbvi.ucsf.edu/chimerax/data/pae-apr2022/pae.html

19 of 35

Disadvantages to AlphaFold

  • Slow (MSA search)�
  • Large storage for sequence database (>2TB)�
  • Not open source (at the time of publication)

20 of 35

RoseTTAFold

David Baker, still in the game.

AlphaFold2 was/is the SOTA, but not open source (the model weights were not publicly available).

Mimicked AlphaFold2, featuring…

  • Co-evolutionary data from sequence alignments
  • Attention based neural network architecture

Accuracy approached AlphaFold2 while allowing for modularity

It’s not over till it’s over, AlphaFold

21 of 35

RoseTTAFold: Under the Hood

22 of 35

Updated Disadvantages to AlphaFold

  • Slow (MSA search)�
  • Large storage for sequence database (>2TB)�
  • Not open source (at the time of publication)
    • RoseTTAFold

23 of 35

Updated Disadvantages to AlphaFold

  • Slow (MSA search)�
  • Large storage for sequence database (>2TB)�
  • Not open source (at the time of publication)
    • RoseTTAFold
    • Alphafold2, now open source

24 of 35

Evolutionary Scale Modeling (ESM)

Meta project launched in 2022, cancelled in 2023… (revived in 2023 ~ EvolutionaryScale)

Aimed to approximate AlphaFold’s prediction accuracy while increasing speed.

ESM and ESM2 are extremely lightweight,

  • 36GB max ESM2 size vs 2TB AF2 sequence database

Super easy to setup and use by comparison.

ELI5: Replaced the AlphaFold2 multiple sequence alignment with an LLM (ESM1, later ESM2).

“leaner, simpler, cheaper”

25 of 35

Evolutionary Scale Modeling (ESM)

26 of 35

Evolutionary Scale Modeling (ESM)

27 of 35

Evolutionary Scale Modeling (ESM)

28 of 35

ESMAtlas

  • Embedding visualization of over 772 million proteins!�
  • Metagenomic Data from MGnify

29 of 35

Updated Disadvantages to AlphaFold

  • Slow (MSA search)
    • ESMFold, but lower predicted accuracy
  • Large storage for sequence database (>2TB)
    • ESMFold, but lower predicted accuracy
  • Not open source (at the time of publication)
    • RoseTTAFold
    • Alphafold2, now open source
    • ESMFold

30 of 35

ColabFold

Martin Steinegger enters the chat.

Why compromise with lower accuracy but faster prediction when you can have both?

Replaces the slow MSA in AF2 with zippy fast MMseqs2

  • 10,000 times faster than BLAST, 16 times faster than AF2 MSA
  • Slight cost in sensitivity

Includes all AlphaFold2 models, including multimer

ELI5: AlphaFold2 but faster

Martin Steinegger! 😍

31 of 35

Updated Disadvantages to AlphaFold

  • Slow (MSA search)
    • ESMFold, but lower predicted accuracy
    • Colabfold, comparative accuracy
  • Large storage for sequence database (>2TB)
    • ESMFold, but lower predicted accuracy
    • Colabfold, comparative accuracy
  • Not open source (at the time of publication)
    • RoseTTAFold
    • Alphafold2, now open source
    • ESMFold
    • Colabfold

32 of 35

Has protein folding been solved?

  • Inferentially, yes? Otherwise, no.
  • The Protein Folding Problem
    • What is the folding code?
    • What is the folding mechanism?
    • Can we predict a native protein structure from its primary, amino acid sequence?
      • No for a sequence in isolation…
      • Yes when informed by like sequences and their structures

33 of 35

Let’s give it a try!

  • Google Colab Notebooks:
    • AlphaFold
    • AlphaFold Multimer
    • ColabFold (AF2 w/MMSeqs2)
  • ESMFold/ESMAtlas
  • RoseTTAFold Server
    • https://robetta.bakerlab.org/ (need to make an account)

34 of 35

AlphaFold2 EvoFormer

35 of 35

AlphaFold2 Structure Model