1 of 59

STRUCTURAL BIOINFORMATICS II

2 of 59

STRUCTURE ALIGNMENT

  • Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation.
  • It is usually applied to protein tertiary structures but can also be used for large RNA molecules.
  • Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity

3 of 59

MINIMAL ROOT MEAN SQUARE DEVIATION (RMSD)

  • The RMSD of two aligned structures indicates their divergence from one another.
  • Normally a rigid superposition which minimizes the RMSD is performed, and this minimum is returned.

δi is the distance between atom i and either a reference structure or the mean position of the N equivalent atoms.

This is often calculated for the backbone heavy atoms C, N, O, and Cα or sometimes just the Cα atoms.

4 of 59

CRITERIA OF COMPARISON

  • When aligning structures with very different sequences, the side chain atoms generally are not taken into account
  • For simplicity and efficiency, often only the alpha carbon positions are considered
  • Only when the structures to be aligned are highly similar or even identical is it meaningful to align side-chain atom positions
  • For reducing noise
    • Secondary structure
    • Native contact maps or residue interaction patterns
    • Side chain packing
    • Hydrogen bond retention

5 of 59

STRUCTURAL SUPERPOSITION

  • The optimal rotations and translations are found by minimizing the sum of the squared distances among all structures in the superposition

6 of 59

COMBINATORIAL EXTENSION (CE)

  • Breaks each structure into a series of fragments
  • Reassemble into a complete alignment
  • A series of pairwise combinations of fragments called aligned fragment pairs, or AFPs
  • Define a similarity matrix through which an optimal path is generated to identify the final alignment
  • The initial AFP pair can occur at any point in the matrix
  • Extensions then proceed with the next AFP that meets given distance criteria restricting the alignment to low gap sizes

7 of 59

COMBINATORIAL EXTENSION (CE)

8 of 59

MAMMOTH

  • Seeking the subset of the structural alignment least likely to occur by chance
  • Generating similarity matrix by
    • Local structure overlap
    • Regular secondary structure
    • 3D-superposition
    • Same ordering in primary sequence

9 of 59

MAMMOTH

URMSR is the expected URMS of a random set of vectors

10 of 59

SSAP (SEQUENTIAL STRUCTURE ALIGNMENT PROGRAM)

  • Using double dynamic programming to produce a structural alignment based on atom-to-atom vectors in structure space
  • Constructing the vectors from the beta carbons
  • Rotameric state of each residue as well as its location along the backbone can be considered
  • Building the vector for inter-residues
  • Dynamic programming search the optimal local alignments for using to determine the overall structural alignment.

11 of 59

TM-ALIGN

  • TM-score weights smaller distance errors stronger than larger distance errors and makes the score value more sensitive to the global fold similarity than to the local structural variations
  • TM-score introduces a length-dependent scale to normalize the distance errors and makes the magnitude of TM-score length-independent for random structure pairs
  • TM-score has the value in (0,1], where 1 indicates a perfect match between two structures
  • Following strict statistics of structures in the PDB, scores below 0.17 correspond to randomly chosen unrelated proteins whereas structures with a score higher than 0.5 assume generally the same fold

12 of 59

FROM SEQUENCE TO STRUCTURE

  • The structural information roughly revealed at sequence level
  • Tertiary structure is the difficult part -> 3D

13 of 59

SECONDARY STRUCTURE PREDICTION-DSSP

  • DSSP (hydrogen bond estimation algorithm)
    • Reading the position of the atoms in a protein followed by calculation of the H-bond energy between all atoms.
    • calculates the optimal hydrogen positions by placing them at 1.000 Å from the backbone N in the opposite direction from the backbone C=O bond.
    • The best two H-bonds for each atom are then used to determine the most likely class of secondary structure for each residue in the protein.
  • STRIDE (Structural identification)
    • Also consider dihedral angle potentials

14 of 59

SECONDARY STRUCTURE PREDICTION-STRIDE

  • Integrating the DSSP algorithms for H bond prediction.
  • also include dihedral angle potentials.
  • The common secondary structure assignment methods, are believed to underpredict pi helices.

15 of 59

TERTIARY STRUCTURE PREDICTION

  • The two main problems are the calculation of protein free energy and finding the global minimum of this energy.
  • Homology modeling and fold recognition methods
    • The target structure should be close to the experimentally determined structure of another homologous protein.

16 of 59

ROSETTA

  • Rosetta is the typical method for protein structure prediction
    • Domain parsing, or domain boundary prediction
    • For the rest of tertiary structure prediction, this can be done comparatively from known structures
    • Domain assembly

17 of 59

https://robetta.bakerlab.org/

18 of 59

19 of 59

MORE ROSETTA RELATED TOOLS

https://rosie.rosettacommons.org/

20 of 59

AB INITIO/DE NOVO PROTEIN STRUCTURE PREDICTION

  • Build three-dimensional protein models "from scratch“
  • Energy- and fragment-based methods
    • Require vast computational resources, only been carried out for tiny proteins
  • Evolutionary covariation to predict 3D contacts
    • When single residue mutations are slightly deleterious, compensatory mutations may occur to restabilize residue-residue interactions
    • Calculate correlated mutations from protein sequences
    • Can be run on a standard personal computer even for proteins with hundreds of residues

21 of 59

COMPARATIVE PROTEIN MODELING

  • Using previously solved structures as templates
  • Evolutionary covariation can be included
  • Homology modeling
    • Homologous proteins will share very similar structures
    • The relationship between target and template can be discerned through sequence alignment
  • Protein threading
    • Scanning the amino acid sequence of an unknown structure against a database of solved structures

22 of 59

SIDE CHAIN CONFORMATION

  • The side chain conformations with low energy are usually determined on the rigid polypeptide backbone and using a set of discrete side chain conformations known as "rotamers.“
  • The methods attempt to identify the set of rotamers that minimize the model's overall energy.

23 of 59

CASP - CRITICAL ASSESSMENT OF TECHNIQUES FOR PROTEIN STRUCTURE PREDICTION

https://predictioncenter.org/

24 of 59

http://ps2v3.life.nctu.edu.tw/

25 of 59

PROTEIN FOLDING PROBLEM

  • Levinthal's paradox is a thought experiment, also constituting a self-reference in the theory of protein folding.
  • For example, a polypeptide of 100 residues will have 99 peptide bonds, and therefore 198 different phi and psi bond angles. If each of these bond angles can be in one of three stable conformations, the protein may misfold into a maximum of 3198 different conformations (including any possible folding redundancy).
  • Therefore, if a protein were to attain its correctly folded configuration by sequentially sampling all the possible conformations, it would require a time longer than the age of the universe to arrive at its correct native conformation.

26 of 59

ALPHAFOLD 1

  • The algorithm is from DeepMind which is known to have trained the program on over 170,000 proteins from a public repository of protein sequences and structures.
  • The overall training was conducted on processing power between 100 and 200 GPUs.
  • Training the system on this hardware took "a few weeks", after which the program would take "a matter of days" to converge for each structure.
  • 2018, AlphaFold 1 was launched.

27 of 59

ALPHAFOLD 1

  • Central to AlphaFold is a distance map predictor implemented as a very deep residual neural networks with 220 residual blocks processing a representation of dimensionality 64×64×128 – corresponding to input features calculated from two 64 amino acid fragments.
  • Each residual block has three layers including a 3×3 dilated convolutional layer – the blocks cycle through dilation of values 1, 2, 4, and 8. In total the model has 21 million parameters.
  • The network uses a combination of 1D and 2D inputs, including evolutionary profiles from different sources and co-evolution features. Alongside a distance map in the form of a very finely-grained histogram of distances, AlphaFold predicts Φ and Ψ angles for each residue which are used to create the initial predicted 3D structure.
  • The AlphaFold authors concluded that the depth of the model, its large crop size, the large training set of roughly 29,000 proteins, modern Deep Learning techniques, and the richness of information from the predicted histogram of distances helped AlphaFold achieve a high contact map prediction precision.

28 of 59

ALPHAFOLD 2

  • AlphaFold2 was launched in 2020.
  • Training Data: AlphaFold 1 was trained on a dataset of 170,000 protein structures, while AlphaFold 2 was trained on a much larger dataset of more than 170 million protein structures.
  • Model Architecture: AlphaFold 2 uses a more advanced neural network architecture than AlphaFold 1. Specifically, AlphaFold 2 uses a transformer neural network, which is a type of neural network that has been shown to be highly effective for natural language processing tasks. This architecture is more powerful and flexible than the convolutional neural network (CNN) architecture used in AlphaFold 1.

29 of 59

ALPHAFOLD 2

  • Integration of Biological Knowledge: AlphaFold 2 incorporates biological knowledge about protein structure into its predictions. For example, the model takes into account the fact that certain amino acid sequences are more likely to form certain types of structures than others. This biological knowledge was not explicitly incorporated into AlphaFold 1.
  • Accuracy: AlphaFold 2 is significantly more accurate than AlphaFold 1. In the 2020 CASP14 protein folding prediction competition, AlphaFold 2 outperformed all other methods by a large margin, achieving a median root mean square deviation (RMSD) of 0.84 angstroms for the predicted protein structures.

30 of 59

31 of 59

PROTEIN DYNAMICS

  • A study for understanding the transitions between different states of proteins and the motions of the proteins
    • Kinetics
    • Thermodynamics

https://youtu.be/3TxvBt4VFnE

32 of 59

LOCAL DYNAMICS

  • Atom and residue level
  • NMR can be used for observing such moving
  • The motions are chemical bond which can be vibration and fluctuation and rotation

33 of 59

REGIONAL DYNAMICS

  • Intra-domain and multiple residue coupling level
  • The final folded protein structure is made by numerous contacts between residues
  • The energy is contributed by intra-domain links like hydrogen bonds, ionic bonds and van der Waals interaction for forming secondary structure
  • Ligand-protein interaction is also one of this dynamics

34 of 59

35 of 59

GLOBAL DYNAMICS

  • Inter domains or multiple domains level
  • The dynamics is related to protein function
  • Hinge and door motions

36 of 59

MOLECULAR DYNAMICS (MD)

  • Original developed in the early 1950s
  • MD is a computer simulation method for analyzing the physical movements of atoms and molecules
  • Calculate the trajectories of atoms and molecules based on the force fields and potential energy

37 of 59

38 of 59

WIDELY USED TOOL

  • GROMACS
  • Download - https://www.gromacs.org/
  • Tutorial - http://www.mdtutorials.com/gmx/

39 of 59

B FACTOR

  • Debye–Waller factor or temperature factor

small fluctuation

large fluctuation

Rigid

Flexible

Δri is the fluctuation of the atom i around its equilibrium position

40 of 59

Solving the equations embodied in Newton’s second law

(∂U/∂q = F = ma )

Empirical Force Field

Levitt Nature (2001)

Atoms trajectory

Rigid

Flexible

41 of 59

  • 30 proteins
  • Four different force fields (OPLS, CHARMM, AMBER, and GROMOS)
  • 10-ns
  • ~50 years of CPU time
    • MareNostrum , a supercomputer with 4,524 64-bit Myrinet-connected processors

41

1CZT

Rueda et al. PNAS (2007)

42 of 59

Sequence

Dynamics

Structure

Function

43 of 59

PROTEIN FIXED POINT MODEL (PFP)

residue i

ri

44 of 59

44

1PD3:A

1U0S:A

1VJH:A

1MIJ:A

1F35:A

1WUB:A

The PFP model

The X-ray B factors

45 of 59

The X-ray B factors

The PFP model

The PFP model

The X-ray B factors

Phosphorylated phytase

PDB ID: 1QWO

46 of 59

WEIGHTED CONTACT NUMBER (WCN) MODEL

rij

residue i

Packing density

47 of 59

47

The X-ray B factors

The WCN model

The WCN model

The X-ray B factors

Dfpase

PDB ID: 1E1A

Rigid

Flexible

More packed

Less packed

48 of 59

CA ATOM

48

All atoms

Cα atoms

Backbone atoms

49 of 59

SEQUENCE CONSERVATION

50 of 59

CONSERVATION – CATALYTIC RESIDUE

50

Conservation

Proportion

570 enzymes selected from Catalytic Site Atlas 2.2.11

Sequence identities < 25%

1,634 catalytic residues

Catalytic residues

All residues

51 of 59

51

The WCN model

Conservations

Catalytic residues

Phosphofructokinase

PDB ID: 1KZH

Ornithine

Decarboxylase

PDB ID: 1ORD

52 of 59

PNAS 104, 796 (2007)

53 of 59

RCSB PDB DATABASE

54 of 59

PDB FORMAT

  • HEADER, TITLE and AUTHOR
  • REMARK
  • SEQRES
    • Sequence of the peptide chain
  • ATOM
    • the coordinates of the atoms
  • HETATM
    • coordinates of hetero-atoms, that is those atoms which are not part of the protein molecule.

55 of 59

ATOM COLUMNS

Atom index

Atom name

Residue name

Chain

Residue index

X, Y, Z coordinates

Occupancy

Temperature factor

Element name

56 of 59

OCCUPANCY

57 of 59

X-RAY

  • Advantages
    • 2D view that gives an indication of the three-dimensional structure of a material
    • Relatively cheap and simple
    • Useful for large structures: Not limited by size or atomic weight
    • High atomic resolution
  • Disadvantages
    • The sample must be crystallizable
    • Sample types are limited. In particular, membrane proteins and large molecules are difficult to crystallize, due to their large molecular weight and relatively poor solubility
    • An organized single crystal must be obtained to produce the desired diffraction
    • Non-dynamic method due to preparation of samples and crystallization
    • Limited applications

58 of 59

NMR

  • Advantages
    • Dynamic technique
    • Non-destructive and non-invasive
    • Three-dimensional structures in their natural state can be measured directly in solution
    • Can provide unique insights into dynamics and intramolecular interactions
    • Macromolecular three-dimensional structure resolution can be as low as sub nanometer
  • Disadvantages
    • The application of NMR in large biomolecule analysis is limited by the complication and difficulty of interpretation of biomolecules with large molecular weight
    • Large amounts of pure samples are needed to achieve an acceptable signal to noise level
    • Highly sensitive to motion. This can lead to signal distortions in artifacts
    • The high-magnetic field can cause problems with other equipment in a laboratory. Therefore, extra precautions may need to be taken, especially if working space is limited

59 of 59

NMR CONTAINS MULTIPLE MODELS

No temperature factor

Model number -> movement