1 of 47

Protein Structure Analysis �

Atul Nag

Kalinga Institute of Social Sciences

This Photo by Unknown Author is licensed under CC BY-SA

2 of 47

Outline

  • Basic concepts.

  • How are protein structures determined?
    • X-ray crystallography.
    • NMR spectroscopy.

  • Protein structure databases (PDB, MMDB).

  • Protein structure visualization (RasMol, Cn3D, etc).

  • Protein structure classification (SCOP and CATH).

3 of 47

Structural Bioinformatics

  • A subdiscipline of bioinformatics that focuses on the representation, storage, visualization, prediction and evaluation of structural information.

  • References:
    • Baxevanis and Ouellette. 2005. Bioinformatics - A practical guide to the analysis of genes and proteins. 3rd edition. Chapter 9 and part of chapter 8.

    • Pevsner. 2003. Bioinformatics and functional genomics. Chapter 9.

    • Bourne and Weissig. 2003. Structural bioinformatics.

4 of 47

Protein Primary Structures

  • Amino acid sequence of a polypeptide chain.

  • 20 amino acids, each with a different side chain (R).

  • Peptide units are building blocks of protein structures.

  • The angle of rotation around the N−Cα bond is called phi (φ), and the angle around the Cα−C′ bond from the same Cα atom is called psi (ψ).

5 of 47

Protein Secondary Structures

  • Local substructures as a result of hydrogen bond formation between neighboring amino acids (backbone interactions).

  • The amino acid side chains affect secondary structure formation.

  • Types of secondary structures:
    • α helix,
    • β sheet,
    • Loop or random coil.

6 of 47

α Helix

  • Most abundant secondary structure.

  • 3.6 amino acids per turn, and hydrogen bond formed between every fourth residue.

  • Often found on the surface of proteins.

7 of 47

β Sheet�

  • Hydrogen bonds formed between adjacent polypeptide chains.
  • The chain directions can be same (parallel sheet), opposite (anti-parallel), or mixed.

8 of 47

Loop or Coil

  • Regions between α helices and β sheets.

  • Various lengths and 3-D configurations.

  • Often functionally significant (e.g., part of an active site).

9 of 47

Protein Tertiary Structure

  • The 3-D structure of a protein is assembled from different secondary structure components.
  • Tertiary structure is determined primarily by hydrophobic interactions between side chains.
  • Different classes of protein structures:

10 of 47

Protein Tertiary Structure

  • Fold: a certain type of 3-D arrangement of secondary structures.
  • Protein structures evolves more slowly than primary amino acid sequences.

11 of 47

Protein Quaternary Structure

  • Two or more independent tertiary structures are assembled into a larger protein complex.
  • Important for understanding protein-protein interactions.

12 of 47

Biological Knowledge from Structures

13 of 47

X-Ray Crystallography

  • Basic steps:
    • Expression/ purification of proteins from gene targets
    • Protein Crystallization
    • X-ray diffraction
    • Structure Solution
  • Advantages:
    • High-resolution structures.
    • Large protein complexes or membrane proteins.

  • Disadvantages:
    • Molecules in a solid-state (crystal) environment.
    • Requirement for crystals.

This Photo by Unknown Author is licensed under CC BY-SA

14 of 47

Nuclear Magnetic Resonance (NMR)

  • NMR reveals the neighborhood information of atoms in a molecule, and the information can be used to construct a 3-D model of the molecule.

  • Advantages:
    • No requirement for crystals.
    • Proteins in a liquid state (near physiological state).

  • Disadvantages:
    • Limited by molecule size (up to 30 kD).
    • Membrane proteins may not be studied.
    • Inherently less precise than X-ray crystallography.

15 of 47

Protein Data Bank (PDB)

  • The primary repository for protein structures.
  • Established in 1971 (the first bioinformatics database, set up with 7 protein structures).
  • Contains 196, 779 structures & 1,000, 361 Computed Structure Models (CSM) by Oct 20, 2022.
  • Supports services for structure submission, search, retrieval, and visualization.

16 of 47

Growth of PDB

17 of 47

18 of 47

Access to Structures through NCBI

MMDB (Molecular Modeling Database):

    • Structures obtained from PDB.
    • Data in NCBI’s ASN.1 format.
    • Integrated into NCBI’s Search.

Cn3D (“see in 3D”): NCBI’s 3-D protein structure viewer.

VAST (Vector Alignment Search Tool): for direct comparison of 3-D protein structures.

19 of 47

Ramachandran Plot

  • Used to assess the quality of structures.
  • Good structures – tight clustering patterns.

20 of 47

3D Visualization Tool - PyMol

https://pymol.org/2/

21 of 47

Cn3D: NCBI’s Structure Viewer

  • Cn3D (“see in 3D”): allows interactive exploration of 3-D structures, sequences and alignments.
  • Can be used to produce high-quality molecular images.
  • Cn3D is available at https://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml.

22 of 47

Other 3-D Visualization Tools

  • Chime: a Netscape plug-in for 3-D structure visualization; based on RasMol source code.

  • Protein Explorer (http://www.proteinexplorer.org/):
    • A Chime-based software package.
    • Particularly user friendly and feature-rich.

  • Swiss-Pdb Viewer (Deep View, available at http://us.expasy.org/spdbv/):
    • Probably the most powerful, freely available molecular modeling and visualization package.
    • Supports homology modeling, site-directed mutagenesis, structure superposition, etc.

23 of 47

Protein Structure Comparison

  • Why is structure comparison important?
    • To understand structure-function relationship.
    • To study the evolution of many key proteins (structure is more conserved than sequence).

  • Comparing 3-D structures is much more difficult than sequence comparison.

  • Protein structure classification:
    • SCOP: Structure Classification Of Proteins.
    • CATH: Class, Architecture, Topology and Homology.

  • Protein structure alignment: DALI and VAST.

24 of 47

SCOP

  • SCOP is based on the expert definition of protein structural similarities and is manually curated.
  • Classification hierarchy
    • Class → Fold → Superfamily → Family
  • SCOP has 7 major classes: all α, all β, α/β, α+β, multi-domain proteins (α and β), membrane and cell surface proteins, and small proteins.
  • Domain is the base unit of the SCOP hierarchy, and proteins with multiple domains may appear at different places in the hierarchy.
  • SCOP at https://scop.mrc-lmb.cam.ac.uk/.

25 of 47

An Example of the SCOP Hierarchy

  • SCOP fold definition:
    • Same major secondary structures.
    • Same arrangement.
    • Same topology.

26 of 47

CATH

  • Classification hierarchy:
  • Class (C) → Architecture (A) → Topology (T) → Homologous superfamily (H)
  • Based on secondary structure content (for C), literature (for A), structure connectivity and general shape (for T, using the SSAP algorithm), and sequence similarity (for H).
  • Multi-domain proteins are partitioned into their constituent domains before classification.
  • CATH at http://cathdb.info/.

27 of 47

An Example of the CATH Hierarchy

  • CATH classes:
    • mainly α.
    • mainly β.
    • mixed α and β.
    • Few secondary structures.

28 of 47

Protein Structure Comparison

  • Why is structure comparison important?
    • To understand structure-function relationship.
    • To study the evolution of many key proteins (structure is more conserved than sequence).

  • Comparing 3-D structures is much more difficult than sequence comparison.

  • Protein structure classification:
    • SCOP: Structure Classification Of Proteins.
    • CATH: Class, Architecture, Topology and Homology.

  • Protein structure alignment: DALI and VAST.

29 of 47

Protein Structure Alignment

  • Positions of atoms in two or more 3-D protein structures are compared.

  • Must first determine which atoms to align. At least two sets of three common reference points should be identified.

  • Atoms in structures are matched to minimize the average deviation.

  • Computers are NOT good at comparing 3-D objects (an NP-hard problem).

30 of 47

How to Compare Structures?

31 of 47

DALI

  • DALI is for Distance matrix ALIgnment.
  • Each structure is represented as a two-dimensional array (matrix) of distances between all pairs of Cα atoms.
    • Remember what a Cα atom is?
  • Assume that similar 3-D structures have similar inter-residue distances.
  • DALI uses distance matrices to align protein structures.
  • DALI is available at http://ekhidna.biocenter.helsinki.fi/dali/.

32 of 47

VAST

  • VAST is for Vector Alignment Search Tool.

  • Each structure is represented as a set of secondary structure elements (SSEs).
    • SSEs: α helices or β strands.

  • VAST scores pairs of SSEs based on their type, orientation and connectivity.

  • The SSE matches of statistical significance are then extended (similar to BLAST).

  • Structures in MMDB have been pre-computed, and organized as structure neighbors in Entrez.

33 of 47

Secondary Structure Prediction

  • Given the sequence of a polypeptide, secondary structures are predicted.

  • Assume that secondary structures are fully determined by local interactions among neighboring residues.

  • Early analysis were based on the frequencies of amino acid found in different types of secondary structures.
    • For example, proline occurs at turns, but not in α helices.
  • Modern approaches use machine learning techniques and multiple sequence alignments.

34 of 47

Machine Learning Approach

35 of 47

PHDsec

  • For a given protein sequence:
    • Search for homologous sequences.
    • Produce a multiple sequence alignment.
    • Generate a profile (evolutionary information).

  • PHDsec uses a feed-forward artificial neural network to predict the secondary structures.

36 of 47

PSIPRED

  • For a given protein sequence:
    • Perform a PSI-BLAST search.
    • Create a profile that conveys the evolutionary information at each position.
    • Feed the profile into a system of neural networks (or support vector machines).

37 of 47

Prediction of 3-D Protein Structures

  • There are about 200,000 structures in PDB, but more than 1.8 million non-redundant protein sequences in UniProt (Swiss-Prot + TrEMBL).

  • Computational structure prediction may provide valuable information for most of the protein sequences derived from genome sequencing projects.

  • Three predictive methods:
    • Homology (or comparative) modeling.
    • Threading (or fold recognition).
    • Ab initio structure prediction.

38 of 47

Sequence - Structure Relationship

  • In cells, protein folding is determined by the amino acid sequence. But, protein structures can also be affected by post-translational modifications and the cellular environment.

  • Proteins with ≥ 30% sequence identity tend to have similar structures. However, exceptions do exist …

80-residue stretch

(yellow) with 40%

sequence identity

39 of 47

Homology Modeling

Probably the most accurate method for protein structure prediction.

Five different steps:

    • Find a known structure related to the query sequence by sequence comparison.
    • Align the query sequence with the known structure (template).
    • Build a model by modifying the backbone and side chains of the template.
    • Refine the model using energy minimization.
    • Validate the model using visual inspection or software tools.

40 of 47

Homology Modeling

Accuracy of structure prediction depends on the percent amino acid sequence identity shared between the query and template.

For >50% sequence identity, RMSD (Root Mean Square Deviation) is only 1 Å for main-chain atoms, which is comparable to the accuracy of a medium-resolution NMR structure or a low-resolution X-ray structure.

Homology modeling may not be used for predicting protein structures if the sequence identity is less than 30%.

41 of 47

Threading

42 of 47

Threading

Threading takes a query sequence and passes (threads) it through the 3-D structure of each protein in a fold database (known structures).

As a sequence is threaded, the fit of the sequence in the fold is evaluated using some functions of energy or packing efficiency.

Threading may find a common fold for proteins with essentially no sequence homology.

Structures predicted from threading techniques often are not of high quality (RMSD > 3 Å).

43 of 47

Ab Initio Structure Prediction

Ab initio prediction can be used when a protein sequence has no detectable homologues in PDB.

Protein folding is modeled based on global free-energy minimization.

Since the protein folding problem has not yet been solved, the ab initio prediction methods are still experimental and can be quite unreliable.

One of the top ab initio prediction methods is called Rosetta, which was found to be able to successfully predict 61% of structures (80 of 131) within 6.0 Å RMSD.

44 of 47

Comparing Structure Prediction Methods

  • A – C: homology modeling with 60% (A), 40% (B) and 30% (C) sequence identity.
  • D and E: ab initio protein structure prediction.
  • Predicted structures are in red, and actual structures are in blue.

45 of 47

Prediction of Solvent Accessibility

Solvent accessibility: the relative area of a residue’s surface that is exposed to the surrounding solvent.

The solvent-accessible residues may be part of an active site or a binding site, while the buried residues may play an important role in stabilizing the protein structure.

46 of 47

Predicting Transmembrane Segments

Transmembrane segments share common biophysical features (e.g., hydrophobicity).

    • Part of the PredictProtein services.
    • Transmembrane helices are predicted using a neural network system.
    • A set of known transmembrane segments are represented as HMMs.
    • A query sequence is matched to a known transmembrane pattern.

47 of 47

Signal Peptide Prediction

Extracellular proteins or proteins targeted to subcellular compartments contain short signal peptides (often at the N-terminal).

PSORT (http://psort.ims.u-tokyo.ac.jp/): A rule-based expert system for predicting subcellular localization of proteins from their amino acid sequences. The algorithm of k-nearest neighbors is used for reasoning.

SignalP (http://www.cbs.dtu.dk/services/SignalP/): predicts the presence and location of signal peptide cleavage sites using a combination of neural networks and HMMs.