Structural Alignment with PyMOL and ProMOL
Techniques to Master:
Template based alignments of protein structures in PyMOL/ProMOL
Analysis of alignment results from PyMOL/ProMOL
The Protein Data Bank (PDB)1 is a database containing all of the experimentally determined and publicly available protein structures – over 100,000 structures! The Protein Structure Initiative (PSI) 2 was a 15-year, nearly one billion dollar project funded by the National Institute of Health (NIH) and with the goal of determining the structure of every known gene product. The PSI generated a lot of protein structures and a recent search of the PDB identified more than 4,000 structures of ‘‘unknown function’’.
A. Bioinformatics Tools for Biochemistry Research Elucidating the functions of proteins is a core component of biochemistry, structural biology, and bioinformatics. Scientists in these disciplines seek to understand the relationship among protein sequence, structure, and function. Software tools have been developed to relate sequence, structure and function and several programs exist that can propose functional annotations for a specific target of interest. Sequence databases can be searched with tools such as BLAST 3 and HMMER 4 to identify sequence homologs. Databases, repositories, and servers such as UniProt 5 , Pfam 6, the Structural Biology Knowledgebase 7 , Dali 8 , and MarkUs 9 collect and display information from various sources that may be used to identify structural homologs and/or give functional insight.
B. ProMOL and PyMOL We have developed the ProMOL plugin 10 for PyMOL11, a tool used to explore the catalytic site structural homologies between proteins of known function and those for which functions are not yet known. ProMOL uses template based alignment of these structures with a motif-library that is rapidly expanding to include motifs from the Catalytic Site Atlas 12, Pfam, and other sources. Although catalytic site structural homology alone is not sufficient to define the function of a protein, it provides one mechanism which, when combined with other structural and sequence motifs, can suggest candidates for experimental verification.
Video Link: Introduction to PyMOL
Video Link: Introduction to ProMOL
These tools can be used only to develop a hypothesis for the potential function of a protein, and final confirmation of the function often relies on biochemical techniques, assays, and characterization. It is not sufficient to simply identify structural homologs (proteins with similar structures and identified with tools like Dali and MarkUs) or sequence homologs (proteins with similar sequences that are identified with tools such as BLAST and HMMER).
The goal of this lab is to choose a previously uncharacterized protein structure from the PDB, use currently available bioinformatics tools (ProMOL, BLAST, Pfam, Dali, etc.) to develop a hypothesis about the putative function, to produce your protein, and to develop an enzymatic assay to test your hypothesis (Figure 1).
Figure 1: Outline of proposed workflow to characterize proteins of “unknown function”.
The first step in our function prediction process is to compare a protein of unknown function against a library of enzyme active sites from the Catalytic Site Atlas (http://www.ebi.ac.uk/thornton-srv/databases/CSA/) that constitute the motif template library of ProMOL. Each catalytic site motif template typically consists of 2 - 5 amino acid residues that have a fixed spatial and distance relationship. The example shown in Figure 2 is an alignment for a serine protease.
Figure 2: Top: Alignment of PDB entry 1afq (bovine gamma chymotrypsin; the query in red) with a motif template based on 1a0j (in white), a trypsin structure from Atlantic salmon. Bottom: The RMSD values for the alignment and the residues in the alignment can be determined and displayed in the PyMOL GUI. The first RMSD value is for all of the atoms in the three residues in the alignment; the second RMSD value is for the alpha carbons and the third is for the alpha and beta carbons. Three residues from 1afq (His 57, Asp 102 and Ser 195) aligned with the same three residues from 1a0j (His 57, Asp 102 and Ser 195).
Video Link: Structure Query with ProMOL
Video Link: Motif Finder in ProMOL
C. Enzyme Classification When residues align as nicely as those in Figure 2, that indicates a strong relationship between the structures. In this case, the template molecule is trypsin from Enzyme Commission class 22.214.171.124. The EC system was established by the IUPAC (the same folks who systematized the naming of organic compounds). There is a nice description of the classification system on the IUBMB page (http://www.chem.qmul.ac.uk/iubmb/). There are six EC classes:
Each class is divided into three lower levels to group enzymes based on more specific criteria. For example, trypsin is EC class 126.96.36.199:13
3.4: Acting on peptide bonds
3.4.21: Serine endopeptidases
When you analyze structures with ProMOL, you will receive an output of the best “hits” for your structure. You should then look at the EC class for the best hits to find out what ProMOL predicts about the function of your query structure.
D. ProMOL Deliverables. Following an alignment, the results window in ProMOL displays results for alignments in the following format:
Where 0 is the Levenshtein distance, A is the motif template set (in this case, an automatically generated motif), 1a0j is the PDB ID of the motif template that was used for the alignment and 3_4_21_4 represents the EC class, 188.8.131.52. ProMOL can also provide RMSD values (for goodness of alignment) and molecular representations of alignments as described below. Here is an explanation of some terms that may be new to you.
Levenshtein Distance is used for comparing strings of text, based on their position in the string and the number of characters in the string. For example, the Levenshtein distance between the word “form” and the word “foam” is 1 - you can change one character (an r to an a) to go from one word to the next. When we compare lists of protein residues, we can use this same approach. These two amino acid lists, HDGS (Histidine-Aspartate-Glycine-Serine), and HDGT (Histidine-Aspartate-Glycine-Threonine) have a Levenshtein distance of 1. The Levenshtein distance can refer either to a substituted amino acid residue or a missing amino acid residue. These two amino acid lists, HDGS (Histidine-Aspartate-Glycine-Serine), and HDS (Histidine-Aspartate-Serine) also have a Levenshtein distance of 1.
RMSD stands for Root-Mean-Square-Deviation. It is a tool for comparing the positions of atoms in three dimensional space. The lower the RMSD, the better the alignment. ProMOL reports out three RMSD values:
The purpose of this lab is for you to become familiar with PyMOL and ProMOL so that you can complete a structural alignment of your protein of unknown function. This will provide an early clue to the function of the protein that you can later refine using other bioinformatics tools (see Figure 1).
Experimental Design Considerations:
As you perform your structural analysis in PyMOL/ProMOL, keep in the mind the data you are collecting. Write any interesting findings in your notebook and make sure to collect images (by screen capture) for any interesting results you may want to consider later. The literature citations and web sites that are linked to this document may contain very useful information as you seek to identify the function of your protein. BE SURE TO RECORD YOUR FINDINGS IN YOUR NOTEBOOK. Include all of the details for each alignment: Levenshtein distance, motif template, EC class, RMSD values and a screen capture of the alignment between the query and the motif. You may wish to expand or limit your search using the “Choose a Set” dialog box described below.
This is an in silico exercise, so you will only need access to a computer with PyMOL and ProMOL installed properly.
For most of the this lab module, you will be focused closely on a computer screen. You are encouraged to take periodic breaks (at least every 15 minutes) to reduce eye fatigue.
Opening ProMOL Once you launch PyMOL, you can access ProMOL by clicking on the Plugin menu of the PyMOL GUI (Figure 3). The ProMOL window then appears (Figure 4). It contains a number of tabs.
Figure 3: When you launch PyMOL, you will see two windows - the graphical user interface is shown here. It contains the drop down menus (File, Edit, Build, etc.) and the molecular visualization window. The ProMOL plugin for PyMOL (right side) can be activated from the Plugin dropdown menu.
The ProMOL plugin (Figure 4, left) contains a number of tools that can be used to explore macromolecular structures. In this exercise, we will focus on the Motif Finder (Figure 4, right), which enables users to query a protein structure against a library of enzyme active sites.
Figure 4: Left: This is the opening screen for ProMOL and includes some general information about it. The ProMOL interface contains five tabs at the top and four buttons at the bottom to enhance usability. Right: The Motif Finder Interface includes progress bars for the individual query structure and for the overall alignment process if you are performing queries with more than one structure, the query box on the left, the results box on the right hand side, and some tools for alignment of results (Precision Factor, Show alignment, Calculate RMSD).
Finding a Motif in a Query Protein The Motif Finder tab gives the user access to the motif search tools in ProMOL. This is one of the major features of ProMOL, which enables you to look for evolutionary relationships between proteins based on the three dimensional relationships between small templates (typically 3-5 amino acids) that form enzyme active sites or binding motifs.
In the Motif Finder interface you can enter multiple PDB IDs in the query box, separating them by commas, if you wish to perform multiple queries at the same time. It typically works best to enter fewer than 25 PDB IDs in the query box. Once you enter a PDB ID in the query box, the Start button becomes active.
Please start the exercise by entering the PDB ID “2hnt” in the ProMOL query box. Once you have gone through this process with this structure of known function (human gamma thrombin), you will conduct this exercise with your protein of unknown function.
Selecting a Template Library ProMOL contains different sets of templates that are distinguished either by the way they were generated or the types of structures that were considered. When you enter a PDB ID in the query box (see Figure 4) and hit the Start button, the “Choose a Set” dialog box appears (inset, right). Here is a brief description of each of the motif sets listed in the “Choose a Set” dialog box (Figure 5):
Figure 6: Progress Bar and the Results Pane. The progress bars (left image near the top) show how the query is progressing. The top bar is the % completion of the individual query and the bottom bar is the % completion of the query set. In this case the query was only one structure, so the two bars are identical. The results pane (right) lists the alignments the ProMOL found for the query structure.
Figure 7. Active site alignment of 2hnt (thrombin; red) with 1o2u (trypsin; white). The residues in the aligment are His 57 and Asp 102 from 2hnt and His 57 adn asp 102 form 1o2u.
Look at the ProMOL results pane for 2hnt (shown in Figure 6, but you should focus on your own screen. Here is an explanation of the alignment results for the first alignment:
There is a set of tools toward the bottom of the Motif Finder tab (see Figure 6) that will enable you to show the alignment between the query (in red) and the motif template (in white). You can also calculate the RMSD for the alignment. If you check both boxes and then double click on an individual result, the alignment will appear in the PyMOL visualization window and the RMSD values will appear in the PyMOL GUI (Figure 8, top). In general with ProMOL, an alignment with three or more residues and RMSD values below 3 Å are considered high quality alignments.
Figure 8: Alignment Results with ProMOL/PyMOL. The RMSD values for all atoms, Calpha only and Calpha/Cbeta are shown in the PyMOL GUI at the top. The alignment of 2hnt (query in red) with 1a0j (motif template in white) is shown in the lower image of the molecular visualization window..
If you click on the residues in the alignment, their information appears in the PyMOL GUI window (Figure 9). You can use this approach to identify the residues in your alignments.
Figure 9: PyMOL GUI window identifying residues within your specific alignment. I generated the information shown here by clicking on the aligned residues that appear in the bottom part of Figure 8. The motif template is trypsin, PDB ID 1a0j. The query is thrombin, PDB ID 2hnt.
Review the image in Figure 9 to identify the residues in the alignment shown in Figure 8 (and on your computer). What information is contained in this line?
You clicked /2NHT//C//HIS’57/CD2
Now that you have gone through this exercize with thrombin (PDB ID 2hnt), repeat the exercise with your protein structure with unknown function. When you are finished with this exercise, you will have generated results about
Review these results carefully. Based on the evidence, can you make a prediction about the function of your protein? Are the results consistent or are there conflicts, that is, do you have good alignments with motif templates from more than one EC class?
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Appendix: Installation Instructions if you are loading ProMOL on a Macintosh computer. These instructions assume that you already have PyMOL installed.
Before doing anything, install QuartzX11. http://xquartz.macosforge.org/landing/
It’s a regular .dmg installer.
In Finder, in Applications Folder, Find MacPyMOL
Rename MacPyMOL to PyMOLX11Hybrid
Right click, show Package Contents
Click Contents → pymol → modules → pmg_tk → startup
In another finder window, find where you have unpacked the ProMol tarball
Copy the contents of the Promol-5.4-r419 folder to the startup folder
Within all that stuff you just copied, find a file (not folder) called remote_pdb_load.py
Add an x to the beginning of the filename
Then go to the folder called remote_pdb_load_plug in
Within that folder there should be a file called remote_pdb_load.py
Copy that file up a directory to the startup folder
Now you should open PyMol. X11 should open at the same time. If you don’t see the menu bar, click on the X11 icon in the dock and it will show it to you.
From that menu bar, you can now click Plugin and choose ProMol