Title:

Structural Alignment with PyMOL and ProMOL

Techniques to Master:

Template based alignments of protein structures in PyMOL/ProMOL

Analysis of alignment results from PyMOL/ProMOL

Learning Objectives:

  1. Students will describe typical characteristics of an enzyme active site.
  2. Students will understand and be able to explain the term motif and apply it to enzyme active sites.
  3. Students will be able to query structures of unknown function against a library of enzyme active site motif templates.
  4. Students will distinguish between good alignments and poor alignments based on Levenshtein distance, RMSD values, and visual alignments.
  5. Students will propose a function for proteins of unknown function and develop plans for testing their hypothesis in silico and in vitro.

Background: 

The Protein Data Bank (PDB)1 is a database containing all of the experimentally determined and publicly available protein structures – over 100,000 structures! The Protein Structure Initiative (PSI) 2 was a 15-year, nearly one billion dollar project funded by the National Institute of Health (NIH) and with the goal of determining the structure of every known gene product. The PSI generated a lot of protein structures and a recent search of the PDB identified more than 4,000 structures of ‘‘unknown function’’.

A. Bioinformatics Tools for Biochemistry Research Elucidating the functions of proteins is a core component of biochemistry, structural biology, and bioinformatics. Scientists in these disciplines seek to understand the relationship among protein sequence, structure, and function. Software tools have been developed to relate sequence, structure and function and several programs exist that can propose functional annotations for a specific target of interest. Sequence databases can be searched with tools such as BLAST 3 and HMMER 4 to identify sequence homologs. Databases, repositories, and servers such as UniProt 5 , Pfam 6, the Structural Biology Knowledgebase 7 , Dali 8 , and MarkUs 9 collect and display information from various sources that may be used to identify structural homologs and/or give functional insight.

B. ProMOL and PyMOL We have developed the ProMOL plugin 10 for PyMOL11, a tool used to explore the catalytic site structural homologies between proteins of known function and those for which functions are not yet known. ProMOL uses template based alignment of these structures with a motif-library that is rapidly expanding to include motifs from the Catalytic Site Atlas 12, Pfam, and other sources. Although catalytic site structural homology alone is not sufficient to define the function of a protein, it provides one mechanism which, when combined with other structural and sequence motifs, can suggest candidates for experimental verification.

Video Link: Introduction to PyMOL

Video Link: Introduction to ProMOL

These tools can be used only to develop a hypothesis for the potential function of a protein, and final confirmation of the function often relies on biochemical techniques, assays, and characterization. It is not sufficient to simply identify structural homologs (proteins with similar structures and identified with tools like Dali and MarkUs) or sequence homologs (proteins with similar sequences that are identified with tools such as BLAST and HMMER).

The goal of this lab is to choose a previously uncharacterized protein structure from the PDB, use currently available bioinformatics tools (ProMOL, BLAST, Pfam, Dali, etc.) to develop a hypothesis about the putative function, to produce your protein, and to develop an enzymatic assay to test your hypothesis (Figure 1).

Figure 1: Outline of proposed workflow to characterize proteins of “unknown function”.

The first step in our function prediction process is to compare a protein of unknown function against a library of enzyme active sites from the Catalytic Site Atlas (http://www.ebi.ac.uk/thornton-srv/databases/CSA/) that constitute the motif template library of ProMOL. Each catalytic site motif template typically consists of 2 - 5 amino acid residues that have a fixed spatial and distance relationship. The example shown in Figure 2 is an alignment for a serine protease.

Screen Shot 2016-07-08 at 9.53.46 AM.png

Screen Shot 2016-07-08 at 9.54.40 AM.png

Figure 2: Top: Alignment of PDB entry 1afq (bovine gamma chymotrypsin; the query in red) with a motif template based on 1a0j (in white), a trypsin structure from Atlantic salmon. Bottom: The RMSD values for the alignment and the residues in the alignment can be determined and displayed in the PyMOL GUI. The first RMSD value is for all of the atoms in the three residues in the alignment; the second RMSD value is for the alpha carbons and the third is for the alpha and beta carbons. Three residues from 1afq (His 57, Asp 102 and Ser 195) aligned with the same three residues from 1a0j (His 57, Asp 102 and Ser 195).

Video Link: Structure Query with ProMOL

Video Link: Motif Finder in ProMOL

C. Enzyme Classification When residues align as nicely as those in Figure 2, that indicates a strong relationship between the structures. In this case, the template molecule is trypsin from Enzyme Commission class 3.4.21.4. The EC system was established by the IUPAC (the same folks who systematized the naming of organic compounds). There is a nice description of the classification system on the IUBMB page (http://www.chem.qmul.ac.uk/iubmb/). There are six EC classes:

  1. Oxidoreductases
  2. Transferases
  3. Hydrolases
  4. Lyases
  5. Isomerases
  6. Ligases

Each class is divided into three lower levels to group enzymes based on more specific criteria. For example, trypsin is EC class 3.4.21.4:13

        3:                Hydrolase

        3.4:                Acting on peptide bonds

        3.4.21:        Serine endopeptidases

        3.4.21.4:        Trypsin

When you analyze structures with ProMOL, you will receive an output of the best “hits” for your structure. You should then look at the EC class for the best hits to find out what ProMOL predicts about the function of your query structure.

D. ProMOL Deliverables. Following an alignment, the results window in ProMOL displays results for alignments in the following format:

0:A_1a0j_3_4_21_4

Where 0 is the Levenshtein distance, A is the motif template set (in this case, an automatically generated motif), 1a0j is the PDB ID of the motif template that was used for the alignment and 3_4_21_4 represents the EC class, 3.4.21.4. ProMOL can also provide RMSD values (for goodness of alignment) and molecular representations of alignments as described below. Here is an explanation of some terms that may be new to you.

Levenshtein Distance is used for comparing strings of text, based on their position in the string and the number of characters in the string. For example, the Levenshtein distance between the word “form” and the word “foam” is 1 - you can change one character (an r to an a) to go from one word to the next. When we compare lists of protein residues, we can use this same approach. These two amino acid lists, HDGS (Histidine-Aspartate-Glycine-Serine), and HDGT (Histidine-Aspartate-Glycine-Threonine) have a Levenshtein distance of 1. The Levenshtein distance can refer either to a substituted amino acid residue or a missing amino acid residue. These two amino acid lists, HDGS (Histidine-Aspartate-Glycine-Serine), and HDS (Histidine-Aspartate-Serine)  also have a Levenshtein distance of 1.

RMSD stands for Root-Mean-Square-Deviation. It is a tool for comparing the positions of atoms in three dimensional space. The lower the RMSD, the better the alignment. ProMOL reports out three RMSD values:

Purpose:

The purpose of this lab is for you to become familiar with PyMOL and ProMOL so that  you can complete a structural alignment of your protein of unknown function. This will provide an early clue to the function of the protein that you can later refine using other bioinformatics tools (see Figure 1).

Experimental Design Considerations:

As you perform your structural analysis in PyMOL/ProMOL, keep in the mind the data you are collecting. Write any interesting findings in your notebook and make sure to collect images (by screen capture) for any interesting results you may want to consider later. The literature citations and web sites that are linked to this document may contain very useful information as you seek to identify the function of your protein. BE SURE TO RECORD YOUR FINDINGS IN YOUR NOTEBOOK. Include all of the details for each alignment: Levenshtein distance, motif template, EC class, RMSD values and a screen capture of the alignment between the query and the motif. You may wish to expand or limit your search using the “Choose a Set” dialog box described below.

Supplies:

This is an in silico exercise, so you will only need access to a computer with PyMOL and ProMOL installed properly.

Safety Concerns:

For most of the this lab module, you will be focused closely on a computer screen. You are encouraged to take periodic breaks (at least every 15 minutes) to reduce eye fatigue.

Procedure:

Opening ProMOL Once you launch PyMOL, you can access ProMOL by clicking on the Plugin menu of the PyMOL GUI (Figure 3). The ProMOL window then appears (Figure 4). It contains a number of tabs.

Screen Shot 2016-06-20 at 4.33.27 PM.png

Figure 3: When you launch PyMOL, you will see two windows - the graphical user interface is shown here. It contains the drop down menus (File, Edit, Build, etc.) and the molecular visualization window. The ProMOL plugin for PyMOL (right side) can be activated from the Plugin dropdown menu.

The ProMOL plugin (Figure 4, left) contains a number of tools that can be used to explore macromolecular structures. In this exercise, we will focus on the Motif Finder (Figure 4, right), which enables users to query a protein structure against a library of enzyme active sites.

Screen Shot 2016-07-18 at 1.10.15 PM.png

Figure 4: Left: This is the opening screen for ProMOL and includes some general information about it. The ProMOL interface contains five tabs at the top and four buttons at the bottom to enhance usability. Right: The Motif Finder Interface includes progress bars for the individual query structure and for the overall alignment process if you are performing queries with more than one structure, the query box on the left, the results box on the right hand side, and some tools for alignment of results (Precision Factor, Show alignment, Calculate RMSD).

Finding a Motif in a Query Protein The Motif Finder tab gives the user access to the motif search tools in ProMOL. This is one of the major features of ProMOL, which enables you to look for evolutionary relationships between proteins based on the three dimensional relationships between small templates (typically 3-5 amino acids) that form enzyme active sites or binding motifs.

In the Motif Finder interface you can enter multiple PDB IDs in the query box, separating them by commas, if you wish to perform multiple queries at the same time. It typically works best to enter fewer than 25 PDB IDs in the query box. Once you enter a PDB ID in the query box, the Start button becomes active.

Please start the exercise by entering the PDB ID “2hnt” in the ProMOL query box. Once you have gone through this process with this structure of known function (human gamma thrombin), you will conduct this exercise with your protein of unknown function.

Selecting a Template Library ProMOL contains different sets of templates that are distinguished either by the way they were generated or the types of structures that were considered. When you enter a PDB ID in the query box (see Figure 4) and hit the Start button, the “Choose a Set” dialog box appears (inset, right). Here is a brief description of each of the motif sets listed in the “Choose a Set” dialog box (Figure 5):

Screen Shot 2016-07-18 at 4.47.42 PM.png

  1. Using Motif Finder, submit your structure of unknown function as a query and use the P set and the A set. Enter the number 3 in the first EC number box.
  1. As the search progresses, the progress bars will indicate how things are moving along. As a general rule, it takes 5-10 minutes to perform a sequence alignment for a single structure against “All Motifs”. If you choose one or more subsets of motifs, the search will speed up accordingly. Please note that the PyMOL visualization window disappears during the search process to conserve computer resources.

Screen Shot 2016-07-18 at 1.10.54 PM.png

Screen Shot 2016-07-18 at 1.12.13 PM.png

Figure  6: Progress Bar and the Results Pane. The progress bars (left image near the top) show how the query is progressing. The top bar is the % completion of the individual query and the bottom bar is the % completion of the query set. In this case the query was only one structure, so the two bars are identical. The results pane (right) lists the alignments the ProMOL found for the query structure.

  1. The Results Screen. Once a search is complete, the results will appear in the Results window on the right hand side of the Motif Finder tab. The results shown below were obtained by searching 2hnt against the P set and the A set, but with the restriction to EC class 3.
  1. Important! Record this! Report all of your alignment results that appear in the Motif Finder results pane.
  1. Dig deeper with all alignments that report a Levenshtein distance of 0. Capture screen images of high quality alignments. Figure 7 contains the alignment for 2hnt with 1o2u. The aspartate and histidine residues align very nicely. However, there are only two residues, so this is not considered an interesting alignment (why not?). Thrombin is a serine protease of EC class 3.4.21.5. Review the other structures with a Levenshtein distance of 0 and determine whether your results are consistent with the established function of your protein. Which motif templates align best with 2hnt?

Screen Shot 2016-07-18 at 1.37.59 PM.png

Figure 7. Active site alignment of 2hnt (thrombin; red) with 1o2u (trypsin; white). The residues in the aligment are His 57 and Asp 102 from 2hnt and His 57 adn asp 102 form 1o2u.

  1. Now repeat this exercise using your structure of unknown function. What do your results tell you about the function of your structure?

Interpreting Results:   

Look at the ProMOL results pane for 2hnt (shown in Figure 6, but you should focus on your own screen.  Here is an explanation of the alignment results for the first alignment:

There is a set of tools toward the bottom of the Motif Finder tab (see Figure 6) that will enable you to show the alignment between the query (in red) and the motif template (in white). You can also calculate the RMSD for the alignment. If you check both boxes and then double click on an individual result, the alignment will appear in the PyMOL visualization window and the RMSD values will appear in the PyMOL GUI (Figure 8, top). In general with ProMOL, an alignment with three or more residues and RMSD values below 3 Å are considered high quality alignments.

Figure 8: Alignment Results with ProMOL/PyMOL. The RMSD values for all atoms, Calpha only and Calpha/Cbeta are shown in the PyMOL GUI at the top. The alignment of 2hnt (query in red) with 1a0j (motif template in white) is shown in the lower image of the molecular visualization window..

If you click on the residues in the alignment, their information appears in the PyMOL GUI window (Figure 9). You can use this approach to identify the residues in your alignments.

Figure 9: PyMOL GUI window identifying residues within your specific alignment. I generated the information shown here by clicking on the aligned residues that appear in the bottom part of Figure 8. The motif template is trypsin, PDB ID 1a0j. The query is thrombin, PDB ID 2hnt.

Review the image in Figure 9 to identify the residues in the alignment shown in Figure 8 (and on your computer). What information is contained in this line?

        
You clicked /2NHT//C//HIS’57/CD2

Now that you have gone through this exercize with thrombin (PDB ID 2hnt), repeat the exercise with your protein structure with unknown function. When you are finished with this exercise, you will have generated results about

Review these results carefully. Based on the evidence, can you make a prediction about the function of your protein? Are the results consistent or are there conflicts, that is, do you have good alignments with motif templates from more than one EC class?

Troubleshooting

  1. If you are using the BASIL Virtual Machine, then you are using CyVerse (http://cyverse.org) and Atmosphere within CyVerse. Make sure that you have registered for an account through your instructor.
  2. Power on the Virtual Machine. We have found that selecting an instance size with at least 2 CPUs and 16 GB of memory (small2 or greater).
  3. Moving files back and forth with the Virtual Machine.  You must save any data you generate on your virtual machine to your home computer. When you close the virtual machine, all data will be lost.
  4. At the end of each class period, you will need to delete the virtual machine to ensure that you don’t use up all of your computer time on CyVerse/Atmosphere.

References:

  1. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., and Bourne, P. E. (2000) The Protein Data Bank. Nucl. Acids Res. 28, 235–242
  2. Berman, H. M., Westbrook, J. D., Gabanyi, M. J., Tao, W., Shah, R., Kouranov, A., Schwede, T., Arnold, K., Kiefer, F., Bordoli, L., Kopp, J., Podvinec, M., Adams, P. D., Carter, L. G., Minor, W., Nair, R., and Baer, J. L. (2009) The protein structure initiative structural genomics knowledgebase. Nucl. Acids Res. 37, D365–D368
  3. Altschul, S, Gish, W, Miller, W, Myers, E, and Lipman, D (1990) Basic Local Alignment Search Tool. Journal of Molecular Biology. 215, 403–410
  4. Finn, R. D., Clements, J., and Eddy, S. R. (2011) HMMER web server: interactive sequence similarity searching. Nucl. Acids Res. 39, W29–W37
  5. Consortium, T. U. (2012) Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucl. Acids Res. 40, D71–D75
  6. Finn, R. D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E. L. L., Tate, J., and Punta, M. (2014) Pfam: the protein families database. Nucl. Acids Res. 42, D222–D230
  7. Gabanyi, M. J., Adams, P. D., Arnold, K., Bordoli, L., Carter, L. G., Flippen-Andersen, J., Gifford, L., Haas, J., Kouranov, A., McLaughlin, W. A., Micallef, D. I., Minor, W., Shah, R., Schwede, T., Tao, Y.-P., Westbrook, J. D., Zimmerman, M., and Berman, H. M. (2011) The Structural Biology Knowledgebase: a portal to protein structures, sequences, functions, and methods. J Struct Funct Genomics. 12, 45–54
  8. Holm, L., and Rosenström, P. (2010) Dali server: conservation mapping in 3D. Nucl. Acids Res. 38, W545–W549
  9. Fischer, M., Zhang, Q. C., Dey, F., Chen, B. Y., Honig, B., and Petrey, D. (2011) MarkUs: a server to navigate sequence-structure-function space. Nucleic Acids Research. 39, W357–W361
  10. Hanson, B., Westin, C., Rosa, M., Grier, A., Osipovitch, M., MacDonald, M. L., Dodge, G., Boli, P. M., Corwin, C. W., Kessler, H., McKay, T., Bernstein, H. J., and Craig, P. A. (2014) Estimation of protein function using template-based alignment of enzyme active sites. BMC Bioinformatics. 15, 87
  11. DeLano, W.L. (2002) The PyMol molecular graphics system. http://www.pymol.org
  12. Porter, C. T., Bartlett, G. J., and Thornton, J. M. (2004) The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucl. Acids Res. 32, D129–D133
  13. Moss, G.P. IUBMB Enzyme Nomenclature, EC 3.4.21.4, http://www.chem.qmul.ac.uk/iubmb/enzyme/EC3/4/21/4.html (accessed 18 July 2016)

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Appendix: Installation Instructions if you are loading ProMOL on a Macintosh computer. These instructions assume that you already have PyMOL installed.

Before doing anything, install QuartzX11.  http://xquartz.macosforge.org/landing/

It’s a regular .dmg installer.

In Finder, in Applications Folder, Find MacPyMOL

Rename MacPyMOL to PyMOLX11Hybrid

Right click, show Package Contents

Click Contents → pymol → modules → pmg_tk → startup

In another finder window, find where you have unpacked the ProMol tarball

Copy the contents of the Promol-5.4-r419 folder to the startup folder

Within all that stuff you just copied, find a file (not folder) called remote_pdb_load.py  

Add an x to the beginning of the filename

Then go to the folder called remote_pdb_load_plug in

Within that folder there should be a file called remote_pdb_load.py  

Copy that file up a directory to the startup folder

Now you should open PyMol.  X11 should open at the same time.  If you don’t see the menu bar, click on the X11 icon in the dock and it will show it to you.

From that menu bar, you can now click Plugin and choose ProMol