Title: How to use Pfam

Techniques to Master:

Searching for protein families using primary sequences of your unknown protein

Learning Objectives:

  1. Interpret the results of a Pfam search
  2. Correlate protein sequence and protein function
  3. Integrate Pfam results with results from other bioinformatics tools
  4. In combination with all of the bioinformatics tools, hypothesize a function for your protein structure

Background: Before we begin, there is a relationship that you absolutely must understand: primary sequence determines structure and structure determines function. Thanks to the transitive property, you could also say that primary sequence is responsible for function. There are caveats in these statements: (1) what if two sequences are not identical, (2) how different can two sequences be before the structure changes, (3) how different can two structures be before the function changes, and (4) what about changes in global structure versus changes in local structure?

In this project, you know the first two (primary sequence and structure) and are looking for clues to help you identify the function. It’s important to make the distinction between sequence, structure, and function here because some of the databases and servers that you will be using are sequence based (e.g., UniProt, Pfam, and BLAST) while others (Structural Biology Knowledgebase, ProMOL, and DALI) are structure-based. To complicate matters even more, you must also understand the difference between global alignments (trying to align the entire protein; DALI) and local alignments (trying to align only a small part of the protein; ProMOL) and be aware of which one is being used.

As you use all of these tools, servers, databases, and websites, it is important to constantly remind yourself that you are the expert. I know that sounds scary, but forgetting that small fact can have disastrous consequences. If you assign the wrong function to your protein, who will suffer the consequences? The DALI server? Or you? Your goal is to take the information from various sources and combine it into a hypothesis that you will test in the wetlab. As you develop your hypothesis based on the results that you obtain from these bioinformatics tools, remember that these tools give you “clues” or “indications” about the possible function of your protein. They do not give you definitive answers on the absolute function of your protein. Also try to remember that you can use these tools to disprove a hypothesis - and that information can be just as valuable.

A. The Pfam Database

Imagine that you are sitting at home and are bored so you decide to start reading all of the protein sequences - there are only 64 million of them, so you should be done before The Unbreakable Kimmy Schmidt   comes on. While reading all of these protein sequences, you notice that many of them are similar. Additionally, you also notice that many bigger proteins contain repeated sequences (Figure 1). As you read through these protein sequences, you decide to categorize them into groups with similar sequences. Congratulations, you have just built the Pfam database [Finn et al. 2016]. The actual Pfam database [http://pfam.xfam.org/] is not quite that simple, but this example illustrates the idea behind the database.

Screenshot 2016-07-07 10.23.46.png

Figure 1: The “domain organisation” of human γ-thrombin (2HNT), according to Pfam.The orange box represents a signal peptide; the green box is a Vitamin K-dependent carboxylation domain that is found in an additional 865 proteins as a Ca2+ binding site; the two red boxes represent two independent Kringle domains that are found in over 3,000 proteins that are related to blood coagulation; the purple box corresponds to the thrombin light chain; and the yellow box represents the serine protease, trypsin.

Pfam is a manually curated database in which protein sequences are grouped into families based on sequence similarity. Pfam families are built from a group of seed sequences. The seed sequences are used to build a statistical profile (Figure 2) and this profile, called a Hidden Markov Model (HMM), is then used to search for more related sequences. It is important to note that the seed defines the Pfam family and that every member of the family is related to the seed, but the pairwise sequence identity of other members in the family may be near-zero.

Screenshot 2016-07-06 11.25.18.png

Figure 2: A section of the HMM for the Trypsin family (PF00089). In this example, position 182 is always a glycine, positions 179/183 are usually glycine but can also be serine, and position 186 prefers aliphatic (valine, isoleucine, leucine, etc.). The statistical probabilities of finding specific residues at specific positions is the defining feature that is used to build the rest of the Pfam family.

B.  Some Proteins Contain Multiple Pfam Families

It is important to understand that Pfam is not a protein database, but rather that it is a domain database. The difference may be subtle, but is important. Perhaps this concept is easier to understand with a specific example. Zinc fingers are not individual proteins, but are small, conserved, structural motifs that are found in many types of proteins and have many functions. A keyword search for “zinc finger” in the Pfam database produces roughly 150 different families with functions that include development of blood cells (PDB ID 1Y0J), interaction with viral RNA (PDB ID 1A1T), and lipid binding (PDB ID 1JOC). Each of these zinc finger domains is found as a smaller part of larger protein (http://pdb101.rcsb.org/motm/87). Recognizing zinc fingers as small, reusable, functional domains that are found in many proteins for many different purposes epitomizes the functionality of Pfam.

C. Pfam Deliverables

After landing on the appropriate Pfam page (Figure 3), you should focus on the upper-right navigation pane to determine: how many sequences are in my family, how many species are represented in this family, and how many structures are known? You should also focus on the Summary, which is a community-written Wikipedia article.

Screenshot 2016-07-07 10.36.33 copy.png

Figure 3: The main page for the Pfam family Trypsin (PF00089). The important links are highlighted with red boxes. Note that some of the links (i.e., “Structures” and “Species”) are duplicated along the top right and down the left side.

Purpose: By now, you have chosen a protein of unknown function and the important research question that you are trying to answer is: “What does my protein do?” A more scientific way to ask that question is: “What reaction does this enzyme catalyze, what substrate is needed, what cofactors are required, and what is the product?” In order to solve these questions, you need to incorporate many different types of information from multiple sources. These bioinformatics tools will give you an idea of potential enzyme functions which you can use to build a hypothesis which you then test.

There are tools that search for primary sequences that are similar to your query (i.e., BLAST and Pfam) and there are tools that search for structurally similar proteins (i.e., DALI). As a general rule of thumb, proteins that have similar sequences, have similar structures, and also have similar functions. Pfam is a curated database of protein sequences that are organized into Protein FAMilies based on their sequences. If you can find the protein family that your protein belongs to, Pfam may have a description of what the family does, how many related proteins are known, how many structures have been determined for this family, and which organisms contain these proteins. If you can identify a related protein with a known function, then you have very strong evidence pointing you towards a potential function.

Experimental Design Considerations: Pfam is a simple, user-friendly bioinformatics tool that requires very little in terms of user input. There are no significant considerations to keep in mind when submitting your queries to the server. After running Pfam, you could summarize your results in a paragraph discussing: the quality of the match between your query sequence and your hit, what the family does, how many sequences are known, how many organisms (and why type of organisms) use proteins from this family, and how many structures exist for this family.

Supplies:

You will need an internet-connected computer and the PDB ID and primary sequence of your protein.

Procedure: A video tutorial that closely follows the procedure described below is at https://youtu.be/QMvWmpj8FOI.

There are several ways to access and search Pfam but I find the structure search to be the easiest and most useful for this type of work. You can usually link to the appropriate Pfam family from the PDB or UniProt, but this tutorial will describe how to do a structure search. [Incorporate 2HNT as an example]

  1. Open the Pfam website (http://Pfam.xfam.org/).
  2. Under the “Quick links” section, click on the “View a structure” link and a text box will appear beside the link. Enter your four-character PDB identifier in this box and press “Go”.
  3. The result page contains a lot of useful information and links to other databases and you should explore these at some other time. For now, we are interested in a link on the left navigation panel: “Sequence mapping”. If the “Sequence mapping” link is gray, then your protein does not have a Pfam family. In this case, you could try a sequence search although you may find no hits or distant, poor-quality hits.
  4. The results on the sequence mapping page may vary depending on how complex your protein is. If your protein contains multiple domains, you will likely have multiple Pfam families to examine. For an example of a large protein with multiple domains, please see 5AIA. For an example of a small protein containing only one domain, please see 1UBQ. At this point, you should have already run ProMOL and have an idea of the which residues define your potential active site and you will need that information if your protein has multiple Pfam families. On the sequence mapping page, columns 2 and 3 show the residue numbers that define each domain. If your protein does contain multiple Pfam families, you must look at each domain that contains a residues which ProMOL identified. Click on the link to the Pfam family (far right column) that corresponds to the domain containing your amino acids of interest.
  5. Once you reach the page describing your protein family, important things that you’ll want to pay attention to are: the summary page, the number of known sequences (top navigation pane), the species tab (left navigation pane to see the distribution of species; top navigation pane to see the number of species), the structures tab (left navigation pane while the number of structures is displayed on the top navigation pane).

Clean-up: This is an in silico exercise so there should be nothing to clean up.

Interpreting Results: Remember that the main function of Pfam is to help you determine the potential function of your protein. To this end, the most useful “results” from Pfam will be the (1) summary page and (2) any information that you can learn from other structures in the family.

[Compare results with BLAST, DALI, and ProMOL. Compile/summarize useful information.]

References:

  1. Finn, R. D., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Mistry, J., Mitchell, A. L., Potter, S. C., Punta, M., Qureshi, M., Sangrador-Vegas, A., Salazar, G. A., Tate, J., and Bateman, A. (2016) The Pfam protein families database: towards a more sustainable future. Nucl. Acids Res. 44, D279–D285

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.