Title: How to use Pfam
Techniques to Master:
Searching for protein families using primary sequences of your unknown protein
Learning Objectives:
Background: Before we begin, there is a relationship that you absolutely must understand: primary sequence determines structure and structure determines function. Thanks to the transitive property, you could also say that primary sequence is responsible for function. There are caveats in these statements: (1) what if two sequences are not identical, (2) how different can two sequences be before the structure changes, (3) how different can two structures be before the function changes, and (4) what about changes in global structure versus changes in local structure?
In this project, you know the first two (primary sequence and structure) and are looking for clues to help you identify the function. It’s important to make the distinction between sequence, structure, and function here because some of the databases and servers that you will be using are sequence based (e.g., UniProt, Pfam, and BLAST) while others (Structural Biology Knowledgebase, ProMOL, and DALI) are structure-based. To complicate matters even more, you must also understand the difference between global alignments (trying to align the entire protein; DALI) and local alignments (trying to align only a small part of the protein; ProMOL) and be aware of which one is being used.
As you use all of these tools, servers, databases, and websites, it is important to constantly remind yourself that you are the expert. I know that sounds scary, but forgetting that small fact can have disastrous consequences. If you assign the wrong function to your protein, who will suffer the consequences? The DALI server? Or you? Your goal is to take the information from various sources and combine it into a hypothesis that you will test in the wetlab. As you develop your hypothesis based on the results that you obtain from these bioinformatics tools, remember that these tools give you “clues” or “indications” about the possible function of your protein. They do not give you definitive answers on the absolute function of your protein. Also try to remember that you can use these tools to disprove a hypothesis - and that information can be just as valuable.
A. The Pfam Database
Imagine that you are sitting at home and are bored so you decide to start reading all of the protein sequences - there are only 64 million of them, so you should be done before The Unbreakable Kimmy Schmidt comes on. While reading all of these protein sequences, you notice that many of them are similar. Additionally, you also notice that many bigger proteins contain repeated sequences (Figure 1). As you read through these protein sequences, you decide to categorize them into groups with similar sequences. Congratulations, you have just built the Pfam database [Finn et al. 2016]. The actual Pfam database [http://pfam.xfam.org/] is not quite that simple, but this example illustrates the idea behind the database.
Figure 1: The “domain organisation” of human γ-thrombin (2HNT), according to Pfam.The orange box represents a signal peptide; the green box is a Vitamin K-dependent carboxylation domain that is found in an additional 865 proteins as a Ca2+ binding site; the two red boxes represent two independent Kringle domains that are found in over 3,000 proteins that are related to blood coagulation; the purple box corresponds to the thrombin light chain; and the yellow box represents the serine protease, trypsin. |
Pfam is a manually curated database in which protein sequences are grouped into families based on sequence similarity. Pfam families are built from a group of seed sequences. The seed sequences are used to build a statistical profile (Figure 2) and this profile, called a Hidden Markov Model (HMM), is then used to search for more related sequences. It is important to note that the seed defines the Pfam family and that every member of the family is related to the seed, but the pairwise sequence identity of other members in the family may be near-zero.
Figure 2: A section of the HMM for the Trypsin family (PF00089). In this example, position 182 is always a glycine, positions 179/183 are usually glycine but can also be serine, and position 186 prefers aliphatic (valine, isoleucine, leucine, etc.). The statistical probabilities of finding specific residues at specific positions is the defining feature that is used to build the rest of the Pfam family. |
B. Some Proteins Contain Multiple Pfam Families
It is important to understand that Pfam is not a protein database, but rather that it is a domain database. The difference may be subtle, but is important. Perhaps this concept is easier to understand with a specific example. Zinc fingers are not individual proteins, but are small, conserved, structural motifs that are found in many types of proteins and have many functions. A keyword search for “zinc finger” in the Pfam database produces roughly 150 different families with functions that include development of blood cells (PDB ID 1Y0J), interaction with viral RNA (PDB ID 1A1T), and lipid binding (PDB ID 1JOC). Each of these zinc finger domains is found as a smaller part of larger protein (http://pdb101.rcsb.org/motm/87). Recognizing zinc fingers as small, reusable, functional domains that are found in many proteins for many different purposes epitomizes the functionality of Pfam.
C. Pfam Deliverables
After landing on the appropriate Pfam page (Figure 3), you should focus on the upper-right navigation pane to determine: how many sequences are in my family, how many species are represented in this family, and how many structures are known? You should also focus on the Summary, which is a community-written Wikipedia article.
Figure 3: The main page for the Pfam family Trypsin (PF00089). The important links are highlighted with red boxes. Note that some of the links (i.e., “Structures” and “Species”) are duplicated along the top right and down the left side. |
Purpose: By now, you have chosen a protein of unknown function and the important research question that you are trying to answer is: “What does my protein do?” A more scientific way to ask that question is: “What reaction does this enzyme catalyze, what substrate is needed, what cofactors are required, and what is the product?” In order to solve these questions, you need to incorporate many different types of information from multiple sources. These bioinformatics tools will give you an idea of potential enzyme functions which you can use to build a hypothesis which you then test.
There are tools that search for primary sequences that are similar to your query (i.e., BLAST and Pfam) and there are tools that search for structurally similar proteins (i.e., DALI). As a general rule of thumb, proteins that have similar sequences, have similar structures, and also have similar functions. Pfam is a curated database of protein sequences that are organized into Protein FAMilies based on their sequences. If you can find the protein family that your protein belongs to, Pfam may have a description of what the family does, how many related proteins are known, how many structures have been determined for this family, and which organisms contain these proteins. If you can identify a related protein with a known function, then you have very strong evidence pointing you towards a potential function.
Experimental Design Considerations: Pfam is a simple, user-friendly bioinformatics tool that requires very little in terms of user input. There are no significant considerations to keep in mind when submitting your queries to the server. After running Pfam, you could summarize your results in a paragraph discussing: the quality of the match between your query sequence and your hit, what the family does, how many sequences are known, how many organisms (and why type of organisms) use proteins from this family, and how many structures exist for this family.
Supplies:
You will need an internet-connected computer and the PDB ID and primary sequence of your protein.
Procedure: A video tutorial that closely follows the procedure described below is at https://youtu.be/QMvWmpj8FOI.
There are several ways to access and search Pfam but I find the structure search to be the easiest and most useful for this type of work. You can usually link to the appropriate Pfam family from the PDB or UniProt, but this tutorial will describe how to do a structure search. [Incorporate 2HNT as an example]
Clean-up: This is an in silico exercise so there should be nothing to clean up.
Interpreting Results: Remember that the main function of Pfam is to help you determine the potential function of your protein. To this end, the most useful “results” from Pfam will be the (1) summary page and (2) any information that you can learn from other structures in the family.
[Compare results with BLAST, DALI, and ProMOL. Compile/summarize useful information.]
References:
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.