Title: Protein BLAST Search

 

Techniques to Master:

 

  1. Use protein sequences to search protein databases with BLAST for proteins of similar sequences
  2. Interpret BLAST results to identify “homologous” proteins

 

Learning Objectives:

 

  1. To determine which family or superfamily of proteins / enzymes your protein belongs to.
  2. To determine which amino acids are conserved amongst (super)family members.
  3. Using BLAST searches along with structures, to determine which amino acids are not only in the active site, but which are conserved in the active site, and thus may be involved in catalysis.
  4. To see what kind of results a “word size” of 2 vs 6 gives you. To determine when  is a word size of 2 better and when is a word size of 6 better.

 

Background: 

 

BLAST stands for Basic local alignment search tool. BLAST was developed in 1990 by Altschul et al.

 

BLAST searches can be done with nucleotide (DNA) or amino acid (protein) sequences. Here we will be using BLAST to analyze proteins. BLAST can be used to determine what other proteins (sequences deposited in various databases) have similar amino acid sequences and thus what other proteins are related by sequence.  Proteins related by sequence are presumed to be evolutionarily related (homologs of one another), belong to the same protein superfamily or even family / subfamily, and have similar structures (at least at the active site) and similar activity (if an enzyme, not necessarily active on the same substrate, but having similar mechanisms of catalysis).

 

Carrying out a few BLAST searches is the best way to start to understand them and thus we will do that under “procedure”.

 Purpose: The purpose of a BLAST search is to determine which other proteins have similar amino acid sequences to your protein of unknown function. Proteins with similar amino acids are presumed to be related evolutionarily, structurally, and functionally. Thus if you uncover proteins with known functions, you may have a clue as to the function of your protein. The relationship may be a very close one: the two proteins do exactly the same thing both in vitro and in vivo, or the relationship may not be as close; the proteins may be in the same superfamily, having similar catalytic mechanisms on structurally similar substrates, but doing very different functions within the cell. Thus, the purpose of a BLAST search is not only to determine what other proteins have similar amino acid sequences, but also to find proteins of known activity to clue you in to what your protein might do chemically.

 

Experimental Design Considerations: Doing a BLAST search is seemingly simple on the surface, but how you design your search makes the difference between obtaining a ton of rather useless data or obtaining a manageable amount of really useful data. You’ll want to consider which database(s) you search, for which organism(s), and how you set “word” size and other search parameters. The practice examples below in the procedure will help you visualize this concept. The more you narrow your search, the more useful the information that you get back.  If you don’t narrow your search at all, you can easily get back a lot of hits that are just your protein (see first practice example). But if you do narrow your search, you will find true homologs of your protein, and if you find homologs whose functions have been determined, you will have an idea of what your protein might do. It all depends on how you set up your search parameters.

 

Supplies: A computer that is connected to the internet and the sequence of your protein of interest

 

Procedure:

 

For Practice Examples:

 

• Go to: http://www.ncbi.nlm.nih.gov

• In the drop down menu of the search box, go to protein

• In text search box, type 1tum

• Return Key or Click on Search

• Click on the 1tum link

• Highlight and copy the protein sequence at the bottom of the page that comes up

  (don’t worry about the numbers and spaces, BLAST knows how to ignore them)

 

Practice Example 1:

 

• Go back to main NCBI webpage

• Click on: BLAST (under the popular resources on the right side of the page)

• Click on Protein BLAST (protein – protein)

• Under Enter Query Sequence: Paste the sequence that you copied above into the search box:

 

1 mkklqiavgi irnenneifi trraadahma nklefpggki emgetpeqav vrelqeevgi

           61 tpqhfslfek leyefpdrhi tlwfwlverw egepwgkegq pgewmslvgl naddfppane

          121 pviaklkrl

 

• Under Choose Search Set, Database: Choose Non-redundant protein sequences (nr)

• Under Program Selection, Algorithm: Choose blastp (protein-protein BLAST)

• Click the box next to “Show Results in New Window”

• Click the button labeled BLAST

 

You will see that the long list of the first “hits” is the protein you searched for: MutT from E. coli. That of course isn’t very useful other than telling you that it is in the Nudix hydrolase superfamily. But there are ways to significantly narrow your search and make it much more meaningful. The next example starts to do that.

 

OK, let’s try this again, but this time we will narrow our search.

 

Practice Example 2:

 

• Go back to the main NCBI webpage

• Click on: BLAST

• Click on Protein BLAST (protein – protein)

• Under Enter Query Sequence: Paste the sequence that you copied above into the search box:

 

1 mkklqiavgi irnenneifi trraadahma nklefpggki emgetpeqav vrelqeevgi

           61 tpqhfslfek leyefpdrhi tlwfwlverw egepwgkegq pgewmslvgl naddfppane

          121 pviaklkrl

 

• Under Choose Search Set, Database, Choose: Non-redundant protein sequences (nr)

• Under Organism, Type: Mycobacterium tuberculosis H37Rv

• Under Program Selection, Algorithm: Choose blastp (protein-protein BLAST)

• Click on Algorithm Parameters to expand it.

• Under Word Size, Change it from 6 to 2 (6 does not give any hits, but 2 will).

• Click the box next to “Show Results in New Window”

• Click the button labeled BLAST

 

This time you are comparing an E. coli protein to proteins within M. tuberculosis, thus the search is more meaningful. In this way, you are searching for and finding Nudix hydrolases within M. tuberculosis. If you know the Nudix signature sequence GX5EX7REUXEEXGU (where X is any amino acid and U is I, L, or V), you can assess which of these hits is a Nudix hydrolase. Typically the first hits are going to be the enzymes that fit into the superfamily.  This example is also an excellent example of “overinterpretation” of data. Many of these state “DNA mismatch repair protein MutT” or “8-oxo-dGTP diphosphatase”, however, it has been shown that most Nudix hydrolases do not have this function and it would also be highly improbable that an organism would contain more than 1 or 2 enzymes doing exactly the same thing. This example really illustrates why characterizing enzymatic activity in the lab is so essential.

 

Practice Example 3:

 

• Go back to the main NCBI webpage

• Click on: BLAST

• Click on Protein BLAST (protein – protein)

• Under Enter Query Sequence: Paste the sequence that you copied above into the search box:

 

1 mkklqiavgi irnenneifi trraadahma nklefpggki emgetpeqav vrelqeevgi

           61 tpqhfslfek leyefpdrhi tlwfwlverw egepwgkegq pgewmslvgl naddfppane

          121 pviaklkrl

 

• Under Choose Search Set, Database, Choose: Protein Databank proteins (pdb)

• Under Program Selection, Algorithm: Choose blastp (protein-protein BLAST)

• Click on Algorithm Parameters to expand it.

• Under Word Size, Change it back to 6.

• Click the box next to “Show Results in New Window”

• Click the button labeled BLAST

 

This time you are comparing MutT from E. coli to other proteins that have had their structures determined. If these proteins have also been characterized for enzymatic activity, that makes the hit more useful. Which of these are Nudix hydrolases (containing the signature sequence GX5EX7REUXEEXGU)? How many of them are not part of the Nudix hydrolase superfamily? Are the first hits more or less likely to be Nudix hydrolases? Are the last hits more of less likely to be Nudix hydrolases?  What happens if you repeat the search with a word size of 2? What happens if you change the Expect threshold to a number >10? <10?  Which of the hits have been characterized enzymatically? How can you tell? Which are given tentative functions based on similarity to other proteins? (Hint: Can you identify a published journal article on the characterization of the protein in question?)

 

Important note: if the search fails as it often seems to do, don’t try to analyze why, just try it again until it works. I’ve found that it fails often for no apparent reason and then finally works. Patience.

Executing a BLAST Search on Your Protein of Interest / Your Protein of Unknown Function:

 

• Go to: http://www.ncbi.nlm.nih.gov

• In the drop down menu of the search box, go to protein

• In text search box, type in your PDB code

• Return Key or Click on Search

• Click on the link

• Highlight and copy the protein sequence at the bottom of the page that comes up

  (don’t worry about the numbers and spaces, BLAST knows how to ignore them)

 

• Go back to main NCBI webpage

• Click on: BLAST

• Click on Protein BLAST (protein – protein) (You can also try other searches such as PSI-BLAST)

• Under Enter Query Sequence: Paste the sequence that you copied above into the search box:

• Under Choose Search Set, Database: Choose a database (nr, pdb, etc)

• Optional: Under Organism, Choose a specific organism if you wish

• Under Program Selection, Algorithm: Choose blastp (protein-protein BLAST)

• Click on Algorithm Parameters to expand it.

• Optional: Play with Word Size and Expect Threshold. Suggestion: use default parameters to start.

• Click the box next to “Show Results in New Window”

• Click the button labeled BLAST

 

Interpreting Your BLAST Search Results

 

• What does the color of the bar indicate?

• What does the length of the bar indicate?

• What does the E score tell you? Do larger or smaller numbers indicate better matches?

• What protein (super)family does your protein fall into? Can you tell from BLAST alone?

Or do you need other tools such as Pfam, DALI, Promol as well to determine this?

• Which of the hits have had their function experimentally determined? And which are simply predicted functions (such as what you are trying to do here)?

• Which search databases and parameters did you find most useful and why?

 

Clean-up: This is an in silico exercise so there should be nothing to clean up.

 

Interpreting Results: Remember that the main goal of BLAST is to help you determine the potential function of your protein. To this end, the most useful “results” from BLAST will be

1)  any proteins for which function has been experimentally determined 2) the function of the “hits” 3) “hits” that have similar active site amino acids (see Promol for this) and 4) “hits” that fall in the (super)families that were determined with Pfam.

[Compare results with Pfam, DALI, and ProMOL. Compile/summarize useful information.]

 

References:

 

  1. Altschul, S, Gish, W, Miller, W, Myers, E, and Lipman, D (1990) Basic Local Alignment Search Tool. Journal of Molecular Biology. 215, 403–410

 

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.