Title: How to use DALI

Techniques to Master:

  1. Submitting a protein structure to search for structural homologs
  2. Interpreting the results of a DALI search

Learning Objectives:

Differentiate between alignments by sequence, small motifs and full backbones

Background: Before we begin, there is a relationship that you absolutely must understand: primary sequence determines structure and structure determines function. Thanks to the transitive property, you could also say that primary sequence is responsible for function. There are caveats in these statements: (1) what if two sequences are not identical, (2) how different can two sequences be before the structure changes, (3) how different can two structures be before the function changes, and (4) what about changes in global structure versus changes in local structure?

In this project, you know the first two (primary sequence and structure) and are looking for clues to help you identify the function. It’s important to make the distinction between sequence, structure, and function here because some of the databases and servers that you will be using are sequence based (e.g., UniProt, Pfam, and BLAST) while others (Structural Biology Knowledgebase, ProMOL, and DALI) are structure based. To complicate matters even more, you must also understand the difference between global alignments (trying to align the entire protein; DALI) and local alignments (trying to align only a small part of the protein; ProMOL) and be aware of which one is being used.

As you run all of these tools, servers, databases, and websites, it is important to constantly remind yourself that you are the expert. I know that sounds scary, but forgetting that small fact can have disastrous consequences. If you assign the wrong function to your protein, who will suffer the consequences? The DALI server? Or you? Your goal is to take the information from various sources and combine it into a hypothesis that you will test in the wetlab. As you develop your hypothesis based on the results that you obtain from these bioinformatics tools, remember that these tools give you “clues” or “indications” about the possible function of your protein. They do not give you definitive answers on the absolute function of your protein. Also try to remember that you can use these tools to disprove a hypothesis - and that information can be just as valuable.

A. Structural alignment vs. sequence alignment

Before we begin, it is important to differentiate and understand differences between programs like BLAST and DALI. Both of them align proteins, but the method of the alignment is very different. BLAST is a simple algorithm that compares only sequences (also called amino acid sequences, primary structures, or protein sequences), while DALI, on the other hand, is designed to compare three-dimensional structures of proteins (Figure 1). While DALI is arguably more useful because it aligns structures rather than sequences, it is also more limited because there are far more protein sequences that are known (about 64 million according to UniProt) than there are protein structures that are known (about 120,000 according to the PDB).

Figure1.png

Figure 1: An example of a sequence alignment from BLAST (top) compared with a structural alignment from DALI (bottom).

B. The Dali Server
By now, you should have an understanding that DALI searches the PDB for proteins that are structurally similar to your query protein. An important consequence is that your results are limited to a subset of protein structures that have been deposited in the PDB. If DALI does not produce any useful information, it does not mean that your query protein has no structural homologs. Instead, it signifies that there are no publicly available structures that are related to your query protein.

Without diving too deeply into the algorithms or specifics of the server, there are some aspects of a DALI search that are important to understand. Please be aware that this process is limited to global alignments of multiple protein backbones.

First, DALI simplifies your query structure down to only the C𝛂 atoms. You may have submitted a file containing the 3D coordinates of every atom, but DALI simplifies your structure down to only the C𝛂 atoms. I suspect that this is done simply to save computational resources; dealing with fewer atoms makes the calculations faster and saves space. You are trying to hypothesize on the potential enzymatic function of the protein that you selected and this enzymatic function is determined entirely by the position of the side chain atoms in the active site identified by ProMOL. These side chain atoms that make your enzyme functional are exactly the ones that DALI discarded. If you need to compare the conformation, identity, or conservation of the side chain atoms in your query/hit structures, you should use your favorite molecular visualization software (i.e., PyMOL or Chimera) to superimpose the same regions that DALI identified.

Second, many of the proteins that we are dealing with have artificial amino acid sequences that are used for purification. These purification tags are unstructured and not part of the native protein. However, DALI does not know that and will try to incorporate these segments into the fit. Consequently, the quality of your fit may be lower than expected. Manually editing the PDB file to remove the coordinates of the purification tag is the only solution.

Purpose: By now, you have chosen a protein of unknown function and the important research question that you are trying to answer is: “What does my protein do?” A more scientific way to ask that question is: “What reaction does this enzyme catalyze, what substrate is needed, what cofactors are required, and what is the product?” In order to solve these questions, you need to incorporate many different types of information from multiple sources. These bioinformatics tools will give you an idea of potential enzyme functions which you can use to build a hypothesis which you then test.

As a general rule of thumb, proteins that have similar structures also have similar functions - and this is where DALI excels. DALI will search the Protein Data Bank (www.pdb.org) for proteins that have global structures that are similar to your query protein. Again, if you can identify a structurally related protein with a known function, then you have very strong evidence pointing you towards a potential function.

 

Experimental Design Considerations:

DALI is a simple, user friendly bioinformatics tool that requires very little in terms of user input. Be aware that DALI is heavily used and is a computationally intensive search. As a consequence, it may take more than 24 hours for your query to finish. Otherwise, there are no significant considerations to keep in mind when submitting your query structure to the server. However, interpretation of the DALI output can be overwhelming for inexperienced users. After running DALI, you could consider formatting the output of (key) entries into a table that you include in your report.

If you need to do large numbers of alignments, have access to a linux computer, and feel comfortable installing and executing linux software from the command line, you are able to download the DALI package and run it locally.

Supplies:

You will need an internet-connected computer and the four-character PDB code of your protein (e.g., “4EZI”).

 

Procedure:

I have created a video tutorial that closely follows the procedure described below. You can find this video here (https://youtu.be/hwKN-ZGTFIg) and follow along.

There are three versions of the DALI website: (1) DALI, (2) DALI pairwise, and (3) DALI database. All three versions can be accessed from the website listed below. Of the three DALI algorithms mentioned above, the first one is the default and, likely, is the only one that you will need.

  1. Open the DALI website (http://ekhidna2.biocenter.helsinki.fi/dali/). Choose the “PDB search” tab. The “Pairwise” and “All against all” tabs are also useful if you have a limited set of structures that you want to compare.
  2. You have a choice of (1) uploading a file containing the atomic coordinates of your protein, or (2) entering the four-letter PDB identifier of your protein. I recommend using the four-letter code.
  3. When submitting your structure, you must also enter the “chain identifier”. The chain identifier is a single alphabetic character in the PDB file that is used to mark individual protein sequences. Some PDB structures contain more than one molecule and each of these molecules can be distinguished using the chain identifier. To illustrate this point, look at the PDB entry 1N0S. Go ahead, I’ll wait for you to come back. Notice anything? There are two FASTA sequences (“1N0S:A” and “1N0S:B”) and they are the same. In this case, you could enter either “1N0SA” or “1N0SB” into DALI and get the same answer. Consider a more complicated example. Look at the PDB entry 5J0N. This one is very complicated: there are 15 chains … and some of them contain DNA! Which chain identifier would you use? Well, it depends … First, never use a chain with DNA because DALI works only with proteins. Second, pick the chain representing the sequence that you are interested in. If there are multiple copies of the same protein (e.g. chains E-H in 5J0N), then it likely doesn’t matter which one you use for DALI.
  1. Go to the PDB and open the page specific to your protein
  2. Click on the “Display Files” button in the upper right corner of the page and select “FASTA sequence”. You will be presented with all of the sequences that are present in your structure and you are interested in lines that begin with >XXXX:A, where XXXX is your four-letter PDB code and A is your chain identifier.
  3. There is a high probability that: (1) you either have only one protein sequence, which means that your chain identifier is “A” or that (2) your complex contains more than one copy of the same protein, which means that you can use any of the chain identifiers.
  1. Enter a meaningful (to you) description in the job name field. Something like a combination of the PDB code and the current date might be useful.
  2. Enter your email address. This is highly recommended because DALI can take more than a day to run and you run the risk of losing the output if you do not enter an email address.
  3. Submit your job and wait for a link to the results to be emailed to you. NOTE: you are using the newest version of DALI in this tutorial and there is some question about whether it will actually send you a link with your results. To be safe, bookmark the page that you see after you submit your job.

 

Clean-up: This is an in silico exercise so there should be nothing to clean up.

Interpreting Results:

Please keep in mind that DALI results are stored TEMPORARILY (7 days) on the server and they will almost certainly disappear before the end of the semester when you are scrambling to write final reports. Because a DALI run can take hours or days, I suggest that you click on the “Parsable data” link at the top of the page and save all of the text that is displayed.

First, some basic explanation of how the results are formatted. After clicking on the link that DALI sends you, you will see a table that summarizes the results of your search (Figure 2).

Screenshot 2016-06-24 09.17.41.png

Figure 2. Example output from DALI. The columns are (from left to right): (1) the number of the hit which corresponds to the alignment that is shown at the bottom of the results; clicking on the link will show the alignment, (2) the PDB and chain identifiers of the hit matching your query; clicking this link will start a new DALI search for that particular PDB identifier, (3) the Z-score statistic which is a summary metric describing the quality of the alignment (higher is better), (4) the root mean square deviation (rmsd, in Å) between your query structure and the hit, (5) the lenght of the alignment (lali) is a report of how many amino acids were considered in your alignment between your query and the hit; ideally, this number should be close to “nres” in the following column, (6) the number of residues (nres) is simply the number of amino acids in the hit structure, (7) %id is a report of the sequence identity between the two structures, (8) clicking on the PDB link will display a modified version of the original PDB file that contains the rotation/translation matrix used to superimpose the two structures, and (9) contains a text description of the molecule.

When I analyze my DALI results, I look at the Z-score first. You can safely assume that Z-scores ≤ 2.0 are structurally insignificant while Z-scores ≥ 4.0 should be examined more carefully. Higher Z-scores are indicative of more significant structural matches. I look at the range of Z-scores to decide which hits to ignore and which ones to pay attention to. After picking a subset of “interesting” hits, I quickly look at the length of the alignment (lali) to confirm that the alignments cover a significant portion of the structure. This is important because it is possible that two large, unrelated proteins could have similar helix-turn-helix motifs that are structurally very similar (e.g., a false positive) while the rest of the protein is structurally different. After confirming that DALI has returned quality alignments, I look at the last column to assess what types of proteins my query structure matches. In Figure 2, I used the structure of 5A1A (a known 𝝱-galactosidase) as an example. DALI found nearly 1,000 structural homologs and the first ~250 were highly conserved 𝝱-galactosidases. In this example, the evidence is overwhelming and the conclusion is clear: 5A1A is a 𝝱-galactosidase. Your results are unlikely to be as clear or definitive.

Your best strategy to analyze your DALI results is to first confirm that you have quality alignments (Z-score and length of alignment). Then start at the top of the list and go through the following steps:

  1. Confirm that your active site residues (as determined with ProMOL) are conserved in your result by clicking on the number in column 1 to display the aligned sequences. If your active site residues are not conserved, a functional relationship between your query and the hit is unlikely and you can move to the next hit.
  2. If your active site is conserved, you should research the current hit: what does PFAM say about it; is there a publication listed in the PDB entry; can you find an EC number in the PDB or Catalytic Site Atlas (CSA); does the PDB entry contain a ligand; does UniProt contain information about cofactors, ligands, mutations, active sites, diseases, enzymatic assay results (Really! See 4DUV as an example http://www.uniprot.org/uniprot/P00722); binding sites; related proteins; etc.
  3. If many DALI hits that you look at say the same thing, then you have fairly convincing evidence for a specific function, reaction, substrate, etc.

References:

  1. Holm, L, and Rosenström, P (2010) Dali server: conservation mapping in 3D. Nucleic Acids Research. 38, W545-549. (doi: 10.1093/nar/gkq366)

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.