Title: How to use DALI
Techniques to Master:
Differentiate between alignments by sequence, small motifs and full backbones
Background: Before we begin, there is a relationship that you absolutely must understand: primary sequence determines structure and structure determines function. Thanks to the transitive property, you could also say that primary sequence is responsible for function. There are caveats in these statements: (1) what if two sequences are not identical, (2) how different can two sequences be before the structure changes, (3) how different can two structures be before the function changes, and (4) what about changes in global structure versus changes in local structure?
In this project, you know the first two (primary sequence and structure) and are looking for clues to help you identify the function. It’s important to make the distinction between sequence, structure, and function here because some of the databases and servers that you will be using are sequence based (e.g., UniProt, Pfam, and BLAST) while others (Structural Biology Knowledgebase, ProMOL, and DALI) are structure based. To complicate matters even more, you must also understand the difference between global alignments (trying to align the entire protein; DALI) and local alignments (trying to align only a small part of the protein; ProMOL) and be aware of which one is being used.
As you run all of these tools, servers, databases, and websites, it is important to constantly remind yourself that you are the expert. I know that sounds scary, but forgetting that small fact can have disastrous consequences. If you assign the wrong function to your protein, who will suffer the consequences? The DALI server? Or you? Your goal is to take the information from various sources and combine it into a hypothesis that you will test in the wetlab. As you develop your hypothesis based on the results that you obtain from these bioinformatics tools, remember that these tools give you “clues” or “indications” about the possible function of your protein. They do not give you definitive answers on the absolute function of your protein. Also try to remember that you can use these tools to disprove a hypothesis - and that information can be just as valuable.
A. Structural alignment vs. sequence alignment
Before we begin, it is important to differentiate and understand differences between programs like BLAST and DALI. Both of them align proteins, but the method of the alignment is very different. BLAST is a simple algorithm that compares only sequences (also called amino acid sequences, primary structures, or protein sequences), while DALI, on the other hand, is designed to compare three-dimensional structures of proteins (Figure 1). While DALI is arguably more useful because it aligns structures rather than sequences, it is also more limited because there are far more protein sequences that are known (about 64 million according to UniProt) than there are protein structures that are known (about 120,000 according to the PDB).
Figure 1: An example of a sequence alignment from BLAST (top) compared with a structural alignment from DALI (bottom).
B. The Dali Server
By now, you should have an understanding that DALI searches the PDB for proteins that are structurally similar to your query protein. An important consequence is that your results are limited to a subset of protein structures that have been deposited in the PDB. If DALI does not produce any useful information, it does not mean that your query protein has no structural homologs. Instead, it signifies that there are no publicly available structures that are related to your query protein.
Without diving too deeply into the algorithms or specifics of the server, there are some aspects of a DALI search that are important to understand. Please be aware that this process is limited to global alignments of multiple protein backbones.
First, DALI simplifies your query structure down to only the C𝛂 atoms. You may have submitted a file containing the 3D coordinates of every atom, but DALI simplifies your structure down to only the C𝛂 atoms. I suspect that this is done simply to save computational resources; dealing with fewer atoms makes the calculations faster and saves space. You are trying to hypothesize on the potential enzymatic function of the protein that you selected and this enzymatic function is determined entirely by the position of the side chain atoms in the active site identified by ProMOL. These side chain atoms that make your enzyme functional are exactly the ones that DALI discarded. If you need to compare the conformation, identity, or conservation of the side chain atoms in your query/hit structures, you should use your favorite molecular visualization software (i.e., PyMOL or Chimera) to superimpose the same regions that DALI identified.
Second, many of the proteins that we are dealing with have artificial amino acid sequences that are used for purification. These purification tags are unstructured and not part of the native protein. However, DALI does not know that and will try to incorporate these segments into the fit. Consequently, the quality of your fit may be lower than expected. Manually editing the PDB file to remove the coordinates of the purification tag is the only solution.
Purpose: By now, you have chosen a protein of unknown function and the important research question that you are trying to answer is: “What does my protein do?” A more scientific way to ask that question is: “What reaction does this enzyme catalyze, what substrate is needed, what cofactors are required, and what is the product?” In order to solve these questions, you need to incorporate many different types of information from multiple sources. These bioinformatics tools will give you an idea of potential enzyme functions which you can use to build a hypothesis which you then test.
As a general rule of thumb, proteins that have similar structures also have similar functions - and this is where DALI excels. DALI will search the Protein Data Bank (www.pdb.org) for proteins that have global structures that are similar to your query protein. Again, if you can identify a structurally related protein with a known function, then you have very strong evidence pointing you towards a potential function.
Experimental Design Considerations:
If you need to do large numbers of alignments, have access to a linux computer, and feel comfortable installing and executing linux software from the command line, you are able to download the DALI package and run it locally.
You will need an internet-connected computer and the four-character PDB code of your protein (e.g., “4EZI”).
I have created a video tutorial that closely follows the procedure described below. You can find this video here (https://youtu.be/hwKN-ZGTFIg) and follow along.
There are three versions of the DALI website: (1) DALI, (2) DALI pairwise, and (3) DALI database. All three versions can be accessed from the website listed below. Of the three DALI algorithms mentioned above, the first one is the default and, likely, is the only one that you will need.
Clean-up: This is an in silico exercise so there should be nothing to clean up.
Please keep in mind that DALI results are stored TEMPORARILY (7 days) on the server and they will almost certainly disappear before the end of the semester when you are scrambling to write final reports. Because a DALI run can take hours or days, I suggest that you click on the “Parsable data” link at the top of the page and save all of the text that is displayed.
First, some basic explanation of how the results are formatted. After clicking on the link that DALI sends you, you will see a table that summarizes the results of your search (Figure 2).
Figure 2. Example output from DALI. The columns are (from left to right): (1) the number of the hit which corresponds to the alignment that is shown at the bottom of the results; clicking on the link will show the alignment, (2) the PDB and chain identifiers of the hit matching your query; clicking this link will start a new DALI search for that particular PDB identifier, (3) the Z-score statistic which is a summary metric describing the quality of the alignment (higher is better), (4) the root mean square deviation (rmsd, in Å) between your query structure and the hit, (5) the lenght of the alignment (lali) is a report of how many amino acids were considered in your alignment between your query and the hit; ideally, this number should be close to “nres” in the following column, (6) the number of residues (nres) is simply the number of amino acids in the hit structure, (7) %id is a report of the sequence identity between the two structures, (8) clicking on the PDB link will display a modified version of the original PDB file that contains the rotation/translation matrix used to superimpose the two structures, and (9) contains a text description of the molecule.
When I analyze my DALI results, I look at the Z-score first. You can safely assume that Z-scores ≤ 2.0 are structurally insignificant while Z-scores ≥ 4.0 should be examined more carefully. Higher Z-scores are indicative of more significant structural matches. I look at the range of Z-scores to decide which hits to ignore and which ones to pay attention to. After picking a subset of “interesting” hits, I quickly look at the length of the alignment (lali) to confirm that the alignments cover a significant portion of the structure. This is important because it is possible that two large, unrelated proteins could have similar helix-turn-helix motifs that are structurally very similar (e.g., a false positive) while the rest of the protein is structurally different. After confirming that DALI has returned quality alignments, I look at the last column to assess what types of proteins my query structure matches. In Figure 2, I used the structure of 5A1A (a known 𝝱-galactosidase) as an example. DALI found nearly 1,000 structural homologs and the first ~250 were highly conserved 𝝱-galactosidases. In this example, the evidence is overwhelming and the conclusion is clear: 5A1A is a 𝝱-galactosidase. Your results are unlikely to be as clear or definitive.
Your best strategy to analyze your DALI results is to first confirm that you have quality alignments (Z-score and length of alignment). Then start at the top of the list and go through the following steps:
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.