This document precisely describes the OrthoMCL algorithm as implemented by OrthoMCL-DB (and available for download on that site). The paper describing OrthoMCL is: Li Li, Christian J. Stoeckert, Jr., and David S. Roos OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes Genome Res. 2003 13: 2178-2189.
OrthoMCL groups proteins into "ortholog groups." That name is a little misleading because the groups contain proteins related by:
The computation is done on protein sequence because it is more sensitive than genomic sequence. But by definition the relationships it is attempting to discover exist in genome space. A complication of using proteins is that proteomes may contain alternative proteins (from alternative transcription). As discussed in Phase 1 below, this can be addressed by filtering, or it can be lived-with.
The algorithm here assumes that the protein sequences for the genomes have been acquired.
There are five phases:
Obtain a set of quality protein sequences by
(This phase is handled by the orthomclFilterFasta program, though it does not yet support filtering away alternative proteins.)
Higher eukaryotes commonly have genes with alternative splice forms, leading to multiple proteins per gene. These "alternative proteins" can be either included or excluded in the input data (step 2 above).
Pros/cons to filtering away alternative proteins:
All-versus-all BLAST
Do all-versus-all BLAST of quality sequences using NCBI BLAST with the following arguments:
Incrementally add a genome
Once a large all-v-all BLAST has been completed, we may need to "incrementally" add a new genome G. To do so:
Find minimum E-value
The normalization step in phase 4 below requires non-zero E-value exponents. However, BLAST uses a zero exponent for "perfect" E-value scores of zero. We must replace zero E-value exponents with an "underflow" or "minimum" value. This is computed by searching the entire blast result, looking for the smallest E-value exponent. We use that value minus one as the minimum. For example, if the smallest E-value exponent is -180, we replace 0 exponents with -181.
(The orthomclPairs program automatically finds the minimum E-value.)
When using NCBI BLAST, we must compute a percent match length.
Percent identity is taken from the best HSP per hit.
(This phase is handled by the orthomclBlastParser program.)
Phase 4 finds potential inparalog, ortholog and co-ortholog reciprocal relationships between proteins.
(This phase is handled by the orthomclPairs program.)
(In phase 5 these pairs are passed to the MCL program to be clustered into ortholog groups.)
Another name for these pairs, in the case of potential orthologs, is reciprocal best hits. In the case of potential inparalogs, the alternative name is reciprocal better hits (better than any hits to any other species).
In the sections below, each type of pair is described formally and informally.
The following terms are used in the formal descriptions
Term | Definition |
Ax | a protein x in taxon A |
H(Ax,Ay) | a BLAST hit between proteins Ax and Ay |
H(Ax,Ay) > H(Bx,By) | the E-value of H(Ax,Ay) is better than the E-value of H(Bx,By) OR they are both zero |
Cutoff | BLAST cutoff: E-value 1e-5 and percent match length 50% |
Underflow | |
QiH() | qualifying in-species hit (single way better hit) |
QoH() | qualifying out-species hit (single way best hit) |
IP() | potential inparalog pair (reciprocal better hit) |
O() | potential ortholog pair (reciprocal best hit) |
CO() | potential co-ortholog (reciprocal best hit) |
&& | and |
~ | not |
!= | not equal |
iff | if and only if |
Find potential inparalog pairs
Informally: Find all pairs of proteins within a species that have mutual hits that are better or equal to all of those proteins' hits to proteins in other species.
Formally:
Each IP(Ax,Ay) is given a pair weight:
Find potential ortholog pairs
Informally: Find all pairs of proteins across two species that have hits as good as or better than any other hits between these proteins and other proteins in those species.
Formally:
Each O(Ax,By) is given a pair weight:
Find potential co-ortholog pairs
Informally: Find all pairs of proteins across two species that are connected through orthology and inparology.
Formally (note the ~ symbol means "not"):
Each CO(Ax,By) is given a pair weight:
Normalize the weights of all inparalog pairs
Because the weights of inparalog pairs were computed exclusively using BLAST scores within one genome, they must be normalized against all the other genomes so that the weights passed to the MCL algorithm are comparable to each other.
Informally: Average all qualifying inparalog pairs in a genome and divide each pair by the average. Within a genome, inparalog pairs qualify if either of the proteins in the pair has an ortholog in any genome. If no inparalogs within a genome have any orthologs, all inparalogs in that genome qualify.
Formally:
Normalize the weights of all ortholog and co-ortholog pairs
for any two species, average the weights across them, and normalize using that average
Input to MCL is the Orthomcl Graph (each inparalog, ortholog, co-ortholog pair with its weight).
MCL has only one parameter, inflation value. We use 1.5
(should explore using latest version of MCL)