OrthoMCL Algorithm

This document precisely describes the OrthoMCL algorithm as implemented by OrthoMCL-DB (and available for download on that site). The paper describing OrthoMCL is: Li Li, Christian J. Stoeckert, Jr., and David S. Roos OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes Genome Res. 2003 13: 2178-2189.

Overview

OrthoMCL groups proteins into "ortholog groups." That name is a little misleading because the groups contain proteins related by:

orthology (recent descent)

inparalology (recent duplication)

co-orthology (recent descent and duplication)

The computation is done on protein sequence because it is more sensitive than genomic sequence. But by definition the relationships it is attempting to discover exist in genome space. A complication of using proteins is that proteomes may contain alternative proteins (from alternative transcription). As discussed in Phase 1 below, this can be addressed by filtering, or it can be lived-with.

The algorithm here assumes that the protein sequences for the genomes have been acquired.

There are five phases:

Sequence filtering
All-versus-all BLAST
Computing percent match length for each blast hit
Finding potential inparalog, ortholog and co-ortholog pairs
Using the MCL program to cluster the pairs into groups

Phase 1 - Sequence filtering

Obtain a set of quality protein sequences by

rejecting low quality sequences

shorter than x amino acids (default=10)
more than x% stop-codons (default=20)
more than x% non-standard amino acids (default=20)

optionally filtering away alternative proteins, keeping only the longest protein per gene

only for higher eukaryotes
requires a protein to gene mapping

rejecting an entire genome

a genome that has too many rejected sequences should be considered for rejection

(This phase is handled by the orthomclFilterFasta program, though it does not yet support filtering away alternative proteins.)

Alternative proteins

Higher eukaryotes commonly have genes with alternative splice forms, leading to multiple proteins per gene. These "alternative proteins" can be either included or excluded in the input data (step 2 above).

Pros/cons to filtering away alternative proteins:

theoretical

pros

avoids "pseudo-inparalogs" which are alternative proteins within one group that therefore look like inparalogs. (if needed these could be distinguished using the gene-->protein mapping files)
group statistics (counts, coherence) are in terms of genes, and are not skewed by the presence of highly similar pseudo-inparalogs

a user writing in said it this way

in Arabidopsis, OrthoMCL includes proteins for genes that differ only in the 5' UTR, ie, that have identical sequence.
including redundant proteins "makes the numbers of ortholog given
for Arabidopsis mean nothing in terms of gene duplication events or
paralogs."

may avoid merging functionally distinct groups that could be joined by overlapping alternative genes with different functional domains (it is not clear if this is a pro or a con)

cons

may exclude important functional components of genes that are omitted in the longest variant
may exclude distinct proteins that are omitted in the longest variant
may merge functionally distinct groups that could be joined by overlapping alternative genes with different functional domains (it is not clear if this is a pro or a con)
relies on the accuracy of the annotation that produced the gene-->protein mapping

practical

pros

smaller input dataset, so requires less resources

cons

sometimes it is difficult to find a trustworthy source of gene-->protein mapping

for some genomes, these are available from:

GenBank/EntrezGene
Ensemble

these may not be the preferred source for protein sequences, so it may be a tradeoff between getting the preferred protein sequences and getting a gene-protein mapping

Phase 2 - BLAST

All-versus-all BLAST

Do all-versus-all BLAST of quality sequences using NCBI BLAST with the following arguments:

-F 'm S' -v 100000 -b 100000 -z protein_database_size -e 1e-5
Where:

-F 'm S': mask with Seg
-v 100000: a "don't care" value
-b 100000: a "don't care" value
-z protein_database_size: the number of proteins in the set. See "Incrementally add a genome" below.
-e 1e-5: recommended e-value cutoff

Incrementally add a genome

Once a large all-v-all BLAST has been completed, we may need to "incrementally" add a new genome G. To do so:

BLAST all-v-G
use the -z argument of BLAST to simulate the size of the all database, so that the statistics and scoring is compatible with the original all-v-all BLAST. To do so, use the same -z value as was used in the original BLAST.

Find minimum E-value

The normalization step in phase 4 below requires non-zero E-value exponents. However, BLAST uses a zero exponent for "perfect" E-value scores of zero. We must replace zero E-value exponents with an "underflow" or "minimum" value. This is computed by searching the entire blast result, looking for the smallest E-value exponent. We use that value minus one as the minimum. For example, if the smallest E-value exponent is -180, we replace 0 exponents with -181.

(The orthomclPairs program automatically finds the minimum E-value.)

Phase 3 - Compute percent match length

When using NCBI BLAST, we must compute a percent match length.

Select whichever is shorter, the query or subject sequence. Call that sequence S.
Count all amino acids in S that participate in any HSP.
Divide that count by the length of S and multiply by 100

Percent identity is taken from the best HSP per hit.

(This phase is handled by the orthomclBlastParser program.)

Phase 4 - Find pairwise relationships

Phase 4 finds potential inparalog, ortholog and co-ortholog reciprocal relationships between proteins.

(This phase is handled by the orthomclPairs program.)

(In phase 5 these pairs are passed to the MCL program to be clustered into ortholog groups.)

Another name for these pairs, in the case of potential orthologs, is reciprocal best hits. In the case of potential inparalogs, the alternative name is reciprocal better hits (better than any hits to any other species).

In the sections below, each type of pair is described formally and informally.

Terms

The following terms are used in the formal descriptions

Term	Definition
Ax	a protein x in taxon A
H(Ax,Ay)	a BLAST hit between proteins Ax and Ay
H(Ax,Ay) > H(Bx,By)	the E-value of H(Ax,Ay) is better than the E-value of H(Bx,By) OR they are both zero
Cutoff	BLAST cutoff: E-value 1e-5 and percent match length 50%
Underflow
QiH()	qualifying in-species hit (single way better hit)
QoH()	qualifying out-species hit (single way best hit)
IP()	potential inparalog pair (reciprocal better hit)
O()	potential ortholog pair (reciprocal best hit)
CO()	potential co-ortholog (reciprocal best hit)
&&	and
~	not
!=	not equal
iff	if and only if

Pairs

Find potential inparalog pairs

Informally: Find all pairs of proteins within a species that have mutual hits that are better or equal to all of those proteins' hits to proteins in other species.

Formally:

Input: BLAST results
QiH(Ax,Ay) iff:

Ax != Ay
H(Ax,Ay) > Cutoff
For all taxa T != A and protein n in T:

H(Ax,Ay) >= H(Ax,Tn)

IP(Ax,Ay) iff:

QiH(Ax,Ay) && QiH(Ay,Ax)

Each IP(Ax,Ay) is given a pair weight:

IP(Ax,Ay) = (-log10(evalue(Ax,Ay)) + -log10(evalue(Ay,Ax))) / 2
Underflow... (note, want to find out what praveen used somehow)

look for MAX_WEIGHT config item from config files

Find potential ortholog pairs

Informally: Find all pairs of proteins across two species that have hits as good as or better than any other hits between these proteins and other proteins in those species.

Formally:

Input: BLAST results
QoH(Ax,By) iff:

H(Ax,By) > Cutoff
For all n != y, H(Ax,By) >= H(Ax,Bn)

O(Ax,By) iff:

QoH(Ax,By) && QoH(By,Ax)

Each O(Ax,By) is given a pair weight:

O(Ax,By) = (-log10(evalue(Ax,By)) + -log10(evalue(By,Ax))) / 2

Find potential co-ortholog pairs

Informally: Find all pairs of proteins across two species that are connected through orthology and inparology.

Formally (note the ~ symbol means "not"):

Input: BLAST results, potential inparalog pairs and potential ortholog pairs
foreach Ax,Bx such that O(Ax,Bx)

CO(Ay,By) iff:

IP(Ax,Ay) && IP(Bx,By) && ~O(Ay,By)

H(Ay,By) >= Cutoff
H(By,Ay) >= Cutoff

CO(Ax,By) iff:

IP(Bx,By) && ~O(Ax,By)

H(Ax, By) >= Cutoff

H(By, Ax) >= Cutoff

CO(Bx,Ay) iff:

IP(Ax,Ay) && ~O(Ax,By)

H(Ay, Bx) >= Cutoff

H(Bx, Ay) >= Cutoff

Each CO(Ax,By) is given a pair weight:

O(Ax,By) = (-log10(evalue(Ax,By)) + -log10(evalue(By,Ax))) / 2

Normalize the weights of all inparalog pairs

Because the weights of inparalog pairs were computed exclusively using BLAST scores within one genome, they must be normalized against all the other genomes so that the weights passed to the MCL algorithm are comparable to each other.

Informally: Average all qualifying inparalog pairs in a genome and divide each pair by the average. Within a genome, inparalog pairs qualify if either of the proteins in the pair has an ortholog in any genome. If no inparalogs within a genome have any orthologs, all inparalogs in that genome qualify.

Formally:

For each genome A, find the set IPqA of all qualified inparalog pairs in A:

IPqA = all inparalog pairs IP(Ax,Ay) in taxon A where, for any T != A, O(Ax,Tn) or O(Ay,Tn)
If IPqA is empty, IPqA = all inparalog pairs IP(Ax,Ay) in A

Find the average weight IPqA_Avg of all pairs in IPqA
Normalize each pair weight IP(Ax,Ay):

IP(Ax,Ay) = IP(Ax,Ay) / IPqA_Avg

Normalize the weights of all ortholog and co-ortholog pairs

for any two species, average the weights across them, and normalize using that average

Phase 5 - Cluster using MCL

Input to MCL is the Orthomcl Graph (each inparalog, ortholog, co-ortholog pair with its weight).

MCL has only one parameter, inflation value. We use 1.5

(should explore using latest version of MCL)