OrthoMCL Algorithm

This document precisely describes the OrthoMCL algorithm as implemented by OrthoMCL-DB (and available for download on that site).  The paper describing OrthoMCL is: Li Li, Christian J. Stoeckert, Jr., and David S. Roos OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes Genome Res. 2003 13: 2178-2189.

Overview

OrthoMCL groups proteins into "ortholog groups."  That name is a little misleading because the groups contain proteins related by:

The computation is done on protein sequence because it is more sensitive than genomic sequence.  But by definition the relationships it is attempting to discover exist in genome space.  A complication of using proteins is that proteomes may contain alternative proteins (from alternative transcription).  As discussed in Phase 1 below, this can be addressed by filtering, or it can be lived-with.

The algorithm here assumes that the protein sequences for the genomes have been acquired.

There are five phases:

  1. Sequence filtering
  2. All-versus-all BLAST
  3. Computing percent match length for each blast hit
  4. Finding potential inparalog, ortholog and co-ortholog pairs
  5. Using the MCL program to cluster the pairs into groups

Phase 1 - Sequence filtering

Obtain a set of quality protein sequences by

  1. rejecting low quality sequences
  1. shorter than x amino acids (default=10)
  2. more than x% stop-codons (default=20)
  3. more than x% non-standard amino acids (default=20)
  1. optionally filtering away alternative proteins, keeping only the longest protein per gene
  1. only for higher eukaryotes
  2. requires a protein to gene mapping
  1. rejecting an entire genome
  1. a genome that has too many rejected sequences should be considered for rejection

(This phase is handled by the orthomclFilterFasta program, though it does not yet support filtering away alternative proteins.)

Alternative proteins

Higher eukaryotes commonly have genes with alternative splice forms, leading to multiple proteins per gene.  These "alternative proteins" can be either included or excluded in the input data (step 2 above).

Pros/cons to filtering away alternative proteins:

Phase 2 - BLAST

All-versus-all BLAST

Do all-versus-all BLAST of quality sequences using NCBI BLAST with the following arguments:

Incrementally add a genome

Once a large all-v-all BLAST has been completed, we may need to "incrementally" add a new genome G.  To do so:

Find minimum E-value

The normalization step in phase 4 below requires non-zero E-value exponents.   However, BLAST uses a zero exponent for "perfect" E-value scores of zero.  We must replace zero E-value exponents with an "underflow" or "minimum" value.  This is computed by searching the entire blast result, looking for the smallest E-value exponent.  We use that value minus one as the minimum.  For example, if the smallest E-value exponent is -180, we replace 0 exponents with -181. 

(The orthomclPairs program automatically finds the minimum E-value.)

Phase 3 - Compute percent match length

When using NCBI BLAST, we must compute a percent match length.

Percent identity is taken from the best HSP per hit.

(This phase is handled by the orthomclBlastParser program.)

Phase 4 - Find pairwise relationships

Phase 4 finds potential inparalog, ortholog and co-ortholog reciprocal relationships between proteins.

(This phase is handled by the orthomclPairs program.)

(In phase 5 these pairs are passed to the MCL program to be clustered into ortholog groups.)

Another name for these pairs, in the case of potential orthologs, is reciprocal best hits.  In the case of potential inparalogs, the alternative name is reciprocal better hits (better than any hits to any other species).

In the sections below, each type of pair is described formally and informally.

Terms

The following terms are used in the formal descriptions

Term

Definition

Ax

a protein x in taxon A

H(Ax,Ay)

a BLAST hit between proteins Ax and Ay

H(Ax,Ay) > H(Bx,By)

the E-value of H(Ax,Ay) is better than the E-value of H(Bx,By) OR they are both zero

Cutoff

BLAST cutoff: E-value 1e-5 and percent match length 50%

Underflow

QiH()

qualifying in-species hit (single way better hit)

QoH()

qualifying out-species hit (single way best hit)

IP()

potential inparalog pair (reciprocal better hit)

O()

potential ortholog pair (reciprocal best hit)

CO()

potential co-ortholog (reciprocal best hit)

&&

and

~

not

!=

not equal

iff

if and only if

Pairs

Find potential inparalog pairs

Informally: Find all pairs of proteins within a species that have mutual hits that are better or equal to all of those proteins' hits to proteins in other species.

Formally:

Each IP(Ax,Ay) is given a pair weight:

Find potential ortholog pairs

Informally: Find all pairs of proteins across two species that have hits as good as or better than any other hits between these proteins and other proteins in those species.

Formally:

Each O(Ax,By) is given a pair weight:

Find potential co-ortholog pairs

Informally: Find all pairs of proteins across two species that are connected through orthology and inparology.

Formally (note the ~ symbol means "not"):

Each CO(Ax,By) is given a pair weight:

Normalize the weights of all inparalog pairs

Because the weights of inparalog pairs were computed exclusively using BLAST scores within one genome, they must be normalized against all the other genomes so that the weights passed to the MCL algorithm are comparable to each other. 

Informally: Average all qualifying inparalog pairs in a genome and divide each pair by the average.  Within a genome, inparalog pairs qualify if either of the proteins in the pair has an ortholog in any genome.  If no inparalogs within a genome have any orthologs, all inparalogs in that genome qualify.

Formally:

Normalize the weights of all ortholog and co-ortholog pairs

for any two species, average the weights across them, and normalize using that average

Phase 5 - Cluster using MCL

Input to MCL is the Orthomcl Graph (each inparalog, ortholog, co-ortholog pair with its weight).

MCL has only one parameter, inflation value.  We use 1.5 

(should explore using latest version of MCL)