Computational-Phylogenetics.doc

Page 1 of 8

www.AssignmentPoint.com

Computational Phylogenetics

www.AssignmentPoint.com

Page 2 of 8

www.AssignmentPoint.com

Computational phylogenetics is the application of computational algorithms,

methods, and programs to phylogenetic analyses. The goal is to assemble a

phylogenetic tree representing a hypothesis about the evolutionary ancestry of a

set of genes, species, or other taxa. For example, these techniques have been

used to explore the family tree of hominid species and the relationships between

specific genes shared by many types of organisms. Traditional phylogenetics

relies on morphological data obtained by measuring and quantifying the

phenotypic properties of representative organisms, while the more recent field

of molecular phylogenetics uses nucleotide sequences encoding genes or amino

acid sequences encoding proteins as the basis for classification. Many forms of

molecular phylogenetics are closely related to and make extensive use of

sequence alignment in constructing and refining phylogenetic trees, which are

used to classify the evolutionary relationships between homologous genes

represented in the genomes of divergent species. The phylogenetic trees

constructed by computational methods are unlikely to perfectly reproduce the

evolutionary tree that represents the historical relationships between the species

being analyzed. The historical species tree may also differ from the historical

tree of an individual homologous gene shared by those species.

Producing a phylogenetic tree requires a measure of homology among the

characteristics shared by the taxa being compared. In morphological studies,

this requires explicit decisions about which physical characteristics to measure

and how to use them to encode distinct states corresponding to the input taxa. In

molecular studies, a primary problem is in producing a multiple sequence

alignment (MSA) between the genes or amino acid sequences of interest.

Progressive sequence alignment methods produce a phylogenetic tree by

necessity because they incorporate new sequences into the calculated alignment

in order of genetic distance.

Page 3 of 8

www.AssignmentPoint.com

Types of phylogenetic trees and networks

Phylogenetic trees generated by computational phylogenetics can be either

rooted or unrooted depending on the input data and the algorithm used. A rooted

tree is a directed graph that explicitly identifies a most recent common ancestor

(MRCA), usually an imputed sequence that is not represented in the input.

Genetic distance measures can be used to plot a tree with the input sequences as

leaf nodes and their distances from the root proportional to their genetic

distance from the hypothesized MRCA. Identification of a root usually requires

the inclusion in the input data of at least one "outgroup" known to be only

distantly related to the sequences of interest.

By contrast, unrooted trees plot the distances and relationships between input

sequences without making assumptions regarding their descent. An unrooted

tree can always be produced from a rooted tree, but a root cannot usually be

placed on an unrooted tree without additional data on divergence rates, such as

the assumption of the molecular clock hypothesis.

The set of all possible phylogenetic trees for a given group of input sequences

can be conceptualized as a discretely defined multidimensional "tree space"

through which search paths can be traced by optimization algorithms. Although

counting the total number of trees for a nontrivial number of input sequences

can be complicated by variations in the definition of a tree topology, it is always

true that there are more rooted than unrooted trees for a given number of inputs

and choice of parameters.

Page 4 of 8

www.AssignmentPoint.com

Both rooted and unrooted phylogenetic trees can be further generalized to

rooted or unrooted phylogenetic networks, which allow for the modeling of

evolutionary phenomena such as hybridization or horizontal gene transfer.

Coding characters and defining homology

Morphological analysis

The basic problem in morphological phylogenetics is the assembly of a matrix

representing a mapping from each of the taxa being compared to representative

measurements for each of the phenotypic characteristics being used as a

classifier. The types of phenotypic data used to construct this matrix depend on

the taxa being compared; for individual species, they may involve

measurements of average body size, lengths or sizes of particular bones or other

physical features, or even behavioral manifestations. Of course, since not every

possible phenotypic characteristic could be measured and encoded for analysis,

the selection of which features to measure is a major inherent obstacle to the

method. The decision of which traits to use as a basis for the matrix necessarily

represents a hypothesis about which traits of a species or higher taxon are

evolutionarily relevant. Morphological studies can be confounded by examples

of convergent evolution of phenotypes. A major challenge in constructing

useful classes is the high likelihood of inter-taxon overlap in the distribution of

the phenotype's variation. The inclusion of extinct taxa in morphological

analysis is often difficult due to absence of or incomplete fossil records, but has

been shown to have a significant effect on the trees produced; in one study only

the inclusion of extinct species of apes produced a morphologically derived tree

that was consistent with that produced from molecular data.

Page 5 of 8

www.AssignmentPoint.com

Some phenotypic classifications, particularly those used when analyzing very

diverse groups of taxa, are discrete and unambiguous; classifying organisms as

possessing or lacking a tail, for example, is straightforward in the majority of

cases, as is counting features such as eyes or vertebrae. However, the most

appropriate representation of continuously varying phenotypic measurements is

a controversial problem without a general solution. A common method is

simply to sort the measurements of interest into two or more classes, rendering

continuous observed variation as discretely classifiable (e.g., all examples with

humerus bones longer than a given cutoff are scored as members of one state,

and all members whose humerus bones are shorter than the cutoff are scored as

members of a second state). This results in an easily manipulated data set but

has been criticized for poor reporting of the basis for the class definitions and

for sacrificing information compared to methods that use a continuous weighted

distribution of measurements.

Because morphological data is extremely labor-intensive to collect, whether

from literature sources or from field observations, reuse of previously compiled

data matrices is not uncommon, although this may propagate flaws in the

original matrix into multiple derivative analyses.

Molecular analysis

The problem of character coding is very different in molecular analyses, as the

characters in biological sequence data are immediate and discretely defined -

distinct nucleotides in DNA or RNA sequences and distinct amino acids in

protein sequences. However, defining homology can be challenging due to the

inherent difficulties of multiple sequence alignment. For a given gapped MSA,

several rooted phylogenetic trees can be constructed that vary in their

Page 6 of 8

www.AssignmentPoint.com

interpretations of which changes are "mutations" versus ancestral characters,

and which events are insertion mutations or deletion mutations. For example,

given only a pairwise alignment with a gap region, it is impossible to determine

whether one sequence bears an insertion mutation or the other carries a deletion.

The problem is magnified in MSAs with unaligned and nonoverlapping gaps. In

practice, sizable regions of a calculated alignment may be discounted in

phylogenetic tree construction to avoid integrating noisy data into the tree

calculation.

Distance-matrix methods

Distance-matrix methods of phylogenetic analysis explicitly rely on a measure

of "genetic distance" between the sequences being classified, and therefore they

require an MSA as an input. Distance is often defined as the fraction of

mismatches at aligned positions, with gaps either ignored or counted as

mismatches. Distance methods attempt to construct an all-to-all matrix from the

sequence query set describing the distance between each sequence pair. From

this is constructed a phylogenetic tree that places closely related sequences

under the same interior node and whose branch lengths closely reproduce the

observed distances between sequences. Distance-matrix methods may produce

either rooted or unrooted trees, depending on the algorithm used to calculate

them. They are frequently used as the basis for progressive and iterative types of

multiple sequence alignments. The main disadvantage of distance-matrix

methods is their inability to efficiently use information about local high-

variation regions that appear across multiple subtrees.

Neighbor-joining

Page 7 of 8

www.AssignmentPoint.com

Neighbor-joining methods apply general data clustering techniques to sequence

analysis using genetic distance as a clustering metric. The simple neighbor-

joining method produces unrooted trees, but it does not assume a constant rate

of evolution (i.e., a molecular clock) across lineages. Its relative, UPGMA

(Unweighted Pair Group Method with Arithmetic mean) produces rooted trees

and requires a constant-rate assumption - that is, it assumes an ultrametric tree

in which the distances from the root to every branch tip are equal.

Fitch-Margoliash method

The Fitch-Margoliash method uses a weighted least squares method for

clustering based on genetic distance. Closely related sequences are given more

weight in the tree construction process to correct for the increased inaccuracy in

measuring distances between distantly related sequences. The distances used as

input to the algorithm must be normalized to prevent large artifacts in

computing relationships between closely related and distantly related groups.

The distances calculated by this method must be linear; the linearity criterion

for distances requires that the expected values of the branch lengths for two

individual branches must equal the expected value of the sum of the two branch

distances - a property that applies to biological sequences only when they have

been corrected for the possibility of back mutations at individual sites. This

correction is done through the use of a substitution matrix such as that derived

from the Jukes-Cantor model of DNA evolution. The distance correction is only

necessary in practice when the evolution rates differ among branches. Another

modification of the algorithm can be helpful, especially in case of concentrated

distances (please report to concentration of measure phenomenon and curse of

dimensionality): that modification, described in, has been shown to improve the

efficiency of the algorithm and its robustness.

Page 8 of 8

www.AssignmentPoint.com

The least-squares criterion applied to these distances is more accurate but less

efficient than the neighbor-joining methods. An additional improvement that

corrects for correlations between distances that arise from many closely related

sequences in the data set can also be applied at increased computational cost.

Finding the optimal least-squares tree with any correction factor is NP-

complete, so heuristic search methods like those used in maximum-parsimony

analysis are applied to the search through tree space.

Using outgroups

Independent information about the relationship between sequences or groups

can be used to help reduce the tree search space and root unrooted trees.

Standard usage of distance-matrix methods involves the inclusion of at least one

outgroup sequence known to be only distantly related to the sequences of

interest in the query set. This usage can be seen as a type of experimental

control. If the outgroup has been appropriately chosen, it will have a much

greater genetic distance and thus a longer branch length than any other

sequence, and it will appear near the root of a rooted tree. Choosing an

appropriate outgroup requires the selection of a sequence that is moderately

related to the sequences of interest; too close a relationship defeats the purpose

of the outgroup and too distant adds noise to the analysis. Care should also be

taken to avoid situations in which the species from which the sequences were

taken are distantly related, but the gene encoded by the sequences is highly

conserved across lineages. Horizontal gene transfer, especially between

otherwise divergent bacteria, can also confound outgroup usage.