Page 2 of 8
www.AssignmentPoint.com
Computational phylogenetics is the application of computational algorithms,
methods, and programs to phylogenetic analyses. The goal is to assemble a
phylogenetic tree representing a hypothesis about the evolutionary ancestry of a
set of genes, species, or other taxa. For example, these techniques have been
used to explore the family tree of hominid species and the relationships between
specific genes shared by many types of organisms. Traditional phylogenetics
relies on morphological data obtained by measuring and quantifying the
phenotypic properties of representative organisms, while the more recent field
of molecular phylogenetics uses nucleotide sequences encoding genes or amino
acid sequences encoding proteins as the basis for classification. Many forms of
molecular phylogenetics are closely related to and make extensive use of
sequence alignment in constructing and refining phylogenetic trees, which are
used to classify the evolutionary relationships between homologous genes
represented in the genomes of divergent species. The phylogenetic trees
constructed by computational methods are unlikely to perfectly reproduce the
evolutionary tree that represents the historical relationships between the species
being analyzed. The historical species tree may also differ from the historical
tree of an individual homologous gene shared by those species.
Producing a phylogenetic tree requires a measure of homology among the
characteristics shared by the taxa being compared. In morphological studies,
this requires explicit decisions about which physical characteristics to measure
and how to use them to encode distinct states corresponding to the input taxa. In
molecular studies, a primary problem is in producing a multiple sequence
alignment (MSA) between the genes or amino acid sequences of interest.
Progressive sequence alignment methods produce a phylogenetic tree by
necessity because they incorporate new sequences into the calculated alignment
in order of genetic distance.
Page 3 of 8
www.AssignmentPoint.com
Types of phylogenetic trees and networks
Phylogenetic trees generated by computational phylogenetics can be either
rooted or unrooted depending on the input data and the algorithm used. A rooted
tree is a directed graph that explicitly identifies a most recent common ancestor
(MRCA), usually an imputed sequence that is not represented in the input.
Genetic distance measures can be used to plot a tree with the input sequences as
leaf nodes and their distances from the root proportional to their genetic
distance from the hypothesized MRCA. Identification of a root usually requires
the inclusion in the input data of at least one "outgroup" known to be only
distantly related to the sequences of interest.
By contrast, unrooted trees plot the distances and relationships between input
sequences without making assumptions regarding their descent. An unrooted
tree can always be produced from a rooted tree, but a root cannot usually be
placed on an unrooted tree without additional data on divergence rates, such as
the assumption of the molecular clock hypothesis.
The set of all possible phylogenetic trees for a given group of input sequences
can be conceptualized as a discretely defined multidimensional "tree space"
through which search paths can be traced by optimization algorithms. Although
counting the total number of trees for a nontrivial number of input sequences
can be complicated by variations in the definition of a tree topology, it is always
true that there are more rooted than unrooted trees for a given number of inputs
and choice of parameters.
Page 4 of 8
www.AssignmentPoint.com
Both rooted and unrooted phylogenetic trees can be further generalized to
rooted or unrooted phylogenetic networks, which allow for the modeling of
evolutionary phenomena such as hybridization or horizontal gene transfer.
Coding characters and defining homology
Morphological analysis
The basic problem in morphological phylogenetics is the assembly of a matrix
representing a mapping from each of the taxa being compared to representative
measurements for each of the phenotypic characteristics being used as a
classifier. The types of phenotypic data used to construct this matrix depend on
the taxa being compared; for individual species, they may involve
measurements of average body size, lengths or sizes of particular bones or other
physical features, or even behavioral manifestations. Of course, since not every
possible phenotypic characteristic could be measured and encoded for analysis,
the selection of which features to measure is a major inherent obstacle to the
method. The decision of which traits to use as a basis for the matrix necessarily
represents a hypothesis about which traits of a species or higher taxon are
evolutionarily relevant. Morphological studies can be confounded by examples
of convergent evolution of phenotypes. A major challenge in constructing
useful classes is the high likelihood of inter-taxon overlap in the distribution of
the phenotype's variation. The inclusion of extinct taxa in morphological
analysis is often difficult due to absence of or incomplete fossil records, but has
been shown to have a significant effect on the trees produced; in one study only
the inclusion of extinct species of apes produced a morphologically derived tree
that was consistent with that produced from molecular data.
Page 5 of 8
www.AssignmentPoint.com
Some phenotypic classifications, particularly those used when analyzing very
diverse groups of taxa, are discrete and unambiguous; classifying organisms as
possessing or lacking a tail, for example, is straightforward in the majority of
cases, as is counting features such as eyes or vertebrae. However, the most
appropriate representation of continuously varying phenotypic measurements is
a controversial problem without a general solution. A common method is
simply to sort the measurements of interest into two or more classes, rendering
continuous observed variation as discretely classifiable (e.g., all examples with
humerus bones longer than a given cutoff are scored as members of one state,
and all members whose humerus bones are shorter than the cutoff are scored as
members of a second state). This results in an easily manipulated data set but
has been criticized for poor reporting of the basis for the class definitions and
for sacrificing information compared to methods that use a continuous weighted
distribution of measurements.
Because morphological data is extremely labor-intensive to collect, whether
from literature sources or from field observations, reuse of previously compiled
data matrices is not uncommon, although this may propagate flaws in the
original matrix into multiple derivative analyses.
Molecular analysis
The problem of character coding is very different in molecular analyses, as the
characters in biological sequence data are immediate and discretely defined -
distinct nucleotides in DNA or RNA sequences and distinct amino acids in
protein sequences. However, defining homology can be challenging due to the
inherent difficulties of multiple sequence alignment. For a given gapped MSA,
several rooted phylogenetic trees can be constructed that vary in their
Page 6 of 8
www.AssignmentPoint.com
interpretations of which changes are "mutations" versus ancestral characters,
and which events are insertion mutations or deletion mutations. For example,
given only a pairwise alignment with a gap region, it is impossible to determine
whether one sequence bears an insertion mutation or the other carries a deletion.
The problem is magnified in MSAs with unaligned and nonoverlapping gaps. In
practice, sizable regions of a calculated alignment may be discounted in
phylogenetic tree construction to avoid integrating noisy data into the tree
calculation.
Distance-matrix methods
Distance-matrix methods of phylogenetic analysis explicitly rely on a measure
of "genetic distance" between the sequences being classified, and therefore they
require an MSA as an input. Distance is often defined as the fraction of
mismatches at aligned positions, with gaps either ignored or counted as
mismatches. Distance methods attempt to construct an all-to-all matrix from the
sequence query set describing the distance between each sequence pair. From
this is constructed a phylogenetic tree that places closely related sequences
under the same interior node and whose branch lengths closely reproduce the
observed distances between sequences. Distance-matrix methods may produce
either rooted or unrooted trees, depending on the algorithm used to calculate
them. They are frequently used as the basis for progressive and iterative types of
multiple sequence alignments. The main disadvantage of distance-matrix
methods is their inability to efficiently use information about local high-
variation regions that appear across multiple subtrees.
Neighbor-joining
Page 7 of 8
www.AssignmentPoint.com
Neighbor-joining methods apply general data clustering techniques to sequence
analysis using genetic distance as a clustering metric. The simple neighbor-
joining method produces unrooted trees, but it does not assume a constant rate
of evolution (i.e., a molecular clock) across lineages. Its relative, UPGMA
(Unweighted Pair Group Method with Arithmetic mean) produces rooted trees
and requires a constant-rate assumption - that is, it assumes an ultrametric tree
in which the distances from the root to every branch tip are equal.
Fitch-Margoliash method
The Fitch-Margoliash method uses a weighted least squares method for
clustering based on genetic distance. Closely related sequences are given more
weight in the tree construction process to correct for the increased inaccuracy in
measuring distances between distantly related sequences. The distances used as
input to the algorithm must be normalized to prevent large artifacts in
computing relationships between closely related and distantly related groups.
The distances calculated by this method must be linear; the linearity criterion
for distances requires that the expected values of the branch lengths for two
individual branches must equal the expected value of the sum of the two branch
distances - a property that applies to biological sequences only when they have
been corrected for the possibility of back mutations at individual sites. This
correction is done through the use of a substitution matrix such as that derived
from the Jukes-Cantor model of DNA evolution. The distance correction is only
necessary in practice when the evolution rates differ among branches. Another
modification of the algorithm can be helpful, especially in case of concentrated
distances (please report to concentration of measure phenomenon and curse of
dimensionality): that modification, described in, has been shown to improve the
efficiency of the algorithm and its robustness.
Page 8 of 8
www.AssignmentPoint.com
The least-squares criterion applied to these distances is more accurate but less
efficient than the neighbor-joining methods. An additional improvement that
corrects for correlations between distances that arise from many closely related
sequences in the data set can also be applied at increased computational cost.
Finding the optimal least-squares tree with any correction factor is NP-
complete, so heuristic search methods like those used in maximum-parsimony
analysis are applied to the search through tree space.
Using outgroups
Independent information about the relationship between sequences or groups
can be used to help reduce the tree search space and root unrooted trees.
Standard usage of distance-matrix methods involves the inclusion of at least one
outgroup sequence known to be only distantly related to the sequences of
interest in the query set. This usage can be seen as a type of experimental
control. If the outgroup has been appropriately chosen, it will have a much
greater genetic distance and thus a longer branch length than any other
sequence, and it will appear near the root of a rooted tree. Choosing an
appropriate outgroup requires the selection of a sequence that is moderately
related to the sequences of interest; too close a relationship defeats the purpose
of the outgroup and too distant adds noise to the analysis. Care should also be
taken to avoid situations in which the species from which the sequences were
taken are distantly related, but the gene encoded by the sequences is highly
conserved across lineages. Horizontal gene transfer, especially between
otherwise divergent bacteria, can also confound outgroup usage.