Published using Google Docs
Comprehensive Guide to Phylogeny Building by Lanna Jin
Updated automatically every 5 minutes

Comprehensive Guide to Phylogeny Building

Written by Lanna Jin


Table of Contents

Table of Contents

1 Prior to sequence collecting

1.1 Setting up folders/files

1.2 Querying GenBank for species/sequences found in GenBank

2 Compiling gene sequences for species

2.1 Collecting sequences in GenBank

2.2 Dealing with Missing Species

2.3 Teasing apart ITS1 and 5.8s from combined sequences

2.4 Merging sequence lines

3 ClustalX2

3.1 Aligning sequences in ClustalX2

3.2 Duplicates

4 Concatenating Sequences into a Super Matrix with FASconCAT

5 Model testing in jModelTest2

6 Super Tree (Angiosperms)

6.1. Nix

6.2 Phylomatic

7 PhyML 3.0: Maximum Likelihood Phylogeny

8 PhyloBayes-MPI: Bayesian MCMC Sampler

9 Archaeopteryx Tree Visualizer

10 Fixing your phylogeny

10.1 ClustalX

10.2 BLAST

11 Finalizing your phylogeny

11.2 Ultrametricize your tree

Figures

Figure 1.2.2

Figure 2.1.2

Figure 2.1.3

Figure 2.1.4

Figure 2.1.5

Figure 2.1.6

Figure 2.1.7

Figure 2.1.8

Figure 2.1.9

Figure 2.2.3

Figure 2.4.2

Figure 3.3b

Figure 9.3

Figure 9.4

Figure 11.1.a.i

Figure 11.1.a.ii

Scripts

Script 2.1.1

Script 6.1.2

Script 6.1.3

Script 6.2.2a

1 Prior to sequence collecting

1.1 Setting up folders/files

  1. In “Phylogeny” folder, create folder “sequences”. This is where you will store all the sequences you gather from GenBank.
  2. Since you will probably have more than one “run” to correct for bad sequences, create a subdirectory, “Phylogeny/sequences/run1 for your first run.
  3. Place a list of species named as “species_list.csv” format in the “Phylogeny/sequences” directory.
  1. Make sure to also include outgroup species, Amborella trichopoda and Magnolia grandiflora
  2. Make sure species are in the format, “Abies lasiocarpa” (as opposed to, “Abies_lasiocarpa”)

1.2 Querying GenBank for species/sequences found in GenBank

  1. Download and install R: http://www.r-project.org/
  1. Install package, “XML”
  1. Open and run “Phylogeny/sequences/genbankSearch_script.r”. This script queries GenBank for each species for each of the four sequences (matK, rbcl, 5.8s and ITS1). It generates the following files:
  1. “sequences/geneSeq.csv”: this is the list you will use to store information about the sequences you collect. Create a duplicate of this workbook in either Excel or Google Docs, and name it to reflect the run number, e.g., “sequences/geneSeq_run1.csv”. This is the file where you will keep track of sequences’ ascension numbers, but also changes you make throughout the phylogeny building process. Add the column, “ITS1 & 5.8s combined”, “NA” and “Comments”. (Figure 1.2.2)
  2. The files for collecting sequences:
  1. “sequences/run1/rbcl.txt”
  2. “sequences/run1/matK.txt”
  3. “sequences/run1/ITS1.txt”
  4. “sequences/run1/5.8s.txt”
  5. “sequences/run1/ITS15.8s.txt”
  1. “sequences/alternativeSeqs.csv”: the list of species that are not found within GenBank at all. You will need to find congenerics for these species.

2 Compiling gene sequences for species

2.1 Collecting sequences in GenBank

  1. Locate and open folder, “Phylogeny/sequences/geneSeq.csv”. (Script 2.1.1)
  2. To locate a gene sequence, go to www.ncbi.nlm.nih.gov/ and use the search bar locate on the top right of the page. Make sure you change “All Databases” to “Nucleotide”. Enter the species name and gene (paste from the excel sheet). For example if you are looking for the rbcl gene for Poa annua, you would type “Poa annua rbcl”. (Figure 2.1.2)
  3. Typically a list of numerous entries appears (if nothing, make sure there weren’t any spelling errors). Click on a link which contains both the species name and the gene name, sometimes other species pop up. (Figure 2.1.3)
  4. Scroll down you will see short links for genes. In our example, this one is only rbcl, so if you clicked on the “gene” link, you’ll just get the full sequence. If there were multiple genes, there would be multiple “gene” links, and in the column across from the “gene” link, you would look for this: /gene=”rbcl”. (Figure 2.1.4)
  5. Click on “gene”, the rbcl will then be highlighted. Click on “Display: FASTA” at the bottom right-hand corner (Figure 2.1.5)
  6. Copy all the text at the beginning of the nucleotide to the last nucleotide. (Figure 2.1.6)
  7. Paste into the document for that gene (in this example rbcl.txt). Save the file! (Figure 2.1.7)
  8. Also, copy the Genbank accession number (always starts with a letter, EU676939.1 in the above example) and paste into the Excel sheet (sequences/genSeq_run1.csv) in the cell for that species and gene. Save the file! (Figure 2.1.8)
  9.  For a given species’s gene (i.e., for “Aconitum columbianum”, if no rbcl sequences are available, simply put a NA in the .csv file (sequences/genSeq_run1.csv) file. (Figure 2.1.9).
  10. If none of the four sequences (rbcl, matK, ITS1 or 5.8s) are available for a given species, make sure NA’s are filled out across the species’ row, but also place an “X” under the “NA” column.

2.2 Dealing with Missing Species

After you’ve collected all the sequences, there may be some species that are not found within GenBank for any of the four sequences (rbcl, matK, 5.8s and ITS1).

  1. Find synonyms: First, check plants.usda.gov or http://www.theplantlist.org/ to see if there are any synonyms for the missing species. If there is a synonym for a missing species:
  1. Check GenBank to see if there are sequences available for the synonym (Section 2.1)
  2. If there are sequences in GenBank, make sure the species’ synonym isn’t already one of the species in your species list.
  3. Under the “Comments” column of the sequences/genSeq_run1.csv file, make a note for the missing species what the synonym whose sequences you are collecting is. Then collect the sequences for the synonym and add them to the sequence text files under the missing species’ name.
  1. If you are adding missing species later at the genus level, you will need to first check if there are at least two congenerics that you’ve already collected sequences for. If not, then find one (at least one congeneric with sequences already collected) or two (no congenerics within collected sequences) congeneric(s) in GenBank. Ideally you want a congeneric with all sequences of interest possible. Make a note somewhere of which species had no sequences found, and which congenerics you’ll be using to form the genus node.
  1. Create the following text files and add corresponding sequences for congeneric species:
  1. “sequences/run1/rbcl_congenerics.txt”
  2. “sequences/run1/matK_congenerics.txt”
  3. “sequences/run1/ITS1_congenerics.txt”
  4. “sequences/run1/5.8s_congenerics.txt”
  5. “sequences/run1/ITS15.8s_congenerics.txt”
  1. In the sequences/genSeq_run1.csv workbook file, create a new sheet called, “Species changes”. Here create a “Before Changes” column, which contains the columns “rbcl”,”matK”,”ITS1”,”5.8s” and “ITS1 & 5.8s”; a “Changes” column, which contains the columns  “rbcl”,”matK”,”ITS1”,”5.8s” and “ITS1 & 5.8s”; and a “Comments column” (Figure 2.2.3). On this worksheet, you should document and track any changes that you make.

2.3 Teasing apart ITS1 and 5.8s from combined sequences

Some species will only have sequences for ITS1 and 5.8s combined with each other, and sometimes other sequences like 18S and ITS2, e.g., Asclepias incarnata, with the ascension code DQ005967.1. These sequences should be collected within the text file: “sequences/run1/ITS15.8s.txt”. Use a congeneric species (e.g., Asclepias syriaca, AM396892.1) with separated 18S, ITS1, 5.8s and ITS1 values in order to extract a species’ ITS1 and 5.8s values. Once the ITS1 and 5.8s sequences have been teased apart for each species, add them, respectively, to the “sequences/run1/ITS1.txt” and “sequences/run1/5.8s.txt” text files.

2.4 Merging sequence lines

Copying sequences directly from GenBank will cause your sequence files to have line breaks. For each of your sequence text files, you will need to eliminate the line breaks such that the sequences are all in one line.

If you are using Notepad++:

  1. Type “Control+F” to bring up the “Find” dialogue box. Click on the tab “Replace”
  2. Under Replace, (Figure 2.4.2)
  1. “Find what: (^[^>].*?)[\s](?!>)”
  2. “Replace with: $1”
  3. Check:
  1. Match case
  2. Wrap around
  3. Search Mode: Regular expression
  1. Recursively click “Replace All” until the dialogue box returns, “0 occurrences were replaced”
  2. Delete any species names with no corresponding sequences

3 ClustalX2

3.1 Aligning sequences in ClustalX2

  1. Create a new subdirectory, “Phylogeny/ClustalX_output”. Within the “Phylogeny/ClustalX_output” directory, create another subdirectory, “Phylogeny/ClustalX_output/run1
  2. Download ClustalX2: http://www.clustal.org/clustal2/#Download
  3. Once downloaded, open ClustalX2.
  1. Load Sequences for Alignment
  1. From the Menu Bar, “File” > “Load Sequences”
  2. Navigate to the directory, “Phylogeny/sequences/run1” and open one of the four sequences text files, e.g., “ITS1.txt”
  3. From Menu Bar, “File” > “Append Sequences”
  4. Navigate to the directory, “Phylogeny/sequences/run1”  and open one of the four sequences text files for the congenerics, e.g., “ITS1_congenerics.txt”
  1. Change Output format (Figure 3.3b)
  1. From the Menu Bar, “Alignment” > “Output Format Options”
  2. Uncheck “CLUSTAL format”
  3. Check “FASTA format”, and click “OK”
  1. Change output directory and do Complete Alignment
  1. From the Menu Bar, “Alignment” > “Do Complete Alignment”
  2. Make sure the Output Guide Tree and Output Alignment Files are pointed toward “Phylogeny/ClustalX_output/run1
  3. Click “OK”.
  1. When ClustalX is finished, it should have put within “Phylogeny/ClustalX_output/run1 two files:
  1. ITS1.dnd
  2. ITS1.fasta

3.2 Duplicates

If duplicates detected use the following lines of code in R to determine duplicated species:

library(seqinr)   #If this is not installed, use command: "install.packages("seqinr")"

 

a<-read.alignment("matK.txt",format="fasta")

b<-read.alignment("rbcl.txt",format="fasta")

c<-read.alignment("ITS1.txt",format="fasta")

d<-read.alignment("5.8s.txt",format="fasta")

 

a$nam[which(duplicated(a$nam))] #lists duplicated species in matK.txt

b$nam[which(duplicated(b$nam))] #lists duplicated species in rbcl.txt

c$nam[which(duplicated(c$nam))] #lists duplicated species in ITS1.txt

d$nam[which(duplicated(d$nam))] #lists duplicated species in 5.8s.txt

Manually remove duplicates from the sequence text files.

4 Concatenating Sequences into a Super Matrix with FASconCAT

  1.  First, make sure all fasta file extensions are converted from .FASTA to .FAS, e.g., ITS1.fasta → ITS1.fas
  2. Create a new directory called “Phylogeny/FASconCAT” with a subdirectory “Phylogeny/FASconCAT/run1“.
  1. Copy/paste .fas files to "Phylogeny/FASconCAT/run1"
  1. Download and install Perl: http://www.perl.org/get.html
  2. Download FASconCAT: http://zfmk.de/web/Forschung/Abteilungen/AG_Wgele/Software/FASconCAT/Download/index.en.html
  1. Unzip/paste FASconCAT_v1.0.pl into “Phylogeny/FASconCAT/run1
  1. Either:
  1. Open cmd.exe and execute the following:

cd "Phylogeny\FASconCAT\run1"

FASconCAT_v1.0.pl

p

p

s

  1. Or, copy/paste FasConCat_v1.0.pl into "Phylogeny/FASconCAT/run1"
  1. Double click on FasConCat_v1.0.pl, which should open up the program
  2. then type in the console:

p

p

s

  1. FASconCAT should output the following files when done:
  1. “FcC_info.xls”
  2. “FcC_smatrix.fas”: a FASTA file
  3. “FcC_smatrix.phy”: a PHYLIP file

5 Model testing in jModelTest2

jModelTest will help you figure out what the best model to build your phylogeny with is. Since it requires a lot of computing and takes a while, you only really need to do it a couple times--mainly for your first and last run.

  1. Download and install jModelTest2: http://code.google.com/p/jmodeltest2/downloads/list
  2. Within “jModelTest2” directory, open the Executable Jar file “jModelTest.jar”
  3. Within the jModelTest 2.0.1 GUI, select from Menu, “File” > “Load DNA Alignment”. Navigate to “Phylogeny/FASconCAT/run1/FcC_smatrix.fas” and open the file.
  4. From Menu, “Analysis” > “Compute likelihood scores”. A “Likelihood settings” window should appear. Click, “Compute Likelihoods”
  5. When the Progress window is gone, you should have a bunch of stuff listed on the console.
  6. In the GUI menu, navigate to "Analysis > Do AIC calculations".
  1. If a window pops up, click on “Do AIC calculations”, assume that the default settings are in place and go ahead and run the analysis.
  1. Repeat step 7. for "Analysis > Do BIC calculations" and "Analysis > Do DT calculations"
  2. Once you have completed the analyses for AIC, BIC and DT calculations, navigate to: "Results > Show results table". A table will pop open you will see different tabs for "AIC", "BIC" and "DT"
  3. In the Results table,
  1. click on the "AIC" tab.
  2. sort the column "AIC" by click on the column header.
  3. You will see that the AIC with the lowest score is on top.
  4. Note what is listed under the column "Name" for the first three models that come up with the lowest score.
  1. Repeat step 10. for columns "BIC" and "DT".
  2. Make a note of the file inputted (whether it’s FcC_smatrix.fas, ITS1.fas, 5.8s, matK.fas, or rbcl.fas) and the information from steps 10 and 11.
  1. For example, the jModeltest results for "ITS1.fas” might be:
  1. AIC: CAT + G, F81 + G, HKY
  2. BIC: CAT + G, JC, F81
  3. DT: CAT + G"
  1. This will help guide you on what the best model for your phylogeny is in Section 7.
  1. In the GUI menu, navigate to "Edit > save console". Save it under “Phylogeny/jModelTest2/mydata/run1/”
  2. Repeat steps 2-12 for:
  1. “Phylogeny/FASconCAT/run1/ITS1.fas”
  2. “Phylogeny/FASconCAT/run1/5.8s.fas”
  3. “Phylogeny/FASconCAT/run1/matK.fas”
  4. “Phylogeny/FASconCAT/run1/rbcl.fas”

6 Super Tree (Angiosperms)

Here, you will graft a phylogeny of your species from a “Super Tree”. The resulting tree (which we’ll call your “super tree”) will be useful as a means of comparison in later steps to see whether the sequences you’ve collected are incorrect. It can also be used as a backbone (starting tree) for building your molecular phylogeny.

6.1. Nix

  1. Create a new directory “Phylogeny/phylocom”; and within it the subdirectory, “Phylogeny/phylocom/run1
  2. We need to create a species list to input into Phylocom; however, in order to do so, we need to make sure the species can be built from the supertree. Mainly, the problem is that we need to make sure the species’ families are found within the master tree, from which the supertree is grafted from. To do this:
  1. First create a species list from the concatenated super matrix created in Section 4 in R. (Script 6.1.2 - replace file path in read.alignment(...) with location of your file)
  1. “Phylogeny/phylocom/run1/submit2nix.csv”.
  1. Copy/paste the species in “submit2nix.csv” into Nix http://phylodiversity.net/nix/. Nix will try to match species to their appropriate genus and families.
  1. Name matching: Exact
  2. Output format: tab-delimited spreadsheet format
  3. Save the output as a text file, “Phylogeny/phylocom/run1/nix.txt”
  1. Check for species that may have not been found within Nix.
  1. In R, run Script 6.1.3a
  1. Open “Phylogeny/phylocom/run1/nix.txt” and identify which species are not found within Nix
  2. Output the file, “Phylogeny/phylocom/run1/FindMatches.csv” 
  1. If there are species not found within Nix (if “FindMatches.csv” is not empty), search http://www.theplantlist.org and check what the accepted name of each species should be
  1. If the species name is accepted, in  “Phylogeny/phylocom/run1/FindMatches.csv” :
  1. If the name is unresolved, but has an accepted synonym, in  “Phylogeny/phylocom/run1/FindMatches.csv” :
  1. Save “Phylogeny/phylocom/run1/FindMatches.csv” as “Phylogeny/phylocom/run1/MatchesFound.csv” (to avoid overwriting files in R)
  1. In R, run Script 6.1.3c
  1. Opens up “Phylogeny/phylocom/run1/MatchesFound.csv”
  2. Replaces the species unmatched from Nix with the information from “Phylogeny/phylocom/run1/MatchesFound.csv”
  3. Outputs “Phylogeny/phylocom/run1/submit2phylomatic.txt”

6.2 Phylomatic

  1. Go to the Phylomatic website: http://phylodiversity.net/phylomatic/
  1. tree=: 
  1. paste whatever super tree is appropriate for your species; OR
  1. storedtree=:
  1. add whatever stored tree is appropriate (e.g., “R20120829”)
  1. method = Phylomatic
  2. taxa =:
  1. Copy and paste the content of your “Phylogeny/phylocom/run1/submit2phylomatic.txt” file.
  1. clean = TRUE
  1. if you don’t check it, your tree will have singleton nodes that R cannot read
  1. Send:
  1. Should output a supertree. Save this super tree as “Phylogeny/phylocom/run1/my_phylomatic.new”
  1. Look at the end of the “Phylogeny/phylocom/run1/my_phylomatic.new” file to see if there are any unmatched species. If there are unmatched species, it’s likely because the species have been reclassed into new families.
  1. In R, run Script 6.2.2a
  1. Looks through “Phylogeny/phylocom/run1/my_phylomatic.new”  to see which species were excluded from the Supertree
  2. Identifies the unique families of these species, and outputs it into the text file, “Phylogeny/phylocom/run1/FindFamilies.csv” 
  1. Refer to this document to find out what the family names for your species should be under: http://www.florachilena.cl/docs/APGII.pdf
  1. In “Phylogeny/phylocom/run1/FindFamilies.csv” , fill out the “NewFamily” column
  2. Save (to prevent overwriting in R) as “Phylogeny/phylocom/run1/FamiliesFound.csv” 
  1. In R, run Script 6.2.2c
  1. Reads “Phylogeny/phylocom/run1/FamiliesFound.csv” and replaces old family names with new family names
  2. Outputs “Phylogeny/phylocom/run1/submit2phylomatic.txt”
  1. Repeat step 6.2 until there are no more excluded species. Save the final super tree as “Phylogeny/phylocom/run1/prior.new”

7 PhyML 3.0: Maximum Likelihood Phylogeny

You can build your tree either using Maximum Likelihood or Bayesian MCMC sampler method. If you are interested in building a Maximum likelihood phylogeny...

  1. Navigate to http://www.atgc-montpellier.fr/phyml/
  1. “Input Data”
  1. Sequence (PHYLIP format): Upload “Phylogeny/FASconCAT/run1/FcC_smatrix.phy”
  2. Data type: DNA
  3. Sequence file: Sequential
  1. “Substitution Model”
  1. Substitution model: Pick the best model from Section 5
  1. “Tree Searching”
  1. Starting tree(s): BIONJ
  2. Type of tree improvement: NNI
  3. Number of random starting trees: NO
  1. “Branch Support”
  1. Fast likelihood-based method: aLRT SH-like
  2. Perform bootstrap: NO
  3. Name your analysis: Give it some kind of name
  4. Your email/Please confirm your email: fill out your email address, where they will send you your results
  1. “Execute & email results”
  1. When PhyML is finished, it will email you a zip file containing the following files:
  1. fcc_smatrix_phy_phyml_ik.txt
  2. fcc_smatrix_phy_phyml_stats.txt
  3. fcc_smatrix_phy_phyml_tree.txt
  4. fcc_smatrix_phy_phyml_stdout.txt
  1. Save these four files under “Phylogeny/PhyML/run1

8 PhyloBayes-MPI: Bayesian MCMC Sampler

You can build your tree either using Maximum Likelihood or Bayesian MCMC sampler method. If you are interested in building a Bayesian Phylogeny...

  1. Download PhyloBayes MPI: http://megasun.bch.umontreal.ca/People/lartillot/www/downloadmpi.html and unzip files into folder “Phylogeny/pb_mpi1.5a”
  2. After downloading PhyloBayes MPI, you may need to recompile the code to suit your OS:
  1. “...PhyloBayes MPI is provided both as executable files (for linux x86-64) and as a C++ source code. Depending on the operating system running on your cluster, you may need to recompile the code. To this end, a simple Makefile is provided in the sources directory, and compiling with the make command should then work in most simple situations, assuming that a version of MPI compatible with C++ compilation (with a mpicxx/mpic++ executable) is already installed on your cluster. Note that, in some machines, OpenMPI needs to be specifically uploaded by the user before compiling and/or running a MPI program. You may need to check all these details with your system administrator. The present code can run in principle on MacOSX or Windows operating systems, however, it is primarily intended for (and has been exclusively tested on) high performance computing facilities operating under linux or Unix.”
  2. Make sure you have the libraries “make” and “openMPI” installed
  1. For Windows:
  1. Use Cygwin to download the “Devel” package.
  2. Use Cygwin to download the “openMPI” package.
  1. For Mac:
  1. Make sure Xcode and it’s developer tools are downloaded (in order to have the “make” command)
  2. Make sure openMPI is installed in order to run the “make” command properly: https://wiki.helsinki.fi/display/HUGG/Installing+Open+MPI+on+Mac+OS+X
  1. Compile phylobayes MPI for your operating system by typing in either Cygwin or in console:

cd "Phylogeny/pb_mpi1.5a/sources"

make -f Makefile

  1. Copy “Phylogeny/phylocom/run1/prior.new”, “Phylogeny/FASconCAT/run1/FcC_smatrix.phylip” and “Phylogeny/FASconCAT/run1/FcC_smatrix.fas”into: “Phylogeny/pb_mpi1.5a/data”
  2. We need to make sure all the species are the same in both the prior tree and concatenated tree. In R,

setwd("Phylogeny/pb_mpi1.5a/data")

library(ape)

library(seqinr)

#Load trees

ini<-read.tree("prior.new")

con<-read.alignment("FcC_smatrix.fas", "fasta")

#Make sure same number of species in prior tree and in concatenated sequence file

length(ini$tip.label)==length(tolower(con$nam))

#Are there any duplicated names in any of the files?

ini$tip.label[which(duplicated(ini$tip.label))]

con$nam[which(duplicated(con$nam))]

#Are all the names in both the files the same?

all(sort(ini$tip.label) == sort((con$nam)))

#Are all the sequences the same length?

length(unique(unlist(lapply(1:con$nb, function(x) nchar(con$seq[[x]][1])))))==1

  1. Prior tree has to be unrooted and bifurcating (no polytomies). To bifurcate your prior tree, randomly resolve polytomies with function multi2di in library ape in R:

######Formatting the Prior Tree

#Remove node labels

ini$node.label<-NULL

#Unroot prior tree

if(is.rooted(ini)){ini<-unroot(ini)}

#Convert tree to ultrametric tree

if(is.ultrametric(ini)){

x.ult<-chronopl(ini,lambda=0.1)

}

#Resolve polytomies randomly

x.ult<-multi2di(x.ult,random=TRUE)

write.tree(x.ult, file="prior.new")

  1. INPUT
  1. -np <cores>: Number of cores to run the process on
  2. -d <datafile>: specifies data file containing aligned sequences, FcC_smatrix.fas.phylip
  3. -gtr: general time reversible matrix: exchange rates are free parameters, with prior distribution a product of independent exponential distributions of mean 1. gtr is the default (not necessary to specify), however if jModelTest had a different model, use that one instead (see manual)
  4. -t <treefile>: forces the chain to start from the specified tree. Here, prior.new
  5. -x <every> [<until>]: specifies the saving frequency and (optional) the number of points after which the chain should stop. If this number is not specified, the chain runs “forever”. By definition, -x 1 corresponds to the default saving frequency. In some cases, samples may be strongly correlated, in which case, if disk space or access is limiting, it would make sense to save points less frequently, say 10 times less often: to do this, you can use the -x 10 option.
  6. -x <burn-in> [<every> <until>]: Defines the burn-in, the sub-sampling frequency, and the size of the samples of trees to be taken from the chains under comparison. By default, <burn-in> = 0, <every> = 1 and <until> is equal to the size of the chain. Thus, for instance: “-x 1000” defines a burn-in of 1000; “-x 1000 10” a burn-in of 1000, taking one every 10 trees, up to the end of each chain, and “-x 1000 10 11000” a burn-in of 1000, taking one every 10 trees, up to the 11 000th point of the chains (or less,
  7. &: Makes two independent chains in parallel, ./output.chain1 and ./output/chain2
  8. Run 2 independent MCMC chains, each starting from a random point, run up to 100,000 cycles. Save MCMC sample every 10 cycles, first 500 samples discard as burnin. In console or Cygwin:

cd "Phylogeny/pb_mpi1.5a/data"

nohup mpirun -n 2 pb_mpi -d FcC_smatrix.phylip -t prior.new -x 10 100000 -S chain1 &

nohup mpirun -n 2 pb_mpi -d FcC_smatrix.phylip -t prior.new -x 10 100000 -S chain2 &

  1. OUTPUT: A series of files will be produced, whose names all start with the name of the chain, combined with a variety of extensions. The most important are:
  1. <name>.treelist: list of sampled trees.
  2. <name>.trace: the trace file, containing a few relevant statistics (e.g. number of generations, loglikelihood, total length of the tree, number of components in the mixture), which can help you checking how fast the chain equilibrates, and how long you will need to run it. The best is to plot the evolution of a few of these statistics, as a function of time, to check for the absence of trend in the long run.
  1. After the two chains have finished executing, run bpcomp
  1. INPUT:
  1. -o: output file name, bpcomp_chain. Gives a detailed comparison of the frequency of all the bipartitions observed in the chains being compared, and will output a consensus tree made by pooling all the trees of all the chains.

bpcomp -o bpcomp_chain chain1 chain2

  1. OUTPUT
  1. maxdiff: largest discrepancy observed across all bipartitions
  1. maxidff<0.1: good run
  2. maxdiff<0.3: acceptable: gives a good qualitative picture of the posterior consensus
  3. maxdiff = 1, bad: indicates at least one of the runs is stuck in a local maximum
  1. meandiff: mean discrepancy observed across all bipartitions
  2. consensus tree: this is your molecular phylogeny

9 Archaeopteryx Tree Visualizer

This will help you see if there are any rogue species appearing in places where they shouldn’t be. You should check each species in your phylogeny against the supertree, which will give you a sense of where the species ought to be. Make a note of any species that appear with the wrong species, or have unusually long branches.

  1. Download Archaeopteryx (“forester_1035.jar”): https://googledrive.com/host/0BxMokdxOh-JRM1d2azFoRnF3bGM/download/
  2. Open two sessions of the Java applet for Archaeopteryx
  1. Make a duplicate of the forester_1035.jar app and open both to create two windows that can be viewed side-by-side.
  2. In one of the sessions, you will load your Super Tree; and the in the other, your tree from PhyML
  1. Navigate in the menu to: “File > Read Tree from File…”
  1. Your molecular phylogeny:  “Phylogeny/PhyML/run1/fcc_smatrix_phy_phyml_tree.txt
  2. Your grafted supertree: “Phylogeny/phylocom/run1/my_phylomatic.new” 
  1. Root each tree (see Figure 9.3)
  1. Search for “Magnolia_grandiflora”.
  2. Right click “Magnolia_grandiflora”
  3. Click “Root/Reroot”
  1. Make sure the “Click on Node to” in the lefthand menu is in “Display Node Data”
  2. Make a note of species that have unusually long branch lengths (see Figure 9.4)
  3. Make a note of species that are not found within the correct genuses

10 Fixing your phylogeny

There are a number of reasons why species are out of place or have long branches:

  1. Collected gene sequence could be wrong.
  1. e.g., matK is actually a matK-like sequence
  2. e.g., what was accidentally sequenced was a parasite or fungus on the plant
  1. Collected gene sequence could have been entered in backwards
  2. Not enough gene sequence information to properly “glue” the species into right clade/node.
  1. e.g., half the species in the genus “carex” have only matK and rbcl sequences, and the other half have only ITS1 and 5.8s. These species may not form a genus node and appear in completely different clades. To “glue” these clade together, you will need to find one or two carex species that have either/both the matK and rbcl sequences AND either/both the ITS1 and 5.8s sequences.

10.1 ClustalX

To address the aforementioned issues, the easiest way is to look at each sequence’s tree as drawn from clustalX to see which sequence has bad.

  1. For each of your sequences (e.g., matK, rbcl, 5.8s and ITS1), open an instance of the ClustalX program
  2. In ClustalX:
  1. “File > Load Sequence”
  1. Navigate to “Phylogeny/ClustalX_output/run1/”
  2. Select the corresponding .fasta file, e.g., “Phylogeny/ClustalX_output/run1/matK.fasta”
  3. Your aligned sequences should appear. Next, navigate to:
  1. “Trees > Draw Tree”
  1. A dialogue box to indicate where you want to save your drawn tree pops up.
  2. Click OK.
  1. Your “Phylogeny/ClustalX_output/run1/” folder should now have four new files corresponding to your sequence names, followed by the extension, .ph.
  2. In Archaeopteryx, load up each of the Phylip trees in your “Phylogeny/ClustalX_output/run1/” folder:
  1. Navigate in the menu to: “File > Read Tree from File…”
  1. Select each sequence’s Phylip tree, e.g., “Phylogeny/ClustalX_output/run1/5.8s.ph”
  1. You can now see which sequences for certain species might be causing the odd behaviors reflected in your molecular phylogeny. However, to figure out why your sequences might be causing you problems, use BLAST in the next section.

10.2 BLAST

BLAST can also help you identify sequences that you suspect are problematic

  1. Navigate to BLAST’s “Align Sequences Nucleotide BLAST” page: http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome
  2. Basic BLAST
  1. “Choose a BLAST program to run.”
  2. nucleotide blast

11 Finalizing your phylogeny

Once everything looks good in Archaeopteryx and you are SURE this is your final phylogeny run, you need to re-add the species which lacked sequence information in GenBank. You can do this either manually in Archaeopteryx, or in R. For some reason, manually adding/removing species in Archaeopteryx can result in weird trees, so be wary that you might end up having issues taking that route.

11.1 Add missing species and remove congenerics

In Archaeopteryx:

  1. Add species at the genus node level:
  1. In Archaeopteryx, search the genus name of the missing species. Your congenerics should show up in green.
  1. At the node which all the congenerics converge, right click and select “Add New Node” > “As Descendent”. A new tip should appear.  (See Figure 11.1.a.i)
  2. Right click on the new (empty) tip, and select “Edit Node Data”. A dialogue box should appear. Click, “Basic”, and under “Name”, enter the missing species’ name by double clicking the empty slot under “Name”. (See Figure 11.1.a.ii)
  1. Once you have added all the missing species to the genus node in Archaeopteryx, save the tree: “File > Save Tree As…”
  1. Navigate to the folder where you want to save the tree
  2. Make sure you specify the “Files of Type” as “New Hampshire - Newick files (*.h, *newick, *.phy, *.tree, *.dnd, *.tr, *.ph, *.phb, *.nwk)”
  1. Remove the congenerics that you had added to form the nodes used for  missing species.

In R:

library(ape); library(phytools)

#Load molecular tree file into R. This should be your tree from your final run

tr<-read.tree("Phylogeny/PhyML/run8/fcc_smatrix_phy_phyml_tree.txt")

#Function to add any species missing from your species pool into the same node as congenerics

spp2genus<-function (tree, species, genus = NULL, where = c("root", "random"))

{

    where <- where[1]

    if (is.null(genus)) {

        x <- strsplit(species, "")[[1]]

        i <- 1

        while (x[i] != "_" && x[i] != " ") i <- i + 1

        genus <- paste(x[2:i - 1], collapse = "")

    }

    ii <- grep(paste(genus, "_", sep = ""), tree$tip.label)

    if (length(ii) > 1) {

        if (!is.monophyletic(tree, tree$tip.label[ii]))

            warning(paste(genus, "may not be monophyletic\n  attaching to the most inclusive group containing members of this genus"))

        nn <- findMRCA(tree, tree$tip.label[ii])

        if (where == "root")

            tree <- bind.tip(tree, gsub(" ", "_", species), where = nn, edge.length=0.00000001)

        else if (where == "random") {

            tt <- splitTree(tree, list(node = nn, bp = tree$edge.length[which(tree$edge[,

                2] == nn)]))

            tt[[2]] <- add.random(tt[[2]], tips = gsub(" ", "_",

                species))

            tree <- paste.tree(tt[[1]], tt[[2]])

        }

        else stop("option 'where' not recognized")

    }

    else if (length(ii) == 1) {

        nn <- ii

        if (where == "root")

            tree <- bind.tip(tree, gsub(" ", "_", species), where = nn,

                position = 0.5 * tree$edge.length[which(tree$edge[,

                  2] == nn)], edge.length=0.00000001)

        else if (where == "random")

            tree <- bind.tip(tree, gsub(" ", "_", species), where = nn,

                position = runif(n = 1) * tree$edge.length[which(tree$edge[,

                  2] == nn)], edge.length=0.00000001)

        else stop("option 'where' not recognized")

    }

    else warning("could not match your species to a genus\n  check spelling, including case")

    tree

}

#Read in list of species that are supposed to be in this phylogeny (regional species pool).

spp<-read.csv("spp_list.csv",header=TRUE)

#Species list's names are supposed to be in the same format at the phylogeny, e.g., there should be a "_" between genus and species name.

spp<-gsub(" ","_",spp[,1])

#Here, we identify species that are missing from the current phylogeny that should be added in at the genus node level

missing<-sort(spp[-which(spp %in% tr$tip.label)])

#Add each missing species in at the node the formed by congenerics

for(i in missing){

        tr<-spp2genus(tr,i, where="root")

}

#Determine which species in the phylogeny are congenerics that were added into the molecular phylogeny as "glue" or to form genus nodes for missing species

rmsp<-sort(tr$tip.label[-which(tr$tip.label %in% spp)])

#Remove congenerics from the tree

tr<-drop.tip(tr,rmsp)

#root the tree

tr<-root(tr,"Equisetum_arvense",r=TRUE) #change the outgroup to your own

#Write the tree

write.tree(tr,"final_tree.new")

11.2 Ultrametricize your tree

Before converting your tree to a ultrametric tree, you need to perform a cross-validation in order to determine the smallest lambda value that minimizing the cross-validation (cv).

  1. To do this, in R you will need to test the different values of lambda with the cv values “D2” with the function “chronopl”. In the following R-script, I test lambda values from 10-15 to 1015:

library(ape); library(picante); library(parallel)

#Read tree file into R

tree<-read.tree("final_tree.new")

#Cross-validation for different values of lambda

chr<-mclapply(10^(-15:15),function(x){

        chronopl(tree,lambda=x,CV=TRUE)}, mc.cores=24)

cv<-unlist(lapply(chr,function(x)sum(attr(x,"D2"))))

lambda<-10^(-15:15)

#Create and write data frame

cvs<-data.frame(cbind(lambda,cv))

write.csv(cvs,file="cv_output.csv",row.names=FALSE)

  1. Once the cv value for each lambda is calculated, plot it out and visually determine the lambda value at which cv is minimized:

#Read in the file with Lambda and CV values into R

x<-read.csv("cv_output.csv",header=TRUE)

#Plot CV vs. Lambda

plot(seq(1,length(lambda),by=1),cv,xlab="lambda number")

x

  1. Next, convert your phylogeny to an ultrametric tree with the specified lambda value which minimizes cv (lambdamincv):

lambdamincv = 0 #Change this value to whatever lambda value your CV is minimized at

#Read tree file into R

tree<-read.tree("final_tree.new")

#Ultrametricize tree

phy<-chronopl(tree,lambda=lambdamincv)

#Write your ultrametric tree

write.tree(phy, file="molecular_tree.new")        

Figures

Figure 1.2.2


Figure 2.1.2

Figure 2.1.3


Figure 2.1.4

Figure 2.1.5

Figure 2.1.6

Figure 2.1.7

Figure 2.1.8

Figure 2.1.9

Figure 2.2.3


Figure 2.4.2

Figure 3.3b

Figure 9.3

Figure 9.4


Figure 11.1.a.i

Figure 11.1.a.ii

Scripts

Script 2.1.1

#This script queries GenBank for the sequences of interest for a list of species

setwd(choose.dir())        #Set working directory to where directory "Phylogeny/sequences" is

(sppList<-as.character(read.csv("species_list.csv",header=TRUE)[,1]))        #Open file with list of species, which should be in directory "Phylogeny"

library(XML)

esearch = function(...,

        database, history=NULL, webenv=NULL, key=NULL, tool='reutils',

        email=NULL, field=NULL, reldate=NULL, mindate=NULL, maxdate=NULL,

        datetype=NULL, retstart=NULL, retmax=NULL, retmode=NULL, rettype=NULL,

        sort=NULL, baseurl='http://eutils.ncbi.nlm.nih.gov/entrez/eutils/') {

    query = do.call(join, c(..., list(operator=' AND ', format='%s[%s]', order=2:1)))

    url = sprintf('%s/esearch.fcgi?%s',

        baseurl,

        do.call(join, as.list(

            c(db=database, term=query, usehistory=history, WebEnv=webenv,

                query_key=key, tool=tool, email=email, field=field, reldate=reldate,

                mindate=mindate, maxdate=maxdate, datetype=datetype,

                retstart=retstart, retmax=retmax, retmode=retmode, rettype=rettype,

                sort=sort))))

    response = xmlInternalTreeParse(url)

    c(

        list(identifiers=xpathSApply(response, '//IdList/Id', xmlValue)),

        lapply(

            list(count='Count', retmax='RetMax', retstart='RetStart', key='QueryKey', webenv='WebEnv'),

            function(name)

                xpathSApply(response, sprintf('/*/%s', name), xmlValue))) }

                                

join = function(..., operator='&', format='%s=%s', order=1:2) {

    values = c(...)

    if (!length(values))

        return('')

    parameters = names(values)

    paste(collapse=operator,

        mapply(

            function(parameter, value)

                if (parameter=='') value

                else do.call(sprintf, c(format, list(parameter, value)[order])),

            if (is.null(parameters)) '' else parameters,

            values)) }

genes<-c("rbcl","matK","5.8s","ITS1","internal transcribed spacer 1")

genB<-data.frame(cbind(Scientific.Name = sppList,rbcl=vector(length=length(sppList)[1],mode="character"),matK=vector(length=length(sppList),mode="character"),c5.8s=vector(length=length(sppList),mode="character"),ITS1=vector(length=length(sppList),mode="character"),ITS1.t=vector(length=length(sppList),mode="character") ))

for(i in 1:length(sppList)){

        for(n in 2:6){

        genB[i,n]<-ifelse(esearch(database="nucleotide",paste(as.character(sppList[i]),"[Organism] AND ",genes[n-1],sep=""))$count=="0","NA","")

        }

}

#Combine the two ITS1 searches

genB[which(genB$ITS1.t==""),"ITS1"]<-""

genB<-genB[,-6]

write.csv(genB,"geneSeq.csv",row.names=FALSE)

sppList2<-paste(">",gsub(" ","_",sppList),sep="",delim="/n")

rbcl<-paste(">", gsub(" ","_",sppList[which(genB$rbcl=="")]),sep="", collapse=" \n\n")

matK<-paste(">", gsub(" ","_",sppList[which(genB$matK=="")]),sep="", collapse=" \n\n")

s5.8s<-paste(">", gsub(" ","_",sppList[which(genB$c5.8s=="")]),sep="", collapse=" \n\n")

ITS1<-paste(">", gsub(" ","_",sppList[which(genB$ITS1=="")]),sep="", collapse=" \n\n")

ITS15.8s<-paste(">", gsub(" ","_",sppList[which(genB$ITS1=="" && genB$c5.8s=="")]),sep="", collapse=" \n\n")

write(rbcl,"run1/rbcl.txt")

write(matK,"run1/matK.txt")

write(s5.8s,"run1/5.8s.txt")

write(ITS1,"run1/ITS1.txt")

write(ITS15.8s,"run1/ITS15.8s.txt")

#Which ones have "NA"s across the board?

nas<-droplevels(genB[apply(genB,1,function(x) all(is.na(x[2:5]))),1] )

gen2<-list()

for(i in 1:length(nas)){

        gen2[i]<-esearch(database="nucleotide",paste(as.character(nas[i]),"[Organism]",sep=""))

        names(gen2)[i]<-as.character(nas[i])

}

tes<-names(gen2[which(sapply(names(gen2),function(x) length(gen2[[x]]))>0)])

write.csv(tes,"alternativeSeqs.csv",row.names=FALSE)

Script 6.1.2

#6.1.2

library(seqinr)

x<-read.alignment("../../3 FASconCAT/run2/FcC_smatrix.fas",format="fasta")

write(gsub("_"," ",x$nam),"submit2nix.csv")

#Take this list to http://phylodiversity.net/nix and submit with exact match and "tab-delimited spreadsheet format"

#Save file as "phylocom-4.2/run1/nix.txt"

Script 6.1.3

#6.1.3a

#Which species do not have class/order/family/genus information?

nix<-read.delim("nix.txt",header=TRUE,sep="\t",stringsAsFactors=FALSE)

nix[which(is.na(nix$MatchRank)),] #Pick out species with no matches

x<-sapply(1:dim(nix)[1], function(x)paste(nix[x, c(7,6,5)], collapse="/"))

write(x,"submit2phylomatic.txt")

write.csv(nix,"nix.txt",row.names=FALSE)

#Output a .csv file with corrected Genus and Family names for unmatched species

write.csv(data.frame(cbind(OriginalName = nix[which(is.na(nix$MatchRank)),"OriginalName"], AcceptedName =  nix[which(is.na(nix$MatchRank)),"OriginalName"], Genus = substr(nix[which(is.na(nix$MatchRank)),"OriginalName"],1,regexpr(" ",nix[which(is.na(nix$MatchRank)),"OriginalName"])-1), Family= NA)),"FindMatches.csv",row.names=FALSE)

#Search http://www.theplantlist.org and check what the accepted name of each species should be. If the name is unresolved, but has an accepted synonym: 1. Change the "AcceptedName"; 2. Change the genus; 3. add a family. If your species name is accepted, fill out the Family name.

#6.1.3c

matches<-read.csv("MatchesFound.csv",header=TRUE,stringsAsFactors=FALSE)        #read in file

nix[which(is.na(nix$MatchRank)),c("OriginalName","AcceptedName","Genus","Family")]<-matches        #replace unmatched species with new genus and families

x<-paste(nix[,7],nix[,6],gsub(" ","_",nix[,1]),sep="/")

write(x,"submit2phylomatic.txt")

write.csv(nix,"nix.txt",row.names=FALSE)

Script 6.2.2a

#6.2.2a

#Species not included in the supertree are listed at the bottom of the Phylomatic output

tree<-scan("my_phylomatic.new",what="character")

fams<-tree[6:(length(tree)-1)]        #Identify the excluded species

#Write a data frame text file of the excluded species' families

write.csv(data.frame(cbind(OldFamily = unique(substr(fams,1,regexpr("/",fams)-1)), NewFamily=NA)), "FindFamilies.csv", row.names=FALSE)

#Go to

#6.2.2c

newfams<-read.csv("FamiliesFound.csv",header=TRUE,stringsAsFactors=FALSE)

for(i in 1:dim(newfams)[1]){ nix[which(nix$Family==newfams$OldFamily[i]),]<-newfams$NewFamily[i] }

x<-paste(nix[,7],nix[,6],gsub(" ","_",nix[,1]),sep="/")

write(x,"submit2phylomatic.txt")

write.csv(nix,"nix.txt",row.names=FALSE)