Comprehensive Guide to Phylogeny Building
Written by Lanna Jin
1 Prior to sequence collecting
1.2 Querying GenBank for species/sequences found in GenBank
2 Compiling gene sequences for species
2.1 Collecting sequences in GenBank
2.2 Dealing with Missing Species
2.3 Teasing apart ITS1 and 5.8s from combined sequences
3.1 Aligning sequences in ClustalX2
4 Concatenating Sequences into a Super Matrix with FASconCAT
5 Model testing in jModelTest2
7 PhyML 3.0: Maximum Likelihood Phylogeny
8 PhyloBayes-MPI: Bayesian MCMC Sampler
9 Archaeopteryx Tree Visualizer
After you’ve collected all the sequences, there may be some species that are not found within GenBank for any of the four sequences (rbcl, matK, 5.8s and ITS1).
Some species will only have sequences for ITS1 and 5.8s combined with each other, and sometimes other sequences like 18S and ITS2, e.g., Asclepias incarnata, with the ascension code DQ005967.1. These sequences should be collected within the text file: “sequences/run1/ITS15.8s.txt”. Use a congeneric species (e.g., Asclepias syriaca, AM396892.1) with separated 18S, ITS1, 5.8s and ITS1 values in order to extract a species’ ITS1 and 5.8s values. Once the ITS1 and 5.8s sequences have been teased apart for each species, add them, respectively, to the “sequences/run1/ITS1.txt” and “sequences/run1/5.8s.txt” text files.
Copying sequences directly from GenBank will cause your sequence files to have line breaks. For each of your sequence text files, you will need to eliminate the line breaks such that the sequences are all in one line.
If you are using Notepad++:
If duplicates detected use the following lines of code in R to determine duplicated species:
library(seqinr) #If this is not installed, use command: "install.packages("seqinr")"
a<-read.alignment("matK.txt",format="fasta") b<-read.alignment("rbcl.txt",format="fasta") c<-read.alignment("ITS1.txt",format="fasta") d<-read.alignment("5.8s.txt",format="fasta")
a$nam[which(duplicated(a$nam))] #lists duplicated species in matK.txt b$nam[which(duplicated(b$nam))] #lists duplicated species in rbcl.txt c$nam[which(duplicated(c$nam))] #lists duplicated species in ITS1.txt d$nam[which(duplicated(d$nam))] #lists duplicated species in 5.8s.txt |
Manually remove duplicates from the sequence text files.
cd "Phylogeny\FASconCAT\run1" FASconCAT_v1.0.pl p p s |
p p s |
jModelTest will help you figure out what the best model to build your phylogeny with is. Since it requires a lot of computing and takes a while, you only really need to do it a couple times--mainly for your first and last run.
Here, you will graft a phylogeny of your species from a “Super Tree”. The resulting tree (which we’ll call your “super tree”) will be useful as a means of comparison in later steps to see whether the sequences you’ve collected are incorrect. It can also be used as a backbone (starting tree) for building your molecular phylogeny.
You can build your tree either using Maximum Likelihood or Bayesian MCMC sampler method. If you are interested in building a Maximum likelihood phylogeny...
You can build your tree either using Maximum Likelihood or Bayesian MCMC sampler method. If you are interested in building a Bayesian Phylogeny...
cd "Phylogeny/pb_mpi1.5a/sources" make -f Makefile |
setwd("Phylogeny/pb_mpi1.5a/data") library(ape) library(seqinr) #Load trees ini<-read.tree("prior.new") con<-read.alignment("FcC_smatrix.fas", "fasta") #Make sure same number of species in prior tree and in concatenated sequence file length(ini$tip.label)==length(tolower(con$nam)) #Are there any duplicated names in any of the files? ini$tip.label[which(duplicated(ini$tip.label))] con$nam[which(duplicated(con$nam))] #Are all the names in both the files the same? all(sort(ini$tip.label) == sort((con$nam))) #Are all the sequences the same length? length(unique(unlist(lapply(1:con$nb, function(x) nchar(con$seq[[x]][1])))))==1 |
######Formatting the Prior Tree #Remove node labels ini$node.label<-NULL #Unroot prior tree if(is.rooted(ini)){ini<-unroot(ini)} #Convert tree to ultrametric tree if(is.ultrametric(ini)){ x.ult<-chronopl(ini,lambda=0.1) } #Resolve polytomies randomly x.ult<-multi2di(x.ult,random=TRUE) write.tree(x.ult, file="prior.new") |
cd "Phylogeny/pb_mpi1.5a/data" nohup mpirun -n 2 pb_mpi -d FcC_smatrix.phylip -t prior.new -x 10 100000 -S chain1 & nohup mpirun -n 2 pb_mpi -d FcC_smatrix.phylip -t prior.new -x 10 100000 -S chain2 & |
bpcomp -o bpcomp_chain chain1 chain2 |
This will help you see if there are any rogue species appearing in places where they shouldn’t be. You should check each species in your phylogeny against the supertree, which will give you a sense of where the species ought to be. Make a note of any species that appear with the wrong species, or have unusually long branches.
There are a number of reasons why species are out of place or have long branches:
To address the aforementioned issues, the easiest way is to look at each sequence’s tree as drawn from clustalX to see which sequence has bad.
BLAST can also help you identify sequences that you suspect are problematic
Once everything looks good in Archaeopteryx and you are SURE this is your final phylogeny run, you need to re-add the species which lacked sequence information in GenBank. You can do this either manually in Archaeopteryx, or in R. For some reason, manually adding/removing species in Archaeopteryx can result in weird trees, so be wary that you might end up having issues taking that route.
11.1 Add missing species and remove congenerics
In Archaeopteryx:
In R:
library(ape); library(phytools) #Load molecular tree file into R. This should be your tree from your final run tr<-read.tree("Phylogeny/PhyML/run8/fcc_smatrix_phy_phyml_tree.txt") #Function to add any species missing from your species pool into the same node as congenerics spp2genus<-function (tree, species, genus = NULL, where = c("root", "random")) { where <- where[1] if (is.null(genus)) { x <- strsplit(species, "")[[1]] i <- 1 while (x[i] != "_" && x[i] != " ") i <- i + 1 genus <- paste(x[2:i - 1], collapse = "") } ii <- grep(paste(genus, "_", sep = ""), tree$tip.label) if (length(ii) > 1) { if (!is.monophyletic(tree, tree$tip.label[ii])) warning(paste(genus, "may not be monophyletic\n attaching to the most inclusive group containing members of this genus")) nn <- findMRCA(tree, tree$tip.label[ii]) if (where == "root") tree <- bind.tip(tree, gsub(" ", "_", species), where = nn, edge.length=0.00000001) else if (where == "random") { tt <- splitTree(tree, list(node = nn, bp = tree$edge.length[which(tree$edge[, 2] == nn)])) tt[[2]] <- add.random(tt[[2]], tips = gsub(" ", "_", species)) tree <- paste.tree(tt[[1]], tt[[2]]) } else stop("option 'where' not recognized") } else if (length(ii) == 1) { nn <- ii if (where == "root") tree <- bind.tip(tree, gsub(" ", "_", species), where = nn, position = 0.5 * tree$edge.length[which(tree$edge[, 2] == nn)], edge.length=0.00000001) else if (where == "random") tree <- bind.tip(tree, gsub(" ", "_", species), where = nn, position = runif(n = 1) * tree$edge.length[which(tree$edge[, 2] == nn)], edge.length=0.00000001) else stop("option 'where' not recognized") } else warning("could not match your species to a genus\n check spelling, including case") tree } #Read in list of species that are supposed to be in this phylogeny (regional species pool). spp<-read.csv("spp_list.csv",header=TRUE) #Species list's names are supposed to be in the same format at the phylogeny, e.g., there should be a "_" between genus and species name. spp<-gsub(" ","_",spp[,1]) #Here, we identify species that are missing from the current phylogeny that should be added in at the genus node level missing<-sort(spp[-which(spp %in% tr$tip.label)]) #Add each missing species in at the node the formed by congenerics for(i in missing){ tr<-spp2genus(tr,i, where="root") } #Determine which species in the phylogeny are congenerics that were added into the molecular phylogeny as "glue" or to form genus nodes for missing species rmsp<-sort(tr$tip.label[-which(tr$tip.label %in% spp)]) #Remove congenerics from the tree tr<-drop.tip(tr,rmsp) #root the tree tr<-root(tr,"Equisetum_arvense",r=TRUE) #change the outgroup to your own #Write the tree write.tree(tr,"final_tree.new") |
Before converting your tree to a ultrametric tree, you need to perform a cross-validation in order to determine the smallest lambda value that minimizing the cross-validation (cv).
library(ape); library(picante); library(parallel) #Read tree file into R tree<-read.tree("final_tree.new") #Cross-validation for different values of lambda chr<-mclapply(10^(-15:15),function(x){ chronopl(tree,lambda=x,CV=TRUE)}, mc.cores=24) cv<-unlist(lapply(chr,function(x)sum(attr(x,"D2")))) lambda<-10^(-15:15) #Create and write data frame cvs<-data.frame(cbind(lambda,cv)) write.csv(cvs,file="cv_output.csv",row.names=FALSE) |
#Read in the file with Lambda and CV values into R x<-read.csv("cv_output.csv",header=TRUE) #Plot CV vs. Lambda plot(seq(1,length(lambda),by=1),cv,xlab="lambda number") x |
lambdamincv = 0 #Change this value to whatever lambda value your CV is minimized at #Read tree file into R tree<-read.tree("final_tree.new") #Ultrametricize tree phy<-chronopl(tree,lambda=lambdamincv) #Write your ultrametric tree write.tree(phy, file="molecular_tree.new") |
#This script queries GenBank for the sequences of interest for a list of species setwd(choose.dir()) #Set working directory to where directory "Phylogeny/sequences" is (sppList<-as.character(read.csv("species_list.csv",header=TRUE)[,1])) #Open file with list of species, which should be in directory "Phylogeny" library(XML) esearch = function(..., database, history=NULL, webenv=NULL, key=NULL, tool='reutils', email=NULL, field=NULL, reldate=NULL, mindate=NULL, maxdate=NULL, datetype=NULL, retstart=NULL, retmax=NULL, retmode=NULL, rettype=NULL, sort=NULL, baseurl='http://eutils.ncbi.nlm.nih.gov/entrez/eutils/') { query = do.call(join, c(..., list(operator=' AND ', format='%s[%s]', order=2:1))) url = sprintf('%s/esearch.fcgi?%s', baseurl, do.call(join, as.list( c(db=database, term=query, usehistory=history, WebEnv=webenv, query_key=key, tool=tool, email=email, field=field, reldate=reldate, mindate=mindate, maxdate=maxdate, datetype=datetype, retstart=retstart, retmax=retmax, retmode=retmode, rettype=rettype, sort=sort)))) response = xmlInternalTreeParse(url) c( list(identifiers=xpathSApply(response, '//IdList/Id', xmlValue)), lapply( list(count='Count', retmax='RetMax', retstart='RetStart', key='QueryKey', webenv='WebEnv'), function(name) xpathSApply(response, sprintf('/*/%s', name), xmlValue))) }
join = function(..., operator='&', format='%s=%s', order=1:2) { values = c(...) if (!length(values)) return('') parameters = names(values) paste(collapse=operator, mapply( function(parameter, value) if (parameter=='') value else do.call(sprintf, c(format, list(parameter, value)[order])), if (is.null(parameters)) '' else parameters, values)) } genes<-c("rbcl","matK","5.8s","ITS1","internal transcribed spacer 1") genB<-data.frame(cbind(Scientific.Name = sppList,rbcl=vector(length=length(sppList)[1],mode="character"),matK=vector(length=length(sppList),mode="character"),c5.8s=vector(length=length(sppList),mode="character"),ITS1=vector(length=length(sppList),mode="character"),ITS1.t=vector(length=length(sppList),mode="character") )) for(i in 1:length(sppList)){ for(n in 2:6){ genB[i,n]<-ifelse(esearch(database="nucleotide",paste(as.character(sppList[i]),"[Organism] AND ",genes[n-1],sep=""))$count=="0","NA","") } } #Combine the two ITS1 searches genB[which(genB$ITS1.t==""),"ITS1"]<-"" genB<-genB[,-6] write.csv(genB,"geneSeq.csv",row.names=FALSE) sppList2<-paste(">",gsub(" ","_",sppList),sep="",delim="/n") rbcl<-paste(">", gsub(" ","_",sppList[which(genB$rbcl=="")]),sep="", collapse=" \n\n") matK<-paste(">", gsub(" ","_",sppList[which(genB$matK=="")]),sep="", collapse=" \n\n") s5.8s<-paste(">", gsub(" ","_",sppList[which(genB$c5.8s=="")]),sep="", collapse=" \n\n") ITS1<-paste(">", gsub(" ","_",sppList[which(genB$ITS1=="")]),sep="", collapse=" \n\n") ITS15.8s<-paste(">", gsub(" ","_",sppList[which(genB$ITS1=="" && genB$c5.8s=="")]),sep="", collapse=" \n\n") write(rbcl,"run1/rbcl.txt") write(matK,"run1/matK.txt") write(s5.8s,"run1/5.8s.txt") write(ITS1,"run1/ITS1.txt") write(ITS15.8s,"run1/ITS15.8s.txt") #Which ones have "NA"s across the board? nas<-droplevels(genB[apply(genB,1,function(x) all(is.na(x[2:5]))),1] ) gen2<-list() for(i in 1:length(nas)){ gen2[i]<-esearch(database="nucleotide",paste(as.character(nas[i]),"[Organism]",sep="")) names(gen2)[i]<-as.character(nas[i]) } tes<-names(gen2[which(sapply(names(gen2),function(x) length(gen2[[x]]))>0)]) write.csv(tes,"alternativeSeqs.csv",row.names=FALSE) |
#6.1.2 library(seqinr) x<-read.alignment("../../3 FASconCAT/run2/FcC_smatrix.fas",format="fasta") write(gsub("_"," ",x$nam),"submit2nix.csv") #Take this list to http://phylodiversity.net/nix and submit with exact match and "tab-delimited spreadsheet format" #Save file as "phylocom-4.2/run1/nix.txt" |
#6.1.3a #Which species do not have class/order/family/genus information? nix<-read.delim("nix.txt",header=TRUE,sep="\t",stringsAsFactors=FALSE) nix[which(is.na(nix$MatchRank)),] #Pick out species with no matches x<-sapply(1:dim(nix)[1], function(x)paste(nix[x, c(7,6,5)], collapse="/")) write(x,"submit2phylomatic.txt") write.csv(nix,"nix.txt",row.names=FALSE) #Output a .csv file with corrected Genus and Family names for unmatched species write.csv(data.frame(cbind(OriginalName = nix[which(is.na(nix$MatchRank)),"OriginalName"], AcceptedName = nix[which(is.na(nix$MatchRank)),"OriginalName"], Genus = substr(nix[which(is.na(nix$MatchRank)),"OriginalName"],1,regexpr(" ",nix[which(is.na(nix$MatchRank)),"OriginalName"])-1), Family= NA)),"FindMatches.csv",row.names=FALSE) #Search http://www.theplantlist.org and check what the accepted name of each species should be. If the name is unresolved, but has an accepted synonym: 1. Change the "AcceptedName"; 2. Change the genus; 3. add a family. If your species name is accepted, fill out the Family name. #6.1.3c matches<-read.csv("MatchesFound.csv",header=TRUE,stringsAsFactors=FALSE) #read in file nix[which(is.na(nix$MatchRank)),c("OriginalName","AcceptedName","Genus","Family")]<-matches #replace unmatched species with new genus and families x<-paste(nix[,7],nix[,6],gsub(" ","_",nix[,1]),sep="/") write(x,"submit2phylomatic.txt") write.csv(nix,"nix.txt",row.names=FALSE) |
#6.2.2a #Species not included in the supertree are listed at the bottom of the Phylomatic output tree<-scan("my_phylomatic.new",what="character") fams<-tree[6:(length(tree)-1)] #Identify the excluded species #Write a data frame text file of the excluded species' families write.csv(data.frame(cbind(OldFamily = unique(substr(fams,1,regexpr("/",fams)-1)), NewFamily=NA)), "FindFamilies.csv", row.names=FALSE) #Go to #6.2.2c newfams<-read.csv("FamiliesFound.csv",header=TRUE,stringsAsFactors=FALSE) for(i in 1:dim(newfams)[1]){ nix[which(nix$Family==newfams$OldFamily[i]),]<-newfams$NewFamily[i] } x<-paste(nix[,7],nix[,6],gsub(" ","_",nix[,1]),sep="/") write(x,"submit2phylomatic.txt") write.csv(nix,"nix.txt",row.names=FALSE) |