MIAPA checklist input form

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W	X	Y	Z	AA	AB	AC	AD	AE	AF	AG	AH	AI
1	Timestamp	annotator	Topology: Is this a gene tree or species tree?	Topology: Is this topology rooted or not?	Alignment Method: name of software used, version of program	Tree Inference Method: parameters used, including model of evolution, and optimality criterion	Tree Inference Method: character weights if (normally then morphological) characters were weighted.	Tree Inference Method: name of software used, version of program	Alignment Method: parameters used (or default if default values were used).	Alignment Method: whether alignment was manually corrected or edited	Character matrix: Data type must be provided, for example DNA, RNA, protein, morphology, etc.	Character matrix: For molecular matrices, the accession numbers (and respective database(s) if different from Genbank) of the sequences used for each row must be provided	Character matrix: a mapping that relates each row identifier to a tip of the topology	Character matrix: a mapping that relates each accession number or specimen identifier to a row label	OTUs: A meaningful external identifier (a combination of database or resource and identifier/accession within that database).	OTUs: For specimens, museum, collection (if applicable), and specimen identifier.	OTUs: Precise (GPS) georeferences for specimens are highly desirable (but not always available).	Branch lengths: Some measure of branch length required unless it is not applicable to the analysis method.	Branch support: Some value of branch support should be provided, for example posterior probability, or bootstrap value, unless it is not applicable to the analysis method.	Topology: database record for tree if available	a local name or identifier for this tree	Publication: author list	Publication: year	Publication: title	Publication: citation (free text)	Publication: DOI	Other information for the purposes of this study	Topology URI	Character matrix URI	Other information on the location of data resources	Topology: nature of topology as representation of inference method	Other annotations of the tree as a whole	Other annotations of the character matrix as a whole	Other annotations of the OTUs	Other annotations of the branches

2	1/29/2013 11:00:00	Andrea	species tree	rooted	n/a	parameters are not available in the publication	not applicable	PAUP v4.0b10	n/a	n/a	This is a divide-and-conquer super tree based on earlier studies (largely Beck et al., 2006) that used DNA and morphological characters. From the supplent: "All supertrees were constructed using standard, unweighted matrix representation with parsimony (MRP 30,31 ), where the topologies of the source trees were converted into a partial binary matrix: species descended from a given node were coded as 1; those that were not, but were present on the tree were coded as 0; and all other taxa were coded as ? for that source tree. Except for the Beck et al. 49 , Muridae, Perissodactyla, and Primates supertrees, a hypothetical all-zero outgroup was added to each matrix consisting of the concatenated matrix representations of the source trees; the former four analyses instead used a semirooted form of MRP 59 , where only robustly rooted source trees were rooted with an all-zero outgroup; otherwise, the outgroup received ‘?’. All matrices were analyzed using a parsimony criterion with the search strategy being tailored to the size of the matrix. For the larger groups, we used the parsimony ratchet 60 to facilitate a more efficient search of tree space. Where appropriate, we also used safe taxonomic reduction 61 as implemented in the Perl script PerlEQ v1.0.x (Jeffries and Wilkinson, unpubl.) to identify poorly known species that would contribute to substantial loss of local resolution. We used the results of this analysis as a guide for pruning potentially problematic species from the source trees before subsequent recoding and re-analysis of the MRP matrices (following ref. 62) to improve resolution. In all cases, the final tree for each group was a strict consensus of all equally most parsimonious trees. Additional detail on individual search strategies can be found in the respective publications."	n/a	n/a	n/a		Collections information is not provided	Georeference information is not provided	Branch lengths are applicable and provided.	Branch support values are not applicable.		Mammals supertree	Bininda-Emonds; Cardillo; Jones; MacPhee; Beck' Grenyer; Price; Vos; Gittleman; Purvis	2007	The delayed rise of present-day mammals.	Bininda-Emonds, O. R. P., Cardillo, M., Jones, K. E., MacPhee, R. D. E., Beck, R. M. D., Grenyer, R., Price, S. a, et al. (2007). The delayed rise of present-day mammals. Nature, 446(7135), 507–12. doi:10.1038/nature05634	10.1038/nature05634				No character matrix because it's a supertree; supplementary data is available at http://www.nature.com/nature/journal/v446/n7135/suppinfo/nature05634.html There's a substantial supplementary file on methods located here: http://www.nature.com/nature/journal/v446/n7135/extref/nature05634-s1.pdf	consensus tree	From the paper: "The supertree was constructed in a hierarchical framework, combining pre- existing supertrees for Carnivora, Chiroptera, ‘Insectivora’ (split into Afrosori- cida and Eulipotyphla) and Lagomorpha with new ones for the remaining groups, including the base supertree of all extant families (see Supplementary Table 1). All new supertrees were built using an explicit source tree collection protocol29 to minimize both data duplication (for example, where the same data set underlies more than one source tree) and the inclusion of source trees of lesser quality (for example, taxonomies or those based on appeals to authority). Species names in the source trees were standardized to those found in ref. 23, and extinct taxa (following the 2004 IUCN Red List; http://www.redlist.org) were pruned from the final supertree. All supertrees were obtained using Matrix Representa- tion with Parsimony (MRP30,31), with the parsimony analyses for the new super- trees being performed in PAUP* v4.0b10 (ref. 32)." The Beck et al tree that this tree uses as a foundation used both DNA and morphological characters	A list of source trees is available in Supplementary file 1, Table 1.	The authors use Wilson,D.E.&Reeder,D.M.(eds)Mammal Species of the World: a Taxonomic and Geographic Reference (Smithsonian Institution Press, Washington, 1993) as their namespace.	This is an ultrametric tree; branch lengths represent time; see methods on pages 510-511 for more explanation
3	1/29/2013 11:29:09	Enrico	species tree	rooted	N/A	Not described, most likely manually curated		Unknown			Not Applicable				Meaningful external identifiers are not provided	Not Applicable	Georeference information is not provided	Branch lengths are not applicable.	Branch support values are not applicable.		APGIII	Birgitta Bremer, Kåre Bremer, Mark W. Chase, Michael F. Fay, James L. Reveal, Douglas E. Soltis, Pamela S. Soltis, Peter F. Stevens	2009	An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG III	Botanical Journal of the Linnean Society, 2009, 161, 105–121		A revised and updated classification for the families of flowering plants is provided	http://phylotastic.org/hack2/arbitrary_hash/APGIII#apgiii			Revised and Updated classification of flowers and plants				This is a phylogenetically informed taxonomy framework.
4	1/29/2013 13:32:44	Arlin	species tree	rooted	MAFFT v6.712b	from the publication section "Tree reconstruction." "Phylogenetic inference of subset 1 and of subset 2 was done under the maximum likelihood (ML) optimality criterion in partitioned analyses with RAxML 7.2.8 [42,43] under the GTRCAT model. Analyses were computed on HPC Linux clusters, 8 nodes with 12 cores each, at the Regionales Rechenzentrum Köln (RRZK) using Cologne High Efficient Operating Platform for Science (CHEOPS); input was done in phylip format; and conversion of Fasta to phylip was done using Readseq [44] [XVII]. Nuclear coding genes were treated as one partition (PROTCAT model, substitution matrix LG + F, taken from ProtTest [45]). All other groups of orthologs were treated as separate partitions (32 partitions in total). (See Additional file 4 for the character partitions of subset 1 and 2.) We applied the rapid bootstrap algorithm [46] with a subsequent tree search. The numbers of bootstrap replicates were estimated on the fly by the "bootstopping" criteria implemented in RAxML 7.2.8 (default settings) [47]. The analyses yielded two trees. These trees are referred to as "tree 1" (corresponding to subset 1) and "tree 2" (corresponding to subset 2). Trees were edited in Dendroscope [48] [XVIII]."	Apparently no character weights were applied. Note, however, that the alignment was masked as described under Alignment Method.	RAxML 7.2.8	from the publication section "Multiple sequence alignment and alignment masking": "Orthologous sequences were aligned with MAFFT v6.712b using the auto option [IV]. Depending on the size of an alignment, MAFFT automatically chooses a suitable alignment option, such as L-INS-i for < 200 sequences and FFT-NS-2 for > 2,000 sequences [33,34]. All alignments were subsequently refined with the refinement option in MUSCLE version 3.7 [35] [V]. These are powerful alignment tools that allow processing very large data sets in reasonable time. Steps II through VI of our pipeline are automatically consecutively executed when using the script batch2_IItoVI.sh. (See the manual of batch scripts for details.) Aligned and refined mitochondrial amino acid sequences were then translated back into nucleotide sequences with the aid of the script aa2dna, which uses the corresponding reading frame information from the GenBank file [VI]. From this point on, we proceeded with nucleotide sequences for all mitochondrial sequences and nuclear noncoding sequences, as well as with amino acid sequences for the nuclear coding sequences (available since step [a.III]). Ambiguously aligned or highly diverged regions of the alignment were masked with three different algorithms [VII]. We applied ALISCORE [36,37] and ALICUT [38] for noncoding nucleotide sequences and for nuclear amino acid sequences (default settings). Since the multiple sequence alignment of 28S rRNA was too big to be processed with ALISCORE, we used Gblocks 0.91b [39,40] for 28S instead (block parameter settings: (1) number of included seq/2 = 1020, (2) 1020, (3) 5, (4) 10, and (5) all). Finally, we used the script gapkiller to identify and delete sites with more than 70% gaps in coding mitochondrial sequences. Then we masked all third codon positions of mitochondrial coding sequences [VIII] and concatenated all tRNA alignments to one single alignment."	not manually corrected		GenBank accession numbers are provided	the mapping is implicit	the mapping is not provided		Not relevant	Not relevant	Branch lengths are applicable and provided.	Branch support values are applicable and provided.		Peters_2011_hymenoptera	Ralph S Peters, Benjamin Meyer, Lars Krogmann, Janus Borner, Karen Meusemann, Kai Schütte, Oliver Niehuis and Bernhard Misof	2011	The taming of an impossible child: a standardized all-in approach to the phylogeny of Hymenoptera using public database sequences	BMC Biology 2011, 9:55	10.1186/1741-7007-9-55	Tree was obtained by personal communication from the first author to Arlin Stoltzfus, 29 Jan, 2013.	http://www.evoio.org/wiki/File:Tree_1_Peters_et_al.tre	not available		most likely tree by incomplete search	We infer that the tree is rooted by an outgroup method. Outgroups are specified "Sequence data retrieval and data processing" but they do not appear in the tree. The authors say that the tree was "edited" in Dendroscope, but do not explain what was the nature of the editing. We infer that the authors removed the outgroups from the final tree.	We assume in this case that the row labels match the tree labels. GenBank accession numbers are provided, but they are not mapped to species names.	Binomials are provided in some cases. In other cases, we have <genus>_sp. Based on reading the section "Species and sequence subset selection", we are unable to determine what these entities represent. However, based on NCBI taxonomy searches, we find that these names correspond to entities in GenBank with incomplete names, e.g., Sania_sp_5 is presumably NCBI's "Sania sp. 5 JCB-2006".	See methods for the meaning of support values.
5	1/29/2013 14:06:54	Andrea	species tree	rooted	method follows that described by Smith et al., 2009	Phylogenetic trees were inferred using the Pthreads-based and SSE3-vectorized RAxML (Stamatakis, 2006b) version 7.2.6. The post-analysis steps (consensus tree building, evaluating the final trees under the GTR+GAMMA model etc.) were carried out with RAxML v7.2.7. We used the standard RAxML search algorithm with the asymptotic stopping rule and the low memory consumption flag (-F and -D options) to infer 223 ML trees on the original alignment under the GTR+CAT approximation of rate heterogeneity (Stamatakis, 2006a) and a partitioned model (we estimated the GTR and alpha parameters separately for each gene) with a joint branch length estimate. The usage of the GAMMA model of rate heterogeneity was not possible on all multi-core systems we used for the analysis because of memory limitations (a run under GTR+CAT required approximately 30GB of main memory, a run under GAMMA requires approximately four times more memory). Branch lengths and likelihood scores under GTR+GAMMA for all 223 ML trees were computed using the -f n option. We also inferred 244 bootstrap trees using the RAxML rapid bootstrap algorithm (Stamatakis et al., 2008). We then plotted BS support values onto the best-scoring ML tree and also computed strict, majority-rule, and extended majority rule consensus trees for the bootstrap replicates and the ML trees on the original alignment. We also applied the bootstopping (bootstrap convergence) tests (Pattengale et al., 2010) a posteriori to the bootstrap trees. The test indicated that an insufficient number of BS replicates has been computed to guarantee stable support values. Finally we computed pair-wise Robinson Foulds (RF) distances between all ML trees (average relative RF: 21.79%) and all bootstrap trees (average relative RF: 53.32\%).	n/a	RAxML v7.2.7	from the paper, pg 408: "Once sequences are identified to belong to the gene regions of interest, saturation analyses are conducted compar- ing uncorrected genetic distances to corrected distances. If alignments appear to be saturated, the alignments are broken up using prior phylogenetic knowledge (classification systems) as guides, and separate alignments are carried out for the individ- ual groups delimited in this way. These individual alignments are then aligned together using profile-to-profile alignment techniques (Edgar, 2004). Our final concatenated data set in- cluded 55 473 species and 9853 aligned sites (Appendix S1; see Supplemental Data with the online version of this article). "	not manually corrected	DNA	accession numbers are not provided	n/a, no character matrix	n/a, no character matrix	binomials, some with citations (those with sp)	Collections information is not provided	Georeference information is not provided	Branch lengths are applicable but not provided.	Branch support values are applicable but not provided.		Smith et al 2011 Angiosperms	Smith SA, Beaulieu JM, Stamatakis A, Donoghue MJ	2011	Understanding angiosperm diversification using small and large phylogenetic trees. American Journal of Botany 98(3): 404-414. doi:10.3732/ajb.1000481	Smith SA, Beaulieu JM, Stamatakis A, Donoghue MJ (2011) Understanding angiosperm diversification using small and large phylogenetic trees. American Journal of Botany 98(3): 404-414. doi:10.3732/ajb.1000481	doi:10.3732/ajb.1000481	Builds on APG III tree	http://dx.doi.org/10.5061/dryad.8790	n/a	Supplemental files published with paper located at: http://www.amjbot.org/content/98/3/404/suppl/DC1 Supplemental files on Dryad: http://datadryad.org/handle/10255/dryad.8790	consensus tree	from the paper: " The data set was assembled using the methods described in Smith et al. (2009), as implemented in the PHLAWD program."
6	1/29/2013 14:08:54	Arlin	species tree	rooted	(not applicable)	The tree topology is the product of a community of hundreds of contributors (curators) that manage particular branches of the tree. There is no fixed method for inferring trees.	Not applicable.	(not applicable)	(not applicable)	(not applicable)	(not applicable)	(not applicable)	(not applicable)	(not applicable)	Meaningful external identifiers are not provided	(not applicable)	(not applicable)	Branch lengths are applicable but not provided.	Branch support values are applicable but not provided.		ToLWeb2006	David R. Maddison, Katja-Sabine Schulz, Wayne P. Maddison	2007	The Tree of Life Web Project	Zootaxa 1668: 19–40		The tree was downloaded from ToLWeb in 2012, but apparently this version is from 2006, which is why we have named it ToLWeb2006.	http://www.evoio.org/wiki/File:TOL.xml.zip		This information is available in interactive form at http://www.tolweb.org .	See Methods below			Binomials are provided, but not references to an external namebank.	This is promoted as a phylogeny, therefore branch lengths and support values are appropriate. However, the tree lacks these features.
7	1/29/2013 14:13:29	Enrico	species tree	rooted	MAFFT and PRANK	Pseudo-posterior samples of complete avian trees were assembled as follows. (1) Every bird species was assigned to one of 158 clades identified using a backbone phylogeny27. (2) Relaxed-clock trees were generated for each clade from sequence data. (3) Relaxed-clock trees for entire clades were generated combining species with and without genetic data: species without genetic information (3,330) were placed within their clade using constraint structures consistent with consensus trees from step (2) plus taxonomic information and branching times sampled froma pure birth model of diversification. (4) Final trees were assembled from the clade distributions plus samples of dated backbone trees from (one of two) distributions constructed using relaxed molecular clock techniques, 15 genes, ten fossil constraints and extensive topology constraints derived from published sources.		multiple methods		manually corrected	DNA	GenBank accession numbers are provided	the mapping is provided	the mapping is provided	Meaningful external identifiers are not provided	Collections information is not provided	Georeference information is not provided	Branch lengths are applicable and provided.	Branch support values are applicable and provided.			Jetz, W., Thomas, G. H., Joy, J. B., Hartmann, K., Mooers, A. O.	2012	The global diversity of birds in space and time	Nature, Vol. 491:444, 2012	doi:10.1038					Start from backbone phylogeny				height_median, height, height_95%_HPD, length, height_range, posterior
8	1/29/2013 14:34:13	Ramona	species tree	rooted	AMPHORA	A maximum likelihood tree was then constructed from the concatenated-alignment using PHYML10. The model selected based on the likelihood ratio test was the WAG model of amino acid substitution with γ-distributed rate variation (five categories) and a proportion of invariable sites. The shape of the γ-distribution and the proportion of the invariable sites were estimated by the program.	no weights given	PHYML	No parameters listed in publication.	not manually corrected	protein	accession numbers are not provided	the mapping is provided	no accesssion numbers	Meaningful external identifiers are not provided	Collections information is not provided	Georeference information is not provided	Branch lengths are applicable and provided.	Branch support values are applicable but not provided.		Genomic Encyclopedia of Bacteria and Archaea tree	Wu D., Hugenholtz P., Mavromatis K., Pukall R., Dalin E., Ivanova N.N., Kunin V., Goodwin L., Wu M., Tindall B.J., Hooper S.D., Pati A., Lykidis A., Spring S., Anderson I.J., D’haeseleer P., Zemla A., Singer M., Lapidus A., Nolan M., Copeland A., Chen F., Cheng J., Lucas S., Kerfeld C., Lang E., Gronow S., Chain P., Bruce D., Rubin E.M., Kyrpides N.C., Klenk H., Eisen J.A.	2009	A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea	Wu D., Hugenholtz P., Mavromatis K., Pukall R., Dalin E., Ivanova N.N., Kunin V., Goodwin L., Wu M., Tindall B.J., Hooper S.D., Pati A., Lykidis A., Spring S., Anderson I.J., D’haeseleer P., Zemla A., Singer M., Lapidus A., Nolan M., Copeland A., Chen F., Cheng J., Lucas S., Kerfeld C., Lang E., Gronow S., Chain P., Bruce D., Rubin E.M., Kyrpides N.C., Klenk H., & Eisen J.A. 2009. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature, 462(7276): 1056-1060.	DOI: 10.1038/nature08656				Will add files to project github site. TreeBASE: http://purl.org/phylo/treebase/phylows/study/TB2:S10965	most likely tree by incomplete search	A maximum likelihood phylogenetic tree for bacterial genomes was built upon a concatenated alignment of 31 phylogenetic marker genes. We included 53 GEBA bacteria and 667 bacterial compete genomes from Genbank for the tree building. Protein sequences for 31 phylogenetic marker genes (dnaG, frr, infC, nusA, pgk, pyrG, rplA, rplB, rplC, rplD, rplE, rplF, rplK, rplL, rplM, rplN, rplP, rplS, rplT, rpmA, rpoB, rpsB, rpsC, rpsE, rpsI, rpsJ, rpsK, rpsM, rpsS, smpB, and tsf) were retrieved, aligned, trimmed, and concatenated using the software AMPHORA9. A maximum likelihood tree was then constructed from the concatenated-alignment using PHYML10. The model selected based on the likelihood ratio test was the WAG model of amino acid substitution with γ-distributed rate variation (five categories) and a proportion of invariable sites. The shape of the γ-distribution and the proportion of the invariable sites were estimated by the program.	Many taxon names have strain IDs as well as binomials.	Species names as binomials are provided but not mapped to a database. No information is provided for collection numbers or georeferences, but there are strain numbers for some of the taxa.
9	1/29/2013 15:33:22	Ramona	species tree	not rooted	SINA, as implemented by the SILVA		No weighting, but: "To exclude positions where positional orthology could not be guaranteed in the alignment, three filter sets were applied to remove positions where the highest occurring base was conserved at less than 30%, 40% and 50% (Table 2)."	RAxML version 7.x	no parameters given in publication	manually corrected	RNA	non-GenBank accession numbers are provided	mapping is implicit because both binomial and accession number are included in the topology	implicit because row names are the same as tip namest	binomials are include in the row names and topology	There is an internal accession number as part of each species name	Georeference information is not provided	Branch lengths are applicable and provided.	Branch support values are applicable but not provided.		all species 16S	Pablo Yarzaa, Michael Richter, Jo¨rg Pepliesb, Jean Euzebyc, Rudolf Amannd, Karl-Heinz Schleifere, Wolfgang Ludwige, Frank Oliver Glo¨ckner, Ramon Rossello´-Mo´ra	2008	The All-Species Living Tree project: A 16S rRNA-based phylogenetic tree of all sequenced type strains	P. Yarza, et al., The All-Species Living Tree project: A 16S rRNA-based phylogenetic tree of all sequenced type strains, Syst. Appl. Microbiol. (2008), doi:10.1016/j.syapm.2008.07.001	doi:10.1016/j.syapm.2008.07.001				Will upload tree and matrix to github. Tree is regularly updated. Publication is from 2008, but most recent tree was released in July 2012. See http://www.arb-silva.de/projects/living-tree/ for link to latest tree and methods.	most likely tree by incomplete search	The visual images of the tree do not show a root, and the publication/website do not specify any rooting method. Sequences had been automatically aligned by SINA, as implemented by the SILVA database project [24]. Briefly, the system searches for the closest relatives in a set of 51,601 manually curated SSU sequences (Seed). Up to 40 related sequences are then used as references for the alignment of the sequence under investigation. Although the process is highly accurate, some of the bases usually escape optimal placement according to biological criteria. The complete dataset of 9975 sequences (type strains and non-type strains) was manually checked in order to improve inaccurately placed bases. For this, the secondary structure of the SSU was taken into account. The final alignment can be retrieved as an ARB database, as well as supplementary material in an aligned multi-FASTA file, and from www.arb-silva.de/living-tree. Original publication describes tree as inferred using RAxML version 7.0, but presumably the latest tree was constructed using an more recent version.		Example of a taxon name from the newick file: "Pantoea_calida__GQ367478__Enterobacteriaceae"
10	1/29/2013 15:37:17	Andrea	species tree	not rooted	Muscle, Mafft	"The combined data set with morphology plus molecules and the molecule-only data set were analysed separately. Tree searches, identical for molecular and combined data sets, ran in parallel on three computers (totalling 16 processors and 96 GB RAM), examining for each data set ∼ 7.5 × 1014 rearrangements in ∼ 2.5 months’ processor-time. To estimate ambiguity, for each data set we used eight independent replicates with tree bisection reconnection (TBR) followed by sectorial searches (see details below). The best tree for each of the two data sets was found by combining the eight independent trees with tree-fusing (Goloboff, 1999; Goloboff and Pol, 2007) and then subjecting the fused tree to sectorial search, as detailed below. For each starting point, TBR-swapping for the molecule-only trees saved 50 000–56 000 steps from the Wagner trees, and for the combined data set, about 35 000 steps. After concluding TBR, each of the resulting trees was subjected to a sectorial search routine, analysing in parallel 16 sectors (with a size of ∼ 4500 each) at the same time. Each sector was analysed for up to 4 h, with the following commands (see documentation of TNT for details): bbreak: cluster 20; timeout 4:00:00; sectsch: xss 15-8+6-2 gocomb 50 combst 5 fuse 4 slack 20 drift 7; xmult = repl 8 rss xss drift 4 hit 10 dumpfuse keep; tfuse; best; The tree-fusing at the end (tfuse command) guarantees that the final solution for the sector is no worse than the initial one. The results for the sectors were merged, and the resulting tree was subject to TBR in parallel (using three slave processes per machine, the maximum allowed by the RAM available in each machine, for a total of nine slave processes in the virtual machine). This alternation between sectorial search and TBR was repeated in 5–7 cycles, slightly changing the sectors selected, and the random seeds used for searching new solutions for each of the sectors. In the final cycles (as the trees approached optimality), the virtual machine examined about 740 × 106 rearrangements/s, requiring between 0.5 and 2 h to complete TBR. The trees resulting from each of the eight independent starting points were then subjected to several rounds of tree-fusing, and the resulting tree was subjected to three cycles of alternating sectorial search and TBR in parallel, but in this case breaking the tree into only seven pieces (sectors of about 10 000 taxa), and running each sector for up to 16 h. Each reduced data set was analysed by means of the following commands: bbreak: cluster 20; sectsch: xss 32-25 + 5-1 gocomb 50 combst 5 fuse 4 slack 20 drift 5; timeout 8:00:00; xmult = repl 8 rss xss drift 4 hit 10 dumpfuse keep prvmix; tfuse; tchoose/; sectsch = xss5-3 + 1-1 [ sectsch: xss10 + 3-1; xmult = xss rss hit 1 rep 8 nofu keep; tfuse; best; ]; The search commands indicated within square brackets are those to be used for analysis of the (five to three) sectors in which the sectorial search command (sectsch) will further partition each reduced tree of ∼ 10 000 taxa."		TNT - http://www.zmuc.dk/public/phylo geny/TNT/Scripts	"All sequences other than LSU and SSU were aligned with Muscle (Edgar, 2004). Nuclear LSU and SSU were aligned with Mafft (Katoh et al., 2005; Katoh, 2008). The alignment of LSU and SSU involved the following steps: (i) separate the complete data set in subsets of approximately 2000 sequences; (ii) align each data set separately using the Mafft option of considering a previously aligned sequence as a “template” for the multiple alignment [17 LSU and 70 SSU sequences, downloaded from the European ribosomal database (http://bioinformatics.psb.ugent.be/webtools/rRNA/ssu and http://bioinformatics.psb.ugent.be/webtools/rRNA/lsu/index.html), which take into account structural considerations]; (iii) find conserved regions common to all the aligned data sets; (iv) subdivide “vertically” each subset of 2000 sequences at the conserved regions identified in step 3, producing data sets of 2000 species per ∼ 50–200 bp each; (v) combine each corresponding partial data set obtained in step 4 and generate a data set of 20 000 species per 50–200 bp; (vi) erase the gaps; (vii) perform a multiple alignment with the data set of step 6; (viii) manually adjust the alignments."	manually corrected	protein and DNA	accession numbers are not provided	the mapping is not provided	the mapping is not provided	Meaningful external identifiers are provided	Collections information is not provided	Georeference information is not provided	Branch lengths are applicable but not provided.	Branch support values are applicable but not provided.		GenBank Eukaryotes	Goloboff, P. A., Catalano, S. A., Mirande, J. M., Szumik, C., Arias, J. S., Kallersjo, M., & Farris, J. S.	2009	Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups	Goloboff, P. A., Catalano, S. A., Mirande, J. M., Szumik, C., Arias, J. S., Kallersjo, M., & Farris, J. S. (2009). Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups. Cladistics, 25, 211–230. doi:10.1111/j.1096-0031.2009.00255.x	10.1111/j.1096-0031.2009.00255.x		http://www.zmuc.dk/public/phylogeny/TNT/More/Supp_Data_Set.tgz	n/a		Most parsimonious tree by incomplete (nonexhaustive) search	"Where possible and appropriate, we used amino acid sequences (COX I–III, CytB, RNAPII). The resulting alignments were inspected visually and, when possible, improved manually; regions that were too gappy were excluded from the final data sets. Given the obviously problematic nature of the alignments and incomplete sequences (many gaps are lack of data, not real deletions), we considered gaps as missing. Multiple sequences for the same species were excluded to maximize taxonomic diversity instead of simply using large numbers of identical sequences."		Normally we weren't saying that binomials count as IDs, but in this case they are, because they are explicitly, algorithmically getting their taxonomy from NCBI.
11	1/29/2013 16:02:08	Arlin	species tree	rooted	(not applicable)	From Federhen, 2012 "It was obviously important to provide a single taxonomic classification to index the entire set of entries in Entrez. The first step was to shuffle together the taxonomies from each of the contributing databases, each of which covered a somewhat different set of species with often very different internal classifications. The end result of this process was a hideous abomination, but it did provide a single classification that spanned all of the entries in Entrez, which we set out to improve. At this point we hosted series of taxonomy workshops to provide advice and direction for the project. David Hillis, John Taylor and Gary Olsen, in particular, put in a significant amount of time and effort in the initial cleanup of our merged classification. The next step forward was the 1997 agreement by the INSDC members to resolve taxonomic issues of nomenclature and classification prior to the release of new sequence data . . ." "We try to maintain a phylogenetic taxonomy—one in which the structure of the classification corresponds with the evolutionary history of the tree of life. A phylogenetic classification aims to include only monophyletic groups—groups in which all of the members are more closely related to each other than any of them are to anything outside of the group. . . " "There are several large taxonomy database projects that seek to aggregate names from other sources into more or less comprehensive collections—the Catalog of Life, the Encyclopedia of Life, NameBank and WikiSpecies, for example. These are useful resources for the taxonomy group when we research the names that we add to our database, and we maintain reciprocal links with many of them. Even more useful are the curated specialty databases that are devoted to a particular group—IPNI for the plants, Index Fungorum and MycoBank for the fungi, Algaebase for the algae, AmphibiaWeb and Amphibian Species of the World for the amphibians, the Catalog of Fishes and FishBase for the fish, Bergey's Manual for the prokaryotes and so on. More than 150 outside groups are registered to maintain LinkOut (http://www.ncbi.nlm.nih.gov/projects/linkout/) links in the NCBI Taxonomy database. But in every case, the ultimate authoritative source for the nomenclature and classification is the primary taxonomic literature itself."	(not applicable)	(not applicable)	(not applicable)	(not applicable)	(not applicable)	(not applicable)	(not applicable)	(not applicable)	Meaningful external identifiers are provided	(not applicable)	(not applicable)	(not applicable)	(not applicable)		NCBI_29Jan2013							ftp://ftp.ncbi.nih.gov/pub/taxonomy/	(not applicable)	We obtained the Newick conversion of the tree, with species names, and collapsed unbranched nodes, from the IToL web site (http://itol.embl.de/other_trees.shtml). The precise URL is http://itol.embl.de/ncbi_tree/ncbi_complete_collapsed_with_names.newick.gz and this was downloaded Jan 29, 2013.	(not applicable)			The resource is linked to the terms of the NCBI taxonomy database.
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100