Part 2 – A Description of the Proposed Research

Dark taxa: Preparing for a post-taxonomic future

"The insects parted with their names in vast clouds and swarms of ephemeral syllables buzzing and stinging and humming and flitting and crawling and tunnelling away."

- Ursula Le Guin She Unnames Them The New Yorker, 21 January 1985

Perhaps nothing highlights the disconnect between the disciplines of taxonomy and genomics more than the increasing number of taxa in the Genbank sequence database that have not been given a formal scientific name. With each passing year these "dark taxa" comprise a growing percentage of organisms represented in the database, such that in 2010 over 80% of new invertebrate taxa added to GenBank lacked a proper scientific name (Fig. 1).

Fig. 1 (a) Growth in the numbers of new species-level taxa added to GenBank each year. Taxa are partitioned into those with scientific names ("named") and those that have informal names ("unnamed"). (b) Percentage of taxa in the named and unnamed categories (data from http://iphylo.blogspot.com/2011/04/dark-taxa-genbank-in-post-taxonomic.html).

The pattern shown in Fig. 1 likely reflects a combination of processes. If most of the taxa being added represent species that have already been described, then the rate at which taxa can be identified (either by taxonomists or by using their outputs such as keys) is being outstripped by the pace of sequencing. Alternatively, dark taxa may represent unknown species, but we lack taxonomists capable of recognising the taxa as new (and formally describing them). If taxonomic capacity is a limiting factor then we would expect a gradual decline in percentage of named taxa, which is the pattern from 1992 to 2009. The growth of dark taxa might also reflect changing practices of molecular workers, for example in DNA barcoding where large numbers of specimens are sequenced and deposited into GenBank labelled with specimen codes rather than taxonomic names. Indeed, the dramatic increase in the numbers of dark taxa in 2010 is mostly due to sequences from the Barcode of Life Data Systems (BOLD) project.

It would be tempting to use diagrams such as Fig. 1 to bemoan the continued decline of taxonomy, despite the repeated attempts to invest in the discipline (House of Lords Science and Technology Committee, 2009). However, the rate at which new species are being described has been roughly constant over the lifetime of Genbank (Fig. 2), which suggests that taxonomy is not in terminal decline. However, it may well have reached capacity, as evidenced by the backlog of museum and herbaria material that continues to yield new species (Bebber et al, 2010; MEEGASKUMBURA, 2007). Hence, the problem is likely to be that taxonomy simply has not (or cannot be) scaled up to handle the the flood of new data being generated in molecular labs.

Fig. 2. Number of new species and subspecies described each year and recorded in Zoological Record (data from ION, http://www.organismnames.com).

There are several ways we could respond to this situation. One is to accept that existing taxonomic capacity is overwhelmed, and unlikely to increase, but argue that taxonomic names are vital and therefore we should develop tools to assign names retrospectively. For example, if an unnamed sequence cites a museum voucher specimen, and subsequent taxonomic work assigns that specimen to a named taxon, then we could retrospectively provide a name for that taxon in GenBank. 

Box 1 As an example of a "dark" taxon that has subsequently been named, consider the frog "Gephyromantis aff. blanci MV-2005" (NCBI tax_id 321743), which has a single sequence AY848308 associated with it. This sequence was published as part of a DNA barcoding study (Vences et al. 2005). If we enter "AY848308" into Google we find two documents, one the supplementary table for Vences et al. (2005), the other the paper by Vences and Riva (2007) that describes the frog with this sequences as a new species, Gephyromantis runewsweeki. Hence "Gephyromantis aff. blanci MV-2005" is Gephyromantis runewsweeki.

Endeavouring to "clean up" GenBank after the fact is a worthy goal, especially if doing so leads to the discovery of new information. For example, there may be a lack of information online about a taxon’s geographic distribution, but unidentified sequences of that taxon may be georeferenced in GenBank. Furthermore, use of arbitrary codes for unnamed taxa in GenBank makes it harder to discover whether a particular taxon has been sequenced. For example, the unnamed ant taxa Proceratium sp. CS-2003-1, Proceratium sp. 1 CSM-2006, and Proceratium sp. Ma02 are all based on sequences of the same ant specimen (CASENT0500379) (Page, 2008). While having independent laboratories replicate the same sequences is in one sense comforting, this represents wasteful duplication of effort that could have been avoided if the GenBank sequences were assigned to a named taxon, or indeed the informal name used the specimen code (Schindel and Miller, 2010).

But we should also ask, what are the implications if retrospective data cleaning of GenBank itself does not scale? For many biologists taxonomic names occupy a central place in biology (Patterson et al., 2010), but the widening gap between named and dark taxa suggests that we should also explore the extent to which we can do meaningful biology without taxonomic names. This may be a situation which macroscopic biologists may find unpalatable, yet it is the norm for much of microbiology. This is not to suggest that microbiologists would not benefit from having formal taxonomic names for the taxa they study, but a great deal of microbiology is done in their absence.

GenBank is perhaps the single fastest growing biodiversity database, and sequences are becoming increasingly richly annotated with information about geography and ecology.  Even if these sequences are not formally named, these annotations may provide a rich source of information about organismal biology (e.g., Sarkar, 2010). For example, I recently proposed a simple visualisation http://iphylo.blogspot.com/2011/03/visualising-symbiome-hosts-parasites.html of host-parasite relationships called the "symbiome". This visualisation is created by depicting the NCBI taxonomy tree as a circle with species arranged on the rim, and representing ecological associations between two organisms, such as a host and its parasite, by drawing a line connecting the positions of those two taxa on the circle. For example, in Fig. 3a we see all the taxa which are listed as "hosts" of insects in GenBank. The diagram is dominated by the association between insects and plants, as well as insect-vertebrate parasitism. If we ask the reverse question, "what organisms have insects as their hosts?" we uncover a broad spectrum of associated organisms, such as fungi, nematodes, and bacteria (Fig. 3b).

Figure 3. “Symbiome” visualisations for insects, based on the NCBI taxonomy. (a) Associations between insects and their “hosts”. (b) Associations where insects are themselves the “host”.

Constructing diagrams such as this are obviously facilitated by having correct taxonomic names for the taxa sequenced, but names are not entirely essential. Often only one member of the association will have a proper name; for example, an unidentified wasp may have its host plant recorded with its scientific name. In other cases the host may have an informal name (such as "human"). Yet we are able to construct diagrams that convey fundamental information about the biology of these taxa.

Objectives and outputs

This project proposes to explore the implications of the rise of dark taxa for the study of biodiversity by doing the following:

1. Determine the patterns of dark versus named taxa across major taxonomic groups.

2. Estimate the relative proportion of cases where the dark taxa are already known taxa that have not been identified (and which have scientific names) versus genuinely new taxa (which will lack names)

3. Explore the extent to which basic biological facts about an organism can be extracted from GenBank alone

In addition to publications documenting the results, the project will produce online databases that will be a significant resource to both the genomics and biodiversity communities.

Methodology and approach

Task 1: Determine the patterns of dark taxa across GenBank and over time. 

The prevalence of "dark taxa" can be assessed by doing a time series analysis of taxonomic names in GenBank. The NCBI taxonomic database will be downloaded from the NCBI FTP server ftp://ftp.ncbi.nih.gov/pub/taxonomy/ and stored in a local MySQL database. This download contains information on the taxa in the NCBI taxonomy database, where each taxon is uniquely identified by its tax_id number. Because the database download lacks dates for when the taxa were added to the GenBank these will be retrieved indirectly using the NCBI Entrez E-Utilities, which enables us to retrieve a list of tax_ids that were published within a given date range. The names of taxa at species level and below will be classified as scientific or not depending on whether they conform to the corresponding nomenclatural code.

Preliminary analyses reported at http://iphylo.blogspot.com/2011/04/dark-taxa-genbank-in-post-taxonomic.html show that these analyses are relatively straightforward. This project will extend those analyses by downloading the sequences themselves and adding basic metadata about those sequences to the database. For example, DNA barcoding sequences are typically flagged by the keyword "BARCODE", which will enable us to determine whether dark taxa are predominantly barcode sequences.

Comparison with growth of new taxonomic names

To put the rate of growth of dark taxa into context we need to compare it with the rate at which new species are described. Summary data on this are available at fairly coarse taxonomic levels from the Index of Organism Names (ION) (Fig. 2) but data that can be queried at any taxonomic level are lacking. Perhaps the closest approach is uBio Taxatoy (Sarkar et al., 2008), which uses dates included in the authorship string in taxonomic names as a proxy for when the name was published. Although many of these dates will be correct, some will be confounded by chresonyms (the practice of publishing a taxonomic name together with a date to indicate the use of a name, rather than it’s original publication (see Smith and Smith, 1972). To improve on uBio's Taxatoy, names and dates will be extracted from ION (http://www.organismnames.com), Index Fungorum (http://www.indexfungorum.org/), and Index of Plant Names (IPNI) (http://www.ipni.org) to create a local database from which the rate of description of new species can be estimated for any taxon of interest.

Task 2: How many dark taxa are new taxa?

The goal of this part of the project is to estimate what fraction of dark taxa represent "unknown knowns", that is, taxa that are known to science but have not been labelled with their scientific name in GenBank, and "known unknowns", taxa that are genuinely new taxa. Addressing this question will generate a large number of annotations of existing sequences, providing a resource with utility beyond the specific question being addressed here.  

Has a dark taxon subsequently been named?

Sequences that have been deposited in GenBank without being identified may have subsequently been assigned to a newly described species (Box 1). To locate these subsequent descriptions the following searches will be undertaken:

a) Taxonomic databases such as IPNI, Index Fungorum and ION will be searched for new species names. Where possible the publication that includes the species description will be downloaded, and the GenBank accession numbers and specimen codes extracted from the publication text using tools developed by Page (2010). If accession numbers or specimen codes match those of sequences associated with dark taxa, then those publications will be flagged as possibly containing descriptions of dark taxa.

b) If sequences for the unnamed taxa have been used in publications, papers citing that paper, as well as subsequent papers by the same authors will be retrieved and processed as for (a) above. Papers not containing accession numbers or specimen codes, but containing phrases such as "n. sp.", "sp. nov.", etc. which indicate new taxa will be flagged.

c) Specimen codes obtained from Genbank sequences and from publications will be used to search GBIF (http://www.gbif.org) and online museum databases. Any scientific names retrieved that are attached to those specimens will be candidate names for the corresponding taxa in GenBank.

Caveats

Data mining scientific papers requires access to the full text. For Open Access journals this is straightforward, but for other journals this will require subscription access. The project’s host institution has access to some journals, but not to some key journals such as Zootaxa, which is the single largest outlet for new animal species descriptions. The success of this part of the project will depend on being able to access these journals.

The success of (b) searches depends on sequences being linked to publications. Ideally every sequence in GenBank would contain a link to a digital record of the publication that made the sequence available, but this is not the case. Many sequences are deposited directly to GenBank, independent of any publication. Miller et al. (2009) found that 42% of GenBank sequences where linked to a publication, and of those 29% either lacked a PubMed identifier (PMID) or cited a publication not indexed by PubMed. Journals not indexed by PubMed include many of the outlets for phylogenetics and taxonomy (such as Zootaxa). In cases where the Genbank record contains the full bibliographic record for the paper publishing the sequence, bibliographic identifiers will be searched for using bioGUID (Page, 2009).

How many dark taxa are new?

Determining whether dark taxa are new is problematic, but we can put some bounds on the scale of the problem. In the first instance we can ask how many species have been described for a given genus, using taxonomic databases such as the Integrated Taxonomic Information System (ITIS), which include a measure of how complete their coverage is. If the number of named + dark taxa in GenBank is less than the total number of described species in the genus, then it is possible that the dark taxa are simply known taxa that haven't been identified. Conversely, if the number of named + dark taxa exceeds the number of described taxa then some of the dark taxa are likely to be new.

These bounds are crude, and subject to several sources of error. For example, the taxonomic database may be less complete than its curators believe. If the dark taxa are specimen codes, then this may inflate the number of "species" that are dark. The bounds could be improved by clustering the sequences, say at generic level, then sorting the clusters into molecular operational taxonomic units (MOTUs) (Floyd et al., 2002) based on a specified level of shared similarity of DNA sequences. If a named taxon appears within a MOTU that also includes one or more dark taxa, then we can infer that the dark taxa are also likely members of the named taxon (Box 2).

Doing this analysis from scratch would require considerable resources to first cluster sequences from GenBank into gene families, construct phylogenies for each sequence cluster, then extract clusters of taxa that correspond to MOTUs. However, the bulk of the first two steps have already been done by the PhyLoTA project (Sanderson et al., 2008). The gene clusters and phylogenies assembled by the PhyLoTa project will be downloaded and stored in a local database. MOTUs will be extracted from these phylogenies, focussing on phylogenies based on barcoding genes such as COI for animals and rbcL and matK for plants (Hollingsworth et al., 2009).

Box 2 Assigning a name to a dark taxon using MOTUs

The sequence FJ559186 from taxon Gephyromantis cf. decaryi 9 MV-2009 was published by Vieites et al. 2009. In the PhyLoTA database the neighbour-joining tree for the corresponding cluster of genes, this sequence appears within a cluster of sequences identified as Gephyromantis decaryi Angel, 1930; hence, we can assign this dark taxon to a named species.

Caveats

The number of described species in a genus is an approximation of the number of actual species, not only because not all species may have been described, but a significant fraction of those that have been described may in fact be synonyms.

Identifying dark taxa using named taxa assumes that GenBank sequences associated with formal scientific names have been correctly identified. Nilsson et al. (2006) found that some 20% of fungal sequences in GenBank may be incorrectly identified, so this method should be used cautiously. MOTUs from PhyLoTA will assist in identifying potentially misidentified taxa.

Outputs

This part of the project will generate significant quantities of annotations of Genbank sequences and taxa, which raises the issue of how to store these. GenBank has resisted calls for "wikification" (Pennisi, 2008), so one approach is to create a separate wiki to hold these annotations. I have explored this approach already at http://iphylo.org/linkout, which is a wiki that maintains links between taxa in the NCBI database and articles on the corresponding taxa in Wikipedia (Page, 2011). This wiki uses the Semantic Mediawiki extension to Mediawiki (the software which powers Wikipedia), and adds several key features to Mediawiki, including support for semi-structured data, web forms, a simple query language, and exporting individual pages in RDF format. The existing iPhylo Linkout wiki could be extended to hold annotations for dark taxa, such as formal scientific names, if they exist, and links to the publication describing the taxon.

A separate wiki will be created for annotations for individual sequences, such as georeferencing, host taxa, and data derived from museum voucher specimens and associated publications. Note that this wiki is not intended to replicate GenBank -- the sequences themselves, for example, won't be included. Instead the wiki will host basic metadata about the sequences. This wiki will be a significant community resource. In addition to the annotations discussed here, it would also enable researchers to flag sequences (or annotations of sequences) that they think are in error. 

Task 3: GenBank as an organismal database

Having determined the scale of the problem (Task 1) and tried to ameliorate it by finding names

for as many dark taxa as possible (Task 2), the final part of this project explores the utility of GenBank as a database for organismal biology and biodiversity. Two annotations most relevant to organismal biologists are where an organism is found, and what other organisms it is associated with. To validate annotations extracted from GenBank, a subset of records will be compared with existing databases on geography and host-parasite associations.

Geographic information in GenBank ranges from latitude and longitude coordinates, descriptions of geographic locality, or reference to a voucher specimen which may, itself, have geographic information. For a subset of taxa the geographic range for associated sequences will be compared with the distributions of the taxa reported in GBIF. The GBIF database is not without errors (Yesson et al, 2007), but is the largest single database of organismal distributions. The relative extent of geographic distributions obtained from GenBank and GBIF will be compared, as well as the number of instances where a Genbank sequence occurs outside the range recorded in GBIF.

The “host” field of GenBank sequences will be used to link taxa with their ecological associates (“hosts”) (see examples of visualising these links in Fig. 3). Given that the host field is free-format and can contain anything from a scientific name to a patient code, tools such as LINNAEUS (Gerner et al., 2010) will be used to extract host names. To evaluate the reliability and coverage of these host-associations, they will be compared with published host-association databases (e.g., Nunn and Altizer, 2005).

Management of project and resources

The PDRA employed on this project will present the research at international conferences, and will also attend the annual Biodiversity Information Standards (TDWG) meetings, which typically include hands-on sessions run by leaders in biodiversity informatics. This will help the PDRA network with the broader community, as well as gain additional skills in the field. There will also be an opportunity for the PDRA to contribute to a taught postgraduate course on biodiversity informatics at the University of Glasgow.

Work on tasks 1-3 will overlap, but the first six months of the project will be spent constructing local databases, developing software to query taxonomic databases, and addressing task 1. The bulk of the project (months 6-24) will be spent addressing task 2. Task 3 will make use of the annotated version of GenBank constructed in task 2, and will commence in month 24.

All software developed in this project will be deposited in a code repository such as Google Code or GitHub and made available under an open access license. The data bases will be available online, both as websites and data dumps. Progress on the project will be actively documented on my blog http://iphylo.blogspot.com.

References

Alroy, J., 2002. How many named species are valid? Proceedings of the National Academy of Sciences of the United States of America, 99(6), pp.3706-11. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=122588&tool=pmcentrez&rendertype=abstract [Accessed August 7, 2010].

Angel, F., 1930. Sur le validité du genre Gephyromantis (Batraciens) et diagnoses de deux espèces et dune variété nouvelles de ce genre. Bulletin De La Société Zoologique De France, 55, pp.548-553. Available at: http://gallica.bnf.fr/ark:/12148/bpt6k54438868/f24.

Bebber, D. P., Carine, M. A., Wood, J. R. I., Wortley, A. H., Harris, D. J., Prance, G. T., et al. (2010). Herbaria are a major frontier for species discovery. Proceedings of the National Academy of Sciences, 107(51), 22169-71. doi: 10.1073/pnas.1011841108.

Bidartondo, M. I. (2008). Preserving accuracy in GenBank. Science (New York, N.Y.), 319(5870), 1616. doi: 10.1126/science.319.5870.1616a.

Gerner, M., Nenadic, G. & Bergman, C.M., 2010. LINNAEUS: a species name identification system for biomedical literature. BMC bioinformatics, 11(1), p.85. Available at: http://www.biomedcentral.com/1471-2105/11/85 [Accessed April 13, 2011].

Hollingsworth, P.M. et al., 2009. A DNA barcode for land plants. Proceedings of the National Academy of Sciences of the United States of America, 106(31), pp.12794-7. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2722355&tool=pmcentrez&rendertype=abstract.

Hopkins, G. W., & Freckleton, R. P. (2002). Declines in the numbers of amateur and professional taxonomists: implications for conservation. Animal Conservation, 5(3), 245-249. doi: 10.1017/S1367943002002299.

House of Lords Science and Technology Committee. (2009). Systematics and taxonomy: follow up 5th report of session 2007-08 report with evidence (p. 330). The Stationery Office. Retrieved from http://www.tsoshop.co.uk/bookstore.asp?Action=Book&ProductId=9780104013496.

MEEGASKUMBURA, M., MANAMENDRA-ARACHCHI, K., SCHNEIDER, C. H. J., & PETHIYAGODA, R. (2007). New species amongst Sri Lanka's extinct shrub frogs (Amphibia: Rhacophoridae: Philautus). Zootaxa, 1397, 1-15. Retrieved from http://www.mapress.com/zootaxa/2007f/z01397p015f.pdf.

Miller, H., Norton, C. N., & Sarkar, I. N. (2009). GenBank and PubMed: How connected are they? BMC research notes, 2(1), 101. doi: 10.1186/1756-0500-2-101.

Moreau CS, Bell CD, Vila R et al. Phylogeny of the ants: diversification in the age of angiosperms, Science 2006;312:101-104. 11.        

Nilsson, R. H., Ryberg, M., Kristiansson, E., Abarenkov, K., Larsson, K.-H., & Kõljalg, U. (2006). Taxonomic reliability of DNA sequences in public sequence databases: a fungal perspective. (C. Fairhead, Ed.)PloS one, 1(1), e59. Public Library of Science. doi: 10.1371/journal.pone.0000059.

Nunn, C.L. & Altizer, S.M., 2005. The global mammal parasite database: An online resource for infectious disease records in wild primates. Evolutionary Anthropology: Issues, News, and Reviews, 14(1), pp.1-2. Available at: http://doi.wiley.com/10.1002/evan.20041 [Accessed January 14, 2011].

Ouellette GD, Fisher BL, Girman DJ. Molecular systematics of basal subfamilies of ants using 28S rRNA (Hymenoptera: Formicidae), Molecular Phylogenetics and Evolution 2006;40:359-369.

Page RDM. Biodiversity informatics: the challenge of linking data and the role of shared identifiers. Brief Bioinform. 2008 Sep;9(5):345-54. Epub 2008 Apr 29. Review. PubMed PMID: 18445641.

doi:10.1093/bib/bbn022

Page, Roderic D. M. Linking NCBI to Wikipedia: a wiki-based approach [Internet]. Version 38. PLoS Currents: Tree of Life. 2011 Mar 8 [revised 2011 Mar 31]. Available from: http://knol.google.com/k/roderic-d-m-page/linking-ncbi-to-wikipedia-a-wiki-based/16h5bb3g3ntlu/2.

Pennisi, E. Proposal to 'Wikify' GenBank Meets Stiff Resistance. Science. 2008;319 (5870): 1598-1599

doi:10.1126/science.319.5870.1598

Sarkar, I.N., 2010. Leveraging biomedical ontologies and annotation services to organize microbiome data from Mammalian hosts. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, 2010, pp.717-21. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3041364&tool=pmcentrez&rendertype=abstract [Accessed April 12, 2011].

Sarkar, I. N., Schenk, R., & Norton, C. N. (2008). Exploring historical trends using taxonomic name metadata. BMC evolutionary biology, 8(1), 144. doi: 10.1186/1471-2148-8-144.

Saux C, Fisher BL, Spicer GS. Dracula ant phylogeny as inferred by nuclear 28S rDNA sequences and implications for ant systematics (Hymenoptera: Formicidae: Amblyoponinae), Molecular Phylogenetics and Evolution 2004;33:457- 468. 10.        

Schindel, D.E. & Miller, S.E., 2010. Provisional nomenclature: the on-ramp to taxonomic names. In A. Polaszek, ed. Systema Naturae 250 - The Linnaean Ark. CRC Press, pp. 109-115.

Smith, H.M. & Smith, R.B., 1972. Chresonymy ex Synonymy. Systematic Zoology, 21(4), p.445. Available at: http://www.jstor.org/stable/2412440?origin=crossref [Accessed May 9, 2011].

Solow, A.R., Mound, L.A. & Gaston, K.J., 1995. Estimating the Rate of Synonymy. Systematic Biology, 44(1), pp.93-96. Available at: http://sysbio.oxfordjournals.org/cgi/doi/10.1093/sysbio/44.1.93 [Accessed May 9, 2011].

Stein LD. Integrating biological databases. Nat Rev Genet. 2003 May;4(5):337-45. Review. PubMed PMID: 12728276.

doi:10.1038/nrg1065

Vences, M., Thomas, M., Meijden, A. van der, Chiari, Y., & Vieites, D. R. (2005). Comparative performance of the 16S rRNA gene in DNA barcoding of amphibians. Frontiers in zoology, 2(1), 5. doi: 10.1186/1742-9994-2-5.

Vences, M., & Riva, I. D. L. (2007). A new species of Gephyromantis from Ranomafana National Park, south-eastern Madagascar (Amphibia, Anura, Mantellidae). Spixiana, 30(1), 135-143. Retrieved from http://www.pfeil-verlag.de/04biol/pdf/spix30_1_16.pdf.

Vieites, D.R. et al., 2009. Vast underestimation of Madagascarʼs biodiversity evidenced by an integrative amphibian inventory. Proceedings of the National Academy of Sciences of the United States of America, 106(20), pp.8267-72. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2688882&tool=pmcentrez&rendertype=abstract [Accessed May 9, 2011].