16S rRNA Metagenomics Analysis

Introductory tutorial

http://genome.edu.au

http://qfab.org

Contents

Tutorial Overview

Background

The workflow in Galaxy:

Preparation [10 min]

Section 1. Dereplicate reads [5 min]

Section 2. Operational Taxonomic Units (OTU) Clustering [10 min]

Section 3. Chimera detection [15 min]

Section 4. Generate the OTU Table [20 min]

Section 5. Generate Phylogenetic Tree [20 min]

Section 6. Rarefaction analysis - Single Sample [8 min]

Section 7. Collector’s Curve - Richness and Diversity Estimators - Single Sample [10 min]

References

Tutorial Overview

In this tutorial we cover the concepts of metagenomics analysis in 16S rRNA.

What’s not covered

This introductory tutorial covers the steps required for the first part of a 16S rRNA metagenomics study applied to samples from a single site. As such we only show analyses that can be applied to a single environment, that is, identifying what organisms are in your sample (diversity), and how many organisms there are in the sample (richness). Similar same steps would be applied if you had samples from multiple sites, with additional analyses that can compare the diversity and richness between samples. The latter part is not covered by this tutorial.

Background

16S rRNA sequencing

The application of culture-independent rRNA-based phylogenetics to the study of bacterial diversity was first explored by researchers in the 1980s e.g. [2] and was used by Carl Woese to delineate the main branches of life [1]. These rRNA based analyses remain central methods in microbiology both for the exploration of species diversity but also or the application in bacterial identification.

The development of pyrosequencing technologies and the emergence of next generation DNA sequencing have changed the scales at which scientists can systematically survey the 16S rRNA content of biological samples and has been a key driver behind initiatives such as the human microbiome project [5].

In the broadest sense, 16S rRNA sequencing should not be confused with metagenomics which has been defined as "the application of shotgun sequencing to DNA obtained directly from an environmental sample or series of related samples, producing at least 50Mbp of randomly sampled sequence data" [3] This is also in contrast to functional metagenomics whereby environmental DNA is first cloned and subsequently screened for specific functional activities e.g. [4]. Kunin et al elegantly argue that 16S rRNA profiling should in fact exist as a mandatory "Premetagenome community composition profiling" step in more comprehensive surveys [3]. There is discussion and debate as to how deep this sequencing could be performed but at JGI at least, a single 384 well plate is routinely surveyed.

Irrespective of the nuances of terminology, metagenomics is a baseline technology for understanding the ecology and evolution of microbial ecosystems upon which hypotheses and experimental strategies are built.

Experimental design is key

Most 16S strategies are biased towards the measurement of bacterial and archaeal constituents of the environment and eukaryotes (e.g. fungi and protists) are excluded. This compromises the ability to measure the whole ecological spectrum of a microbial community; the larger and non-coding and thus uninterpretable parts of the eukaryotic genome are problematic. This is largely driving strategies that encompass additional metatranscriptomics surveys. The studies are thus oriented towards the sequence-tractable bacterial, archaeal and viral components of the research community.

Limitations of the 16S rRNA profiling strategy

The 16S rRNA gene copy number of variable between bacterial species and this is further confounded by PCR amplification biases which skew estimates of the community composition.

Goals

The goal of the 16S study is to understand the complexity of the community being sampled. This is assessed as a function of the number of species in the community (richness) and their relative abundances (evenness).

The workflow in Galaxy:

The process presented in this workshop is aimed to be an introduction for molecular biologists to 16S rRNA metagenomics analysis. The workflow is also implemented such that it can be run using a standard desktop or laptop environment without the requirements of a large cluster or cloud infrastructure.

The steps are linear and include (and is shown in Figure 1):

  1. Dereplicate reads
  2. Cluster reads into Operational Taxonomic Units (OTU)
  3. Chimera detection and removal
  4. OTU Table
  5. Phylogenetic Tree alignment
  6. Rarefaction analysis
  7. Collector Curve analysis

Where is the data in this tutorial from?

We use the same data from the Mothur Marine community analysis, that is the the study from the Global Ocean Sampling (GOS) Expedition. However we only concentrate on one site as opposed to the 14 sites they sampled.

Preparation [10 min]

  1. Register as a new user in Galaxy if you don’t already have an account (what is Galaxy?)
  1. Open a browser and go to the the Galaxy server: http://galaxy-qld.genome.edu.au
  1. NOTE: Firefox/Safari/Chrome all work well, Internet Explorer not so well
  1. Register as a new user: User>Register or login if you already have an account
  1. Import the “Metagenomics 16S rRNA Data” for the workshop. You can do this in a few ways, of which by far the easiest is to access a shared data library on your Galaxy instance.
  1. Go to Shared Data -> Data Libraries and click on Metagenomics 16S rRNA Data
  2. Select the three datasets:
  1. seqs.fasta
  2. gold.fasta
  3. core_set_aligned.fasta
  1. Go to 'For selected datasets:' just below the data set list, select ‘Import to histories’ and click ‘Go’.

A new dialog will be displayed.

  1. In the ‘Destination Histories’ panel go to ‘New history named:’ and enter your new history name, e.g. Metagenomics Analysis
  2. Finally click on ‘Import library datasets’

          The datasets will be imported to your account to the new history specified in step 2d. Start using this history.

Please note the following tutorial refers to outputs using the format:  “history number: tool name - file name”

If there are already items in your history panel when you begin the tutorial, the numbers may not be in sync but if the “tool name- file name” is the same then it will be the right file.

This also applies, when you are running the corresponding workflow ‘
Tutorial: Metagenomics 16S rRNA Analysis’ to this tutorial.

  1. Alternatively:
  1. Download the data directly to your computer using these URLs, then upload to Galaxy:
  1. https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/Metagenomics_BASIC/seqs.fasta
  2. https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/Metagenomics_BASIC/gold.fasta
  3. https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/Metagenomics_BASIC/core_set_aligned.fasta
  1. Once files are downloaded, in Galaxy click on Get data>Upload File
  1. Click 'Choose File', locate the local copy of each of the three fasta files and upload
  2. You can leave File Format as Auto-detect, or if you prefer, set it to fasta
  1. Click Execute and wait for the files to upload to Galaxy. It can take a couple of minutes.
  1. Or:
  1. Upload the data directly to Galaxy using the file URLs
  2. Follow the steps above but instead of downloading the files locally and then uploading, copy the URLs from the previous step into the URL/Text box and Galaxy will download and import them directly.
  1. You should now have the following files in your Galaxy history:
  1. seqs.fasta - Raw 16S sequences from the Global Ocean Sampling Expedition (on which the 16S rRNA analysis will be performed)
  2. gold.fasta - Reference database consisting of chimera free sequences (used by the chimera detection) - described by Haas et al.
  3. core_set_aligned.fasta - GreenGenes ‘Core Set’ (a template alignment), comprising approximately 5000 aligned non-chimeric sequences representative of the currently recognized diversity among bacteria and archaea (used by the alignment generation)

Click on the eye icon on the top right of each fasta file to view the first part of the file.

Section 1. Dereplicate reads [5 min]

The aim here is to:

  • reduce the number of reads in the input file into clusters of identical sequences.
  • there are two possible modes in the dereplication tool:
    (1) full-length where the full length of the two sequences must be identical before being grouped together or
    (2) prefix where the first section of the two sequences are identical.
  1. In the tool panel (left hand side) search for the “Dereplicate” tool, either using the search box or under the “Metagenomics 16S Analysis” category
  1. For the first field “Input sequence file”, select the “1: seqs.fasta” dataset
  1. For the second field “Removing mode” select “Full length
  1. Click on Execute
  1. Click on the name of the job “4: Dereplicate on data 1” in the History panel (right hand side) and you will see the number of sequences have been reduced from 1,472 in the input data (“1: seqs.fasta”) to 1,459 sequences.
  1. Examine the output by clicking on the eye icon  in the top right hand corner of this step. NOTE: The size=XXXX annotation in the id line of the fasta file indicates the number of reads that were exactly the same as this representative sequence.
  1. (OPTIONAL) You can rename the output by clicking on the pencil icon  on the top right hand corner of the output. Click on the “Attributes tab” and then  type in a new name in the “Name” field and click on Save.

Completed Galaxy history for this section (in SharedData>Published Histories): Metagenomics_16S_rRNA_Analysis (datasets 1-4)

Section 2. Operational Taxonomic Units (OTU) Clustering [10 min]

The aim here is to:

  • Cluster the output from Step 1 into groups of sequences that are 97% similar to each other.
  • Similar sequences are assumed to belong to the same taxonomic groups. These groups are known as Operational Taxonomic Units (OTUs) and the number of clusters that result are indicative of the number of species that are present in this sample.
  1. In the tool panel on the left hand side, search for the “Cluster OTU” tool, either using the search box or under the “Metagenomics 16S Analysis” category
  1. For the first field “Input sequence file”, select the output from Step 1: “4: Dereplicate on data 1
  1. For the second field  “Minimum cluster radius”, enter 0.97. This value corresponds to 97% sequence identity.
  1. Click on Execute
  1. Click on the name of the job “5: Cluster OTU on data 4” in the History panel (right hand side) and you will find that the number of sequences has decreased to 559 sequences.


Examine the output by clicking on the eye icon
 in the top right hand corner of this step.

The output file is in Fasta format where each sequence is the representative of an OTU. Figure 2 below, provides a summary of the clustering approach. Reads are represented by solid dots. Those that are 97% similar to each other are grouped together into an OTU as represented by the blue dashed circles. In this example, there are 3 OTUs. The reads that represents the centroid of each cluster are shown as red dots and these red dots form the representative of the corresponding OTU. These are the sequences that are reported in the output file. In this tutorial, the output file has 559 sequences, meaning that 559 OTUs were identified.
NOTE: The number of sequences that you have may not be exactly the same as mentioned here. This is so as the clustering process starts by randomly selecting a read as the center of a cluster (see figure below). Since this is a random selection, each time you perform a clustering process, a different read can be selected and thus affect the end result.

For more detailed description of the OTU clustering algorithm, click here.

http://drive5.com/usearch/manual/uparseotu_algo.jpg
Figure 2: Schematic of the OTU Clustering algorithm’s output, see text for description.



Please Note: The “size=XX” annotation in the identifier line is NOT the number of sequences that is in the corresponding cluster (OTU). This value is the same as the input file.

Completed Galaxy history for this section (in SharedData>Published Histories): Metagenomics_16S_rRNA_Analysis (datasets 1-5)

Section 3. Chimera detection [15 min]

The aim here is to:

  • Identify and filter chimeric sequences utilizing a reference dataset containing nucleotide sequences believed to be free of chimeras.
  • There are two possible modes in the Uchime tool for chimera detection:
    (1) ref where a reference dataset of parent sequences believed to be chimera-free is provided by the user or
    (2) de-novo where a database is constructed from the query sequences by using the abundance data.
  1. In the tool panel on the left hand side, search for the “Uchime” tool, either using the search box or under the “Metagenomics 16S Analysis” category
  1. For the first field “Mode to detect chimeras”, select “ref” (reference mode).
  1. For the second field “Input reference file”, select the “5: Cluster OTU on data 4
  1. For the third  field “Reference Database”, select the “2: gold.fasta”. The gold.fasta file (downloaded from http://drive5.com/uchime/gold.fa) contains nucleotide sequences believed to be free of chimeras.
  1. Click on Execute
  1. Four outputs are generated from this tool, two of which are hidden. The first output lists the predicted chimeras and has the “:chimeras” suffix in the output name. The second output lists the chimera free sequences and has the “:non chimeras” suffix.
    Click on the job name(s) to examine the output and click on the eye icon
     in the top right hand corner of the corresponding steps to view the output files:
  1. “6: Uchime on data 5 and data 2:chimeras
  1. These sequences have been classified to be chimeras
  1. 7: Uchime on data 5 and data 2:non chimeras
  1. These sequences have been classified to be non-chimeras

Only the “non-chimeras” output be used for the subsequent steps. All other sequences, chimeras  and nonclassified will be disregarded. There are 2 hidden log files also generated from this tool, they contain detailed job information about  the classification into non-chimeras and chimeras.

(OPTIONAL) To access the hidden outputs, click on the cog wheel  in the upper right corner of the History panel and select “Include Hidden Datasets”.

  1. 8: Uchime on data 5 and data 2:Human-readable output
  1. The chimeric alignments (with respect to the putative parents) are listed following the human-readable format of the USEARCH-Tool-Suite.

  1. 9: Uchime on data 5 and data 2:Tabbed output
  1. For each sequence there are 18 values logged in a tabbed format, as following:

Field

Name

Description

1

Score

Value >= 0.0, higher score means more strongly chimeric alignment.

2

Q

Query label.

3

A

Parent A label.

4

B

Parent B label.

5

T

Top parent (T) label. This is the closest reference sequence; usually either A or B.

6

IdQM

Percent identity of query and the model (M) constructed as a segment of A and a segment of B.

7

IdQA

Percent identity of Q and A.

8

IdQB

Percent identity of Q and B.

9

IdAB

Percent identity of A and B.

10

IdQT

Percent identity of Q and T.

11

LY

Yes votes in left segment.

12

LN

No votes in left segment.

13

LA

Abstain votes in left segment.

14

RY

Yes votes in right segment.

15

RN

No votes in right segment.

16

RA

Abstain votes in right segment.

17

Div

Divergence, defined as (IdQM - IdQT).

18

YN

Y or N, indicating whether the query was classified as chimeric. This requires that Score >= threshold specified by -minh, Div > minimum divergence specified by ‑mindiv and the number of diffs ( (Y+N+A) in each segment (L and R) is greater than the minimum specified by -mindiffs.

http://drive5.com/usearch/manual/uchimeout.html 

Completed Galaxy history for this section (in SharedData>Published Histories): Metagenomics_16S_rRNA_Analysis (datasets 1-9)

NOTE: The number of sequences that you have may not be exactly the same due to the results of the OTU Clustering step in Section 2.

Section 4. Generate the OTU Table [20 min]

The aim here is to:

  • Generate a full OTU Table which consists of the number of reads found for each group, the representative sequence and the predicted taxonomy
  • This is accomplished by running two tools and merging the resulting outputs together
  • The first tool “Map Reads to OTU”, as the name indicates will map read sequences to the OTU cluster representatives from Section 2.
  • In section 2, we clustered the reads into clusters that are 97% similar to all other members of that cluster. Then for each cluster, a representative sequence was selected. However, we do not know the number of reads that are in each cluster. Furthermore, in Section 3 we removed sequences that were potential chimeric sequences; this alters the number of remaining clusters. Therefore, in this step we first find out which of the remaining clusters each read belongs with 97% similarity and this will give us a count of the number of reads per cluster.
  • The second tool “RDP Multi Classifier” will assign each the OTU representative sequence to a taxonomy level
  • The last step then merges the information to generate the OTU Table  
  1. In the tool panel on the left hand side, search for the “Map Reads to OTU” tool, either using the search box or under the “Metagenomics 16S Analysis” category
  1. For the first field “OTU fasta file”, select dataset “7: Uchime on data 5 and data 2: non_chimeras
  1. For the second field “Input reads file”, select dataset “4: Dereplicate on data 1
  1. For the third field “Minimum identity”, enter 0.97. Like before, this value corresponds to 97% sequence similarity.
  1. Click on Execute
  1. Four outputs are generated from this tool, one of which is a hidden file. Click on the following job names:
  1. “10: Map Reads to OTU on data 7 and data 4:rabund
  1. The output is tab delimited and follows the 'rabund' format from mothur.
  • The first column is a label (this is usually the identity)
  • The second column is the number of OTUs found and
  • each subsequent column is the number of reads found for the corresponding OTUs
  1. N.B. this is just a one-line file (as can be seen by clicking on the history entry to expand it to full details) which wraps around in the Galaxy display window to look like a multi-line file. In other words, the first and second columns correspond to these descriptions for the first line only.
  1. “11: Map Reads to OTU on data 7 and data 4:Pre-OTU Table”
  1. This is the pre OTU table with the following columns:
  • OTU label
  • Count
  • Sequence
  1. “12: Map Reads to OTU on data7 and data 4:relabelled OTU”
  1. The input reads from “7: C) Uchime on data 5 and data 2: non_chimeras” have been re-labelled with convenient labels for parsing, e.g. OTU_1, OTU_2,...OTU_N, where N is the number of OTUs.

There is one hidden log file also generated from this tool, which reports whether a query sequence matches or not matches the database.

(OPTIONAL) To access the hidden output, click on the cog wheel  in the upper right corner of the History panel and select “Include Hidden Datasets”.

  1. “13: Map Reads to OTU on data 7 and data 4:hit list”
  1. This file reports query sequences that match or did not match the database “5: B) Cluster OTU on data 4following the UC-format of the USEARCH-Tool-Suite. Note that a given read may match two or more OTUs given the identity threshold. In such cases the tool will tend to assign the read to the OTU with highest identity and will break ties arbitrarily. Some reads may not match any OTU for these reasons: (1) the read is chimeric, (2) the read has more than 3% errors, (3) the read has a singleton sequence so was discarded.

A complete OTU table contains the taxonomy assignment of each sequence. To complete the Pre-OTU Table, we will use the “E) RDP Multi Classifier” tool.

  1. In the tool panel on the left hand side, search for the “RDP MultiClassifier” tool, either using the search box or under the “Metagenomics 16S Analysis” category
  1. For the first field “Select Gene Trainings Model”, select “16S rRNA” from the dropdown list DEFAULT
  1. Tick the checkbox for the second field, “Select to generate an OTU Table”

    Once the checkbox is enabled, the original input field “Input reads file in Fasta format” will be replaced by two other fields:
    i)  “Relabelled OTU input reads of the ‘Map Reads to OTU’ tool in FASTA format”
    and
    ii)  “PRE OTU table of the ‘Map Reads to OTU’ tool”.

                   

  1. For the third field “Relabelled OTU input reads file of the ‘Map Reads to OTU’ tool in FASTA format” select “12: Map Reads to OTU on data 7 and data 4:relabelled OTU”  
  1. For the fourth field “Intermediate OTU table of the ‘Map Reads to OTU’ tool”, select “11: Map Reads to OTU on data 7 and data 4:Pre-OTU Table
  1. For the fifth field “Assignment confidence cutoff”, enter 0.8. DEFAULT
    This value is
    used to specify the assignment confidence cutoff used to determine the assignment count in the hierarchical format. The range is between 0 and 1 with 0.8 being the default value. For sequences that are shorter than 250 base pairs, the tool developers recommend using 0.5 as the confidence threshold. (http://rdp.cme.msu.edu/tutorials/classifier/RDPtutorial_MULTICLASSIFIER.html)
  1. For the sixth field “Tab delimited output format”, select “allrank“ from the drop-down list DEFAULT
  1. Click on Execute
  1. Three outputs are generated by this tool, click on the job name to examine the output:
  1. 14: RDP MultiClassifier on data 12 and data 11:classification_assignment_hierarchical.tab
  1. Sequence count for each taxon in the hierarchy
  1. 15: RDP MultiClassifier on data 12 and data 11:classification_assignment_details.tab
  1. Sequence-by-sequence classification results including confidence scores at each level of the hierarchy
  1. 16: RDP MultiClassifier on data 12 and data 11:OTU_Table.tab
  1. The OTU table is composed of the following columns:
  • OTU label
  • Count
  • Sequence
  • Phylogenetic lineage

Completed Galaxy history for this section (in SharedData>Published Histories): Metagenomics_16S_rRNA_Analysis (datasets 1-16)

NOTE: The number of sequences that you have may not be exactly the same due to the results of the OTU Clustering step in Section 2.

Section 5. Generate Phylogenetic Tree [20 min]

The aim here is to:

  • Build and visualise the phylogenetic tree, this is accomplished using two tools:
  • Aligning the sequences to a template sequence using PyNAST. PyNast is a reimplementation of the NAST algorithm to align each provided sequence (the “candidate” sequence) to the best-matching sequence in a pre-aligned database of sequences (the “template” sequence).
  • Using FastTree to infer approximately-maximum-likelihood phylogenetic trees from the created alignments of nucleotide or protein sequences
  1. In the tool panel on the left hand side, search for the “PyNAST” tool, either using the search box or under the “Metagenomics 16S Analysis” category
  1. For the first field “Candidate file (Fasta format)”, select “7: Uchime on data 5 and data 2:non_chimeras
  1. For the second field “ Template alignment file (Fasta format)”, select “3: core_set_aligned.fasta
  1. For the third field “Minimum sequence length”, enter 1000. The minimum sequence length is the length to include in the NAST alignment and it has to be set according to your set of your sequences.DEFAULT
  1. Click Execute
  1. Three outputs are generated by this tool, two of which are hidden, click on the job name to examine each one:
  1. 17: PyNAST on data 7 and data 3: pynast aligned
  1. This file contains the alignment. You will find that the number of sequences might have decreased. This means  that the alignment for some sequences failed. More information is given in the log file and the failure file, which are hidden outputs.

(OPTIONAL) To access the hidden outputs, click on the cog wheel  in the upper right corner of the History  panel and select “Include Hidden Datasets”.

  1. 18: PyNAST on data 7 and data 3: pynast log file
  1. This file is composed of the following columns:
  • candidate sequence ID
  • candidate nucleotide count
  • errors
  • template ID
  • BLAST percent identify to template
  • candidate nucleotide count post-NAST
  1. You will observe that some sequences have an entry in the error column: “No search results”. These are the sequences that failed to align.
  1. 19: PyNAST on data 7 and data 3: pynast failure
  1. This file contains the sequences that failed to align. As already found out through the log file which sequences failed to align. Here those sequences are given in FASTA format.

Next we use “Fast Tree” to generate the phylogenetic tree using the aligned sequenced file.

  1. In the tool panel on the left hand side, search for the “FastTree” tool, either using the search box or under the “Metagenomics 16S Analysis” category
  1. For the first field “Aligned sequences file (FASTA format)”, select “17: PyNAST on data 7 and data 3:pynast aligned” from step 6.
  1. For the second field “Nucleotide or protein Alignment”, select “Nucleotide”, which should be selected by default. DEFAULT
  1. Click Execute
  1. To outputs are generated from this tool, one is hidden. Click on the job name to examine the outputs:
  1. 20: FastTree on data 18:tree.nhx
  1. To visualize the phylogenetic tree, click on the “Visualize” icon  to launch Phyloviz. Phyloviz is the interactive Phylogenetic Tree Visualizer included from Galaxy.  Currently Phyloviz has only one entry point, which is via the “Visualize - View in Phyloviz” icon that will appear for all supported phylogenetic data sets.  
  2. Phyloviz will lay out the tree in a linear format. All phylogenetic distances are normalized to a value between 0 to 1.0 inclusive, and a default value of 250px would represent 1.0 units from a parent node to child node. The laying out of phyloviz is “in-place” meaning that the tree is presented in the same order as it is represented in the data and that no nodes are shuffled.
  3. Phyloviz supports a number of user interaction features to help you present, analyze and share the visualization. Please see the Galaxy wiki for more information on Phyloviz.

(OPTIONAL) To access the hidden log file, click on the cog wheel  in the upper right corner of the History panel and select “Include Hidden Datasets”.

  1. 21: FastTree on data 17:log_FastTree_run.txt”
  1. this is the log file from Fast tree and shows job related information. It is not required for the analysis workflow.

Completed Galaxy history for this section (in SharedData>Published Histories): Metagenomics_16S_rRNA_Analysis (datasets 1-21)

NOTE: The number of sequences that you have may not be exactly the same due to the results of the OTU Clustering step in Section 2.

Section 6. Rarefaction analysis - Single Sample [8 min]

The aim here is to:

  • Generate an intra-sample rarefaction curve to determine if the sampling effort was sufficient.
  • Rarefaction curves provide a way of comparing the richness observed in different samples although it is a better measure of sample diversity. Briefly, it plots the average number of OTUs that you would expect to observe against the number of samples. If the curve plateaus off, then it is a good indication that the sample is saturated and you have captured the majority (if not all) of species/strains present in the sample. However, if the curve does not plateau off, this suggests there may still be more species/strains that have not been captured in your data and you need increase the size of your sample.
  1. In the tool panel on the left hand side, search for the “Rarefaction” tool, either using the search box or under the “Metagenomics 16S Analysis” category
  1. For the first field “Input file format”, select rabund from the dropdown list, which should be selected by default. DEFAULT
  1. For the second field “OTU list (rabund or sabund format)”, select “10: Map Reads to OTU on data 7 and data 4:rabund
  1. For the third field “Minimum identity used in OTU clustering”, enter the minimum identity used in Section 4. Generate the OTU Table [20 min] - Step 4.
    In this tutorial we used
    0.97, which corresponds to 97% sequence similarity that was used previous and is selected by default. DEFAULT

    In case you can’t remember the minimum identity used, go back to the output file “5: B) Cluster OTU on data 1” in your History. Click on “5: B) Cluster OTU on data 1” to extend the dataset information. Click on the “View details” icon
    . In the Input Parameter section you will find the “Minimum cluster radius” you used in Step “B) OTU Clustering”.

  1. Click Execute
  1. Examine the output “22: Rarefaction on data 12”, by clicking on the “View data” icon . The output is a tabbed delimited file with the following columns:

First column (numsampled)

is the number sampled and indicates the level of sampling intensity (by default this information is provided every 100 individuals)

The second column (0.03)

is the average number of OTUs that were observed for that sampling intensity based on the number of iterations, which is 1000 by default. This would represents the data for the y-axis for the corresponding minimum identity (1-0.97=0.03).

The third column (lci)

fourth column (hci)

represent the confidence intervals

  1. To plot the rarefaction curve:
  1. Click on “22: Rarefaction on data 12” to extend the dataset information.
  2. Click on the “Visualize” icon
  3. Select “Scatterplot”
  4. Select the “Data Controls” tab
  5. For the first field “Data column for X”, select column 1 (the number sampled)
  6. For the second field “Data column for Y”, select column 2 (the average number of OTUs that were observed for that sampling intensity based on the number of iterations, which is 1,000 by default)
  7. Click Draw

Completed Galaxy history for this section (in SharedData>Published Histories): Metagenomics_16S_rRNA_Analysis (data sets 1-22)

NOTE: The numbers that you have may not be exactly the same due to the results of the OTU Clustering step in Section 2.

Section 7. Collector’s Curve - Richness and Diversity Estimators - Single Sample [10 min]

The aim here is to:

  • To estimate the diversity and richness of the sample
  • Collector's curves describe how the richness or diversity change as you add more samples into the dataset. If a collector's curve plateaus, it is a good indication that enough samples were collected and you can have confidence in the last value of the curve.
  1. In the tool panel on the left hand side, search for the “Collector Curve” tool, either using the search box or under the “Metagenomics 16S Analysis” category
  1. For the first field “OTU list (rabund, sabund, list or shared format)”, select “10: Map Reads to OTU on data 7 and data 4:rabund
  1. For the second field “Labels - OTU labels”, select 0.03 (corresponding to 97%). Note that this value is the inverse of other similarity thresholds we used thus far (1-0.97=0.03) and is the default value. DEFAULT
  1. For the third field “Calculators”, there is no need to change anything from the default calculators
  1. chao - Community richness the Chao estimator
  2. invsimpson - Community diversity the Simpson index
  3. npshannon - Community diversity the non-parametric Shannon index
  1. In this tutorial we do not enable advanced options.
  1. Click Execute
  1. Four outputs are generated from this tool, click on the job name to examine them:
  1. 23: Collector Curve on data 10:tab
  1. This file provides a summary containing the following fields, number of sequences, the sample coverage, the number of observed OTUs, and then a summary of each of the selected calculators that were applied to the sample.           
  1. 24: Collector Curve on data 10:tab (invsimpson)
  1. The Simpson’s Inverse Index also referred to as Simpson’s Reciprocal Index is a measure of diversity. Species diversity is a way to compare samples that incorporates both a measure of evenness (representation by each species) and richness (number of each species in the sample).
    This index starts with the value of 1 as the lowest possible figure. This figure would represent a community containing only one species. The higher the value, the greater the diversity. The maximum value is the number of species in the sample.
  1. 25: Collector Curve on data 10:tab (npshannon)
  1. The non-parametric Shannon Index, like the Simpson’s Inverse Index, is a diversity estimator and calculates a non-parametric estimate of the classical Shannon diversity Index for an OTU definition. The interpretation of the curve has to be performed in a comparison to the diversity curve of one or more communities.
  1. 26: Collector Curve on data 10:tab (chao)
  1. The Chao1 estimator is a species richness estimator. The species richness estimators estimate the total number of species present in a community.  The Chao1 index estimator the species richness for one community based on the observed species frequencies or presence/absence data.
  2. If you compare the rarefaction curve (observed richness) and Chao1 estimator curve (estimated richness), you will observe a gap. The greater estimated richness in comparison to the observed richness usually indicates that with further sampling, the richness would likely continue to increase. Please note that if a sample contains many singletons, it is likely that more undetected OTUs exist, and the Chao1 index will estimate greater species richness than it would for a sample without rare OTUs.

These three outputs are tabbed delimited with the following columns:

First column (numsampled)

is the number sampled and indicates the level of sampling intensity (by default this information is provided every 100 individuals)

The second column (0.03)

is the average number of OTUs that were observed for that sampling intensity based on the number of iterations, which is 1000 by default. This would represents the data for the y-axis for the corresponding minimum identity (1-0.97=0.03).

The third column (lci)

fourth column (hci)

represent the confidence intervals

These files can be plotted similar to the rarefaction curve (see Section 6).

  1. Plot the collector’s curves:
  1. Click on “24: Collector Curve on data 10:tab (invsimpson)” to extend the dataset information.
  2. Click on the now accessible “Visualize” icon
  3. Select “Scatterplot”
  4. Select the “Data Controls” tab
  5. For the first field “Data column for X”, select column 1 (the number sampled)
  6. For the second field “Data column for Y”, select column 2 (the average number of OTUs that were observed for that sampling intensity based on the number of iterations, which is 1,000 by default)
  7. Click Draw
  1. Repeat the previous step 8 to plot the results of the other calculators selected. Here:
  1. chao - “25: Collector Curve on data 10:tab (npshannon)
  2. invsimpson - “26: Collector Curve on data 10:tab (chao)

Completed Galaxy history for this section (in SharedData>Published Histories): Metagenomics_16S_rRNA_Analysis (datasets 1-26)

NOTE: The numbers that you have may not be exactly the same due to the results of the OTU Clustering step in Section 2.

References

1. Woese CR, Kandler O, Wheelis ML (1990) Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S A 87: 4576-4579.

2. Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin ML, et al. (1985) Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. Proc Natl Acad Sci U S A 82: 6955-6959.

3. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P (2008) A bioinformatician's guide to metagenomics. Microbiol Mol Biol Rev 72: 557-578.

4. Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM (1998) Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem Biol 5: R245-249.

5. Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, et al. (2007) The human microbiome project. Nature 449: 804-810.

6. Holmes S, Alekseyenko A, Timme A, Nelson T, Pasricha PJ, et al. (2011) Visualization and statistical comparisons of microbial communities using R packages on phylochip data. Pac Symp Biocomput: 142-153.

7. Hamady M, Lozupone C, Knight R (2010) Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. ISME J 4: 17-27.

8. Brodie EL, Desantis TZ, Joyner DC, Baek SM, Larsen JT, et al. (2006) Application of a high-density oligonucleotide microarray approach to study bacterial population dynamics during uranium reduction and reoxidation. Appl Environ Microbiol 72: 6288-6298.