JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 36

Alpha and Beta Microbial Diversity Metrics

Ranacapa Analysis

Tutorial by Chris Dao and Amanda Freise

UCLA MIMG Dept.

PUMA

2 of 36

Biodiversity can be measured at multiple scales

http://www.webpages.uidaho.edu/veg_measure/Modules/Lessons/Module%209(Composition&Diversity)/9_2_Biodiversity.htm

α-diversity: within an individual sample

Site A = 7 species

Site B = 5 species

β-diversity: diversity between multiple samples

Site A and C have highest β-diversity:

10 species that differ between them,

only 2 species in common

PUMA

3 of 36

Defining alpha diversity

Richness: number of different species present

AKA number of different operational taxonomic units (OTUs) or amplicon sequence variants (ASVs)
Doesn’t tell us anything about the abundance of these OTUs/ASVs

Evenness: describes distribution of OTUs/ASVs

Various diversity indices exist which weight richness and evenness in different ways
No single index is the best; instead, depends on the question you are asking

PUMA

4 of 36

Defining alpha diversity

Richness: number of different species present

Observed OTUs
Chao1
ACE

Evenness: describes distribution of OTUs/ASVs

Shannon index
Simpson’s diversity index
Phylogenetic distance

PUMA

5 of 36

Accurately estimating diversity

How are we measuring diversity for most of the possible indices?

By the ASVs

What happens if we miscalculate the number of ASVs present?

May over- or under-estimate diversity

Why might ASVs get miscalculated?

Several reasons…

PUMA

6 of 36

Samples may have inconsistent sequencing results

Number of sequences overall per sample may be different (e.g. way more DNA may get extracted from one sample)

How might that sample’s ASVs & diversity change?

More rare sequences may be missed during sequencing in favor of highly common sequences

How might that sample’s ASVs & diversity change?

How to make sure we correctly and reproducibly estimate the # of ASVs across all our samples?

Rarefaction!

PUMA

7 of 36

Rarefaction: a normalization method

An equal number of sequences from each sample are actually used for downstream analysis
Number of sequences can be chosen by the researcher (e.g. 5000 seqs, 15000, etc)

If less than the total number of sequences obtained for a sample, may be missing some ASVs…what effect would that have?

How deep should we rarefy?

PUMA

8 of 36

Rarefaction curves indicate species coverage

Adapted from: https://doi.org/10.1371/journal.pcbi.1000667.g004

Number of sequences

most or all species have been sampled

this site has not been exhaustively sampled

only a small fraction of species been sampled

PUMA

9 of 36

Rarefaction curves

Different samples have a different total number of reads
One solution: randomly subsample all of the samples with a certain # of reads

Compare subsample of 50k reads in each sample

PUMA

10 of 36

Ranacapa provides several built-in analyses

Rarefaction
Alpha-diversity
Beta-diversity
And more…

Once we calculate alpha diversity, how can we ask scientifically meaningful questions?

Are some types of samples more or less diverse than other samples?

How do we categorize samples? METADATA!

PUMA

11 of 36

Are these community differences significant?

PUMA

12 of 36

Objective: explore the relationship between environmental parameters (metadata) and diversity

METADATA - Data about the data

How we create “groups” of data for comparisons

Compare categories:

-Two samples directly (Sample A vs. Sample B)

-Two groups of samples (Burned samples vs. Unburned samples)

-Multiple groups of samples (Low vs. Medium vs. High soil phosphate levels)

For our analyses metadata needs to be categorical (low, med., high) rather than continual (i.e. 3.2, 5.3, 8.5)

PUMA

13 of 36

Metadata Table Example

Sample ID	Team Name	Sample Location	Burn status	Phosphate content
S18.K0010A2	LIT	Skirball	Burned	Low
S18.K0011C1	FIF	Skirball	Unburned	Low
S18.K0011C2	FIF	Skirball	Burned	Medium
S18.K0033C1	SIT	Skirball	Unburned	Medium
S18.K0033C2	GoB	Botanical Garden	Unburned	Medium
S18.K0145B1	TDD	Botanical Garden	Unburned	Low
S18.K0147C2	BBB	Skirball	Unburned	Medium
S18.K0148A1	AL-Gs	Skirball	Burned	Low
S18.K0154C1	SIT	Skirball	Burned	Low
S18.K0154C2	FC	Skirball	Unburned	Medium
S18.K0191C1	LIT	Skirball	Burned	Medium
S18.K0192B2	AL-Gs	Botanical Garden	Unburned	Low
S18.K0192C1	TDD/FC	Skirball	Burned	High
S18.K0195C1	BBB	Skirball	Burned	High

Metadata Categories

PUMA

14 of 36

Ranacapa provides several built-in analyses

Rarefaction
Alpha-diversity
Beta-diversity
And more…

Once we calculate alpha diversity, how can we ask scientifically meaningful questions?

Are some types of samples more or less diverse than other samples?

PUMA

15 of 36

Grouping samples together using metadata allows us to make inferences about ecological parameters, ie, burned vs. unburned

PUMA

16 of 36

T-test

Tests whether the means of two categorical groups are statistically significantly different from each other

Example of hypotheses testable by a t-test:

Are the means of the alpha diversity greater in burned vs. unburned samples?
Are the abundances of Actinomyces higher in Section 1A samples vs. Section 1B samples?

PUMA

17 of 36

Variance

Average of the squared distance of each sample value from the mean
Measure of how “spread apart” the data is

Which sample grouping (1A or 1B) has more variance?

Image from: http://www.statisticshowto.com/sample-variance/

PUMA

18 of 36

ANOVA, or Analysis of Variance

Generalizes t-test for 3+ categories: are the means of the category groups different?

ANOVA quantifies how well different metadata category groupings explain the variance between the samples

ANOVA partitions variance between categories and within categories:

If there is more variance is between categories and less within categories, then the categories explain the variance in the data well

PUMA

19 of 36

ANOVA example

PUMA

20 of 36

ANOVA post-hoc tests

If there is a significant overall ANOVA result, post-hoc tests between pairs of categories determine which categories have different means from each other

PUMA

21 of 36

ANOVA example

Which ANOVA will have a lower p-value?

Recall that ANOVA partitions variance between categories and within categories

If there is more variance is between categories and less within categories, then the categories explain the variance in the data well (lower p-value)

Comparison 1

Comparison 2

22 of 36

ANOVA example

Recall that ANOVA partitions variance between categories and within categories

p=0.82

If there is more variance is between categories and less within categories, then the categories explain the variance in the data well (lower p-value)

p=0.02

Comparison 1

Comparison 2

23 of 36

ANOVA table in Ranacapa

Alpha diversity plots

Comparison 1

Alpha diversity stats

Degrees of freedom sample size

Sum/mean of squares variance

F-statistic

p-value

_{higher value}probability the observed F- _{means more}statistic is due to chance variance is

explained

24 of 36

ANOVA post-hoc tests

If there is a significant overall ANOVA result, post-hoc tests between pairs of categories determine which specific pairs of categories have different means from each other

Moving on to beta diversity…

25 of 36

Representation of beta-diversity

Recall: beta diversity is the diversity between samples

Single samples do not have a beta diversity value, pairs of samples do.

The following sentence does not make sense:

“Sample A has more beta diversity than Sample B.”

PUMA

26 of 36

Representation of beta-diversity

Different distance metrics quantify how different the microbial community of two samples are from each other:

Bray-Curtis – takes abundance of ASV’s into account
Jaccard – only takes presence/absence of ASV’s into account
Unifrac – phylogenetic metric; how far different ASV’s are from each other on a phylogenetic tree matters

The overall community structure of our 14 samples can be represented by a distance matrix

PUMA

27 of 36

Representation of beta-diversity

Different distance metrics quantify how different the microbial community of two samples are from each other:

Bray-Curtis – takes abundance of ASV’s into account
Jaccard – only takes presence/absence of ASV’s into account
Unifrac – phylogenetic metric; how far different ASV’s are from each other on a phylogenetic tree matters

The overall community structure of our 18 samples can be represented by a distance matrix

PUMA

28 of 36

Beta-diversity distance matrix

Can generate a distance matrix for each beta-diversity metric we use

Bray-Curtis distance matrix of 109BL-S18 samples made using QIIME2

PUMA

29 of 36

Highly multidimensional datasets

We have:

We are interested in whether the microbial community composition can be explained by certain metadata groupings
Could try plotting them to see which samples are close to each other… but there are too many dimensions to plot

ASV table (~2800 dimensions) Distance matrix (14 dimensions)

PUMA

30 of 36

Overview of ordination methods

PCA = Principal Component Analysis (STAMP)
PCoA = Principal Coordinate Analysis (QIIME)
NMDS = Nonmetric Dimensional Scaling (Ranacapa)

All of these techniques reduce the dimensionality of data, preserving as much of the distances between samples as possible. This allows the data to be viewed in 2D or 3D plots.

PCA uses the original ASV table, while PCoA and NMDS uses the distance matrices

PUMA

31 of 36

Beta Diversity (Ordination)

Microbial community profile visualization

Qualitative question: do data points with the same metadata label cluster with each other?

Each dot represents a sample, colored based on its associated metadata category grouping

PUMA

32 of 36

Beta Diversity in Ranacapa

Each dot represents a sample, colored based on its associated metadata category grouping

PUMA

33 of 36

PCA is an axis transformation

Visualization is taken from http://setosa.io/ev/principal-component-analysis/

Here’s a reduction from 2D to 1D

Our actual PCA is a reduction from 2800D to 2D or 3D!

PUMA

34 of 36

PCoA and NMDS similar to PCA

PCoA and NMDS operate on distance matrices rather than the original sample/ASV data

PCA and PCoA are identical if the distance metric used for PCoA is the Euclidean distance. However, the Bray-Curtis and Jaccard distance metrics are more often used in ecology

PCoA and NMDS are similar procedures with different algorithms

PUMA

35 of 36

Beta Diversity Cluster Analysis

Groups sites together according to their taxonomic composition

Sites with more similar taxonomic composition will cluster together

Image from ranacapa demo data

PUMA

36 of 36

Beta-diversity group significance is calculated by permutational multivariate ANOVA

PERMANOVA is the same principle as an ANOVA, except it:

1. uses user-defined distance metrics instead of variance

2. permutes the dataset to assess statistical significance

R² Effect size: % variance explained

P-value: statistical significance

PUMA