1 of 36

Alpha and Beta Microbial Diversity Metrics

Ranacapa Analysis

Tutorial by Chris Dao and Amanda Freise

UCLA MIMG Dept.

PUMA

2 of 36

Biodiversity can be measured at multiple scales

http://www.webpages.uidaho.edu/veg_measure/Modules/Lessons/Module%209(Composition&Diversity)/9_2_Biodiversity.htm

α-diversity: within an individual sample

Site A = 7 species

Site B = 5 species

β-diversity: diversity between multiple samples

Site A and C have highest β-diversity:

10 species that differ between them,

only 2 species in common

PUMA

3 of 36

Defining alpha diversity

  • Richness: number of different species present
    • AKA number of different operational taxonomic units (OTUs) or amplicon sequence variants (ASVs)
    • Doesn’t tell us anything about the abundance of these OTUs/ASVs
  • Evenness: describes distribution of OTUs/ASVs

  • Various diversity indices exist which weight richness and evenness in different ways
  • No single index is the best; instead, depends on the question you are asking

PUMA

4 of 36

Defining alpha diversity

  • Richness: number of different species present
    • Observed OTUs
    • Chao1
    • ACE

  • Evenness: describes distribution of OTUs/ASVs
    • Shannon index
    • Simpson’s diversity index
    • Phylogenetic distance

PUMA

5 of 36

Accurately estimating diversity

  • How are we measuring diversity for most of the possible indices?
    • By the ASVs
  • What happens if we miscalculate the number of ASVs present?
    • May over- or under-estimate diversity
  • Why might ASVs get miscalculated?
    • Several reasons…

PUMA

6 of 36

Samples may have inconsistent sequencing results

  • Number of sequences overall per sample may be different (e.g. way more DNA may get extracted from one sample)
    • How might that sample’s ASVs & diversity change?
  • More rare sequences may be missed during sequencing in favor of highly common sequences
    • How might that sample’s ASVs & diversity change?
  • How to make sure we correctly and reproducibly estimate the # of ASVs across all our samples?
    • Rarefaction!

PUMA

7 of 36

Rarefaction: a normalization method

  • An equal number of sequences from each sample are actually used for downstream analysis
  • Number of sequences can be chosen by the researcher (e.g. 5000 seqs, 15000, etc)
    • If less than the total number of sequences obtained for a sample, may be missing some ASVs…what effect would that have?
  • How deep should we rarefy?

PUMA

8 of 36

Rarefaction curves indicate species coverage

Number of sequences

most or all species have been sampled

this site has not been exhaustively sampled

only a small fraction of species been sampled

PUMA

9 of 36

Rarefaction curves

  • Different samples have a different total number of reads
  • One solution: randomly subsample all of the samples with a certain # of reads

Compare subsample of 50k reads in each sample

PUMA

10 of 36

Ranacapa provides several built-in analyses

  • Rarefaction
  • Alpha-diversity
  • Beta-diversity
  • And more…

Once we calculate alpha diversity, how can we ask scientifically meaningful questions?

Are some types of samples more or less diverse than other samples?

How do we categorize samples? METADATA!

PUMA

11 of 36

Are these community differences significant?

PUMA

12 of 36

Objective: explore the relationship between environmental parameters (metadata) and diversity

METADATA - Data about the data

  • How we create “groups” of data for comparisons

Compare categories:

-Two samples directly (Sample A vs. Sample B)

-Two groups of samples (Burned samples vs. Unburned samples)

-Multiple groups of samples (Low vs. Medium vs. High soil phosphate levels)

For our analyses metadata needs to be categorical (low, med., high) rather than continual (i.e. 3.2, 5.3, 8.5)

PUMA

13 of 36

Metadata Table Example

Sample ID

Team Name

Sample Location

Burn status

Phosphate content

S18.K0010A2

LIT

Skirball

Burned

Low

S18.K0011C1

FIF

Skirball

Unburned

Low

S18.K0011C2

FIF

Skirball

Burned

Medium

S18.K0033C1

SIT

Skirball

Unburned

Medium

S18.K0033C2

GoB

Botanical Garden

Unburned

Medium

S18.K0145B1

TDD

Botanical Garden

Unburned

Low

S18.K0147C2

BBB

Skirball

Unburned

Medium

S18.K0148A1

AL-Gs

Skirball

Burned

Low

S18.K0154C1

SIT

Skirball

Burned

Low

S18.K0154C2

FC

Skirball

Unburned

Medium

S18.K0191C1

LIT

Skirball

Burned

Medium

S18.K0192B2

AL-Gs

Botanical Garden

Unburned

Low

S18.K0192C1

TDD/FC

Skirball

Burned

High

S18.K0195C1

BBB

Skirball

Burned

High

Metadata Categories

PUMA

14 of 36

Ranacapa provides several built-in analyses

  • Rarefaction
  • Alpha-diversity
  • Beta-diversity
  • And more…

Once we calculate alpha diversity, how can we ask scientifically meaningful questions?

Are some types of samples more or less diverse than other samples?

PUMA

15 of 36

Grouping samples together using metadata allows us to make inferences about ecological parameters, ie, burned vs. unburned

PUMA

16 of 36

T-test

  • Tests whether the means of two categorical groups are statistically significantly different from each other

  • Example of hypotheses testable by a t-test:

    • Are the means of the alpha diversity greater in burned vs. unburned samples?
    • Are the abundances of Actinomyces higher in Section 1A samples vs. Section 1B samples?

PUMA

17 of 36

Variance

  • Average of the squared distance of each sample value from the mean
  • Measure of how “spread apart” the data is

    • Which sample grouping (1A or 1B) has more variance?

Image from: http://www.statisticshowto.com/sample-variance/

PUMA

18 of 36

ANOVA, or Analysis of Variance

  • Generalizes t-test for 3+ categories: are the means of the category groups different?

  • ANOVA quantifies how well different metadata category groupings explain the variance between the samples

  • ANOVA partitions variance between categories and within categories:
    • If there is more variance is between categories and less within categories, then the categories explain the variance in the data well

PUMA

19 of 36

ANOVA example

PUMA

20 of 36

ANOVA post-hoc tests

  • If there is a significant overall ANOVA result, post-hoc tests between pairs of categories determine which categories have different means from each other

PUMA

21 of 36

ANOVA example

Which ANOVA will have a lower p-value?

Recall that ANOVA partitions variance between categories and within categories

If there is more variance is between categories and less within categories, then the categories explain the variance in the data well (lower p-value)

Comparison 1

Comparison 2

22 of 36

ANOVA example

Recall that ANOVA partitions variance between categories and within categories

p=0.82

If there is more variance is between categories and less within categories, then the categories explain the variance in the data well (lower p-value)

p=0.02

Comparison 1

Comparison 2

23 of 36

ANOVA table in Ranacapa

Alpha diversity plots

Comparison 1

Alpha diversity stats

Degrees of freedom sample size

Sum/mean of squares variance

F-statistic

p-value

higher value probability the observed F- means more statistic is due to chance variance is

explained

24 of 36

ANOVA post-hoc tests

  • If there is a significant overall ANOVA result, post-hoc tests between pairs of categories determine which specific pairs of categories have different means from each other

Moving on to beta diversity…

25 of 36

Representation of beta-diversity

  • Recall: beta diversity is the diversity between samples

  • Single samples do not have a beta diversity value, pairs of samples do.

  • The following sentence does not make sense:
      • “Sample A has more beta diversity than Sample B.”

PUMA

26 of 36

Representation of beta-diversity

  • Different distance metrics quantify how different the microbial community of two samples are from each other:

    • Bray-Curtis – takes abundance of ASV’s into account
    • Jaccard – only takes presence/absence of ASV’s into account
    • Unifrac – phylogenetic metric; how far different ASV’s are from each other on a phylogenetic tree matters

  • The overall community structure of our 14 samples can be represented by a distance matrix

PUMA

27 of 36

Representation of beta-diversity

  • Different distance metrics quantify how different the microbial community of two samples are from each other:

    • Bray-Curtis – takes abundance of ASV’s into account
    • Jaccard – only takes presence/absence of ASV’s into account
    • Unifracphylogenetic metric; how far different ASV’s are from each other on a phylogenetic tree matters

  • The overall community structure of our 18 samples can be represented by a distance matrix

PUMA

28 of 36

Beta-diversity distance matrix

  • Can generate a distance matrix for each beta-diversity metric we use

Bray-Curtis distance matrix of 109BL-S18 samples made using QIIME2

PUMA

29 of 36

Highly multidimensional datasets

  • We have:

  • We are interested in whether the microbial community composition can be explained by certain metadata groupings
  • Could try plotting them to see which samples are close to each other… but there are too many dimensions to plot

ASV table (~2800 dimensions) Distance matrix (14 dimensions)

PUMA

30 of 36

Overview of ordination methods

  • PCA = Principal Component Analysis (STAMP)
  • PCoA = Principal Coordinate Analysis (QIIME)
  • NMDS = Nonmetric Dimensional Scaling (Ranacapa)

  • All of these techniques reduce the dimensionality of data, preserving as much of the distances between samples as possible. This allows the data to be viewed in 2D or 3D plots.

  • PCA uses the original ASV table, while PCoA and NMDS uses the distance matrices

PUMA

31 of 36

Beta Diversity (Ordination)

Microbial community profile visualization

Qualitative question: do data points with the same metadata label cluster with each other?

Each dot represents a sample, colored based on its associated metadata category grouping

PUMA

32 of 36

Beta Diversity in Ranacapa

Each dot represents a sample, colored based on its associated metadata category grouping

PUMA

33 of 36

PCA is an axis transformation

Here’s a reduction from 2D to 1D

Our actual PCA is a reduction from 2800D to 2D or 3D!

PUMA

34 of 36

PCoA and NMDS similar to PCA

  • PCoA and NMDS operate on distance matrices rather than the original sample/ASV data

  • PCA and PCoA are identical if the distance metric used for PCoA is the Euclidean distance. However, the Bray-Curtis and Jaccard distance metrics are more often used in ecology

  • PCoA and NMDS are similar procedures with different algorithms

PUMA

35 of 36

Beta Diversity Cluster Analysis

  • Groups sites together according to their taxonomic composition

  • Sites with more similar taxonomic composition will cluster together

Image from ranacapa demo data

PUMA

36 of 36

Beta-diversity group significance is calculated by permutational multivariate ANOVA

PERMANOVA is the same principle as an ANOVA, except it:

1. uses user-defined distance metrics instead of variance

2. permutes the dataset to assess statistical significance

R2 Effect size: % variance explained

P-value: statistical significance

PUMA