HUB
Metagenomics 1
Amine Ghozlane
Institut Pasteur
Bacterial identification
1900s
Microbes cause diseases
Louis Pasteur, Robert Koch
1700s
Microscope
Anton van Leeuwenhoek
Microbes categorization
GOLDEN ERA
OF BACTERIOLOGY
1800s
Smallpox “vaccination”
(spread to EU)
1880: Optical microscopy
and Gram stain classification
1931: Electronical microscopy
Small viral particle detection
Prokaryota and Eukaryota classification
Microbes were identified by culturing and phenotyping techniques
1859: Spontaneous generation
Bacterial identification
1900s
Microbes cause diseases
Louis Pasteur, Robert Koch
1700s
Microscope
Anton van Leeuwenhoek
Microbes categorization
GOLDEN ERA
OF BACTERIOLOGY
1800s
Smallpox “vaccination”
(spread to EU)
1880: Optical microscopy
and Gram stain classification
1931: Electronical microscopy
Small viral particle detection
Prokaryota and Eukaryota classification
2001: ABI PRISM 3700, capable of sequencing 1M nucleotide per day.
1990s ---
DNA sequencing technology
Bacterial identification / characterization
Microscope
Culturing/ Phenotyping
Sequencing
Mass spectrometry
Metagenomics
To study the diversity of taxa
To study the diversity of taxa and of their associated genes/functions
To study the active fraction of the diversity
Who is there ?
What are they able to do ?
What are they doing ?
What is the difference between these environments ?
Metagenomics
Metagenomics
Sample collection
DNA extraction
Functional Metagenomics
Screens capacity of an ecosystem
Targeted Metagenomics
Signature of an ecosystem
Quantitative Metagenomics
Genetic potential
Comparative Metagenomics
Comparison of ecosystems
In silico analysis
MICROBIAL ECOSYSTEM
Unknown composition
T
Targeted metagenomics
ITS : located between 18S and 5.8S rRNA genes
Image : Alberts Molecular Biology of the Cell 5th
50S
30S
Targeted metagenomics
Focus on one gene of the ribosomal RNA
Archaea
Bacteria
Fungi
Protozoa
[Woese et al, Nature, 1975]
1975: 16S rRNA oligomer alignment to identify 9 variable regions
1977: Using 16S/18S rRNA oligomer alignment to define the eubacteria (Bacteria), archaebacteria (Archaea) and eukaryotes (Eukaryota)
[Woese and Fox, PNAS, 1977]
A taxonomical classification based on ONE molecule !
2D electrophoretic separation to produce an oligonucleotide fingerprint
Occurence of oligomers detected at a given position in 28 organisms !
Isolate
Hierarchical classification of nature initiated by Carl Linnaeus
*Daubin et al. 2003 (Science) **Weisburg et al. 1991 (J Bacteriol.) *** Illumina protocol
Yarza et al. 2014 (Nature reviews Microbiology)
Variable regions
16S rRNA
*Daubin et al. 2003 (Science) **Weisburg et al. 1991 (J Bacteriol.) *** Illumina protocol
16S rRNA
*Mysara 2017
16S rRNA
16S rRNA
Streptococcus thermophilus
Streptococcus salivarius
High similarity at the genus level
16S rRNA
RAST alignment of the two genomes
S. thermophilus
ATCC 19258
~2000 proteins aligned to
S. salivarius NCTC 8618
Taxonomical annotation
SILVA [Pruesse, et al. 2007]:
Greengenes [DeSantis et al. 2006]:
Ribosomal Database Project [Maidak et al. 1994]:
Kunin et al. 2010
Q20
Q20
“Group of DNA sequences that share a defined level of similarity”*
*Vetrovsky and Baldrian 2013 (Plos One)
Q30
Q30
Operational Taxonomic Unit (OTU)
*Westcott, Schloss, 2016 PeerJ; Rideout 2014; Schmidt et al. 2015
Targeted metagenomics strategies
Hierarchical clustering
Algorithm:
Distances:
Let’s play with
Clustering methods
Clustering criteria: 800 km
Single linkage
Hierarchical-clustering
Distance
Distance between US cities | Atlanta | Chicago | Denver | Houston | Los Angeles | Miami | New York | San Francisco | Seattle |
Atlanta | | | | | | | | | |
Chicago | 587 | | | | | | | | |
Denver | 1212 | 920 | | | | | | | |
Houston | 701 | 940 | 879 | | | | | | |
Los Angeles | 1936 | 1745 | 831 | 1374 | | | | | |
Miami | 604 | 1188 | 1726 | 968 | 2339 | | | | |
New York | 748 | 713 | 1631 | 1420 | 2451 | 1092 | | | |
San Francisco | 2139 | 1858 | 949 | 1645 | 347 | 2594 | 2571 | | |
Seattle | 2182 | 1737 | 1021 | 1891 | 959 | 2734 | 2408 | 678 | |
Washington DC | 543 | 597 | 1494 | 1220 | 2300 | 923 | 205 | 2442 | 2329 |
Clustering criteria: 800 km
Distance
Distance between US cities | Atlanta | Chicago | Denver | Houston | Los Angeles | Miami | New York | San Francisco | Seattle |
Atlanta | | | | | | | | | |
Chicago | 587 | | | | | | | | |
Denver | 1212 | 920 | | | | | | | |
Houston | 701 | 940 | 879 | | | | | | |
Los Angeles | 1936 | 1745 | 831 | 1374 | | | | | |
Miami | 604 | 1188 | 1726 | 968 | 2339 | | | | |
New York | 748 | 713 | 1631 | 1420 | 2451 | 1092 | | | |
San Francisco | 2139 | 1858 | 949 | 1645 | 347 | 2594 | 2571 | | |
Seattle | 2182 | 1737 | 1021 | 1891 | 959 | 2734 | 2408 | 678 | |
Washington DC | 543 | 597 | 1494 | 1220 | 2300 | 923 | 205 | 2442 | 2329 |
Clustering criteria: 800 km
Single linkage
Hierarchical-clustering
Distance
Distance between US cities | Atlanta | Chicago | Denver | Houston | Los Angeles | Miami | New York | San Francisco | Seattle |
Atlanta | | | | | | | | | |
Chicago | 587 | | | | | | | | |
Denver | 1212 | 920 | | | | | | | |
Houston | 701 | 940 | 879 | | | | | | |
Los Angeles | 1936 | 1745 | 831 | 1374 | | | | | |
Miami | 604 | 1188 | 1726 | 968 | 2339 | | | | |
New York | 748 | 713 | 1631 | 1420 | 2451 | 1092 | | | |
San Francisco | 2139 | 1858 | 949 | 1645 | 347 | 2594 | 2571 | | |
Seattle | 2182 | 1737 | 1021 | 1891 | 959 | 2734 | 2408 | 678 | |
Washington DC | 543 | 597 | 1494 | 1220 | 2300 | 923 | 205 | 2442 | 2329 |
Clustering criteria: 800 km
Single linkage
Hierarchical-clustering
Distance
Distance between US cities | Atlanta | Chicago | Denver | Houston | Los Angeles | Miami | New York | San Francisco | Seattle |
Atlanta | | | | | | | | | |
Chicago | 587 | | | | | | | | |
Denver | 1212 | 920 | | | | | | | |
Houston | 701 | 940 | 879 | | | | | | |
Los Angeles | 1936 | 1745 | 831 | 1374 | | | | | |
Miami | 604 | 1188 | 1726 | 968 | 2339 | | | | |
New York | 748 | 713 | 1631 | 1420 | 2451 | 1092 | | | |
San Francisco | 2139 | 1858 | 949 | 1645 | 347 | 2594 | 2571 | | |
Seattle | 2182 | 1737 | 1021 | 1891 | 959 | 2734 | 2408 | 678 | |
Washington DC | 543 | 597 | 1494 | 1220 | 2300 | 923 | 205 | 2442 | 2329 |
Clustering criteria: 800 km
Single linkage
Hierarchical-clustering
Single linkage
Hierarchical-clustering
Single linkage
Hierarchical-clustering
Compute every distance
We need a better algorithm to do it.
SLOW !
Hypothesis
If most bases are good, most unique sequences are bad,
because good reads are all alike, but every bad read is bad in its own way. (Lost origin)
Greedy clustering
Algorithm:
Abundance-based Greedy Clustering methods
Main ideas:
Abundance-based Greedy Clustering methods
City | Population |
New York | 8550405 |
Los Angeles | 3958125 |
Chicago | 2722389 |
Houston | 2099451 |
San Francisco | 852469 |
Washington DC | 646449 |
Seattle | 634535 |
Denver | 634265 |
Atlanta | 443775 |
Miami | 430332 |
New York - Los Angeles = 2451 km > 800 km
Clustering criteria: 800 km
Abundance-based Greedy Clustering methods
Abundance-based Greedy Clustering methods
City | Population |
New York | 8550405 |
Los Angeles | 3958125 |
Chicago | 2722389 |
Houston | 2099451 |
San Francisco | 852469 |
Washington DC | 646449 |
Seattle | 634535 |
Denver | 634265 |
Atlanta | 443775 |
Miami | 430332 |
New york - Los Angeles = 2451 km > 800 km
New york - Chicago = 713 km
Clustering criteria: 800 km
Abundance-based Greedy Clustering methods
Abundance-based Greedy Clustering methods
Abundance-based Greedy Clustering methods
Outcome:
Example :
n=200 sequences
Cost = 40000 operations <<< 8e+06 operations in hierarchical clustering
Targeted metagenomics pipeline
FASTQ reads
OTU
Read Quality analysis
FastqQC
Alientrimmer,, Fastx-Toolkit, Cudadapt, …
MegaBLAST, BLAT, Vsearch, …
Flash, Pear, Pandaseq
Merging
Cleaning/Trimming
Chimera filtering
Annotation
Clustering
Mapping
Count matrix
Paired-end
Single-end
Annotation table
Dereplication
Biom
vsearch, usearch, Chimera slayer, gold database...
vsearch, usearch, cd-hit, uclust...
vsearch, usearch, cd-hit, uclust...
Singleton removal
vsearch, cd-hit, uclust...
1
2
3
4
5
6
7
Targeted metagenomics pipeline
GATTACA … AGGCTTTA
30, 31 … 20, 16, 15, 14
READ
Quality
Sequence
GATTACA … AGGCT
30, 31 … 20
READ
trimmed
Quality
Sequence
1) Read Trimming
Targeted metagenomics pipeline
GATTACA … AGGCTTTA
30, 31 … 20, 16, 15, 14
READ
Quality
Sequence
GATTACA … AGGCT
30, 31 … 20
READ
trimmed
Quality
Sequence
1) Read Trimming
2) Read Merging
GATTACA … AGGCT
Sequence
AGGCT …
Targeted metagenomics pipeline
GATTACA … GCT
GATTACA … GCT
Full length
GATTACA … GCT
GATTACA …
Prefix
GATTACA … GCT
...
Substring
3) Dereplication and singleton removal
Biological sequence X
Biological sequence Y
Chimera formed from X and Y
4) Chimera filtering
Targeted metagenomics pipeline
4) Chimera filtering
Biological sequence X
Haas et al. 2011 :
Aborted extension
Biological sequence Y
Mis-priming
Biological sequence Y
Extension
Chimera X-Y
Amplification
Targeted metagenomics pipeline
4) Chimera filtering
Biological sequence X
Haas et al. 2011 :
Aborted extension
Biological sequence Y
Mis-priming
Biological sequence Y
Extension
Chimera X-Y
Amplification
How to detect and kill the chimera ?
Targeted metagenomics pipeline
Chimera Detection
UCHIME [Edgar et al. 2011]
Previously seen sequence
Sequence S
CHUNK
Split S into chunks
CHUNK
CHUNK
CHUNK
Align and find closest pairs ( A, B)
Sequence S
B
A
Align
Compute sequence identity: A-S-B
Targeted metagenomics pipeline
OTUs
Model
Sequence
Sequence S
A. Model - Sequence <3%, Assign to OTU
B. Model - Sequence ≥3%, new OTU
OTUs
No match
Sequence
Sequence S
Add to database
5) Clustering
Sequence S
Targeted metagenomics pipeline
With the courtesy of Metagenopolis
6) Mapping
OTU
1000
The hierarchical classification of nature initiated by Carl Linnaeus today consists of eight major “ranks”
Taxonomy classification of organisms
MOST
SPECIFIC
MOST
INCLUSIVE
>94.5%
Strain
86.5%
82%
78.5%
75%
Yarza et al. 2014
Phylogeny: a complete history of life ?
Cozannet 2021
Comparing microbial communities ?
X
Y
Statistical analysis
Test if is the abundance of is different between X and Y
Metrics
Comparing microbial communities ?
Diversity
N = 33
X
N = 16
Y
Diversity
S = 4
N = 16
S = 3
N = 16
X
Y
RAREFACTION = “DOWNSIZING”
Alpha diversity
A
B
C
S = 3
N= 10
S = 4
N = 10
S = 2
N = 10
CONDITION 1
CONDITION 2
D
S = 8
N = 10
Gamma diversity
A
B
C
S = 3
N= 10
S = 4
N = 10
S = 2
N = 10
CONDITION 1
CONDITION 2
D
S = 8
N = 10
Beta diversity
A
B
C
S = 3
N= 10
S = 4
N = 10
S = 2
N = 10
CONDITION 1
CONDITION 2
D
S = 8
N = 10
ɣ1
α1
Shannon
-(4/16*log(4/16)+5/16*log(5/16)+2/16*log(2/16)+5/16*log(5/16))
S = 4
N = 16
S = 3
N = 16
X
Y
Shannon = 1.33
Shannon = 1.08
Bray-Curtis dissimilarity
A
B
SA = 3
N= 10
SB = 4
N = 10
| | | | | |
A | 3 | 4 | 3 | 0 | 0 |
B | 4 | 0 | 1 | 3 | 2 |
Bray-Curtis dissimilarity
A
B
SA = 3
N= 10
SB = 4
N = 10
| | | | | |
A | 3 | 4 | 3 | 0 | 0 |
B | 4 | 0 | 1 | 3 | 2 |
Otherwise a larger size (N) in one site would be misleading because the difference is not their composition but their size.
Unifrac distance
OTU
Multiple sequence
Alignment
Unifrac
Muscle, Mafft
Tree construction
Rooting tree
Fasttree, raxml…
Principal Coordinates Analysis (PCoA)
Rarefaction curves
Krona plot
16S – Challenges
*Vetrovsky and Baldrian 2013 (Plos One), Klappenbach et al. 2001 (Nuc Acid Res)
**Acinas et al. 2004 (J Bacteriol)
***Stoddard et al. 2015 (Nucl Acid Res)
16S – Challenges
*Vetrovsky and Baldrian 2013 (Plos One), Klappenbach et al. 2001 (Nuc Acid Res)
**Acinas et al. 2004 (J Bacteriol)
***Stoddard et al. 2015 (Nucl Acid Res)
NO !
16S – Challenges
*Vetrovsky and Baldrian 2013 (Plos One), Klappenbach et al. 2001 (Nuc Acid Res)
**Acinas et al. 2004 (J Bacteriol)
***Stoddard et al. 2015 (Nucl Acid Res)
OK ! but in absolute count
NO !
Solution?
BIOM format
Motivation:
Annotation
Metadata
Count
shaman.pasteur.fr
Better statistics
Simplicity
Reproducible research
Raw data submission
71 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
Unique key associated to your email
Parameters are saved with your key
We do not store your reads after calculations unless:
-> crashed dataset are deleted regularly
Except ITS databases: Unite, Findley, Underhill
Raw data submission
72 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
Global report
Detailed process
Download available
A mail is also automatically sent
Processed data submission
73 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
Tables
BIOM / Epi2me (nanopore)
Key
Exemple datasets
Processed data submission
74 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
% of OTU annotated
Download available
Unifrac enabled
Processed data submission
75 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
DESeq2 Metagenomics
| Metagenomics |
Distribution | Overdispersed counts → Negative binomial |
| |
Constraints | Highly abundant species |
| |
Goal | Find differentially abundant features (species, family, …): Feature distributions and abundances vary between conditions |
16S : McMurdie, Holmes, Plos Comput Biol,2014
WGS : Jonsson, BMC Genomics, 2016
76 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
Check our publication in BMC Bioinfo !
Number of observations
Sparsity ratio
Modified normalizations
78 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
Data filtering
79 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
Statistical modelling
Automatically determined from the statistical model
31 interactives visualizations
81 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
Diagnostic plots
How good is my normalisation ? modelisation ? effect size ?
82 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
Significant features
83 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
Global visualisations
84 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
Global visualisations
abundance tree
85 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
Global visualisations
86 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
An. stephensi Prevalence GS
Cedecea R2 =15%
Serratia R2 =26%
Contrast comparison
87 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 25/06/2021
shaman.pasteur.fr
« There is no disputing the importance of statistical analysis in biological research, but too often it is considered only after an experiment is completed, when it may be too late. »
Raw data submission
(targeted metagenomics)
Targeted metagenomics at Pasteur
shaman.pasteur.fr
Annotation
Count matrix
shaman.pasteur.fr
target
M2IPFB
FASTQ reads
OTU
Q20
Id 0.75 on both strand,
take best hit only,
Database is in database folder
Merging
Cleaning/Trimming
Chimera filtering
Annotation
Clustering
Mapping
Count matrix
Paired-end
Annotation table
Dereplication
De novo chimera filtering
Clustering based on abundance at Id 0.97 both strand
Dereplication fulllength on both strand
Singleton removal
Min abundance > 10
min overlap 40,
max diffs 15
Group all sequences in one file
Each amplicon origin’s is specified !
Global alignment at Id 0.97 both strand
SHAMAN
M2BI
FASTQ reads
OTU
Q20
Id 0.75 on both strand,
take best hit only,
Database is in database folder
Merging
Cleaning/Trimming
Chimera filtering
Annotation
Clustering
Mapping
Count matrix
Paired-end
Annotation table
Dereplication
De novo chimera filtering
Clustering based on abundance at Id 0.97 both strand
Dereplication fulllength on both strand
Singleton removal
Min abundance > 10
min overlap 40,
max diffs 15
Group all sequences in one file
Each amplicon origin’s is specified !
Global alignment at Id 0.97 both strand
For the practical sessions
you will need to use a VPN !
Option 1: Install a remote access software
How (if not installed) ?
Go to the following address : http://connect.pasteur.fr/
and then get logged with your usual Pasteur ID and password
Scroll down, until you find :
Choose the suited remote access client (depending on your OS)
and install it !
Once installed, you should be able to open the software and log-in using your usual Pasteur IDs
For the practical sessions
you will need to use a VPN !
Option 2: directly use the connect.pasteur.fr interface
Go to the following address : http://connect.pasteur.fr/
and then get logged with your usual Pasteur ID and password
Enter the VM address :
desktop.pasteur.fr
Practical session - Command line on virtual machine
Ubuntu virtual machine accessible with web browser or using VMware
https://desktop.pasteur.fr
Practical session - Command line on virtual machine
Ubuntu virtual machine accessible with web browser or using VMware
https://desktop.pasteur.fr
Practical session
M2BI
WARNINGS
@M01626:316:000000000-AY73Y:1:1101:15636:1045 1:N:0:38
>M01626:316:000000000-AY73Y:1:1101:11704:11521:N:0:38;sample=1ng-25cycles-1;
Amplicon Fasta:
A bacteriocin from epidemic Listeria strains alters the host intestinal microbiota
101 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 28/06/2016
A bacteriocin from epidemic Listeria strains alters the host intestinal microbiota
102 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 28/06/2016
A bacteriocin from epidemic Listeria strains alters the host intestinal microbiota
WT
Delta
Delta-compl
-48h
-24h
6h
24h
T0
T2
T3
16S : V1-V3
PNAS 2016
103 ∙ Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis ∙ 28/06/2016
Data quality check
Evaluation of the read quality by plotting the Phred scores
and the distribution of mean read quality
Sequencing quality tends to decrease at the end of the reads
- > Trimming step
Data quality check
Evaluation of the read quality by plotting the Phred scores
and the distribution of mean read quality
Sequencing quality tends to decrease at the end of the reads
- > Trimming step
Data quality check
These patterns are expected due to the ‘not so random’ primers that will fix on regions of the genome with biaised GC content.
Data quality check
This profil means that some reads have an intermediate length (excess of intermediate-length reads)
-> The aligner can easily deal with it
This profil means that some reads are duplicated
Data quality check
What about metagenomics real life data now ?
Data quality check
Data quality check
Data quality check
Data quality check
Do you think it is worth trimming
the bad quality reads ?
You can run your own fastqc analysis in the terminal
Targeted metagenomics pipeline
Where to find the data ?
ctp_1.tar.gz
c3bi.pasteur.fr/m2p7
and
Click on cours1
Metagenomics
application to
Image : http://phdcomics.com/comics/archive.php?comicid=1874
Mapping
*Edgar et al. 2013 (Nature methods)
Isolated singleton
“Bad” singleton
Recaptured singletons
Kunin et al. 2010
“Group of DNA sequences that share a defined level of similarity”*
*Vetrovsky and Baldrian 2013 (Plos One)
Operational Taxonomic Unit (OTU)
Loosely based on Stackebrandt and Ebers 2006
“Group of DNA sequences that share a defined level of similarity”*
*Vetrovsky and Baldrian 2013 (Plos One)
Operational Taxonomic Unit (OTU)
Johnson 2019