1 of 117

HUB

Metagenomics 1

Amine Ghozlane

Institut Pasteur

2 of 117

Bacterial identification

1900s

Microbes cause diseases

Louis Pasteur, Robert Koch

1700s

Microscope

Anton van Leeuwenhoek

Microbes categorization

  • Physical form
  • In/out products

GOLDEN ERA

OF BACTERIOLOGY

1800s

Smallpox “vaccination”

(spread to EU)

1880: Optical microscopy

and Gram stain classification

1931: Electronical microscopy

Small viral particle detection

Prokaryota and Eukaryota classification

Microbes were identified by culturing and phenotyping techniques

1859: Spontaneous generation

3 of 117

Bacterial identification

1900s

Microbes cause diseases

Louis Pasteur, Robert Koch

1700s

Microscope

Anton van Leeuwenhoek

Microbes categorization

  • Physical form
  • In/out products

GOLDEN ERA

OF BACTERIOLOGY

1800s

Smallpox “vaccination”

(spread to EU)

1880: Optical microscopy

and Gram stain classification

1931: Electronical microscopy

Small viral particle detection

Prokaryota and Eukaryota classification

2001: ABI PRISM 3700, capable of sequencing 1M nucleotide per day.

1990s ---

DNA sequencing technology

4 of 117

Bacterial identification / characterization

Microscope

Culturing/ Phenotyping

Sequencing

Mass spectrometry

5 of 117

Metagenomics

To study the diversity of taxa

To study the diversity of taxa and of their associated genes/functions

To study the active fraction of the diversity

6 of 117

Who is there ?

  • Taxonomic annotation
  • Co-Abundance Gene groups (CAG)
  • Metagenomic Assembled Genomes (MAGs)

What are they able to do ?

  • Gene/protein prediction
  • Functional annotation
  • Metabolic network reconstruction

What are they doing ?

  • RNA/Protein quantification

What is the difference between these environments ?

  • Comparative metagenomics
  • Quantitative metagenomics

Metagenomics

7 of 117

Metagenomics

Sample collection

DNA extraction

Functional Metagenomics

Screens capacity of an ecosystem

Targeted Metagenomics

Signature of an ecosystem

Quantitative Metagenomics

Genetic potential

Comparative Metagenomics

Comparison of ecosystems

In silico analysis

MICROBIAL ECOSYSTEM

Unknown composition

8 of 117

T

Targeted metagenomics

9 of 117

ITS : located between 18S and 5.8S rRNA genes

Image : Alberts Molecular Biology of the Cell 5th

50S

30S

Targeted metagenomics

Focus on one gene of the ribosomal RNA

Archaea

Bacteria

Fungi

Protozoa

10 of 117

[Woese et al, Nature, 1975]

1975: 16S rRNA oligomer alignment to identify 9 variable regions

1977: Using 16S/18S rRNA oligomer alignment to define the eubacteria (Bacteria), archaebacteria (Archaea) and eukaryotes (Eukaryota)

[Woese and Fox, PNAS, 1977]

A taxonomical classification based on ONE molecule !

2D electrophoretic separation to produce an oligonucleotide fingerprint

Occurence of oligomers detected at a given position in 28 organisms !

11 of 117

Isolate

  • After Woese, classification mostly based on molecular comparison, starting with 16S

Hierarchical classification of nature initiated by Carl Linnaeus

12 of 117

  • Weakly affected by horizontal gene transfer*

  • 9 variable regions surrounded by conserved regions

  • Universal primers**, 25 PCR cycle***

  • Most well represented gene in Genbank

  • Sequencing kits : V1-V3, V3-V4, V3-V5, V5-V6...

*Daubin et al. 2003 (Science) **Weisburg et al. 1991 (J Bacteriol.) *** Illumina protocol

Yarza et al. 2014 (Nature reviews Microbiology)

Variable regions

16S rRNA

13 of 117

  • Weakly affected by horizontal gene transfer*

  • 9 variable regions surrounded by conserved regions

  • Universal primers**, 25 PCR cycle***

*Daubin et al. 2003 (Science) **Weisburg et al. 1991 (J Bacteriol.) *** Illumina protocol

16S rRNA

  • Most well represented gene in Genbank

  • Sequencing kits : V1-V3, V3-V4, V3-V5, V5-V6...

14 of 117

*Mysara 2017

16S rRNA

  • Weakly affected by horizontal gene transfer*

  • 9 variable regions surrounded by conserved regions

  • Universal primers**, 25 PCR cycle***

  • Most well represented gene in Genbank

  • Sequencing kits : V1-V3, V3-V4, V3-V5, V5-V6...

15 of 117

16S rRNA

Streptococcus thermophilus

Streptococcus salivarius

High similarity at the genus level

16 of 117

16S rRNA

RAST alignment of the two genomes

S. thermophilus

ATCC 19258

~2000 proteins aligned to

S. salivarius NCTC 8618

17 of 117

Taxonomical annotation

SILVA [Pruesse, et al. 2007]:

  • 597,607 sequences (04/2016)
  • Small (16S/18S, SSU) and large subunit (23S/28S, LSU)
  • Bacteria, Archaea and Eukarya
  • Non redundant (Uclust 99% id)
  • Based on EMBL-bank
  • Updated every year more or less

Greengenes [DeSantis et al. 2006]:

  • 1,262,986 sequences (last update 05/2013)
  • Small subunit (16S, SSU)
  • Bacteria and Archaea
  • Non redundant (Uclust 99% id)
  • Based on Genbank

Ribosomal Database Project [Maidak et al. 1994]:

  • 3,224,600 + 108,901 sequences (last update 05/2015)
  • Small subunit (16S/18S, SSU) and Fungal 28S
  • Bacteria, Archaea and Fungi

18 of 117

Kunin et al. 2010

  • 454 sequencing of V1-V2 & V8 E. coli MG1655
  • Theoretical number
    • 5 phylotypes for V1-V2 at 100% id
    • 1 phylotype for V8
  • Results
    • 0.1-0.2% error probability
    • 97% similarity threshold

Q20

Q20

“Group of DNA sequences that share a defined level of similarity”*

*Vetrovsky and Baldrian 2013 (Plos One)

Q30

Q30

Operational Taxonomic Unit (OTU)

19 of 117

  • CLOSED REFERENCE CLUSTERING
    • Clustering in a OTU the sequence that are similar to a reference
    • Classification

  • DE NOVO CLUSTERING
    • Distance between the sequence is used to cluster sequence into OTUs

  • OPEN REFERENCE CLUSTERING
    • Closed-reference clustering followed by de novo clustering for sequence that are not similar to the reference

*Westcott, Schloss, 2016 PeerJ; Rideout 2014; Schmidt et al. 2015

Targeted metagenomics strategies

20 of 117

Hierarchical clustering

Algorithm:

  • Initial n groups
  • Each step:
    • Merge of two group considering linkage

Distances:

  • Single linkage
    • the distance between two clusters is defined as the shortest distance between two points in each cluster
  • Complete linkage
    • the distance between two clusters is defined as the longest distance between two points in each cluster
  • Average linkage
    • the distance between two clusters is defined as the average distance to every point in the other cluster
  • ...

21 of 117

Let’s play with

Clustering methods

Clustering criteria: 800 km

22 of 117

Single linkage

Hierarchical-clustering

  1. Take all distance between every cities

23 of 117

Distance

Distance between US cities

Atlanta

Chicago

Denver

Houston

Los Angeles

Miami

New York

San Francisco

Seattle

Atlanta

Chicago

587

Denver

1212

920

Houston

701

940

879

Los Angeles

1936

1745

831

1374

Miami

604

1188

1726

968

2339

New York

748

713

1631

1420

2451

1092

San Francisco

2139

1858

949

1645

347

2594

2571

Seattle

2182

1737

1021

1891

959

2734

2408

678

Washington DC

543

597

1494

1220

2300

923

205

2442

2329

Clustering criteria: 800 km

24 of 117

Distance

Distance between US cities

Atlanta

Chicago

Denver

Houston

Los Angeles

Miami

New York

San Francisco

Seattle

Atlanta

Chicago

587

Denver

1212

920

Houston

701

940

879

Los Angeles

1936

1745

831

1374

Miami

604

1188

1726

968

2339

New York

748

713

1631

1420

2451

1092

San Francisco

2139

1858

949

1645

347

2594

2571

Seattle

2182

1737

1021

1891

959

2734

2408

678

Washington DC

543

597

1494

1220

2300

923

205

2442

2329

Clustering criteria: 800 km

25 of 117

Single linkage

Hierarchical-clustering

26 of 117

Distance

Distance between US cities

Atlanta

Chicago

Denver

Houston

Los Angeles

Miami

New York

San Francisco

Seattle

Atlanta

Chicago

587

Denver

1212

920

Houston

701

940

879

Los Angeles

1936

1745

831

1374

Miami

604

1188

1726

968

2339

New York

748

713

1631

1420

2451

1092

San Francisco

2139

1858

949

1645

347

2594

2571

Seattle

2182

1737

1021

1891

959

2734

2408

678

Washington DC

543

597

1494

1220

2300

923

205

2442

2329

Clustering criteria: 800 km

27 of 117

Single linkage

Hierarchical-clustering

28 of 117

Distance

Distance between US cities

Atlanta

Chicago

Denver

Houston

Los Angeles

Miami

New York

San Francisco

Seattle

Atlanta

Chicago

587

Denver

1212

920

Houston

701

940

879

Los Angeles

1936

1745

831

1374

Miami

604

1188

1726

968

2339

New York

748

713

1631

1420

2451

1092

San Francisco

2139

1858

949

1645

347

2594

2571

Seattle

2182

1737

1021

1891

959

2734

2408

678

Washington DC

543

597

1494

1220

2300

923

205

2442

2329

Clustering criteria: 800 km

29 of 117

Single linkage

Hierarchical-clustering

30 of 117

Single linkage

Hierarchical-clustering

31 of 117

Single linkage

Hierarchical-clustering

Compute every distance

  • is really slow if you have millions of cities !!!
  • requires a lot of memory

We need a better algorithm to do it.

SLOW !

32 of 117

Hypothesis

If most bases are good, most unique sequences are bad,

because good reads are all alike, but every bad read is bad in its own way. (Lost origin)

33 of 117

Greedy clustering

Algorithm:

  • Initial n groups ordered in particular way
  • Each step:
    • Pick a group and compare to the reference
    • If close to the reference:
      • Add in reference cluster
    • Otherwise:
      • Add it as a reference

  • Ordering:
    • Length-based Greedy Clustering (CD-HIT, Uclust)
    • Abundance-based Greedy Clustering (AGC) : “Most-Abundant-centroid”

34 of 117

Abundance-based Greedy Clustering methods

Main ideas:

  • Abundance: Number of identical sequences in the data

  • Sequences that have a high abundance are sequences with less errors

35 of 117

Abundance-based Greedy Clustering methods

City

Population

New York

8550405

Los Angeles

3958125

Chicago

2722389

Houston

2099451

San Francisco

852469

Washington DC

646449

Seattle

634535

Denver

634265

Atlanta

443775

Miami

430332

New York - Los Angeles = 2451 km > 800 km

Clustering criteria: 800 km

36 of 117

Abundance-based Greedy Clustering methods

37 of 117

Abundance-based Greedy Clustering methods

City

Population

New York

8550405

Los Angeles

3958125

Chicago

2722389

Houston

2099451

San Francisco

852469

Washington DC

646449

Seattle

634535

Denver

634265

Atlanta

443775

Miami

430332

New york - Los Angeles = 2451 km > 800 km

New york - Chicago = 713 km

Clustering criteria: 800 km

38 of 117

Abundance-based Greedy Clustering methods

39 of 117

Abundance-based Greedy Clustering methods

40 of 117

Abundance-based Greedy Clustering methods

Outcome:

  • AGC depend on the sorting strategy (length, abundance...)

  • The distance to the reference is guarantee…

  • ...not the distance between sequences in the OTU

  • AGC cost is in the worst case O(n2)…

Example :

n=200 sequences

Cost = 40000 operations <<< 8e+06 operations in hierarchical clustering

41 of 117

Targeted metagenomics pipeline

FASTQ reads

OTU

Read Quality analysis

FastqQC

Alientrimmer,, Fastx-Toolkit, Cudadapt, …

MegaBLAST, BLAT, Vsearch, …

Flash, Pear, Pandaseq

Merging

Cleaning/Trimming

Chimera filtering

Annotation

Clustering

Mapping

Count matrix

Paired-end

Single-end

Annotation table

Dereplication

Biom

vsearch, usearch, Chimera slayer, gold database...

vsearch, usearch, cd-hit, uclust...

vsearch, usearch, cd-hit, uclust...

Singleton removal

vsearch, cd-hit, uclust...

1

2

3

4

5

6

7

42 of 117

Targeted metagenomics pipeline

GATTACA … AGGCTTTA

30, 31 … 20, 16, 15, 14

READ

Quality

Sequence

GATTACA … AGGCT

30, 31 … 20

READ

trimmed

Quality

Sequence

1) Read Trimming

43 of 117

Targeted metagenomics pipeline

GATTACA … AGGCTTTA

30, 31 … 20, 16, 15, 14

READ

Quality

Sequence

GATTACA … AGGCT

30, 31 … 20

READ

trimmed

Quality

Sequence

1) Read Trimming

2) Read Merging

GATTACA … AGGCT

Sequence

AGGCT …

44 of 117

Targeted metagenomics pipeline

GATTACA … GCT

GATTACA … GCT

Full length

GATTACA … GCT

GATTACA …

Prefix

GATTACA … GCT

...

Substring

3) Dereplication and singleton removal

Biological sequence X

Biological sequence Y

Chimera formed from X and Y

4) Chimera filtering

45 of 117

Targeted metagenomics pipeline

4) Chimera filtering

Biological sequence X

Haas et al. 2011 :

Aborted extension

Biological sequence Y

Mis-priming

Biological sequence Y

Extension

Chimera X-Y

Amplification

46 of 117

Targeted metagenomics pipeline

4) Chimera filtering

Biological sequence X

Haas et al. 2011 :

Aborted extension

Biological sequence Y

Mis-priming

Biological sequence Y

Extension

Chimera X-Y

Amplification

How to detect and kill the chimera ?

47 of 117

Targeted metagenomics pipeline

Chimera Detection

UCHIME [Edgar et al. 2011]

Previously seen sequence

Sequence S

CHUNK

Split S into chunks

CHUNK

CHUNK

CHUNK

Align and find closest pairs ( A, B)

Sequence S

B

A

Align

Compute sequence identity: A-S-B

48 of 117

Targeted metagenomics pipeline

OTUs

Model

Sequence

Sequence S

A. Model - Sequence <3%, Assign to OTU

B. Model - Sequence ≥3%, new OTU

OTUs

No match

Sequence

Sequence S

Add to database

5) Clustering

Sequence S

49 of 117

Targeted metagenomics pipeline

With the courtesy of Metagenopolis

6) Mapping

OTU

1000

50 of 117

The hierarchical classification of nature initiated by Carl Linnaeus today consists of eight major “ranks”

Taxonomy classification of organisms

MOST

SPECIFIC

MOST

INCLUSIVE

>94.5%

Strain

86.5%

82%

78.5%

75%

Yarza et al. 2014

51 of 117

Phylogeny: a complete history of life ?

Cozannet 2021

52 of 117

Comparing microbial communities ?

X

Y

Statistical analysis

Test if is the abundance of is different between X and Y

Metrics

53 of 117

Comparing microbial communities ?

  • Evenness: Equality of the abundance
  • Richness: Number of a given element

54 of 117

Diversity

  • N: Total count of individual
  • Rarefaction: a type of normalisation : Rarefy to the same number of individual

N = 33

X

N = 16

Y

55 of 117

Diversity

  • N: Total count of individual
  • Rarefaction: a type of normalisation : Rarefy to the same number of individual
  • S = number of species = richness = number of object > 0

S = 4

N = 16

S = 3

N = 16

X

Y

RAREFACTION = “DOWNSIZING”

56 of 117

Alpha diversity

  • S = number of species = richness = number of object > 0
  • Alpha diversity:
    • Condition 1 : α1 = mean(SA, SB) = 3.5
    • Condition 2 : α2 = mean(SC, SD) = 5

A

B

C

S = 3

N= 10

S = 4

N = 10

S = 2

N = 10

CONDITION 1

CONDITION 2

D

S = 8

N = 10

57 of 117

Gamma diversity

  • S = number of species = richness = number of object > 0
  • Gamma diversity:
    • Condition 1 : ɣ1 = S1 = 5
    • Condition 2 : ɣ2 = S2 = 8

A

B

C

S = 3

N= 10

S = 4

N = 10

S = 2

N = 10

CONDITION 1

CONDITION 2

D

S = 8

N = 10

58 of 117

Beta diversity

  • S = number of species = richness = number of object > 0
  • Beta diversity:
    • Condition 1 : β1 = - 1 = 0.43

    • Condition 2 : β2 = 0.6

A

B

C

S = 3

N= 10

S = 4

N = 10

S = 2

N = 10

CONDITION 1

CONDITION 2

D

S = 8

N = 10

ɣ1

α1

59 of 117

Shannon

  • Indice de Shannon:

-(4/16*log(4/16)+5/16*log(5/16)+2/16*log(2/16)+5/16*log(5/16))

S = 4

N = 16

S = 3

N = 16

X

Y

Shannon = 1.33

Shannon = 1.08

60 of 117

Bray-Curtis dissimilarity

A

B

SA = 3

N= 10

SB = 4

N = 10

  • CAB: The sum of the lesser values for the species found in each site.

A

3

4

3

0

0

B

4

0

1

3

2

  • CAB = 3 + 0 + 1 + 0 + 0 = 4

  • BCAB = 1 - (2*CAB) /(NA + NB) = 1 - ( 2 * 4 ) / ( 20 ) = 0.6

61 of 117

Bray-Curtis dissimilarity

A

B

SA = 3

N= 10

SB = 4

N = 10

  • CAB: The sum of the lesser values for the species found in each site.

A

3

4

3

0

0

B

4

0

1

3

2

  • 0 = No dissimilarity = same number of each type of species

  • 1 = Complete dissimilarity = share non of the same type of species

  • The Bray-Curtis dissimilarity assumes that the sites are of equal size.

Otherwise a larger size (N) in one site would be misleading because the difference is not their composition but their size.

62 of 117

Unifrac distance

OTU

Multiple sequence

Alignment

Unifrac

Muscle, Mafft

Tree construction

Rooting tree

Fasttree, raxml…

  • Lozupone and Knight 2005

      • Introduce the sequence similarity parameter using a phylogenetic distance

63 of 117

Principal Coordinates Analysis (PCoA)

  • PCoA attempts to represent inter-object (dis)similarity in a low-dimensional.

  • Rather than using raw data, PCoA takes a (dis)similarity matrix as input.

  • It is conceptually similar to principal components analysis (PCA) which preserve Euclidean and χ2 (chi-squared) distances between objects, respectively; however, PCoA can preserve distances generated from any (dis)similarity measure allowing more flexible handling of complex ecological data.

64 of 117

Rarefaction curves

65 of 117

Krona plot

66 of 117

16S – Challenges

*Vetrovsky and Baldrian 2013 (Plos One), Klappenbach et al. 2001 (Nuc Acid Res)

**Acinas et al. 2004 (J Bacteriol)

***Stoddard et al. 2015 (Nucl Acid Res)

  • Copy number varies in the genomes (from 1 to 15)*
  • 16S sequence variants in the same specie and even genome** -> impact diversity

67 of 117

16S – Challenges

*Vetrovsky and Baldrian 2013 (Plos One), Klappenbach et al. 2001 (Nuc Acid Res)

**Acinas et al. 2004 (J Bacteriol)

***Stoddard et al. 2015 (Nucl Acid Res)

  • Copy number varies in the genomes (from 1 to 15)*
  • 16S sequence variants in the same specie and even genome** -> impact diversity

NO !

68 of 117

16S – Challenges

*Vetrovsky and Baldrian 2013 (Plos One), Klappenbach et al. 2001 (Nuc Acid Res)

**Acinas et al. 2004 (J Bacteriol)

***Stoddard et al. 2015 (Nucl Acid Res)

  • Copy number varies in the genomes (from 1 to 15)*
  • 16S sequence variants in the same specie and even genome** -> impact diversity

OK ! but in absolute count

NO !

Solution?

  • rrnDB: ribosomal RNA operon copy number database***

69 of 117

BIOM format

Motivation:

  • Encapsulation of the whole project (count table, annotation, metadata...)
  • Efficient storage (version 2.0: HD5 binary format)
  • Compatibility between software

Annotation

Metadata

Count

70 of 117

shaman.pasteur.fr

Better statistics

Simplicity

Reproducible research

71 of 117

Raw data submission

71 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

Unique key associated to your email

Parameters are saved with your key

We do not store your reads after calculations unless:

  • there is crash and you might come discuss with us: shaman@pasteur.fr

-> crashed dataset are deleted regularly

Except ITS databases: Unite, Findley, Underhill

72 of 117

Raw data submission

72 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

Global report

Detailed process

Download available

A mail is also automatically sent

73 of 117

Processed data submission

73 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

Tables

BIOM / Epi2me (nanopore)

Key

Exemple datasets

74 of 117

Processed data submission

74 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

% of OTU annotated

Download available

Unifrac enabled

75 of 117

Processed data submission

75 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

76 of 117

DESeq2 Metagenomics

Metagenomics

Distribution

Overdispersed counts → Negative binomial

Constraints

Highly abundant species

Goal

Find differentially abundant features (species, family, …):

Feature distributions and abundances vary between conditions

16S : McMurdie, Holmes, Plos Comput Biol,2014

WGS : Jonsson, BMC Genomics, 2016

76 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

77 of 117

Check our publication in BMC Bioinfo !

78 of 117

Number of observations

Sparsity ratio

Modified normalizations

78 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

79 of 117

Data filtering

79 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

80 of 117

Statistical modelling

Automatically determined from the statistical model

81 of 117

31 interactives visualizations

81 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

82 of 117

Diagnostic plots

How good is my normalisation ? modelisation ? effect size ?

82 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

83 of 117

Significant features

83 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

84 of 117

Global visualisations

84 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

85 of 117

Global visualisations

abundance tree

85 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

86 of 117

Global visualisations

86 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

An. stephensi Prevalence GS

Cedecea R2 =15%

Serratia R2 =26%

87 of 117

Contrast comparison

87 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 25/06/2021

88 of 117

shaman.pasteur.fr

« There is no disputing the importance of statistical analysis in biological research, but too often it is considered only after an experiment is completed, when it may be too late. »

89 of 117

Raw data submission

(targeted metagenomics)

Targeted metagenomics at Pasteur

90 of 117

shaman.pasteur.fr

Annotation

Count matrix

91 of 117

shaman.pasteur.fr

target

92 of 117

M2IPFB

FASTQ reads

OTU

Q20

Id 0.75 on both strand,

take best hit only,

Database is in database folder

Merging

Cleaning/Trimming

Chimera filtering

Annotation

Clustering

Mapping

Count matrix

Paired-end

Annotation table

Dereplication

De novo chimera filtering

Clustering based on abundance at Id 0.97 both strand

Dereplication fulllength on both strand

Singleton removal

Min abundance > 10

min overlap 40,

max diffs 15

Group all sequences in one file

Each amplicon origin’s is specified !

Global alignment at Id 0.97 both strand

SHAMAN

93 of 117

M2BI

FASTQ reads

OTU

Q20

Id 0.75 on both strand,

take best hit only,

Database is in database folder

Merging

Cleaning/Trimming

Chimera filtering

Annotation

Clustering

Mapping

Count matrix

Paired-end

Annotation table

Dereplication

De novo chimera filtering

Clustering based on abundance at Id 0.97 both strand

Dereplication fulllength on both strand

Singleton removal

Min abundance > 10

min overlap 40,

max diffs 15

Group all sequences in one file

Each amplicon origin’s is specified !

Global alignment at Id 0.97 both strand

94 of 117

For the practical sessions

you will need to use a VPN !

Option 1: Install a remote access software

How (if not installed) ?

Go to the following address : http://connect.pasteur.fr/

and then get logged with your usual Pasteur ID and password

Scroll down, until you find :

Choose the suited remote access client (depending on your OS)

and install it !

Once installed, you should be able to open the software and log-in using your usual Pasteur IDs

95 of 117

For the practical sessions

you will need to use a VPN !

Option 2: directly use the connect.pasteur.fr interface

Go to the following address : http://connect.pasteur.fr/

and then get logged with your usual Pasteur ID and password

Enter the VM address :

desktop.pasteur.fr

96 of 117

Practical session - Command line on virtual machine

Ubuntu virtual machine accessible with web browser or using VMware

https://desktop.pasteur.fr

97 of 117

Practical session - Command line on virtual machine

Ubuntu virtual machine accessible with web browser or using VMware

https://desktop.pasteur.fr

98 of 117

Practical session

  • Right-click on the desktop -> click on “open in terminal”

99 of 117

M2BI

  • You must develop a program that clusterize 16S sequence into OTU

100 of 117

WARNINGS

  • Clustering is performed on the sequence in one step
    • You must identify from which sample come your amplicon

@M01626:316:000000000-AY73Y:1:1101:15636:1045 1:N:0:38

>M01626:316:000000000-AY73Y:1:1101:11704:11521:N:0:38;sample=1ng-25cycles-1;

  • Count table
  • Annotation table

Amplicon Fasta:

101 of 117

A bacteriocin from epidemic Listeria strains alters the host intestinal microbiota

  • Javier Pizarro-Cerda – Juan Jose Quereda Torres

101 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 28/06/2016

102 of 117

A bacteriocin from epidemic Listeria strains alters the host intestinal microbiota

  • Javier Pizarro-Cerda – Juan Jose Quereda Torres

102 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 28/06/2016

103 of 117

A bacteriocin from epidemic Listeria strains alters the host intestinal microbiota

WT

Delta

Delta-compl

-48h

-24h

6h

24h

T0

T2

T3

16S : V1-V3

  • Impact on major communities (Which ? When ? How much ?)
  • Quereda, Dussurget, Nahori, Ghozlane, Volant, Dillies, et al.,

PNAS 2016

103 Amine Ghozlane • SHAMAN : Shiny Application for Metagenomic ANalysis 28/06/2016

104 of 117

Data quality check

Evaluation of the read quality by plotting the Phred scores

and the distribution of mean read quality

Sequencing quality tends to decrease at the end of the reads

- > Trimming step

105 of 117

Data quality check

Evaluation of the read quality by plotting the Phred scores

and the distribution of mean read quality

Sequencing quality tends to decrease at the end of the reads

- > Trimming step

106 of 117

Data quality check

These patterns are expected due to the ‘not so random’ primers that will fix on regions of the genome with biaised GC content.

107 of 117

Data quality check

This profil means that some reads have an intermediate length (excess of intermediate-length reads)

-> The aligner can easily deal with it

This profil means that some reads are duplicated

108 of 117

Data quality check

What about metagenomics real life data now ?

109 of 117

Data quality check

110 of 117

Data quality check

111 of 117

Data quality check

112 of 117

Data quality check

Do you think it is worth trimming

the bad quality reads ?

You can run your own fastqc analysis in the terminal

113 of 117

Targeted metagenomics pipeline

Where to find the data ?

ctp_1.tar.gz

c3bi.pasteur.fr/m2p7

and

Click on cours1

114 of 117

Metagenomics

application to

Image : http://phdcomics.com/comics/archive.php?comicid=1874

115 of 117

Mapping

*Edgar et al. 2013 (Nature methods)

Isolated singleton

“Bad” singleton

Recaptured singletons

116 of 117

Kunin et al. 2010

  • 454 sequencing of V1-V2 & V8 E. coli MG1655
  • Theoretical number
    • 5 phylotypes for V1-V2 at 100% id
    • 1 phylotype for V8
  • Results
    • 0.1-0.2% error probability
    • 97% similarity threshold

“Group of DNA sequences that share a defined level of similarity”*

*Vetrovsky and Baldrian 2013 (Plos One)

Operational Taxonomic Unit (OTU)

Loosely based on Stackebrandt and Ebers 2006

117 of 117

“Group of DNA sequences that share a defined level of similarity”*

*Vetrovsky and Baldrian 2013 (Plos One)

Operational Taxonomic Unit (OTU)

Johnson 2019