1 of 54

Complement assemblage procaryotes

��Valentin Loux

Guillaume Gautreau

co-rédacteur : Olivier Rué, Cédric Midoux : https://documents.migale.inrae.fr/posts/training-materials/2024-03-18-module24/slides/#/title-slide

March 18, 2024

2 of 54

Isolate genome assembly

  • Sequencing strategy ?
    • long read only
    • Mix ?
  • Assembly strategy ?
    • short + long
    • long + short

3 of 54

Short read or low depth hybrid

Short read only OR hybrid with low depth (< 100x ) long reads

  • Illumina only & hybrid :
    • short-read first assembly with SPAdes
    • use long read to scaffold
    • filter low depth reads
    • handle plasmids
    • circularize & choose “start”
  • Long read only
    • Long read only with Miniasm (assembly)
    • Polishing with Racoon

4 of 54

Trycycler : Long read only or hybrid

High depth hybrid (>100x) with or without short reads

SPAdes

Unicycler

Flye

Raven

Minasm

5 of 54

Trycycler : detailed view (1)

6 of 54

Trycycler : detailed view (2)

7 of 54

Hybracter

  • Automated snakemake workflow
  • “As easy as Unycycler”
  • Assembly + Plasmid

George Bouras, Ghais Houtak, Ryan R Wick, Vijini Mallawaarachchi, Michael J. Roach, Bhavya Papudeshi, Louise M Judd, Anna E Sheppard, Robert A Edwards, Sarah Vreugde - Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies. (2024) Microbial Genomics doi: https://doi.org/10.1099/mgen.0.001244.

8 of 54

Introduction to shotgun metagenomics

��Guillaume Gautreau�Valentin Loux

co-rédacteur : Olivier Rué, Cédric Midoux : https://documents.migale.inrae.fr/posts/training-materials/2024-03-18-module24/slides/#/title-slide

March 18, 2024

9 of 54

History

10 of 54

Isolate genomics versus �metagenomics

11 of 54

Introduction

12 of 54

Introduction

13 of 54

A review of methods and databases for metagenomic classification and assembly (2019)

14 of 54

Challenges

  • Complexity of the ecosystem
  • Completeness of databases
  • Sequencing depth
  • Computational and storage resources required

15 of 54

Coverage requirement (1/5)

  • To detect a species based on marker genes ?
  • To cover most of the genome to determine what part of the pangenome is covered by a sample?
  • To perform an assembly from a metagenome ?

Depth coverage

Breadth coverage

Short sequencing reads (22 are aligned on the reference)

Reference (could be a genome, a gene, a contig)

75b

500b

16 of 54

Coverage requirement (2/5)

  • To detect a species based on marker genes ?
  • To cover most of the genome to determine what part of the pangenome is covered by a sample ?
  • To perform an assembly from a metagenome ?

Coverage requirement

Depth coverage = (75X22)/500 ≈ 3,3X (can be calculated before alignment)

Breadth coverage

Short sequencing reads (22 are aligned on the reference)

Reference (could be a genome, a gene, a contig)

75b

500b

17 of 54

Coverage requirement (3/5)

  • To detect a species based on marker genes ?
  • To cover most of the genome to determine what part of the pangenome is covered by a sample ?
  • To perform an assembly from a metagenome ?

Depth coverage ≈ 3,3X

Breadth coverage ≈ ?

Short sequencing reads (22 are aligned on the reference)

Reference (could be a genome, a gene, a contig)

75b

500b

18 of 54

Coverage requirement (4/5)

  • To detect a species based on marker genes ?
  • To cover most of the genome to determine what part of the pangenome is covered by a sample ?
  • To perform an assembly from a metagenome ?

Lander and Waterman (1988)

Depth coverage ≈ 3,3X

Breadth coverage ≈ 95%

Short sequencing reads (22 are aligned on the reference)

Reference (could be a genome, a gene, a contig)

75b

500b

0.95

3.3

19 of 54

Coverage requirement (5/5)

  • To detect a species based on marker genes ? <0.1-3X
  • To cover most of the genome to determine what part of the pangenome is covered by a sample ? 3X
  • To perform an assembly from a metagenome ? 5-10X

Lander and Waterman (1988)

Depth coverage ≈ 3.3X

Breadth coverage ≈ 95%

Short sequencing reads (22 are aligned on the reference)

Reference (could be a genome, a gene, a contig)

75b

500b

0.95

3.3

20 of 54

Challenges

  • Complexity of the ecosystem
  • Completeness of databases
  • Sequencing depth
  • Computational resources required

21 of 54

Challenges

  • Complexity of the ecosystem
  • Completeness of databases
  • Sequencing depth
  • Computational resources required

22 of 54

Taxonomic classification and quantification

Taxonomic classification caveats:

  • Databanks
  • K-mer choice (sensitivity / specificity)
  • Allow a “fast” overview of your data
    • Contaminants?
    • Host reads?
    • unknown rate

2 kinds of approaches:

  • kmer-based
  • gene markers based

approachs

tools

galaxy

comments

kmer-based

Kraken2

the reference, fast and efficient

Bracken

Bayesian Reestimation of Abundance from Kraken

Centrifuge

indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem

Kaiju

protein level

Sylph

X

K-mer sketching. Work locally. Both fast and accurate

gene markers based

MetaPhlAn4

X

version 2 :✔

MetaPhlAn relies on unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic)

Meteor2

X

Based on environment specific gene catalogs (especially human gut).

23 of 54

Kraken2

  • A very popular taxonomic affiliation tool.
  • Very fast

Method:

  1. Chop genomes into k-mers and link to a taxonomic id.
  2. Chop reads into k-mers and search for exact hits in database
  3. Search for highest-weighted root-to-leaf paths and assign the taxonomic id of the lowest node to read

24 of 54

Braken

  • Kraken classifies reads using the LCA approach
    • Some reads are shared
  • Braken distributes abondancies from Kraken results using a Bayesian statistical method

25 of 54

Centrifuge

  • Similar to Kraken but few differences :
    • Memory efficient (within species compression)
    • Allow multiple assignments per read
    • K-mer extension : a bit more accurate
    • Not as fast as Kraken

26 of 54

Kaiju

  • An equivalent of Kraken, but with some particularities:
    • Database of proteic sequences
    • Supposed to be more sensitive
    • Translate reads in all six reading frames, split at stop codons

Kaiju databanks

27 of 54

Sylph

28 of 54

MetaPhlAn4

  • Relies on :
    • 5.1M unique clade-specific marker genes identified
    • from ~1M microbial genomes
      • ~236,600 references
      • 771,500 metagenomic assembled genomes
    • spanning 26,970 species-level genome bins
    • 4,992 of them taxonomically unidentified at the species level
  • associated to HUMAnN 3.0 for functional profiling (high coverage)
  • StrainPhlAn for strain-level analyses (high coverage)
  • PanPhlAn for pangenome-level analyses (high coverage)

29 of 54

Meteor2

  • Developed by MetaGenoPolis (INRAE)
  • https://github.com/metagenopolis/meteor
  • Relies on available gene catalogs :
      • human gut 10.4M of genes clustered in 1 990 species pangenome
      • human oral 8.4M of genes clustered in 853 species pangenome
      • cat gut 1.3M of genes clustered in 344 species pangenome
      • human skin 2.9M of genes clustered in 392 species pangenome
      • brown rat gut 5.9M of genes clustered in 1627 species pangenome
      • chicken gut 5.9M of genes clustered in 13.6M 2420 species pangenome
      • pig gut 9.3M of genes clustered in 1523 species pangenome
  • Highly accurate quantification (unpublished)
  • Able to remove host contaminations

30 of 54

Statistical analyses

31 of 54

⚠️ Contamination issues (1/2)

Host contaminations

  • dilution effect (costly)
  • ethical consideration (human)

External contaminants QC:

  • negative controls
  • mapping on suspected contaminant
  • taxonomic affiliation

tool

galaxy

comments

Kneaddata

X

remove rRNA and host (human and mouse) reads

SortmeRNA

✔️

remove rRNA reads, slow…

32 of 54

⚠️ Contamination issues (2/2)

  • well-to-well contaminations :
    • Overestimation of diversity
    • Can mute the main signal

    • CroCoDeEL
      • Find contamination pattern in gut microbiome study
      • Goulet et al. (JOBIM 2024, in preparation)
      • https://github.com/metagenopolis/CroCoDeEL
    • SCRuB :
      • Works across multiple ecosystems
      • Can decontaminate samples
      • Need blank controls

33 of 54

Metagenomics assembly

34 of 54

Metagenomics assembly

Objectives

  • Reconstruct genes and organisms from complex mixtures

  • Dealing with the ecosystem’s heterogeneity, multiple genomes at varying levels of abundance

  • Limiting the reconstruction of chimeras

35 of 54

General assembly strategies

36 of 54

Metagenome assembly specificity

  • Coverage :
    • Widely different abundance levels of various species in a microbial sample result in a highly nonuniform read coverage across different genome
    • Coverage of most species in a typical metagenomic data set is much lower.
  • Interspecies repeats :Various species within a microbial community often share highly conserved genomic regions in
  • Mixture : many bacterial species in a microbial sample are represented by strain mixtures, that is, multiple related strains with varying abundances

37 of 54

Individual assembly or co-assembly ?

Usefull to reduce differences in coverage between samples

Pros of co-assembly

Cons of co-assembly

More data

Higher computational overhead

Better/longer assemblies

Risk of shattering the assembly

Access to lower abundant organisms

Risk of increased contamination

38 of 54

Co-assembly

Co-assembly is reasonable if:

  • Same samples
  • Same sampling event
  • Longitudinal sampling of the same site
  • Related samples

If it is not the case, individual assembly should be prefered. In this case, an extra step of de-replication should be used

39 of 54

Software

Metagenomic assembly software :

  • Generic tool with a meta option :
    • SPAdes and metaSPAdes [Bankevich et al. 2012]
  • Tools requiring less memory :
    • MEGAHIT [Li et al. 2015]
  • Long read / Hybrid assemblies use different algorithms and strategies and are still a research question
    • metaFLYE, SPAdes …

40 of 54

Benchmark

[Zhang et al. 2023]

41 of 54

Some results

Our results showed that the short-read assemblers generated the lowest contig contiguity and [Near Complete MAGs]. MEGAHIT outperformed IDBA-UD and metaSPAdes on the deeply sequenced datasets (>100X), and metaSPAdes obtained better results than MEGAHIT and IDBA-UD on low-complexity datasets (depth < 100X).

Hybrid assemblies demonstrated higher (or at least similar) [Genome fraction] and [total assembly length] than short- and long-read assemblies, and generated higher [High Quality] and [Near Complete] than long-read assemblies

Short-read assemblers were unable to assemble any genomes of low-abundance microbes

42 of 54

Assessment of assembly quality

MetaQUAST [Mikheenko et al. 2015] to evaluate and compare metagenome assemblies

MetaQUAST :

  • De novo metagenomic assembly evaluation
  • [Optionally] identify reference genomes from the content of the assembly
  • Reference-based evaluation
  • Filtering so-called misassemblies based on read mapping
  • Report and visualization

43 of 54

De novo metrics

Evaluation of the assembly based on:

  • Number of contigs greater than a given threshold (0, 1kb, …)
  • Total / thresholded assembly size
  • Largest contig size
  • N50 : the sequence length of the shortest contig at 50% of the total assembly length, equivalent to a median of contig lengths. (N75 idem, for 75%)
  • L50 : the number of contigs at 50% of the total assembly length. (L75 idem, for 75%)

44 of 54

Reference-based metrics

  • Metrics based on the comparison with reference genomes.
  • Reference genomes are given by the user or automatically constitued by MetaQuast based on comparison of rRNA genes content of the assembly and a reference database (Silva).
  • Complete genomes are then automatically downloaded.

45 of 54

Binning

Binning :

  • grouping similar contigs together into metagenomic assembled genomes (MAG)
  • In other words :
    • A MAG represents a microbial genome by a group of sequences from genome assembly with similar characteristics

Binning is a good compromise when the assembly of whole genomes is not feasible.

Concoct, SemiBin

46 of 54

Approach

MetaBAT [Khang et al. 2019 ] is a tool for reconstructing genomes from complex microbial communities.

47 of 54

Bins evaluation

For the evaluation of bins, we will use completeness and contamination estimated by CheckM [Parks et al. 2015]

  • Use of collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage.
    • completeness: estimated completeness of genome as determined from the presence/absence of marker genes and the expected colocalization of these genes
    • contamination: estimated contamination of genome as determined by the presence of multi-copy marker genes and the expected colocalization of these genes
    • strain heterogeneity: estimated strain heterogeneity as determined from the number of multi-copy marker pairs which exceed a specified amino acid identity threshold (default = 90%). High strain heterogeneity suggests the majority of reported contamination is from one or more closely related organisms (i.e. potentially the same species), while low strain heterogeneity suggests the majority of contamination is from more phylogenetically diverse sources

Threshold depends on the type of assembly.

On metagenomics , usually : completeness >90% , < 5% conta, <= 0.5 hetereogenity

Pasolli et al. 2019,Bowers et al., 2017

48 of 54

Anvi’o

49 of 54

What’s next ?

Galaxy training on

  • Assembly of metagenomics data
    • assembly, QC, QC with reference
  • Binning of metagenomics data
    • Binning with metabat2

50 of 54

References

  • Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19:455–77. doi:10.1089/cmb.2012.0021.
  • Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics. 2015;31:1674–6.
  • Zhenmiao Zhang, Chao Yang, Werner Pieter Veldsman, Xiaodong Fang, Lu Zhang, Benchmarking genome assembly methods on metagenomic sequencing data, Briefings in Bioinformatics, Volume 24, Issue 2, March 2023, bbad087, https://doi.org/10.1093/bib/bbad087
  • Mikheenko A, Saveliev V, Gurevich A. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics. 2015;32:1088–90. doi:10.1093/bioinformatics/btv697.
  • Kang DD, Li F, Kirton E, Thomas A, Egan R, An H, Wang Z. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019 Jul 26;7:e7359. doi: 10.7717/peerj.7359. PMID: 31388474; PMCID: PMC6662567.
  • Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research. 2015;25:1043–55. doi:10.1101/gr.186072.114.

51 of 54

Advances and challenges in metatranscriptomic analysis, 2019 [40]

Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics, 2023 [41]

52 of 54

Take home message

  • Shotgun metagenomics is still an ongoing active bioinformatics research field
  • Numerous software dedicated to assembly, binning, functional annotation are actively developed
  • Depending on the ecosystem , one can have different approaches :
    • mapping on a reference database
    • assembly and mapping
  • The biological question must determine the analysis

53 of 54

Need help?

54 of 54

References

1. Escobar-Zepeda A, Vera-Ponce de León A, Sanchez-Flores A. The road to metagenomics: From microbiology to DNA sequencing technologies and bioinformatics. Frontiers in genetics. 2015;6:348.

2. Breitwieser FP, Lu J, Salzberg SL. A review of methods and databases for metagenomic classification and assembly. Briefings in bioinformatics. 2019;20:1125–36.

3. Yang C, Chowdhury D, Zhang Z, Cheung WK, Lu A, Bian Z, et al. A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data. Computational and Structural Biotechnology Journal. 2021;19:6301–14. doi:https://doi.org/10.1016/j.csbj.2021.11.028.

4. Stoler N, Nekrutenko A. Sequencing error profiles of Illumina sequencing instruments. NAR Genomics and Bioinformatics. 2021;3. doi:10.1093/nargab/lqab019.

5. Wood DE, Salzberg SL. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome biology. 2014;15:1–12.

6. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with kaiju. Nature communications. 2016;7:11257.

7. Jurasz H, Pawlowski T, Perlejewski K. Contamination issue in viral metagenomics: Problems, solutions, and clinical perspectives. Frontiers in Microbiology. 2021;12:745076.

8. Minich JJ, Sanders JG, Amir A, Humphrey G, Gilbert JA, Knight R. Quantifying and understanding well-to-well contamination in microbiome research. mSystems. 2019;4:10.1128/msystems.00186–19. doi:10.1128/msystems.00186-19.

9. Lou YC, Hoff J, Olm MR, West-Roberts J, Diamond S, Firek BA, et al. Using strain-resolved analysis to identify contamination in metagenomics data. Microbiome. 2023;11:36.

10. Zhou Y, Chen Y, Chen S, Gu J. Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–90. doi:10.1093/bioinformatics/bty560.

11. Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: A fast and accurate illumina paired-end reAd mergeR. Bioinformatics. 2013;30:614–20.

12. Joshi N, Fass J. Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files. 2011.

13. Lab H. KneadData is a tool designed to perform quality control on metagenomic and metatranscriptomic sequencing data, especially data from microbiome experiments. 2022. https://github.com/biobakery/kneaddata.

14. Kopylova E, Noé L, Touzet H. SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 2012;28:3211–7.

15. Rumbavicius I, Rounge TB, Rognes T. HoCoRT: Host contamination removal tool. BMC bioinformatics. 2023;24:371.

16. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19:455–77. doi:10.1089/cmb.2012.0021.

17. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics. 2015;31:1674–6.

18. Vollmers J, Wiegand S, Kaster A-K. Comparing and evaluating metagenome assembly tools from a microbiologist’s perspective-not only size matters! PloS one. 2017;12:e0169662.

19. Mikheenko A, Saveliev V, Gurevich A. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics. 2015;32:1088–90. doi:10.1093/bioinformatics/btv697.

20. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:13033997. 2013.

21. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9.

22. Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.

23. Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 2015;3:e1165.

24. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research. 2015;25:1043–55. doi:10.1101/gr.186072.114.

25. Seemann T. Prokka: Rapid prokaryotic genome annotation. Bioinformatics. 2014;30:2068–9. doi:10.1093/bioinformatics/btu153.

26. Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC bioinformatics. 2010;11:1–11.

27. Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, Mering C von, et al. Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper. Molecular Biology and Evolution. 2017;34:2115–22. doi:10.1093/molbev/msx148.

28. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nature Methods. 2014;12:59–60. doi:10.1038/nmeth.3176.

29. Kanehisa M, Sato Y, Morishima K. BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. Journal of molecular biology. 2016;428:726–31.

30. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–2. doi:10.1093/bioinformatics/bts565.

31. Liu Y-X, Qin Y, Chen T, Lu M, Qian X, Guo X, et al. A practical guide to amplicon and metagenomic analysis of microbiome data. Protein & Cell. 2020;12:315–30. doi:10.1007/s13238-020-00724-8.

32. Benoit G, Mariadassou M, Robin S, Schbath S, Peterlongo P, Lemaitre C. SimkaMin: Fast and resource frugal de novo comparative metagenomics. Bioinformatics. 2019. doi:10.1093/bioinformatics/btz685.

33. Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nature Communications. 2018;9. doi:10.1038/s41467-018-04964-5.

34. Steinegger M, Mirdita M, Söding J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods. 2019;16:603–6. doi:10.1038/s41592-019-0437-4.

35. Boyd JA, Woodcroft BJ, Tyson GW. GraftM: a tool for scalable, phylogenetically informed classification of genes within metagenomes. Nucleic Acids Research. 2018;46:e59–9. doi:10.1093/nar/gky174.

36. Eren AM, Esen OC, Quince C, Vineis JH, Morrison HG, Sogin ML, et al. Anvi’o: An advanced analysis and visualization platform for ‘omics data. PeerJ. 2015;3:e1319. doi:10.7717/peerj.1319.

37. Kieser S, Brown J, Zdobnov EM, Trajkovski M, McCue LA. ATLAS: A snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data. BMC Bioinformatics. 2020;21. doi:10.1186/s12859-020-03585-4.

38. Kim C, Pongpanich M, Porntaveetus T. Unraveling metagenomics through long-read sequencing: A comprehensive review. Journal of Translational Medicine. 2024;22:111.

39. Kolmogorov M, Rayko M, Yuan J, Polevikov E, Pevzner P. metaFlye: Scalable long-read metagenome assembly using repeat graphs. 2019. doi:10.1101/637637.

40. Shakya M, Lo C-C, Chain PS. Advances and challenges in metatranscriptomic analysis. Frontiers in genetics. 2019;10:904.

41. Ojala T, Häkkinen A-E, Kankuri E, Kankainen M. Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics. Trends in Genetics. 2023;39:686–702.