Complement assemblage procaryotes
��Valentin Loux
Guillaume Gautreau�
co-rédacteur : Olivier Rué, Cédric Midoux : https://documents.migale.inrae.fr/posts/training-materials/2024-03-18-module24/slides/#/title-slide
March 18, 2024
Isolate genome assembly
Short read or low depth hybrid
Short read only OR hybrid with low depth (< 100x ) long reads
Trycycler : Long read only or hybrid
High depth hybrid (>100x) with or without short reads
SPAdes
Unicycler
Flye
Raven
Minasm
…
Trycycler : detailed view (1)
Trycycler : detailed view (2)
Hybracter
George Bouras, Ghais Houtak, Ryan R Wick, Vijini Mallawaarachchi, Michael J. Roach, Bhavya Papudeshi, Louise M Judd, Anna E Sheppard, Robert A Edwards, Sarah Vreugde - Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies. (2024) Microbial Genomics doi: https://doi.org/10.1099/mgen.0.001244.
Introduction to shotgun metagenomics
��Guillaume Gautreau�Valentin Loux�
co-rédacteur : Olivier Rué, Cédric Midoux : https://documents.migale.inrae.fr/posts/training-materials/2024-03-18-module24/slides/#/title-slide
March 18, 2024
History
Isolate genomics versus �metagenomics
Introduction
Introduction
A review of methods and databases for metagenomic classification and assembly (2019)
Challenges
Coverage requirement (1/5)
Depth coverage
Breadth coverage
Short sequencing reads (22 are aligned on the reference)
Reference (could be a genome, a gene, a contig)
75b
500b
Coverage requirement (2/5)
Coverage requirement
Depth coverage = (75X22)/500 ≈ 3,3X (can be calculated before alignment)
Breadth coverage
Short sequencing reads (22 are aligned on the reference)
Reference (could be a genome, a gene, a contig)
75b
500b
Coverage requirement (3/5)
Depth coverage ≈ 3,3X
Breadth coverage ≈ ?
Short sequencing reads (22 are aligned on the reference)
Reference (could be a genome, a gene, a contig)
75b
500b
Coverage requirement (4/5)
Lander and Waterman (1988)
Depth coverage ≈ 3,3X
Breadth coverage ≈ 95%
Short sequencing reads (22 are aligned on the reference)
Reference (could be a genome, a gene, a contig)
75b
500b
0.95
3.3
Coverage requirement (5/5)
Lander and Waterman (1988)
Depth coverage ≈ 3.3X
Breadth coverage ≈ 95%
Short sequencing reads (22 are aligned on the reference)
Reference (could be a genome, a gene, a contig)
75b
500b
0.95
3.3
Challenges
Challenges
Taxonomic classification and quantification
Taxonomic classification caveats:
2 kinds of approaches:
approachs | tools | galaxy | comments |
kmer-based | Kraken2 | ✔ | the reference, fast and efficient |
Bracken | ✔ | Bayesian Reestimation of Abundance from Kraken | |
Centrifuge | ✔ | indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem | |
Kaiju | ✔ | protein level | |
Sylph | X | K-mer sketching. Work locally. Both fast and accurate | |
gene markers based | MetaPhlAn4 | X version 2 :✔ | MetaPhlAn relies on unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic) |
Meteor2 | X | Based on environment specific gene catalogs (especially human gut). |
Kraken2
Method:
Braken
Centrifuge
Kaiju
Kaiju databanks
Sylph
MetaPhlAn4
Meteor2
Statistical analyses
⚠️ Contamination issues (1/2)
Host contaminations
External contaminants QC:
tool | galaxy | comments |
Kneaddata | X | remove rRNA and host (human and mouse) reads |
SortmeRNA | ✔️ | remove rRNA reads, slow… |
⚠️ Contamination issues (2/2)
Metagenomics assembly
Metagenomics assembly
Objectives
General assembly strategies
Metagenome assembly specificity
Individual assembly or co-assembly ?
Usefull to reduce differences in coverage between samples
Pros of co-assembly | Cons of co-assembly |
More data | Higher computational overhead |
Better/longer assemblies | Risk of shattering the assembly |
Access to lower abundant organisms | Risk of increased contamination |
Co-assembly
Co-assembly is reasonable if:
If it is not the case, individual assembly should be prefered. In this case, an extra step of de-replication should be used
Software
Metagenomic assembly software :
Benchmark
[Zhang et al. 2023]
Some results
Our results showed that the short-read assemblers generated the lowest contig contiguity and [Near Complete MAGs]. MEGAHIT outperformed IDBA-UD and metaSPAdes on the deeply sequenced datasets (>100X), and metaSPAdes obtained better results than MEGAHIT and IDBA-UD on low-complexity datasets (depth < 100X).
Hybrid assemblies demonstrated higher (or at least similar) [Genome fraction] and [total assembly length] than short- and long-read assemblies, and generated higher [High Quality] and [Near Complete] than long-read assemblies
Short-read assemblers were unable to assemble any genomes of low-abundance microbes
Assessment of assembly quality
MetaQUAST [Mikheenko et al. 2015] to evaluate and compare metagenome assemblies
MetaQUAST :
De novo metrics
Evaluation of the assembly based on:
Reference-based metrics
Binning
Binning :
Binning is a good compromise when the assembly of whole genomes is not feasible.
Concoct, SemiBin
Approach
MetaBAT [Khang et al. 2019 ] is a tool for reconstructing genomes from complex microbial communities.
Bins evaluation
For the evaluation of bins, we will use completeness and contamination estimated by CheckM [Parks et al. 2015]
Threshold depends on the type of assembly.
On metagenomics , usually : completeness >90% , < 5% conta, <= 0.5 hetereogenity
Pasolli et al. 2019,Bowers et al., 2017
What’s next ?
Galaxy training on
References
Advances and challenges in metatranscriptomic analysis, 2019 [40]
Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics, 2023 [41]
Take home message
Need help?
References
1. Escobar-Zepeda A, Vera-Ponce de León A, Sanchez-Flores A. The road to metagenomics: From microbiology to DNA sequencing technologies and bioinformatics. Frontiers in genetics. 2015;6:348.
2. Breitwieser FP, Lu J, Salzberg SL. A review of methods and databases for metagenomic classification and assembly. Briefings in bioinformatics. 2019;20:1125–36.
3. Yang C, Chowdhury D, Zhang Z, Cheung WK, Lu A, Bian Z, et al. A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data. Computational and Structural Biotechnology Journal. 2021;19:6301–14. doi:https://doi.org/10.1016/j.csbj.2021.11.028.
4. Stoler N, Nekrutenko A. Sequencing error profiles of Illumina sequencing instruments. NAR Genomics and Bioinformatics. 2021;3. doi:10.1093/nargab/lqab019.
5. Wood DE, Salzberg SL. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome biology. 2014;15:1–12.
6. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with kaiju. Nature communications. 2016;7:11257.
7. Jurasz H, Pawlowski T, Perlejewski K. Contamination issue in viral metagenomics: Problems, solutions, and clinical perspectives. Frontiers in Microbiology. 2021;12:745076.
8. Minich JJ, Sanders JG, Amir A, Humphrey G, Gilbert JA, Knight R. Quantifying and understanding well-to-well contamination in microbiome research. mSystems. 2019;4:10.1128/msystems.00186–19. doi:10.1128/msystems.00186-19.
9. Lou YC, Hoff J, Olm MR, West-Roberts J, Diamond S, Firek BA, et al. Using strain-resolved analysis to identify contamination in metagenomics data. Microbiome. 2023;11:36.
10. Zhou Y, Chen Y, Chen S, Gu J. Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–90. doi:10.1093/bioinformatics/bty560.
11. Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: A fast and accurate illumina paired-end reAd mergeR. Bioinformatics. 2013;30:614–20.
12. Joshi N, Fass J. Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files. 2011.
13. Lab H. KneadData is a tool designed to perform quality control on metagenomic and metatranscriptomic sequencing data, especially data from microbiome experiments. 2022. https://github.com/biobakery/kneaddata.
14. Kopylova E, Noé L, Touzet H. SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 2012;28:3211–7.
15. Rumbavicius I, Rounge TB, Rognes T. HoCoRT: Host contamination removal tool. BMC bioinformatics. 2023;24:371.
16. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19:455–77. doi:10.1089/cmb.2012.0021.
17. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics. 2015;31:1674–6.
18. Vollmers J, Wiegand S, Kaster A-K. Comparing and evaluating metagenome assembly tools from a microbiologist’s perspective-not only size matters! PloS one. 2017;12:e0169662.
19. Mikheenko A, Saveliev V, Gurevich A. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics. 2015;32:1088–90. doi:10.1093/bioinformatics/btv697.
20. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:13033997. 2013.
21. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9.
22. Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.
23. Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 2015;3:e1165.
24. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research. 2015;25:1043–55. doi:10.1101/gr.186072.114.
25. Seemann T. Prokka: Rapid prokaryotic genome annotation. Bioinformatics. 2014;30:2068–9. doi:10.1093/bioinformatics/btu153.
26. Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC bioinformatics. 2010;11:1–11.
27. Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, Mering C von, et al. Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper. Molecular Biology and Evolution. 2017;34:2115–22. doi:10.1093/molbev/msx148.
28. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nature Methods. 2014;12:59–60. doi:10.1038/nmeth.3176.
29. Kanehisa M, Sato Y, Morishima K. BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. Journal of molecular biology. 2016;428:726–31.
30. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–2. doi:10.1093/bioinformatics/bts565.
31. Liu Y-X, Qin Y, Chen T, Lu M, Qian X, Guo X, et al. A practical guide to amplicon and metagenomic analysis of microbiome data. Protein & Cell. 2020;12:315–30. doi:10.1007/s13238-020-00724-8.
32. Benoit G, Mariadassou M, Robin S, Schbath S, Peterlongo P, Lemaitre C. SimkaMin: Fast and resource frugal de novo comparative metagenomics. Bioinformatics. 2019. doi:10.1093/bioinformatics/btz685.
33. Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nature Communications. 2018;9. doi:10.1038/s41467-018-04964-5.
34. Steinegger M, Mirdita M, Söding J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods. 2019;16:603–6. doi:10.1038/s41592-019-0437-4.
35. Boyd JA, Woodcroft BJ, Tyson GW. GraftM: a tool for scalable, phylogenetically informed classification of genes within metagenomes. Nucleic Acids Research. 2018;46:e59–9. doi:10.1093/nar/gky174.
36. Eren AM, Esen OC, Quince C, Vineis JH, Morrison HG, Sogin ML, et al. Anvi’o: An advanced analysis and visualization platform for ‘omics data. PeerJ. 2015;3:e1319. doi:10.7717/peerj.1319.
37. Kieser S, Brown J, Zdobnov EM, Trajkovski M, McCue LA. ATLAS: A snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data. BMC Bioinformatics. 2020;21. doi:10.1186/s12859-020-03585-4.
38. Kim C, Pongpanich M, Porntaveetus T. Unraveling metagenomics through long-read sequencing: A comprehensive review. Journal of Translational Medicine. 2024;22:111.
39. Kolmogorov M, Rayko M, Yuan J, Polevikov E, Pevzner P. metaFlye: Scalable long-read metagenome assembly using repeat graphs. 2019. doi:10.1101/637637.
40. Shakya M, Lo C-C, Chain PS. Advances and challenges in metatranscriptomic analysis. Frontiers in genetics. 2019;10:904.
41. Ojala T, Häkkinen A-E, Kankuri E, Kankainen M. Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics. Trends in Genetics. 2023;39:686–702.