Analysis and assembly methods for
microbiome sequencing data
Marcus Fedarko
Analysis and assembly methods for
microbiome sequencing data
Marcus Fedarko
Microbiomes
3
Steve Gschmeissner, Science Photo Library
“Scanning electron micrograph (SEM) of bacteria cultured from a sample of human faeces.”
Steve Gschmeissner, Science Photo Library
“Coloured scanning electron micrograph (SEM) of bacteria cultured from a mobile phone.”
Steve Gschmeissner, Science Photo Library
“Bacterial contamination, coloured scanning electron micrograph (SEM). Escherichia coli bacteria in a cell culture. This contamination has come from an unclean water source.”
Microbiomes
4
Steve Gschmeissner, Science Photo Library
“Scanning electron micrograph (SEM) of bacteria cultured from a sample of human faeces.”
Steve Gschmeissner, Science Photo Library
“Coloured scanning electron micrograph (SEM) of bacteria cultured from a mobile phone.”
Steve Gschmeissner, Science Photo Library
“Bacterial contamination, coloured scanning electron micrograph (SEM). Escherichia coli bacteria in a cell culture. This contamination has come from an unclean water source.”
This talk
5
This talk
6
Introduction: Why bother studying microbiomes?
7
Introduction: Why bother studying microbiomes?
8
Introduction: Why bother studying microbiomes?
“During the Eastern Jin dynasty (AD 300–400 years), ‘Zhou Hou Bei Ji Fang’, a well-known monograph of traditional Chinese medicine (TCM) written by Hong Ge, recorded a case of treating patients with food poisoning or severe diarrhea by ingesting human fecal suspension (known as yellow soup or Huang-Long decoction).”
9
H. Du, T.-t. Kuang, S. Qiu, T. Xu, C.-L. G. Huan, G. Fan, and Y. Zhang. Fecal medicines used in traditional medical system of China: a�systematic review of their names, original species, traditional uses, and modern investigations. Chinese medicine, 14(1):1–16, 2019.
Introduction: Why bother studying microbiomes?
10
American Gastroenterological Association. Fecal Microbiota Transplanation (FMT): Overview. https://gastro.org/practice-guidance/gi-patient-center/topic/fecal-microbiota-transplantation-fmt/.
Fecal Microbiota Transplantation (FMT)
Introduction: Why bother studying microbiomes?
11
Introduction: Why bother studying microbiomes?
12
Introduction: Why bother studying microbiomes?
13
Introduction: Why bother studying microbiomes?
14
“H. pylori-negative individuals harbor a microbiota that is more complex and highly diverse compared to H. pylori-positive individuals. [...] Following infection with H. pylori, Proteobacteria and specifically H. pylori dominate the gastric microbiota. This leads to the development of chronic gastritis.”
Introduction: Why is it always C. diff and H. pylori?
“There are two well-documented diseases in the microbiome field that link a microbial biomarker with causation in disease: Helicobacter pylori-associated peptic ulceration and gastric cancer (Parsonnet et al., 1991) and Clostridium (or Clostridioides) difficile infection-associated diarrhea (Gupta et al., 2016).”
“However, causal inferences between complex microbiomes and other inflammatory, metabolic, neoplastic, and neuro-behavioral disorders have been neither compelling nor conclusive [...]”
15
J. Walter, A. M. Armet, B. B. Finlay, and F. Shanahan. Establishing or Exaggerating Causality for the Gut Microbiome:�Lessons from Human Microbiota-Associated Rodents. Cell, 180(2):221–232, 2020.
Introduction: Obesity and the gut microbiome
16
Introduction: Obesity and the gut microbiome
17
“Adult germ-free [wild-type] C57BL/ 6J mice were colonized (by gavage) with a microbiota harvested from the caecum of obese (ob/ob) or lean (+/+) donors (1 donor and 4–5 germ-free recipients per treatment group per experiment; two independent experiments). [...]”
(Context: ob/ob mice have a specific mutation “[...] that produces a stereotyped, fully penetrant obesity phenotype”; +/+ mice lack this mutation.)
Ley, R. E., Bäckhed, F., Turnbaugh, P., Lozupone, C. A., Knight, R. D., & Gordon, J. I. (2005). Obesity alters gut microbial ecology. Proceedings of the national academy of sciences, 102(31), 11070-11075.
Introduction: Obesity and the gut microbiome
18
“Adult germ-free [wild-type] C57BL/ 6J mice were colonized (by gavage) with a microbiota harvested from the caecum of obese (ob/ob) or lean (+/+) donors (1 donor and 4–5 germ-free recipients per treatment group per experiment; two independent experiments). [...]”
Results: “Strikingly, mice colonized with an ob/ob microbiota exhibited a significantly greater percentage increase in body fat over two weeks than mice colonized with a +/+ microbiota [...]”
Introduction: Obesity and the gut microbiome
19
Introduction: Obesity and the gut microbiome
20
Turnbaugh, P. J. (2017). Microbes and diet-induced obesity: fast, cheap, and out of control. Cell Host & Microbe, 21(3), 278-281.
Introduction: Obesity and the gut microbiome?
21
Introduction: Why is it always C. diff and H. pylori?
22
Introduction: Why is this so difficult?
23
Introduction: Why is this so difficult?
24
Introduction: Our goal for this talk
Improving the resolution with which we can study microbes.
25
As far as some technologies can go
Introduction: Our goal for this talk
Improving the resolution with which we can study microbes.
26
“Strain” level: completely unique genomes
Introduction: Our goal for this talk
Improving the resolution with which we can study microbes.
27
For more examples, see: Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain�separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.
Introduction: Our goal for this talk
Improving the resolution with which we can study microbes.�Small strain-level differences can make a big difference!
28
For more examples, see: Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain�separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.
Introduction: Our goal for this talk
Improving the resolution with which we can study microbes.�Small strain-level differences can make a big difference!�Our goal, then, is reconstructing the full genomes of all�microbes in a sample.
29
For more examples, see: Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain�separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.
Introduction: Our goal for this talk
Improving the resolution with which we can study microbes.�Small strain-level differences can make a big difference!�Our goal, then, is reconstructing the full genomes of all�microbes in a sample.
30
Introduction: Our goal for this talk
Improving the resolution with which we can study microbes.�Small strain-level differences can make a big difference!�Our goal, then, is reconstructing the full genomes of all�microbes in a sample.�
… This isn’t really possible right now, but we’ll see what we can do!
31
Introduction: Our goal for this talk
Improving the resolution with which we can study microbes.�Small strain-level differences can make a big difference!�Our goal, then, is reconstructing the full genomes of all�microbes in a sample.�
… This isn’t really possible right now, but we’ll see what we can do!
One final note for the introduction, though:
32
Introduction: Why is this so difficult?
33
Introduction: Why is this so difficult?
34
Introduction: Our goal for this talk
35
This talk
36
Culture-Independent Methods
37
Culture-Independent Methods
38
Steve Gschmeissner, Science Photo Library
“Scanning electron micrograph (SEM) of bacteria cultured from a sample of human faeces.”
Steve Gschmeissner, Science Photo Library
“Coloured scanning electron micrograph (SEM) of bacteria cultured from a mobile phone.”
Steve Gschmeissner, Science Photo Library
“Bacterial contamination, coloured scanning electron micrograph (SEM). Escherichia coli bacteria in a cell culture. This contamination has come from an unclean water source.”
Culture-Independent Methods
39
Steve Gschmeissner, Science Photo Library
“Scanning electron micrograph (SEM) of bacteria cultured from a sample of human faeces.”
Steve Gschmeissner, Science Photo Library
“Bacterial contamination, coloured scanning electron micrograph (SEM). Escherichia coli bacteria in a cell culture. This contamination has come from an unclean water source.”
Steve Gschmeissner, Science Photo Library
“Coloured scanning electron micrograph (SEM) of bacteria cultured from a mobile phone.”
Culture-Independent Methods
“It is estimated that >99% of microorganisms observable in nature typically are not cultivated by using standard techniques.”
40
Hugenholtz, P., Goebel, B. M., & Pace, N. R. (1998). Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. Journal of bacteriology, 180(18), 4765-4774.
Culture-Independent Methods
“It is estimated that >99% of microorganisms observable in nature typically are not cultivated by using standard techniques.”
Although not all microbes are easily culturable, all have genomes.
Idea: look at the DNA in a sample and use that to study the microbes there!
41
Hugenholtz, P., Goebel, B. M., & Pace, N. R. (1998). Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. Journal of bacteriology, 180(18), 4765-4774.
Culture-Independent Methods
42
For a nice history of these and other methods, see: M. H. Fraher, P. W. O’Toole, and E. M. Quigley. Techniques used to characterize the�gut microbiota: a guide for the clinician. Nature Reviews Gastroenterology & Hepatology, 9(6):312–322, 2012.
Culture-Independent Methods
43
For a nice history of these and other methods, see: M. H. Fraher, P. W. O’Toole, and E. M. Quigley. Techniques used to characterize the�gut microbiota: a guide for the clinician. Nature Reviews Gastroenterology & Hepatology, 9(6):312–322, 2012.
Culture-Independent Methods
44
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
(Marker gene sequencing)
(Metagenome sequencing)
Culture-Independent Methods
45
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
(Marker gene sequencing)
(Metagenome sequencing)
“Reads”
Culture-Independent Methods
46
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
(Marker gene sequencing)
(Metagenome sequencing)
Strings from�Σ = {A,C,G,T}
Culture-Independent Methods
47
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
(Marker gene sequencing)
(Metagenome sequencing)
Strings from�Σ = {A,C,G,T}
C. I. Methods: Marker gene sequencing
48
C. I. Methods: Carl Woese and rRNA genes
49
ABSTRACT A phylogenetic analysis based upon ribosomal RNA sequence�characterization reveals that living systems represent one of three aboriginal lines of descent:�
(i) the eubacteria, comprising all typical bacteria;
(ii) the archaebacteria, containing methanogenic bacteria; and
(iii) the urkaryotes, now represented in the cytoplasmic component of eukaryotic cells.
For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.
C. I. Methods: Carl Woese and rRNA genes
50
For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.
ABSTRACT A phylogenetic analysis based upon ribosomal RNA sequence�characterization reveals that living systems represent one of three aboriginal lines of descent:�
(i) the eubacteria, comprising all typical bacteria;
(ii) the archaebacteria, containing methanogenic bacteria; and
(iii) the urkaryotes, now represented in the cytoplasmic component of eukaryotic cells.
C. I. Methods: Carl Woese and rRNA genes
51
For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.
ABSTRACT A phylogenetic analysis based upon ribosomal RNA sequence�characterization reveals that living systems represent one of three aboriginal lines of descent:�
(i) the eubacteria, comprising all typical bacteria;
(ii) the archaebacteria, containing methanogenic bacteria; and
(iii) the urkaryotes, now represented in the cytoplasmic component of eukaryotic cells.
C. I. Methods: Carl Woese and rRNA genes
52
For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.
C. I. Methods: What’s in a marker gene?
53
ABSTRACT A phylogenetic analysis based upon ribosomal RNA sequence�characterization reveals that living systems represent one of three aboriginal lines of descent:�
(i) the eubacteria, comprising all typical bacteria;
(ii) the archaebacteria, containing methanogenic bacteria; and
(iii) the urkaryotes, now represented in the cytoplasmic component of eukaryotic cells.
C. I. Methods: What’s in a marker gene?
54
ABSTRACT A phylogenetic analysis based upon ribosomal RNA sequence�characterization reveals that living systems represent one of three aboriginal lines of descent:�
(i) the eubacteria, comprising all typical bacteria;
(ii) the archaebacteria, containing methanogenic bacteria; and
(iii) the urkaryotes, now represented in the cytoplasmic component of eukaryotic cells.
C. I. Methods: What’s in a marker gene?
55
“To determine relationships covering the entire spectrum of extant living systems, one optimally needs a molecule of appropriately broad distribution. None of the readily characterized proteins fits this requirement. However, ribosomal RNA does. It is
�permitting the detection of relatedness among very distant species.”
C. I. Methods: What’s in a marker gene?
56
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
C. I. Methods: What’s in a marker gene?
57
Entropy in the 16S rRNA gene
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
C. I. Methods: What’s in a marker gene?
58
Entropy in the 16S rRNA gene
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
More
mutations
C. I. Methods: What’s in a marker gene?
59
Entropy in the 16S rRNA gene
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
More
mutations
Hypervariable regions
C. I. Methods: What’s in a marker gene?
60
Entropy in the 16S rRNA gene
More
mutations
Hypervariable regions
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
Conserved regions
C. I. Methods: What’s in a marker gene?
61
Entropy in the 16S rRNA gene
More
mutations
Hypervariable regions
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
Conserved regions
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
C. I. Methods: What’s in a marker gene?
62
Entropy in the 16S rRNA gene
More
mutations
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
C. I. Methods: What’s in a marker gene?
63
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
C. I. Methods: What’s in a marker gene?
64
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
C. I. Methods: What’s in a marker gene?
65
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
C. I. Methods: What’s in a marker gene?
66
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
C. I. Methods: What’s in a marker gene?
67
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
...
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
C. I. Methods: What’s in a marker gene?
68
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
...
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
Since these amplified sequences contain hypervariable region(s), these regions help us determine which sequences came from which microbe.
C. I. Methods: What’s in a marker gene?
69
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
...
Since these amplified sequences contain hypervariable region(s), these regions help us determine which sequences came from which microbe.
C. I. Methods: What’s in a marker gene?
70
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Neisseriales; f__Neisseriaceae; g__Neisseria
k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus
k__Bacteria; p__Firmicutes; c__Bacilli
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__Haemophilus; s__parainfluenzae
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
k__Bacteria; p__Firmicutes; c__Bacilli
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__Haemophilus; s__parainfluenzae
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Neisseriales; f__Neisseriaceae; g__Neisseria
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus
k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Neisseriales; f__Neisseriaceae; g__Neisseria
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__Haemophilus; s__parainfluenzae
k__Bacteria; p__Firmicutes; c__Bacilli
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
...
Since these amplified sequences contain hypervariable region(s), these regions help us determine which sequences came from which microbe.
Example taxonomic annotations from the QIIME 2�“Moving Pictures” tutorial: https://docs.qiime2.org
C. I. Methods: Marker gene sequencing
71
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
C. I. Methods: Marker gene sequencing
72
Bolyen, E., Rideout, J. R., Dillon, M. R., Bokulich, N. A., Abnet, C. C., Al-Ghalith, G. A., ... & Caporaso, J. G. (2019).�Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology, 37(8), 852-857.
Broad
taxonomic
comparisons
C. I. Methods: Marker gene sequencing
73
Willis, A., Bunge, J., & Whitman, T. (2017). Improved detection of changes in species richness in high diversity microbial communities.�Journal of the Royal Statistical Society: Series C (Applied Statistics), 66(5), 963-977. (Figure from arXiv manuscript version.)
Estimating and comparing diversity within samples
Also referred to as 𝛼-diversity (“alpha diversity”)
C. I. Methods: Marker gene sequencing
74
McDonald, D., Hyde, E., Debelius, J. W., Morton, J. T., Gonzalez, A., Ackermann, G., ... & Knight, R. (2018).�American Gut: an open platform for citizen science microbiome research. mSystems, 3(3), e00031-18.
Unsupervised dimensionality reduction (e.g. PCA / PCoA)
Also referred to as 𝛽-diversity (“beta diversity”)
C. I. Methods: Marker gene sequencing
75
Fedarko, M. W., Martino, C., Morton, J. T., González, A., Rahman, G., Marotz, C. A., ... & Knight, R. (2020).�Visualizing ‘omic feature rankings and log-ratios using Qurro. NAR Genomics and Bioinformatics, 2(2), lqaa023.
Identifying differentially abundant features (or ratios of features) in groups of samples
C. I. Methods: Marker gene sequencing
76
C. I. Methods: Marker gene sequencing
77
Callahan B.J. et al. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The ISME Journal, 11(12):2639–2643, 2017.
Schloss P.D. Amplicon sequence variants artificially split bacterial genomes into separate clusters. bioRxiv, 2021.
Knight R. et al. Best practices for analysing microbiomes. Nature Reviews Microbiology, 16(7):410–422, 2018.
C. I. Methods: Marker gene sequencing
78
Callahan B.J. et al. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The ISME Journal, 11(12):2639–2643, 2017.
Schloss P.D. Amplicon sequence variants artificially split bacterial genomes into separate clusters. bioRxiv, 2021.
Knight R. et al. Best practices for analysing microbiomes. Nature Reviews Microbiology, 16(7):410–422, 2018.
C. I. Methods: back to Carl Woese and rRNA genes
The original paper on PCR was published in 1986.�The first automatic sequencer (the “AB370”) was developed in 1987.
79
For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.
C. I. Methods: back to Carl Woese and rRNA genes
The original paper on PCR was published in 1986.�The first automatic sequencer (the “AB370”) was developed in 1987.�Woese and Fox’s three-domain paper was published in 1977!
80
For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.
C. I. Methods: back to Carl Woese and rRNA genes
The original paper on PCR was published in 1986.�The first automatic sequencer (the “AB370”) was developed in 1987.�Woese and Fox’s three-domain paper was published in 1977!
81
For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.
Photo credit: N. Pace. Carl Woese and the Beginnings of Metagenomics.�Looking in the Right Direction: Carl Woese and the New Biology, 2015. https://www.youtube.com/watch?v=h3K50DD9kIM (timestamp: 2:56)
C. I. Methods: back to Carl Woese and rRNA genes
The original paper on PCR was published in 1986.�The first automatic sequencer (the “AB370”) was developed in 1987.�Woese and Fox’s three-domain paper was published in 1977!�
“While the grad students and technicians produced�fingerprints, Woese spent his time staring at the�spots. Was this effort tedious in practice as well�as profound in its potential results? Yes.�‘There were days,’ he wrote later, ‘when I’d�walk home from work saying to myself,�‘Woese, you have destroyed your mind�again today.’ ’ ”
82
For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.
Photo credit: N. Pace. Carl Woese and the Beginnings of Metagenomics.�Looking in the Right Direction: Carl Woese and the New Biology, 2015. https://www.youtube.com/watch?v=h3K50DD9kIM (timestamp: 2:56)
C. I. Methods: Marker gene sequencing
83
(Marker gene sequencing)
(Metagenome sequencing)
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
C. I. Methods: Metagenome sequencing
84
(Marker gene sequencing)
(Metagenome sequencing)
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
C. I. Methods: Metagenome sequencing
85
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
C. I. Methods: Metagenome sequencing
86
𝛼-diversity
Taxonomy
Many of the “standard analyses” for marker gene sequencing data are also applicable to metagenome sequencing data.
𝛽-diversity
Differential abundance
C. I. Methods: Metagenome sequencing
87
Metagenome sequencing enables two main additional types of analyses, compared to marker gene sequencing.
𝛼-diversity
Taxonomy
𝛽-diversity
Differential abundance
C. I. Methods: Metagenome sequencing
88
Functional annotation
Sequence assembly
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Human Microbiome Project Consortium. (2012). Structure, function and diversity of the healthy human microbiome. Nature, 486(7402), 207.
Metagenome sequencing enables two main additional types of analyses, compared to marker gene sequencing.
C. I. Methods: Metagenome sequencing
89
Functional annotation
Sequence assembly
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Human Microbiome Project Consortium. (2012). Structure, function and diversity of the healthy human microbiome. Nature, 486(7402), 207.
Metagenome sequencing enables two main additional types of analyses, compared to marker gene sequencing.
C. I. Methods: Functional annotation
90
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
C. I. Methods: Functional annotation
91
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
(not just marker!) genes
operons
terminators
promoters
For more information on metagenomic functional annotation, see: Salamov, V. S. A., & Solovyevand, A. (2011). Automatic annotation of microbial genomes and metagenomic sequences. Metagenomics and Its Applications in Agriculture, Biomedicine and Environmental Studies; Li, RW, Ed, 61-78.
...
C. I. Methods: Functional annotation, in practice
92
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
(not just marker!) genes
operons
terminators
promoters
For more information on metagenomic functional annotation, see: Salamov, V. S. A., & Solovyevand, A. (2011). Automatic annotation of microbial genomes and metagenomic sequences. Metagenomics and Its Applications in Agriculture, Biomedicine and Environmental Studies; Li, RW, Ed, 61-78.
...
Usually, F.A. involves aligning (“mapping”) sequences to a reference database with information about “function” in well- studied organisms.
(But there are fancier approaches, e.g. metabolic modelling methods.)
C. I. Methods: Functional annotation, in practice?
93
“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”
C. I. Methods: Functional annotation, in practice??
94
“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”
A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates [...]
C. I. Methods: Functional annotation, in practice???
95
A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates [...]
“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”
C. I. Methods: Functional annotation, in practice???
96
“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”
A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates [...]
C. I. Methods: Functional annotation, in practice
97
Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics.�PLOS computational biology, 6(2), e1000667.
C. I. Methods: Metagenome sequencing
98
Functional annotation
Sequence assembly
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Human Microbiome Project Consortium. (2012). Structure, function and diversity of the healthy human microbiome. Nature, 486(7402), 207.
Metagenome sequencing enables two main additional types of analyses, compared to marker gene sequencing.
C. I. Methods: Metagenome sequence assembly
99
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
)
C. I. Methods: Metagenome sequence assembly
100
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
(Strings from�Σ = {A,C,G,T})
“Reads”
)
C. I. Methods: Metagenome sequence assembly
101
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
(Strings from�Σ = {A,C,G,T})
“Reads”
)
In practice there’ll�be many input molecules here...
C. I. Methods: Metagenome sequence assembly
102
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
(Strings from�Σ = {A,C,G,T})
“Reads”
In practice there’ll�be many input molecules here...
...and, ideally, one�output sequence per input molecule
)
C. I. Methods: Metagenome sequence assembly
103
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
C. I. Methods: Metagenome sequence assembly
104
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
C. I. Methods: Metagenome sequence assembly
105
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
This talk
106
Assembly (in the context of a metagenome)
107
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
(Strings from�Σ = {A,C,G,T})
“Reads”
)
In practice there’ll�be many input molecules here...
...and, ideally, one�output sequence per input molecule
Assembly: inputs
Reads: strings from the alphabet Σ = {A, C, G, T}� Occasionally this alphabet is extended if the base at a position is ambiguous:� e.g. W = (A or T), S = (C or G), N = (A or C or G or T), …
108
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
For more detail on this notation, see: https://en.wikipedia.org/wiki/Nucleic_acid_notation.
Assembly: inputs
Reads: strings from the alphabet Σ = {A, C, G, T}� Occasionally this alphabet is extended if the base at a position is ambiguous:� e.g. W = (A or T), S = (C or G), N = (A or C or G or T), …
Things that can vary based on the sequencing technology being used:� Read length� Read error rate� Number of reads� Read structure (e.g. single vs. paired-end reads)
109
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
For more detail on this notation, see: https://en.wikipedia.org/wiki/Nucleic_acid_notation.
Assembly: inputs
Reads: strings from the alphabet Σ = {A, C, G, T}� Occasionally this alphabet is extended if the base at a position is ambiguous:� e.g. W = (A or T), S = (C or G), N = (A or C or G or T), …
Things that can vary based on the sequencing technology being used:� Read length� Read error rate� Number of reads� Read structure (e.g. single vs. paired-end reads)
110
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
For more detail on this notation, see: https://en.wikipedia.org/wiki/Nucleic_acid_notation.
Assembly: inputs
Reads: strings from the alphabet Σ = {A, C, G, T}� Occasionally this alphabet is extended if the base at a position is ambiguous:� e.g. W = (A or T), S = (C or G), N = (A or C or G or T), …
Things that can vary based on the sequencing technology being used:� Read length� Read error rate� Number of reads� Read structure (e.g. single vs. paired-end reads)
Three (modern) sequencing technologies we’ll focus on:� (1) short-read, (2) long, error-prone read, (3) HiFi
111
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
For more detail on this notation, see: https://en.wikipedia.org/wiki/Nucleic_acid_notation.
Assembly: sequencing technologies
112
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Assembly: sequencing technologies
113
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Assembly: sequencing technologies
114
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
This simple categorization ignores a lot of finicky details!
But this should be enough to understand how these technologies can complicate or simplify assembly.
Assembly: impacts of technologies
115
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Medvedev, P., Pham, S., Chaisson, M., Tesler, G., & Pevzner, P. (2011). Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology, 18(11), 1625-1634.
Assembly: impacts of technologies
116
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Medvedev, P., Pham, S., Chaisson, M., Tesler, G., & Pevzner, P. (2011). Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology, 18(11), 1625-1634.
Koren, S., Treangen, T. J., & Pop, M. (2011). Bambus 2: �scaffolding metagenomes. Bioinformatics, 27(21), 2964-2971.
Assembly: impacts of technologies
117
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Medvedev, P., Pham, S., Chaisson, M., Tesler, G., & Pevzner, P. (2011). Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology, 18(11), 1625-1634.
Koren, S. (2012). Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms. http://www.sergek.umiacs.io/presentations/ThesisTalk_final.pdf.
Technically this figure changes k-mer lengths, not read lengths�(we’ll define k-mers soon)—however, the idea is the same.
Assembly: impacts of technologies
118
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Medvedev, P., Pham, S., Chaisson, M., Tesler, G., & Pevzner, P. (2011). Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology, 18(11), 1625-1634.
Assembly: impacts of technologies
119
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Medvedev, P., Pham, S., Chaisson, M., Tesler, G., & Pevzner, P. (2011). Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology, 18(11), 1625-1634.
Assembly: impacts of technologies
120
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Medvedev, P., Pham, S., Chaisson, M., Tesler, G., & Pevzner, P. (2011). Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology, 18(11), 1625-1634.
Koren, S., Treangen, T. J., & Pop, M. (2011). Bambus 2: �scaffolding metagenomes. Bioinformatics, 27(21), 2964-2971.
Assembly: outputs
121
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
(Strings from�Σ = {A,C,G,T})
“Reads”
)
In practice there’ll�be many input molecules here...
...and, ideally, one�output sequence per input molecule
Assembly: outputs
122
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
(Strings from�Σ = {A,C,G,T})
“Reads”
)
“Contigs”
In practice there’ll�be many input molecules here...
...and, ideally, one�output sequence per input molecule
Assembly: outputs
123
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
(Strings from�Σ = {A,C,G,T})
“Reads”
)
“Contigs”
(Hopefully longer) strings from�Σ = {A,C,G,T}
In practice there’ll�be many input molecules here...
...and, ideally, one�output sequence per input molecule
Assembly: outputs (contigs)
Ideally: one contig per input molecule of DNA� (e.g. each chromosome, plasmid, …)�In practice: usually more contigs than that
124
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Assembly: outputs (contigs)
Ideally: one contig per input molecule of DNA� (e.g. each chromosome, plasmid, …)�In practice: usually more contigs than that
Some projects attempt to group contigs together into bins that likely originate from the same “genome”.
125
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Bowers, R. M., Kyrpides, N. C., Stepanauskas, R., Harmon-Smith, M., Doud, D., Reddy, T. B. K., ... & Woyke, T. (2017). Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology, 35(8), 725-731.
Assembly: outputs (contigs)
Ideally: one contig per input molecule of DNA� (e.g. each chromosome, plasmid, …)�In practice: usually more contigs than that
Some projects attempt to group contigs together into bins that likely originate from the same “genome”.
Bins of contigs, or especially high-quality individual contigs, are referred to as metagenome-assembled genomes (MAGs).
126
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Bowers, R. M., Kyrpides, N. C., Stepanauskas, R., Harmon-Smith, M., Doud, D., Reddy, T. B. K., ... & Woyke, T. (2017). Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology, 35(8), 725-731.
Assembly: outputs (contigs)
Ideally: one contig per input molecule of DNA� (e.g. each chromosome, plasmid, …)�In practice: usually more contigs than that
Some projects attempt to group contigs together into bins that likely originate from the same “genome”.
Bins of contigs, or especially high-quality individual contigs, are referred to as metagenome-assembled genomes (MAGs).
127
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Bowers, R. M., Kyrpides, N. C., Stepanauskas, R., Harmon-Smith, M., Doud, D., Reddy, T. B. K., ... & Woyke, T. (2017). Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology, 35(8), 725-731.
“We present a metagenomic HiFi assembly of a complex microbial community from sheep fecal material that resulted in�428 high-quality MAGs from a single sample, the highest resolution achieved with metagenomic deconvolution to date.”
Bickhart, D. M., Kolmogorov, M., Tseng, E., Portik, D., Korobeynikov, A., Tolstoganov, I., ... & Smith, T. P. (2021).�Generation of lineage-resolved complete metagenome-assembled genomes by precision phasing. bioRxiv.
Assembly: outputs (contigs)
Ideally: one contig per input molecule of DNA� (e.g. each chromosome, plasmid, …)�In practice: usually more contigs than that
Some projects attempt to group contigs together into bins that likely originate from the same “genome”.
Bins of contigs, or especially high-quality individual contigs, are referred to as metagenome-assembled genomes (MAGs).
128
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Bowers, R. M., Kyrpides, N. C., Stepanauskas, R., Harmon-Smith, M., Doud, D., Reddy, T. B. K., ... & Woyke, T. (2017). Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology, 35(8), 725-731.
Assembly: outputs (contigs)
Ideally: one contig per input molecule of DNA� (e.g. each chromosome, plasmid, …)�In practice: usually more contigs than that
129
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Assembly: outputs (assembly graph)
Most assembly algorithms model the problem as some�sort of graph traversal.��Assemblers usually output assembly graphs, which�(generally) show overlaps between contigs.
130
Assembly: outputs (assembly graph)
Most assembly algorithms model the problem as some�sort of graph traversal.��Assemblers usually output assembly graphs, which�(generally) show overlaps between contigs.
Ideally: one connected component per input molecule of DNA�
131
Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: interactive�visualization of de novo genome assemblies. Bioinformatics, 31(20), 3350-3352.
Assembly: outputs (assembly graph)
Most assembly algorithms model the problem as some�sort of graph traversal.��Assemblers usually output assembly graphs, which�(generally) show overlaps between contigs.
Ideally: one connected component per input molecule of DNA�In practice: the graph is usually tangled, fragmented, ...
132
Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: interactive�visualization of de novo genome assemblies. Bioinformatics, 31(20), 3350-3352.
Assembly: outputs (assembly graph)
Most assembly algorithms model the problem as some�sort of graph traversal.��Assemblers usually output assembly graphs, which�(generally) show overlaps between contigs.
Ideally: one connected component per input molecule of DNA�In practice: the graph is usually tangled, fragmented, ...
These can be useful when “finishing” assemblies, or looking�at subtle variations.
133
Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: interactive�visualization of de novo genome assemblies. Bioinformatics, 31(20), 3350-3352.
Assembly: outputs (assembly graph)
Most assembly algorithms model the problem as some�sort of graph traversal.��Assemblers usually output assembly graphs, which�(generally) show overlaps between contigs.
Ideally: one connected component per input molecule of DNA�In practice: the graph is usually tangled, fragmented, …
These can be useful when “finishing” assemblies, or looking�at subtle variations.
134
Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: interactive�visualization of de novo genome assemblies. Bioinformatics, 31(20), 3350-3352.
Ghurye, J., Treangen, T., Fedarko, M., Hervey, W. J., & Pop, M. (2019). MetaCarvel: linking assembly graph motifs to biological variants. Genome Biology, 20(1), 1-14.
This talk
135
Assembly: methods
136
“A good genome assembler is like a good sausage: you would rather not know what is inside.”
Apocryphal quote attributed to Sante Gnerre: http://rayan.chikhi.name/pdf/2019-july-19-cgsi.pdf
Assembly: methods (de novo vs. reference-based)
de novo assembly: Use only the read data available�
Reference-based assembly: Also use available reference sequence(s)
137
“[...] reconstruction in its pure form, without consultation to previously resolved sequence including from genomes, transcripts, and proteins.”
“For some applications, sufficient information can be extracted from the mapping of reads to a reference sequence, such as a finished genome from a related individual.”
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Assembly: methods (de novo vs. reference-based)
de novo assembly: Use only the read data available� Far more commonly used when working with metagenome sequencing data.
Reference-based assembly: Also use available reference sequence(s)� Some reference-based assemblers have been developed for metagenome� sequencing data, but they have not (yet) seen widespread use in the field.
138
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Cepeda, V., Liu, B., Almeida, M., Hill, C. M., Koren, S., Treangen, T. J., & Pop, M. (2017). MetaCompass: reference-guided assembly of metagenomes. bioRxiv, 212506.
Assembly: methods (de novo vs. reference-based)
de novo assembly: Use only the read data available� Far more commonly used when working with metagenome sequencing data.
Reference-based assembly: Also use available reference sequence(s)� Some reference-based assemblers have been developed for metagenome� sequencing data, but they have not (yet) seen widespread use in the field.
139
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Cepeda, V., Liu, B., Almeida, M., Hill, C. M., Koren, S., Treangen, T. J., & Pop, M. (2017). MetaCompass: reference-guided assembly of metagenomes. bioRxiv, 212506.
Assembly: methods (overlap vs. de Bruijn graphs)
Overlap graph (directed graph)� Nodes: input reads� Edges: an edge is created from n1 → n2 if read n1 overlaps with read n2�
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: an edge is created from n1 → n2 if there is a k-mer whose� prefix is n1 and whose suffix is n2
140
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
DBG
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
Assembly: methods (overlap vs. de Bruijn graphs)
Overlap graph (directed graph)� Nodes: input reads� Edges: an edge is created from n1 → n2 if read n1 overlaps with read n2�
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: an edge is created from n1 → n2 if there is a k-mer whose� prefix is n1 and whose suffix is n2�
141
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
DBG
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
Assembly: methods (overlap vs. de Bruijn graphs)
Overlap graph (directed graph)� Nodes: input reads� Edges: an edge is created from n1 → n2 if read n1 overlaps with read n2� Goal: Find Hamiltonian Paths (or cycles) in this graph
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: an edge is created from n1 → n2 if there is a k-mer whose� prefix is n1 and whose suffix is n2� Goal: Find Eulerian Paths (or cycles) in this graph
142
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
DBG
NP-Complete!
Polynomial time!
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
Assembly: methods (overlap vs. de Bruijn graphs)
143
OLC
DBG
Neither of these lists are comprehensive! Many assemblers (of both the OLC or DBG varieties) are still actively being developed and in use today.�Also, most assemblers implement their own “twist” on how they use these graph structures, so cleanly categorizing assemblers in this way ignores many details.
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.
Dida, F., & Yi, G. (2021). Empirical evaluation of methods for de novo genome assembly. PeerJ Computer Science, 7, e636.
Wang, A., Wang, Z., Li, Z., & Li, L. M. (2018). BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach. Bioinformatics, 34(12), 2019-2028.
Assembly: methods (overlap vs. de Bruijn graphs)
144
OLC
DBG
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.
Dida, F., & Yi, G. (2021). Empirical evaluation of methods for de novo genome assembly. PeerJ Computer Science, 7, e636.
Wang, A., Wang, Z., Li, Z., & Li, L. M. (2018). BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach. Bioinformatics, 34(12), 2019-2028.
Neither of these lists are comprehensive! Many assemblers (of both the OLC or DBG varieties) are still actively being developed and in use today.�Also, most assemblers implement their own “twist” on how they use these graph structures, so cleanly categorizing assemblers in this way ignores many details.
Assembly: methods (overlap vs. de Bruijn graphs)
145
OLC
DBG
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.
Dida, F., & Yi, G. (2021). Empirical evaluation of methods for de novo genome assembly. PeerJ Computer Science, 7, e636.
Wang, A., Wang, Z., Li, Z., & Li, L. M. (2018). BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach. Bioinformatics, 34(12), 2019-2028.
Neither of these lists are comprehensive! Many assemblers (of both the OLC or DBG varieties) are still actively being developed and in use today.�Also, most assemblers implement their own “twist” on how they use these graph structures, so cleanly categorizing assemblers in this way ignores many details.
Assembly: methods (overlap vs. de Bruijn graphs)
146
OLC
DBG
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.
Dida, F., & Yi, G. (2021). Empirical evaluation of methods for de novo genome assembly. PeerJ Computer Science, 7, e636.
Wang, A., Wang, Z., Li, Z., & Li, L. M. (2018). BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach. Bioinformatics, 34(12), 2019-2028.
Neither of these lists are comprehensive! Many assemblers (of both the OLC or DBG varieties) are still actively being developed and in use today.�Also, most assemblers implement their own “twist” on how they use these graph structures, so cleanly categorizing assemblers in this way ignores many details.
Assembly: methods (single-genome vs. metagenome)
147
Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.
Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.
Assembly: methods (single-genome vs. metagenome)
148
Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.
Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.
Assembly: methods (single-genome vs. metagenome)
149
Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.
Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.
Assembly: methods (single-genome vs. metagenome)
150
Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.
Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.
Assembly: methods (single-genome vs. metagenome)
151
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs.�Nature Methods, 17(11), 1103-1110.
“An assembly graph of a single connected component in the sheep microbiome dataset before strain collapsing [...] The component represents a bacterial genome of the Clostridia class [...] There are 20 simple bubbles (shown in green) and 10 superbubbles (shown in yellow) that account for 1.2 Mbp out of 2.4 Mbp long genome.”
Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.
Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.
Assembly: methods (single-genome vs. metagenome)
152
Different assemblers will do different things to deal with these sorts of subtle variations: even “smooth” contigs can conceal a lot of variation. Sometimes this is desirable—sometimes not!
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs.�Nature Methods, 17(11), 1103-1110.
“An assembly graph of a single connected component in the sheep microbiome dataset before strain collapsing [...] The component represents a bacterial genome of the Clostridia class [...] There are 20 simple bubbles (shown in green) and 10 superbubbles (shown in yellow) that account for 1.2 Mbp out of 2.4 Mbp long genome.”
Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.
Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.
This talk
153
Future work
Strain separation problem (Vicedomini et al., 2021)�“The reconstruction of partial or complete DNA sequences corresponding to�strains, at the base level.”�
154
Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.
(Vicedomini et al.’s definition of “strain” matches the one I defined�~45 minutes ago, i.e. a completely unique genome.)
Future work
Strain separation problem (Vicedomini et al., 2021)�“The reconstruction of partial or complete DNA sequences corresponding to�strains, at the base level.”
(Local) Metagenome individual haplotyping problem (Nicholls et al., 2021)�“The ideal output [...] is the collection of whole-genome sequences representing all the individual organisms in a microbial community.”
��
155
Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.
Nicholls, S. M., Aubrey, W., De Grave, K., Schietgat, L., Creevey, C. J., & Clare, A. (2021). On the complexity of haplotyping a microbial community. Bioinformatics, 37(10), 1360-1366.
(Vicedomini et al.’s definition of “strain” matches the one I defined�~45 minutes ago, i.e. a completely unique genome.)
Future work
Strain separation problem (Vicedomini et al., 2021)�“The reconstruction of partial or complete DNA sequences corresponding to�strains, at the base level.”
(Local) Metagenome individual haplotyping problem (Nicholls et al., 2021)�“The ideal output [...] is the collection of whole-genome sequences representing all the individual organisms in a microbial community.”
Haplotype assembly problem (Lancia et al., 2001)�“Given a set of fragments obtained by DNA sequencing from the two copies of a chromosome, reconstruct two haplotypes that would be compatible with all the fragments observed.”
156
Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.
Nicholls, S. M., Aubrey, W., De Grave, K., Schietgat, L., Creevey, C. J., & Clare, A. (2021). On the complexity of haplotyping a microbial community. Bioinformatics, 37(10), 1360-1366.
(Vicedomini et al.’s definition of “strain” matches the one I defined�~45 minutes ago, i.e. a completely unique genome.)
Lancia, G., Bafna, V., Istrail, S., Lippert, R., & Schwartz, R. (2001, August). SNPs problems, complexity, and algorithms. In European Symposium on Algorithms (pp. 182-193). Springer, Berlin, Heidelberg.
Future work: “haplotype”?
Humans (or other diploid organisms) usually have two copies of�each chromosome.
Ordinary assemblers often “smooth out” differences between the chromosomes, creating contigs that are chimeras of both chromosomes’ sequences.
157
Lancia, G., Bafna, V., Istrail, S., Lippert, R., & Schwartz, R. (2001, August). SNPs problems, complexity, and algorithms. In European Symposium on Algorithms (pp. 182-193). Springer, Berlin, Heidelberg.
https://www.genome.gov/genetics-glossary/Karyotype
In haplotype phasing, we attempt to fix this. Usually,�this involves looking for variations which occur on the same reads.
Future work: “haplotype”?
Humans (or other diploid organisms) usually have two copies of�each chromosome.
Ordinary assemblers often “smooth out” differences between the chromosomes, creating contigs that are chimeras of both chromosomes’ sequences.
158
Lancia, G., Bafna, V., Istrail, S., Lippert, R., & Schwartz, R. (2001, August). SNPs problems, complexity, and algorithms. In European Symposium on Algorithms (pp. 182-193). Springer, Berlin, Heidelberg.
https://www.genome.gov/genetics-glossary/Karyotype
In haplotype phasing, we attempt to fix this. Usually,�this involves looking for variations which occur on the same reads.
Even in the case of a single human genome, this problem is NP-Hard. It gets worse for metagenomes!
Future work: “metagenomic haplotyping”
159
A solution to [metagenomic haplotyping] is confounded by five problems:�
(i) DNA from every genome needs to be extracted and sequenced to a depth sufficient for recovery,�
(ii) genomes share homologous regions that require disambiguation,�
(iii) reads may be of an insufficient length to disambiguate repeats or resolve bridges between variants,�
(iv) sequencing error can be indistinguishable from rare haplotypes and�
(v) the presence of an unknown number of haplotypes complicates the already computationally difficult (NP-hard) (Cilibrasi et al., 2005) problem of haplotyping.
Nicholls, S. M., Aubrey, W., De Grave, K., Schietgat, L., Creevey, C. J., & Clare, A. (2021).�On the complexity of haplotyping a microbial community. Bioinformatics, 37(10), 1360-1366.
Future work: “metagenomic haplotyping”
160
A solution to [metagenomic haplotyping] is confounded by five problems:�
(i) DNA from every genome needs to be extracted and sequenced to a depth sufficient for recovery,�
(ii) genomes share homologous regions that require disambiguation,�
(iii) reads may be of an insufficient length to disambiguate repeats or resolve bridges between variants,�
(iv) sequencing error can be indistinguishable from rare haplotypes and�
(v) the presence of an unknown number of haplotypes complicates the already computationally difficult (NP-hard) (Cilibrasi et al., 2005) problem of haplotyping.
So increase sequencing depth!
So increase read lengths!
So increase read lengths (again)!
So use better algorithms?
So decrease error rates!
Future work: “metagenomic haplotyping”
161
A solution to [metagenomic haplotyping] is confounded by five problems:�
(i) DNA from every genome needs to be extracted and sequenced to a depth sufficient for recovery,�
(ii) genomes share homologous regions that require disambiguation,�
(iii) reads may be of an insufficient length to disambiguate repeats or resolve bridges between variants,�
(iv) sequencing error can be indistinguishable from rare haplotypes and�
(v) the presence of an unknown number of haplotypes complicates the already computationally difficult (NP-hard) (Cilibrasi et al., 2005) problem of haplotyping.
So increase sequencing depth!
So increase read lengths!
So increase read lengths (again)!
So use better algorithms?
So decrease error rates!
Future work: next steps forward
162
Some things I’ve been up to at UCSD
163
Thank you!
Advice/support over the years
Research Exam Committee and Logistics
Gary Cottrell, Vineet Bafna, Melissa Gymrek, Julie Conner
164
Pavel Pevzner�Mikhail Kolmogorov�Andrey Bzikadze�Vikram Sirupurapu�Rob Knight�Yoshiki Vázquez-Baeza�Lisa Marotz�Cameron Martino�Jamie Morton�Antonio González�Gibraan Rahman�Jake Minich�Eric Allen�Dan Hakim
�Kalen Cantrell�Daniel McDonald�Yimeng Yang�Thant Zaw�Stefan Janssen�Mehrbod Estaki�Niina Haiminen�Kristen Beck�Qiyun Zhu�Erfan Sayyari�George Armstrong�Priya Tripathi�Julia Gauglitz�Nate Matteson
�Jon Sanders�Anna Paola Carrieri�Se Jin Song�Austin Swafford�Pieter Dorrestein�Kristian Andersen�Laxmi Parida�Ho-Cheol Kim�Larry Smarr�Gail Ackermann�Jeff DeReus�Michiko Souza�Justin Shaffer�Pedro Belda-Ferre
�Greg Humphrey�Celeste Allaband�Rodolfo Salido�Greg Poore�Victor Cantu�Jeffrey Chiu�Franck Lejzerowicz�Shi Huang�Sarah Adams�Tomasz Kosciolek�Zech Xu�Charles Cowart�Farhana Ali�Robert Mills
�Alison Vrbanc�Bryn Taylor�Jerry Kennedy�Yna Villanueva�Justine Debelius�Evan Bolyen�Matthew Dillon�Jay Ghurye�Jacquelyn Michaelis�Harihara Muralidharan�Nidhi Shah�Brian Brubach�Todd Treangen�Mihai Pop
Thank you!
Research Exam Committee and Logistics
Gary Cottrell, Vineet Bafna, Melissa Gymrek, Julie Conner
Advice/support over the years
165
Pavel Pevzner�Mikhail Kolmogorov�Andrey Bzikadze�Vikram Sirupurapu�Rob Knight�Yoshiki Vázquez-Baeza�Lisa Marotz�Cameron Martino�Jamie Morton�Antonio González�Gibraan Rahman�Jake Minich�Eric Allen�Dan Hakim
�Kalen Cantrell�Daniel McDonald�Yimeng Yang�Thant Zaw�Stefan Janssen�Mehrbod Estaki�Niina Haiminen�Kristen Beck�Qiyun Zhu�Erfan Sayyari�George Armstrong�Priya Tripathi�Julia Gauglitz�Nate Matteson
�Jon Sanders�Anna Paola Carrieri�Se Jin Song�Austin Swafford�Pieter Dorrestein�Kristian Andersen�Laxmi Parida�Ho-Cheol Kim�Larry Smarr�Gail Ackermann�Jeff DeReus�Michiko Souza�Justin Shaffer�Pedro Belda-Ferre
�Greg Humphrey�Celeste Allaband�Rodolfo Salido�Greg Poore�Victor Cantu�Jeffrey Chiu�Franck Lejzerowicz�Shi Huang�Sarah Adams�Tomasz Kosciolek�Zech Xu�Charles Cowart�Farhana Ali�Robert Mills
�Alison Vrbanc�Bryn Taylor�Jerry Kennedy�Yna Villanueva�Justine Debelius�Evan Bolyen�Matthew Dillon�Jay Ghurye�Jacquelyn Michaelis�Harihara Muralidharan�Nidhi Shah�Brian Brubach�Todd Treangen�Mihai Pop
Misc. Acknowledgements
Emojis: Google emojis from emojipedia.org: https://emojipedia.org/pile-of-poo/ (removed the emoji eyes manually in GIMP), https://emojipedia.org/non-potable-water/, https://emojipedia.org/potable-water/, https://emojipedia.org/mobile-phone/�Citation of Blaser 1992 in the context of H. pylori based on the Strainberry paper’s introduction: Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.�Taxonomic ranks figure modified from https://en.wikipedia.org/wiki/Taxonomic_rank#/media/File:Taxonomic_Rank_Graph.svg, ℅ Annina Breen.�E. coli phylogeny from Dunne, K. A., Chaudhuri, R. R., Rossiter, A. E., Beriotto, I., Browning, D. F., Squire, D., ... & Henderson, I. R. (2017). Sequencing a piece of history: complete genome sequence of the original Escherichia coli strain. Microbial Genomics, 3(3).�Stock photos of someone in a suit kicking a can down the road all from Shutterstock.com ℅ Jim Barber (all images marked as royalty-free). Why were there five separate images of this? That’s a great question. I wish my job was literally just putting on a suit and kicking cans down a road. That would be so sick. I bet it pays better than grad school. Wait, why are you reading this? Seriously, there’s nothing important here. It’s just links. And this text.�https://www.shutterstock.com/search/kick+the+can,�https://www.shutterstock.com/image-photo/politicians-shoe-stops-dented-can-rolling-85554979,�https://www.shutterstock.com/image-photo/close-politicians-shoe-kicking-dented-shiny-85554970,�https://www.shutterstock.com/image-photo/close-shiny-dentedl-can-sitting-on-85554964, https://www.shutterstock.com/image-photo/close-politicians-shoe-kicking-dented-shiny-85554973, https://www.shutterstock.com/image-photo/politician-kicks-shiny-dented-can-down-85554967�Harold’s face: https://www.independent.co.uk/arts-entertainment/interviews/hide-pain-harold-meme-gif-interview-model-real-name-arato-andras-thumbs-stock-photo-a7835076.html�Quote about functional annotation and E. coli distance is from C. Frioux, D. Singh, T. Korcsmaros, and F. Hildebrand. From bag-of-genes to bag-of-genomes: metabolic modelling of communities in the era of metagenome-assembled genomes. Computational and Structural Biotechnology Journal, 18:1722–1734, 2020.�Jigsaw puzzle photo: From VisitIndiana.com. Also, I acknowledge that I used this figure (and the Commins figure for assembly) in a talk I gave last December.�Original PCR paper: Mullis et al. 1986. Specific Enzymatic Amplification of DNA In Vitro: The Polymerase Chain Reaction.�Source about the AB370 being the first sequencer: https://www.hindawi.com/journals/bmri/2012/251364/�Use of the Bambus 2 “variant” figure in the context of variant calling based on Serge’s PhD defense from 9 years ago: Koren, S. (2012). Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms. http://www.sergek.umiacs.io/presentations/ThesisTalk_final.pdf. I already cited this when using the “read lengths help” figure a few slides beforehand, but I figure I might as well make that clear here. Serge’s a cool dude.
166
Funding
Fall 2018–Winter 2019�Standard first-year CSE department fellowship
Spring 2019–Summer 2019�Joint University Microelectronics Program (JUMP)’s�Center for Research on Intelligent Storage and Processing-in-memory (CRISP)
Fall 2019–Fall 2020�IBM Research AI, via the AI Horizons Network and�the UCSD Center for Microbiome Innovation (CMI)
Winter 2021�Teaching assistantship (CSE 282)
Spring 2021–�Pevzner Lab grants
167
Bonus: So how many microbes are there?
Turnbaugh 2007: “The vast majority of the 10–100 trillion microbes in the human gastrointestinal tract live in the colon.”
Locey and Lennon 2016: “[...] we predict that Earth is home to upward of�1 trillion microbial species.”
Willis 2016: The method used by L&L 2016 isn’t statistically admissible!
∴ Maybe the only answer right now that won’t anger any statistician or biologist: “a lot, I guess”
168
Introduction: Obesity and the gut microbiome
169
“We performed microbiota transplantation experiments to test directly the notion that the ob/ob microbiota has an increased capacity to harvest energy from the diet and to determine whether increased adiposity is a transmissible trait. Adult germ-free C57BL/ 6J mice were colonized (by gavage) with a microbiota harvested from the caecum of obese (ob/ob) or lean (1/1) donors (1 donor and 4–5 germ-free recipients per treatment group per experiment; two independent experiments). 16S-rRNA-gene-sequence-based surveys confirmed that the ob/ob donor microbiota had a greater relative abundance of Firmicutes compared with the lean donor microbiota (Supplementary Fig. 4 and Supplementary Table 7). Furthermore, the ob/ob recipient microbiota had a significantly higher relative abundance of Firmicutes compared with the lean recipient microbiota (P < 0.05, two-tailed Student’s t-test). UniFrac analysis of 16S rRNA gene sequences obtained from the recipients’ caecal microbiotas revealed that they cluster according to the input donor community (Supplementary Fig. 4): that is, the initial colonizing community structure did not exhibit marked changes by the end of the two-week experiment. There was no statistically significant difference in (1) chow consumption over the 14-day period (55.4 6 2.5 g (ob/ob) versus 54.0 6 1.2 g (1/1); caloric density of chow, 3.7 kcal g21 ), (2) initial body fat (2.7 6 0.2 g for both groups as measured by dual-energy X-ray absorptiometry), or (3) initial weight between the recipients of lean and obese microbiotas. Strikingly, mice colonized with an ob/ob microbiota exhibited a significantly greater percentage increase in body fat over two weeks than mice colonized with a 1/1 microbiota (Fig. 3c; 47 6 8.3 versus 27 6 3.6 percentage increase or 1.3 6 0.2 versus 0.86 6 0.1 g fat (dual-energy X-ray absorptiometry): at 9.3 kcal g21 fat, this corresponds to a difference of 4 kcal or 2% of total calories consumed).”
Bonus: Culture-Independent methods and dark matter
“It is estimated that >99% of microorganisms observable in nature typically are not cultivated by using standard techniques.”
Some folks have used the term “dark matter” to refer to these so-far-uncultured microbes, but it’s not a great analogy...
170
Hugenholtz, P., Goebel, B. M., & Pace, N. R. (1998). Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. Journal of bacteriology, 180(18), 4765-4774.
“Tempting as it may be, perhaps we should calm down on the use of the term dark matter in biology. Biology is confusing, complicated, and mysterious enough without it.”
Bonus: About functional annotation...
“It says something of our ability to annotate genomes, that the proportion of a genome functionally annotated is often correlated to the genetic distance to the very well researched Escherichia coli (anecdotal observation).”
C. Frioux, D. Singh, T. Korcsmaros, and F. Hildebrand. From bag-of-genes to bag-of-genomes: metabolic modelling of communities in the era of metagenome-assembled genomes. Computational and Structural Biotechnology Journal, 18:1722–1734, 2020.
171
Bonus: Assembly (single genome)
172
Bonus: Assembly (metagenome)
173
...
...
Bonus: “metagenomic haplotyping”??????
174
A solution to [metagenomic haplotyping] is confounded by five problems:�
(i) DNA from every genome needs to be extracted and sequenced to a depth sufficient for recovery,�
(ii) genomes share homologous regions that require disambiguation,�
(iii) reads may be of an insufficient length to disambiguate repeats or resolve bridges between variants,�
(iv) sequencing error can be indistinguishable from rare haplotypes and�
(v) the presence of an unknown number of haplotypes complicates the already computationally difficult (NP-hard) (Cilibrasi et al., 2005) problem of haplotyping.
So increase sequencing depth!
So increase read lengths!
So increase read lengths (again)!
So use better algorithms?
So decrease error rates!
Microbiomes
175
This talk
176
Specificity
This talk
177
How many people care about this
A brief history of microbiome research
178
1 of 4: Marcus Terentius Varro, 1st century B.C.E.
“Precautions must also be taken in the neighbourhood of swamps, both for the reasons given, and because there are bred certain minute creatures which cannot be seen by the eyes, which float in the air and enter the body through the mouth and nose and there cause serious diseases.” “What can I do,” asked Fundanius, “to prevent disease if I should inherit a farm of that kind?” “Even I can answer that question,” replied Agrius; “sell it for the highest cash price; or if you can’t sell it, abandon it.”
179
M. T. Varro and M. P. Cato. On Agriculture, page 209.�Harvard University Press, Cambridge, MA, 1934. Translated by W. D. Hooper and H. B. Ash.
2 of 4: Hong Ge, 4th century C.E.
“During the Eastern Jin dynasty [...] Zhou Hou Bei Ji Fang, a well-known monograph of traditional Chinese medicine (TCM) written by Hong Ge, recorded a case of treating patients with food poisoning or severe diarrhea by ingesting human fecal suspension (known as yellow soup or Huang-Long decoction).”
180
H. Du, T.-t. Kuang, S. Qiu, T. Xu, C.-L. G. Huan, G. Fan, and Y. Zhang. Fecal medicines used in traditional medical system of China: a�systematic review of their names, original species, traditional uses, and modern investigations. Chinese medicine, 14(1):1–16, 2019.
3 of 4: Antonie van Leeuwenhoek, 17th century C.E.
“[...] among these streaks there were besides very many little animalcules ... And the motion of most of these animalcules in the water was so swift, and so various upwards, downwards and round about that ‘twas wonderful to see: and I judged that some of these little creatures were above a thousand times smaller than the smallest ones I have ever yet seen upon the rind of cheese [...]”
�
181
Lane, N. (2015). The unseen world: reflections on Leeuwenhoek (1677) “Concerning little animals”.�Philosophical Transactions of the Royal Society B: Biological Sciences, 370(1666), 20140344.
Schloss, P. D. (2018). Identifying and overcoming threats to reproducibility, replicability, robustness, and generalizability in microbiome research. mBio, 9(3), e00525-18.
4 of 4: Ernest Rutherford (?), 20th century C.E.
“All science is either physics or stamp collecting.”
182
This quote is generally attributed to Ernest Rutherford, but its first written occurrence was in a book published�two years after he died. So your guess is as good as mine; see https://quoteinvestigator.com/2015/05/08/stamp.
4 of 4: Ernest Rutherford (?), 20th century C.E.
“All science is either physics or stamp collecting.”
�
.
183
This quote is generally attributed to Ernest Rutherford, but its first written occurrence was in a book published�two years after he died. So your guess is as good as mine; see https://quoteinvestigator.com/2015/05/08/stamp.
Thank you?
184
Thank you?
185
Thank you?
186