Metagenome assembly�
Marcus Fedarko
These slides are online at:
Today’s outline
These slides are online at:
Today’s outline
These slides are online at:
Culture-Independent Methods
4
Steve Gschmeissner, Science Photo Library
“Scanning electron micrograph (SEM) of bacteria cultured from a sample of human faeces.”
Steve Gschmeissner, Science Photo Library
“Bacterial contamination, coloured scanning electron micrograph (SEM). Escherichia coli bacteria in a cell culture. This contamination has come from an unclean water source.”
Steve Gschmeissner, Science Photo Library
“Coloured scanning electron micrograph (SEM) of bacteria cultured from a mobile phone.”
Culture-Independent Methods
“It is estimated that >99% of microorganisms observable in nature typically are not cultivated by using standard techniques.”
5
Hugenholtz, P., Goebel, B. M., & Pace, N. R. (1998). Impact of culture-independent studies on the emerging�phylogenetic view of bacterial diversity. Journal of Bacteriology, 180(18), 4765-4774.
Culture-Independent Methods
“It is estimated that >99% of microorganisms observable in nature typically are not cultivated by using standard techniques.”
Although not all microbes are easily culturable, all have genomes.
So: look at the genetic material (e.g. DNA) in a sample and study that!
6
Hugenholtz, P., Goebel, B. M., & Pace, N. R. (1998). Impact of culture-independent studies on the emerging�phylogenetic view of bacterial diversity. Journal of Bacteriology, 180(18), 4765-4774.
Culture-Independent Methods
7
For a nice history of these and other methods, see: M. H. Fraher, P. W. O’Toole, and E. M. Quigley. Techniques used to characterize the�gut microbiota: a guide for the clinician. Nature Reviews Gastroenterology & Hepatology, 9(6):312–322, 2012.
Culture-Independent Methods
8
For a nice history of these and other methods, see: M. H. Fraher, P. W. O’Toole, and E. M. Quigley. Techniques used to characterize the�gut microbiota: a guide for the clinician. Nature Reviews Gastroenterology & Hepatology, 9(6):312–322, 2012.
Culture-Independent Methods
9
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
(Marker gene sequencing)
(Metagenome sequencing)
(Let’s assume that this is for an exploratory study, since we haven’t discussed a hypothesis.)
Culture-Independent Methods
10
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
(Marker gene sequencing)
(Metagenome sequencing)
“Reads”
(Let’s assume that this is for an exploratory study, since we haven’t discussed a hypothesis.)
Culture-Independent Methods
11
(Let’s assume that this is for an exploratory study, since we haven’t discussed a hypothesis.)
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
(Marker gene sequencing)
(Metagenome sequencing)
Caporaso, J. G., Lauber, C. L., Costello, E. K., Berg-Lyons, D., Gonzalez, A., Stombaugh, J., ... & Knight, R. (2011). Moving pictures of the human microbiome. Genome Biology, 12(5), 1-8.
Text made up of�{A, C, G, T}�(and N, etc.)
Culture-Independent Methods
12
(Let’s assume that this is for an exploratory study, since we haven’t discussed a hypothesis.)
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
(Marker gene sequencing)
(Metagenome sequencing)
Caporaso, J. G., Lauber, C. L., Costello, E. K., Berg-Lyons, D., Gonzalez, A., Stombaugh, J., ... & Knight, R. (2011). Moving pictures of the human microbiome. Genome Biology, 12(5), 1-8.
The Metagenome Assembly Problem
How do we stitch these reads back together into their original genomes?
Text made up of�{A, C, G, T}�(and N, etc.)
Metagenome Assembly
13
(Let’s assume that this is for an exploratory study, since we haven’t discussed a hypothesis.)
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
(Marker gene sequencing)
(Metagenome sequencing)
Caporaso, J. G., Lauber, C. L., Costello, E. K., Berg-Lyons, D., Gonzalez, A., Stombaugh, J., ... & Knight, R. (2011). Moving pictures of the human microbiome. Genome Biology, 12(5), 1-8.
The Metagenome Assembly Problem
How do we stitch these reads back together into their original genomes?
Text made up of�{A, C, G, T}�(and N, etc.)
How many genomes were there?
What if a part of a genome isn’t covered by any of the reads?
What if a read looks like it could have come from multiple genomes?
How do we know that an assembly is correct?
Why are we even doing this?
What do you mean by “original genomes”?
What’s for lunch today?
Today’s outline
These slides are online at:
Today’s outline
These slides are online at:
Single-Genome Assembly
16
)
Reads
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
Single-Genome Assembly
17
)
Reads
Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.
The Genome Assembly Problem
How do we stitch these reads back together into the original genome?
Single-Genome Assembly is still difficult
18
Venter et al., Science
February 16, 2001
Lander et al., Nature
February 15, 2001
Based on lecture slides (MATH 283, Fall 2021) from Glenn Tesler.
Single-Genome Assembly is still difficult!
19
Nurk, Koren, Rhie,�Rautiainen et al., Science
March 31, 2022
Single-Genome Assembly is still difficult!!
20
Rhie, Nurk, Cechova, Hoyt, Taylor et al., Nature
August 23, 2023
Nurk, Koren, Rhie,�Rautiainen et al., Science
March 31, 2022
It’s important to formulate the problem well
The Genome Assembly Problem.
21
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
It’s important to formulate the problem well
The Genome Assembly Problem.
Definitions:
Read: string of characters from the alphabet {A, C, G, T}.
22
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
It’s important to formulate the problem well
The Genome Assembly Problem.
Definitions:
Read: string of characters from the alphabet {A, C, G, T}.
Input: A set of reads.
23
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
It’s important to formulate the problem well
The Genome Assembly Problem.
Definitions:
Read: string of characters from the alphabet {A, C, G, T}.
Input: A set of reads.
Output: A string assembled from these reads.
24
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
It’s important to formulate the problem well
The Genome Assembly Problem.
Definitions:
Read: string of characters from the alphabet {A, C, G, T}.
Input: A set of reads.
Output: A string assembled from these reads.
25
What’s wrong with this definition?
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
It’s important to formulate the problem well
My Cool and New Assembler
26
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
It’s important to formulate the problem well
My Cool and New Assembler
27
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
It’s important to formulate the problem well
My Cool and New Assembler
28
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
It’s important to formulate the problem well
My Cool and New Assembler
29
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
+
It’s important to formulate the problem well
My Cool and New Assembler
30
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
+
+
It’s important to formulate the problem well
My Cool and New Assembler
31
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
+
+
.
.
.
+
It’s important to formulate the problem well
My Cool and New Assembler
32
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
+
+
.
.
.
+
It’s important to formulate the problem well
My Cool and New Assembler
33
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
def cool_and_new_assembler(fastq_filepath):
with open(fastq_filepath, "r") as f:
genome = ""
for li, line in enumerate(f):
if li % 4 == 1:
genome += line.strip() # copyright marcus 2024 do not steal
return genome
It’s important to formulate the problem well
My Cool and New Assembler
��
My assembler is good because:�+ it runs really quick�+ it makes really long sequences
34
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
def cool_and_new_assembler(fastq_filepath):
with open(fastq_filepath, "r") as f:
genome = ""
for li, line in enumerate(f):
if li % 4 == 1:
genome += line.strip() # copyright marcus 2024 do not steal
return genome
It’s important to formulate the problem well
My Cool and New Assembler
��
My assembler is good because:�+ it runs really quick�+ it makes really long sequences
35
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
def cool_and_new_assembler(fastq_filepath):
with open(fastq_filepath, "r") as f:
genome = ""
for li, line in enumerate(f):
if li % 4 == 1:
genome += line.strip() # copyright marcus 2024 do not steal
return genome
�so long story short please hire me
It’s important to formulate the problem well
“All too often, scientists focus only on contiguity and ignore the correctness of the reconstructed sequences [...] Even if correctly used, however, N50 values rarely correlate with the actual quality of assembly, as demonstrated by recent assembly competitions.”
36
Nagarajan, N., & Pop, M. (2013). Sequence assembly demystified. Nature Reviews Genetics, 14(3), 157-167.
It’s important to formulate the problem well
The Genome Assembly Problem (less broken).
Definitions:
Read: string of characters from the alphabet {A, C, G, T}.
Superstring(s1, s2, …, sN): A string containing all N strings as substrings.
Input: A set of reads.
Output: The shortest possible superstring of these reads.
37
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
It’s important to formulate the problem well
My Cool and New Assembler
��
My assembler is good because:�+ it runs really quick�+ it makes really long sequences
38
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
def cool_and_new_assembler(fastq_filepath):
with open(fastq_filepath, "r") as f:
genome = ""
for li, line in enumerate(f):
if li % 4 == 1:
genome += line.strip() # copyright marcus 2024 do not steal
return genome
�so long story short please hire me
It’s important to formulate the problem well
My Cool and New Assembler
��
My assembler is good because:�+ it runs really quick�+ it makes really long sequences
39
Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner
Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE
def cool_and_new_assembler(fastq_filepath):
with open(fastq_filepath, "r") as f:
genome = ""
for li, line in enumerate(f):
if li % 4 == 1:
genome += line.strip() # copyright marcus 2024 do not steal
return genome
�so long story short please hire me
Retracted
Today’s outline
These slides are online at:
Today’s outline
These slides are online at:
Why do we care about metagenome assembly?
42
Why do we care about metagenome assembly?
43
Reason 1: Better taxonomic resolution. Strain-level differences matter!
Dunne, K. A., Chaudhuri, R. R., Rossiter, A. E., Beriotto, I., Browning, D. F., Squire, D., ... & Henderson, I. R. (2017).�Sequencing a piece of history: complete genome sequence of the original Escherichia coli strain. Microbial Genomics, 3(3).
Detour: Obesity and the gut microbiome
44
Detour: Obesity and the gut microbiome
45
“Adult germ-free [wild-type] C57BL/ 6J mice were colonized (by gavage) with a microbiota harvested from the caecum of obese (ob/ob) or lean (+/+) donors (1 donor and 4–5 germ-free recipients per treatment group per experiment; two independent experiments). [...]”
(Context: ob/ob mice have a specific mutation “[...] that produces a stereotyped, fully penetrant obesity phenotype”; +/+ mice lack this mutation.)
Ley, R. E., Bäckhed, F., Turnbaugh, P., Lozupone, C. A., Knight, R. D., & Gordon, J. I. (2005). Obesity alters gut microbial ecology. Proceedings of the national academy of sciences, 102(31), 11070-11075.
Detour: Obesity and the gut microbiome
46
“Adult germ-free [wild-type] C57BL/ 6J mice were colonized (by gavage) with a microbiota harvested from the caecum of obese (ob/ob) or lean (+/+) donors (1 donor and 4–5 germ-free recipients per treatment group per experiment; two independent experiments). [...]”
Results: “Strikingly, mice colonized with an ob/ob microbiota exhibited a significantly greater percentage increase in body fat over two weeks than mice colonized with a +/+ microbiota [...]”
Detour: Obesity and the gut microbiome
47
Detour: Obesity and the gut microbiome
48
Turnbaugh, P. J. (2017). Microbes and diet-induced obesity: fast, cheap, and out of control. Cell Host & Microbe, 21(3), 278-281.
Detour: Obesity and the gut microbiome?
49
Detour: Obesity and the gut microbiome?
50
“To test models on other data sets, we trained models using genus-level phylotype data for each data set.”
Why do we care about metagenome assembly?
51
Dunne, K. A., Chaudhuri, R. R., Rossiter, A. E., Beriotto, I., Browning, D. F., Squire, D., ... & Henderson, I. R. (2017).�Sequencing a piece of history: complete genome sequence of the original Escherichia coli strain. Microbial Genomics, 3(3).
Reason 1: Better taxonomic resolution. Strain-level differences matter!
The eventual dream here is being able to run studies that�give us “perfect” information about every unique genome�observed in a sample.
Why do we really care about metagenome assembly?
52
Why do we really care about metagenome assembly?
53
For a recent paper that I’m more confident in [but haven’t had time to fully read yet], see Morton et al., 2023.�Multi-level analysis of the gut-brain axis shows autism spectrum disorder-associated molecular and microbial profiles. Nature Neuroscience.
Why do we care about metagenome assembly?
54
Reason 1: Better taxonomic resolution. Strain-level differences matter!�Reason 2: Longer sequences → more information about putative functions.
Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics.�PLOS Computational Biology, 6(2), e1000667.
Why do we care about metagenome assembly?
55
Reason 1: Better taxonomic resolution. Strain-level differences matter!�Reason 2: Longer sequences → more information about putative functions.
For example, the creation of natural products (aka secondary metabolites).
Nice high-level overview: https://sitn.hms.harvard.edu/flash/2020/bacteria-the-drug-factory-youd-never-expect�My diagram is a gross oversimplification of microbial ecology, but you get the point
Why do we care about metagenome assembly?
56
Reason 1: Better taxonomic resolution. Strain-level differences matter!�Reason 2: Longer sequences → more information about putative functions.
Leonard Gertz, Getty Images. From https://www.aarp.org/health/drugs-supplements/info-2018/penicillin-allergy-real.html.
57
Why do we care about metagenome assembly?
58
Reason 1: Better taxonomic resolution. Strain-level differences matter!�Reason 2: Longer sequences → more information about putative functions.
“[...] there are many more natural products to discover. Efforts to mine new ecological niches and microbial taxa have uncovered a wealth of novel molecules. Bacterial genome sequences have shown that a single strain has the capacity to produce 25–30 different molecules, which has spawned new efforts to discover the ∼90% of natural products that remain ‘cryptic’.”
Why do we care about metagenome assembly?
59
Reason 1: Better taxonomic resolution. Strain-level differences matter!�Reason 2: Longer sequences → more information about putative functions.
(For reference, BGC = “Biosynthetic Gene Cluster.”)�See Meleshko et al., 2019 (biosyntheticSPAdes) for more information about BGCs and their assembly.
Why do we care about metagenome assembly?
60
Reason 1: Better taxonomic resolution. Strain-level differences matter!�Reason 2: Longer sequences → more information about putative functions.
Today’s outline
These slides are online at:
Today’s outline
These slides are online at:
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
?
Sequencing technologies
66
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
Long reads simplify the assembly of repetitive regions.
Koren, S., Treangen, T. J., & Pop, M. (2011). Bambus 2: �scaffolding metagenomes. Bioinformatics, 27(21), 2964-2971.
Koren, S. (2012). Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms. http://www.sergek.umiacs.io/presentations/ThesisTalk_final.pdf.
Technically this figure changes k-mer lengths, not
read lengths; however, the idea is the same.
Low error rates help us distinguish errors from real variations.
Sequencing technologies
67
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
This simple categorization ignores a lot of finicky details!
But this should be enough to understand how these technologies can complicate or simplify assembly.
Sequencing technologies
68
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
2012
2022
2019
Sequencing technologies: other things you might see
69
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
Top diagram from https://thesequencingcenter.com/knowledge-base/what-are-paired-end-reads/;�Bottom diagram from Burton et al., 2013 (Nature Biotechnology)
Sequencing technologies: other things you might see
70
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
short reads
+ long, error-prone reads
short reads
+ long, accurate reads
+ Hi-C reads
Today’s outline
These slides are online at:
Today’s outline
These slides are online at:
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
✔
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
✔
?
Analyzing just metagenomic reads (no assembly)
75
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
✔
These sorts of analyses can be very useful; however, they’re not the main focus of today’s talk.
Analyzing just metagenomic reads (no assembly)
76
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
✔
Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics.�PLOS Computational Biology, 6(2), e1000667.
These sorts of analyses can be very useful; however, they’re not the main focus of today’s talk.
Assembling longer sequences is helpful beyond just taxonomic profiling!
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
✔
✔
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
✔
✔
?
How does assembly work?
How does assembly work?
The Genome Metagenome Assembly Problem.
Definitions:
Read: string of characters from the alphabet {A, C, G, T}.
Input: A set of reads.
Output: A string set of strings assembled from these reads.
How does assembly work?
The Genome Metagenome Assembly Problem.
Definitions:
Read: string of characters from the alphabet {A, C, G, T}.
Input: A set of reads.
Output: A string set of strings assembled from these reads.
This definition is still too naïve, but let’s roll with it for now
Lee, M. D. (2019). Journal of Open�Source Education, 4(41), 53.
Lee, M. D. (2019). Journal of Open�Source Education, 4(41), 53.
How does assembly work?
84
“A good genome assembler is like a good sausage: you would rather not know what is inside.”
Apocryphal quote attributed to Sante Gnerre:�http://rayan.chikhi.name/pdf/2019-july-19-cgsi.pdf
Today’s outline
These slides are online at:
Today’s outline
These slides are online at:
Assembly: methods (de novo vs. reference-based)
de novo assembly: Use only the read data available�
Reference-based assembly: Also use available reference sequence(s)
87
“[...] reconstruction in its pure form, without consultation to previously resolved sequence including from genomes, transcripts, and proteins.”
“For some applications, sufficient information can be extracted from the mapping of reads to a reference sequence, such as a finished genome from a related individual.”
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Assembly: methods (de novo vs. reference-based)
de novo assembly: Use only the read data available� Far more commonly used when working with metagenome sequencing data.
Reference-based assembly: Also use available reference sequence(s)� Some reference-based assemblers have been developed for metagenome� sequencing data, but they have not (yet) seen widespread use in the field.
88
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Cepeda, V., Liu, B., Almeida, M., Hill, C. M., Koren, S., Treangen, T. J., & Pop, M. (2017). MetaCompass: reference-guided assembly of metagenomes. bioRxiv, 212506.
Assembly: methods (de novo vs. reference-based)
de novo assembly: Use only the read data available� Far more commonly used when working with metagenome sequencing data.
Reference-based assembly: Also use available reference sequence(s)� Some reference-based assemblers have been developed for metagenome� sequencing data, but they have not (yet) seen widespread use in the field.
89
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Cepeda, V., Liu, B., Almeida, M., Hill, C. M., Koren, S., Treangen, T. J., & Pop, M. (2017). MetaCompass: reference-guided assembly of metagenomes. bioRxiv, 212506.
Today’s outline
These slides are online at:
Today’s outline
These slides are online at:
Assembly: methods (graphs)
Most assemblers use the following rough paradigm:
92
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Graphs
Undirected
Directed
Assembly: methods (graphs)
Most assemblers use the following rough paradigm:
94
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Assembly: methods (graphs)
Most assemblers use the following rough paradigm:
95
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
DBG
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Feng, X., Cheng, H., Portik, D., & Li, H. (2022). Metagenome assembly of high-fidelity long reads with hifiasm-meta. Nature Methods, 19(6), 671-674.
Assembly: methods (graphs)
Most assemblers use the following rough paradigm:
96
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Assembly: methods (graphs)
Most assemblers use the following rough paradigm:
97
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Assembly: methods (graphs)
Most assemblers use the following rough paradigm:
98
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Assembly: methods (overlap vs. de Bruijn graphs)
Let’s just consider single-genome assembly�for now.��Our “reads” are all substrings of length 3 of�TAATGCCATGGGATGTT (17 nt).
99
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
OLC
DBG
Assembly: methods (overlap vs. de Bruijn graphs)
Overlap graph (directed graph)� Nodes: input reads� Edges: n1 → n2 if read n1 overlaps with read n2
100
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
DBG
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Assembly: methods (overlap vs. de Bruijn graphs)
Overlap graph (directed graph)� Nodes: input reads� Edges: n1 → n2 if read n1 overlaps with read n2
101
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
DBG
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Assembly: methods (overlap vs. de Bruijn graphs)
Overlap graph (directed graph)� Nodes: input reads� Edges: n1 → n2 if read n1 overlaps with read n2� Goal?: Find Hamiltonian Paths (or cycles) in this graph—visiting each� node once.
102
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
DBG
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Assembly: methods (overlap vs. de Bruijn graphs)
Overlap graph (directed graph)� Nodes: input reads� Edges: n1 → n2 if read n1 overlaps with read n2� Goal?: Find Hamiltonian Paths (or cycles) in this graph—visiting each� node once.
103
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
DBG
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Assembly: methods (overlap vs. de Bruijn graphs)
Overlap graph (directed graph)� Nodes: input reads� Edges: n1 → n2 if read n1 overlaps with read n2� Goal?: Find Hamiltonian Paths (or cycles) in this graph—visiting each� node once.
104
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
DBG
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Assembly: methods (overlap vs. de Bruijn graphs)
Overlap graph (directed graph)� Nodes: input reads� Edges: n1 → n2 if read n1 overlaps with read n2� Goal?: Find Hamiltonian Paths (or cycles) in this graph—visiting each� node once.
105
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
DBG
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Finding Hamiltonian paths (or cycles) is NP-complete—we don’t know how to find a solution efficiently.
Assembly: methods (overlap vs. de Bruijn graphs)
106
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
DBG
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Finding Hamiltonian paths (or cycles) is NP-complete—we don’t know how to find a solution efficiently.
“Following the decision of the Scientific Advisory Board, the Board of Directors of CMI designated a $7 million prize fund for the solutions to these problems, with $1 million allocated to the solution of each problem.”
Assembly: methods (overlap vs. de Bruijn graphs)
Overlap graph (directed graph) ✔� Nodes: input reads� Edges: n1 → n2 if read n1 overlaps with read n2� Goal?: Find Hamiltonian Paths (or cycles) in this graph
de Bruijn graph (also a directed graph; takes a parameter k)
107
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
NP-Complete!
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
OLC
DBG
Assembly: methods (overlap vs. de Bruijn graphs)
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n1 → n2 if there is a k-mer whose prefix is n1 and whose suffix is n2
108
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
OLC
DBG
Assembly: methods (overlap vs. de Bruijn graphs)
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n1 → n2 if there is a k-mer whose prefix is n1 and whose suffix is n2
109
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
OLC
DBG
k - 1 = 3
k - 1 = 2
Assembly: methods (overlap vs. de Bruijn graphs)
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n1 → n2 if there is a k-mer whose prefix is n1 and whose suffix is n2�
110
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
OLC
DBG
k - 1 = 3
k - 1 = 2
Assembly: methods (overlap vs. de Bruijn graphs)
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n1 → n2 if there is a k-mer whose prefix is n1 and whose suffix is n2� Goal?: Find Eulerian Paths (or cycles) in this graph—visiting each� edge once.
111
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
OLC
DBG
k - 1 = 3
k - 1 = 2
Assembly: methods (overlap vs. de Bruijn graphs)
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n1 → n2 if there is a k-mer whose prefix is n1 and whose suffix is n2� Goal?: Find Eulerian Paths (or cycles) in this graph—visiting each� edge once.
112
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
OLC
DBG
k - 1 = 3
k - 1 = 2
ATG is a repeat: notice how it complicates the assembly.
Assembly: methods (overlap vs. de Bruijn graphs)
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n1 → n2 if there is a k-mer whose prefix is n1 and whose suffix is n2� Goal?: Find Eulerian Paths (or cycles) in this graph—visiting each� edge once.
113
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
OLC
DBG
k - 1 = 3
k - 1 = 2
ATG is a repeat: notice how it complicates the assembly.
Assembly: methods (overlap vs. de Bruijn graphs)
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n1 → n2 if there is a k-mer whose prefix is n1 and whose suffix is n2� Goal?: Find Eulerian Paths (or cycles) in this graph—visiting each� edge once.
114
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
OLC
DBG
k - 1 = 3
k - 1 = 2
ATG is a repeat: notice how it complicates the assembly.
Assembly: methods (overlap vs. de Bruijn graphs)
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n1 → n2 if there is a k-mer whose prefix is n1 and whose suffix is n2� Goal?: Find Eulerian Paths (or cycles) in this graph—visiting each� edge once.
115
ATG is a repeat: notice how it complicates the assembly.
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
OLC
DBG
k - 1 = 3
k - 1 = 2
Finding Eulerian paths (or cycles) is doable in polynomial time, actually! We can solve this problem efficiently.
Assembly: methods (overlap vs. de Bruijn graphs)
Overlap graph (directed graph)� Nodes: input reads� Edges: n1 → n2 if read n1 overlaps with read n2�
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n1 → n2 if there is a k-mer whose prefix is n1 and whose� suffix is n2�
116
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
DBG
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
Assembly: methods (overlap vs. de Bruijn graphs)
Overlap graph (directed graph)� Nodes: input reads� Edges: n1 → n2 if read n1 overlaps with read n2� Goal?: Find Hamiltonian Paths (or cycles) in this graph
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n1 → n2 if there is a k-mer whose prefix is n1 and whose� suffix is n2� Goal?: Find Eulerian Paths (or cycles) in this graph
117
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
DBG
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
Assembly: methods (overlap vs. de Bruijn graphs)
Overlap graph (directed graph)� Nodes: input reads� Edges: n1 → n2 if read n1 overlaps with read n2� Goal?: Find Hamiltonian Paths (or cycles) in this graph
de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n1 → n2 if there is a k-mer whose prefix is n1 and whose� suffix is n2� Goal?: Find Eulerian Paths (or cycles) in this graph
118
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
OLC
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
DBG
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.
NP-Complete :(
Linear time :)
Assembly: methods (overlap vs. de Bruijn graphs)
119
OLC
DBG
Neither of these lists are comprehensive! Many assemblers (of both the OLC or DBG varieties) are still actively being developed and in use today.�Also, most assemblers implement their own “twist” on how they use these graph structures, so cleanly categorizing assemblers in this way ignores many details.
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.
Dida, F., & Yi, G. (2021). Empirical evaluation of methods for de novo genome assembly. PeerJ Computer Science, 7, e636.
Wang, A., Wang, Z., Li, Z., & Li, L. M. (2018). BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach. Bioinformatics, 34(12), 2019-2028.
“[...] an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. Instead, assemblers output contigs—long, contiguous segments which can unambiguously be inferred to be part of the genome. Finding such segments is a very different computational problem than finding a single Eulerian or Hamiltonian cycle [...]
the difference between using de Bruijn graphs and overlap graphs is poorly understood and is a fascinating open research problem. But, the Eulerian and Hamiltonian cycle dichotomy is not really relevant to assembly or to the popularity of de Bruijn graphs.”
“[...] an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. Instead, assemblers output contigs—long, contiguous segments which can unambiguously be inferred to be part of the genome. Finding such segments is a very different computational problem than finding a single Eulerian or Hamiltonian cycle [...]
the difference between using de Bruijn graphs and overlap graphs is poorly understood and is a fascinating open research problem. But, the Eulerian and Hamiltonian cycle dichotomy is not really relevant to assembly or to the popularity of de Bruijn graphs.”
“[...] an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. Instead, assemblers output contigs—long, contiguous segments which can unambiguously be inferred to be part of the genome. Finding such segments is a very different computational problem than finding a single Eulerian or Hamiltonian cycle [...]
the difference between using de Bruijn graphs and overlap graphs is poorly understood and is a fascinating open research problem. But, the Eulerian and Hamiltonian cycle dichotomy is not really relevant to assembly or to the popularity of de Bruijn graphs.”
“[...] an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. Instead, assemblers output contigs—long, contiguous segments which can unambiguously be inferred to be part of the genome. Finding such segments is a very different computational problem than finding a single Eulerian or Hamiltonian cycle [...]
the difference between using de Bruijn graphs and overlap graphs is poorly understood and is a fascinating open research problem. But, the Eulerian and Hamiltonian cycle dichotomy is not really relevant to assembly or to the popularity of de Bruijn graphs.”
Assembly: methods (graphs)
Most assemblers use the following rough paradigm:
124
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
Assembly: methods (graphs)
Most assemblers use the following rough paradigm:
125
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
(Overlap or de Bruijn)
Even graph construction can be difficult in practice
126
Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.
Even graph construction can be difficult in practice
127
Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.
Even graph construction can be difficult in practice
128
Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.
LJA can completely assemble entire human chromosomes!
Today’s outline
These slides are online at:
Today’s outline
These slides are online at:
Assembly: methods (graphs)
Most assemblers use the following rough paradigm:
131
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
(Overlap or de Bruijn)
Assembly: methods (single-genome vs. metagenome)
132
Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.
Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.
Assembly: methods (single-genome vs. metagenome)
133
Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.
Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.
Assembly: methods (single-genome vs. metagenome)
134
Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.
Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.
Assembly: methods (single-genome vs. metagenome)
135
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs.�Nature Methods, 17(11), 1103-1110.
“An assembly graph of a single connected component in the sheep microbiome dataset before strain collapsing [...] The component represents a bacterial genome of the Clostridia class [...] There are 20 simple bubbles (shown in green) and 10 superbubbles (shown in yellow) that account for 1.2 Mbp out of 2.4 Mbp long genome.”
Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.
Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.
“Metagenome-
assembled genome”�(MAG)
Assembly: methods (single-genome vs. metagenome)
136
Different assemblers will do different things to deal with these sorts of subtle variations: even “smooth” contigs can conceal a lot of variation. Sometimes this is desirable—sometimes not!
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs.�Nature Methods, 17(11), 1103-1110.
“An assembly graph of a single connected component in the sheep microbiome dataset before strain collapsing [...] The component represents a bacterial genome of the Clostridia class [...] There are 20 simple bubbles (shown in green) and 10 superbubbles (shown in yellow) that account for 1.2 Mbp out of 2.4 Mbp long genome.”
Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.
Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.
Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.
“Metagenome-
assembled genome”�(MAG)
Assembly: methods (graphs)
Most assemblers use the following rough paradigm:
137
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
(Overlap or de Bruijn)
Assembly: methods (graphs)
Most assemblers use the following rough paradigm:
138
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
(Overlap or de Bruijn)
(it’s complicated)
Assembly: methods (graphs)
Most assemblers use the following rough paradigm:
139
Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.
These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.
(Overlap or de Bruijn)
(it’s complicated)
Lee, M. D. (2019). Journal of Open�Source Education, 4(41), 53.
Lee, M. D. (2019). Journal of Open�Source Education, 4(41), 53.
✔
?
Today’s outline
These slides are online at:
Today’s outline
These slides are online at:
Scaffolding
Attempt to “order and orient” contigs relative to each other, using various types of linking information.
Generates scaffolds / scaffold graphs—analogous to the contigs / assembly graphs output by an assembler.
Sometimes scaffolding is just done by the assembler, but there are stand-alone options such as the Bambus series of tools.
Pop, M., Kosack, D. S., & Salzberg, S. L. (2004).�Hierarchical scaffolding with Bambus. Genome Research, 14(1), 149-159.
Binning
Attempt to group contigs into bins representing a single genome.
One distinction: de novo vs. reference-based approaches, again.
Binning: de novo
Attempt to group contigs into bins representing a single genome.
Wu, Y. W., Tang, Y. H., Tringe, S. G., Simmons, B. A., & Singer, S. W. (2014). MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome, 2, 1-18.
Binning: de novo
Attempt to group contigs into bins representing a single genome.
Wu, Y. W., Tang, Y. H., Tringe, S. G., Simmons, B. A., & Singer, S. W. (2014). MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome, 2, 1-18.
“Tetranucleotides”: AAAA, AAAC, AAAG, AAAT,
AACA, AACC, AACG, AACT,� …?
Binning: reference-based
Wood, D. E., & Salzberg, S. L. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3), 1-12.
Wickramarachchi, A., & Lin, Y. (2022). Binning long reads in metagenomics datasets using composition and coverage information. Algorithms for Molecular Biology, 17(1), 14.
Binning: reference-based
Wood, D. E., & Salzberg, S. L. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3), 1-12.
Wickramarachchi, A., & Lin, Y. (2022). Binning long reads in metagenomics datasets using composition and coverage information. Algorithms for Molecular Biology, 17(1), 14.
Today’s outline
These slides are online at:
Today’s outline
These slides are online at:
Validation
Regardless of if our sequences were output
we can estimate their quality (completeness and contamination) by looking for single-copy genes.
Validation
Regardless of if our sequences were output
we can estimate their quality (completeness and contamination) by looking for single-copy genes.
My Cool and New Assembler
��
def cool_and_new_assembler(fastq_filepath):
with open(fastq_filepath, "r") as f:
genome = ""
for li, line in enumerate(f):
if li % 4 == 1:
genome += line.strip() # copyright marcus 2024 do not steal
return genome
Extra Retracted
Validation
Regardless of if our sequences were output
we can estimate their quality (completeness and contamination) by looking for single-copy genes.
Keep in mind that these are only estimates!
Validation
Regardless of if our sequences were output
we can estimate their quality (completeness and contamination) by looking for single-copy genes.
Keep in mind that these are only estimates!
Bowers, R. M., Kyrpides, N. C., Stepanauskas, R., Harmon-Smith, M., Doud, D., Reddy, T. B. K., ... & Woyke, T. (2017). Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology, 35(8), 725-731.
Further inspection: variant calling
Assembled sequence
Further inspection: variant calling
Assembled sequence
Figuring out where to place these reads is called alignment
(or mapping).
Further inspection: variant calling
Is this a “real” variant, or an error?
Figuring out where to place these reads is called alignment
(or mapping).
Assembled sequence
Further inspection: variant calling
Fedarko, M. W. (hello), Kolmogorov, M., & Pevzner, P. A. (2022). Analyzing rare mutations in metagenomes assembled using long and accurate reads. Genome Research, 32(11-12), 2119-2133.
Assembled sequence
Is this a “real” variant, or an error?
Further inspection: variant calling
Fedarko, M. W. (hello), Kolmogorov, M., & Pevzner, P. A. (2022). Analyzing rare mutations in metagenomes assembled using long and accurate reads. Genome Research, 32(11-12), 2119-2133.
Assembled sequence
Is this a “real” variant, or an error?
Further inspection: variant calling
Fedarko, M. W. (hello), Kolmogorov, M., & Pevzner, P. A. (2022). Analyzing rare mutations in metagenomes assembled using long and accurate reads. Genome Research, 32(11-12), 2119-2133.
Assembled sequence
Is this a “real” variant, or an error?
Further inspection: variant calling
Fedarko, M. W. (hello), Kolmogorov, M., & Pevzner, P. A. (2022). Analyzing rare mutations in metagenomes assembled using long and accurate reads. Genome Research, 32(11-12), 2119-2133.
Tens of thousands of rare mutations can lurk within a single MAG!
Further inspection: variant calling
Fedarko, M. W. (hello), Kolmogorov, M., & Pevzner, P. A. (2022). Analyzing rare mutations in metagenomes assembled using long and accurate reads. Genome Research, 32(11-12), 2119-2133.
Tens of thousands of rare mutations can lurk within a single MAG!
Further inspection?
Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn�graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.�(This is work-in-progress research on using LJA to assemble metagenomes.)
Further inspection?
Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn�graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.�(This is work-in-progress research on using LJA to assemble metagenomes.)
Dot plots
Red dots → matching k-mer starting at this position�
Blue dots → reverse-complement matching k-mer starting at this position
Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach.
(in this example, k = 3)
Further inspection?
Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn�graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.�(This is work-in-progress research on using LJA to assemble metagenomes.)
Further inspection??
Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn�graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.�(This is work-in-progress research on using LJA to assemble metagenomes.)
See https://github.com/fedarko/wotplot for the tool used to draw these dot plots.
Further inspection: an extremely high-coverage repeat
We identified a strange circular (or tandemly repeated) 6,719 nt sequence at over 100,000X coverage: an order of magnitude higher coverage than even the most abundant genomes in this dataset. (Classified as Akkermansia sp.; has 9 coding genes.)
81,917 / 22,118,393 (0.37%) reads map to it.�
Of these, 75,034 / 81,917 (91.60%) reads are “rotation reads” consisting almost entirely of the repeated copies of D20Seq.
This repeat unit doesn’t have much self-similarity.
Other assemblers (including short-read assemblers) would likely miss such repeats due to assembly fragmentation in the case of high coverage.
Were such surprising repeats missed in previous metagenomic studies?
Slide edited by Pavel Pevzner
Further inspection: an extremely high-coverage repeat
Xie, F., Jin, W., Si, H., Yuan, Y., Tao, Y., Liu, J., ... & Mao, S. (2021). An integrated gene catalog and over 10,000 metagenome-assembled genomes from the gastrointestinal microbiome of ruminants. Microbiome, 9(1), 137.
An Akkermansia sp. MAG from another sheep gut microbiome contains just one copy of D20Seq!
Occurrences of D20Seq in bin 174 from Bickhart & Kolmogorov et al., 2022
An earlier assembly of this dataset only included D20Seq in one bin (in two contigs).
???
We downloaded 159 complete Akkermansia genomes from NCBI; none of them contained* D20Seq!
All of D20Seq’s genes’ best BLASTP hits, actually, come from the Xie et al., 2021 study.
The best non-Xie-2021 hits (all relatively weak, and not necessarily indications of “containment”*) were to protein sequences predicted from the gut microbiota of Camelus bactrianus; a San Clemente Island Goat; a horse; and a Plateau pika.
* Defining “containment” informally, based on minimap2 finding a match between D20Seq and the genome.
Further inspection: why does this repeat exist?
Mamun, M. A. A., Sandeman, M., Rayment, P., Brook-Carter, P., Scholes, E., Kasinadhuni, N., ... & Greenhill, A. R. (2020). Variation in gut bacterial composition is associated with Haemonchus contortus parasite infection of sheep. Animal Microbiome, 2, 1–14.
Hu, X., Zhao, Y., Yang, Y., Gong, W., Sun, X., Yang, L., ... & Jin, M. (2021). Akkermansia muciniphila improves host defense against influenza virus infection. Frontiers in Microbiology, 11, 586476.
Further inspection: why does this repeat exist?
Mamun, M. A. A., Sandeman, M., Rayment, P., Brook-Carter, P., Scholes, E., Kasinadhuni, N., ... & Greenhill, A. R. (2020). Variation in gut bacterial composition is associated with Haemonchus contortus parasite infection of sheep. Animal Microbiome, 2, 1–14.
Hu, X., Zhao, Y., Yang, Y., Gong, W., Sun, X., Yang, L., ... & Jin, M. (2021). Akkermansia muciniphila improves host defense against influenza virus infection. Frontiers in Microbiology, 11, 586476.
A lot of interesting details lurk within metagenomes and their assemblies. Within reason :), try to notice these details!
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
✔
✔
✔
Today’s outline
These slides are online at:
Today’s outline
These slides are online at:
Open problems
Today’s outline
These slides are online at:
Thanks :)
Pop Lab (UMD), Knight Lab (UCSD), Pevzner Lab (UCSD)
Bonus slides
C. I. Methods: What’s in a marker gene?
181
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
C. I. Methods: What’s in a marker gene?
182
Entropy in the 16S rRNA gene
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
C. I. Methods: What’s in a marker gene?
183
Entropy in the 16S rRNA gene
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
More
mutations
C. I. Methods: What’s in a marker gene?
184
Entropy in the 16S rRNA gene
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
More
mutations
Hypervariable regions
C. I. Methods: What’s in a marker gene?
185
Entropy in the 16S rRNA gene
More
mutations
Hypervariable regions
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
Conserved regions
C. I. Methods: What’s in a marker gene?
186
Entropy in the 16S rRNA gene
More
mutations
Hypervariable regions
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
Conserved regions
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
C. I. Methods: What’s in a marker gene?
187
Entropy in the 16S rRNA gene
More
mutations
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
C. I. Methods: What’s in a marker gene?
188
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
C. I. Methods: What’s in a marker gene?
189
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
C. I. Methods: What’s in a marker gene?
190
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
C. I. Methods: What’s in a marker gene?
191
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
C. I. Methods: What’s in a marker gene?
192
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
...
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
C. I. Methods: What’s in a marker gene?
193
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
...
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
Since these amplified sequences contain hypervariable region(s), these regions help us determine which sequences came from which microbe.
C. I. Methods: What’s in a marker gene?
194
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
...
Since these amplified sequences contain hypervariable region(s), these regions help us determine which sequences came from which microbe.
C. I. Methods: What’s in a marker gene?
195
S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Neisseriales; f__Neisseriaceae; g__Neisseria
k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus
k__Bacteria; p__Firmicutes; c__Bacilli
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__Haemophilus; s__parainfluenzae
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
k__Bacteria; p__Firmicutes; c__Bacilli
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__Haemophilus; s__parainfluenzae
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Neisseriales; f__Neisseriaceae; g__Neisseria
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus
k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Neisseriales; f__Neisseriaceae; g__Neisseria
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__Haemophilus; s__parainfluenzae
k__Bacteria; p__Firmicutes; c__Bacilli
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
With polymerase chain reaction (PCR), we can amplify specific regions of the genome using
primers that target conserved regions.
Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”
...
Since these amplified sequences contain hypervariable region(s), these regions help us determine which sequences came from which microbe.
Example taxonomic annotations from the QIIME 2�“Moving Pictures” tutorial: https://docs.qiime2.org
C. I. Methods: Marker gene sequencing
196
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
C. I. Methods: Functional annotation
197
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
C. I. Methods: Functional annotation
198
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
(not just marker!) genes
operons
terminators
promoters
For more information on metagenomic functional annotation, see: Salamov, V. S. A., & Solovyevand, A. (2011). Automatic annotation of microbial genomes and metagenomic sequences. Metagenomics and Its Applications in Agriculture, Biomedicine and Environmental Studies; Li, RW, Ed, 61-78.
...
C. I. Methods: Functional annotation, in practice
199
Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.
(not just marker!) genes
operons
terminators
promoters
For more information on metagenomic functional annotation, see: Salamov, V. S. A., & Solovyevand, A. (2011). Automatic annotation of microbial genomes and metagenomic sequences. Metagenomics and Its Applications in Agriculture, Biomedicine and Environmental Studies; Li, RW, Ed, 61-78.
...
Usually, F.A. involves aligning (“mapping”) sequences to a reference database with information about “function” in well- studied organisms.
(But there are fancier approaches, e.g. metabolic modelling methods.)
C. I. Methods: Functional annotation, in practice?
200
“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”
C. I. Methods: Functional annotation, in practice??
201
“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”
A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates [...]
C. I. Methods: Functional annotation, in practice???
202
A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates [...]
“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”
C. I. Methods: Functional annotation, in practice???
203
“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”
A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates [...]
C. I. Methods: Functional annotation, in practice
204
Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics.�PLOS computational biology, 6(2), e1000667.
Marker gene assembly?
205
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
(Marker gene sequencing)
(Metagenome sequencing)
Q: Would it make sense for us to assemble marker gene sequencing reads?
A more detailed view of short-read sequencing tech
Nagarajan, N., & Pop, M. (2013). Sequence assembly demystified. Nature Reviews Genetics, 14(3), 157-167.
Marker gene assembly?
207
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
(Marker gene sequencing)
(Metagenome sequencing)
Q: Would it make sense for us to assemble marker gene sequencing reads?
A: We only have a small region from each genome—what would we even assemble? :(
Marker gene assembly?
208
Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.
(Marker gene sequencing)
(Metagenome sequencing)
(Let’s assume that this is for an exploratory study, since we haven’t discussed a hypothesis.)
Q: Would it make sense for us to assemble marker gene sequencing reads?
A: We only have a small region from each genome—what would we even assemble? :(
A #2: Depending on our data, we could use assembly to construct longer marker gene sequences—these can be helpful! See Kozich et al., 2013 :)
Kozich, J. J., Westcott, S. L., Baxter, N. T., Highlander, S. K., & Schloss, P. D. (2013). Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology, 79(17), 5112-5120.