1 of 208

Metagenome assembly�

Marcus Fedarko

https://bit.ly/41GBxEt

These slides are online at:

2 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems

https://bit.ly/41GBxEt

These slides are online at:

3 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems

https://bit.ly/41GBxEt

These slides are online at:

4 of 208

Culture-Independent Methods

4

Steve Gschmeissner, Science Photo Library

“Scanning electron micrograph (SEM) of bacteria cultured from a sample of human faeces.”

Steve Gschmeissner, Science Photo Library

“Bacterial contamination, coloured scanning electron micrograph (SEM). Escherichia coli bacteria in a cell culture. This contamination has come from an unclean water source.”

Steve Gschmeissner, Science Photo Library

“Coloured scanning electron micrograph (SEM) of bacteria cultured from a mobile phone.”

5 of 208

Culture-Independent Methods

“It is estimated that >99% of microorganisms observable in nature typically are not cultivated by using standard techniques.”

5

Hugenholtz, P., Goebel, B. M., & Pace, N. R. (1998). Impact of culture-independent studies on the emerging�phylogenetic view of bacterial diversity. Journal of Bacteriology, 180(18), 4765-4774.

6 of 208

Culture-Independent Methods

“It is estimated that >99% of microorganisms observable in nature typically are not cultivated by using standard techniques.”

Although not all microbes are easily culturable, all have genomes.

So: look at the genetic material (e.g. DNA) in a sample and study that!

6

Hugenholtz, P., Goebel, B. M., & Pace, N. R. (1998). Impact of culture-independent studies on the emerging�phylogenetic view of bacterial diversity. Journal of Bacteriology, 180(18), 4765-4774.

7 of 208

Culture-Independent Methods

Terminal restriction fragment length polymorphism� a.k.a. T-RFLP�
Denaturing / temperature gradient gel electrophoresis� a.k.a. DGGE / TGGE�
Fluorescence in situ hybridization� a.k.a. FISH�
Marker gene sequencing� a.k.a. amplicon sequencing, metabarcoding, metataxonomics, ...; e.g. 16S rRNA sequencing, ...�
Metagenome sequencing� a.k.a. metagenomics, shotgun metagenome sequencing, whole metagenome sequencing, ...

7

For a nice history of these and other methods, see: M. H. Fraher, P. W. O’Toole, and E. M. Quigley. Techniques used to characterize the�gut microbiota: a guide for the clinician. Nature Reviews Gastroenterology & Hepatology, 9(6):312–322, 2012.

8 of 208

Culture-Independent Methods

Terminal restriction fragment length polymorphism� a.k.a. T-RFLP�
Denaturing / temperature gradient gel electrophoresis� a.k.a. DGGE / TGGE�
Fluorescence in situ hybridization� a.k.a. FISH�
Marker gene sequencing� a.k.a. amplicon sequencing, metabarcoding, metataxonomics, ...; e.g. 16S rRNA sequencing, ...�
Metagenome sequencing� a.k.a. metagenomics, shotgun metagenome sequencing, whole metagenome sequencing, ...

8

For a nice history of these and other methods, see: M. H. Fraher, P. W. O’Toole, and E. M. Quigley. Techniques used to characterize the�gut microbiota: a guide for the clinician. Nature Reviews Gastroenterology & Hepatology, 9(6):312–322, 2012.

9 of 208

Culture-Independent Methods

9

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

(Marker gene sequencing)

(Metagenome sequencing)

(Let’s assume that this is for an exploratory study, since we haven’t discussed a hypothesis.)

10 of 208

Culture-Independent Methods

10

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

(Marker gene sequencing)

(Metagenome sequencing)

“Reads”

(Let’s assume that this is for an exploratory study, since we haven’t discussed a hypothesis.)

11 of 208

Culture-Independent Methods

11

(Let’s assume that this is for an exploratory study, since we haven’t discussed a hypothesis.)

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

(Marker gene sequencing)

(Metagenome sequencing)

Caporaso, J. G., Lauber, C. L., Costello, E. K., Berg-Lyons, D., Gonzalez, A., Stombaugh, J., ... & Knight, R. (2011). Moving pictures of the human microbiome. Genome Biology, 12(5), 1-8.

Text made up of�{A, C, G, T}�(and N, etc.)

12 of 208

Culture-Independent Methods

12

(Let’s assume that this is for an exploratory study, since we haven’t discussed a hypothesis.)

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

(Marker gene sequencing)

(Metagenome sequencing)

Caporaso, J. G., Lauber, C. L., Costello, E. K., Berg-Lyons, D., Gonzalez, A., Stombaugh, J., ... & Knight, R. (2011). Moving pictures of the human microbiome. Genome Biology, 12(5), 1-8.

The Metagenome Assembly Problem

How do we stitch these reads back together into their original genomes?

Text made up of�{A, C, G, T}�(and N, etc.)

13 of 208

Metagenome Assembly

13

(Let’s assume that this is for an exploratory study, since we haven’t discussed a hypothesis.)

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

(Marker gene sequencing)

(Metagenome sequencing)

Caporaso, J. G., Lauber, C. L., Costello, E. K., Berg-Lyons, D., Gonzalez, A., Stombaugh, J., ... & Knight, R. (2011). Moving pictures of the human microbiome. Genome Biology, 12(5), 1-8.

The Metagenome Assembly Problem

How do we stitch these reads back together into their original genomes?

Text made up of�{A, C, G, T}�(and N, etc.)

How many genomes were there?

What if a part of a genome isn’t covered by any of the reads?

What if a read looks like it could have come from multiple genomes?

How do we know that an assembly is correct?

Why are we even doing this?

What do you mean by “original genomes”?

What’s for lunch today?

14 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems

https://bit.ly/41GBxEt

These slides are online at:

15 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems

https://bit.ly/41GBxEt

These slides are online at:

16 of 208

Single-Genome Assembly

16

)

Reads

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

17 of 208

Single-Genome Assembly

17

)

Reads

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

The Genome Assembly Problem

How do we stitch these reads back together into the original genome?

18 of 208

Single-Genome Assembly is still difficult

18

Venter et al., Science

February 16, 2001

Lander et al., Nature

February 15, 2001

Based on lecture slides (MATH 283, Fall 2021) from Glenn Tesler.

19 of 208

Single-Genome Assembly is still difficult!

19

Nurk, Koren, Rhie,�Rautiainen et al., Science

March 31, 2022

20 of 208

Single-Genome Assembly is still difficult!!

20

Rhie, Nurk, Cechova, Hoyt, Taylor et al., Nature

August 23, 2023

Nurk, Koren, Rhie,�Rautiainen et al., Science

March 31, 2022

21 of 208

It’s important to formulate the problem well

The Genome Assembly Problem.

21

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

22 of 208

It’s important to formulate the problem well

The Genome Assembly Problem.

Definitions:

Read: string of characters from the alphabet {A, C, G, T}.

22

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

23 of 208

It’s important to formulate the problem well

The Genome Assembly Problem.

Definitions:

Read: string of characters from the alphabet {A, C, G, T}.

Input: A set of reads.

23

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

24 of 208

It’s important to formulate the problem well

The Genome Assembly Problem.

Definitions:

Read: string of characters from the alphabet {A, C, G, T}.

Input: A set of reads.

Output: A string assembled from these reads.

24

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

25 of 208

It’s important to formulate the problem well

The Genome Assembly Problem.

Definitions:

Read: string of characters from the alphabet {A, C, G, T}.

Input: A set of reads.

Output: A string assembled from these reads.

25

What’s wrong with this definition?

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

26 of 208

It’s important to formulate the problem well

My Cool and New Assembler

26

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

27 of 208

It’s important to formulate the problem well

My Cool and New Assembler

27

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

28 of 208

It’s important to formulate the problem well

My Cool and New Assembler

28

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

29 of 208

It’s important to formulate the problem well

My Cool and New Assembler

29

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

+

30 of 208

It’s important to formulate the problem well

My Cool and New Assembler

30

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

+

31 of 208

It’s important to formulate the problem well

My Cool and New Assembler

31

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

+

.

+

32 of 208

It’s important to formulate the problem well

My Cool and New Assembler

32

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

+

.

+

33 of 208

It’s important to formulate the problem well

My Cool and New Assembler

33

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

def cool_and_new_assembler(fastq_filepath):

with open(fastq_filepath, "r") as f:

genome = ""

for li, line in enumerate(f):

if li % 4 == 1:

genome += line.strip() # copyright marcus 2024 do not steal

return genome

34 of 208

It’s important to formulate the problem well

My Cool and New Assembler

��

My assembler is good because:�+ it runs really quick�+ it makes really long sequences

34

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

def cool_and_new_assembler(fastq_filepath):

with open(fastq_filepath, "r") as f:

genome = ""

for li, line in enumerate(f):

if li % 4 == 1:

genome += line.strip() # copyright marcus 2024 do not steal

return genome

35 of 208

It’s important to formulate the problem well

My Cool and New Assembler

��

My assembler is good because:�+ it runs really quick�+ it makes really long sequences

35

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

def cool_and_new_assembler(fastq_filepath):

with open(fastq_filepath, "r") as f:

genome = ""

for li, line in enumerate(f):

if li % 4 == 1:

genome += line.strip() # copyright marcus 2024 do not steal

return genome

�so long story short please hire me

36 of 208

It’s important to formulate the problem well

“All too often, scientists focus only on contiguity and ignore the correctness of the reconstructed sequences [...] Even if correctly used, however, N50 values rarely correlate with the actual quality of assembly, as demonstrated by recent assembly competitions.”

36

Nagarajan, N., & Pop, M. (2013). Sequence assembly demystified. Nature Reviews Genetics, 14(3), 157-167.

37 of 208

It’s important to formulate the problem well

The Genome Assembly Problem (less broken).

Definitions:

Read: string of characters from the alphabet {A, C, G, T}.

Superstring(s₁, s₂, …, s_N): A string containing all N strings as substrings.

Input: A set of reads.

Output: The shortest possible superstring of these reads.

37

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

38 of 208

It’s important to formulate the problem well

My Cool and New Assembler

��

My assembler is good because:�+ it runs really quick�+ it makes really long sequences

38

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

def cool_and_new_assembler(fastq_filepath):

with open(fastq_filepath, "r") as f:

genome = ""

for li, line in enumerate(f):

if li % 4 == 1:

genome += line.strip() # copyright marcus 2024 do not steal

return genome

�so long story short please hire me

39 of 208

It’s important to formulate the problem well

My Cool and New Assembler

��

My assembler is good because:�+ it runs really quick�+ it makes really long sequences

39

Based in part on Niema Moshiri and Behar Behsaz’s video for the Compeau & Pevzner

Bioinformatics Algorithms textbook: https://www.youtube.com/watch?v=eJxP06h-QxE

def cool_and_new_assembler(fastq_filepath):

with open(fastq_filepath, "r") as f:

genome = ""

for li, line in enumerate(f):

if li % 4 == 1:

genome += line.strip() # copyright marcus 2024 do not steal

return genome

�so long story short please hire me

Retracted

40 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

41 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

42 of 208

Why do we care about metagenome assembly?

42

43 of 208

Why do we care about metagenome assembly?

43

Reason 1: Better taxonomic resolution. Strain-level differences matter!

Dunne, K. A., Chaudhuri, R. R., Rossiter, A. E., Beriotto, I., Browning, D. F., Squire, D., ... & Henderson, I. R. (2017).�Sequencing a piece of history: complete genome sequence of the original Escherichia coli strain. Microbial Genomics, 3(3).

44 of 208

Detour: Obesity and the gut microbiome

44

45 of 208

Detour: Obesity and the gut microbiome

45

“Adult germ-free [wild-type] C57BL/ 6J mice were colonized (by gavage) with a microbiota harvested from the caecum of obese (ob/ob) or lean (+/+) donors (1 donor and 4–5 germ-free recipients per treatment group per experiment; two independent experiments). [...]”

(Context: ob/ob mice have a specific mutation “[...] that produces a stereotyped, fully penetrant obesity phenotype”; +/+ mice lack this mutation.)

Ley, R. E., Bäckhed, F., Turnbaugh, P., Lozupone, C. A., Knight, R. D., & Gordon, J. I. (2005). Obesity alters gut microbial ecology. Proceedings of the national academy of sciences, 102(31), 11070-11075.

46 of 208

Detour: Obesity and the gut microbiome

46

“Adult germ-free [wild-type] C57BL/ 6J mice were colonized (by gavage) with a microbiota harvested from the caecum of obese (ob/ob) or lean (+/+) donors (1 donor and 4–5 germ-free recipients per treatment group per experiment; two independent experiments). [...]”

Results: “Strikingly, mice colonized with an ob/ob microbiota exhibited a significantly greater percentage increase in body fat over two weeks than mice colonized with a +/+ microbiota [...]”

47 of 208

Detour: Obesity and the gut microbiome

47

48 of 208

Detour: Obesity and the gut microbiome

48

Turnbaugh, P. J. (2017). Microbes and diet-induced obesity: fast, cheap, and out of control. Cell Host & Microbe, 21(3), 278-281.

49 of 208

Detour: Obesity and the gut microbiome?

49

2016, 10 studies.

“The reanalysis study of Walters et al. suggested that it was possible

to classify individuals as being nonobese or obese on the basis of

the composition of their microbiota. We repeated this analysis

with additional data sets using OTU and genus-level phylotype

data. For each study, we developed a random forest machine learning model to classify individuals. Using 10-fold cross validation, the cross-validated area under the curve (AUC) values for the OTU-based models varied between 0.52 and 0.69, indicating a relatively poor ability to classify individuals (Fig. 4A). To test models on other data sets, we trained models using genus-level phylotype data for each data set. The cross-validated AUC values for the models applied to the training data sets varied between 0.51 and 0.65, again indicating a relatively poor ability to classify individuals from the original data set (Fig. 4B). For each model, we identified the probability where the sum of the sensitivity and specificity was the highest. We then used this probability to define a threshold for calculating the accuracy of the models when applied to the other nine data sets (Fig. 5). Although there was considerable variation in the accuracy values for each model, the median accuracy of each model varied between 0.33 (Turnbaugh et al.) and 0.65 (Human Microbiome Project) (median, 0.57).”

50 of 208

Detour: Obesity and the gut microbiome?

50

“To test models on other data sets, we trained models using genus-level phylotype data for each data set.”

2016, 10 studies.

“The reanalysis study of Walters et al. suggested that it was possible

to classify individuals as being nonobese or obese on the basis of

the composition of their microbiota. We repeated this analysis

with additional data sets using OTU and genus-level phylotype

data. For each study, we developed a random forest machine learning model to classify individuals. Using 10-fold cross validation, the cross-validated area under the curve (AUC) values for the OTU-based models varied between 0.52 and 0.69, indicating a relatively poor ability to classify individuals (Fig. 4A). To test models on other data sets, we trained models using genus-level phylotype data for each data set. The cross-validated AUC values for the models applied to the training data sets varied between 0.51 and 0.65, again indicating a relatively poor ability to classify individuals from the original data set (Fig. 4B). For each model, we identified the probability where the sum of the sensitivity and specificity was the highest. We then used this probability to define a threshold for calculating the accuracy of the models when applied to the other nine data sets (Fig. 5). Although there was considerable variation in the accuracy values for each model, the median accuracy of each model varied between 0.33 (Turnbaugh et al.) and 0.65 (Human Microbiome Project) (median, 0.57).”

51 of 208

Why do we care about metagenome assembly?

51

Dunne, K. A., Chaudhuri, R. R., Rossiter, A. E., Beriotto, I., Browning, D. F., Squire, D., ... & Henderson, I. R. (2017).�Sequencing a piece of history: complete genome sequence of the original Escherichia coli strain. Microbial Genomics, 3(3).

Reason 1: Better taxonomic resolution. Strain-level differences matter!

The eventual dream here is being able to run studies that�give us “perfect” information about every unique genome�observed in a sample.

52 of 208

Why do we really care about metagenome assembly?

52

53 of 208

Why do we really care about metagenome assembly?

53

For a recent paper that I’m more confident in [but haven’t had time to fully read yet], see Morton et al., 2023.�Multi-level analysis of the gut-brain axis shows autism spectrum disorder-associated molecular and microbial profiles. Nature Neuroscience.

54 of 208

Why do we care about metagenome assembly?

54

Reason 1: Better taxonomic resolution. Strain-level differences matter!�Reason 2: Longer sequences → more information about putative functions.

Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics.�PLOS Computational Biology, 6(2), e1000667.

55 of 208

Why do we care about metagenome assembly?

55

Reason 1: Better taxonomic resolution. Strain-level differences matter!�Reason 2: Longer sequences → more information about putative functions.

For example, the creation of natural products (aka secondary metabolites).

Nice high-level overview: https://sitn.hms.harvard.edu/flash/2020/bacteria-the-drug-factory-youd-never-expect�My diagram is a gross oversimplification of microbial ecology, but you get the point

56 of 208

Why do we care about metagenome assembly?

56

Reason 1: Better taxonomic resolution. Strain-level differences matter!�Reason 2: Longer sequences → more information about putative functions.

Leonard Gertz, Getty Images. From https://www.aarp.org/health/drugs-supplements/info-2018/penicillin-allergy-real.html.

57 of 208

57

Diagram from https://sitn.hms.harvard.edu/flash/2020/bacteria-the-drug-factory-youd-never-expect

58 of 208

Why do we care about metagenome assembly?

58

Reason 1: Better taxonomic resolution. Strain-level differences matter!�Reason 2: Longer sequences → more information about putative functions.

“[...] there are many more natural products to discover. Efforts to mine new ecological niches and microbial taxa have uncovered a wealth of novel molecules. Bacterial genome sequences have shown that a single strain has the capacity to produce 25–30 different molecules, which has spawned new efforts to discover the ∼90% of natural products that remain ‘cryptic’.”

59 of 208

Why do we care about metagenome assembly?

59

Reason 1: Better taxonomic resolution. Strain-level differences matter!�Reason 2: Longer sequences → more information about putative functions.

(For reference, BGC = “Biosynthetic Gene Cluster.”)�See Meleshko et al., 2019 (biosyntheticSPAdes) for more information about BGCs and their assembly.

60 of 208

Why do we care about metagenome assembly?

60

Reason 1: Better taxonomic resolution. Strain-level differences matter!�Reason 2: Longer sequences → more information about putative functions.

61 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

62 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

63 of 208

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

64 of 208

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

65 of 208

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

?

66 of 208

Sequencing technologies

66

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

Long reads simplify the assembly of repetitive regions.

Koren, S., Treangen, T. J., & Pop, M. (2011). Bambus 2: �scaffolding metagenomes. Bioinformatics, 27(21), 2964-2971.

Koren, S. (2012). Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms. http://www.sergek.umiacs.io/presentations/ThesisTalk_final.pdf.

Technically this figure changes k-mer lengths, not

read lengths; however, the idea is the same.

Low error rates help us distinguish errors from real variations.

67 of 208

Sequencing technologies

67

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

Short-read technologies

e.g. Illumina
Read lengths: up to 150 nt
Error rates: < 1%

Long, error-prone read technologies

e.g. Pacific Biosciences, Oxford Nanopore
Read length: > 10,000 nt
Error rates: 10–25%

Long, accurate read technologies

e.g. Pacific Biosciences CCS (“HiFi”), Oxford Nanopore R10.4
Read length: over 10,000 nt, but generally shorter than long, error-prone reads
Error rates (HiFi):

< 1% for “point” mutations
somewhat higher (~5%) for insertions/deletions

This simple categorization ignores a lot of finicky details!

But this should be enough to understand how these technologies can complicate or simplify assembly.

68 of 208

Sequencing technologies

68

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

Short-read technologies

e.g. Illumina
Read lengths: up to 150 nt
Error rates: < 1%

Long, error-prone read technologies

e.g. Pacific Biosciences, Oxford Nanopore
Read length: > 10,000 nt
Error rates: 10–25%

Long, accurate read technologies

e.g. Pacific Biosciences CCS (“HiFi”), Oxford Nanopore R10.4
Read length: over 10,000 nt, but generally shorter than long, error-prone reads
Error rates (HiFi):

< 1% for “point” mutations
somewhat higher (~5%) for insertions/deletions

2012

2022

2019

69 of 208

Sequencing technologies: other things you might see

69

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

Top diagram from https://thesequencingcenter.com/knowledge-base/what-are-paired-end-reads/;�Bottom diagram from Burton et al., 2013 (Nature Biotechnology)

^✝Discontinued in 2020

Technologies that try to lengthen reads:

Paired-end reads (e.g. Illumina)��
Hi-C reads (e.g. Phase Genomics)�[Arguably a subset of “paired-end reads”]��
Linked reads^✝ (e.g. 10X Genomics)�

70 of 208

Sequencing technologies: other things you might see

70

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

Combining multiple types of reads

short reads

+ long, error-prone reads

short reads

+ long, accurate reads

+ Hi-C reads

71 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

72 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

73 of 208

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

✔

74 of 208

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

✔

?

75 of 208

Analyzing just metagenomic reads (no assembly)

75

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

✔

These sorts of analyses can be very useful; however, they’re not the main focus of today’s talk.

76 of 208

Analyzing just metagenomic reads (no assembly)

76

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

✔

Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics.�PLOS Computational Biology, 6(2), e1000667.

These sorts of analyses can be very useful; however, they’re not the main focus of today’s talk.

Assembling longer sequences is helpful beyond just taxonomic profiling!

77 of 208

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

✔

78 of 208

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

✔

?

79 of 208

How does assembly work?

80 of 208

How does assembly work?

The Genome Metagenome Assembly Problem.

Definitions:

Read: string of characters from the alphabet {A, C, G, T}.

Input: A set of reads.

Output: A string set of strings assembled from these reads.

81 of 208

How does assembly work?

The Genome Metagenome Assembly Problem.

Definitions:

Read: string of characters from the alphabet {A, C, G, T}.

Input: A set of reads.

Output: A string set of strings assembled from these reads.

This definition is still too naïve, but let’s roll with it for now

82 of 208

Lee, M. D. (2019). Journal of Open�Source Education, 4(41), 53.

83 of 208

Lee, M. D. (2019). Journal of Open�Source Education, 4(41), 53.

84 of 208

How does assembly work?

84

“A good genome assembler is like a good sausage: you would rather not know what is inside.”

Apocryphal quote attributed to Sante Gnerre:�http://rayan.chikhi.name/pdf/2019-july-19-cgsi.pdf

85 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

86 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

87 of 208

Assembly: methods (de novo vs. reference-based)

de novo assembly: Use only the read data available�

Reference-based assembly: Also use available reference sequence(s)

87

“[...] reconstruction in its pure form, without consultation to previously resolved sequence including from genomes, transcripts, and proteins.”

“For some applications, sufficient information can be extracted from the mapping of reads to a reference sequence, such as a finished genome from a related individual.”

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

88 of 208

Assembly: methods (de novo vs. reference-based)

de novo assembly: Use only the read data available� Far more commonly used when working with metagenome sequencing data.

Reference-based assembly: Also use available reference sequence(s)� Some reference-based assemblers have been developed for metagenome� sequencing data, but they have not (yet) seen widespread use in the field.

88

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

Cepeda, V., Liu, B., Almeida, M., Hill, C. M., Koren, S., Treangen, T. J., & Pop, M. (2017). MetaCompass: reference-guided assembly of metagenomes. bioRxiv, 212506.

89 of 208

Assembly: methods (de novo vs. reference-based)

de novo assembly: Use only the read data available� Far more commonly used when working with metagenome sequencing data.

Reference-based assembly: Also use available reference sequence(s)� Some reference-based assemblers have been developed for metagenome� sequencing data, but they have not (yet) seen widespread use in the field.

89

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

Cepeda, V., Liu, B., Almeida, M., Hill, C. M., Koren, S., Treangen, T. J., & Pop, M. (2017). MetaCompass: reference-guided assembly of metagenomes. bioRxiv, 212506.

90 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

91 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

92 of 208

Assembly: methods (graphs)

Most assemblers use the following rough paradigm:

Construct a graph from the reads
Simplify the graph
Output the graph and its sequences

92

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

93 of 208

Graphs

https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)

Undirected

Directed

94 of 208

Assembly: methods (graphs)

Most assemblers use the following rough paradigm:

Construct a graph from the reads
Simplify the graph
Output the graph and its sequences

94

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

95 of 208

Assembly: methods (graphs)

Most assemblers use the following rough paradigm:

Construct a graph from the reads
Simplify the graph
Output the graph and its sequences

95

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

DBG

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

Feng, X., Cheng, H., Portik, D., & Li, H. (2022). Metagenome assembly of high-fidelity long reads with hifiasm-meta. Nature Methods, 19(6), 671-674.

96 of 208

Assembly: methods (graphs)

Most assemblers use the following rough paradigm:

Construct a graph from the reads
Simplify the graph
Output the graph and its sequences

96

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

97 of 208

Assembly: methods (graphs)

Most assemblers use the following rough paradigm:

Construct a graph from the reads What type of graph?
Simplify the graph
Output the graph and its sequences

97

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

98 of 208

Assembly: methods (graphs)

Most assemblers use the following rough paradigm:

Construct a graph from the reads What type of graph?
Simplify the graph
Output the graph and its sequences

98

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

99 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

Let’s just consider single-genome assembly�for now.��Our “reads” are all substrings of length 3 of�TAATGCCATGGGATGTT (17 nt).

99

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

OLC

DBG

100 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

Overlap graph (directed graph)� Nodes: input reads� Edges: n₁→n₂ if read n₁ overlaps with read n₂

100

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

DBG

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

101 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

Overlap graph (directed graph)� Nodes: input reads� Edges: n₁→n₂ if read n₁ overlaps with read n₂

101

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

DBG

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

102 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

Overlap graph (directed graph)� Nodes: input reads� Edges: n₁→n₂ if read n₁ overlaps with read n₂� Goal?: Find Hamiltonian Paths (or cycles) in this graph—visiting each� node once.

102

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

DBG

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

103 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

Overlap graph (directed graph)� Nodes: input reads� Edges: n₁→n₂ if read n₁ overlaps with read n₂� Goal?: Find Hamiltonian Paths (or cycles) in this graph—visiting each� node once.

103

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

DBG

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

104 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

Overlap graph (directed graph)� Nodes: input reads� Edges: n₁→n₂ if read n₁ overlaps with read n₂� Goal?: Find Hamiltonian Paths (or cycles) in this graph—visiting each� node once.

104

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

DBG

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

105 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

Overlap graph (directed graph)� Nodes: input reads� Edges: n₁→n₂ if read n₁ overlaps with read n₂� Goal?: Find Hamiltonian Paths (or cycles) in this graph—visiting each� node once.

105

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

DBG

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

Finding Hamiltonian paths (or cycles) is NP-complete—we don’t know how to find a solution efficiently.

106 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

106

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

DBG

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

Finding Hamiltonian paths (or cycles) is NP-complete—we don’t know how to find a solution efficiently.

“Following the decision of the Scientific Advisory Board, the Board of Directors of CMI designated a $7 million prize fund for the solutions to these problems, with $1 million allocated to the solution of each problem.”

107 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

Overlap graph (directed graph) ✔� Nodes: input reads� Edges: n₁→n₂ if read n₁ overlaps with read n₂� Goal?: Find Hamiltonian Paths (or cycles) in this graph

de Bruijn graph (also a directed graph; takes a parameter k)

107

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

NP-Complete!

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

OLC

DBG

108 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n₁→n₂ if there is a k-mer whose prefix is n₁ and whose suffix is n₂

108

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

OLC

DBG

109 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n₁→n₂ if there is a k-mer whose prefix is n₁ and whose suffix is n₂

109

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

OLC

DBG

k - 1 = 3

k - 1 = 2

110 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n₁→n₂ if there is a k-mer whose prefix is n₁ and whose suffix is n_2�

110

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

OLC

DBG

k - 1 = 3

k - 1 = 2

111 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n₁→n₂ if there is a k-mer whose prefix is n₁ and whose suffix is n_2� Goal?: Find Eulerian Paths (or cycles) in this graph—visiting each� edge once.

111

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

OLC

DBG

k - 1 = 3

k - 1 = 2

112 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n₁→n₂ if there is a k-mer whose prefix is n₁ and whose suffix is n_2� Goal?: Find Eulerian Paths (or cycles) in this graph—visiting each� edge once.

112

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

OLC

DBG

k - 1 = 3

k - 1 = 2

ATG is a repeat: notice how it complicates the assembly.

113 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n₁→n₂ if there is a k-mer whose prefix is n₁ and whose suffix is n_2� Goal?: Find Eulerian Paths (or cycles) in this graph—visiting each� edge once.

113

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

OLC

DBG

k - 1 = 3

k - 1 = 2

ATG is a repeat: notice how it complicates the assembly.

114 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n₁→n₂ if there is a k-mer whose prefix is n₁ and whose suffix is n_2� Goal?: Find Eulerian Paths (or cycles) in this graph—visiting each� edge once.

114

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

OLC

DBG

k - 1 = 3

k - 1 = 2

ATG is a repeat: notice how it complicates the assembly.

115 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n₁→n₂ if there is a k-mer whose prefix is n₁ and whose suffix is n_2� Goal?: Find Eulerian Paths (or cycles) in this graph—visiting each� edge once.

115

ATG is a repeat: notice how it complicates the assembly.

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

OLC

DBG

k - 1 = 3

k - 1 = 2

Finding Eulerian paths (or cycles) is doable in polynomial time, actually! We can solve this problem efficiently.

116 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

Overlap graph (directed graph)� Nodes: input reads� Edges: n₁→n₂ if read n₁ overlaps with read n₂�

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n₁→n₂ if there is a k-mer whose prefix is n₁ and whose� suffix is n₂�

116

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

DBG

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

117 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

Overlap graph (directed graph)� Nodes: input reads� Edges: n₁→n₂ if read n₁ overlaps with read n₂� Goal?: Find Hamiltonian Paths (or cycles) in this graph

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n₁→n₂ if there is a k-mer whose prefix is n₁ and whose� suffix is n₂� Goal?: Find Eulerian Paths (or cycles) in this graph

117

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

DBG

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

118 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

Overlap graph (directed graph)� Nodes: input reads� Edges: n₁→n₂ if read n₁ overlaps with read n₂� Goal?: Find Hamiltonian Paths (or cycles) in this graph

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: n₁→n₂ if there is a k-mer whose prefix is n₁ and whose� suffix is n₂� Goal?: Find Eulerian Paths (or cycles) in this graph

118

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

DBG

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

NP-Complete :(

Linear time :)

119 of 208

Assembly: methods (overlap vs. de Bruijn graphs)

119

OLC

DBG

Newbler
Celera
ARACHNE
Edena
SHORTY
HINGE
BAUM
Canu
hifiasm
...

EULER
AllPaths
Velvet
ABySS
E + V-SC
SPAdes
ABruijn
Flye
LJA
...

Neither of these lists are comprehensive! Many assemblers (of both the OLC or DBG varieties) are still actively being developed and in use today.�Also, most assemblers implement their own “twist” on how they use these graph structures, so cleanly categorizing assemblers in this way ignores many details.

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.

Dida, F., & Yi, G. (2021). Empirical evaluation of methods for de novo genome assembly. PeerJ Computer Science, 7, e636.

Wang, A., Wang, Z., Li, Z., & Li, L. M. (2018). BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach. Bioinformatics, 34(12), 2019-2028.

120 of 208

“[...] an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. Instead, assemblers output contigs—long, contiguous segments which can unambiguously be inferred to be part of the genome. Finding such segments is a very different computational problem than finding a single Eulerian or Hamiltonian cycle [...]

the difference between using de Bruijn graphs and overlap graphs is poorly understood and is a fascinating open research problem. But, the Eulerian and Hamiltonian cycle dichotomy is not really relevant to assembly or to the popularity of de Bruijn graphs.”

121 of 208

“[...] an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. Instead, assemblers output contigs—long, contiguous segments which can unambiguously be inferred to be part of the genome. Finding such segments is a very different computational problem than finding a single Eulerian or Hamiltonian cycle [...]

the difference between using de Bruijn graphs and overlap graphs is poorly understood and is a fascinating open research problem. But, the Eulerian and Hamiltonian cycle dichotomy is not really relevant to assembly or to the popularity of de Bruijn graphs.”

122 of 208

“[...] an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. Instead, assemblers output contigs—long, contiguous segments which can unambiguously be inferred to be part of the genome. Finding such segments is a very different computational problem than finding a single Eulerian or Hamiltonian cycle [...]

the difference between using de Bruijn graphs and overlap graphs is poorly understood and is a fascinating open research problem. But, the Eulerian and Hamiltonian cycle dichotomy is not really relevant to assembly or to the popularity of de Bruijn graphs.”

123 of 208

“[...] an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. Instead, assemblers output contigs—long, contiguous segments which can unambiguously be inferred to be part of the genome. Finding such segments is a very different computational problem than finding a single Eulerian or Hamiltonian cycle [...]

the difference between using de Bruijn graphs and overlap graphs is poorly understood and is a fascinating open research problem. But, the Eulerian and Hamiltonian cycle dichotomy is not really relevant to assembly or to the popularity of de Bruijn graphs.”

124 of 208

Assembly: methods (graphs)

Most assemblers use the following rough paradigm:

Construct a graph from the reads What type of graph?
Simplify the graph
Output the graph and its sequences

124

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

125 of 208

Assembly: methods (graphs)

Most assemblers use the following rough paradigm:

Construct a graph from the reads What type of graph? ✔
Simplify the graph
Output the graph and its sequences

125

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

(Overlap or de Bruijn)

126 of 208

Even graph construction can be difficult in practice

126

Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.

127 of 208

Even graph construction can be difficult in practice

127

Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.

128 of 208

Even graph construction can be difficult in practice

128

Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.

LJA can completely assemble entire human chromosomes!

129 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

130 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

131 of 208

Assembly: methods (graphs)

Most assemblers use the following rough paradigm:

Construct a graph from the reads What type of graph? ✔
Simplify the graph Why? How?
Output the graph and its sequences

131

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

(Overlap or de Bruijn)

132 of 208

Assembly: methods (single-genome vs. metagenome)

Problem: Uneven coverage

Different microbes are present at different abundances in a microbiome.
The coverage, or number of reads “supporting” a position in a genome, varies a lot!
Worst case scenario: some genome(s) are only partially covered in the reads.�

132

Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.

Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.

133 of 208

Assembly: methods (single-genome vs. metagenome)

Problem: Uneven coverage

Different microbes are present at different abundances in a microbiome.
The coverage, or number of reads “supporting” a position in a genome, varies a lot!
Worst case scenario: some genome(s) are only partially covered in the reads.�

Problem: Intergenomic repeats

Stretches of DNA shared by many organisms are repeats!
For example, marker genes (e.g. rRNA genes).
Some approaches attempt to get around this by�exploiting uneven coverages across genomes.

133

Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.

Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.

134 of 208

Assembly: methods (single-genome vs. metagenome)

Problem: Strain mixtures

Similar to intergenomic repeats: a microbiome can contain many strains of a microbe.
How can we distinguish between “real” variations and sequencing errors?
Separating these strains’ genomes is very challenging, especially with short reads.

134

Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.

Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.

135 of 208

Assembly: methods (single-genome vs. metagenome)

Problem: Strain mixtures

Similar to intergenomic repeats: a microbiome can contain many strains of a microbe.
How can we distinguish between “real” variations and sequencing errors?
Separating these strains’ genomes is very challenging, especially with short reads.

135

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs.�Nature Methods, 17(11), 1103-1110.

“An assembly graph of a single connected component in the sheep microbiome dataset before strain collapsing [...] The component represents a bacterial genome of the Clostridia class [...] There are 20 simple bubbles (shown in green) and 10 superbubbles (shown in yellow) that account for 1.2 Mbp out of 2.4 Mbp long genome.”

Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.

Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.

“Metagenome-

assembled genome”�(MAG)

136 of 208

Assembly: methods (single-genome vs. metagenome)

Problem: Strain mixtures

Similar to intergenomic repeats: a microbiome can contain many strains of a microbe.
How can we distinguish between “real” variations and sequencing errors?
Separating these strains’ genomes is very challenging, especially with short reads.

136

Different assemblers will do different things to deal with these sorts of subtle variations: even “smooth” contigs can conceal a lot of variation. Sometimes this is desirable—sometimes not!

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs.�Nature Methods, 17(11), 1103-1110.

“An assembly graph of a single connected component in the sheep microbiome dataset before strain collapsing [...] The component represents a bacterial genome of the Clostridia class [...] There are 20 simple bubbles (shown in green) and 10 superbubbles (shown in yellow) that account for 1.2 Mbp out of 2.4 Mbp long genome.”

Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.

Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.

“Metagenome-

assembled genome”�(MAG)

137 of 208

Assembly: methods (graphs)

Most assemblers use the following rough paradigm:

Construct a graph from the reads What type of graph? ✔
Simplify the graph Why? How?
Output the graph and its sequences

137

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

(Overlap or de Bruijn)

138 of 208

Assembly: methods (graphs)

Most assemblers use the following rough paradigm:

Construct a graph from the reads What type of graph? ✔
Simplify the graph Why? How? ✔
Output the graph and its sequences

138

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

(Overlap or de Bruijn)

(it’s complicated)

139 of 208

Assembly: methods (graphs)

Most assemblers use the following rough paradigm:

Construct a graph from the reads What type of graph? ✔
Simplify the graph Why? How? ✔
Output the graph and its sequences

139

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

(Overlap or de Bruijn)

(it’s complicated)

140 of 208

Lee, M. D. (2019). Journal of Open�Source Education, 4(41), 53.

141 of 208

Lee, M. D. (2019). Journal of Open�Source Education, 4(41), 53.

✔

?

142 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

143 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

144 of 208

Scaffolding

Attempt to “order and orient” contigs relative to each other, using various types of linking information.

Generates scaffolds / scaffold graphs—analogous to the contigs / assembly graphs output by an assembler.

Sometimes scaffolding is just done by the assembler, but there are stand-alone options such as the Bambus series of tools.

Pop, M., Kosack, D. S., & Salzberg, S. L. (2004).�Hierarchical scaffolding with Bambus. Genome Research, 14(1), 149-159.

145 of 208

Binning

Attempt to group contigs into bins representing a single genome.

One distinction: de novo vs. reference-based approaches, again.

146 of 208

Binning: de novo

Attempt to group contigs into bins representing a single genome.

Wu, Y. W., Tang, Y. H., Tringe, S. G., Simmons, B. A., & Singer, S. W. (2014). MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome, 2, 1-18.

147 of 208

Binning: de novo

Attempt to group contigs into bins representing a single genome.

Wu, Y. W., Tang, Y. H., Tringe, S. G., Simmons, B. A., & Singer, S. W. (2014). MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome, 2, 1-18.

“Tetranucleotides”: AAAA, AAAC, AAAG, AAAT,

AACA, AACC, AACG, AACT,� …?

148 of 208

Binning: reference-based

Wood, D. E., & Salzberg, S. L. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3), 1-12.

Wickramarachchi, A., & Lin, Y. (2022). Binning long reads in metagenomics datasets using composition and coverage information. Algorithms for Molecular Biology, 17(1), 14.

149 of 208

Binning: reference-based

Wood, D. E., & Salzberg, S. L. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3), 1-12.

Wickramarachchi, A., & Lin, Y. (2022). Binning long reads in metagenomics datasets using composition and coverage information. Algorithms for Molecular Biology, 17(1), 14.

150 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

151 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

152 of 208

Validation

Regardless of if our sequences were output

directly from assembly, or
after scaffolding / binning,

we can estimate their quality (completeness and contamination) by looking for single-copy genes.

153 of 208

Validation

Regardless of if our sequences were output

directly from assembly, or
after scaffolding / binning,

we can estimate their quality (completeness and contamination) by looking for single-copy genes.

My Cool and New Assembler

��

def cool_and_new_assembler(fastq_filepath):

with open(fastq_filepath, "r") as f:

genome = ""

for li, line in enumerate(f):

if li % 4 == 1:

return genome

Extra Retracted

154 of 208

Validation

Regardless of if our sequences were output

directly from assembly, or
after scaffolding / binning,

we can estimate their quality (completeness and contamination) by looking for single-copy genes.

Keep in mind that these are only estimates!

155 of 208

Validation

Regardless of if our sequences were output

directly from assembly, or
after scaffolding / binning,

we can estimate their quality (completeness and contamination) by looking for single-copy genes.

Keep in mind that these are only estimates!

Bowers, R. M., Kyrpides, N. C., Stepanauskas, R., Harmon-Smith, M., Doud, D., Reddy, T. B. K., ... & Woyke, T. (2017). Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology, 35(8), 725-731.

156 of 208

Further inspection: variant calling

Assembled sequence

157 of 208

Further inspection: variant calling

Assembled sequence

Figuring out where to place these reads is called alignment

(or mapping).

158 of 208

Further inspection: variant calling

Is this a “real” variant, or an error?

Figuring out where to place these reads is called alignment

(or mapping).

Assembled sequence

159 of 208

Further inspection: variant calling

Fedarko, M. W. (hello), Kolmogorov, M., & Pevzner, P. A. (2022). Analyzing rare mutations in metagenomes assembled using long and accurate reads. Genome Research, 32(11-12), 2119-2133.

Assembled sequence

Is this a “real” variant, or an error?

160 of 208

Further inspection: variant calling

Fedarko, M. W. (hello), Kolmogorov, M., & Pevzner, P. A. (2022). Analyzing rare mutations in metagenomes assembled using long and accurate reads. Genome Research, 32(11-12), 2119-2133.

Assembled sequence

Is this a “real” variant, or an error?

161 of 208

Further inspection: variant calling

Fedarko, M. W. (hello), Kolmogorov, M., & Pevzner, P. A. (2022). Analyzing rare mutations in metagenomes assembled using long and accurate reads. Genome Research, 32(11-12), 2119-2133.

Assembled sequence

Is this a “real” variant, or an error?

162 of 208

Further inspection: variant calling

Fedarko, M. W. (hello), Kolmogorov, M., & Pevzner, P. A. (2022). Analyzing rare mutations in metagenomes assembled using long and accurate reads. Genome Research, 32(11-12), 2119-2133.

Tens of thousands of rare mutations can lurk within a single MAG!

163 of 208

Further inspection: variant calling

Fedarko, M. W. (hello), Kolmogorov, M., & Pevzner, P. A. (2022). Analyzing rare mutations in metagenomes assembled using long and accurate reads. Genome Research, 32(11-12), 2119-2133.

Tens of thousands of rare mutations can lurk within a single MAG!

164 of 208

Further inspection?

Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn�graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.�(This is work-in-progress research on using LJA to assemble metagenomes.)

165 of 208

Further inspection?

Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn�graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.�(This is work-in-progress research on using LJA to assemble metagenomes.)

166 of 208

Dot plots

Red dots → matching k-mer starting at this position�

Blue dots → reverse-complement matching k-mer starting at this position

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach.

(in this example, k = 3)

167 of 208

Further inspection?

Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn�graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.�(This is work-in-progress research on using LJA to assemble metagenomes.)

168 of 208

Further inspection??

Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D., & Pevzner, P. A. (2022). Multiplex de Bruijn�graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40(7), 1075-1081.�(This is work-in-progress research on using LJA to assemble metagenomes.)

See https://github.com/fedarko/wotplot for the tool used to draw these dot plots.

169 of 208

Further inspection: an extremely high-coverage repeat

We identified a strange circular (or tandemly repeated) 6,719 nt sequence at over 100,000X coverage: an order of magnitude higher coverage than even the most abundant genomes in this dataset. (Classified as Akkermansia sp.; has 9 coding genes.)

81,917 / 22,118,393 (0.37%) reads map to it.�

Of these, 75,034 / 81,917 (91.60%) reads are “rotation reads” consisting almost entirely of the repeated copies of D20Seq.

This repeat unit doesn’t have much self-similarity.

Other assemblers (including short-read assemblers) would likely miss such repeats due to assembly fragmentation in the case of high coverage.

Were such surprising repeats missed in previous metagenomic studies?

Slide edited by Pavel Pevzner

170 of 208

171 of 208

172 of 208

Further inspection: an extremely high-coverage repeat

Xie, F., Jin, W., Si, H., Yuan, Y., Tao, Y., Liu, J., ... & Mao, S. (2021). An integrated gene catalog and over 10,000 metagenome-assembled genomes from the gastrointestinal microbiome of ruminants. Microbiome, 9(1), 137.

An Akkermansia sp. MAG from another sheep gut microbiome contains just one copy of D20Seq!

Occurrences of D20Seq in bin 174 from Bickhart & Kolmogorov et al., 2022

An earlier assembly of this dataset only included D20Seq in one bin (in two contigs).

???

We downloaded 159 complete Akkermansia genomes from NCBI; none of them contained* D20Seq!

All of D20Seq’s genes’ best BLASTP hits, actually, come from the Xie et al., 2021 study.

The best non-Xie-2021 hits (all relatively weak, and not necessarily indications of “containment”*) were to protein sequences predicted from the gut microbiota of Camelus bactrianus; a San Clemente Island Goat; a horse; and a Plateau pika.

* Defining “containment” informally, based on minimap2 finding a match between D20Seq and the genome.

173 of 208

Further inspection: why does this repeat exist?

When did D20Seq emerge in Akkermansia?

And when did D20Seq actually … become repeated?�

Could D20Seq be an extrachromosomal mobile element?

Or a transposon?�

Does D20Seq have anything to do with this sheep’s infections?

The type species of Akkermansia (A. muciniphila) has been shown to increase in abundance in the gut microbiome in response to host infection, and apparently helps with immune response (e.g. with influenza, in Hu et al., 2020).�
Mamun et al., 2020 found that the relative abundance of Akkermansia in the sheep gut microbiome was negatively (?) associated with the presence of another eukaryotic parasite, Haemonchus contortus.

Mamun, M. A. A., Sandeman, M., Rayment, P., Brook-Carter, P., Scholes, E., Kasinadhuni, N., ... & Greenhill, A. R. (2020). Variation in gut bacterial composition is associated with Haemonchus contortus parasite infection of sheep. Animal Microbiome, 2, 1–14.

Hu, X., Zhao, Y., Yang, Y., Gong, W., Sun, X., Yang, L., ... & Jin, M. (2021). Akkermansia muciniphila improves host defense against influenza virus infection. Frontiers in Microbiology, 11, 586476.

174 of 208

Further inspection: why does this repeat exist?

When did D20Seq emerge in Akkermansia?

And when did D20Seq actually … become repeated?�

Could D20Seq be an extrachromosomal mobile element?

Or a transposon?�

Does D20Seq have anything to do with this sheep’s infections?

The type species of Akkermansia (A. muciniphila) has been shown to increase in abundance in the gut microbiome in response to host infection, and apparently helps with immune response (e.g. with influenza, in Hu et al., 2020).�
Mamun et al., 2020 found that the relative abundance of Akkermansia in the sheep gut microbiome was negatively (?) associated with the presence of another eukaryotic parasite, Haemonchus contortus.

Mamun, M. A. A., Sandeman, M., Rayment, P., Brook-Carter, P., Scholes, E., Kasinadhuni, N., ... & Greenhill, A. R. (2020). Variation in gut bacterial composition is associated with Haemonchus contortus parasite infection of sheep. Animal Microbiome, 2, 1–14.

Hu, X., Zhao, Y., Yang, Y., Gong, W., Sun, X., Yang, L., ... & Jin, M. (2021). Akkermansia muciniphila improves host defense against influenza virus infection. Frontiers in Microbiology, 11, 586476.

A lot of interesting details lurk within metagenomes and their assemblies. Within reason :), try to notice these details!

175 of 208

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

✔

176 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

177 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

178 of 208

Open problems

How can we generate complete metagenome assemblies?

How do we get longer and more accurate sequencing reads?
How can we better assemble these reads?�

How can we better visualize large assembly graphs?

…and their underlying sequences?�

How should we compare metagenome assemblies?

…of different samples, individuals, populations, etc.?�

How can we incentivize the development of good bioinformatics software? 💀�
How do we teach these subjects?

179 of 208

Today’s outline

Culture-independent methods: marker gene vs. metagenome sequencing
Assembly: metagenome, single-genome, and problem formulations
Metagenome assembly, in detail

Motivations
Sequencing technologies and their impacts on assembly
Assembly

de novo vs. reference-based
Graph construction
Graph simplification

Scaffolding
Binning
Validation and further inspection

Open problems
Questions

https://bit.ly/41GBxEt

These slides are online at:

Thanks :)

Pop Lab (UMD), Knight Lab (UCSD), Pevzner Lab (UCSD)

180 of 208

Bonus slides

181 of 208

C. I. Methods: What’s in a marker gene?

181

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

182 of 208

C. I. Methods: What’s in a marker gene?

182

Entropy in the 16S rRNA gene

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

183 of 208

C. I. Methods: What’s in a marker gene?

183

Entropy in the 16S rRNA gene

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

More

mutations

184 of 208

C. I. Methods: What’s in a marker gene?

184

Entropy in the 16S rRNA gene

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

More

mutations

Hypervariable regions

185 of 208

C. I. Methods: What’s in a marker gene?

185

Entropy in the 16S rRNA gene

More

mutations

Hypervariable regions

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

Conserved regions

186 of 208

C. I. Methods: What’s in a marker gene?

186

Entropy in the 16S rRNA gene

More

mutations

Hypervariable regions

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

Conserved regions

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

187 of 208

C. I. Methods: What’s in a marker gene?

187

Entropy in the 16S rRNA gene

More

mutations

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

188 of 208

C. I. Methods: What’s in a marker gene?

188

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

189 of 208

C. I. Methods: What’s in a marker gene?

189

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

190 of 208

C. I. Methods: What’s in a marker gene?

190

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

191 of 208

C. I. Methods: What’s in a marker gene?

191

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

192 of 208

C. I. Methods: What’s in a marker gene?

192

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

...

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

193 of 208

C. I. Methods: What’s in a marker gene?

193

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

...

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

Since these amplified sequences contain hypervariable region(s), these regions help us determine which sequences came from which microbe.

194 of 208

C. I. Methods: What’s in a marker gene?

194

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

...

Since these amplified sequences contain hypervariable region(s), these regions help us determine which sequences came from which microbe.

195 of 208

C. I. Methods: What’s in a marker gene?

195

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__

k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Neisseriales; f__Neisseriaceae; g__Neisseria

k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus

k__Bacteria; p__Firmicutes; c__Bacilli

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__

k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__Haemophilus; s__parainfluenzae

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__

k__Bacteria; p__Firmicutes; c__Bacilli

k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__Haemophilus; s__parainfluenzae

k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Neisseriales; f__Neisseriaceae; g__Neisseria

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__

k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__

k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Neisseriales; f__Neisseriaceae; g__Neisseria

k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__Haemophilus; s__parainfluenzae

k__Bacteria; p__Firmicutes; c__Bacilli

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

...

Since these amplified sequences contain hypervariable region(s), these regions help us determine which sequences came from which microbe.

Example taxonomic annotations from the QIIME 2�“Moving Pictures” tutorial: https://docs.qiime2.org

196 of 208

C. I. Methods: Marker gene sequencing

196

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

197 of 208

C. I. Methods: Functional annotation

197

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

198 of 208

C. I. Methods: Functional annotation

198

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

(not just marker!) genes

operons

terminators

promoters

For more information on metagenomic functional annotation, see: Salamov, V. S. A., & Solovyevand, A. (2011). Automatic annotation of microbial genomes and metagenomic sequences. Metagenomics and Its Applications in Agriculture, Biomedicine and Environmental Studies; Li, RW, Ed, 61-78.

...

199 of 208

C. I. Methods: Functional annotation, in practice

199

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

(not just marker!) genes

operons

terminators

promoters

For more information on metagenomic functional annotation, see: Salamov, V. S. A., & Solovyevand, A. (2011). Automatic annotation of microbial genomes and metagenomic sequences. Metagenomics and Its Applications in Agriculture, Biomedicine and Environmental Studies; Li, RW, Ed, 61-78.

...

Usually, F.A. involves aligning (“mapping”) sequences to a reference database with information about “function” in well- studied organisms.

(But there are fancier approaches, e.g. metabolic modelling methods.)

200 of 208

C. I. Methods: Functional annotation, in practice?

200

“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”

201 of 208

C. I. Methods: Functional annotation, in practice??

201

“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”

A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates [...]

202 of 208

C. I. Methods: Functional annotation, in practice???

202

A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates [...]

“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”

203 of 208

C. I. Methods: Functional annotation, in practice???

203

“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”

A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates [...]

204 of 208

C. I. Methods: Functional annotation, in practice

204

Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics.�PLOS computational biology, 6(2), e1000667.

205 of 208

Marker gene assembly?

205

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

(Marker gene sequencing)

(Metagenome sequencing)

Q: Would it make sense for us to assemble marker gene sequencing reads?

206 of 208

A more detailed view of short-read sequencing tech

Nagarajan, N., & Pop, M. (2013). Sequence assembly demystified. Nature Reviews Genetics, 14(3), 157-167.

207 of 208

Marker gene assembly?

207

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

(Marker gene sequencing)

(Metagenome sequencing)

Q: Would it make sense for us to assemble marker gene sequencing reads?

A: We only have a small region from each genome—what would we even assemble? :(

208 of 208

Marker gene assembly?

208

Lee, M. D. (2019). Journal of Open Source Education, 4(41), 53.

(Marker gene sequencing)

(Metagenome sequencing)

(Let’s assume that this is for an exploratory study, since we haven’t discussed a hypothesis.)

Q: Would it make sense for us to assemble marker gene sequencing reads?

A: We only have a small region from each genome—what would we even assemble? :(

A #2: Depending on our data, we could use assembly to construct longer marker gene sequences—these can be helpful! See Kozich et al., 2013 :)

Kozich, J. J., Westcott, S. L., Baxter, N. T., Highlander, S. K., & Schloss, P. D. (2013). Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology, 79(17), 5112-5120.