Program Chairs: Mark Chaisson (USC), Rayan Chikhi (CNRS)

Program

Thursday April 19

9:45 AM - 10:30 AM

Registration

10:30 AM - 11:20 AM

Opening remarks, Session 1: regular talks

Jasmijn Baaijens, Bastiaan van der Roest, Johannes Koster, Leen Stougie and Alexander Schoenhuth

Full-length de novo viral quasispecies assembly through variation graph construction

Orabi, Baraa, Emre Erhan, Brian McConeghy, Stanislav V. Volik, Stephane Le Bihan, Collin C. Collins, Cedric Chauve and Faraz Hach

Alignment-free Clustering of Barcode (UMI) Tagged DNA Molecules

11:20 AM - 11:50 AM

Highlight session

Yuanhua Huang and Guido Sanguinetti

BRIE: transcriptome-wide splicing quantification in single cells

Shaun Jackman, Ben Vandervalk, Hamid Mohamadi, Sarah Yeo, Austin Hammond, Golnaz Jahesh, Hamza Khan, Lauren Coombe, Rene Warren and Inanc Birol

ABySS 2.0: Resource-Efficient Assembly of Large Genomes using a Bloom Filter

11:50 AM - 2:00 PM

Lunch

2:00 PM - 3:00 PM

Keynote: Alexis Battle

Johns Hopkins, USA

3:00 PM - 3:10 PM

Short break

3:10 PM - 4:00 PM

Session 2: regular talks

Ergude Bao, Fei Xie, Changjin Song and Dandan Song

HALS: Fast and High Throughput Algorithm for PacBio Long Read Self-Correction

Antoine Limasset, Jean-François Flot and Pierre Peterlongo

Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs

4:00 PM - 4:30 PM

Coffee Break

4:30 PM - 5:30 PM

Session 3: short talks

Aldo Guzman-Saenz, Niina Haiminen, Saugata Basu and Laxmi Parida

Signal Enrichment of Metagenome Sequencing Reads using Topological Data Analysis

Lynn Yi, Vasilis Ntranos, Pall Melsted and Lior Pachter

Identification of transcriptional signatures for cell types from single-cell RNA-Seq

David Jenkins, Tyler Faits, Emma Briars, Sebasitan Carrasco Pro, Steve Cunningham, Masanao Yajima and W. Evan Johnson

Interactive single cell RNA-Seq analysis with the Single Cell Toolkit (SCTK)

Liat Shenhav, Mike Thompson, Tyler Joseph, Ori Furman, David Bogumil, Itzik Mizrahi and Eran Halperin

Fast expectation maximization source tracking

Harry Taegyun Yang

GRASS-C - Graph-based RNA-Seq Analysis in Single cell level Subgraph Clustering

Friday April 20

9:00 AM - 9:30 AM

Registration

9:30 AM - 10:20 AM

Session 1: regular talks

Shibing Deng, Maruja Lira, Donghui Huang, Kai Wang, Crystal Valdez, Jennifer Kinong, Paul Rejto, Jadwiga Bienkowska, James Hardwick and Tao Xie

TNER: A Novel Background Error Suppression Method for Mutation Detection in Circulating Tumor DNA

Kiavash Kianfar, Christopher Pockrandt, Bahman Torkamandi, Haochen Luo and Knut Reinert

Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

10:20 AM - 10:50 AM

Coffee break

10:50 AM - 11:30 AM

Session 2: short talks

Igor Mandric and Alex Zelikovsky

Solving scaffolding problem with repeats

Shaun D Jackman, Lauren Coombe, Justin Chu, Rene Warren, Ben Vandervalk, Sarah Yeo, Hamid Mohamadi, Joerg Bohlmann, Steven Jones and Inanc Birol

Tigmint: Correct Assembly Errors Using Linked Reads From Large Molecules

Sergey Knyazev, Viachaslau Tsyvina, Andrii Melnyk, Alexander Artyomenko, Tatiana Malygina, Yuri Porozov, Ellsworth Campbell, William Switzer, Pavel Skums and Alex Zelikovsky

CliqueSNV: Scalable Reconstruction of Intra-Host Viral Populations from NGS Reads

Jonas Fischer and Marcel Schulz

Fast and accurate bisulfite alignment and methylation calling for mammalian genomes

11:30 AM - 11:45 AM

Highlight session

Serghei Mangul

Comprehensive analysis of RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues

11:45 AM - 1:45 PM

Lunch

1:45 PM - 2:45 PM

Keynote: Olivier Jaillon CEA-Genoscope

2:45 PM - 3:00 PM

Short break

3:00 PM - 3:50 PM

Session 3: regular talks

Bahar Alipanahi, Martin Muggli, Musa Jundi, Noelle Noyes and Christina Boucher

Resistome SNP Calling via Read Colored de Bruijn Graphs

Guo Liang Gan, Elijah Willie, Cedric Chauve and Leonid Chindelevitch

Deconvoluting the diversity of within-host pathogen strains in a Multi-Locus Sequence Typing framework

3:50 PM - 4:20 PM

Coffee Break

4:20 PM - 4:40 PM

Poster lightning session

4:40 PM - onwards

Break and poster session

Posters

SEQ-1

mirLibSpark: a scalable NGS microRNA prediction pipeline with data aggregation

SEQ-2

K-merator, an efficient design of highly specific k-mers for quantification of transcriptional signatures in large scale RNAseq cohorts.

SEQ-3

Kevlar: Mapping-free approach for accurate discovery of de novo variants

SEQ-4

Promoter and enhancer chromatin dynamics during pancreatic differentiation

SEQ-5

Ultrafast space-efficient k-mer indexing

SEQ-6

ARKS: chromosome-scale human genome scaffolding with linked read kmers

SEQ-7

Tigmint: Correct Assembly Errors Using Linked Reads From Large Molecules

SEQ-8

Multi-Index Bloom Filters: A probabilistic data structure for sensitive multi-reference sequence classification with multiple spaced seeds

SEQ-9

De novo Clustering of Gene Expressed Variants in Transcriptomic Long Reads Data Sets

SEQ-10

ONTig: Contiguating Genome Assembly using Oxford Nanopore Long Reads

SEQ-11

Rapid and precise analysis of human gut metagenomes using Oxford Nanopore sequencing technology

SEQ-12

S3A: A Scalable and Accurate Annotated Assembly Tool for Targeted Gene Assembly

SEQ-13

Pan-genome structural analysis and visualisation

SEQ-14

Reference-guided genome assembly in metagenomic samples

SEQ-15

Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming

SEQ-16

Isoform assembly with quasi-lossless compression of quality scores in RNA-seq data

SEQ-17

Map2Peak: From Unmapped Reads to ChIP-Seq Peaks in Half the Time

SEQ-18

DiscoSnp-RAD: de novo detection of small variants for population genomics

SEQ-19

SVIM: Structural Variant Identification Method using Long Reads

SEQ-20

TBA

SEQ-21

TBA

SEQ-22

TBA

The list of posters is subject to change. For the latest version, please go to the RECOMB-Seq website.

goo.gl

Regular Talks Abstracts

Full-length de novo viral quasispecies assembly through variation graph construction

Jasmijn Baaijens, Bastiaan van der Roest, Johannes Koster, Leen Stougie and Alexander Schoenhuth

Motivation: Viruses populate their hosts as a viral quasispecies: a collection of genetically related mutant strains.Viral quasispecies assembly refers to reconstructing the strain-specific haplotypes from read data, and predicting their relative abundances within the mix of strains, an important step for various treatment-related reasons. Reference-genome-independent (“de novo”) approaches have yielded benefits over reference-guided approaches, because reference-induced biases can become overwhelming when dealing with divergent strains. While being very accurate, extant de novo methods only yield rather short contigs. It remains to reconstruct full-length haplotypes together with their abundances from such contigs.

Method: We first construct a variation graph, a recently popular, suitable structure for arranging and integrating several related genomes, from the short input contigs, without making use of a reference genome. To obtain paths through the variation graph that reflect the original haplotypes, we solve a minimization problem that addresses to yield a selection of maximal-length paths that is optimal in terms of being compatible with the read coverages computed for the nodes of the variation graph. We output the resulting selection of maximal length paths as the haplotypes, together with their abundances.

Results: Benchmarking experiments on challenging simulated data sets show significant improvements in assembly contiguity compared to the input contigs, while preserving low error rates. As a consequence, our method outperforms all state-of-the-art viral quasispecies assemblers that aim at the construction of full-length haplotypes, in terms of various relevant assembly measures. Our tool, Virus-VG, is publicly available at https://bitbucket.org/jbaaijens/virus-vg.

Alignment-free Clustering of Barcode (UMI) Tagged DNA Molecules

Baraa Orabi, Emre Erhan, Brian McConeghy, Stanislav V. Volik, Stephane Le Bihan, Collin C. Collins, Cedric Chauve and Faraz Hach

Motivation: Next generation sequencing has led to the availability of massive genomic datasets whose processing raises many challenges, including handling of sequencing errors. This is especially pertinent in cancer genomics when it comes to detecting variations in low allele frequency of circulating tumour DNA. Barcode tagging of DNA molecules attempts to mitigate this issue by using PCR to duplicate a DNA molecular with a unique identifying barcode, also commonly called a Unique Molecule Identifier (UMI), and sequencing those PCR products independently. However, both the PCR and sequencing steps can generate errors in the barcodes and DNA molecule sequence of the sequenced reads. Analyzing barcoded sequencing data requires an initial accurate clustering step, with the aim of grouping reads sequenced from the same molecule into a single cluster. Moreover, the size of the current datasets requires that this clustering process be resource-efficient.

Results: We introduce Calib, a computational tool that clusters paired-end reads from barcoded sequencing experiments, addressing the issues raised above. Calib clusters are defined as connected components of a graph whose edges are defined in terms of both barcode similarity and read sequence similarity, that are computed efficiently using locality sensitive hashing and MinHashing techniques. To ease the application to various contexts, Calib comes with a data simulation module that enables informed choice of parameters as well as reproducible benchmarking. Calib is efficient as it avoids pairwise comparisons and exact edit distance calculations. Compared to other tools, Calib has the best accuracy on simulated data, while maintaining a reasonable speed and memory footprint.

HALS: Fast and High Throughput Algorithm for PacBio Long Read Self-Correction

Ergude Bao, Fei Xie, Changjin Song and Dandan Song

Motivation: The third generation PacBio long reads have greatly facilitated sequencing projects with very large read lengths, but they contain about 15% sequencing errors and need error correction. For the projects with long reads only, it is challenging to make correction with fast speed, and also challenging to correct a sufficient amount of read bases, i.e. to achieve high throughput self-correction. MECAT is currently the fastest self-correction algorithm, but its throughput is relatively small (Xiao et al., 2017).

Results: Here we introduce HALS, a wrapper algorithm of MECAT, to achieve high throughput long read self-correction while keeping MECAT's fast speed. HALS finds additional alignments from MECAT prealigned long reads to improve the correction throughput, and removes misalignments for accuracy. In addition, HALS also uses the corrected long read regions to correct the uncorrected ones to further improve the throughput. In our performance tests on E. coli, S. cerevisiae and A. thaliana long reads, HALS can achieve 28.1-230.2% larger throughput than MECAT. Compared to the other existing self-correction algorithms, HALS is 8-119x faster, and its throughput is also 17.4-157.8% larger or comparable. The HALS corrected long reads can be assembled into contigs of 18.0-60.4% larger N50 sizes than MECAT.

Availability: The HALS software can be downloaded for free from this site: https://github.com/xief001/hals.

Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs

Antoine Limasset, Jean-François Flot and Pierre Peterlongo

Motivations: Short-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large datasets or consider reads as mere suites of k-mers, without taking into account their full-length read information.

Results: We propose a new method to correct short reads using de Bruijn graphs, and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby cleaning it from most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond.

Availability and Implementation

The implementation is open source and available at github.com/Malfoy/BCOOL under the Affero GPL license.

TNER: A Novel Background Error Suppression Method for Mutation Detection in Circulating Tumor DNA

Shibing Deng, Maruja Lira, Donghui Huang, Kai Wang, Crystal Valdez, Jennifer Kinong, Paul Rejto, Jadwiga Bienkowska, James Hardwick and Tao Xie

The use of ultra-deep, next generation sequencing of circulating tumor DNA (ctDNA) holds great promise for early detection of cancer as well as a tool for monitoring disease progression and therapeutic responses. However, the low abundance of ctDNA in the bloodstream coupled with technical errors introduced during library construction and sequencing complicates mutation detection. To achieve high accuracy of variant calling via better distinguishing low frequency ctDNA mutations from background errors, we introduce TNER (Tri-Nucleotide Error Reducer), a novel background error suppression method that provides a robust estimation of background noise to reduce sequencing errors. It significantly enhances the specificity for downstream ctDNA mutation detection without sacrificing sensitivity. Results on both simulated and real healthy subjects’ data demonstrate that the proposed algorithm consistently outperforms a current, state of the art, position-specific error polishing model, particularly when the sample size of healthy subjects is small. TNER is publicly available at https://github.com/ctDNA/TNER.

Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

Kiavash Kianfar, Christopher Pockrandt, Bahman Torkamandi, Haochen Luo and Knut Reinert

Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem.

Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina reads (with two errors) becomes 35 times faster than standard backtracking. Moreover, despite being performed purely in the index, the running time of search using our optimal schemes (for up to two errors) is comparable to the best state-of-the-art aligners, which benefit from combining search in index with in-text verification using dynamic programming. As a result, we anticipate a full-fledged aligner that employs an intelligent combination of search in the bidirectional FM-index using our optimal search schemes and in-text verification using dynamic programming outperforms today's best aligners. The development of such an aligner, called FAMOUS (Fast Approximate string Matching using OptimUm search Schemes), is ongoing as our future work.

Resistome SNP Calling via Read Colored de Bruijn Graphs

Bahar Alipanahi, Martin Muggli, Musa Jundi, Noelle Noyes and Christina Boucher

Motivation: The resistome, which refers to all of the antimicrobial resistance (AMR) genes in pathogenic and non-pathogenic bacteria, is frequently studied using shotgun metagenomic data. Unfortunately, few existing methods are able to identify single nucleotide polymorphisms (SNPs) within metagenomic data, and to the best of our knowledge, no methods exist to detect SNPs within AMR genes within the resistome. The ability to identify SNPs in AMR genes across the resistome would represent a significant advance in understanding the dissemination and evolution of AMR, as SNP identification would enable“fingerprinting” of the resistome, which could then be used to track AMR dynamics across various settings and/or time periods.

Results: We present LueVari, a reference-free SNP caller based on the read colored de Bruijn graph, an extension of the traditional de Bruijn graph that allows repeated regions longer than the k-mer length and shorter than the read length to be identified unambiguously. We demonstrate LueVari was the only method that had reliable sensitivity (between 73% and 98%) as the performance of competing methods varied widely. Furthermore, we show LueVari constructs sequences containing the variation which span 93% of the gene in datasets with lower coverage (15X), and 100% of the gene in datasets with higher coverage (30X).

Availability: Code and datasets are available It is publicly available at https://github.com/ baharpan/cosmo/tree/LueVari.

Deconvoluting the diversity of within-host pathogen strains in a Multi-Locus Sequence Typing framework

Guo Liang Gan, Elijah Willie, Cedric Chauve and Leonid Chindelevitch

Bacterial pathogens exhibit an impressive amount of genomic diversity. This diversity can be informative of evolutionary adaptations, host-pathogen interactions, and disease transmission patterns. However, capturing this diversity directly from biological samples is challenging due to the possible presence of multiple strains within each host.

In this work we introduce a framework for extracting multi-locus sequence types (MLST) from whole-genome sequencing data. Our approach consists of two stages. First we assign to each sample a set of alleles and a proportion for each allele, for each locus in the MLST scheme, while minimizing the mismatches between the reads and the alleles, weighted by their base quality score. Next, we associate to each sample a set of strain types using the alleles and the strain proportions obtained in the first step, using as small a number as possible of previously unobserved strains across all samples, while respecting the allele proportions as closely as possible . It harnesses the power of mixed integer linear programming (MILP) to solve both of these optimization problems, providing optimality guarantees with respect to our objective function-based formulation.

Our problem is a novel instance of an expanding class of diversity problems in genomics, such as isoform abundance in transcriptomics or haplotyping in human genomics among others. Our solution can apply to any bacterial pathogen for which an MLST scheme exists, even though we developed it primarily with Borrelia burgdorferi, the etiological agent of Lyme disease, in mind. Our work paves the way for robust strain typing in the presence of within-host heterogeneity, overcoming an essential challenge currently not addressed by any existing frameworks in pathogen genomics.

Highlights Abstracts

BRIE: transcriptome-wide splicing quantification in single cells

Yuanhua Huang and Guido Sanguinetti

Single-cell RNA-seq (scRNA-seq), a recent technology that combines efficient RNA amplification with high throughput sequencing, has revolutionised our understanding of transcriptome variability among cell population. It has profound implications both fundamental and translational, for example, in dissecting tumour heterogeneity. However, intrinsic limitations of scRNA-seq, stemming from the minute quantity of initial RNA retrieved from single cells, have prevented its application to dissect variability in RNA splicing, as methods from bulk RNA-seq cannot handle the low coverage and high drop-out rates of scRNA-seq. Here we present BRIE (Bayesian Regression for Isoform Estimation), a Bayesian hierarchical model which pools genetic and expression information to perform robust splicing quantification from scRNA-seq data. This model has been implemented as a standard Python package, which is freely available at \texttt{http://github.com/huangyh09/brie}. The full manuscript has been published in Genome Biology.

BRIE consists of two modules: a likelihood part (bottom part of Fig 1) which uses the scRNA-seq data (aligned reads) within a mixture model approach for isoform estimation (as used in standard methods such as MISO). The likelihood module is coupled with an informative prior distribution in the form of a Bayesian regression model, where the prior probability of exon inclusion ratios is regressed against sequence-derived features (upper part of Fig 1). This exploits the fact that splicing events are highly predictable from sequence to help quantification when data is lacking. This architecture effectively enables BRIE to simultaneously trade-off two tasks: in the absence of data (drop-out genes), the informative prior provides a way of imputing missing data, while for highly covered genes the likelihood term dominates, returning a mixture-model quantification. For intermediate levels of coverage, BRIE uses Bayes's theorem to trade off imputation and quantification.

We validate BRIE on both simulated and real scRNA-seq data sets, showing that BRIE yields reproducible estimates of exon inclusion ratios in single cells. With the simulated RNA-seq data, we assessed the performance of BRIE with different coverage levels, and see that the use of an informative prior in BRIE can bring very substantial performance improvements at low coverage, with a gain of almost 20% in correlation between estimates and ground truth in RPK=25. With 96 real scRNA-seq libraries from individual HCT116 human cells, Figure 2 shows that BRIE clearly outperforms all other methods by a large margin, both in terms of correlation between estimates from different single cells (Fig 2f), and in terms of correlations between estimates from individual single-cells and bulk (Fig 2c). Example scatter plots for both comparisons are given in Fig 2e and 2b, clearly showing very consistent predictions. Notably, the performance of other methods was strongly degraded by the inability to handle the large drop-out rates (see Fig 2a and 2d for DICE-seq, where many estimates of splicing are centred around the uninformative prior value of 0.5). The high correlation between bulk and scRNA-seq predictions is particularly remarkable, as the analysis of the two data sets is not done with a shared prior.

BRIE can also be used for differential splicing detection across different data sets. To do so, we compute the evidence ratio (Bayes factor, BF) between a model where the two data sets are treated as replicates (null hypothesis) and an alternative model where the two data sets are treated as separate. To estimate a background level of differential splicing between identical cells, we considered again the 96 single cell HCT116 libraries, and compared all possible pairs of cells. In this control experiment, we found only around 1% of genes differentially spliced at the threshold of BF=10. This level of background calling could be partly attributed to intrinsic stochasticity or to residual physiological variability that was not controlled for in the experiment, such as cell cycle phase. As an additional comparison, we considered two bulk RNA-seq methods for differential splicing, MISO and the recently proposed rMATS. Both methods could only call a negligible number of events, far fewer than the expected number of false positives, confirming that bulk methods are not suitable for scRNA-seq splicing analysis.

Overall, our results demonstrate that BRIE yields reproducible estimates of exon inclusion ratios in single cells and provides an effective tool for differential isoform quantification between scRNA-seq data sets. BRIE therefore expands the scope of scRNA-seq experiments to probe the stochasticity of RNA splicing. As splicing is implicated in a number of disease and developmental processes, BRIE can considerably enhance the usefulness of scRNA-seq technologies in both fundamental and translational biology.

ABySS 2.0: Resource-Efficient Assembly of Large Genomes using a Bloom Filter

Shaun Jackman, Ben Vandervalk, Hamid Mohamadi, Sarah Yeo, Austin Hammond, Golnaz Jahesh, Hamza Khan, Lauren Coombe, Rene Warren and Inanc Birol

The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely.

With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements.

We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using less than 35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics' Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.

Comprehensive analysis of RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues

Serghei Mangul

High-throughput RNA-sequencing (RNA-seq) technologies provide an unprecedented opportunity to explore the individual transcriptome. Unmapped reads are a large and often overlooked output of standard RNA-seq analyses. Here, we present Read Origin Protocol (ROP), a tool for discovering the source of all reads originating from complex RNA molecules. We apply ROP to samples across 2630 individuals from 54 diverse human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Additionally, we use ROP to investigate the functional mechanisms underlying connections between the immune system, microbiome, and disease. ROP is freely available at http://smangul1.github.io/rop/

The preprint is available at https://www.biorxiv.org/content/early/2017/06/12/053041

Short Talks Abstracts

Signal Enrichment of Metagenome Sequencing Reads using Topological Data Analysis

Aldo Guzman-Saenz, Niina Haiminen, Saugata Basu and Laxmi Parida

A metagenome is a collection of genomes, usually in a micro-environment, and sequencing the sample en masse is a powerful means for investigating the community of the constituent micro-organisms. One of the associated challenges is teasing apart reads originating from very similar organisms. This usually results in an increase in false positives and also a sacrifice of many good quality reads due to ambiguous assignments. The latter may even skew the results when the task is to detect not simply the presence or absence but the abundance of genes or organisms.

In our solution, we map the problem to a

topological data analysis (TDA) framework. TDA extracts information from the geometric structure of data; in this particular application the structure is defined by multi-way relationships between the sequencing reads, with respect to a reference database.

Based primarily on the patterns of co-mapping of the reads to organisms, also known as operational taxonomic units (OTU), we use a modification of the Cech complex to model the multi-way maps of reads to OTUs.

The modification (viz, Barycentric subdivision) allows a natural mapping of our problem requirements to the homology computation and interpretation.

The results from applying the approach on simulated genome mixtures shows not just enrichment of signal but also the potential for identifying microbes, possibly at a finer strain or sub-strain level.

Identification of transcriptional signatures for cell types from single-cell RNA-Seq

Lynn Yi, Vasilis Ntranos, Pall Melsted and Lior Pachter

Single-cell RNA-Seq (scRNA-Seq) makes it possible to characterize the transcriptomes of cell types and identify their transcriptional signatures via differential analysis. We present a fast and accurate method for discriminating cell types that takes advantage of the large numbers of cells that are assayed. Logistic regression is a predictive model that requires many samples to be fit accurately, but is also fast and scalable with increasing numbers of samples. We perform logistic regression for each gene to predict cell labels from the constituent transcript quantifications. The resulting linear combination of transcripts can be interpreted as the best separator for the two clusters. Unlike traditional methods that test either for changes in overall gene abundance or for differential transcript usage, our method provides a unified framework for gene differential expression that eliminates the need for such a dichotomy. We applied our method to a dataset of differentiating myoblasts (Trapnell et al., 2014) and discovered differential genes with a diversity of transcript dynamics that are likely to be missed by traditional approaches. On simulated data, we show that our method is more accurate than other existing scRNA-Seq differential expression methods. Furthermore, our method can be applied on non-uniform scRNA-Seq when we use transcript compatibility counts in lieu of transcript quantifications. We showcase an example where we identified previously undetectable marker genes of memory and naïve T-cells from 10x sequencing (Zheng et al., 2017). In sum, our method is scalable, accurate and interpretable, and can be applied to non-uniform scRNA-Seq.

Interactive single cell RNA-Seq analysis with the Single Cell Toolkit (SCTK)

David Jenkins, Tyler Faits, Emma Briars, Sebasitan Carrasco Pro, Steve Cunningham, Masanao Yajima and W. Evan Johnson

Single cell RNA-sequencing (scRNA-Seq) allows researchers to profile transcriptional activity in individual cells, in contrast to bulk RNA-sequencing which profiles a conglomerate of an entire cell population. Because each sample comes from an individual cell, the small amount of input RNA results in sparse data, introducing new analytical challenges not present in bulk RNA-Seq analysis techniques. Additionally, scRNA-Seq datasets can vary dramatically in the number of cells sequenced, the sequencing depth, and the number of batches in which the samples were sequenced, so many dataset dependent analytical decisions must be made. Here, we present the Single Cell Toolkit (SCTK), an interactive scRNA-Seq analysis package that allows a user to upload raw scRNA-Seq count matrices and perform downstream scRNA-Seq analysis interactively through a web interface. The package is written in R with a graphical user interface (GUI) written in Shiny. Users can perform analysis with modules for filtering raw results, clustering, batch correction, differential expression, pathway enrichment, and scRNA-Seq study design, all in a simple to use point and click interface. The toolkit also supports command line data processing, and results can be loaded into the GUI for additional exploration and downstream analysis. We demonstrate the effectiveness of the SCTK on multiple scRNA-seq examples, including data from mucosal-associated invariant T cells, induced pluripotent stem cells, and breast cancer tumor cells. While other scRNA-Seq analysis tools exist, the SCTK is the first fully interactive analysis toolkit for scRNA-Seq data available within the R language.

Fast expectation maximization source tracking

Liat Shenhav, Mike Thompson, Tyler Joseph, Ori Furman, David Bogumil, Itzik Mizrahi and Eran Halperin

Advances in sequencing technology are resulting in an exponential increase in the acquisition and sharing of microbial community surveys such as the "Earth Microbiome Project". These advances allow access to microbial data in an unprecedented scale while opening a new and larger window into the distribution of microbial diversity on earth. Given the complex nature of these metagenomic datasets, and the sensitivity of PCR and whole-genome amplification methods, the ability to identify the origins of each sample, and specifically detect and identify contamination, is crucial. Yet, progress toward an inclusive and scalable solution has been limited. Specifically, the available state-of-the-art method for contamination identification is based on Markov Chain Monte Carlo, and it is not scalable to modern large-scale data sets. To that end, we present FEAST - a flexible and computationally efficient method to estimate the proportions of contaminants in a given community that come from possible source environments. We evaluated the performance of FEAST using simulations and real data and found that FEAST is significantly more accurate than the state-of-the-art method for microbial source tracking, and that it is 50-1000 fold faster. Notably, in some cases, it reduces the run-time from days or weeks to hours. Unlike current solutions, FEAST can simultaneously estimate hundreds and thousands of potential source environments in a timely manner and thus help decipher the origins of complex microbial data samples.

GRASS-C - Graph-based RNA-Seq Analysis in Single cell level Subgraph Clustering

Harry Taegyun Yang

Advances in sequencing technology are resulting in an exponential increase in the acquisition and sharing of microbial community surveys such as the "Earth Microbiome Project". These advances allow access to microbial data in an unprecedented scale while opening a new and larger window into the distribution of microbial diversity on earth. Given the complex nature of these metagenomic datasets, and the sensitivity of PCR and whole-genome amplification methods, the ability to identify the origins of each sample, and specifically detect and identify contamination, is crucial. Yet, progress toward an inclusive and scalable solution has been limited. Specifically, the available state-of-the-art method for contamination identification is based on Markov Chain Monte Carlo, and it is not scalable to modern large-scale data sets. To that end, we present FEAST - a flexible and computationally efficient method to estimate the proportions of contaminants in a given community that come from possible source environments. We evaluated the performance of FEAST using simulations and real data and found that FEAST is significantly more accurate than the state-of-the-art method for microbial source tracking, and that it is 50-1000 fold faster. Notably, in some cases, it reduces the run-time from days or weeks to hours. Unlike current solutions, FEAST can simultaneously estimate hundreds and thousands of potential source environments in a timely manner and thus help decipher the origins of complex microbial data samples.

Solving scaffolding problem with repeats

Igor Mandric and Alex Zelikovsky

One of the most important steps in genome assembly is scaffolding. Increasing the length of sequencing reads allows assembling short genomes but assembly of long repeat-rich genomes remains one of the most interesting and challenging problems in bioinformatics. There is a high demand in developing computational approaches for repeat aware scaffolding. In this paper, we propose a novel repeat-aware scaffolder BATISCAF based on the optimization formulation for filtering out repeated and short contigs. Our experiments with five benchmarking datasets show that the proposed tool BATISCAF outperforms state-of-the-art tools. BATISCAF is freely available on GitHub: https://github.com/mandricigor/batiscaf.

Tigmint: Correct Assembly Errors Using Linked Reads From Large Molecules

Shaun D Jackman, Lauren Coombe, Justin Chu, Rene Warren, Ben Vandervalk, Sarah Yeo, Hamid Mohamadi, Joerg Bohlmann, Steven Jones and Inanc Birol

Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome. Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task is difficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, and heterozygosity, and assembly errors are common. These misassemblies may be identified by comparing the sequencing data to the assembly, and by looking for discrepancies between the two. Once identified, these misassemblies may be corrected, improving the quality of the assembly. Although tools exist to identify and correct misassemblies using Illumina pair-end and mate-pair sequencing, no such tool yet exists that makes use of the long distance information of the large molecules provided by linked reads, such as those offered by the 10x Genomics Chromium platform. We have developed the tool Tigmint for this purpose. To demonstrate the effectiveness of Tigmint, we corrected assemblies of a human genome using short reads assembled with ABySS 2.0 and other assemblers. Tigmint reduced the number of misassemblies identified by QUAST in the ABySS assembly by 216. While scaffolding with ARCS alone more than doubled the scaffold NGA50 of the assembly from 3 to 8 Mbp, the combination of Tigmint and ARCS improved the scaffold NGA50 of the assembly over five-fold to 16.4 Mbp. This notable improvement in contiguity highlights the utility of assembly correction in refining assemblies. We demonstrate its usefulness in correcting the assemblies of multiple tools, as well as in using Chromium reads to correct and scaffold assemblies of long single-molecule sequencing. The source code of Tigmint is available for download from https://github.com/bcgsc/tigmint, and is distributed under the GNU GPL v3.0 license.

CliqueSNV: Scalable Reconstruction of Intra-Host Viral Populations from NGS Reads

Sergey Knyazev, Viachaslau Tsyvina, Andrii Melnyk, Alexander Artyomenko, Tatiana Malygina, Yuri Porozov, Ellsworth Campbell, William Switzer, Pavel Skums and Alex Zelikovsky

Highly mutable RNA viruses such as Influenza A virus, human Immunodeficiency virus and hepatitis C virus exist in infected hosts as highly heterogeneous populations of closely related genomic variants. The presence of low-frequency variants with few mutations with respect to major strains may result in immune escape, emergence of drug resistance, and increase of virulence and infectivity. Next-generation sequencing technologies permit detection of sample intra-host viral population at extremely great depth, thus providing an opportunity to access low-frequency variants. Long read lengths offered by single-molecule sequencing technologies allow all viral variants to be sequenced in a single pass. However, high sequencing error rates limit the ability to study heterogeneous viral populations composed of rare, closely related variants.

In this article, we present CliqueSNV, a novel reference-based method for reconstruction of viral variants from NGS data. It efficiently constructs an allele graph based on linkage between single nucleotide variations and identifies true viral variants by merging cliques of that graph using combinatorial optimization techniques. The new method outperforms existing methods in both accuracy and running time on experimental and simulated NGS data for titrated levels of known viral variants. For PacBio reads, it accurately reconstructs variants with frequency as low as 0.1%. For Illumina reads, it fully reconstructs main variants. The open source implementation of CliqueSNV is freely available for download at https://github.com/vyacheslav-tsivina/CliqueSNV

Fast and accurate bisulfite alignment and methylation calling for mammalian genomes

Jonas Fischer and Marcel Schulz

Whole Genome Bisulfite Sequencing (WGBS) is considered the gold

standard for genome wide, high resolution DNA methylation measurements.

With the ongoing advances in Next Generation Sequencing techniques, for example

single-cell methylomes, an evergrowing amount of WGBS sequencing data is produced

for different organisms, tissues, and cell types. However, while sequencing throughput has increased, alignment and methylation calling algorithms have not been adapted to the increasing demands. This causes a serious bottleneck in current applications.

Here, we present a novel approach called FAME, which combines bisulfite read alignment and methylation calling in one task. We designed an index structure that can store large genomes efficiently while allowing fast lookups of candidate matching positions for reads.

We further designed fast filters that drastically reduce the search space for read alignment. Alignment is done with a modified Shift-And automaton, which enables asymmetric C/T mapping to resolve ambiguity introduced by the bisulfite conversion. Once an alignment is found, methylation levels are directly estimated, thus avoiding excessive I/O for writing large bam files and additional postprocessing time for methylation calling. On a benchmark based on sampled reads from the human genome we show that we are as accurate or more accurate than the state-of-the-art methods for single end.

We also outperform the competitors in terms of mapping efficiency and predicted methylation rates for paired end reads. In addition, we are an order of magnitude faster than our competitors on simulated and real data sets. Thus FAME paves the way for large-scale analysis of bisulfite datasets.