1 of 24

Towards the Latent Transcriptome

https://arxiv.org/abs/1810.03442

Assya Trofimov, Francis Dutil, Claude Perreault, Sebastien Lemieux, Yoshua Bengio, Joseph Paul Cohen

Joint work IRIC + Mila

2 of 24

Motivation

  • Usual pipeline:

Index

kmer

count

Patient id

1

ACGT...GT

100

1

2

AAAA...AA

24k

1

3

ACGT...CC

1

1

...

...

...

...

7 billion

TCAA...GT

58

1

Alignment/

Pseudo-Alignment

Index

gene

count

Patient id

1

name_1

5k

1

2

name_2

100k

1

3

name_3

0

1

...

...

...

...

20k

name_x

58

1

Hand crafted features!

3 of 24

Deep learning traditionally replaces hand crafted features

Yann LeCun

4 of 24

Template Matching

Image credit: scikit-image

"Gene"

Reads

Gene Expression

5 of 24

Discriminatively Trained Part-Based Models

Felzenszwalb, P. F. F., et al. “Object Detection with Discriminatively Trained Part-Based Models.” TPAMI. 2009

State of the art computer vision 2009!

Human

Bicycle

Aligned over the image

(Like string alignment)

Regions of the bike are connected like springs

Aka BLAST

6 of 24

Felzenszwalb, P. F. F., et al. “Object Detection with Discriminatively Trained Part-Based Models.” TPAMI. 2009

Activation map of template

7 of 24

Problems with template matching

  • We throw away huge amount of data
  • We rely on handcrafted features (genes), that might be incorrect
  • But what if we could discover the “real” underlying structure?

It can become very difficult to maintain!

8 of 24

Reads aligned

w/Maximum likelihood

Reference Genome

Aggregation using hand crafted gene annotations

Reads cut into kmers, bined into histograms

Kmers group into clusters representing genes

Standard pipeline

Latent Transcriptome

Creation of reads

Updated often

(heuristic)

Colors represent known genes

Grey represents unexplained by standard pipeline

Learned from data

Ignores mutations

read

read

read

read

read

read

read

read

read

9 of 24

Towards the Latent Transcriptome

Computing embeddings for kmers from raw RNA-seq data of the transcriptome.

No need for alignment to a reference genome!

RNN transforms kmers into a 2d space which then predicts count (conditioned on sample)

RNN processes each kmer

10 of 24

Experiments on synthetic data

Gene 1

Gene 2

Gene 3

Results with a low number of patients:

48002 examples total

20 genes

2 patients

24k kmers

Gene 1

Gene 2

Gene 3

When more patients are used:

120005 examples total

20 genes

5 patients

24k kmers

11 of 24

Data

Canonical sequence embedding task:

  • 24 RNA-Seq samples of heart and uterus from Genotype-Tissue Expression (GTEx)(Lonsdale et al., 2013).

Reference-free sequence embedding task:

  • 149 RNA-Seq samples of Acute Myeloid Leukemia cohort from The Cancer Genome Atlas (TCGA) (Ley et al. 2013)

12 of 24

Real data (just 1 sample)

Only one 1 patient

~ 25 thousand genes

230k kmers (length 24)

Real data (1 patient)

Synthetic data (1 patient)

5 genes

1 patients

24k kmers

13 of 24

Experiment with 2 real genes

We pick 2 genes: ZFY, MYH6

We pick 2 tissues: heart and uterus

What we expect in terms of gene expression:

Genes

Heart

Uterus

Male

Female

Male

Female

ZFY

+

-

(+)

-

MYH6

+

+

(-)

-

Hack! We filter the FASTQ files based on kmers from the reference sequences for both genes

14 of 24

Embedding of kmers with their count, for each sample

15 of 24

16 of 24

Hack #2! We use aligned BAM files to filter kmers by gene region.

17 of 24

Does it align to known genes?

18 of 24

Three example patients

The exons of two known genes colored

Variation not observable via standard pipeline

19 of 24

How does the method react to homologous genes?

ZFY appears as an optional gene!

20 of 24

How can we use these embeddings to find translocations?

21 of 24

How can we use these embeddings to find translocations?

Idea: Some unique kmers (to the population) should span two genes

22 of 24

CTCTTGCATCACCCAGGGGAAAGCCATTGAGACCCAGA

We expect to see these kmers

We discover the actual translocation sequence by looking at the unique kmers!

23 of 24

Scaling up!

We heavily restrained the number of kmers by filtering to known genes regions.

Without pre-filtering the BAM files, each sample would generate approximately 10-30 billion kmers (compared to 125,000 per sample in the current dataset).

While we used the pre-aligned BAM file to filter our regions of interest, our goal is to optimize this model to move away from reference genomes entirely.

24 of 24

Pr. Yoshua Bengio, PhD

Francis Dutil

Martin Weiss

Tristan Sylvain

Margaux Luck,

PhD

Assya Trofimov

Vincent Frappier,

PhD

Joseph Paul Cohen, PhD

Shawn Tan

Sina Honari

Geneviève Boucher

Mandana Samiei

Georgy Derevyanko, PhD