Towards the Latent Transcriptome
https://arxiv.org/abs/1810.03442
Assya Trofimov, Francis Dutil, Claude Perreault, Sebastien Lemieux, Yoshua Bengio, Joseph Paul Cohen
Joint work IRIC + Mila
Motivation
Index | kmer | count | Patient id |
1 | ACGT...GT | 100 | 1 |
2 | AAAA...AA | 24k | 1 |
3 | ACGT...CC | 1 | 1 |
... | ... | ... | ... |
7 billion | TCAA...GT | 58 | 1 |
Alignment/
Pseudo-Alignment
Index | gene | count | Patient id |
1 | name_1 | 5k | 1 |
2 | name_2 | 100k | 1 |
3 | name_3 | 0 | 1 |
... | ... | ... | ... |
20k | name_x | 58 | 1 |
Hand crafted features!
Deep learning traditionally replaces hand crafted features
Yann LeCun
Template Matching
Image credit: scikit-image
"Gene"
Reads
Gene Expression
Discriminatively Trained Part-Based Models
Felzenszwalb, P. F. F., et al. “Object Detection with Discriminatively Trained Part-Based Models.” TPAMI. 2009
State of the art computer vision 2009!
Human
Bicycle
Aligned over the image
(Like string alignment)
Regions of the bike are connected like springs
Aka BLAST
Felzenszwalb, P. F. F., et al. “Object Detection with Discriminatively Trained Part-Based Models.” TPAMI. 2009
Activation map of template
Problems with template matching
It can become very difficult to maintain!
Reads aligned
w/Maximum likelihood
Reference Genome
Aggregation using hand crafted gene annotations
Reads cut into kmers, bined into histograms
Kmers group into clusters representing genes
Standard pipeline
Latent Transcriptome
Creation of reads
Updated often
(heuristic)
Colors represent known genes
Grey represents unexplained by standard pipeline
Learned from data
Ignores mutations
read
read
read
read
read
read
read
read
read
Towards the Latent Transcriptome
Computing embeddings for kmers from raw RNA-seq data of the transcriptome.
No need for alignment to a reference genome!
RNN transforms kmers into a 2d space which then predicts count (conditioned on sample)
RNN processes each kmer
Experiments on synthetic data
Gene 1
Gene 2
Gene 3
Results with a low number of patients:
48002 examples total
20 genes
2 patients
24k kmers
Gene 1
Gene 2
Gene 3
When more patients are used:
120005 examples total
20 genes
5 patients
24k kmers
Data
Canonical sequence embedding task:
Reference-free sequence embedding task:
Real data (just 1 sample)
Only one 1 patient
~ 25 thousand genes
230k kmers (length 24)
Real data (1 patient)
Synthetic data (1 patient)
5 genes
1 patients
24k kmers
Experiment with 2 real genes
We pick 2 genes: ZFY, MYH6
We pick 2 tissues: heart and uterus
What we expect in terms of gene expression:
Genes | Heart | Uterus | ||
Male | Female | Male | Female | |
ZFY | + | - | (+) | - |
MYH6 | + | + | (-) | - |
Hack! We filter the FASTQ files based on kmers from the reference sequences for both genes
Embedding of kmers with their count, for each sample
Hack #2! We use aligned BAM files to filter kmers by gene region.
Does it align to known genes?
Three example patients
The exons of two known genes colored
Variation not observable via standard pipeline
How does the method react to homologous genes?
ZFY appears as an optional gene!
How can we use these embeddings to find translocations?
How can we use these embeddings to find translocations?
Idea: Some unique kmers (to the population) should span two genes
CTCTTGCATCACCCAGGGGAAAGCCATTGAGACCCAGA
We expect to see these kmers
We discover the actual translocation sequence by looking at the unique kmers!
Scaling up!
We heavily restrained the number of kmers by filtering to known genes regions.
Without pre-filtering the BAM files, each sample would generate approximately 10-30 billion kmers (compared to 125,000 per sample in the current dataset).
While we used the pre-aligned BAM file to filter our regions of interest, our goal is to optimize this model to move away from reference genomes entirely.
Pr. Yoshua Bengio, PhD
Francis Dutil
Martin Weiss
Tristan Sylvain
Margaux Luck,
PhD
Assya Trofimov
Vincent Frappier,
PhD
Joseph Paul Cohen, PhD
Shawn Tan
Sina Honari
Geneviève Boucher
Mandana Samiei
Georgy Derevyanko, PhD