RNA-seq and Transcriptomics:
How to experimentally determine gene expression?
Based on slides of Ivan Antonov
Why do we sequence RNA?
Functional studies
Predicting transcript sequence from genome sequence is impossible at the current stage
Some molecular features exist only for RNA
RNA-seq
Challenges
rRNA depletion methods
Типичный параметр: RIN (RQI), DV200
Common aims of RNA-Seq analysis (what can you ask of the data?)
Какие требования к эксперименту?
Сколько нам нужно чтений?
Expression Profiling / Differential expression 5-10 Million
Alternative splicing, quantifying cSNPs 50-100 Million
De Novo Transcriptome Assembly 100-1000 Million
Какие могут быть ошибки и как от них избавиться?
Replicates
Multiple isolations of cells showing the same phenotype, stage or other experimental condition (Environmental Factors, Growth Conditions, Time)
Correlation Coefficient 0.9
Типы реплик
RNA-seq data analysis
Quality control
Good vs bad quality
Duplicated reads
Bowtie/Tophat/Cufflinks/DESeq2 �RNA-seq Pipeline
RNA-seq reads
Sequencing
Bowtie/TopHat alignment (genome)
Read alignment
Cufflinks
Transcript compilation
Cufflinks (cuffmerge)
Gene identification
CuffDiff/DESeq2
(A:B comparison)
Differential expression
Gene annotation
(.gtf file)
Reference genome
(.fa file)
Raw sequence data
(.fastq files)
Inputs
Bioinformatics challenges
Alignment (Tophat)
Tophat results
How do we quantify expression from RNA-seq?
RPKM: Reads per Kb million (Mortazavi et al. Nature Methods 2008)
Differential Gene Expression Analysis
CuffDiff:
-t-test (one can set a threshold)
-replicates encouraged but not needed
-can provide differential splicing and promoter usage
DESeq:
- counts reads as following a negative binomial distribution
- fit a generalized linear model (GLM): more then two groups can be tested; Wald test
- models variability between replicates