RNA-seq and Transcriptomics:
How to experimentally determine gene expression?
Why do we sequence RNA?
Functional studies
Predicting transcript sequence from genome sequence is impossible at the current stage
Some molecular features exist only for RNA
RNA-seq
Challenges
rRNA depletion methods
Типичный параметр: RIN (RQI), DV200
Common aims of RNA-Seq analysis (what can you ask of the data?)
Какие требования к эксперименту?
Сколько нам нужно чтений?
Expression Profiling / Differential expression 5-10 Million
Alternative splicing, quantifying cSNPs 50-100 Million
De Novo Transcriptome Assembly 100-1000 Million
Какие могут быть ошибки и как от них избавиться?
Replicates
Multiple isolations of cells showing the same phenotype, stage or other experimental condition (Environmental Factors, Growth Conditions, Time)
Correlation Coefficient 0.9
Типы реплик
Что дальше делаем с чтениями?
Three RNA-seq mapping strategies
Diagrams from Cloonan & Grimmond, Nature Methods 2010
De novo assembly
Align to transcriptome
Align to reference genome
Alignment
1
15
Module 2
Kim et al. 2015. Nat Methods 12:357–360
rnabio.org
Assembly
1
16
Module 2
Haas, et al (2013) doi: https://www.nature.com/articles/nprot.2013.084
Splicing
Module 5
17
rnabio.org
1
Splicing
Module 5
18
rnabio.org
1
Pseudoalignment
1
19
Module 2
Bray, 2016 doi:10.1038/nbt.3519 https://tinyheero.github.io/2015/09/02/pseudoalignments-kallisto.html
rnabio.org
Which alignment strategy is best?
Should I use a splice-aware or unspliced mapper?
HISAT2
HISAT2 algorithm
Read types
Output of HISAT2
https://pmc.ncbi.nlm.nih.gov/articles/PMC5600148/
RNA-seq data analysis
Quality control
Good vs bad quality
Duplicated reads
Bowtie/Tophat/Cufflinks/DESeq2 �RNA-seq Pipeline
RNA-seq reads
Sequencing
Bowtie/TopHat alignment (genome)
Read alignment
Cufflinks
Transcript compilation
Cufflinks (cuffmerge)
Gene identification
CuffDiff/DESeq2
(A:B comparison)
Differential expression
Gene annotation
(.gtf file)
Reference genome
(.fa file)
Raw sequence data
(.fastq files)
Inputs
Анализ данных RNA-seq
Bioinformatics challenges
Alignment (Tophat)
Tophat results
How do we quantify expression from RNA-seq?
RPKM: Reads per Kb million (Mortazavi et al. Nature Methods 2008)
What is FPKM (RPKM)?
Module 3
39
rnabio.org
1
Reads Per Kilobase of transcript per Million mapped
What is FPKM?
Module 3
40
rnabio.org
Fragments
Per Kilobase of transcript
per Million mapped reads.
1
What is FPKM?
Module 3
41
rnabio.org
1
How do FPKM and TPM differ?
Module 3
42
rnabio.org
1
FPKM
TPM
Differential Gene Expression Analysis
CuffDiff:
-t-test (one can set a threshold)
-replicates encouraged but not needed
-can provide differential splicing and promoter usage
DESeq:
- counts reads as following a negative binomial distribution
- fit a generalized linear model (GLM): more then two groups can be tested; Wald test
- models variability between replicates