1 of 22

RNA-seq and Transcriptomics:

How to experimentally determine gene expression?

Based on slides of Ivan Antonov

2 of 22

Why do we sequence RNA?

Functional studies

    • Genome is the same for the organism but different cells have different genes expressed;
    • Experiment conditions may have a pronounced effect on gene expression (e.g. Drug treated vs. untreated cell line; wild type versus knock out)

Predicting transcript sequence from genome sequence is impossible at the current stage

    • Alternative splicing, RNA editing, etc.

Some molecular features exist only for RNA

    • Alternative isoforms, fusion transcripts, RNA editing, etc.

3 of 22

RNA-seq

  • RNA (quite often poly A+) converted to a library of cDNA fragments with adaptors
  • Molecules sequenced from one end (Single End) or both ends (Pair End)
  • Reads of 30-400bp depending on sequence technology

4 of 22

Challenges

  • Samples
    • Purity, quantity, quality
  • RNA is fragile compared to DNA (easily degraded)
  • Most eukaryptic RNAs are spliced (consist of exons separated by introns in DNA)
    • Mapping reads to genome is a problem
  • The relative fraction of particular RNAs vary wildly
    • 1 – 10^7 molecules per cell (lncRNA vs ribosomal RNA)
    • As sequencing is based on sampling, highly expressed genes covered by majority of reads (ribosomal genes)
    • Cannot estimate the overall decrease/increase of transcription
  • RNAs come in a wide range of sizes
    • Small RNAs must be captured separately

5 of 22

rRNA depletion methods

6 of 22

Типичный параметр: RIN (RQI), DV200

7 of 22

Common aims of RNA-Seq analysis (what can you ask of the data?)

  • Gene expression and differential expression
  • Isoforms expression, alternative splicing
  • Novel transcripts discovery and annotation
  • Allele specific expression
  • Link to known SNPs or mutations
  • SNP/mutation discovery
  • Fusion detection
  • RNA editing

8 of 22

Какие требования к эксперименту?

9 of 22

Сколько нам нужно чтений?

Expression Profiling / Differential expression 5-10 Million

Alternative splicing, quantifying cSNPs 50-100 Million

De Novo Transcriptome Assembly 100-1000 Million

10 of 22

Какие могут быть ошибки и как от них избавиться?

11 of 22

Replicates

Multiple isolations of cells showing the same phenotype, stage or other experimental condition (Environmental Factors, Growth Conditions, Time)

Correlation Coefficient 0.9

12 of 22

Типы реплик

13 of 22

RNA-seq data analysis

14 of 22

Quality control

15 of 22

Good vs bad quality

16 of 22

Duplicated reads

17 of 22

Bowtie/Tophat/Cufflinks/DESeq2 �RNA-seq Pipeline

RNA-seq reads

Sequencing

Bowtie/TopHat alignment (genome)

Read alignment

Cufflinks

Transcript compilation

Cufflinks (cuffmerge)

Gene identification

CuffDiff/DESeq2

(A:B comparison)

Differential expression

Gene annotation

(.gtf file)

Reference genome

(.fa file)

Raw sequence data

(.fastq files)

Inputs

18 of 22

Bioinformatics challenges

  • Splice junctions
  • Several overlapping transcripts from one gene
  • Non-uniquely mapped reads
  • Transcripts of different length

19 of 22

Alignment (Tophat)

20 of 22

Tophat results

21 of 22

How do we quantify expression from RNA-seq?

RPKM: Reads per Kb million (Mortazavi et al. Nature Methods 2008)

  • Longer and more highly expressed transcripts are more likely be represented among RNA-seq reads
  • RPKM normalizes by transcript length and the total number of reads captured and mapped in the experiment
  • Sequencing depth can alter RPKM values

22 of 22

Differential Gene Expression Analysis

CuffDiff:

-t-test (one can set a threshold)

-replicates encouraged but not needed

-can provide differential splicing and promoter usage

DESeq:

- counts reads as following a negative binomial distribution

- fit a generalized linear model (GLM): more then two groups can be tested; Wald test

- models variability between replicates