1 of 45

RNA-seq and Transcriptomics:

How to experimentally determine gene expression?

2 of 45

Why do we sequence RNA?

Functional studies

    • Genome is the same for the organism but different cells have different genes expressed;
    • Experiment conditions may have a pronounced effect on gene expression (e.g. Drug treated vs. untreated cell line; wild type versus knock out)

Predicting transcript sequence from genome sequence is impossible at the current stage

    • Alternative splicing, RNA editing, etc.

Some molecular features exist only for RNA

    • Alternative isoforms, fusion transcripts, RNA editing, etc.

3 of 45

RNA-seq

  • RNA (quite often poly A+) converted to a library of cDNA fragments with adaptors
  • Molecules sequenced from one end (Single End) or both ends (Pair End)
  • Reads of 30-400bp depending on sequence technology

4 of 45

Challenges

  • Samples
    • Purity, quantity, quality
  • RNA is fragile compared to DNA (easily degraded)
  • Most eukaryptic RNAs are spliced (consist of exons separated by introns in DNA)
    • Mapping reads to genome is a problem
  • The relative fraction of particular RNAs vary wildly
    • 1 – 10^7 molecules per cell (lncRNA vs ribosomal RNA)
    • As sequencing is based on sampling, highly expressed genes covered by majority of reads (ribosomal genes)
    • Cannot estimate the overall decrease/increase of transcription
  • RNAs come in a wide range of sizes
    • Small RNAs must be captured separately

5 of 45

rRNA depletion methods

6 of 45

Типичный параметр: RIN (RQI), DV200

7 of 45

Common aims of RNA-Seq analysis (what can you ask of the data?)

  • Gene expression and differential expression
  • Isoforms expression, alternative splicing
  • Novel transcripts discovery and annotation
  • Allele specific expression
  • Link to known SNPs or mutations
  • SNP/mutation discovery
  • Fusion detection
  • RNA editing

8 of 45

Какие требования к эксперименту?

9 of 45

Сколько нам нужно чтений?

Expression Profiling / Differential expression 5-10 Million

Alternative splicing, quantifying cSNPs 50-100 Million

De Novo Transcriptome Assembly 100-1000 Million

10 of 45

Какие могут быть ошибки и как от них избавиться?

11 of 45

Replicates

Multiple isolations of cells showing the same phenotype, stage or other experimental condition (Environmental Factors, Growth Conditions, Time)

Correlation Coefficient 0.9

12 of 45

Типы реплик

13 of 45

Что дальше делаем с чтениями?

14 of 45

Three RNA-seq mapping strategies

Diagrams from Cloonan & Grimmond, Nature Methods 2010

https://www.nature.com/articles/nmeth.2714

De novo assembly

Align to transcriptome

Align to reference genome

15 of 45

Alignment

1

15

Module 2

  • Uses a reference genome/transcriptome to map reads
  • Capable of some novel transcript inference
  • Relatively fast runtime
  • Tools: HISAT2, STAR, GSNAP, .... and many others

Kim et al. 2015. Nat Methods 12:357–360

rnabio.org

16 of 45

Assembly

  • Infer transcript structure directly from the data
  • Useful when you do not have a reference sequence
  • Other uses – highly rearranged genomes (some cancers)
  • Computationally expensive
  • Tools: Trinity, Velvet, SPAdes

1

16

Module 2

17 of 45

Splicing

Module 5

17

rnabio.org

1

18 of 45

Splicing

Module 5

18

rnabio.org

1

19 of 45

Pseudoalignment

1

19

Module 2

  • Does not determine where in the genome a read lies, only which transcripts it is compatible with
  • Does not produce a bam by default (though pseudo-bams can be created), not useful for variant detection.
  • Tools:
    • Kallisto, Sailfish, Salmon (one of modes)
      • The comparison of the sequencing reads to the transcripts is done using a transcriptome de Bruijn graph
      • the graph is constructed from the k-mers present in an input transcriptome as opposed to reads which is done normally for genome/transcriptome assembly.

Bray, 2016 doi:10.1038/nbt.3519 https://tinyheero.github.io/2015/09/02/pseudoalignments-kallisto.html

rnabio.org

20 of 45

Which alignment strategy is best?

  • De novo assembly
    • If a reference genome does not exist for the species being studied
    • If complex polymorphisms/mutations/haplotypes might be missed by comparing to the reference genome

  • Align to transcriptome
    • If you have short reads (< 50bp)
    • Relies on known transcripts

  • Align to reference genome
    • All other cases
    • Does not rely on known transcripts – allows for discovery

  • Each strategy involves different alignment/assembly tools

21 of 45

Should I use a splice-aware or unspliced mapper?

  • The fragments being sequenced in RNA-seq represent mRNA - introns are removed

  • But we are usually aligning these reads back to the reference genome

  • Unless your reads are short (<50bp) you should use a splice-aware aligner
    • HISAT2, STAR, MapSplice, etc.

22 of 45

HISAT2

  • HISAT is a ‘splice-aware’ RNA-seq read aligner
    • HISAT = Hierarchical Indexing for Spliced Alignments of Transcripts
  • Requires a reference genome
  • Very fast

  • Uses an indexing scheme based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index
  • Multiple types of indexes for alignment
    • a whole-genome FM index to anchor each alignment
    • numerous local FM indexes for very rapid extensions of these alignments.
    • Whole-genome indices with SNPs and known transcript structures accounted for

23 of 45

HISAT2 algorithm

  • Uses a hierarchical indexing algorithm + several adaptive strategies
    • based on the position of a read with respect to splice sites
  • Find candidate locations across the whole genome first
    • mapping part of each read using the global FM index
    • Generally identifies one or a small number of candidates.
  • Do local alignment
    • selects one of ~48,000 local indexes for each candidate
    • uses it to align the remainder of the read.
  • For paired reads, each mate is separately aligned
    • If a read fails to align, then the alignments of its mate are used as anchors to map the unaligned mate

24 of 45

Read types

25 of 45

Output of HISAT2

  • A SAM/BAM file
    • SAM stands for Sequence Alignment/Map format
    • BAM is the binary version of a SAM file

  • Remember, compressed files require special handling compared to plain text files

  • How can I convert BAM to SAM?
    • Samtools

26 of 45

https://pmc.ncbi.nlm.nih.gov/articles/PMC5600148/

27 of 45

RNA-seq data analysis

28 of 45

Quality control

29 of 45

Good vs bad quality

30 of 45

Duplicated reads

31 of 45

Bowtie/Tophat/Cufflinks/DESeq2 �RNA-seq Pipeline

RNA-seq reads

Sequencing

Bowtie/TopHat alignment (genome)

Read alignment

Cufflinks

Transcript compilation

Cufflinks (cuffmerge)

Gene identification

CuffDiff/DESeq2

(A:B comparison)

Differential expression

Gene annotation

(.gtf file)

Reference genome

(.fa file)

Raw sequence data

(.fastq files)

Inputs

32 of 45

Анализ данных RNA-seq

33 of 45

  1. Merge re-sequenced FastQ files (cat)
  2. Auto-infer strandedness by subsampling and pseudoalignment (fq, Salmon)
  3. Read QC (FastQC)
  4. UMI extraction (UMI-tools)
  5. Adapter and quality trimming (Trim Galore!)
  6. Removal of genome contaminants (BBSplit)
  7. Removal of ribosomal RNA (SortMeRNA)
  8. Choice of multiple alignment and quantification routes (For STAR the sentieon implementation can be chosen):
    1. STAR -> Salmon
    2. STAR -> RSEM
    3. HiSAT2 -> NO QUANTIFICATION
  9. Sort and index alignments (SAMtools)
  10. UMI-based deduplication (UMI-tools)
  11. Duplicate read marking (picard MarkDuplicates)
  12. Transcript assembly and quantification (StringTie)
  13. Create bigWig coverage files (BEDTools, bedGraphToBigWig)
  14. Extensive quality control:
  15. Pseudoalignment and quantification (Salmon or ‘Kallisto’; optional)
  16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks (MultiQC, R)

34 of 45

Bioinformatics challenges

  • Splice junctions
  • Several overlapping transcripts from one gene
  • Non-uniquely mapped reads
  • Transcripts of different length

35 of 45

Alignment (Tophat)

36 of 45

Tophat results

37 of 45

38 of 45

How do we quantify expression from RNA-seq?

RPKM: Reads per Kb million (Mortazavi et al. Nature Methods 2008)

  • Longer and more highly expressed transcripts are more likely be represented among RNA-seq reads
  • RPKM normalizes by transcript length and the total number of reads captured and mapped in the experiment
  • Sequencing depth can alter RPKM values

39 of 45

What is FPKM (RPKM)?

Module 3

39

rnabio.org

1

  • RPKM:reads.

Reads Per Kilobase of transcript per Million mapped

  • FPKM: Fragments Per Kilobase of transcript per Million mapped reads.

  • No essential difference - Just a terminology change to better describe paired-end reads!

40 of 45

What is FPKM?

Module 3

40

rnabio.org

  • Why not just count reads in my RNAseq data?
  • The relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. However:
  • # fragments is biased towards larger genes
  • # fragments is related to total library depth

Fragments

Per Kilobase of transcript

per Million mapped reads.

1

41 of 45

What is FPKM?

Module 3

41

rnabio.org

1

  • FPKM attempts to normalize for gene size and library depth
    • remember – RPKM is essentially the same!

  • C = number of mappable fragments for a gene (transcript)
  • N = total number of mappable fragments in the library
  • L = number of base pairs in the gene (transcript)
    • FPKM = (C / (N x L) ) x 1,000 x 1,000,000
    • FPKM = (1,000,000,000 x C) / (N x L)
    • FPKM = (C / (N / 1,000,000)) / (L/1000)

42 of 45

How do FPKM and TPM differ?

Module 3

42

rnabio.org

1

  • TPM: Transcript per Kilobase Million
  • The difference is in the order of operations:

FPKM

  1. Determine total fragment count, divide by 1,000,000 (per Million)
  2. Divide each gene/transcript fragment count by #1 (Fragments Per Million)
  3. Divide each FPM by length of each gene/transcript in kilobases (FPKM)

TPM

  1. Divide each gene/transcript fragment count by length of the transcript in kilobases (Fragments Per Kilobase)

  • Sum all FPK values for the sample and divide by 1,000,000 (per Million)
  • Divide #1 by #2 (TPM)

43 of 45

44 of 45

Differential Gene Expression Analysis

CuffDiff:

-t-test (one can set a threshold)

-replicates encouraged but not needed

-can provide differential splicing and promoter usage

DESeq:

- counts reads as following a negative binomial distribution

- fit a generalized linear model (GLM): more then two groups can be tested; Wald test

- models variability between replicates

45 of 45