1 of 45

RNA-seq and Transcriptomics:

How to experimentally determine gene expression?

2 of 45

Why do we sequence RNA?

Functional studies

Genome is the same for the organism but different cells have different genes expressed;
Experiment conditions may have a pronounced effect on gene expression (e.g. Drug treated vs. untreated cell line; wild type versus knock out)

Predicting transcript sequence from genome sequence is impossible at the current stage

Alternative splicing, RNA editing, etc.

Some molecular features exist only for RNA

Alternative isoforms, fusion transcripts, RNA editing, etc.

3 of 45

RNA-seq

RNA (quite often poly A+) converted to a library of cDNA fragments with adaptors
Molecules sequenced from one end (Single End) or both ends (Pair End)
Reads of 30-400bp depending on sequence technology

4 of 45

Challenges

Samples

Purity, quantity, quality

RNA is fragile compared to DNA (easily degraded)
Most eukaryptic RNAs are spliced (consist of exons separated by introns in DNA)

Mapping reads to genome is a problem

The relative fraction of particular RNAs vary wildly

1 – 10^7 molecules per cell (lncRNA vs ribosomal RNA)
As sequencing is based on sampling, highly expressed genes covered by majority of reads (ribosomal genes)
Cannot estimate the overall decrease/increase of transcription

RNAs come in a wide range of sizes

Small RNAs must be captured separately

5 of 45

rRNA depletion methods

6 of 45

Типичный параметр: RIN (RQI), DV200

7 of 45

Common aims of RNA-Seq analysis (what can you ask of the data?)

Gene expression and differential expression
Isoforms expression, alternative splicing
Novel transcripts discovery and annotation
Allele specific expression
Link to known SNPs or mutations
SNP/mutation discovery
Fusion detection
RNA editing

8 of 45

Какие требования к эксперименту?

9 of 45

Сколько нам нужно чтений?

Expression Profiling / Differential expression 5-10 Million

Alternative splicing, quantifying cSNPs 50-100 Million

De Novo Transcriptome Assembly 100-1000 Million

10 of 45

Какие могут быть ошибки и как от них избавиться?

11 of 45

Replicates

Multiple isolations of cells showing the same phenotype, stage or other experimental condition (Environmental Factors, Growth Conditions, Time)

Correlation Coefficient 0.9

12 of 45

Типы реплик

DOI:10.11613/BM.2021.010502

13 of 45

Что дальше делаем с чтениями?

14 of 45

Three RNA-seq mapping strategies

Diagrams from Cloonan & Grimmond, Nature Methods 2010

https://www.nature.com/articles/nmeth.2714

De novo assembly

Align to transcriptome

Align to reference genome

15 of 45

Alignment

Module 2

Uses a reference genome/transcriptome to map reads
Capable of some novel transcript inference
Relatively fast runtime
Tools: HISAT2, STAR, GSNAP, .... and many others

Kim et al. 2015. Nat Methods 12:357–360

rnabio.org

16 of 45

Assembly

Infer transcript structure directly from the data
Useful when you do not have a reference sequence
Other uses – highly rearranged genomes (some cancers)
Computationally expensive
Tools: Trinity, Velvet, SPAdes

Module 2

Haas, et al (2013) doi: https://www.nature.com/articles/nprot.2013.084

17 of 45

Splicing

Module 5

rnabio.org

18 of 45

Splicing

Module 5

rnabio.org

19 of 45

Pseudoalignment

Module 2

Does not determine where in the genome a read lies, only which transcripts it is compatible with
Does not produce a bam by default (though pseudo-bams can be created), not useful for variant detection.
Tools:

Kallisto, Sailfish, Salmon (one of modes)

The comparison of the sequencing reads to the transcripts is done using a transcriptome de Bruijn graph
the graph is constructed from the k-mers present in an input transcriptome as opposed to reads which is done normally for genome/transcriptome assembly.

Bray, 2016 doi:10.1038/nbt.3519 https://tinyheero.github.io/2015/09/02/pseudoalignments-kallisto.html

rnabio.org

20 of 45

Which alignment strategy is best?

De novo assembly

If a reference genome does not exist for the species being studied
If complex polymorphisms/mutations/haplotypes might be missed by comparing to the reference genome

Align to transcriptome

If you have short reads (< 50bp)
Relies on known transcripts

Align to reference genome

All other cases
Does not rely on known transcripts – allows for discovery

Each strategy involves different alignment/assembly tools

21 of 45

Should I use a splice-aware or unspliced mapper?

The fragments being sequenced in RNA-seq represent mRNA - introns are removed

But we are usually aligning these reads back to the reference genome

Unless your reads are short (<50bp) you should use a splice-aware aligner

HISAT2, STAR, MapSplice, etc.

22 of 45

HISAT2

HISAT is a ‘splice-aware’ RNA-seq read aligner

HISAT = Hierarchical Indexing for Spliced Alignments of Transcripts

Requires a reference genome
Very fast

Uses an indexing scheme based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index
Multiple types of indexes for alignment

a whole-genome FM index to anchor each alignment
numerous local FM indexes for very rapid extensions of these alignments.
Whole-genome indices with SNPs and known transcript structures accounted for

https://www.nature.com/articles/s41587-019-0201-4 https://pmc.ncbi.nlm.nih.gov/articles/PMC4655817/

23 of 45

HISAT2 algorithm

Uses a hierarchical indexing algorithm + several adaptive strategies

based on the position of a read with respect to splice sites

Find candidate locations across the whole genome first

mapping part of each read using the global FM index
Generally identifies one or a small number of candidates.

Do local alignment

selects one of ~48,000 local indexes for each candidate
uses it to align the remainder of the read.

For paired reads, each mate is separately aligned

If a read fails to align, then the alignments of its mate are used as anchors to map the unaligned mate

25 of 45

Output of HISAT2

A SAM/BAM file

SAM stands for Sequence Alignment/Map format
BAM is the binary version of a SAM file

Remember, compressed files require special handling compared to plain text files

How can I convert BAM to SAM?

Samtools

26 of 45

https://pmc.ncbi.nlm.nih.gov/articles/PMC5600148/

27 of 45

RNA-seq data analysis

28 of 45

Quality control

29 of 45

Good vs bad quality

30 of 45

Duplicated reads

31 of 45

Bowtie/Tophat/Cufflinks/DESeq2 �RNA-seq Pipeline

RNA-seq reads

Sequencing

Bowtie/TopHat alignment (genome)

Read alignment

Cufflinks

Transcript compilation

Cufflinks (cuffmerge)

Gene identification

CuffDiff/DESeq2

(A:B comparison)

Differential expression

Gene annotation

(.gtf file)

Reference genome

(.fa file)

Raw sequence data

(.fastq files)

Inputs

32 of 45

Анализ данных RNA-seq

https://nf-co.re/rnaseq/3.21.0

33 of 45

Merge re-sequenced FastQ files (cat)
Auto-infer strandedness by subsampling and pseudoalignment (fq, Salmon)
Read QC (FastQC)
UMI extraction (UMI-tools)
Adapter and quality trimming (Trim Galore!)
Removal of genome contaminants (BBSplit)
Removal of ribosomal RNA (SortMeRNA)
Choice of multiple alignment and quantification routes (For STAR the sentieon implementation can be chosen):

STAR -> Salmon
STAR -> RSEM
HiSAT2 -> NO QUANTIFICATION

Sort and index alignments (SAMtools)
UMI-based deduplication (UMI-tools)
Duplicate read marking (picard MarkDuplicates)
Transcript assembly and quantification (StringTie)
Create bigWig coverage files (BEDTools, bedGraphToBigWig)
Extensive quality control:

RSeQC
Qualimap
dupRadar
Preseq
DESeq2
Kraken2 -> Bracken on unaligned sequences; optional

Pseudoalignment and quantification (Salmon or ‘Kallisto’; optional)
Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks (MultiQC, R)

34 of 45

Bioinformatics challenges

Splice junctions
Several overlapping transcripts from one gene
Non-uniquely mapped reads
Transcripts of different length

35 of 45

Alignment (Tophat)

36 of 45

Tophat results

38 of 45

How do we quantify expression from RNA-seq?

RPKM: Reads per Kb million (Mortazavi et al. Nature Methods 2008)

Longer and more highly expressed transcripts are more likely be represented among RNA-seq reads
RPKM normalizes by transcript length and the total number of reads captured and mapped in the experiment
Sequencing depth can alter RPKM values

39 of 45

What is FPKM (RPKM)?

Module 3

rnabio.org

RPKM:reads.

Reads Per Kilobase of transcript per Million mapped

FPKM: Fragments Per Kilobase of transcript per Million mapped reads.

No essential difference - Just a terminology change to better describe paired-end reads!

40 of 45

What is FPKM?

Module 3

rnabio.org

Why not just count reads in my RNAseq data?

The relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. However:

# fragments is biased towards larger genes

# fragments is related to total library depth

Fragments

Per Kilobase of transcript

per Million mapped reads.

41 of 45

What is FPKM?

Module 3

rnabio.org

FPKM attempts to normalize for gene size and library depth

remember – RPKM is essentially the same!

C = number of mappable fragments for a gene (transcript)
N = total number of mappable fragments in the library
L = number of base pairs in the gene (transcript)

FPKM = (C / (N x L) ) x 1,000 x 1,000,000
FPKM = (1,000,000,000 x C) / (N x L)
FPKM = (C / (N / 1,000,000)) / (L/1000)

42 of 45

How do FPKM and TPM differ?

Module 3

rnabio.org

TPM: Transcript per Kilobase Million
The difference is in the order of operations:

The sum of all TPMs in each sample is the same. Easier to compare across samples!
http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
https://www.ncbi.nlm.nih.gov/pubmed/22872506

FPKM

Determine total fragment count, divide by 1,000,000 (per Million)
Divide each gene/transcript fragment count by #1 (Fragments Per Million)
Divide each FPM by length of each gene/transcript in kilobases (FPKM)

TPM

Divide each gene/transcript fragment count by length of the transcript in kilobases (Fragments Per Kilobase)

Sum all FPK values for the sample and divide by 1,000,000 (per Million)
Divide #1 by #2 (TPM)

43 of 45

https://www.youtube.com/watch?v=TTUrtCY2k-w

44 of 45

Differential Gene Expression Analysis

CuffDiff:

-t-test (one can set a threshold)

-replicates encouraged but not needed

-can provide differential splicing and promoter usage

DESeq:

- counts reads as following a negative binomial distribution

- fit a generalized linear model (GLM): more then two groups can be tested; Wald test

- models variability between replicates

1 of 45

2 of 45

3 of 45

4 of 45

5 of 45

6 of 45

7 of 45

8 of 45

9 of 45

10 of 45

11 of 45

12 of 45

13 of 45

14 of 45

15 of 45

16 of 45

17 of 45

18 of 45

19 of 45

20 of 45

21 of 45

22 of 45

23 of 45

24 of 45

25 of 45

26 of 45

27 of 45

28 of 45

29 of 45

30 of 45

31 of 45

32 of 45

33 of 45

34 of 45

35 of 45

36 of 45

37 of 45

38 of 45

39 of 45

40 of 45

41 of 45

42 of 45

43 of 45

44 of 45

45 of 45