1 of 49

June 6, 2025

Bulk RNA-Seq Analysis

MSM-NYGC Sequence Informatics Workshop

Heather Geiger, Rui Fu

@nygenome

@notsojunkdna

2 of 49

Outline

Protocols

Different protocols for different experiments
“mRNA”/”totalRNA”

Bioinformatics

Alignment
Gene Quantification
Alignment-free transcript quantification

Applications

Differential gene expression
Fusion gene discovery
Immune cell decomposition
Immuno-oncology, neoantigen, etc
Long reads technologies

4 of 49

Typical RNA-Seq process

Griffith et al, 2015

5 of 49

Fragmentation

Sonication
Heat fragmentation
Transposase

Size selection -> loss of small RNA

Despite certain claims, we can’t measure microRNA accurately from regular RNA-Seq experiments.
It doesn’t mean that no reads map to microRNA loci…

6 of 49

RNA-seq library fragmentation and size selection

Griffith et al, 2015

7 of 49

Isolation of RNA

Most standard: polyA
Alternative 1: ribo-depletion (multiple kits)

Get the non-polyA transcripts
Samples too degraded for polyA

Alternative 2: capture

10 of 49

RNASeq alignment

“gene”

“a read”

5’UTR

exon 2

3’UTR

intron

exon 1

exon 3

11 of 49

RNASeq alignment

“many reads”

12 of 49

RNASeq alignment

many reads corresponding to the different insert size

13 of 49

RNASeq workflow

“a read”

“Align”

14 of 49

RNASeq workflow

“a read”

15 of 49

RNASeq workflow

“a read”

16 of 49

RNASeq workflow

“many reads”

17 of 49

Split alignment

Align reads to the genome!
We use the algorithm STAR (relies on “Maximal Mappable Prefix”)
Index computed from the reference sequence, and optionally from an annotation file (typically GTF format)
STAR extracts the known junctions from the GTF and can “guide” the alignment to these known sites.
It can also align to an unknown junction if the alignment quality is better
Most RNA-Seq reads map to genes

mRNA=CDS+UTR

18 of 49

however…

Genes don’t exist!

19 of 49

“gene”

“transcripts” (or isoforms)

21 of 49

These 5 reads support isoform 1

22 of 49

support isoform 3

support isoform 2

doesn’t only support isoform 2

support isoform 1

23 of 49

Quantification

At the level of genes:

Count how many reads are mapped to genomic region annotated as “gene”
Only consider unique reads (reads that map uniquely to the genome)

24 of 49

featureCounts

We use a software called featureCounts

Written in C (faster than HTSeq-count, written in python)
Can use multiple threads

Input:

BAM file
Annotation file (compatible with the alignment, ie same reference)

Parameters

Mode of quantification
Stranded or non-stranded
…

Output

Table with the number of counts per gene
Also include the size of the gene

25 of 49

What do we do with counts?

Counts are not a great measure of the expression of a gene:

Longer genes get more reads than shorter genes
The number of reads depends on the total number of reads in the library

Initial solution: RPKM

Reads Per Kilobase and per Million mapped reads
A=sum (counts)/1e6;
RPKM=counts/(Length*A)
Normalize for the size of the gene and the library depth
For Paired-end reads: FPKM (Fragments per Kb per Million reads)

26 of 49

Counts/RPKM/TPM

Best measure: TPM

Transcript per million

RPK=counts/Length
M=sum(RPK)/1e6
TPM=RPK/M

The sum of TPM is the same in each sample (=1e6)

Every measurements are relative (depends on the other genes measured in the experiment)

We will continue with counts in DESeq2

Under the assumption that the size of the gene doesn’t change between condition
No alternative splicing / alternative polyadenylation

27 of 49

Unique reads

We only consider the reads that map uniquely to the genome.
Problem of pseudogenes or genes with high similarity

28 of 49

Duplicated reads

In WGS/WES, we mark the duplicated reads (assuming identical reads come from PCR artifacts and not from two different molecules). Coverage is <100x, reads are 2X150bp
In RNA-Seq, many genes are highly covered, so it’s possible to obtain duplicated reads from independent molecules.
Removing duplicates would under-estimate the expression of these genes and reduce the dynamic range of the expression profile
Standard is to keep the duplicates, but we run picard MarkDuplicate to measure it, as a QC metric. Frequently, dup rate correlate with low library complexity

29 of 49

Alternative pipeline

Can we quantify transcripts without alignment?
K-mer profiles:

Make an index from the set of transcript sequences
in each read, find the transcripts compatible with the sequence of the read
Estimate the expression level of each transcript
Accurate and extremely fast

Kallisto and Salmon

30 of 49

Kallisto

https://pachterlab.github.io/kallisto
Input: fastq files
Output:

Transcripts, length, effective length, estimated counts, TPM

Dependent on the annotations
Comes with a statistical tool called Sleuth
The Bioconductor package txImport allows to use Kallisto or Salmon estimates in DESeq2

31 of 49

Differential expression

Typical use case
What are the genes that vary between condition A and condition B?

Is it “statistically significant”?

WE NEED REPLICATES!

32 of 49

Why do we need replicates?

Say you have two conditions and one expression profile for each.
You could look at the ratio of two measures (A/B)
Decide on a threshold for what you want to study:

Pick gene with a two fold difference.

A/B > 2 or A/B < 1/2
Log(A/B) > 1 or log(A/B) < -1

Two assumptions:

All genes have the same variability (uncertainty in the measure)
Ratio for “interest” is uniform for the entire dynamic range

33 of 49

Abs(log(A/B))>1

34 of 49

With replicates

Of course, we could average over all the replicates and operate the same operation
Better to use all of the measurements, normalize the differences between the samples and apply a statistical test

Take into account differences between samples/libraries and between genes

The more replicates we have, the better we can estimate the variability of the counts
Because we apply a statistical test on all genes (with a “pvalue”), we need to apply a procedure for multiple testing correction (“adjusted pvalue”)
We then apply a threshold (any threshold, but common threshold are 0.01 or 0.05, corresponding to 1% FDR or 5%FDR) to get a list of “genes statistically different between condition A and condition B”
We usually work with DESeq2, but edgeR or voom/limma are also good packages for differential expression analysis

35 of 49

Complex design

If you have “co-variates”, it’s (often) possible to include them in the model, and regress their effect (focus on the variable of interest).
Examples:

Sex of the samples
Cell lines
Batch

Same framework can be used for time course experiments

36 of 49

Batch effect

If samples were prepared in batch, it induce some technical variability.
ALWAYS!
If the technical variability is higher than the biological variability (=“variable of interest”), it’s problematic.
One method to check this is to associate the first principal components to the batch variable, and measure their association (and the amount of variation explained).
If some of the variability is associated with a batch, we can use a tool called combat to remove the technical variability and correct the measurements.

37 of 49

Alternative splicing

Analyzing at the level of the transcripts can help you identify differential expression of annotated isoforms.
For any of the event listed here, if the isoform is not annotated, these methods won’t work
Methods to detect new isoforms (on annotated genes):

Cufflinks
MISO
SUPPA

Requires long paired-end reads
Requires deeper sequencing

38 of 49

How many reads? Read length

Our recommendation is ~40 million reads
For differential expression:

Deeper sequencing allows to detect DE genes in low expression range
More replicates allows to detect DE genes with lower fold change
Power calculation with pilot data or reasonable expectations

For alternative splicing:

Longer, deeper sequencing

Read length:

For expression only, 2x50bp can be fine
For splicing or fusion genes discovery, 2x125bp is recommended

Depends also on the organism you’re studying

39 of 49

Fusion gene discovery

Somatic rearrangements in cancer can lead to the expression of “chimeric transcripts”

40 of 49

Fusion genes

“Fusion genes” are transcripts coming from two different genes
Can be caused by structural events (deletion, translocation) or by a “readthrough” events between two adjacent genes
Our recommendation is FusionCatcher (stringent filtering, good annotations), but we also use STAR-Fusion (which takes advantage of the STAR alignments)
Importance of the functional interpretation of the candidates:

In-frame fusion?
Which protein domains are part of the chimeric proteins?
Does the fusion drive overexpression of a gene?

41 of 49

Immune cell decomposition

“tumor RNASeq” is the sequencing of RNA molecules present in a sample composed of tumor cells, stromal cells and normal cells.

The aggregated expression profile can be decomposed to estimate the proportion of each cell type in the sample.

42 of 49

Immuno-oncology, neoantigen

Identify somatic variants coding for peptides, score immunogenicity of the peptides
Require HLA-typing (various algorithms for WGS/WES or RNASeq)
NetMHC family of algorithms to score immunogenicity

43 of 49

Variant calling?

It is possible to call variants from the RNA-Seq (GATK pipeline or Strelka2)

Coverage is not uniform
Allele frequencies are more variable
Sequencing is “noisier” (because of reverse transcription step)
High False Positive Rate

When possible, it’s better to call variants from WGS/WES and confirm their presence in the RNA-Seq:

Variant is present

estimate allele frequency

Variant is absent

not enough reads
allele is not expressed
Variant was a false positive in the WGS/WES

44 of 49

RNA editing

There are pipelines to study RNA editing and identify sequence difference between DNA and RNA
Requires very careful filtering
RNA editing is probably not “widespread”
Useful resources:

SNPiR: Piskol et al. AJHG 2013

45 of 49

Novel transcripts

In any experiment, some reads don’t map to annotated genes
Even in human, your experiment might capture novel genes
However, not all reads are coming from novel genes

Degradation products, low-quality reads, mapping artifacts

There are programs and set of criteria for the detection of novel coding or non-coding transcripts.
It should be reproducible, ideally some level of conservation (at least partial), have exon-intron structures (not always the case), …
Cufflinks, Trinity

46 of 49

Annotation

For human/mouse experiments:

RefSeq is more stringent
Gencode/Ensembl is more comprehensive / exhaustive

It all depends on your experiment!

47 of 49

Long read technologies

48 of 49

Do and don’t

Look at your data

Fastq files, alignment in IGV, expression profile in R

Know which protocol you used and what QC to expect

Don’t use Excel

Don’t use Tophat/Cufflinks

1 of 49

2 of 49

3 of 49

4 of 49

5 of 49

6 of 49

7 of 49

8 of 49

9 of 49

10 of 49

11 of 49

12 of 49

13 of 49

14 of 49

15 of 49

16 of 49

17 of 49

18 of 49

19 of 49

20 of 49

21 of 49

22 of 49

23 of 49

24 of 49

25 of 49

26 of 49

27 of 49

28 of 49

29 of 49

30 of 49

31 of 49

32 of 49

33 of 49

34 of 49

35 of 49

36 of 49

37 of 49

38 of 49

39 of 49

40 of 49

41 of 49

42 of 49

43 of 49

44 of 49

45 of 49

46 of 49

47 of 49

48 of 49

49 of 49