1 of 49

June 6, 2025

Bulk RNA-Seq Analysis

MSM-NYGC Sequence Informatics Workshop

Heather Geiger, Rui Fu

@nygenome

@notsojunkdna

2 of 49

Outline

  • Protocols
    • Different protocols for different experiments
    • “mRNA”/”totalRNA”
  • Bioinformatics
    • Alignment
    • Gene Quantification
    • Alignment-free transcript quantification
  • Applications
    • Differential gene expression
    • Fusion gene discovery
    • Immune cell decomposition
    • Immuno-oncology, neoantigen, etc
    • Long reads technologies

3 of 49

RNA-Seq

4 of 49

Typical RNA-Seq process

Griffith et al, 2015

5 of 49

Fragmentation

  • Sonication
  • Heat fragmentation
  • Transposase

  • Size selection -> loss of small RNA
    • Despite certain claims, we can’t measure microRNA accurately from regular RNA-Seq experiments.
    • It doesn’t mean that no reads map to microRNA loci…

6 of 49

RNA-seq library fragmentation and size selection

Griffith et al, 2015

7 of 49

Isolation of RNA

  • Most standard: polyA
  • Alternative 1: ribo-depletion (multiple kits)
    • Get the non-polyA transcripts
    • Samples too degraded for polyA
  • Alternative 2: capture

8 of 49

9 of 49

10 of 49

RNASeq alignment

“gene”

“a read”

5’UTR

exon 2

3’UTR

intron

intron

exon 1

exon 3

R1

R2

11 of 49

RNASeq alignment

“many reads”

12 of 49

RNASeq alignment

many reads corresponding to the different insert size

13 of 49

RNASeq workflow

“a read”

“Align”

14 of 49

RNASeq workflow

“a read”

15 of 49

RNASeq workflow

“a read”

16 of 49

RNASeq workflow

“many reads”

17 of 49

Split alignment

  • Align reads to the genome!
  • We use the algorithm STAR (relies on “Maximal Mappable Prefix”)
  • Index computed from the reference sequence, and optionally from an annotation file (typically GTF format)
  • STAR extracts the known junctions from the GTF and can “guide” the alignment to these known sites.
  • It can also align to an unknown junction if the alignment quality is better
  • Most RNA-Seq reads map to genes
    • mRNA=CDS+UTR

18 of 49

however…

Genes don’t exist!

19 of 49

“gene”

“transcripts” (or isoforms)

20 of 49

21 of 49

1

2

3

These 5 reads support isoform 1

22 of 49

1

2

3

support isoform 3

support isoform 2

doesn’t only support isoform 2

support isoform 1

23 of 49

Quantification

  • At the level of genes:
    • Count how many reads are mapped to genomic region annotated as “gene”
    • Only consider unique reads (reads that map uniquely to the genome)

24 of 49

featureCounts

  • We use a software called featureCounts
    • Written in C (faster than HTSeq-count, written in python)
    • Can use multiple threads
  • Input:
    • BAM file
    • Annotation file (compatible with the alignment, ie same reference)
  • Parameters
    • Mode of quantification
    • Stranded or non-stranded
  • Output
    • Table with the number of counts per gene
    • Also include the size of the gene

25 of 49

What do we do with counts?

  • Counts are not a great measure of the expression of a gene:
    • Longer genes get more reads than shorter genes
    • The number of reads depends on the total number of reads in the library

    • Initial solution: RPKM
      • Reads Per Kilobase and per Million mapped reads
      • A=sum (counts)/1e6;
      • RPKM=counts/(Length*A)
      • Normalize for the size of the gene and the library depth
      • For Paired-end reads: FPKM (Fragments per Kb per Million reads)

26 of 49

Counts/RPKM/TPM

  • Best measure: TPM
    • Transcript per million
      • RPK=counts/Length
      • M=sum(RPK)/1e6
      • TPM=RPK/M
  • The sum of TPM is the same in each sample (=1e6)

  • Every measurements are relative (depends on the other genes measured in the experiment)

  • We will continue with counts in DESeq2
    • Under the assumption that the size of the gene doesn’t change between condition
    • No alternative splicing / alternative polyadenylation

27 of 49

Unique reads

  • We only consider the reads that map uniquely to the genome.
  • Problem of pseudogenes or genes with high similarity

28 of 49

Duplicated reads

  • In WGS/WES, we mark the duplicated reads (assuming identical reads come from PCR artifacts and not from two different molecules). Coverage is <100x, reads are 2X150bp
  • In RNA-Seq, many genes are highly covered, so it’s possible to obtain duplicated reads from independent molecules.
  • Removing duplicates would under-estimate the expression of these genes and reduce the dynamic range of the expression profile
  • Standard is to keep the duplicates, but we run picard MarkDuplicate to measure it, as a QC metric. Frequently, dup rate correlate with low library complexity

29 of 49

Alternative pipeline

  • Can we quantify transcripts without alignment?
  • K-mer profiles:
    • Make an index from the set of transcript sequences
    • in each read, find the transcripts compatible with the sequence of the read
    • Estimate the expression level of each transcript
    • Accurate and extremely fast

    • Kallisto and Salmon

30 of 49

Kallisto

  • https://pachterlab.github.io/kallisto
  • Input: fastq files
  • Output:
    • Transcripts, length, effective length, estimated counts, TPM

  • Dependent on the annotations
  • Comes with a statistical tool called Sleuth
  • The Bioconductor package txImport allows to use Kallisto or Salmon estimates in DESeq2

31 of 49

Differential expression

  • Typical use case
  • What are the genes that vary between condition A and condition B?
    • Is it “statistically significant”?
      • WE NEED REPLICATES!

32 of 49

Why do we need replicates?

  • Say you have two conditions and one expression profile for each.
  • You could look at the ratio of two measures (A/B)
  • Decide on a threshold for what you want to study:
    • Pick gene with a two fold difference.
      • A/B > 2 or A/B < 1/2
      • Log(A/B) > 1 or log(A/B) < -1
    • Two assumptions:
      • All genes have the same variability (uncertainty in the measure)
      • Ratio for “interest” is uniform for the entire dynamic range

33 of 49

  • Abs(log(A/B))>1

34 of 49

With replicates

  • Of course, we could average over all the replicates and operate the same operation
  • Better to use all of the measurements, normalize the differences between the samples and apply a statistical test
    • Take into account differences between samples/libraries and between genes
  • The more replicates we have, the better we can estimate the variability of the counts
  • Because we apply a statistical test on all genes (with a “pvalue”), we need to apply a procedure for multiple testing correction (“adjusted pvalue”)
  • We then apply a threshold (any threshold, but common threshold are 0.01 or 0.05, corresponding to 1% FDR or 5%FDR) to get a list of “genes statistically different between condition A and condition B”
  • We usually work with DESeq2, but edgeR or voom/limma are also good packages for differential expression analysis

35 of 49

Complex design

  • If you have “co-variates”, it’s (often) possible to include them in the model, and regress their effect (focus on the variable of interest).
  • Examples:
    • Sex of the samples
    • Cell lines
    • Batch

  • Same framework can be used for time course experiments

36 of 49

Batch effect

  • If samples were prepared in batch, it induce some technical variability.
  • ALWAYS!
  • If the technical variability is higher than the biological variability (=“variable of interest”), it’s problematic.
  • One method to check this is to associate the first principal components to the batch variable, and measure their association (and the amount of variation explained).
  • If some of the variability is associated with a batch, we can use a tool called combat to remove the technical variability and correct the measurements.

37 of 49

Alternative splicing

  • Analyzing at the level of the transcripts can help you identify differential expression of annotated isoforms.
  • For any of the event listed here, if the isoform is not annotated, these methods won’t work
  • Methods to detect new isoforms (on annotated genes):
    • Cufflinks
    • MISO
    • SUPPA
  • Requires long paired-end reads
  • Requires deeper sequencing

38 of 49

How many reads? Read length

  • Our recommendation is ~40 million reads
  • For differential expression:
    • Deeper sequencing allows to detect DE genes in low expression range
    • More replicates allows to detect DE genes with lower fold change
    • Power calculation with pilot data or reasonable expectations
  • For alternative splicing:
    • Longer, deeper sequencing
  • Read length:
    • For expression only, 2x50bp can be fine
    • For splicing or fusion genes discovery, 2x125bp is recommended

  • Depends also on the organism you’re studying

39 of 49

Fusion gene discovery

  • Somatic rearrangements in cancer can lead to the expression of “chimeric transcripts”

40 of 49

Fusion genes

  • “Fusion genes” are transcripts coming from two different genes
  • Can be caused by structural events (deletion, translocation) or by a “readthrough” events between two adjacent genes
  • Our recommendation is FusionCatcher (stringent filtering, good annotations), but we also use STAR-Fusion (which takes advantage of the STAR alignments)
  • Importance of the functional interpretation of the candidates:
    • In-frame fusion?
    • Which protein domains are part of the chimeric proteins?
    • Does the fusion drive overexpression of a gene?

41 of 49

Immune cell decomposition

“tumor RNASeq” is the sequencing of RNA molecules present in a sample composed of tumor cells, stromal cells and normal cells.

The aggregated expression profile can be decomposed to estimate the proportion of each cell type in the sample.

42 of 49

Immuno-oncology, neoantigen

  • Identify somatic variants coding for peptides, score immunogenicity of the peptides
  • Require HLA-typing (various algorithms for WGS/WES or RNASeq)
  • NetMHC family of algorithms to score immunogenicity

43 of 49

Variant calling?

  • It is possible to call variants from the RNA-Seq (GATK pipeline or Strelka2)
    • Coverage is not uniform
    • Allele frequencies are more variable
    • Sequencing is “noisier” (because of reverse transcription step)
    • High False Positive Rate
  • When possible, it’s better to call variants from WGS/WES and confirm their presence in the RNA-Seq:
    • Variant is present
      • estimate allele frequency
    • Variant is absent
      • not enough reads
      • allele is not expressed
      • Variant was a false positive in the WGS/WES

44 of 49

RNA editing

  • There are pipelines to study RNA editing and identify sequence difference between DNA and RNA
  • Requires very careful filtering
  • RNA editing is probably not “widespread
  • Useful resources:

SNPiR: Piskol et al. AJHG 2013

45 of 49

Novel transcripts

  • In any experiment, some reads don’t map to annotated genes
  • Even in human, your experiment might capture novel genes
  • However, not all reads are coming from novel genes
    • Degradation products, low-quality reads, mapping artifacts
  • There are programs and set of criteria for the detection of novel coding or non-coding transcripts.
  • It should be reproducible, ideally some level of conservation (at least partial), have exon-intron structures (not always the case), …
  • Cufflinks, Trinity

46 of 49

Annotation

  • For human/mouse experiments:
    • RefSeq is more stringent
    • Gencode/Ensembl is more comprehensive / exhaustive
  • It all depends on your experiment!

47 of 49

Long read technologies

48 of 49

Do and don’t

  • Look at your data
    • Fastq files, alignment in IGV, expression profile in R
  • Know which protocol you used and what QC to expect

  • Don’t use Excel

  • Don’t use Tophat/Cufflinks

49 of 49

Want more?