1 of 23

Dr. Harpreet Singh

Head, PG Department of Bioinformatics,

Hans Raj Mahila Maha Vidyalaya,

Jalandhar, Punjab, India

e-module

RNASeq Analysis using R

https://scienceparkstudygroup.github.io/rna-seq-lesson/

2 of 23

Introduction

RNA-seq is essentially the sequence of RNA molecules from either a specific cell, tissue, or species. A highly sensitive and accurate tool for measuring expression across the transcriptome, it is providing scientists with visibility into previously undetected changes occurring in disease states, in response to therapeutics, under different environmental conditions, and across a wide range of other study designs.

RNA-Seq allows researchers to detect both known and novel features in a single assay, enabling the identification of transcript isoforms, gene fusions, single nucleotide variants, and other features without the limitation of prior knowledge.

3 of 23

Introduction

There are a wide range of applications for RNA-seq as given below: -

Expression quantification

Discovery of novel genes and gene isoforms

Differential expression

Many other types of functional analysis.

Hence, researchers can use RNA-seq to answer many of their most compelling questions. It can find new cancer subtypes, discover the transcriptome of never-before-sequenced species, elucidate tissue-specific gene expression, and many other applications.

4 of 23

Benefits of RNA Sequencing

NA-Seq with next-generation sequencing (NGS) is increasingly the method of choice for scientists studying the transcriptome.

Covers an extremely broad dynamic range
Provides sensitive, accurate measurement of gene expression
Captures both known and novel features; does not require predesigned probes
Generates both qualitative and quantitative data
Reveals the full transcriptome, not just a few selected transcripts
Can be applied to any species, even if a reference sequence is not available

5 of 23

Types of RNAseq

RNAseq can be roughly divided into two “types”:

Reference genome-based - an assembled genome exists for a species for which an RNAseq experiment is performed. It allows reads to be aligned against the reference genome and significantly improves our ability to reconstruct transcripts. This category would obviously include humans and most model organisms but excludes the majority of truly biologically intereting species (e.g., Hyacinth macaw).

Reference genome-free - no genome assembly for the species of interest is available. In this case one would need to assemble the reads into transcripts using de novo approaches. This type of RNAseq is as much of an art as well as science because assembly is heavily parameter-dependent and difficult to do well.

6 of 23

An Overview of RNASeq

7 of 23

Analysis of RNAseq Data

Major steps in analyzing RNA-seq data include the following: -

Quality control and trimming
Read alignment
Quantify expression
Differential expression

8 of 23

Quality control and trimming

As the first step, low quality reads and adaptor sequences need to be removed from the data (a step called trimming). Adaptors are short sequences used to prepare your sample for sequencing and need to be removed before analysis. The impact of trimming on read quality is commonly shown in figures like Figure 1. But how is quality measured? Sequencing machines will output a quality score for each base pair sequenced (called a Q value). Higher Q values mean better data. Trimming is important because it removes the poor quality parts of the read, or even the entire read if necessary. This increases the amount of useful data for downstream analyses.

Figure 1: Representative plot from Basepair showing the mean quality scores for each position in the read after trimming. “Q30” means the percentage of bases in all reads with quality score of 30 or greater.

9 of 23

Read Alignment

After trimming, reads are typically aligned to either a reference genome or transcriptome. Alignment is the algorithm to figure out which gene a read came from. The popular and commonly used tools for read alignment are STAR and TopHat. Another option if your species has no reference is to assemble your own transcriptome using Trinity.

There are several metrics to pay attention to after alignment is finished. Ideally, reads align uniquely to one place in the genome, and you generally want to see >70% uniquely aligned reads. However, some reads may align to multiple locations, so it is unclear which gene the read came from, and are removed.

Alignment outputs data in the SAM format (Sequence Alignment Map), which is then converted to the compressed format (BAM) and used for further downstream analyses. An overview of read processing steps from the total starting reads, how many were trimmed, to how many aligned has been depicted in Figure 2 given below.

10 of 23

Read Alignment

Figure 2: Representative Sankey plot from Basepair showing the amount of reads you have starting from raw data to trimming and alignment.

11 of 23

Expression Quantification

The next step is to quantify gene expression. The number of reads that align to each gene is quantified using programs such as HTSeq-count or featureCounts. However, raw read counts are not appropriate to compare expression between samples and genes. Raw read counts do not account for biases such as gene length and total number of reads in each sample.

The measure FPKM (Fragments Per Kilobase of exon model per Million mapped reads) is a common within-sample normalization method. The RNA-seq analysis pipelines provide expression in raw counts, FPKM, and another common normalization method, TPM as shown in Figure 3.

Figure 3 A: A histogram of log transformed FPKM values. At the top there is a button that lets you toggle between gene or transcript expression.

12 of 23

Expression Quantification

Figure 3 B: Interactive table of per-gene expression values, which you can sort and search.

13 of 23

Differential Expression

Another popular analysis for RNA-seq data is differential expression, which statistically tests for gene expression changes between two groups of samples. For example, this can help you find genes that change between treatment and control conditions.

The most highly cited tool for this purpose is DESeq2 for performing differential expression, which takes as input the raw read counts per gene and outputs log2 fold changes, p-values, and adjusted p-values (among other statistics).

The log2 fold change will tell you the magnitude of gene expression change, while the p-value and adjusted p-value assess the statistical significance.

Compared to the p-value, the adjusted p-values is the better metric to look at since it is less likely to produce false positives.

Differential expression results are often depicted in form of visualizations such as volcano plot, heatmap, interactive tables etc. as shown in Figure 4.

14 of 23

Differential Expression

Figure 4: Representative results provided by Basepair. (A) Volcano plot that plots for each gene the -log10(p-value) against the log2 fold change. Significant genes are shown in blue. You can set the thresholds that determine significance using the controls to the left. (B) Heatmap that shows the expression across the samples for the significant genes. (C) Interactive table showing the per-gene values of log2 fold change, p-value, and adjusted p-value. You can sort and filter the columns as well as search for your favorite gene.

15 of 23

R packages required

library(edgeR)

library(DESeq2)

library(limma)

library(Glimma)

library(gplots) library(org.Mm.eg.db) library(RColorBrewer) library(GEOquery) library(tibble) library(tidyverse)

Mouse mammary gland dataset

The data for this tutorial comes from a paper, Next Generation Sequencing Quantitative Analysis of Srf-regulated Transcriptomes in mouse cardiomyocytes during cardiomyocyte maturation . Both the raw data (sequence reads) and processed data (counts) can be downloaded from Gene Expression Omnibus database (GEO) under accession number GSE116030.

This study examines the expression profiles of wild type neonatal cardiomyocytes and SRF-overexpressed neonatal cardiomyocytes in mouse. Two groups are present and each group contains three biological replicates.

16 of 23

Useful commands in R

gset <- getGEO("GSE116030", GSEMatrix =TRUE, getGPL=FALSE)

seqdata = read.table('~/Desktop/lab/Shulin/data/published/GSE116030_SrfOE.txt', header = T, sep = '\t') head(seqdata)

dim(seqdata)

Sample Information

Reading counts data

Format Data

Filtering to remove lowly expressed genes

# Remove first two columns from seqdata

countdata <- seqdata[,-1]

# Store EntrezGeneID as rownames

rownames(countdata) <- seqdata[,1]

head(countdata)

# Obtain CPMs

myCPM <- cpm(countdata)

# Have a look at the output

head(myCPM)

# Which values in myCPM are greater than 0.5?

thresh <- myCPM > 0.5

# This produces a logical matrix with TRUEs and FALSEs

head(thresh)

# we would like to keep genes that have at least 2 TRUES in each row of thresh

keep <- rowSums(thresh) >= 2

# Subset the rows of countdata to keep the more highly expressed genes

counts.keep <- countdata[keep,]

summary(keep)

17 of 23

Useful commands in R

Convert counts to DGEList object

Filtering to remove lowly expressed genes (continued)

plot(myCPM[,1],countdata[,1])

# Let us limit the x and y-axis so we can actually look to see what is happening at the smaller counts

plot(myCPM[,1],countdata[,1],ylim=c(0,50),xlim=c(0,3))

# Add a vertical line at 0.5 CPM

abline(v=0.5)

dgeObj <- DGEList(counts.keep)

Quality Control

dgeObj$samples$lib.size

# The names argument tells the barplot to use the sample names on the x-axis

# The las argument rotates the axis names

barplot(dgeObj$samples$lib.size, names=colnames(dgeObj), las=2)

# Add a title to the plot title("Barplot of library sizes")

# Get log2 counts per million logcounts <- cpm(dgeObj,log=TRUE)

# Check distributions of samples using boxplots

boxplot(logcounts, xlab="", ylab="Log2 counts per million",las=2)

# Let's add a blue horizontal line that corresponds to the median logCPM

abline(h=median(logcounts),col="blue")

title("Boxplots of logCPMs (unnormalised)")

From the boxplots we see that overall the density distributions of raw log-intensities are not identical but still not very different. If a sample is really far above or below the blue horizontal line we may need to investigate that sample further. And in this case, we doubt that sample OE1 may indicate outlier sample.

18 of 23

Useful commands in R

Multidimensional scaling plots

An MDSplot is a visualisation of a principle components analysis. If your experiment is well controlled and has worked well, what we hope to see is that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also an incredibly useful tool for quality control and checking for outliers.

We can use the plotMDS function to create the MDS plot.

# Redo the MDSplot with corrected information

col.cell <- c("purple","orange")[sampleinfo$status]

plotMDS(dgeObj,col=col.cell)

legend("topright",fill=c("purple","orange"),legend=levels(sampleinfo$status), pos)

title("Cell type")

19 of 23

Useful commands in R

Hierarchical clustering with heatmaps

An alternative approach to plotMDS for examining relationships between samples is using hierarchical clustering. Heatmaps are a nice visualisation to examine hierarchical clustering of your samples. We can do this using the heatmap.2 function from the gplots package. In this example heatmap.2 calculates a matrix of euclidean distances from the logCPM (logcounts object) for the 500 most variable genes. (Note this has more complicated code than plotting principle components using plotMDS.)

# We estimate the variance for each row in the logcounts matrix

var_genes <- apply(logcounts, 1, var)

head(var_genes)

# Get the gene names for the top 500 most variable genes

select_var <- names(sort(var_genes, decreasing=TRUE))[1:500]

head(select_var)

# Subset logcounts matrix

highly_variable_lcpm <- logcounts[select_var,]

dim(highly_variable_lcpm)

## Get some nicer colours

mypalette <- brewer.pal(11,"RdYlBu")

morecols <- colorRampPalette(mypalette)

# Set up colour vector for celltype variable

col.cell <- c("purple","orange")[sampleinfo$status]

# Plot the heatmap

heatmap.2(highly_variable_lcpm, col=rev(morecols(50)),

trace="column", main="Top 500 most variable genes across samples",

ColSideColors=col.cell,scale="row")

# Save the heatmap png(file="High_var_genes.heatmap.png") heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes\nacross samples",ColSideColors=col.cell,scale="row") dev.off()

21 of 23

Useful commands in R

Differentially Expressed Genes

samples = sampleinfo[, c(1,3)]

row.names(samples) = sampleinfo$title

expr_srf = seqdata[keep,2:7]

rownames(expr_srf) = seqdata[keep,1]

design <- as.formula(~status)

design <- model.matrix(design, data = samples)

ddsObj <- DESeqDataSetFromMatrix(countData = expr_srf, colData = samples, design = design)

vstcounts <- vst(ddsObj, blind=TRUE)

#plotPCA(vstcounts, intgroup="status")

ddsObj <- estimateSizeFactors(ddsObj)

ddsObj <- estimateDispersions(ddsObj)

ddsObj <- nbinomWaldTest(object = ddsObj,modelMatrix = design)

res <- results(ddsObj, alpha=0.05) head(res)

topGenesPvV <- as.data.frame(res) %>%

rownames_to_column("GeneID") %>%

arrange(padj) %>%

head(1000)

topGenesPvV[which(topGenesPvV$padj<=0.05),1]

Normalisation for composition bias

Usually we will perform the trimmed mean of M-values normalization method (TMM) to eliminate composition biases between libraries. Due to the time limit, please refer to other resources.

save(group,dgeObj,sampleinfo,file="~/Desktop/R_projects/data/preprocessing.Rdata")

22 of 23

References and Links

1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Publishing Group. 2009 Jan;10(1):57–63.

2. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15–21.

3. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. Oxford University Press; 2009 May 1;25(9):1105–11.

4. Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, et al. De novo transcript sequence reconstruction from RNA-seq using the 5. Trinity platform for reference generation and analysis. Nat Protoc. Nature Publishing Group; 2013 Aug;8(8):1494–512.

5. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. BioMed Central Ltd; 2014;15(12):550.

https://www.basepairtech.com/knowledge-center/a-gentle-introduction-to-rna-seq-analysis/

https://www.ebi.ac.uk/training/events/introduction-rna-seq-and-functional-interpretation-1/

https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rb-rnaseq/tutorial.html

https://sbc.shef.ac.uk/workshops/2020-02-13-rnaseq-r/rna-seq-preprocessing.nb.html

https://learn.gencore.bio.nyu.edu/rna-seq-analysis/

https://rstudio-pubs-static.s3.amazonaws.com/462299_a9bc385f89b94b0aa95de0f3b7040b04.html#:~:text=Analysing%20an%20RNAseq%20experiment%20begins,statistical%20analysis%20on%20in%20R.

https://bioinformatics-core-shared-training.github.io/RNAseq-R/

https://www.melbournebioinformatics.org.au/tutorials/tutorials/rna_seq_dge_in_r/rna_seq_r/

https://hackmd.io/@raynamharris/r4rnaseq-workshop

https://geneviatechnologies.com/blog/rna-seq-expression-analysis-5-dirty-secrets/

1 of 23

2 of 23

3 of 23

4 of 23

5 of 23

6 of 23

7 of 23

8 of 23

9 of 23

10 of 23

11 of 23

12 of 23

13 of 23

14 of 23

15 of 23

16 of 23

17 of 23

18 of 23

19 of 23

20 of 23

21 of 23

22 of 23

23 of 23