Dr. Harpreet Singh
Head, PG Department of Bioinformatics,
Hans Raj Mahila Maha Vidyalaya,
Jalandhar, Punjab, India
e-module
RNASeq Analysis using R
Introduction
RNA-seq is essentially the sequence of RNA molecules from either a specific cell, tissue, or species. A highly sensitive and accurate tool for measuring expression across the transcriptome, it is providing scientists with visibility into previously undetected changes occurring in disease states, in response to therapeutics, under different environmental conditions, and across a wide range of other study designs.
RNA-Seq allows researchers to detect both known and novel features in a single assay, enabling the identification of transcript isoforms, gene fusions, single nucleotide variants, and other features without the limitation of prior knowledge.
Introduction
There are a wide range of applications for RNA-seq as given below: -
Expression quantification
Discovery of novel genes and gene isoforms
Differential expression
Many other types of functional analysis.
Hence, researchers can use RNA-seq to answer many of their most compelling questions. It can find new cancer subtypes, discover the transcriptome of never-before-sequenced species, elucidate tissue-specific gene expression, and many other applications.
Benefits of RNA Sequencing
NA-Seq with next-generation sequencing (NGS) is increasingly the method of choice for scientists studying the transcriptome.
Types of RNAseq
RNAseq can be roughly divided into two “types”:
Reference genome-based - an assembled genome exists for a species for which an RNAseq experiment is performed. It allows reads to be aligned against the reference genome and significantly improves our ability to reconstruct transcripts. This category would obviously include humans and most model organisms but excludes the majority of truly biologically intereting species (e.g., Hyacinth macaw).
Reference genome-free - no genome assembly for the species of interest is available. In this case one would need to assemble the reads into transcripts using de novo approaches. This type of RNAseq is as much of an art as well as science because assembly is heavily parameter-dependent and difficult to do well.
An Overview of RNASeq
Analysis of RNAseq Data
Major steps in analyzing RNA-seq data include the following: -
Quality control and trimming
As the first step, low quality reads and adaptor sequences need to be removed from the data (a step called trimming). Adaptors are short sequences used to prepare your sample for sequencing and need to be removed before analysis. The impact of trimming on read quality is commonly shown in figures like Figure 1. But how is quality measured? Sequencing machines will output a quality score for each base pair sequenced (called a Q value). Higher Q values mean better data. Trimming is important because it removes the poor quality parts of the read, or even the entire read if necessary. This increases the amount of useful data for downstream analyses.
Figure 1: Representative plot from Basepair showing the mean quality scores for each position in the read after trimming. “Q30” means the percentage of bases in all reads with quality score of 30 or greater.
Read Alignment
After trimming, reads are typically aligned to either a reference genome or transcriptome. Alignment is the algorithm to figure out which gene a read came from. The popular and commonly used tools for read alignment are STAR and TopHat. Another option if your species has no reference is to assemble your own transcriptome using Trinity.
There are several metrics to pay attention to after alignment is finished. Ideally, reads align uniquely to one place in the genome, and you generally want to see >70% uniquely aligned reads. However, some reads may align to multiple locations, so it is unclear which gene the read came from, and are removed.
Alignment outputs data in the SAM format (Sequence Alignment Map), which is then converted to the compressed format (BAM) and used for further downstream analyses. An overview of read processing steps from the total starting reads, how many were trimmed, to how many aligned has been depicted in Figure 2 given below.
Read Alignment
Figure 2: Representative Sankey plot from Basepair showing the amount of reads you have starting from raw data to trimming and alignment.
Expression Quantification
The next step is to quantify gene expression. The number of reads that align to each gene is quantified using programs such as HTSeq-count or featureCounts. However, raw read counts are not appropriate to compare expression between samples and genes. Raw read counts do not account for biases such as gene length and total number of reads in each sample.
The measure FPKM (Fragments Per Kilobase of exon model per Million mapped reads) is a common within-sample normalization method. The RNA-seq analysis pipelines provide expression in raw counts, FPKM, and another common normalization method, TPM as shown in Figure 3.
Figure 3 A: A histogram of log transformed FPKM values. At the top there is a button that lets you toggle between gene or transcript expression.
Expression Quantification
Figure 3 B: Interactive table of per-gene expression values, which you can sort and search.
Differential Expression
Another popular analysis for RNA-seq data is differential expression, which statistically tests for gene expression changes between two groups of samples. For example, this can help you find genes that change between treatment and control conditions.
The most highly cited tool for this purpose is DESeq2 for performing differential expression, which takes as input the raw read counts per gene and outputs log2 fold changes, p-values, and adjusted p-values (among other statistics).
The log2 fold change will tell you the magnitude of gene expression change, while the p-value and adjusted p-value assess the statistical significance.
Compared to the p-value, the adjusted p-values is the better metric to look at since it is less likely to produce false positives.
Differential expression results are often depicted in form of visualizations such as volcano plot, heatmap, interactive tables etc. as shown in Figure 4.
Differential Expression
Figure 4: Representative results provided by Basepair. (A) Volcano plot that plots for each gene the -log10(p-value) against the log2 fold change. Significant genes are shown in blue. You can set the thresholds that determine significance using the controls to the left. (B) Heatmap that shows the expression across the samples for the significant genes. (C) Interactive table showing the per-gene values of log2 fold change, p-value, and adjusted p-value. You can sort and filter the columns as well as search for your favorite gene.
R packages required
library(edgeR)
library(DESeq2)
library(limma)
library(Glimma)
library(gplots) library(org.Mm.eg.db) library(RColorBrewer) library(GEOquery) library(tibble) library(tidyverse)
Mouse mammary gland dataset
The data for this tutorial comes from a paper, Next Generation Sequencing Quantitative Analysis of Srf-regulated Transcriptomes in mouse cardiomyocytes during cardiomyocyte maturation . Both the raw data (sequence reads) and processed data (counts) can be downloaded from Gene Expression Omnibus database (GEO) under accession number GSE116030.
This study examines the expression profiles of wild type neonatal cardiomyocytes and SRF-overexpressed neonatal cardiomyocytes in mouse. Two groups are present and each group contains three biological replicates.
Useful commands in R
gset <- getGEO("GSE116030", GSEMatrix =TRUE, getGPL=FALSE)
seqdata = read.table('~/Desktop/lab/Shulin/data/published/GSE116030_SrfOE.txt', header = T, sep = '\t') head(seqdata)
dim(seqdata)
Sample Information
Reading counts data
Format Data
Filtering to remove lowly expressed genes
# Remove first two columns from seqdata
countdata <- seqdata[,-1]
# Store EntrezGeneID as rownames
rownames(countdata) <- seqdata[,1]
head(countdata)
# Obtain CPMs
myCPM <- cpm(countdata)
# Have a look at the output
head(myCPM)
# Which values in myCPM are greater than 0.5?
thresh <- myCPM > 0.5
# This produces a logical matrix with TRUEs and FALSEs
head(thresh)
# we would like to keep genes that have at least 2 TRUES in each row of thresh
keep <- rowSums(thresh) >= 2
# Subset the rows of countdata to keep the more highly expressed genes
counts.keep <- countdata[keep,]
summary(keep)
Useful commands in R
Convert counts to DGEList object
Filtering to remove lowly expressed genes (continued)
plot(myCPM[,1],countdata[,1])
# Let us limit the x and y-axis so we can actually look to see what is happening at the smaller counts
plot(myCPM[,1],countdata[,1],ylim=c(0,50),xlim=c(0,3))
# Add a vertical line at 0.5 CPM
abline(v=0.5)
dgeObj <- DGEList(counts.keep)
Quality Control
dgeObj$samples$lib.size
# The names argument tells the barplot to use the sample names on the x-axis
# The las argument rotates the axis names
barplot(dgeObj$samples$lib.size, names=colnames(dgeObj), las=2)
# Add a title to the plot title("Barplot of library sizes")
# Get log2 counts per million logcounts <- cpm(dgeObj,log=TRUE)
# Check distributions of samples using boxplots
boxplot(logcounts, xlab="", ylab="Log2 counts per million",las=2)
# Let's add a blue horizontal line that corresponds to the median logCPM
abline(h=median(logcounts),col="blue")
title("Boxplots of logCPMs (unnormalised)")
From the boxplots we see that overall the density distributions of raw log-intensities are not identical but still not very different. If a sample is really far above or below the blue horizontal line we may need to investigate that sample further. And in this case, we doubt that sample OE1 may indicate outlier sample.
Useful commands in R
Multidimensional scaling plots
An MDSplot is a visualisation of a principle components analysis. If your experiment is well controlled and has worked well, what we hope to see is that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also an incredibly useful tool for quality control and checking for outliers.
We can use the plotMDS function to create the MDS plot.
# Redo the MDSplot with corrected information
col.cell <- c("purple","orange")[sampleinfo$status]
plotMDS(dgeObj,col=col.cell)
legend("topright",fill=c("purple","orange"),legend=levels(sampleinfo$status), pos)
title("Cell type")
Useful commands in R
Hierarchical clustering with heatmaps
An alternative approach to plotMDS for examining relationships between samples is using hierarchical clustering. Heatmaps are a nice visualisation to examine hierarchical clustering of your samples. We can do this using the heatmap.2 function from the gplots package. In this example heatmap.2 calculates a matrix of euclidean distances from the logCPM (logcounts object) for the 500 most variable genes. (Note this has more complicated code than plotting principle components using plotMDS.)
# We estimate the variance for each row in the logcounts matrix
var_genes <- apply(logcounts, 1, var)
head(var_genes)
# Get the gene names for the top 500 most variable genes
select_var <- names(sort(var_genes, decreasing=TRUE))[1:500]
head(select_var)
# Subset logcounts matrix
highly_variable_lcpm <- logcounts[select_var,]
dim(highly_variable_lcpm)
## Get some nicer colours
mypalette <- brewer.pal(11,"RdYlBu")
morecols <- colorRampPalette(mypalette)
# Set up colour vector for celltype variable
col.cell <- c("purple","orange")[sampleinfo$status]
# Plot the heatmap
heatmap.2(highly_variable_lcpm, col=rev(morecols(50)),
trace="column", main="Top 500 most variable genes across samples",
ColSideColors=col.cell,scale="row")
# Save the heatmap png(file="High_var_genes.heatmap.png") heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes\nacross samples",ColSideColors=col.cell,scale="row") dev.off()
Useful commands in R
Differentially Expressed Genes
samples = sampleinfo[, c(1,3)]
row.names(samples) = sampleinfo$title
expr_srf = seqdata[keep,2:7]
rownames(expr_srf) = seqdata[keep,1]
design <- as.formula(~status)
design <- model.matrix(design, data = samples)
ddsObj <- DESeqDataSetFromMatrix(countData = expr_srf, colData = samples, design = design)
vstcounts <- vst(ddsObj, blind=TRUE)
#plotPCA(vstcounts, intgroup="status")
ddsObj <- estimateSizeFactors(ddsObj)
ddsObj <- estimateDispersions(ddsObj)
ddsObj <- nbinomWaldTest(object = ddsObj,modelMatrix = design)
res <- results(ddsObj, alpha=0.05) head(res)
topGenesPvV <- as.data.frame(res) %>%
rownames_to_column("GeneID") %>%
arrange(padj) %>%
head(1000)
topGenesPvV[which(topGenesPvV$padj<=0.05),1]
Normalisation for composition bias
Usually we will perform the trimmed mean of M-values normalization method (TMM) to eliminate composition biases between libraries. Due to the time limit, please refer to other resources.
save(group,dgeObj,sampleinfo,file="~/Desktop/R_projects/data/preprocessing.Rdata")
References and Links
1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Publishing Group. 2009 Jan;10(1):57–63.
2. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15–21.
3. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. Oxford University Press; 2009 May 1;25(9):1105–11.
4. Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, et al. De novo transcript sequence reconstruction from RNA-seq using the 5. Trinity platform for reference generation and analysis. Nat Protoc. Nature Publishing Group; 2013 Aug;8(8):1494–512.
5. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. BioMed Central Ltd; 2014;15(12):550.
https://www.basepairtech.com/knowledge-center/a-gentle-introduction-to-rna-seq-analysis/
https://www.ebi.ac.uk/training/events/introduction-rna-seq-and-functional-interpretation-1/
https://sbc.shef.ac.uk/workshops/2020-02-13-rnaseq-r/rna-seq-preprocessing.nb.html
https://learn.gencore.bio.nyu.edu/rna-seq-analysis/
https://bioinformatics-core-shared-training.github.io/RNAseq-R/
https://www.melbournebioinformatics.org.au/tutorials/tutorials/rna_seq_dge_in_r/rna_seq_r/
https://hackmd.io/@raynamharris/r4rnaseq-workshop
https://geneviatechnologies.com/blog/rna-seq-expression-analysis-5-dirty-secrets/
Thanks