IGB Bulk RNA-Seq Workflow Training
Setup
QC and
Salmon Quantitation
DESeq2
Differential Expression Testing
GSEA and g:Profiler Functional Interpretation
Requirements:
luria/smb
nf-core/rnaseq
smb/Rstudio
luria/smb/Rstudio
Day 1
Day 2
Day 3
Linux and R
Linux/Singularity Introduction
ssh/smb Connections
Aims:
Setup
Using LLM for Computing, Data Science and Bioinformatics
Chat-GPT/Bard etc. will change everything
Interest in Coding +
= Data Scientist
Huge Barrier!!
Effort
Frustration
Failure
Interest in Coding +
= Data Scientist
More Productive Barrier!!
Effort
LLM
Frustration
Failure
Before LLM
After LLM
Cluster Node
luria.mit.edu
Cluster Node
Head Node
Cluster Node
Cluster Node
Cluster Node
Cluster Node
Storage
Linux Cluster
bmc-pub17.mit.edu
bmc-pub14.mit.edu
bmc-lab2.mit.edu
bmc-lab6.mit.edu
Setup
The KI Computational Environment
Mac
PC
Users
nfs
Kerberos IDs
ssh
smb
ssh
slurm
love
sharp
class
kellis
chen
bhatia
lees
yilmaz
Shares
Essentials
MIT IP Address - VPN or MIT Secure
Valid Certificates - CertAid
User-Specific Information
Setup
Connecting to luria with ssh
Mac
PC
Use the Terminal App
Use SecureCRT - Available HERE
charliew = your kerberos ID
user-specific variables
Setup
Connecting to storage with smb
Mac
PC
Use the Connect to Server function
Use the Map network drive function
pub10 = your storage host
bcc = your share name
charliew = your kerberos ID
user-specific variables
Setup
Essential Linux concepts and commands
srun --pty bash
LLM Prompts
What are the most essential commands required to use a Linux computer?
How do I use grep?
How do I edit files with emacs?
Containerized R
Setup
Head Node
Cluster Node
bmc-XXX.mit.edu
Setup
Tunneled connections to cluster node for singularity Rstudio (part1)
Mac
PC
nfs
ssh
slurm
share
ssh
dirsrun --pty bash
(cd $(pwd -P))
cd /net/StorageHost/Share/kerberosID/bulkRNAseq
./singularity_Rstudio.sh
Head Node
Cluster Node
Setup
Tunneled connections to cluster node for singularity Rstudio (part3)
Mac
PC
ssh
slurm
ssh
ssh
ssh -t kerberosID@luria.mit.edu -L PORT:localhost:PORT ssh clusterNode -L PORT:localhost:SOCKET
http://localhost:PORT
Mac
Create 2nd connection tunneled to node on Mac
PC
Modify the Session Options to create tunnel
Example Data for Class
Setup
GEO - repository of gene expression data
Setup
Setup
srun --pty bash
module add perl/5.24.1
module add sra/2.10.7
fastq-dump --split-files SRR20709984 SRR20709985 SRR2070990 SRR2070991
On luria (no need to run this):
sra toolkit - fastq-dump to obtain files
Execution script:
Output:
Salmon Quantitation
nf-core/rnaseq - Required inputs
Salmon Quantitation
Examine our Inputs
Sample description csv file:
Slurm script:
nf-core/rnaseq processing
Prepare for the nf-core/rnaseq run. From inside your bulkRNAseq directory, create a copy of the nf-core/rnaseq submission script and samples csv file:
cp /net/ostrom/data/bcc/charliew/RNAseq-Class_23_stock/nf-core_rnaseq.sh .
cp /net/ostrom/data/bcc/charliew/RNAseq-Class_23_stock/GSE210061_subset.csv .
Edit nf-core_rnaseq.sh using emacs to correct email address
emacs nf-core_rnaseq.sh
Salmon Quantitation
Monitor pipeline run
squeue -u yourKerberosName
tail slurm*
Wait for email:
Launch the nf-core/rnaseq pipeline run
sbatch -p bcc nf-core_rnaseq.sh
Wait for email:
Troubleshooting Example
Mm10 + gencode annotation - only 59 transcripts quantified
Genome Sequence Names
Annotation Sequence Names
nf-core/rnaseq - mm10 gencode troubleshooting
Through two Essential Open Source Software for Science (EOSS) grants
Developed by:
Funded by:
Community
Open-source
Salmon Quantitation
Salmon Quantitation
Nextflow
Works with most compute
Efficient
Containerized software
Scale
Salmon Quantitation
Salmon Quantitation
Pipelines built from modules using nextflow with guidelines
From website: “nf-core is a community effort to collect a curated set of analysis pipelines built using Nextflow.”
81 pipelines currently available:
Engaged community:
nf-core
Salmon Quantitation
Salmon Quantitation
Caching, -resume and other considerations
During pipeline execution, if something is incorrect or if a change is required, pipeline can be stopped, repaired and re-executed.
This will use successfully completed and unchanged chucks from “cache” and pick up where needed. Each nexflow run creates a “work” directory that holds the cached data. The output of tasks are stored in “work” subdirectories and they are associated with downstream task directories using symbolic links.
All commands executed during pipelines have the following files: .command.sh, .command.log, .command.begin, .command.out, .command.err, .command.trace files in corresponding task directories.
From pipeline_info folder, execution_report files:
After completion of initial pipeline run, one row (sample) was added to the samplesheet csv file and the the argument “-resume” was added to the slurm execution script and the script was re-submitted:
Most tasks were in cache, additional required tasks were run and results folded into the original.
Bbsplit – maps reads to multiple references simultaneously, allows filtering of mouse derived reads in PDX samples.
Pipeline produces the input needed for differential expression analysis but does not include that work.
Salmon Quantitation
Salmon Quantitation
Examine nf-core/rnaseq output
Output documentation here:
https://nf-co.re/rnaseq/3.11.2/output
Full output here:
https://bmc-data.mit.edu/BCC/caw.ost/RNAseq-Class_23_full/GSE210061_subset/
Downsampled output here:
https://bmc-data.mit.edu/BCC/caw.ost/bulkRNAseq/GSE210061_subset/
Salmon Quantitation
Using salmon to quantify RNASeq reads
K-mer based quasi aligner to quantify transcripts
Software: https://combine-lab.github.io/salmon/
Documentation: https://salmon.readthedocs.io/en/latest/
Versions - Rapidly evolving software - see change log: https://github.com/COMBINE-lab/salmon/releases
Genomic Alignment (star)
Quasi Alignment (salmon)
DESeq2
Differential Expression Testing
R and Rstudio
R is statistical programming software
Rstudio is the interface
Rmd Interface
Command Console
Environment
Active Directory
R commands similar to Linux (but different)
Loading Packages
DESeq2
Differential Expression Testing
R Packages - CRAN and Bioconductor
DESeq2
Differential Expression Testing
Testing for differentially expressed genes
Tximport and DESeq2
“DESeqDataSetFromTximport” imports count data from a tximport object
The function requires description of the experimental conditions being compared (design formula)
Excellent introductory videos:
Statquest:
Library Normalization - https://youtu.be/UFB993xufUU
Independent Filtering - https://youtu.be/Gi0JdrxRq5s
Bioinformagician:
DESeq2 Introduction- https://youtu.be/0b24mpzM_5M
DESeq2 Workflow - https://youtu.be/OzNzO8qwwp0
LiquidBrain Bioinformatics:
Negative Binomial and Gene Expression Data - https://youtu.be/lJw6Ku_jQkM
DESeq2
Differential Expression Testing
tximport and DESeq2 Vignettes
Set up R/tximport project
Create new project inside tximport folder
Open the file: GSE210061_subset_Processing.Rmd
This script:
From inside bulkRNAseq folder (copy completed run if you dont have it):
cp -r /net/ostrom/data/bcc/charliew/RNAseq_class23_test/GSE210061_subset .
cd GSE210061_subset
cp -r /net/ostrom/data/bcc/charliew/RNAseq_class23_test/GSE210061_subset/tximport .
From Rstudio:
DESeq2
Differential Expression Testing
Max per gene (MPG) column
In some cases, protein coding genes with two ENS geneIDs per gene symbol. This results in 2 data points per gene, this can complicate downstream analyses. MPG flags the ENS geneID with the highest average expression as the representative for the gene symbol.
DESeq2
Differential Expression Testing
DESeq2
Differential Expression Testing
DESeq2 tests for differentially expressed genes
Steps of DESeq2 analysis:
Simple design: ~ Condition
Sample | Folder | Condition |
120hr_Mock_No_MAC_REP1 | ../salmon/120hr_Mock_No_MAC_REP1 | 120hr_Mock_No_MAC |
120hr_Mock_No_MAC_REP2 | ../salmon/120hr_Mock_No_MAC_REP2 | 120hr_Mock_No_MAC |
120hr_CoV2_w_MAC_REP1 | ../salmon/120hr_CoV2_w_MAC_REP1 | 120hr_CoV2_w_MAC |
120hr_CoV2_w_MAC_REP2 | ../salmon/120hr_CoV2_w_MAC_REP2 | 120hr_CoV2_w_MAC |
DESeq2
Differential Expression Testing
DESeq2: Principal Components Analysis
PCA is a useful way to summarize data�
Genes with high co-variance are grouped into “principal components” to reduce dimensionality of data
What is PCA
Use PCA to show expected clustering
Identify outliers
Looking beyond first 2 principal components can be useful
DESeq2
Differential Expression Testing
DESeq2: Principal Components Analysis
Useful for checking clustering and identifying outlier samples
Outliers may be mislabeled or have poor sequencing quality
What is PCA
Use PCA to show expected clustering
Identify outliers
Looking beyond first 2 principal components can be useful
DESeq2
Differential Expression Testing
DESeq2 design matrix
Formalize metadata into factors for analysis
DESeq2 will control for columns/factors
Experiment
Design Matrix
MICE
A
B
Sample ID | Treatment | Timepoint |
1 | A | 1 |
2 | A | 2 |
3 | A | 3 |
4 | B | 1 |
5 | B | 2 |
6 | B | 3 |
[replicates] | … | … |
Sample Collection Timepoints
1 2 3
Treatment
Design formula : “design <- ~ treatment + timepoint”
DESeq2
Differential Expression Testing
DESeq2 command
Deseq command:
Input: raw read counts� sample design matrix
design formula
Deseq2 process:�
Output:
DEseq object, to be used as input to differential expression testing
resultsNames:
lists comparisons available for differential expression testing
DESeq2
Differential Expression Testing
DESeq2 output
Lfcshrink calculates differential expression between two sets of samples
Input: deseq object, index of comparison
DESeq2
Differential Expression Testing
Visualizing differentially expressed genes - Heatmaps and Volcano Plots
DESeq2
Differential Expression Testing
Row-centered (z-score) data and heatmaps
In row-centered data, each row is scaled so that mean is 0 and variance is 1
GSEA and g:Profiler Functional Interpretation
Functional Analysis is used to associate differential expression data and biological annotations
GO terms
Pathways
MsigDB
GSEA and g:Profiler Functional Interpretation
FCS (Functional Class Sorting, GSEA) vs ORA (Over-Representation Analysis, g:Profiler)
Links:
GSEA - https://www.gsea-msigdb.org/gsea/index.jsp
g:Profiler - https://biit.cs.ut.ee/gprofiler/gost
FCS
ORA
enrichment
no enrichment
GSEA and g:Profiler Functional Interpretation
FCS (Functional Class Sorting, GSEA) vs ORA (Over-Representation Analysis, g:Profiler)
•DEGs are defined using (arbitrary) fold change and p-value thresholds.
•In the above example, 1% of tested genes are differentially expressed but 10% of the genes in a pathway are in the list of DEGs leading to highly significant ORA.
•This approach works well but has 2 major limitations:
1. What if none or a small number of genes meet significance thresholds?
2.The biology of genes that almost meet the thresholds are not considered.
•GSEA is not impacted by those limitations and compares all the genes in a set with the full expression dataset.
Set size 100
GSEA and g:Profiler Functional Interpretation
Running g:Profiler
Cov2vMock.gostres <- gost(query = Cov2_vs_Mock.degs$GeneSym,
organism = "hsapiens",
sources = sourceSet,significant = sigFlag,
correction_method ="g_SCS",
custom_bg = background$GeneSym,
domain_scope="custom")
Web client
R client
GSEA and g:Profiler Functional Interpretation
Running GSEA
GUI - can be downloaded and run locally
for i in *.rnk
do
sh /net/ostrom/data/bcc/charliew/gsea_4/GSEA_Linux_4.1.0/gsea-cli.sh GSEAPreranked -rnk $i \
-gmx h.all.v7.2.symbols.gmt \
-set_min 5 -set_max 1500 \
-rpt_label $i.h -plot_top_x 40 -nperm 1000
done
Command Line
RNA-Seq data should analyzed with the pre-ranked GSEA mode. Input data are:
The results are usually based on pairwise comparisons.
GSEA and g:Profiler Functional Interpretation
The Molecular Signatures Database - MSigDB
Tools
Collections
GSEA and g:Profiler Functional Interpretation
GSEA web output
GSEA and g:Profiler Functional Interpretation
Single sample GSEA (ssGSEA)
ssGSEA – from Barbie et al., 2009
The ‘single sample’ extension of GSEA7 allows one to define an enrichment score that represents the degree of absolute enrichment of a gene set in each sample within a given data set. The gene expression values for a given sample were rank-normalized, and an enrichment score was produced using the Empirical Cumulative Distribution Functions (ECDF) of the genes in the signature and the remaining genes. This procedure is similar to GSEA but the list is ranked by absolute expression (in one sample). The enrichment score is obtained by an integration of the difference between the ECDF.
As you progress along the rank ordered list of genes, the algorithm looks for a difference in encountering the genes in the gene set compared to the non-gene set genes. If the gene set genes are encountered relatively early in the list the ES is negative, late in the list the ES is positive and encountered at roughly the same rate as the non-gene set genes the ES is near 0.
Y
X
Y
X
Up In X genes
Up In Y genes
GSEA and g:Profiler Functional Interpretation
*
*
*
*
X2
Y1
Up In Y
X3
Y1
Up In X
GSEA and g:Profiler Functional Interpretation
Level 6
Level 12
rand 4
Level 2
GSEA and g:Profiler Functional Interpretation
Understanding R syntax
Function
name
Input
Define your own functions to avoid rewriting code
Functions take input and return a reliable output
Variables are containers for information
“<-” sets the value of a variable
Types of variables in R
Data Types
Data Structures
https://swcarpentry.github.io/r-novice-inflammation/13-supp-data-structures/
https://www.tutorialspoint.com/r/r_data_types.htm
R Libraries
Plotting Metadata
Salmon
Salmon Command
salmon --no-version-check quant --validateMappings --incompatPrior 0.0
-i /net/ostrom/data/bcc/charliew/Genomes/salmon_indicies/hg38_ens101_1.3.0/hg38_ens101/
-l A
-r /net/ostrom/data/bcc/dpradhan/talks/RNASeq_Pipeline_Class/200212Yil/data/200212Yil_D20-133001_dedup.fastq
-p 8
-o Hcol_A1.quant
Salmon Output
Tximport aggregates salmon output into abundance estimates for gene-level analysis
+1 offset for l2tpm/cpm
Alignment QC