1 of 63

IGB Bulk RNA-Seq Workflow Training

Setup

QC and

Salmon Quantitation

DESeq2

Differential Expression Testing

GSEA and g:Profiler Functional Interpretation

Requirements:

  • Account on luria.mit.edu
  • smb connection to data storage
  • Rstudio (Docker Container)

luria/smb

nf-core/rnaseq

smb/Rstudio

luria/smb/Rstudio

Day 1

Day 2

Day 3

Linux and R

Linux/Singularity Introduction

ssh/smb Connections

Aims:

  • Reproducible
  • Best Practices
  • Thorough Documentation

2 of 63

Setup

Using LLM for Computing, Data Science and Bioinformatics

Chat-GPT/Bard etc. will change everything

  • OpenAI account for GPT-4
  • Bard available in google

Interest in Coding +

= Data Scientist

Huge Barrier!!

Effort

Frustration

Failure

Interest in Coding +

= Data Scientist

More Productive Barrier!!

Effort

LLM

Frustration

Failure

Before LLM

After LLM

3 of 63

Cluster Node

luria.mit.edu

Cluster Node

Head Node

Cluster Node

Cluster Node

Cluster Node

Cluster Node

Storage

Linux Cluster

bmc-pub17.mit.edu

bmc-pub14.mit.edu

bmc-lab2.mit.edu

bmc-lab6.mit.edu

Setup

The KI Computational Environment

Mac

PC

Users

nfs

Kerberos IDs

ssh

smb

ssh

slurm

love

sharp

class

kellis

chen

bhatia

lees

yilmaz

Shares

Essentials

MIT IP Address - VPN or MIT Secure

Valid Certificates - CertAid

User-Specific Information

  • Kerberos ID/Password
  • Laptop OS
  • Storage Host
  • Share Name
  • Assigned Port

See ClassList

4 of 63

Setup

Connecting to luria with ssh

Mac

PC

Use the Terminal App

Use SecureCRT - Available HERE

charliew = your kerberos ID

user-specific variables

5 of 63

Setup

Connecting to storage with smb

Mac

PC

Use the Connect to Server function

Use the Map network drive function

pub10 = your storage host

bcc = your share name

charliew = your kerberos ID

user-specific variables

6 of 63

Setup

Essential Linux concepts and commands

  • Explore some basic commands
    • cd,ls,mkdir,cp,mv,head,tail,emacs,rm,rmdir,man,gzip,history,more,cat,grep
  • Job management with slurm (squeue, sbatch, sinfo)
  • Software management with module (module avail, module list, module add)
  • User-level installation with conda
  • Do not run computationally intensive jobs on the head node!
    • If command will run for more than a few seconds, use a cluster node.
    • Launch a cluster node interactive shell with:

srun --pty bash

LLM Prompts

What are the most essential commands required to use a Linux computer?

How do I use grep?

How do I edit files with emacs?

7 of 63

Containerized R

  • Containerized Environment
    • Reproducible, shareable, and portable execution of workflows
    • Prebuilt with common packages for RNA-Seq analysis
  • Local and HPC compatible
    • Docker for local systems
    • Singularity for HPC systems
    • slurm script for set up and ssh tunnelling
  • Markdown workflows of analyses
    • Enables replication of experiment in exact environment for unparalleled reproducibility.

Setup

8 of 63

Head Node

Cluster Node

bmc-XXX.mit.edu

Setup

Tunneled connections to cluster node for singularity Rstudio (part1)

Mac

PC

nfs

ssh

slurm

share

ssh

  • Login to luria (Mac/PC-specific instructions)
  • Launch interactive shell on cluster node and note the node name:

dirsrun --pty bash

(cd $(pwd -P))

  • Change into class directory using physical path:

cd /net/StorageHost/Share/kerberosID/bulkRNAseq

  • Run singularity script and note SOCKET number and Rstudio password from the output

./singularity_Rstudio.sh

9 of 63

Head Node

Cluster Node

Setup

Tunneled connections to cluster node for singularity Rstudio (part3)

Mac

PC

ssh

slurm

ssh

ssh

ssh -t kerberosID@luria.mit.edu -L PORT:localhost:PORT ssh clusterNode -L PORT:localhost:SOCKET

http://localhost:PORT

Mac

Create 2nd connection tunneled to node on Mac

PC

Modify the Session Options to create tunnel

10 of 63

Example Data for Class

Setup

11 of 63

GEO - repository of gene expression data

Setup

12 of 63

Setup

srun --pty bash

module add perl/5.24.1

module add sra/2.10.7

fastq-dump --split-files SRR20709984 SRR20709985 SRR2070990 SRR2070991

On luria (no need to run this):

sra toolkit - fastq-dump to obtain files

13 of 63

  • Samplesheet input - csv file that associates sample name with, fastq files (paired or single-end) and strandedness information.
  • Genome
  • Annotation

Execution script:

Output:

Salmon Quantitation

nf-core/rnaseq - Required inputs

14 of 63

Salmon Quantitation

Examine our Inputs

Sample description csv file:

Slurm script:

15 of 63

nf-core/rnaseq processing

Prepare for the nf-core/rnaseq run. From inside your bulkRNAseq directory, create a copy of the nf-core/rnaseq submission script and samples csv file:

cp /net/ostrom/data/bcc/charliew/RNAseq-Class_23_stock/nf-core_rnaseq.sh .

cp /net/ostrom/data/bcc/charliew/RNAseq-Class_23_stock/GSE210061_subset.csv .

Edit nf-core_rnaseq.sh using emacs to correct email address

emacs nf-core_rnaseq.sh

Salmon Quantitation

Monitor pipeline run

squeue -u yourKerberosName

tail slurm*

Wait for email:

Launch the nf-core/rnaseq pipeline run

sbatch -p bcc nf-core_rnaseq.sh

Wait for email:

16 of 63

Troubleshooting Example

Mm10 + gencode annotation - only 59 transcripts quantified

17 of 63

Genome Sequence Names

Annotation Sequence Names

nf-core/rnaseq - mm10 gencode troubleshooting

18 of 63

19 of 63

20 of 63

Through two Essential Open Source Software for Science (EOSS) grants

Developed by:

Funded by:

Community

Open-source

Salmon Quantitation

Salmon Quantitation

21 of 63

Nextflow

Works with most compute

Efficient

Containerized software

Scale

Salmon Quantitation

Salmon Quantitation

22 of 63

https://nf-co.re/

Pipelines built from modules using nextflow with guidelines

From website: “nf-core is a community effort to collect a curated set of analysis pipelines built using Nextflow.”

81 pipelines currently available:

  • The aim is one pipeline for each datatype.
  • Goal develop each pipeline so that it can handle all the different scenarios associated with that datatype.

Engaged community:

  • Summits (600 participants in 2022), Hackathons, Training
  • Active slack channel
    • Example from hic workflow - I encountered a bug where any fastq file name containing the strings “_1” or “_2” would break the pipeline. I searched the slack channel and found another report of the issue, verified the behavior I observed and posted a description. Within hours repairs were made to the dev branch.
  • Byte-Sized presentations (youtube)
  • Annual Nextflow/nf-core survey documents information about the community

nf-core

Salmon Quantitation

23 of 63

Salmon Quantitation

24 of 63

Caching, -resume and other considerations

During pipeline execution, if something is incorrect or if a change is required, pipeline can be stopped, repaired and re-executed.

This will use successfully completed and unchanged chucks from “cache” and pick up where needed. Each nexflow run creates a “work” directory that holds the cached data. The output of tasks are stored in “work” subdirectories and they are associated with downstream task directories using symbolic links.

All commands executed during pipelines have the following files: .command.sh, .command.log, .command.begin, .command.out, .command.err, .command.trace files in corresponding task directories.

From pipeline_info folder, execution_report files:

After completion of initial pipeline run, one row (sample) was added to the samplesheet csv file and the the argument “-resume” was added to the slurm execution script and the script was re-submitted:

Most tasks were in cache, additional required tasks were run and results folded into the original.

Bbsplit – maps reads to multiple references simultaneously, allows filtering of mouse derived reads in PDX samples. 

Pipeline produces the input needed for differential expression analysis but does not include that work.

Salmon Quantitation

25 of 63

Salmon Quantitation

Examine nf-core/rnaseq output

26 of 63

Salmon Quantitation

Using salmon to quantify RNASeq reads

K-mer based quasi aligner to quantify transcripts

Genomic Alignment (star)

Quasi Alignment (salmon)

27 of 63

DESeq2

Differential Expression Testing

R and Rstudio

R is statistical programming software

Rstudio is the interface

Rmd Interface

Command Console

Environment

Active Directory

R commands similar to Linux (but different)

Loading Packages

28 of 63

DESeq2

Differential Expression Testing

R Packages - CRAN and Bioconductor

29 of 63

DESeq2

Differential Expression Testing

Testing for differentially expressed genes

Tximport and DESeq2

“DESeqDataSetFromTximport” imports count data from a tximport object

The function requires description of the experimental conditions being compared (design formula)

Excellent introductory videos:

Statquest:

Library Normalization - https://youtu.be/UFB993xufUU

Independent Filtering - https://youtu.be/Gi0JdrxRq5s

Bioinformagician:

DESeq2 Introduction- https://youtu.be/0b24mpzM_5M

DESeq2 Workflow - https://youtu.be/OzNzO8qwwp0

LiquidBrain Bioinformatics:

Negative Binomial and Gene Expression Data - https://youtu.be/lJw6Ku_jQkM

30 of 63

DESeq2

Differential Expression Testing

tximport and DESeq2 Vignettes

31 of 63

Set up R/tximport project

Create new project inside tximport folder

Open the file: GSE210061_subset_Processing.Rmd

This script:

  • Imports data from quant.sf salmon output files
  • Generates gene-level summary of transcript data
  • Creates integer count table (intCt) for differential expression testing
  • Creates log2 TPM table (l2tpm) for data visualization
  • Merges intCt and l2tpm data for output file
  • Adds gene type and gene symbol annotations to assembled data
  • Calculates gene averages and variances for filtering and annotation
  • Adds flag for low expression and low variance genes (LowExpLowVar)
  • Adds flag for max average per gene (MPG) - see next slide
  • Calculates fold change/p-value differential results

From inside bulkRNAseq folder (copy completed run if you dont have it):

cp -r /net/ostrom/data/bcc/charliew/RNAseq_class23_test/GSE210061_subset .

cd GSE210061_subset

cp -r /net/ostrom/data/bcc/charliew/RNAseq_class23_test/GSE210061_subset/tximport .

From Rstudio:

DESeq2

Differential Expression Testing

32 of 63

Max per gene (MPG) column

In some cases, protein coding genes with two ENS geneIDs per gene symbol. This results in 2 data points per gene, this can complicate downstream analyses. MPG flags the ENS geneID with the highest average expression as the representative for the gene symbol.

DESeq2

Differential Expression Testing

33 of 63

DESeq2

Differential Expression Testing

DESeq2 tests for differentially expressed genes

Steps of DESeq2 analysis:

  • Import expression data using DESeqDataSetFromTximport
  • Generate exploratory principal component plot
  • Perform differential testing with selected reference level using DESeq2 wrapper function
  • Prepare differential test output for each contrast using lfcshrink

Simple design: ~ Condition

Sample

Folder

Condition

120hr_Mock_No_MAC_REP1

../salmon/120hr_Mock_No_MAC_REP1

120hr_Mock_No_MAC

120hr_Mock_No_MAC_REP2

../salmon/120hr_Mock_No_MAC_REP2

120hr_Mock_No_MAC

120hr_CoV2_w_MAC_REP1

../salmon/120hr_CoV2_w_MAC_REP1

120hr_CoV2_w_MAC

120hr_CoV2_w_MAC_REP2

../salmon/120hr_CoV2_w_MAC_REP2

120hr_CoV2_w_MAC

34 of 63

DESeq2

Differential Expression Testing

DESeq2: Principal Components Analysis

PCA is a useful way to summarize data�

Genes with high co-variance are grouped into “principal components” to reduce dimensionality of data

What is PCA

Use PCA to show expected clustering

Identify outliers

Looking beyond first 2 principal components can be useful

35 of 63

DESeq2

Differential Expression Testing

DESeq2: Principal Components Analysis

Useful for checking clustering and identifying outlier samples

Outliers may be mislabeled or have poor sequencing quality

What is PCA

Use PCA to show expected clustering

Identify outliers

Looking beyond first 2 principal components can be useful

36 of 63

DESeq2

Differential Expression Testing

DESeq2 design matrix

Formalize metadata into factors for analysis

DESeq2 will control for columns/factors

Experiment

Design Matrix

MICE

A

B

Sample ID

Treatment

Timepoint

1

A

1

2

A

2

3

A

3

4

B

1

5

B

2

6

B

3

[replicates]

Sample Collection Timepoints

1 2 3

Treatment

Design formula : “design <- ~ treatment + timepoint”

37 of 63

DESeq2

Differential Expression Testing

DESeq2 command

Deseq command:

Input: raw read counts� sample design matrix

design formula

Deseq2 process:�

  • Normalization of raw read counts
  • Estimation of dispersion
  • Fit data to generalized linear model defined by design formula

Output:

DEseq object, to be used as input to differential expression testing

resultsNames:

lists comparisons available for differential expression testing

38 of 63

DESeq2

Differential Expression Testing

DESeq2 output

Lfcshrink calculates differential expression between two sets of samples

Input: deseq object, index of comparison

  • geneid
  • baseMean: mean read counts across all samples
  • log2FoldChange: the relative log2 fold change of the first group in the comparison relative to the second group
  • lfcSE: standard-error of the fold change
  • Pvalue: the p-value of the logFC fold change
  • Padj: p-value adjusted for multiple-comparison testing
  • Stat: test statistic, derived from Wald statistic

39 of 63

DESeq2

Differential Expression Testing

Visualizing differentially expressed genes - Heatmaps and Volcano Plots

40 of 63

DESeq2

Differential Expression Testing

Row-centered (z-score) data and heatmaps

In row-centered data, each row is scaled so that mean is 0 and variance is 1

41 of 63

GSEA and g:Profiler Functional Interpretation

Functional Analysis is used to associate differential expression data and biological annotations

GO terms

Pathways

MsigDB

42 of 63

GSEA and g:Profiler Functional Interpretation

FCS (Functional Class Sorting, GSEA) vs ORA (Over-Representation Analysis, g:Profiler)

FCS

ORA

  • Order genes from high in one condition to high in the other.
  • Analyze the position of genes in a set along that continuum.
  • Thresholds used to define differentially expressed genes (DEGS)
  • Analyze fractions of DEGS in set and not in set with Fisher’s exact test

enrichment

no enrichment

43 of 63

GSEA and g:Profiler Functional Interpretation

FCS (Functional Class Sorting, GSEA) vs ORA (Over-Representation Analysis, g:Profiler)

DEGs are defined using (arbitrary) fold change and p-value thresholds.

In the above example, 1% of tested genes are differentially expressed but 10% of the genes in a pathway are in the list of DEGs leading to highly significant ORA.

This approach works well but has 2 major limitations:

1. What if none or a small number of genes meet significance thresholds?

2.The biology of genes that almost meet the thresholds are not considered.

GSEA is not impacted by those limitations and compares all the genes in a set with the full expression dataset.

44 of 63

Set size 100

45 of 63

GSEA and g:Profiler Functional Interpretation

Running g:Profiler

Cov2vMock.gostres <- gost(query = Cov2_vs_Mock.degs$GeneSym,

organism = "hsapiens",

sources = sourceSet,significant = sigFlag,

correction_method ="g_SCS",

custom_bg = background$GeneSym,

domain_scope="custom")

Web client

R client

46 of 63

GSEA and g:Profiler Functional Interpretation

Running GSEA

GUI - can be downloaded and run locally

for i in *.rnk

do

sh /net/ostrom/data/bcc/charliew/gsea_4/GSEA_Linux_4.1.0/gsea-cli.sh GSEAPreranked -rnk $i \

-gmx h.all.v7.2.symbols.gmt \

-set_min 5 -set_max 1500 \

-rpt_label $i.h -plot_top_x 40 -nperm 1000

done

Command Line

RNA-Seq data should analyzed with the pre-ranked GSEA mode. Input data are:

  • Rank file (stat column from DESeq2)
  • Gene set (gmx files from MsigDB)

The results are usually based on pairwise comparisons.

47 of 63

GSEA and g:Profiler Functional Interpretation

The Molecular Signatures Database - MSigDB

http://www.gsea-msigdb.org/gsea/msigdb/index.jsp

Tools

Collections

48 of 63

GSEA and g:Profiler Functional Interpretation

GSEA web output

49 of 63

GSEA and g:Profiler Functional Interpretation

Single sample GSEA (ssGSEA)

ssGSEA – from Barbie et al., 2009

The ‘single sample’ extension of GSEA7 allows one to define an enrichment score that represents the degree of absolute enrichment of a gene set in each sample within a given data set. The gene expression values for a given sample were rank-normalized, and an enrichment score was produced using the Empirical Cumulative Distribution Functions (ECDF) of the genes in the signature and the remaining genes. This procedure is similar to GSEA but the list is ranked by absolute expression (in one sample). The enrichment score is obtained by an integration of the difference between the ECDF.

As you progress along the rank ordered list of genes, the algorithm looks for a difference in encountering the genes in the gene set compared to the non-gene set genes. If the gene set genes are encountered relatively early in the list the ES is negative, late in the list the ES is positive and encountered at roughly the same rate as the non-gene set genes the ES is near 0.

50 of 63

Y

X

Y

X

Up In X genes

Up In Y genes

GSEA and g:Profiler Functional Interpretation

51 of 63

*

*

*

*

X2

Y1

Up In Y

X3

Y1

Up In X

GSEA and g:Profiler Functional Interpretation

52 of 63

  • 1200 randomly selected genes
  • 5 random gene sets
  • 6 gene sets randomly selected from 6 different levels of expression.
  • All gene sets consist of about 50 genes

Level 6

Level 12

rand 4

Level 2

GSEA and g:Profiler Functional Interpretation

53 of 63

Understanding R syntax

Function

name

Input

Define your own functions to avoid rewriting code

Functions take input and return a reliable output

Variables are containers for information

“<-” sets the value of a variable

54 of 63

Types of variables in R

Data Types

Data Structures

  • Matrix
  • Data Frame

https://swcarpentry.github.io/r-novice-inflammation/13-supp-data-structures/

https://www.tutorialspoint.com/r/r_data_types.htm

55 of 63

R Libraries

  • Libraries are sets of predefined functions for specific fields
  • Important libraries to have for RNASeq analysis
    • dplyr - general functions to manipulate data and dataframes
    • readxl - reading in excel files
    • ggplot2 - robust tools for plotting
    • tximport - processing Salmon quantitation output into a dataframe
    • DESeq2 - differential expression analysis
    • apeglm - generalized linear modeling tools used by DESeq
    • gprofiler2 - functional analysis of gene sets

56 of 63

57 of 63

Plotting Metadata

58 of 63

Salmon

  • Parameters and what they mean
  • Library types
  • Single-end versus paired end
  • Alignment target
  • Salmon samples sheet

59 of 63

Salmon Command

salmon --no-version-check quant --validateMappings --incompatPrior 0.0

-i /net/ostrom/data/bcc/charliew/Genomes/salmon_indicies/hg38_ens101_1.3.0/hg38_ens101/

-l A

-r /net/ostrom/data/bcc/dpradhan/talks/RNASeq_Pipeline_Class/200212Yil/data/200212Yil_D20-133001_dedup.fastq

-p 8

-o Hcol_A1.quant

60 of 63

Salmon Output

61 of 63

Tximport aggregates salmon output into abundance estimates for gene-level analysis

62 of 63

+1 offset for l2tpm/cpm

  • Avoids negative values from tpm/cpm values less than 1

63 of 63

Alignment QC

  • Heatmap
    • Dendrograms
    • Splitting sections
    • Color scale
  • PCA
    • Input into pca (top genes with top variance)
    • Variance of different principal components (elbow plot)
    • ggplot: differing color, shapes, size, etc