1 of 121

Understanding Gene Expression Data

Slides: https://bit.ly/itn_nyu_2024_expression

Candace Savonen, Carrie Wright, and Kate Isaac

2 of 121

Except where otherwise indicated, The contents of this slide presentation are available for use under the Creative Commons Attribution 4.0 license.

You are free to adapt and share the work, but you must give appropriate credit, provide a link to the license, and indicate if changes were made.

Sample attribution: [Title of work] by the Johns Hopkins Data Science Lab. CC-BY 4.0

Terms of Use

3 of 121

Welcome to the ITN workshop!

While you wait:

Please sign in: https://bit.ly/hutch_learner

And create a GenePattern account: https://cloud.genepattern.org/gp/

Slides are here: https://bit.ly/itn_nyu_2024_expression

4 of 121

Schedule for today:

Introduction ~15 min
Gene Expression Overview ~25min

Bulk RNA-seq data
Single cell RNA-seq data
Considerations in choosing tools

Why is metadata important?
RNA-seq Gene Pattern Activity ~45 min
Feedback surveys ~ 5 min

Slides: https://bit.ly/itn_nyu_2024_expression

5 of 121

Slides: https://bit.ly/itn_nyu_2024_expression

6 of 121

Have your phone

(or a separate tab) handy for interactive polls!

Join at slido.com�#2871 844

7 of 121

What's your favorite candy?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

8 of 121

How confident do you feel about your understanding of gene expression data?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

9 of 121

What would you like to learn from this workshop?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

10 of 121

Informatics Technology for Cancer Research (ITCR)

itcr.cancer.gov

11 of 121

Informatics Technology for Cancer Research (ITCR)

… and more!

ITCR tools: https://bit.ly/ITCR_Tool_List

12 of 121

What is the ITN?

ITCR Training Network

Catalyzing informatics research through training opportunities

itcrtraining.org

13 of 121

We are all busy - especially researchers!

https://media.giphy.com/media/q6RoNkLlFNjaw/giphy.gif

14 of 121

Technology is changing quickly & it’s hard to keep up! �ITCR developers keep making more awesome software!

https://media.giphy.com/media/lRnUWhmllPI9a/giphy.gif

15 of 121

User preparedness

Gap

Tool usability

itcrtraining.org/courses

Informatics research is hindered by a gap between different types of experts

CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a

16 of 121

User preparedness

Gap

Tool usability

itcrtraining.org/courses

Catalyzing Informatics for Research

CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a

17 of 121

Elements of ITN:

Make courses about informatics

Make tools for researchers to do outreach

Provide live education opportunities

Enhance community engagement in cancer research

18 of 121

ITN courses

itcrtraining.org/courses

19 of 121

20 of 121

Choosing Genomics Tools

https://hutchdatascience.org/Choosing_Genomics_Tools/

21 of 121

Image by candace Savonen with Avataars and Openmoji.org

Your data are ready.

22 of 121

Image by Candace Savonen with Avataars, pixabay and openmoji.org

Genomic data

What is this and what do I do with it?

23 of 121

CC-BY

Concepts discussed in Choosing -omics Tools course:

What does your genomic data type represent?

What are the most common data processing steps for your data type

Find resources, tools and tutorials to help you process and interpret your data

24 of 121

General Chapters

Data Specific Chapters

https://bit.ly/genomics-tools

25 of 121

A wikipedia for -omic analysis

Datatypes included so far:

RNA-seq
scRNA-seq
WGS/WXS
ATAC-seq
ChIP-seq
Microarrays
Methylation data

And hope to add more! Let us know if you’d like to contribute! (Stipends available for grad students)

26 of 121

What kind of genomic data are you working with most frequently?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

27 of 121

RNA-seq data analysis workflows

28 of 121

Genomics workflows in a very general sense

Image by Candace Savonen using IconFinder

Raw Data

Normalized Data

Summarized Data

Plots and Results!

29 of 121

RNA-seq

data generation

From the Childhood Cancer Data Lab: https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html

30 of 121

Single-end vs paired-end

Image from https://open.oregonstate.education/appliedbioinformatics/chapter/chapter-6/

31 of 121

Single-end vs paired-end

Image from https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html#quality-control

------>

----------------------------- [fragment]

------> <------

32 of 121

From Mike Love https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/

exon 1

exon 2

exon 3

33 of 121

Poly A selection vs Ribo minus

Poly A selection Ribo minus

Tables from SITOOLS BIOTECH blog: https://blog.sitoolsbiotech.com/2019/08/ribo-depletion-rna-seq-ribosomal-rna-depletion-method-works-best/

Poly A selection advantages: lower sequencing depth needed. Greater exonic coverage. Disadvantages of Poly A selection is that it does not detect non-polyA transcripts including miRNAs, snoRNAs, and some lncRNAs. It obtains less information on immature transcripts. It performs poorly for degraded RNA or Formalin-Fixed Paraffin-Embedded (FFPE) samples Bias towards 3’ end of transcripts. Cannot be used for prokaryotes. Ribo minus advantages are: It is able to detect small and non-polyadenylated RNAs. It detects long and short transcripts (no 3’ bias). It has better performance on degraded RNa or FFPE samples. It is applicable for prokaryotes. It can be applied toward other abundant RNA. The disadvantages of Ribo minus is that it will collect more intronic reads and immature RNAs (if you are not interested in those). And thus because of the greater quantity of the returned RNA pool. It requires greater sequencing depths.

34 of 121

Sequence related biases

3’ bias - 3’ ends of sequences are more likely to be sequenced
GC bias - guanine and cytosine bond melts at higher temp - if a sequence has a lot of G’s and C’s

Sequence complexity - certain sequences more likely to have primers bound to them (and more likely to be sequenced)

Length bias - longer targets are more likely to be amplified or sequenced

Many of these biases are worsened by PCR amplification!

35 of 121

Sequence related biases

3’ bias - 3’ ends of sequences are more likely to be sequenced
GC bias - guanine and cytosine bond melts at higher temp - if a sequence has a lot of G’s and C’s

Sequence complexity - certain sequences more likely to have primers bound to them (and more likely to be sequenced)

Length bias - longer targets are more likely to be amplified or sequenced

Many of these biases are worsened by PCR amplification!

36 of 121

Sequence related biases

3’ bias - 3’ ends of sequences are more likely to be sequenced
GC bias - guanine and cytosine bond melts at higher temp - if a sequence has a lot of G’s and C’s

Sequence complexity - certain sequences more likely to have primers bound to them (and more likely to be sequenced)

Length bias - longer targets are more likely to be amplified or sequenced

Many of these biases are worsened by PCR amplification!

37 of 121

Sequence related biases

3’ bias - 3’ ends of sequences are more likely to be sequenced
GC bias - guanine and cytosine bond melts at higher temp - if a sequence has a lot of G’s and C’s

Sequence complexity - certain sequences more likely to have primers bound to them (and more likely to be sequenced)

Length bias - longer targets are more likely to be amplified or sequenced

Many of these biases are worsened by PCR amplification!

38 of 121

Sequence related biases

3’ bias - 3’ ends of sequences are more likely to be sequenced
GC bias - guanine and cytosine bond melts at higher temp - if a sequence has a lot of G’s and C’s

Sequence complexity - certain sequences more likely to have primers bound to them (and more likely to be sequenced)

Length bias - longer targets are more likely to be amplified or sequenced

Many of these biases are worsened by PCR amplification!

39 of 121

Some tools and some options account for these biases in some way!

Look at tool documentation!

40 of 121

Very general bulk RNA-seq workflow steps

Image by Candace Savonen

Quantification/Alignment

Quality control

Sequence quality

trimming

Normalization

Downstream analyses

Dimension Reduction

Differential expression

41 of 121

bulk RNA-seq workflow steps

Quantification/Alignment

Quality control

Sequence quality

trimming

Normalization

Downstream analyses

Dimension Reduction

Differential expression

Input(s)	fastq	fastq�(+ GTF/GFF or index)	bam or sam (tab for counts)
Output(s)	HTML report	bam or sam		png, txt or csv
	fastq	tab

In a very general sense, RNA-seq workflows involves first quantification/alignment. You will also need to conduct quality control steps that check the quality of the sequencing done. You may also want to trim and filter out data that is not trustworthy. After you have a set of reliable data, you need to normalize your data. After data has been normalized you are ready to conduct your downstream analyses. This will be highly dependent on the original goals and questions of your experiment. It may include dimension reduction, differential expression, or any number of other analyses.

Inputs to QC are reads either in fastq files (older version) or fastqsanger files (newer technologies with a different encoding for quality scores). Outputs from QC may be HTML reports or fastq files if using a tool for trimming reads. ��Inputs to Alignment tools are the reads and either a GTF/GFF annotation file for reference-based assembly/alignment or an index file for reference-free pseudoalignment or the reference genome for faster alignment (most indices are built-in for tools or can be built by the tool given sequencing reads). Outputs from this may be bam or sam files (with bam being a binary/condensed form of sam files) and sam files contain alignment information (including more quality scores). Additional outputs may be tab files if the alignment software has also been asked to do some basic quantification for counts per gene (a STAR option).

Normalization and downstream analyses tools will use the outputs from alignment/quantification for the next steps. Normalization may use quality scores and would need a sam file. Outputs from normalization and other downstream analyses differ based on the task. These may be an image, or a txt or csv file with various numbers or a gtf file if creating a transcript assembly…. Or a vcf if calling variants…. It depends on the specific downstream analysis for what inputs are necessary and what outputs are produced

42 of 121

What type of data does the tool expect?

Abundance?
Transformation?
File format?

43 of 121

What is a FASTQ file even?

44 of 121

What is a FASTQ file even?

45 of 121

What is a FASTQ file even?

46 of 121

What is a FASTQ file even?

47 of 121

How is single-cell �RNA-seq same �or different?

48 of 121

Image from Nature, Tanaka et al, 2018

49 of 121

Image from 10X Genomics: https://www.10xgenomics.com/blog/single-cell-rna-seq-an-introductory-overview-and-tools-for-getting-started

50 of 121

Slide from the Childhood Cancer Data Lab

Laser

51 of 121

Slide from the Childhood Cancer Data Lab

Cells

Example: 10X Genomics Chromium

52 of 121

Full-length scRNA-seq

Pros:

Can use paired-end sequencing (less 3' bias)
More complete coverage of transcripts (which may be better for transcript discovery purposes)

53 of 121

Full-length scRNA-seq

Pros:

Can use paired-end sequencing (less 3' bias)
More complete coverage of transcripts (which may be better for transcript discovery purposes)

Cons:

Is not very efficient (generally 96 cells per plate)
Takes much longer to run (days/weeks depending on sample size)
Expensive

Pre-processing: Very similar to bulk RNA-seq

54 of 121

Tag-Based scRNA-seq

Pros:

Can profile up to millions of cells
Takes less computing power
File storage requirements are smaller
Much less expensive

55 of 121

Tag-Based scRNA-seq

Pros:

Can profile up to millions of cells
Takes less computing power
File storage requirements are smaller
Much less expensive

Cons:

More intense 3' bias (can’t do paired end sequencing)
Coverage is generally not as deep

Pre-processing: use Alevin (a Salmon tool) using the cell barcodes to separate cells’ da

56 of 121

Single-cell RNA-seq quirks

Less starting material means:

1) More zeroes

2) More PCR amplification (and its associated biases)

Remember: some sequences are more likely to be amplified

57 of 121

Unique Molecular Identifiers (UMIs):

a ‘snapshot’ of the original molecules in the pre-amplified cell

From the Childhood Cancer Data Lab

Original image from: Islam et al. Nature 2014 (https://doi.org/10.1038/nmeth.2772)

1.

58 of 121

Unique Molecular Identifiers (UMIs):

a ‘snapshot’ of the original molecules in the pre-amplified cell

Original image from: Islam et al. Nature 2014 (https://doi.org/10.1038/nmeth.2772)

1. 2.

59 of 121

Unique Molecular Identifiers (UMIs):

a ‘snapshot’ of the original molecules in the pre-amplified cell

Original image from: Islam et al. Nature 2014 (https://doi.org/10.1038/nmeth.2772)

1. 2.

3.

60 of 121

Unique Molecular Identifiers (UMIs):

a ‘snapshot’ of the original molecules in the pre-amplified cell

Original image from: Islam et al. Nature 2014 (https://doi.org/10.1038/nmeth.2772)

1. 2.

3. 4.

61 of 121

Very general single cell RNA-seq workflow steps

Image by Candace Savonen

Quantification/Alignment

Quality control

UMIs

Duplet detection

Filtering

Normalization

Downstream analyses

Dimension Reduction

Cell classification

Differential expression

Trajectory

62 of 121

What files and formats are we using for (single-cell) RNA-seq data?

63 of 121

Single cell RNA-seq workflow steps

Quantification/Alignment

Quality control

UMIs

Duplet detection

Filtering

Normalization

Downstream analyses

Dimension Reduction

Cell classification

Differential expression

Trajectory

Input(s)	bcl (--> fastq)	bam + counts	Updated counts
Output(s)	bam or sam	Updated counts		png, txt or csv
	mtx or counts

In a very general sense, single cell RNA-seq workflows involves first quantification/alignment. You will also need to conduct quality control steps that may involve using UMIs to check for what’s detected, detecting duplets, and using this information to filter out data that is not trustworthy. After you have a set of reliable data, you need to normalize your data. Single cell data is highly skewed - a lot of genes barely or not detected and a few genes that are detected a lot. After data has been normalized you are ready to conduct your downstream analyses. This will be highly dependent on the original goals and questions of your experiment. It may include dimension reduction, cell classification, differential expression, detecting cell trajectories or any number of other analyses.

The bcl file format is the rawest output (base calls) that is converted to fastq files. Tools like cellranger save the counts information into three files (“The count matrix was saved as three files, where barcodes.tsv saves barcode information, genes.tsv saves gene information, and matrix.mtx saves the count data in MatrixMarket format.”).

QC is performed with tools like Cell Ranger which take the bam and counts files as input and output updated counts. Outputs may be stored in hierarchical data files from some tools like Cell Ranger (HDF5, where H5 is binary version). These updated counts are inputs to normalization and downstream analysis with output being images or csv files with numbers/classifications like with the bulk RNA-seq process.��Source for file types info quote: https://www.fredhutch.org/content/dam/stripe/sun/software/scRNAseq/scRNAseq.html and https://satijalab.org/seurat/reference/read10x

64 of 121

RNA-seq data analysis tools and platforms

65 of 121

Considerations for choosing tools:

Is it appropriate for your data type?

Is it an interface or programming language you feel comfortable with?

How much computing power do you have?

Are there benchmarking papers that compare the tool options?

Is the tool well documented and usable?

Is the tool well-maintained?

Is the tool generally accepted by the field?

66 of 121

Which tool for which workflow step?

Input: FASTQ Files

Output: Differentially expressed genes or transcripts

Step 1: �QC

Step 2a: �Mapping

Step 2b: �Assemble

Step 3: �Count/Quantify

Step 4: �Normalize (eg TMM)

Step 5: �Model DE

FastQC�MultiQC

Trimmomatic
Cutadapt

edgeR
DESeq2
limma + voom

Ballgown

Cufflinks

CuffDiff2

Salmon
Kallisto

TopHat
STAR
HISAT2

HTSeq
featureCounts
RSEM

tximport

Stringtie

67 of 121

https://jhudatascience.org/ITCR_Tables

68 of 121

https://jhudatascience.org/ITCR_Tables/omicsTable.html

69 of 121

Why is metadata important?

70 of 121

Let’s say you did your RNA-seq data analysis and you saw…

71 of 121

Metadata: Anything and everything that should be known about your samples!

A B C D

E F G H

sample_id	mouse_id	processing_date	treatment	…
A	1	3-10-21	None	…
B	1	4-12-21	None	…
C	2	3-10-21	None	…
D	2	4-12-21	None	…
E	3	3-10-21	Morphine	…
F	3	4-12-21	Morphine	…
G	4	3-10-21	Morphine	…
H	4	4-12-21	Morphine	…

I know everything I need to know about these samples from their metadata!

72 of 121

Examples of metadata categories:

Patient/organism of origin
Patient/organism information

Demographics
Disease state
Treatment state
Time point (if applicable)

Processing information

Batch information
Processing details (E.g. Isolation methods: Poly-A vs Ribo-minus)

Anything that should be known about the samples and their handling!

73 of 121

If you have human data the metadata probably is loaded with PII and/or PHI

74 of 121

Data Organization in Spreadsheets

75 of 121

Why GenePattern?

76 of 121

77 of 121

Tools for Bioinformatics

78 of 121

Reproducible analysis tools for GUI

https://usegalaxy.org/

https://www.genepattern.org/

79 of 121

Why GenePattern?

GenePattern can be a nice way to get access to these tools without:

Having to know command line
Having to struggle to install them
Having to figure out how to convert from one tool to another

80 of 121

GenePattern supports reproducibility

Record and replay of all analyses
Retain all versions of code – so results can be reproduced even if code changes
Chain analyses into “pipelines”, or workflows that can be shared and published

81 of 121

>250 GenePattern Modules, 8/2024

82 of 121

Gene Pattern tutorial

83 of 121

Let’s try it!��Let’s do bulk RNA differential expression analysis

84 of 121

TCGA - BRCA

20 Breast cancer primary tumor

20 Normal matched samples

85 of 121

TCGA - BRCA

20 Breast cancer primary tumor

20 Normal matched samples

What genes differentiate breast cancer from not cancer?

86 of 121

bulk RNA-seq workflow steps

Quantification/Alignment

Quality control

Sequence quality

trimming

Normalization

Downstream analyses

Dimension Reduction

Differential Expression

Input(s)	fastq or fastqsanger	fastq or fastqsanger�(+ GTF/GFF or index)	bam or sam (tab for counts)
Output(s)	HTML report	bam or sam		png, txt or csv
	fastq	tab

In a very general sense, RNA-seq workflows involves first quantification/alignment. You will also need to conduct quality control steps that check the quality of the sequencing done. You may also want to trim and filter out data that is not trustworthy. After you have a set of reliable data, you need to normalize your data. After data has been normalized you are ready to conduct your downstream analyses. This will be highly dependent on the original goals and questions of your experiment. It may include dimension reduction, differential expression, or any number of other analyses.

Inputs to QC are reads either in fastq files (older version) or fastqsanger files (newer technologies with a different encoding for quality scores). Outputs from QC may be HTML reports or fastq files if using a tool for trimming reads. ��Inputs to Alignment tools are the reads and either a GTF/GFF annotation file for reference-based assembly/alignment or an index file for reference-free pseudoalignment or the reference genome for faster alignment (most indices are built-in for tools or can be built by the tool given sequencing reads). Outputs from this may be bam or sam files (with bam being a binary/condensed form of sam files) and sam files contain alignment information (including more quality scores). Additional outputs may be tab files if the alignment software has also been asked to do some basic quantification for counts per gene (a STAR option).

Normalization and downstream analyses tools will use the outputs from alignment/quantification for the next steps. Normalization may use quality scores and would need a sam file. Outputs from normalization and other downstream analyses differ based on the task. These may be an image, or a txt or csv file with various numbers or a gtf file if creating a transcript assembly…. Or a vcf if calling variants…. It depends on the specific downstream analysis for what inputs are necessary and what outputs are produced

87 of 121

Download the data from here:

https://datasets.genepattern.org/?prefix=data/workshops/240411-PSTP/

88 of 121

Go to cloud.genepattern.org/gp/

89 of 121

This module will identify differentially expressed genes between our two groups

90 of 121

GCT file goes here

CLS file goes here

91 of 121

What are these files?

CLS: class file – describes which samples are what
GCT: Gene Cluster Text file - gene expression data with some extra stuff
See here about how to create these files from your data

92 of 121

https://www.genepattern.org/file-formats-guide#gsc.tab=0

93 of 121

https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#Expression_Data_Formats

94 of 121

95 of 121

ODF is a GenePattern specific file: It is similar to the RES or GCT file formats for datasets. The main difference is in the header.

96 of 121

1.

2.

This module will make a heatmap of our data

97 of 121

GCT file goes here

This odf file will automatically be loaded

98 of 121

99 of 121

100 of 121

101 of 121

Activity 2: single cell RNA-seq pipeline

102 of 121

3000 peripheral blood mononuclear cells from a healthy donor

103 of 121

3000 peripheral blood mononuclear cells from a healthy donor

What kinds of cells are in this population?

104 of 121

Go to cloud.genepattern.org/gp/

105 of 121

106 of 121

107 of 121

You might need to login again

108 of 121

Download practice single-cell data from 10X:

https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

No need to unzip the file but these are the contents of these files described here

109 of 121

110 of 121

111 of 121

112 of 121

113 of 121

114 of 121

Sharing results!

115 of 121

116 of 121

How confident do you feel about your understanding of gene expression data *now*?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

117 of 121

How likely is this workshop to have a positive impact on the ease and efficiency of your work?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

118 of 121

How likely are you to recommend this workshop?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

119 of 121

What did you like best about this workshop?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

120 of 121

Please share any recommendations you have for improvements

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

121 of 121

https://bit.ly/itn_demo

Demographics Survey