1 of 121

Understanding Gene Expression Data

Slides: https://bit.ly/itn_nyu_2024_expression

Candace Savonen, Carrie Wright, and Kate Isaac

2 of 121

Except where otherwise indicated, The contents of this slide presentation are available for use under the Creative Commons Attribution 4.0 license.

You are free to adapt and share the work, but you must give appropriate credit, provide a link to the license, and indicate if changes were made.

Sample attribution: [Title of work] by the Johns Hopkins Data Science Lab. CC-BY 4.0

Terms of Use

3 of 121

Welcome to the ITN workshop!

While you wait:

Please sign in: https://bit.ly/hutch_learner

And create a GenePattern account: https://cloud.genepattern.org/gp/

Slides are here: https://bit.ly/itn_nyu_2024_expression

4 of 121

Schedule for today:

  • Introduction ~15 min
  • Gene Expression Overview ~25min
    • Bulk RNA-seq data
    • Single cell RNA-seq data
    • Considerations in choosing tools
  • Why is metadata important?
  • RNA-seq Gene Pattern Activity ~45 min
  • Feedback surveys ~ 5 min

Slides: https://bit.ly/itn_nyu_2024_expression

5 of 121

Slides: https://bit.ly/itn_nyu_2024_expression

6 of 121

Have your phone

(or a separate tab) handy for interactive polls!

Join at slido.com�#2871 844

7 of 121

What's your favorite candy?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

8 of 121

How confident do you feel about your understanding of gene expression data?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

9 of 121

What would you like to learn from this workshop?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

10 of 121

Informatics Technology for Cancer Research (ITCR)

11 of 121

Informatics Technology for Cancer Research (ITCR)

… and more!

12 of 121

What is the ITN?

ITCR Training Network

Catalyzing informatics research through training opportunities

13 of 121

We are all busy - especially researchers!

https://media.giphy.com/media/q6RoNkLlFNjaw/giphy.gif

14 of 121

Technology is changing quickly & it’s hard to keep up! �ITCR developers keep making more awesome software!

https://media.giphy.com/media/lRnUWhmllPI9a/giphy.gif

15 of 121

User preparedness

Gap

Tool usability

Informatics research is hindered by a gap between different types of experts

CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a

16 of 121

User preparedness

Gap

Tool usability

Catalyzing Informatics for Research

CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a

17 of 121

Elements of ITN:

  1. Make courses about informatics

  • Make tools for researchers to do outreach

  • Provide live education opportunities

  • Enhance community engagement in cancer research

18 of 121

ITN courses

19 of 121

20 of 121

21 of 121

Image by candace Savonen with Avataars and Openmoji.org

Your data are ready.

22 of 121

Image by Candace Savonen with Avataars, pixabay and openmoji.org

Genomic data

What is this and what do I do with it?

23 of 121

CC-BY

Concepts discussed in Choosing -omics Tools course:

What does your genomic data type represent?

What are the most common data processing steps for your data type

Find resources, tools and tutorials to help you process and interpret your data

24 of 121

General Chapters

Data Specific Chapters

25 of 121

A wikipedia for -omic analysis

Datatypes included so far:

  • RNA-seq
  • scRNA-seq
  • WGS/WXS
  • ATAC-seq
  • ChIP-seq
  • Microarrays
  • Methylation data

And hope to add more! Let us know if you’d like to contribute! (Stipends available for grad students)

26 of 121

What kind of genomic data are you working with most frequently?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

27 of 121

RNA-seq data analysis workflows

28 of 121

Genomics workflows in a very general sense

Image by Candace Savonen using IconFinder

Raw Data

Normalized Data

Summarized Data

Plots and Results!

29 of 121

RNA-seq

data generation

From the Childhood Cancer Data Lab: https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html

30 of 121

Single-end vs paired-end

Image from https://open.oregonstate.education/appliedbioinformatics/chapter/chapter-6/

31 of 121

Single-end vs paired-end

Image from https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html#quality-control

------>

------>

------>

------>

----------------------------- [fragment]

------> <------

------> <------

------> <------

------> <------

32 of 121

From Mike Love https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/

exon 1

exon 2

exon 3

33 of 121

Poly A selection vs Ribo minus

Poly A selection Ribo minus

Tables from SITOOLS BIOTECH blog: https://blog.sitoolsbiotech.com/2019/08/ribo-depletion-rna-seq-ribosomal-rna-depletion-method-works-best/

34 of 121

Sequence related biases

  • 3’ bias - 3’ ends of sequences are more likely to be sequenced
  • GC bias - guanine and cytosine bond melts at higher temp - if a sequence has a lot of G’s and C’s

  • Sequence complexity - certain sequences more likely to have primers bound to them (and more likely to be sequenced)

  • Length bias - longer targets are more likely to be amplified or sequenced

Many of these biases are worsened by PCR amplification!

35 of 121

Sequence related biases

  • 3’ bias - 3’ ends of sequences are more likely to be sequenced
  • GC bias - guanine and cytosine bond melts at higher temp - if a sequence has a lot of G’s and C’s

  • Sequence complexity - certain sequences more likely to have primers bound to them (and more likely to be sequenced)

  • Length bias - longer targets are more likely to be amplified or sequenced

Many of these biases are worsened by PCR amplification!

36 of 121

Sequence related biases

  • 3’ bias - 3’ ends of sequences are more likely to be sequenced
  • GC bias - guanine and cytosine bond melts at higher temp - if a sequence has a lot of G’s and C’s

  • Sequence complexity - certain sequences more likely to have primers bound to them (and more likely to be sequenced)

  • Length bias - longer targets are more likely to be amplified or sequenced

Many of these biases are worsened by PCR amplification!

37 of 121

Sequence related biases

  • 3’ bias - 3’ ends of sequences are more likely to be sequenced
  • GC bias - guanine and cytosine bond melts at higher temp - if a sequence has a lot of G’s and C’s

  • Sequence complexity - certain sequences more likely to have primers bound to them (and more likely to be sequenced)

  • Length bias - longer targets are more likely to be amplified or sequenced

Many of these biases are worsened by PCR amplification!

38 of 121

Sequence related biases

  • 3’ bias - 3’ ends of sequences are more likely to be sequenced
  • GC bias - guanine and cytosine bond melts at higher temp - if a sequence has a lot of G’s and C’s

  • Sequence complexity - certain sequences more likely to have primers bound to them (and more likely to be sequenced)

  • Length bias - longer targets are more likely to be amplified or sequenced

Many of these biases are worsened by PCR amplification!

39 of 121

Some tools and some options account for these biases in some way!

Look at tool documentation!

40 of 121

Very general bulk RNA-seq workflow steps

Image by Candace Savonen

Quantification/Alignment

Quality control

Sequence quality

trimming

Normalization

Downstream analyses

Dimension Reduction

Differential expression

41 of 121

bulk RNA-seq workflow steps

Quantification/Alignment

Quality control

Sequence quality

trimming

Normalization

Downstream analyses

Dimension Reduction

Differential expression

Input(s)

fastq

fastq�(+ GTF/GFF or index)

bam or sam

(tab for counts)

Output(s)

HTML report

bam or sam

png, txt or csv

fastq

tab

42 of 121

What type of data does the tool expect?

  • Abundance?
  • Transformation?
  • File format?

43 of 121

What is a FASTQ file even?

44 of 121

What is a FASTQ file even?

45 of 121

What is a FASTQ file even?

46 of 121

What is a FASTQ file even?

47 of 121

How is single-cell �RNA-seq same �or different?

48 of 121

Image from Nature, Tanaka et al, 2018

49 of 121

Image from 10X Genomics: https://www.10xgenomics.com/blog/single-cell-rna-seq-an-introductory-overview-and-tools-for-getting-started

50 of 121

Slide from the Childhood Cancer Data Lab

Laser

51 of 121

Slide from the Childhood Cancer Data Lab

Cells

Example: 10X Genomics Chromium

52 of 121

Full-length scRNA-seq

Pros:

  • Can use paired-end sequencing (less 3' bias)
  • More complete coverage of transcripts (which may be better for transcript discovery purposes)

53 of 121

Full-length scRNA-seq

Pros:

  • Can use paired-end sequencing (less 3' bias)
  • More complete coverage of transcripts (which may be better for transcript discovery purposes)

Cons:

  • Is not very efficient (generally 96 cells per plate)
  • Takes much longer to run (days/weeks depending on sample size)
  • Expensive

Pre-processing: Very similar to bulk RNA-seq

54 of 121

Tag-Based scRNA-seq

Pros:

  • Can profile up to millions of cells
  • Takes less computing power
  • File storage requirements are smaller
  • Much less expensive

55 of 121

Tag-Based scRNA-seq

Pros:

  • Can profile up to millions of cells
  • Takes less computing power
  • File storage requirements are smaller
  • Much less expensive

Cons:

  • More intense 3' bias (can’t do paired end sequencing)
  • Coverage is generally not as deep

Pre-processing: use Alevin (a Salmon tool) using the cell barcodes to separate cells’ da

56 of 121

Single-cell RNA-seq quirks

Less starting material means:

1) More zeroes

2) More PCR amplification (and its associated biases)

Remember: some sequences are more likely to be amplified

57 of 121

Unique Molecular Identifiers (UMIs):

a ‘snapshot’ of the original molecules in the pre-amplified cell

From the Childhood Cancer Data Lab

Original image from: Islam et al. Nature 2014 (https://doi.org/10.1038/nmeth.2772)

1.

58 of 121

Unique Molecular Identifiers (UMIs):

a ‘snapshot’ of the original molecules in the pre-amplified cell

Original image from: Islam et al. Nature 2014 (https://doi.org/10.1038/nmeth.2772)

1. 2.

59 of 121

Unique Molecular Identifiers (UMIs):

a ‘snapshot’ of the original molecules in the pre-amplified cell

Original image from: Islam et al. Nature 2014 (https://doi.org/10.1038/nmeth.2772)

1. 2.

3.

60 of 121

Unique Molecular Identifiers (UMIs):

a ‘snapshot’ of the original molecules in the pre-amplified cell

Original image from: Islam et al. Nature 2014 (https://doi.org/10.1038/nmeth.2772)

1. 2.

3. 4.

61 of 121

Very general single cell RNA-seq workflow steps

Image by Candace Savonen

Quantification/Alignment

Quality control

UMIs

Duplet detection

Filtering

Normalization

Downstream analyses

Dimension Reduction

Cell classification

Differential expression

Trajectory

62 of 121

What files and formats are we using for (single-cell) RNA-seq data?

63 of 121

Single cell RNA-seq workflow steps

Quantification/Alignment

Quality control

UMIs

Duplet detection

Filtering

Normalization

Downstream analyses

Dimension Reduction

Cell classification

Differential expression

Trajectory

Input(s)

bcl (--> fastq)

bam + counts

Updated counts

Output(s)

bam or sam

Updated counts

png, txt or csv

mtx or counts

64 of 121

RNA-seq data analysis tools and platforms

65 of 121

Considerations for choosing tools:

Is it appropriate for your data type?

Is it an interface or programming language you feel comfortable with?

How much computing power do you have?

Are there benchmarking papers that compare the tool options?

Is the tool well documented and usable?

Is the tool well-maintained?

Is the tool generally accepted by the field?

66 of 121

Which tool for which workflow step?

Input: FASTQ Files

Output: Differentially expressed genes or transcripts

Step 1: �QC

Step 2a: �Mapping

Step 2b: �Assemble

Step 3: �Count/Quantify

Step 4: �Normalize (eg TMM)

Step 5: �Model DE

FastQC�MultiQC

  • Trimmomatic
  • Cutadapt
  • edgeR
  • DESeq2
  • limma + voom
  • Ballgown

Cufflinks

  • CuffDiff2
  • Salmon
  • Kallisto
  • TopHat
  • STAR
  • HISAT2
  • HTSeq
  • featureCounts
  • RSEM

tximport

Stringtie

67 of 121

68 of 121

69 of 121

Why is metadata important?

70 of 121

Let’s say you did your RNA-seq data analysis and you saw…

71 of 121

Metadata: Anything and everything that should be known about your samples!

A B C D

E F G H

sample_id

mouse_id

processing_date

treatment

A

1

3-10-21

None

B

1

4-12-21

None

C

2

3-10-21

None

D

2

4-12-21

None

E

3

3-10-21

Morphine

F

3

4-12-21

Morphine

G

4

3-10-21

Morphine

H

4

4-12-21

Morphine

I know everything I need to know about these samples from their metadata!

72 of 121

Examples of metadata categories:

  • Patient/organism of origin
  • Patient/organism information
    • Demographics
    • Disease state
    • Treatment state
    • Time point (if applicable)
  • Processing information
    • Batch information
    • Processing details (E.g. Isolation methods: Poly-A vs Ribo-minus)
  • Anything that should be known about the samples and their handling!

73 of 121

If you have human data the metadata probably is loaded with PII and/or PHI

74 of 121

75 of 121

Why GenePattern?

76 of 121

77 of 121

Tools for Bioinformatics

78 of 121

Reproducible analysis tools for GUI

79 of 121

Why GenePattern?

  • GenePattern can be a nice way to get access to these tools without:
    • Having to know command line
    • Having to struggle to install them
    • Having to figure out how to convert from one tool to another

80 of 121

GenePattern supports reproducibility

  • Record and replay of all analyses
  • Retain all versions of code – so results can be reproduced even if code changes
  • Chain analyses into “pipelines”, or workflows that can be shared and published

81 of 121

>250 GenePattern Modules, 8/2024

82 of 121

Gene Pattern tutorial

83 of 121

Let’s try it!��Let’s do bulk RNA differential expression analysis

84 of 121

TCGA - BRCA

20 Breast cancer primary tumor

20 Normal matched samples

85 of 121

TCGA - BRCA

20 Breast cancer primary tumor

20 Normal matched samples

What genes differentiate breast cancer from not cancer?

86 of 121

bulk RNA-seq workflow steps

Quantification/Alignment

Quality control

Sequence quality

trimming

Normalization

Downstream analyses

Dimension Reduction

Differential Expression

Input(s)

fastq or fastqsanger

fastq or fastqsanger�(+ GTF/GFF or index)

bam or sam

(tab for counts)

Output(s)

HTML report

bam or sam

png, txt or csv

fastq

tab

87 of 121

88 of 121

89 of 121

This module will identify differentially expressed genes between our two groups

90 of 121

GCT file goes here

CLS file goes here

91 of 121

What are these files?

  • CLS: class file – describes which samples are what
  • GCT: Gene Cluster Text file - gene expression data with some extra stuff
  • See here about how to create these files from your data

92 of 121

https://www.genepattern.org/file-formats-guide#gsc.tab=0

93 of 121

https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#Expression_Data_Formats

94 of 121

95 of 121

ODF is a GenePattern specific file: It is similar to the RES or GCT file formats for datasets. The main difference is in the header.

96 of 121

1.

2.

This module will make a heatmap of our data

97 of 121

GCT file goes here

This odf file will automatically be loaded

98 of 121

99 of 121

100 of 121

101 of 121

Activity 2: single cell RNA-seq pipeline

102 of 121

103 of 121

What kinds of cells are in this population?

104 of 121

105 of 121

106 of 121

107 of 121

You might need to login again

108 of 121

No need to unzip the file but these are the contents of these files described here

109 of 121

110 of 121

111 of 121

112 of 121

113 of 121

114 of 121

Sharing results!

115 of 121

More tutorials online!

116 of 121

How confident do you feel about your understanding of gene expression data *now*?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

117 of 121

How likely is this workshop to have a positive impact on the ease and efficiency of your work?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

118 of 121

How likely are you to recommend this workshop?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

119 of 121

What did you like best about this workshop?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

120 of 121

Please share any recommendations you have for improvements

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

121 of 121

Demographics Survey