Understanding Gene Expression Data
Slides: https://bit.ly/itn_nyu_2024_expression
Candace Savonen, Carrie Wright, and Kate Isaac
Except where otherwise indicated, The contents of this slide presentation are available for use under the Creative Commons Attribution 4.0 license.
You are free to adapt and share the work, but you must give appropriate credit, provide a link to the license, and indicate if changes were made.
Sample attribution: [Title of work] by the Johns Hopkins Data Science Lab. CC-BY 4.0
Terms of Use
Welcome to the ITN workshop!
While you wait:
Please sign in: https://bit.ly/hutch_learner
And create a GenePattern account: https://cloud.genepattern.org/gp/
Slides are here: https://bit.ly/itn_nyu_2024_expression
Schedule for today:
Slides: https://bit.ly/itn_nyu_2024_expression
Slides: https://bit.ly/itn_nyu_2024_expression
Have your phone
(or a separate tab) handy for interactive polls!
Join at slido.com�#2871 844
What's your favorite candy?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
How confident do you feel about your understanding of gene expression data?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
What would you like to learn from this workshop?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Informatics Technology for Cancer Research (ITCR)
Informatics Technology for Cancer Research (ITCR)
ITCR tools: https://bit.ly/ITCR_Tool_List
What is the ITN?
ITCR Training Network
Catalyzing informatics research through training opportunities
We are all busy - especially researchers!
https://media.giphy.com/media/q6RoNkLlFNjaw/giphy.gif
Technology is changing quickly & it’s hard to keep up! �ITCR developers keep making more awesome software!
https://media.giphy.com/media/lRnUWhmllPI9a/giphy.gif
User preparedness
Gap
Tool usability
Informatics research is hindered by a gap between different types of experts
CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a
User preparedness
Gap
Tool usability
Catalyzing Informatics for Research
CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a
Elements of ITN:
ITN courses
Image by candace Savonen with Avataars and Openmoji.org
Your data are ready.
Image by Candace Savonen with Avataars, pixabay and openmoji.org
Genomic data
What is this and what do I do with it?
CC-BY
Concepts discussed in Choosing -omics Tools course:
What does your genomic data type represent?
What are the most common data processing steps for your data type
Find resources, tools and tutorials to help you process and interpret your data
General Chapters
Data Specific Chapters
A wikipedia for -omic analysis
Datatypes included so far:
And hope to add more! Let us know if you’d like to contribute! (Stipends available for grad students)
What kind of genomic data are you working with most frequently?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
RNA-seq data analysis workflows
Genomics workflows in a very general sense
Image by Candace Savonen using IconFinder
Raw Data
Normalized Data
Summarized Data
Plots and Results!
RNA-seq
data generation
From the Childhood Cancer Data Lab: https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html
Single-end vs paired-end
Image from https://open.oregonstate.education/appliedbioinformatics/chapter/chapter-6/
Single-end vs paired-end
Image from https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html#quality-control
------>
------>
------>
------>
----------------------------- [fragment]
------> <------
------> <------
------> <------
------> <------
From Mike Love https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/
exon 1
exon 2
exon 3
Poly A selection vs Ribo minus
Poly A selection Ribo minus
Tables from SITOOLS BIOTECH blog: https://blog.sitoolsbiotech.com/2019/08/ribo-depletion-rna-seq-ribosomal-rna-depletion-method-works-best/
Sequence related biases
Many of these biases are worsened by PCR amplification!
Sequence related biases
Many of these biases are worsened by PCR amplification!
Sequence related biases
Many of these biases are worsened by PCR amplification!
Sequence related biases
Many of these biases are worsened by PCR amplification!
Sequence related biases
Many of these biases are worsened by PCR amplification!
Some tools and some options account for these biases in some way!
Look at tool documentation!
Very general bulk RNA-seq workflow steps
Image by Candace Savonen
Quantification/Alignment
Quality control
Sequence quality
trimming
Normalization
Downstream analyses
Dimension Reduction
Differential expression
bulk RNA-seq workflow steps
Quantification/Alignment
Quality control
Sequence quality
trimming
Normalization
Downstream analyses
Dimension Reduction
Differential expression
Input(s) | fastq | fastq�(+ GTF/GFF or index) | bam or sam (tab for counts) | |
Output(s) | HTML report | bam or sam | | png, txt or csv |
fastq | tab | |||
What type of data does the tool expect?
What is a FASTQ file even?
What is a FASTQ file even?
What is a FASTQ file even?
What is a FASTQ file even?
How is single-cell �RNA-seq same �or different?
Image from Nature, Tanaka et al, 2018
Image from 10X Genomics: https://www.10xgenomics.com/blog/single-cell-rna-seq-an-introductory-overview-and-tools-for-getting-started
Slide from the Childhood Cancer Data Lab
Laser
Slide from the Childhood Cancer Data Lab
Cells
Example: 10X Genomics Chromium
Full-length scRNA-seq
Pros:
Full-length scRNA-seq
Pros:
Cons:
Pre-processing: Very similar to bulk RNA-seq
Tag-Based scRNA-seq
Pros:
Tag-Based scRNA-seq
Pros:
Cons:
Pre-processing: use Alevin (a Salmon tool) using the cell barcodes to separate cells’ da
Single-cell RNA-seq quirks
Less starting material means:
1) More zeroes
2) More PCR amplification (and its associated biases)
Remember: some sequences are more likely to be amplified
Unique Molecular Identifiers (UMIs):
a ‘snapshot’ of the original molecules in the pre-amplified cell
From the Childhood Cancer Data Lab
Original image from: Islam et al. Nature 2014 (https://doi.org/10.1038/nmeth.2772)
1.
Unique Molecular Identifiers (UMIs):
a ‘snapshot’ of the original molecules in the pre-amplified cell
Original image from: Islam et al. Nature 2014 (https://doi.org/10.1038/nmeth.2772)
1. 2.
Unique Molecular Identifiers (UMIs):
a ‘snapshot’ of the original molecules in the pre-amplified cell
Original image from: Islam et al. Nature 2014 (https://doi.org/10.1038/nmeth.2772)
1. 2.
3.
Unique Molecular Identifiers (UMIs):
a ‘snapshot’ of the original molecules in the pre-amplified cell
Original image from: Islam et al. Nature 2014 (https://doi.org/10.1038/nmeth.2772)
1. 2.
3. 4.
Very general single cell RNA-seq workflow steps
Image by Candace Savonen
Quantification/Alignment
Quality control
UMIs
Duplet detection
Filtering
Normalization
Downstream analyses
Dimension Reduction
Cell classification
Differential expression
Trajectory
What files and formats are we using for (single-cell) RNA-seq data?
Single cell RNA-seq workflow steps
Quantification/Alignment
Quality control
UMIs
Duplet detection
Filtering
Normalization
Downstream analyses
Dimension Reduction
Cell classification
Differential expression
Trajectory
Input(s) | bcl (--> fastq) | bam + counts | Updated counts | |
Output(s) | bam or sam | Updated counts | | png, txt or csv |
mtx or counts | | |||
RNA-seq data analysis tools and platforms
Considerations for choosing tools:
Is it appropriate for your data type?
Is it an interface or programming language you feel comfortable with?
How much computing power do you have?
Are there benchmarking papers that compare the tool options?
Is the tool well documented and usable?
Is the tool well-maintained?
Is the tool generally accepted by the field?
Which tool for which workflow step?
Input: FASTQ Files
Output: Differentially expressed genes or transcripts
Step 1: �QC
Step 2a: �Mapping
Step 2b: �Assemble
Step 3: �Count/Quantify
Step 4: �Normalize (eg TMM)
Step 5: �Model DE
FastQC�MultiQC
Cufflinks
tximport
Stringtie
Why is metadata important?
Let’s say you did your RNA-seq data analysis and you saw…
Metadata: Anything and everything that should be known about your samples!
A B C D
E F G H
sample_id | mouse_id | processing_date | treatment | … |
A | 1 | 3-10-21 | None | … |
B | 1 | 4-12-21 | None | … |
C | 2 | 3-10-21 | None | … |
D | 2 | 4-12-21 | None | … |
E | 3 | 3-10-21 | Morphine | … |
F | 3 | 4-12-21 | Morphine | … |
G | 4 | 3-10-21 | Morphine | … |
H | 4 | 4-12-21 | Morphine | … |
I know everything I need to know about these samples from their metadata!
Examples of metadata categories:
If you have human data the metadata probably is loaded with PII and/or PHI
Why GenePattern?
Tools for Bioinformatics
Reproducible analysis tools for GUI
Why GenePattern?
GenePattern supports reproducibility
>250 GenePattern Modules, 8/2024
Gene Pattern tutorial
Let’s try it!��Let’s do bulk RNA differential expression analysis
20 Breast cancer primary tumor
20 Normal matched samples
What genes differentiate breast cancer from not cancer?
bulk RNA-seq workflow steps
Quantification/Alignment
Quality control
Sequence quality
trimming
Normalization
Downstream analyses
Dimension Reduction
Differential Expression
Input(s) | fastq or fastqsanger | fastq or fastqsanger�(+ GTF/GFF or index) | bam or sam (tab for counts) | |
Output(s) | HTML report | bam or sam | | png, txt or csv |
fastq | tab | |||
Go to cloud.genepattern.org/gp/
This module will identify differentially expressed genes between our two groups
GCT file goes here
CLS file goes here
What are these files?
https://www.genepattern.org/file-formats-guide#gsc.tab=0
https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#Expression_Data_Formats
ODF is a GenePattern specific file: It is similar to the RES or GCT file formats for datasets. The main difference is in the header.
1.
2.
This module will make a heatmap of our data
GCT file goes here
This odf file will automatically be loaded
Activity 2: single cell RNA-seq pipeline
What kinds of cells are in this population?
Go to cloud.genepattern.org/gp/
You might need to login again
Download practice single-cell data from 10X:
No need to unzip the file but these are the contents of these files described here
Sharing results!
More tutorials online!
How confident do you feel about your understanding of gene expression data *now*?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
How likely is this workshop to have a positive impact on the ease and efficiency of your work?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
How likely are you to recommend this workshop?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
What did you like best about this workshop?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Please share any recommendations you have for improvements
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Demographics Survey