Designing your Analysis
By Tallulah Andrews
What is your question?
What data do you need to answer it?
Example Questions:
What progenitor states exist between hematopoietic stem cells and each differentiated blood cell-types?
How do different Huntington gene variants affect neuronal identity/function?
How many cell-types exist in different regions of the liver?
How does the communication between glia and neurons change in Parkinson’s disease?
What transcription factor(s) control the differentiation of pancreatic cell-types?
Algorithms/Tools will always give you an answer!
That doesn’t mean that answer is “true”.
Batch Effects / Sample Integration
Part 1: When and what to normalize/correct?
Unconfounded Experiments vs Replicates
Biological “Noise” aka Confounders
Examples:
Solutions:
The Cell Cycle
Regressing out the cell-cycle will also remove all variability that is confounded with it.
Development/Differentiation
Mature tissue
Cancer
Individual Variation
Human / Patient samples
Often must be confounded with disease or treatment conditions
Treated similar to batch effects.
Stress Response
Cluster: Prog Chol CSC Stress
Cancer Stem
Metabolic
Redox
Stress
Common when dissociating tissues
Highly cell-type dependent
May depend on condition / replicate
Often non-linear -> Difficult to regress
Typically exclude affected cells
When to impute/smooth data? - Visualization only
Part 2: Clustering vs Pseudotime
You shouldn’t be doing both.
Part 2: Clustering vs Pseudotime
You shouldn’t be doing publishing both.
(usually)
Case Study: Malaria Cell Atlas
Problem: What is a “good” cluster?
How many clusters are there in this data?
Pause the video and decide on your answer
The Truth
Clustering: Common Assumptions
Clustering: Common Assumptions
Clustering: Common Assumptions
What happens if you use a clustering method on a gradient?
Pause the video and decide on your answer
Gradient 1:
Gradient 2:
Stem cells
Differentiated
Apical
Basal
variability
What happens if you use a clustering method on a gradient?
Gradient 1:
Gradient 2:
Trajectories are split up along their length in a way to optimize uniformity of clusters.
Clustering: Common Assumptions
Clustering the same data with different parameters:
Seurat Applied to MCA data:
Morphologically there are 3 distinct stages of parasites, using marker genes we could link these clusters to them.
Are the subdivisions actually meaningful?
Or is this a smooth continuum that has been arbitrarily divided up by a clustering algorithm?
Clustering: How do you know a cluster is “real”?
e.g. Silhouette Index
e.g. scmap
Seeing that they “look good” in a visualization is not a good way to validate clustering
Malaria Cell Atlas Data:
NA
NA
Pseudotime: Common Assumptions
Pseudotime: Common Assumptions
Monocle Applied to MCA data: What happened?
PCA
Monocle
Pause the video and decide on your answer
Monocle Applied to MCA data:
Errors due to model assumptions:
2. False branches at most heavily sampled portion of the cycle
Danger!
We could have interpreted one/both of the branches as cells moving in/out of the IDC
Malaria Cell Atlas - what did we do? (both)
Why did we end up doing both?
Part 3: Visualization
All models are wrong, some are useful -George Box
Visualizations
Visualization: Box Analogy
PC 1
PC 2
PC 1
PC 3
PC1
PC2
PC3
Visualization: Box Analogy
UMAP 1
UMAP 2
Visualization: What do you choose to show?
Accurate distances between points. - PCA
Overall structure - UMAP
Distinct Clusters - t-SNE
Trajectories/gradiants - Diffusion Map
How many clusters can you see?
tSNE
UMAP
Part 4: Differential Expression
General Linear Models: Common Assumptions
Pseudotime
Pseudotime
General Linear Models: Common Assumptions
General Linear Models: Common Assumptions
Non-parametric Tests
What is the “best”?
Soneson and Robinson. 2018. Nature Methods. 15 : 255-261 doi: 10.1038/nmeth.4612
Conclusions:
DE across multiple batches and multiple conditions
Analysis of complex experiments
How would you test the following questions? (Hint: there is more than one good answer)
Pause the video now to write down your answers
(15 minutes allotted)
Effect of stimulation on gene expression in three batches of T-cells, each batch contains 50% stimulated and 50% unstimulated.
| Stimulated | Unstimulated |
Batch 1 | 150 | 150 |
Batch 2 | 150 | 150 |
Batch 3 | 150 | 150 |
Example Analysis:
Effect of stimulation on gene expression in three batches of T-cells, each batch contains 50% stimulated and 50% unstimulated.
| Stimulated | Unstimulated |
Batch 1 | 150 | 150 |
Batch 2 | 150 | 150 |
Batch 3 | 150 | 150 |
Analysis Option 2:
Cell-type specific effects of diabetes in the pancreas, three replicates were performed for diabetic and non-diabetic pancreas samples.
| Diabetic | Healthy |
Donor 1 | 1500 | 0 |
Donor 2 | 1500 | 0 |
Donor 3 | 1500 | 0 |
Donor 4 | 0 | 1500 |
Donor 5 | 0 | 1500 |
Donor 6 | 0 | 1500 |
Example Analysis:
Cell-type specific effects of diabetes in the pancreas, three replicates were performed for diabetic and non-diabetic pancreas samples.
| Diabetic | Healthy |
Donor 1 | 1500 | 0 |
Donor 2 | 1500 | 0 |
Donor 3 | 1500 | 0 |
Donor 4 | 0 | 1500 |
Donor 5 | 0 | 1500 |
Donor 6 | 0 | 1500 |
Analysis Option 2:
END
Types of questions
What is this?
Types of questions
How are things related to each other?
Types of questions
How do things change when X happens?
<- Tumour Type ->
<- Marker Genes ->
Types of questions
Most current tools exist in this space.
Differential Expression
Interactive Exercise: Research Questions
For each situation:
Pause the video to write down your answers before continuing.
Exercise Situations:
(1) What progenitor states exist between hematopoietic stem cells and each differentiated blood cell-types?
(2) How do different Huntington gene variants affect neuronal identity/function?
(3) How many cell-types exist in different regions of the liver?
(4) How does the communication between glia and neurons change in Parkinson’s disease?
(5) What transcription factor(s) control the differentiation of pancreatic cell-types?
Pause the video now to write down your answers
(5 minutes allotted)
Exercise Answers: (1)
What progenitor states exist between hematopoietic stem cells and each differentiated blood cell-types?
Relational & Descriptive
Single cell expression for one sample of differentiated blood cells, and from bone marrow and all other sites of blood cell maturation.
Exercise Answers: (2)
How do different Huntington gene variants affect neuronal identity/function?
Comparative
Single cell RNAseq from brain samples or brain organoids from multiple samples carrying different Huntington gene variants.
(Bulk RNAseq of sorted neurons is also valid)
Exercise Answers: (3)
How many cell-types exist in different regions of the liver?
Descriptive (perhaps Comparative)
Single cell RNAseq from one or more different regions a liver.
Exercise Answers: (4)
How does the communication between glia and neurons change in Parkinson’s disease?
Comparative & Relational
Single-cell RNAseq from both glia and neurons in normal samples and samples affected by Parkinson’s disease.
Exercise Answers: (5)
What transcription factor(s) control the differentiation of pancreatic cell-types?
Relational & Descriptive
Single-cell RNAseq at multiple time points during pancreatic development or in vitro differentiation