Marker Gene Detection
Content adapted from https://osca.bioconductor.org/marker-detection.html
Motivation
Motivation
We've now got clusters, but what are they (e.g. "what cell type is cluster 1")?
What genes are driving the clustering (e.g., "what genes are differentially expressed between clusters 1 and 2")?
Idea: Look at differences in gene expression profiles of cells from different clusters
Illustrative dataset:
Unfiltered 10X PBMC4k
Illustrative dataset: Unfiltered 10X PBMC4k
library(BiocFileCache)
bfc <- BiocFileCache()
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
"cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))
library(DropletUtils)
library(Matrix)
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)
Human peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics
https://osca.bioconductor.org/unfiltered-human-pbmcs-10x-genomics.html
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)
Illustrative dataset: Unfiltered 10X PBMC4k
# gene-annotation
library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(
rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)
library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID,
column="SEQNAME", keytype="GENEID")
# cell-detection
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)
Illustrative dataset: Unfiltered 10X PBMC4k
# quality-control
stats <- perCellQCMetrics(sce.pbmc,
subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent,
type="higher")
sce.pbmc <- sce.pbmc[,!high.mito]
# normalization
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)
sce.pbmc <- logNormCounts(sce.pbmc)
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)
Illustrative dataset: Unfiltered 10X PBMC4k
# variance modelling
set.seed(1001)
dec.pbmc <- modelGeneVarByPoisson(sce.pbmc)
top.pbmc <- getTopHVGs(dec.pbmc, prop=0.1)
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)
Illustrative dataset: Unfiltered 10X PBMC4k
# dimensionality-reduction
set.seed(10000)
sce.pbmc <- denoisePCA(sce.pbmc, subset.row=top.pbmc,
technical=dec.pbmc)
set.seed(100000)
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")
set.seed(1000000)
sce.pbmc <- runUMAP(sce.pbmc, dimred="PCA")
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)
Illustrative dataset: Unfiltered 10X PBMC4k
# clustering
g <- buildSNNGraph(sce.pbmc, k=10, use.dimred="PCA")
clust <- igraph::cluster_walktrap(g)$membership
sce.pbmc$cluster <- factor(clust)
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)
Motivation continued
Motivation
# Is gene 1 associated with the clustering?
plotExpression(sce.pbmc, features=rownames(sce.pbmc)[1],
x="cluster", colour_by="cluster")
Motivation
Motivation
# Is gene 2 associated with the clustering?
plotExpression(sce.pbmc, features=rownames(sce.pbmc)[2],
x="cluster", colour_by="cluster")
Motivation
Motivation
# Is gene 2512 associated with the clustering?
plotExpression(sce.pbmc, features=rownames(sce.pbmc)[2512],
x="cluster", colour_by="cluster")
Motivation
Motivation
Motivation
Motivation
Looking at this plot as a way to find marker genes obviously doesn't scale
Need a statistical method to identify these marker genes
👉 The Welch t-test is an obvious choice to test for differences in expression between clusters
Pairwise Welch t-test
➕ Fast and good statistical properties for a large number of cells (Soneson and Robinson, 2018)
➕ Pairwise comparisons provide log-fold change to indicate which clusters are distinguished by each gene
🤔 Why not compare each cluster to the average of all other cells?
➖ Sensitive to population composition, a single dominant
subpopulation will drive the selection of top markers for every other
cluster
Illustrative example:
CD3E as a marker gene in the 10X PBMC4k dataset
Pairwise tests
Pairwise tests
Comparison | logFC | P-value |
1 vs 2 | 1.50 | 1.7x10-198 |
Pairwise tests
Comparison | logFC | P-value |
1 vs 2 | 1.50 | 1.7x10-198 |
1 vs 3 | -0.08 | 0.11 |
Pairwise tests
Comparison | logFC | P-value |
1 vs 2 | 1.50 | 1.7x10-198 |
1 vs 3 | -0.08 | 0.11 |
... | ... | ... |
1 vs 18 | 1.39 | 2.2x10-9 |
Pairwise tests
Comparison | logFC | P-value |
1 vs 2 | 1.50 | 1.7x10-198 |
1 vs 3 | -0.08 | 0.11 |
... | ... | ... |
1 vs 18 | 1.39 | 2.2x10-9 |
2 vs 1 | -1.50 | 1.7x10-198 |
Pairwise tests
Comparison | logFC | P-value |
1 vs 2 | 1.50 | 1.7x10-198 |
1 vs 3 | -0.08 | 0.11 |
... | ... | ... |
1 vs 18 | 1.39 | 2.2x10-9 |
2 vs 1 | -1.50 | 1.7x10-198 |
... | ... | ... |
18 vs 17 | 0.11 | 0.46 |
Pairwise tests
Comparison | logFC | P-value |
1 vs 2 | 1.50 | 1.7x10-198 |
1 vs 3 | -0.08 | 0.11 |
... | ... | ... |
1 vs 18 | 1.39 | 2.2x10-9 |
2 vs 1 | -1.50 | 1.7x10-198 |
... | ... | ... |
18 vs 17 | 0.11 | 0.46 |
K = 18 clusters
🤓 K! / (K - 2)! = 306 comparisons
Redundancy in pairwise tests
Comparison | logFC | P-value |
1 vs 2 | 1.50 | 1.7x10-198 |
1 vs 3 | -0.08 | 0.11 |
... | ... | ... |
1 vs 18 | 1.39 | 2.2x10-9 |
2 vs 1 | -1.50 | 1.7x10-198 |
... | ... | ... |
18 vs 17 | 0.11 | 0.46 |
🤓 But half of these comparisons are redundant
Combining comparisons of CD3E for cluster 1
Gene | Comparison | logFC | P-value |
CD3E | 1 vs 2 | ... | ... |
CD3E | ... | ... | ... |
CD3E | 1 vs 18 | .. | ... |
CD3E | 2 vs 1 | ... | ... |
CD3E | ... | ... | ... |
CD3E | 18 vs 1 | ... | ... |
Gene | Comparison | logFC | P-value |
CD3E | 1 vs 2 | ... | ... |
CD3E | ... | ... | ... |
CD3E | 1 vs 18 | .. | ... |
Combining comparisons of CD3E for cluster 1
Gene | P-value.1vs2 | ... | P-value.1vs18 | logFC.1vs2 | ... | logFC.1vs18 |
CD3E | ... | ... | ... | ... | ... | ... |
Gene | Comparison | logFC | P-value |
CD3E | 1 vs 2 | ... | ... |
CD3E | ... | ... | ... |
CD3E | 1 vs. 18 | .. | ... |
Combining comparisons of CD3E for cluster 1
🤔 "I'm interested in CD3 if it is differentially expressed between cluster 1 and ..."
Gene | P-value.1vs2 | ... | P-value.1vs18 | logFC.1vs2 | ... | logFC.1vs18 |
CD3E | ... | ... | ... | ... | ... | ... |
Gene | Combined P-value | logFC.1vs2 | ... | logFC.1vs18 |
CD3E | ... | ... | ... | ... |
Combining comparisons of CD3E for cluster 1
any other cluster: P = 1.3 x 10-205
🤓 Simes-adjusted P-value
all other clusters: P = 0.11
🤓 Berger's intersection-union test
some other cluster: P = 2.0 x 10-44
🤓 median (or other quantile) of Holm-adjusted P-values
"I'm interested in CD3 if it is differentially expressed between cluster 1 and ..."
Extending to all genes
Extending to all genes
Comparison | logFC | P-value |
1 vs 2 | 1.50 | 1.7x10-198 |
... | ... | ... |
18 vs 17 | 0.11 | 0.46 |
CD3E
Extending to all genes
Comparison | logFC | P-value |
1 vs 2 | 1.50 | 1.7x10-198 |
... | ... | ... |
18 vs 17 | 0.11 | 0.46 |
Comparison | logFC | P-value |
1 vs 2 | 0.00 | 0.53 |
... | ... | ... |
18 vs 17 | 0.27 | 0.33 |
CD3E
PF4
Extending to all genes
Comparison | logFC | P-value |
1 vs 2 | 1.50 | 1.7x10-198 |
... | ... | ... |
18 vs 17 | 0.11 | 0.46 |
Comparison | logFC | P-value |
1 vs 2 | 0.00 | 0.53 |
... | ... | ... |
18 vs 17 | 0.27 | 0.33 |
CD3E
PF4
...
Extending to all genes
Comparison | logFC | P-value |
1 vs 2 | ... | ... |
... | ... | ... |
18 vs 17 | ... | ... |
scran::pairwiseTTests()
Comparison | logFC | P-value |
1 vs 2 | ... | ... |
... | ... | ... |
18 vs 17 | ... | ... |
Gene 1
Gene M
...
Extending to all genes
Gene | logFC | P-value |
Gene 1 | ... | ... |
... | ... | ... |
Gene M | ... | ... |
scran::pairwiseTTests()
Gene | logFC | P-value |
Gene 1 | ... | ... |
... | ... | ... |
Gene M | ... | ... |
1 vs 2
18 vs 17
...
Extending to all genes
Gene | logFC | P-value |
Gene 1 | ... | ... |
... | ... | ... |
Gene M | ... | ... |
scran::pairwiseTTests()
Gene | logFC | P-value |
Gene 1 | ... | ... |
... | ... | ... |
Gene 2 | ... | ... |
1 vs 2
18 vs 17
...
M = 33,694 genes
🤓 K x M = 10,310,364 tests
Extending to all clusters
Gene | logFC | P-value |
Gene 1 | ... | ... |
... | ... | ... |
Gene M | ... | ... |
scran::combineMarkers()
Gene | logFC | P-value |
Gene 1 | ... | ... |
... | ... | ... |
Gene 2 | ... | ... |
1 vs 2
18 vs 17
...
Gene | Combined P-value | logFC.2 | ... | logFC.18 |
Gene 1 | ... | ... | | |
... | ... | ... | | |
Gene M | ... | ... | | |
Comparisons involving cluster 1
Gene | Combined P-value | logFC.2 | ... | logFC.18 |
Gene 1 | ... | ... | | |
... | ... | ... | | |
Gene M | ... | ... | | |
Comparisons involving cluster 18
...
Standard application
Standard application
library(scran)
markers.pbmc <- findMarkers(sce.pbmc, groups=sce.pbmc$cluster,
test.type="t", pval.type="any")
For each cluster, use Welch t-tests to identify genes that are differentially expressed between it and any other cluster
scran::findMarkers()
Standard application
chosen <- "9"
interesting <- markers.pbmc[[chosen]]
Standard application
plotExpression(sce.pbmc, rownames(interesting)[1:4],
x="cluster", colour_by="cluster")
Standard application
best.set <- interesting[interesting$Top <= 6,]
logFCs <- as.matrix(best.set[,-(1:3)])
colnames(logFCs) <- sub("logFC.", "", colnames(logFCs))
library(pheatmap)
pheatmap(logFCs, breaks=seq(-5, 5, length.out=101))
👉 We use the 'Top' field to identify a set of genes that distinguishes cluster 9 from any other cluster
Using the log-fold change
Using the log-fold change
markers.pbmc.up <- findMarkers(sce.pbmc, groups=sce.pbmc$cluster,
test.type="t", direction="up", pval.type="any")
interesting.up <- markers.pbmc.up[[chosen]]
For each cluster, use Welch t-tests to identify genes that are up-regulated between it and any other cluster
scran::findMarkers()
Using the log-fold change
markers.pbmc.up2 <- findMarkers(sce.pbmc, groups=sce.pbmc$cluster,
test.type="t", direction="up", lfc=1, pval.type="any")
interesting.up2 <- markers.pbmc.up2[[chosen]]
For each cluster, use Welch t-tests to identify genes that are up-regulated with a log-fold change (lfc) of at least 1 between it and any other cluster
👉 t-test also allows us to specify a non-zero log-fold change as the null hypothesis
🤓 More rigorous than simply filtering on the log-fold changes (see TREAT)
scran::findMarkers()
Using the log-fold change
best.set <- interesting.up2[interesting.up2$Top <= 5,]
logFCs <- as.matrix(best.set[,-(1:3)])
colnames(logFCs) <- sub("logFC.", "", colnames(logFCs))
pheatmap(logFCs, breaks=seq(-5, 5, length.out=101))
👉 Yields a more focused set of candidate marker genes that are upregulated in cluster 9
⚠️ Increased stringency is not without cost
⚠️ If 'lfc' is too large might discard otherwise useful genes
E.g., a gene upregulated in a small proportion of cells in a cluster can still be
an effective marker if the focus is on specificity rather than sensitivity
Finding cluster-specific markers
Finding cluster-specific marker genes
👉 By default, scran::findMarkers() will give a high ranking to genes that are DE in any pairwise comparison
🤔 I want genes that are specific to each cluster
👉 You want genes that are DE in all pairwise comparisons
Finding cluster-specific marker genes
markers.pbmc.up3 <- findMarkers(sce.pbmc, groups=sce.pbmc$cluster,
direction="up", pval.type="all")
interesting.up3 <- markers.pbmc.up3[[chosen]]
For each cluster, use Welch t-tests to identify genes that are up-regulated between it and all other clusters
🤓 Uses an intersection-union test to combine the P-values which is the maximum P-value from all pairwise comparisons
Pros/cons of cluster-specific marker genes
| | Gene expressed | |
| | CD4 | CD8 |
Population | DN (CD4-/CD8-) | ❌ | ❌ |
CD4+ | ✔️ | ❌ | |
CD8+ | ❌ | ✔️ | |
DP (CD4+/CD8+) | ✔️ | ✔️ |
⚠️ Neither CD4 nor CD8 is a cluster specific marker gene in this scenario
Finding cluster-specific-ish marker genes
markers.pbmc.up4 <- findMarkers(sce.pbmc, groups=sce.pbmc$cluster,
direction="up", pval.type="some")
interesting.up4 <- markers.pbmc.up4[[chosen]]
👉 For when pval.type="all" is too stringent yet pval.type="any" is too generous
🤓 Apply Holm-Bonferroni correction across P-values and take the middle-most value as the combined P-value
⚠️ You lose some theoretical guarantees offered by the other methods
For each cluster, use Welch t-tests to identify genes that are up-regulated between it and some other clusters
Alternative testing regimes
Motivation
The t-test isn't the only way to compare two groups of measurements
🤔 I want a test can be used to perfectly distinguish two clusters from one another
👉 Wilcoxon rank sum test
🤔 I want to identify genes that are expressed more often in one cluster than another
👉 Binomial test
Wilcoxon rank sum test
Directly assesses separation between the expression distribution of different clusters
🤓 Proportional to the area-under-the-curve (AUC), which is the probability a random cell from one cluster having higher expression than a random cell from another cluster
👉 AUCs of 1 or 0 indicate that the two clusters have perfectly separated expression distributions
🤓 Also known as the Wilcoxon-Mann-Whitney (WMW) test
Wilcoxon rank sum test
markers.pbmc.wmw <- findMarkers(sce.pbmc,
groups=sce.pbmc$cluster, test.type="wilcox",
direction="up", pval.type="any")
interesting.wmw <- markers.pbmc.wmw[[chosen]]
For each cluster, use Wilcoxon rank sum tests to identify genes that are up-regulated between it and any other cluster
scran::findMarkers()
Wilcoxon rank sum test
best.set <- interesting.wmw[interesting.wmw$Top <= 5,]
AUCs <- as.matrix(best.set[,-(1:3)])
colnames(AUCs) <- sub("AUC.", "", colnames(AUCs))
pheatmap(AUCs, breaks=seq(0, 1, length.out=21),
color=viridis::viridis(21))
Summary of Wilcoxon rank sum test
➕ Directly addresses the desirable property of a marker gene (i.e. that the gene perfectly distinguishes two clusters)
➕ Is symmetric with respect to differences in the size of the groups being compared
➖ Much slower to compared compared to t-statistics (although generally not an issue in practice)
Binomial test
Test identifies genes that differ in the proportion of expressing cells between clusters
A much more stringent definition of marker genes
🤓 Turns expression into a binary measure of presence/absence, so all quantitative information is ignored
Binomial test
markers.pbmc.binom <- findMarkers(sce.pbmc,
groups=sce.pbmc$cluster, test.type="binom",
direction="up", pval.type="any")
interesting.binom <- markers.pbmc.binom[[chosen]]
For each cluster, use Binomial tests to identify genes that are more frequently expressed (up-regulated) in it compared to any other cluster
🤓 Effect size is reported as the log-fold change in the proportion of expressing cells between clusters
👉 Large positive log-fold changes indicate that the gene is more frequently expressed in one cluster compared to the other
scran::findMarkers()
Binomial test
top.genes <- head(rownames(interesting.binom))
plotExpression(sce.pbmc, x="cluster", features=top.genes)
scran::findMarkers()
Summary of Binomial test
The Binomial test does not care about scaling normalization
➕ Produces marker genes that may be easier to validate
➖ Increased stringency can lead to loss of good candidate markers
Custom differential expression methods
🤔 Why not just use edgeR/DESeq2/limma-voom or custom methods (e.g., MAST)?
👉 You can!
https://osca.bioconductor.org/marker-detection.html#using-custom-de-methods
👉 But these are probably overkill for identifying marker genes
🤓 Cells are our 'replicates' for the purposes of marker gene detection
🤓 edgeR/DESeq2/limma-voom make stronger assumptions about the data that are more likely to be violated for individual cells in scRNA-seq
Illustrative dataset: 416B
Illustrative dataset: 416B
library(scRNAseq)
sce.416b <- LunSpikeInData(which="416b")
sce.416b$block <- factor(sce.416b$block)
Immortalized mouse myeloid progenitor cell line processed using SmartSeq2
A constant amount of ERCC spike-in RNA was added to each cell’s lysate prior to library preparation.
https://osca.bioconductor.org/lun-416b-cell-line-smart-seq2.html
Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)
Illustrative dataset: 416B
# gene-annotation
library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
rowData(sce.416b)$ENSEMBL <- rownames(sce.416b)
rowData(sce.416b)$SYMBOL <- mapIds(ens.mm.v97,
keys=rownames(sce.416b),
keytype="GENEID", column="SYMBOL")
rowData(sce.416b)$SEQNAME <- mapIds(ens.mm.v97,
keys=rownames(sce.416b),
keytype="GENEID", column="SEQNAME")
library(scater)
rownames(sce.416b) <- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL,
rowData(sce.416b)$SYMBOL)
Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)
Illustrative dataset: 416B
# quality-control
mito <- which(rowData(sce.416b)$SEQNAME=="MT")
stats <- perCellQCMetrics(sce.416b, subsets=list(Mt=mito))
qc <- quickPerCellQC(stats,
percent_subsets=c("subsets_Mt_percent", "altexps_ERCC_percent"),
batch=sce.416b$block)
sce.416b <- sce.416b[,!qc$discard]
# normalization
library(scran)
sce.416b <- computeSumFactors(sce.416b)
sce.416b <- logNormCounts(sce.416b)
Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)
Illustrative dataset: 416B
# variance-modelling
dec.416b <- modelGeneVarWithSpikes(sce.416b, "ERCC",
block=sce.416b$block)
chosen.hvgs <- getTopHVGs(dec.416b, prop=0.1)
# batch-correction (we'll learn about this later)
library(limma)
assay(sce.416b, "corrected") <- removeBatchEffect(
logcounts(sce.416b),
design=model.matrix(~sce.416b$phenotype),
batch=sce.416b$block)
Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)
Illustrative dataset: 416B
# dimensionality-reduction
sce.416b <- runPCA(sce.416b, ncomponents=10,
subset_row=chosen.hvgs,
exprs_values="corrected",
BSPARAM=BiocSingular::ExactParam())
set.seed(1010)
sce.416b <- runTSNE(sce.416b, dimred="PCA", perplexity=10)
Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)
Illustrative dataset: 416B
# clustering (hierarchical clustering)
my.dist <- dist(reducedDim(sce.416b, "PCA"))
my.tree <- hclust(my.dist, method="ward.D2")
library(dynamicTreeCut)
my.clusters <- unname(
cutreeDynamic(my.tree, distM=as.matrix(my.dist),
minClusterSize=10, verbose=0))
sce.416b$cluster <- factor(my.clusters)
Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)
Handling blocking factors
(aka considering experimental factors)
Handling blocking factors
Large studies may contain factors of variation that are known and not interesting (e.g., batch effects, sex differences) for the purposes of marker gene detection
👉These are known as blocking factors
👉If these aren't modelled, they can interfere with marker gene detection
Handling blocking factors
m.out <- findMarkers(sce.416b, groups=sce.416b$cluster,
block=sce.416b$block, test.type="t", direction="up")
For each cluster, account for the experimental blocking and use Welch t-tests to identify genes that are up-regulated between it and any other cluster
👉 Each pairwise comparison between clusters is performed separately in each level of the blocking factor (here, plate)
👉 Favours genes that exhibit consistent DE in the same direction in each level of the blocking factor (here, plate)
🤓 P-values combined using Stouffer's Z method to obtain a single P-value for each comparison
Ignoring blocking factors can lead to 'false' marker genes
Statistical issues
Invalidity of P-values
👉 All of our DE strategies for detecting marker genes between clusters are statistically flawed to some extent
🤓 'Data dredging': the DE analysis is performed using the same data used to obtain the clusters
👉 Testing for DE genes between clusters will inevitably yield some significant results as that's how the clusters were defined!
Invalidity of P-values
🤓 A simulation with i.i.d. normal values with k-means (k=2) clustering followed by testing for DE
Invalidity of P-values
👉 Even though P-values are flawed, the effect is largely harmless for marker gene detection as the P-values are used only for ranking
🤓 Can't use P-values to define 'significant differences' between clusters with respect to an error rate threshold
Nature of replication
👉 Ideally, validate some of the markers with an independent replicate population of cells (and ideally a different technique, e.g., fluorescent in situ hybridization or qPCR)
Further comments
Further comments
👉 The DE analysis strategy is that markers are defined relative to subpopulations in the same dataset
👉 If a gene is expressed uniformly throughout the population, it will not come up as a marker
👉 There are fancy machine learning methods for doing marker gene detection, but the humble t-test is still great
Summary and recommendations
Summary and recommendations
👉 Create multiple lists of marker genes at different 'stringencies'
👉 The simplest to interpret marker genes are the 'uniquely up-regulated' or 'cluster-specific' genes, especially if we can impose a minimum log-fold change
👉 You may need to do more focused marker gene detection, e.g., subset the data to just 2 clusters of interest and run find scran::findMarkers()
Where we're at