1 of 90

Marker Gene Detection

Content adapted from https://osca.bioconductor.org/marker-detection.html

2 of 90

Motivation

3 of 90

Motivation

We've now got clusters, but what are they (e.g. "what cell type is cluster 1")?

What genes are driving the clustering (e.g., "what genes are differentially expressed between clusters 1 and 2")?

Idea: Look at differences in gene expression profiles of cells from different clusters

4 of 90

Illustrative dataset:

Unfiltered 10X PBMC4k

5 of 90

Illustrative dataset: Unfiltered 10X PBMC4k

library(BiocFileCache)

bfc <- BiocFileCache()

raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",

"cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))

untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))

library(DropletUtils)

library(Matrix)

fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")

sce.pbmc <- read10xCounts(fname, col.names=TRUE)

Human peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics

https://osca.bioconductor.org/unfiltered-human-pbmcs-10x-genomics.html

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)

6 of 90

Illustrative dataset: Unfiltered 10X PBMC4k

# gene-annotation

library(scater)

rownames(sce.pbmc) <- uniquifyFeatureNames(

rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)

library(EnsDb.Hsapiens.v86)

location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID,

column="SEQNAME", keytype="GENEID")

# cell-detection

set.seed(100)

e.out <- emptyDrops(counts(sce.pbmc))

sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)

7 of 90

Illustrative dataset: Unfiltered 10X PBMC4k

# quality-control

stats <- perCellQCMetrics(sce.pbmc,

subsets=list(Mito=which(location=="MT")))

high.mito <- isOutlier(stats$subsets_Mito_percent,

type="higher")

sce.pbmc <- sce.pbmc[,!high.mito]

# normalization

library(scran)

set.seed(1000)

clusters <- quickCluster(sce.pbmc)

sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)

sce.pbmc <- logNormCounts(sce.pbmc)

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)

8 of 90

Illustrative dataset: Unfiltered 10X PBMC4k

# variance modelling

set.seed(1001)

dec.pbmc <- modelGeneVarByPoisson(sce.pbmc)

top.pbmc <- getTopHVGs(dec.pbmc, prop=0.1)

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)

9 of 90

Illustrative dataset: Unfiltered 10X PBMC4k

# dimensionality-reduction

set.seed(10000)

sce.pbmc <- denoisePCA(sce.pbmc, subset.row=top.pbmc,

technical=dec.pbmc)

set.seed(100000)

sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")

set.seed(1000000)

sce.pbmc <- runUMAP(sce.pbmc, dimred="PCA")

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)

10 of 90

Illustrative dataset: Unfiltered 10X PBMC4k

# clustering

g <- buildSNNGraph(sce.pbmc, k=10, use.dimred="PCA")

clust <- igraph::cluster_walktrap(g)$membership

sce.pbmc$cluster <- factor(clust)

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)

11 of 90

Motivation continued

12 of 90

Motivation

# Is gene 1 associated with the clustering?

plotExpression(sce.pbmc, features=rownames(sce.pbmc)[1],

x="cluster", colour_by="cluster")

13 of 90

Motivation

14 of 90

Motivation

# Is gene 2 associated with the clustering?

plotExpression(sce.pbmc, features=rownames(sce.pbmc)[2],

x="cluster", colour_by="cluster")

15 of 90

Motivation

16 of 90

17 of 90

Motivation

# Is gene 2512 associated with the clustering?

plotExpression(sce.pbmc, features=rownames(sce.pbmc)[2512],

x="cluster", colour_by="cluster")

18 of 90

19 of 90

Motivation

20 of 90

Motivation

21 of 90

Motivation

22 of 90

Motivation

Looking at this plot as a way to find marker genes obviously doesn't scale

Need a statistical method to identify these marker genes

👉 The Welch t-test is an obvious choice to test for differences in expression between clusters

23 of 90

Pairwise Welch t-test

➕ Fast and good statistical properties for a large number of cells (Soneson and Robinson, 2018)

➕ Pairwise comparisons provide log-fold change to indicate which clusters are distinguished by each gene

🤔 Why not compare each cluster to the average of all other cells?

➖ Sensitive to population composition, a single dominant

subpopulation will drive the selection of top markers for every other

cluster

24 of 90

Illustrative example:

CD3E as a marker gene in the 10X PBMC4k dataset

25 of 90

Pairwise tests

26 of 90

Pairwise tests

Comparison	logFC	P-value
1 vs 2	1.50	1.7x10^-198

27 of 90

Pairwise tests

Comparison	logFC	P-value
1 vs 2	1.50	1.7x10^-198
1 vs 3	-0.08	0.11

28 of 90

Pairwise tests

Comparison	logFC	P-value
1 vs 2	1.50	1.7x10^-198
1 vs 3	-0.08	0.11
...	...	...
1 vs 18	1.39	2.2x10^-9

29 of 90

Pairwise tests

Comparison	logFC	P-value
1 vs 2	1.50	1.7x10^-198
1 vs 3	-0.08	0.11
...	...	...
1 vs 18	1.39	2.2x10^-9
2 vs 1	-1.50	1.7x10^-198

30 of 90

Pairwise tests

Comparison	logFC	P-value
1 vs 2	1.50	1.7x10^-198
1 vs 3	-0.08	0.11
...	...	...
1 vs 18	1.39	2.2x10^-9
2 vs 1	-1.50	1.7x10^-198
...	...	...
18 vs 17	0.11	0.46

31 of 90

Pairwise tests

Comparison	logFC	P-value
1 vs 2	1.50	1.7x10^-198
1 vs 3	-0.08	0.11
...	...	...
1 vs 18	1.39	2.2x10^-9
2 vs 1	-1.50	1.7x10^-198
...	...	...
18 vs 17	0.11	0.46

K = 18 clusters

🤓 K! / (K - 2)! = 306 comparisons

32 of 90

Redundancy in pairwise tests

Comparison	logFC	P-value
1 vs 2	1.50	1.7x10^-198
1 vs 3	-0.08	0.11
...	...	...
1 vs 18	1.39	2.2x10^-9
2 vs 1	-1.50	1.7x10^-198
...	...	...
18 vs 17	0.11	0.46

🤓 But half of these comparisons are redundant

33 of 90

Combining comparisons of CD3E for cluster 1

Gene	Comparison	logFC	P-value
CD3E	1 vs 2	...	...
CD3E	...	...	...
CD3E	1 vs 18	..	...
CD3E	2 vs 1	...	...
CD3E	...	...	...
CD3E	18 vs 1	...	...

Gene	Comparison	logFC	P-value
CD3E	1 vs 2	...	...
CD3E	...	...	...
CD3E	1 vs 18	..	...

34 of 90

Combining comparisons of CD3E for cluster 1

Gene	P-value.1vs2	...	P-value.1vs18	logFC.1vs2	...	logFC.1vs18
CD3E	...	...	...	...	...	...

Gene	Comparison	logFC	P-value
CD3E	1 vs 2	...	...
CD3E	...	...	...
CD3E	1 vs. 18	..	...

35 of 90

Combining comparisons of CD3E for cluster 1

🤔 "I'm interested in CD3 if it is differentially expressed between cluster 1 and ..."

any other cluster
all other clusters
some other cluster(s

Gene	P-value.1vs2	...	P-value.1vs18	logFC.1vs2	...	logFC.1vs18
CD3E	...	...	...	...	...	...

Gene	Combined P-value	logFC.1vs2	...	logFC.1vs18
CD3E	...	...	...	...

36 of 90

Combining comparisons of CD3E for cluster 1

any other cluster: P = 1.3 x 10^-205

🤓 Simes-adjusted P-value

all other clusters: P = 0.11

🤓 Berger's intersection-union test

some other cluster: P = 2.0 x 10^-44

🤓 median (or other quantile) of Holm-adjusted P-values

"I'm interested in CD3 if it is differentially expressed between cluster 1 and ..."

37 of 90

Extending to all genes

38 of 90

Extending to all genes

Comparison	logFC	P-value
1 vs 2	1.50	1.7x10^-198
...	...	...
18 vs 17	0.11	0.46

CD3E

39 of 90

Extending to all genes

Comparison	logFC	P-value
1 vs 2	1.50	1.7x10^-198
...	...	...
18 vs 17	0.11	0.46

Comparison	logFC	P-value
1 vs 2	0.00	0.53
...	...	...
18 vs 17	0.27	0.33

CD3E

PF4

40 of 90

Extending to all genes

Comparison	logFC	P-value
1 vs 2	1.50	1.7x10^-198
...	...	...
18 vs 17	0.11	0.46

Comparison	logFC	P-value
1 vs 2	0.00	0.53
...	...	...
18 vs 17	0.27	0.33

CD3E

PF4

...

41 of 90

Extending to all genes

Comparison	logFC	P-value
1 vs 2	...	...
...	...	...
18 vs 17	...	...

scran::pairwiseTTests()

Comparison	logFC	P-value
1 vs 2	...	...
...	...	...
18 vs 17	...	...

Gene 1

Gene M

...

42 of 90

Extending to all genes

Gene	logFC	P-value
Gene 1	...	...
...	...	...
Gene M	...	...

scran::pairwiseTTests()

Gene	logFC	P-value
Gene 1	...	...
...	...	...
Gene M	...	...

1 vs 2

18 vs 17

...

43 of 90

Extending to all genes

Gene	logFC	P-value
Gene 1	...	...
...	...	...
Gene M	...	...

scran::pairwiseTTests()

Gene	logFC	P-value
Gene 1	...	...
...	...	...
Gene 2	...	...

1 vs 2

18 vs 17

...

M = 33,694 genes

🤓 K x M = 10,310,364 tests

44 of 90

Extending to all clusters

Gene	logFC	P-value
Gene 1	...	...
...	...	...
Gene M	...	...

scran::combineMarkers()

Gene	logFC	P-value
Gene 1	...	...
...	...	...
Gene 2	...	...

1 vs 2

18 vs 17

...

Gene	Combined P-value	logFC.2	...	logFC.18
Gene 1	...	...
...	...	...
Gene M	...	...

Comparisons involving cluster 1

Gene	Combined P-value	logFC.2	...	logFC.18
Gene 1	...	...
...	...	...
Gene M	...	...

Comparisons involving cluster 18

...

45 of 90

Standard application

46 of 90

Standard application

library(scran)

markers.pbmc <- findMarkers(sce.pbmc, groups=sce.pbmc$cluster,

test.type="t", pval.type="any")

For each cluster, use Welch t-tests to identify genes that are differentially expressed between it and any other cluster

scran::findMarkers()

47 of 90

Standard application

chosen <- "9"

interesting <- markers.pbmc[[chosen]]

48 of 90

Standard application

plotExpression(sce.pbmc, rownames(interesting)[1:4],

x="cluster", colour_by="cluster")

49 of 90

Standard application

best.set <- interesting[interesting$Top <= 6,]

logFCs <- as.matrix(best.set[,-(1:3)])

colnames(logFCs) <- sub("logFC.", "", colnames(logFCs))

library(pheatmap)

pheatmap(logFCs, breaks=seq(-5, 5, length.out=101))

👉 We use the 'Top' field to identify a set of genes that distinguishes cluster 9 from any other cluster

50 of 90

Using the log-fold change

51 of 90

Using the log-fold change

markers.pbmc.up <- findMarkers(sce.pbmc, groups=sce.pbmc$cluster,

test.type="t", direction="up", pval.type="any")

interesting.up <- markers.pbmc.up[[chosen]]

For each cluster, use Welch t-tests to identify genes that are up-regulated between it and any other cluster

scran::findMarkers()

52 of 90

Using the log-fold change

markers.pbmc.up2 <- findMarkers(sce.pbmc, groups=sce.pbmc$cluster,

test.type="t", direction="up", lfc=1, pval.type="any")

interesting.up2 <- markers.pbmc.up2[[chosen]]

For each cluster, use Welch t-tests to identify genes that are up-regulated with a log-fold change (lfc) of at least 1 between it and any other cluster

👉 t-test also allows us to specify a non-zero log-fold change as the null hypothesis

🤓 More rigorous than simply filtering on the log-fold changes (see TREAT)

scran::findMarkers()

53 of 90

Using the log-fold change

best.set <- interesting.up2[interesting.up2$Top <= 5,]

logFCs <- as.matrix(best.set[,-(1:3)])

colnames(logFCs) <- sub("logFC.", "", colnames(logFCs))

pheatmap(logFCs, breaks=seq(-5, 5, length.out=101))

👉 Yields a more focused set of candidate marker genes that are upregulated in cluster 9

⚠️ Increased stringency is not without cost

⚠️ If 'lfc' is too large might discard otherwise useful genes

E.g., a gene upregulated in a small proportion of cells in a cluster can still be

an effective marker if the focus is on specificity rather than sensitivity

54 of 90

Finding cluster-specific markers

55 of 90

Finding cluster-specific marker genes

👉 By default, scran::findMarkers() will give a high ranking to genes that are DE in any pairwise comparison

🤔 I want genes that are specific to each cluster

👉 You want genes that are DE in all pairwise comparisons

56 of 90

Finding cluster-specific marker genes

markers.pbmc.up3 <- findMarkers(sce.pbmc, groups=sce.pbmc$cluster,

direction="up", pval.type="all")

interesting.up3 <- markers.pbmc.up3[[chosen]]

For each cluster, use Welch t-tests to identify genes that are up-regulated between it and all other clusters

🤓 Uses an intersection-union test to combine the P-values which is the maximum P-value from all pairwise comparisons

57 of 90

Pros/cons of cluster-specific marker genes

		Gene expressed
		CD4	CD8
Population	DN (CD4^-/CD8^-)	❌	❌
	CD4⁺	✔️	❌
	CD8⁺	❌	✔️
	DP (CD4⁺/CD8⁺)	✔️	✔️

⚠️ Neither CD4 nor CD8 is a cluster specific marker gene in this scenario

58 of 90

Finding cluster-specific-ish marker genes

markers.pbmc.up4 <- findMarkers(sce.pbmc, groups=sce.pbmc$cluster,

direction="up", pval.type="some")

interesting.up4 <- markers.pbmc.up4[[chosen]]

👉 For when pval.type="all" is too stringent yet pval.type="any" is too generous

🤓 Apply Holm-Bonferroni correction across P-values and take the middle-most value as the combined P-value

⚠️ You lose some theoretical guarantees offered by the other methods

For each cluster, use Welch t-tests to identify genes that are up-regulated between it and some other clusters

59 of 90

Alternative testing regimes

60 of 90

Motivation

The t-test isn't the only way to compare two groups of measurements

🤔 I want a test can be used to perfectly distinguish two clusters from one another

👉 Wilcoxon rank sum test

🤔 I want to identify genes that are expressed more often in one cluster than another

👉 Binomial test

61 of 90

Wilcoxon rank sum test

Directly assesses separation between the expression distribution of different clusters

🤓 Proportional to the area-under-the-curve (AUC), which is the probability a random cell from one cluster having higher expression than a random cell from another cluster

👉 AUCs of 1 or 0 indicate that the two clusters have perfectly separated expression distributions

🤓 Also known as the Wilcoxon-Mann-Whitney (WMW) test

62 of 90

Wilcoxon rank sum test

markers.pbmc.wmw <- findMarkers(sce.pbmc,

groups=sce.pbmc$cluster, test.type="wilcox",

direction="up", pval.type="any")

interesting.wmw <- markers.pbmc.wmw[[chosen]]

For each cluster, use Wilcoxon rank sum tests to identify genes that are up-regulated between it and any other cluster

scran::findMarkers()

63 of 90

Wilcoxon rank sum test

best.set <- interesting.wmw[interesting.wmw$Top <= 5,]

AUCs <- as.matrix(best.set[,-(1:3)])

colnames(AUCs) <- sub("AUC.", "", colnames(AUCs))

pheatmap(AUCs, breaks=seq(0, 1, length.out=21),

color=viridis::viridis(21))

64 of 90

Summary of Wilcoxon rank sum test

➕ Directly addresses the desirable property of a marker gene (i.e. that the gene perfectly distinguishes two clusters)

➕ Is symmetric with respect to differences in the size of the groups being compared

➖ Much slower to compared compared to t-statistics (although generally not an issue in practice)

65 of 90

Binomial test

Test identifies genes that differ in the proportion of expressing cells between clusters

A much more stringent definition of marker genes

🤓 Turns expression into a binary measure of presence/absence, so all quantitative information is ignored

From a practical perspective, this may be easier to validate

66 of 90

Binomial test

markers.pbmc.binom <- findMarkers(sce.pbmc,

groups=sce.pbmc$cluster, test.type="binom",

direction="up", pval.type="any")

interesting.binom <- markers.pbmc.binom[[chosen]]

For each cluster, use Binomial tests to identify genes that are more frequently expressed (up-regulated) in it compared to any other cluster

🤓 Effect size is reported as the log-fold change in the proportion of expressing cells between clusters

👉 Large positive log-fold changes indicate that the gene is more frequently expressed in one cluster compared to the other

scran::findMarkers()

67 of 90

Binomial test

top.genes <- head(rownames(interesting.binom))

plotExpression(sce.pbmc, x="cluster", features=top.genes)

scran::findMarkers()

68 of 90

Summary of Binomial test

The Binomial test does not care about scaling normalization

➕ Produces marker genes that may be easier to validate

➖ Increased stringency can lead to loss of good candidate markers

The disadvantage of the binomial test is that its increased stringency can lead to the loss of good candidate markers; see https://osca.bioconductor.org/marker-detection.html#using-a-binomial-test

For example, GCG is a known marker for pancreatic alpha cells but is expressed in almost every other cell of the Lawlor et al. (2017) pancreas data (Figure 11.6) and would not be highly ranked by the binomial test.

Another property of the binomial test is that it will not respond to scaling normalization. Systematic differences in library size between clusters will not be considered when computing p-values or effect sizes. This is not necessarily problematic for marker gene detection - users can treat this as retaining information about the total RNA content, analogous to spike-in normalization in Section 7.4.

69 of 90

Custom differential expression methods

🤔 Why not just use edgeR/DESeq2/limma-voom or custom methods (e.g., MAST)?

👉 You can!

https://osca.bioconductor.org/marker-detection.html#using-custom-de-methods

👉 But these are probably overkill for identifying marker genes

🤓 Cells are our 'replicates' for the purposes of marker gene detection

🤓 edgeR/DESeq2/limma-voom make stronger assumptions about the data that are more likely to be violated for individual cells in scRNA-seq

70 of 90

Illustrative dataset: 416B

71 of 90

Illustrative dataset: 416B

library(scRNAseq)

sce.416b <- LunSpikeInData(which="416b")

sce.416b$block <- factor(sce.416b$block)

Immortalized mouse myeloid progenitor cell line processed using SmartSeq2

A constant amount of ERCC spike-in RNA was added to each cell’s lysate prior to library preparation.

https://osca.bioconductor.org/lun-416b-cell-line-smart-seq2.html

Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)

72 of 90

Illustrative dataset: 416B

# gene-annotation

library(AnnotationHub)

ens.mm.v97 <- AnnotationHub()[["AH73905"]]

rowData(sce.416b)$ENSEMBL <- rownames(sce.416b)

rowData(sce.416b)$SYMBOL <- mapIds(ens.mm.v97,

keys=rownames(sce.416b),

keytype="GENEID", column="SYMBOL")

rowData(sce.416b)$SEQNAME <- mapIds(ens.mm.v97,

keys=rownames(sce.416b),

keytype="GENEID", column="SEQNAME")

library(scater)

rownames(sce.416b) <- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL,

rowData(sce.416b)$SYMBOL)

Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)

73 of 90

Illustrative dataset: 416B

# quality-control

mito <- which(rowData(sce.416b)$SEQNAME=="MT")

stats <- perCellQCMetrics(sce.416b, subsets=list(Mt=mito))

qc <- quickPerCellQC(stats,

percent_subsets=c("subsets_Mt_percent", "altexps_ERCC_percent"),

batch=sce.416b$block)

sce.416b <- sce.416b[,!qc$discard]

# normalization

library(scran)

sce.416b <- computeSumFactors(sce.416b)

sce.416b <- logNormCounts(sce.416b)

Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)

74 of 90

Illustrative dataset: 416B

# variance-modelling

dec.416b <- modelGeneVarWithSpikes(sce.416b, "ERCC",

block=sce.416b$block)

chosen.hvgs <- getTopHVGs(dec.416b, prop=0.1)

# batch-correction (we'll learn about this later)

library(limma)

assay(sce.416b, "corrected") <- removeBatchEffect(

logcounts(sce.416b),

design=model.matrix(~sce.416b$phenotype),

batch=sce.416b$block)

Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)

75 of 90

Illustrative dataset: 416B

# dimensionality-reduction

sce.416b <- runPCA(sce.416b, ncomponents=10,

subset_row=chosen.hvgs,

exprs_values="corrected",

BSPARAM=BiocSingular::ExactParam())

set.seed(1010)

sce.416b <- runTSNE(sce.416b, dimred="PCA", perplexity=10)

Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)

76 of 90

Illustrative dataset: 416B

# clustering (hierarchical clustering)

my.dist <- dist(reducedDim(sce.416b, "PCA"))

my.tree <- hclust(my.dist, method="ward.D2")

library(dynamicTreeCut)

my.clusters <- unname(

cutreeDynamic(my.tree, distM=as.matrix(my.dist),

minClusterSize=10, verbose=0))

sce.416b$cluster <- factor(my.clusters)

Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)

77 of 90

Handling blocking factors

(aka considering experimental factors)

78 of 90

Handling blocking factors

Large studies may contain factors of variation that are known and not interesting (e.g., batch effects, sex differences) for the purposes of marker gene detection

👉These are known as blocking factors

👉If these aren't modelled, they can interfere with marker gene detection

79 of 90

Handling blocking factors

m.out <- findMarkers(sce.416b, groups=sce.416b$cluster,

block=sce.416b$block, test.type="t", direction="up")

For each cluster, account for the experimental blocking and use Welch t-tests to identify genes that are up-regulated between it and any other cluster

👉 Each pairwise comparison between clusters is performed separately in each level of the blocking factor (here, plate)

👉 Favours genes that exhibit consistent DE in the same direction in each level of the blocking factor (here, plate)

🤓 P-values combined using Stouffer's Z method to obtain a single P-value for each comparison

80 of 90

Ignoring blocking factors can lead to 'false' marker genes

81 of 90

Statistical issues

82 of 90

Invalidity of P-values

👉 All of our DE strategies for detecting marker genes between clusters are statistically flawed to some extent

🤓 'Data dredging': the DE analysis is performed using the same data used to obtain the clusters

👉 Testing for DE genes between clusters will inevitably yield some significant results as that's how the clusters were defined!

83 of 90

Invalidity of P-values

🤓 A simulation with i.i.d. normal values with k-means (k=2) clustering followed by testing for DE

84 of 90

Invalidity of P-values

👉 Even though P-values are flawed, the effect is largely harmless for marker gene detection as the P-values are used only for ranking

🤓 Can't use P-values to define 'significant differences' between clusters with respect to an error rate threshold

85 of 90

Nature of replication

👉 Ideally, validate some of the markers with an independent replicate population of cells (and ideally a different technique, e.g., fluorescent in situ hybridization or qPCR)

86 of 90

Further comments

87 of 90

Further comments

👉 The DE analysis strategy is that markers are defined relative to subpopulations in the same dataset

👉 If a gene is expressed uniformly throughout the population, it will not come up as a marker

E.g., T cell markers will not be detected if only T cells are present in the data
Usually not a problem, as we usually have some idea about what cells are captured

👉 There are fancy machine learning methods for doing marker gene detection, but the humble t-test is still great

One consequence of the DE analysis strategy is that markers are defined relative to subpopulations in the same dataset. Biologically meaningful genes will not be detected if they are expressed uniformly throughout the population, e.g., T cell markers will not be detected if only T cells are present in the dataset. In practice, this is usually only a problem when the experimental data are provided without any biological context - certainly, we would hope to have some a priori idea about what cells have been captured. For most applications, it is actually desirable to avoid detecting such genes as we are interested in characterizing heterogeneity within the context of a known cell population. Continuing from the example above, the failure to detect T cell markers is of little consequence if we already know we are working with T cells. Nonetheless, if “absolute” identification of cell types is necessary, we discuss some strategies for doing so in Chapter 12.

Alternatively, marker detection can be performed by treating gene expression as a predictor variable for cluster assignment. For a pair of clusters, we can find genes that discriminate between them by performing inference with a logistic model where the outcome for each cell is whether it was assigned to the first cluster and the lone predictor is the expression of each gene. Treating the cluster assignment as the dependent variable is more philosophically pleasing in some sense, as the clusters are indeed defined from the expression data rather than being known in advance. (Note that this does not solve the data snooping problem.) In practice, this approach effectively does the same task as a Wilcoxon rank sum test in terms of quantifying separation between clusters. Logistic models have the advantage in that they can easily be extended to block on multiple nuisance variables, though this is not typically necessary in most use cases. Even more complex strategies use machine learning methods to determine which features contribute most to successful cluster classification, but this is probably unnecessary for routine analyses.

88 of 90

Summary and recommendations

89 of 90

Summary and recommendations

👉 Create multiple lists of marker genes at different 'stringencies'

👉 The simplest to interpret marker genes are the 'uniquely up-regulated' or 'cluster-specific' genes, especially if we can impose a minimum log-fold change

👉 You may need to do more focused marker gene detection, e.g., subset the data to just 2 clusters of interest and run find scran::findMarkers()

90 of 90

Where we're at