1 of 57

Feature Selection

2 of 57

Motivation

3 of 57

Motivation

We often use scRNA-seq data to characterize heterogeneity across cells

To do this, we use methods like clustering and dimensionality reduction

These involve summarising per-gene differences into a single (dis)similarity metric between a pair of cells

Which features genes should we use to compute this (dis)similarity metric?

4 of 57

Feature (gene) selection

The choice of features has a major impact on how similar we judge cells to be

➕ Features that contain useful information about the biology

➖ Features that contain random noise

👉 Side effect of reducing dimensionality of data

Roughly, we want to select highly variable genes (HVGs)

Genes with an increased variation compared to other genes

that are only affected by technical noise or other 'uninteresting'

biological variation

5 of 57

Illustrative dataset:

Unfiltered 10X PBMC4k

6 of 57

Illustrative dataset: Unfiltered 10X PBMC4k

library(BiocFileCache)

bfc <- BiocFileCache()

raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",

"cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))

untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))

library(DropletUtils)

library(Matrix)

fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")

sce.pbmc <- read10xCounts(fname, col.names=TRUE)

Human peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics

https://osca.bioconductor.org/unfiltered-human-pbmcs-10x-genomics.html

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)

7 of 57

Illustrative dataset: Unfiltered 10X PBMC4k

# gene-annotation

library(scater)

rownames(sce.pbmc) <- uniquifyFeatureNames(

rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)

library(EnsDb.Hsapiens.v86)

location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID,

column="SEQNAME", keytype="GENEID")

# cell-detection

set.seed(100)

e.out <- emptyDrops(counts(sce.pbmc))

sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)

8 of 57

Illustrative dataset: Unfiltered 10X PBMC4k

# quality-control

stats <- perCellQCMetrics(sce.pbmc,

subsets=list(Mito=which(location=="MT")))

high.mito <- isOutlier(stats$subsets_Mito_percent,

type="higher")

sce.pbmc <- sce.pbmc[,!high.mito]

# normalization

library(scran)

set.seed(1000)

clusters <- quickCluster(sce.pbmc)

sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)

sce.pbmc <- logNormCounts(sce.pbmc)

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)

9 of 57

Illustrative dataset: 416B

10 of 57

Illustrative dataset: 416B

library(scRNAseq)

sce.416b <- LunSpikeInData(which="416b")

sce.416b$block <- factor(sce.416b$block)

Immortalized mouse myeloid progenitor cell line processed using SmartSeq2

A constant amount of ERCC spike-in RNA was added to each cell’s lysate prior to library preparation.

https://osca.bioconductor.org/lun-416b-cell-line-smart-seq2.html

Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)

11 of 57

Illustrative dataset: 416B

# gene-annotation

library(AnnotationHub)

ens.mm.v97 <- AnnotationHub()[["AH73905"]]

rowData(sce.416b)$ENSEMBL <- rownames(sce.416b)

rowData(sce.416b)$SYMBOL <- mapIds(ens.mm.v97,

keys=rownames(sce.416b),

keytype="GENEID", column="SYMBOL")

rowData(sce.416b)$SEQNAME <- mapIds(ens.mm.v97,

keys=rownames(sce.416b),

keytype="GENEID", column="SEQNAME")

library(scater)

rownames(sce.416b) <- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL,

rowData(sce.416b)$SYMBOL)

Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)

12 of 57

Illustrative dataset: 416B

# quality-control

mito <- which(rowData(sce.416b)$SEQNAME=="MT")

stats <- perCellQCMetrics(sce.416b, subsets=list(Mt=mito))

qc <- quickPerCellQC(stats,

percent_subsets=c("subsets_Mt_percent", "altexps_ERCC_percent"),

batch=sce.416b$block)

sce.416b <- sce.416b[,!qc$discard]

# normalization

library(scran)

sce.416b <- computeSumFactors(sce.416b)

sce.416b <- logNormCounts(sce.416b)

Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)

13 of 57

Quantifying per-gene variance

14 of 57

Variance of the log-counts

Simplest approach to quantifying per-feature variation is to simply compute the variance of the log-counts

➕ Feature selection based on log-counts (which are

used in downstream analyses)

⚠️ Log-transformation does not achieve perfect variance

stabilization, so need to model mean-variance

relationship of features

15 of 57

Variance of the log-counts

Simple (naive) approach:

  1. Compute variance of log-counts for each gene (ignoring experimental groups)
  2. Order genes from most-to-least variable

More sophisticated (better) approach:

  1. Compuate variance of log-counts for each gene (ignoring experimental groups)
  2. Model the mean-variance relationship of log-counts to estimate the 'technical' variation
  3. Estimate the 'biological' variance by subtracting the 'technical' variance from the total variance
  4. Order genes from most-to-least biologically variable

16 of 57

Variance of the log-counts

library(scran)

dec.pbmc <- modelGeneVar(sce.pbmc)

🤓 Assumption is that at any given abundance the expression profiles of most genes are dominated by random 'technical' noise

🤓 Therefore, trend represents an estimate of the technical noise as a function of abundance

🤓 Can then break down the total variance of each gene into a 'technical' and a 'biological' component

🤓 Genes with a large 'biological' variance are considered interesting

scran::modelGeneVar()

17 of 57

Variance of the log-counts

# Visualizing the fit

fit.pbmc <- metadata(dec.pbmc)

plot(fit.pbmc$mean, fit.pbmc$var,

xlab="Mean of log-expression",

ylab="Variance of log-expression")

curve(fit.pbmc$trend(x), col="dodgerblue",

add=TRUE, lwd=2)

scran::modelGeneVar()

18 of 57

Variance of the log-counts

# Ordering by largest 'biological' variation to # identify most interesting genes

dec.pbmc[order(dec.pbmc$bio, decreasing=TRUE),]

🤓 You might notice genes with negative 'biological' components. These can be thought of as genes without variation greater than the estimated technical variation and so can be ignored in most applications

19 of 57

Coefficient of variation of the counts

The squared coefficient of variation (CV2) of the counts is an alternative to variance of log-counts

👉 Computed using the counts rather than log-counts

🤓 CV is the ratio of the standard deviation to the mean and is closely related to the 'dispersion' parameter of the negative binomial distribution used in edgeR and DESeq2

20 of 57

Coefficient of variation

dec.cv2.pbmc <- modelGeneCV2(sce.pbmc)

🤓 Models the mean-variance relationship when considering the relevance of each gene

🤓 Again, assumes that most genes contain random noise and that the trend captures mostly technical variation

🤓 Genes with large CV2 that deviate strongly from the trend are likely to represent genes affected by biological structure

🤓 Use the ratio (rather than difference) of CV2 to the trend

scran::modelGeneCV2()

21 of 57

Coefficient of variation

# Visualizing the fit

fit.cv2.pbmc <- metadata(dec.cv2.pbmc)

plot(fit.cv2.pbmc$mean, fit.cv2.pbmc$cv2,

log="xy")

curve(fit.cv2.pbmc$trend(x), col="dodgerblue",

add=TRUE, lwd=2)

scran::modelGeneCV2()

22 of 57

Coefficient of variation

# Ordering by largest CV2 to identify most interesting genes

dec.cv2.pbmc[order(dec.cv2.pbmc$ratio,

decreasing=TRUE),]

scran::modelGeneCV2()

23 of 57

Variance of log-counts vs. coefficient of variation

Generally prefer the use of the variance of log-counts

Both are effective metrics for quantifying variation in gene expression

CV2 tends to give higher rank to low-abundance HVGs

These are driven by upregulation in rare subpopulations

Can assign a high rank to uninteresting genes with low absolute

variance

The variation described by CV2 of the counts is less directly relevant to downstream procedures (which operate on the log-counts)

24 of 57

Quantifying technical noise

25 of 57

Quantifying technical noise

Previously, we fit a trend to all endogenous genes and assumed that most genes are dominated by random technical noise

In practice, all expressed genes have some non-zero level of

biological variability (e.g., due to transcriptional bursting)

This suggests our estimates of the technical components are likely to be inflated

26 of 57

Quantifying technical noise

👉 Better to think of these estimates as 'technical' variation plus 'uninteresting' biological variation

🤔 But what if a group of genes at a particular abundance affected by a biological process?

E.g., strong upregulation of cell type-specific genes could lead to an

enrichment of HVGs at high abundances. This would inflate the trend

and compromise detection of the relevant genes

How can we avoid this problem?

27 of 57

How can we avoid this problem and better estimate technical variation?

We'll go through two approaches:

  1. When we have spike-ins
  2. When we don't have spike-ins

28 of 57

In the presence of spike-ins

dec.spike.416b <- modelGeneVarWithSpikes(sce.416b,

"ERCC")

# Ordering by most interesting genes for

# inspection.

dec.spike.416b[order(dec.spike.416b$bio,

decreasing=TRUE),]

👉 Fit the trend to just the spike-ins (which should only be affected by 'technical' variation)

scran::modelGeneVarWithSpikes()

29 of 57

In the presence of spike-ins

plot(dec.spike.416b$mean, dec.spike.416b$total,

xlab="Mean of log-expression",

ylab="Variance of log-expression")

fit.spike.416b <- metadata(dec.spike.416b)

points(fit.spike.416b$mean, fit.spike.416b$var,

col="red", pch=16)

curve(fit.spike.416b$trend(x), col="dodgerblue",

add=TRUE, lwd=2)

scran::modelGeneVarWithSpikes()

30 of 57

In the absence of spike-ins

set.seed(0010101)

dec.pois.pbmc <- modelGeneVarByPoisson(sce.pbmc)

# Ordering by most interesting genes for inspection.

dec.pois.pbmc[order(dec.pois.pbmc$bio,decreasing=TRUE),]

scran::modelGeneVarByPoisson()

👉 Make some statistical assumptions about the noise

🤓 UMI counts typically exhibit near-Poisson variation if we only consider technical noise from library preparation and sequencing

⚠️ modelGeneVarByPoisson() performs simulations, so need to 'set the seed' to obtain reproducible results

🤓 modelGeneVarByPoisson() can also simulate negative binomial variation (overdispersed Poisson variation)

31 of 57

In the absence of spike-ins

plot(dec.pois.pbmc$mean, dec.pois.pbmc$total,

pch=16, xlab="Mean of log-expression",

ylab="Variance of log-expression")

curve(metadata(dec.pois.pbmc)$trend(x),

col="dodgerblue", add=TRUE)

scran::modelGeneVarByPoisson()

🤓 Trends based purely on technical noise tend to yield large 'biological' components for highly-expressed genes, which often includes 'house-keeping' genes.

🤔 Need to consider if such genes are 'interesting' or 'uninteresting' for your dataset

32 of 57

Considering experimental factors

33 of 57

Considering experimental factors

Data containing multiple batches will often exhibit batch effects that create 'artificial' HVGs

We want to a way to identify HVGs in each batch and then combine these into a single list of HVGs

34 of 57

Considering experimental factors

dec.block.416b <- modelGeneVarWithSpikes(sce.416b,

"ERCC", block=sce.416b$block)

dec.block.416b[order(

dec.block.416b$bio, decreasing=TRUE),]

👉 Batch-specific trends accommodate differences in mean-variance trends between batches

🤓 Get batch-specific estimates of biological and technical components for each gene, which are then averaged across batches to create a single list of HVGs

35 of 57

Considering experimental factors

36 of 57

Selecting highly-variable genes (HVGs)

37 of 57

Selecting highly-variable genes (HVGs)

We've now ranked genes from most-to-least 'interestingly' variable

🤔How far down the list should we go to construct our HVGs?

Choosing a larger subset

➕ Reduces risk of discarding biological signal

➖ Increases noise from inclusion of irrelevant genes

⚠️ Difficult to determine the optimal trade-off because noise in one context may be useful signal in another

👉 We'll discuss a few common strategies for selecting HVGs

38 of 57

Selecting HVGs on the largest metrics

Simplest strategy is take the top-X genes with the largest values for the relevant variance metric

E.g., largest 'biological' variance from scran::modelGeneVar()

➕ User can directly control the number of HVGs

➖ What value of X to choose?

scran::getTopHVGs()

39 of 57

Selecting HVGs on the largest metrics

# Works with modelGeneVar() output

hvg.pbmc.var <- getTopHVGs(dec.pbmc, n=1000)

str(hvg.pbmc.var)

# Works with modelGeneVarWithSpikes() output

hvg.416b.var <- getTopHVGs(dec.spike.416b, n=1000)

str(hvg.416b.var)

# Also works with modelGeneCV2() but note `var.field`

hvg.pbmc.cv2 <- getTopHVGs(dec.cv2.pbmc,

var.field="ratio", n=1000)

str(hvg.pbmc.cv2)

scran::getTopHVGs()

40 of 57

Strategies for choosing X

Assume that, say, no more than 5% of genes are differentially expressed between cell types in our population

👉 Set X to 5% of the number of genes

🤔 We don't normally know the % of DE genes beforehand, so we've just changed one arbitrary number for another

➖ Turns HVG selection into a competition between genes. E.g., the marker genes for a rare subpopulation will be dominated by marker genes from the other populations

scran::getTopHVGs()

41 of 57

Strategies for choosing X

👉 If using top-X HVGs, choose a value of X and proceed with the rest of the analysis, with the intention of testing other choices later, rather than agonizing over the 'optimal' value

scran::getTopHVGs()

42 of 57

Selecting HVGs on statistical significance

Set a fix threshold of one of the metrics

E.g., some of the method we used report a P-value for each gene, so

select all genes with an (adjusted) P-value less than 0.05

🤓 Statistical test may not 'hold its size'

➕ Simple to implement

➖ Less predictable than top-X strategy

➖ May prioritise genes with strong statistical significance over strong biological significance

scran::getTopHVGs()

43 of 57

Selecting HVGs on statistical significance

# Works with modelGeneVar() output

hvg.pbmc.var.2 <- getTopHVGs(dec.pbmc, fdr.threshold=0.05)

str(hvg.pbmc.var.2)

# Works with modelGeneVarWithSpikes() output

hvg.416b.var.2 <- getTopHVGs(dec.spike.416b,

fdr.threshold=0.05)

str(hvg.416b.var.2)

# Also works with modelGeneCV2() but note `var.field`

hvg.pbmc.cv2.2 <- getTopHVGs(dec.cv2.pbmc,

var.field="ratio", fdr.threshold=0.05)

str(hvg.pbmc.cv2.2)

scran::getTopHVGs()

44 of 57

Selecting genes above the trend as HVGs

Keep all genes with a positive 'biological' variance

🤓 One extreme of the bias-variance trade-off that minimizes bias at the cost of maximizing noise

➕ Gives secondary population structure a chance ot manifest

➖ More noise is also captured, which can reduced resolution of otherwise well-separated populations and mask the very secondary signal we were trying to preserve

👉 Best suited for highly heterogenous datasets containing many different cell types

scran::getTopHVGs()

45 of 57

Selecting genes above the trend as HVGs

# Works with modelGeneVar() output

hvg.pbmc.var.3 <- getTopHVGs(dec.pbmc, var.threshold=0)

str(hvg.pbmc.var.3)

# Works with modelGeneVarWithSpikes() output

hvg.416b.var.3 <- getTopHVGs(dec.spike.416b,

var.threshold=0)

str(hvg.416b.var.3)

# Also works with modelGeneCV2() but note `var.field` and

# value of `var.threshold`

hvg.pbmc.cv2.3 <- getTopHVGs(dec.cv2.pbmc,

var.field="ratio", var.threshold=1)

str(hvg.pbmc.cv2.2)

scran::getTopHVGs()

46 of 57

Selecting a priori genes of interest

47 of 57

Selecting a priori genes of interest

A blunt yet effective strategy is to use pre-defined sets of interesting genes

E.g., Immunological signatures for MSigDB

There's no shame in leveraging prior biological knowledge!

See the common desire for an 'unbiased' analysis of genomics data

➖ Limits our capacity to discover novel or unexpected aspects of variation

👉 Consider this complementary to the other HVG selection strategies

48 of 57

Selecting a priori genes of interest

👉 Might flip this around and instead remove pre-defined genes sets

E.g., ribosomal protein genes and mitochondrial genes are

notorious for being amongst the most highly variable genes

and 'interfering' in downstream analyses

👉 Err on the side of caution; wait until such genes are demonstrably problemetic before removing them

49 of 57

Putting it all together

50 of 57

Putting it all together

dec.pbmc <- modelGeneVar(sce.pbmc)

chosen <- getTopHVGs(dec.pbmc, prop=0.1)

str(chosen)

We then have several options to enforce our HVG selection on the rest of the analysis:

  1. Subsetting SCE to just the HVGs
  2. Specifying HVGs in downstream functions
  3. Witchcraft (altExps)

51 of 57

Subsetting to just the HVGs

sce.pbmc.hvg <- sce.pbmc[chosen,]

sce.pbmc.hvg

Ensures that downstream methods will only use these genes for their calculations

Non-HVGs are discarded from the new SingleCellExperiment, making it less convenient to interrogate the full dataset for interesting genes that are not HVGs

52 of 57

Specifying HVGs in downstream functions

# Example of specifying HVGs in a downstream function

# Performing PCA only on the chosen HVGs.

library(scater)

sce.pbmc <- runPCA(sce.pbmc, subset_row=chosen)

sce.pbmc

Keep the original SingleCellExperiment object and specify the genes to use for downstream functions via an extra argument like subset_row

Useful if the analysis uses multiple sets of HVGs at different steps

May be inconvenient to repeatedly specify the same HVG set across steps

👉Most functions in our workflow support this option (but others may not)

subset_row

53 of 57

Witchcraft

# Add the full SCE to the subsetted data SCE

altExp(sce.pbmc.hvg, "original") <- sce.pbmc

sce.pbmc.hvg

altExp(sce.pbmc.hvg, "original")

Have our cake and eat it too by (ab)using the "alternative experiment" system in the SingleCellExperiment class

Avoids book-keeping problems in long analyses when the original dataset is not synchronized with the HVG subsetted data

Turtles all the way down

SingelCellExperiment::altExp()

54 of 57

Summary and recommendations

55 of 57

Summary and recommendations

Easy to get stuck on feature selection due to the many options available

Choose a set of HVGs and move on with the analysis, but come back to test the robustness of your results using a different approach to selecting HVGs

If you have spike-ins, try to make use of them. But be aware that spike-ins cannot capture all sources of 'technical' variation

56 of 57

What I usually start with

For CEL-Seq2:

  • scran::modelGeneVarWithSpikes()

For 10X:

  • scran::modelGeneVarByPoisson()

�I tend to err on the side of keeping more genes:

  • scran::getTopHVGs(dec, var.threshold=0)

Perhaps do a quick comparison to taking, say, the top-1000 HVGs

Return to HVG selection to remove 'problematic' genes as necessary

57 of 57

Where we're at