Feature Selection
Content adapted from https://osca.bioconductor.org/feature-selection.html
Motivation
Motivation
We often use scRNA-seq data to characterize heterogeneity across cells
To do this, we use methods like clustering and dimensionality reduction
These involve summarising per-gene differences into a single (dis)similarity metric between a pair of cells
Which features genes should we use to compute this (dis)similarity metric?
Feature (gene) selection
The choice of features has a major impact on how similar we judge cells to be
➕ Features that contain useful information about the biology
➖ Features that contain random noise
👉 Side effect of reducing dimensionality of data
Roughly, we want to select highly variable genes (HVGs)
Genes with an increased variation compared to other genes
that are only affected by technical noise or other 'uninteresting'
biological variation
Illustrative dataset:
Unfiltered 10X PBMC4k
Illustrative dataset: Unfiltered 10X PBMC4k
library(BiocFileCache)
bfc <- BiocFileCache()
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
"cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))
library(DropletUtils)
library(Matrix)
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)
Human peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics
https://osca.bioconductor.org/unfiltered-human-pbmcs-10x-genomics.html
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)
Illustrative dataset: Unfiltered 10X PBMC4k
# gene-annotation
library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(
rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)
library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID,
column="SEQNAME", keytype="GENEID")
# cell-detection
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)
Illustrative dataset: Unfiltered 10X PBMC4k
# quality-control
stats <- perCellQCMetrics(sce.pbmc,
subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent,
type="higher")
sce.pbmc <- sce.pbmc[,!high.mito]
# normalization
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)
sce.pbmc <- logNormCounts(sce.pbmc)
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)
Illustrative dataset: 416B
Illustrative dataset: 416B
library(scRNAseq)
sce.416b <- LunSpikeInData(which="416b")
sce.416b$block <- factor(sce.416b$block)
Immortalized mouse myeloid progenitor cell line processed using SmartSeq2
A constant amount of ERCC spike-in RNA was added to each cell’s lysate prior to library preparation.
https://osca.bioconductor.org/lun-416b-cell-line-smart-seq2.html
Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)
Illustrative dataset: 416B
# gene-annotation
library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
rowData(sce.416b)$ENSEMBL <- rownames(sce.416b)
rowData(sce.416b)$SYMBOL <- mapIds(ens.mm.v97,
keys=rownames(sce.416b),
keytype="GENEID", column="SYMBOL")
rowData(sce.416b)$SEQNAME <- mapIds(ens.mm.v97,
keys=rownames(sce.416b),
keytype="GENEID", column="SEQNAME")
library(scater)
rownames(sce.416b) <- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL,
rowData(sce.416b)$SYMBOL)
Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)
Illustrative dataset: 416B
# quality-control
mito <- which(rowData(sce.416b)$SEQNAME=="MT")
stats <- perCellQCMetrics(sce.416b, subsets=list(Mt=mito))
qc <- quickPerCellQC(stats,
percent_subsets=c("subsets_Mt_percent", "altexps_ERCC_percent"),
batch=sce.416b$block)
sce.416b <- sce.416b[,!qc$discard]
# normalization
library(scran)
sce.416b <- computeSumFactors(sce.416b)
sce.416b <- logNormCounts(sce.416b)
Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017)
Quantifying per-gene variance
Variance of the log-counts
Simplest approach to quantifying per-feature variation is to simply compute the variance of the log-counts
➕ Feature selection based on log-counts (which are
used in downstream analyses)
⚠️ Log-transformation does not achieve perfect variance
stabilization, so need to model mean-variance
relationship of features
Variance of the log-counts
Simple (naive) approach:
More sophisticated (better) approach:
Variance of the log-counts
library(scran)
dec.pbmc <- modelGeneVar(sce.pbmc)
🤓 Assumption is that at any given abundance the expression profiles of most genes are dominated by random 'technical' noise
🤓 Therefore, trend represents an estimate of the technical noise as a function of abundance
🤓 Can then break down the total variance of each gene into a 'technical' and a 'biological' component
🤓 Genes with a large 'biological' variance are considered interesting
scran::modelGeneVar()
Variance of the log-counts
# Visualizing the fit
fit.pbmc <- metadata(dec.pbmc)
plot(fit.pbmc$mean, fit.pbmc$var,
xlab="Mean of log-expression",
ylab="Variance of log-expression")
curve(fit.pbmc$trend(x), col="dodgerblue",
add=TRUE, lwd=2)
scran::modelGeneVar()
Variance of the log-counts
# Ordering by largest 'biological' variation to # identify most interesting genes
dec.pbmc[order(dec.pbmc$bio, decreasing=TRUE),]
🤓 You might notice genes with negative 'biological' components. These can be thought of as genes without variation greater than the estimated technical variation and so can be ignored in most applications
Coefficient of variation of the counts
The squared coefficient of variation (CV2) of the counts is an alternative to variance of log-counts
👉 Computed using the counts rather than log-counts
🤓 CV is the ratio of the standard deviation to the mean and is closely related to the 'dispersion' parameter of the negative binomial distribution used in edgeR and DESeq2
Coefficient of variation
dec.cv2.pbmc <- modelGeneCV2(sce.pbmc)
🤓 Models the mean-variance relationship when considering the relevance of each gene
🤓 Again, assumes that most genes contain random noise and that the trend captures mostly technical variation
🤓 Genes with large CV2 that deviate strongly from the trend are likely to represent genes affected by biological structure
🤓 Use the ratio (rather than difference) of CV2 to the trend
scran::modelGeneCV2()
Coefficient of variation
# Visualizing the fit
fit.cv2.pbmc <- metadata(dec.cv2.pbmc)
plot(fit.cv2.pbmc$mean, fit.cv2.pbmc$cv2,
log="xy")
curve(fit.cv2.pbmc$trend(x), col="dodgerblue",
add=TRUE, lwd=2)
scran::modelGeneCV2()
Coefficient of variation
# Ordering by largest CV2 to identify most interesting genes
dec.cv2.pbmc[order(dec.cv2.pbmc$ratio,
decreasing=TRUE),]
scran::modelGeneCV2()
Variance of log-counts vs. coefficient of variation
Generally prefer the use of the variance of log-counts
Both are effective metrics for quantifying variation in gene expression
CV2 tends to give higher rank to low-abundance HVGs
These are driven by upregulation in rare subpopulations
Can assign a high rank to uninteresting genes with low absolute
variance
The variation described by CV2 of the counts is less directly relevant to downstream procedures (which operate on the log-counts)
Quantifying technical noise
Quantifying technical noise
Previously, we fit a trend to all endogenous genes and assumed that most genes are dominated by random technical noise
In practice, all expressed genes have some non-zero level of
biological variability (e.g., due to transcriptional bursting)
This suggests our estimates of the technical components are likely to be inflated
Quantifying technical noise
👉 Better to think of these estimates as 'technical' variation plus 'uninteresting' biological variation
🤔 But what if a group of genes at a particular abundance affected by a biological process?
E.g., strong upregulation of cell type-specific genes could lead to an
enrichment of HVGs at high abundances. This would inflate the trend
and compromise detection of the relevant genes
How can we avoid this problem?
How can we avoid this problem and better estimate technical variation?
We'll go through two approaches:
In the presence of spike-ins
dec.spike.416b <- modelGeneVarWithSpikes(sce.416b,
"ERCC")
# Ordering by most interesting genes for
# inspection.
dec.spike.416b[order(dec.spike.416b$bio,
decreasing=TRUE),]
👉 Fit the trend to just the spike-ins (which should only be affected by 'technical' variation)
scran::modelGeneVarWithSpikes()
In the presence of spike-ins
plot(dec.spike.416b$mean, dec.spike.416b$total,
xlab="Mean of log-expression",
ylab="Variance of log-expression")
fit.spike.416b <- metadata(dec.spike.416b)
points(fit.spike.416b$mean, fit.spike.416b$var,
col="red", pch=16)
curve(fit.spike.416b$trend(x), col="dodgerblue",
add=TRUE, lwd=2)
scran::modelGeneVarWithSpikes()
In the absence of spike-ins
set.seed(0010101)
dec.pois.pbmc <- modelGeneVarByPoisson(sce.pbmc)
# Ordering by most interesting genes for inspection.
dec.pois.pbmc[order(dec.pois.pbmc$bio,decreasing=TRUE),]
scran::modelGeneVarByPoisson()
👉 Make some statistical assumptions about the noise
🤓 UMI counts typically exhibit near-Poisson variation if we only consider technical noise from library preparation and sequencing
⚠️ modelGeneVarByPoisson() performs simulations, so need to 'set the seed' to obtain reproducible results
🤓 modelGeneVarByPoisson() can also simulate negative binomial variation (overdispersed Poisson variation)
In the absence of spike-ins
plot(dec.pois.pbmc$mean, dec.pois.pbmc$total,
pch=16, xlab="Mean of log-expression",
ylab="Variance of log-expression")
curve(metadata(dec.pois.pbmc)$trend(x),
col="dodgerblue", add=TRUE)
scran::modelGeneVarByPoisson()
🤓 Trends based purely on technical noise tend to yield large 'biological' components for highly-expressed genes, which often includes 'house-keeping' genes.
🤔 Need to consider if such genes are 'interesting' or 'uninteresting' for your dataset
Considering experimental factors
Considering experimental factors
Data containing multiple batches will often exhibit batch effects that create 'artificial' HVGs
We want to a way to identify HVGs in each batch and then combine these into a single list of HVGs
Considering experimental factors
dec.block.416b <- modelGeneVarWithSpikes(sce.416b,
"ERCC", block=sce.416b$block)
dec.block.416b[order(
dec.block.416b$bio, decreasing=TRUE),]
👉 Batch-specific trends accommodate differences in mean-variance trends between batches
🤓 Get batch-specific estimates of biological and technical components for each gene, which are then averaged across batches to create a single list of HVGs
Considering experimental factors
Selecting highly-variable genes (HVGs)
Selecting highly-variable genes (HVGs)
We've now ranked genes from most-to-least 'interestingly' variable
🤔How far down the list should we go to construct our HVGs?
Choosing a larger subset
➕ Reduces risk of discarding biological signal
➖ Increases noise from inclusion of irrelevant genes
⚠️ Difficult to determine the optimal trade-off because noise in one context may be useful signal in another
👉 We'll discuss a few common strategies for selecting HVGs
Selecting HVGs on the largest metrics
Simplest strategy is take the top-X genes with the largest values for the relevant variance metric
E.g., largest 'biological' variance from scran::modelGeneVar()
➕ User can directly control the number of HVGs
➖ What value of X to choose?
scran::getTopHVGs()
Selecting HVGs on the largest metrics
# Works with modelGeneVar() output
hvg.pbmc.var <- getTopHVGs(dec.pbmc, n=1000)
str(hvg.pbmc.var)
# Works with modelGeneVarWithSpikes() output
hvg.416b.var <- getTopHVGs(dec.spike.416b, n=1000)
str(hvg.416b.var)
# Also works with modelGeneCV2() but note `var.field`
hvg.pbmc.cv2 <- getTopHVGs(dec.cv2.pbmc,
var.field="ratio", n=1000)
str(hvg.pbmc.cv2)
scran::getTopHVGs()
Strategies for choosing X
Assume that, say, no more than 5% of genes are differentially expressed between cell types in our population
👉 Set X to 5% of the number of genes
🤔 We don't normally know the % of DE genes beforehand, so we've just changed one arbitrary number for another
➖ Turns HVG selection into a competition between genes. E.g., the marker genes for a rare subpopulation will be dominated by marker genes from the other populations
scran::getTopHVGs()
Strategies for choosing X
👉 If using top-X HVGs, choose a value of X and proceed with the rest of the analysis, with the intention of testing other choices later, rather than agonizing over the 'optimal' value
scran::getTopHVGs()
Selecting HVGs on statistical significance
Set a fix threshold of one of the metrics
E.g., some of the method we used report a P-value for each gene, so
select all genes with an (adjusted) P-value less than 0.05
🤓 Statistical test may not 'hold its size'
➕ Simple to implement
➖ Less predictable than top-X strategy
➖ May prioritise genes with strong statistical significance over strong biological significance
scran::getTopHVGs()
Selecting HVGs on statistical significance
# Works with modelGeneVar() output
hvg.pbmc.var.2 <- getTopHVGs(dec.pbmc, fdr.threshold=0.05)
str(hvg.pbmc.var.2)
# Works with modelGeneVarWithSpikes() output
hvg.416b.var.2 <- getTopHVGs(dec.spike.416b,
fdr.threshold=0.05)
str(hvg.416b.var.2)
# Also works with modelGeneCV2() but note `var.field`
hvg.pbmc.cv2.2 <- getTopHVGs(dec.cv2.pbmc,
var.field="ratio", fdr.threshold=0.05)
str(hvg.pbmc.cv2.2)
scran::getTopHVGs()
Selecting genes above the trend as HVGs
Keep all genes with a positive 'biological' variance
🤓 One extreme of the bias-variance trade-off that minimizes bias at the cost of maximizing noise
➕ Gives secondary population structure a chance ot manifest
➖ More noise is also captured, which can reduced resolution of otherwise well-separated populations and mask the very secondary signal we were trying to preserve
👉 Best suited for highly heterogenous datasets containing many different cell types
scran::getTopHVGs()
Selecting genes above the trend as HVGs
# Works with modelGeneVar() output
hvg.pbmc.var.3 <- getTopHVGs(dec.pbmc, var.threshold=0)
str(hvg.pbmc.var.3)
# Works with modelGeneVarWithSpikes() output
hvg.416b.var.3 <- getTopHVGs(dec.spike.416b,
var.threshold=0)
str(hvg.416b.var.3)
# Also works with modelGeneCV2() but note `var.field` and
# value of `var.threshold`
hvg.pbmc.cv2.3 <- getTopHVGs(dec.cv2.pbmc,
var.field="ratio", var.threshold=1)
str(hvg.pbmc.cv2.2)
scran::getTopHVGs()
Selecting a priori genes of interest
Selecting a priori genes of interest
A blunt yet effective strategy is to use pre-defined sets of interesting genes
E.g., Immunological signatures for MSigDB
There's no shame in leveraging prior biological knowledge!
See the common desire for an 'unbiased' analysis of genomics data
➖ Limits our capacity to discover novel or unexpected aspects of variation
👉 Consider this complementary to the other HVG selection strategies
Selecting a priori genes of interest
👉 Might flip this around and instead remove pre-defined genes sets
E.g., ribosomal protein genes and mitochondrial genes are
notorious for being amongst the most highly variable genes
and 'interfering' in downstream analyses
👉 Err on the side of caution; wait until such genes are demonstrably problemetic before removing them
Putting it all together
Putting it all together
dec.pbmc <- modelGeneVar(sce.pbmc)
chosen <- getTopHVGs(dec.pbmc, prop=0.1)
str(chosen)
We then have several options to enforce our HVG selection on the rest of the analysis:
Subsetting to just the HVGs
sce.pbmc.hvg <- sce.pbmc[chosen,]
sce.pbmc.hvg
➕ Ensures that downstream methods will only use these genes for their calculations
➖ Non-HVGs are discarded from the new SingleCellExperiment, making it less convenient to interrogate the full dataset for interesting genes that are not HVGs
Specifying HVGs in downstream functions
# Example of specifying HVGs in a downstream function
# Performing PCA only on the chosen HVGs.
library(scater)
sce.pbmc <- runPCA(sce.pbmc, subset_row=chosen)
sce.pbmc
Keep the original SingleCellExperiment object and specify the genes to use for downstream functions via an extra argument like subset_row
➕ Useful if the analysis uses multiple sets of HVGs at different steps
➖ May be inconvenient to repeatedly specify the same HVG set across steps
👉Most functions in our workflow support this option (but others may not)
subset_row
Witchcraft
# Add the full SCE to the subsetted data SCE
altExp(sce.pbmc.hvg, "original") <- sce.pbmc
sce.pbmc.hvg
altExp(sce.pbmc.hvg, "original")
➕ Have our cake and eat it too by (ab)using the "alternative experiment" system in the SingleCellExperiment class
➕ Avoids book-keeping problems in long analyses when the original dataset is not synchronized with the HVG subsetted data
➖ Turtles all the way down
SingelCellExperiment::altExp()
Summary and recommendations
Summary and recommendations
Easy to get stuck on feature selection due to the many options available
Choose a set of HVGs and move on with the analysis, but come back to test the robustness of your results using a different approach to selecting HVGs
If you have spike-ins, try to make use of them. But be aware that spike-ins cannot capture all sources of 'technical' variation
What I usually start with
For CEL-Seq2:
For 10X:
�I tend to err on the side of keeping more genes:
Perhaps do a quick comparison to taking, say, the top-1000 HVGs
Return to HVG selection to remove 'problematic' genes as necessary
Where we're at