Dimensionality Reduction
Content adapted from https://osca.bioconductor.org/dimensionality-reduction.html
Motivation
Motivation
The next step of a scRNA-seq analysis is typically to identify groups of 'similar' cells
E.g., clustering aims to identify cells with similar transcriptomic
profiles by computing 'distances' between cells
If we had a dataset with two genes, we could make a two-dimensional plot to identify clusters of cells
E.g., cluster 1 (gene A+, gene B+), cluster 2 (gene A-, gene B+), etc.
Could extend this idea to multiple genes, much like how FACS gating
is done
đ¤ But we have tens of thousands of genes, we need a better way
đ Dimensionality reduction
Dimensionality reduction
đ¤ Dimensionality reduction is possible because different genes are correlated if they are affected by the same biological process
đ Don't need to store separate information for individuals genes, but can instead compress multiple features into a single dimension
â Reduces computational work in downstream analyses
â Reduces noise by averaging across multiple genes to obtain a more precise representation of the patterns in the data
â Enable effective plotting of the data in 2 dimensions
Illustrative dataset: Zeisel
Illustrative dataset: Zeisel
library(scRNAseq)
sce.zeisel <- ZeiselBrainData(ensembl=TRUE)
Study of cell types in the mouse brain (oligodendrocytes, microglia, neurons, etc.) processed with the STRT-seq system (similar to CEL-Seq)
https://osca.bioconductor.org/zeisel-mouse-brain-strt-seq.html
Zeisel, A. et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138â1142 (2015)
Illustrative dataset: Zeisel
# Quality control
library(scater)
is.mito <- which(rowData(sce.zeisel)$featureType=="mito")
stats <- perCellQCMetrics(sce.zeisel,
subsets=list(Mt=is.mito))
qc <- quickPerCellQC(stats,
percent_subsets=c("altexps_ERCC_percent",
"subsets_Mt_percent"))
sce.zeisel <- sce.zeisel[,!qc$discard]
Zeisel, A. et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138â1142 (2015)
Illustrative dataset: Zeisel
# normalization
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.zeisel)
sce.zeisel <- computeSumFactors(sce.zeisel,
cluster=clusters)
sce.zeisel <- logNormCounts(sce.zeisel)
# variance-modelling
dec.zeisel <- modelGeneVarWithSpikes(sce.zeisel,
"ERCC")
Zeisel, A. et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138â1142 (2015)
Illustrative dataset:
Unfiltered 10X PBMC4k
Illustrative dataset: Unfiltered 10X PBMC4k
library(BiocFileCache)
bfc <- BiocFileCache()
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
"cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))
library(DropletUtils)
library(Matrix)
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)
Human peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics
https://osca.bioconductor.org/unfiltered-human-pbmcs-10x-genomics.html
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)
Illustrative dataset: Unfiltered 10X PBMC4k
# gene-annotation
library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(
rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)
library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID,
column="SEQNAME", keytype="GENEID")
# cell-detection
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)
Illustrative dataset: Unfiltered 10X PBMC4k
# quality-control
stats <- perCellQCMetrics(sce.pbmc,
subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent,
type="higher")
sce.pbmc <- sce.pbmc[,!high.mito]
# normalization
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)
sce.pbmc <- logNormCounts(sce.pbmc)
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)
Illustrative dataset: Unfiltered 10X PBMC4k
# variance modelling
set.seed(1001)
dec.pbmc <- modelGeneVarByPoisson(sce.pbmc)
top.pbmc <- getTopHVGs(dec.pbmc, prop=0.1)
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)
Principal component analysis
Principal component analysis (PCA)
â PCA is a workhorse of dimensionality reduction
đ¤ PCA discovers the (linear) combinations of features that captures the largest amount of variation
đ¤ In PCA, the first linear combination ('principal component') is chosen such that it captures the greatest variance across cells. The next PC is chosen such that it is 'orthogonal' to the first and captures the greatest remaining amount of variation, and so on.
PCA applied to scRNA-seq data
đ We can perform dimensionality reduction by performing PCA on the log-counts matrix and restricting downstream analyses to the top PCs
đ This can reduce our dataset from 20,000 dimensions down to, say, 10, without losing much information
đ¤ PCA has lots of great, well-studied theoretical properties
đ¤ There are a wide range of fast ways to perform PCA on large datasets
Assumptions of PCA applied to scRNA-seq data
â ī¸ Early PCs will capture batch effects that affect multiple genes in a coordinated manner
Principal component analysis
library(scran)
top.zeisel <- getTopHVGs(dec.zeisel, n=2000)
library(scater)
set.seed(100)
sce.zeisel <- runPCA(sce.zeisel,
subset_row=top.zeisel)
scater::runPCA()
â ī¸ By default, runPCA() uses a fast approximate method that performs simulations, so need to 'set the seed' to obtain reproducible results
Choosing the number of PCs
Choosing the number of PCs
This choice is analogous to the choice of the number of HVGs
More PC will avoid discarding biological signal at the cost of retaining
more noise
đ Common to select a 'reasonable' but arbitrary number of PCs (10 - 50), continue with the analysis, but come back to check the robustness of the conclusions to a range of values
đ We'll now look at some more data-driven strategies
Using the elbow point
percent.var <- attr(reducedDim(sce.zeisel),
"percentVar")
chosen.elbow <- PCAtools::findElbowPoint(percent.var)
plot(percent.var, xlab="PC",
ylab="Variance explained (%)")
abline(v=chosen.elbow, col="red")
PCAtools::findElbowPoint()
đ A simple heuristic for choosing the number of PCs based on the 'percentage of variance explained' by successive PCs
Based on population structure
choices <- getClusteredPCs(reducedDim(sce.zeisel))
chosen.clusters <- metadata(choices)$chosen
scran::getClusteredPCs()
đ A (more sophisticated) heuristic approach that uses the number of clusters as a proxy for the number of subpopulations
đ¤ Suppose we expect d subpopulations of cells, then we need (d-1) dimensions to guarantee separation of all subpopulations
đ¤ But we don't know how many subpopulations there are
Based on population structure
scran::getClusteredPCs()
đ Try a range of d and only consider values that yield no more than d+1 clusters
đ If we select more clusters with fewer dimensions, we consider this 'overclustering'
đ Choose d that maximizes the number of clusters without 'overclustering'
â Pragmatic solution that addreses bias-variance tradeoff in downstream analysis (specifically clustering)
â Strong assumptions about nature of biological differences between clusters (and indeed the very existence of clusters, which may not exist in student of processes like differentiation)
Putting it together
set.seed(100)
# Compute and store the 'full' set of PCs
sce.zeisel <- runPCA(sce.zeisel, subset_row=top.zeisel)
# Can also select d and store the reduced set of PCs
# e.g., using the elbow pointīŋŊreducedDim(sce.zeisel, "PCA_elbow") <- reducedDim(
sce.zeisel, "PCA")[,1:chosen.elbow]
# e.g., based on population structure
reducedDim(sce.zeisel, "PCA_clusters") <- reducedDim(
sce.zeisel, "PCA")[,1:chosen.clusters]
Using the technical noise
library(scran)
set.seed(111001001)
denoised.pbmc <- denoisePCA(sce.pbmc,
technical=dec.pbmc, subset.row=top.pbmc)
scran::denoisenPCA()
đ¤ Retain all PCs until the %variation explained reaches some threshold (e.g., based on an estimate of technical variation)
đdenoisePCA() automatically selects the number of PCs
â ī¸ By default, denoisePCA() performs simulations, so need to 'set the seed' to obtain reproducible results
Using the technical noise
dim(reducedDim(denoised.pbmc, "PCA"))
scran::denoisenPCA()
đ¤ Dimensionality of the output is a lower bound on the number of PCs required to explain all biological variation. That is, any fewer PCs will definitely discard some aspect of the 'biological' signal.
â ī¸ This does not guarantee that the retained PCs capture all of the 'biological' signal
đ Usually retains more PCs than the elbow point method
Using the technical noise
set.seed(001001001)
denoised.zeisel <- denoisePCA(sce.zeisel,
technical=dec.zeisel, subset.row=top.zeisel)
dim(reducedDim(denoised.zeisel))
scran::denoisenPCA()
đ¤ scran::denoisenPCA() internally caps the number of PCs, by default 5 - 50, to avoid selection of too few PCs (when 'technical' noise is high relative to 'biological' noise) or too many PCs (when technical noise is very low)
đ Zeisel brain data hits the upper limit
Using the technical noise
dec.pbmc2 <- modelGeneVar(sce.pbmc)
denoised.pbmc2 <- denoisePCA(sce.pbmc,
technical=dec.pbmc2, subset.row=top.pbmc)
dim(reducedDim(denoised.pbmc2))
scran::denoisenPCA()
đ¤ scran::denoisenPCA() tends to perform best when the mean-variance relationship reflects the actual technical noise, i.e. estimated by scran::modelGeneVarByPoisson() or scran::modelGeneVarWithSpikes() instead of scran::modelGeneVar()
đ PBMC dataset hits the lower limit
Dimensionality reduction for visualization
Motivation
Clustering and other algorithms will happily operate on 10-50+ PCs, but that's still too many for visualization
đ Need further dimensionality reduction strategies for visualization
Visualizing with PCA
plotReducedDim(sce.zeisel, dimred="PCA")
scater::plotReducedDim()
Visualizing with PCA
plotReducedDim(sce.zeisel, dimred="PCA",
colour_by="level1class")
â ī¸ PCA is a linear technique and cannot efficiently pack differences in > 2 dimensions into the first 2 PCs
scater::plotReducedDim()
Challenges and sumamry of visualizing with PCA
plotReducedDim(sce.zeisel, dimred="PCA",
ncomponents=4, colour_by="level1class")
đ Workaround is to plot first few PCs against each other
đĩ Difficult to interpret
â PCA is predictable and will not introduce artificial structure
â Deterministic and robust to small changes in input values
â Usually not satisfactory for visualizing the complex nature of scRNA-seq data
Visualizing with t-SNE
set.seed(00101001101)
sce.zeisel <- runTSNE(sce.zeisel, dimred="PCA")
plotReducedDim(sce.zeisel, dimred="TSNE",
colour_by="level1class")
đ t-stochastic neighbour embedding (t-SNE) is the de facto standard for visualization of scRNA-seq data
đ¤ Attempts to find a low-dimensional (non-linear) representation of the data that preserves the distances between each point and its neighbours in the high-dimensional space
scater::runTSNE()
scater::plotReducedDim()
Challenges of visualizing with t-SNE
set.seed(100)
sce.zeisel <- runTSNE(sce.zeisel, dimred="PCA",
perplexity=30)
plotReducedDim(sce.zeisel, dimred="TSNE",
colour_by="level1class")
â What happens if you re-run runTSNE() (without setting the seed)
â What happens if you set the seed but change the value of perplexity and re-run runTSNE()
Challenges of visualizing with t-SNE
đ¤ Low perplexity favour resolution of finer structure, possibly to the point that visualization looks like random noise
âšī¸ http://distill.pub/2016/misread-tsne/ discuss parameter choices for t-SNE in some depth
Challenges of visualizing with t-SNE
â ī¸ Don't overinterpret the t-SNE results as a 'map' of single-cell identities
â ī¸ Random component and parameter choices will change visualization
â ī¸ Interpretation is easily misled by the size and position of visual clusters
â ī¸ t-SNE inflates dense clusters and compresses sparse ones
â ī¸ t-SNE is not obliged to preserve the relative locations of non-neighboring clusters (so can't interpret 'non-local' distances)
đ t-SNE is a proven tool for general-purpose visualization of scRNA-seq data and remains very popular
Visualizing with UMAP
đ Uniform manifold approximation and project (UMAP) is an alternative to t-SNE
đ¤ Like t-SNE, UMAP attempts to find a low-dimensional (non-linear) representation of the data that preserves the distances between each point and its neighbours in the high-dimensional space
đ¤ t-SNE and UMAP based on different mathematical theory
Visualizing with UMAP
set.seed(1100101001)
sce.zeisel <- runUMAP(sce.zeisel, dimred="PCA")
plotReducedDim(sce.zeisel, dimred="UMAP",
colour_by="level1class")
đ Compared to t-SNE, UMAP tends to have more compact visual clusters
âšī¸ Attempts to preserve more of the global structure than t-SNE
âąī¸ Tends to be faster than t-SNE, which may be important for larger datasets, but difference disappears when applied to PCs
scater::runUMAP()
scater::plotReducedDim()
Challenges of visualizing with UMAP
set.seed(100)
sce.zeisel <- runUMAP(sce.zeisel, dimred="PCA",
n_neighbors=15)
plotReducedDim(sce.zeisel, dimred="UMAP",
colour_by="level1class")
â What happens if you re-run runUMAP() (without setting the seed)?
â What happens if you set the seed but change the value of n_neighbors and re-run runUMAP()?
Challenges of visualizing with UMAP
â ī¸ Like t-SNE, need to 'set the seed' and different parameter values will change visualization
â ī¸ If n_neighbors or min_dist is too low then random noise will be treated as high-resolution structure, if too high will discard fine structure
đ Try a range of parameter values to ensure they don't compromise any conclusions drawn from a UMAP plot
Interpreting the plots
â ī¸ Dimensionality reduction for visualization necessarily involves discarding information and distorting the distances between cells
đ Don't overinterpret the pretty pictures
Summary and recommendations
Summary and recommendations
đ t-SNE and UMAP plots are useful diagnostics, e.g., checking whether two clusters are actually neighboring subclusters or whether a cluster can be split into further clusters
It is arguable whether t-SNE or UMAP visualizations are more useful or aesthetically pleasing
đ It's okay to pick the one that works better for your analysis (provided you treat the plot as a visualization/diagnostic and don't make any strong interpretations solely on the basis of the plot itself)
Where we're at