1 of 45

Dimensionality Reduction

2 of 45

Motivation

3 of 45

Motivation

The next step of a scRNA-seq analysis is typically to identify groups of 'similar' cells

E.g., clustering aims to identify cells with similar transcriptomic

profiles by computing 'distances' between cells

If we had a dataset with two genes, we could make a two-dimensional plot to identify clusters of cells

E.g., cluster 1 (gene A+, gene B+), cluster 2 (gene A-, gene B+), etc.

Could extend this idea to multiple genes, much like how FACS gating

is done

🤔 But we have tens of thousands of genes, we need a better way

👉 Dimensionality reduction

4 of 45

Dimensionality reduction

🤓 Dimensionality reduction is possible because different genes are correlated if they are affected by the same biological process

👉 Don't need to store separate information for individuals genes, but can instead compress multiple features into a single dimension

➕ Reduces computational work in downstream analyses

➕ Reduces noise by averaging across multiple genes to obtain a more precise representation of the patterns in the data

➕ Enable effective plotting of the data in 2 dimensions

5 of 45

Illustrative dataset: Zeisel

6 of 45

Illustrative dataset: Zeisel

library(scRNAseq)

sce.zeisel <- ZeiselBrainData(ensembl=TRUE)

Study of cell types in the mouse brain (oligodendrocytes, microglia, neurons, etc.) processed with the STRT-seq system (similar to CEL-Seq)

https://osca.bioconductor.org/zeisel-mouse-brain-strt-seq.html

Zeisel, A. et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015)

7 of 45

Illustrative dataset: Zeisel

# Quality control

library(scater)

is.mito <- which(rowData(sce.zeisel)$featureType=="mito")

stats <- perCellQCMetrics(sce.zeisel,

subsets=list(Mt=is.mito))

qc <- quickPerCellQC(stats,

percent_subsets=c("altexps_ERCC_percent",

"subsets_Mt_percent"))

sce.zeisel <- sce.zeisel[,!qc$discard]

Zeisel, A. et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015)

8 of 45

Illustrative dataset: Zeisel

# normalization

library(scran)

set.seed(1000)

clusters <- quickCluster(sce.zeisel)

sce.zeisel <- computeSumFactors(sce.zeisel,

cluster=clusters)

sce.zeisel <- logNormCounts(sce.zeisel)

# variance-modelling

dec.zeisel <- modelGeneVarWithSpikes(sce.zeisel,

"ERCC")

Zeisel, A. et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015)

9 of 45

Illustrative dataset:

Unfiltered 10X PBMC4k

10 of 45

Illustrative dataset: Unfiltered 10X PBMC4k

library(BiocFileCache)

bfc <- BiocFileCache()

raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",

"cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))

untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))

library(DropletUtils)

library(Matrix)

fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")

sce.pbmc <- read10xCounts(fname, col.names=TRUE)

Human peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics

https://osca.bioconductor.org/unfiltered-human-pbmcs-10x-genomics.html

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)

11 of 45

Illustrative dataset: Unfiltered 10X PBMC4k

# gene-annotation

library(scater)

rownames(sce.pbmc) <- uniquifyFeatureNames(

rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)

library(EnsDb.Hsapiens.v86)

location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID,

column="SEQNAME", keytype="GENEID")

# cell-detection

set.seed(100)

e.out <- emptyDrops(counts(sce.pbmc))

sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)

12 of 45

Illustrative dataset: Unfiltered 10X PBMC4k

# quality-control

stats <- perCellQCMetrics(sce.pbmc,

subsets=list(Mito=which(location=="MT")))

high.mito <- isOutlier(stats$subsets_Mito_percent,

type="higher")

sce.pbmc <- sce.pbmc[,!high.mito]

# normalization

library(scran)

set.seed(1000)

clusters <- quickCluster(sce.pbmc)

sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)

sce.pbmc <- logNormCounts(sce.pbmc)

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)

13 of 45

Illustrative dataset: Unfiltered 10X PBMC4k

# variance modelling

set.seed(1001)

dec.pbmc <- modelGeneVarByPoisson(sce.pbmc)

top.pbmc <- getTopHVGs(dec.pbmc, prop=0.1)

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017)

14 of 45

Principal component analysis

15 of 45

Principal component analysis (PCA)

⭐ PCA is a workhorse of dimensionality reduction

🤓 PCA discovers the (linear) combinations of features that captures the largest amount of variation

🤓 In PCA, the first linear combination ('principal component') is chosen such that it captures the greatest variance across cells. The next PC is chosen such that it is 'orthogonal' to the first and captures the greatest remaining amount of variation, and so on.

16 of 45

PCA applied to scRNA-seq data

👉 We can perform dimensionality reduction by performing PCA on the log-counts matrix and restricting downstream analyses to the top PCs

👉 This can reduce our dataset from 20,000 dimensions down to, say, 10, without losing much information

🤓 PCA has lots of great, well-studied theoretical properties

🤓 There are a wide range of fast ways to perform PCA on large datasets

17 of 45

Assumptions of PCA applied to scRNA-seq data

  1. Biological processes affect multiple genes in a coordinated manner
  2. Earlier PCs likely represent biological structure as more variation can be captured by considering the correlated behaviour of many genes
  3. By comparison, random 'technical' noise is expected to affect each gene independently

âš ī¸ Early PCs will capture batch effects that affect multiple genes in a coordinated manner

18 of 45

Principal component analysis

library(scran)

top.zeisel <- getTopHVGs(dec.zeisel, n=2000)

library(scater)

set.seed(100)

sce.zeisel <- runPCA(sce.zeisel,

subset_row=top.zeisel)

scater::runPCA()

âš ī¸ By default, runPCA() uses a fast approximate method that performs simulations, so need to 'set the seed' to obtain reproducible results

19 of 45

Choosing the number of PCs

20 of 45

Choosing the number of PCs

This choice is analogous to the choice of the number of HVGs

More PC will avoid discarding biological signal at the cost of retaining

more noise

👉 Common to select a 'reasonable' but arbitrary number of PCs (10 - 50), continue with the analysis, but come back to check the robustness of the conclusions to a range of values

👉 We'll now look at some more data-driven strategies

21 of 45

Using the elbow point

percent.var <- attr(reducedDim(sce.zeisel),

"percentVar")

chosen.elbow <- PCAtools::findElbowPoint(percent.var)

plot(percent.var, xlab="PC",

ylab="Variance explained (%)")

abline(v=chosen.elbow, col="red")

PCAtools::findElbowPoint()

👉 A simple heuristic for choosing the number of PCs based on the 'percentage of variance explained' by successive PCs

22 of 45

Based on population structure

choices <- getClusteredPCs(reducedDim(sce.zeisel))

chosen.clusters <- metadata(choices)$chosen

scran::getClusteredPCs()

👉 A (more sophisticated) heuristic approach that uses the number of clusters as a proxy for the number of subpopulations

🤓 Suppose we expect d subpopulations of cells, then we need (d-1) dimensions to guarantee separation of all subpopulations

🤔 But we don't know how many subpopulations there are

23 of 45

Based on population structure

scran::getClusteredPCs()

👉 Try a range of d and only consider values that yield no more than d+1 clusters

👉 If we select more clusters with fewer dimensions, we consider this 'overclustering'

👉 Choose d that maximizes the number of clusters without 'overclustering'

➕ Pragmatic solution that addreses bias-variance tradeoff in downstream analysis (specifically clustering)

➖ Strong assumptions about nature of biological differences between clusters (and indeed the very existence of clusters, which may not exist in student of processes like differentiation)

24 of 45

Putting it together

set.seed(100)

# Compute and store the 'full' set of PCs

sce.zeisel <- runPCA(sce.zeisel, subset_row=top.zeisel)

# Can also select d and store the reduced set of PCs

# e.g., using the elbow pointīŋŊreducedDim(sce.zeisel, "PCA_elbow") <- reducedDim(

sce.zeisel, "PCA")[,1:chosen.elbow]

# e.g., based on population structure

reducedDim(sce.zeisel, "PCA_clusters") <- reducedDim(

sce.zeisel, "PCA")[,1:chosen.clusters]

25 of 45

Using the technical noise

library(scran)

set.seed(111001001)

denoised.pbmc <- denoisePCA(sce.pbmc,

technical=dec.pbmc, subset.row=top.pbmc)

scran::denoisenPCA()

🤓 Retain all PCs until the %variation explained reaches some threshold (e.g., based on an estimate of technical variation)

👉denoisePCA() automatically selects the number of PCs

âš ī¸ By default, denoisePCA() performs simulations, so need to 'set the seed' to obtain reproducible results

26 of 45

Using the technical noise

dim(reducedDim(denoised.pbmc, "PCA"))

scran::denoisenPCA()

🤓 Dimensionality of the output is a lower bound on the number of PCs required to explain all biological variation. That is, any fewer PCs will definitely discard some aspect of the 'biological' signal.

âš ī¸ This does not guarantee that the retained PCs capture all of the 'biological' signal

👉 Usually retains more PCs than the elbow point method

27 of 45

Using the technical noise

set.seed(001001001)

denoised.zeisel <- denoisePCA(sce.zeisel,

technical=dec.zeisel, subset.row=top.zeisel)

dim(reducedDim(denoised.zeisel))

scran::denoisenPCA()

🤓 scran::denoisenPCA() internally caps the number of PCs, by default 5 - 50, to avoid selection of too few PCs (when 'technical' noise is high relative to 'biological' noise) or too many PCs (when technical noise is very low)

👉 Zeisel brain data hits the upper limit

28 of 45

Using the technical noise

dec.pbmc2 <- modelGeneVar(sce.pbmc)

denoised.pbmc2 <- denoisePCA(sce.pbmc,

technical=dec.pbmc2, subset.row=top.pbmc)

dim(reducedDim(denoised.pbmc2))

scran::denoisenPCA()

🤓 scran::denoisenPCA() tends to perform best when the mean-variance relationship reflects the actual technical noise, i.e. estimated by scran::modelGeneVarByPoisson() or scran::modelGeneVarWithSpikes() instead of scran::modelGeneVar()

👉 PBMC dataset hits the lower limit

29 of 45

Dimensionality reduction for visualization

30 of 45

Motivation

Clustering and other algorithms will happily operate on 10-50+ PCs, but that's still too many for visualization

👉 Need further dimensionality reduction strategies for visualization

31 of 45

Visualizing with PCA

plotReducedDim(sce.zeisel, dimred="PCA")

scater::plotReducedDim()

32 of 45

Visualizing with PCA

plotReducedDim(sce.zeisel, dimred="PCA",

colour_by="level1class")

âš ī¸ PCA is a linear technique and cannot efficiently pack differences in > 2 dimensions into the first 2 PCs

scater::plotReducedDim()

33 of 45

Challenges and sumamry of visualizing with PCA

plotReducedDim(sce.zeisel, dimred="PCA",

ncomponents=4, colour_by="level1class")

👉 Workaround is to plot first few PCs against each other

đŸ˜ĩ Difficult to interpret

➕ PCA is predictable and will not introduce artificial structure

➕ Deterministic and robust to small changes in input values

➖ Usually not satisfactory for visualizing the complex nature of scRNA-seq data

34 of 45

Visualizing with t-SNE

set.seed(00101001101)

sce.zeisel <- runTSNE(sce.zeisel, dimred="PCA")

plotReducedDim(sce.zeisel, dimred="TSNE",

colour_by="level1class")

👉 t-stochastic neighbour embedding (t-SNE) is the de facto standard for visualization of scRNA-seq data

🤓 Attempts to find a low-dimensional (non-linear) representation of the data that preserves the distances between each point and its neighbours in the high-dimensional space

scater::runTSNE()

scater::plotReducedDim()

35 of 45

Challenges of visualizing with t-SNE

set.seed(100)

sce.zeisel <- runTSNE(sce.zeisel, dimred="PCA",

perplexity=30)

plotReducedDim(sce.zeisel, dimred="TSNE",

colour_by="level1class")

❓ What happens if you re-run runTSNE() (without setting the seed)

❓ What happens if you set the seed but change the value of perplexity and re-run runTSNE()

36 of 45

Challenges of visualizing with t-SNE

🤓 Low perplexity favour resolution of finer structure, possibly to the point that visualization looks like random noise

â„šī¸ http://distill.pub/2016/misread-tsne/ discuss parameter choices for t-SNE in some depth

37 of 45

Challenges of visualizing with t-SNE

âš ī¸ Don't overinterpret the t-SNE results as a 'map' of single-cell identities

âš ī¸ Random component and parameter choices will change visualization

âš ī¸ Interpretation is easily misled by the size and position of visual clusters

âš ī¸ t-SNE inflates dense clusters and compresses sparse ones

âš ī¸ t-SNE is not obliged to preserve the relative locations of non-neighboring clusters (so can't interpret 'non-local' distances)

🙃 t-SNE is a proven tool for general-purpose visualization of scRNA-seq data and remains very popular

38 of 45

Visualizing with UMAP

👉 Uniform manifold approximation and project (UMAP) is an alternative to t-SNE

🤓 Like t-SNE, UMAP attempts to find a low-dimensional (non-linear) representation of the data that preserves the distances between each point and its neighbours in the high-dimensional space

🤓 t-SNE and UMAP based on different mathematical theory

39 of 45

Visualizing with UMAP

set.seed(1100101001)

sce.zeisel <- runUMAP(sce.zeisel, dimred="PCA")

plotReducedDim(sce.zeisel, dimred="UMAP",

colour_by="level1class")

👀 Compared to t-SNE, UMAP tends to have more compact visual clusters

â„šī¸ Attempts to preserve more of the global structure than t-SNE

âąī¸ Tends to be faster than t-SNE, which may be important for larger datasets, but difference disappears when applied to PCs

scater::runUMAP()

scater::plotReducedDim()

40 of 45

Challenges of visualizing with UMAP

set.seed(100)

sce.zeisel <- runUMAP(sce.zeisel, dimred="PCA",

n_neighbors=15)

plotReducedDim(sce.zeisel, dimred="UMAP",

colour_by="level1class")

❓ What happens if you re-run runUMAP() (without setting the seed)?

❓ What happens if you set the seed but change the value of n_neighbors and re-run runUMAP()?

41 of 45

Challenges of visualizing with UMAP

âš ī¸ Like t-SNE, need to 'set the seed' and different parameter values will change visualization

âš ī¸ If n_neighbors or min_dist is too low then random noise will be treated as high-resolution structure, if too high will discard fine structure

👉 Try a range of parameter values to ensure they don't compromise any conclusions drawn from a UMAP plot

42 of 45

Interpreting the plots

âš ī¸ Dimensionality reduction for visualization necessarily involves discarding information and distorting the distances between cells

👉 Don't overinterpret the pretty pictures

43 of 45

Summary and recommendations

44 of 45

Summary and recommendations

👉 t-SNE and UMAP plots are useful diagnostics, e.g., checking whether two clusters are actually neighboring subclusters or whether a cluster can be split into further clusters

It is arguable whether t-SNE or UMAP visualizations are more useful or aesthetically pleasing

👉 It's okay to pick the one that works better for your analysis (provided you treat the plot as a visualization/diagnostic and don't make any strong interpretations solely on the basis of the plot itself)

45 of 45

Where we're at