1 of 3

Dimensionality Reduction

If we’d only sampled 2 genes and hundreds of cells - it would be easy to plot and visualise our dataset as a 2D plot with each axis representing a gene

However, we easily profile 1000-10000+ genes in a scRNAseq experiment

If we view each gene as a dimension, cells inhabit a gene-space with 1000s of dimensions.

If genes are behaving in a correlated manner, we don’t need to store information about the individual genes and can instead create meta-features summarising this information instead

1

https://www.monash.edu/researchinfrastructure/mgbp

2 of 3

Dimensionality Reduction: PCA

Principal Components Analysis (PCA) finds axes in multidimensional space that capture the most variation in a data-set

The first axes (or Principle Components) are typically assumed to capture the dominant factors of heterogeneity in the data

Using PCA, most of the variation in a dataset can usually be summarized into �10s of components.

This is convenient for many algorithms, but still difficult to visualize...����

2

https://www.monash.edu/researchinfrastructure/mgbp

3 of 3

Dimensionality Reduction - UMAP

UMAP provides a further dimensionality reduction step from 10s of PCs to 2 dimensions that can be easily visualized.

UMAP is a non-linear dimensionality reduction method.

  • Very good at showing the structure of the data.�
  • May arbitrarily warp and tear the data to present it in 2D.

The UMAP layout is a useful map on which other data can be shown, such as clusters or the expression of particular genes.����Comparison of UMAP and PCA

3

https://www.monash.edu/researchinfrastructure/mgbp