Dimensionality Reduction
If we’d only sampled 2 genes and hundreds of cells - it would be easy to plot and visualise our dataset as a 2D plot with each axis representing a gene
However, we easily profile 1000-10000+ genes in a scRNAseq experiment
If we view each gene as a dimension, cells inhabit a gene-space with 1000s of dimensions.
If genes are behaving in a correlated manner, we don’t need to store information about the individual genes and can instead create meta-features summarising this information instead
1
https://www.monash.edu/researchinfrastructure/mgbp
Dimensionality Reduction: PCA
Principal Components Analysis (PCA) finds axes in multidimensional space that capture the most variation in a data-set
The first axes (or Principle Components) are typically assumed to capture the dominant factors of heterogeneity in the data
Using PCA, most of the variation in a dataset can usually be summarized into �10s of components.
This is convenient for many algorithms, but still difficult to visualize...����
2
https://www.monash.edu/researchinfrastructure/mgbp
Dimensionality Reduction - UMAP
UMAP provides a further dimensionality reduction step from 10s of PCs to 2 dimensions that can be easily visualized.
UMAP is a non-linear dimensionality reduction method.
The UMAP layout is a useful map on which other data can be shown, such as clusters or the expression of particular genes.����Comparison of UMAP and PCA
3
https://www.monash.edu/researchinfrastructure/mgbp