1 of 14

scRNA-seq : visualization

École de bioinformatique AVIESAN-IFB-INSERM 2022

2 of 14

2

scRNA-Seq pipeline overview

biological sample

sequencer output

unfiltered count matrix

filtered count matrix

normalized matrix

reduced space

cells visualization

Dimensions

Cells

Cells

Cells

Genes

Genes

Genes

Dim. 1

Dim. 2

Cells

HVG selection + scaling

Cells

Genes

Cells

We want a visual summary of thousands cells’ gene expression.

3 of 14

3

normalized matrix

reduced space

cells visualization

Dimensions

Cells

Cells

Genes

Dim. 1

Dim. 2

How ?

Why not ?

How ?

How ?

Why an intermediate step is done ?

We will summarize genes expression in few dimensions, before building the 2D projection.

http://cmdlinetips.com/wp-content/uploads/2018/03/Sparse_Matrix.png

scRNA-Seq data are sparse

> 70 % of the expression matrix is 0 : not very informative

Data are noisy

Some genes are more informative than some other.

There is biological / technical noise in gene expression.

Computational time

prop(expr_mat == 0)

4 of 14

4

Challenges

normalized matrix

reduced space

cells visualization

Dimensions

Cells

Cells

Genes

Dim. 1

Dim. 2

Cells

  • How to reduce the number of dimensions ?
  • How many dimensions ?

We want a visual summary of thousands cells’ gene expression.

  • How ?
  • How to identify cell populations ?

5 of 14

5

normalized matrix

reduced space

cells visualization

Dimensions

Cells

Cells

Genes

Dim. 1

Dim. 2

How ?

Why not ?

How ?

How ?

Dimensionality reduction

Overview

Commonly used dimensionality reduction methods

  • PCA Principal Component Analysis
  • BFA Binary Factor Analysis
  • ICA Independent Component Analysis
  • LSI Latent Semantic Indexing
  • LDA Linear Discriminant Analysis

Important parameters

  • information : number of variable genes (HVG)
  • number of dimensions to generate (signal / noise)
  • randomness : random seed
  • convergence criteria

HVG selection

Cells

Genes

(≈15,000)

HVG

“constant” genes

Cells

HVG

(≈3,000)

Cells

scaling

reduced space

Dimensions

(≈50)

Cells

HVG

(≈3,000)

6 of 14

6

normalized matrix

reduced space

cells visualization

Dimensions

Cells

Cells

Genes

Dim. 1

Dim. 2

How ?

Why not ?

How ?

How ?

Dimensionality reduction

Principal Component Analysis - principle

  • Input : X (≈ 2 000 - 5 000) HVG with scaled expression leveles
  • Goal : Group genes by dimensions when they have similar expression across cells

gene 1

gene 2

gene 3

gene 4

gene 5

gene 6

dim 1

dim 2

dim 3

  • Output : Z (≈ 50 - 100) dimensions “Principal Component”
  • Each PC summarizes a certain amount of the input data variability
  • First PC recapitulates the most part of information
  • Last PC can be considered as noise

7 of 14

7

normalized matrix

reduced space

cells visualization

Dimensions

Cells

Cells

Genes

Dim. 1

Dim. 2

How ?

Why not ?

How ?

How ?

Dimensionality reduction

Principal Component Analysis - visualization

  • Input : X most variable genes
  • Goal : Group genes by dimensions when they have similar expression across cells
  • Output : Z dimensions “Principal Component”
  • Each PC summarizes a certain amount of the input data variability

Now, we will use the reduced space to make a 2D representation.

8 of 14

8

normalized matrix

reduced space

cells visualization

Dimensions

Cells

Cells

Genes

Dim. 1

Dim. 2

How ?

Why not ?

How ?

How ?

2D space for cells visualization

Commonly used 2D space

  • UMAP
  • tSNE
  • Diffusion Map

The same cells can be represented using different 2D spaces.

Do not make to many interpretations from the 2D space, it is an over-simplified representation of cells.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6417818/

Important parameters

  • input information : number of dimensions
  • cells neighborhood : number of neighbors, perplexity, distance method, …

9 of 14

9

normalized matrix

reduced space

cells visualization

Dimensions

Cells

Cells

Genes

Dim. 1

Dim. 2

How ?

Why not ?

How ?

How ?

Clustering

Clustering is made on expression matrix or reduced space, not on the 2D projection.

The 2D projection is not a clustering. A clustering is an annotation.

Commonly used methods

  • Louvain clustering
  • Leiden clustering
  • k-means

Important parameters

  • input information : number of dimensions
  • cells neighborhood parameters : number of neighbors, distance measurement method, resolution

k-nearest neighbors (kNN)

k = 3

k = 6

shared nearest neighbors (SNN)

clustering

(from SNN graph)

10 of 14

10

Summary

HVG selection

normalized matrix

Cells

Axis 1

Axis 2

Cluster

Cells

Cells

Genes

(≈15,000)

HVG

Cells

HVG

(≈3,000)

Cells

scaled matrix

reduced space

Dimensions

(≈50)

Cells

HVG

(≈3,000)

Genes

(≈15,000)

UMAP

tSNE

others…

11 of 14

11

Take Home Messages

  • The number of variable genes impact the PCA, thus the 2D space. It depends on the expected number of cell populations in the dataset.
  • Number of dimensions = amount of information (not enough < - - > noisy data)
  • UMAP is suited to visualize several cell types and their global transcriptomic profile
  • tSNE is suited to visualize sub cell types and their local transcriptomic particularity
  • Diffusion Map is suited to visualize cell differentiation data
  • The resolution impacts the number of clusters : not enough clusters / not biologically interpretable clusters

Advice :

  1. Make the analysis with all default settings :
    • 2000 HVG
    • 15 PC to generate a UMAP (or tSNE)
    • Resolution 1 for the clustering
  2. Identify your cell populations
  3. Change the settings to make the representation showing what you identified

The goal is to generate a quick representation for your cells. Run your favorite analyses and represent results on the representation. Do not make to many interpretations from the 2D representation itself.

12 of 14

12

Let’s go to practice

normalized matrix

reduced space

PCA

cells visualization

UMAP

Dimensions

Cells

Cells

Genes

Dim. 1

Dim. 2

Cells

  • Nb of variable features : 500, 2000, 5000
  • Nb of dimensions : 50
  • Nb of dimensions : 5, 15, 50, 100
  • Nb of dims : same as UMAP
  • Resolution : 0.1, 0.5, 1, 5

13 of 14

13

500

2000

5000

5

15

50

Number of variable features

Number of PC (/50) to make the UMAP

14 of 14

14

0.1

0.5

1

5

Resolution