3 of 39

Gene Counts

http://data-science-sequencing.github.io/Win2018/lectures/lecture16/ https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2018

/SingleCell/slides/2018-07-25_CRUK_CI_summer_school-scRNAseq.pdf

cDNA alignment to Genome and group Results by cell

In each gene, within each cell, the total number of unique UMI is counted and reported as the number of transcripts of that gene for a given cell.

^{Hundreds of millions of reads}Thousands of cells

Count unique UMIs for each gene in each cell

Create digital expression matrix

Gene counts matrix

4 of 39

	Cell1	Cell2		Cell N
Gene 1	5	6	…	5
Gene 2	13	0	…	13
Gene 3	14	13	…	14
Gene 4	18	19	…	18
…
Gene M	10	10	…	10

Interpretation of zeroes

5 of 39

Drop-outs in single cell

Kharchenko, P., Silberstein, L. & Scadden, D. Bayesian approach to single-cell differential expression analysis. Nat Methods 11, 740–742 (2014).

a gene is observed at a moderate or high expression level in one cell but is not detected in another cell

Why do dropouts occur in single cell?

technical artifacts
cell type differences

statistical sampling
biological factors

6 of 39

Drop-outs in single cell

What should we do about dropouts?

Ignore zero inflation
Preprocess/reduce dimensions
Impute scRNA-seq gene count matrix before analysis

Why do we need imputation methods?

Downstream analyses relying on the accuracy of gene expression measurements

Imputation Methods

MAGIC
Droplet
DrImpute
scDoc

7 of 39

MAGIC

http://jsb.ucla.edu/sites/default/files/scImpute.pdf

van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, Burdziak C, Moon KR, Chaffer CL, Pattabiraman D, Bierie B, Mazutis L, Wolf G, Krishnaswamy S, Pe'er D. Recovering Gene Interactions from Single-Cell Data Using Data Diffusion. Cell. 2018 Jul 26;174(3):716-729.e27. doi: 10.1016/j.cell.2018.05.061. Epub 2018 Jun 28. PMID: 29961576; PMCID: PMC6771278.

Similarity between two cells
Transform the similarity matrix 𝑨 into a Markov transition matrix 𝑴
Raise the Markov matrix to the power of 𝑡: 𝑴^𝑡, which determines the weights of cells

Markov affinity-based graph imputation of cells

Denoise high-dimensional scRNA-seq data
Impute missing expression values by sharing information across

similar cells

8 of 39

(i) The input data consist of a matrix of cells by genes (middle) of the data (right).

(ii)compute a cell-by-cell distance matrix.

(iii) The distance matrix is converted to an affinity matrix (middle) using a Gaussian kernel. A graphical depiction of the kernel function is shown (right).

(iv) The affinities are normalized, resulting in a Markov matrix (middle). The normalized affinities are shown for a single point as transition probabilities (right).

(v) To perform diffusion, exponentiate the Markov matrix to a chosen power t.

(vi) multiply the exponentiated Markov matrix (left) by the original data matrix (middle) to obtain a denoised and imputed data matrix (right).

9 of 39

scRNA-seq process

Lafzi et al. Tutorial: guidelines for the experimental design of single-cell RNA sequencing studies, Nature Protocols 2018 (https://doi.org/10.1038/s41596-018-0073-y)

Data analysis

Dimensionality reduction
Clustering and marker identification
Trajectory analysis

10 of 39

scRNA-seq data analysis

Dimensionality reduction

Linear: PCA
Non-linear: t-SNE, UMAP

Cell clustering

K-means
hierarchical clustering
graph-based clustering

11 of 39

scRNA-seq analysis

Cell-level analysis

Marker gene identification
Cluster analysis
Trajectory analysis

Gene-level analysis

Single-cell differential expression analysis
Gene set analysis
Gene regulatory networks

Python for scRNA-seq data analysis

Luecken, MD and Theis, FJ. Current best practices in single‐cell RNA‐seq analysis: a tutorial, Mol Syst Biol 2019 (doi: https://doi.org/10.15252/msb.20188746)

12 of 39

Single-cell RNA sequencing analysis

Dimensionality reduction

Linear: PCA
Non-linear: t-SNE, UMAP

Cell clustering

K-means
hierarchical clustering
graph-based clustering

13 of 39

Dimensionality Reduction

https://doi.org/10.3389/fgene.2021.646936 https://www.biorxiv.org/content/10.1101/241646v1.full

Luecken, MD and Theis, FJ. Current best practices in single‐cell RNA‐seq analysis: a tutorial, Mol Syst Biol 2019 (doi: https://doi.org/10.15252/msb.20188746)

Dimensionality reduction announces that cells could cluster together into groups.

14 of 39

Dimensionality Reduction

Reduce noise, sparsity
Easier for visualization and processing
Linear methods:

PCA (principal component analysis)

Non-linear methods:

t-SNE
UMAP

https://doi.org/10.3389/fgene.2021.646936 https://www.biorxiv.org/content/10.1101/241646v1.full

15 of 39

PCA

Principal component analysis

It is a linear algebraic method of dimensionality reduction
Any matrix can be decomposed as a multiplication of other matrices
PC1 represents the most of variance for the data, then PC2, PC3

https://github.com/NBISweden/excelerate-scRNAseq

Reduce D genes to d PCs of cells, where d<<D

16 of 39

PCA of Peripheral Blood Mononuclear Cells (PBMC)

17 of 39

Limitation of PCA

PCs represent linear combination of individual features (e.g., genes)
PCA fails to find non-linear structure in the data

High dimensions Low dimensions

https://www.cs.cmu.edu/~bapoczos/Classes/ML10715_2015Fall/slides/ManifoldLearning.pdf

18 of 39

t-SNE

t-Stochastic Neighbor Embedding

t-SNE is a manifold embedding algorithm for nonlinear dimensionality reduction (”keep manifold structures on low dimension space”)

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008 https://nbviewer.org/github/YeoLab/single-cell-bioinformatics/tree/master/ https://mit6874.github.io/assets/sp2020/slides/L11_PCA_tSNE_Autoencoders.pdf

19 of 39

t-SNE

Given a collection of points 𝑋 = 𝑥₁, … , 𝑥_𝑛⊂ 𝑅^𝑑, find a collection of points

𝑌 = {𝑦₁ , … , 𝑦_n} ⊂ 𝑅 , where 𝑑 ≫ 𝑑′. 𝑝 and 𝑞 measure the conditional

probability that a point j would pick point i as it’s nearest neighbor, in high (p) and low (q) dimensional space, respectively.

High dimension

= 𝑝_𝑖j

Low dimension

= 𝑞_𝑖j

𝑥_𝑖

𝑥_j

𝑦_𝑖

𝑦_j

We want to learn

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008

20 of 39

Similarity matrix at high dimension

High dimension

= 𝑝_𝑖j

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008

where τ ²is the variance for the Gaussian distribution centered around x_i

21 of 39

Similarity matrix at low dimension

Low dimension

= 𝑞_𝑖j

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008

22 of 39

Cost Function

Kullback-Leibler divergence

Find optimal {y_i}: Minimize a single Kullback-Leibler divergence between a joint probability 𝑃, in the high-dimensional space and a joint probability

𝑄, in the low-dimensional space.

The gradient is given by

The result of this optimization is a map that reflects the similarities between the high-dimensional inputs.

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008

23 of 39

Summary of t-SNE

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008

t-SNE minimizes the divergence between two similarity distributions:

pairwise similarities of the input data points on the high dimensional space {p_ij}
pairwise similarities of the corresponding low- dimensional points {q_ij}

Main steps for t-SNE

Construct a similarity matrix over pairs of high-dimensional points

Similar data points are assigned a higher probability, while dissimilar points are assigned a lower probability.

Define a similarity matrix in the low-dimensional embedding space

Minimize the 𝐾𝐿 divergence between two similarity distributions using gradient descent to find optimal low dimensional coordinates of data points

24 of 39

Limitation of t-SNE

https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668

t-SNE mainly preserves local similarity structure of data, not global similarity.
Only within cluster distances are meaningful while between cluster similarities are not guaranteed.

25 of 39

UMAP

Uniform Manifold Approximation and Projection

Instead of randomly assigning points as in t-SNE, UMAP constructs a high dimensional graph representation of the data and then optimizes a low-dimensional graph to be as structurally similar as possible

UMAP could preserve both global and local data structures UMAP on MNIST dataset t-SNE on MNIST dataset

https://www.youtube.com/watch?v=nq6iPZVUxZU
https://nbisweden.github.io/excelerate-scRNAseq/session-dim-reduction/lecture_dimensionality_reduction.pdf
https://umap-learn.readthedocs.io/en/latest/how_umap_works.html
https://pair-code.github.io/understanding-umap/

26 of 39

Main steps for UMAP

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008 https://pair-code.github.io/understanding-umap/supplement.html

Construct a weighted k nearest neighbors graph for defining a similarity matrix in high-dimensional space

The edge weights represent the likelihood that two data points

are connected (i.e., “similarity”)

Define a similarity matrix in low-dimensional space

Minimize the cross-entropy cost function at each point using gradient descent to find the most similar graph in lower dimensions

27 of 39

Similarity matrix at high dimension

𝜌_𝑖, the distance from x_ito the k nearest neighbors in high dimension (e.g., kNN graph)

UMAP uses an exponential probability distribution in high dimension

The symmetrization of high dimension similarity score and the definition of number of nearest neighbors

https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668

arXiv:1802.03426

28 of 39

Similarity matrix at high dimension

For each data point, it has a locally adaptive exponential kernel, so the distance metric varies from point to point.

https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668

arXiv:1802.03426

https://www.youtube.com/watch?v=jth4kEvJ3P8

Graph

29 of 39

Cost Function

binary cross-entropy (CE)

Similarity matrix at low dimension

The gradient of the CE to find optimal {y_i}

the family of curves 1/𝑎×𝑦^2𝑏for modelling distance probabilities in low dimensions, 𝑎 and 𝑏 are hyperparameters

Cross-entropy cost function makes UMAP to capture the global and local data structures

arXiv:1802.03426

https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668

30 of 39

UMAP examples

https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html

31 of 39

Single-cell RNA sequencing analysis

Introduction of single cell analysis
Dimensional reduction

Non-linear: t-SNE, UMAP
Linear: PCA

Cell clustering

K-means
hierarchical clustering
graph-based clustering

32 of 39

Cell clustering

Feature selection and dimensionality reduction extract the most informative genes and represented features from background noise, respectively
Cell–cell distances are then calculated in the low dimensional space for clustering cells to clusters

Kiselev, V.Y., Andrews, T.S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet 20, 273–282 (2019). https://doi.org/10.1038/s41576-018-0088-9

33 of 39

Clustering methods for scRNA-seq

Clustering methods:

K-means
hierarchical clustering
graph-based clustering

Kiselev, V.Y., Andrews, T.S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet 20, 273–282 (2019). https://doi.org/10.1038/s41576-018-0088-9

Tools for graph-based clustering:

Seurat: Louvain, Leiden, SLM
igraph:fast greedy, Louvain, optimal, walktrap, spinglass, infomap

34 of 39

Graph-based clustering

One of most popular clustering algorithms in scRNA-seq data analysis

Node -> cell
Edge -> similarity between two cells

Freytag, Saskia, Luyi Tian, Ingrid Lönnstedt, Milica Ng, and Melanie Bahlo. 2018. “Comparison of Clustering Tools in R for Medium-Sized 10x Genomics Single-Cell Rna-Sequencing Data.” F1000Research 7. Faculty of 1000 Ltd.
Wasserman, S. & Faust, K. (1994) Social Network Analysis (Cambridge Univ. Press, Cambridge, U.K.).
https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq- workshop/public/clustering-and-cell-annotation.html#ref-freytag2018comparison

Louvain community

detection: (Blondel et al. 2008)

Leiden community detection: (Traag, Waltman, and Eck 2019)

35 of 39

Modularity

Modularity Q: measurement on strength of network division

low high

Brede, Europhysics Letters, 2010.

Newman, PNAS, 2006.

Clustering goal: assign each node a module

to maximize “modularity” as an objective function (module is a group of highly connected nodes)

36 of 39

Louvain community detection

(Кластеризация Лувена)

https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/clustering-and-cell-annotation.html#ref-freytag2018comparison https://www.youtube.com/watch?v=QNv7rKWCgM8

Start with every node in its own community (i.e., module/cluster)
Step1: Modularity optimization

Order the nodes and for each node 𝑖, move 𝑖 to the community of neighbor 𝑗 that leads to maximum ∆𝑄
If all ∆𝑄 < 0 the 𝑖 remains in its current community
Repeatedly cycle through all nodes until ∆𝑄 = 0

Step2 : Community aggregation

Create a weighted network of communities from Step1
Nodes : communities in Step1
Edge weights : sum of weights of edges between communities
Edges within a community become two self-loops

Repeat: Apply Step1 & Step2 to resulting network, and so on until ∆𝑄 = 0

37 of 39

Examples of Louvain clustering

Belgian mobile phone network

2M nodes
Red nodes:

French speakers

Green nodes:

Dutch speakers

Blondel et al.2008

38 of 39

Examples of Louvain clustering

The Louvain algorithm clusters millions of cell with reasonable computational complexity.

Luecken MD, Theis FJ (2019) Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol 15:e8746

Han X, Zhou Z, Fei L, Sun H, Wang R, Chen Y, Zhou Y (2020) Construction of a human cell landscape at single-cell level. Nature 581:303–309. https://doi.org/10.1038/s41586-020-2157-4

Seow, J.J.W., Wong, R.M.M., Pai, R. et al. Single‐Cell RNA Sequencing for Precision Oncology: Current State-of-Art. J Indian Inst Sci 100, 579–588 (2020). https://doi.org/10.1007/s41745-020-00178-1

1 of 39

2 of 39

3 of 39

4 of 39

5 of 39

6 of 39

7 of 39

8 of 39

9 of 39

10 of 39

11 of 39

12 of 39

13 of 39

14 of 39

15 of 39

16 of 39

17 of 39

18 of 39

19 of 39

20 of 39

21 of 39

22 of 39

23 of 39

24 of 39

25 of 39

26 of 39

27 of 39

28 of 39

29 of 39

30 of 39

31 of 39

32 of 39

33 of 39

34 of 39

35 of 39

36 of 39

37 of 39

38 of 39

39 of 39