1 of 39

Анализ scRNA-seq. Часть 2.

Импутация, снижение размерности, кластеризация

2 of 39

Previous lectures…

3 of 39

Gene Counts

16

http://data-science-sequencing.github.io/Win2018/lectures/lecture16/ https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2018

/SingleCell/slides/2018-07-25_CRUK_CI_summer_school-scRNAseq.pdf

cDNA alignment to Genome and group Results by cell

  • In each gene, within each cell, the total number of unique UMI is counted and reported as the number of transcripts of that gene for a given cell.

Hundreds of millions of reads Thousands of cells

Count unique UMIs for each gene in each cell

Create digital expression matrix

Gene counts matrix

4 of 39

Cell1

Cell2

Cell N

Gene 1

5

6

5

Gene 2

13

0

13

Gene 3

14

13

14

Gene 4

18

19

18

Gene M

10

10

10

Interpretation of zeroes

5 of 39

Drop-outs in single cell

1

Kharchenko, P., Silberstein, L. & Scadden, D. Bayesian approach to single-cell differential expression analysis. Nat Methods 11, 740–742 (2014).

  • a gene is observed at a moderate or high expression level in one cell but is not detected in another cell

Why do dropouts occur in single cell?

  • technical artifacts
  • cell type differences
  • statistical sampling
  • biological factors

6 of 39

Drop-outs in single cell

What should we do about dropouts?

  • Ignore zero inflation
  • Preprocess/reduce dimensions
  • Impute scRNA-seq gene count matrix before analysis

Why do we need imputation methods?

  • Downstream analyses relying on the accuracy of gene expression measurements

Imputation Methods

  • MAGIC
  • Droplet
  • DrImpute
  • scDoc

7 of 39

MAGIC

19

http://jsb.ucla.edu/sites/default/files/scImpute.pdf

van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, Burdziak C, Moon KR, Chaffer CL, Pattabiraman D, Bierie B, Mazutis L, Wolf G, Krishnaswamy S, Pe'er D. Recovering Gene Interactions from Single-Cell Data Using Data Diffusion. Cell. 2018 Jul 26;174(3):716-729.e27. doi: 10.1016/j.cell.2018.05.061. Epub 2018 Jun 28. PMID: 29961576; PMCID: PMC6771278.

*

  • Similarity between two cells
  • Transform the similarity matrix 𝑨 into a Markov transition matrix 𝑴
  • Raise the Markov matrix to the power of 𝑡: 𝑴𝑡, which determines the weights of cells

Markov affinity-based graph imputation of cells

  • Denoise high-dimensional scRNA-seq data
  • Impute missing expression values by sharing information across

similar cells

8 of 39

(i) The input data consist of a matrix of cells by genes (middle) of the data (right).

(ii)compute a cell-by-cell distance matrix.

(iii) The distance matrix is converted to an affinity matrix (middle) using a Gaussian kernel. A graphical depiction of the kernel function is shown (right).

(iv) The affinities are normalized, resulting in a Markov matrix (middle). The normalized affinities are shown for a single point as transition probabilities (right).

(v) To perform diffusion, exponentiate the Markov matrix to a chosen power t.

(vi) multiply the exponentiated Markov matrix (left) by the original data matrix (middle) to obtain a denoised and imputed data matrix (right).

9 of 39

scRNA-seq process

20

Lafzi et al. Tutorial: guidelines for the experimental design of single-cell RNA sequencing studies, Nature Protocols 2018 (https://doi.org/10.1038/s41596-018-0073-y)

Data analysis

    • Dimensionality reduction
    • Clustering and marker identification
    • Trajectory analysis

10 of 39

scRNA-seq data analysis

10

  • Dimensionality reduction
    • Linear: PCA
    • Non-linear: t-SNE, UMAP
  • Cell clustering
    • K-means
    • hierarchical clustering
    • graph-based clustering

11 of 39

scRNA-seq analysis

11

Cell-level analysis

  • Marker gene identification
  • Cluster analysis
  • Trajectory analysis

Gene-level analysis

  • Single-cell differential expression analysis
  • Gene set analysis
  • Gene regulatory networks

Python for scRNA-seq data analysis

Luecken, MD and Theis, FJ. Current best practices in single‐cell RNA‐seq analysis: a tutorial, Mol Syst Biol 2019 (doi: https://doi.org/10.15252/msb.20188746)

12 of 39

Single-cell RNA sequencing analysis

  • Dimensionality reduction
    • Linear: PCA
    • Non-linear: t-SNE, UMAP
  • Cell clustering
    • K-means
    • hierarchical clustering
    • graph-based clustering

12

13 of 39

Dimensionality Reduction

26

https://doi.org/10.3389/fgene.2021.646936 https://www.biorxiv.org/content/10.1101/241646v1.full

Luecken, MD and Theis, FJ. Current best practices in single‐cell RNA‐seq analysis: a tutorial, Mol Syst Biol 2019 (doi: https://doi.org/10.15252/msb.20188746)

  • Dimensionality reduction announces that cells could cluster together into groups.

14 of 39

Dimensionality Reduction

  • Reduce noise, sparsity
  • Easier for visualization and processing
  • Linear methods:
    • PCA (principal component analysis)
  • Non-linear methods:
    • t-SNE
    • UMAP

27

https://doi.org/10.3389/fgene.2021.646936 https://www.biorxiv.org/content/10.1101/241646v1.full

15 of 39

PCA

Principal component analysis

  • It is a linear algebraic method of dimensionality reduction
  • Any matrix can be decomposed as a multiplication of other matrices
  • PC1 represents the most of variance for the data, then PC2, PC3

28

https://github.com/NBISweden/excelerate-scRNAseq

1

st

PC

2

nd

PC

Reduce D genes to d PCs of cells, where d<<D

16 of 39

PCA of Peripheral Blood Mononuclear Cells (PBMC)

29

17 of 39

Limitation of PCA

30

  • PCs represent linear combination of individual features (e.g., genes)
  • PCA fails to find non-linear structure in the data

High dimensions Low dimensions

18 of 39

t-SNE

32

t-Stochastic Neighbor Embedding

  • t-SNE is a manifold embedding algorithm for nonlinear dimensionality reduction (”keep manifold structures on low dimension space”)

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008 https://nbviewer.org/github/YeoLab/single-cell-bioinformatics/tree/master/ https://mit6874.github.io/assets/sp2020/slides/L11_PCA_tSNE_Autoencoders.pdf

19 of 39

t-SNE

19

  • Given a collection of points 𝑋 = 𝑥1, … , 𝑥𝑛 ⊂ 𝑅𝑑, find a collection of points

𝑌 = {𝑦1 , … , 𝑦n} ⊂ 𝑅 , where 𝑑 ≫ 𝑑′. 𝑝 and 𝑞 measure the conditional

probability that a point j would pick point i as it’s nearest neighbor, in high (p) and low (q) dimensional space, respectively.

High dimension

= 𝑝𝑖j

j

i

j

i

Low dimension

= 𝑞𝑖j

𝑥𝑖

𝑥j

𝑦𝑖

𝑦j

We want to learn

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008

20 of 39

Similarity matrix at high dimension

20

High dimension

= 𝑝𝑖j

j

i

i

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008

where τ 2 is the variance for the Gaussian distribution centered around xi

21 of 39

Similarity matrix at low dimension

21

j

i

Low dimension

= 𝑞𝑖j

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008

22 of 39

Cost Function

22

Kullback-Leibler divergence

  • Find optimal {yi}: Minimize a single Kullback-Leibler divergence between a joint probability 𝑃, in the high-dimensional space and a joint probability

𝑄, in the low-dimensional space.

  • The gradient is given by
  • The result of this optimization is a map that reflects the similarities between the high-dimensional inputs.

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008

23 of 39

Summary of t-SNE

23

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008

  • t-SNE minimizes the divergence between two similarity distributions:
    • pairwise similarities of the input data points on the high dimensional space {pij}
    • pairwise similarities of the corresponding low- dimensional points {qij}

Main steps for t-SNE

  1. Construct a similarity matrix over pairs of high-dimensional points
    • Similar data points are assigned a higher probability, while dissimilar points are assigned a lower probability.

  • Define a similarity matrix in the low-dimensional embedding space

  • Minimize the 𝐾𝐿 divergence between two similarity distributions using gradient descent to find optimal low dimensional coordinates of data points

24 of 39

Limitation of t-SNE

39

https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668

  • t-SNE mainly preserves local similarity structure of data, not global similarity.
  • Only within cluster distances are meaningful while between cluster similarities are not guaranteed.

25 of 39

UMAP

Uniform Manifold Approximation and Projection

  • Instead of randomly assigning points as in t-SNE, UMAP constructs a high dimensional graph representation of the data and then optimizes a low-dimensional graph to be as structurally similar as possible

  • UMAP could preserve both global and local data structures UMAP on MNIST dataset t-SNE on MNIST dataset

40

  • https://www.youtube.com/watch?v=nq6iPZVUxZU
  • https://nbisweden.github.io/excelerate-scRNAseq/session-dim-reduction/lecture_dimensionality_reduction.pdf
  • https://umap-learn.readthedocs.io/en/latest/how_umap_works.html
  • https://pair-code.github.io/understanding-umap/

26 of 39

Main steps for UMAP

41

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008 https://pair-code.github.io/understanding-umap/supplement.html

  1. Construct a weighted k nearest neighbors graph for defining a similarity matrix in high-dimensional space
    • The edge weights represent the likelihood that two data points

are connected (i.e., “similarity”)

  1. Define a similarity matrix in low-dimensional space

  • Minimize the cross-entropy cost function at each point using gradient descent to find the most similar graph in lower dimensions

27 of 39

Similarity matrix at high dimension

42

  • 𝜌𝑖, the distance from xi to the k nearest neighbors in high dimension (e.g., kNN graph)

    • UMAP uses an exponential probability distribution in high dimension
  • The symmetrization of high dimension similarity score and the definition of number of nearest neighbors

https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668

arXiv:1802.03426

28 of 39

Similarity matrix at high dimension

43

  • For each data point, it has a locally adaptive exponential kernel, so the distance metric varies from point to point.

https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668

arXiv:1802.03426

https://www.youtube.com/watch?v=jth4kEvJ3P8

Graph

29 of 39

Cost Function

44

binary cross-entropy (CE)

  • Similarity matrix at low dimension
  • The gradient of the CE to find optimal {yi}

the family of curves 1/𝑎×𝑦2𝑏 for modelling distance probabilities in low dimensions, 𝑎 and 𝑏 are hyperparameters

  • Cross-entropy cost function makes UMAP to capture the global and local data structures

arXiv:1802.03426

https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668

30 of 39

UMAP examples

https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html

31 of 39

Single-cell RNA sequencing analysis

  • Introduction of single cell analysis
  • Dimensional reduction
    • Non-linear: t-SNE, UMAP
    • Linear: PCA
  • Cell clustering
    • K-means
    • hierarchical clustering
    • graph-based clustering

32 of 39

Cell clustering

32

  • Feature selection and dimensionality reduction extract the most informative genes and represented features from background noise, respectively
  • Cell–cell distances are then calculated in the low dimensional space for clustering cells to clusters

Kiselev, V.Y., Andrews, T.S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet 20, 273–282 (2019). https://doi.org/10.1038/s41576-018-0088-9

33 of 39

Clustering methods for scRNA-seq

33

Clustering methods:

  • K-means
  • hierarchical clustering
  • graph-based clustering

Kiselev, V.Y., Andrews, T.S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet 20, 273–282 (2019). https://doi.org/10.1038/s41576-018-0088-9

Tools for graph-based clustering:

  • Seurat: Louvain, Leiden, SLM
  • igraph:fast greedy, Louvain, optimal, walktrap, spinglass, infomap

34 of 39

Graph-based clustering

  • One of most popular clustering algorithms in scRNA-seq data analysis

    • Node -> cell
    • Edge -> similarity between two cells

49

  • Freytag, Saskia, Luyi Tian, Ingrid Lönnstedt, Milica Ng, and Melanie Bahlo. 2018. “Comparison of Clustering Tools in R for Medium-Sized 10x Genomics Single-Cell Rna-Sequencing Data.” F1000Research 7. Faculty of 1000 Ltd.
  • Wasserman, S. & Faust, K. (1994) Social Network Analysis (Cambridge Univ. Press, Cambridge, U.K.).
  • https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq- workshop/public/clustering-and-cell-annotation.html#ref-freytag2018comparison
  • Louvain community

detection: (Blondel et al. 2008)

  • Leiden community detection: (Traag, Waltman, and Eck 2019)

35 of 39

Modularity

Modularity Q: measurement on strength of network division

low high

Brede, Europhysics Letters, 2010.

Newman, PNAS, 2006.

Clustering goal: assign each node a module

to maximize “modularity” as an objective function (module is a group of highly connected nodes)

36 of 39

Louvain community detection

(Кластеризация Лувена)

51

https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/clustering-and-cell-annotation.html#ref-freytag2018comparison https://www.youtube.com/watch?v=QNv7rKWCgM8

  • Start with every node in its own community (i.e., module/cluster)
  • Step1: Modularity optimization
    • Order the nodes and for each node 𝑖, move 𝑖 to the community of neighbor 𝑗 that leads to maximum ∆𝑄
    • If all ∆𝑄 < 0 the 𝑖 remains in its current community
    • Repeatedly cycle through all nodes until ∆𝑄 = 0

  • Step2 : Community aggregation
    • Create a weighted network of communities from Step1
    • Nodes : communities in Step1
    • Edge weights : sum of weights of edges between communities
    • Edges within a community become two self-loops

  • Repeat: Apply Step1 & Step2 to resulting network, and so on until ∆𝑄 = 0

37 of 39

Examples of Louvain clustering

52

Belgian mobile phone network

  • 2M nodes
  • Red nodes:

French speakers

  • Green nodes:

Dutch speakers

Blondel et al.2008

38 of 39

Examples of Louvain clustering

53

The Louvain algorithm clusters millions of cell with reasonable computational complexity.

Luecken MD, Theis FJ (2019) Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol 15:e8746

Han X, Zhou Z, Fei L, Sun H, Wang R, Chen Y, Zhou Y (2020) Construction of a human cell landscape at single-cell level. Nature 581:303–309. https://doi.org/10.1038/s41586-020-2157-4

Seow, J.J.W., Wong, R.M.M., Pai, R. et al. Single‐Cell RNA Sequencing for Precision Oncology: Current State-of-Art. J Indian Inst Sci 100, 579–588 (2020). https://doi.org/10.1007/s41745-020-00178-1

39 of 39

Адаптированы слайды Mark Craven, Colin Dewey, Anthony Gitter and Daifeng Wang