Анализ scRNA-seq. Часть 2.
Импутация, снижение размерности, кластеризация
Previous lectures…
Gene Counts
16
http://data-science-sequencing.github.io/Win2018/lectures/lecture16/ https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2018
/SingleCell/slides/2018-07-25_CRUK_CI_summer_school-scRNAseq.pdf
cDNA alignment to Genome and group Results by cell
Hundreds of millions of reads Thousands of cells
Count unique UMIs for each gene in each cell
Create digital expression matrix
Gene counts matrix
| Cell1 | Cell2 | | Cell N |
Gene 1 | 5 | 6 | … | 5 |
Gene 2 | 13 | 0 | … | 13 |
Gene 3 | 14 | 13 | … | 14 |
Gene 4 | 18 | 19 | … | 18 |
… | | | | |
Gene M | 10 | 10 | … | 10 |
Interpretation of zeroes
Drop-outs in single cell
1
Kharchenko, P., Silberstein, L. & Scadden, D. Bayesian approach to single-cell differential expression analysis. Nat Methods 11, 740–742 (2014).
Why do dropouts occur in single cell?
Drop-outs in single cell
What should we do about dropouts?
Why do we need imputation methods?
Imputation Methods
MAGIC
19
http://jsb.ucla.edu/sites/default/files/scImpute.pdf
van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, Burdziak C, Moon KR, Chaffer CL, Pattabiraman D, Bierie B, Mazutis L, Wolf G, Krishnaswamy S, Pe'er D. Recovering Gene Interactions from Single-Cell Data Using Data Diffusion. Cell. 2018 Jul 26;174(3):716-729.e27. doi: 10.1016/j.cell.2018.05.061. Epub 2018 Jun 28. PMID: 29961576; PMCID: PMC6771278.
*
Markov affinity-based graph imputation of cells
similar cells
(i) The input data consist of a matrix of cells by genes (middle) of the data (right).
(ii)compute a cell-by-cell distance matrix.
(iii) The distance matrix is converted to an affinity matrix (middle) using a Gaussian kernel. A graphical depiction of the kernel function is shown (right).
(iv) The affinities are normalized, resulting in a Markov matrix (middle). The normalized affinities are shown for a single point as transition probabilities (right).
(v) To perform diffusion, exponentiate the Markov matrix to a chosen power t.
(vi) multiply the exponentiated Markov matrix (left) by the original data matrix (middle) to obtain a denoised and imputed data matrix (right).
scRNA-seq process
20
Lafzi et al. Tutorial: guidelines for the experimental design of single-cell RNA sequencing studies, Nature Protocols 2018 (https://doi.org/10.1038/s41596-018-0073-y)
Data analysis
scRNA-seq data analysis
10
scRNA-seq analysis
11
Cell-level analysis
Gene-level analysis
Python for scRNA-seq data analysis
Luecken, MD and Theis, FJ. Current best practices in single‐cell RNA‐seq analysis: a tutorial, Mol Syst Biol 2019 (doi: https://doi.org/10.15252/msb.20188746)
Single-cell RNA sequencing analysis
12
Dimensionality Reduction
26
https://doi.org/10.3389/fgene.2021.646936 https://www.biorxiv.org/content/10.1101/241646v1.full
Luecken, MD and Theis, FJ. Current best practices in single‐cell RNA‐seq analysis: a tutorial, Mol Syst Biol 2019 (doi: https://doi.org/10.15252/msb.20188746)
Dimensionality Reduction
27
https://doi.org/10.3389/fgene.2021.646936 https://www.biorxiv.org/content/10.1101/241646v1.full
PCA
Principal component analysis
28
https://github.com/NBISweden/excelerate-scRNAseq
1
st
PC
2
nd
PC
Reduce D genes to d PCs of cells, where d<<D
PCA of Peripheral Blood Mononuclear Cells (PBMC)
29
Limitation of PCA
30
High dimensions Low dimensions
t-SNE
32
t-Stochastic Neighbor Embedding
L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008 https://nbviewer.org/github/YeoLab/single-cell-bioinformatics/tree/master/ https://mit6874.github.io/assets/sp2020/slides/L11_PCA_tSNE_Autoencoders.pdf
t-SNE
19
𝑌 = {𝑦1 , … , 𝑦n} ⊂ 𝑅 , where 𝑑 ≫ 𝑑′. 𝑝 and 𝑞 measure the conditional
probability that a point j would pick point i as it’s nearest neighbor, in high (p) and low (q) dimensional space, respectively.
High dimension
= 𝑝𝑖j
j
i
j
i
Low dimension
= 𝑞𝑖j
𝑥𝑖
𝑥j
𝑦𝑖
𝑦j
We want to learn
L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008
Similarity matrix at high dimension
20
High dimension
= 𝑝𝑖j
j
i
i
L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008
where τ 2 is the variance for the Gaussian distribution centered around xi
Similarity matrix at low dimension
21
j
i
Low dimension
= 𝑞𝑖j
L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008
Cost Function
22
Kullback-Leibler divergence
𝑄, in the low-dimensional space.
L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008
Summary of t-SNE
23
L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008
Main steps for t-SNE
Limitation of t-SNE
39
https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668
UMAP
Uniform Manifold Approximation and Projection
40
Main steps for UMAP
41
L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR2008 https://pair-code.github.io/understanding-umap/supplement.html
are connected (i.e., “similarity”)
Similarity matrix at high dimension
42
https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668
arXiv:1802.03426
Similarity matrix at high dimension
43
https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668
arXiv:1802.03426
Graph
Cost Function
44
binary cross-entropy (CE)
the family of curves 1/𝑎×𝑦2𝑏 for modelling distance probabilities in low dimensions, 𝑎 and 𝑏 are hyperparameters
arXiv:1802.03426
https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668
UMAP examples
https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html
Single-cell RNA sequencing analysis
Cell clustering
32
Kiselev, V.Y., Andrews, T.S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet 20, 273–282 (2019). https://doi.org/10.1038/s41576-018-0088-9
Clustering methods for scRNA-seq
33
Clustering methods:
Kiselev, V.Y., Andrews, T.S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet 20, 273–282 (2019). https://doi.org/10.1038/s41576-018-0088-9
Tools for graph-based clustering:
Graph-based clustering
49
detection: (Blondel et al. 2008)
Modularity
Modularity Q: measurement on strength of network division
low high
Brede, Europhysics Letters, 2010.
Newman, PNAS, 2006.
Clustering goal: assign each node a module
to maximize “modularity” as an objective function (module is a group of highly connected nodes)
Louvain community detection
(Кластеризация Лувена)
51
https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/clustering-and-cell-annotation.html#ref-freytag2018comparison https://www.youtube.com/watch?v=QNv7rKWCgM8
Examples of Louvain clustering
52
Belgian mobile phone network
French speakers
Dutch speakers
Blondel et al.2008
Examples of Louvain clustering
53
The Louvain algorithm clusters millions of cell with reasonable computational complexity.
Luecken MD, Theis FJ (2019) Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol 15:e8746
Han X, Zhou Z, Fei L, Sun H, Wang R, Chen Y, Zhou Y (2020) Construction of a human cell landscape at single-cell level. Nature 581:303–309. https://doi.org/10.1038/s41586-020-2157-4
Seow, J.J.W., Wong, R.M.M., Pai, R. et al. Single‐Cell RNA Sequencing for Precision Oncology: Current State-of-Art. J Indian Inst Sci 100, 579–588 (2020). https://doi.org/10.1007/s41745-020-00178-1
Адаптированы слайды Mark Craven, Colin Dewey, Anthony Gitter and Daifeng Wang