1 of 19

VIRTUAL COURSE�Single cell RNA-seq analysis using Python

Anna Vathrakokoili Pournara

February 2025

Feature Selection, Dimensionality reduction,

Clustering and Annotation

2 of 19

A bit about myself

I did my undergrad in Biology (Patras, Greece)

Master’s in Molecular Biomedicine (NKUA, Athens)

PhD at Papatheodorou Group (EMBL-EBI)

Working on cell-type deconvolution of bulk cancer samples

Bulk RNA-seq and single-cell analysis (R and python)

Postdoc at Sanger (Haniffa lab) – studying skin diseases using single-cell genomics

gene expression

3 of 19

Previously…

From raw sequencing files... to count matrix

From QC of count matrix… to Normalization

low-quality cells(QC)

ambient RNA(SoupX)

Doublet detection

Normalization

4 of 19

Coming up…

Today’s Lecture Outline
Feature Selection Highly variable genes - Dispersion-based approaches Highly variable genes - Seurat v3 (variance-stabilisation)
Dimensionality Reduction PCA T-SNE UMAP
Clustering Graph-based clustering(louvain, leiden) Hierarchical clustering
Cell-type annotation From differentially expressed genes to cluster annotation From markers to cluster annotation New generation tools : Automated annotation

5 of 19

Feature selection in single-cell analysis �

PCA

PCs

clustering

~30,000 genes

~500-2,000 selected genes

6 of 19

Feature selection methods implemented in scanpy�

Dispersion-based: reproduces the R-implementations of Seurat [Satija et al], Cell Ranger [Zheng et al] flavour= “Seurat” or “Cell-ranger”

2. Variance-based : Seurat v3 [Stuart etal].

Flavor = “seuratV3

scanpy.pp.highly_variable_gene()

Mean expression=0.5

Mean expression=10

Mean expression=100

Mean expression=200

a) Calculate dispersion of each gene in each bin

b) Calculate the mean and the standard deviation of the dispersions in each bin

c) Normalise the dispersion of each gene by using the mean and the standard deviation from b

d) genes within each bin are ranked based on their normalized dispersion values --> Highly variable genes

a) Expects raw counts ( not normalised or log-transformed)

b) variance-stabilising transformation is applied to the raw data.

c) Highly variable genes are selected based on the variance of the standardised values ( mean-variance relationship is taken into account)

d) 2,000 highly variable genes selected

When we talk about features we mean genes

Then, they look at how much the activity of genes in each bin varies. They calculate something called "dispersion," which tells them how different the activity levels are among the genes in that bin.
To make this easier to compare across all the bins, they do two things:

They find the average (mean) level of activity for all the genes in each bin.
They also calculate the spread or variation (standard deviation) of activity levels in that bin.

Finally, they use these numbers (mean and standard deviation) to figure out which genes are highly variable. In other words, they pick the genes in each bin that have the most different levels of activity compared to the others in that same bin.

Initially, people tried to select variable genes just by looking at how much their activity levels varied between cells after doing a mathematical transformation called log normalization. However, this approach didn't consider the relationship between the average (mean) level of a gene's activity and its variability, which is crucial in single-cell RNA sequencing data.

�This whole process helps to identify genes that are variable across cells while accounting for the relationship between the gene's average activity and its variability, making the selection more accurate and informative for downstream analysis.�

The goal is to find genes that vary significantly across cells in a dataset. Initially, simply using log-normalized variance was insufficient because it didn't consider the relationship between a gene's average activity and its variability. To address this, a variance-stabilizing transformation was applied to the raw data. They then calculated the mean and variance for each gene without any transformation and used a curve to understand how variance relates to the mean. This curve allowed them to standardize the data, making it comparable. Highly variable genes were selected based on the variance of the standardized values, resulting in a subset of 2,000 genes for further analysis. This method better captures meaningful variability while accounting for the mean-variance relationship.

7 of 19

Dimensionality reduction

Curse of Dimensionality:

refers to the challenge of dealing with high-dimensional data, such as scRNA-seq data with many cells and genes.
While high-dimensional data theoretically contains more information, in practice, it often includes more noise and redundancy, making additional dimensions less beneficial for downstream analysis.

8 of 19

PCA

Each cell in a single-cell dataset is represented as a point in a high-dimensional space with many features (genes).

PCs are combinations of the original features and are chosen to capture the most variation in the data.
These PCs are ordered by how much variation they capture, with the first PC having the most variance.

PCA helps reduce the dimensionality of the data by keeping the top PCs that capture the most important information.
This reduction is useful because it makes the data easier to work with and visualize.

PC1

PC2

9 of 19

t-SNE (t-Distributed Stochastic Neighbor Embedding)

t-SNE is a non-linear technique used to reduce the dimensionality of high-dimensional data.
It maps data points to a lower-dimensional space by creating probability distributions based on distances and optimizing embeddings.
t-SNE can reveal data clusters and preserving local relationships, making it valuable for visualizing complex datasets and pattern discovery.

10 of 19

UMAP(Uniform Manifold Approximation and Projection):

UMAP is a graph based, non-linear dimensionality technique.
It constructs a high dimensional graph representation of the dataset and optimizes the low-dimensional graph representation to be structurally as similar as possible to the original graph.
For UMAP: first calculate PCA and subsequently a neighbourhood graph on the data.

https://blog.bioturing.com/2022/01/14/umap-vs-t-sne-single-cell-rna-seq-data-visualization/#:~:text=Thanks%20to%20the%20solution%20in,it%20took%20t%2DSNE%2045!

11 of 19

Clustering

Leiden Algorithm: We utilize the Leiden algorithm on a k-nearest-neighbors (KNN) graph constructed from the reduced expression space, often obtained through principal component analysis (PCA). Leiden identifies clusters by considering the density of connections between cells and comparing it to the expected density.

Resolution Parameter: The Leiden algorithm offers a resolution parameter, allowing users to control the granularity of clustering. Higher values result in more clusters, while lower values yield coarser groupings.

The goal in single-cell RNA sequencing (scRNA-seq) analysis is to uncover cellular structures and identify cell identities within the dataset.

The Leiden algorithm is a community detection algorithm used in network analysis, particularly for identifying clusters or communities in complex networks. It was developed as an improvement over the Louvain method, another widely-used community detection algorithm. The primary goal of the Leiden algorithm is to partition the nodes of a network into communities such that nodes within the same community are more densely connected to each other than to nodes in other communities.

Initialization: Initially, each node in the network is treated as a single community.
Modularity Optimization: The algorithm iteratively evaluates the modularity of the current community structure. Modularity is a measure that quantifies the quality of the community structure in terms of how well-connected nodes are within their communities compared to what would be expected by random chance.
Node Swapping: In each iteration, the algorithm evaluates the potential gain in modularity by moving a node from its current community to another community or by forming a new community with it. This process involves swapping nodes between communities and assessing whether the modularity increases.
Local Optimization: The Leiden algorithm uses local optimization heuristics to efficiently identify moves that improve modularity. It avoids exhaustive searches, which would be computationally expensive for large networks.
Coarsening and Refinement: The algorithm employs a coarsening-refinement approach, where it aggregates nodes into super-nodes to speed up the optimization process. After the coarsening phase, it refines the community structure by uncoarsing the network and optimizing the communities at different scales.
Stopping Criterion: The Leiden algorithm continues iterating until no further improvements in modularity can be achieved or until a predefined stopping criterion is met.
Final Community Structure: Once the algorithm converges, it outputs the final community structure, where nodes are grouped into communities that maximize the modularity score.

The Leiden algorithm is known for its ability to efficiently detect communities in large networks and for its robustness in handling various network types, including social networks, biological networks, and more. It has become a valuable tool in various fields, including biology, sociology, and data science, where understanding the structure of complex networks is crucial for gaining insights into underlying relationships and patterns.

does leiden use knn graph ? hw does that work?

ChatGPT

2 / 2

Yes, the Leiden algorithm, like many other community detection algorithms, can utilize a k-nearest neighbors (KNN) graph to help identify communities within a network. The KNN graph is constructed from the original network data and serves as an intermediate representation that captures local connectivity patterns. Here's how it works in the context of the Leiden algorithm:

Constructing the KNN Graph:

For each node in the original network, calculate its K nearest neighbors based on some distance metric (e.g., Euclidean distance, Jaccard similarity, etc.). These neighbors are the nodes that are most similar or closest to the given node.
Connect each node to its K nearest neighbors in the KNN graph. This results in a graph where nodes are connected to their most similar neighbors, capturing local network structures.

Leiden Algorithm on the KNN Graph:

The Leiden algorithm can then be applied to the KNN graph instead of the original network. This simplifies the community detection process because the KNN graph typically contains fewer edges and focuses on local neighborhood relationships.

Community Detection:

The Leiden algorithm optimizes the modularity of the community structure on the KNN graph, as described in my previous response. It attempts to group nodes into communities based on their connectivity patterns in the KNN graph.

Refinement:

After obtaining a community structure on the KNN graph, the Leiden algorithm can refine the results by uncoarsing the communities to the original network scale, if needed. This refinement phase ensures that the communities make sense in the context of the entire network.

The use of a KNN graph in the Leiden algorithm helps in identifying local structures and communities within the network. It is especially useful when the network is large and complex, as it reduces the computational complexity of the community detection process while capturing important local relationships. Additionally, it can help the algorithm uncover smaller-scale structures that might be missed when working directly on the entire network.

12 of 19

Clustering

Leiden Algorithm: We utilize the Leiden algorithm on a k-nearest-neighbors (KNN) graph constructed from the reduced expression space, often obtained through principal component analysis (PCA). Leiden identifies clusters by considering the density of connections between cells and comparing it to the expected density.

Resolution Parameter: The Leiden algorithm offers a resolution parameter, allowing users to control the granularity of clustering. Higher values result in more clusters, while lower values yield coarser groupings.

The goal in single-cell RNA sequencing (scRNA-seq) analysis is to uncover cellular structures and identify cell identities within the dataset.

13 of 19

Cell type annotation

Definition of cell-type:

- Cell types are robust cellular phenotypes identifiable based on the expression of specific markers (e.g., proteins or gene transcripts).

- They are often linked to specific functions and remain consistent across datasets.

Challenges:

- Cell categorisation is subjective and may change over time due to technological advancements or discoveries of sub-phenotypes.

- Cell types can be further classified into subtypes or cell states, and the term "cell identity" is sometimes used to avoid arbitrary distinctions.

Continuum and Differentiation:

- Cell types may exist along a continuum, where cells transition or differentiate into one another.

- Differentiation coordinates can provide a more accurate description of cell states, especially in processes like haematopoiesis.

14 of 19

Cell-type annotation methods

Rely on transcriptomic similarity between cells.

Types of cell-types annotation:

Manual annotation:

From known markers to cluster annotation
From differentially expressed genes to cluster annotation

Automated annotation

Marker gene-based classifiers
Classifiers based on a wider set of genes.
Annotation by mapping to a reference.

15 of 19

Manual annotation

From known markers to cluster annotation

Literature based annotation
Transcriptome-based might work better than protein expression based
Good quality markers : validated in multiple datasets
Good knowledge of the biology of a tissue/cell-type and the functions involved

literature

annotate clusters

16 of 19

Manual annotation

From differentially expressed(DE) genes to cluster annotation

Most popular DE tests implemented in scanpy : t-test, Wilcoxon

Wilcoxon rank-sum test : Calculate U Statistic: Measure how well groups separate based on gene expression of each gene – hypothesis testing --> decide if a gene is considered a marker(DE).

Differential

expression(DE)

analysis

Find marker

genes/cluster

annotate clusters

Literature/available datasets + studies

the Mann-Whitney U test, is a non-parametric statistical test used to compare the distribution of two independent samples to determine if one sample tends to have larger values than the other. It is an alternative to the t-test, which assumes that the data follows a normal distribution, while the Wilcoxon rank-sum test makes no assumptions about the distribution of the data.

For each group of cells, rank the expression values of the selected marker gene. You will have two sets of ranked values, one for each group.
Calculate the Wilcoxon Rank-Sum Statistic (U): Apply the Wilcoxon rank-sum test to compare the ranks of the marker gene's expression between the two groups of cells. The Wilcoxon U statistic in this context represents how well the two groups separate based on the expression of the marker gene. If U is significantly different from what would be expected by chance, it suggests that the marker gene is differentially expressed between the two groups.

17 of 19

Automated annotation

Marker-gene Database-based:

scAssign
scCATCH

Correlation-based(query-reference):

SingleR
scMatch(python)

Supervised classification-based:

Moana
Garnett

Others: scANVI

18 of 19

Take home message

Feature Selection

Choose informative genes or features for analysis.
Dispersion-based vs variance-based feature selection methods (scanpy)

Dimensionality Reduction

Reduce data complexity with techniques like PCA
Visualize and explore high-dimensional data effectively.

Clustering

Group cells into clusters based on similar expression profiles.(use leiden algorithm,improved variation of louvain)
Sub-clustering can be very useful in single-cell analysis(resolution parameter in leiden)

Cell Annotation

Assign biological meaning (cell identity) to cell clusters.
Manual(literature<-> DE analysis marker genes VS Automated(databases, correlation-based approaches, machine learning approaches.

19 of 19

Useful links

https://www.youtube.com/watch?v=FgakZw6K1QQ&ab_channel=StatQuestwithJoshStarmer (UMAP video)

https://www.youtube.com/watch?v=eN0wFzBA4Sc&ab_channel=StatQuestwithJoshStarmer (PCA video)

https://www.youtube.com/watch?v=NEaUSP4YerM&ab_channel=StatQuestwithJoshStarmer (t-SNE video)

https://www.sc-best-practices.org/preamble.html (single-cell practices tutorial)

https://scanpy.readthedocs.io/en/stable/tutorials.html (scanpy docs)

https://anndata.readthedocs.io/en/latest/ (anndata docs)

https://scverse.org/packages/#core-packages (scverse community)

https://www.nature.com/articles/s41576-018-0088-9 (Challenges in unsupervised clustering of single-cell RNA-seq data)

https://www.nature.com/articles/s41576-023-00586-w

https://www.sciencedirect.com/science/article/pii/S2001037021000192#:~:text=Automatic%20cell%20type%20annotation%20methods,found%20across%20scRNA%2Dseq%20datasets. (Automated methods in cell-type annotation)

https://github.com/seandavi/awesome-single-cell?tab=readme-ov-file