1 of 19

VIRTUAL COURSESingle cell RNA-seq analysis using Python

Anna Vathrakokoili Pournara

February 2025

Feature Selection, Dimensionality reduction,

Clustering and Annotation

2 of 19

A bit about myself

  • I did my undergrad in Biology (Patras, Greece)

  • Master’s in Molecular Biomedicine (NKUA, Athens)

  • PhD at Papatheodorou Group (EMBL-EBI)

  • Working on cell-type deconvolution of bulk cancer samples

  • Bulk RNA-seq and single-cell analysis (R and python)

  • Postdoc at Sanger (Haniffa lab) – studying skin diseases using single-cell genomics

gene expression

3 of 19

Previously…

From raw sequencing files... to count matrix

From QC of count matrix… to Normalization

low-quality cells(QC)

ambient RNA(SoupX)

Doublet detection

Normalization

4 of 19

Coming up…

Today’s Lecture Outline

Feature Selection

  • Highly variable genes - Dispersion-based approaches
  • Highly variable genes - Seurat v3 (variance-stabilisation)

Dimensionality Reduction

  • PCA
          • T-SNE
          • UMAP

Clustering

  • Graph-based clustering(louvain, leiden)
  • Hierarchical clustering

Cell-type annotation

  • From differentially expressed genes to cluster annotation
  • From markers to cluster annotation
  • New generation tools : Automated annotation

5 of 19

Feature selection in single-cell analysis �

PCA

PCs

clustering

~30,000 genes

~500-2,000 selected genes

6 of 19

Feature selection methods implemented in scanpy�

  1. Dispersion-based: reproduces the R-implementations of Seurat [Satija et al], Cell Ranger [Zheng et al] flavour= “Seurat” or “Cell-ranger”

2. Variance-based : Seurat v3 [Stuart etal].

Flavor = “seuratV3

scanpy.pp.highly_variable_gene()

Mean expression=0.5

Mean expression=10

Mean expression=100

Mean expression=200

a) Calculate dispersion of each gene in each bin

b) Calculate the mean and the standard deviation of the dispersions in each bin

c) Normalise the dispersion of each gene by using the mean and the standard deviation from b

d) genes within each bin are ranked based on their normalized dispersion values --> Highly variable genes

a) Expects raw counts ( not normalised or log-transformed)

b) variance-stabilising transformation is applied to the raw data.

c) Highly variable genes are selected based on the variance of the standardised values ( mean-variance relationship is taken into account)

d) 2,000 highly variable genes selected

7 of 19

Dimensionality reduction

Curse of Dimensionality:

  • refers to the challenge of dealing with high-dimensional data, such as scRNA-seq data with many cells and genes.
  • While high-dimensional data theoretically contains more information, in practice, it often includes more noise and redundancy, making additional dimensions less beneficial for downstream analysis.

8 of 19

PCA

Each cell in a single-cell dataset is represented as a point in a high-dimensional space with many features (genes).

  • PCs are combinations of the original features and are chosen to capture the most variation in the data.
  • These PCs are ordered by how much variation they capture, with the first PC having the most variance.
  • PCA helps reduce the dimensionality of the data by keeping the top PCs that capture the most important information.
  • This reduction is useful because it makes the data easier to work with and visualize.

PC1

PC2

9 of 19

t-SNE (t-Distributed Stochastic Neighbor Embedding)

  • t-SNE is a non-linear technique used to reduce the dimensionality of high-dimensional data.
  • It maps data points to a lower-dimensional space by creating probability distributions based on distances and optimizing embeddings.
  • t-SNE can reveal data clusters and preserving local relationships, making it valuable for visualizing complex datasets and pattern discovery.

10 of 19

UMAP(Uniform Manifold Approximation and Projection):

  • UMAP is a graph based, non-linear dimensionality technique.
  • It constructs a high dimensional graph representation of the dataset and optimizes the low-dimensional graph representation to be structurally as similar as possible to the original graph.
  • For UMAP: first calculate PCA and subsequently a neighbourhood graph on the data.

https://blog.bioturing.com/2022/01/14/umap-vs-t-sne-single-cell-rna-seq-data-visualization/#:~:text=Thanks%20to%20the%20solution%20in,it%20took%20t%2DSNE%2045!

11 of 19

Clustering

  • Leiden Algorithm: We utilize the Leiden algorithm on a k-nearest-neighbors (KNN) graph constructed from the reduced expression space, often obtained through principal component analysis (PCA). Leiden identifies clusters by considering the density of connections between cells and comparing it to the expected density.

  • Resolution Parameter: The Leiden algorithm offers a resolution parameter, allowing users to control the granularity of clustering. Higher values result in more clusters, while lower values yield coarser groupings.

The goal in single-cell RNA sequencing (scRNA-seq) analysis is to uncover cellular structures and identify cell identities within the dataset.

12 of 19

Clustering

  • Leiden Algorithm: We utilize the Leiden algorithm on a k-nearest-neighbors (KNN) graph constructed from the reduced expression space, often obtained through principal component analysis (PCA). Leiden identifies clusters by considering the density of connections between cells and comparing it to the expected density.

  • Resolution Parameter: The Leiden algorithm offers a resolution parameter, allowing users to control the granularity of clustering. Higher values result in more clusters, while lower values yield coarser groupings.

The goal in single-cell RNA sequencing (scRNA-seq) analysis is to uncover cellular structures and identify cell identities within the dataset.

13 of 19

Cell type annotation

  1. Definition of cell-type:

- Cell types are robust cellular phenotypes identifiable based on the expression of specific markers (e.g., proteins or gene transcripts).

- They are often linked to specific functions and remain consistent across datasets.

  • Challenges:

- Cell categorisation is subjective and may change over time due to technological advancements or discoveries of sub-phenotypes.

- Cell types can be further classified into subtypes or cell states, and the term "cell identity" is sometimes used to avoid arbitrary distinctions.

  • Continuum and Differentiation:

- Cell types may exist along a continuum, where cells transition or differentiate into one another.

- Differentiation coordinates can provide a more accurate description of cell states, especially in processes like haematopoiesis.

14 of 19

Cell-type annotation methods

Rely on transcriptomic similarity between cells.

Types of cell-types annotation:

    • Manual annotation:
      • From known markers to cluster annotation
      • From differentially expressed genes to cluster annotation
    • Automated annotation
      • Marker gene-based classifiers
      • Classifiers based on a wider set of genes.
      • Annotation by mapping to a reference.

15 of 19

Manual annotation

From known markers to cluster annotation

  • Literature based annotation
  • Transcriptome-based might work better than protein expression based
  • Good quality markers : validated in multiple datasets
  • Good knowledge of the biology of a tissue/cell-type and the functions involved

literature

annotate clusters

16 of 19

Manual annotation

From differentially expressed(DE) genes to cluster annotation

  • Most popular DE tests implemented in scanpy : t-test, Wilcoxon

  • Wilcoxon rank-sum test : Calculate U Statistic: Measure how well groups separate based on gene expression of each gene – hypothesis testing --> decide if a gene is considered a marker(DE).

Differential

expression(DE)

analysis

Find marker

genes/cluster

annotate clusters

Literature/available datasets + studies

17 of 19

Automated annotation

Marker-gene Database-based:

  • scAssign
  • scCATCH

Correlation-based(query-reference):

  • SingleR
  • scMatch(python)

Supervised classification-based:

  • Moana
  • Garnett

Others: scANVI

18 of 19

Take home message

Feature Selection

    • Choose informative genes or features for analysis.
    • Dispersion-based vs variance-based feature selection methods (scanpy)

Dimensionality Reduction

    • Reduce data complexity with techniques like PCA
    • Visualize and explore high-dimensional data effectively.

Clustering

    • Group cells into clusters based on similar expression profiles.(use leiden algorithm,improved variation of louvain)
    • Sub-clustering can be very useful in single-cell analysis(resolution parameter in leiden)

Cell Annotation

    • Assign biological meaning (cell identity) to cell clusters.
    • Manual(literature<-> DE analysis marker genes VS Automated(databases, correlation-based approaches, machine learning approaches.

19 of 19

Useful links