1 of 21

scRNA-seq Dataset Integration

Vinicius Maracaja-Coutinho

Associate Professor

Universidad de Chile

vinicius.maracaja@uchile.cl

Modified version of Erick Armingol slides

2 of 21

Increasing complexity in single-cell RNA-seq data

(PMID: 37198436)

3 of 21

Building reference atlases

Hrovatin et al., Unpublished.

4 of 21

Usage of integrated databases

Hrovatin et al., Unpublished.

5 of 21

Using the available transcriptomes

Label-centric

comparison

Comparing the labels or annotations of cell types or clusters across different datasets or conditions.

  • How well the cell types or clusters identified in one dataset match or align with those in another dataset after integration.

    • Can be used to compare the annotations of two different samples from the same experiment.

    • Can project cells from a new experiment onto an annotated reference.

6 of 21

Using the available transcriptomes

Cross-dataset

analysis

Attempts to computationally remove experiment-specific technical/biological effects so that data from multiple experiments can be combined and jointly analyzed.

7 of 21

Integration and analysis of multiple datasets

(PMID: 37198436)

8 of 21

Main challenges: Batch effects among datasets

Single-cell colored by datasets 🡪 Separation by dataset

Batch effect could be defined as variability in the data that is not due to a variable of interest

  • Technical variability:
    • Sample handling
    • Experimental protocols
    • Sequencing platform

  • Biological variability:
    • Donor differences
    • External factors (e.g. environmental factors)
    • Evolution

(PMID: 31217225)

9 of 21

Main challenges: Batch effects among datasets

Single-cell colored by datasets 🡪 Separation by dataset

Batch effect could be defined as variability in the data that is not due to a variable of interest

  • Technical variability:
    • Sample handling
    • Experimental protocols
    • Sequencing platform

  • Biological variability:
    • Donor differences
    • External factors (e.g. environmental factors)
    • Evolution

(PMID: 36859475)

10 of 21

Identifying covariates driving batch effects help to correct

(PMID: 31217225)

11 of 21

Main types of methods for data integration

(PMID: 36859475)

  1. Linear decomposition methods

  • Similarity-based batch correction methods
  • Dimensionality reduction
  • Identification of similar cells between batches

- Cell-level similarity search

- Cluster-level similarity search

  1. Generative models with variational autoencoder (artificial intelligence)

12 of 21

Main types of methods for data integration

(PMID: 36859475)

13 of 21

Linear Decomposition

(PMID: 37002403)

14 of 21

Similarity: Mutual Nearest Neighbors (MNN)

15 of 21

Similarity: Mutual Nearest Neighbors (MNN)

16 of 21

Similarity: Mutual Nearest Neighbors (MNN)

17 of 21

Seurat-v3 (Canonical Correlation Analysis + Anchors)

(PMID: 31178118)

18 of 21

Harmony

(PMID: 31740819)

19 of 21

Deep-Learning models

(PMID: 36859475)

20 of 21

Benchmarking

21 of 21

vinicius.maracaja@uchile.cl

@vin_maracaja