1 of 38

Graph alignment: Applications to scRNA-seq data integration

Nov 5th 2024

BMI/CS 775 Computational Network Biology�Fall 2024

Sushmita Roy

https://compnetbiocourse.discovery.wisc.edu

2 of 38

Plan for this section

  • Global alignment of protein-interaction networks (Oct 31st, Nov 5th)
    • Matrix factorization: FUSE

  • Graph-based alignment for single cell omic datasets (Nov 5th)

3 of 38

Applications of network alignment

Alignment of scRNA-seq datasets

Alignment of molecular networks

  • PathBLAST
  • IsoRank
  • FUSE
  • SCANORAMA
  • CONOS
  • LIGER
  • scPopcorn

4 of 38

Goals for today

  • Overview of single cell omics

  • Approaches to align datasets
    • Mutual nearest neighbor alignment
    • LIGER

5 of 38

Single cell omics

Slide credit: 10x genomics

6 of 38

A single cell RNA-seq dataset

scRNA-seq dataset

genes (6k-20k)

cells (5k-1million)

7 of 38

Computational problems with scRNA-seq data

  1. Pre-processing and normalization
  2. Visualization
  3. Cell type identification
  4. Trajectory inference:
    1. Single cell ordering
      1. pseudo time
      2. velocity
    2. Cell population structure relationships
  5. Network inference
  6. Data integration

8 of 38

Computational tools for single cell omic datasets

Zappia, L. & Theis, F. J. Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol 22, 301 (2021).

9 of 38

Flavors of data integration

  • Integrate multiple scRNA-seq datasets, each representing a time point or treatment condition
  • Integrate multi-modal single cell datasets
    • scRNA-seq
    • scATAC-seq
    • spatial RNA-seq
    • cite-seq (proteins)
  • Query new dataset with an existing tissue atlas

10 of 38

Overall Problem Definition

  • Given N single cell RNA-seq datasets, E1, .. EN

  • Do
    • Find a correspondence of the cells and cell types in one dataset to cells and cell types in another dataset

11 of 38

What makes integration of scRNA-seq datasets difficult?

  • Presence of batch effects

  • Unknown number of cell types

  • Varying number of cell types across datasets

  • Sparsity

  • ..

12 of 38

Aligning high-dimensional datasets

  • Often high-dimensional datasets have a low-dimensional structure
  • Such low-dimensional structure can be approximated by a graph
  • Dataset alignment can be considered as an instance of graph alignment
  • Broadly speaking, this aims to construct low dimensional mappings between two or more datasets by aligning their low-dimensional spaces

Adapted from “Manifold alignment”, Wang et al 2010

13 of 38

Common approach to aligning scRNA-seq datasets

  1. Define k-nearest neighbor graphs
    1. Often needs dimensionality reduction
    2. Needs appropriate distance metrics

  • Correct/align cells

  • (Optional) cluster

14 of 38

Goals for today

  • Overview of single cell omics

  • Approaches to align datasets
    • Mutual nearest neighbor alignment
    • LIGER

15 of 38

Batch effect correction of scRNA-seq data using mutual nearest neighbors

  • L. Haghverdi, A. T. L. Lun, M. D. Morgan, and J. C. Marioni, “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors,” Nat. Biotechnol., vol. 36, no. 5, pp. 421–427, 2018, doi: 10.1038/nbt.4091.
  • Previous approaches relied on methods for bulk methods
  • This approach aimed to find “mutual nearest neighbors” (MNN) that it will use to align different datasets
  • Assumes there is at least one cell population that is present in both batches

16 of 38

MNN for batch correction

  1. Consider two batches of data at a time

  • Find mutual nearest neighbors

  • Find correction vectors

  • Shift one dataset to another

  • Repeat for new batches

17 of 38

Comparing MNN to other methods: simulated dataset

18 of 38

Comparing MNN to other methods: hematopoiesis differentiation dataset

19 of 38

Goals for today

  • Overview of single cell omics

  • Approaches to align datasets
    • Mutual nearest neighbor alignment
    • LIGER

20 of 38

Non-negative matrix factorization

Minimize

Lee and Seung Adv. Neur. In. 2001

 

 

Slide credit Erika Da-Inn Lee

Cells

Genes

H

E

W

s.t, H>=0, W>=0

21 of 38

Using NMF factors for clustering

  • Cell i is in cluster k if

Cells

H

  • Gene j is in cluster k if

W

22 of 38

Applying NMF to a single cell RNA-seq dataset

k ≪ n, m

X = ℝn×m

H = ℝn×k

W = ℝk×m

n cells

k factors

H

X

W

 

HW

m genes

k factors

n cells

m genes

Original value matrix

Predicted matrix

 

U

Factorized cell-side matrix

Cell clusters

23 of 38

Extensions to NMF

  • Joint NMF

  • Integrative NMF

24 of 38

Joint NMF

E1

H1

H2

W

genes

W

E2

=

cells

cells

25 of 38

Integrative NMF

X

+

genes

X

+

X

+

W

cells

cells

cells

W

W

E1

H1

H2

E2

E3

H3

V1

V2

V3

26 of 38

LIGER

  • Assumes datasets have a shared and specific lower dimensional representation

  • Uses integrative NMF (iNMF) to find low dimensional space

27 of 38

LIGER key steps

  • iNMF to find low-dimensional cell and gene space
  • Cluster cells based on iNMF factors
  • Refine cell clusters further to handle divergent datasets

28 of 38

LIGER: Defining/refining cell clusters

  • Use Hi (cell representations) to define the k-nearest neighborhood of a cell per dataset
  • Assign each cell i the cluster based on max factor loading
  • Get the histogram of cluster assignments of neighbors of i
  • Compute Manhattan Distance between cluster histograms.
  • Connect two cells if their distance is less than t
  • Louvain graph clustering

29 of 38

Benchmarking LIGER

30 of 38

Applying LIGER to integrate multiple datasets

Cell clusters

Gene markers

Donor-specific and shared genes

31 of 38

Using LIGER to integrate scRNA-seq and spatial transcriptomics data

Distribution of gene expression per cell between two platforms

scRNA-seq: 71,000 cells

spatial: 2500 cells

32 of 38

Using LIGER to integrate scRNA-seq and spatial transcriptomics data

Spatial location of cell clusters

scRNA-seq

Spatial

33 of 38

Summary of algorithms

Algorithm

Dimensionality reduction technique

Graph creation

Cell-clustering

MNN

PCA (optional)

Mutual nearest neighbor

SCANORAMA

SVD

Mutual nearest neighbor on factor space

Kmeans

LIGER

iNMF

NMF+Shared neighborhood

Louvain/Leiden

SEURAT

CCA

k nearest neighbor

Louvain

34 of 38

Take away points

  • We talked about two types of alignment problems
  • Aligning across species
    • Nodes are mismatched, but we have some sequence based mapping that we wish to exploit
    • Algorithms:
      • FUSE, IsoRank
    • Differ based on: Pairwise, global, local, how to define the similarity matrix
  • Aligning across datasets
    • Nodes might have a substantial mismatch and datasets can only partially overlap
    • Algorithms differ based on
      • how they project into the shared space
      • cluster or not

35 of 38

References

  • Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M. Mauck, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2019. “Comprehensive Integration of Single-Cell Data.” Cell 177 (7): 1888-1902.e21. https://doi.org/10.1016/j.cell.2019.05.031.
  • Barkas, Nikolas, Viktor Petukhov, Daria Nikolaeva, Yaroslav Lozinsky, Samuel Demharter, Konstantin Khodosevich, and Peter V. Kharchenko. 2019. “Joint Analysis of Heterogeneous Single-Cell RNA-Seq Dataset Collections.” Nature Methods 16 (8): 695–98. https://doi.org/10.1038/s41592-019-0466-z.
  • Hie, Brian, Bryan Bryson, and Bonnie Berger. 2019. “Efficient Integration of Heterogeneous Single-Cell Transcriptomes Using Scanorama.” Nature Biotechnology 37 (6): 685–91. https://doi.org/10.1038/s41587-019-0113-3.
  • Welch, Joshua D., Velina Kozareva, Ashley Ferreira, Charles Vanderburg, Carly Martin, and Evan Z. Macosko. 2019. “Single-Cell Multi-Omic Integration Compares and Contrasts Features of Brain Cell Identity.” Cell 177 (7): 1873-1887.e17. https://doi.org/10.1016/j.cell.2019.05.006.

36 of 38

Singular Value Decomposition

By Cmglee - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=67853297

37 of 38

Canonical Correlation Analysis

  •  

38 of 38

Canonical correlation analysis

  • u and v are called the first correlation vectors
  • We can keep finding subsequent vectors that are orthogonal to the ones before
  • We can get the canonical correlations by performing SVD on the cross-correlation matrix