1 of 38

Graph alignment: Applications to scRNA-seq data integration

Nov 5^th 2024

BMI/CS 775 Computational Network Biology�Fall 2024

Sushmita Roy

https://compnetbiocourse.discovery.wisc.edu

2 of 38

Plan for this section

Global alignment of protein-interaction networks (Oct 31^st, Nov 5^th)

Matrix factorization: FUSE

Graph-based alignment for single cell omic datasets (Nov 5^th)

3 of 38

Applications of network alignment

Alignment of scRNA-seq datasets

Alignment of molecular networks

PathBLAST
IsoRank
FUSE

SCANORAMA
CONOS
LIGER
scPopcorn

4 of 38

Goals for today

Overview of single cell omics

Approaches to align datasets

Mutual nearest neighbor alignment
LIGER

5 of 38

Single cell omics

Slide credit: 10x genomics

6 of 38

A single cell RNA-seq dataset

scRNA-seq dataset

genes (6k-20k)

cells (5k-1million)

7 of 38

Computational problems with scRNA-seq data

Pre-processing and normalization
Visualization
Cell type identification
Trajectory inference:

Single cell ordering

pseudo time
velocity

Cell population structure relationships

Network inference
Data integration

8 of 38

Computational tools for single cell omic datasets

Zappia, L. & Theis, F. J. Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol 22, 301 (2021).

9 of 38

Flavors of data integration

Integrate multiple scRNA-seq datasets, each representing a time point or treatment condition
Integrate multi-modal single cell datasets

scRNA-seq
scATAC-seq
spatial RNA-seq
cite-seq (proteins)

Query new dataset with an existing tissue atlas

10 of 38

Overall Problem Definition

Given N single cell RNA-seq datasets, E₁, .. E_N

Do

Find a correspondence of the cells and cell types in one dataset to cells and cell types in another dataset

11 of 38

What makes integration of scRNA-seq datasets difficult?

Presence of batch effects

Unknown number of cell types

Varying number of cell types across datasets

Sparsity

..

12 of 38

Aligning high-dimensional datasets

Often high-dimensional datasets have a low-dimensional structure
Such low-dimensional structure can be approximated by a graph
Dataset alignment can be considered as an instance of graph alignment
Broadly speaking, this aims to construct low dimensional mappings between two or more datasets by aligning their low-dimensional spaces

Adapted from “Manifold alignment”, Wang et al 2010

13 of 38

Common approach to aligning scRNA-seq datasets

Define k-nearest neighbor graphs

Often needs dimensionality reduction
Needs appropriate distance metrics

Correct/align cells

(Optional) cluster

14 of 38

Goals for today

Overview of single cell omics

Approaches to align datasets

Mutual nearest neighbor alignment
LIGER

15 of 38

Batch effect correction of scRNA-seq data using mutual nearest neighbors

L. Haghverdi, A. T. L. Lun, M. D. Morgan, and J. C. Marioni, “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors,” Nat. Biotechnol., vol. 36, no. 5, pp. 421–427, 2018, doi: 10.1038/nbt.4091.
Previous approaches relied on methods for bulk methods
This approach aimed to find “mutual nearest neighbors” (MNN) that it will use to align different datasets
Assumes there is at least one cell population that is present in both batches

16 of 38

MNN for batch correction

Consider two batches of data at a time

Find mutual nearest neighbors

Find correction vectors

Shift one dataset to another

Repeat for new batches

17 of 38

Comparing MNN to other methods: simulated dataset

18 of 38

Comparing MNN to other methods: hematopoiesis differentiation dataset

19 of 38

Goals for today

Overview of single cell omics

Approaches to align datasets

Mutual nearest neighbor alignment
LIGER

20 of 38

Non-negative matrix factorization

Minimize

Lee and Seung Adv. Neur. In. 2001

•

Slide credit Erika Da-Inn Lee

Cells

Genes

H

E

W

s.t, H>=0, W>=0

21 of 38

Using NMF factors for clustering

Cell i is in cluster k if

Cells

H

Gene j is in cluster k if

W

22 of 38

Applying NMF to a single cell RNA-seq dataset

k ≪ n, m

X = ℝ^n×m

H = ℝ^n×k

W = ℝ^k×m

n cells

k factors

H

X

W

HW

m genes

k factors

n cells

m genes

Original value matrix

Predicted matrix

U

Factorized cell-side matrix

Cell clusters

23 of 38

Extensions to NMF

Joint NMF

Integrative NMF

24 of 38

Joint NMF

E₁

H₁

H₂

W

genes

W

E₂

=

cells

25 of 38

Integrative NMF

X

+

genes

X

+

X

+

W

cells

W

E₁

H₁

H₂

E₂

E₃

H₃

V₁

V₂

V₃

26 of 38

LIGER

Assumes datasets have a shared and specific lower dimensional representation

Uses integrative NMF (iNMF) to find low dimensional space

27 of 38

LIGER key steps

iNMF to find low-dimensional cell and gene space
Cluster cells based on iNMF factors
Refine cell clusters further to handle divergent datasets

28 of 38

LIGER: Defining/refining cell clusters

Use H_i (cell representations) to define the k-nearest neighborhood of a cell per dataset
Assign each cell i the cluster based on max factor loading
Get the histogram of cluster assignments of neighbors of i
Compute Manhattan Distance between cluster histograms.
Connect two cells if their distance is less than t
Louvain graph clustering

29 of 38

Benchmarking LIGER

30 of 38

Applying LIGER to integrate multiple datasets

Cell clusters

Gene markers

Donor-specific and shared genes

31 of 38

Using LIGER to integrate scRNA-seq and spatial transcriptomics data

Distribution of gene expression per cell between two platforms

scRNA-seq: 71,000 cells

spatial: 2500 cells

32 of 38

Using LIGER to integrate scRNA-seq and spatial transcriptomics data

Spatial location of cell clusters

scRNA-seq

Spatial

33 of 38

Summary of algorithms

Algorithm	Dimensionality reduction technique	Graph creation	Cell-clustering
MNN	PCA (optional)	Mutual nearest neighbor
SCANORAMA	SVD	Mutual nearest neighbor on factor space	Kmeans
LIGER	iNMF	NMF+Shared neighborhood	Louvain/Leiden
SEURAT	CCA	k nearest neighbor	Louvain

34 of 38

Take away points

We talked about two types of alignment problems
Aligning across species

Nodes are mismatched, but we have some sequence based mapping that we wish to exploit
Algorithms:

FUSE, IsoRank

Differ based on: Pairwise, global, local, how to define the similarity matrix

Aligning across datasets

Nodes might have a substantial mismatch and datasets can only partially overlap
Algorithms differ based on

how they project into the shared space
cluster or not

35 of 38

References

Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M. Mauck, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2019. “Comprehensive Integration of Single-Cell Data.” Cell 177 (7): 1888-1902.e21. https://doi.org/10.1016/j.cell.2019.05.031.
Barkas, Nikolas, Viktor Petukhov, Daria Nikolaeva, Yaroslav Lozinsky, Samuel Demharter, Konstantin Khodosevich, and Peter V. Kharchenko. 2019. “Joint Analysis of Heterogeneous Single-Cell RNA-Seq Dataset Collections.” Nature Methods 16 (8): 695–98. https://doi.org/10.1038/s41592-019-0466-z.
Hie, Brian, Bryan Bryson, and Bonnie Berger. 2019. “Efficient Integration of Heterogeneous Single-Cell Transcriptomes Using Scanorama.” Nature Biotechnology 37 (6): 685–91. https://doi.org/10.1038/s41587-019-0113-3.
Welch, Joshua D., Velina Kozareva, Ashley Ferreira, Charles Vanderburg, Carly Martin, and Evan Z. Macosko. 2019. “Single-Cell Multi-Omic Integration Compares and Contrasts Features of Brain Cell Identity.” Cell 177 (7): 1873-1887.e17. https://doi.org/10.1016/j.cell.2019.05.006.

36 of 38

Singular Value Decomposition

By Cmglee - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=67853297

37 of 38

Canonical Correlation Analysis

38 of 38

Canonical correlation analysis

u and v are called the first correlation vectors
We can keep finding subsequent vectors that are orthogonal to the ones before
We can get the canonical correlations by performing SVD on the cross-correlation matrix