1 of 6

State of PCA

Kasper D. Hansen�Johns Hopkins University

2 of 6

Goal

Fast, scalable PCA of (large) single-cell datasets stored on-disk.�Easy use of multiple cores/nodes.

TLDR: we’re not there yet. BiocSingular is a substantial step forward.

3 of 6

Algorithms

A complexity is on-disk storage. This implies that performance is a combination of computational load (flops) and data access (pattern).

In all algorithms, the computational bottleneck reduces to matrix-matrix or matrix-vector multiplication. Sparsity can be used (incl. centering/scaling with potential loss of precision).

We have focused on algorithms which computes the “correct” first k eigenvectors (there are also algorithms which computes more loose approximations).

4 of 6

Algorithms, cont’d

Some notes

  • irlba: least computation, touches the data more (Ay, yA)
  • rsvd: more computation, touches the data less
  • Computing the (self) cross-product of the data (very useful for tall/wide matrices): most computation, touches the data one. Embarrassingly parallel.

On in-memory matrices, irlba is best (according to our benchmarks). But this becomes much harder to reason about when coupled with data access (patterns).

Example: on the PBMC 8k dataset (small) irlba requires 34 matrix-vector products. This implies: (1) all data is accessed 34 times. (2) ~240x improvements in computations.

5 of 6

Data access considerations

[ The stack: BiocSingular -> DelayedArray -> HDF5 ]

HDF5 can be optimized for different data access patterns (chunking).

HDF5 does not have native sparsity (TENx data format is a hack).

The DelayedArray framework adds additional overhead.

At the moment, together these issues impose a substantial penalty.

6 of 6

Benchmarking and notes

Started a benchmarking github repos to exhaustively measure performance across implementations and systems.

Goal is to include “fastest” custom implementations for lower bound, to identify gaps in performance between our stack and lower bound.

Github: kasperdanielhansen/bench_pca (not ready yet)