State of PCA
Kasper D. Hansen�Johns Hopkins University
Goal
Fast, scalable PCA of (large) single-cell datasets stored on-disk.�Easy use of multiple cores/nodes.
TLDR: we’re not there yet. BiocSingular is a substantial step forward.
Algorithms
A complexity is on-disk storage. This implies that performance is a combination of computational load (flops) and data access (pattern).
In all algorithms, the computational bottleneck reduces to matrix-matrix or matrix-vector multiplication. Sparsity can be used (incl. centering/scaling with potential loss of precision).
We have focused on algorithms which computes the “correct” first k eigenvectors (there are also algorithms which computes more loose approximations).
Algorithms, cont’d
Some notes
On in-memory matrices, irlba is best (according to our benchmarks). But this becomes much harder to reason about when coupled with data access (patterns).
Example: on the PBMC 8k dataset (small) irlba requires 34 matrix-vector products. This implies: (1) all data is accessed 34 times. (2) ~240x improvements in computations.
Data access considerations
[ The stack: BiocSingular -> DelayedArray -> HDF5 ]
HDF5 can be optimized for different data access patterns (chunking).
HDF5 does not have native sparsity (TENx data format is a hack).
The DelayedArray framework adds additional overhead.
At the moment, together these issues impose a substantial penalty.
Benchmarking and notes
Started a benchmarking github repos to exhaustively measure performance across implementations and systems.
Goal is to include “fastest” custom implementations for lower bound, to identify gaps in performance between our stack and lower bound.
Github: kasperdanielhansen/bench_pca (not ready yet)