1 of 15

Value-Compressed Sparse Column (VCSC): �Sparse Matrix Storage for Single-cell Omics Data

Big Data 2024 HPC-BOD Workshop

Seth Wolfgang, Skyler Ruiter, Marc Tunnel, Timothy Triche Jr., Erin Carrier, Zachary DeBruine (debruinz@gvsu.edu)

2 of 15

Sparse matrix compression formats do not leverage redundancy of non-zeros to increase compression ratios

  • Many types of sparse matrices contain highly redundant values due to discrete distributions that are enriched for zeros.

  • More effective compression of data (e.g. single-cell transcriptomics) will help avoid distributed computing by remaining in-core, reduce file size, and reduce communication latency.

3 of 15

Conventional coordinate-based storage of sparse data

4 of 15

Adding value-based compression to CSC

5 of 15

Additional compression of indices with bytepacking

6 of 15

A measure of redundancy to quantify compression capability

We define redundancy of the ith column as:

This metric captures the magnitude of the difference between the number of non-zero elements and the number of unique elements.

7 of 15

Unlike CSC, VCSC and IVCSC compress as a function of both sparsity and value-wise redundancy

8 of 15

Performance of CSC, VCSC, and IVCSC compression on randomly generated redundant sparse matrices

9 of 15

Performance of CSC, VCSC, and IVCSC compression on real sparse matrices

10 of 15

Performance of CSC, VCSC, and IVCSC �compression on single-cell transcriptomics datasets

11 of 15

Performance of CSC, VCSC, and IVCSC constructor and iterator

12 of 15

Performance of BLAS routines with CSC, VCSC, and IVCSC

13 of 15

VCSC and IVCSC enable in-core modeling of large single-cell datasets

14 of 15

Conclusion

  • CSC does not leverage redundancy to compress sparse matrices

  • VCSC is a new sparse matrix format that leverages redundancy to improve compression while retaining high performance

  • IVCSC is an extension of VCSC that balances high performance with deeper compression using index bytepacking

15 of 15

Value-Compressed Sparse Column (VCSC): �Sparse Matrix Storage for Single-cell Omics Data

Big Data 2024 HPC-BOD Workshop

Seth Wolfgang, Skyler Ruiter, Marc Tunnel, Timothy Triche Jr., Erin Carrier, Zachary DeBruine (debruinz@gvsu.edu)