1 of 17

Benchmarking on-disk file formats for single-cell data

Mike Jiang, Raphael Gottardo

2 of 17

Summary

  • Benchmark HDF5 against different formats
    • HDF5 vs sqlite
    • HDF5 vs Kita
    • HDF5 vs tiledb
  • Ways to improve the HDF5 format
    • Hybrid chunking shapes
    • Use lz4 for data compression
  • mbenchmark R package
    • Agnostic about underlying matrix format
    • User-friendly API for benchmarking common matrix operations
  • Next steps

3 of 17

Benchmark HDF5 against different formats--H5 vs Sqlite

  • Simulated data
    • Beta distribution
    • Dims
      • nGenes <- 3e4
      • nCells <- c(1e3, 4e3, 7e3, 2e4, 5e4,1e5)
    • Sparsity

100s to 1000s of biological samples per study (data set).

Each biological sample (often, not always) analyzed independently.

Primary processing transforms raw data to HDF5 cube of parameters x samples x cells.

Analysis requires frequent access to subsets of this data.

Optimizations: parallelization - one sample per file.

Avoid overhead of transferring entire data sets - HDF5 object store?

4 of 17

Benchmark HDF5 against different formats--H5 vs Sqlite

  • Formats
    • In-memory
      • sparse
      • CSR ( Compressed sparse row)
    • On-disk
      • H5
        • 2d dense compressed
        • Chunked by row (gene)
        • IO strategy
          • H5 hyperslab
          • Chunked read
      • Sqlite
        • COO (Coordinate list)
  • IO pattern
    • 1e3 random rows
    • 1e3 random cols
    • 1e3 x 1e3 sub-mat

5 of 17

Benchmark HDF5 against different formats--H5 vs Sqlite

Conclusions:

  • H5 hyperslab selection is slow
  • Sqlite performs well in smaller-size sub-matrix but not so good in the entire row/col selections

6 of 17

Benchmark HDF5 against different formats -- H5-chunked vs Sqlite-blobs

  • Sqlite
    • Store each row (gene) as single binary blobs (similar to h5 chunks)
    • Improved row/col-wise selection

7 of 17

Benchmark HDF5 against different formats -- H5 vs HSDS(Kita)

  • data ( 482911 cells x 22 channel)
  • Chunking dims 120727 x 5
  • Access pattern
    • 2 channels (random)
    • cells = list(range(10, 10000,1000)) (random)
  • H5 access mode
    • Default (point-selection, i.e. through h5 hyperslab)
    • Non-point-selection (i.e. chunked read)
  • Conclusion
    • Kita is more scalable than native H5 for random slicing regardless of the network latency
  • TODO
    • Yet to test against bigger data set

8 of 17

Benchmark HDF5 against different formats -- H5 vs tiledb

  • data ( 10000 cells x 27998 genes)
  • Chunking dims 10000 x 1
  • tiledb
    • Implemented tiledb backend for DelayedArray
    • Storage mode
      • Dense
      • Sparse
  • Access pattern
    • Continuous block selection
      • 1000 x 1000
    • Tiledb performs better than h5

9 of 17

Benchmark HDF5 against different formats -- H5 vs tiledb

  • Access pattern
    • Random slicing ( 1000 x 1000)
    • No native support at tiledb c lib yet, thus achieve it through chunked read
    • Tiledb is comparable to h5

10 of 17

Ways to improve the HDF5 format-- Hybrid chunking shapes

  • TenX 1M data (1306127 cells x 27998 genes)
  • Chunking dims
    • By cell
    • By gene
    • Hybrid (copies of both)
      • Dispatches the query request to the proper copy based on the shape of submatrix to minimize the disk IOs

11 of 17

Ways to improve the HDF5 format -- gzip vs lz4

  • Both use their default compression parameters
  • Lz4 consistently performs better than gzip

12 of 17

Ways to improve the HDF5 format -- gzip vs lz4

  • Disk usages are comparable

13 of 17

mbenchmark

  • A generic tool that can benchmark any arbitrary matrix format without rewriting single line of benchmarking code
    • as long as the matrix object provides two APIs
      • ‘[‘ indexing method
      • ‘dim’ accessor
  • Benchmarking the common matrix operations
    • Subsetting
      • region_selection: continuous block selection
      • random_slicing: non-continuous slab selection
    • Traversing
      • rowSums
      • colSums

14 of 17

mbenchmark

  • Example for tenX data (10000 x 27998)
  • DelayedArray is natively supported, to benchmark different formats, simply extend DelayedArray to their corresponding backends
    • h5 → HDF5Array
    • bigmemory → bmArray
    • ff → ffArray
    • matter -- > mattArray
  • Simple one-liner benchmarking call

> res <- mbenchmark(mat.list, type = "subsetting", times = 3)

  • Quick plot

> autoplot(res)

15 of 17

mbenchmark -- Dockerized benchmarker

  • Download benchmarker folder that contains Dockerfile
  • Build the docker image by running command
    • $ docker build -t benchmarker_r .
  • Run the benchmarker
    • $docker run -v /path/to/data:/data -v /path/to/output:/output benchmarker_r \ --data-path=/data --output-path=/output - --max-percent-of-rows=0.01 --verbose
  • Similarly docker image and command for data converter

16 of 17

Next steps

  • More benchmarks on tiledb
    • Waiting for the native support of random slicing in its new release
    • enable Parallel IO

17 of 17

Additional work

  • Improve sce/DelayedArray (e.g. HDF5Array, tiledbArray,...) support in MAST�Finak, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278.�
  • Work on BioC paper (effort lead by Rob Amezquita and Stephanie Hicks)