2 of 17

Summary

Benchmark HDF5 against different formats

HDF5 vs sqlite
HDF5 vs Kita
HDF5 vs tiledb

Ways to improve the HDF5 format

Hybrid chunking shapes
Use lz4 for data compression

mbenchmark R package

Agnostic about underlying matrix format
User-friendly API for benchmarking common matrix operations

Next steps

3 of 17

Benchmark HDF5 against different formats--H5 vs Sqlite

Simulated data

Beta distribution
Dims

nGenes <- 3e4
nCells <- c(1e3, 4e3, 7e3, 2e4, 5e4,1e5)

Sparsity

100s to 1000s of biological samples per study (data set).

Each biological sample (often, not always) analyzed independently.

Primary processing transforms raw data to HDF5 cube of parameters x samples x cells.

Analysis requires frequent access to subsets of this data.

Optimizations: parallelization - one sample per file.

Avoid overhead of transferring entire data sets - HDF5 object store?

4 of 17

Benchmark HDF5 against different formats--H5 vs Sqlite

Formats

In-memory

sparse
CSR ( Compressed sparse row)

On-disk

2d dense compressed
Chunked by row (gene)
IO strategy

H5 hyperslab
Chunked read

Sqlite

COO (Coordinate list)

IO pattern

1e3 random rows
1e3 random cols
1e3 x 1e3 sub-mat

5 of 17

Benchmark HDF5 against different formats--H5 vs Sqlite

Conclusions:

H5 hyperslab selection is slow
Sqlite performs well in smaller-size sub-matrix but not so good in the entire row/col selections

6 of 17

Benchmark HDF5 against different formats -- H5-chunked vs Sqlite-blobs

Sqlite

Store each row (gene) as single binary blobs (similar to h5 chunks)
Improved row/col-wise selection

7 of 17

Benchmark HDF5 against different formats -- H5 vs HSDS(Kita)

data ( 482911 cells x 22 channel)
Chunking dims 120727 x 5
Access pattern

2 channels (random)
cells = list(range(10, 10000,1000)) (random)

H5 access mode

Default (point-selection, i.e. through h5 hyperslab)
Non-point-selection (i.e. chunked read)

Conclusion

Kita is more scalable than native H5 for random slicing regardless of the network latency

TODO

Yet to test against bigger data set

8 of 17

Benchmark HDF5 against different formats -- H5 vs tiledb

data ( 10000 cells x 27998 genes)
Chunking dims 10000 x 1
tiledb

Implemented tiledb backend for DelayedArray
Storage mode

Dense
Sparse

Access pattern

Continuous block selection

1000 x 1000

Tiledb performs better than h5

9 of 17

Benchmark HDF5 against different formats -- H5 vs tiledb

Access pattern

Random slicing ( 1000 x 1000)
No native support at tiledb c lib yet, thus achieve it through chunked read
Tiledb is comparable to h5

10 of 17

Ways to improve the HDF5 format-- Hybrid chunking shapes

TenX 1M data (1306127 cells x 27998 genes)
Chunking dims

By cell
By gene
Hybrid (copies of both)

Dispatches the query request to the proper copy based on the shape of submatrix to minimize the disk IOs

11 of 17

Ways to improve the HDF5 format -- gzip vs lz4

Both use their default compression parameters
Lz4 consistently performs better than gzip

12 of 17

Ways to improve the HDF5 format -- gzip vs lz4

Disk usages are comparable

13 of 17

mbenchmark

A generic tool that can benchmark any arbitrary matrix format without rewriting single line of benchmarking code

as long as the matrix object provides two APIs

‘[‘ indexing method
‘dim’ accessor

Benchmarking the common matrix operations

Subsetting

region_selection: continuous block selection
random_slicing: non-continuous slab selection

Traversing

rowSums
colSums

14 of 17

mbenchmark

Example for tenX data (10000 x 27998)
DelayedArray is natively supported, to benchmark different formats, simply extend DelayedArray to their corresponding backends

h5 → HDF5Array
bigmemory → bmArray
ff → ffArray
matter -- > mattArray

Simple one-liner benchmarking call

> res <- mbenchmark(mat.list, type = "subsetting", times = 3)

Quick plot

> autoplot(res)

15 of 17

mbenchmark -- Dockerized benchmarker

Download benchmarker folder that contains Dockerfile

https://github.com/RGLab/table-testing/blob/master/rbenchmarker/benchmarker

Build the docker image by running command

$ docker build -t benchmarker_r .

Run the benchmarker

$docker run -v /path/to/data:/data -v /path/to/output:/output benchmarker_r \ --data-path=/data --output-path=/output - --max-percent-of-rows=0.01 --verbose

Similarly docker image and command for data converter

https://github.com/RGLab/table-testing/tree/master/rbenchmarker/create_data
Used to create/convert the data matrix to the formats suitable for benchmarker

16 of 17

Next steps

More benchmarks on tiledb

Waiting for the native support of random slicing in its new release
enable Parallel IO

17 of 17

Additional work

Improve sce/DelayedArray (e.g. HDF5Array, tiledbArray,...) support in MAST�Finak, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278.�
Work on BioC paper (effort lead by Rob Amezquita and Stephanie Hicks)