Benchmarking on-disk file formats for single-cell data
Mike Jiang, Raphael Gottardo
Summary
Benchmark HDF5 against different formats--H5 vs Sqlite
100s to 1000s of biological samples per study (data set).
Each biological sample (often, not always) analyzed independently.
Primary processing transforms raw data to HDF5 cube of parameters x samples x cells.
Analysis requires frequent access to subsets of this data.
Optimizations: parallelization - one sample per file.
Avoid overhead of transferring entire data sets - HDF5 object store?
Benchmark HDF5 against different formats--H5 vs Sqlite
Benchmark HDF5 against different formats--H5 vs Sqlite
Conclusions:
Benchmark HDF5 against different formats -- H5-chunked vs Sqlite-blobs
Benchmark HDF5 against different formats -- H5 vs HSDS(Kita)
Benchmark HDF5 against different formats -- H5 vs tiledb
Benchmark HDF5 against different formats -- H5 vs tiledb
Ways to improve the HDF5 format-- Hybrid chunking shapes
Ways to improve the HDF5 format -- gzip vs lz4
Ways to improve the HDF5 format -- gzip vs lz4
mbenchmark
mbenchmark
> res <- mbenchmark(mat.list, type = "subsetting", times = 3)
> autoplot(res)
mbenchmark -- Dockerized benchmarker
Next steps
Additional work