1 of 21

Updates to rhdf5

Mike Smith

@grimbough

2 of 21

Introduction to rhdf5

One of several packages providing this functionality

3 of 21

Introduction to rhdf5

C / C++ Library

R Interface

Counts Matrix

Analysis Tools

Complete SC Dataset

rhdf5

H5Fopen()

h5read()

Rhdf5lib

.Call()

4 of 21

Introduction to rhdf5

C / C++ Library

R Interface

Counts Matrix

Analysis Tools

Complete SC Dataset

rhdf5

H5Fopen()

h5read()

Rhdf5lib

.Call()

DelayedArray

DelayedArray

HDF5Array

HDF5Array

scran

zinbwave

scPipe

DropletUtils

scone

scater

SingleCellExperiment

SingleCellExperiment

...

5 of 21

Update to underlying HDF5 library

  • Switch from HDF5 version 1.8 to 1.10
  • Motivated by users unable to open files created by other software
  • Not a simple drop-in replacement!
    • The hid_t type was changed from 32-bit to a 64-bit value.
    • Change should be transparent to users
  • Potential interesting new features e.g. SWMR

6 of 21

Performance improvements

  • Low hanging fruit - classic R inefficiencies
    • Unnecessary reordering
    • Copying rather than preallocating

https://github.com/lgatto/MSnbase/issues/395

7 of 21

Performance improvements

https://github.com/grimbough/rhdf5/issues/31

8 of 21

HDF5 Hyperslabs

  • Regularly spaced selections of elements

9 of 21

HDF5 Hyperslabs

  • Regularly spaced selections of elements

10 of 21

HDF5 Hyperslabs

  • Regularly spaced selections of elements

  • Defined by offset, count, stride and block
  • Available in rhdf5

11 of 21

HDF5 Hyperslab Unions

  • More complex selections require unions of hyperslabs

12 of 21

HDF5 Hyperslab Unions

  • More complex selections require unions of hyperslabs

13 of 21

HDF5 Hyperslab Unions

  • Performing many unions gets very slow
  • Acknowledged by HDF5 group - but no solution suggested e.g. Union of non-consecutive hyperslabs is very slow

14 of 21

Selections with rhdf5

  • In R it is more familiar to give an index e.g. c(1, 4, 7, 8)

15 of 21

Selections with rhdf5 (< 2.27.6)

  • Existing rhdf5 implementation not efficient - many more unions than optimal

16 of 21

Improvements to rhdf5 indexing

https://github.com/grimbough/rhdf5/issues/31

17 of 21

Improvements to rhdf5 indexing

https://github.com/grimbough/rhdf5/issues/31

18 of 21

Other additions

  • Reading ‘long’ vectors e.g. Original 10x 1M Neuron h5 files
  • Writing ‘large’ datasets
  • Tests & settings for file locking issues on Lustre & Solaris
    • h5testFileLocking()
    • h5enableFileLocking()
    • h5disableFileLocking()

19 of 21

Expanding documentation

  • Important to share knowledge / offer advice to users
    • Practical Tips vignette
    • Blog posts
  • Other suggestions?

20 of 21

Acknowledgements

Wolfgang Huber

Martin Morgan

Vince Carey

Daniel van Twisk

Hervé Pagès

Levi Waldron

Aaron Lun

Mike Jiang

21 of 21