1 of 1

zfp: Compressed Floating-Point Arrays

Scientific Achievement

Data compression and decompression can be greatly accelerated via parallel execution on today’s multi-core computers but require changes to enable direct access to small, variable-size blocks of compressed data. Indexing blocks may add storage overhead that exceeds the payload compressed data. We have developed several compact encodings of 64-bit block offsets that allow locating any block quickly using as little as 16 bit of storage per block. This enables thousands of blocks to be (de)compressed concurrently on GPUs.

Significance and Impact

Most science applications drive lossy compression by prescribing error tolerances rather than compressed size—a zfp feature previously available only on CPUs. Our work allows applications to take advantage of GPUs through error-bounded, variable-rate, parallel (de)compression using CUDA and HIP. We are separately collaborating with Intel to port these techniques to SYCL.

zfp partitions multidimensional data sets into small cubical blocks (visible on the far right) that are compressed and decompressed independently, enabling block-granular random access and parallel decompression on GPUs by efficiently indexing blocks. This work has resulted in several compact block index encodings that reduce storage by up to 4x while still ensuring constant-time access.

Technical Approach

Block offsets are computed from interleaved offsets & sizes via prefix sums.
Multiple encodings allow trading speed of access and compactness.

PI(s)/Facility lead(s): Peter Lindstrom (LLNL)

Collaborating Institutions: University of Utah, Intel

ASCR Program: SciDAC RAPIDS2

ASCR PM: Kalyan Perumalla

Code Developed: https://github.com/LLNL/zfp

LOCAL LAB POC:

Jeff Hittinger (ASCR POC), Kathryn Mohror (ASCR CS POC)

TALKING POINTS:

Expensive movement and storage of large science data can be mitigated via data compression.
High-speed data compression and decompression is achieved using many parallel threads that concurrently process small pieces of a data set.
As individual pieces compress to different sizes, locating where they end up on linear storage requires producing and consulting an index data structure.
The index itself must be very compact so as not to inflate storage yet simple enough to allow fast lookups.
This work has resulted in several novel index encodings that reduce storage by up to 4x while guaranteeing constant-time random access.

METADATA:

Name of the associated awarded project: RAPIDS2

PI name(s): Rob Ross, PI (ANL); Peter Lindstrom, Lab PI and zfp lead (LLNL)

Name of the program manager: Kalyan Perumalla

CITATIONS:

Code: https://github.com/LLNL/zfp
An early research prototype of this work was published previously and is now being extended and fully integrated into zfp: L. Noordsij, S. van der Vlugt, M. Bamakhrama, Z. Al-Ars, P. Lindstrom, “Parallelization of Variable Rate Decompression through Metadata,” Euromicro International Conference on Parallel, Distributed and Network-based Processing 2020, doi:10.1109/PDP50117.2020.00045.

AWARDS:

2023 R&D 100 award

BACKGROUND AND CONTEXT INFORMATION:

The dominant bottleneck in today’s high-performance computing is data movement, which not only limits performance but also is the main contributor to power usage. This is true for data transfer across the internet, I/O, communication, host-device transfers, and even moving data between RAM and registers. If data movement does not keep pace with compute throughput, precious resources are wasted as compute core sit idly waiting for data on which to compute. One possible solution to reducing data movement is to compress the data, move it in reduced form, and then decompress it. zfp is a high-speed numerical data compressor designed explicitly to reduce data movement in HPC applications. zfp not only supports massive parallelism for batch compression but also fine-grained random access to compressed data, allowing data arrays to be stored in main memory in compressed form and (de)compressed in small pieces on demand. Such random access is made complicated when small data chunks are compressed to variable length, as locating a compressed chunk in the compressed stream requires very efficient low-overhead indexing. Current zfp work is focused on extending GPU support for capturing, encoding, and later decoding offsets to compressed chunks to enable highly parallel decompression of variable-length chunks.