1 of 15

GPU R&D of SZ Lossy Compression

Luddy School of Informatics, Computing, and Engineering

Dingwen Tao, Associate Professor

Indiana University Bloomington

2 of 15

INDIANA UNIVERSITY BLOOMINGTON

3 of 15

INDIANA UNIVERSITY BLOOMINGTON

4 of 15

lossy transformation w.r.t. error bound

statistics

lossless encoding

  • [PACT ’20] Parallelize comp. pred-quant
  • [CLUSTER ’21] Parallelize decomp. pred-quant
  • [git::db56f32] Integrate outlier compaction
  • [WIP] Integrate partial/full histogram
  • [WIP] Linear and cubic spline interpolation
  • [CLUSTER ’21] A histogram of highly compressible data also indicates the smoothness of data.
  • [WIP] Statistical Model makes GPU branching less intimating.
  • [PACT ’20] Use canonical Huffman encoding.
  • [IPDPS ’21] Massively parallelized Huffman encoding scheme.
  • [CLUSTER ’21] Use run-length encoding
  • [HPDC ’23] Alternatively, bitshuffle + VLE.

INDIANA UNIVERSITY BLOOMINGTON

5 of 15

cuSZ: An Efficient GPU Based Error-Bounded Lossy Compression Framework for Scientific Data

Published in 2020 International Conference on Parallel Architectures and Compilation Techniques (PACT’20)

2021 IEEE International Conference on Cluster Computing (CLUSTER’21)

Led by Jiannan Tian from HiPDAC

INDIANA UNIVERSITY BLOOMINGTON

6 of 15

System Design

Challenges

  • Tight data dependency—loop-carried read-after-write (RAW)—hinders parallelization.
  • Host-device communications due to only considering CPU/GPU suitableness.

INDIANA UNIVERSITY BLOOMINGTON

7 of 15

Fully Parallelized P+Q

INDIANA UNIVERSITY BLOOMINGTON

8 of 15

Adaptive Parallelism

Threads # Tuning

GPU Performance Optimization

Canonical Codebook & Huffman Encoding

fine-grained manner:

IPDPS’21: Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures, Tian et al.

IPDPS’22: Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs, Rivera et al.

INDIANA UNIVERSITY BLOOMINGTON

9 of 15

Performance Evaluation: Throughput and Quality

cuSZ (as of October 2021):

For compression kernel,

411× ~ 719× over serial CPU

19.1× ~ 24.8× over OMP CPU

For decompression kernel,

130× ~ 235× over serial CPU

11.8× ~ 16.8× over OMP CPU

Rate-Distortion

INDIANA UNIVERSITY BLOOMINGTON

10 of 15

FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs

Published in 2023 ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC’23)

Led by Boyuan Zhang from HiPDAC

INDIANA UNIVERSITY BLOOMINGTON

11 of 15

Motivation

  • SOTA GPU error-bounded lossy compressors suffer from either low throughput or low compression quality
    • Low throughput: cuSZ and MGARD-GPU’s Huffman encoding (and most dictionary encoding algorithms) contain substantial data dependencies, making them difficult to parallelize on GPUs
    • Low quality: cuZFP has slightly higher throughput compared with cuSZ and MGARD-GPU but much lower compression quality due to fix-rate mode

INDIANA UNIVERSITY BLOOMINGTON

12 of 15

System Design

2. Optimize Bitshuffle on GPUs

  • Fully leverage shared memory in each thread block
  • Use a warp-level vote function to shuffle bits to solve data access conflicts
  • Store the result locally to enable coalesced memory access

3. Fast GPU Lossless Encoder

  • Partition data into chunks and iterate all data blocks
  • Record whether all values in one block are zeros (use 1 bit to denote) and copy data if not all zeros
  • Fuse bitshuffle kernel and the first phase of our encoding to save one time of global memory access

1. Optimize Dual-Quantization

  • Remove shift operations to formulate values symmetrically distributed around zero
  • Avoid separately handling outliers for high performance
  • Use 1 bit to denote sign of each quantization code instead of using 2’s complement

INDIANA UNIVERSITY BLOOMINGTON

13 of 15

Evaluation: Compression Throughput

  • FZ-GPU achieves a speedup of up to 11.2X over cuSZ, and a speedup of up to 4.2X over cuZFP on A100
  • Our dual-quantization kernel speeds up at most 1.7X
  • Kernel fusion provides a speedup at most 1.1X
  • Prefix-sum-encode kernel has a speedup of up to 1.9X
  • FZ-GPU achieves the best overall GPU-CPU throughput on almost all datasets and evaluated relative error bounds

INDIANA UNIVERSITY BLOOMINGTON

14 of 15

Evaluation: Rate-distortion and Visualization

  • Compression ratio: FZ-GPU is up to 1.1X higher than cuSZ and 1.7X higher than cuZFP on average
  • SSIM: FZ-GPU has highest SSIM among all compressors
  • PSNR:
    • versus cuZFP/cuSZx: FZ-GPU is 1.3X/1.1X higher
    • versus MGARD-GPU: FZ-GPU is slightly lower because MGARD-GPU uses a very expensive multi-grid-based approach for accurate approximation (i.e., compression throughput of 4.9 GB/s vs FZ-GPU’s 65.4 GB/s)

INDIANA UNIVERSITY BLOOMINGTON

15 of 15

  • With more and more backends added, CUDA-based parallelized SZ evolves into pSZ, incorporating platform-specific runtimes: cuSZ, hipSZ (AMD, in dev.), and dpSZ (data-parallel C++, Intel, in dev.).
  • Architecture-specific optimization?
  • What about debuggability and maintainability?
  • What about toolchain support in real-world HPC environment?
  • [base] The 2023 version.
  • [done] Sub 10-second build process.
  • [WIP] Memory management.
  • [WIP] Logging system.
  • [WIP] Decrease memory footprint.
  • [WIP] Error status reporting.
  • [WIP] Testing.
  • [TODO] (C)FFI.
  • [TODO] CI/CD.
  • Compressor developer
    • Framework stability
  • Technology integrator
    • Ergonomically good APIs
  • Mainstream users
    • Tool stability
    • High quality, high compression ratio, and high processing speed
    • Ease of conducting experiments

portability

userbase and focus

slowly paying down technical debts

INDIANA UNIVERSITY BLOOMINGTON