1 of 15

The LC Framework for Synthesizing High-Speed Lossless and Error-Bounded Lossy Compressors

PI: Martin Burtscher (Texas State University)

Co-PI: Sheng Di (Argonne National Laboratory)

Senior advisor: Franck Cappello (ANL)

Ph.D. students: Noushin Azami, Alex Fallin, Yiqian Liu, Brandon Burtchell, and Benila Jerald

DE-SC0022223 and DE-AC02-06CH11357

DE-SC0022223 and DE-AC02-06CH11357

2 of 15

LC Framework

Preprocessor library

Component library

Q_abs_f32

Lor2D_i32

Q_rel_f64

RLE_1

BIT_4

. . .

ZE_8

RE_2

. . .

Q_abs_f32

. . .

BIT_4

. . .

RE_2

input

output

LC compression pipeline�(decompressor is inverse stages in reverse order)

DE-SC0022223 and DE-AC02-06CH11357

DE-SC0022223 and DE-AC02-06CH11357

3 of 15

LC Algorithms

Supported lossless algorithms

  • Currently over 50 components
  • 1 to 8 pipeline stages
  • Many billions of distinct compression and�matching decompression algorithms

Supported lossy algorithms

  • Currently over 20 preprocessors
  • Followed by 0 to 8 lossless stages
  • Quantization with guaranteed point-wise�absolute and/or relative error bounds

Component library

RLE_1

BIT_4

. . .

ZE_8

RE_2

Preprocessor library

Q_abs_f32

Lor2D_i32

Q_rel_f64

. . .

3

DE-SC0022223 and DE-AC02-06CH11357

4 of 15

LC Search and Statistics

Search capabilities

  • Can automatically find effective compression algorithm for given training data
  • Optimized for compression ratio or both compression ratio and throughput
  • Exhaustive search, genetic algorithm (GA), and progressive pipeline building

Result statistics

  • Compression ratio, compression and decompression throughput
  • Pareto front of optimal algorithms in search space
  • Remaining entropy and most frequent values at different word sizes
  • All results recorded in CSV files

4

DE-SC0022223 and DE-AC02-06CH11357

5 of 15

Demo

FLDSC_1_1800_3600.dat from SDRbench suite

./generate_Device_LC-Framework.py

nvcc -O3 -arch=sm_86 -DUSE_GPU -Xcompiler "-O3 -march=native -fopenmp" -o lc lc.cu

./lc FLDSC_1_1800_3600.dat CR "" "BIT_4 RLE_4"

./lc FLDSC_1_1800_3600.dat CR "" "BIT_4 RLE_1"

./lc FLDSC_1_1800_3600.dat AL "" "BIT_4 RLE_1"

./lc

./lc FLDSC_1_1800_3600.dat CR "" "BIT_4 .+"

./lc FLDSC_1_1800_3600.dat CR "" ".+ .+"

./lc FLDSC_1_1800_3600.dat PR "" ".+ .+ .+"

./lc FLDSC_1_1800_3600.dat CR "" ".+ .+ .+"

./lc FLDSC_1_1800_3600.dat CR "" ".+ .+ L.+|R.+|Z.+|C.+"

./lc FLDSC_1_1800_3600.dat CR "" "DIFF_4 .+ .+ L.+|R.+|Z.+|C.+"

// pipeline script

./scripts/ga_search.py -s 5 -r 5 FLDSC_1_1800_3600.dat

./lc FLDSC_1_1800_3600.dat EX "" ".+ .+ L.+|R.+|Z.+|C.+"

5

DE-SC0022223 and DE-AC02-06CH11357

6 of 15

Performance Optimization

Internal operation

  • Framework breaks data into 16 kB chunks
  • Processes each chunk independently and in parallel
    • Using OpenMP threads on CPU and CUDA/HIP* thread-blocks on GPU
  • Concatenates compressed chunks and includes auxiliary info to quickly find each chunk (pseudo random access, i.e., can find and decode any one chunk while skipping others)

Chunk processing

  • Load chunk data into buffer
  • Perform all data transformations (pipeline stages) on chunk
    • CPU: alternate between two buffers (that fit in the L1 data cache)
    • GPU: alternate between two buffers in shared memory (software-controlled L1 data cache)
  • Store chunk to main memory

6

DE-SC0022223 and DE-AC02-06CH11357

7 of 15

Demo (cont.)

ls -lrt

// redo for other test files

// script to find best pipeline across inputs

// ./scripts/are_components_used_or_not.py 10 FLDSC_1_1800_3600.dat.CR3.csv

./lc FLDSC_1_1800_3600.dat CR "QUANT_ABS_R_f32(0.01)" ".+ .+ L.+|R.+|Z.+|C.+"

./lc FLDSC_1_1800_3600.dat CR "QUANT_ABS_R_f32(0.1)" ".+ .+ L.+|R.+|Z.+|C.+"

./lc FLDSC_1_1800_3600.dat CR "QUANT_REL_R_f32(0.01)" ".+ .+ L.+|R.+|Z.+|C.+"

./lc FLDSC_1_1800_3600.dat EX "QUANT_REL_R_f32(0.01)" ".+ .+ L.+|R.+|Z.+|C.+"

./lc FLDSC_1_1800_3600.dat EX "QUANT_REL_R_f32(0.01)" ".+ .+ L.+|R.+|Z.+|C.+" "MAXREL_f32(0.01)"

7

DE-SC0022223 and DE-AC02-06CH11357

8 of 15

Extensible Libraries and Scripts

Extensible design

  • User can add preprocessors (+ verifiers) and components to library
  • Simple interface (see later)
  • Automatic parallelization for CPUs, partially automatic parallelization for GPUs

Included scripts

  • For finding best algorithm across a set of inputs
  • For running GA across a set of inputs
  • For generating progressively longer pipelines
  • For emitting standalone compressor and decompressor
  • For analyzing component usage

8

DE-SC0022223 and DE-AC02-06CH11357

9 of 15

CPU Component Interface

// CPU encoder

// returns false if the encoded data does not fit in the out array

static inline bool h_[name](int& csize, byte in[CS], byte out[CS]);

// CPU decoder

static inline void h_i[name](int& csize, byte in[CS], byte out[CS]);

// losslessly transform the first csize bytes of the 'in' array

// write the result to the 'out' array

// update csize if the transformed data has a different size than the input

// must be serial code (e.g., cannot use OpenMP)

// are allowed to change the contents of both arrays

// the two arrays are guaranteed to start at an 8-byte aligned address

9

DE-SC0022223 and DE-AC02-06CH11357

10 of 15

GPU Component Interface

// GPU encoder

// returns false if the encoded data does not fit in the out array

static __device__ inline bool d_[name](int& csize, byte in[CS], byte out[CS], byte temp[CS]);

// GPU decoder

static __device__ inline void d_i[name](int& csize, byte in[CS], byte out[CS], byte temp[CS]);

// losslessly transform the first csize bytes of the 'in' array

// write the result to the 'out' array

// update csize if the decoded data has a different size than the input data

// must be thread-block-local code

// are allowed to change the contents of all three arrays

// don’t allocate __shared__ memory (use temp, e.g., int* buf = (int*)&temp;)

// the three arrays are guaranteed to start at an 8-byte aligned address

10

DE-SC0022223 and DE-AC02-06CH11357

11 of 15

CPU Component Example

// invert data (at byte granularity)

static inline bool h_INV_1(int& csize, const byte in [CS], byte out [CS])

{

for (int i = 0; i < csize; i++) {

out[i] = ~in[i];

}

return true;

}

static inline void h_iINV_1(int& csize, const byte in [CS], byte out [CS])

{

for (int i = 0; i < csize; i++) {

out[i] = ~in[i];

}

}

11

DE-SC0022223 and DE-AC02-06CH11357

12 of 15

CPU Preprocessor Interface

// CPU preprocessor encoder

static inline void h_[name](int& size, byte*& data, const int paramc, const double paramv []);

// CPU preprocessor decoder

static inline void h_i[name](int& size, byte*& data, const int paramc, const double paramv []);

// transforms the 'size' bytes in the 'data' array and writes the result either back to the 'data' array or to a new array and then makes 'data' point to this new array

// if the number of bytes changes, the 'size' needs to be updated accordingly

// the data array must start at an 8-byte aligned address

// 'paramc' specifies the number of elements in the 'paramv' array

// the 'paramv' array passes the command-line arguments provided to this preprocessor (e.g., the error bound, data set dimensionality, etc.)

// this code must be manually parallelized (using OpenMP) if desired

12

DE-SC0022223 and DE-AC02-06CH11357

13 of 15

GPU Preprocessor Interface

// GPU preprocessor encoder

static inline void d_[name](int& size, byte*& data, const int paramc, const double paramv []);

// GPU preprocessor decoder

static inline void d_i[name](int& size, byte*& data, const int paramc, const double paramv []);

// transforms the 'size' bytes in the 'data' array and writes the result either back to the 'data' array or to a new array and then makes 'data' point to this new array

// if the number of bytes changes, the 'size' needs to be updated accordingly

// the data array must start at an 8-byte aligned address

// 'paramc' specifies the number of elements in the 'paramv' array

// the 'paramv' array passes the command-line arguments provided to this preprocessor (e.g., the error bound, data set dimensionality, etc.)

// must be a host function that launches a kernel to do the preprocessing

// the kernel is allowed to allocate and use shared memory

// 'data' must be in device memory

13

DE-SC0022223 and DE-AC02-06CH11357

14 of 15

CPU Preprocessor Example

// add user-provided constant (at byte granularity)

static inline void h_ADD(int& size, byte*& data, const int paramc, const double paramv [])

{

assert(paramc == 1);

const byte offset = paramv[0];

#pragma omp parallel for default(none) shared(size, data, offset)

for (int i = 0; i < size; i++) {

data[i] += offset;

}

}

static inline void h_iADD(int& size, byte*& data, const int paramc, const double paramv [])

{

...

data[i] -= offset;

...

}

14

DE-SC0022223 and DE-AC02-06CH11357

15 of 15

Languages and Guarantees

Supported languages

  • CPUs: serial and OpenMP C++
  • GPUs: CUDA and HIP*

Guarantees on compressed and decompressed output files

  • Bit-wise compatibility between devices (CPUs and GPUs)
  • Deterministic (fixed input always produces same compressed & decompressed output)

15

DE-SC0022223 and DE-AC02-06CH11357