1 of 15

The LC Framework for Synthesizing High-Speed Lossless and Error-Bounded Lossy Compressors

PI: Martin Burtscher (Texas State University)

Co-PI: Sheng Di (Argonne National Laboratory)

Senior advisor: Franck Cappello (ANL)

Ph.D. students: Noushin Azami, Alex Fallin, Yiqian Liu, Brandon Burtchell, and Benila Jerald

DE-SC0022223 and DE-AC02-06CH11357

2 of 15

LC Framework

Preprocessor library

Component library

Q_abs_f32

Lor2D_i32

Q_rel_f64

RLE_1

BIT_4

. . .

ZE_8

RE_2

. . .

Q_abs_f32

. . .

BIT_4

. . .

RE_2

input

output

LC compression pipeline�(decompressor is inverse stages in reverse order)

DE-SC0022223 and DE-AC02-06CH11357

3 of 15

LC Algorithms

Supported lossless algorithms

Currently over 50 components
1 to 8 pipeline stages
Many billions of distinct compression and�matching decompression algorithms

Supported lossy algorithms

Currently over 20 preprocessors
Followed by 0 to 8 lossless stages
Quantization with guaranteed point-wise�absolute and/or relative error bounds

Component library

RLE_1

BIT_4

. . .

ZE_8

RE_2

Preprocessor library

Q_abs_f32

Lor2D_i32

Q_rel_f64

. . .

DE-SC0022223 and DE-AC02-06CH11357

4 of 15

LC Search and Statistics

Search capabilities

Can automatically find effective compression algorithm for given training data
Optimized for compression ratio or both compression ratio and throughput
Exhaustive search, genetic algorithm (GA), and progressive pipeline building

Result statistics

Compression ratio, compression and decompression throughput
Pareto front of optimal algorithms in search space
Remaining entropy and most frequent values at different word sizes
All results recorded in CSV files

DE-SC0022223 and DE-AC02-06CH11357

5 of 15

Demo

FLDSC_1_1800_3600.dat from SDRbench suite

./generate_Device_LC-Framework.py

nvcc -O3 -arch=sm_86 -DUSE_GPU -Xcompiler "-O3 -march=native -fopenmp" -o lc lc.cu

./lc FLDSC_1_1800_3600.dat CR "" "BIT_4 RLE_4"

./lc FLDSC_1_1800_3600.dat CR "" "BIT_4 RLE_1"

./lc FLDSC_1_1800_3600.dat AL "" "BIT_4 RLE_1"

./lc

./lc FLDSC_1_1800_3600.dat CR "" "BIT_4 .+"

./lc FLDSC_1_1800_3600.dat CR "" ".+ .+"

./lc FLDSC_1_1800_3600.dat PR "" ".+ .+ .+"

./lc FLDSC_1_1800_3600.dat CR "" ".+ .+ .+"

./lc FLDSC_1_1800_3600.dat CR "" ".+ .+ L.+|R.+|Z.+|C.+"

./lc FLDSC_1_1800_3600.dat CR "" "DIFF_4 .+ .+ L.+|R.+|Z.+|C.+"

// pipeline script

./scripts/ga_search.py -s 5 -r 5 FLDSC_1_1800_3600.dat

./lc FLDSC_1_1800_3600.dat EX "" ".+ .+ L.+|R.+|Z.+|C.+"

DE-SC0022223 and DE-AC02-06CH11357

6 of 15

Performance Optimization

Internal operation

Framework breaks data into 16 kB chunks
Processes each chunk independently and in parallel

Using OpenMP threads on CPU and CUDA/HIP* thread-blocks on GPU

Concatenates compressed chunks and includes auxiliary info to quickly find each chunk (pseudo random access, i.e., can find and decode any one chunk while skipping others)

Chunk processing

Load chunk data into buffer
Perform all data transformations (pipeline stages) on chunk

CPU: alternate between two buffers (that fit in the L1 data cache)
GPU: alternate between two buffers in shared memory (software-controlled L1 data cache)

Store chunk to main memory

DE-SC0022223 and DE-AC02-06CH11357

7 of 15

Demo (cont.)

ls -lrt

// redo for other test files

// script to find best pipeline across inputs

// ./scripts/are_components_used_or_not.py 10 FLDSC_1_1800_3600.dat.CR3.csv

./lc FLDSC_1_1800_3600.dat CR "QUANT_ABS_R_f32(0.01)" ".+ .+ L.+|R.+|Z.+|C.+"

./lc FLDSC_1_1800_3600.dat CR "QUANT_ABS_R_f32(0.1)" ".+ .+ L.+|R.+|Z.+|C.+"

./lc FLDSC_1_1800_3600.dat CR "QUANT_REL_R_f32(0.01)" ".+ .+ L.+|R.+|Z.+|C.+"

./lc FLDSC_1_1800_3600.dat EX "QUANT_REL_R_f32(0.01)" ".+ .+ L.+|R.+|Z.+|C.+"

./lc FLDSC_1_1800_3600.dat EX "QUANT_REL_R_f32(0.01)" ".+ .+ L.+|R.+|Z.+|C.+" "MAXREL_f32(0.01)"

DE-SC0022223 and DE-AC02-06CH11357

8 of 15

Extensible Libraries and Scripts

Extensible design

User can add preprocessors (+ verifiers) and components to library
Simple interface (see later)
Automatic parallelization for CPUs, partially automatic parallelization for GPUs

Included scripts

For finding best algorithm across a set of inputs
For running GA across a set of inputs
For generating progressively longer pipelines
For emitting standalone compressor and decompressor
For analyzing component usage

DE-SC0022223 and DE-AC02-06CH11357

9 of 15

CPU Component Interface

// CPU encoder

// returns false if the encoded data does not fit in the out array

static inline bool h_[name](int& csize, byte in[CS], byte out[CS]);

// CPU decoder

static inline void h_i[name](int& csize, byte in[CS], byte out[CS]);

// losslessly transform the first csize bytes of the 'in' array

// write the result to the 'out' array

// update csize if the transformed data has a different size than the input

// must be serial code (e.g., cannot use OpenMP)

// are allowed to change the contents of both arrays

// the two arrays are guaranteed to start at an 8-byte aligned address

DE-SC0022223 and DE-AC02-06CH11357

10 of 15

GPU Component Interface

// GPU encoder

// returns false if the encoded data does not fit in the out array

static __device__ inline bool d_[name](int& csize, byte in[CS], byte out[CS], byte temp[CS]);

// GPU decoder

static __device__ inline void d_i[name](int& csize, byte in[CS], byte out[CS], byte temp[CS]);

// losslessly transform the first csize bytes of the 'in' array

// write the result to the 'out' array

// update csize if the decoded data has a different size than the input data

// must be thread-block-local code

// are allowed to change the contents of all three arrays

// don’t allocate __shared__ memory (use temp, e.g., int* buf = (int*)&temp;)

// the three arrays are guaranteed to start at an 8-byte aligned address

DE-SC0022223 and DE-AC02-06CH11357

11 of 15

CPU Component Example

// invert data (at byte granularity)

static inline bool h_INV_1(int& csize, const byte in [CS], byte out [CS])

{

for (int i = 0; i < csize; i++) {

out[i] = ~in[i];

}

return true;

}

static inline void h_iINV_1(int& csize, const byte in [CS], byte out [CS])

{

for (int i = 0; i < csize; i++) {

out[i] = ~in[i];

}

DE-SC0022223 and DE-AC02-06CH11357

12 of 15

CPU Preprocessor Interface

// CPU preprocessor encoder

static inline void h_[name](int& size, byte*& data, const int paramc, const double paramv []);

// CPU preprocessor decoder

static inline void h_i[name](int& size, byte*& data, const int paramc, const double paramv []);

// transforms the 'size' bytes in the 'data' array and writes the result either back to the 'data' array or to a new array and then makes 'data' point to this new array

// if the number of bytes changes, the 'size' needs to be updated accordingly

// the data array must start at an 8-byte aligned address

// 'paramc' specifies the number of elements in the 'paramv' array

// the 'paramv' array passes the command-line arguments provided to this preprocessor (e.g., the error bound, data set dimensionality, etc.)

// this code must be manually parallelized (using OpenMP) if desired

DE-SC0022223 and DE-AC02-06CH11357

13 of 15

GPU Preprocessor Interface

// GPU preprocessor encoder

static inline void d_[name](int& size, byte*& data, const int paramc, const double paramv []);

// GPU preprocessor decoder

static inline void d_i[name](int& size, byte*& data, const int paramc, const double paramv []);

// transforms the 'size' bytes in the 'data' array and writes the result either back to the 'data' array or to a new array and then makes 'data' point to this new array

// if the number of bytes changes, the 'size' needs to be updated accordingly

// the data array must start at an 8-byte aligned address

// 'paramc' specifies the number of elements in the 'paramv' array

// the 'paramv' array passes the command-line arguments provided to this preprocessor (e.g., the error bound, data set dimensionality, etc.)

// must be a host function that launches a kernel to do the preprocessing

// the kernel is allowed to allocate and use shared memory

// 'data' must be in device memory

DE-SC0022223 and DE-AC02-06CH11357

14 of 15

CPU Preprocessor Example

// add user-provided constant (at byte granularity)

static inline void h_ADD(int& size, byte*& data, const int paramc, const double paramv [])

{

assert(paramc == 1);

const byte offset = paramv[0];

#pragma omp parallel for default(none) shared(size, data, offset)

for (int i = 0; i < size; i++) {

data[i] += offset;

}

static inline void h_iADD(int& size, byte*& data, const int paramc, const double paramv [])

{

...

data[i] -= offset;

...

}

DE-SC0022223 and DE-AC02-06CH11357

15 of 15

Languages and Guarantees

Supported languages

CPUs: serial and OpenMP C++
GPUs: CUDA and HIP*

Guarantees on compressed and decompressed output files

Bit-wise compatibility between devices (CPUs and GPUs)
Deterministic (fixed input always produces same compressed & decompressed output)

DE-SC0022223 and DE-AC02-06CH11357