1 of 1

Optimized Preprocessing For Scientific Deep Learning Applications

Scientific Achievement

Develop optimized preprocessing for scientific deep learning workloads (MLperf HPC). Optimized preprocessing improved the end-to-end performance by up to 10x.

Significance and Impact

Deep-learning workloads are increasingly significant consumer of HPC compute cycles. Developing improved preprocessing pipelines is critical for efficient utilization of systems with AI accelerators. Our optimized preprocessing utilized novel techniques to improve processing data for CosmoFlow and DeepCAM applications, which could be leveraged for other scientific deep-learning applications.

Technical Approach

Develop application-specific coding/decoding preprocessing.

Specialized data format for both FP16 floating-point and integer-based scientific data.
Operator fusion and reordering to improve the preprocessing execution.
Decoder logic that efficiently execute on both the host and the accelerator side.

Improve the performance through the reduction of data movement across the architectural bottlenecks, and enable better caching.
Develop data schema/metadata optimized for efficient processing on GPU accelerators.

K. Z. Ibrahim and L. Oliker, "Preprocessing Pipeline Optimization for Scientific Deep Learning Workloads," 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Lyon, France, 2022, pp. 1118-1128, doi: 10.1109/IPDPS53621.2022.00112.

Performance improvement of Cosmoflow deep learning throughput on GPU-accelerated systems at OLCF and NERSC.

Deep learning applications are commonly known for being compute-bound thus amenable to acceleration using specialized architectures.

Recent trends in architectural designs with multi-accelerator node designs are shifting the problem to being bottlenecked by data movement, especially for scientific deep-learning workloads, where datasets typically have larger memory footprint.

This work introduces application-specific decoders that enable efficient preprocessing of two scientific workloads, DeepCAM for Climate modeling, CosmoFlow for Cosmological modeling.

Optimizing the pre-preprocessing involves moving the data in specialized compressed format, applying operators on the fly, and reorder operation to improve efficiency.

Figure legend:

Top figures show the optimized dataflow, where data is compressed before being migrated to accelerators, e.g GPU, for preprocessing. Compressed data allows caching more samples within the CPU memory, and preprocessing involves reordering operators on compressing data and fusing decompression with data transposes.

The bottom figures shows speedup up to 10x for CosmoFlow on OLCF and NERSC super computing facilities.

Reference:

ACKNOWLEDGMENTS: This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231 and used resources of the National Energy Research Scientific Computing Center (NERSC) which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

The work was done at Lawrence Berkeley National Laboratory.