1 of 1

Performance Portability of Sparse Computational Methods on GPU-Accelerated Architectures

Scientific Achievement

Developed an algorithm-centric performance portability metric for multi-model approaches for performance portability for DOE GPU-accelerated HPC architectures. Demonstrated effectiveness for sparse matrix multi-vector solvers.

Significance and Impact

The developed metric and methodology for evaluating performance portability provide a better guide to algorithmic selection than programming-model centric approaches for algorithms influences by the dataset content, such as sparse matrix methods. Furthermore, a broad range of computational patterns can leverage this methodology.

Research Details

This research studied a wide set of algorithmic and programming models for sparse matrix multi-vector implementations, including vendor-specific programming models (CUDA, and HIP), and portability programming models (Kokkos, OpenMP and OpenACC) using 5 algorithmic variants.
The study shows that achieving portability depends on the feasibility of expressing an algorithmic variant on each of the programming models.
We developed an algorithm-centric portability metric that enables the evaluation of multi-programming model approaches.

Performance portability through a portability models vs. multi-backend approach.

SciDAC-5 RAPIDS-FastMath

Khaled Z. Ibrahim, Chao Yang, Pieter Maris. Performance Portability of Sparse Block Diagonal Matrix Multiple Vector Multiplications on GPUs, 2022 International Workshop on Performance, Portability and Productivity in HPC (P3HPC), SC22.

Kokkos performance portability metric varies significantly with problem configuration.

Multi-programming model approaches can improve the observed performance portability across a wide range of configurations.

Performance portability is a challenging problem for the scientific computing community, especially with the advent of multiple GPU accelerator technologies from various vendors.

This study conducted our experiments on Nvidia V100 and A100 architectures and AMD MI250x.

This work focused on optimizing the performance and defining a performance portability metric where the system utilization is influenced by the dataset, such as with the case of sparse methods.

The study explored multiple programming models, including OpenACC and OpenMP offload, Kokkos, CUDA, and HIP using 5 algorithmic variants.

It searched a large space of execution configurations to reach an efficient preconditioner for eigen-solver for nuclear structure calculations.

Performance-wise, the study achieved up to an order of magnitude speedup compared with prior offloading-based approaches.

The study defined and argued for a performance portability metric that rewards programming models allowing to express efficient algorithmic variants.

Figure legend: The top figure shows the algorithmic variants explored in this study and software layers to leverage NVIDIA and AMD GPUs at NERSC and OLCF.

The bottom left figure shows the performance portability across the different problems on the x-axis and the different number of vectors on the y-axis.

Using Kokkos, which gives good portability, the performance portability varied widely across configurations. Using multi-backend implementation, based on HIP and CUDA, we reached a better consistent result despite slightly impacting productivity.

Reference:

ACKNOWLEDGMENTS: This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231 and used resources of the National Energy Research Scientific Computing Center (NERSC) which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

The work was done at Lawrence Berkeley National Laboratory and Iowa State University.