1 of 24

CAMP First GPU Solver: A Solution to Accelerate Chemistry in Atmospheric Models

7th ENES workshop

Christian Guzman Ruiz, Mario C. Acosta, Matthew Dawson*, Oriol Jorba, Carlos Pérez García-Pando, Kim Serradell

Barcelona SuperComputing Center

*National Center for Atmospheric Research (NCAR)

2 of 24

BSC Departments

2

3 of 24

Background

4 of 24

Atmospheric models

Atmospheric models are a mathematical representation of atmospheric water, gas, and aerosol cycles.

5 of 24

Atmospheric models

Atmospheric models are a mathematical representation of atmospheric water, gas, and aerosol cycles.

Resolution of chemical processes can take up to 90% of the time execution! (Christou et al., 2016)

6 of 24

State of the art - KPP GPU

6

Michail Alvanos and Theodoros Christoudia, GPU-accelerated atmospheric chemical kinetics in the ECHAM/MESSy (EMAC) Earth system model , 2017

Kinetic PreProcessor (KPP) is a analysis tool to solve chemical mechanisms using Rosenbrock methods

KPP is widely used in the atmospheric community

The GPU version for the EMAC climate model achieves up to 20x speedup against CPU single-core and 1.86x against 2 CPUs

Recently, multiple studies denote the efficiency of computing chemistry in GPU instead of the classic CPU implementation. The most common language used is CUDA, which allows for lower-level optimizations than other programming options like OpenACC. On the left, we can see a case that computes some moderately stiff chemical reactions. In this study they achieve up to 59x speedup from the single-core CPU execution. On the right, we can see a more complex chemical scenario, which computes the chemical kinetics of an atmospheric model. In this case, the study only achieves a 20x speedup from the single-core execution, and worst, the speedup drops to 2.5x when using multiple MPI nodes. The large difference in speedup between these studies highlights the importance of developing new methods specifically focused on parallel GPU execution and the impact of translating classic CPU-oriented methods to GPU. However, the GPU-specific methods are harder to adapt to atmospheric models and are often only tested for specific types of chemical equations. In contrast, GPU-translated solvers, like KPP, are already prepared to run in atmospheric models with the same types of chemical equations currently solved by purely CPU-based code.

7 of 24

CAMP: Chemistry Across Multiple Phases

Dawson, Guzman, Curtis, Acosta, et. al., Chemistry Across Multiple Phases (CAMP) version 1.0, GMD 2022

8 of 24

CAMP: Chemistry Across Multiple Phases

Dawson, Guzman, Curtis, Acosta, et. al., Chemistry Across Multiple Phases (CAMP) version 1.0, GMD 2022

Our objective!

9 of 24

CAMP CPU Solver

CAMP uses the Backward Differentiation Formula (BDF) from CVODE, which is a solver for ordinary differential equation (ODE) systems.

BDF requires a linear solver package. The default option is the KLU algorithm for the CPU execution, while it also has a CUDA version of the Biconjugate Gradient (BCG) algorithm.

10 of 24

Implementation

11 of 24

Block-cells (GPU parallelization strategy)

Block-cells assigns each atmospheric cell to a GPU thread block

Uses as many threads as chemical species

Guzman et. al. Studying a new GPU treatment for chemical modules inside CAMP, 19th ECMWF Workshop

12 of 24

Block-cells (GPU parallelization strategy)

Higher occupancy than traditional approaches (more threads computing data)

34x speedup against CPU single-thread for the CAMP BCG linear solver

Guzman et. al. Studying a new GPU treatment for chemical modules inside CAMP, 19th ECMWF Workshop

13 of 24

Communicating data between threads

All communications are performed at thread block level

An example: A thread triggers an error due to a negative concentration

The error is shared between the other threads in the block by using shared memory

14 of 24

Test environment

15 of 24

Hardware

15

CTE-POWER cluster:

2 x IBM Power9 8335-GTH @ 2.4GHz (3.0GHz on turbo, 20 cores and 4 threads/core, total 40 cores per node)

4 x GPU NVIDIA V100 (Volta) with 16GB HBM2.

Compilers: GCC version 6.4.0 and NVCC version 10.2

16 of 24

Software configuration

16

Architecture	Parallel resources	Parallelization language
CPU	1, 40	MPI
GPU	Nº of different chemical concentrations (species x cells)	CUDA

The evaluation is performed over the code included in the most external loop in BDF. The code related to previous initializations is excluded.

Chemical mechanism: Gas phase chemistry from Carbon bond 2005 (CB05) | Chemical species: 156 | Cells (ODE systems): 100-10,000 | GPU Shared memory size per block: 256 | CVODE absolute tolerance: 0.01% | BCG tolerance: 1.0e-30

17 of 24

Results

18 of 24

Speed-up

Up to 35x speedup in average vs single-thread

Standard deviation around 2

19 of 24

Speed-up against 40 processes

1.2x speed-up against a fully CPU node (40 MPI processes)

Since there’s no communication between threads, we estimate 4.8x speed-up using the full GPU resources in a node (4 GPUs) - Ongoing work

20 of 24

Kernel profiling

Some optimizations already performed (like adjusting the number of registers per thread)

The register usage is likely preventing the kernel from fully utilizing the GPU

This usage is mostly produced by the algorithm definition, which computes big data like the Jacobian matrix

21 of 24

Conclusions

22 of 24

Conclusions

22

Our Block-cells strategy increase the GPU parallel threads against traditional implementations (Nº Threads = Nº Cells)

The CUDA BDF loop performs up to 35x times faster than CPU single-thread

1.2x speed-up against CPU using the fully resources node.
Since the load is independent between threads, we estimate up to 4.8x speedup using 4 GPUs per node.

The kernel profiling suggests a limitation on the performance by memory

Our approach can be used in more chemical applications thanks to the versatility of the CAMP module.

23 of 24

Future work

23

Add multi-device functionality to compute up to 4 GPus per node.

Balance load between CPU and GPU architectures.

Integrate our implementation inside an atmospheric model.

24 of 24

Thank you

christian.guzman@bsc.es