1 of 24

CAMP First GPU Solver: A Solution to Accelerate Chemistry in Atmospheric Models

7th ENES workshop

Christian Guzman Ruiz, Mario C. Acosta, Matthew Dawson*, Oriol Jorba, Carlos Pérez García-Pando, Kim Serradell

Barcelona SuperComputing Center

*National Center for Atmospheric Research (NCAR)

2 of 24

BSC Departments

2

Introduction | Background | Implementation | Test environment | Results | Conclusions

3 of 24

Background

4 of 24

Atmospheric models

Atmospheric models are a mathematical representation of atmospheric water, gas, and aerosol cycles.

Introduction | Background | Implementation | Test environment | Results | Conclusions

5 of 24

Atmospheric models

Atmospheric models are a mathematical representation of atmospheric water, gas, and aerosol cycles.

Resolution of chemical processes can take up to 90% of the time execution! (Christou et al., 2016)

Introduction | Background | Implementation | Test environment | Results | Conclusions

6 of 24

State of the art - KPP GPU

6

Michail Alvanos and Theodoros Christoudia, GPU-accelerated atmospheric chemical kinetics in the ECHAM/MESSy (EMAC) Earth system model , 2017

  • Kinetic PreProcessor (KPP) is a analysis tool to solve chemical mechanisms using Rosenbrock methods

  • KPP is widely used in the atmospheric community

  • The GPU version for the EMAC climate model achieves up to 20x speedup against CPU single-core and 1.86x against 2 CPUs

Introduction | Background | Implementation | Test environment | Results | Conclusions

7 of 24

CAMP: Chemistry Across Multiple Phases

Dawson, Guzman, Curtis, Acosta, et. al., Chemistry Across Multiple Phases (CAMP) version 1.0, GMD 2022

Introduction | Background | Implementation | Test environment | Results | Conclusions

8 of 24

CAMP: Chemistry Across Multiple Phases

Dawson, Guzman, Curtis, Acosta, et. al., Chemistry Across Multiple Phases (CAMP) version 1.0, GMD 2022

Our objective!

Introduction | Background | Implementation | Test environment | Results | Conclusions

9 of 24

CAMP CPU Solver

  • CAMP uses the Backward Differentiation Formula (BDF) from CVODE, which is a solver for ordinary differential equation (ODE) systems.

  • BDF requires a linear solver package. The default option is the KLU algorithm for the CPU execution, while it also has a CUDA version of the Biconjugate Gradient (BCG) algorithm.

Introduction | Background | Implementation | Test environment | Results | Conclusions

10 of 24

Implementation

11 of 24

Block-cells (GPU parallelization strategy)

  • Block-cells assigns each atmospheric cell to a GPU thread block

  • Uses as many threads as chemical species

Guzman et. al. Studying a new GPU treatment for chemical modules inside CAMP, 19th ECMWF Workshop

Introduction | Background | Implementation | Test environment | Results | Conclusions

12 of 24

Block-cells (GPU parallelization strategy)

  • Higher occupancy than traditional approaches (more threads computing data)

  • 34x speedup against CPU single-thread for the CAMP BCG linear solver

Guzman et. al. Studying a new GPU treatment for chemical modules inside CAMP, 19th ECMWF Workshop

Introduction | Background | Implementation | Test environment | Results | Conclusions

13 of 24

Communicating data between threads

  • All communications are performed at thread block level

  • An example: A thread triggers an error due to a negative concentration

  • The error is shared between the other threads in the block by using shared memory

Introduction | Background | Implementation | Test environment | Results | Conclusions

14 of 24

Test environment

15 of 24

Hardware

15

  • CTE-POWER cluster:

    • 2 x IBM Power9 8335-GTH @ 2.4GHz (3.0GHz on turbo, 20 cores and 4 threads/core, total 40 cores per node)

    • 4 x GPU NVIDIA V100 (Volta) with 16GB HBM2.

    • Compilers: GCC version 6.4.0 and NVCC version 10.2

Introduction | Background | Implementation | Test environment | Results | Conclusions

16 of 24

Software configuration

16

Architecture

Parallel resources

Parallelization language

CPU

1, 40

MPI

GPU

Nº of different chemical concentrations (species x cells)

CUDA

  • The evaluation is performed over the code included in the most external loop in BDF. The code related to previous initializations is excluded.

  • Chemical mechanism: Gas phase chemistry from Carbon bond 2005 (CB05) | Chemical species: 156 | Cells (ODE systems): 100-10,000 | GPU Shared memory size per block: 256 | CVODE absolute tolerance: 0.01% | BCG tolerance: 1.0e-30

Introduction | Background | Implementation | Test environment | Results | Conclusions

17 of 24

Results

18 of 24

Speed-up

  • Up to 35x speedup in average vs single-thread

  • Standard deviation around 2

Introduction | Background | Implementation | Test environment | Results | Conclusions

19 of 24

Speed-up against 40 processes

  • 1.2x speed-up against a fully CPU node (40 MPI processes)

  • Since there’s no communication between threads, we estimate 4.8x speed-up using the full GPU resources in a node (4 GPUs) - Ongoing work

Introduction | Background | Implementation | Test environment | Results | Conclusions

20 of 24

Kernel profiling

  • Some optimizations already performed (like adjusting the number of registers per thread)

  • The register usage is likely preventing the kernel from fully utilizing the GPU

  • This usage is mostly produced by the algorithm definition, which computes big data like the Jacobian matrix

Introduction | Background | Implementation | Test environment | Results | Conclusions

21 of 24

Conclusions

22 of 24

Conclusions

22

  • Our Block-cells strategy increase the GPU parallel threads against traditional implementations (Nº Threads = Nº Cells)

  • The CUDA BDF loop performs up to 35x times faster than CPU single-thread
    • 1.2x speed-up against CPU using the fully resources node.
    • Since the load is independent between threads, we estimate up to 4.8x speedup using 4 GPUs per node.

  • The kernel profiling suggests a limitation on the performance by memory

  • Our approach can be used in more chemical applications thanks to the versatility of the CAMP module.

Introduction | Background | Implementation | Test environment | Results | Conclusions

23 of 24

Future work

23

  • Add multi-device functionality to compute up to 4 GPus per node.

  • Balance load between CPU and GPU architectures.

  • Integrate our implementation inside an atmospheric model.

Introduction | Background | Implementation | Test environment | Results | Conclusions

24 of 24

Thank you

christian.guzman@bsc.es