1 of 18

Mashy Green & Ilektra Christidi

RSE HPC SIG Online Meetup, 19 May 2025

Custom Acceleration Frameworks:�The good, the bad, and the ugly

2 of 18

A case study of our experience as RSEs on a project that uses a Custom Acceleration Framework (CAF)

Sharing what we hope are useful observations from working with Grid
Applicable to other codebases

3 of 18

Software options for offloading

Framework	Pros	Cons
Native language (cuda/hip)	Full control, maximum acceleration possible	Difficult (different language), non-portable
Custom	Optimal performance and abstraction balance for the specific codebase	Lots of work, code-specific, maintenance is up to the application developers
Pragma API (OpenACC, OpenMP)	Straightforward, familiar, intuitive, “fully” portable	Features available and performance depends on compiler maturity
Third party (kokkos, raja, SYCL)	Portable, higher abstraction compared to native	Performance not optimal for all codes

Note: not an endorsement/rejection of any of the options, there are valid cases for all of them.

4 of 18

Grid’s CAF

Heavy use of c macros
Same API for CPU (simd) and GPU (simt)
Kernel abstraction:

lambda_apply
accelerator_for

Memory management:

Common memory pool between host & device (virtual or physical)
Objects are moved when needed and evicted when they don’t fit anymore
Accessed via autoview objects or peek/poke functions

5 of 18

Grid’s CAF

Heavy use of c macros
Same API for CPU (simd) and GPU (simt)
Kernel abstraction:

lambda_apply
accelerator_for

6 of 18

Grid’s CAF

Memory management:

Common memory pool between host & device (virtual or physical)
Objects are moved when needed and evicted when they don’t fit anymore
Accessed via autoview objects or peek/poke functions

Memory manager

7 of 18

Grid’s CAF

Memory management:

Common memory pool between host & device (virtual or physical)
Objects are moved when needed and evicted when they don’t fit anymore
Accessed via autoview objects or peek/poke functions

8 of 18

The Good

Low level optimisation abstracted away from user

9 of 18

The Good

Data types and operators already optimised

10 of 18

The Bad

Hard/unclear how to gain fine-grained control of the memory
A lot of abstraction without detailed docs (cannot expect same level as generic frameworks, but new developers cannot be expected to have prior knowledge, so the right level of documentation is sorely needed.)
Hard for new developers to understand and make optimal use

11 of 18

The Bad

Where is staple (dSdU_mu) allocated and zeroed?

Before

12 of 18

The Bad

Where is staple (dSdU_mu) allocated and zeroed?

Before

After

13 of 18

The Ugly

Profile/Debug this:

14 of 18

The Ugly

Profile/Debug this:

15 of 18

The Ugly

Profile/Debug this:

16 of 18

The Ugly

Profile/Debug this:

17 of 18

The Ugly

Profile/Debug this:

18 of 18

Conclusions/Recommendations

When designing a CAF

Make various layers of memory access/data movement control available to the application developers
Document those layers, how they work, and their APIs
Macros/lambdas are not helpful for profiling and debugging