1 of 18

Mashy Green & Ilektra Christidi

RSE HPC SIG Online Meetup, 19 May 2025

Custom Acceleration Frameworks:�The good, the bad, and the ugly

2 of 18

A case study of our experience as RSEs on a project that uses a Custom Acceleration Framework (CAF)

  • Sharing what we hope are useful observations from working with Grid
  • Applicable to other codebases

3 of 18

Software options for offloading

Framework

Pros

Cons

Native language

(cuda/hip)

Full control, maximum acceleration possible

Difficult (different language), non-portable

Custom

Optimal performance and abstraction balance for the specific codebase

Lots of work, code-specific, maintenance is up to the application developers

Pragma API

(OpenACC, OpenMP)

Straightforward, familiar, intuitive, “fully” portable

Features available and performance depends on compiler maturity

Third party

(kokkos, raja, SYCL)

Portable, higher abstraction compared to native

Performance not optimal for all codes

Note: not an endorsement/rejection of any of the options, there are valid cases for all of them.

4 of 18

Grid’s CAF

  • Heavy use of c macros
  • Same API for CPU (simd) and GPU (simt)
  • Kernel abstraction:
    • lambda_apply
    • accelerator_for
  • Memory management:
    • Common memory pool between host & device (virtual or physical)
    • Objects are moved when needed and evicted when they don’t fit anymore
    • Accessed via autoview objects or peek/poke functions

5 of 18

Grid’s CAF

  • Heavy use of c macros
  • Same API for CPU (simd) and GPU (simt)
  • Kernel abstraction:
    • lambda_apply
    • accelerator_for

6 of 18

Grid’s CAF

  • Memory management:
    • Common memory pool between host & device (virtual or physical)
    • Objects are moved when needed and evicted when they don’t fit anymore
    • Accessed via autoview objects or peek/poke functions

Memory manager

7 of 18

Grid’s CAF

  • Memory management:
    • Common memory pool between host & device (virtual or physical)
    • Objects are moved when needed and evicted when they don’t fit anymore
    • Accessed via autoview objects or peek/poke functions

8 of 18

The Good

Low level optimisation abstracted away from user

9 of 18

The Good

Data types and operators already optimised

10 of 18

The Bad

  • Hard/unclear how to gain fine-grained control of the memory
  • A lot of abstraction without detailed docs (cannot expect same level as generic frameworks, but new developers cannot be expected to have prior knowledge, so the right level of documentation is sorely needed.)
  • Hard for new developers to understand and make optimal use

11 of 18

The Bad

Where is staple (dSdU_mu) allocated and zeroed?

Before

12 of 18

The Bad

Where is staple (dSdU_mu) allocated and zeroed?

Before

After

13 of 18

The Ugly

Profile/Debug this:

14 of 18

The Ugly

Profile/Debug this:

15 of 18

The Ugly

Profile/Debug this:

16 of 18

The Ugly

Profile/Debug this:

17 of 18

The Ugly

Profile/Debug this:

18 of 18

Conclusions/Recommendations

When designing a CAF

  • Make various layers of memory access/data movement control available to the application developers
  • Document those layers, how they work, and their APIs
  • Macros/lambdas are not helpful for profiling and debugging