1 of 9

Performance Portability Pre-ECP and Post-ECP – VTK-m

Kenneth Moreland, Oak Ridge National Laboratory

ECP Annual Meeting

January 19, 2023

Approved for public release

1

2 of 9

VTK-m: Visualization on Accelerators

Contouring and features

Particle Density

Rendering

Advection

and flow

And much more…

2

3 of 9

VTK-m Designed with Portability in Mind

Execution Environment

Cell Operations

Field Operations

Basic Math

Make Cells

Control Environment

Grid Topology

Array Handle

Invoke

Device Adapter

Allocate

Transfer

Schedule

Sort

Worklet

Over-decompose problem into small units. Same basic algorithm can be parallelized in different ways.

Build an adaption layer. (Based on Blelloch, Vector Models for Data-Parallel Computing)

3

4 of 9

Why we switched to Kokkos

Contour

Streams

Clip

Render

x86

CUDA

Xeon Phi

Surface

Normals

Ghost Cells

Warp

4

5 of 9

Why we switched to Kokkos

Contour

Streams

Clip

Render

x86

CUDA

Xeon Phi

Radeon

Xe

Surface

Normals

Ghost Cells

Warp

Surprise!

5

6 of 9

Kokkos Performance Running VTK-m Benchmarks

Kokkos performed about the same on most benchmarks

Kokkos performed significantly better on a few benchmarks

Kokkos performed much worse on a few benchmarks

6

7 of 9

Improved VTK-m/Kokkos Performance on Spock

Scope and objectives

    • ECP/VTK-m enables scientific visualization on the emerging processors required for extreme scale computers.
    • Major focus on core functionality of HPC sci-vis software.

Project accomplishment

    • Further testing led to identification of register spilling as a potential cause.
    • Leveraging fixes provided by AMD and the Kokkos team, an update to VTK-m yielded a 12-fold performance improvement.

Impact

    • Benchmarking of VTK-m on Spock revealed performance issues in some circumstances.
      • In some cases, Spock runs were significantly slower than Summit runs.
    • Slowdown was common in cases that required tracing connectivity across unstructured mesh data.

Cool image

Improvements in Kokkos and VTK-m lead to a 12-fold improvement of unstructured gradient estimation on Spock.

ECP WBS

2.3.4.13 ECP/VTK-m

PI

Kenneth Moreland, ORNL

Members

LANL, SNL, UO, Kitware

Deliverables

VTK-m Quality Performance Integration 8 - performance evaluation, P6 Activity STDA05-62�VTK-m source code repository available at: https://gitlab.kitware.com/vtk/vtk-m

7

8 of 9

Challenges with Kokkos

  • Algorithm support is not as mature as array support
    • We have encountered performance issues with some
      • Example: The binning of sort is fast, but only if the distribution is even
    • The algorithm functionality is not as expressive as we need
      • Example: Sort only supports a less-than comparator operator
      • Example: By-key variants either don’t exist or are not production worthy yet
  • Setting/Getting runtime options (e.g., number of threads to use) is screwy
    • Runtime options can only be set on initialization (Kokkos::initialize())
      • Cannot mix command line arguments and initialization settings object
      • Sometimes VTK-m does not get to initialize Kokkos.
    • Cannot get status of options (e.g., how many threads is Kokkos using?)

8

9 of 9

VTK-m is Frontier-Ready

  • Verified to compile and run by Thomas Gibson (AMD Engineer)
  • Performance comparable to Crusher

9