1 of 19

Scientific Kernels in Computing & HPC �Lit Survey

June 2024

Raghu Shankar

1

OCP ODSA HPC & AI Modularity

2 of 19

References – Scientific kernels Lit Survey

Objective: Build and share understanding of kernels to initiate discussions in OCP forum

  1. “Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels,” Ang Li, et al., SC17, Nov 2017

  • “GenArchBench: A genomics benchmark suite for arm HPC processors, Future Generation Computer Systems,” Lorien Lopez-Villellas, et. al., Mar 2024

  • “Performance characterization of the 64-core SG2042 RISC-V CPU for HPC,” Nick Brown, et.al. Jun 2024

NOTE: This is not a comprehensive lit search/survey

2

3 of 19

Arithmetic Intensity Spectrum of select 8 scientific kernels

3

Source: Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels, Ang Li, et al., SC17, Nov 2017

4 of 19

Scientific kernels characteristics

4

Source: Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels, Ang Li, et al., SC17, Nov 2017

All the kernels use double-precision (DP). Threads are the optimal thread number running on Broadwell and KNL

5 of 19

Scientific Kernels – Details – Ordered by Arithmetic Intensity

5

General Matrix-Matrix Multiplication (GEMM)

  • One of the most widely used computation kernels in scientific computing. It calculates the product of two dense matrices: C = αA∗B+βC
  • Generally compute-bound, algorithm complexity is O(n3)
  • One implementation in PLASMA

Cholesky Decomposition

  • Decomposes a positive-definite Hermitian matrix A into a lower triangular matrix L, and its conjugate transpose L∗: A = L ∗ L∗
  • More efficient way to solve systems of linear equations than normal LU decomposition
  • Compute-bound and high arithmetic intensity

Stencil

  • Class of iterative kernels, which constitute the core of many scientific applications.
  • kernels sweep the same stencil computation on cells in a given multi-dimensional grid iteratively
  • Stencils could exhibit diverse arithmetic intensities depending on the specific stencil functions adopted
  • However, optimized stencil algorithms often see high arithmetic intensity under orchestrated spatial and temporal blocking techniques

Fast Fourier Transform (FFT)

  • Algorithm to compute the discrete Fourier transform (DFT) of a sequence, or the inverse
  • Rapidly converts signal from its original time or space domain into frequency domain or vice versa, by factorizing the DFT matrix into a product of sparse factors

Sparse Matrix-Vector Multiplication (SpMV)

  • Probably the most used and studied sparse linear algebra algorithm.
  • Multiplies a sparse matrix with a dense vector and returns a dense vector, low arithmetic intensity, normally bounded by memory bandwidth
  • Compressed Sparse Row (CSR) data structure-based implementation uses load balanced data partitioning and SIMD-friendly operations to achieve performance

Sparse Matrix Transposition (SpTRANS)

  • Transposes a sparse matrix A of order m×n to AT of order n×m, in which the CSR format is converted to the compressed sparse column (CSC) format, or vise versa.
  • Mainly rearranges nonzero entries of the input matrix, thus requiring little computation.

Sparse Triangular Solve (SpTRSV)

  • Computes dense solution vector x from a sparse linear system Lx = b, where L is square lower triangular sparse matrix and b is dense vector
  • Unlike other sparse BLAS routines, SpTRSV generally more difficult to parallelize as it is inherently sequential
  • Complexity of O(nnz) and very low arithmetic intensity as SpMV, but is often much slower than SpMV due to the dependencies among components of x

6 of 19

Package Repositories built on kernels

6

Kernel

Package

Dataset

Output

Repository

GEMM

Parallel Linear Algebra Software for Multicore Arch

Dense matrices

Dataset stats, elapsed exec time, GFLOPS throughput

Cholesky

PLASMA

Dense matrices

Same

Same

SpMV

Benchmark SpMV using CSR5

968 matrices

Same

Github TBD

SpTRANS

ScanTrans, MergeTrans

968 matrices

Same

SpTRSV

SpMP: Sparse Matrix pre-processing library

968 matrices

Same

Stencil

YASK (Yet Another Stencil Kernel)

3D grid of given size

Same

FFT

FFTW v3.3.5

3D grid of given size

Same

Stream

Stream

Array of given size

7 of 19

Platform Configuration – On-package memory at 236 Gflops/sec

7

  • “SP Perf.” and “DP Pref.” are the theoretical maximum single- and double-precision floating-point operation throughput, respectively
  • “C/” = memory capacity.
  • “B/” = memory bandwidth
  • “OPM” is on-package memory
  • “Cache” refers to the upper-level cache respect to OPM in the memory hierarchy.

Note performance and bandwidth listed are the theoretic values calculated from the spec sheet; the actual number can be worse

Source: Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels, Ang Li, et al., SC17, Nov 2017

236 Gflops/s

Intel Haswell and Broadwell processor with eDRAM

8 of 19

eDRAM Analysis for reference: Performance Gaps and Speedup ~2X

8

Throughput Perf gap (Gflops/sec) limits Speedup

9 of 19

References – Scientific kernels Lit Survey

Objective: Build and share understanding of kernels to initiate discussions in OCP forum

  1. “Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels,” Ang Li, et al., SC17, Nov 2017

  • “GenArchBench: A genomics benchmark suite for arm HPC processors, Future Generation Computer Systems,” Lorien Lopez-Villellas, et. al., Mar 2024

  • “Performance characterization of the 64-core SG2042 RISC-V CPU for HPC,” Nick Brown, et.al. Jun 2024

NOTE: This is not a comprehensive lit search/survey

9

10 of 19

GenArchBench: genomics benchmark suite – Summary

  • Fugaku A64FX processors, Graviton3, Intel Xeon Skylake Platinum, and AMD EPYC Rome
  • Most genomics applications tested and optimized for x86 systems; few are prepared to perform efficiently on Arm machines. Moreover, these applications do not exploit the newly introduced Scalable Vector Extensions (SVE)
  • computationally demanding kernels from the most widely used tools in genome data analysis and ported them to ARM
  • GenArch benchmark suite comprises 13 multi-core kernels from critical stages of widely-used genome analysis pipelines, incl
    • Base-calling, Read mapping, Variant calling, and Genome assembly.
  • Suite includes different input data sets per kernel (small and large), each with a corresponding regression test to verify the correctness of each execution automatically.
  • .. optimizations implemented in each kernel, performance & energy evaluations comparisons of 4 different HPC machines
  • evaluation shows that Graviton3 outperforms other machines on average,
  • performance of A64FX is significantly constrained by its small memory hierarchy and latencies
  • https://github.com/LorienLV/genarchbench/releases/tag/1.0.0

10

11 of 19

Workflow diagram of common genome analysis pipelines: �(1) Sequencing, (2) Basecalling, (3a) Genome resequencing or (3b) and genome assembly.�Figure shows different computational kernels used within each stage or tool

11

A

G

12 of 19

GenArchBench: 13 multi-threaded CPU kernels genomic tools: 1 of 2

12

2. Basescaling

Basescaling

Adaptive Banded Signal to Event Alignment (ABEA)

Redesigned version of the Suzuki-Kasahara dynamic programming algorithm used to compare raw nanopore signals, produced by ONT sequencing machines to a reference genome sequence.

Neural Network-based Base Calling (NN-BASE)

ONT sequencing machines monitor changes in an electrical current as single strands of DNA or RNA pass through a protein nanopore. These changes in the electrical current are then converted to a sequence of nucleotide bases in the basecalling process

3a. Genome Resequencing

Seed

FM-Index Search (FMI)

compressed sub-string index based on the Burrows-Wheler transform. Given sub-string “s”, FM-index finds location of “s” in reference genome in O(|s|) time, length of sub-string

Chain

Seed Chaining (CHAIN)

Given set of subsequences (seeds) from a DNA sequence (read), chaining consists of linking overlapping seeds to form larger ones (uses heuristics)

SIMD Seed Chaining (FAST-CHAIN)

An x86-vectorized version of CHAIN that removes the heuristics to exploit SIMD computation

Extend

Bit-Parallel Myers (BPM)

dynamic programming algorithm that finds all locations a query string of size m matches a reference string of size n with k or fewer differences

Banded Smith-Waterman (BSW)

dynamic programming algorithm that computes local sequence alignment of two sequences of length m and n, respectively, in O(mn) time and space

Wavefront Alignment (WFA)

pairwise alignment algorithm that takes advantage of homologous regions between the sequences to accelerate the alignment process.

.. traditional dynamic programming algorithms run in quadratic time, WFA time complexity is O(ns), proportional to read length n and the alignment score s, using O(s2) memory.

13 of 19

GenArchBench: 13 multi-threaded CPU kernels genomic tools: 2 of 2

13

Reseq uencing

Variant Calling

Neural Network-based Variant Calling (NN-VARIANT)

process of detecting the differences (variants or mutations) between the aligned reads and the reference genome. This is a costly process ..

Pileup Counting (PILEUP)

Starting with alignment data of a set of aligned reads to a region of a reference genome, usually a SAM or BAM file, pileup counting is the process of summarizing the 13 base-pair information at each chromosomal position

3b. Genome Assembly

K-mer Counting (KMER-CNT)

aims to count the number of occurrences of each k-mer in an input Sequence

De-Bruijn Graph Construction (DBG)

.. of an input set of reads is used to represent the overlaps between the sub-strings of length k (k-mers) found in the input

Multiple Seq Alignment

Partial-Order Alignment (POA)

construction of an overlap graph from a set of reads leads to an approximate representation of the original sample’s genome

To determine consensus genome of the sample, alignment of all the reads against each other is performed in process called multiple sequence alignment (MSA)

14 of 19

Speedups: 1 of 2

14

15 of 19

Speedups: 2 of 2

15

16 of 19

References – Scientific kernels Lit Survey

Objective: Build and share understanding of kernels to initiate discussions in OCP forum

  1. “Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels,” Ang Li, et al., SC17, Nov 2017

  • “GenArchBench: A genomics benchmark suite for arm HPC processors, Future Generation Computer Systems,” Lorien Lopez-Villellas, et. al., Mar 2024

  • “Performance characterization of the 64-core SG2042 RISC-V CPU for HPC,” Nick Brown, et.al. Jun 2024

NOTE: This is not a comprehensive lit search/survey

16

17 of 19

Performance characterization of 64-core SG2042 RISC-V CPU for HPC

  • .. RISC-V to gain significant traction HPC ...
  • Sophon’s SG2042 is the first mass produced, commodity available, high-core count RISC-V CPU designed for high performance workloads
  • NASA’s NAS Parallel Benchmark (NPB) suite to characterise performance of the SG2042 against other CPUs implementing RISC-V, x86-64, and AArch64 ISAs
  • NPB suite is a collection of benchmarks developed by NASA’s Advanced Supercomputing (NAS) division to characterize Computational Fluid Dynamics (CFD) applications
  • Two models of exec: (1) OpenMP uses shared memory and (2) MPI uses message passing

  • SG2042 consistently outperforms all other RISC-V solutions, delivering between 2.6 and 16.7 performance improvement at the single core level.
  • When compared against the x86-64 and AArch64 CPUs, SG2042 performs comparatively well with computationally bound algorithms but decreases in relative performance on memory bandwidth or latency bound algorithms
  • .. identify that performance of the SG2042’s memory subsystem is the greatest bottleneck.

17

18 of 19

NAS Parallel Benchmark (NPB): ~5 Kernels & ~3 Pseudo Applications

18

Kernel

Brief Description

Integer Sort

tests indirect, random, memory accesses which it can be seen stalls a significant fraction of the CPU due to cache accesses

Multi Grid

heavily memory bound both in terms of time stalled on cache and main memory accesses, and also the percentage of execution time where DDR is under high utilization

Embarassingly Parallel

designed to test compute performance and there are far fewer cycles stalled on memory access, and no time spent with high DDR bandwidth utilization

Conjugate Gradient

irregular memory access and nearest neighbour communication, which results in around 37% of clock ticks stalled on cache or DDR accesses

Fast Fourier Transform

requires all-to-all communications between ranks to undertake a parallel transposition of data

Pseudo Applications: combine multiple kernels to provide more complicated workloads – compute finite difference solution to the 3D compressible Navier Stokes equations

Block Tridiagonal

Based on a Beam-Warming approximation, Gaussian elim., resulting equations are block-tridiagonal, stalls the least on memory access

LU Gauss Seidel

LU benchmark solves via a block-lower block-upper triangular approximation based upon Gauss Seidel iterative method

Scalar Pentadiagonal

Beam-Warming approximation, Gaussian elim., but resulting equations are fully diagonalized

19 of 19

Performance Characterization – 2 kernels

19

Integer Sort

Multi Grid

AMD EPYC has 8 memory controllers and 8 memory channels, connected to DDR4-3200 memory

Skylake performs the best has the largest L2 cache, 1MB per core