2 of 19

References – Scientific kernels Lit Survey

Objective: Build and share understanding of kernels to initiate discussions in OCP forum

“Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels,” Ang Li, et al., SC17, Nov 2017

“GenArchBench: A genomics benchmark suite for arm HPC processors, Future Generation Computer Systems,” Lorien Lopez-Villellas, et. al., Mar 2024

“Performance characterization of the 64-core SG2042 RISC-V CPU for HPC,” Nick Brown, et.al. Jun 2024

NOTE: This is not a comprehensive lit search/survey

3 of 19

Arithmetic Intensity Spectrum of select 8 scientific kernels

Source: Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels, Ang Li, et al., SC17, Nov 2017

4 of 19

Scientific kernels characteristics

Source: Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels, Ang Li, et al., SC17, Nov 2017

All the kernels use double-precision (DP). Threads are the optimal thread number running on Broadwell and KNL

5 of 19

Scientific Kernels – Details – Ordered by Arithmetic Intensity

General Matrix-Matrix Multiplication (GEMM)	One of the most widely used computation kernels in scientific computing. It calculates the product of two dense matrices: C = αA∗B+βC Generally compute-bound, algorithm complexity is O(n3) One implementation in PLASMA
Cholesky Decomposition	Decomposes a positive-definite Hermitian matrix A into a lower triangular matrix L, and its conjugate transpose L∗: A = L ∗ L∗ More efficient way to solve systems of linear equations than normal LU decomposition Compute-bound and high arithmetic intensity
Stencil	Class of iterative kernels, which constitute the core of many scientific applications. kernels sweep the same stencil computation on cells in a given multi-dimensional grid iteratively Stencils could exhibit diverse arithmetic intensities depending on the specific stencil functions adopted However, optimized stencil algorithms often see high arithmetic intensity under orchestrated spatial and temporal blocking techniques
Fast Fourier Transform (FFT)	Algorithm to compute the discrete Fourier transform (DFT) of a sequence, or the inverse Rapidly converts signal from its original time or space domain into frequency domain or vice versa, by factorizing the DFT matrix into a product of sparse factors
Sparse Matrix-Vector Multiplication (SpMV)	Probably the most used and studied sparse linear algebra algorithm. Multiplies a sparse matrix with a dense vector and returns a dense vector, low arithmetic intensity, normally bounded by memory bandwidth Compressed Sparse Row (CSR) data structure-based implementation uses load balanced data partitioning and SIMD-friendly operations to achieve performance
Sparse Matrix Transposition (SpTRANS)	Transposes a sparse matrix A of order m×n to A^T of order n×m, in which the CSR format is converted to the compressed sparse column (CSC) format, or vise versa. Mainly rearranges nonzero entries of the input matrix, thus requiring little computation.
Sparse Triangular Solve (SpTRSV)	Computes dense solution vector x from a sparse linear system Lx = b, where L is square lower triangular sparse matrix and b is dense vector Unlike other sparse BLAS routines, SpTRSV generally more difficult to parallelize as it is inherently sequential Complexity of O(nnz) and very low arithmetic intensity as SpMV, but is often much slower than SpMV due to the dependencies among components of x

6 of 19

Package Repositories built on kernels

Kernel	Package	Dataset	Output	Repository
GEMM	Parallel Linear Algebra Software for Multicore Arch	Dense matrices	Dataset stats, elapsed exec time, GFLOPS throughput	https://github.com/icl-utk-edu/plasma/
Cholesky	PLASMA	Dense matrices	Same	Same
SpMV	Benchmark SpMV using CSR5	968 matrices	Same	Github TBD
SpTRANS	ScanTrans, MergeTrans	968 matrices	Same	https://github.com/vtsynergy/sptrans
SpTRSV	SpMP: Sparse Matrix pre-processing library	968 matrices	Same	https://github.com/IntelLabs/SpMP
Stencil	YASK (Yet Another Stencil Kernel)	3D grid of given size	Same	https://github.com/intel/yask
FFT	FFTW v3.3.5	3D grid of given size	Same	https://github.com/FFTW/fftw3
Stream	Stream	Array of given size		https://www.cs.virginia.edu/stream/ref.html

7 of 19

Platform Configuration – On-package memory at 236 Gflops/sec

“SP Perf.” and “DP Pref.” are the theoretical maximum single- and double-precision floating-point operation throughput, respectively
“C/” = memory capacity.
“B/” = memory bandwidth
“OPM” is on-package memory
“Cache” refers to the upper-level cache respect to OPM in the memory hierarchy.

Note performance and bandwidth listed are the theoretic values calculated from the spec sheet; the actual number can be worse

Source: Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels, Ang Li, et al., SC17, Nov 2017

236 Gflops/s

Intel Haswell and Broadwell processor with eDRAM

8 of 19

eDRAM Analysis for reference: Performance Gaps and Speedup ~2X

Throughput Perf gap (Gflops/sec) limits Speedup

9 of 19

References – Scientific kernels Lit Survey

Objective: Build and share understanding of kernels to initiate discussions in OCP forum

“Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels,” Ang Li, et al., SC17, Nov 2017

“GenArchBench: A genomics benchmark suite for arm HPC processors, Future Generation Computer Systems,” Lorien Lopez-Villellas, et. al., Mar 2024

“Performance characterization of the 64-core SG2042 RISC-V CPU for HPC,” Nick Brown, et.al. Jun 2024

NOTE: This is not a comprehensive lit search/survey

10 of 19

GenArchBench: genomics benchmark suite – Summary

Fugaku A64FX processors, Graviton3, Intel Xeon Skylake Platinum, and AMD EPYC Rome
Most genomics applications tested and optimized for x86 systems; few are prepared to perform efficiently on Arm machines. Moreover, these applications do not exploit the newly introduced Scalable Vector Extensions (SVE)
computationally demanding kernels from the most widely used tools in genome data analysis and ported them to ARM
GenArch benchmark suite comprises 13 multi-core kernels from critical stages of widely-used genome analysis pipelines, incl

Base-calling, Read mapping, Variant calling, and Genome assembly.

Suite includes different input data sets per kernel (small and large), each with a corresponding regression test to verify the correctness of each execution automatically.
.. optimizations implemented in each kernel, performance & energy evaluations comparisons of 4 different HPC machines
evaluation shows that Graviton3 outperforms other machines on average,
performance of A64FX is significantly constrained by its small memory hierarchy and latencies
https://github.com/LorienLV/genarchbench/releases/tag/1.0.0

11 of 19

Workflow diagram of common genome analysis pipelines: �(1) Sequencing, (2) Basecalling, (3a) Genome resequencing or (3b) and genome assembly.�Figure shows different computational kernels used within each stage or tool

12 of 19

GenArchBench: 13 multi-threaded CPU kernels genomic tools: 1 of 2

2. Basescaling	Basescaling	Adaptive Banded Signal to Event Alignment (ABEA)	Redesigned version of the Suzuki-Kasahara dynamic programming algorithm used to compare raw nanopore signals, produced by ONT sequencing machines to a reference genome sequence.
2. Basescaling	Basescaling	Neural Network-based Base Calling (NN-BASE)	ONT sequencing machines monitor changes in an electrical current as single strands of DNA or RNA pass through a protein nanopore. These changes in the electrical current are then converted to a sequence of nucleotide bases in the basecalling process
3a. Genome Resequencing	Seed	FM-Index Search (FMI)	compressed sub-string index based on the Burrows-Wheler transform. Given sub-string “s”, FM-index finds location of “s” in reference genome in O(\|s\|) time, length of sub-string
	Chain	Seed Chaining (CHAIN)	Given set of subsequences (seeds) from a DNA sequence (read), chaining consists of linking overlapping seeds to form larger ones (uses heuristics)
	Chain	SIMD Seed Chaining (FAST-CHAIN)	An x86-vectorized version of CHAIN that removes the heuristics to exploit SIMD computation
	Extend	Bit-Parallel Myers (BPM)	dynamic programming algorithm that finds all locations a query string of size m matches a reference string of size n with k or fewer differences
		Banded Smith-Waterman (BSW)	dynamic programming algorithm that computes local sequence alignment of two sequences of length m and n, respectively, in O(mn) time and space
		Wavefront Alignment (WFA)	pairwise alignment algorithm that takes advantage of homologous regions between the sequences to accelerate the alignment process. .. traditional dynamic programming algorithms run in quadratic time, WFA time complexity is O(ns), proportional to read length n and the alignment score s, using O(s2) memory.

13 of 19

GenArchBench: 13 multi-threaded CPU kernels genomic tools: 2 of 2

Reseq uencing	Variant Calling	Neural Network-based Variant Calling (NN-VARIANT)	process of detecting the differences (variants or mutations) between the aligned reads and the reference genome. This is a costly process ..
Reseq uencing	Variant Calling	Pileup Counting (PILEUP)	Starting with alignment data of a set of aligned reads to a region of a reference genome, usually a SAM or BAM file, pileup counting is the process of summarizing the 13 base-pair information at each chromosomal position
3b. Genome Assembly		K-mer Counting (KMER-CNT)	aims to count the number of occurrences of each k-mer in an input Sequence
		De-Bruijn Graph Construction (DBG)	.. of an input set of reads is used to represent the overlaps between the sub-strings of length k (k-mers) found in the input
	Multiple Seq Alignment	Partial-Order Alignment (POA)	construction of an overlap graph from a set of reads leads to an approximate representation of the original sample’s genome To determine consensus genome of the sample, alignment of all the reads against each other is performed in process called multiple sequence alignment (MSA)

14 of 19

Speedups: 1 of 2

15 of 19

Speedups: 2 of 2

16 of 19

References – Scientific kernels Lit Survey

Objective: Build and share understanding of kernels to initiate discussions in OCP forum

“Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels,” Ang Li, et al., SC17, Nov 2017

“GenArchBench: A genomics benchmark suite for arm HPC processors, Future Generation Computer Systems,” Lorien Lopez-Villellas, et. al., Mar 2024

“Performance characterization of the 64-core SG2042 RISC-V CPU for HPC,” Nick Brown, et.al. Jun 2024

NOTE: This is not a comprehensive lit search/survey

17 of 19

Performance characterization of 64-core SG2042 RISC-V CPU for HPC

.. RISC-V to gain significant traction HPC ...
Sophon’s SG2042 is the first mass produced, commodity available, high-core count RISC-V CPU designed for high performance workloads
NASA’s NAS Parallel Benchmark (NPB) suite to characterise performance of the SG2042 against other CPUs implementing RISC-V, x86-64, and AArch64 ISAs
NPB suite is a collection of benchmarks developed by NASA’s Advanced Supercomputing (NAS) division to characterize Computational Fluid Dynamics (CFD) applications
Two models of exec: (1) OpenMP uses shared memory and (2) MPI uses message passing

SG2042 consistently outperforms all other RISC-V solutions, delivering between 2.6 and 16.7 performance improvement at the single core level.
When compared against the x86-64 and AArch64 CPUs, SG2042 performs comparatively well with computationally bound algorithms but decreases in relative performance on memory bandwidth or latency bound algorithms
.. identify that performance of the SG2042’s memory subsystem is the greatest bottleneck.

18 of 19

NAS Parallel Benchmark (NPB): ~5 Kernels & ~3 Pseudo Applications

Kernel	Brief Description
Integer Sort	tests indirect, random, memory accesses which it can be seen stalls a significant fraction of the CPU due to cache accesses
Multi Grid	heavily memory bound both in terms of time stalled on cache and main memory accesses, and also the percentage of execution time where DDR is under high utilization
Embarassingly Parallel	designed to test compute performance and there are far fewer cycles stalled on memory access, and no time spent with high DDR bandwidth utilization
Conjugate Gradient	irregular memory access and nearest neighbour communication, which results in around 37% of clock ticks stalled on cache or DDR accesses
Fast Fourier Transform	requires all-to-all communications between ranks to undertake a parallel transposition of data
Pseudo Applications: combine multiple kernels to provide more complicated workloads – compute finite difference solution to the 3D compressible Navier Stokes equations
Block Tridiagonal	Based on a Beam-Warming approximation, Gaussian elim., resulting equations are block-tridiagonal, stalls the least on memory access
LU Gauss Seidel	LU benchmark solves via a block-lower block-upper triangular approximation based upon Gauss Seidel iterative method
Scalar Pentadiagonal	Beam-Warming approximation, Gaussian elim., but resulting equations are fully diagonalized

1 of 19

2 of 19

3 of 19

4 of 19

5 of 19

6 of 19

7 of 19

8 of 19

9 of 19

10 of 19

11 of 19

12 of 19

13 of 19

14 of 19

15 of 19

16 of 19

17 of 19

18 of 19

19 of 19