Scientific Kernels in Computing & HPC �Lit Survey
June 2024
Raghu Shankar
1
OCP ODSA HPC & AI Modularity
References – Scientific kernels Lit Survey
Objective: Build and share understanding of kernels to initiate discussions in OCP forum
NOTE: This is not a comprehensive lit search/survey
2
Arithmetic Intensity Spectrum of select 8 scientific kernels
3
Source: Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels, Ang Li, et al., SC17, Nov 2017
Scientific kernels characteristics
4
Source: Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels, Ang Li, et al., SC17, Nov 2017
All the kernels use double-precision (DP). Threads are the optimal thread number running on Broadwell and KNL
Scientific Kernels – Details – Ordered by Arithmetic Intensity
5
General Matrix-Matrix Multiplication (GEMM) |
|
Cholesky Decomposition |
|
Stencil |
|
Fast Fourier Transform (FFT) |
|
Sparse Matrix-Vector Multiplication (SpMV) |
|
Sparse Matrix Transposition (SpTRANS) |
|
Sparse Triangular Solve (SpTRSV) |
|
Package Repositories built on kernels
6
Kernel | Package | Dataset | Output | Repository |
GEMM | Parallel Linear Algebra Software for Multicore Arch | Dense matrices | Dataset stats, elapsed exec time, GFLOPS throughput | |
Cholesky | PLASMA | Dense matrices | Same | Same |
SpMV | Benchmark SpMV using CSR5 | 968 matrices | Same | Github TBD |
SpTRANS | ScanTrans, MergeTrans | 968 matrices | Same | |
SpTRSV | SpMP: Sparse Matrix pre-processing library | 968 matrices | Same | |
Stencil | YASK (Yet Another Stencil Kernel) | 3D grid of given size | Same | |
FFT | FFTW v3.3.5 | 3D grid of given size | Same | |
Stream | Stream | Array of given size | |
Platform Configuration – On-package memory at 236 Gflops/sec
7
Note performance and bandwidth listed are the theoretic values calculated from the spec sheet; the actual number can be worse
Source: Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels, Ang Li, et al., SC17, Nov 2017
236 Gflops/s
Intel Haswell and Broadwell processor with eDRAM
eDRAM Analysis for reference: Performance Gaps and Speedup ~2X
8
Throughput Perf gap (Gflops/sec) limits Speedup
References – Scientific kernels Lit Survey
Objective: Build and share understanding of kernels to initiate discussions in OCP forum
NOTE: This is not a comprehensive lit search/survey
9
GenArchBench: genomics benchmark suite – Summary
10
Workflow diagram of common genome analysis pipelines: �(1) Sequencing, (2) Basecalling, (3a) Genome resequencing or (3b) and genome assembly.�Figure shows different computational kernels used within each stage or tool
11
A
G
GenArchBench: 13 multi-threaded CPU kernels genomic tools: 1 of 2
12
2. Basescaling | Basescaling | Adaptive Banded Signal to Event Alignment (ABEA) | Redesigned version of the Suzuki-Kasahara dynamic programming algorithm used to compare raw nanopore signals, produced by ONT sequencing machines to a reference genome sequence. |
Neural Network-based Base Calling (NN-BASE) | ONT sequencing machines monitor changes in an electrical current as single strands of DNA or RNA pass through a protein nanopore. These changes in the electrical current are then converted to a sequence of nucleotide bases in the basecalling process | ||
3a. Genome Resequencing | Seed | FM-Index Search (FMI) | compressed sub-string index based on the Burrows-Wheler transform. Given sub-string “s”, FM-index finds location of “s” in reference genome in O(|s|) time, length of sub-string |
Chain | Seed Chaining (CHAIN) | Given set of subsequences (seeds) from a DNA sequence (read), chaining consists of linking overlapping seeds to form larger ones (uses heuristics) | |
SIMD Seed Chaining (FAST-CHAIN) | An x86-vectorized version of CHAIN that removes the heuristics to exploit SIMD computation | ||
Extend | Bit-Parallel Myers (BPM) | dynamic programming algorithm that finds all locations a query string of size m matches a reference string of size n with k or fewer differences | |
Banded Smith-Waterman (BSW) | dynamic programming algorithm that computes local sequence alignment of two sequences of length m and n, respectively, in O(mn) time and space | ||
Wavefront Alignment (WFA) | pairwise alignment algorithm that takes advantage of homologous regions between the sequences to accelerate the alignment process. .. traditional dynamic programming algorithms run in quadratic time, WFA time complexity is O(ns), proportional to read length n and the alignment score s, using O(s2) memory. |
GenArchBench: 13 multi-threaded CPU kernels genomic tools: 2 of 2
13
Reseq uencing | Variant Calling | Neural Network-based Variant Calling (NN-VARIANT) | process of detecting the differences (variants or mutations) between the aligned reads and the reference genome. This is a costly process .. |
Pileup Counting (PILEUP) | Starting with alignment data of a set of aligned reads to a region of a reference genome, usually a SAM or BAM file, pileup counting is the process of summarizing the 13 base-pair information at each chromosomal position | ||
3b. Genome Assembly | | K-mer Counting (KMER-CNT) | aims to count the number of occurrences of each k-mer in an input Sequence |
| De-Bruijn Graph Construction (DBG) | .. of an input set of reads is used to represent the overlaps between the sub-strings of length k (k-mers) found in the input | |
Multiple Seq Alignment | Partial-Order Alignment (POA) | construction of an overlap graph from a set of reads leads to an approximate representation of the original sample’s genome To determine consensus genome of the sample, alignment of all the reads against each other is performed in process called multiple sequence alignment (MSA) |
Speedups: 1 of 2
14
Speedups: 2 of 2
15
References – Scientific kernels Lit Survey
Objective: Build and share understanding of kernels to initiate discussions in OCP forum
NOTE: This is not a comprehensive lit search/survey
16
Performance characterization of 64-core SG2042 RISC-V CPU for HPC
17
NAS Parallel Benchmark (NPB): ~5 Kernels & ~3 Pseudo Applications
18
Kernel | Brief Description |
Integer Sort | tests indirect, random, memory accesses which it can be seen stalls a significant fraction of the CPU due to cache accesses |
Multi Grid | heavily memory bound both in terms of time stalled on cache and main memory accesses, and also the percentage of execution time where DDR is under high utilization |
Embarassingly Parallel | designed to test compute performance and there are far fewer cycles stalled on memory access, and no time spent with high DDR bandwidth utilization |
Conjugate Gradient | irregular memory access and nearest neighbour communication, which results in around 37% of clock ticks stalled on cache or DDR accesses |
Fast Fourier Transform | requires all-to-all communications between ranks to undertake a parallel transposition of data |
Pseudo Applications: combine multiple kernels to provide more complicated workloads – compute finite difference solution to the 3D compressible Navier Stokes equations | |
Block Tridiagonal | Based on a Beam-Warming approximation, Gaussian elim., resulting equations are block-tridiagonal, stalls the least on memory access |
LU Gauss Seidel | LU benchmark solves via a block-lower block-upper triangular approximation based upon Gauss Seidel iterative method |
Scalar Pentadiagonal | Beam-Warming approximation, Gaussian elim., but resulting equations are fully diagonalized |
Performance Characterization – 2 kernels
19
Integer Sort
Multi Grid
AMD EPYC has 8 memory controllers and 8 memory channels, connected to DDR4-3200 memory
Skylake performs the best has the largest L2 cache, 1MB per core