1 of 18

Guarding Numerics Amidst Rising Heterogeneity

a presentation at correctness-workshop.github.io/2021

by members of the ComPort project

https://xstack-fp.github.io/ (evolving website)

Ganesh Gopalakrishnan, Ignacio Laguna, Ang Li,

Pavel Panchekha, Cindy Rubio-Gonzalez, Zachary Tatlock

1,4: University of Utah, 2: LLNL, 3: PNNL, 4: University of California, Davis 5: University of Washington

ganesh@cs.utah.edu, ilaguna@llnl.gov, ang.li@pnnl.gov, pavpan@cs.utah.edu, crubio@ucdavis.edu, ztatlock@cs.washington.edu

      • This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, ComPort: Rigorous Testing Methods to Safeguard Software Porting, under Award Numbers DE-SC0022252 (1,4), SCW 1743 (2), SCW 78284 (3), UCD# (5), UW# (6)

2 of 18

Rising Heterogeneity, Mixed-Precision

3 of 18

Rising Heterogeneity: Data Movement Reduction → Precision Reduction

4 of 18

Rising Heterogeneity

  • CPUs along with GPUs and custom accelerators in support of:
    • HPC and ML workloads
  • We focus on the consequences of GPU adoption
    • We first describe the broad spectrum of problems
      • Jointly compiled in our paper
    • Then the specifics of each problem
    • And how to solve them through community effort

5 of 18

GPUs: Moving/Evolving

serving different needs...

6 of 18

Challenges due to Increasing GPU/accelerator adoption

7 of 18

Need better formal error models.

Build trust in them outside of ML.

8 of 18

Need GPU Race Checkers; none available now. Dire need to develop.

Need better formal error models.

Build trust in them outside of ML.

9 of 18

Need better formal error models.

Build trust in them outside of ML.

Need GPU Race Checkers; none available now. Dire need to develop.

Nobody trusts or knows how brittle the code will get, and where. Need efficient analysis tools.

10 of 18

Need better formal error models.

Build trust in them outside of ML.

Need GPU Race Checkers; none available now. Dire need to develop.

Nobody trusts or knows how brittle the code will get, and where. Need efficient analysis tools.

HW Exceptions not reported by many GPUs (currently printfs). Develop

analysis tools.

11 of 18

Need better formal error models.

Build trust in them outside of ML.

Need GPU Race Checkers; none available now. Dire need to develop.

Nobody trusts or knows how brittle the code will get, and where. Need efficient analysis tools.

HW Exceptions not reported by many GPUs (currently printfs). Develop

analysis tools.

Closed-source compilers,

Moving Optimization Targets (SFU). Open-Source , Better Specs.

12 of 18

Need better formal error models.

Build trust in them outside of ML.

Need GPU Race Checkers; none available now. Dire need to develop.

Nobody trusts or knows how brittle the code will get, and where. Need efficient analysis tools.

HW Exceptions not reported by many GPUs (currently printfs). Develop

analysis tools.

Closed-source compilers,

Moving Optimization Targets (SFU). Open-Source , Better Specs.

Testing Objectives, Oracles, Fuzzing, Scalable Tracing. Open-Source Testing Tool Components to be Shared.

13 of 18

Need better formal error models.

Build trust in them outside of ML.

Need GPU Race Checkers; none available now. Dire need to develop.

Nobody trusts or knows how brittle the code will get, and where. Need efficient analysis tools.

HW Exceptions not reported by many GPUs (currently printfs). Develop

analysis tools.

Closed-source compilers,

Moving Optimization Targets (SFU). Open-Source , Better Specs.

Testing Objectives, Oracles, Fuzzing, Scalable Tracing. Open-Source Testing Tool Components to be Shared.

14 of 18

Challenges and Solutions (summary slide)

  • FP Formats, Formal Standards, Error Models
    • FMA, SFU, Tensor Cores
  • Exceptions
    • Develop techniques to detect at runtime or pre-analyze and prove absence
  • Schedule-dependency
    • Races, Reduction Order Dependence
  • Compiler optimizations
    • Performance-portability layers can provide a point to inject parametric solutions
  • Mixed precision
    • Not just flashy results but robust engineering, runtime dynamic-range exhaustion detection
  • Testing and Fuzzing
    • Need to specify interfaces, better goal-directed fuzzers

15 of 18

Proprietary Nature of GPUs is a reality

  • Nvidia is dominant
    • Good and bad
  • AMD and Intel on the rise -- but very little experience
    • Documented uses in HPC and ML hard to come by

16 of 18

Tool Landscape (will refine with your help)(other tools??)

FPSpy

OS-level insrum.

No GPU

Available

FPChecker (Laguna, ASE'20)

LLVM instrumentation

GPU (initial)

Available

Verificarlo / Verrou

Montecarlo Arithmetic

No GPU

Available

FPDebug, NSan (CC'21), FPSanitizer

Shadow Value

No GPU

Some available

Herbgrind

Valgrind instrum.

No GPU

Install issues

Saman (Nestor),

Modeling error (library based)

No GPU

Available

Ariadne

Exception triggering

No GPU

?

FLiT

Optimization bisection

No GPU

Available

Blossom, S3FP, FPGen

Guided fuzzing

No GPU

Available

17 of 18

Selected Numerical Issues, Solutions, Actionables

Issue

Problems, Where Experienced

Status of Solutions

Most Promising Research Needs

New Number Systems, Exception

No common notions of error

Hype has overshot usage, tools

Fix IEEE issues first; Automate through translation; Invest in education

Precision tuning

Code can become very brittle at places

Tools to check for blown precision budgets are unavailable

Need precision pressure-relief valves; Avoid Precision Fragmentation; Invest more in data compression ("bulk tuning")

Scalable Error Analysis

Many codes have have loops; No "one-size fits all"

Not all variables are alike (values, derivatives, FFT)

Domain-specific Error Definitions Appear Inevitable

Handling Compiler-Induced Variability

Made difficult by proprietary compilers

Very little progress; compilers don't know what a variable models

Insist on clear compiler specs; Optimize specific to problem semantics

Combined HPC and ML

Increasing in Uptake

Hardly any Tools to support SW Testing

Urgent creation of verification benchmarks; Get traction by pitching around Trustworthy AI

18 of 18

Concluding Remarks

  • Community Action
    • FPBench.org
    • X-Stack Project
  • Challenge Benchmark Creation
  • Need tool standardization, avoid duplication of effort
    • Incentivize robust tool release, value real impact of tools on HPC codes
      • Change reward metrics!
  • Help from GPU vendors essential to stay abreast
    • Force hands during procurement -- not just for perf but also correctness tools!
    • Not just window-dressing but serious commitment
      • Best Practices to Mix or Change Precision
      • C++11 memory model adoption (amidst CUDA Atomics, older idioms)
      • Weaning users away from coding practices such as the "C volatile holy-water sprinkles"