Guarding Numerics Amidst Rising Heterogeneity
a presentation at correctness-workshop.github.io/2021
by members of the ComPort project
https://xstack-fp.github.io/ (evolving website)
Ganesh Gopalakrishnan, Ignacio Laguna, Ang Li,
Pavel Panchekha, Cindy Rubio-Gonzalez, Zachary Tatlock
1,4: University of Utah, 2: LLNL, 3: PNNL, 4: University of California, Davis 5: University of Washington
ganesh@cs.utah.edu, ilaguna@llnl.gov, ang.li@pnnl.gov, pavpan@cs.utah.edu, crubio@ucdavis.edu, ztatlock@cs.washington.edu
Rising Heterogeneity, Mixed-Precision
Rising Heterogeneity: Data Movement Reduction → Precision Reduction
Rising Heterogeneity
GPUs: Moving/Evolving
serving different needs...
Challenges due to Increasing GPU/accelerator adoption
Need better formal error models.
Build trust in them outside of ML.
Need GPU Race Checkers; none available now. Dire need to develop.
Need better formal error models.
Build trust in them outside of ML.
Need better formal error models.
Build trust in them outside of ML.
Need GPU Race Checkers; none available now. Dire need to develop.
Nobody trusts or knows how brittle the code will get, and where. Need efficient analysis tools.
Need better formal error models.
Build trust in them outside of ML.
Need GPU Race Checkers; none available now. Dire need to develop.
Nobody trusts or knows how brittle the code will get, and where. Need efficient analysis tools.
HW Exceptions not reported by many GPUs (currently printfs). Develop
analysis tools.
Need better formal error models.
Build trust in them outside of ML.
Need GPU Race Checkers; none available now. Dire need to develop.
Nobody trusts or knows how brittle the code will get, and where. Need efficient analysis tools.
HW Exceptions not reported by many GPUs (currently printfs). Develop
analysis tools.
Closed-source compilers,
Moving Optimization Targets (SFU). Open-Source , Better Specs.
Need better formal error models.
Build trust in them outside of ML.
Need GPU Race Checkers; none available now. Dire need to develop.
Nobody trusts or knows how brittle the code will get, and where. Need efficient analysis tools.
HW Exceptions not reported by many GPUs (currently printfs). Develop
analysis tools.
Closed-source compilers,
Moving Optimization Targets (SFU). Open-Source , Better Specs.
Testing Objectives, Oracles, Fuzzing, Scalable Tracing. Open-Source Testing Tool Components to be Shared.
Need better formal error models.
Build trust in them outside of ML.
Need GPU Race Checkers; none available now. Dire need to develop.
Nobody trusts or knows how brittle the code will get, and where. Need efficient analysis tools.
HW Exceptions not reported by many GPUs (currently printfs). Develop
analysis tools.
Closed-source compilers,
Moving Optimization Targets (SFU). Open-Source , Better Specs.
Testing Objectives, Oracles, Fuzzing, Scalable Tracing. Open-Source Testing Tool Components to be Shared.
Challenges and Solutions (summary slide)
Proprietary Nature of GPUs is a reality
Tool Landscape (will refine with your help)(other tools??)
FPSpy | OS-level insrum. | No GPU | Available |
FPChecker (Laguna, ASE'20) | LLVM instrumentation | GPU (initial) | Available |
Verificarlo / Verrou | Montecarlo Arithmetic | No GPU | Available |
FPDebug, NSan (CC'21), FPSanitizer | Shadow Value | No GPU | Some available |
Herbgrind | Valgrind instrum. | No GPU | Install issues |
Saman (Nestor), | Modeling error (library based) | No GPU | Available |
Ariadne | Exception triggering | No GPU | ? |
FLiT | Optimization bisection | No GPU | Available |
Blossom, S3FP, FPGen | Guided fuzzing | No GPU | Available |
Selected Numerical Issues, Solutions, Actionables
Issue | Problems, Where Experienced | Status of Solutions | Most Promising Research Needs |
New Number Systems, Exception | No common notions of error | Hype has overshot usage, tools | Fix IEEE issues first; Automate through translation; Invest in education |
Precision tuning | Code can become very brittle at places | Tools to check for blown precision budgets are unavailable | Need precision pressure-relief valves; Avoid Precision Fragmentation; Invest more in data compression ("bulk tuning") |
Scalable Error Analysis | Many codes have have loops; No "one-size fits all" | Not all variables are alike (values, derivatives, FFT) | Domain-specific Error Definitions Appear Inevitable |
Handling Compiler-Induced Variability | Made difficult by proprietary compilers | Very little progress; compilers don't know what a variable models | Insist on clear compiler specs; Optimize specific to problem semantics |
Combined HPC and ML | Increasing in Uptake | Hardly any Tools to support SW Testing | Urgent creation of verification benchmarks; Get traction by pitching around Trustworthy AI |
Concluding Remarks