Abstract
Potential Impact
Motivation
Geoffrey Fox1, Pete Beckman2, Shantenu Jha3, Piotr Luszczek4, Vikram Jadhao5 �1Virginia University, 2Argonne Natl Lab, 3Rutgers, 4University of Tennessee, 5Indiana University
Selected Highlights
Collaboration Opportunities
Selected References
SBI will make it easier for general users to develop new surrogates and help make their major performance increases pervasive across DoE computational science. Surrogates are openly available through MLCommons or SBI Repository.
Benefits: Techniques and methodology for generating high performance surrogates. Examples to use in education and as starting points for new surrogates. We would be happy for other surrogates from collaborators.
Surrogate Benchmark Initiative SBI: FAIR Surrogate Benchmarks Supporting AI and Simulation Research
The five institutions and MLCommons accumulate Generative and Regressive simulation surrogates and make them available in repositories with FAIR access. We produce a taxonomy across domain and system architectures with examples. We study performance and accuracy from AI, system I/O and communication aspects, as well as the size and nature of the training set. We look at batching and compression approaches as well as use of I/O parallelism and improved communication performance.
Easy access to state of the art modern AI is very important and surrogates are a transformational AI approach to simulations.
1. SBI Web page https://sbi-fair.github.io/ (has full list of publications)
2. E. A. Huerta, et. Al., FAIR for AI: An interdisciplinary and international community building perspective. Scientific Data, 10(1):487, 2023. URL:.
ANL Scalable Communication Framework for Second-Order Optimizers
Challenges:
https://doi.org/10.1038/s41597-023-02298-6
ASCR Award DE-SC0023452
ANL I/O Speedup
PyTorch
SOLAR-ANL
Background:
Kronecker-factored Approximate Curvature (K-FAC)
Step1: Forward and backward computation, Allreduce gradient ∇𝐿(w).
Step2: For all layers, compute Kronecker factors, Allreduce factors A and G.
Step3: For assigned layer l, eigen decompose Al ,Gl and compute preconditioned gradient Hl , Allgather Hl .
Step4: Update model weights using preconditioned gradient H.
Motivation:
Allgather takes up to 50%
Allreduce takes 10%
Existing compressors either significantly impacts accuracy or have limited compression ratio.
Rutgers 6 Motifs
With mini-apps
DOE ASCR Computer Science Principal Investigators (PI) Meeting, Atlanta, GA February 5-7, 2024