1 of 1

Abstract

Potential Impact

Motivation

Geoffrey Fox1, Pete Beckman2, Shantenu Jha3, Piotr Luszczek4, Vikram Jadhao5 �1Virginia University, 2Argonne Natl Lab, 3Rutgers, 4University of Tennessee, 5Indiana University

Selected Highlights

Collaboration Opportunities

Selected References

SBI will make it easier for general users to develop new surrogates and help make their major performance increases pervasive across DoE computational science. Surrogates are openly available through MLCommons or SBI Repository.

Benefits: Techniques and methodology for generating high performance surrogates. Examples to use in education and as starting points for new surrogates. We would be happy for other surrogates from collaborators.

Surrogate Benchmark Initiative SBI: FAIR Surrogate Benchmarks Supporting AI and Simulation Research

The five institutions and MLCommons accumulate Generative and Regressive simulation surrogates and make them available in repositories with FAIR access. We produce a taxonomy across domain and system architectures with examples. We study performance and accuracy from AI, system I/O and communication aspects, as well as the size and nature of the training set. We look at batching and compression approaches as well as use of I/O parallelism and improved communication performance.

Easy access to state of the art modern AI is very important and surrogates are a transformational AI approach to simulations.

1. SBI Web page https://sbi-fair.github.io/ (has full list of publications)

2. E. A. Huerta, et. Al., FAIR for AI: An interdisciplinary and international community building perspective. Scientific Data, 10(1):487, 2023. URL:.

ANL Scalable Communication Framework for Second-Order Optimizers

Challenges:

https://doi.org/10.1038/s41597-023-02298-6

ASCR Award DE-SC0023452

ANL I/O Speedup

PyTorch

SOLAR-ANL

Background:

 Kronecker-factored Approximate Curvature (K-FAC)

Step1: Forward and backward computation, Allreduce gradient ∇𝐿(w).

Step2: For all layers, compute Kronecker factors, Allreduce factors A and G.

Step3: For assigned layer l, eigen decompose A,Gl and compute preconditioned gradient Hl , Allgather H.​

Step4: Update model weights using preconditioned gradient H.​

 

Motivation:

Allgather takes up to 50%

Allreduce takes 10%​

  • Preserving accuracy while achieving high CR.
  • Achieving end-to-end performance improvement considering the system and compression/decompression overhead.

Existing compressors either significantly impacts accuracy or have limited compression ratio.

Rutgers 6 Motifs

With mini-apps

DOE ASCR Computer Science Principal Investigators (PI) Meeting, Atlanta, GA February 5-7, 2024