1 of 1

MFDn Communication Optimization

With the SciDAC5 NP NUCLEI Partnership

Scientific Achievement

We optimized the collective communication performance within a nuclear quantum many-body calculation code called MFDn (many-fermion dynamics for nucleons) using two-sided point-to-point and one-sided MPI. Using 30 Perlmutter GPU nodes, we achieved a speedup of 7x.

Significance and Impact

The optimized collective communication significantly improves the overall scalability of MFDn on DOE leadership class high performance computers such as the Perlmutter at NERSC, and enables physicists to study a variety of properties of light nuclei with high fidelity.

Time cost of 100 LANCZOS iterations using MPI collectives, MPI point-to-point and one-sided MPI. The MPI point-to-point can achieve 7x using 30 Perlmutter GPU nodes. MPI one-sided time cost is slightly higher than point-to-point because there is a pair of MPI_Win_Fence at each messaging time, which introduces extra barrier time.

PI(s)/Facility Lead(s): Lenny Oliker (LBL)

Collaborating Institutions: Iowa State University

ASCR Program: SciDAC RAPIDS2, FASTMath

ASCR PM: Kalyan Perumalla (SciDAC RAPIDS2), Steve Lee (FASTMath)

Technical Approach

We use point-to-point communication and one-sided MPI to accelerate MFDn at scales
We reduce the load imbalance among processes
We explore the potential of different communication paradigms.

LOCAL LAB POC:

TALKING POINTS:

MFDn helps physicists to study a variety of properties of light nuclei with high fidelity
As the problem size increases, MFDn requires increasing parallelism to meet the time-to-solution requirement.
The collective communication dominates the total run time at scales, making the total time cost stop scaling.
We use standard point-to-point MPI (and one-sided MPI) to reduce the load imbalance and thus shorten the latency by 7x.
In addition, the standard MPI can maintain good code and performance portability across architectures.

METADATA:

Name of the associated awarded project:

PI name(s): ): Lenny Oliker (LBL)

Collaborating: RAPIDS2 (PI: Lenny Oliker), FASTMath (PI: Esmond Ng )

Name of the program manager: Kalyan Perumalla (SciDAC RAPIDS2), Steve Lee (FASTMath)

BACKGROUND AND CONTEXT INFORMATION:

This is an on-going work. As such, there is no publication yet. The figure shows the strong scaling time cost of three implementations: MPI collectives (baseline), MPI point-to-point communication, and MPI one-sided. The MPI point-to-point can achieve 7x using 30 Perlmutter GPU nodes. MPI one-sided time cost is slightly higher than point-to-point because there is a pair of MPI_Win_Fence at each messaging time, which introduces extra barrier time.