1 of 1

Portable Staging for GPU-Based In-Situ Workflows

Scientific Achievement

We extend the use of GPU memories for data staging in coupled scientific workflows by implementing a “dual-channel dual staging” (DCDS) mode, in which data may be communicated between GPUs and staging servers by simultaneously utilizing both a direct GPU->remote channel and a GPU->CPU->remote proxied channel. We further implement adaptive load balancing across these channels to improve utilization in the face of system variation and different modes of congestion.

Significance and Impact

GPU-based computation has taken a prominent role in scientific computing workflows. Machine learning is only expected to increase this trend. Efficient use of system resources is critical to the performance of memory-intensive applications, and using all available transfer mechanisms improves throughput. Using DCDS, we observed a 49% and 56% improvement in write and read time, respectively, over GPU-GPU transfer alone in a LULESH/ZFP coupled workflow.

Evaluated on Perlmutter, splitting data between both GPU and host channels (dual channel) improves read/write time by 5%/6% over GPU channels alone and 17%/36% over host channels alone. The addition of pre-staging data in host-memory (dual channel dual staging) significantly increases these gains, with an improvement of read/write time of 60%/35% over GPU channels alone and 65%/56% over host channels alone. These results show the value that host memory and communication channels can bring to GPU workflows.

Technical Approach

Divide objects to be transferred between GPUs and staging servers into a “direct channel” and a “CPU-proxied channel” slice, according to a “slicing ratio”
Adapt the slicing ratio across these channels to improve utilization in the face of system variation and different modes of congestion.

PI(s)/Facility Lead(s): Philip Davis, Bo Zhang (University of Utah)

Collaborating Institutions: OLCF, Rutgers University

ASCR Program: SciDAC

ASCR PM: Kalyan Perumalla.

Publication for this work: Zhang, et al. “Dual Channel Dual Staging: Hierarchical and Portable Staging for GPU-Based In-Situ Workflow,” (Under Submission)

Code Developed: Dual Channel/Dual Staging DataSpaces extensions

LOCAL LAB POC:

Philip Davis, University of Utah

TALKING POINTS:

It is becoming more common for data to be produced on the GPU of one compute node and be consumed on the GPU of a different compute node
This will become more common, as ML workflows often produce and consume data on the GPU
Until recently, data exchange occurred by copying data off the GPU to the CPU for transfers
Previous work in this project has been to create streamlined communication channels for exchanging data directly between GPUs over the network
In this work, we found that additional performance could be gleaned by simultaneously using the new channels while buffering data on the CPU, and performing adaptive load balancing to improve utilization and reduce read/write time.

Name of the associated awarded project: SciDAC RAPIDS Institute

PI name(s): The RAPIDS lead PIs are Rob Ross and Lenny Oliker. The RAPIDS investigators involved in this work are Philip Davis and Bo Zhang (Utah).

Non-RAPIDS Collaborators are Keita Teranishi (ORNL), Zhao Zhang (Rutgers), and Manish Parashar (Utah)

Name of the program manager: SciDAC managed by Kalyan Perumalla. Supporting technology managed by ASCR research (Margaret Lentz and Hal Finkel)

CITATIONS:

The notes section should also include full citations (including the DOI) to any important and associated publications, datasets or code developed as part this work.

Zhang, et al. “Dual Channel Dual Staging: Hierarchical and Portable Staging for GPU-Based In-Situ Workflow,” (Under Submission)

Zhang, et al. "Optimizing Data Movement for GPU-Based In-Situ Workflow Using GPUDirect RDMA." European Conference on Parallel Processing. Cham: Springer Nature Switzerland, 2023.