We extend the use of GPU memories for data staging in coupled scientific workflows by implementing a “dual-channel dual staging” (DCDS) mode, in which data may be communicated between GPUs and staging servers by simultaneously utilizing both a direct GPU->remote channel and a GPU->CPU->remote proxied channel. We further implement adaptive load balancing across these channels to improve utilization in the face of system variation and different modes of congestion.
Significance and Impact
GPU-based computation has taken a prominent role in scientific computing workflows. Machine learning is only expected to increase this trend. Efficient use of system resources is critical to the performance of memory-intensive applications, and using all available transfer mechanisms improves throughput. Using DCDS, we observed a 49%and 56% improvement in write and read time, respectively, over GPU-GPU transfer alone in a LULESH/ZFP coupled workflow.
Evaluated on Perlmutter, splitting data between both GPU and host channels (dual channel) improves read/write time by 5%/6%over GPU channels alone and 17%/36% over host channels alone. The addition of pre-staging data in host-memory (dual channel dual staging) significantly increases these gains, with an improvement of read/write time of 60%/35%over GPU channels alone and 65%/56% over host channels alone. These results show the value that host memory and communication channels can bring to GPU workflows.
Technical Approach
Divide objects to be transferred between GPUs and staging servers into a “direct channel” and a “CPU-proxied channel” slice, according to a “slicing ratio”
Adapt the slicing ratio across these channels to improve utilization in the face of system variation and different modes of congestion.
PI(s)/Facility Lead(s): Philip Davis, Bo Zhang (University of Utah)
Collaborating Institutions: OLCF, Rutgers University
ASCR Program: SciDAC
ASCR PM: Kalyan Perumalla.
Publication for this work: Zhang, et al. “Dual Channel Dual Staging: Hierarchical and Portable Staging for GPU-Based In-Situ Workflow,” (Under Submission)