MPI HYBRID & ACCELERATOR�WORKING GROUP DISCUSSION
�MPI Forum Meeting
May 23, 2022
1
HYBRID & ACCELERATOR WORKING GROUP
Mission: Improve interoperability of MPI with other programming models
Active topics:
More information: https://github.com/mpiwg-hybrid/hybrid-issues/wiki
2
COMPLETION CONTINUATIONS
Treat the completion of an MPI operation as continuation of some activity
“Callback-based completion notification using MPI Continuations,”�Joseph Schuchart, Christoph Niethammer, José Gracia, George Bosilca, Parallel Computing, 2021.
“MPI Detach - Asynchronous Local Completion,”�Joachim Protze, Marc-André Hermanns, Ali Demiralp, Matthias S. Müller, Torsten Kuhlen. EuroMPI ‘20.
3
CLARIFICATION OF THREAD ORDERING
Camp B: Operations from different threads are unordered
Camp A: MPI must respect an order across threads
MPI-3.1 Section 3.5: If a process has a single thread of execution, then any two communications executed by this process are ordered. On the other hand, if the process is multithreaded, then the semantics of thread execution may not define a relative order between two send operations executed by two distinct threads. The operations are logically concurrent, even if one physically precedes the other. In such a case, the two messages sent can be received in any order. Similarly, if two receive operations that are logically concurrent receive two successively sent messages, then the two messages can match the two receives in either order.
4
ACCELERATOR INFO
“mpi_memory_kind” (string, default: implementation defined)
Is the MPI library CUDA/HIP/L0/ETC-Aware?
5
ACCELERATOR INFO
“mpi_assert_memory_kind” (string, default: none)
Pointer Attribute Checking Overhead
6
MPI DATATYPE INFO
int MPI_Type_commit_with_info(MPI_Datatype *datatype, MPI_Info info);
int MPI_Type_get_info(MPI_Datatype datatype, MPI_Info *info_used);
New Functions
7
STREAM TRIGGERED NEIGHBOR EXCHANGE
Simple Ring Exchange Using a CUDA Stream
MPI_Request send_req, recv_req;
MPI_Status sstatus, rstatus;
�for (i = 0; i < NITER; i++) {
if (i > 0) {
MPI_Wait_enqueue(recv_req, &rstatus, MPI_CUDA_STREAM, stream);
MPI_Wait_enqueue(send_req, &sstatus, MPI_CUDA_STREAM, stream);
}
kernel<<<..., stream>>>(send_buf, recv_buf, …);
if (i < NITER – 1) {
MPI_Irecv_enqueue(&recv_buf, …, &recv_req, MPI_CUDA_STREAM, stream);
MPI_Isend_enqueue(&send_buf, …, &send_req, MPI_CUDA_STREAM, stream);
}
}
cudaStreamSynchronize(stream);
kernel
Isend
Irecv
Wait
Wait
kernel
Isend
Irecv
…
stream
8
ACCELERATOR BINDINGS FOR MPI PARTITIONED APIS
CUDA and SYCL Language Bindings Under Exploration
int MPI_Psend_init(const void *buf, int partitions, MPI_Count count,� MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Info info,� MPI_Request *request)
int MPI_Precv_init(void *buf, int partitions, MPI_Count count,� MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Info info,� MPI_Request *request)
int MPI_[start,wait][_all](...)
__device__ int MPI_Pready(int partition, MPI_Request request)
__device__ int MPI_Pready_range(int partition_low, int partition_high, MPI_Request request)
__device__ int MPI_Pready_list(int length, const int array_of_partitions[], MPI_Request request)
__device__ int MPI_Parrived(MPI_Request request, int partition, int *flag)
Keep host only
Add device bindings
9
KERNEL TRIGGERED COMMUNICATION USAGE
Device Code
__device__�void MPI_Pready(int idx, MPI_Request req);
__global__ kernel(..., MPI_Request *req) {
int i = my_partition(...);
// Compute and fill partition i
// then mark i as ready
MPI_Pready(i, req[0]);
}
Partitioned Neighbor Exchange
Host Code
MPI_Request req[2];
MPI_Psend_init(..., &req[0]);
MPI_Precv_init(..., &req[1]);
while (...) {
MPI_Startall(2, req);
MPI_Pbuf_prepare_all(2, req);
kernel<<<..., s>>>(..., req);
cudaStreamSynchronize(s);
MPI_Waitall(2, req);
}
MPI_Request_free(&req[0]);
MPI_Request_free(&req[1]);
10
KERNEL & STREAM TRIGGERED COMMUNICATION USAGE
Device Code
__device__�void MPI_Pready(int idx, MPI_Request req);
__global__ kernel(..., MPI_Request *req) {
int i = my_partition(...);
// Compute and fill partition i
// then mark i as ready
MPI_Pready(i, req[0]);
}
Partitioned Neighbor Exchange
Host Code
MPI_Request req[2];
MPI_Psend_init(..., &req[0]);
MPI_Precv_init(..., &req[1]);
while (...) {
MPI_Startall_enqueue(2, req, …, s);
MPI_Pbuf_prepare_all_enqueue(2,req,…,s);
kernel<<<..., s>>>(..., req);
cudaStreamSynchronize(s);
MPI_Waitall_enqueue(2, req, …, s);
}
MPI_Request_free(&req[0]);
MPI_Request_free(&req[1]);
Moving control ops to stream eliminates stream synchronization overhead
11
KERNEL & STREAM TRIGGERED COMMUNICATION USAGE
Device Code
__device__�void MPI_Pready(int idx, MPI_Request req);
__global__ kernel(..., MPI_Request *req) {
int i = my_partition(...);
// Compute and fill partition i
// then mark i as ready
MPI_Pready(i, req[0]);
}
Partitioned Neighbor Exchange
Host Code
MPI_Request req[2];
MPI_Psend_init(..., &req[0]);
MPI_Precv_init(..., &req[1]);
while (...) {
MPI_Startall_enqueue(2, req, …, s);
MPI_Pbuf_prepare_all_enqueue(2,req,…,s);
kernel<<<..., s>>>(..., req);
cudaStreamSynchronize(s);
MPI_Waitall_enqueue(2, req, …, s);
}
MPI_Request_free(&req[0]);
MPI_Request_free(&req[1]);
Allows the sender to wait until receiver is ready, so Pready becomes RDMA write
12
Thank you!
Wednesdays 10-11am US Eastern Time
https://github.com/mpiwg-hybrid/hybrid-issues/wiki
13