1 of 13

MPI HYBRID & ACCELERATOR�WORKING GROUP DISCUSSION

�MPI Forum Meeting

May 23, 2022

1

2 of 13

HYBRID & ACCELERATOR WORKING GROUP

Mission: Improve interoperability of MPI with other programming models

Active topics:

Continuations proposal [Joseph Schuchart, UTK] #6
Clarification of thread ordering rules [Daniel Holmes, Intel] #117
Integration with accelerator programming models:

Accelerator info keys [Jim Dinan, NVIDIA] #3
Stream/Graph Based MPI Operations [Jim Dinan, NVIDIA] #5
Accelerator bindings for partitioned communication [Jim D., NVIDIA + Maria Garzaran, Intel] #4
Partitioned communication buffer preparation [Ryan Grant, Queen’s U.] #264

More information: https://github.com/mpiwg-hybrid/hybrid-issues/wiki

2

3 of 13

COMPLETION CONTINUATIONS

Interoperability with asynchronous and multithreaded programming models
Register callbacks that continue the activity upon completion of an MPI operation

Treat the completion of an MPI operation as continuation of some activity

“Callback-based completion notification using MPI Continuations,”�Joseph Schuchart, Christoph Niethammer, José Gracia, George Bosilca, Parallel Computing, 2021.

“MPI Detach - Asynchronous Local Completion,”�Joachim Protze, Marc-André Hermanns, Ali Demiralp, Matthias S. Müller, Torsten Kuhlen. EuroMPI ‘20.

3

4 of 13

CLARIFICATION OF THREAD ORDERING

Pro: Can enable MPI libraries to optimize performance for multithreaded applications
Con: Hard to get ordering, and MPI doesn’t know what the user considers to be a thread

E.g., A user-level thread can be migrated to a different shepherd thread. Does MPI see it?

Camp B: Operations from different threads are unordered

Pro: Probably what users expect, can relax ordering with info assertions and per-thread comms
Con: Removes a (questionable) performance optimization opportunity

Camp A: MPI must respect an order across threads

MPI-3.1 Section 3.5: If a process has a single thread of execution, then any two communications executed by this process are ordered. On the other hand, if the process is multithreaded, then the semantics of thread execution may not define a relative order between two send operations executed by two distinct threads. The operations are logically concurrent, even if one physically precedes the other. In such a case, the two messages sent can be received in any order. Similarly, if two receive operations that are logically concurrent receive two successively sent messages, then the two messages can match the two receives in either order.

4

5 of 13

ACCELERATOR INFO

“mpi_memory_kind” (string, default: implementation defined)

This info key contains a comma separated list of the memory kinds that the MPI library supports as buffer arguments to MPI routines. Possible values include:

“sysmem” - Memory allocated by standard operating system allocators
“cuda” - Memory allocated by CUDA memory allocators
“hip” - Memory allocated by HIP memory allocators
“l0” - Memory allocated by One API L0 memory allocators

This info key should appear in MPI_INFO_ENV and info returned by MPI_SESSION_GET_INFO
This info key can be passed in the info argument to MPI_SESSION_INIT to request support for a memory kind

Users must check info on the session to determine what is actually supported

Is the MPI library CUDA/HIP/L0/ETC-Aware?

5

6 of 13

ACCELERATOR INFO

“mpi_assert_memory_kind” (string, default: none)

This info key contains a comma separated list of the memory kinds that will be used by operations involving the given MPI object. Possible values include:

“sysmem” - Memory allocated by standard operating system allocators
“cuda” - Memory allocated by CUDA memory allocators
“hip” - Memory allocated by HIP memory allocators
“l0” - Memory allocated by One API L0 memory allocators

Can be applied to MPI communicators and datatypes

Pointer Attribute Checking Overhead

6

7 of 13

MPI DATATYPE INFO

int MPI_Type_commit_with_info(MPI_Datatype *datatype, MPI_Info info);

int MPI_Type_get_info(MPI_Datatype datatype, MPI_Info *info_used);

Info does not automatically propagate (e.g. when making another datatype using a datatype that has info on it)

New Functions

7

8 of 13

STREAM TRIGGERED NEIGHBOR EXCHANGE

Simple Ring Exchange Using a CUDA Stream

MPI_Request send_req, recv_req;

MPI_Status sstatus, rstatus;

�for (i = 0; i < NITER; i++) {

if (i > 0) {

MPI_Wait_enqueue(recv_req, &rstatus, MPI_CUDA_STREAM, stream);

MPI_Wait_enqueue(send_req, &sstatus, MPI_CUDA_STREAM, stream);

}

kernel<<<..., stream>>>(send_buf, recv_buf, …);

if (i < NITER – 1) {

MPI_Irecv_enqueue(&recv_buf, …, &recv_req, MPI_CUDA_STREAM, stream);

MPI_Isend_enqueue(&send_buf, …, &send_req, MPI_CUDA_STREAM, stream);

}

cudaStreamSynchronize(stream);

kernel

Isend

Irecv

Wait

kernel

Isend

Irecv

…

stream

8

9 of 13

ACCELERATOR BINDINGS FOR MPI PARTITIONED APIS

CUDA and SYCL Language Bindings Under Exploration

int MPI_Psend_init(const void *buf, int partitions, MPI_Count count,� MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Info info,� MPI_Request *request)

int MPI_Precv_init(void *buf, int partitions, MPI_Count count,� MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Info info,� MPI_Request *request)

int MPI_[start,wait][_all](...)

__device__ int MPI_Pready(int partition, MPI_Request request)

__device__ int MPI_Pready_range(int partition_low, int partition_high, MPI_Request request)

__device__ int MPI_Pready_list(int length, const int array_of_partitions[], MPI_Request request)

__device__ int MPI_Parrived(MPI_Request request, int partition, int *flag)

Keep host only

Add device bindings

9

10 of 13

KERNEL TRIGGERED COMMUNICATION USAGE

Device Code

__device__�void MPI_Pready(int idx, MPI_Request req);

__global__ kernel(..., MPI_Request *req) {

int i = my_partition(...);

// Compute and fill partition i

// then mark i as ready

MPI_Pready(i, req[0]);

}

Partitioned Neighbor Exchange

Host Code

MPI_Request req[2];

MPI_Psend_init(..., &req[0]);

MPI_Precv_init(..., &req[1]);

while (...) {

MPI_Startall(2, req);

MPI_Pbuf_prepare_all(2, req);

kernel<<<..., s>>>(..., req);

cudaStreamSynchronize(s);

MPI_Waitall(2, req);

}

MPI_Request_free(&req[0]);

MPI_Request_free(&req[1]);

10

11 of 13

KERNEL & STREAM TRIGGERED COMMUNICATION USAGE

Device Code

__device__�void MPI_Pready(int idx, MPI_Request req);

__global__ kernel(..., MPI_Request *req) {

int i = my_partition(...);

// Compute and fill partition i

// then mark i as ready

MPI_Pready(i, req[0]);

}

Partitioned Neighbor Exchange

Host Code

MPI_Request req[2];

MPI_Psend_init(..., &req[0]);

MPI_Precv_init(..., &req[1]);

while (...) {

MPI_Startall_enqueue(2, req, …, s);

MPI_Pbuf_prepare_all_enqueue(2,req,…,s);

kernel<<<..., s>>>(..., req);

cudaStreamSynchronize(s);

MPI_Waitall_enqueue(2, req, …, s);

}

MPI_Request_free(&req[0]);

MPI_Request_free(&req[1]);

Moving control ops to stream eliminates stream synchronization overhead

11

12 of 13

KERNEL & STREAM TRIGGERED COMMUNICATION USAGE

Device Code

__device__�void MPI_Pready(int idx, MPI_Request req);

__global__ kernel(..., MPI_Request *req) {

int i = my_partition(...);

// Compute and fill partition i

// then mark i as ready

MPI_Pready(i, req[0]);

}

Partitioned Neighbor Exchange

Host Code

MPI_Request req[2];

MPI_Psend_init(..., &req[0]);

MPI_Precv_init(..., &req[1]);

while (...) {

MPI_Startall_enqueue(2, req, …, s);

MPI_Pbuf_prepare_all_enqueue(2,req,…,s);

kernel<<<..., s>>>(..., req);

cudaStreamSynchronize(s);

MPI_Waitall_enqueue(2, req, …, s);

}

MPI_Request_free(&req[0]);

MPI_Request_free(&req[1]);

Allows the sender to wait until receiver is ready, so Pready becomes RDMA write

12

13 of 13

Thank you!

Wednesdays 10-11am US Eastern Time

https://github.com/mpiwg-hybrid/hybrid-issues/wiki

13