One-sided communication: functions
Questions
What functions should you use for RMA?
Objectives
Learn how to create memory windows.
Learn how to access remote memory windows.
RMA anatomy
One-sided communication in MPI is achieved in three steps, which map onto three sets of functions:
- Windows
Make memory available on each process for remote memory accesses. We use memory windows, which are objects of type
MPI_Win
providing handles to remotely-accessible memory. MPI provides 4 collective routines for the creation of memory windows:MPI_Win_allocate
allocates memory and creates the window object.MPI_Win_create
creates a window from already allocated memory.MPI_Win_allocate_shared
creates a window from already allocated MPI shared memory.MPI_Win_create_dynamic
creates a window from allocated memory, but the window-memory pairing is deferred.
A handle of type
MPI_Win
manages memory made available for remote operations on all ranks in the communicator. Memory windows must be explicitly freed after use withMPI_Win_free
.- Load/store
Load/store/transform data in remote windows. We can identify an origin and a target process. At variance with two-sided communication, the origin process fully specifies the data transfer: where the data comes from and where it is going to. There are three main groups of MPI routines for this purpose:
Put
MPI_Put
andMPI_Rput
Get
MPI_Get
andMPI_Rget
Accumulate
MPI_Accumulate
,MPI_Raccumulate
and variations thereof.
- Synchronization
Ensure that the data is available for remote memory accesses. The load/store routines are non-blocking and the programmer must take care that subsequent accesses are safe and correct. How synchronization is achieved depends on the one-sided communication paradigm adopted:
Active if both origin and target processes play a role in the synchronization. This is indeed the message passing model of parallel computation.
Passive if the origin process orchestrates data transfer and synchronization. Conceptually, this is closely related to the shared memory model of parallel computation: the window is the shared memory in the communicator and every process can operate on it, seemingly independently of each other.
There are three sets of routines currently available in MPI:
MPI_Win_fence
this achieves synchronization in the active target communication paradigm.MPI_Win_post
,MPI_Win_start
,MPI_Win_complete
,MPI_Win_wait
are also used in the active target communication paradigm.MPI_Win_lock
,MPI_Win_unlock
which enables synchronization in the passive target paradigm.
We will discuss synchronization further in the next episode One-sided communication: synchronization.
RMA in action
In this example, we will work with two processes:
Rank 1, will allocate a buffer and expose it as a window.
Rank 0, will get the values from this buffer.
You can find a full working solution in the
content/code/day-3/00_rma/solution
folder.
First of all, we create the buffer on all ranks. However, only rank 1 will fill it with some values. We will see that window creation is collective call for all ranks in the given communicator.
int window_buffer[4] = {0};
if (rank == 1) {
window_buffer[0] = 42;
window_buffer[1] = 88;
window_buffer[2] = 12;
window_buffer[3] = 3;
}
MPI_Win win;
MPI_Win_create(&window_buffer, /* pre-allocated buffer */
(MPI_Aint)4 * sizeof(int), /* size in bytes */
sizeof(int), /* displacement units */
MPI_INFO_NULL, /* info object */
comm, /* communicator */
&win /* window object */);
Every rank has now a window, but only the window on rank 1 has values different from 0. Before doing anything on the window, we need to start an access epoch:
MPI_Win_fence(0, /* assertion */
win /* window object */);
Process 0 can now load the values into its local memory:
int getbuf[4];
if (rank == 0) {
// Fetch the value from the MPI process 1 window
MPI_Get(&getbuf, /* pre-allocated buffer on RMA origin process */
4, /* count on RMA origin process */
MPI_INT, /* type on RMA origin process */
1, /* rank of RMA target process */
0, /* displacement on RMA target process */
4, /* count on RMA target process */
MPI_INT, /* type on RMA target process */
win /* window object */);
}
We synchronize again once we are done with RMA operations: this access epoch is closed. This is needed even if subsequent accesses are local!
MPI_Win_fence(0, /* assertion */
win /* window object */);
Remember to free the window object!
MPI_Win_free(&win);
Non-blocking vs. RMA
At a first glance, one-sided and non-blocking communication appear similar. The key difference is in the mechanism to use for synchronization.
Typealong
We can you re-express the code shown in the type-along in terms of
MPI_Isend
/MPI_Recv
. A full working example is in
content/code/day-3/01_rma-vs-nonblocking/solution
folder.
if (rank == 0) {
int sendbuf[4] = {42, 88, 12, 3};
MPI_Request request;
printf("MPI process %d sends values:", rank);
for (int i = 0; i < 4; ++i) {
printf(" %d", sendbuf[i]);
}
printf("\n");
MPI_Isend(&sendbuf, 4, MPI_INT, 1, 0, comm, &request);
/* Here you might do other useful computational work */
// Let's wait for the MPI_Isend to complete before progressing further.
MPI_Wait(&request, MPI_STATUS_IGNORE);
} else if (rank == 1) {
int recvbuf[4];
MPI_Recv(&recvbuf, 4, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE);
printf("MPI process %d receives values:", rank);
for (int i = 0; i < 4; ++i) {
printf(" %d", recvbuf[i]);
}
printf("\n");
}
// start access epoch
MPI_Win_fence(0, win);
int getbuf[4];
if (rank == 0) {
// Fetch the value from the MPI process 1 window
MPI_Get(&getbuf, 4, MPI_INT, 1, 0, 4, MPI_INT, win);
}
// end access epoch
MPI_Win_fence(0, win);
Window creation
The creation of MPI_Win
objects is a collective operation: each process in
the communicator will reserve the specified memory for remote memory accesses.
Use this function to allocate memory and create a window object out of it.
int MPI_Win_allocate(MPI_Aint size,
int disp_unit,
MPI_Info info,
MPI_Comm comm,
void *baseptr,
MPI_Win *win)
We can expose an array of 10 double
-s for RMA with:
// allocate window
double *buf;
MPI_Win win;
MPI_Win_allocate((MPI_Aint)(10 * sizeof(double)), sizeof(double),
MPI_INFO_NULL, MPI_COMM_WORLD, &buf, &win);
// do something with win
// free window and the associated memory
MPI_Win_free(&win);
Parameters
size
Size in bytes.
disp_unit
Displacement units. If
disp_unit = 1
, then displacements are computed in bytes. The use of displacement units can help with code readability and is essential for correctness on heterogeneous systems, where the sizes of the basic types might differ between processes. See also Derived datatypes: MPI_Datatype.info
An info object, which can be used to provide optimization hints to the MPI implementation. Using
MPI_INFO_NULL
is always correct.comm
The (intra)communicator.
baseptr
The base pointer.
win
The window object.
With this routine you can tell MPI what memory to expose as window. The memory must be already allocated and contiguous, since it will be specified in input as base address plus size in bytes.
int MPI_Win_create(void *base,
MPI_Aint size,
int disp_unit,
MPI_Info info,
MPI_Comm comm,
MPI_Win *win)
Parameters
base
The base pointer.
size
Size in bytes.
disp_unit
Displacement units. If
disp_unit = 1
, then displacements are computed in bytes. The use of displacement units can help with code readability and is essential for correctness on heterogeneous systems, where the sizes of the basic types might differ between processes. See also Derived datatypes: MPI_Datatype.info
An info object, which can be used to provide optimization hints to the MPI implementation. Using
MPI_INFO_NULL
is always correct.comm
The (intra)communicator.
win
The window object.
What if the memory is not allocated? We advise to use MPI_Alloc_mem
:
// allocate memory
double *buf;
MPI_Alloc_mem((MPI_Aint)(10 * sizeof(double)), MPI_INFO_NULL, &buf);
// create window
MPI_Win win;
MPI_Win_create(buf, (MPI_Aint)(10 * sizeof(double)), sizeof(double),
MPI_INFO_NULL, MPI_COMM_WORLD, &win);
// do something with win
// free window
MPI_Win_free(&win);
// free memory
MPI_Free_mem(buf);
You must explicitly call MPI_Free_mem
to deallocate memory obtained
with MPI_Alloc_mem
.
Note
The memory window is usually a single array: the size of the window object
then coincides with the size of the array. If the base type of the array
is a simple type, then the displacement unit is the size of that type,
e.g. double
and sizeof(double)
. You should use a displacement
unit of 1 otherwise.
Window creation
Let’s look again at the initial example in the type-along. There we published
an already allocated buffer as memory window. Use the examples above to
figure out how to switch to using MPI_Win_allocate
You can find a scaffold for the code in the
content/code/day-3/02_rma-win-allocate
folder. A working solution is in the
solution
subfolder.
RMA operations
Store data from the origin process to the memory window of the target process. The origin process is the source, while the target process is the destination.
int MPI_Put(const void *origin_addr,
int origin_count,
MPI_Datatype origin_datatype,
int target_rank,
MPI_Aint target_disp,
int target_count,
MPI_Datatype target_datatype,
MPI_Win win)
Load data from the memory window of the target process to the origin process. The origin process is the destination, while the target process is the source.
int MPI_Get(void *origin_addr,
int origin_count,
MPI_Datatype origin_datatype,
int target_rank,
MPI_Aint target_disp,
int target_count,
MPI_Datatype target_datatype,
MPI_Win win)
Parameters
Both MPI_Put
and MPI_Get
are non-blocking: they are completed
by a call to synchronization routines.
The two functions have the same argument list. Similarly to MPI_Send
and MPI_Recv
, the data is specified by the triplet of address, count,
and datatype.
For the data at the origin process this is: origin_addr
,
origin_count
, origin_datatype
.
On the target process, we describe the buffer in terms of displacement,
count, and datatype: target_disp
, target_count
, target_datatype
.
The address of the buffer on the target process is computed using the base
address and displacement unit of the MPI_Win
object:
target_addr = win_base_addr + target_disp * disp_unit
With MPI_Put
, the origin
triplet specifies the local send
buffer; while with MPI_Get
it specifies the local receive
buffer.
The target_rank
parameter is, as the name suggests, the rank of the
target process in the communicator.
Using MPI_Put
Reorganize the sample code of the previous exercise such that rank 1 stores
values into rank 0 memory window with MPI_Put
, rather than rank 0
loading them with MPI_Get
.
You can find a scaffold for the code in the
content/code/day-3/03_rma-put
folder. A working solution is in the
solution
subfolder.
Store data from the origin process to the memory window of the target process and combine it using one the predefined MPI reduction operations.
int MPI_Accumulate(const void *origin_addr,
int origin_count,
MPI_Datatype origin_datatype,
int target_rank,
MPI_Aint target_disp,
int target_count,
MPI_Datatype target_datatype,
MPI_Op op,
MPI_Win win)
The argument list to MPI_Accumulate
is the same as for MPI_Put
,
with the addition of the op
parameter with type MPI_Op
, which
specifies which reduction operation to execute on the target process.
This routine is elementwise atomic: accesses from multiple processes will
be serialized in some order and no race conditions can thus occur. You still
need to exercise care though: reductions are only deterministic if the
operation is associative and commutative for the given datatype. For
example, MPI_SUM
and MPI_PROD
are neither associative nor
commutative for floating point numbers!
Using MPI_Accumulate
You can find a scaffold for the code in the
content/code/day-3/04_rma-accumulate
folder.
Follow the prompts and complete the function calls to:
Create a window object from an allocated buffer:
int buffer = 42;
Let each process accumulate its rank in the memory window of the process with rank 0. We want to obtain the sum of the accumulating values.
With 2 processes, you should get the following output to screen:
[MPI process 0] Value in my window_buffer before MPI_Accumulate: 42.
[MPI process 1] I accumulate data 1 in MPI process 0 window via MPI_Accumulate.
[MPI process 0] Value in my window_buffer after MPI_Accumulate: 43.
A working solution is in the solution
subfolder.
Describe the sequence MPI calls connecting the before and after schemes.
-
Window creation with
MPI_Win_allocate
.Window creation with
MPI_Win_create
followed byMPI_Alloc_mem
.Dynamic window creation with
MPI_Win_create_dynamic
.Memory allocation with
MPI_Alloc_mem
followed by window creationMPI_Win_create
.
-
Window creation with
MPI_Win_allocate
andMPI_Get
from origin process 2 to target process 1.Window creation with
MPI_Win_create_dynamic
andMPI_Put
from origin process 1 to target process 2.Window creation with
MPI_Win_create
andMPI_Get
from origin process 1 to target process 2.Window creation with
MPI_Win_create
andMPI_Put
from origin process 2 to target process 1.
Solution
Both options A and D are correct. With option A, we let MPI allocate memory on each process and create a
MPI_Win
window object. With option C, the memory allocation and window object creation are decoupled and managed by the programmer. If you have the choice, option A should be preferred: the MPI library might be able to better optimize window creation.Option D is correct. The memory is already allocated on each process, maybe through use of
MPI_Alloc_mem
, and the window can be created with a call toMPI_Win_create
. The subsequent data movement is a remote store operation. The callMPI_Put
is issued by process 2, the origin process, to store itsC
variable to the memory window of process 1, the target process.
Note
There are other routines for RMA operations. We give here a list without going into details:
- Request-based variants
These routines return a handle of type
MPI_Request
and synchronization can be achieved withMPI_Wait
.MPI_Rget
MPI_Rput
MPI_Raccumulate
MPI_Rget_accumulate
- Specialized accumulation variants
These functions perform specialized accumulations, but are conceptually similar to
MPI_Accumulate
.MPI_Get_accumulate
MPI_Fetch_and_op
MPI_Compare_and_swap
See also
The lecture covering MPI RMA from EPCC is available here
Chapter 3 of the Using Advanced MPI by William Gropp et al. [GHTL14]
Keypoints
The MPI model for remote memory accesses.
Window objects and memory windows.
Timeline of RMA and the importance of synchronization.