Non-blocking collective communication

Questions

Is the synchronization of collective communications avoidable?

Objectives

Understand the utility of non-blocking collective communications
Employ available profiling tools to monitor code behavior

Introduction

Just like for point-to-point messages, applications that use non-blocking collectives can be more efficient than blocking ones. This is because the latency of communication can overlap with unrelated computation.

Non-blocking barrier synchronization

At first glance, this seems like a nonsense. However if a barrier is needed, then it can be quite useful to overlap work with the synchronization. Use cases are rare, but include highly unstructured work described by variable numbers of messages sent between ranks, or very latency-sensitive applications.

MPI_Ibarrier

int MPI_Ibarrier(MPI_Comm comm,
                 MPI_Request *request)

Parameters

A communicator comm and a request object that is a handler for a later wait call.

Non-blocking reduce

MPI_Ireduce starts a reduction operation and generates a request in a MPI_Request object. The reduction process is completed only when a test is passed or a wait call is done. Upon completion, the reduced value is collected in the root process.

MPI_Ireduce

int MPI_Ireduce(const void* sendbuf,
                void* recvbuf,
                int count,
                MPI_Datatype datatype,
                MPI_Op op,
                int root,
                MPI_Comm comm,
                MPI_Request *request)

Parameters

sendbuf, recvbuf, and count are the buffer on each process, the buffer on root, and the number of elements to be allocated on each process. datatype is the type of the data to be reduced. op is the reduction operation to be applied on the distributed data. The global result of the reduction operation is collected in the root process in the communicator comm. The request object that is returned must be used to wait on the communication later.

Using ireduce for computing a running total in a stencil workflow

You can find a scaffold for the code in the content/code/day-3/00_ireduce folder. It is quite similar to that for the earlier non-blocking exercise. A working solution is in the solution subfolder. Try to compile with:

mpicc -g -Wall -std=c11 non-blocking-communication-ireduce.c -o non-blocking-communication-ireduce

When you have the code compiling, try to run with:

mpiexec -np 2 ./non-blocking-communication-ireduce

Try to fix the code

Solution

One correct approach is:

    fprintf(stderr, "Doing an non-blocking reduction on step %d\n", step);
    MPI_Ireduce(&local_total, &temporary_total, 1, MPI_FLOAT, MPI_SUM, total_root_rank, comm, &total_request);
}
/* Wait for the most recent total heat reduction, 4 steps after it was started */
if (step % 5 == 3 && total_request != MPI_REQUEST_NULL)
{
    MPI_Wait(&total_request, MPI_STATUS_IGNORE);
    total = temporary_total;
    if (rank == total_root_rank)
    {
        fprintf(stderr, "Total after waiting at step %d was %g\n", step, total);
    }
}

   ... same code as in the original example

/* Now that we have left the main loop, we should wait for
 * the most recent total heat reduction to complete. */
if (total_request != MPI_REQUEST_NULL)
{
    MPI_Wait(&total_request, MPI_STATUS_IGNORE);

There are other approaches that work correctly. Is yours better or worse than this one? Why?

Code analysis

How can you know when a blocking or non-blocking communication is required? It is cumbersome to analyse code with printing out instructions (printf) embedded in the code. For this reason, analysis tools have been written that allow you to monitor the behavior of your code in more detail. Some of these tools are Extrae/Paraver, TAU, Scalasca, to cite only a few of them.

Here, we will mention the combination of Extrae and Paraver tools that are developed at the Barcelona Supercomputing Center (BSC) and provide support for different architectures including CPUs and GPUs and also for different parallelisation levels, for instance, MPI, OpenMP, and MPI+OpenMP. Extrae is the tool used for producing trace files while Paraver is the visualiser/analyser tool.

In order to use Extrae, one needs to compile the code with debugging flag (-g). Events that should be monitored by Extrae are included in a .xml file (extrae.xml), for instance MPI or OpenMP:

<?xml version='1.0'?>

  <trace enabled="yes"
   home="/software/Extrae/3.8.0-gompi-2020b"
   initial-mode="detail"
   type="paraver" >

  <mpi enabled="yes">
    <counters enabled="yes" />
  </mpi>

  <openmp enabled="no">
    <locks enabled="no" />
    <counters enabled="no" />
  </openmp>

</trace>

For the non-blocking deadlock and overlap cases discussed in the previous lecture, the MPI call events show the following patterns in Paraver:

../_images/extrae-deadlock.png — MPI calls analysis for the deadlock case in the previous non-blocking section.

../_images/extrae-overlap.png — MPI calls analysis for the overlap case.

Notice that the size in the horizontal axis for the grid was increased to 8000 to make the visualisation clearer. From the overlap case, we can see that some work was interleaved (black region) between the MPI_Isend and MPI_Irecv calls and the waiting call (red rectangles).