Simple collective communication

Questions

  • How can all ranks of my program collaborate with messages?

  • How does collective messaging differ from point-to-point?

Objectives

  • Know the different kinds of collective message operations

  • Understand the terminology used in MPI about collective messages

  • Understand how to combine data from ranks of a communicator in an operation

Introduction

Parallel programs often need to collaborate when passing messages:

  • To ensure that all ranks have reached a certain point (barrier)

  • To share data with all ranks (broadcast)

  • To compute based on data from all ranks (reduce)

  • To rearrange data across ranks for subsequent computation (scatter, gather)

These can all be done with point-to-point messages. However that requires more code, runs slower, and scales worse that using the optimized collective calls.

There are several other operations that generalize these building blocks:

  • gathering data from all ranks and delivering the same data to all ranks

  • all-to-all scatter and gather of different data to all ranks

Finally, MPI supports reduction operations, where a logical or arithmetic operation can be used to efficiently compute while communicating data.

Barrier

An MPI_Barrier call ensures that all ranks arrive at the call before any of them proceeds past it.

../_images/MPI_Barrier.svg

All ranks in the communicator reach the barrier before any continue past it

MPI_Barrier is blocking (ie. does not return until the operation is complete) and introduces collective synchronization into the program. This can be useful to allow rank to wait for an external event (e.g. files being written by another process) before entering the barrier, rather than have all ranks checking.

When debugging problems in other MPI communication, adding calls to MPI_Barrier can be useful. However, if a barrier is necessary for your program to function correctly, that may suggest your program has a bug!

Broadcast

An MPI_Bcast call sends data from one rank to all other ranks.

../_images/MPI_Bcast.svg

After the call, all ranks in the communicator agree on the two values sent.

MPI_Bcast is blocking and introduces collective synchronization into the program.

This can be useful to allow one rank to share values to all other ranks in the communicator. For example, one rank might read a file, and then broadcast the content to all other ranks. This is usually more efficient than having each rank read the same file.

Reduce

An MPI_Reduce call combines data from all ranks using an operation and returns values to a single rank.

../_images/MPI_Reduce.svg

After the call, the root rank has a value computed by combining a value from each other rank in the communicator with an operation.

MPI_Reduce is blocking and introduces collective synchronization into the program.

There are several kinds of pre-defined operation, including arithmetic and logical operations. A full list of operations is available in the linked documentation.

This is useful to allow one rank to compute based on values from all other ranks in the communicator. For example, the maximum value found over all ranks (and even the rank upon which it was found) can be returned to the root rank. Often one simply wants a sum, and for that MPI_SUM is provided.

Allreduce

An MPI_Reduce call combines data from all ranks using an operation and returns values to all ranks.

../_images/MPI_Allreduce.png

After the call, every rank has a value computed by combining a value from all ranks in the communicator with an operation.

MPI_Allreduce is blocking and introduces collective synchronization into the program. The pre-defined operation is the same as in MPI_Reduce. MPI_Allreduce is useful when the result of MPI_Reduce is needed on all ranks.

Exercise: broadcast and reduce

Use a broadcast and observe the results with reduce

You can find a scaffold for the code in the content/code/day-1/08_broadcast folder. A working solution is in the solution subfolder. Try to compile with:

mpicc -g -Wall -std=c11 collective-communication-broadcast.c -o collective-communication-broadcast
  1. When you have the code compiling, try to run with:

    mpiexec -np 2 ./collective-communication-broadcast
    
  2. Use clues from the compiler and the comments in the code to change the code so it compiles and runs. Try to get all ranks to report success :-)

Exercise: calculating \(\pi\) using numerical integration

Use broadcast and reduce to compute \(\pi\)

\(\pi = 4 \int_{0}^{1} \frac{1}{1+x^2} dx\).

You can find a scaffold for the code in the content/code/day-1/09_integrate-pi folder.

Compile with:

mpicc -g -Wall -std=c11 pi-integration.c -o pi-integration

A working solution is in the solution subfolder.

  1. When you have the code compiling, try to run with:

    mpiexec -np 4 ./pi-integration 10000
    
  2. You can try different number of points and see how it affects the result.

Tips when using collective communication

Unlike point-to-point messages, collective communication does not use tags. This is deliberate, because collective communication requires all ranks in the communicator to contribute to the work before any rank will return from the call. There’s no facility for more than one collective communication to run at a time on a communicator, so there’s no need for a tag to clarify which communication is taking place.

Quiz: if one rank calls a reduce, and another rank calls a broadcast, is it a problem?

  1. Yes, always.

  2. No, never.

  3. Yes when they are using the same communicator

See also

Keypoints

  • Collective communication requires participation of all ranks in that communicator

  • Collective communication happens in order and so no tags are needed.