MPI and threads in practice

Questions

  • When should I consider writing hybrid MPI+OpenMP programs?

  • What should I look out for when writing hybrid MPI+OpenMP programs?

Objectives

  • Estimate the benefits before trying to write code for hybrid parallelism

Using fork-join parallelism

In fork-join parallelism, multiple threads are launched to collaborate on work. Typically regions of parallelism alternate with regions where only one thread works. This enables parallelism to be introduced gradually, and only where profiling shows that it would be most beneficial. In typical implementations, threads are kept idle between parallel regions; this is more efficient than creating and destroying them many times.

../_images/fork-join-parallelism.svg

OpenMP is particularly suited for fork-join parallelism. Beware that each parallel region requires synchronization between threads, which can be costly. Further, the speed-up depends critically on the time spent in the single-threaded regions!

The simplest hybrid approach is often to do the MPI communication in the regions that are single-threaded.

../_images/fork-join-with-mpi.svg

Fork-join parallelism is a natural fit for MPI_THREAD_FUNNELED where fairly simple code can be improved with thread parallelism.

for loops in Fortan/C/C++ can be readily parallelised with #pragma omp parallel, so applications that already use MPI outside such loops can be converted to hybrid parallelism fairly easily.

Using the thread funneled MPI option

You can find a scaffold for the code in the content/code/day-4/01_threading-funneled folder. A working solution is in the solution subfolder. It is quite similar to that for the earlier non-blocking exercise. Try to compile with:

mpicc -g -fopenmp -Wall -std=c11 threading-funneled.c -o threading-funneled
  1. When you have the code compiled, try to run with:

    OMP_NUM_THREADS=2 mpiexec -np 2 ./threading-funneled
    
  2. Try to fix the code so that it compiles, runs, and reports success

Using OpenMP tasking with MPI

../_images/stencil-with-tasking.svg

Stencil code with halo exchange implemented with OpenMP tasking. One group of threads takes responsibility for the halo exchange and non-local stencil work. Another takes responsibility for the local work. The threads are split statically during each time step, but how many threads to assign to each part might be able to be tuned over the duration of the program.

Using the thread multiple MPI option

You can find a scaffold for the code in the content/code/day-4/02_threading-multiple folder. A working solution is in the solution subfolder. Try to compile with:

mpicc -g -fopenmp -Wall -std=c11 threading-multiple.c -o threading-multiple
  1. When you have the code compiled, try to run with:

    OMP_NUM_THREADS=2 mpiexec -np 2 ./threading-multiple
    

    but you can see the kind of approach that can work, and the complexity it entails. Do this only when you really need to!

Setting the proper thread affinity

Setting the affinity or the preferred location of threads in the hardware is crucial for the performance of hybrid MPI+OpenMP applications specially in modern architectures which are composed of several non-uniform memory access (NUMA) nodes.

../_images/kebnekaise.png

Kebnekaise architecture contains two NUMA nodes and 14 cores per NUMA node. Also several levels of cache L1,L2, and L3 can be seen in this architecture.

In addition to the physical cores (28 per node on Kebnekaise), logical cores could be available in your system but this option is usually turned-off in HPC systems. In the case of Kebnekaise, only one thread can run on a physical core:

$ lscpu | grep -i 'core\|thread\|Socket'

Thread(s) per core:              1
Core(s) per socket:              14
Socket(s):                       2

Without specifying the location of threads, the OS decides where the threads are placed. Binding of OpenMP threads can be controlled with the enviroment variables:

There are several programs available that allow you to see the binding scheme that is being used, for instance the xthi.c program from HPE(Cray) cited at the bottom of this page.

Exercise

Download the xthi.c code and compile it with:

mpicc -fopenmp -Wall -std=c11 xthi.c -o xthi_exe
  1. When you have the code compiled, try to run with:

    export OMP_NUM_THREADS=(Nr. threads)
    export OMP_DISPLAY_ENV=true
    mpiexec -np (Nr. MPI ranks) ./xthi_exe | sort -n -k 4 -k 6
    
  2. You should set the Nr. threads and MPI ranks so that their product don’t exceed the number of physical cores in your system. The variable OMP_DISPLAY_ENV can be used to see the value of the OpenMP environment variables.

Exercise

  1. Export the variables for binding affinity and run the xthi.c code:

    export OMP_NUM_THREADS=(Nr. threads)
    export OMP_DISPLAY_ENV=true
    export OMP_PROC_BIND=close
    export OMP_PLACES=cores
    mpiexec -np (Nr. MPI ranks) ./xthi_exe | sort -n -k 4 -k 6
    
  2. You should set the Nr. threads and MPI ranks so that their product don’t exceed the number of physical cores in your system. The variable OMP_DISPLAY_ENV can be used to see the value of the OpenMP environment variables.

  3. Compare this output with the one of the previous exercise. Where are the threads placed?

Tips for implementing hybrid MPI+OpenMP

  • Demonstrate that you need more scaling to solve the problem.

  • Know why you’re adding hybrid parallelism… to access more memory, improve performance, reduce communication or a combination?

  • Estimate how much improvement is available, based on existing performance measurements, e.g. profiling to find bottlenecks. If you don’t know how, learn. Access to quality tools at HPC clusters are worth it!

  • Are your external libraries using threading? How should you manage them?

  • You have to introduce effective OpenMP parallelism to 90% of the execution time to get a good result.

  • Start with master-only or funneled style. Migrate later if measurements suggest it.

  • Initialize data structures inside OpenMP regions, to take advantage of “first-touch” policies needed with NUMA nodes.

  • Make use of OpenMP’s conditional compilation features to ensure that the application can still be built without OpenMP.

  • If the application makes use of derived datatypes to pack/unpack noncontiguous data, consider replacing these with user-level pack/unpack routines which can be parallelised with OpenMP.

  • Learn about and use the OpenMP environment variables well

  • Learn how to use the MPI launcher to place the ranks and their threads well. This is different for different applications.

See also

Keypoints

  • Fork-join parallelism with MPI_THREAD_FUNNELED is a cheap way to get improvements, but the benefit is limited

  • More complex multi-threading can do a better job of overlapping communication and computation