Multiple GPU programming with MPI

Questions

What approach should be adopted to extend the synchronous OpenACC and OpenMP offloading models to utilise multiple GPUs across multiple nodes?

Objectives

To learn about combining MPI with either OpenACC or OpenMP offloading models.
To learn about implementing GPU-awareness MPI approach.

Instructor note

30 min teaching
30 min exercises

Introduction

Exploring multiple GPUs (Graphics Processing Units) across distributed nodes offers the potential to fully leveraging the capacity of modern HPC (High-Performance Computing) systems at a large scale. Here one of the approaches to accelerate computing on distributed systems is to combine MPI (Message Passing Interface) with a GPU programming model such as OpenACC and OpenMP application programming interfaces (APIs). This combination is motivated by both the simplicity of these APIs, and the widespread use of MPI.

In this guide we provide readers, who are familiar with MPI, with insighits on implementing a hybrid model in which the MPI communication framework is combined with either OpenACC or OpenMP APIs. A special focus will be on performing point-to-point (e.g. MPI_Send and MPI_Recv) and collective operations (e.g. MPI_Allreduce) from OpenACC and OpenMP APIs. Here we address two scenarios: (i) a scenario in which MPI operations are performed in the CPU-host followed by an offload to the GPU-device; and (ii) a scenario in which MPI operations are performed between a pair of GPUs without involving the CPU-host memory. The latter scenario is referred to as GPU-awareness MPI, and has the advantage of reducing the computing time caused by transferring data via the host-memory during heterogenous communications, thus rendering HPC applications efficient.

This guide is organized as follows: we first introduce how to assign each MPI rank to a GPU device within the same node. We consider a situation in which the host and the device have a distinct memory. This is followed by a presentation on the hybrid MPI-OpenACC/OpenMP offloading with and without the GPU-awareness MPI. Exercises to help understanding these concepts are provided at the end.

Assigning MPI-ranks to GPU-devices

Accelerating MPI applications to utilise multiple GPUs on distributed nodes requires as a first step assigning each MPI rank to a GPU device, such that two MPI ranks do not use the same GPU device. This is necessarily in order to prevent the application from a potential crash. This is because GPUs are designed to handle multiple threading tasks, but not multiple MPI ranks.

One of the way to ensure that two MPI ranks do not use the same GPU, is to determine which MPI processes run on the same node, such that each process can be assigned to a GPU device within the same node. This can be done, for instance, by splitting the world communicator into sub-groups of communicators (or sub-communicators) using the routine MPI_COMM_SPLIT_TYPE().

    ! Split the world communicator into subgroups of commu, each of which
    ! contains processes that run on the same node, and which can create a
    ! shared memory region (via the type MPI_COMM_TYPE_SHARED).
    ! The call returns a new communicator "host_comm", which is created by
    ! each subgroup.        
    call MPI_COMM_SPLIT_TYPE(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,&
                            MPI_INFO_NULL, host_comm,ierr)
    call MPI_COMM_RANK(host_comm, host_rank,ierr)

Here, the size of each sub-communicator corresponds to the number of GPUs per node (which is also the number of tasks per node), and each sub-communicator contains a list of processes indicated by a rank. These processes have a shared-memory region defined by the argument MPI_COMM_TYPE_SHARED (see the MPI report) for more details). Calling the routine MPI_COMM_SPLIT_TYPE() returns a sub-communicator labelled in the code above ”host_comm”, and in which MPI-ranks are ranked from 0 to number of processes per node -1. These MPI ranks are in turn assigned to different GPU devices within the same node. This procedure is done according to which directive-based model is implemented. The retrieved MPI ranks are then stored in the variable myDevice. The variable is passed to an OpenACC or OpenMP routine as indicated in the code below.

Example: Assign device

    myDevice = host_rank

    ! Sets the device number and the device type to be used
    call acc_set_device_num(myDevice, acc_get_device_type())

    ! Returns the number of devices available on the host
    numDevice = acc_get_num_devices(acc_get_device_type())

    myDevice = host_rank

    ! Sets the device number to use in device constructs by setting the initial value of the default-device-var 
    call omp_set_default_device(myDevice)

    ! Returns the number of devices available for offloading.
    numDevice = omp_get_num_devices()

Another useful function for retrieving the device number of a specific device, which is useful, e.g., to map data to a specific device is

acc_get_device_num()

omp_get_device_num()

The syntax of assigning MPI ranks to GPU devices is summarised below

Example: Set device

    ! Initialise MPI communication      
    call MPI_Init(ierr)
    ! Identify the ID rank (process)        
    call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
    ! Get number of active processes (from 0 to nproc-1)        
    call MPI_COMM_SIZE( MPI_COMM_WORLD, nproc, ierr )

    ! Split the world communicator into subgroups of commu, each of which
    ! contains processes that run on the same node, and which can create a
    ! shared memory region (via the type MPI_COMM_TYPE_SHARED).
    ! The call returns a new communicator "host_comm", which is created by
    ! each subgroup.        
    call MPI_COMM_SPLIT_TYPE(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,&
                            MPI_INFO_NULL, host_comm,ierr)
    call MPI_COMM_RANK(host_comm, host_rank,ierr)

    ! Gets the node name
    call MPI_GET_PROCESSOR_NAME(name, resulten, ierror)

    myDevice = host_rank

    ! Sets the device number and the device type to be used
    call acc_set_device_num(myDevice, acc_get_device_type())

    ! Returns the number of devices available on the host
    numDevice = acc_get_num_devices(acc_get_device_type())

    ! Initialise MPI communication      
    call MPI_Init(ierr)
    ! Identify the ID rank (process)        
    call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
    ! Get number of active processes (from 0 to nproc-1)        
    call MPI_COMM_SIZE( MPI_COMM_WORLD, nproc, ierr )

    ! Split the world communicator into subgroups of commu, each of which
    ! contains processes that run on the same node, and which can create a
    ! shared memory region (via the type MPI_COMM_TYPE_SHARED).
    ! The call returns a new communicator "host_comm", which is created by
    ! each subgroup.        
    call MPI_COMM_SPLIT_TYPE(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,&
                            MPI_INFO_NULL, host_comm,ierr)
    call MPI_COMM_RANK(host_comm, host_rank,ierr)

    ! Gets the node name
    call MPI_GET_PROCESSOR_NAME(name, resulten, ierror)

    myDevice = host_rank

    ! Sets the device number to use in device constructs by setting the initial value of the default-device-var 
    call omp_set_default_device(myDevice)

    ! Returns the number of devices available for offloading.
    numDevice = omp_get_num_devices()

Hybrid MPI-OpenACC/OpenMP without GPU-awareness approach

After covering how to assign each MPI-rank to a GPU device, we now address the concept of combining MPI with either OpenACC or OpenMP offloading. In this approach, calling an MPI routine from an OpenACC or OpenMP API requires updating the data in the CPU host before and after an MPI call. In this scenario, the data is copied back and forth between the host and the device before and after each MPI call. In the hybrid MPI-OpenACC model, the procedure is defined by specifying the directive update host() for copying the data from the device to the host before an MPI call; and by the directive update device() specified after an MPI call for copying the data back to the device. Similarly in the hybrid MPI-OpenMP. Here, updating the data in the host can be done by specifying the OpenMP directives update device() from() and update device() to(), respectively, for copying the data from the device to the host and back to the device.

To illustrate the concept of the hybrid MPI-OpenACC/OpenMP, we show below an example of an implementation that involves the MPI functions MPI_Send() and MPI_Recv().

Example: Update host/device directives

    !offload f to GPUs
    !$acc enter data copyin(f)

    !update f: copy f from GPU to CPU
    !$acc update host(f)

    if(myid.lt.nproc-1) then
        call MPI_Send(f(np:np),1,MPI_DOUBLE_PRECISION,myid+1,tag,MPI_COMM_WORLD, ierr)
    endif

    if(myid.gt.0) then
        call MPI_Recv(f(1),1,MPI_DOUBLE_PRECISION,myid-1,tag,MPI_COMM_WORLD, status,ierr)
    endif

    !update f: copy f from CPU to GPU
    !$acc update device(f)

    !offload f to GPUs
    !$omp target enter data device(myDevice) map(to:f)

    !update f: copy f from GPU to CPU
    !$omp target update device(myDevice) from(f)

    if(myid.lt.nproc-1) then
        call MPI_Send(f(np:np),1,MPI_DOUBLE_PRECISION,myid+1,tag,MPI_COMM_WORLD, ierr)
    endif

    if(myid.gt.0) then
        call MPI_Recv(f(1),1,MPI_DOUBLE_PRECISION,myid-1,tag,MPI_COMM_WORLD, status,ierr)
    endif

    !update f: copy f from CPU to GPU
    !$omp target update device(myDevice) to(f)

Here we present a code example that combines MPI with OpenACC/OpenMP API.

Example: Update host/device directives

    call MPI_Scatter(f_send,np,MPI_DOUBLE_PRECISION,f, np,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD, ierr)

    !offload f to GPUs
    !$acc enter data copyin(f)

    !update f: copy f from GPU to CPU
    !$acc update host(f)

    if(myid.lt.nproc-1) then
        call MPI_Send(f(np:np),1,MPI_DOUBLE_PRECISION,myid+1,tag,MPI_COMM_WORLD, ierr)
    endif

    if(myid.gt.0) then
        call MPI_Recv(f(1),1,MPI_DOUBLE_PRECISION,myid-1,tag,MPI_COMM_WORLD, status,ierr)
    endif

    !update f: copy f from CPU to GPU
    !$acc update device(f)

    !do something .e.g.
    !$acc kernels
    f = f/2.
    !$acc end kernels

    SumToT=0d0
    !$acc parallel loop reduction(+:SumToT)
    do k=1,np
        SumToT = SumToT + f(k)
    enddo
    !$acc end parallel loop

    !SumToT is by default copied back to the CPU
    call MPI_ALLREDUCE(MPI_IN_PLACE,SumToT,1,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr )

    !$acc exit data delete(f)

    call MPI_Scatter(f_send,np,MPI_DOUBLE_PRECISION,f, np,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD, ierr)

    !offload f to GPUs
    !$omp target enter data device(myDevice) map(to:f)

    !update f: copy f from GPU to CPU
    !$omp target update device(myDevice) from(f)

    if(myid.lt.nproc-1) then
        call MPI_Send(f(np:np),1,MPI_DOUBLE_PRECISION,myid+1,tag,MPI_COMM_WORLD, ierr)
    endif

    if(myid.gt.0) then
        call MPI_Recv(f(1),1,MPI_DOUBLE_PRECISION,myid-1,tag,MPI_COMM_WORLD, status,ierr)
    endif

    !update f: copy f from CPU to GPU
    !$omp target update device(myDevice) to(f)

    !do something .e.g.
    !$omp target teams distribute parallel do
    do k=1,np
        f(k) = f(k)/2.
    enddo
    !$omp end target teams distribute parallel do

    SumToT=0d0
    !$omp target teams distribute parallel do reduction(+:SumToT)
    do k=1,np
        SumToT = SumToT + f(k)
    enddo
    !$omp end target teams distribute parallel do  

    !SumToT is by default copied back to the CPU
    call MPI_ALLREDUCE(MPI_IN_PLACE,SumToT,1,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr )

    !$omp target exit data map(delete:f)

Despite the simplicity of implementing the hybrid MPI-OpenACC/OpenMP offloading, it suffers from a low performance caused by an explicit transfer of data between the host and the device before and after calling an MPI routine. This constitutes a bottleneck in GPU-programming. To improve the performance affected by the host staging during the data transfer, one can implement the GPU-awareness MPI approach as described in the following section.

Hybrid MPI-OpenACC/OpenMP with GPU-awareness approach

The concept of the GPU-aware MPI enables an MPI library to directly access the GPU-device memory without necessarily using the CPU-host memory as an intermediate buffer (see e.g. here). This offers the benefit of transferring data from one GPU to another GPU without the involvement of the CPU-host memory.

To be specific, in the GPU-awareness approach, the device pointers point to the data allocated in the GPU memory space (data should be present in the GPU device). Here, the pointers are passed as arguments to an MPI routine that is supported by the GPU memory. As MPI routines can directly access GPU memory, it offers the possibility of communicating between pairs of GPUs without transferring data back to the host.

In the hybrid MPI-OpenACC model, the concept is defined by combining the directive host_data together with the clause use_device(list_array). This combination enables the access to the arrays listed in the clause use_device(list_array) from the host (see here). The list of arrays, which are already present in the GPU-device memory, are directly passed to an MPI routine without a need of a staging host-memory for copying the data. Note that for initially copying data to GPU, we use unstructured data blocks characterized by the directives enter data and exit data. The unstructured data has the advantage of allowing to allocate and deallocate arrays within a data region.

To illustarte the concept of the GPU-awareness MPI, we show below two examples that make use of point-to-point and collective operations from OpenACC and OpenMP APIs. In the first code example, the device pointer f is passed to the MPI functions MPI_Send() and MP_Recv(); and in the second one, the pointer SumToT is passed to the MPI function MPI_Allreduce. Here, the MPI operations MPI_Send and MPI_Recv as well as MPI_Allreduce are performed between a pair of GPUs without passing through the CPU-host memory.

Example: GPU-awareness: MPI_Send & MPI_Recv

    !Device pointer f will be passed to MPI_send & MPI_recv
    !$acc host_data use_device(f)
    if(myid.lt.nproc-1) then
        call MPI_Send(f(np:np),1,MPI_DOUBLE_PRECISION,myid+1,tag,MPI_COMM_WORLD, ierr)
    endif

    if(myid.gt.0) then
        call MPI_Recv(f(1),1,MPI_DOUBLE_PRECISION,myid-1,tag,MPI_COMM_WORLD, status,ierr)
    endif
    !$acc end host_data

    !Device pointer f will be passed to MPI_send & MPI_recv
    !$omp target data use_device_ptr(f)
    if(myid.lt.nproc-1) then
        call MPI_Send(f(np:np),1,MPI_DOUBLE_PRECISION,myid+1,tag,MPI_COMM_WORLD, ierr)
    endif

    if(myid.gt.0) then
        call MPI_Recv(f(1),1,MPI_DOUBLE_PRECISION,myid-1,tag,MPI_COMM_WORLD, status,ierr)
    endif
    !$omp end target data

Example: GPU-awareness: MPI_Allreduce

    !$acc data copy(SumToT)
    !$acc host_data use_device(SumToT)
    call MPI_ALLREDUCE(MPI_IN_PLACE,SumToT,1,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr )
    !$acc end host_data
    !$acc end data

    !$omp target enter data device(myDevice) map(to:SumToT)
    !$omp target data use_device_ptr(SumToT)
    call MPI_ALLREDUCE(MPI_IN_PLACE,SumToT,1,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr )
    !$omp end target data
    !$omp target exit data map(from:SumToT)

We provide below a code example that illustrates the implementation of the MPI functions MPI_Send(), MPI_Recv() and MPI_Allreduce() within an OpenACC/OpenMP API. This implementation is specifically designed to support GPU-aware MPI operations.

Example: GPU-awareness approach

    call MPI_Scatter(f_send,np,MPI_DOUBLE_PRECISION,f, np,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD, ierr)

    !offload f to GPUs
    !$acc enter data copyin(f)

    !Device pointer f will be passed to MPI_send & MPI_recv
    !$acc host_data use_device(f)
    if(myid.lt.nproc-1) then
        call MPI_Send(f(np:np),1,MPI_DOUBLE_PRECISION,myid+1,tag,MPI_COMM_WORLD, ierr)
    endif

    if(myid.gt.0) then
        call MPI_Recv(f(1),1,MPI_DOUBLE_PRECISION,myid-1,tag,MPI_COMM_WORLD, status,ierr)
    endif
    !$acc end host_data

    !do something .e.g.
    !$acc kernels
    f = f/2.
    !$acc end kernels

    SumToT=0d0
    !$acc parallel loop reduction(+:SumToT)
    do k=1,np
        SumToT = SumToT + f(k)
    enddo
    !$acc end parallel loop

    !SumToT is by default copied back to the CPU

    !$acc data copy(SumToT)
    !$acc host_data use_device(SumToT)
    call MPI_ALLREDUCE(MPI_IN_PLACE,SumToT,1,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr )
    !$acc end host_data
    !$acc end data

    !$acc exit data delete(f)

    call MPI_Scatter(f_send,np,MPI_DOUBLE_PRECISION,f, np,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD, ierr)

    !offload f to GPUs
    !$omp target enter data device(myDevice) map(to:f)

    !Device pointer f will be passed to MPI_send & MPI_recv
    !$omp target data use_device_ptr(f)
    if(myid.lt.nproc-1) then
        call MPI_Send(f(np:np),1,MPI_DOUBLE_PRECISION,myid+1,tag,MPI_COMM_WORLD, ierr)
    endif

    if(myid.gt.0) then
        call MPI_Recv(f(1),1,MPI_DOUBLE_PRECISION,myid-1,tag,MPI_COMM_WORLD, status,ierr)
    endif
    !$omp end target data

    !do something .e.g.
    !$omp target teams distribute parallel do
    do k=1,np
        f(k) = f(k)/2.
    enddo
    !$omp end target teams distribute parallel do

    SumToT=0d0
    !$omp target teams distribute parallel do reduction(+:SumToT)
    do k=1,np
        SumToT = SumToT + f(k)
    enddo
    !$omp end target teams distribute parallel do  

    !SumToT is by default copied back to the CPU

    !$omp target enter data device(myDevice) map(to:SumToT)
    !$omp target data use_device_ptr(SumToT)
    call MPI_ALLREDUCE(MPI_IN_PLACE,SumToT,1,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr )
    !$omp end target data
    !$omp target exit data map(from:SumToT)

    !$omp target exit data map(delete:f)

The GPU-aware MPI with OpenACC/OpenMP APIs has the capability of directly communicating between a pair of GPUs within a single node. However, performing the GPU-to-GPU communication across multiple nodes requires the the GPUDirect RDMA (Remote Direct Memory Access) technology. This technology can further improve performance by reducing latency.

Compilation process

The compilation process of the hybrid MPI-OpenACC and MPI-OpenMP offloading is described below. This description is given for a Cray compiler of the wrapper ftn. On LUMI-G, the following modules may be necessary before compiling (see the LUMI documentation for further details about the available programming environments):

ml CrayEnv
ml PrgEnv-cray
ml cray-mpich
ml rocm
ml craype-accel-amd-gfx90a

Example: Compilation process

$ ftn -hacc -o mycode.mpiacc.exe mycode_mpiacc.f90

Here, the flags hacc and homp enable the OpenACC and OpenMP directives in the hybrid MPI-OpenACC and MPI-OpenMP applications, respectively.

Enabling GPU-aware support

To enable the GPU-aware support in MPICH library, one needs to set the following environment variable before running the application.

$ export MPICH_GPU_SUPPORT_ENABLED=1

Conclusion

In conclusion, we have presented an overview of a GPU-hybrid programming by integrating GPU-directive models, specifically OpenACC and OpenMP APIs, with the MPI library. The approach adopted here allows us to utilise multiple GPU-devices not only within a single node but it extends to distributed nodes. In particular, we have addressed GPU-aware MPI approach, which has the advantage of enabling a direct interaction between an MPI library and a GPU-device memory. In other words, it permits performing MPI operations between a pair of GPUs, thus reducing the computing time caused by the data locality.

Exercises

We consider an MPI fortran code that solves a 2D-Laplace equation, and which is partially accelerated. The focus of the exercises is to complete the acceleration using either OpenACC or OpenMP API by following these steps.

Access exercise material

Code examples for the exercises below can be accessed in the content/examples/exercise_multipleGPU subdirectory of this repository. To access them, you need to clone the repository:

$ git clone https://github.com/ENCCS/gpu-programming.git
$ cd gpu-programming/content/examples/exercise_multipleGPU
$ ls

Exercise I: Set a GPU device

Implement OpenACC/OpenMP functions that enable assigning each MPI rank to a GPU device.

1.1 Compile and run the code on multiple GPUs.

Exercise II: Apply traditional MPI-OpenACC/OpenMP

2.1 Incoporate the OpenACC directives *update host()* and *update device()* before and after calling an MPI function, respectively.

Note

The OpenACC directive *update host()* is used to transfer data from GPU to CPU within a data region; while the directive *update device()* is used to transfer the data from CPU to GPU.

2.2 Incorporate the OpenMP directives *update device() from()* and *update device() to()* before and after calling an MPI function, respectively.

Note

The OpenMP directive *update device() from()* is used to transfer data from GPU to CPU within a data region; while the directive *update device() to()* is used to transfer the data from CPU to GPU.

2.3 Compile and run the code on multiple GPUs.

Exercise III: Implement GPU-aware support

3.1 Incorporate the OpenACC directive *host_data use_device()* to pass a device pointer to an MPI function.

3.2 Incorporate the OpenMP directive *data use_device_ptr()* to pass a device pointer to an MPI function.

3.3 Compile and run the code on multiple GPUs.

Exercise IV: Evaluate the performance

Evaluate the execution time of the accelerated codes in the exercises II and III, and compare it with that of a pure MPI implementation.