Directive-based models

Questions

  • What is OpenACC and OpenMP offloading

  • How to write GPU code using directives

Objectives

  • Understand the process of offloading

  • Understand the differences between OpenACC and OpenMP offloading

  • Understand the various levels of parallelism on a GPU

  • Understand what is data movement

Instructor note

  • 40 min teaching

  • 40 min exercises

The most common directive-based models for GPU parallel programming are OpenMP offloading and OpenACC. The parallelization is done by introducing directives in places which are targeted for parallelization.

  • OpenACC is known to be more descriptive, which means the programmer uses directives to tell the compiler how/where to parallelize the code and to move the data.

  • OpenMP offloading approach, on the other hand, is known to be more prescriptive, where the programmer uses directives to tell the compiler more explicitly how/where to parallelize the code, instead of letting the compiler decides.

In OpenMP/OpenACC the compiler directives are specified by using #pragma in C/C++ or as special comments identified by unique sentinels in Fortran. Compilers can ignore the directives if the support for OpenMP/OpenACC is not enabled.

The compiler directives are used for various purposes: for thread creation, workload distribution (work sharing), data-environment management, serializing sections of code or for synchronization of work among the threads.

Execution model

OpenMP and OpenACC use the fork-join model of parallel execution. The program begins as a single thread of execution, the master thread. Everything is executed sequentially until the first parallel region construct is encountered.

../_images/threads.png

When a parallel region is encountered, master thread creates a group of threads, becomes the master of this group of threads, and is assigned the thread index 0 within the group. There is an implicit barrier at the end of the parallel regions.

Offloading Directives

OpenACC

In OpenACC, one of the most commonly used directives is kernels, which defines a region to be transferred into a series of kernels to be executed in sequence on a GPU. Work sharing is defined automatically for the separate kernels, but tuning prospects is limited.

Example: kernels

#include <stdio.h>
#include <openacc.h>

#define NX 102400

int main(void)
{
    double vecA[NX], vecB[NX], vecC[NX];
    int i;

    /* Initialization of the vectors */
    for (i = 0; i < NX; i++) {
        vecA[i] = 1.0;
        vecB[i] = 2.0;
    }

    #pragma acc kernels
    for (i = 0; i < NX; i++) {
        vecC[i] = vecA[i] + vecB[i];
    }

    return 0;
}

The other approach of OpenACC to define parallel regions is to use parallel directive. Contrary to the kernels directive, the parallel directive is more explicit and requires more analysis by the programmer. Work sharing has to be defined manually using the loop directive, and refined tuning is possible to achieve. The above example can be re-written as the following:

Example: parallel loop

#include <stdio.h>
#include <openacc.h>

#define NX 102400

int main(void)
{
    double vecA[NX], vecB[NX], vecC[NX];
    int i;

    /* Initialization of the vectors */
    for (i = 0; i < NX; i++) {
        vecA[i] = 1.0;
        vecB[i] = 2.0;
    }

    #pragma acc parallel loop
    for (i = 0; i < NX; i++) {
        vecC[i] = vecA[i] + vecB[i];
    }

    return 0;
}

Sometimes we can obtain a little more performance by guiding the compiler to make specific choices. OpenACC has four levels of parallelism for offloading execution:

  • gang coarse grain: the iterations are distributed among the gangs

  • worker fine grain: worker’s threads are activated within gangs and iterations are shared among the threads

  • vector each worker activates its threads working in SIMT fashion and the work is shared among the threads

  • seq the iterations are executed sequentially

Note

By default, gang, worker and vector parallelism are automatically decided and applied by the compiler.

The programmer could add clauses like num_gangs, num_workers and vector_length within the parallel region to specify the number of gangs, workers and vector length.

The optimal numbers are highly GPU architecture and compiler implementation dependent though.

There is no thread synchronization at gang level, which means there maybe a risk of race condition.

OpenMP Offloading

With OpenMP, the target directive is used for device offloading.

Example: target construct

#include <stdio.h>

#define NX 102400

int main(void)
{
    double vecA[NX], vecB[NX], vecC[NX];
    int i;

    /* Initialization of the vectors */
    for (i = 0; i < NX; i++) {
        vecA[i] = 1.0;
        vecB[i] = 2.0;
    }

    #pragma omp target
    for (i = 0; i < NX; i++) {
        vecC[i] = vecA[i] + vecB[i];
    }

    return 0;
}

Compared to the OpenACC’s kernels directive, the target directive will not parallelise the underlying loop at all. To achieve proper parallelisation, one needs to be more prescriptive and specify what one wants. OpenMP offloading offers multiple levels of parallelism as well:

  • teams coarse grain: creates a league of teams and one master thread in each team, but no worksharing among the teams

  • distribute distributes the iterations across the master threads in the teams, but no worksharing among the threads within one team

  • parallel do/for fine grain: threads are activated within one team and worksharing among them

  • SIMD like the vector directive in OpenACC

Note

The programmer could add clauses like num_teams and thread_limit to specify the number of teams and threads within a team.

Threads in a team can synchronize but no synchronization among the teams.

Since OpenMP 5.0, there is a new loop directive available, which has the similar functionality as the corresponding one in OpenACC.

Keypoints

Mapping between OpenACC/OpenMP directives and GPU (HPE implementation)

NVIDIA

AMD

Fortran OpenACC/OpenMP

C/C++ OpenMP

Threadblock

Work group

gang/teams

teams

Wrap

Wavefront

worker/simd

parallel for simd

Thread

Work item

vector/simd

parallel for simd

Exercise: Change the levels of parallelism

In this exercise we would like to change the levels of parallelism using clauses. First compile and run one of the example to find out the default number of block and thread set by compiler at runtime. To make a change, adding clauses like num_gangs, num_workers, vector_length for OpenACC and num_teams, thread_limit for OpenMP offloading.

Remember to set the environment by executing export CRAY_ACC_DEBUG=2 at runtime.

How to compile and run the code interactively:

salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1

module load LUMI/24.03
module load partition/G
module load rocm/6.0.3

# OpenMP
cc -O2 -fopenmp -o ex1 ex1.c
# Only OpenACC Fortran is supported by HPE compiler.

export CRAY_ACC_DEBUG=2
srun ./ex1

Example of a trivially parallelizable vector addition problem:

#include <stdio.h>
#include <math.h>
#define NX 102400

int main(void){
    double vecA[NX],vecB[NX],vecC[NX];

    /* Initialize vectors */
    for (int i = 0; i < NX; i++) {
        vecA[i] = 1.0;
        vecB[i] = 1.0;
    }

    #pragma omp target teams distribute parallel for simd
    {
    for (int i = 0; i < NX; i++) {
       vecC[i] = vecA[i] + vecB[i];
    }
    }
 }

Keypoints

Mapping between OpenACC/OpenMP directives and GPU (HPE implementation)

Nvidia

AMD

Fortran OpenACC/OpenMP

C/C++ OpenMP

Threadblock

Work group

gang/teams

teams

Wrap

Wavefront

worker/simd

parallel for simd

Thread

Work item

vector/simd

parallel for simd

  • Each compiler supports different levels of parallelism

  • The size of gang/team/worker/vector_length can be chosen arbitrarily by the user but there are limits defined by the implementation.

  • The maximum thread/grid/block size can be found via rocminfo/nvaccelinfo

Data Movement

Due to distinct memory spaces on host and device, transferring data becomes inevitable. New directives are needed to specify how variables are transferred from the host to the device data environment. The common transferred items consist of arrays (array sections), scalars, pointers, and structure elements. Various data clauses used for data movement is summarised in the following table

OpenMP

OpenACC

map(to:list)

copyin(list)

On entering the region, variables in the list are initialized on the device using the original values from the host

map(from:list)

copyout(list)

At the end of the target region, the values from variables in the list are copied into the original variables on the host. On entering the region, the initial value of the variables on the device is not initialized

map(tofrom:list)

copy(list)

The effect of both a map-to and a map-from

map(alloc:list)

create(list)

On entering the region, data is allocated and uninitialized on the device

map(delete:list)

delete(list)

Delete data on the device

Note

When mapping data arrays or pointers, be careful about the array section notation:
  • In C/C++: array[lower-bound:length]. The notation :N is equivalent to 0:N.

  • In Fortran:array[lower-bound:upper-bound]. The notation :N is equivalent to 1:N.

Data region

The specific data clause combined with the data directive constitutes the start of a data region. How the directives create storage, transfer data, and remove storage on the device are classified as two categories: structured data region and unstructured data region.

Structured Data Region

A structured data region is convenient for providing persistent data on the device which could be used for subsequent GPU directives.

Syntax for structured data region

#pragma omp target data [clauses]
{structured-block}

Unstructured Data Region

However it is inconvenient in real applications to use structured data region, therefore the unstructured data region with much more freedom in creating and deleting of data on the device at any appropriate point is adopted.

Syntax for unstructured data region

#pragma omp target enter data [clauses]
#pragma omp target exit data

Keypoints

Structured Data Region
  • Start and end points within a single subroutine

  • Memory exists within the data region

Unstructured Data Region
  • Multiple start and end points across different subroutines

  • Memory exists until explicitly deallocated

Update

Sometimes, variables need to be synchronized between the host and the device memory, e.g. in order to write out variables on the host for debugging or visualization, and it is often used in conjunction with unstructured data regions. To control data transfer direction, a motion-clause must be present.

Syntax for update directive

#pragma omp target update [clauses]
motion-clause:
          to (list)
          from (list)

Note

  • update directive can only be used in host code since data movement must be initiated from the host, i.e. it may not appear inside of a compute region.

  • in OpenACC, motion-clause “host” has been deprecated and renamed “self”

Exercise: update

Trying to figure out the variable values on host and device at each check point.

#include <stdio.h>
int main(void)
{
int x = 0;

#pragma omp target data map(tofrom:x)
{
   /* check point 1 */
  x = 10;
   /* check point 2 */
#pragma omp target update to(x)
   /* check point 3 */
}

return 0;
}

Exercise: Adding data mapping clauses

Add proper data mapping clauses explicitly to the directives

#include <stdio.h>
#include <math.h>
#define NX 102400

int main(void){
    double vecA[NX],vecB[NX],vecC[NX];

    /* Initialize vectors */
    for (int i = 0; i < NX; i++) {
    vecA[i] = 1.0;
    vecB[i] = 1.0;
    }
    /* Adding mapping clauses here */
    #pragma omp target teams distribute parallel for simd
    {
    for (int i = 0; i < NX; i++) {
       vecC[i] = vecA[i] + vecB[i];
    }
    }

    double sum = 0.0;
    for (int i = 0; i < NX; i++) {
       sum += vecC[i];
    }
    printf("The sum is: %8.6f \n", sum);
}

Optimize Data Transfers

  • Explicitly transfer the data as much as possible

  • Reduce the amount of data mapping between host and device, get rid of unnecessary data transfer

  • Try to keep data environment residing on the device as long as possible

Pros of directive-based frameworks

  • Incremental programming

  • Porting of existing software requires less work

  • Same code can be compiled to CPU and GPU versions easily using compiler flag

  • Low learning curve, do not need to know low-level hardware details

  • Good portability

See also

Keypoints

  • OpenACC and OpenMP-offloading enables you to annotate your code with special directives to identify areas to be executed in parallel on a GPU.

  • This saves time compared to lower-level approaches, but you need to be mindful of memory movement.