Offloading to GPU

Objectives

  • Understand and be able to offload code to device

  • Understand different constructs to create parallelism on device

Host-device model

Since version 4.0 , OpenMP supports heterogeneous systems. OpenMP uses TARGET construct to offload execution from the host to the target device(s), and hence the directive name. In addition, the associated data needs to be transferred to the device(s) as well. Once transferred, the target device owns the data and accesses by the host during the execution of the target region is forbidden.

A host/device model is generally used by OpenMP for offloading:

  • normally there is only one single host: e.g. CPU

  • one or multiple target devices of the same kind: e.g. coprocessor, GPU, FPGA, …

  • unless with unified shared memory, the host and device have separate memory address space

Note

Under the following condition, there will be NO data transfer to the device

  • data already exists on the device from a previous execution

Device execution model

The execution on the device is host-centric

1.the host creates the data environments on the device(s)

2.the host maps data to the device data environment

3.the host offloads OpenMP target regions to the target device to be executed

4.the host transfers data from the device to the host

5.the host destroys the data environment on the device

TARGET construct

The TARGET construct consists of a target directive and an execution region. It is used to transfer both the control flow from the host to the device and the data between the host and device.

Syntax

#pragma omp target [clauses]
     structured-block
clause:
      if([ target:] scalar-expression)
      device(integer-expression)
      private(list)
      firstprivate(list)
      map([map-type:] list)
      is_device_ptr(list)
      defaultmap(tofrom:scalar)
      nowait
      depend(dependence-type : list)

Exercise00: Hello world with OpenMP offloading

 1/* Copyright (c) 2019 CSC Training */
 2/* Copyright (c) 2021 ENCCS */
 3#include <stdio.h>
 4
 5#ifdef _OPENMP
 6#include <omp.h>
 7#endif
 8
 9int main() 
10{
11  int num_devices = omp_get_num_devices();
12  printf("Number of available devices %d\n", num_devices);
13
14  #pragma omp target 
15  {
16      if (omp_is_initial_device()) {
17        printf("Running on host\n");    
18      } else {
19        int nteams= omp_get_num_teams(); 
20        int nthreads= omp_get_num_threads();
21        printf("Running on device with %d teams in total and %d threads in each team\n",nteams,nthreads);
22      }
23  }
24  
25}
26

Exercise01: Adding TARGET construct

 1/* Copyright (c) 2019 CSC Training */
 2/* Copyright (c) 2021 ENCCS */
 3#include <stdio.h>
 4#include <math.h>
 5#define NX 102400
 6
 7int main(void)
 8{
 9  double vecA[NX],vecB[NX],vecC[NX];
10  double r=0.2;
11
12/* Initialization of vectors */
13  for (int i = 0; i < NX; i++) {
14     vecA[i] = pow(r, i);
15     vecB[i] = 1.0;
16  }
17
18/* Dot product of two vectors */
19  for (int i = 0; i < NX; i++) {
20     vecC[i] = vecA[i] * vecB[i];
21  }
22
23  double sum = 0.0;
24  /* Calculate the sum */
25  for (int i = 0; i < NX; i++) {
26    sum += vecC[i];
27  }
28  printf("The sum is: %8.6f \n", sum);
29  return 0;
30}

Creating parallelism on the target device

The TARGET construct transfers the control flow to the device is sequential and synchronous, and it is because OpenMP separates offload and parallelism. One needs to explicitly create parallel regions on the target device to make efficient use of the device(s).

TEAMS construct

Syntax

#pragma omp teams [clauses]
      structured-block
clause:
num_teams(integer-expression)
thread_limit(integer-expression)
default(shared | none)
private(list)
firstprivate(list)
shared(list)
reduction(reduction-identifier : list)

The TEAMS construct creates a league of one-thread teams where the thread of each team executes concurrently and is in its own contention group. The number of teams created is implementation defined, but is no more than num_teams if specified in the clause. The maximum number of threads participating in the contention group that each team initiates is implementation defined as well, unless thread_limit is specified in the clause. Threads in a team can synchronize but no synchronization among teams. The TEAMS construct must be contained in a TARGET construct, without any other directives, statements or declarations in between.

Note

A contention group is the set of all threads that are descendants of an initial thread. An initial thread is never a descendant of another initial thread.

DISTRIBUTE construct

Syntax

#pragma omp distribute [clauses]
      for-loops
clause:
private(list)
firstprivate(list)
lastprivate(list)
collapse(n)
dist_schedule(kind[, chunk_size])

The DISTRIBUTE construct is a coarsely worksharing construct which distributes the loop iterations across the master threads in the teams, but no worksharing within the threads in one team. No implicit barrier at the end of the construct and no guarantee about the order the teams will execute.

To further create threads within each team and distritute loop iterations across threads, we will use the PARALLEL FOR/DO constructs.

PARALLEL construct

Syntax

#pragma omp parallel [clauses]
      structured-block
clause:
num_threads(integer-expression)
default(shared | none)
private(list)
firstprivate(list)
shared(list)
reduction(reduction-identifier : list)

FOR/DO construct

Syntax

#pragma omp for [clauses]
      structured-block
clause:
private(list)
firstprivate(list)
lastprivate(list)
reduction(reduction-identifier : list)
schedule(kind[, chunk_size])
collapse(n)

Keypoints

TEAMS DISTRIBUTE construct
  • Coarse-grained parallelism

  • Spawns multiple single-thread teams

  • No synchronization of threads in different teams

PARALLEL FOR/DO construct
  • Fine-grained parallelism

  • Spawns many threads in one team

  • Threads can synchronize in a team

Exercise02: Adding constructs for parallelism

 1/* Copyright (c) 2019 CSC Training */
 2// Copyright (c) 2021 ENCCS
 3#include <stdio.h>
 4#include <math.h>
 5#define NX 102400
 6
 7int main(void)
 8{
 9  double vecA[NX],vecB[NX],vecC[NX];
10  double r=0.2;
11
12/* Initialization of vectors */
13  for (int i = 0; i < NX; i++) {
14     vecA[i] = pow(r, i);
15     vecB[i] = 1.0;
16  }
17
18/* dot product of two vectors */
19  #pragma omp target
20  for (int i = 0; i < NX; i++) {
21     vecC[i] = vecA[i] * vecB[i];
22  }
23
24  double sum = 0.0;
25  /* calculate the sum */
26  for (int i = 0; i < NX; i++) {
27    sum += vecC[i];
28  }
29  printf("The sum is: %8.6f \n", sum);
30  return 0;
31}

Exercise03: TEAMS vs PARALLEL constructs

We start from the “hello world” example, and by adding TEAMS and PARALLEL constructs to compare the differences. Furthermore, using num_teams and thread_limit to limit the number of teams and threads to be generated.

 1/* Copyright (c) 2019 CSC Training */
 2/* Copyright (c) 2021 ENCCS */
 3#include <stdio.h>
 4
 5#ifdef _OPENMP
 6#include <omp.h>
 7#endif
 8
 9int main() 
10{
11  int num_devices = omp_get_num_devices();
12  printf("Number of available devices %d\n", num_devices);
13
14  #pragma omp target 
15  {
16      if (omp_is_initial_device()) {
17        printf("Running on host\n");    
18      } else {
19        int nteams= omp_get_num_teams(); 
20        int nthreads= omp_get_num_threads();
21        printf("Running on device with %d teams in total and %d threads in each team\n",nteams,nthreads);
22      }
23  }
24  
25}
26

Composite directive

It is convenient to use the composite construct

  • the code is more portable

  • let the compiler figures out the loop tiling since each compiler supports different levels of parallelism

  • possible to reach good performance without composite directives

Syntax

#pragma omp target teams distribute parallel for [clauses]
      for-loops

Exercise: Offloading

We will start from the serial version of the heat diffusion and step by step add the directives for offloading and parallelism on the target device. Compare the performance to understand the effects of different directives. We will focus on the core operation only for now, i.e. subroutine evolve in the file core.cpp or core.F90.

For C/C++, you need to add a data mapping clause map(currdata[0:(nx+2)*(ny+2)],prevdata[0:(nx+2)*(ny+2)])

step 1: adding the TARGET construct

step 2: adding the TARGET TEAMS construct

step 3: adding the TARGET TEAMS DISTRIBUTE construct

step 4: adding the TARGET TEAMS DISTRIBUTE PARALLEL FOR/DO construct

Use a small number of iterations, e.g. ./heat_serial 800 800 10, otherwise it may take a long time to finish.

The exercise is under /content/exercise/offloading