Offloading to GPU

Objectives

Understand and be able to offload code to device
Understand different constructs to create parallelism on device

Host-device model

Since version 4.0 , OpenMP supports heterogeneous systems. OpenMP uses TARGET construct to offload execution from the host to the target device(s), and hence the directive name. In addition, the associated data needs to be transferred to the device(s) as well. Once transferred, the target device owns the data and accesses by the host during the execution of the target region is forbidden.

A host/device model is generally used by OpenMP for offloading:

normally there is only one single host: e.g. CPU

one or multiple target devices of the same kind: e.g. coprocessor, GPU, FPGA, …

unless with unified shared memory, the host and device have separate memory address space

Note

Under the following condition, there will be NO data transfer to the device

data already exists on the device from a previous execution

Device execution model

The execution on the device is host-centric

1.the host creates the data environments on the device(s)

2.the host maps data to the device data environment

3.the host offloads OpenMP target regions to the target device to be executed

4.the host transfers data from the device to the host

5.the host destroys the data environment on the device

TARGET construct

The TARGET construct consists of a target directive and an execution region. It is used to transfer both the control flow from the host to the device and the data between the host and device.

Syntax

#pragma omp target [clauses]
     structured-block

clause:
      if([ target:] scalar-expression)
      device(integer-expression)
      private(list)
      firstprivate(list)
      map([map-type:] list)
      is_device_ptr(list)
      defaultmap(tofrom:scalar)
      nowait
      depend(dependence-type : list)

!$omp target [clauses]
      structured-block
!$omp end target

clause:
      if([ target:] scalar-expression)
      device(integer-expression)
      private(list)
      firstprivate(list)
      map([map-type:] list)
      is_device_ptr(list)
      defaultmap(tofrom:scalar)
      nowait
      depend(dependence-type : list)

Exercise00: Hello world with OpenMP offloading

/* Copyright (c) 2019 CSC Training */
/* Copyright (c) 2021 ENCCS */
#include <stdio.h>

#ifdef _OPENMP
#include <omp.h>
#endif

int main() 
{
  int num_devices = omp_get_num_devices();
  printf("Number of available devices %d\n", num_devices);

  #pragma omp target 
  {
      if (omp_is_initial_device()) {
        printf("Running on host\n");    
      } else {
        int nteams= omp_get_num_teams(); 
        int nthreads= omp_get_num_threads();
        printf("Running on device with %d teams in total and %d threads in each team\n",nteams,nthreads);
      }
  }
  
}

! Copyright (c) 2019 CSC Training
! Copyright (c) 2021 ENCCS
program hello

#ifdef _OPENMP
  use omp_lib
#endif
  implicit none

  integer :: num_devices,nteams,nthreads
  logical :: initial_device

  num_devices = omp_get_num_devices()
  print *, "Number of available devices", num_devices

  !$omp target map(nteams,nthreads)
    initial_device = omp_is_initial_device()
    nteams= omp_get_num_teams()
    nthreads= omp_get_num_threads()
  !$omp end target 
    if (initial_device) then
      write(*,*) "Running on host"
    else 
      write(*,'(A,I4,A,I4,A)') "Running on device with ",nteams, " teams in total and ", nthreads, " threads in each team"
    end if

end program

Exercise01: Adding TARGET construct

/* Copyright (c) 2019 CSC Training */
/* Copyright (c) 2021 ENCCS */
#include <stdio.h>
#include <math.h>
#define NX 102400

int main(void)
{
  double vecA[NX],vecB[NX],vecC[NX];
  double r=0.2;

/* Initialization of vectors */
  for (int i = 0; i < NX; i++) {
     vecA[i] = pow(r, i);
     vecB[i] = 1.0;
  }

/* Dot product of two vectors */
  for (int i = 0; i < NX; i++) {
     vecC[i] = vecA[i] * vecB[i];
  }

  double sum = 0.0;
  /* Calculate the sum */
  for (int i = 0; i < NX; i++) {
    sum += vecC[i];
  }
  printf("The sum is: %8.6f \n", sum);
  return 0;
}

! Copyright (c) 2019 CSC Training
! Copyright (c) 2021 ENCCS
program dotproduct
  implicit none

  integer, parameter :: nx = 102400
  real, parameter :: r=0.2

  real, dimension(nx) :: vecA,vecB,vecC
  real    :: sum
  integer :: i

  ! Initialization of vectors
  do i = 1, nx
     vecA(i) = r**(i-1)
     vecB(i) = 1.0
  end do

  ! Dot product of two vectors
  do i = 1, nx
     vecC(i) =  vecA(i) * vecB(i)
  end do

  sum = 0.0
  ! Calculate the sum 
  do i = 1, nx
     sum =  vecC(i) + sum
  end do

  write(*,*) 'The sum is: ', sum

end program dotproduct

Solution

/* Copyright (c) 2019 CSC Training */
/* Copyright (c) 2021 ENCCS */
#include <stdio.h>
#include <math.h>
#define NX 102400

int main(void)
{
  double vecA[NX],vecB[NX],vecC[NX];
  double r=0.2;

/* Initialization of vectors */
  for (int i = 0; i < NX; i++) {
     vecA[i] = pow(r, i);
     vecB[i] = 1.0;
  }

/* dot product of two vectors */
  #pragma omp target
  for (int i = 0; i < NX; i++) {
     vecC[i] = vecA[i] * vecB[i];
  }

  double sum = 0.0;
  /* calculate the sum */
  for (int i = 0; i < NX; i++) {
    sum += vecC[i];
  }
  printf("The sum is: %8.6f \n", sum);
  return 0;
}

! Copyright (c) 2019 CSC Training
! Copyright (c) 2021 ENCCS
program dotproduct
  implicit none

  integer, parameter :: nx = 102400
  real, parameter :: r=0.2

  real, dimension(nx) :: vecA,vecB,vecC
  real    :: sum
  integer :: i

  ! Initialization of vectors
  do i = 1, nx
     vecA(i) = r**(i-1)
     vecB(i) = 1.0
  end do

  ! Dot product of two vectors 
  !$omp target  
  do i = 1, nx
     vecC(i) =  vecA(i) * vecB(i)
  end do
  !$omp end target

  sum = 0.0
  ! Calculate the sum 
  do i = 1, nx
     sum =  vecC(i) + sum
  end do

  write(*,*) 'The sum is: ', sum

end program dotproduct

Creating parallelism on the target device

The TARGET construct transfers the control flow to the device is sequential and synchronous, and it is because OpenMP separates offload and parallelism. One needs to explicitly create parallel regions on the target device to make efficient use of the device(s).

TEAMS construct

Syntax

#pragma omp teams [clauses]
      structured-block

clause:
num_teams(integer-expression)
thread_limit(integer-expression)
default(shared | none)
private(list)
firstprivate(list)
shared(list)
reduction(reduction-identifier : list)

!$omp teams [clauses]
        structured-block
!$omp end teams

clause:
num_teams(integer-expression)
thread_limit(integer-expression)
default(shared | none)
private(list)
firstprivate(list)
shared(list)
reduction(reduction-identifier : list)

The TEAMS construct creates a league of one-thread teams where the thread of each team executes concurrently and is in its own contention group. The number of teams created is implementation defined, but is no more than num_teams if specified in the clause. The maximum number of threads participating in the contention group that each team initiates is implementation defined as well, unless thread_limit is specified in the clause. Threads in a team can synchronize but no synchronization among teams. The TEAMS construct must be contained in a TARGET construct, without any other directives, statements or declarations in between.

Note

A contention group is the set of all threads that are descendants of an initial thread. An initial thread is never a descendant of another initial thread.

DISTRIBUTE construct

Syntax

#pragma omp distribute [clauses]
      for-loops

clause:
private(list)
firstprivate(list)
lastprivate(list)
collapse(n)
dist_schedule(kind[, chunk_size])

!$omp distribute [clauses]
        do-loops
[!$omp end distribute]

clause:
private(list)
firstprivate(list)
lastprivate(list)
collapse(n)
dist_schedule(kind[, chunk_size])

The DISTRIBUTE construct is a coarsely worksharing construct which distributes the loop iterations across the master threads in the teams, but no worksharing within the threads in one team. No implicit barrier at the end of the construct and no guarantee about the order the teams will execute.

To further create threads within each team and distritute loop iterations across threads, we will use the PARALLEL FOR/DO constructs.

PARALLEL construct

Syntax

#pragma omp parallel [clauses]
      structured-block

clause:
num_threads(integer-expression)
default(shared | none)
private(list)
firstprivate(list)
shared(list)
reduction(reduction-identifier : list)

!$omp parallel [clauses]
        structured-block
!$omp end parallel

clause:
num_threads(integer-expression)
default(private | firstprivate | shared | none)
private(list)
firstprivate(list)
shared(list)
copyin(list)
reduction(reduction-identifier : list)

FOR/DO construct

Syntax

#pragma omp for [clauses]
      structured-block

clause:
private(list)
firstprivate(list)
lastprivate(list)
reduction(reduction-identifier : list)
schedule(kind[, chunk_size])
collapse(n)

!$omp do [clauses]
        structured-block
[!$omp end do]

clause:
private(list)
firstprivate(list)
lastprivate(list)
reduction(reduction-identifier : list)
schedule(kind[, chunk_size])
collapse(n)

Keypoints

TEAMS DISTRIBUTE construct

Coarse-grained parallelism
Spawns multiple single-thread teams
No synchronization of threads in different teams

PARALLEL FOR/DO construct

Fine-grained parallelism
Spawns many threads in one team
Threads can synchronize in a team

Exercise02: Adding constructs for parallelism

/* Copyright (c) 2019 CSC Training */
// Copyright (c) 2021 ENCCS
#include <stdio.h>
#include <math.h>
#define NX 102400

int main(void)
{
  double vecA[NX],vecB[NX],vecC[NX];
  double r=0.2;

/* Initialization of vectors */
  for (int i = 0; i < NX; i++) {
     vecA[i] = pow(r, i);
     vecB[i] = 1.0;
  }

/* dot product of two vectors */
  #pragma omp target
  for (int i = 0; i < NX; i++) {
     vecC[i] = vecA[i] * vecB[i];
  }

  double sum = 0.0;
  /* calculate the sum */
  for (int i = 0; i < NX; i++) {
    sum += vecC[i];
  }
  printf("The sum is: %8.6f \n", sum);
  return 0;
}

! Copyright (c) 2019 CSC Training
! Copyright (c) 2021 ENCCS
program dotproduct
  implicit none

  integer, parameter :: nx = 102400
  real, parameter :: r=0.2

  real, dimension(nx) :: vecA,vecB,vecC
  real    :: sum
  integer :: i

  ! Initialization of vectors
  do i = 1, nx
     vecA(i) = r**(i-1)
     vecB(i) = 1.0
  end do

  ! Dot product of two vectors 
  !$omp target  
  do i = 1, nx
     vecC(i) =  vecA(i) * vecB(i)
  end do
  !$omp end target

  sum = 0.0
  ! Calculate the sum 
  do i = 1, nx
     sum =  vecC(i) + sum
  end do

  write(*,*) 'The sum is: ', sum

end program dotproduct

Solution

/* Copyright (c) 2019 CSC Training */
/* Copyright (c) 2021 ENCCS */
#include <stdio.h>
#include <math.h>
#define NX 102400

int main(void)
{
  double vecA[NX],vecB[NX],vecC[NX];
  double r=0.2;

/* Initialization of vectors */
  for (int i = 0; i < NX; i++) {
     vecA[i] = pow(r, i);
     vecB[i] = 1.0;
  }

/* dot product of two vectors */
  #pragma omp target teams distribute parallel for
  for (int i = 0; i < NX; i++) {
     vecC[i] = vecA[i] * vecB[i];
  }

  double sum = 0.0;
  /* calculate the sum */
  for (int i = 0; i < NX; i++) {
    sum += vecC[i];
  }
  printf("The sum is: %8.6f \n", sum);
  return 0;
}

! Copyright (c) 2019 CSC Training
! Copyright (c) 2021 ENCCS
program dotproduct
  implicit none

  integer, parameter :: nx = 102400
  real, parameter :: r=0.2

  real, dimension(nx) :: vecA,vecB,vecC
  real    :: sum
  integer :: i

  ! Initialization of vectors
  do i = 1, nx
     vecA(i) = r**(i-1)
     vecB(i) = 1.0
  end do

  ! Dot product of two vectors 
  !$omp target teams distribute parallel do 
  do i = 1, nx
     vecC(i) =  vecA(i) * vecB(i)
  end do
  !$omp end target teams distribute parallel do

  sum = 0.0
  ! Calculate the sum 
  do i = 1, nx
     sum =  vecC(i) + sum
  end do

  write(*,*) 'The sum is: ', sum

end program dotproduct

Exercise03: TEAMS vs PARALLEL constructs

We start from the “hello world” example, and by adding TEAMS and PARALLEL constructs to compare the differences. Furthermore, using num_teams and thread_limit to limit the number of teams and threads to be generated.

/* Copyright (c) 2019 CSC Training */
/* Copyright (c) 2021 ENCCS */
#include <stdio.h>

#ifdef _OPENMP
#include <omp.h>
#endif

int main() 
{
  int num_devices = omp_get_num_devices();
  printf("Number of available devices %d\n", num_devices);

  #pragma omp target 
  {
      if (omp_is_initial_device()) {
        printf("Running on host\n");    
      } else {
        int nteams= omp_get_num_teams(); 
        int nthreads= omp_get_num_threads();
        printf("Running on device with %d teams in total and %d threads in each team\n",nteams,nthreads);
      }
  }
  
}

! Copyright (c) 2019 CSC Training
! Copyright (c) 2021 ENCCS
program hello

#ifdef _OPENMP
  use omp_lib
#endif
  implicit none

  integer :: num_devices,nteams,nthreads
  logical :: initial_device

  num_devices = omp_get_num_devices()
  print *, "Number of available devices", num_devices

  !$omp target  map(nteams,nthreads)
    initial_device = omp_is_initial_device()
    nteams= omp_get_num_teams()
    nthreads= omp_get_num_threads()
  !$omp end target 
    if (initial_device) then
      write(*,*) "Running on host"
    else 
      write(*,'(A,I4,A,I4,A)') "Running on device with ",nteams, " teams in total and ", nthreads, " threads in each team"
    end if

end program

Solution

/* Copyright (c) 2019 CSC Training */
/* Copyright (c) 2021 ENCCS */
#include <stdio.h>

#ifdef _OPENMP
#include <omp.h>
#endif

int main() 
{
  int num_devices = omp_get_num_devices();
  printf("Number of available devices %d\n", num_devices);

  #pragma omp target 
  #pragma omp teams num_teams(2) thread_limit(3)
  #pragma omp parallel
  {
      if (omp_is_initial_device()) {
        printf("Running on host\n");    
      } else {
        int nteams= omp_get_num_teams(); 
        int nthreads= omp_get_num_threads();
        printf("Running on device with %d teams in total and %d threads in each team\n",nteams,nthreads);
      }
  }
  
}

! Copyright (c) 2019 CSC Training
! Copyright (c) 2021 ENCCS
program hello

#ifdef _OPENMP
  use omp_lib
#endif
  implicit none

  integer :: num_devices,nteams,nthreads
  logical :: initial_device

  num_devices = omp_get_num_devices()
  print *, "Number of available devices", num_devices

  !$omp target map(nteams,nthreads)
  !$omp teams num_teams(2) thread_limit(3)
  !$omp parallel
    initial_device = omp_is_initial_device()
    nteams= omp_get_num_teams()
    nthreads= omp_get_num_threads()
  !$omp end parallel 
  !$omp end teams
  !$omp end target 
    if (initial_device) then
      write(*,*) "Running on host"
    else 
      write(*,'(A,I4,A,I4,A)') "Running on device with ",nteams, " teams in total and ", nthreads, " threads in each team"
    end if

end program

Composite directive

It is convenient to use the composite construct

the code is more portable

let the compiler figures out the loop tiling since each compiler supports different levels of parallelism

possible to reach good performance without composite directives

Syntax

#pragma omp target teams distribute parallel for [clauses]
      for-loops

!$omp target teams distribute parallel do [clauses]
        do-loops
[!$omp end target teams distribute parallel do ]

Exercise: Offloading

We will start from the serial version of the heat diffusion and step by step add the directives for offloading and parallelism on the target device. Compare the performance to understand the effects of different directives. We will focus on the core operation only for now, i.e. subroutine evolve in the file core.cpp or core.F90.

For C/C++, you need to add a data mapping clause map(currdata[0:(nx+2)*(ny+2)],prevdata[0:(nx+2)*(ny+2)])

step 1: adding the TARGET construct

step 2: adding the TARGET TEAMS construct

step 3: adding the TARGET TEAMS DISTRIBUTE construct

step 4: adding the TARGET TEAMS DISTRIBUTE PARALLEL FOR/DO construct

Use a small number of iterations, e.g. ./heat_serial 800 800 10, otherwise it may take a long time to finish.

The exercise is under /content/exercise/offloading

Solution

// Copyright (c) 2019 CSC Training
// Copyright (c) 2021 ENCCS
// Main solver routines for heat equation solver

#include "heat.h"

// Update the temperature values using five-point stencil
// Arguments:
//   curr: current temperature values
//   prev: temperature values from previous time step
//   a: diffusivity
//   dt: time step
void evolve(field *curr, field *prev, double a, double dt)
{
  // Help the compiler avoid being confused by the structs
  double *currdata = curr->data.data();
  double *prevdata = prev->data.data();
  int nx = curr->nx;
  int ny = curr->ny;

  // Determine the temperature field at next time step
  // As we have fixed boundary conditions, the outermost gridpoints
  // are not updated.
  double dx2 = prev->dx * prev->dx;
  double dy2 = prev->dy * prev->dy;
  #pragma omp target teams distribute parallel for \
  map(currdata[0:(nx+2)*(ny+2)],prevdata[0:(nx+2)*(ny+2)])
  for (int i = 1; i < nx + 1; i++) {
    for (int j = 1; j < ny + 1; j++) {
      int ind = i * (ny + 2) + j;
      int ip = (i + 1) * (ny + 2) + j;
      int im = (i - 1) * (ny + 2) + j;
      int jp = i * (ny + 2) + j + 1;
      int jm = i * (ny + 2) + j - 1;
      currdata[ind] = prevdata[ind] + a*dt*
	    ((prevdata[ip] - 2.0*prevdata[ind] + prevdata[im]) / dx2 +
	     (prevdata[jp] - 2.0*prevdata[ind] + prevdata[jm]) / dy2);
    }
  }
}

! Copyright (c) 2019 CSC Training
! Copyright (c) 2021 ENCCS
! Main solver routines for heat equation solver
module core
  use heat

contains

  ! Update the temperature values using five-point stencil
  ! Arguments:
  !   curr (type(field)): current temperature values
  !   prev (type(field)): temperature values from previous time step
  !   a (real(dp)): diffusivity
  !   dt (real(dp)): time step
  subroutine evolve(curr, prev, a, dt)

    implicit none

    type(field),target, intent(inout) :: curr, prev
    real(dp) :: a, dt
    integer :: i, j, nx, ny
    real(dp) :: dx, dy
    real(dp), pointer, contiguous, dimension(:,:) :: currdata, prevdata

    ! Help the compiler avoid being confused
    nx = curr%nx
    ny = curr%ny
    dx = curr%dx
    dy = curr%dy
    currdata => curr%data
    prevdata => prev%data

    ! Determine the temperature field at next time step As we have
    ! fixed boundary conditions, the outermost gridpoints are not
    ! updated.

    !$omp target teams distribute parallel do
    do j = 1, ny
       do i = 1, nx
          currdata(i, j) = prevdata(i, j) + a * dt * &
               & ((prevdata(i-1, j) - 2.0 * prevdata(i, j) + &
               &   prevdata(i+1, j)) / dx**2 + &
               &  (prevdata(i, j-1) - 2.0 * prevdata(i, j) + &
               &   prevdata(i, j+1)) / dy**2)
       end do
    end do
    !$omp end target teams distribute parallel do
  end subroutine evolve

end module core