Why OpenMP offloading?

Questions

When and why should I use OpenMP offloading in my code?

Objectives

Understand shared parallel model
Understand program execution model
Understand basic constructs

Prerequisites

Basic C or FORTRAN

Computing in parallel

The underlying idea of parallel computing is to split a computational problem into smaller subtasks. Many subtasks can then be solved simultaneously by multiple processing units.

../_images/compp.png — Computing in parallel.

How a problem is split into smaller subtasks depends fully on the problem. There are various paradigms and programming approaches how to do this.

Distributed- vs. Shared-Memory Architecture

Most of computing problems are not trivially parallelizable, which means that the subtasks need to have access from time to time to some of the results computed by other subtasks. The way subtasks exchange needed information depends on the available hardware.

../_images/distributed_vs_shared.png — Distributed- vs shared-memory parallel computing.

In a distributed memory environment each computing unit operates independently from the others. It has its own memory and it cannot access the memory in other nodes. The communication is done via network and each computing unit runs a separate copy of the operating system. In a shared memory machine all computing units have access to the memory and can read or modify the variables within.

Processes and threads

The type of environment (distributed- or shared-memory) determines the programming model. There are two types of parallelism possible, process based and thread based.

For distributed memory machines, a process-based parallel programming model is employed. The processes are independent execution units which have their own memory address spaces. They are created when the parallel program is started and they are only terminated at the end. The communication between them is done explicitly via message passing like MPI.

On the shared memory architectures it is possible to use a thread based parallelism. The threads are light execution units and can be created and destroyed at a relatively small cost. The threads have their own state information but they share the same memory adress space. When needed the communication is done though the shared memory.

Both approaches have their advantages and disadvantages. Distributed machines are relatively cheap to build and they have an “infinite ” capacity. In principle one could add more and more computing units. In practice the more computing units are used the more time consuming is the communication. The shared memory systems can achive good performance and the programing model is quite simple. However they are limited by the memory capacity and by the access speed. In addition in the shared parallel model it is much easier to create race conditions.

OpenMP

OpenMP is de facto standard for threaded based parallelism. It is relatively easy to implement. The whole technology suite contains the library routines, the compiler directives and environment variables. The parallelization is done providing “hints” (directives) about the regions of code which are targeted for parallelization. The compiler then chooses how to implement these hints as best as possible. The compiler directives are comments in Fortran and pragmas in C/C++. If there is no OpenMP support in the system they become comments and the code works just as any other serial code.

Execution model

OpenMP API uses the fork-join model of parallel execution. The program begins as a single thread of execution, the master thread. Everything is executed sequentially until the first parallel region construct is encountered.

When a parallel region is encountered, master thread creates a group of threads, becomes the master of this group of threads, and is assigned the thread index 0 within the group. There is an implicit barrier at the end of the parallel regions.

OpenMP Memory Model

In the OpenMP API supports a relaxed-consistency shared-memory model. The global memory is a shared place where all threads can store and retrieve variables. In addition to it each thread has its own temporary view of the memory. The temporary view of memory can represent any kind of intervening structure, such as machine registers, cache, or other local storage, between the thread and the memory it allows the thread to cache variables and to avoid going to memory for every reference to a variable. The temporary view of memory is not necesseraly consistent with that of other threads. Finally each thread has access to a part of memory that can not be access by the other threads, the threadprivate memory.

Inside a parallel region there are two kinds of access of the variables, shared and private. Each reference to a shared variable in the structured block becomes a reference to the original variable, while for each private variable referenced in the structured block, a new version of the original variable is created in memory for each thread. In the case of nested parallel regions a variable which private can be made shared to the inner parallel region.

OpenMP Directives

In OpenMP the compiler directives are specified by using #pragma in C/C++ or as special comments identified by unique sentinels in Fortran. Compilers can ingnore the OpenMP directives if the support for OpenMP is not enabled,

Here are some prototypes of OpenMP directives:

#pragma omp directive [clauses]

!$ omp directive [clauses]

Parallel regions

The compiler directives are used for various purposes: for thread creation, workload distribution (work sharing), data-environment management, serializing sections of code or for synchronization of work among the threads. The parallel regions are created using the parallel construct. When this construct is encounter additional thread are forked to carry out the work enclose in it.

../_images/omp-parallel.png — Outside of a parallel region there is only one threas, while inside there are N threads

All threads inside the construct execute the same, there is not work sharing yet.

#include <stdio.h>
  int main(int argc, char argv[]){
#pragma omp parallel
 {
   printf("Hello world!");
  }
 }

  program hello
  integer :: omp_rank
!$omp parallel
  print *, 'Hello world!
!$omp end parallel
  end program hello

Note that the value of the output from the printf/print can be all mixed up.

Clauses

Together with compiler directives, OpenMP provides clauses that can used to control the parallelism of regions of code. The clauses specify additional behaviour the user wants to occur and they refere to how the variables are visible to the threads (private or shared), synchronization, scheduling, control, etc. The clauses are appended in the code to the directives. Below is an list of many types of clauses available to the programmers:

Synchronization clauses

critical: the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data from race conditions.

atomic: the memory update (write, or read-modify-write) in the next instruction will be performed atomically. It does not make the entire statement atomic; only the memory update is atomic. A compiler might use special hardware instructions for better performance than when using critical.

ordered: the structured block is executed in the order in which iterations would be executed in a sequential loop

barrier: each thread waits until all of the other threads of a team have reached this point. A work-sharing construct has an implicit barrier synchronization at the end.

nowait: specifies that threads completing assigned work can proceed without waiting for all threads in the team to finish. In the absence of this clause, threads encounter a barrier synchronization at the end of the work sharing construct.

Scheduling clauses

schedule (type, chunk): This is useful if the work sharing construct is a do-loop or for-loop. The iterations in the work sharing construct are assigned to threads according to the scheduling method defined by this clause. The three types of scheduling are:

static: Here, all the threads are allocated iterations before they execute the loop iterations. The iterations are divided among threads equally by default. However, specifying an integer for the parameter chunk will allocate chunk number of contiguous iterations to a particular thread.

dynamic: Here, some of the iterations are allocated to a smaller number of threads. Once a particular thread finishes its allocated iteration, it returns to get another one from the iterations that are left. The parameter chunk defines the number of contiguous iterations that are allocated to a thread at a time.

guided: A large chunk of contiguous iterations are allocated to each thread dynamically (as above). The chunk size decreases exponentially with each successive allocation to a minimum size specified in the parameter chunk

IF control

if: This will cause the threads to parallelize the task only if a condition is met. Otherwise the code block executes serially.

Initialization

firstprivate: the data is private to each thread, but initialized using the value of the variable using the same name from the master thread.

lastprivate: the data is private to each thread. The value of this private data will be copied to a global variable using the same name outside the parallel region if current iteration is the last iteration in the parallelized loop. A variable can be both firstprivate and lastprivate.

threadprivate: The data is a global data, but it is private in each parallel region during the runtime. The difference between threadprivate and private is the global scope associated with threadprivate and the preserved value across parallel regions.

Reduction

reduction (operator | intrinsic : list): the variable has a local copy in each thread, but the values of the local copies will be summarized (reduced) into a global shared variable. This is very useful if a particular operation (specified in operator for this particular clause) on a variable runs iteratively, so that its value at a particular iteration depends on its value at a prior iteration. The steps that lead up to the operational increment are parallelized, but the threads updates the global variable in a thread safe manner. This would be required in parallelizing numerical integration of functions and differential equations, as a common example.

Others

flush: The value of this variable is restored from the register to the memory for using this value outside of a parallel part

master: Executed only by the master thread (the thread which forked off all the others during the execution of the OpenMP directive). No implicit barrier; other team members (threads) not required to reach.

collapse: When more than one loop follows a loop construct it sppecifies how many loops in a nested loop should be collapsed into one large iteration space.

Runtime library routines

The OpenMp includes and extensive suite of run-time routines. They can be used for many purposes: to modify/check the number of threads, detect if the execution context is in a parallel region, how many processors in current system, set/unset locks, timing functions, etc.

The functions definitions are in the omp.h header in C/C++ and in fortran in the omp_lib module. Some very useful routines:

omp_get_num_threads()
omp_get_thread_num()

omp_get_wtime()

#include <omp.h>
  int main(int argc, char argv[]){
  int omp_rank;
#pragma omp parallel
 {
   omp_rank = omp_get_thread_num();
   printf("Hello world! by thread %d", omp_rank);
  }
 }

  program hello
  use omp_lib
  integer :: omp_rank
!$omp parallel
  omp_rank = omp_get_thread_num()
  print *, 'Hello world! by thread ', omp_rank
!$omp end parallel
  end program hello

The portability of the code can be mantained by using the conditional compilation ifdef _OPENMP.

OpenMP environment variables

OpenMP standard defines also a set of environment variables that all implementations have to support. The environment variables are set before the program execution and they are read during program start-up. They can be used to control the execution of the parallel code at run-time. They are used to set the number of threads, specify the binding of the threads or specify how the loop interations are divided.

Setting OpenMP environment variables is done the same way you set any other environment variables. For example:

csh/tcsh: setenv OMP_NUM_THREADS 8

sh/bash: export OMP_NUM_THREADS=8

Here are a few environment variables:

OMP_NUM_THREADS: Number of threads to use

OMP_PROC_BIND: Bind threads to CPUs

OMP_PLACES: Specify the bindings between threads and CPUs

OMP_DISPLAY_ENV: Print the current OpenMP environment info on stderr

Compiling an OpenMP program

In order to use OpenMP the compiler needs to have support for it. The OpenMP support is enabled by adding an extra compiling option:

GNU: -fopenmp

Intel: -qopenmp

Cray: -h omp

PGI: -mp[=nonuma,align,allcores,bind]

Keypoints

OpenMP is the de facto standard for programming shared memory machines
threaded based parallel programming model
fork-join model
global memory, temporary view, thread private memory
parallelism is exposed via directives which are treated as comments if no support
work sharing done via specific constructs
clauses provide additional control
runtime libray provides an extensive suite of routines
environment variables can be used to alter execution features of the the applications