GPU programming example: stencil computation

Questions

  • How do I compile and run code developed using different programming models and frameworks?

  • What can I expect from the GPU-ported programs in terms of performance gains / trends and how do I estimate this?

Objectives

  • To show a self-contained example of parallel computation executed on CPU and GPU using different programming models

  • To show differences and consequences of implementing the same algorithm in natural “style” of different models/ frameworks

  • To discuss how to assess theoretical and practical performance scaling of GPU codes

Instructor note

  • 45 min teaching

  • 45 min exercises

Problem: heat flow in two-dimensional area

Heat flows in objects according to local temperature differences, as if seeking local equilibrium. The following example defines a rectangular area with two always-warm sides (temperature 70 and 85), two cold sides (temperature 20 and 5) and a cold disk at the center. Because of heat diffusion, temperature of neighboring patches of the area is bound to equalize, changing the overall distribution:

../_images/heat_montage.png

Over time, the temperature distribution progresses from the initial state toward an end state where upper triangle is warm and lower is cold. The average temperature tends to (70 + 85 + 20 + 5) / 4 = 45.

Technique: stencil computation

Heat transfer in the system above is governed by the partial differential equation(s) describing local variation of the temperature field in time and space. That is, the rate of change of the temperature field \(u(x, y, t)\) over two spatial dimensions \(x\) and \(y\) and time \(t\) (with rate coefficient \(\alpha\)) can be modelled via the equation

\[\frac{\partial u}{\partial t} = \alpha \left( \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial x^2}\right)\]

The standard way to numerically solve differential equations is to discretize them, i. e. to consider only a set/ grid of specific area points at specific moments in time. That way, partial derivatives \({\partial u}\) are converted into differences between adjacent grid points \(u^{m}(i,j)\), with \(m, i, j\) denoting time and spatial grid points, respectively. Temperature change in time at a certain point can now be computed from the values of neighboring points at earlier time; the same expression, called stencil, is applied to every point on the grid.

../_images/stencil.svg

This simplified model uses an 8x8 grid of data in light blue in state \(m\), each location of which has to be updated based on the indicated 5-point stencil in yellow to move to the next time point \(m+1\).

Discussion: stencil applications

Stencil computation is a common occurrence in solving numerical problems. Have you already encountered it? Can you think of a problem that could be formulated this way in your field / area of expertise?

Technical considerations

1. How fast and/ or accurate can the solution be?

Spatial resolution of the temperature field is controlled by the number/ density of the grid points. As the full grid update is required to proceed from one time point to the next, stencil computation is the main target of parallelization (on CPU or GPU).

Moreover, in many cases the chosen time step cannot be arbitrarily large, otherwise the numerical differentiation will fail, and dense/ accurate grids imply small time steps (see inset below), which makes efficient spatial update even more important.

2. What to do with area boundaries?

Naturally, stencil expression can’t be applied directly to the outermost grid points that have no outer neighbors. This can be solved by either changing the expression for those points or by adding an additional layer of grid that is used in computing update, but not updated itself – points of fixed temperature for the sides are being used in this example.

3. How could the algorithm be optimized further?

In an earlier episode, importance of efficient memory access was already stressed. In the following examples, each grid point (and its neighbors) is treated mostly independently; however, this also means that for 5-point stencil each value of the grid point may be read up to 5 times from memory (even if it’s the fast GPU memory). By rearranging the order of mathematical operations, it may be possible to reuse these values in a more efficient way.

Another point to note is that even if the solution is propagated in small time steps, not every step might actually be needed for output. Once some local region of the field is updated, mathematically nothing prevents it from being updated for the second time step – even if the rest of the field is still being recalculated – as long as \(t = m-1\) values for the region boundary are there when needed. (Of course, this is more complicated to implement and would only give benefits in certain cases.)

Poll: which programming model/ framework are you most interested in today?

  • OpenMP offloading (C++)

  • SYCL

  • Python (numba/CUDA)

  • Julia

The following table will aid you in navigating the rest of this section:

Episode guide

Sequential and thread-parallel program in C++

Trying out code examples

Source files of the examples presented for the rest of this episode are available in the content/examples/stencil/ directory. To download them to your home directory on the cluster, you can use Git:

$ git clone https://github.com/ENCCS/gpu-programming.git
$ cd gpu-programming/content/examples/stencil/
$ ls

Warning

Don’t forget to git pull for the latest updates if you already have the content from the first day of the workshop!

If we assume the grid point values to be truly independent for a single time step, stencil application procedure may be straighforwardly written as a loop over the grid points, as shown below in tab “Stencil update”. (General structure of the program and the default parameter values for the problem model are also provided for reference.) CPU-thread parallelism can then be enabled by a single OpenMP #pragma:

// (c) 2023 ENCCS, CSC and the contributors
#include "heat.h"

// Update the temperature values using five-point stencil
// Arguments:
//   curr: current temperature values
//   prev: temperature values from previous time step
//   a: diffusivity
//   dt: time step
void evolve(field *curr, field *prev, double a, double dt)
{
  // Help the compiler avoid being confused by the structs
  double *currdata = curr->data.data();
  double *prevdata = prev->data.data();
  int nx = prev->nx;
  int ny = prev->ny;

  // Determine the temperature field at next time step
  // As we have fixed boundary conditions, the outermost gridpoints
  // are not updated.
  double dx2 = prev->dx * prev->dx;
  double dy2 = prev->dy * prev->dy;

  // Use OpenMP threads for parallel update of grid values
  #pragma omp parallel for
  for (int i = 1; i < nx + 1; i++) {
    for (int j = 1; j < ny + 1; j++) {
      int ind = i * (ny + 2) + j;
      int ip = (i + 1) * (ny + 2) + j;
      int im = (i - 1) * (ny + 2) + j;
      int jp = i * (ny + 2) + j + 1;
      int jm = i * (ny + 2) + j - 1;
      currdata[ind] = prevdata[ind] + a*dt*
	    ((prevdata[ip] - 2.0*prevdata[ind] + prevdata[im]) / dx2 +
	     (prevdata[jp] - 2.0*prevdata[ind] + prevdata[jm]) / dy2);
    }
  }
}

Callout

If you will be using the Git-stored versions of the executables, you should also make them… well, executable:

$ cd lumi
$ chmod 770 stencil*

CPU parallelization: timings

For later comparison, some benchmarks of the thread-parallel executable are provided below:

Run times of OpenMP-enabled executable, s

Job size

1 CPU core

32 CPU cores

S:2000 T:500

1.390

0.061

S:2000 T:5000

13.900

0.550

S:20000 T:50

15.200

12.340

A closer look reveals that the computation time scales very nicely with increasing time steps:

../_images/heat-omp-T.png

However, for larger grid sizes the parallelization becomes inefficient – as the individual chunks of the grid get too large to fit into CPU cache, threads become bound by the speed of RAM reads/writes:

../_images/heat-omp-S.png

Discussion: heat flow computation scaling

  1. How is heat flow computation expected to scale with respect to the number of time steps?

    1. Linearly

    2. Quadratically

    3. Exponentially

  2. How is stencil application (grid update) expected to scale with respect to the size of the grid side?

    1. Linearly

    2. Quadratically

    3. Exponentially

  3. (Optional) Do you expect GPU-accelerated computations to suffer from the memory effects observed above? Why/ why not?

GPU parallelization: first steps

Let’s apply several techniques presented in previous episodes to make stencil update GPU-parallel.

OpenMP (or OpenACC) offloading requires to define a region to be executed in parallel as well as data that shall be copied over/ used in GPU memory. Similarly, SYCL programming model offers convenient ways to define execution kernels, context to run them in (called queue) and simplified CPU-GPU transfer of needed data.

Changes of stencil update code for OpenMP and SYCL are shown in the tabs below:

// (c) 2023 ENCCS, CSC and the contributors
#include "heat.h"

// Update the temperature values using five-point stencil
// Arguments:
//   curr: current temperature values
//   prev: temperature values from previous time step
//   a: diffusivity
//   dt: time step
void evolve(field *curr, field *prev, double a, double dt)
{
  // Help the compiler avoid being confused by the structs
  double *currdata = curr->data.data();
  double *prevdata = prev->data.data();
  int nx = prev->nx;
  int ny = prev->ny;

  // Determine the temperature field at next time step
  // As we have fixed boundary conditions, the outermost gridpoints
  // are not updated.
  double dx2 = prev->dx * prev->dx;
  double dy2 = prev->dy * prev->dy;
  
  // Offload value update to GPU target (fallback to CPU is possible)
  #pragma omp target teams distribute parallel for \
  map(currdata[0:(nx+2)*(ny+2)],prevdata[0:(nx+2)*(ny+2)])
  for (int i = 1; i < nx + 1; i++) {
    for (int j = 1; j < ny + 1; j++) {
      int ind = i * (ny + 2) + j;
      int ip = (i + 1) * (ny + 2) + j;
      int im = (i - 1) * (ny + 2) + j;
      int jp = i * (ny + 2) + j + 1;
      int jm = i * (ny + 2) + j - 1;
      currdata[ind] = prevdata[ind] + a*dt*
	    ((prevdata[ip] - 2.0*prevdata[ind] + prevdata[im]) / dx2 +
	     (prevdata[jp] - 2.0*prevdata[ind] + prevdata[jm]) / dy2);
    }
  }
}

Loading modules on LUMI

As SYCL is placed on top of ROCm/HIP (or CUDA) software stack, even running SYCL executables may require respective modules to be loaded. On current nodes, it can be done as follows:

salloc -A project_465000485 -N 1 -t 1:00:0 -p standard-g --gpus-per-node=1

module load LUMI/22.08
module load partition/G
module load rocm/5.3.3
module use /project/project_465000485/Easy_Build_Installations/modules/LUMI/22.08/partition/G/
module load hipSYCL

Exercise: naive GPU ports

In the interactive allocation, run (using srun) provided or compiled executables base/stencil, base/stencil_off and sycl/stencil_naive. Try changing problem size parameters:

  • srun stencil_naive 2000 2000 5000

To look for:

  • How computation times change?

  • Do the results align to your expectations?

GPU parallelization: data movement

Why the porting approach above seems to be grossly inefficient?

On each step, we:

  • re-allocate GPU memory,

  • copy the data from CPU to GPU,

  • perform the computation,

  • then copy the data back.

But overhead can be reduced by taking care to minimize data transfers between host and device memory:

  • allocate GPU memory once at the start of the program,

  • only copy the data from GPU to CPU when we need it,

  • swap the GPU buffers between timesteps, like we do with CPU buffers. (OpenMP does this automatically.)

Changes of stencil update code as well as the main program are shown in tabs below.

// (c) 2023 ENCCS, CSC and the contributors
#include "heat.h"

// Update the temperature values using five-point stencil
// Arguments:
//   curr: current temperature values
//   prev: temperature values from previous time step
//   a: diffusivity
//   dt: time step
void evolve(field *curr, field *prev, double a, double dt)
{
  // Help the compiler avoid being confused by the structs
  double *currdata = curr->data.data();
  double *prevdata = prev->data.data();
  int nx = prev->nx;
  int ny = prev->ny;

  // Determine the temperature field at next time step
  // As we have fixed boundary conditions, the outermost gridpoints
  // are not updated.
  double dx2 = prev->dx * prev->dx;
  double dy2 = prev->dy * prev->dy;
  
  // Offload value update to GPU target (fallback to CPU is possible)
  #pragma omp target teams distribute parallel for
  for (int i = 1; i < nx + 1; i++) {
    for (int j = 1; j < ny + 1; j++) {
      int ind = i * (ny + 2) + j;
      int ip = (i + 1) * (ny + 2) + j;
      int im = (i - 1) * (ny + 2) + j;
      int jp = i * (ny + 2) + j + 1;
      int jm = i * (ny + 2) + j - 1;
      currdata[ind] = prevdata[ind] + a*dt*
	    ((prevdata[ip] - 2.0*prevdata[ind] + prevdata[im]) / dx2 +
	     (prevdata[jp] - 2.0*prevdata[ind] + prevdata[jm]) / dy2);
    }
  }
}

// Start a data region and copy temperature fields to the device
void enter_data(field *curr, field *prev)
{
  double *currdata = curr->data.data();
  double *prevdata = prev->data.data();
  int nx = prev->nx;
  int ny = prev->ny;

  // adding data mapping here
  #pragma omp target enter data \
  map(to: currdata[0:(nx+2)*(ny+2)], prevdata[0:(nx+2)*(ny+2)])
}

// End a data region and copy temperature fields back to the host
void exit_data(field *curr, field *prev)
{
  double *currdata = curr->data.data();
  double *prevdata = prev->data.data();
  int nx = prev->nx;
  int ny = prev->ny;

  // adding data mapping here
  #pragma omp target exit data \
  map(from: currdata[0:(nx+2)*(ny+2)], prevdata[0:(nx+2)*(ny+2)])
}

// Copy a temperature field from the device to the host
void update_host(field *heat)
{
  double *data = heat->data.data();
  int nx = heat->nx;
  int ny = heat->ny;

  // adding data mapping here
  #pragma omp target update from(data[0:(nx+2)*(ny+2)])
}

Exercise: updated GPU ports

In the interactive allocation, run (using srun) provided or compiled executables base/stencil_data and sycl/stencil. Try changing problem size parameters:

  • srun stencil 2000 2000 5000

To look for:

  • How computation times change this time around?

  • What largest grid and/or longest propagation time can you get in 10 s on your machine?

Python: JIT and GPU acceleration

As mentioned previously, Numba package allows developers to just-in-time (JIT) compile Python code to run fast on CPUs, but can also be used for JIT compiling for GPUs (although AMD GPU support is at the moment deprecated for Numba versions > 0.53). JIT seems to work well on loop-based, computationally heavy functions, so trying it out is a nice choice for initial source version:

from numba import jit


# Update the temperature values using five-point stencil
# Arguments:
#   curr: current temperature field object
#   prev: temperature field from previous time step
#   a: diffusivity
#   dt: time step
def evolve(current, previous, a, dt):
    dx2, dy2 = previous.dx**2, previous.dy**2
    curr, prev = current.data, previous.data
    # Run (possibly accelerated) update
    _evolve(curr, prev, a, dt, dx2, dy2)


@jit(nopython=True)
def _evolve(curr, prev, a, dt, dx2, dy2):
    nx, ny = prev.shape # These are the FULL dims, rows+2 / cols+2
    for i in range(1, nx-1):
        for j in range(1, ny-1):
            curr[i, j] = prev[i, j] + a * dt * ( \
              (prev[i+1, j] - 2*prev[i, j] + prev[i-1, j]) / dx2 + \
              (prev[i, j+1] - 2*prev[i, j] + prev[i, j-1]) / dy2 )

The alternative approach would be to rewrite stencil update code in NumPy style, exploiting loop vectorization.

Trying out Python examples

You can run provided code examples on Google Colab using instructions provided in the Setup, your local machine, or LUMI node (non-GPU variants). On LUMI, you can set up Python distribution as following:

$ module load cray-python/3.9.13.1
(install needed dependencies locally)
$ pip3 install --user numba matplotlib
$ cd ../python/
(make sure you have active allocation)
$ srun python3 main.py

Short summary of a typical Colab run is provided below:

Run times of Numba JIT-enabled Python program, s

Job size

JIT (LUMI)

JIT (Colab)

Job size

no JIT (Colab)

S:2000 T:500

1.648

8.495

S:200 T:50

5.318

S:2000 T:200

0.787

3.524

S:200 T:20

1.859

S:1000 T:500

0.547

2.230

S:100 T:50

1.156

Numba’s @vectorize and @guvectorize decorators offer an interface to create CPU- (or GPU-) accelerated Python functions without explicit implementation details. However, such functions become increasingly complicated to write (and optimize by the compiler) with increasing complexity of the computations within.

However, for NVIDIA GPUs, Numba also offers direct CUDA-based kernel programming, which can be the best choice for those already familiar with CUDA. Example for stencil update written in Numba CUDA is shown in the data movement section, tab “Python”. In this case, data transfer functions devdata = cuda.to_device(data) and devdata.copy_to_host(data) (see main_cuda.py) are already provided by Numba package.

Exercise: CUDA acceleration in Python

Using Google Colab (or your own machine), run provided Numba-CUDA Python program. Try changing problem size parameters:

  • args.rows, args.cols, args.nsteps = 2000, 2000, 5000 for notebooks,

  • [srun] python3 main.py 2000 2000 5000 for command line.

To look for:

  • How computation times change?

  • Do you get better performance than from JIT-compiled CPU version? How far can you push the problem size?

  • Are you able to monitor the GPU usage?

Julia GPU acceleration

A Julia version of the stencil example above can be found below (a simplified version of the HeatEquation module at https://github.com/ENCCS/HeatEquation.jl). The source files are also available in the content/examples/stencil/julia directory of this repository.

To run the example on LUMI CPU partition, type:

$ # interactive CPU node
$ srun --account=project_465000485 --partition=standard --nodes=1 --cpus-per-task=32 --ntasks-per-node=1 --time=01:00:00 --pty bash
$ # load Julia env
$ module purge
$ module use /appl/local/csc/modulefiles
$ module load julia/1.9.0
$ # in directory with Project.toml and source files, instantiate an environment to install packages
$ julia --project -e "using Pkg ; Pkg.instantiate()"
$ # finally run
$ julia --project main.jl

To run on the GPU partition, use instead the srun command

$ srun --account=project_465000485 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash

Optional dependency

Note that the Plots.jl dependency is commented out in main.jl and Project.toml. This saves ~2 minute precompilation time when you first instantiate the Julia environment. To generate plots, just uncomment the commented Plots.jl dependency in Project.toml, instantiate again, and import and use Plots in main.jl.

#using Plots
using BenchmarkTools

include("heat.jl")
include("core.jl")


"""
    visualize(curr::Field, filename=:none)

Create a heatmap of a temperature field. Optionally write png file. 
"""    
function visualize(curr::Field, filename=:none)
    background_color = :white
    plot = heatmap(
        curr.data,
        colorbar_title = "Temperature (C)",
        background_color = background_color
    )

    if filename != :none
        savefig(filename)
    else
        display(plot)
    end
end


ncols, nrows = 2048, 2048
nsteps = 500

# initialize current and previous states to the same state
curr, prev = initialize(ncols, nrows)

# visualize initial field, requires Plots.jl
#visualize(curr, "initial.png")

# simulate temperature evolution for nsteps
simulate!(curr, prev, nsteps)

# visualize final field, requires Plots.jl
#visualize(curr, "final.png")

Exercise: Julia port to GPUs

Carefully inspect all Julia source files and consider the following questions:

  1. Which functions should be ported to run on GPU?

  2. Look at the initialize!() function and how it uses the arraytype argument. This could be done more compactly and elegantly, but this solution solves scalar indexing errors. What are scalar indexing errors?

  3. Try to start sketching GPU-ported versions of the key functions.

  4. When you have a version running on a GPU (your own or the solution provided below), try benchmarking it by adding @btime in front of simulate!() in main.jl. Benchmark also the CPU version, and compare.

See also

This section leans heavily on source code and material created for several other computing workshops by ENCCS and CSC and adapted for the purposes of this lesson. If you want to know more about specific programming models / framework, definitely check these out!