Advanced exercises

Instructor note

  • 0 min teaching

  • 30 min exercises

Example use case: heat flow in two-dimensional area

Heat flows in objects according to local temperature differences, as if seeking local equilibrium. The following example defines a rectangular area with two always-warm sides (temperature 70 and 85), two cold sides (temperature 20 and 5) and a cold disk at the center. Because of heat diffusion, temperature of neighboring patches of the area is bound to equalize, changing the overall distribution:

../_images/heat_montage.png

Over time, the temperature distribution progresses from the initial state toward an end state where upper triangle is warm and lower is cold. The average temperature tends to (70 + 85 + 20 + 5) / 4 = 45.

Technique: stencil computation

Heat transfer in the system above is governed by the partial differential equation(s) describing local variation of the temperature field in time and space. That is, the rate of change of the temperature field \(u(x, y, t)\) over two spatial dimensions \(x\) and \(y\) and time \(t\) (with rate coefficient \(\alpha\)) can be modelled via the equation

\[\frac{\partial u}{\partial t} = \alpha \left( \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial x^2}\right)\]

The standard way to numerically solve differential equations is to discretize them, i. e. to consider only a set/ grid of specific area points at specific moments in time. That way, partial derivatives \({\partial u}\) are converted into differences between adjacent grid points \(u^{m}(i,j)\), with \(m, i, j\) denoting time and spatial grid points, respectively. Temperature change in time at a certain point can now be computed from the values of neighboring points at earlier time; the same expression, called stencil, is applied to every point on the grid.

../_images/stencil.svg

This simplified model uses an 8x8 grid of data in light blue in state \(m\), each location of which has to be updated based on the indicated 5-point stencil in yellow to move to the next time point \(m+1\).

The following series of exercises uses this stencil example implemented in Julia. The source files listed below represent a simplification of this HeatEquation package, which in turn is inspired by this educational repository containing C/C++/Fortran versions with different parallelization strategies (credits to CSC Finland) (you can also find the source files in the content/code/stencil/ directory of this repository).

#using Plots
using BenchmarkTools

include("heat.jl")
include("core.jl")


"""
    visualize(curr::Field, filename=:none)

Create a heatmap of a temperature field. Optionally write png file. 
"""    
function visualize(curr::Field, filename=:none)
    background_color = :white
    plot = heatmap(
        curr.data,
        colorbar_title = "Temperature (C)",
        background_color = background_color
    )

    if filename != :none
        savefig(filename)
    else
        display(plot)
    end
end


ncols, nrows = 2048, 2048
nsteps = 10

# initialize current and previous states to the same state
curr, prev = initialize(ncols, nrows)

# visualize initial field, requires Plots.jl
#visualize(curr, "initial.png")

# simulate temperature evolution for nsteps
simulate!(curr, prev, nsteps)

# visualize final field, requires Plots.jl
#visualize(curr, "final.png")

Run the code

  • Copy the source files from the code box above.

  • Activate the environment found in the Project.toml file.

  • Run the main.jl code and (optionally) visualise the result by uncommenting the relevant line.

  • Study the code for a while to understand how it works.

Optimise and benchmark

  • Benchmark the evolve!() function in the Julia REPL.

  • Add the @inbounds macro to the innermost loop.

  • Benchmark again and estimate the performance gain.

Multithread

  • Multithread the evolve!() function

  • Benchmark again with different number of threads. It will be most convenient to run these benchmarks from the command line where you can set the number of threads: julia -t <nthreads>.

  • How does it scale?

Adapt the stencil problem for GPU porting

In order to prepare for porting the stencil problem to run on a GPU, it’s wise to modify the code slightly. One approach is to change the evolve! function to accept arrays instead of Field types. For now, we define a new version of this function called evolve2():

function evolve2!(currdata::AbstractArray, prevdata::AbstractArray, dx, dy, a, dt)
    nx, ny = size(currdata) .- 2
    for j = 2:ny+1
        for i = 2:nx+1
            @inbounds xderiv = (prevdata[i-1, j] - 2.0 * prevdata[i, j] + prevdata[i+1, j]) / dx^2
            @inbounds yderiv = (prevdata[i, j-1] - 2.0 * prevdata[i, j] + prevdata[i, j+1]) / dy^2
            @inbounds currdata[i, j] = prevdata[i, j] + a * dt * (xderiv + yderiv)
        end
    end
end
  • In the simulate!() function, update how you call the evolve2!() function.

  • Take a moment to study the initialize() function. Why is the if arraytype != Matrix statement there?

Using SharedArrays with stencil problem

Look again at the double for-loop in the modified evolve! function and think about how you could use SharedArrays. Start from the evolve2!() function defined above, and try to implement a version that accepts SharedArray arrays.

Exercise: Julia port to GPUs

Carefully inspect all Julia source files and consider the following questions:

  1. Which functions should be ported to run on GPU?

  2. Try to start sketching GPU-ported versions of the key functions.

  3. When you have a version running on a GPU (your own or the solution provided below), try benchmarking it by adding @btime in front of simulate!() in main.jl. Benchmark also the CPU version, and compare.

Further considerations:

  1. The kernel function needs to end with return or return nothing.

  2. The arrays are two-dimensional, so you will need both the .x and .y parts of threadIdx(), blockDim() and blockIdx().

    • Does it matter how you match the x and y dimensions of the threads and blocks to the dimensions of the data (i.e. rows and columns)?

  3. You also need to specify tuples for the number of threads and blocks in the x and y dimensions, e.g. threads = (32, 32) and similarly for blocks (using cld).

    • Note the hardware limitations: the product of x and y threads cannot exceed it.

  4. For debugging, you can print from inside a kernel using @cuprintln (e.g. to print thread numbers). It will only print during the first execution - redefine the function again to print again. If you get warnings or errors relating to types, you can use the code introspection macro @device_code_warntype to see the types inferred by the compiler.

  5. Check correctness of your results! To test that evolve! and evolve_gpu! give (approximately) the same results, for example:

  1. Perform some benchmarking of the evolve! and evolve_gpu! functions for arrays of various sizes and with different choices of nthreads. You will need to prefix the kernel execution with the CUDA.@sync macro to let the CPU wait for the GPU kernel to finish (otherwise you would be measuring the time it takes to only launch the kernel):

  2. Compare your Julia code with the corresponding CUDA version to enjoy the (relative) simplicity of Julia!

Create a package

Take the code for the stencil example and convert it into a Julia package! Instructions for creating Julia packages are found in the Introduction to Julia lesson.

Also try to write one or more tests. It can include unit tests, integration tests or an end-to-end test.