Solving heat equation with CUDA

The problem

The heat equation is a partial differential equation that describes the propagation of heat in a region over time. Two-dimensional heat equation can be written as:

\[\frac{\partial U}{\partial t}=a\left(\frac{\partial^2U}{\partial x^2}+\frac{\partial^2U}{\partial y^2}\right)\]

Where \(x\) and \(y\) are spatial variables, \(t\) is the time. \(U\) is the temperature, \(a\) is thermal conductivity. It is common to use \(U\) instead of \(T\) for the temperature, since the same mathematical equation describes diffusion, in which case, \(a\) will be a diffusion constant.

To formulate the problem, one needs to also specify the initial condition, i.e. function \(U(x,y,t)\) at \(t=0\). This is sufficient for infinite domain, where \(x \in (-\infty,+\infty)\) and \(y \in (-\infty,+\infty)\). Since simulating over the infinite spatial domain is not possible, the numerical computations are usually done on a finite area. This implies that one has to specify boundary condition on the region. That is, if \(x \in (x_0, x_1)\) and \(y \in (y_0, y_1)\), values of functions \(U(x_0,y,t)\), \(U(x_1,y,t)\), \(U(x,y_0,t)\) and \(U(x,y_1,t)\) should be set. With the area limited, one can approximate the function \(U(x,y,t)\) with the grid function \(U^n_{ij}\), where \(n=0,1,2,\ldots,N\) is the time step, \(i=0,1,\ldots,N_x-1\) and \(j=0,1,\ldots,N_y-1\) are the spatial steps. The grid is usually defined as a set of equally separated points, such as \(t_n=n\cdot dt\), \(x_i=x_0+i\cdot dx\) and \(y_i=y_0+j\cdot dy\). The values of spatial steps \(dx\) and \(dy\) are such that the final grid points are \(x_1\) and \(y_1\) for two spatial dimensions respectively.


The grid (A) and the template (B) of the numerical scheme for the heat equation. Red dots indicate the grid nodes, where the values are taken from the boundary conditions. Black dots are internal grid nodes. The grid node in which the values is computes is shown in green. There are two examples of how the template (B) is applied to the grid (A).

With the grid defined, one can use the following explicit approximation for the differential equation:

\[\frac{U^{n+1}_{i,j}-U^{n}_{i,j}}{t_n}=a\left(\frac{U^n_{i-1,j}-2U^{n}_{i,j}+U^n_{i+1,j}}{dx} + \frac{U^n_{i,j-1}-2U^{n}_{i,j}+U^n_{i,j+1}}{dx}\right)\]

Here we used basic approximations for the first and second derivatives \(\frac{df}{dx}\approx\frac{f(x+dx)-f(x)}{dx}\) and \(\frac{d^2f}{dx^2}\approx\frac{f(x-dx)-2f(x)+f(x+dx)}{dx^2}\). In the equation above, the values on the next time step, \(U^{n+1}_{i,j}\) are unknown. Indeed, if \(n=0\), the only unknown value in equation for each \(i\) and \(j\) is \(U^1_{ij}\): the rest are determined by the initial condition. Going through all possible values of \(i\) and \(j\), one can compute the grid function \(U^n_{i,j}\) at \(n=1\). This procedure is then repeated for \(n=2\) and so on. Note that the values at the borders can not be computed from the equation above, since some of the required values will be out of the spatial region (e.g. \(U^n_{i-1,j}\) when \(i=0\)). This is where the boundary conditions are used.

The numerical scheme can then be rearranged to give an explicit expression for these values:

\[U^{n+1}_{i,j}= U^{n}_{i,j} + dh\cdot a\left(\frac{U^n_{i-1,j}-2U^{n}_{i,j}+U^n_{i+1,j}}{dx} + \frac{U^n_{i,j-1}-2U^{n}_{i,j}+U^n_{i,j+1}}{dx}\right)\]

And this is the expression we are going to use in our code.

The choice of the spatial steps \(dx\) and \(dy\) (or equally the choice of number of spatial steps \(N_x\) and \(N_y\)) are determined by the required spatial resolution. With the spatial steps set, the time step is limited by the stability of the numerical approximation of the equation, i.e. by:

\[dt \leq \frac{1}{2a}\frac{1}{\frac{1}{dx^2}+\frac{1}{dy^2}}=\frac{dx^2dy^2}{2a(dx^2+dy^2)}\]

We are going to be using the maximum possible time steps, as determined by the expression above.

Porting the code

Even though the problem is two-dimensional, we are still going to use one-dimensional arrays in CUDA. This allows for an explicit control of the memory access pattern. We want neighboring threads to access the neighboring elements of an array, so that the limited amounts of cache will be used efficiently. For that, we will use a helper function that computes the one-dimensional index from two indices:

getIndex(const int i, const int j, const int width)
    return i*width + j;

Here, width is a width of a two-dimensional array, i and j are the two-dimensional indices.


Mapping of the two-dimensional array into one-dimensional. The numbers represent the number of the element in 1D array.

Since we are going to use non-trivial data access pattern in the following examples, it is good idea to constantly check for errors. Not to overload the code with extra checks after every API call, we are going to use the following function from the CUDA API:

This function will check if there were any CUDA API errors in the previous calls and should return cudaSuccess if there were none. We will check this, and print an error message if this was not the case. In order to render a human-friendly string that describes an error, the cudaGetErrorString(..) function from the CUDA API will be used:

This will return a string, that we are going to print in case there were errors.

We will also need to use the getIndex(..) function from the device code. To do so, we will need to ask a compiler to compile it for the device execution. This is done by adding a __device__ specifier to its definition, will make the function available in the device code but not available in the host code. Since we are also using it when populating the initial conditions, we need a __host__ specifier for this function as well.

Moving data ownership to the device

There is a lot of possibilities to improve the performance of in the current implementation. One of them is to reduce the number of the host-device and device-to host data transfers. Even though the transfers are relatively fast, they are much slower than accessing the memory in the kernel call. Eliminating the transfers is one of the most basic and most effitient improvements.

Note that in more complicated cases, eliminating the data transfers between host and device can be challenging. For instance, in cases where not all the computational procedures are ported to the GPU. This may happen on the early stages of the code porting, or because it is more effitient to compute some parts of the algorithm on a CPU. In this cases, effort should be made to hide the copy behind the computations: the compute kernels and copy calls use different resources. These two operations can be done simultaneously: while GPU is busy computing, the data can be copied on the background. One should also consider using CPU efficiently: if everything is computed on a device, host will be idling. This is a waste of resources. In some cases one can copy some data to the host memory, do the computations and copy data back while the device is still computing.

Removing unnessesary host to device and device to host data transfers, can also be looked at as the change in the data ownerhip. Now the device holds the data, do all computational procedures and, occasionally, the data is copied back to the CPU for e.g. output. This is exactly the case in our code: there is nothing to compute between two consequative time steps, so there is no need to copy data to the host on each step. The data only needed on the host for the output.

In the following exercise we will eliminate the unnessesary data transfers and will make the device responsible for holding current data.

Using shared memory

Another useful way to optimize the device code is to reduce the number of global memory calls in the GPU kernel. Even though the memory bandwith is very high on modern GPUs, many threads are using it. And there is not so much cache to go with either. Minimizing the calls to the global memory can drastically improve the computational efficiency of the application. Shared memory is the cache memory that is shared between threads in a block. The access pattern in our GPU kernel is such that neighboring threads aggess neighboring data. This means that some of the data is accessed by neighboring threads. In fact, each value of the grid function, \(U^n_{ij}\) is read 5 times — once as the central point in the thread (i,j) and once as a side point in threads (i-1,j), (i+1,j), (i,j-1) and (i,j+1). What can be done instead is, at the beginning of the kernel call, we read the value of the central point into the shared memory. Than, we ask all the threads to wait until all the values are read. Once the data is ready, we proceed with the computation. Additionally, one will hate to take care of the extra values at the borders of the thread block, which we will see while working on the example.

But before we start, we need to learn the extra tools that we are going to need. There are two ways of allocating the shared memory: dynamic and static. The dynamic allocation is needed if the size of the required shared memory is not known at the compilation time. In our case, we know exactly how much space is needed, so we will be using static alocation. To allocate the shared memory of size N, one needs to add in the GPU kernel:

__shared__ float s_x[N];

The __shared__ modifier will tell the compiler that this array should be allocated in the shared memory space. Note that we used the s_ prefix to the array. This is not necessary, but helps for the code transparency.

We will also need to make sure that all threads in the block are done reading data and placing it into the shared memory. This can be done with the call to __syncthreads(..) function inside the GPU kernel:

Calling this function will block all the threads from execution until they reach the point where this function call is made. Note that __syncthreads(..) should be called unconditionally, from all threads in the thread block, so that this point in code can be reached by all the threads.

In the following example, we will change the GPU kernel to use the shared memory to hold all the values needed for the current computational time step.