Optimizing the GPU kernel

1. Reduction to a single value

In this part of the workshop, we are going to investigate how the individual kernel can take advantage of all the features provided by GPU. As an example, we are going to use the reduction problem. Consider a vector of N elements that contains floating point numbers from 0 to 1. The task is to compute the sum of all the elements in the vector. The basic approach in a serial code is to designate a single floating point variable and add elements to it one by one in a loop. Thus, on each iteration, one floating point number will be added to accumulating value, as indicated on the figure below.

../_images/ENCCS-OpenACC-CUDA_Reduction_cpu_1.png

A reduction of an array to a single value. Each iteration (unfolded on the Figure) takes one element of the vector and adds it to the accumulated value.

On a GPU, many threads are running simultaneously. This means that simple approach of adding values to a single number will not work out of the box — we simply don’t know which of the threads will be adding respective value at a given time. In case two threads are trying to add to the value simultaneously, we can encounter a problem called race condition. In particular, if thread number one tries to make a binary addition, it should first read the accumulated value. If, at the same time, thread number two does the same thing, it will first read the value as well. In case this happens before the first thread saves the data, the result from the first thread will overwrite the value and the addition from the thread one will be lost. To avoid race conditions, the summation has to be done in a thread-safe fashion. Likely, CUDA provides atomic functions, that ensures that all the operations are done thread-safe. Atomic functions have the following signature:

These can only be called from the device, where the application runs in many threads. The function takes the address, where the data will be atomically added and a value to add. Note, that the atomic operations can be called on a system level, device level and thread block level and these will result in different performance. Let us first adopt a simple CPU code that accumulates values to a single variable to run on a GPU. Note that this approach is not desirable both from correctness and performance standpoints, as we will see later. The figure below illustrates how the CPU algorithm can be adopted to run on a GPU with atomic operations. Since the number of threads on a GPU is a multiple by a block size, some threads may try to access the elements that are out of bonds. So either extra condition should be added ot the array should be extended by zeroes, as shown on the figure.

../_images/ENCCS-OpenACC-CUDA_Reduction_gpu_1.png

A reduction of an array to a single value on a GPU. Data is copied to the GPU memory, where each thread adds one element to the accumulated value. Note that the thread-safe atomic operations have to be used in order to ensure that there are no race conditions. Many threads will run simultaneously on a GPU, so there is no need for a loop over the indices.

Before we start optimizing the GPU code, we need to fix one big problem with our approach: on both CPU and GPU, the sum becomes invalid for arrays of large size. Indeed, we are summing random values between 0 and 1. If the number of these values is large enough, the sum should be approximately half of the number of the elements. But running the code for \(10^8\) elements results in a number is significantly lower.

Why the number is significantly lower than expected for large vectors? How can one fix this?

Try running the progrem for 100000000 elements. What is the expected reduction value? Compare it with what you are getting.

There is another problem with the GPU code as well. The reduction is running in many threads that all access the same location in the memory atomically. One should expect a huge queue of threads trying to save their data. The good thing that solving the first problem helps us to solves the second one, as we will see below.

2. Pair-wise reduction

Let us first fix the CPU code, so that the result will be correct for larger arrays. The figure below shows one of the options how the correctness can be fixed even for large arrays. The idea is to make sure that only numbers of similar value are added together. This can be done by summing the elements by pairs. These binary sum should be of similar value as well, so the procedure can be repeated until the final value is obtained.

../_images/ENCCS-OpenACC-CUDA_Reduction_cpu_2.png

A pair-wise reduction algorithm on a CPU. The array is split into pairs, which are added together, resulting with the array half a size. The procedure is then repeated until all the values are added.

Let us fix the CPU code with the approach described by the figure above.

Having this CPU version gives us a reference that can be handy while adapting the GPU code.

../_images/ENCCS-OpenACC-CUDA_Reduction_gpu_2.png

Maping pair-style addition algorithm to CUDA. Each kernel call does one binary addition per GPU thread. The execution is than returned to the CPU so that all the threads are in-sync. The kernel is called again with the new array as an input. This continues untill only one element is left. The numbers in circles indicate which thread does the specific operation. The values that are out of bonds are set to zeroes to make sure that all threads get the data.

Let us use the same approach to fix the GPU code.

Now we ensured that the result is correct. Also note, that the performance of the new implementation is quite a lot better: we got rid of the bottleneck of many threads writing to the same memory address simultaneously. In many cases, this first round of optimization would be sufficient for the GPU to outperform CPU. However, there is still huge room for improvement in terms of the performance.

3. Using shared memory

The first issue we are going to address is the number of the kernel launches we currently do. Each CUDA API call has an overhead, which we want to reduce. Also, we have to read the input data and write the output from and to the global memory in each kernel call. We can adress both of these issues by using the shared memory. Shared memory allows the GPU threads within a block to communicate with one another. Hence, the reduction of all the values inside the thread block can be done in just one kernel call. The shared memory can be allocated in two ways: statically and dynamically. In first option, we need to know how much shared memory we are going to need at the compile time. To have this memory available, add the following line inside the GPU kernel:

__shared__ float s_data[BLOCK_SIZE]

The __shared__ modifier will tell the compiler that this array should be allocated in the shared memory space. Note that we used the s_ prefix to the array. This is not necessary, but helps for the code transparency.

The second option allows to define the size of the shared memory array at run time. It is more flexible, since the size needed can vary from one kernel call to the other. To declare the shared memory within the kernel, add the following line:

extern __shared__ float s_data[]

Note two difference here. First, the definition now have extern keyword. This tells the compiler to expect the size of the shared memory to be defined dynamically. Due to the same reason, the size of the array is not defined here. Instead, we will need to provide third argument to the kernel launch configuration:

gpu_kernel<<<numBlocks, threadsPerBlock, sharedMemorySizeInBytes>>>(..)

Note that the size should be specified in bytes (e.g. 4 bytes per lement of the array of floats). One extra benefit of the dynamically defined shared memory is that it can be easily recycled within the kernel, i.e. having the following lines in the kernel allows to use the shared memory for both floating point and integer values:

extern __shared__ float s_dataFloat[]
..
extern __shared__ int s_dataInt[]

Note that one should be careful not to overwrite the data: the same memory adress will be used by both arrays. So the s_dataInt should only be used when the s_dataFloat is not needed any more.

We will need one array element per thread in a block, i.e. the number of elements is equal to the block size. This is define at compile time, so both options are suitable for us.

Since the threads within the block are executed in parallel, we will also need the means to synchronize them. In CUDA, this can be done with the call to __syncthreads(..) function inside the GPU kernel:

Calling this function will block all the threads from execution until they all reach the point where this function call is made. Note that __syncthreads(..) should be called unconditionally, from all threads in the thread block, so that the point in code where it is called can be reached by all the threads.

The following figure shows how the modified code will work. We read the data to from global memory to the shared memory, reduce the data to a single value, which is then saved to the global memory before the kernel quits. Note that we will need to synchronize threads in multiple places to make sure that they all reached an intermediate checkpoint.

../_images/ENCCS-OpenACC-CUDA_Reduction_gpu_3.png

A reduction algorithm that uses the shared memory. The data is copied to the GPU global memory. Each thread is than saves one value into the shared memory. The kernel is than executes until all the data from shared memory is reduced into one value. The procedure repeates until there is only one thread block and all the data fits into a single thread block. Note that each thread uses its own adress in shared memory to save the data. This is done to ensure that the data is not overwritten and to avoid extra synchronizations between threads.

4. Reduce thread divergency

In the previous call, we ask the thread that correspond to the value that is reduced to do the work. This is not effective on a GPU: neighboring threads will diverge from one onother. Next optimization step will be fixing that. Let us try to modify the code so that the first threads in the block do the reduction.

../_images/ENCCS-OpenACC-CUDA_Reduction_gpu_4.png

This figure may look similar to the one before. But have a look on the numbers in the gray circles. They are the number of threads that do the reduction. As one can see, they are now sequential, meaning that neighboring threads will more likely to take the same path in the conditionals. This is espetially important for the threads within one warp, where both paths are taken in case the divergence occurs.

5. Sequential memory access

Now, the cosequent threads do the work, we can address another issue with the code: memory access pattern. Even though GPU has relatively fast memory bus, it is utilized by many threads simultaneously. To add to the problem, the cache size is small relative to the CPU — GPUs are design to pack as many cores as possible, thus less transistors are left for the local memory. This makes the memory access pattern one of the most important thing when optimizing the kernels.

Let us change the kernel so that the sequential GPU threads read the sequential memory addresses. Since two values are added at a time, they will be separated by the offset that is large enough to accommodate other threads. This means that the shared memory array should be split into two parts at each iterations: one for the first values for all the threads, the other is for the second. The offset, or separation value, will be reduced from one iteration to the other with less values to reduce.

../_images/ENCCS-OpenACC-CUDA_Reduction_gpu_5.png

A scheme for the algorithm, where the memory is accessed sequentially. At each iteration the reduced values are split into two equal parts which are read sequentially by sequential threads. With less values left to reduced, the offset decreases, until it is equal to one for the last pair. Note that all the relevant values are kept at the beginning of the array, thus the data read is less scattered.

6. Load two values at a time

At the very first iteration, the half of the threads are not doing any reduction. The only thing that the second half of the threads are doing is loading the data into the shared memory. This can be easily fixed by loading two numbers in each thread and reducing them before saving to the shared memory. In this case all threads will have some computations to do and less resources will be wasted.

../_images/ENCCS-OpenACC-CUDA_Reduction_gpu_6.png

Only part of the algorithm that needs changing is shown. Each thread now takes two values from the global memory and reduce it immediately to the respective location in shared memory.

7. Unroll the last warp

The GPUs are often refereed to having Single Instruction Multiple Threads (SIMT) architecture. This is to separate them from Single Instruction Multiple Data (SIMD) devices. The main difference is that different threads can execute different instructions. However, this is only true, when the threads in question are outside the same warp. Warp is a unit of threads that executes the same instructions for all the threads. In a warp any thread divergence will take both paths in every thread even when only one of them will take an alternative path. On NVidia GPUs, the warp is a unit of 32 threads, which means that when we get to that many threads, special care should be taken to make sure that there is no divergence. In fact, even checking for the conditional will slow the execution down. The good thing is that, inside the warp, all the threads do the same operation at the same time, which can be used to remove explicit synchronization calls.

In our code, we slowly reduce the number of active threads from the block width to 2 on the last iteration. When the number of active threads reaches the size of warp, all the active threads are within the same warp and we can manually unroll the last iterations. While doing so, we will ask all the threads to do the reduction, not only those that produce the numbers needed at the next iteration. It may look like we are asking the GPU to do extra work, but, in fact, we are removing extra conditional checks. Indeed, the inactive threads wold have taken diferent path where they do nothing. But since there are the threads that actually do the work, the inactive threads will idle while this is happening since they are in the same warp.

../_images/ENCCS-OpenACC-CUDA_Reduction_gpu_7.png

Last warp reduction for a warp of size 4 (indicated by dashed lines). Only the changed part of the algorithm is shown. Every thread computes the binary reduction at each interaction, which allows one to remove the conditional. Even though this leads to computing values that are not used, the reduction in thread divergence inside a warp will give better performance.

8. Further improvements

There is more one can do with the current code to get even better performance. Please, see this excelent presentation from Mark Harris (NVidia) for some ideas.