Introduction to CUDA

Hello, CUDA!

Let us start familiarizing ourselves with CUDA by writing a simple “Hello CUDA” program, which will query all available devices and print some information on them. We will start with a basic .cpp code, change it so it will be compiled by CUDA compiler and do some CUDA API call, to see what devices are available.

Allocate memory, transfer data and execute kernels

In the next example we will be adding two vectors together. This will allow us to get familiar with basic CUDA API that are essential for writing code in CUDA.

Now that we know that there is a CUDA device available and we can execute simple code on it. In CUDA, developer must control the data flow between host (CPU) and device (GPU) memory. To do so, one must declare the buffers that will be located in the device memory. It is usually convenient to “mirror” the host buffers, declaring and allocating buffers for the same data of the same size on both host and device, however this is not a requirement. Note that in CUDA, device buffer is a basic pointer and it can be easily confused with the host pointer. Using the device buffer on host most likely will lead to segmentation fault error, so one must keep track of where the buffers are located. It is advisable to have the prefix that will indicate where the buffer is located, e.g. use h_ prefix for host memory and d_ prefix for device memory. Declaration of the device buffer is as simple as it is for the host buffer. For instance, declaring host and device buffers for a vector of floating point values x should look something like this:

float* h_x;
float* d_x;

Note that we have not specified yet, where the buffers are located. This is done when we are allocating the memory. On host it can be done by calling alloc(..) or calloc(..) function.

What is the difference between alloc(..) and calloc(..)?

  1. Only calloc(..) can be used for arrays.

  2. The difference is only in signatures, which makes calloc(..) more convenient to use for arrays. Both initialize the memory.

  3. The difference is only in signatures, which makes calloc(..) more convenient to use for arrays. Neither initialize the memory.

  4. Only calloc(..) initializes memory with zeroes.

To allocate buffer in GPU memory, one has to call the CUDA API function cudaMalloc(..):

We are now getting used to these function having access specifiers and return cudaError_t. As the first arguments, the function takes a pointer to the buffer in the device memory. The function that allocates size bytes, as specified by the second argument, and updates the provided device duffer by the address of this allocation. Note that this function takes pointer to the buffer, which is itself a pointer. This allows to update the pointer to where the memory is allocated.

To release the memory, cudaFree(..) function should be used:

After memory is allocated, we need to copy data from host to device buffer and back. This is done using the cudaMemcpy(..) function, that has the following signature:

Both copy to and from the device buffer are done using the same function and the direction of the copy is specifies by the last argument, which is cudaMemcpyKind enumeration. The enumeration can take values cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice or cudaMemcpyDefault. All but the last are self-explanatory. Passing the cudaMemcpyDefault will make the API to deduce the direction of the transfer from pointer values, but require unified virtual addressing. Second to last argument is the size of the data to be copied in bytes. The first two arguments can be either host or device pointers, depending on the directionality of the transfer. This is where using h_ and d_ prefixes come handy: this way we should only remember the order in which the destination and the source arguments are specified. For instance, host to device copy call should look something like that:

cudaMemcpy(d_x, h_x, numElements*sizeof(float), cudaMemcpyHostToDevice);

The names of the buffers suggest that the first argument (destination) is the device buffer and the second argument is the host buffer (source). This means that we are executing host to device copy, which is specified byt the last argument. After the execution on the device is done, we have the data in the device memory and the results can be copied back to the host memory using:

cudaMemcpy(h_x, d_x, numElements*sizeof(float), cudaMemcpyDeviceToHost);

What will happen if we execute the code as it is (“Add GPU data management” tab above)?

  1. It will not compile.

  2. The output will be the same - we are still computing everything on the CPU.

  3. The results will be zero.

  4. The results can be anything.

We are finally ready to define the function, that will be executed on the device (usually called GPU kernel). Kernels are defined by another function specifier, called __global__:

__global__ void gpu_kernel(..)

What __global__ essentially means is that the function should be called from the host code, but will be executed on the device. Since this function will be executed in many threads, the return value must be void: otherwise it would not be clear which of the threads should do the return. The rest of the function definition is the same as with any c function: its name has the same limitations as a normal c function, it can have any number of arguments of any type, it is even can be templated. Since the call of the kernel function happens in the host code but it is executed on the device, this place in the code marks a transition from single-thread execution to a many-thread execution. One can think of it being a loop, each step of which is executed simultaneously. As in loop, one needs an index, to differentiate the threads. Here it gets a little bit complicated and we need to step back a little and remember how the GPUs are organized on a hardware level.

The GPU contains several Streaming Modules (SMs, or multiprocessors), each with many compute units. Every compute unit can execute commands. So the entire GPU is first divided into streaming modules (or multiprocessors) and each multiprocessor contains many execution units. To reflect this hierarchy on a software level, threads are grouped in identically sized blocks. Each block is assigned into a streaming module for execution. This collection of the thread blocks is usually called “grid”, which also can be multi-dimensional.

Although it may seem a bit complicated at the beginning, the grouping of threads open extra opportunities for synchronization and data exchange. Since threads in a block are executed on a same SM, they can shared the data and can do fast communications. This can be leveraged when designing and optimizing the code for GPU execution, and we will touch this topic later.

Given that the threads on a GPU are organized in a hierarchical manner, the global index of a thread should be computed from its in-block index, the index of execution block and the execution block size. To get the global thread index, one can start the kernel function with:

__global__ void gpu_kernel(..)
{
   int i = threadIdx.x + blockIdx.x*blockDim.x;
}

Here, threadIdx.x, blockIdx.x and blockDim.x are internal variables that are always available inside the device function. They are, respectively, index of thread in a block, index of the block and the size of the block.

../_images/BlocksAndThreads.png

A simple example of the division of threads (green squares) in blocks (cyan rectangles). The equally-sized blocks contain four threads each. The thread index starts from zero in each block. Hence the “global” thread index should be computed from the thread index, block index and block size. This is explained for the thread #3 in block #2 (blue numbers). The total number of threads that are needes for the execution (N) can ofter not be a multiple of the block size and some of the threads will be idling or producing unused data (red blocks).

Here, we use one-dimensional arrangement of blocks and threads (hence, the .x). More on multi-dimensional grids and CUDA built-in simple types later, for now we assume that the rest of the components equal to 1. Since the index i is unique for each thread in an entire grid, it is usually called “global” index. It is important to notice that the total number of threads in a grid is a multiple of the block size. This is not necessary the case for the problem that we are solving: the length of the vectors we are summing can be non-divisible by selected block size. So we either need to make sure that the threads with index large than the size of the vector don’t do anything, or add padding to the vectors. We are going to use the former, more simple solution, by adding a conditional after the global thread index is computed:

__global__ void gpu_kernel(..)
{
   int i = threadIdx.x + blockIdx.x*blockDim.x;
   if (i < numElements)
   {
      ...
   }
}

Now the vectors can be addressed by the global index in the conditional the same way they are addressed in a loop of a CPU code. To have an access to the buffers, we need pass the device pointers to the kernel function, as we do with host pointers in the CPU code.

Now the kernel is defined, we can call it from the host code. Since the kernel will be executed in a grid of threads, so the kernel launch should be supplied with the configuration of the grid. In CUDA this is done by adding kernel cofiguration, <<<numBlocks, threadsPerBlock>>>, to the function call:

gpu_kernel<<<numBlocks, threadsPerBlock>>>(..)

Here, numBlocks is the total number of thread blocks in the grid, threadsPerBlock is the number of threads in a single block. Note, that these values can be integers, or can be two-dimensional of three-dimensional vectors, if this is more suitable for the kernel. It is natural to use the one-dimensional layout for the vector addition problem, which is itself one-dimensional. In this case, the kernel configuration can be specified by two integer values. The threadsPerBlock can be arbitrary chosen. It should be larger that the number of CUDA cores in the SM to fully occupy the device, but lower than the limit of 1024 (see the technical specifications). Values of 256 or 512 are frequently used. Since one has to make sure that the total number of threads (i.e. numBlocks*threadsPerBlock is greater or equal to the size of the vector. So numBlocks can be defined as numElements/threadsPerBlock + 1, where numElements is a number of elements in the vector.

Built-in CUDA vector types

CUDA has built-in vector types derived from basic integer and floating point types. They are structures of 1, 2, 3 and 4 component that can be accessed through the fields x, y, z and w respectively. For instance, float3 type has x, y and z types. All these types come with a constructor function, for instance:

int2 make_int2(int x, int y);

Built-in data types are not only convenient to use in many cases, but can also improve the overall performance of the code, since the data in these types are aligned for optimal access pattern. We already encountered the built-in data types when we were computing the global thread index in the previous example. The threadIdx, blockIdx and blockDim variables are all of type uint3, which reflects the dimensionality of the grid of threads.

How can one make the code compile with gcc?

  1. Add CUDA libraries to the LD_LIBRARY_PATH.

  2. Add CUDA include folder to CPATH.

  3. Both 1 and 2.

  4. Define struct float3 {float x, y, z;};

The numBlocks and threadsPerBlock are of another special type — dim3, which is uint3, that initialized unspecified values to 1. This makes it possible to define just one dimension for one-dimensional grid, as we did in the example above. Having more than one dimension can be useful when working with two- or three-dimension space, or working with matrices, as in the following example.