Allocate memory and transfer data

Allocating and releasing the memory on the GPU

In the previous example we learned how to start a GPU kernel. What was missing for it to able to do anything meaningful is some data. Here, we will learn how to transfer the data from host to device and back and how to do basic operations on in using an example of element-wise vector addition. Let us first go through the functions we are going to need.

In CUDA, the developer controls the data flow between host (CPU) and device (GPU) memory. To do so, one must declare the buffers that will be located in the device memory. It is usually convenient to “mirror” the host buffers, declaring and allocating buffers for the same data of the same size on both host and device, however this is not a requirement. Note that in CUDA, device buffer is a basic pointer and it can be easily confused with the host pointer. Using the device buffer on host most likely will lead to segmentation fault error, so one must keep track of where the buffers are located. It is advisable to name buffers with a prefix that will indicate where the buffer is located, e.g. use h_ prefix for host memory and d_ prefix for device memory. Declaration of the device buffer is as simple as t is for the host buffer. For instance, declaring host and device buffers for a vector of floating point values x should look something like this:

float* h_x;
float* d_x;

Note that we have not specified yet, where the buffers are located. This is decided when we are allocating the memory. On host, the allocation can be done by calling alloc(..) or calloc(..) function.

What is the difference between alloc(..) and calloc(..)?

  1. Only calloc(..) can be used for arrays.

  2. The difference is only in signatures, which makes calloc(..) more convenient to use for arrays. Both initialize the memory.

  3. The difference is only in signatures, which makes calloc(..) more convenient to use for arrays. Neither initialize the memory.

  4. Only calloc(..) initializes memory with zeroes.

To allocate buffer in GPU memory, one has to call the CUDA API function cudaMalloc(..):

We are now getting used to these function having access specifiers and return cudaError_t. As the first arguments, the function takes a pointer to the buffer in the device memory. The function that allocates size bytes, as specified by the second argument, and updates the provided device duffer by setting it to the address of this allocation. Note that this function takes pointer to the buffer, which is itself a pointer. This allows to update the pointer so it will be modified and set to where the memory is allocated.

To release the memory, cudaFree(..) function should be used:

Here, the pointer itself is not updated,

Transferring the data from host to device and back

After memory is allocated, we need to copy data from host to device buffer and back. This is done using the cudaMemcpy(..) function, that has the following signature:

Both copy to and from the device buffer are done using the same function and the direction of the copy is specifies by the last argument, which is cudaMemcpyKind enumeration. The enumeration can take values cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice or cudaMemcpyDefault. All but the last are self-explanatory. Passing the cudaMemcpyDefault will make the API to deduce the direction of the transfer from pointer values, but require unified virtual addressing. Second to last argument is the size of the data to be copied in bytes. The first two arguments can be either host or device pointers, depending on the directionality of the transfer. This is where using h_ and d_ prefixes come handy: this way we should only remember the order in which the destination and the source arguments are specified. For instance, host to device copy call should look something like that:

cudaMemcpy(d_x, h_x, numElements*sizeof(float), cudaMemcpyHostToDevice);

The names of the buffers suggest that the first argument (destination) is the device buffer and the second argument is the host buffer (source). This means that we are executing host to device copy, which is specified byt the last argument. After the execution on the device is done, we have the data in the device memory and the results can be copied back to the host memory using:

cudaMemcpy(h_x, d_x, numElements*sizeof(float), cudaMemcpyDeviceToHost);

Defining the thread grid

Given that the threads on a GPU are organized in a hierarchical manner, the global index of a thread should be computed from its in-block index, the index of execution block and the execution block size. To get the global thread index, one can start the kernel function with:

__global__ void gpu_kernel(..)
{
   int i = threadIdx.x + blockIdx.x*blockDim.x;
}

Here, threadIdx.x, blockIdx.x and blockDim.x are internal variables that are always available inside the device function. They are, respectively, index of thread in a block, index of the block and the size of the block. Here, we use one-dimensional arrangement of blocks and threads (hence, the .x). More on multi-dimensional grids and CUDA built-in simple types later, for now we assume that the rest of the components equal to 1.

../_images/BlocksAndThreads2.png

A simple example of the division of threads (green squares) in blocks (cyan rectangles). The equally-sized blocks contain four threads each. The thread index starts from zero in each block. Hence the “global” thread index should be computed from the thread index, block index and block size. This is explained for the thread #3 in block #2 (blue numbers). The total number of threads that are needed for the execution (N) can ofter not be a multiple of the block size and some of the threads will be idling or producing unused data (red blocks).

It is important to notice that the total number of threads in a grid is a multiple of the block size. This is not necessary the case for the problem that we are solving: the length of the vectors we are summing can be non-divisible by selected block size. So we either need to make sure that the threads with index large than the size of the vector don’t do anything, or add padding to the vectors. We are going to use the former, more simple solution, by adding a conditional after the global thread index is computed:

__global__ void gpu_kernel(..)
{
   int i = threadIdx.x + blockIdx.x*blockDim.x;
   if (i < numElements)
   {
      ...
   }
}

Now the vectors can be addressed by the global index in the conditional the same way they are addressed in a loop of a CPU code. To have an access to the buffers, we need pass the device pointers as arguments to the kernel function, as we do with host pointers in the CPU code.

Now the kernel is defined, we can call it from the host code. Since the kernel will be executed in a grid of threads, so the kernel launch should be supplied with the configuration of the grid. In CUDA this is done by adding kernel cofiguration, <<<numBlocks, threadsPerBlock>>>, to the function call:

gpu_kernel<<<numBlocks, threadsPerBlock>>>(..)

Here, numBlocks is the total number of thread blocks in the grid, threadsPerBlock is the number of threads in a single block. Note, that these values can be integers, or can be two-dimensional of three-dimensional vectors, if this is more suitable for the kernel. It is natural to use the one-dimensional layout for the vector addition problem, which is itself one-dimensional. In this case, the kernel configuration can be specified by two integer values. The threadsPerBlock can be arbitrary chosen. It should be larger that the number of CUDA cores in the SM to fully occupy the device, but lower than the limit of 1024 (see the technical specifications). Values of 256 or 512 are frequently used. Since one has to make sure that the total number of threads (i.e. numBlocks*threadsPerBlock is greater or equal to the size of the vector. So numBlocks can be defined as numElements/threadsPerBlock + 1, where numElements is a number of elements in the vector.

Exercise 1

What will happen if we execute the code as it is (“Add GPU data management” tab above)?

  1. It will not compile.

  2. The output will be the same - we are still computing everything on the CPU.

  3. The results will be zero.

  4. The results can be anything.

In the next example we will be adding two vectors together. This will allow us to get familiar with basic CUDA API that are essential for writing code in CUDA.

Built-in CUDA vector types

CUDA has built-in vector types derived from basic integer and floating point types. They are structures of 1, 2, 3 and 4 component that can be accessed through the fields x, y, z and w respectively. For instance, float3 type has x, y and z types. All these types come with a constructor function, for instance:

int2 make_int2(int x, int y);

Built-in data types are not only convenient to use in many cases, but can also improve the overall performance of the code, since the data in these types are aligned for optimal access pattern. We already encountered the built-in data types when we were computing the global thread index in the previous example. The threadIdx, blockIdx and blockDim variables are all of type uint3, which reflects the dimensionality of the grid of threads.

Exercise 2

How can one make the code compile with gcc?

  1. Add CUDA libraries to the LD_LIBRARY_PATH.

  2. Add CUDA include folder to CPATH.

  3. Both 1 and 2.

  4. Define struct float3 {float x, y, z;};

Two-dimensional grids

The numBlocks and threadsPerBlock are of another special type — dim3, which is uint3, that initialized unspecified values to 1. This makes it possible to define just one dimension for one-dimensional grid, as we did in the example above. Having more than one dimension can be useful when working with two- or three-dimension space, or working with matrices, as in the following example.

Exercise 3