Using CUDA API

List available devices and their properties

Let us start familiarizing ourselves with CUDA by writing a simple “Hello CUDA” program, which will query all available devices and print some information on them. We will start with a basic .cpp code, change it so it will be compiled by CUDA compiler and do some CUDA API call, to see what devices are available.

To do that, we are going to need a couple of CUDA API functions. First, we want to ask API how many CUDA+capable devices are available, which is done by following function:

cudaGetDeviceCount(..)

__host__ __device__ cudaError_t cudaGetDeviceCount(int* numDevices)

The function calls the API and returns the number of the available devices in the address provided as a first argument. There are a couple of things to notice here. First, the function is defined with two CUDA specifiers __host__ and __device__. This means that it is available in both host and device code. Second, as most of CUDA calls, this function returns cudaError_t enumeration type, which can contain a error message if something went wrong. In case of success, cudaSuccess is returned. The actual number of devices is returned in the only argument the function takes, i.e. one needs to declare an integer and pass a pointer to it. The function will then update the value at this address. This type of signature is quite common to CUDA functions, with most of them returning cudaError_t type and taking a pointer for its actual output.

With the number of devices known, we can cycle through them and check what kind of devices are available, their names and capabilities. In CUDA, these are stored in cudaDeviceProp structure. This structure contains extensive information on the device, for instance its name (prop.name), major and minor compute capabilities (prop.major and prop.minor), number of streaming processors (prop.multiProcessorCount), core clock (prop.clockRate) and available memory (prop.totalGlobalMem). See the cudaDeviceProp API reference for full list of fields in the cudaDeviceProp structure. To populate the cudaDeviceProp structure, CUDA has cudaGetDeviceProperties(..) function:

cudaGetDeviceProperties(..)

__host__​ cudaError_t cudaGetDeviceProperties(cudaDeviceProp* prop, int deviceId)

The function has a __host__ specifier, which means that one can not call it from the device code. It also returns cudaError_t structure, which can be cudaErrorInvalidDevice in case we are trying to get properties of a non-existing device (e.g. when deviceId is larger than numDevices above). The function takes a pointer to the cudaDeviceProp structure, to which the data is saved and an integer index of the device to get the information about. The following code should get you an information on the first device in the system (one with deviceId = 0).

cudaGetDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);

Exercise

Getting the information on available devices using CUDA API

#include <stdio.h>

int main()
{
    printf("I can see %d device(s) if the code is compiled with gcc:.\n", 0);
    return 0;
}

#include <stdio.h>

int main()
{
    int driverVersion = 0;
    cudaDriverGetVersion(&driverVersion);
    printf("CUDA driver: %d\n", driverVersion);

    int runtimeVersion = 0;
    cudaRuntimeGetVersion(&runtimeVersion);
    printf("CUDA runtime: %d\n", runtimeVersion);

    int         numDevices;
    cudaError_t stat = cudaGetDeviceCount(&numDevices);

    for (int i = 0; i < numDevices; i++)
    {
        cudaDeviceProp prop;
        stat = cudaGetDeviceProperties(&prop, i);

        printf("%d: %s, CC %d.%d, %d SMs running at %dMHz, %luMB\n", i, prop.name,
            prop.major, prop.minor,
            prop.multiProcessorCount,
            prop.clockRate/1000,
            prop.totalGlobalMem/1024/1024);
    }

    return 0;
}

#include <stdio.h>

// Beginning of GPU Architecture definitions
inline int _ConvertSMVer2Cores(int major, int minor) {
  // Defines for GPU Architecture types (using the SM version to determine
  // the # of cores per SM
  typedef struct {
    int SM;  // 0xMm (hexidecimal notation), M = SM Major version,
    // and m = SM minor version
    int Cores;
  } sSMtoCores;

  sSMtoCores nGpuArchCoresPerSM[] = {
      {0x30, 192},
      {0x32, 192},
      {0x35, 192},
      {0x37, 192},
      {0x50, 128},
      {0x52, 128},
      {0x53, 128},
      {0x60,  64},
      {0x61, 128},
      {0x62, 128},
      {0x70,  64},
      {0x72,  64},
      {0x75,  64},
      {0x80,  64},
      {0x86, 128},
      {-1, -1}};

  int index = 0;

  while (nGpuArchCoresPerSM[index].SM != -1) {
    if (nGpuArchCoresPerSM[index].SM == ((major << 4) + minor)) {
      return nGpuArchCoresPerSM[index].Cores;
    }

    index++;
  }
  return 0;
}

int main()
{
    int driverVersion = 0;
    cudaDriverGetVersion(&driverVersion);
    printf("CUDA driver: %d\n", driverVersion);

    int runtimeVersion = 0;
    cudaRuntimeGetVersion(&runtimeVersion);
    printf("CUDA runtime: %d\n", runtimeVersion);

    int         numDevices;
    cudaError_t stat = cudaGetDeviceCount(&numDevices);

    for (int i = 0; i < numDevices; i++)
    {
        cudaDeviceProp prop;
        stat = cudaGetDeviceProperties(&prop, i);

        printf("%d: %s, CC %d.%d, %dx%d=%d@%dMHz CUDA cores, %luMB\n", i, prop.name,
            prop.major, prop.minor,
            prop.multiProcessorCount, _ConvertSMVer2Cores(prop.major, prop.minor),
            prop.multiProcessorCount*_ConvertSMVer2Cores(prop.major, prop.minor), prop.clockRate/1000,
            prop.totalGlobalMem/1024/1024);

    }

    return 0;
}

We need the compiler to be aware that it is dealing with source file that may contain CUDA code. To do so, we change the extension of the file to .cu. We will not be using the GPU yet, only checking if we have some available. To do so, we will be using the CUDA API functions. Changing the extension to .cu will make sure that the nvcc compiler will add all the necessary includes and will be aware that the code can contain CUDA API calls.
To get the number of devices, use the cudaGetDeviceCount(..) CUDA API function.
Now that we know how many devices we have, we can cycle through them and get properties of each one. Cycle through the device indices from zero to the number of devices that you got from the previous function call and call the cudaGetDeviceProperties(..) for each of them. Print the name of each device, number of multiprocessors and their clock rate.
Note that the total number of CUDA cores is not contained in cudaDeviceProp structure. This is so, because different devices can have different number of CUDA cores per streaming module (multiprocessor). This number can by up to 192, depending on compute capabilities major and minor version of the device. The provided “extended” solution has a helper function from CUDA SDK examples, that can get this number depending on prop.major and prop.minor.