Device discovery

Questions

How can we make SYCL aware of the available hardware?
Is it possible to specialize our code for the available hardware?

Objectives

Learn how to query device information with get_info.
Learn how to use and write selectors.

The examples in the What is SYCL? episode highlighted the importance of the queue abstraction in SYCL. All device code is submitted to a queue as actions:

queue Q;

Q.submit(
  /* device code action */
);

the runtime schedules the actions and executes them asynchronously. We will discuss queues in further detail in Queues, command groups, and kernels. At this point, it is important to stress that a queue can be mapped to one device only. The mapping happens at queue construction and cannot be changed afterwards.

We have five strategies to run our device code:

Somewhere

This is what we have done so far and it’s achieved by simply using the queue object default constructor:

queue Q;

Most likely you want to have more control on what device queues in your code will use, especially as your SYCL code matures.

In order to gain more control, we will use the following constructor:

queue constructor

template <typename DeviceSelector>
explicit queue(const DeviceSelector &deviceSelector,
               const property_list &propList = {});

The selector passed as first parameter lets us specify how the runtime should go about mapping the queue to a device.

On the host device

A standards-compliant SYCL implementation will always define a host device and we can bind a queue to it by passing the host_selector object to its constructor:

queue Q{host_selector{}};

The host device makes the host CPU “look like” an independent device, such that device code will run regardless of whether the hardware is available. This is especially useful in three scenarios:

Developing heterogeneous code on a machine without hardware.
Debugging device code using CPU tooling.
Fallback option to guarantee functional portability.

On a specific class of devices

Such as GPUs or FPGAs. The SYCL standard defines a few selectors for this use case.

queue Q_cpu{default_selector{}};

queue Q_cpu{cpu_selector{}};

queue Q_device{gpu_selector{}};

queue Q_accelerator{accelerator_selector{}};

default_selector is the implementation-defined default device. This is not portable between SYCL compilers.
cpu_selector a CPU device.
gpu_selector a GPU device.
accelerator_selector an accelerator device, including FPGAs.

On a specific device in a specific class

For example on a GPU with well-defined compute capabilities. SYCL defines the device_selector base class, which we can inherit from and customize to our needs.

class special_device_selector : public device_selector {
  /* we will look at what goes here soon! */
};

queue Q{special_device_selector{}};

Coincidentally, this is the most flexible and portable way of parameterizing our code to work on a diverse set of devices.

hipSYCL and HIPSYCL_TARGETS

SYCL is all about being able to write code once and execute it on different hardware. The sample code in folder content/code/day-1/01_hello-selectors used the default queue constructor: essentially, the runtime decides which device will be used for us. Let us explore how that works with hipSYCL. When configuring the code we can set the HIPSYCL_TARGETS CMake option to influence the behavior.

Compile to target the GPU, using -DHIPSYCL_TARGETS="cuda:sm_80" in the configuration step. What output do you see? Is the code running on the device you expect?
Compile to target the GPU and the host device using OpenMP with -DHIPSYCL_TARGETS="cuda:sm_80;omp". What output do you see now?
Extend the code to create multiple queues, each using one of the standard selectors, and compile with -DHIPSYCL_TARGETS="cuda:sm_80;omp". What output do you expect to see?

To learn more about the compilation model in hipSYCL, check out its documentation.

Introspection with `get_info`

It is not a good idea to depend explicitly on vendor and/or device names in our program: for maximum portability, our device code should rather be parameterized on compute capabilities, available memory, and so forth. Introspection into such parameters is achieved with the get_info template function:

get_info

template <typename param>
typename param::return_type get_info() const;

This is a method available for many of the classes defined by the SYCL standard including device, of course. The template parameter specifies which information we would like to obtain. In the previous examples, we have used info::device::vendor and info::device::name to build our selectors. Valid get_info queries for devices are in the info::device namespace and can be roughly classified in two groups of queries, which can:

decide whether a kernel can run correctly on a given device. For example, querying for info::device::global_mem_size and info::device::local_mem_size would return the size, in bytes, of the global and local memory, respectively.
help tune kernel code to a given device. For example, querying info::device::local_mem_type would return which kind of local memory is available on the device: none, dedicated local storage, or an abstraction built using global memory.

We will not list all possible device queries here: a complete list is available on the webpage of the standard.

Nosing around on our system

We will write a chatty program to report properties of all devices available on our system. We will have to keep the list of queries at hand for this task.

You can find a scaffold for the code in the content/code/day-1/03_platform-devices/platform-devices.cpp file, alongside the CMake script to build the executable. You will have to complete the source code to compile and run correctly: follow the hints in the source file. A working solution is in the solution subfolder.

The code is a double loop:

First over all platforms

for (const auto & p : platform::get_platforms()) {
  ...
}

A platform is an abstraction mapping to a backend.

Then over all devices available on each platform:

for (const auto& d: p.get_devices()) {
  ...
}

We will query the device with get_info in the inner loop. For example:

std::cout << "name: " << d.get_info<info::device::name>() << std::endl;

Add queries and print out information on global and local memory, work items, and work groups.

Note

Not all queries make sense on all devices:

A property might be unlimited on a particular device/backend, e.g., no real maximum allocation size on most host operating systems due to modern virtual memory implementations with first-touch allocation policy.
hipSYCL does not know the result, e.g. because the query does not apply or make sense to the backend, or the backend cannot provide this information.

Compiling with hipSYCL might return alarming values in some cases, but this is not necessarily an issue.

The info:: namespace is vast! You can query many aspects of a SYCL code at runtime using get_info, not just devices. The classes platform, context, queue, event, and kernel also offer a get_info method. The queries in the info::kernel_device_specific namespace can be helpful with performance tuning.

Writing your own selector

Using inheritance

All the standard selectors are derived types of the abstract device_selector class. This class defines, among other things, a pure virtual overload of the function-call operator:

virtual int operator()(const device &dev) const = 0;

The method takes a device object and return a score for it, an integer value, and the highest score gets selected. The runtime will call this method exactly once for each device that it has access to, in order to build the ranking of device scores. Devices might be completely excluded from the ranking if their score is a negative number. We can write our own selector by simply inheriting from this abstract base class and implementing our own custom logic for scoring devices:

class special_device_selector : public device_selector
{
public:
  int operator()(const sycl::device& dev) const override
  {
    if (dev.is_gpu()) {
      auto vendorName = dev.get_info<sycl::info::device::vendor>();
      if (vendorName.find("Intel") != std::string::npos) {
        return 1;
      }
    }
    return -1;
  }
};

auto Q = queue { special_device_selector {} };

Write a custom selector

It’s not that far of a stretch to imagine that in a not-so-distant future, a node in a cluster might be equipped with accelarators from different vendors. In this exercise, you’ll write a selector to score GPUs from different vendors according to your preferences.

You can find a scaffold for the code in the content/code/day-1/02_custom-selectors/custom-selectors.cpp file, alongside the CMake script to build the executable. You will have to complete the source code to compile and run correctly: follow the hints in the source file. A working solution is in the solution subfolder.

Load the necessary modules:
```
$ module load CMake hipSYCL
```

Configure, compile, and run the code:

$ cmake -S. -Bbuild -DHIPSYCL_TARGETS="cuda:sm_80;omp"
$ cmake --build build -- VERBOSE=1
$ ./build/custom-selectors

Try compiling and executing on a non-GPU node. What happens? How can you make the code more robust?

Using aspects

The standard defines the aspect_selector free function, which return a selectors based on desired device aspects:

aspect_selector

template <class... aspectListTN>
auto aspect_selector(aspectListTN... aspectList);

Available aspects are defined in the aspect enumeration and can be probed using the has method of the device class. For example, dev.has(aspect::gpu) is equivalent to dev.is_gpu(). A selector for GPUs supporting half-precision floating-point numbers (FP16) and USM device allocations can be implemented with a one-liner:

auto my_selector = aspect_selector(aspect::usm_device_allocations, aspect::fp16);

The aspects available, according to the standard, are available here. Currently, we cannot use aspects to filter devices based on vendors.

Keypoints

Device selection is essential to tailor execution to the available hardware and is achieved using the device_selector abstraction.
Custom selectors with complex logic can be implemented with inheritance.
You should use get_info to probe your system. Device selection should be done based on compute capabilities, not on vendor and/or device names.