Queues, command groups, and kernels
Questions
How do we organize work in a SYCL application?
Objectives
Learn about queues to describe ordering of operations.
Command groups.
Understand that kernels are units of parallelism in SYCL.
SYCL queue
objects are the abstraction connecting a host program to a single
device. The queue is a central abstraction in SYCL: all device code is
submitted to a queue as actions. The runtime schedules the actions and
executes them asynchronously. The runtime keeps track of action
prerequisites in its scheduling, for example, availability of data.
We can state that the tracking of actions and their dependencies is the essence
of SYCL. The SYCL standard models our program as a task graph, a set of
nodes connected by edges:
Nodes are actions to be performed on a device, such as the invocation of a kernel or explicit data movements.
Edges are dependencies between the actions and express when it’s legal for a node to execute. Edges arise most often because of data dependencies between nodes.
The task graph is a directed acyclic graph (DAG): it has a well-defined start-to-finish direction and no nodes are self-connected. The SYCL runtime can resolve dependencies and thus generate the task graph. Furthermore, it can schedule how to execute the nodes, i.e. traversal of the task graph, in a completely asynchronous manner from the execution of the host code. We will see in The task graph: data, dependencies, synchronization how to manually modify the task graph.
Two kinds of actions can be part of the task graph:
- Execution of device code
These actions add nodes to the graph that will, eventually, execute device code. They accept kernel code and its execution space as argument and you invoke them as methods on the
queue
class directly or on thehandler
class. They come in three flavors, which represent different abstractions for work distribution in SYCL:single_task
: as the name says, this will execute one single instance of the kernel code.parallel_for
: this will launch a kernel with given work-size specification in a single instruction, multiple threads (SIMT) fashion.parallel_for_work_group
: launches a kernel with hierarchical parallelism. This is only available on thehandler
class.
- Explicit memory operations.
These actions add nodes to the graph that will, eventually, perform data migrations. You invoke them as methods on the
queue
class directly or on thehandler
class:copy
: copies data.update_host
: updates data in the buffer on the host-side.fill
: initializes data in a buffer to the given value.
We have given a high-level overview of the abstractions in the execution model: from the queue to the execution on a device, passing through submission of work, described as a data-parallel kernel.
But how do we write a kernel?
Kernels
Kernels are the fundamental building blocks for performing work in a SYCL program. We will only consider two ways of writing kernels in SYCL:
- lambda expressions
Kernels as lambdas are very concise, thanks especially to the capture syntax. They cannot be templated and might be cumbersome to reuse. In some cases, lambdas can be too terse.
+1 as a lambda
[=](id<1> idx) -> void { data_acc[idx] += 1; }
- function objects
A kernel is a class that overloads
operator()
function call operator. They can be templated, easily reused, and give full control over what data is passed in and out. They are more verbosee.+1 as a function object
class PlusOne { public: PlusOne(accessor<int> acc) : data_acc_(acc) {} void operator()(id<1> idx) { data_acc[idx] += 1; } private: accessor<int> data_acc_; };
There are no technical reasons to prefer one style over the other, it will ultimately boil down to personal preference. Regardless of the chosen style, kernel code has some restrictions:
It must have
void
as return type.It cannot use runtime type identification (RTTI).
It cannot dynamic allocate memory.
Queues
One queue maps to one device: the mapping happens upon construction of a
queue
object and cannot be changed subsequently.
It is not possible to use a single queue
object to:
manage more than one device. The runtime would face ambiguities in deciding which device should actually do the work!
spread enqueued work over multiple devices.
While these might appear as limitations, we are free to declare as many
queue
object as we like in our programs. It is also valid to create multiple
queues to the same device. Thus, the relation between queues and devices is
many-to-one.
Work on a device can be enqueued with the shortcut methods described above. For
example, we can launch a data-parallel kernel with parallel_for
invoked on
the desired queue object:
Creating work on a device using queue
shortcuts.
auto Q = queue{my_selector{}};
Q.parallel_for(range<1>{sz}, [=](auto &idx){
/* kernel code */
});
Command groups
A command group handler gives more control over how code is submitted to the
queue. Submission is slightly more verbose, but we get access to features of
hierarchical parallelism.
The abstraction for command groups is the class handler
: these objects are
constructed for us by the SYCL runtime. As such, we will meet them only as
arguments of the lambda functions passed to the submit
method of our queues.
A command group handler contains:
host code, to set up the dependencies of the corresponding node in the task graph. Host code is executed immediately upon submission.
exactly one action of the ones described above. The action executes asynchronously on the device. Parallel work actions will, furthermore, need an execution range and a kernel function.
Creating work on a device using a command group handler
.
auto Q = queue{my_selector{}};
Q.submit([&](handler &cgh){
/* host code: sets up the dependencies of this node. It executes **immediately!** */
accessor acc{B, h};
/* exactly **one** of the available actions. It executes **asynchronously** */
cgh.parallel_for(range<1>{sz}, [=](auto &idx){
/* kernel code */
});
});
single_task
and streams
We’ll walk through the use of the single_task
method to create work on a
device.
As the name suggests, this will create a task for sequential execution:
probably not a method you will use often, but definitely something to be
aware of!
The task we would like to perform is a print-out on the device. If you are
familiar with CUDA/HIP, you probably know that printf
can be used in
device code. In keeping with C++, the SYCL standard defines a stream
class, which works similar to the standard streams. A SYCL stream needs a
handler
object on construction:
auto out = stream(1024, /* maximum size of output per kernel invocation */
256, /* maximum size before flushing the stream */
cgh);
SYCL streams behave just like standard C++ streams. We can write something to
a stream using operator<<
:
out << "my message" << std::endl;
You can find a scaffold for the code in the
content/code/day-1/04_single-task/single-task.cpp
file,
alongside the CMake script to build the executable. You will have to complete
the source code to compile and run correctly: follow the hints in the source
file. A working solution is in the solution
subfolder.
Create a queue object. You’re free to use any of the device selection strategies we have encountered in the previous episode.
Submit work to the queue using a command handler group.
Create a
stream
object.Create a single task on the
handler
printing a string to the stream. Asingle_task
only accepts a function with no input arguments as parameter:cgh.single_task([=](){ /* task code */ });
Keypoints
One queue maps to one device, such that there is no ambiguity in spreading work.
A program can have as many queues as desired. Multiple queues can use the same device: the queue-device mapping is many-to-one.
Enqueing actions can happen by submitting command groups using the
handler
class.You can also enqueue actions with shortcut methods on the
queue
class.Work can be enqueued with a command group handler. This gives more flexibility over the definition of the corresponding node in the task graph.
Kernels are callables: either lambda functions or function objects.
Kernel code cannot use neither RTTI nor dynamic memory allocation.