Expressing parallelism with SYCL: nd-range data-parallel kernels
Questions
How do we write parallel kernels in SYCL?
Objectives
Learn the difference between task and data parallelism.
Learn about work-items, work-groups, and ND-ranges.
ND-range data-parallel kernels
The basic flavor of data-parallel kernel allows to map 1-, 2-, and 3-dimensional
problems to the available hardware. However, the semantics of the range
class is quite limited: we have no way of expressing notions of locality within
these kernels.
This was quite evident in the matrix multiplication exercise: each work-item
addressed one element in the output matrix and would load a full (contiguous)
row of the left operand and a full (non-contiguous) column of the right operand.
Loading of operands would happen multiple times.
ND-ranges are represented with objects of type nd_range
, templated over the number of dimensions:
nd_range
constructor
nd_range(range<dimensions> globalSize, range<dimensions> localSize);
these are constructed using two range
objects, representing the
global and local execution ranges:
The global range gives the total size of the
nd_range
: a 1-, 2-, or 3-dimensional collection of work-items. This is exactly likerange
objects: at their coarsest level the two objects look exactly the same.The local range gives the size of each work-group comprising the
nd_range
.The implementation can further subdivide each work-group into 1-dimensional sub-groups. Since this is an implementation-dependent feature, its size cannot be set in the
nd_range
constructor.
Note that:
The local sizes in each dimension have to divide exactly the corresponding global sizes.
The contiguous dimensions of the ND-range and its work-groups coincide.
Sub-groups, if available, are along the contiguous dimension of their work-groups.
Work-groups and work-items in ND-ranges
As for basic ranges, the work-items in an ND-range represent instances of a kernel function:
they cannot communicate, nor sychronize with each other. [*]
they are scheduled for execution in any order.
Work-groups are the novelty of ND-ranges and the fact that work-items are grouped together gives us more programming tools:
Each work-group has work-group local memory. Each work-item in the group can access this memory. How the memory is made available is up to the hardware and the SYCL implementation.
We can use group-level barriers and fences to synchronize work-items within a group.
We can use group-level collectives, for communication, e.g. broadcasting, or computation, e.g. scans.
Work-items in a work-group are scheduled concurrently to a single compute unit, however:
There can be many more work-groups in an ND-range than compute units in the hardware. It’s fine to oversubscribe the hardware and if we avoid to make our code too device-specific, we have a better chance at achieving portability.
Work-items are not guaranteed to make independent progress. Interleaving execution of a work-item with barriers and fences, effectively running in a sequential fashion, is a valid execution model and the runtime may decide to do just so.
In ND-range data-parallel kernels, the kernel function passed as second argument
to the parallel_for
invocation accepts objects of type nd_item
, which
generalize the item
type.
The nd_item
gives access to its ids in the global and local ranges with the
get_global_id
and get_local_id
methods, respectively. It is often
clearer, however, to first obtain a handle to the work-group (sub-group) with
the get_group
(get_sub_group
) method and then interrogate ids and ranges
from the returned group
(sub_group
) objects.
Less naïve MatMul
Using the ND-range flavor of data-parallelism should let us optimize memory
accesses a bit more. In this exercise, we will rewrite the matrix
multiplication kernel to use nd_range
s.
Each work-item will compute an element in the result matrix \(\mathbf{C}\) by accessing a full row of \(\mathbf{A}\) and a full column of \(\mathbf{B}\). However, at variance with the previous implementation, the work-item is in a work-group, and thus the data loaded for the operands can be reused by all work-items, improving locality of accesses.
Don’t do this at home, use optimized BLAS!
You can find a scaffold for the code in the
content/code/day-2/02_nd_range-matmul/nd_range-matmul.cpp
file,
alongside the CMake script to build the executable. You will have to complete
the source code to compile and run correctly: follow the hints in the source
file. A working solution is in the solution
subfolder.
We first create a queue and map it to the GPU, either explicitly:
queue Q{gpu_selector{}};
or implicitly, by compiling with the appropriate
HIPSYCL_TARGETS
value.We declare the operands as
std::vector<double>
the right-hand side operands are filled with random numbers:constexpr size_t N = 256; std::vector<double> a(N * N), b(N * N), c(N * N); // fill a and b with random numbers in the unit interval std::random_device rd; std::mt19937 mt(rd()); std::uniform_real_distribution<double> dist(0.0, 1.0); std::generate(a.begin(), a.end(), [&dist, &mt]() { return dist(mt); }); std::generate(b.begin(), b.end(), [&dist, &mt]() { return dist(mt); });
while the result matrix is zeroed out:
std::vector<double> c(N * N); // zero-out c std::fill(c.begin(), c.end(), 0.0);
We define buffers to the operands in our matrix multiplication. For example, for the matrix \(\mathbf{A}\):
buffer<double, 2> a_buf(a.data(), range<2>(N, N));
Since we will be using the ND-range version, we will also need a local, 2-dimensional iteration range, with size \(B\times B\):
constexpr size_t B = 4;
We submit work to the queue through a command group handler:
Q.submit([&](handler& cgh) { /* work for the queue */ });
We declare accessors to the buffers. For example, for the matrix \(\mathbf{A}\):
accessor a{ a_buf, cgh };
We also need the global and local 2-dimensional iteration ranges:
range global{N, N}; range local{B, B};
Within the handler, we launch a
parallel_for
. The parallel region iterates over the 2-dimensional ranges of global and local indices, with an inner loop to span the common dimension of the \(\mathbf{A}\) and \(\mathbf{B}\) operand matrices:cgh.parallel_for( nd_range{ /* global range */, /* local range */ }, [=](nd_item<2> it){ auto j = it.get_global_id(0); auto i = it.get_global_id(1); for (auto k = 0; k < N; ++k) { c[j][i] += ...; } } );
Check that your results are correct.
You can find a scaffold for the code in the
content/code/day-2/03_usm-nd_range-matmul/usm-nd_range-matmul.cpp
file,
alongside the CMake script to build the executable. You will have to complete
the source code to compile and run correctly: follow the hints in the source
file. A working solution is in the solution
subfolder.
We first create a queue and map it to the GPU, either explicitly:
queue Q{gpu_selector{}};
or implicitly, by compiling with the appropriate
HIPSYCL_TARGETS
value.We allocate the operands as USM buffers and fill them with random numbers. We can do this with untyped or typed
malloc
-style orusm_allocator
APIs. Should operands be host, device, or shared allocations?We allocate the result as USM buffer and zero it out. We can do this with untyped or typed
malloc
-style orusm_allocator
APIs. Should this be host, device, or shared allocation?We submit work to the queue. Note that we need to linearize indices for row-major access to our buffers:
auto irow = ...; auto jcol = ...; auto row_major_id = irow * N + jcol;
Check that your results are correct.
Keypoints
ND-range kernels should be used for more sophisticated control over performance aspects.
Footnotes