Expressing parallelism with SYCL: nd-range data-parallel kernels

Questions

  • How do we write parallel kernels in SYCL?

Objectives

  • Learn the difference between task and data parallelism.

  • Learn about work-items, work-groups, and ND-ranges.

ND-range data-parallel kernels

The basic flavor of data-parallel kernel allows to map 1-, 2-, and 3-dimensional problems to the available hardware. However, the semantics of the range class is quite limited: we have no way of expressing notions of locality within these kernels. This was quite evident in the matrix multiplication exercise: each work-item addressed one element in the output matrix and would load a full (contiguous) row of the left operand and a full (non-contiguous) column of the right operand. Loading of operands would happen multiple times.

ND-ranges are represented with objects of type nd_range, templated over the number of dimensions:

these are constructed using two range objects, representing the global and local execution ranges:

  • The global range gives the total size of the nd_range: a 1-, 2-, or 3-dimensional collection of work-items. This is exactly like range objects: at their coarsest level the two objects look exactly the same.

  • The local range gives the size of each work-group comprising the nd_range.

  • The implementation can further subdivide each work-group into 1-dimensional sub-groups. Since this is an implementation-dependent feature, its size cannot be set in the nd_range constructor.

Note that:

  1. The local sizes in each dimension have to divide exactly the corresponding global sizes.

  2. The contiguous dimensions of the ND-range and its work-groups coincide.

  3. Sub-groups, if available, are along the contiguous dimension of their work-groups.

../_images/nd_range.svg

Depiction of a 3-dimensional ND-range and its further subvisions. The global execution range is \(8\times 8 \times 8\), thus containing 512 work-items. The global range is further subdivided into 8 work-groups each comprised of \(4 \times 4 \times 4\) work-items. At an even finer level, each work-group has sub-groups of 4 work-items. The availability of sub-groups is implementation-dependent. Note that the contiguous dimension (dimension 2 in this example) of ND-range and work-group coincide. Furthermore, sub-groups are laid out along the contiguous dimension of their work-groups. Figure adapted from [RAB+21].

Work-groups and work-items in ND-ranges

As for basic ranges, the work-items in an ND-range represent instances of a kernel function:

  • they cannot communicate, nor sychronize with each other. [*]

  • they are scheduled for execution in any order.

Work-groups are the novelty of ND-ranges and the fact that work-items are grouped together gives us more programming tools:

  • Each work-group has work-group local memory. Each work-item in the group can access this memory. How the memory is made available is up to the hardware and the SYCL implementation.

  • We can use group-level barriers and fences to synchronize work-items within a group.

  • We can use group-level collectives, for communication, e.g. broadcasting, or computation, e.g. scans.

Work-items in a work-group are scheduled concurrently to a single compute unit, however:

  • There can be many more work-groups in an ND-range than compute units in the hardware. It’s fine to oversubscribe the hardware and if we avoid to make our code too device-specific, we have a better chance at achieving portability.

  • Work-items are not guaranteed to make independent progress. Interleaving execution of a work-item with barriers and fences, effectively running in a sequential fashion, is a valid execution model and the runtime may decide to do just so.

In ND-range data-parallel kernels, the kernel function passed as second argument to the parallel_for invocation accepts objects of type nd_item, which generalize the item type. The nd_item gives access to its ids in the global and local ranges with the get_global_id and get_local_id methods, respectively. It is often clearer, however, to first obtain a handle to the work-group (sub-group) with the get_group (get_sub_group) method and then interrogate ids and ranges from the returned group (sub_group) objects.

Less naïve MatMul

Using the ND-range flavor of data-parallelism should let us optimize memory accesses a bit more. In this exercise, we will rewrite the matrix multiplication kernel to use nd_range s.

Each work-item will compute an element in the result matrix \(\mathbf{C}\) by accessing a full row of \(\mathbf{A}\) and a full column of \(\mathbf{B}\). However, at variance with the previous implementation, the work-item is in a work-group, and thus the data loaded for the operands can be reused by all work-items, improving locality of accesses.

../_images/less_naive_matmul.svg

Schematics of a less naïve implementation of matrix multiplication: \(C_{ij} = \sum_{k}A_{ik}B_{kj}\). The computation is split into work-groups to optimize the locality of memory accesses. Each work-item (green) will compute an element in the result matrix \(\mathbf{C}\) by accessing a full row of \(\mathbf{A}\) and a full column of \(\mathbf{B}\). However, since the work-item is in a work-group (orange), the data loaded for the operands can be reused by all work-items. Figure adapted from [RAB+21].

Don’t do this at home, use optimized BLAS!

You can find a scaffold for the code in the content/code/day-2/02_nd_range-matmul/nd_range-matmul.cpp file, alongside the CMake script to build the executable. You will have to complete the source code to compile and run correctly: follow the hints in the source file. A working solution is in the solution subfolder.

  1. We first create a queue and map it to the GPU, either explicitly:

    queue Q{gpu_selector{}};
    

    or implicitly, by compiling with the appropriate HIPSYCL_TARGETS value.

  2. We declare the operands as std::vector<double> the right-hand side operands are filled with random numbers:

    constexpr size_t N = 256;
    std::vector<double> a(N * N), b(N * N), c(N * N);
    
    // fill a and b with random numbers in the unit interval
    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_real_distribution<double> dist(0.0, 1.0);
    
    std::generate(a.begin(), a.end(), [&dist, &mt]() {
      return dist(mt);
    });
    std::generate(b.begin(), b.end(), [&dist, &mt]() {
      return dist(mt);
    });
    

    while the result matrix is zeroed out:

    std::vector<double> c(N * N);
    
    // zero-out c
    std::fill(c.begin(), c.end(), 0.0);
    
  3. We define buffers to the operands in our matrix multiplication. For example, for the matrix \(\mathbf{A}\):

    buffer<double, 2> a_buf(a.data(), range<2>(N, N));
    

    Since we will be using the ND-range version, we will also need a local, 2-dimensional iteration range, with size \(B\times B\):

    constexpr size_t B = 4;
    
  4. We submit work to the queue through a command group handler:

    Q.submit([&](handler& cgh) {
      /* work for the queue */
    });
    
  5. We declare accessors to the buffers. For example, for the matrix \(\mathbf{A}\):

    accessor a{ a_buf, cgh };
    

    We also need the global and local 2-dimensional iteration ranges:

    range global{N, N};
    range local{B, B};
    
  6. Within the handler, we launch a parallel_for. The parallel region iterates over the 2-dimensional ranges of global and local indices, with an inner loop to span the common dimension of the \(\mathbf{A}\) and \(\mathbf{B}\) operand matrices:

    cgh.parallel_for(
      nd_range{ /* global range */, /* local range */ },
      [=](nd_item<2> it){
        auto j = it.get_global_id(0);
        auto i = it.get_global_id(1);
        for (auto k = 0; k < N; ++k) {
          c[j][i] += ...;
        }
      }
    );
    
  7. Check that your results are correct.

Keypoints

  • ND-range kernels should be used for more sophisticated control over performance aspects.

Footnotes