What is SYCL?

Questions

What is SYCL and why should I care?

Objectives

Learn how to write a simple C++ SYCL code.
Learn how to compile with hipSYCL.

The high-performance computing landscape is increasingly dominated by machines whose high FLOP count is delivered by heterogeneous hardware: large-core count CPUs in tandem with ever more powerful GPUs are now the norm in the HPC datacenter. This trend is likely to continue, with the appearance of new hardware architectures, sometimes tailored for specific operations.

Each new architecture comes equipped with its own, usually low-level, programming language. Adapting applications for a heterogeneous computing environment proceeds in tight cycles of profiling and porting. These are time-consuming, error-prone, and scarcely portable:

Mastering low-level programming languages may require years.
The codebase can diverge significantly to work with different hardware.

Let’s look at a simple example: the single-precision vector addition kernel $y := \alpha x + y$, better known as SAXPY:

__global__
void saxpy(int sz, float a, float * restrict x, float * restrict y) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < sz) 
    y[i] = a * x[i] + y[i];
}

This code is an example of prescriptive parallelism.

__global__ void
saxpy(int sz, float a, float* restrict x, float* restrict y)
{
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < sz) y[i] = a * x[i] + y[i];
}

This code is an example of prescriptive parallelism.

void
saxpy(int sz, float a, float *restrict x, float *restrict y)
{
#pragma omp target teams distribute parallel for simd num_teams(num_blocks) \
  map(to                                                                    \
      : x [0:sz]) map(tofrom                                                \
                      : y [0:sz])
  for (int i = 0; i < sz; ++i) {
    y[i] = a * x[i] + y[i];
  }
}

This code is an example of descriptive parallelism.

void saxpy(int sz, float a, float *restrict x, float *restrict y) {
#pragma acc kernels
  for (int i = 0; i < sz; ++i)
    y[i] = a * x[i] + y[i];
}

This code is an example of descriptive parallelism.

There are few things to notice in this example:

We work with prescriptive parallelism with CUDA and HIP: we explicitly divide up the work between threads inside the kernel function.
In the OpenMP and OpenACC models, we describe data movement and worksharing, but rely on the compiler to generate correct code. Hence the qualification of descriptive parallelism.
When using low-level APIs, a certain degree of code duplication between host and device code is inevitable. Parallelization strategies between host and device will be different. On top of this, the expressiveness of low-level language extensions is limited, which might increase the maintenance burden.
Low-level APIs might only be available with proprietary, vendor-specific compilers. This limits both functional and performance portability.
pragma-based schemes are standardized, but compiler support is not always optimal.

When programming for heterogeneous platforms, there is no silver bullet: we have to carefully evaluate the tradeoffs between performance, productivity, and portability. SYCL is an attempt at improving this state of affairs. [*] It is a standards-based and vendor-agnostic domain-specific embedded language for parallel programming, for heterogeneous and homogeneous architectures. The SYCL standard is developed by the Khronos group:

It is built as a header-only library for ISO C++17. SYCL code can be compiled with a standards-compliant compiler and the necessary headers: it does not require special compiler extensions.
It is a single-source-style framework. Host and device code are in the same translation unit.
It is asynchronous. The programmer describes computations, memory allocations, and data migrations; the runtime generates a task graph and works its way through it when executing the program.

The SYCL compiler

Despite the fact that SYCL is a header-only library for ISO C++17, we will still need a SYCL-aware compiler. ISO C++17 compilers will be unaware of how to generate optimal code for core abstractions, such as queues and unified shared memory. A SYCL compiler will be able to optimize our parallel code and will also be aware of which low-level framework to use to target specific architectures. Fortunately, there are many SYCL implementations to choose from!

../_images/sycl_impls%2Bbackends.svg — SYCL is an open standard for heterogeneous parallelism build on top of modern ISO C++. The Khronos group spearheads the standardization effort. There are numerous compilers implementing the SYCL standard and they use different **backends** to target CPUs, GPUs, and even specialized hardware such as field-programmable gate arrays (FPGA) and vector engines. The figure shows the route, as of today, from SYCL source code to hardware through the various implementations and backends. The red, dashed lines show experimental backends, as of today. The figure was adapted from here and here.

Hello, SYCL!

Let’s dig in with a “Hello, world” example.

“Hello, world” with SYCL

You can find the file with the complete source code in the content/code/day-1/00_hello folder. Worry not about the details in the code, we will dig into what is happening here at great length during the rest of the lesson.

#include <iostream>

#include <sycl/sycl.hpp>

using namespace sycl;

int
main()
{
  const std::string secret { "Ifmmp-!xpsme\"\012J(n!tpssz-!Ebwf/!"
                             "J(n!bgsbje!J!dbo(u!ep!uibu/!.!IBM\01" };

  const auto sz = secret.size();

  queue Q;

  std::cout << "Running on: " << Q.get_device().get_info<info::device::name>()
            << std::endl;

  char *result = malloc_shared<char>(sz, Q);
  std::memcpy(result, secret.data(), sz);

  Q.parallel_for(
     range<1> { sz },
     [=](id<1> tid) {
       result[tid[0]] -= 1;
     })
    .wait();

  std::cout << result << "\n";

  free(result, Q);

  return 0;
}

Log in to Karolina and clone the repository for this workshop. Navigate to the correct folder. This contains a source file, hello.cpp, and the CMake script to build it.

Load the necessary modules:

$ module load hipSYCL CMake/3.20.1-GCCcore-10.2.0

Configure and compile the code:

$ cmake -S. -Bbuild -DHIPSYCL_TARGETS="omp"
$ cmake --build build -- VERBOSE=1

Run the code! What result do you get?
```
./build/hello
```
We can configure again to target the GPU:
```
$ cmake -S. -Bbuild -DHIPSYCL_TARGETS="cuda:sm_80"
$ cmake --build build -- VERBOSE=1
$ ./build/hello
```
What output do you see? We will talk more about device selection in Device discovery.

This source code introduces a number of fundamental concepts in SYCL:

SYCL is a template library and its classes and functions are behind the sycl:: namespace. The SYCL runtime is provided by an optimizing compiler, in our case hipSYCL:
```
#include <sycl/sycl.hpp>

using namespace sycl;
```
Host and device code are in the same translation unit.
Thanks to unified shared memory we can use a pointer-based approach to memory management that transparently works across host and devices:
```
char *result = malloc_shared<char>(sz, Q);
std::memcpy(result, secret.data(), sz);
```
We will cover memory management in Data management with buffers and accessors and Data management with unified shared memory.
A queue is the mechanism by which we orchestrate work on our devices. For example, getting the device on which our actions will run:
```
queue Q;
Q.get_device().get_info<info::device::name>();
```
We will explore get_info and mechanisms for device selection in the Device discovery section.
An action is submitted to a queue and it runs on a device. In this example, our action is a parallel_for on a 1-dimensional range of work-items
```
Q.parallel_for(
   range<1>{sz},  /* range of work-items */
   ...
);
```
Within actions, we execute kernels:
```
[=](id<1> tid) {
  result[tid[0]] -= 1;
}
```
the result array is indexed using an id object: a mapping between a range of work-items and available workers. Kernels are either lambda functions or function objects.

Actions are executed asynchronously. The host enqueues work and moves on with its tasks. If results are neeeded from an action, then we need to wait for it to complete:

Q.parallel_for(
   range<1>{sz},        /* range of work-items */
   [=](id<1> tid) {     /* kernel code */
     result[tid[0]] -= 1;
   }
).wait();

We have introduced two functions to manage USM: malloc_shared and free, for memory allocation and deallocation:

malloc_shared

template <typename T>
T* malloc_shared(size_t count,
                 const queue& syclQueue,
                 const property_list &propList = {})

Parameters

T: Type of the allocation.
count: How many elements to allocate.
syclQueue: Queue on which to perform the allocation.
propList: Properties of the allocation.

free

void free(void* ptr, sycl::queue& syclQueue)

Parameters

ptr: Memory to deallocate. It must have been allocated using any of the USM functions.
syclQueue: Queue on which to perform the deallocation.

We have handled memory migration using std::memcpy, which is part of the C++ standard.

Keypoints

SYCL is an open-source standard to describe parallelism on heterogeneous platforms using the C++ programming language.
SYCL is single-source, heterogeneous, and asynchronous.
There are many implementations of the SYCL standard, targeting diverse hardware with different strategies:
- hipSYCL,
- Intel DPC++,
- CodePlay ComputeCPP.
- triSYCL.
- neoSYCL.

Footnotes