What is SYCL?
Questions
What is SYCL and why should I care?
The high-performance computing landscape is increasingly dominated by machines whose high FLOP count is delivered by heterogeneous hardware: large-core count CPUs in tandem with ever more powerful GPUs are now the norm in the HPC datacenter. This trend is likely to continue, with the appearance of new hardware architectures, sometimes tailored for specific operations.
Each new architecture comes equipped with its own, usually low-level, programming language. Adapting applications for a heterogeneous computing environment proceeds in tight cycles of profiling and porting. These are time-consuming, error-prone, and scarcely portable:
Mastering low-level programming languages may require years.
The codebase can diverge significantly to work with different hardware.
Let’s look at a simple example: the single-precision vector addition kernel
\(y := \alpha x + y\), better known as SAXPY
:
__global__
void saxpy(int sz, float a, float * restrict x, float * restrict y) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < sz)
y[i] = a * x[i] + y[i];
}
This code is an example of prescriptive parallelism.
__global__ void
saxpy(int sz, float a, float* restrict x, float* restrict y)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < sz) y[i] = a * x[i] + y[i];
}
This code is an example of prescriptive parallelism.
void
saxpy(int sz, float a, float *restrict x, float *restrict y)
{
#pragma omp target teams distribute parallel for simd num_teams(num_blocks) \
map(to \
: x [0:sz]) map(tofrom \
: y [0:sz])
for (int i = 0; i < sz; ++i) {
y[i] = a * x[i] + y[i];
}
}
This code is an example of descriptive parallelism.
void saxpy(int sz, float a, float *restrict x, float *restrict y) {
#pragma acc kernels
for (int i = 0; i < sz; ++i)
y[i] = a * x[i] + y[i];
}
This code is an example of descriptive parallelism.
There are few things to notice in this example:
We work with prescriptive parallelism with CUDA and HIP: we explicitly divide up the work between threads inside the kernel function.
In the OpenMP and OpenACC models, we describe data movement and worksharing, but rely on the compiler to generate correct code. Hence the qualification of descriptive parallelism.
When using low-level APIs, a certain degree of code duplication between host and device code is inevitable. Parallelization strategies between host and device will be different. On top of this, the expressiveness of low-level language extensions is limited, which might increase the maintenance burden.
Low-level APIs might only be available with proprietary, vendor-specific compilers. This limits both functional and performance portability.
pragma
-based schemes are standardized, but compiler support is not always optimal.
When programming for heterogeneous platforms, there is no silver bullet: we have to carefully evaluate the tradeoffs between performance, productivity, and portability. SYCL is an attempt at improving this state of affairs. [*] It is a standards-based and vendor-agnostic domain-specific embedded language for parallel programming, for heterogeneous and homogeneous architectures. The SYCL standard is developed by the Khronos group:
It is built as a header-only library for ISO C++17. SYCL code can be compiled with a standards-compliant compiler and the necessary headers: it does not require special compiler extensions.
It is a single-source-style framework. Host and device code are in the same translation unit.
It is asynchronous. The programmer describes computations, memory allocations, and data migrations; the runtime generates a task graph and works its way through it when executing the program.
The SYCL compiler
Despite the fact that SYCL is a header-only library for ISO C++17, we will still need a SYCL-aware compiler. ISO C++17 compilers will be unaware of how to generate optimal code for core abstractions, such as queues and unified shared memory. A SYCL compiler will be able to optimize our parallel code and will also be aware of which low-level framework to use to target specific architectures. Fortunately, there are many SYCL implementations to choose from!
Hello, SYCL!
Let’s dig in with a “Hello, world” example.
“Hello, world” with SYCL
You can find the file with the complete source code in the
content/code/day-1/00_hello
folder. Worry not about the details in the
code, we will dig into what is happening here at great length during the rest
of the lesson.
#include <iostream>
#include <sycl/sycl.hpp>
using namespace sycl;
int
main()
{
const std::string secret { "Ifmmp-!xpsme\"\012J(n!tpssz-!Ebwf/!"
"J(n!bgsbje!J!dbo(u!ep!uibu/!.!IBM\01" };
const auto sz = secret.size();
queue Q;
std::cout << "Running on: " << Q.get_device().get_info<info::device::name>()
<< std::endl;
char *result = malloc_shared<char>(sz, Q);
std::memcpy(result, secret.data(), sz);
Q.parallel_for(
range<1> { sz },
[=](id<1> tid) {
result[tid[0]] -= 1;
})
.wait();
std::cout << result << "\n";
free(result, Q);
return 0;
}
Log in to Karolina and clone the repository for this workshop. Navigate to the correct folder. This contains a source file,
hello.cpp
, and the CMake script to build it.Load the necessary modules:
$ module load hipSYCL CMake/3.20.1-GCCcore-10.2.0
Configure and compile the code:
$ cmake -S. -Bbuild -DHIPSYCL_TARGETS="omp" $ cmake --build build -- VERBOSE=1
Run the code! What result do you get?
./build/hello
We can configure again to target the GPU:
$ cmake -S. -Bbuild -DHIPSYCL_TARGETS="cuda:sm_80" $ cmake --build build -- VERBOSE=1 $ ./build/hello
What output do you see? We will talk more about device selection in Device discovery.
This source code introduces a number of fundamental concepts in SYCL:
SYCL is a template library and its classes and functions are behind the
sycl::
namespace. The SYCL runtime is provided by an optimizing compiler, in our case hipSYCL:#include <sycl/sycl.hpp> using namespace sycl;
Host and device code are in the same translation unit.
Thanks to unified shared memory we can use a pointer-based approach to memory management that transparently works across host and devices:
char *result = malloc_shared<char>(sz, Q); std::memcpy(result, secret.data(), sz);
We will cover memory management in Data management with buffers and accessors and Data management with unified shared memory.
A queue is the mechanism by which we orchestrate work on our devices. For example, getting the device on which our actions will run:
queue Q; Q.get_device().get_info<info::device::name>();
We will explore
get_info
and mechanisms for device selection in the Device discovery section.An action is submitted to a queue and it runs on a device. In this example, our action is a
parallel_for
on a 1-dimensional range of work-itemsQ.parallel_for( range<1>{sz}, /* range of work-items */ ... );
Within actions, we execute kernels:
[=](id<1> tid) { result[tid[0]] -= 1; }
the
result
array is indexed using anid
object: a mapping between arange
of work-items and available workers. Kernels are either lambda functions or function objects.Actions are executed asynchronously. The host enqueues work and moves on with its tasks. If results are neeeded from an action, then we need to wait for it to complete:
Q.parallel_for( range<1>{sz}, /* range of work-items */ [=](id<1> tid) { /* kernel code */ result[tid[0]] -= 1; } ).wait();
We have introduced two functions to manage USM:
malloc_shared
and free
, for memory allocation and deallocation:
malloc_shared
template <typename T>
T* malloc_shared(size_t count,
const queue& syclQueue,
const property_list &propList = {})
Parameters
T
Type of the allocation.
count
How many elements to allocate.
syclQueue
Queue on which to perform the allocation.
propList
Properties of the allocation.
free
void free(void* ptr, sycl::queue& syclQueue)
Parameters
ptr
Memory to deallocate. It must have been allocated using any of the USM functions.
syclQueue
Queue on which to perform the deallocation.
We have handled memory migration using std::memcpy
, which is part of the C++ standard.
Keypoints
Footnotes