Data management with unified shared memory
Questions
What facilities does SYCL offer to manage memory spaces in an heterogeneous environment?
Objectives
Learn about the unified shared memory (USM) API.
In the previous episode we learnt that SYCL offers three abstractions for memory management: the buffer and accessor API and unified shared memory (USM) are the most relevant for our purposes. We will discuss the latter in this episode.
USM is probably the biggest new feature adopted in the SYCL 2020 standard. Why? The value of any pointer returned by a USM allocation on the host is guaranteed to be a valid pointer value also on the device. We have seen that the buffer-accessor API is powerful and also quite intuitive in a modern C++ setting. However, most programmers are quite familiar with pointer-based memory management, especially if they have been working with low-level CUDA/HIP languages. Furthermore, it is difficult to adopt SYCL in an existing codebase when it requires radical changes in fundamental infrastructure. USM offers a path forward.
Warning
USM will only be available in SYCL for devices that support a unified virtual address space. If you want/need to use USM, be sure to write an appropriate selector for queue!
Let’s now analyze the allocation and data movement aspects of USM.
USM memory allocation
There are three types of USM allocations available in the SYCL standard. If you have worked previously with CUDA/HIP, this “menu” of allocations should look fairly familiar to you:
Device allocations will return a pointer to memory physically located on the device.
Host allocations will return a pointer to memory physically located on the host. These are accessible both on the host and the device. In the latter case, however, memory will not migrate to the device automatically, but rather be accessed remotely. This is a crucial aspect to keep in mind for performance!
Shared allocations will return a pointer to the unified virtual address space. Such allocations can be accessed from both host and device, and the memory can migrate freely, without programmer intervention, between host and device. This comes at the cost of increasing latency.
The following table summarizes the available USM allocations and their properties.
Kind |
Host-accessible |
Device-accessible |
Memory space |
Automatic migration |
---|---|---|---|---|
|
No |
Yes |
Device |
No |
|
Yes |
Yes |
Host |
No |
|
Yes |
Yes |
Shared |
Yes |
To perform USM allocations, we need to inform the runtime about which device
we’d like to request memory from. The simplest way is to pass a queue
object
to the allocation functions. The standard provides three APIs for allocating
USM:
C-like (untyped)
C-like
malloc
void* malloc_device(size_t numBytes, const queue& syclQueue, const property_list& prop_list = {}); void* malloc_host(size_t numBytes, const queue& syclQueue, const property_list& prop_list = {}); void* malloc_shared(size_t numBytes, const queue& syclQueue, const property_list& prop_list = {});
C++-like (typed)
C++-like
malloc
template <typename T> T* malloc_device(size_t count, const queue& syclQueue, const property_list& prop_list = {}); template <typename T> T* malloc_host(size_t count, const queue& syclQueue, const property_list& prop_list = {}); template <typename T> T* malloc_shared(size_t count, const queue& syclQueue, const property_list& prop_list = {});
C++ allocator object
usm_allocator
.
As usual, you need to free any memory you claim dynamically from the runtime.
The free
function also needs information about the location of the memory,
which can be conveniently conveyed by a queue
object:
free
void free(void* ptr,
queue& syclQueue);
USM data management
We have claimed memory from the system, now what to do with it? Usually, we first initialize with some more-or-less useful values and then use it in our data-parallel kernels.
Initialization of the allocated memory to a byte or to an arbitrary
pattern can be achieved using the memset
and fill
functions,
respectively, provided by the SYCL standard.
Using fill
Q queue;
auto x = malloc_device<double>(256, Q);
fill(x, 42.0, 256);
For more complex initialization, a data-parallel loop is the way to go and it requires us to learn about USM and data movement.
Movement
Data movement is a big part of achieving performanance in a heterogeneous programming environment: data needs to be available at the right time and at the right place for parallel kernels to perform optimally. This is probably old news already, if you come from a CUDA/HIP approach.
USM supports both explicit and implicit data movement strategies.
- Explicit
We have to call
memcpy
(untyped C-like API) andcopy
(typed C++-like API) explicitly whenever data needs to migrate between different backends. These methods are available both on thequeue
andhandler
classes. Remember methods of thequeue
andhandler
class are asynchronous! Copies are not an exception! Explicit data movement is only strictly necessary for host-to-device and device-to-host data migrations. Indeed, device allocations cannot be directly accessed from the host.Explicit data migration
This looks quite like CUDA/HIP! But why would want to use this? Isn’t the whole point of SYCL to not think about data movements? While ease of programming is definitely important, we also want a framework that empowers us to take full control whenever we deem it necessary.
constexpr auto N = 256; queue Q; std::vector<double> x_h(N); std::iota(x_h.begin(), x_h.end(), 0.0); auto x_d = malloc_device<double>(N, Q); // in a handler Q.submit([&](handler& cgh){ // untyped API cgh.memcpy(x_d, x_h.data(), N*sizeof(double)); // or typed API //cgh.copy(x_d, x_h.data(), N); }); // or on the queue directly // with typed API //Q.copy(x_d, x_h.data(), N); //or untype API //Q.memcpy(x_d, x_h.data(), N*sizeof(double)); // copies are ASYNCHRONOUS!! Q.wait();
constexpr auto N = 256; std::vector<double> x_h(N); std::iota(x_h.begin(), x_h.end(), 0.0); double* x_d; cudaMalloc((void**)&x_d, N*sizeof(double)); cudaMemcpy(x_d, x_h.data(), N*sizeof(double), cudaMemcpyHostToDevice);
- Implicit
This movement strategy requires no programmer intervention and is relevant for host and shared allocations. When the former are accessed on a device, the runtime will transfer the memory through the appropriate hardware interface. Host memory allocations do not migrate to the device, so they incur latency and repeated accesses are discouraged. Shared memory is essentially defined by its ability to migrate between host and device. This happens simply by accessing the same memory location from different locations.
Host and shared allocations benefit from implicit data movement
In this sample code, we access both the
x_h
andx_s
arrays within kernel code. The former will be transferred from host memory over the appropriate interface, e.g. PCIe. The latter will be migrated to the device memory.constexpr auto N = 256; queue Q; auto x_h = malloc_host<double>(N, Q); for (auto i = 0; i < N; ++i) { x_h[i] = static_cast<double>(i); } auto x_s = malloc_shared<double>(N, Q); // in a handler Q.submit([&](handler& cgh){ cgh.parallel_for(range{N}, [=](id<1> tid){ // get index out of id object auto i = tid[0]; x_s[i] = x_h[i] + 1.0; } }); // or on the queue directly //Q.parallel_for(range{N}, [=](id<1> tid){ // // get index out of id object // auto i = tid[0]; // x_s[i] = x_h[i] + 1.0; //} Q.wait();
AXPY with SYCL and USM
We will now write an AXPY implementation in SYCL, using unified shared memory. This will be a generic implementation: it will work with any arithmetic type, thanks to C++ templates.
Don’t do this at home, use optimized BLAS!
You can find a scaffold for the code in the
content/code/day-1/06_axpy-usm/axpy.cpp
file, alongside the CMake script
to build the executable. You will have to complete the source code to compile
and run correctly: follow the hints in the source file.
The code fills two raw arrays and passes them to the axpy
function, which accepts a queue
object as first parameter.
You have to allocate the x
and y
operands and complete the
axpy
function:
Define and allocate raw arrays for the operands. Should these allocations be of host, device or shared type?
Fill the operands such that their sum is equal to
sz - 1
.Complete the
axpy
function.
A working solution is in the solution
subfolder.
In this bonus exercise, we’ll implement AXPY to use USM and
std::vector
.
Containers in the C++ STL can be constructed to use custom
allocators. For example:
std::vector x(count, value, alloc);
will create a vector with count
elements initialized to value
,
with memory managed by the alloc
object.
The SYCL standard already defines the usm_allocator
class, which
conforms to the C++ allocator interface. Refer to the page in the
standard
for details.
You can find a scaffold for the code in the
content/code/day-1/07_axpy-usm_allocator/axpy.cpp
file, alongside
the CMake script to build the executable. You will have to complete the
source code to compile and run correctly: follow the hints in the
source file.
Create
std::vector
objects for the operands and fill them such that their sum is equal tosz - 1
. Should these allocations be of host or shared type?Complete the
axpy
function. Why do we take the address of the output vector before submitting the kernel?
A working solution is in the solution
subfolder.
Note
The fact that we use a device-aware allocator for std::vector
does not mean that STL objects will magically work within the
SYCL runtime. Indeed, doing so is absolutely not legal code: it
might compile and even work in some cases, but is to be considered
a bug.
You can read up a complete explanation here.
Keypoints
Unified shared memory (USM) provides a pointer-based API for SYCL. It looks and feels familiar if coming from CUDA/HIP.
It is useful when porting existing code to SYCL, as it might require less pervasive changes to the codebase.
The SYCL standard offers APIs for host, device, and shared allocations.
USM supports both explicit and implicit data movement. The former is only relevant for device allocations.
Implicit data movement simplifies programmers’ life considerably, but we might incur hard-to-control performance penalties.