Preparing code for GPU porting
What are the key steps involved in porting code to take advantage of GPU parallel processing capability?
How can I identify the computationally intensive parts of my code that can benefit from GPU acceleration?
What are the considerations for refactoring loops to suit the GPU architecture and improve memory access patterns?
Are there any tools that can translate automatically between different frameworks?
Getting familiarized the steps involved in porting code to GPUs to take advantage of parallel processing capabilities.
Giving some idea about refactoring loops and modifying operations to suit the GPU architecture and improve memory access patterns.
Learn to use automatic translation tools to port from CUDA to HIP and from OpenACC to OpenMP
Instructor note
30 min teaching
20 min exercises
Porting from CPU to GPU
When porting code to take advantage of the parallel processing capability of GPUs, several steps need to be followed and some additional work is required before writing actual parallel code to be executed on the GPUs:
Identify Targeted Parts: Begin by identifying the parts of the code that contribute significantly to the execution time. These are often computationally intensive sections such as loops or matrix operations. The Pareto principle suggests that roughly 10-20% of the code accounts for 80-90% of the execution time.
Equivalent GPU Libraries: If the original code uses CPU libraries like BLAS, FFT, etc, it’s crucial to identify the equivalent GPU libraries. For example, cuBLAS or hipBLAS can replace CPU-based BLAS libraries. Utilizing GPU-specific libraries ensures efficient GPU utilization.
Refactor Loops: When porting loops directly to GPUs, some refactoring is necessary to suit the GPU architecture. This typically involves splitting the loop into multiple steps or modifying operations to exploit the independence between iterations and improve memory access patterns. Each step of the original loop can be mapped to a kernel, executed by multiple GPU threads, with each thread corresponding to an iteration.
Memory Access Optimization: Consider the memory access patterns in the code. GPUs perform best when memory access is coalesced and aligned. Minimizing global memory accesses and maximizing utilization of shared memory or registers can significantly enhance performance. Review the code to ensure optimal memory access for GPU execution.
How would this be ported? (n_soap ≈ 100, n_sites ⩾ 10000, k_max ≈ 20*n_sites )
Inspect the following Fortran code (if you don’t read Fortran: do-loops == for-loops)
k2 = 0 do i = 1, n_sites do j = 1, n_neigh(i) k2 = k2 + 1 counter = 0 counter2 = 0 do n = 1, n_max do np = n, n_max do l = 0, l_max if( skip_soap_component(l, np, n) )cycle counter = counter+1 do m = 0, l k = 1 + l*(l+1)/2 + m counter2 = counter2 + 1 multiplicity = multiplicity_array(counter2) soap_rad_der(counter, k2) = soap_rad_der(counter, k2) + multiplicity * real( cnk_rad_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_rad_der(k, np, k2)) ) soap_azi_der(counter, k2) = soap_azi_der(counter, k2) + multiplicity * real( cnk_azi_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_azi_der(k, np, k2)) ) soap_pol_der(counter, k2) = soap_pol_der(counter, k2) + multiplicity * real( cnk_pol_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_pol_der(k, np, k2)) ) end do end do end do end do soap_rad_der(1:n_soap, k2) = soap_rad_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_rad_der(1:n_soap, k2) ) soap_azi_der(1:n_soap, k2) = soap_azi_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_azi_der(1:n_soap, k2) ) soap_pol_der(1:n_soap, k2) = soap_pol_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_pol_der(1:n_soap, k2) ) if( j == 1 )then k3 = k2 else soap_cart_der(1, 1:n_soap, k2) = dsin(thetas(k2)) * dcos(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dcos(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) - dsin(phis(k2)) / rjs(k2) * soap_azi_der(1:n_soap, k2) soap_cart_der(2, 1:n_soap, k2) = dsin(thetas(k2)) * dsin(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dsin(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) + dcos(phis(k2)) / rjs(k2) * soap_azi_der(1:n_soap, k2) soap_cart_der(3, 1:n_soap, k2) = dcos(thetas(k2)) * soap_rad_der(1:n_soap, k2) + dsin(thetas(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) soap_cart_der(1, 1:n_soap, k3) = soap_cart_der(1, 1:n_soap, k3) - soap_cart_der(1, 1:n_soap, k2) soap_cart_der(2, 1:n_soap, k3) = soap_cart_der(2, 1:n_soap, k3) - soap_cart_der(2, 1:n_soap, k2) soap_cart_der(3, 1:n_soap, k3) = soap_cart_der(3, 1:n_soap, k3) - soap_cart_der(3, 1:n_soap, k2) end if end do end doSome steps at first glance:
the code could (has to) be splitted in 3-4 kernels. Why?
check if there are any variables that could lead to false dependencies between iterations, like the index k2
is it efficient for GPUs to split the work over the index i? What about the memory access? Note the arrays are 2D in Fortran
is it possible to collapse some loops? Combining nested loops can reduce overhead and improve memory access patterns, leading to better GPU performance.
what is the best memory access in a GPU? Review memory access patterns in the code. Minimize global memory access by utilizing shared memory or registers where appropriate. Ensure memory access is coalesced and aligned, maximizing GPU memory throughput
Refactored code!
Registers are limited and the larger the kernel use more registers registers resulting in less active threads (small occupancy).
In order to compute soap_rad_der(is,k2) the CUDA thread needs access to all the previous values soap_rad_der(1:nsoap,k2).
In order to compute soap_cart_der(1, 1:n_soap, k3) it is required to have access to all values (k3+1:k2+n_neigh(i)).
Note the indices in the first part. The matrices are transposed for better access patterns.
!omp target teams distribute parallel do private (i) do k2 = 1, k2_max i=list_of_i(k2) counter = 0 counter2 = 0 do n = 1, n_max do np = n, n_max do l = 0, l_max if( skip_soap_component(l, np, n) ) then cycle endif counter = counter+1 do m = 0, l k = 1 + l*(l+1)/2 + m counter2 = counter2 + 1 multiplicity = multiplicity_array(counter2) tsoap_rad_der(k2,counter) = tsoap_rad_der(k2,counter) + multiplicity * real( tcnk_rad_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_rad_der(k2,k,np)) ) tsoap_azi_der(k2,counter) = tsoap_azi_der(k2,counter) + multiplicity * real( tcnk_azi_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_azi_der(k2,k,np)) ) tsoap_pol_der(k2,counter) = tsoap_pol_der(k2,counter) + multiplicity * real( tcnk_pol_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_pol_der(k2,k,np)) ) end do end do end do end do end do ! Before the next part the variables are transposed again to their original layout. !omp target teams distribute private(i) do k2 = 1, k2_max i=list_of_i(k2) locdot=0.d0 !omp parallel do reduction(+:locdot_rad_der,locdot_azi_der,locdot_pol_der) do is=1,nsoap locdot_rad_der=locdot_rad_der+soap(is, i) * soap_rad_der(is, k2) locdot_azi_der=locdot_azi_der+soap(is, i) * soap_azi_der(is, k2) locdot_pol_der=locdot_pol_der+soap(is, i) * soap_pol_der(is, k2) enddo dot_soap_rad_der(k2)= locdot_rad_der dot_soap_azi_der(k2)= locdot_azi_der dot_soap_pol_der(k2)= locdot_pol_der end do !omp target teams distribute do k2 = 1, k2_max i=list_of_i(k2) !omp parallel do do is=1,nsoap soap_rad_der(is, k2) = soap_rad_der(is, k2) / sqrt_dot_p(i) - soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_rad_der(k2) soap_azi_der(is, k2) = soap_azi_der(is, k2) / sqrt_dot_p(i) - soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_azi_der(k2) soap_pol_der(is, k2) = soap_pol_der(is, k2) / sqrt_dot_p(i) - soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_pol_der(k2) end do end do !omp teams distribute private(k3) do k2 = 1, k2_max k3=list_k2k3(k2) !omp parallel do private (is) do is=1,n_soap if( k3 /= k2)then soap_cart_der(1, is, k2) = dsin(thetas(k2)) * dcos(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dcos(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) - dsin(phis(k2)) / rjs(k2) * soap_azi_der(is, k2) soap_cart_der(2, is, k2) = dsin(thetas(k2)) * dsin(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dsin(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) + dcos(phis(k2)) / rjs(k2) * soap_azi_der(is, k2) soap_cart_der(3, is, k2) = dcos(thetas(k2)) * soap_rad_der(is, k2) + dsin(thetas(k2)) / rjs(k2) * soap_pol_der(is, k2) end if end do end do !omp teams distribute private(k3) do i = 1, n_sites k3=list_k3(i) !omp parallel do private(is, k2) do is=1,n_soap do k2=k3+1,k3+n_neigh(i) soap_cart_der(1, is, k3) = soap_cart_der(1, is, k3) - soap_cart_der(1, is, k2) soap_cart_der(2, is, k3) = soap_cart_der(2, is, k3) - soap_cart_der(2, is, k2) soap_cart_der(3, is, k3) = soap_cart_der(3, is, k3) - soap_cart_der(3, is, k2) end do end do end do
Identify equivalent GPU libraries for CPU-based libraries and utilizing them to ensure efficient GPU utilization.
Importance of identifying the computationally intensive parts of the code that contribute significantly to the execution time.
The need to refactor loops to suit the GPU architecture.
Significance of memory access optimization for efficient GPU execution, including coalesced and aligned memory access patterns.
Porting between different GPU frameworks
You might also find yourself in a situation where you need to port a code from one particular GPU framework to another. This section gives an overview of different tools that enable converting CUDA and OpenACC codes to HIP and OpenMP, respectively. This conversion process enables an application to target various GPU architectures, specifically, NVIDIA and AMD GPUs. Here we focus on hipify and clacc tools. This guide is adapted from the NRIS documentation.
Translating CUDA to HIP with Hipify
In this section, we cover the use of hipify-perl
and hipify-clang
tools to translate a CUDA code to HIP.
The hipify-perl
tool is a script based on perl that translates CUDA syntax into HIP syntax
(see .e.g. here for more details).
For instance, in a CUDA code that incorporates the CUDA functions cudaMalloc`
and cudaDeviceSynchronize
, the tool will substitute cudaMalloc
with the HIP function hipMalloc
. Similarly the CUDA function cudaDeviceSynchronize
will be substituted with the HIP function hipDeviceSynchronize
. We list below the basic steps to run hipify-perl
on LUMI-G.
Step 1: Generating
script$ module load rocm/5.2.3 $ hipify-clang --perl
Step 2: Running the generated
$ hipify-perl >
Step 3: Compiling with
the generated HIP code$ hipcc --offload-arch=gfx90a -o program.hip.exe
Despite the simplicity of the use of hipify-perl
, the tool might not be suitable for large applications, as it relies heavily on substituting CUDA strings with HIP strings (e.g. it substitutes *cuda*
with *hip*
In addition, hipify-perl
lacks the ability of distinguishing device/host function calls.
The alternative here is to use the hipify-clang
tool as will be described in the next section.
As described in the HIPIFY documentation,
the hipify-clang
tool is based on clang for translating CUDA sources into HIP sources.
The tool is more robust for translating CUDA codes compared to the hipify-perl
Furthermore, it facilitates the analysis of the code by providing assistance.
In short, hipify-clang
requires LLVM+CLANG
and CUDA
. Details about building hipify-clang
can be found here. Note that hipify-clang
is available on LUMI-G.
The issue however might be related to the installation of CUDA-toolkit.
To avoid any eventual issues with the installation procedure we opt for CUDA singularity container. Here we present a step-by-step guide for running hipify-clang
Step 1: Pulling a CUDA singularity container e.g.
$ singularity pull docker://
Step 2: Loading a rocm module and launching the CUDA singularity
$ module load rocm/5.2.3 $ singularity shell -B $PWD,/opt:/opt cuda_11.4.0-devel-ubuntu20.04.sif
where the current directory
in the host is mounted to that of the container, and the directory/opt
in the host is mounted to the that inside the container.Step 3: Setting the environment variable
. In order to runhipify-clang
from inside the container, one can set the environment variable$PATH
that defines the path to look for the binaryhipify-clang
.$ export PATH=/opt/rocm-5.2.3/bin:$PATH
Note that the rocm version we used is
.Step 4: Running
from inside the singularity container$ hipify-clang -o --cuda-path=/usr/local/cuda-11.4 -I /usr/local/cuda-11.4/include
Here the cuda path and the path to the
files should be specified. The CUDA source code and the generated output code are and, respectively.The syntax for the compilation process of the generated hip code is similar to the one described in the previous section (see the Step 3 in the hipify-perl section).
Code examples for the Hipify
exercises can be accessed in the content/examples/exercise_hipify subdirectory by cloning this repository:
$ git clone $ cd gpu-programming/content/examples/exercise_hipify $ ls
Exercise I : Translate an CUDA code to HIP with hipify-perl
1.1 Generate the hipify-perl
1.2 Convert the CUDA code
located in /exercise_hipify/Hipify_perl
with the Hipify-perl
tool to HIP.
1.3 Compile the generated HIP code with the hipcc
compiler wrapper and run it.
Exercise II : Translate an CUDA code to HIP with hipify-clang
2.1 Convert the CUDA code
located in /exercise_hipify/Hipify_clang
with the Hipify-clang
tool to HIP.
2.2 Compile the generated HIP code with the hipcc
compiler wrapper and run it.
Translating OpenACC to OpenMP with Clacc
Clacc is a tool to translate an OpenACC
application to OpenMP offloading with the Clang/LLVM compiler environment.
Note that the tool is specific to OpenACC C, while OpenACC Fortran is already supported on AMD GPU.
As indicated in the GitHub repository the compiler Clacc
is the Clang
’s executable in the subdirectory \bin
of the \install
directory as described below.
In the following we present a step-by-step guide for building and using Clacc:
Step 1: Building and installing Clacc.
$ git clone -b clacc/main $ cd llvm-project $ mkdir build && cd build $ cmake -DCMAKE_INSTALL_PREFIX=../install \ -DCMAKE_BUILD_TYPE=Release \ -DLLVM_ENABLE_PROJECTS="clang;lld" \ -DLLVM_ENABLE_RUNTIMES=openmp \ -DLLVM_TARGETS_TO_BUILD="host;AMDGPU" \ -DCMAKE_C_COMPILER=gcc \ -DCMAKE_CXX_COMPILER=g++ \ ../llvm $ make $ make install
Step 2: Setting up environment variables to be able to work from the
directory, which is the simplest way. We assume that the/install
directory is located in the path/project/project_xxxxxx/Clacc/llvm-project
For more advanced usage, which includes for instance modifying Clacc
, we refer readers to “Usage from Build directory”
$ export PATH=/project/project_xxxxxx/Clacc/llvm-project/install/bin:$PATH $ export LD_LIBRARY_PATH=/project/project_xxxxxx/Clacc/llvm-project/install/lib:$LD_LIBRARY_PATH
Step 3: Source to source conversion of the openACC_code.c code to be printed out to the file openMP_code.c:
$ clang -fopenacc-print=omp -fopenacc-structured-ref-count-omp=no-ompx-hold openACC_code.c > openMP_code.c
Here the flag
is introduced to disable theompx_hold
map type modifier, which is used by the OpenACCcopy
clause translation. Theompx_hold
is an OpenMP extension that might not be supported yet by other compilers.Step 4 Compiling the code with the cc compiler wrapper
module load CrayEnv module load PrgEnv-cray module load craype-accel-amd-gfx90a module load rocm/5.2.3 cc -fopenmp -o executable openMP_code.c
Access exercise material
Code examples for the Clacc
exercise can be accessed in the content/examples/exercise_clacc subdirectory by cloning this repository:
$ git clone
$ cd gpu-programming/content/examples/exercise_clacc
$ ls
Exercise : Translate an OpenACC code to OpenMP
Convert the OpenACC code
located in/exercise_clacc
with theClacc
compiler.Compile the generated OpenMP code with the
compiler wrapper and run it.
Translating CUDA to SYCL/DPC++ with SYCLomatic
Intel offers a tool for CUDA-to-SYCL code migration, included in the Intel oneAPI Basekit.
It is not installed on LUMI, but the general workflow is similar to the HIPify Clang and also requires an existing CUDA installation:
$ dpct $ cd dpct_output/ $ icpx -fsycl program.dp.cpp
SYCLomatic can migrate larger projects by using -in-root
and -out-root
flags to process directories recursively. It can also
use compilation database (supported by CMake and other build systems) to deal with more complex project layouts.
Please note that the code generated by SYCLomatic relies on oneAPI-specific extensions, and thus cannot be directly used with other
SYCL implementations, such as AdaptiveCpp (hipSYCL). The --no-incremental-migration
flag can be added to dpct
command to minimize, but not
completely avoid, the use of this compatibility layer. That would require manual effort, since some CUDA concepts cannot be directly
mapped to SYCL.
Additionally, CUDA applications might assume certain hardware behavior, such as 32-wide warps. If the target hardware is different (e.g., AMD MI250 GPUs, used in LUMI, have warp size of 64), the algorithms might need to be adjusted manually.
This concludes a brief overview of the usage of available tools to convert CUDA codes to HIP and SYCL, and OpenACC codes to OpenMP offloading. In general the translation process for large applications might be incomplete and thus requires manual modification to complete the porting process. It is however worth noting that the accuracy of the translation process requires that applications are written correctly according to the CUDA and OpenACC syntaxes.
See also
Useful tools exist to automatically translate tools from CUDA to HIP and SYCL and from OpenACC to OpenMP, but they may require manual modifications.