The GPU hardware and software ecosystem

Questions

  • What are the differences between GPUs and CPUs?

  • What GPU software stacks are available? What do they provide?

Objectives

  • Understand the fundamental differences between GPUs and CPUs

  • Explore the major GPU software suites available, such as CUDA, ROCm, and oneAPI, and gain a basic understanding of them

Instructor note

  • 20 min teaching

  • 0 min exercises

Overview of GPU hardware

../_images/CPUAndGPU.png

A comparison of the CPU and GPU architecture. CPU (left) has complex core structure and pack several cores on a single chip. GPU cores are very simple in comparison, they also share data and control between each other. This allows to pack more cores on a single chip, thus achieving very high compute density.

Accelerators are a separate main circuit board with the processor, memory, power management, etc. It is connected to the motherboard with CPUs via PCIe bus. Having its own memory means that the data has to be copied to and from it (not neceseraly true anymore). CPU acts as a main processor, controlling the execution workflow. It copies the data from its own memory to the GPU memory, executes the program and copies the results back. GPUs runs tens of thousands of threads simultaneously on thousands of cores and does not do much of the data management. With many cores trying to access the memory simultaneously and with little cache available, the accelerator can run out of memory very quickly. This makes the data management and its access pattern is essential on the GPU. Accelerators like to be overloaded with the number of threads, because they can switch between threads very quickly. This allows to hide the memory operations: while some threads wait, others can compute.

A very important feature of the accelerators is their scalability. Computational cores on accelerators are usually grouped into multiprocessors. The multiprocessors share the data and logical elements. This allows to achieve a very high density of compute elements on a GPU. This also allows the scaling: more multiprocessors means more raw performance and this is very easy to achieve with more transistors available.

How do GPUs differ from CPUs?

CPUs and GPUs were designed with different goals in mind. While the CPU is designed to excel at executing a sequence of operations, called a thread, as fast as possible and can execute a few tens of these threads in parallel, the GPU is designed to excel at executing many thousands of them in parallel. GPUs were initially developed for highly-parallel task of graphic processing and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control. More transistors dedicated to data processing is beneficial for highly parallel computations; the GPU can hide memory access latencies with computation, instead of relying on large data caches and complex flow control to avoid long memory access latencies, both of which are expensive in terms of transistors.

CPU

GPU

General purpose

Highly specialized for parallelism

Good for serial processing

Good for parallel processing

Great for task parallelism

Great for data parallelism

Low latency per thread

High-throughput

Large area dedicated cache and control

Hundreds of floating-point execution units

GPU platforms

GPUs come together with software stacks or APIs that work in conjunction with the hardware and give a standard way for the software to interact with the GPU hardware. They are used by software developers to write code that can take advantage of the parallel processing power of the GPU, and they provide a standard way for software to interact with the GPU hardware. Typically, they provide access to low-level functionality, such as memory management, data transfer between the CPU and the GPU, and the scheduling and execution of parallel processing tasks on the GPU. They may also provide higher level functions and libraries optimized for specific HPC workloads, like linear algebra or fast Fourier transforms. Finally, in order to facilitate the developers to optimize and write correct codes, debugging and profiling tools are also included.

NVIDIA, AMD, and Intel are the major companies which design and produces GPUs for HPC providing each its own suite CUDA, ROCm, and respectively oneAPI. This way they can offer optimization, differentiation (offering unique features tailored to their devices), vendor lock-in, licensing, and royalty fees, which can result in better performance, profitability, and customer loyalty. There are also cross-platform APIs such DirectCompute (only for Windows operating system), OpenCL, and SYCL.

CUDA

Compute Unified Device Architecture is the parallel computing platform from NVIDIA. The CUDA API provides a comprehensive set of functions and tools for developing high-performance applications that run on NVIDIA GPUs. It consists of two main components: the CUDA Toolkit and the CUDA driver. The toolkit provides a set of libraries, compilers, and development tools for programming and optimizing CUDA applications, while the driver is responsible for communication between the host CPU and the device GPU. CUDA is designed to work with programming languages such as C, C++, and Fortran.

CUDA API provides many highly optimize libraries such as: cuBLAS (for linear algebra operations, such a dense matrix multiplication), cuFFT (for performing fast Fourier transforms), cuRAND (for generating pseudo-random numbers), cuSPARSE (for sparse matrices operations). Using these libraries, developers can quickly and easily accelerate complex computations on NVIDIA GPUs without having to write low-level GPU code themselves.

There are several compilers that can be used for developing and executing code on NVIDIA GPUs: nvcc. The latest versions are based on the widely used LLVM (low level virtual machine) open source compiler infrastructure. nvcc produces optimized code for NVIDIA GPUs and drives a supported host compiler for AMD, Intel, OpenPOWER, and Arm CPUs.

In addition to this are provided nvc (C11 compiler), nvc++ (C++17 compiler), and nvfortran (ISO Fortran 2003 compiler). These compilers can as well create code for execution on the NVIDIA GPUs, and also support GPU and multicore CPU programming with parallel language features, OpeanACC and OpenMP.

When programming mistakes are inevitable they have to be fixed as soon as possible. The CUDA toolkit includes the command line tool cuda-gdb which can be used to find errors in the code. It is an extension to GDB, the GNU Project debugger. The existing GDB debugging features are inherently present for debugging the host code, and additional features have been provided to support debugging CUDA device code, allowing simultaneous debugging of both GPU and CPU code within the same application. The tool provides developers with a mechanism for debugging CUDA applications running on actual hardware. This enables developers to debug applications without the potential variations introduced by simulation and emulation environments.

In addition to this the command line tool compute-sanitizer can be used to look exclusively for memory access problems: unallocated buffers, out of bounds accesses, race conditions, and uninitialized variables.

Finally, in order to utilize the GPUs at maximum some performance analysis tools. NVIDIA provides NVIDIA Nsight Systems and NVIDIA Nsight Compute tools for helping the developers to optimize their applications. The former, NVIDIA Nsight Systems, is a system-wide performance analysis tool that provides detailed metrics on both CPU and GPU usage, memory bandwidth, and other system-level metrics. The latter, NVIDIA Nsight Compute, is a kernel-level performance analysis tool that allows developers to analyze the performance of individual CUDA kernels. It provides detailed metrics on kernel execution, including memory usage, instruction throughput, and occupancy. These tools have graphical which can be used for all steps of the performance analysis, however on supercomputers it is recommended to use the command line interface for collecting the information needed and then visualize and analyse the results using the graphical interface on personal computers.

Apart from what was presented above there are many others tools and features provided by NVIDIA. The CUDA ecosystem is very well developed.

ROCm

ROCm is an open software platform allowing researchers to tap the power of AMD accelerators. The ROCm platform is built on the foundation of open portability, supporting environments across multiple accelerator vendors and architectures. In some way it is very similar to CUDA API. It contains libraries, compilers, and development tools for programming and optimizing programs for AMD GPUs. For debugging, it provides the command line tool rocgdb, while for performance analysis rocprof and roctracer. In order to produce code for the AMD GPUs, one can use the Heterogeneous-Computing Interface for Portability (HIP). HIP is a C++ runtime API and a set of tools that allows developers to write portable GPU-accelerated code for both NVIDIA and AMD platforms. It provides the hipcc compiler driver, which will call the appropriate toolchain depending on the desired platform. On the AMD ROCm platform, HIP provides a header and runtime library built on top of the HIP-Clang (ROCm compiler). On an NVIDIA platform, HIP provides a header file which translates from the HIP runtime APIs to CUDA runtime APIs. The header file contains mostly inlined functions and thus has very low overhead. The code is then compiled with nvcc, the standard C++ compiler provided with CUDA. On AMD platforms, libraries are prefixed by roc, which can be called directly from HIP. In order to make portable calls, one can call the libraries using hip-prefixed wrappers. These wrappers can be used at no performance cost and ensure that HIP code can be used on other platforms with no changes. Libraries included in the ROCm, are almost one-to-one equivalent to the ones supplied with CUDA.

ROCm also integrates with popular machine learning frameworks such as TensorFlow and PyTorch and provides optimized libraries and drivers to accelerate machine learning workloads on AMD GPUs enabling the researchers to leverage the power of ROCm and AMD accelerators to train and deploy machine learning models efficiently.

oneAPI

Intel oneAPI is a unified software toolkit developed by Intel that allows developers to optimize and deploy applications across a variety of architectures, including CPUs, GPUs, and FPGAs. It provides a comprehensive set of tools, libraries, and frameworks, enabling developers to leverage the full potential of heterogeneous computing environments. With oneAPI, the developers can write code once and deploy it across different hardware targets without the need for significant modifications or rewriting. This approach promotes code reusability, productivity, and performance portability, as it abstracts the complexities of heterogeneous computing and provides a consistent programming interface based on open standards.

The core of suite is Intel oneAPI Base Toolkit, a set of tools and libraries for developing high-performance, data-centric applications across diverse architectures. It features an industry-leading C++ compiler that implements SYCL, an evolution of C++ for heterogeneous computing. It includes the Collective Communications Library, the Data Analytics Library, the Deep Neural Networks Library, the DPC++/C++ Compiler, the DPC++ Library, the Math Kernel Library, the Threading Building Blocks, debugging tool Intel Distribution for GDB, performance analysis tools Intel Adviser and Intel Vtune Profiler, the Video Processing Library, Intel Distribution for Python, the DPC++ Compatibility Tool, the FPGA Add-on for oneAPI Base Toolkit, the Integrated Performance Primitives. This can be complemented with additional toolkits. The Intel oneAPI HPC Toolkit contains DPC++/C++ Compiler, Fortran and C++ Compiler Classic, debugging tools Cluster Checker and Inspector, Intel MPI Library, and performance analysis tool Intel Trace Analyzer and Collector.

oneAPI supports multiple programming models and programming languages. It enables developers to write OpenMP codes targeting multi-core CPUs and Intel GPUs using the Classic Fortran and C++ compilers and as well SYCL programs for GPUs and FPGAs using the DPC++ compiler. Initially, the DPC++ compiler only targeted Intel GPUs using the oneAPI Level Zero low-level programming interface, but now support for NVIDIA GPUs (using CUDA) and AMD GPUs (using ROCm) has been added. Overall, Intel oneAPI offers a comprehensive and unified approach to heterogeneous computing, empowering developers to optimize and deploy applications across different architectures with ease. By abstracting the complexities and providing a consistent programming interface, oneAPI promotes code reusability, productivity, and performance portability, making it an invaluable toolkit for developers in the era of diverse computing platforms.

Differences and similarities

GPUs in general support different features, even among the same producer. In general newer cards come with extra features and sometimes old features are not supported anymore. It is important when compiling to create binaries targeting the specific architecture when compiling. A binary built for a newer card will not run on older devices, while a binary build for older devices might not run efficiently on newer architectures. In CUDA the compute capability which is targeted is specified by the -arch=sm_XY, where X specifies the major architecture and it is between 1 and 9, and Y the minor. When using HIP on NVIDIA platforms one needs to use compiling option --gpu-architecture=sm_XY, while on AMD platforms --offload-arch=gfxabc ( where abc is the architecture code such as 90a for the MI200 series or 908 for MI100 series). Note that in the case of portable (single source) programs one would specify openmp as well as target for compilation, enabling to run the same code on multicore CPU.

Terminology

Hardware

NVIDIA

AMD

Intel

Streaming processor/streaming core

SIMD lane

Processing element

SIMT unit

SIMD unit

Vector engine (XVE)

Streaming Multiprocessor (SM)

Computing Unit (CU)

Xe-core / Execution unit (EU)

GPU processing clusters (GPC)

Compute Engine

Xe-slice

Please keep in mind, that this table is only a rough approximation. Each GPU architecture is different, and it’s impossible to make a 1-to-1 mapping between terms used by different vendors.

Summary

  • GPUs are designed to execute thousands of threads simultaneously, making them highly parallel processors. In contrast, CPUs excel at executing a smaller number of threads in parallel.

  • GPUs allocate a larger portion of transistors to data processing rather than data caching and flow control. This prioritization of data processing enables GPUs to effectively handle parallel computations and hide memory access latencies through computation.

  • GPU producers provide comprehensive toolkits, libraries, and compilers for developing high-performance applications that leverage the parallel processing power of GPUs. Examples include CUDA (NVIDIA), ROCm (AMD), and oneAPI (Intel).

  • These platforms offer debugging tools (e.g., cuda-gdb, rocgdb) and performance analysis tools (e.g., NVIDIA Nsight Systems, NVIDIA Nsight Compute, rocprof, roctracer) to facilitate code optimization and ensure efficient utilization of GPU resources.

Exercises

GPUs and memory

Which statement about the relationship between GPUs and memory is true?

    1. GPUs are not affected by memory access latencies.

    1. GPUs can run out of memory quickly with many cores trying to access the memory simultaneously.

    1. GPUs have an unlimited cache size.

    1. GPUs prefer to run with a minimal number of threads to manage memory effectively.

Keypoints

  • GPUs vs. CPUs, key differences between them

  • GPU software suites, support specific GPU features, programming models, compatibility

  • Applications of GPUs