Modern HPC architectures

Objectives

  • Understand how modern HPC hardware is built and why.

  • Learn about the memory hierarchy on Dardel and LUMI.

Quantum chemistry has evolved hand-in-hand with advances in computer hardware: each new computing architecture that becomes available allows leaps both in the size of molecular systems that can be studied and the methods that can be applied.

All hardware available today, from your phone to any large-scale data center, is engineered to be parallel. Multiple cores, multiple threads, multiple vector lanes are all, to lesser or greater extent, built-in the hardware we have at our disposal. The evolution of the hardware has brought a paradigm shift in software development. It is mandatory for developers to explicitly think about parallelism and how to harness it to improve performance. Most importantly, applications need to be scalable in order to exploit all hardware resources available today, but also in the future, as more parallel resources are packed into chips.

Moore’s law *

Back in 1965, Gordon Moore observed that the number of transistors in a dense integrated circuit doubles about every two years. This observation has been since dubbed Moore’s law. Being able to pack more transistors means smaller size of a single element, such that higher clock rates can be achieved. Higher clock rates mean higher instruction throughput and thus higher performance.

If we look at the historical trends in the past 50 years of microprocessor development, we indeed notice the doubling in transistor numbers asserted in Moore’s law. It is also quite clear that in the early 2000s the corresponding increase in clock rates stalled: modern chips have clock rates in the 3 GHz range. Code will not be more performant on newer architectures without significant programmer intervention.

Loading…

The evolution of microprocessors. The number of transistors per chip increases every 2 years or so. However, it can no longer be exploited by the core frequency due to power consumption limits. Before 2000, the increase in the single core clock frequency was the major source of performance increase. Mid-2000 mark a transition towards multi-core processors.

What happened? Processor designers hit three walls:

  • Power consumption wall. This scales as the third power of the clock rate and the heat generated by a denser transistor packing cannot be dissipated effectively by air cooling alone. Higher clock rates would result in power-inefficient computation.

  • Memory wall. While clock rates for computing chips have seen a steady rise in the past decades, the same has not been the case for memory, especially that residing off-chip. Read/write operations in memory that is far away in time and space from the computing chips is expensive, both in terms of time and required power. Algorithms bound by memory performance will not gain from faster computation.

  • Instruction-level parallelism wall. Automatic parallelization at the instruction level, though superscalar instructions, pipelining, prefetching, and branch prediction, leads to more complicated chip designs and only affords constant, not scalable, factors of speedup.

To compensate, chip designs to achieve performance have increased the number of physical cores on the same die. More cores, more threads, and wider vector instruction sets all contribute to expose more hardware parallelism for applications to take advantage of.

The memory hierarchy

A modern CPU consists of multiple cores on the same chip die. Each core has its own arithmetic logic unit (ALU) and control unit.

_images/cpu.svg

A schematic view of a modern multicore CPU. Each purple-shaded box is a single core, with its own arithmetic logic unit (ALU), control unit, and L1 cache (yellow-shaded box). Groups of cores might share the L2 cache (blue-shaded box), which is larger and slower than L1. Groups of cores in the chip share the L3 cache (orange-shaded box), in turn larger and slower than L2. The CPU has access to off-chip dynamic random access memory (DRAM), which is usually of the order of hundreds of gigabytes. Access to DRAM is much slower than access to caches due to lower memory clock rates and locality.

As mentioned, clock rates for memory have not seen the same rise as for computing chips. This has direct impact on performance: a read/write operation can require multiple clock cycles, during which the cores might be idle. This is called latency and its impact depends on the locality of the memory being accessed. Modern CPUs have a memory hierarchy:

  • Registers are very small and very fast units of memory that store the data about to be processed by the core.

  • multiple caches, with each level the latency typically increases by an order of magnitude:

    • L1 cache is per-core memory where both instructions and data to be processed are stored for fast retrieval into the registers. Its size is in the order of tens of kilobytes per core.

    • L2 cache. Its size is in the order of hundreds of kilobytes per core and it might be shared by groups of cores.

    • L3 cache. It is shared among cores, either subgroups or all, and its size is in the order of tens of megabytes per group of cores.

  • DRAM this is the main off-chip memory. Nowadays, HPC clusters have hundreds of gigabytes of RAM per node. Its latency is usually two order of magnitude larger than that of L3 cache.

We can see that the closest the memory is to the core, the faster it can be accessed. Unfortunately, on-chip memory is rather small.

As an example, the AMD EPYC 7742 CPUs on Dardel have 64 cores and cache hierarchy:

  • L1 instruction cache of 32 KiB per core, for a total of 2 MiB.

  • L1 data cache of 32 KiB per core, for a total of 2 MiB.

  • L2 cache of 512 KiB per core, for a total of 32 MiB.

  • L3 cache of 16 MiB shared among 16 cores, for a total of 256 MiB.

Multiprocessor systems and non-uniform memory access

Multiple multicore CPUs can be packaged together in a socket. The CPUs communicate through fast point-to-point channels. In this architecture, each CPU in the socket is attached to its own off-chip memory. As a result, access to the memory is not equal across CPUs in the socket.

Off-chip memory accesses become non-uniform: the CPU on socket 0 (socket 1) experiences higher latency and, possibly, reduced bandwidth accessing DRAM attached to the CPU on socket 1 (socket 0). To further complicate matters, cores on each socket might also be arranged in non-uniform memory access (NUMA) domains. Cores within each socket might experience different latency and bandwidth when accessing memory.

_images/numa.svg

Schematic view of a typical dual-socket node on a modern cluster. Each socket houses two CPUs, each with 64 cores. The cores are arranged in a configuration with 4 NUMA domains per socket (NPS4). Each NUMA domain has 16 cores.

The Dardel system at PDC

Dardel is the new high-performance cluster at PDC: it has a CPU partition and a GPU partition is planned.

https://www.pdc.kth.se/polopoly_fs/1.1053343.1614296818!/image/3D%20marketing%201%20row%20cropped%201000pW%20300ppi.jpg

Anatomy of supercomputer:

  • Dardel consists of several cabinets (also known as racks)

  • Each cabinet is filled with many blades

  • A single blade hosts two nodes

  • A node has two AMD EPYC 7742 CPUs, each with 64 cores clocking at 2.25GHz

Different types of compute nodes in the CPU partition:

  • 488 x 256 GB (SNIC thin nodes)

  • 20 x 512 GB (SNIC large nodes)

  • 8 x 1024 GB (SNIC huge nodes)

  • 2 x 2048 GB (SNIC giant nodes)

  • 36 x 256 GB (KTH industry/business research nodes)

The performance of the CPU partition is 2.279 petaFLOPS according to the Top500 list (Nov 2021).

Keypoints

  • It is not possible to achieve higher clock rates: more performing hardware packs multiple computational cores on the same die.

  • Multicore machines give us access to more parallelism, but this needs to be harnessed with careful software design.

  • Understanding the existing memory hierarchy is essential for efficient use of the hardware.

*

This section is adapted, with permission, from the training material for the ENCCS CUDA workshop.

The data in this plot is collected by Karl Rupp and made available on GitHub.