.. _modern-hpc-architectures:


==========================
 Modern HPC architectures
==========================

.. objectives::

   - Understand how modern HPC hardware is built and why.
   - Learn about the memory hierarchy on `Dardel
     <https://www.pdc.kth.se/hpc-services/computing-systems/dardel-1.1043529>`_
     and `LUMI <https://www.lumi-supercomputer.eu/>`_.


Quantum chemistry has evolved hand-in-hand with advances in computer
hardware: each new computing architecture that becomes available allows leaps
both in the size of molecular systems that can be studied and the methods that
can be applied.

All hardware available today, from your phone to any large-scale data center, is
engineered to be parallel. Multiple cores, multiple threads, multiple vector
lanes are all, to lesser or greater extent, **built-in** the hardware we have at
our disposal.
The evolution of the hardware has brought a paradigm shift in software
development. It is *mandatory* for developers to explicitly think about
parallelism and how to harness it to improve performance. Most importantly,
applications need to be *scalable* in order to exploit all hardware resources
available today, but also in the future, as more parallel resources are packed
into chips.


Moore's law [*]_
================

Back in 1965, Gordon Moore observed that the number of transistors in a dense
integrated circuit doubles about every two years. This observation has been since dubbed **Moore's law**.
Being able to pack more transistors means smaller size of a single element, such
that higher clock rates can be achieved.
Higher clock rates mean higher instruction throughput and thus higher
performance.

If we look at the historical trends in the past 50 years of microprocessor
development, we indeed notice the doubling in transistor numbers asserted in
Moore's law. It is also quite clear that in the early 2000s the corresponding
increase in clock rates stalled: modern chips have clock rates in the 3 GHz
range.  Code will not be more performant on newer architectures without
*significant* programmer intervention.

.. chart:: charts/microprocessor-trend-data.json

   The evolution of microprocessors. The number of transistors per chip
   increases every 2 years or so. However, it can no longer be exploited by the
   core frequency due to power consumption limits. Before 2000, the increase in
   the single core clock frequency was the major source of performance increase.
   Mid-2000 mark a transition towards multi-core processors. [*]_

What happened? Processor designers hit three walls:

- **Power consumption wall**. This scales as the third power of the clock rate and
  the heat generated by a denser transistor packing cannot be dissipated
  effectively by air cooling alone. Higher clock rates would result in
  power-inefficient computation.
- **Memory wall**. While clock rates for computing chips have seen a
  steady rise in the past decades, the same has not been the case for memory,
  especially that residing *off-chip*. Read/write operations in memory that is
  far away in time and space from the computing chips is expensive, both in
  terms of time and required power. Algorithms bound by memory performance will
  not gain from faster computation.
- **Instruction-level parallelism wall**. Automatic parallelization at the
  instruction level, though superscalar instructions, pipelining, prefetching,
  and branch prediction, leads to more complicated chip designs and only affords
  *constant*, not *scalable*, factors of speedup.

To compensate, chip designs to achieve performance have increased the number of
physical cores on the same die.  More cores, more threads, and wider vector
instruction sets all contribute to expose more *hardware parallelism* for
applications to take advantage of.


The memory hierarchy
====================

A modern CPU consists of multiple cores on the same chip die. Each core has its
own **arithmetic logic unit** (ALU) and **control unit**.

.. figure:: img/cpu.svg
   :align: center
   :scale: 80%

   A schematic view of a modern multicore CPU. Each purple-shaded box is a
   single core, with its own **arithmetic logic unit** (ALU), **control unit**,
   and **L1 cache** (yellow-shaded box).  Groups of cores *might* share the **L2
   cache** (blue-shaded box), which is larger and slower than L1. Groups of
   cores in the chip share the **L3 cache** (orange-shaded box), in turn larger
   and slower than L2. The CPU has access to off-chip **dynamic random access
   memory** (DRAM), which is usually of the order of hundreds of gigabytes.
   Access to DRAM is much slower than access to caches due to lower memory clock
   rates and locality.

As mentioned, clock rates for memory have not seen the same rise as for
computing chips. This has direct impact on performance: a read/write operation
can require multiple clock cycles, during which the cores might be idle.
This is called **latency** and its impact depends on the *locality* of the
memory being accessed.
Modern CPUs have a **memory hierarchy**:

- **Registers** are very small and very fast units of memory that store the data
  about to be processed by the core.
- multiple caches, with each level the latency typically increases by an order
  of magnitude:

  - **L1 cache** is per-core memory where both instructions and data to be
    processed are stored for fast retrieval into the registers. Its size is in the
    order of *tens of kilobytes* per core.
  - **L2 cache**. Its size is in the order of *hundreds of kilobytes*
    per core and it might be shared by groups of cores.
  - **L3 cache**. It is shared among cores, either subgroups or all, and its size
    is in the order of *tens of megabytes* per *group* of cores.

- **DRAM** this is the main off-chip memory. Nowadays, HPC clusters have *hundreds
  of gigabytes* of RAM per node. Its latency is usually two order of magnitude
  larger than that of L3 cache.

We can see that the closest the memory is to the core, the faster it can be
accessed. Unfortunately, on-chip memory is rather small.

As an example, the `AMD EPYC 7742
<https://en.wikichip.org/wiki/amd/epyc/7742>`_ CPUs on Dardel have 64 cores and
cache hierarchy:

- L1 instruction cache of 32 KiB per core, for a total of 2 MiB.
- L1 data cache of 32 KiB per core, for a total of 2 MiB.
- L2 cache of 512 KiB per core, for a total of 32 MiB.
- L3 cache of 16 MiB shared among 16 cores, for a total of 256 MiB.


Multiprocessor systems and non-uniform memory access
====================================================

Multiple multicore CPUs can be packaged together in a **socket**. The CPUs
communicate through fast point-to-point channels. In this architecture, each
CPU in the socket is attached to its own off-chip memory.  As a result, access
to the memory is not equal across CPUs in the socket. 

Off-chip memory accesses become **non-uniform**: the CPU on socket 0 (socket 1)
experiences higher latency and, possibly, reduced bandwidth accessing DRAM
attached to the CPU on socket 1 (socket 0).
To further complicate matters, *cores* on each socket might also be arranged in
**non-uniform memory access** (NUMA) domains. Cores within each socket might
experience different latency and bandwidth when accessing memory.

.. _numa:

.. figure:: img/numa.svg
   :align: center
   :scale: 80%

   Schematic view of a typical dual-socket node on a modern cluster.  Each
   socket houses two CPUs, each with 64 cores. The cores are arranged in a
   configuration with 4 NUMA domains per socket (NPS4).  Each NUMA domain has 16
   cores.


The Dardel system at PDC
========================

Dardel is the new high-performance cluster at PDC: it has a CPU *partition* and
a GPU *partition* is planned.

.. image:: https://www.pdc.kth.se/polopoly_fs/1.1053343.1614296818!/image/3D%20marketing%201%20row%20cropped%201000pW%20300ppi.jpg
   :align: center

Anatomy of supercomputer:

- Dardel consists of several *cabinets* (also known as racks)
- Each cabinet is filled with many *blades*
- A single blade hosts two *nodes*
- A node has two AMD EPYC 7742 CPUs, each with 64 cores clocking at 2.25GHz

Different types of compute nodes in the CPU partition:

-  488 x 256 GB (SNIC thin nodes)
-  20 x 512 GB (SNIC large nodes)
-  8 x 1024 GB (SNIC huge nodes)
-  2 x 2048 GB (SNIC giant nodes)
-  36 x 256 GB (KTH industry/business research nodes)

The performance of the CPU partition is 2.279 petaFLOPS according to the Top500
list (Nov 2021).

.. typealong:: Exploring the memory hierarchy on Dardel

   Each of Dardel's node is dual-socket: the memory latency and bandwidth will
   differ based on which CPU/core accesses the off-chip memory.
   We will use the `numactl <https://linux.die.net/man/8/numactl>`_ command-line
   tool to get a description of the NUMA domains on the Dardel login and compute
   nodes.

   To do so:

   - Log in to Dardel:

     .. code-block:: shell

        ssh <your-username>@dardel.pdc.kth.se

   - Run the command on the login node:

     .. code-block:: shell

        numactl --hardware

   - Request a short interactive allocation and run the command on a compute
     node:

     .. code-block:: shell

        salloc -N 1 -t 00:05:00 -p main -A edu22.veloxchem --reservation velox-lab1
        srun -n 1 numactl --hardware
        exit

   Note that the login node and the compute node give very different output.

   On the log in node, the output looks like the following:

   .. code-block:: text

      available: 2 nodes (0-1)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
      node 0 size: 257342 MB
      node 0 free: 70756 MB
      node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
      node 1 size: 258019 MB
      node 1 free: 49565 MB
      node distances:
      node   0   1
        0:  10  32
        1:  32  10

   While on the compute node, the output looks like the following:

   .. code-block:: text

      available: 8 nodes (0-7)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
      node 0 size: 31620 MB
      node 0 free: 30673 MB
      node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
      node 1 size: 32249 MB
      node 1 free: 31150 MB
      node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
      node 2 size: 32249 MB
      node 2 free: 30757 MB
      node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
      node 3 size: 32237 MB
      node 3 free: 31752 MB
      node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
      node 4 size: 32249 MB
      node 4 free: 31783 MB
      node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
      node 5 size: 32249 MB
      node 5 free: 31813 MB
      node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
      node 6 size: 32249 MB
      node 6 free: 31198 MB
      node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
      node 7 size: 32246 MB
      node 7 free: 31375 MB
      node distances:
      node   0   1   2   3   4   5   6   7
        0:  10  12  12  12  32  32  32  32
        1:  12  10  12  12  32  32  32  32
        2:  12  12  10  12  32  32  32  32
        3:  12  12  12  10  32  32  32  32
        4:  32  32  32  32  10  12  12  12
        5:  32  32  32  32  12  10  12  12
        6:  32  32  32  32  12  12  10  12
        7:  32  32  32  32  12  12  12  10

   Since the actual calculations will be run on the compute nodes, 
   we need to take a close look at the ``numactl`` output.

   #. There are 8 NUMA domains: ``available: 8 nodes (0-7)``
   #. The index for the threads in the domain 0, together with the total and
      free amounts of memory.

      .. code-block:: text

         node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
         node 0 size: 31620 MB
         node 0 free: 30673 MB

   #. The index for the threads in the domain 1, together with the total and
      free amounts of memory.

      .. code-block:: text

         node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
         node 1 size: 32249 MB
         node 1 free: 31150 MB

   #. The distances between nodes. These numbers give a measure of the latency
      incurred accessing memory on one NUMA domain from the other.

      .. code-block:: text

         node distances:
         node   0   1   2   3   4   5   6   7
           0:  10  12  12  12  32  32  32  32
           1:  12  10  12  12  32  32  32  32
           2:  12  12  10  12  32  32  32  32
           3:  12  12  12  10  32  32  32  32
           4:  32  32  32  32  10  12  12  12
           5:  32  32  32  32  12  10  12  12
           6:  32  32  32  32  12  12  10  12
           7:  32  32  32  32  12  12  12  10


.. typealong:: Efficient utilization of resources/hardware on Dardel

   Nowadays more and more scientific software are parallelized over 
   both MPI and OpenMP.
   When running hybrid MPI/OpenMP software on multi-core compute nodes, 
   there are many choices in the combination of MPI processes and OpenMP 
   threads. 

   On Dardel this can be controlled by the SLURM job script, where user
   can specify the number of MPI tasks per node. The OpenMP threads can
   be controlled by relevant environment variables such as 
   ``OMP_NUM_THREADS`` and ``OMP_PLACES``.

   Here are some examples of requesting different combinations of MPI 
   tasks and OpenMP threads on Dardel. Please note that ``--cpus-per-task``
   should be set to 2x ``OMP_NUM_THREADS`` because simultaneous multithreading
   (SMT) is turned on.


   - 128 MPI x 1 OMP

     .. code-block:: shell

        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node=128

        export OMP_NUM_THREADS=1
        export OMP_PLACES=cores

   - 64 MPI x 2 OMP

     .. code-block:: shell

        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node=64
        #SBATCH --cpus-per-task=4  # 2x2 because of SMT

        export OMP_NUM_THREADS=2
        export OMP_PLACES=cores

   - 8 MPI x 16 OMP

     .. code-block:: shell

        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node=8

        export OMP_NUM_THREADS=16
        export OMP_PLACES=cores

   - 2 MPI x 64 OMP

     .. code-block:: shell

        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node=2
        #SBATCH --cpus-per-task=128  # 64x2 because of SMT

        export OMP_NUM_THREADS=64
        export OMP_PLACES=cores

   - 1 MPI x 128 OMP

     .. code-block:: shell

        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node=1

        export OMP_NUM_THREADS=128
        export OMP_PLACES=cores


.. keypoints::

   - It is not possible to achieve higher clock rates: more performing hardware
     packs multiple computational cores on the same die.
   - Multicore machines give us access to more parallelism, but this needs to be
     harnessed with careful software design.
   - Understanding the existing memory hierarchy is essential for efficient use
     of the hardware.


.. [*] This section is adapted, with permission, from the training material for
        the `ENCCS CUDA workshop
        <https://enccs.github.io/CUDA/1.01_GPUIntroduction/#exposing-parallelism>`_.
.. [*] The data in this plot is collected by Karl Rupp and made available `on GitHub <https://github.com/karlrupp/microprocessor-trend-data>`_.