.. _performance-theory:


Performance theory
==================


.. objectives::

   - Understand the different meanings of performance.
   - Familiarize with theoretical models of performance analysis.


Quantum chemical programs must be written in an explicitly parallel fashion in
order to harness the computational power offered by the increasing number of
cores that can be packed on modern chips.
The effort expended in parallelizing or improving parallelization of a program
needs to be measured against suitable metrics to be justified. The term
"performance" has many meanings in this context.
A code has improved performance compared to another, if it:

- Shows lower **latency**, *i.e.* it computes a single result in less time.
- Shows higher **throughput**, *i.e.* it computes a bunch of results at a higher
  rate.
- Consumes less power.

In quantum chemistry we are mostly interested in reducing the **total time to
solution**, *i.e.* the latency of a computation.
It is thus quite natural to evaluate parallel performance in terms of the *speedup* and *efficiency* metrics:


Speedup
  The ratio of serial and parallel times to solution.

  .. math::

     \mathrm{speedup} = \frac{T_{1}}{T_{P}}

Efficiency
  The speedup per unit of parallel worker.

  .. math::

     \mathrm{efficiency} = \frac{\mathrm{speedup}}{P} = \frac{T_{1}}{PT_{P}}

The serial time to solution :math:`T_{1}` might be that of the *serialized*
parallel program, that is the parallel algorithm running with :math:`P=1`
workers. Speedup and efficiency are then said to be *relative*.
These metrics are said to be *absolute* if :math:`T_{1}` is measured using a
serial algorithm that does not parallelize well.

Observed speedups are usually *sublinear*: running on :math:`P` workers is a
factor of :math:`< P` faster. Ideally, one would want to observe *linear*
speedups, but this is very hard to achieve in practice, due, for example, to the
overhead incurred in distributing computation among workers.
In rare cases, *superlinear* speedup might be observed, usually due to favorable
combinations of cache-friendliness and efficient use of vectorization.


Scalability
-----------

*Scalability* is the ability of hardware and software to deliver increased
performance as a function of the resources used.

There are two metrics to consider:

Strong scalability
  The ability of an algorithm to obtain speedup for a **fixed problem size** and
  an **increasing number of workers**.

Weak scalability
  The ability of an algorithm to obtain speedup for an **increasing problem
  size** and an **increasing number of workers**.

Let's take a closer look at the measured latencies :math:`T_{1}` and
:math:`T_{P}`.  Both are the sum of the fractions executed in serial or in
parallel. While for :math:`T_{1}` it holds:

.. math::

   T_{1} = W_{\mathrm{ser}} + W_{\mathrm{par}}

for :math:`T_{P}` we have:

.. math::

   T_{P} \geq W_{\mathrm{ser}} + \frac{W_{\mathrm{par}}}{P}

the equality being valid for a "perfect" parallelization.

.. _amdahl:

Amdahl's law
~~~~~~~~~~~~

Let's assume that there is a *constant* fraction :math:`f` of :math:`T_{1}` that is serial, then:

.. math::

   \begin{align}
   W_{\mathrm{ser}} &= fT_{1} \\
   W_{\mathrm{par}} &= (1-f)T_{1}
   \end{align}

This gives an upper bound on the achievable speedup and efficiency as a function
of the serial fraction:

.. math::

   \mathrm{speedup} \leq \frac{1}{f + (1-f)P}

also known as **Amdahl's law** :cite:`Amdahl1967-cv`.

.. chart:: charts/amdahl-speedup.json

   Strong scaling, in terms of achievable **speedup**, as a function of the number of
   workers, according to Amdahl's law.  We consider different values of the
   serial fraction :math:`f`.


.. chart:: charts/amdahl-efficiency.json

   Strong scaling, in terms of achievable **efficiency**, as a function of the
   number of workers, according to Amdahl's law.  We consider different values
   of the serial fraction :math:`f`.

Quite pessimistically, in the limit of an infinite number of workers, strong
scalability is essentially bound by the serial fraction:

.. math::

   \mathrm{speedup}_{\infty} \leq \frac{1}{f}

.. _gustafson:

Gustafson's law
~~~~~~~~~~~~~~~

20 years after Amdahl published his paper on strong scalability, Gustafson noted
an important fact: more capable hardware is usually applied to larger problems,
leading to observed speedups exceeding what would be predicted by Amdahl's law.
:cite:`Gustafson1988-ch`

How is this possible? We must make two assumptions:

#. That the serial part of our algorithm remains constant when the problem size
   increases.
#. That the parallel part grows *linearly* with the number of workers.

.. figure:: img/amdahl-vs-gustafson.svg
   :align: center
   :scale: 80%

   In a strong scalability regime, the amount of work remains constant, while
   the capability of the machine to perform parallel work increase.
   Asymptotically, the speedup is dominated by the fraction of work executed
   serially, as observed by Amdahl.
   However, increasing the amount of work proportionally to the parallel
   capabilities, will eventually let the serial portion become insignificant to
   the total time to solution.

When exploring weak scalability, the **scaled speedup** is our figure of merit:

.. math::

   \text{scaled speedup} = P + (1-P)f

This metric grows *linearly* with the number of workers.


.. keypoints::

   - Time to solution is the performance metric most often used in quantum
     chemistry.
   - Strong calability of an application depends on the fraction of serial work
     performed.
   - In practice, more capable hardware is employed on larger problems: we might
     be able to observe weak scalability even without removing serial
     bottlenecks.