Performant PME simulations

Questions

What considerations are important when using PME

Objectives

Know how to assign the PME workload

Background on the PME algorithm

Most systems of interest in biomolecular MD have inhomogeneous distributions of partially charged particles. It turns out that simply neglecting interactions beyond a cut-off is not accurate enough. Even extremely large cut-offs only reduce the size, not eliminate truncation artefacts. Instead, most turn to some form of the Ewald method, where the shape of the short-range interaction is modified, and a compensating fix is made by doing extra work in a concurrent long-range component.

../_images/ewald-corrected-force.svg — Decomposing the Coulomb interaction (blue) into short- and long-ranged contributions. A Gaussian potential is subtracted from the \(\frac{1}{r}\) Coulomb potential so that its forces (red) goes to zero at smaller \(r\). That Gaussian is added back in the so-called long-range (“PME”) component (green) so that the full all-vs-all Coulomb interaction is modelled accurately. The advantage is that the long-range component is smoothly varying and can be efficiently computed with a 3D-FFT.

That concurrent force work can also be computed on the CPU or the GPU. When run on an Nvidia GPU, the CUFFT library is used to implement the 3D-FFT part of the long-range component. When run on the CPU, a similar library is used, normally MIT’s FFTW or Intel’s MKL.

PME tuning

One of the most useful attributes of the PME algorithm is that the share of the computational work of the two components can be varying. Scaling the short-ranged cutoff and the 3D-FFT grid spacing by the same factor gives a model that is just as accurate an approximation, while reducing the workload of one and increasing the workload of the other. So the user input can be used to define the expected quality of the electrostatic approximation, and the actual implementation can do something equivalent that minimizes the total execution time.

The PME tuning is on by default whenever it is likely to be useful, can be forced on with gmx mdrun -tunepme, and forced off with gmx mdrun -notunepme. In practice, mdrun does such tuning in the first few thousand steps, and then uses the result of the optimization for the remaining time.

3.1 Quiz: mdrun also has to compute the van der Waals interactions between particles. Should the cutoff for those be changed to match the tuned electrostatic cutoff

Yes, keep it simple
Yes, van der Waals interactions are not important
No, they’re so cheap it doesn’t matter
No, the van der Waals radii are critical for force-field accuracy

Solution

Changing the van der Waals cutoff unbalances the force field, because the parameters for different interactions are optimized in context with each other. Even making it longer can upset the balance. So in the PME tuning, mdrun must preserve the cutoff for van der Waals. This means PME tuning to short electrostatic cutoffs is a less effective option, because the pair lists must always be large enough for the van der Waals. But typically that possibility was not interesting for performance anyway.

MD workflows using PME

../_images/molecular-dynamics-workflow-on-cpu-and-one-gpu.svg — Typical GROMACS simulation running on a GPU, with only the short-ranged interactions offloaded to the GPU. This can be selected with `gmx mdrun -nb gpu -pme cpu -bonded cpu`.

3.2 Quiz: When would it be most likely to benefit from moving PME interactions to the GPU?

Few bonded interactions and relatively weak CPU
Few bonded interactions and relatively strong CPU
Many bonded interactions and relatively weak CPU
Many bonded interactions and relatively strong CPU

Solution

Running two tasks on the GPU again adds overhead there, and that offsets any benefit from speeding up the total work by running it on the GPU. If the CPU is powerful enough to finish all its work before the GPU finishes the short-ranged work, then leaving the PME work on the CPU is best.

The PME task can be moved to the same GPU as the short-ranged task. This comes with the same kinds of challenges as moving the bonded task to the GPU.

../_images/molecular-dynamics-workflow-short-range-gpu-pme-gpu-bonded-cpu.svg — Possible GROMACS simulation running on a GPU, with both short-ranged and PME tasks offloaded to the GPU. This can be selected with `gmx mdrun -nb gpu -pme gpu -bonded cpu`.

It turns out that the latter part of the PME task is harder to make run fast on a GPU than the first part, particularly when there is a short-ranged task also running on the same GPU. GROMACS permits that second part to be run on the CPU instead.

../_images/molecular-dynamics-workflow-short-range-gpu-pme-gpu-pmefft-cpu-bonded-cpu.svg — Possible GROMACS simulation running on a GPU, with short-ranged and the first part of the PME task offloaded to the GPU. This can be selected with `gmx mdrun -nb gpu -pme gpu -pmefft cpu -bonded cpu`.

Explore performance with PME

Make a new folder for this exercise, e.g. mkdir using-pme; cd using-pme.

Download the run input file prepared to do 20000 steps of a PME simulation. We’ll use it to experiment with task assignment.

Download the job submission script where you will see several lines marked **FIXME**. Remove the **FIXME** to achieve the goal stated in the comment before that line. You will need to refer to the information above to achieve that. Save the file and exit.

Submit the script to the SLURM job manager with sbatch script.sh. It will reply something like Submitted batch job 4565494 when it succeeded. The job manager will write terminal output to a file named like slurm-4565494.out. It may take a few minutes to start and a few more minutes to run.

While it is running, you can use tail -f slurm*out to watch the output. When it says “Done” then the runs are finished. Use Ctrl-C to exit the tail command that you ran.

Once the first trajectory completes, exit tail and use less default.log to inspect the output. Find the “Mapping of GPU IDs…” Does what you read there agree with what you just learned?

Then, find where the PME tuning took place. Hint: search for “pme grid”. What minimum value do you expect based on the van der Waals cutoff? What does the tuned value that tell you about the performance of the tasks on the GPU on this machine?

The *.log files contain the performance (in ns/day) of each run on the last line. Use tail *log to see the last chunk of each log file. Have a look through the log files and see what you can learn. What differs from log files from previous exercises?

Solution

You can download a working version of the batch submission script. Its diff from the original is file

--- /home/runner/work/gromacs-gpu-performance/gromacs-gpu-performance/content/exercises/using-pme/script.sh
+++ /home/runner/work/gromacs-gpu-performance/gromacs-gpu-performance/content/answers/using-pme/script.sh
@@ -19,13 +19,13 @@
 srun gmx mdrun $options -g default.log
 # Run mdrun assigning only the non-bonded interactions to the
 # GPU and PME to the CPU
-srun gmx mdrun $options -g manual-nb.log          **FIXME**
+srun gmx mdrun $options -g manual-nb.log           -nb gpu -pme cpu
 # Run mdrun assigning the non-bonded interactions and all of
 # the PME task to the GPU
-srun gmx mdrun $options -g manual-nb-pmeall.log   **FIXME**
+srun gmx mdrun $options -g manual-nb-pmeall.log    -nb gpu -pme gpu
 # Run mdrun assigning the non-bonded interactions and just
 # the first part of the PME task to the GPU
-srun gmx mdrun $options -g manual-nb-pmefirst.log **FIXME**
+srun gmx mdrun $options -g manual-nb-pmefirst.log  -nb gpu -pme gpu -pmefft cpu
 
 # Let us know we're done
 echo Done

Sample output it produced is available:

The tails of those log files are

==> default.log <==
-----------------------------------------------------------------------------
 Total                                         19.079        913.607 100.0
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:      381.558       19.079     1999.9
                 (ns/day)    (hour/ns)
Performance:       90.582        0.265
Finished mdrun on rank 0 Thu Sep  9 10:25:42 2021


==> manual-nb.log <==
 PME 3D-FFT             1   20      20002       4.644        222.376  11.8
 PME solve Elec         1   20      10001       0.546         26.143   1.4
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:      790.275       39.514     2000.0
                 (ns/day)    (hour/ns)
Performance:       43.735        0.549
Finished mdrun on rank 0 Thu Sep  9 10:27:23 2021


==> manual-nb-pmeall.log <==
-----------------------------------------------------------------------------
 Total                                         18.967        908.261 100.0
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:      379.323       18.967     1999.9
                 (ns/day)    (hour/ns)
Performance:       91.115        0.263
Finished mdrun on rank 0 Thu Sep  9 10:28:18 2021


==> manual-nb-pmefirst.log <==
-----------------------------------------------------------------------------
 Total                                         25.932       1241.797 100.0
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:      518.627       25.932     1999.9
                 (ns/day)    (hour/ns)
Performance:       66.642        0.360
Finished mdrun on rank 0 Thu Sep  9 10:29:29 2021

Depending on the underlying variability of the performance of this trajectory on this hardware, we might be able to observe which configuration corresponds to the default, and whether offloading all or part of the PME work is advantageous, or not. Run the scripts a few time to get a crude impression of that variability!

Running update and constraints on the GPU

Recall that earlier we said that the dominant operations are arithmetic and data movement. We can eliminate a lot of the data movement by moving most computation to the GPU, and also the reduction, update and constraints phases.

../_images/molecular-dynamics-workflow-all-on-gpu.svg — Moving also the update and constraints to the GPU. Now there is much less data movement and the whole calculation is much more efficient. Generally the bonded forces should remain on the CPU, which is otherwise idle. Run this way using `gmx mdrun -nb gpu -pme gpu -update gpu`

Note that not all combinations of algorithms are supported, but where they are supported, the benefit of running the update also on the GPU is very useful.

Explore GPU updates

Using the same folder and topol.tpr file from the above exercise, download the job submission script where you will again see FIXME comments. Replace them to make it run NB, PME and the update on the GPU, as well as perhaps the bonded work. Save and exit.

Run the script and observe the performance as before. Is that better or worse than earlier?

Solution

You can download a working version of the batch submission script. Its diff from the original is file

--- /home/runner/work/gromacs-gpu-performance/gromacs-gpu-performance/content/exercises/using-pme/all-on-gpu.sh
+++ /home/runner/work/gromacs-gpu-performance/gromacs-gpu-performance/content/answers/using-pme/all-on-gpu.sh
@@ -16,10 +16,10 @@
 options="-nsteps 20000 -resetstep 19000 -ntomp $SLURM_CPUS_PER_TASK -pin on -pinstride 1"
 
 # Run mdrun assigning the non-bonded, PME, and update work to the GPU
-srun gmx mdrun $options -g manual-nb-pme-update.log        **FIXME**
+srun gmx mdrun $options -g manual-nb-pme-update.log        -nb gpu -pme gpu             -update gpu
 # Run mdrun assigning the non-bonded, PME, bonded, and update
 # work to the GPU
-srun gmx mdrun $options -g manual-nb-pme-bonded-update.log **FIXME**
+srun gmx mdrun $options -g manual-nb-pme-bonded-update.log -nb gpu -pme gpu -bonded gpu -update gpu
 
 # Let us know we're done
 echo Done

Sample output it produced is available:

The tails of those log files are

==> manual-nb-pme-bonded-update.log <==
-----------------------------------------------------------------------------
 Total                                         13.602        651.337 100.0
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:      271.997       13.602     1999.7
                 (ns/day)    (hour/ns)
Performance:      127.056        0.189
Finished mdrun on rank 0 Thu Sep  9 10:45:31 2021


==> manual-nb-pme-update.log <==
-----------------------------------------------------------------------------
 Total                                         12.897        617.605 100.0
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:      257.917       12.897     1999.8
                 (ns/day)    (hour/ns)
Performance:      133.995        0.179
Finished mdrun on rank 0 Thu Sep  9 10:44:53 2021

Depending on the underlying variability of the performance of this trajectory on this hardware, we might be able to observe whether running also the update on the GPU is advantageous, or not. You should observe that it is a clear improvement on any hardware. Run the scripts a few time to get a crude impression of that variability!

Keypoints

The PME workload can be run on a GPU in a few different ways
The relative strength of CPU and GPU and the simulation system determines the most efficient way to assign the tasks. The default is not always best.
When supported, moving the whole MD workload to the GPU provides good improvements.