Performant PME simulations
Questions
What considerations are important when using PME
Objectives
Know how to assign the PME workload
Background on the PME algorithm
Most systems of interest in biomolecular MD have inhomogeneous distributions of partially charged particles. It turns out that simply neglecting interactions beyond a cut-off is not accurate enough. Even extremely large cut-offs only reduce the size, not eliminate truncation artefacts. Instead, most turn to some form of the Ewald method, where the shape of the short-range interaction is modified, and a compensating fix is made by doing extra work in a concurrent long-range component.
That concurrent force work can also be computed on the CPU or the GPU. When run on an Nvidia GPU, the CUFFT library is used to implement the 3D-FFT part of the long-range component. When run on the CPU, a similar library is used, normally MIT’s FFTW or Intel’s MKL.
PME tuning
One of the most useful attributes of the PME algorithm is that the share of the computational work of the two components can be varying. Scaling the short-ranged cutoff and the 3D-FFT grid spacing by the same factor gives a model that is just as accurate an approximation, while reducing the workload of one and increasing the workload of the other. So the user input can be used to define the expected quality of the electrostatic approximation, and the actual implementation can do something equivalent that minimizes the total execution time.
The PME tuning is on by default whenever it is likely to be useful,
can be forced on with gmx mdrun -tunepme
, and forced off with
gmx mdrun -notunepme
. In practice, mdrun
does such tuning in
the first few thousand steps, and then uses the result of the
optimization for the remaining time.
3.1 Quiz: mdrun also has to compute the van der Waals interactions between particles. Should the cutoff for those be changed to match the tuned electrostatic cutoff
Yes, keep it simple
Yes, van der Waals interactions are not important
No, they’re so cheap it doesn’t matter
No, the van der Waals radii are critical for force-field accuracy
Solution
Changing the van der Waals cutoff unbalances the force field, because the parameters for different interactions are optimized in context with each other. Even making it longer can upset the balance. So in the PME tuning,
mdrun
must preserve the cutoff for van der Waals. This means PME tuning to short electrostatic cutoffs is a less effective option, because the pair lists must always be large enough for the van der Waals. But typically that possibility was not interesting for performance anyway.
MD workflows using PME
3.2 Quiz: When would it be most likely to benefit from moving PME interactions to the GPU?
Few bonded interactions and relatively weak CPU
Few bonded interactions and relatively strong CPU
Many bonded interactions and relatively weak CPU
Many bonded interactions and relatively strong CPU
Solution
Running two tasks on the GPU again adds overhead there, and that offsets any benefit from speeding up the total work by running it on the GPU. If the CPU is powerful enough to finish all its work before the GPU finishes the short-ranged work, then leaving the PME work on the CPU is best.
The PME task can be moved to the same GPU as the short-ranged task. This comes with the same kinds of challenges as moving the bonded task to the GPU.
It turns out that the latter part of the PME task is harder to make run fast on a GPU than the first part, particularly when there is a short-ranged task also running on the same GPU. GROMACS permits that second part to be run on the CPU instead.
Explore performance with PME
Make a new folder for this exercise, e.g. mkdir
using-pme; cd using-pme
.
Download the run input file
prepared to do 20000
steps of a PME simulation. We’ll use it to experiment with task
assignment.
Download the job submission script
where you will see
several lines marked **FIXME**
. Remove the **FIXME**
to
achieve the goal stated in the comment before that line. You will
need to refer to the information above to achieve that. Save the
file and exit.
Submit the script to the SLURM job manager with sbatch
script.sh
. It will reply something like Submitted batch job
4565494
when it succeeded. The job manager will write terminal
output to a file named like slurm-4565494.out
. It may take a
few minutes to start and a few more minutes to run.
While it is running, you can use tail -f slurm*out
to watch the
output. When it says “Done” then the runs are finished. Use Ctrl-C
to exit the tail
command that you ran.
Once the first trajectory completes, exit tail
and use less
default.log
to inspect the output. Find the “Mapping of GPU
IDs…” Does what you read there agree with what you just learned?
Then, find where the PME tuning took place. Hint: search for “pme grid”. What minimum value do you expect based on the van der Waals cutoff? What does the tuned value that tell you about the performance of the tasks on the GPU on this machine?
The *.log
files contain the performance (in ns/day) of each run
on the last line. Use tail *log
to see the last chunk of each
log file. Have a look through the log files and see what you can
learn. What differs from log files from previous exercises?
Solution
You can download a working version
of the batch
submission script. Its diff from the original is file
--- /home/runner/work/gromacs-gpu-performance/gromacs-gpu-performance/content/exercises/using-pme/script.sh
+++ /home/runner/work/gromacs-gpu-performance/gromacs-gpu-performance/content/answers/using-pme/script.sh
@@ -19,13 +19,13 @@
srun gmx mdrun $options -g default.log
# Run mdrun assigning only the non-bonded interactions to the
# GPU and PME to the CPU
-srun gmx mdrun $options -g manual-nb.log **FIXME**
+srun gmx mdrun $options -g manual-nb.log -nb gpu -pme cpu
# Run mdrun assigning the non-bonded interactions and all of
# the PME task to the GPU
-srun gmx mdrun $options -g manual-nb-pmeall.log **FIXME**
+srun gmx mdrun $options -g manual-nb-pmeall.log -nb gpu -pme gpu
# Run mdrun assigning the non-bonded interactions and just
# the first part of the PME task to the GPU
-srun gmx mdrun $options -g manual-nb-pmefirst.log **FIXME**
+srun gmx mdrun $options -g manual-nb-pmefirst.log -nb gpu -pme gpu -pmefft cpu
# Let us know we're done
echo Done
Sample output it produced is available:
The tails of those log files are
==> default.log <==
-----------------------------------------------------------------------------
Total 19.079 913.607 100.0
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 381.558 19.079 1999.9
(ns/day) (hour/ns)
Performance: 90.582 0.265
Finished mdrun on rank 0 Thu Sep 9 10:25:42 2021
==> manual-nb.log <==
PME 3D-FFT 1 20 20002 4.644 222.376 11.8
PME solve Elec 1 20 10001 0.546 26.143 1.4
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 790.275 39.514 2000.0
(ns/day) (hour/ns)
Performance: 43.735 0.549
Finished mdrun on rank 0 Thu Sep 9 10:27:23 2021
==> manual-nb-pmeall.log <==
-----------------------------------------------------------------------------
Total 18.967 908.261 100.0
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 379.323 18.967 1999.9
(ns/day) (hour/ns)
Performance: 91.115 0.263
Finished mdrun on rank 0 Thu Sep 9 10:28:18 2021
==> manual-nb-pmefirst.log <==
-----------------------------------------------------------------------------
Total 25.932 1241.797 100.0
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 518.627 25.932 1999.9
(ns/day) (hour/ns)
Performance: 66.642 0.360
Finished mdrun on rank 0 Thu Sep 9 10:29:29 2021
Depending on the underlying variability of the performance of this trajectory on this hardware, we might be able to observe which configuration corresponds to the default, and whether offloading all or part of the PME work is advantageous, or not. Run the scripts a few time to get a crude impression of that variability!
Running update and constraints on the GPU
Recall that earlier we said that the dominant operations are arithmetic and data movement. We can eliminate a lot of the data movement by moving most computation to the GPU, and also the reduction, update and constraints phases.
Note that not all combinations of algorithms are supported, but where they are supported, the benefit of running the update also on the GPU is very useful.
Explore GPU updates
Using the same folder and topol.tpr
file from the above exercise,
download the job submission script
where you will again see
FIXME comments. Replace them to make it run NB, PME and the
update on the GPU, as well as perhaps the bonded work. Save and
exit.
Run the script and observe the performance as before. Is that better or worse than earlier?
Solution
You can download a working version
of the batch
submission script. Its diff from the original is file
--- /home/runner/work/gromacs-gpu-performance/gromacs-gpu-performance/content/exercises/using-pme/all-on-gpu.sh
+++ /home/runner/work/gromacs-gpu-performance/gromacs-gpu-performance/content/answers/using-pme/all-on-gpu.sh
@@ -16,10 +16,10 @@
options="-nsteps 20000 -resetstep 19000 -ntomp $SLURM_CPUS_PER_TASK -pin on -pinstride 1"
# Run mdrun assigning the non-bonded, PME, and update work to the GPU
-srun gmx mdrun $options -g manual-nb-pme-update.log **FIXME**
+srun gmx mdrun $options -g manual-nb-pme-update.log -nb gpu -pme gpu -update gpu
# Run mdrun assigning the non-bonded, PME, bonded, and update
# work to the GPU
-srun gmx mdrun $options -g manual-nb-pme-bonded-update.log **FIXME**
+srun gmx mdrun $options -g manual-nb-pme-bonded-update.log -nb gpu -pme gpu -bonded gpu -update gpu
# Let us know we're done
echo Done
Sample output it produced is available:
The tails of those log files are
==> manual-nb-pme-bonded-update.log <==
-----------------------------------------------------------------------------
Total 13.602 651.337 100.0
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 271.997 13.602 1999.7
(ns/day) (hour/ns)
Performance: 127.056 0.189
Finished mdrun on rank 0 Thu Sep 9 10:45:31 2021
==> manual-nb-pme-update.log <==
-----------------------------------------------------------------------------
Total 12.897 617.605 100.0
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 257.917 12.897 1999.8
(ns/day) (hour/ns)
Performance: 133.995 0.179
Finished mdrun on rank 0 Thu Sep 9 10:44:53 2021
Depending on the underlying variability of the performance of this trajectory on this hardware, we might be able to observe whether running also the update on the GPU is advantageous, or not. You should observe that it is a clear improvement on any hardware. Run the scripts a few time to get a crude impression of that variability!
Keypoints
The PME workload can be run on a GPU in a few different ways
The relative strength of CPU and GPU and the simulation system determines the most efficient way to assign the tasks. The default is not always best.
When supported, moving the whole MD workload to the GPU provides good improvements.