Performant PME simulations

Questions

  • What considerations are important when using PME

Objectives

  • Know how to assign the PME workload

Background on the PME algorithm

Most systems of interest in biomolecular MD have inhomogeneous distributions of partially charged particles. It turns out that simply neglecting interactions beyond a cut-off is not accurate enough. Even extremely large cut-offs only reduce the size, not eliminate truncation artefacts. Instead, most turn to some form of the Ewald method, where the shape of the short-range interaction is modified, and a compensating fix is made by doing extra work in a concurrent long-range component.

../_images/ewald-corrected-force.svg

Decomposing the Coulomb interaction (blue) into short- and long-ranged contributions. A Gaussian potential is subtracted from the \(\frac{1}{r}\) Coulomb potential so that its forces (red) goes to zero at smaller \(r\). That Gaussian is added back in the so-called long-range (“PME”) component (green) so that the full all-vs-all Coulomb interaction is modelled accurately. The advantage is that the long-range component is smoothly varying and can be efficiently computed with a 3D-FFT.

That concurrent force work can also be computed on the CPU or the GPU. When run on an Nvidia GPU, the CUFFT library is used to implement the 3D-FFT part of the long-range component. When run on the CPU, a similar library is used, normally MIT’s FFTW or Intel’s MKL.

PME tuning

One of the most useful attributes of the PME algorithm is that the share of the computational work of the two components can be varying. Scaling the short-ranged cutoff and the 3D-FFT grid spacing by the same factor gives a model that is just as accurate an approximation, while reducing the workload of one and increasing the workload of the other. So the user input can be used to define the expected quality of the electrostatic approximation, and the actual implementation can do something equivalent that minimizes the total execution time.

The PME tuning is on by default whenever it is likely to be useful, can be forced on with gmx mdrun -tunepme, and forced off with gmx mdrun -notunepme. In practice, mdrun does such tuning in the first few thousand steps, and then uses the result of the optimization for the remaining time.

3.1 Quiz: mdrun also has to compute the van der Waals interactions between particles. Should the cutoff for those be changed to match the tuned electrostatic cutoff

  1. Yes, keep it simple

  2. Yes, van der Waals interactions are not important

  3. No, they’re so cheap it doesn’t matter

  4. No, the van der Waals radii are critical for force-field accuracy

MD workflows using PME

../_images/molecular-dynamics-workflow-on-cpu-and-one-gpu.svg

Typical GROMACS simulation running on a GPU, with only the short-ranged interactions offloaded to the GPU. This can be selected with gmx mdrun -nb gpu -pme cpu -bonded cpu.

3.2 Quiz: When would it be most likely to benefit from moving PME interactions to the GPU?

  1. Few bonded interactions and relatively weak CPU

  2. Few bonded interactions and relatively strong CPU

  3. Many bonded interactions and relatively weak CPU

  4. Many bonded interactions and relatively strong CPU

The PME task can be moved to the same GPU as the short-ranged task. This comes with the same kinds of challenges as moving the bonded task to the GPU.

../_images/molecular-dynamics-workflow-short-range-gpu-pme-gpu-bonded-cpu.svg

Possible GROMACS simulation running on a GPU, with both short-ranged and PME tasks offloaded to the GPU. This can be selected with gmx mdrun -nb gpu -pme gpu -bonded cpu.

It turns out that the latter part of the PME task is harder to make run fast on a GPU than the first part, particularly when there is a short-ranged task also running on the same GPU. GROMACS permits that second part to be run on the CPU instead.

../_images/molecular-dynamics-workflow-short-range-gpu-pme-gpu-pmefft-cpu-bonded-cpu.svg

Possible GROMACS simulation running on a GPU, with short-ranged and the first part of the PME task offloaded to the GPU. This can be selected with gmx mdrun -nb gpu -pme gpu -pmefft cpu -bonded cpu.

Explore performance with PME

Make a new folder for this exercise, e.g. mkdir using-pme; cd using-pme.

Download the run input file prepared to do 20000 steps of a PME simulation. We’ll use it to experiment with task assignment.

Download the job submission script where you will see several lines marked **FIXME**. Remove the **FIXME** to achieve the goal stated in the comment before that line. You will need to refer to the information above to achieve that. Save the file and exit.

Submit the script to the SLURM job manager with sbatch script.sh. It will reply something like Submitted batch job 4565494 when it succeeded. The job manager will write terminal output to a file named like slurm-4565494.out. It may take a few minutes to start and a few more minutes to run.

While it is running, you can use tail -f slurm*out to watch the output. When it says “Done” then the runs are finished. Use Ctrl-C to exit the tail command that you ran.

Once the first trajectory completes, exit tail and use less default.log to inspect the output. Find the “Mapping of GPU IDs…” Does what you read there agree with what you just learned?

Then, find where the PME tuning took place. Hint: search for “pme grid”. What minimum value do you expect based on the van der Waals cutoff? What does the tuned value that tell you about the performance of the tasks on the GPU on this machine?

The *.log files contain the performance (in ns/day) of each run on the last line. Use tail *log to see the last chunk of each log file. Have a look through the log files and see what you can learn. What differs from log files from previous exercises?

Running update and constraints on the GPU

Recall that earlier we said that the dominant operations are arithmetic and data movement. We can eliminate a lot of the data movement by moving most computation to the GPU, and also the reduction, update and constraints phases.

../_images/molecular-dynamics-workflow-all-on-gpu.svg

Moving also the update and constraints to the GPU. Now there is much less data movement and the whole calculation is much more efficient. Generally the bonded forces should remain on the CPU, which is otherwise idle. Run this way using gmx mdrun -nb gpu -pme gpu -update gpu

Note that not all combinations of algorithms are supported, but where they are supported, the benefit of running the update also on the GPU is very useful.

Explore GPU updates

Using the same folder and topol.tpr file from the above exercise, download the job submission script where you will again see FIXME comments. Replace them to make it run NB, PME and the update on the GPU, as well as perhaps the bonded work. Save and exit.

Run the script and observe the performance as before. Is that better or worse than earlier?

Keypoints

  • The PME workload can be run on a GPU in a few different ways

  • The relative strength of CPU and GPU and the simulation system determines the most efficient way to assign the tasks. The default is not always best.

  • When supported, moving the whole MD workload to the GPU provides good improvements.