# Performant PME simulations

Questions

• What considerations are important when using PME

Objectives

• Know how to assign the PME workload

## Background on the PME algorithm

Most systems of interest in biomolecular MD have inhomogeneous distributions of partially charged particles. It turns out that simply neglecting interactions beyond a cut-off is not accurate enough. Even extremely large cut-offs only reduce the size, not eliminate truncation artefacts. Instead, most turn to some form of the Ewald method, where the shape of the short-range interaction is modified, and a compensating fix is made by doing extra work in a concurrent long-range component.

That concurrent force work can also be computed on the CPU or the GPU. When run on an Nvidia GPU, the CUFFT library is used to implement the 3D-FFT part of the long-range component. When run on the CPU, a similar library is used, normally MIT’s FFTW or Intel’s MKL.

## PME tuning

One of the most useful attributes of the PME algorithm is that the share of the computational work of the two components can be varying. Scaling the short-ranged cutoff and the 3D-FFT grid spacing by the same factor gives a model that is just as accurate an approximation, while reducing the workload of one and increasing the workload of the other. So the user input can be used to define the expected quality of the electrostatic approximation, and the actual implementation can do something equivalent that minimizes the total execution time.

The PME tuning is on by default whenever it is likely to be useful, can be forced on with gmx mdrun -tunepme, and forced off with gmx mdrun -notunepme. In practice, mdrun does such tuning in the first few thousand steps, and then uses the result of the optimization for the remaining time.

3.1 Quiz: mdrun also has to compute the van der Waals interactions between particles. Should the cutoff for those be changed to match the tuned electrostatic cutoff

1. Yes, keep it simple

2. Yes, van der Waals interactions are not important

3. No, they’re so cheap it doesn’t matter

4. No, the van der Waals radii are critical for force-field accuracy

## MD workflows using PME

3.2 Quiz: When would it be most likely to benefit from moving PME interactions to the GPU?

1. Few bonded interactions and relatively weak CPU

2. Few bonded interactions and relatively strong CPU

3. Many bonded interactions and relatively weak CPU

4. Many bonded interactions and relatively strong CPU

The PME task can be moved to the same GPU as the short-ranged task. This comes with the same kinds of challenges as moving the bonded task to the GPU.

It turns out that the latter part of the PME task is harder to make run fast on a GPU than the first part, particularly when there is a short-ranged task also running on the same GPU. GROMACS permits that second part to be run on the CPU instead.

Explore performance with PME

Make a new folder for this exercise, e.g. mkdir using-pme; cd using-pme.

Download the run input file prepared to do 20000 steps of a PME simulation. We’ll use it to experiment with task assignment.

Download the job submission script where you will see several lines marked **FIXME**. Remove the **FIXME** to achieve the goal stated in the comment before that line. You will need to refer to the information above to achieve that. Save the file and exit.

Submit the script to the SLURM job manager with sbatch script.sh. It will reply something like Submitted batch job 4565494 when it succeeded. The job manager will write terminal output to a file named like slurm-4565494.out. It may take a few minutes to start and a few more minutes to run.

While it is running, you can use tail -f slurm*out to watch the output. When it says “Done” then the runs are finished. Use Ctrl-C to exit the tail command that you ran.

Once the first trajectory completes, exit tail and use less default.log to inspect the output. Find the “Mapping of GPU IDs…” Does what you read there agree with what you just learned?

Then, find where the PME tuning took place. Hint: search for “pme grid”. What minimum value do you expect based on the van der Waals cutoff? What does the tuned value that tell you about the performance of the tasks on the GPU on this machine?

The *.log files contain the performance (in ns/day) of each run on the last line. Use tail *log to see the last chunk of each log file. Have a look through the log files and see what you can learn. What differs from log files from previous exercises?

## Running update and constraints on the GPU

Recall that earlier we said that the dominant operations are arithmetic and data movement. We can eliminate a lot of the data movement by moving most computation to the GPU, and also the reduction, update and constraints phases.

Note that not all combinations of algorithms are supported, but where they are supported, the benefit of running the update also on the GPU is very useful.

Using the same folder and topol.tpr file from the above exercise, download the job submission script where you will again see FIXME comments. Replace them to make it run NB, PME and the update on the GPU, as well as perhaps the bonded work. Save and exit.