Using multiple GPUs

Questions

How one can get more performance with more than one GPU?

Objectives

Know how to use several GPUs efficiently.

Nodes with several GPUs

It is common to have more than one GPU in a single node. To use them, we can leverage the domain decomposition machinery. Indeed, it is already set up so that the communications between ranks are minimal. So the GPUs will be assigned to ranks and do all the computations for a single domain.

Run simulation on two GPUs simultaneously

Make a new folder for this exercise, e.g. mkdir using-pme-multigpu; cd using-pme-multigpu.

Download the run input file prepared to do 20000 steps of a PME simulation. We’ll use it to experiment with task assignment.

Download the job submission script where you will see several lines marked **FIXME**. Remove the **FIXME** to achieve the goal stated in the comment before that line. You will need to refer to the information above to achieve that. Save the file and exit.

Submit the script to the SLURM job manager with sbatch many-gpus.sh. It will reply something like Submitted batch job 4565494 when it succeeded. The job manager will write terminal output to a file named like slurm-4565494.out. It may take a few minutes to start and a few more minutes to run.

While it is running, you can use tail -f slurm*out to watch the output. When it says “Done” then the runs are finished. Use Ctrl-C to exit the tail command that you ran.

Once the first trajectory completes, exit tail and use less default.log to inspect the output. Find the “Mapping of GPU IDs…” Does what you read there agree with what you just learned?

The *.log files contain the performance (in ns/day) of each run on the last line. Use tail *log to see the last chunk of each log file. Have a look through the log files and see what you can learn. What differs from log files from previous exercises? What is the performance gain when compared to a single GPU?

Solution

You can download a working version of the batch submission script. Its diff from the original is file

--- /home/runner/work/gromacs-gpu-performance/gromacs-gpu-performance/content/exercises/using-pme/many-gpus.sh
+++ /home/runner/work/gromacs-gpu-performance/gromacs-gpu-performance/content/answers/using-pme/many-gpus.sh
@@ -2,9 +2,9 @@
 
 #SBATCH --time=00:15:00
 #SBATCH --partition=gpu
-#SBATCH --ntasks=**FIXME**
+#SBATCH --ntasks=2
 #SBATCH --cpus-per-task=10
-#SBATCH --gres=gpu:v100:**FIXME**
+#SBATCH --gres=gpu:v100:2
 #SBATCH --account=project_2003752
 #SBATCH --reservation=gmx3
 
@@ -16,10 +16,10 @@
 options="-nsteps 20000 -resetstep 19000 -ntomp $SLURM_CPUS_PER_TASK -pin on -pinstride 1"
 
 # Run mdrun assigning the non-bonded, PME, and update work to the GPU
-srun gmx mdrun $options -g manual-nb-pme-update.log        **FIXME**
+srun gmx mdrun $options -g manual-nb-pme-update.log        -nb gpu -pme gpu             -update gpu
 # Run mdrun assigning the non-bonded, PME, bonded, and update
 # work to the GPU
-srun gmx mdrun $options -g manual-nb-pme-bonded-update.log **FIXME**
+srun gmx mdrun $options -g manual-nb-pme-bonded-update.log -nb gpu -pme gpu -bonded gpu -update gpu
 
 # Let us know we're done
 echo Done

Using direct communications between GPUs

On Puhti, GPUs are connected with NVLink. This allows them to communicate directly. This can be explored in GROMACS, although this feature is not tested that carefully. So use cautiously and report your experience to GROMACS developers! To enable, one needs to set two environment variables: GMX_GPU_DD_COMMS and GMX_GPU_PME_PP_COMMS. The first will enable halo exchange between domains, the second one will allow PP and PME ranks to communicate directly.

Run simulation on two GPUs simultaneously

Download the job submission script where you will see several lines marked **FIXME**. Remove the **FIXME** to achieve the goal stated in the comment before that line. You will need to refer to the information above to achieve that. Save the file and exit.

Submit the script to the SLURM job manager with sbatch many-gpus_direct-comms.sh. It will reply something like Submitted batch job 4565494 when it succeeded. The job manager will write terminal output to a file named like slurm-4565494.out. It may take a few minutes to start and a few more minutes to run.

While it is running, you can use tail -f slurm*out to watch the output. When it says “Done” then the runs are finished. Use Ctrl-C to exit the tail command that you ran.

Once the first trajectory completes, exit tail and use less default.log to inspect the output. Find the “Mapping of GPU IDs…” Does what you read there agree with what you just learned?

The *.log files contain the performance (in ns/day) of each run on the last line. Use tail *log to see the last chunk of each log file. Have a look through the log files and see what you can learn. What differs from log files from previous exercises? What is the performance gain when compared to a single GPU?

Solution

You can download a working version of the batch submission script. Its diff from the original is file

--- /home/runner/work/gromacs-gpu-performance/gromacs-gpu-performance/content/exercises/using-pme/many-gpus-direct-comms.sh
+++ /home/runner/work/gromacs-gpu-performance/gromacs-gpu-performance/content/answers/using-pme/many-gpus-direct-comms.sh
@@ -11,17 +11,17 @@
 module purge
 module load gromacs-env/2021-gpu
 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
-
-**FIXME**: set GMX_GPU_DD_COMMS and GMX_GPU_PME_PP_COMMS environment variables
+export GMX_GPU_DD_COMMS=1
+export GMX_GPU_PME_PP_COMMS=1
 
 # Make sure we don't spend time writing useless output
 options="-nsteps 20000 -resetstep 19000 -ntomp $SLURM_CPUS_PER_TASK -pin on -pinstride 1"
 
 # Run mdrun assigning the non-bonded, PME, and update work to the GPU
-srun gmx mdrun $options -g manual-nb-pme-update.log        **FIXME**
+srun gmx mdrun $options -g manual-nb-pme-update.log        -nb gpu -pme gpu             -update gpu
 # Run mdrun assigning the non-bonded, PME, bonded, and update
 # work to the GPU
-srun gmx mdrun $options -g manual-nb-pme-bonded-update.log **FIXME**
+srun gmx mdrun $options -g manual-nb-pme-bonded-update.log -nb gpu -pme gpu -bonded gpu -update gpu
 
 # Let us know we're done
 echo Done

The most efficient way to use multi-GPU systems

As before, the scaling when going from one GPU to two is not linear. This is expected: GPUs now don’t have as much to compute and they have to communicate between each other. To add to that, the communications can not be easily hidden behind the computations. To make the best use of the resources, ensemble runs can be executed. Try to use multi-dir approach as we did before, to see what configuration will give you the best cumulative performance. Try to assign more than one rank to a single GPU. This will allow to overlap communications, CPU and GPU execution more efficiently. Try to leave bonded computation and/or update constraints to the CPU: you have 10 CPU core per single GPU and it would be a waste to keep them idle.

Keypoints

One can use several GPUs for a single run.
Ensemble runs allow to overlap communications with computations thus using the resources more efficiently.