Understanding Navigator GPU nodes

Questions

  • What hardware is available on Navigator’s GPU nodes?

  • How do we organize computational work to go where it should?

Objectives

  • Understand that the time to transfer data between compute units is finite

  • Understand that managing the locality of data (and the tasks that use them) is critical for performance

Running jobs on Navigator

When requesting GPU nodes on Navigator, a number of CPUs and GPUs are requested. The SLURM job scheduler is capable of quite complex assignments, but today we’ll keep it simple and focus solely on jobs that have one or more GPUs and matching groups of 20 CPU cores.

For example (adapted from https://www.uc.pt/lca/ClusterResources/Navigator/running) to get a single GPU, 20 nearby CPU cores and some memory for 10 minutes using the project for this workshop, we could use a job script like

#!/bin/bash

#SBATCH --account training
#SBATCH --nodes 1          # One node
#SBATCH --ntasks 1         # One task
#SBATCH --cpus-per-task=20 # 20 cores per task
#SBATCH --time 0-00:20     # Runtime in D-HH:MM
#SBATCH --partition gpu    # Use the GPU partition
#SBATCH --gres=gpu:v100:1  # One GPU requested
#SBATCH --mem=20GB         # Total memory requested

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load GROMACS/2021.3-foss-2020b

srun gmx mdrun

If that was in a file run-script.sh then we can submit it to the Navigator batch queue with sbatch run-script.sh. For this workshop, that will start quickly because we have a dedicated reservation.

See also

Keypoints

  • HPC nodes have internal structure that affects performance

  • Expect to see many clusters that have multiple GPUs per node