Tutorial: Quantum ESPRESSO on HPC systems

Credit

This material is adapted with permission from https://gitlab.com/QEF/materials-for-max-coe-enccs-workshop-2022

Slides

Setup


EXERCISE 0 - First serial run

Files needed:

 1#!/bin/bash
 2#SBATCH --job-name=USERjob
 3#SBATCH --nodes 1
 4#SBATCH --time=00:00:30
 5#SBATCH --partition=cpu
 6#SBATCH --reservation=maxcpu
 7#SBATCH --ntasks-per-socket=...
 8#SBATCH --ntasks-per-node=...
 9#SBATCH --hint=nomultithread
10#SBATCH --cpus-per-task=...
11#SBATCH --output=sysout.out
12#SBATCH --error=syserr.err
13
14export EXDIR=${PWD}/..
15
16module purge
17module load QuantumESPRESSO/7.1-foss-2022a
18
19export INDIR=${EXDIR}/inputs
20export ESPRESSO_PSEUDO=${EXDIR}/../pseudo
21
22# number of threads (serial)
23export OMP_NUM_THREADS=...
24# execute PW
25mpirun ...

We will only use pw.x for this hands-on.
The build-in module of QEv7.1 is loaded in the batch file:

$ module load QuantumESPRESSO/7.1-foss-2022a

Check that the module works by submitting a quick serial test.
Open the Slurm batch file, fill the dots with the right numbers/commands (dots at lines 7, 8, 10, 23, 25) and submit it:

$ sbatch ./ex0-run.slurm

To see the submission status of your job:

$ squeue -u YOUR_USERNAME

Solution


EXERCISE 1 - Parallelization with pools

Files needed:

 1#!/bin/bash
 2#SBATCH --job-name=USERjob
 3#SBATCH --nodes 1
 4#SBATCH --exclusive
 5#SBATCH --time=00:20:00
 6#SBATCH --partition=cpu
 7#SBATCH --reservation=maxcpu
 8#SBATCH --ntasks-per-socket=2
 9#SBATCH --ntasks-per-node=16
10#SBATCH --hint=nomultithread
11#SBATCH --cpus-per-task=1
12#SBATCH --output=sysout.out
13#SBATCH --error=syserr.err
14# # SBATCH --mail-user=YOUR_EMAIL - if you want
15
16module purge
17module load QuantumESPRESSO/7.1-foss-2022a
18
19export EXDIR=${PWD}/..
20export INDIR=${EXDIR}/inputs
21export ESPRESSO_PSEUDO=${EXDIR}/../pseudo
22export OMP_NUM_THREADS=1
23
24mpiopt="-mca pml ucx -mca btl ^uct,tcp,openib,vader --map-by socket:PE=1 --rank-by core --report-bindings "
25
26for ip in ...
27do
28mpirun $mpiopt -np 16 pw.x -npool "$ip" -ndiag 1 -i ${INDIR}/pw.CuO.scf.in > pw_CuO_${ip}pools.out
29done

Try to predict the best value of npool and check by performing a number of runs.

  1. Open the batch file ex1-pools.slurm and customize the user-related SLURM options like job-name and mail-user (not essential at all);
    Replace the dots with a list of proper values for npool, e.g:

    for ip in 1 2 3 4 5 6    # not the necessarily the right values here!
    do
    
  2. Submit the job file:

    $ sbatch ./ex1-pools.slurm
    
  3. Look for total WALL time at the end of each output file:

            PWSCF        :   3m26.15s CPU   3m30.27s WALL
    

    and plot TIME(npool). Which is the best npool value? Why?

  4. You can try with different numbers of MPI tasks.

Solution


EXERCISE 2 - Parallelization of the eigenvalue problem

Files needed:

 1#!/bin/bash
 2#SBATCH --job-name=USERjob
 3#SBATCH --nodes 1
 4#SBATCH --exclusive
 5#SBATCH --time=00:20:00
 6#SBATCH --partition=cpu
 7#SBATCH --reservation=maxcpu
 8#SBATCH --ntasks-per-socket=2
 9#SBATCH --ntasks-per-node=16
10#SBATCH --hint=nomultithread
11#SBATCH --cpus-per-task=1
12#SBATCH --output=sysout.out
13#SBATCH --error=syserr.err
14# # SBATCH --mail-user=YOUR_EMAIL - if you want
15
16module purge
17module load QuantumESPRESSO/7.1-foss-2022a
18
19export EXDIR=${PWD}/..
20export INDIR=${EXDIR}/inputs
21export ESPRESSO_PSEUDO=${EXDIR}/../pseudo
22export OMP_NUM_THREADS=1
23
24mpiopt="-mca pml ucx -mca btl ^uct,tcp,openib,vader --map-by socket:PE=1 --rank-by core --report-bindings "
25
26for id in ....
27do
28mpirun $mpiopt -np 16 pw.x -npool 4 -ndiag "$id" -i ${INDIR}/pw.CuO.scf.in > pw_CuO_${id}diag.out
29done

Play with the ndiag parameter by performing a number of runs and seeing variations in the WALL time (if any).
You can also change the fixed value of npools (the default value for this exercise is 4).

  1. Replace the dots with a list of proper values for ndiag, e.g:

    for id in 1 2 3 4 5 6     # not necessarily the right values here!
    do
    
  2. Submit the job file:

    $ sbatch ./ex2-diag.slurm
    
  3. Check the total WALL time at the end of the output file and plot TIME(ndiag).
    Which is the best ndiag value (if any)? Why?

Solution


EXERCISE 3 - MPI + OpenMP parallelization

Files needed:

 1#!/bin/bash
 2#SBATCH --job-name=USERjob
 3#SBATCH --nodes 1
 4#SBATCH --exclusive
 5#SBATCH --time=00:20:00
 6#SBATCH --partition=cpu
 7#SBATCH --reservation=maxcpu
 8#SBATCH --ntasks-per-socket=2
 9#SBATCH --ntasks-per-node=...
10#SBATCH --hint=nomultithread
11#SBATCH --cpus-per-task=...
12#SBATCH --output=sysout.out
13#SBATCH --error=syserr.err
14# # SBATCH --mail-user=YOUR_EMAIL - if you want
15
16module purge
17module load QuantumESPRESSO/7.1-foss-2022a
18
19export EXDIR=${PWD}/..
20export INDIR=${EXDIR}/inputs
21export ESPRESSO_PSEUDO=${EXDIR}/../pseudo
22
23mpiopt="-mca pml ucx -mca btl ^uct,tcp,openib,vader --map-by socket:PE=4 --rank-by core --report-bindings "
24
25for nthr in ....
26do
27export OMP_NUM_THREADS="$nthr"
28mpirun $mpiopt -np ... pw.x -npool ... -i ${INDIR}/pw.CuO.scf.in > pw_run16x${nthr}.out
29done

Find out how to best exploit the available CPU resources, by playing with the MPI-related parameters (number of tasks, npools) together with the number of threads.
Use the batch file ex3-omp.slurm to submit your jobs (modify it at your convenience).
Hints:

  1. Know the size of your node, e.g. the amount of cores at your disposal;

  2. See how the time scaling of your jobs changes just by varying the number of tasks (keep just 1 thread each at first).
    Adapt the npool parameter at each run.

  3. Now you can start to explore the OpenMP parallelization by varying the number of threads (avoid hyperthreading).

  4. Do multiple WALL_TIME plots in function of the number of MPI tasks and OpenMP threads. Which is the best configuration for this exercise?

Solution

EXERCISE 4 - Hands on the GPUs

Files needed:

 1#!/bin/bash
 2#SBATCH --job-name=USERjob
 3#SBATCH --nodes 1
 4#SBATCH --exclusive
 5#SBATCH --time=00:10:00
 6#SBATCH --partition=cpu
 7#SBATCH --reservation=maxcpu
 8#SBATCH --ntasks-per-node=16
 9#SBATCH --cpus-per-task=8
10#SBATCH --output=sysout.out
11#SBATCH --error=syserr.err
12# # SBATCH --mail-user=YOUR_EMAIL - if you want
13
14# module purge
15# module use ${WORK}/modules
16# echo $WORK
17# module load QuantumESPRESSO/DEV-NVHPC-21.2-FIXMAG
18module purge
19module load QuantumESPRESSO/7.1-foss-2022a
20
21export EXDIR=${PWD}/..
22export INDIR=${EXDIR}/inputs
23export ESPRESSO_PSEUDO=${EXDIR}/../pseudo
24
25export OMP_NUM_THREADS=4
26
27mpiopt="-mca pml ucx -mca btl ^uct,tcp,openib,vader --map-by socket:PE=4 --rank-by socket --report-bindings "
28
29mpirun $mpiopt -np 16 pw.x -i ${INDIR}/pw.CnSnI3.in > pw.CnSnI3.cpu.out

Test the power of the GPUs (roughly).

  1. Know the size of your node. Look at: https://doc.vega.izum.si/general-spec/
    How many cores? How many GPUs?

  2. Use the batch file ex4-gpu.slurm to submit a few MPI+GPU runs. You can also check what happens with more MPI tasks than GPUs.
    Dots at lines 7, 9, 10, 25, 27, 29.

  3. Enable openMP threading. Do you see any improvement?

  4. Consider your best CPU run. How many GPUs were necessary to match the performance?
    If you don’t have your optimized CPU batch file from exercise 3 you can use the one in the reference folder.

Solution