Scaling study: self-consistent field calculations
Contents
Scaling study: self-consistent field calculations¶
Objectives
Learn how to run a self-consistent field (SCF) calculation.
Keypoints
Run a SCF calculation.
Perform scaling test of the SCF calculation.
In this exercise we will run self-consistent field (SCF) calculations of zinc-porphyrin and oligomers of guanine.
An SCF calculation consists essentially of three parts:
Initial guess for the density matrix \(\mathbf{D}\),
Formation of the Fock matrix:
\[F_{\mu\nu} = h_{\mu\nu} + \sum_{\kappa\lambda} \left[ 2(\mu\nu|\kappa\lambda) - (\mu\kappa|\nu\lambda) \right]D_{\lambda\kappa}\]by assembling one-electron:
\[h_{\mu\nu} = \left\langle \chi_{\mu} \left| -\frac{1}{2}\nabla^{2} -\sum_{A=1}^{N_\mathrm{at}} \frac{Z_{A}}{|\mathbf{R}_A - \mathbf{r}|} \right| \chi_{\nu} \right\rangle\]and two-electron integrals:
\[(\mu\nu|\kappa\lambda) = \left\langle \chi_{\mu}(\mathbf{r})\chi_{\nu}(\mathbf{r}) \left| \frac{1}{|\mathbf{r} - \mathbf{r}^{\prime}|} \right| \chi_{\kappa}(\mathbf{r}^{\prime})\chi_{\lambda}(\mathbf{r}^{\prime}) \right\rangle\]Diagonalization of Fock matrix and formation of new density matrix.
While the last step formally scales with the third power of the dimension of the Fock matrix, it is usually the formation of the matrix itself in step 2 that is the most time-consuming. The formation of the Fock matrix and in particular the computation of the electron-repulsion integrals (ERIs) are natural targets for code optimization and parallelization.
A schematic view of the SCF algorithm. The formation of the Fock matrix is the most time-consuming step in the algorithm.¶
VeloxChem adopts a hybrid MPI+OpenMP parallelization strategy to achieve efficient Fock builds:
We spread the workload across multiple nodes using the message passing interface (MPI). In this context, a node might be a machine in a blade, a socket, or a NUMA domain.
On each node, we spawn multiple threads using OpenMP. All threads share the memory space on the node.
In contrast, the first and third steps are carried out using OpenMP threading on a single MPI process. *
In this exercise, we want to explore three aspects of the SCF code in VeloxChem:
Strong scaling: how does the time-to-solution change for a fixed-size problem using an increasing number of workers? We will use a zinc-porphyrin system for this investigation.
Scaling of SCF: how does the computational cost change when the number of basis functions increases? We will use a series of guanine oligomers for this purpose.
How do MPI and OpenMP settings affect the performance of SCF calculation on a single node?
Systems and input files¶
Below is the input file for SCF calculation of zinc porphyrin and guanine oligomers. You can read more about the VeloxChem input keywords in this page.
The timing
input option in the SCF section will print out a detailed
breakdown of the time spent within each task of every SCF iteration:
@jobs
task: scf
@end
@method settings
basis: def2-sv(p)
@end
@scf
timing: yes
@end
@molecule
charge: 0
multiplicity: 1
xyz:
- Zinc porphyrin
You can download the full input file with:
wget https://raw.githubusercontent.com/ENCCS/veloxchem-hpc/main/content/inputs/zn-ph.inp
- Guanine oligomers
You can download the full input files with:
wget https://raw.githubusercontent.com/ENCCS/veloxchem-hpc/main/content/inputs/g{1,2,3,4}.inp
Using the SLURM scheduler¶
Dardel, as virtually all modern HPC clusters, uses SLURM as its job scheduler.
A sample submission script:
1#!/bin/bash
2
3#SBATCH -J myjob
4#SBATCH -t 01:00:00
5#SBATCH -p main
6#SBATCH -A edu22.veloxchem
7#SBATCH --reservation velox-lab1
8
9#SBATCH --nodes=1
10#SBATCH --ntasks-per-node=8
11
12module load PDC/21.11
13module load veloxchem/1.0dev
14
15export OMP_NUM_THREADS=16
16export OMP_PLACES=cores
17
18job=porphyrin
19srun vlx ${job}.inp ${job}.out
Line 4 sets the limit on the total runtime of the job allocation.
Line 5 specifies which partition to be used. Check here for the partitions available on Dardel.
Lines 9 and 10 request computational resources:
each node consists of two AMD EPYC 7742 CPUs, since nodes are dual-socket on Dardel.
on each node, we spawn 8 MPI ranks: 4 per socket, since
numactl --hardware
showed that each socket is in NPS-4 configuration.
Thus 8 is the number of MPI ranks which will be used in the calculation.
Line 15 tells to use 16 OpenMP threads on each MPI rank. This ensures full usage of all cores in the given NUMA domain/MPI rank.
Finally, line 16 specifies that threads should be mapped to the cores.
This sample script will use a single node on Dardel:
Exercise¶
We are going to plot speedup (or scaled speedup) with respect to the number of nodes used. VeloxChem prints out the total execution time at the end of each output file:
!========================================================================================================================!
! VeloxChem execution completed at Wed Mar 2 10:51:14 2022. !
!========================================================================================================================!
! Total execution time is 111.56 sec. !
!========================================================================================================================!
With the timing
option set to yes
in the input file, you will also get a
detailed breakdown of timings in each SCF iteration:
Timing (in sec)
---------------
FockBuild ERI CompEnergy FockDiag
Iteration 1 3.12 3.12 0.04 0.44
Iteration 2 3.50 3.50 0.04 0.48
Iteration 3 3.50 3.50 0.04 0.52
Iteration 4 3.88 3.88 0.04 0.55
Iteration 5 3.88 3.88 0.04 0.58
Iteration 6 3.87 3.87 0.04 0.63
Iteration 7 3.88 3.88 0.04 0.68
Iteration 8 3.87 3.87 0.04 0.74
Iteration 9 3.88 3.87 0.04 0.79
Iteration 10 3.88 3.88 0.04 0.79
Iteration 11 3.88 3.88 0.04 0.81
Iteration 12 3.87 3.87 0.04 0.81
Iteration 13 3.87 3.87 0.04 0.81
Iteration 14 3.87 3.87 0.04 0.81
Iteration 15 3.87 3.87 0.04 0.80
Iteration 16 3.88 3.88 0.04 0.81
Iteration 17 3.87 3.87 0.04 0.81
Iteration 18 3.87 3.87 0.04 0.84
Iteration 19 3.88 3.88 0.04 0.82
Iteration 20 3.87 3.87 0.04
Run the zinc porphyrin example on 1, 2, 3 and 4 nodes. Use 8 MPI ranks per node and 16 OpenMP threads per rank.
Gather total execution times from the output files and plot the speedup against the number of nodes.
Gather the breakout of timings per SCF iteration from the output files. Plot the speedup of
FockBuild
with respect to the number of nodes.
Run the guanine monomer (
g1.inp
), dimer (g2.inp
), trimer (g3.inp
), and tetramer (g4.inp
) on 1, 2, 3, and 4 nodes, respectively. Use 8 MPI ranks per node and 16 OpenMP threads per rank.Gather total execution times from the output files and plot the computational cost (taking into account number of nodes) against the number of basis functions. Use logarithm scale on both axes.
Gather the breakout of timings per SCF iteration from the output files. Plot the computational cost of
FockBuild
with respect to the number of basis functions.
Run the zinc porphyrin example:
Compare the total execution time and the timings for
FockBuild
obtained using:1 node, 8 MPI ranks, 16 OpenMP threads
1 node, 2 MPI ranks, 64 OpenMP threads
#SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=128 # 64x2 because of SMT export OMP_NUM_THREADS=64 export OMP_PLACES=cores
Both calculation setups use one full node on Dardel, but they are not equally efficient. What might make it so?
Plotting your results
You can launch a MyBinder instance to analyze your results in a Jupyter notebook. You can adapt this sample script.
import numpy as np
import plotly.graph_objects as go
from scipy import stats
# insert your data!
nnodes = np.array([1, 2, 3, 4])
# insert your data!
timings = np.array([1, 0.6, 0.4, 0.35])
speedup = timings[0] / timings
fig = go.Figure()
fig.add_trace(
go.Scatter(
name=f"SCF",
x=nnodes,
y=speedup,
mode="markers",
hovertemplate="~%{y:.2f}x<extra></extra>",
)
)
# generate linear fit
slope, intercept, r_value, p_value, std_err = stats.linregress(nnodes, speedup)
line = slope * nnodes + intercept
fig.add_trace(
go.Scatter(
x=nnodes,
y=line,
mode="lines",
name=f"Fit: y = {slope:.3f}x + {intercept:.3f}",
)
)
fig.update_layout(
title="SCF Speedup",
xaxis_title="Number of nodes",
yaxis_title="Speedup",
height=500,
width=600,
)
fig.show()
- *
MPI processes are usually also called ranks.