:-) GROMACS - gmx mdrun, 2021.3 (-: GROMACS is written by: Andrey Alekseenko Emile Apol Rossen Apostolov Paul Bauer Herman J.C. Berendsen Par Bjelkmar Christian Blau Viacheslav Bolnykh Kevin Boyd Aldert van Buuren Rudi van Drunen Anton Feenstra Gilles Gouaillardet Alan Gray Gerrit Groenhof Anca Hamuraru Vincent Hindriksen M. Eric Irrgang Aleksei Iupinov Christoph Junghans Joe Jordan Dimitrios Karkoulis Peter Kasson Jiri Kraus Carsten Kutzner Per Larsson Justin A. Lemkul Viveca Lindahl Magnus Lundborg Erik Marklund Pascal Merz Pieter Meulenhoff Teemu Murtola Szilard Pall Sander Pronk Roland Schulz Michael Shirts Alexey Shvetsov Alfons Sijbers Peter Tieleman Jon Vincent Teemu Virolainen Christian Wennberg Maarten Wolf Artem Zhmurov and the project leaders: Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel Copyright (c) 1991-2000, University of Groningen, The Netherlands. Copyright (c) 2001-2019, The GROMACS development team at Uppsala University, Stockholm University and the Royal Institute of Technology, Sweden. check out http://www.gromacs.org for more information. GROMACS is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. GROMACS: gmx mdrun, version 2021.3 Executable: /veracruz/projects/t/training/gromacs-2021.3/bin/gmx Data prefix: /veracruz/projects/t/training/gromacs-2021.3 Working dir: /veracruz/home/m/mabraham/git/gromacs-gpu-performance-master/content/exercises/using-pme Process ID: 174568 Command line: gmx mdrun -ntmpi 1 -noconfout -resetstep 10000 -g manual-nb -nb gpu -pme cpu GROMACS version: 2021.3 Verified release checksum is c5bf577cc74de0e05106b7b6426476abb7f6530be7b4a2c64f637d6a6eca8fcb Precision: mixed Memory model: 64 bit MPI library: thread_mpi OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64) GPU support: CUDA SIMD instructions: AVX_512 FFT library: fftw-3.3.9-sse2-avx RDTSCP usage: enabled TNG support: enabled Hwloc support: disabled Tracing support: disabled C compiler: /apps/software/GCCcore/9.3.0/bin/cc GNU 9.3.0 C compiler flags: -mavx512f -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -O3 -DNDEBUG C++ compiler: /apps/software/GCCcore/9.3.0/bin/c++ GNU 9.3.0 C++ compiler flags: -mavx512f -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -fopenmp -O3 -DNDEBUG CUDA compiler: /apps/software/CUDAcore/11.0.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2020 NVIDIA Corporation;Built on Thu_Jun_11_22:26:38_PDT_2020;Cuda compilation tools, release 11.0, V11.0.194;Build cuda_11.0_bu.TC445_37.28540450_0 CUDA compiler flags:-std=c++17;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-Wno-deprecated-gpu-targets;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_53,code=compute_53;-gencode;arch=compute_80,code=compute_80;-use_fast_math;;-mavx512f -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -fopenmp -O3 -DNDEBUG CUDA driver: 11.10 CUDA runtime: 11.0 Running on 1 node with total 80 cores, 80 logical cores, 1 compatible GPU Hardware detected: CPU info: Vendor: Intel Brand: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Family: 6 Model: 85 Stepping: 4 Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl avx512secondFMA clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic Number of AVX-512 FMA units: 2 Hardware topology: Only logical processor count GPU info: Number of GPUs detected: 1 #0: NVIDIA Tesla V100-PCIE-16GB, compute cap.: 7.0, ECC: yes, stat: compatible ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers SoftwareX 1 (2015) pp. 19-25 -------- -------- --- Thank You --- -------- -------- ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ S. Páll, M. J. Abraham, C. Kutzner, B. Hess, E. Lindahl Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS In S. Markidis & E. Laure (Eds.), Solving Software Challenges for Exascale 8759 (2015) pp. 3-27 -------- -------- --- Thank You --- -------- -------- ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R. Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, and E. Lindahl GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit Bioinformatics 29 (2013) pp. 845-54 -------- -------- --- Thank You --- -------- -------- ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation J. Chem. Theory Comput. 4 (2008) pp. 435-447 -------- -------- --- Thank You --- -------- -------- ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C. Berendsen GROMACS: Fast, Flexible and Free J. Comp. Chem. 26 (2005) pp. 1701-1719 -------- -------- --- Thank You --- -------- -------- ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ E. Lindahl and B. Hess and D. van der Spoel GROMACS 3.0: A package for molecular simulation and trajectory analysis J. Mol. Mod. 7 (2001) pp. 306-317 -------- -------- --- Thank You --- -------- -------- ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ H. J. C. Berendsen, D. van der Spoel and R. van Drunen GROMACS: A message-passing parallel molecular dynamics implementation Comp. Phys. Comm. 91 (1995) pp. 43-56 -------- -------- --- Thank You --- -------- -------- ++++ PLEASE CITE THE DOI FOR THIS VERSION OF GROMACS ++++ https://doi.org/10.5281/zenodo.5053201 -------- -------- --- Thank You --- -------- -------- The number of OpenMP threads was set by environment variable OMP_NUM_THREADS to 20 Input Parameters: integrator = md tinit = 0 dt = 0.002 nsteps = 20000 init-step = 0 simulation-part = 1 mts = false comm-mode = Linear nstcomm = 100 bd-fric = 0 ld-seed = 1993 emtol = 10 emstep = 0.01 niter = 20 fcstep = 0 nstcgsteep = 1000 nbfgscorr = 10 rtpi = 0.05 nstxout = 0 nstvout = 0 nstfout = 0 nstlog = 0 nstcalcenergy = 10 nstenergy = 0 nstxout-compressed = 0 compressed-x-precision = 1000 cutoff-scheme = Verlet nstlist = 10 pbc = xyz periodic-molecules = false verlet-buffer-tolerance = 0.005 rlist = 1 coulombtype = PME coulomb-modifier = Potential-shift rcoulomb-switch = 0 rcoulomb = 1 epsilon-r = 1 epsilon-rf = 1 vdw-type = Cut-off vdw-modifier = Potential-shift rvdw-switch = 0 rvdw = 1 DispCorr = EnerPres table-extension = 1 fourierspacing = 0.12 fourier-nx = 96 fourier-ny = 96 fourier-nz = 128 pme-order = 4 ewald-rtol = 1e-05 ewald-rtol-lj = 0.001 lj-pme-comb-rule = Geometric ewald-geometry = 0 epsilon-surface = 0 tcoupl = V-rescale nsttcouple = 10 nh-chain-length = 0 print-nose-hoover-chain-variables = false pcoupl = No pcoupltype = Semiisotropic nstpcouple = -1 tau-p = 5 compressibility (3x3): compressibility[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} compressibility[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} ref-p (3x3): ref-p[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} ref-p[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} refcoord-scaling = COM posres-com (3): posres-com[0]= 0.00000e+00 posres-com[1]= 0.00000e+00 posres-com[2]= 0.00000e+00 posres-comB (3): posres-comB[0]= 0.00000e+00 posres-comB[1]= 0.00000e+00 posres-comB[2]= 0.00000e+00 QMMM = false qm-opts: ngQM = 0 constraint-algorithm = Lincs continuation = false Shake-SOR = false shake-tol = 0.0001 lincs-order = 4 lincs-iter = 1 lincs-warnangle = 45 nwall = 0 wall-type = 9-3 wall-r-linpot = -1 wall-atomtype[0] = -1 wall-atomtype[1] = -1 wall-density[0] = 0 wall-density[1] = 0 wall-ewald-zfac = 3 pull = false awh = false rotation = false interactiveMD = false disre = No disre-weighting = Conservative disre-mixed = false dr-fc = 1000 dr-tau = 0 nstdisreout = 100 orire-fc = 0 orire-tau = 0 nstorireout = 100 free-energy = no cos-acceleration = 0 deform (3x3): deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} simulated-tempering = false swapcoords = no userint1 = 0 userint2 = 0 userint3 = 0 userint4 = 0 userreal1 = 0 userreal2 = 0 userreal3 = 0 userreal4 = 0 applied-forces: electric-field: x: E0 = 0 omega = 0 t0 = 0 sigma = 0 y: E0 = 0 omega = 0 t0 = 0 sigma = 0 z: E0 = 0 omega = 0 t0 = 0 sigma = 0 density-guided-simulation: active = false group = protein similarity-measure = inner-product atom-spreading-weight = unity force-constant = 1e+09 gaussian-transform-spreading-width = 0.2 gaussian-transform-spreading-range-in-multiples-of-width = 4 reference-density-filename = reference.mrc nst = 1 normalize-densities = true adaptive-force-scaling = false adaptive-force-scaling-time-constant = 4 grpopts: nrdf: 67964.4 49247.5 196258 ref-t: 310 310 310 tau-t: 0.1 0.1 0.1 annealing: No No No annealing-npoints: 0 0 0 acc: 0 0 0 nfreeze: N N N energygrp-flags[ 0]: 0 Changing nstlist from 10 to 100, rlist from 1 to 1.132 1 GPU selected for this run. Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node: PP:0 PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU PP task will update and constrain coordinates on the CPU Using 1 MPI thread Non-default thread affinity set, disabling internal thread affinity Using 20 OpenMP threads The -resetstep functionality is deprecated, and may be removed in a future version. System total charge: 0.000 Will do PME sum in reciprocal space for electrostatic interactions. ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen A smooth particle mesh Ewald method J. Chem. Phys. 103 (1995) pp. 8577-8592 -------- -------- --- Thank You --- -------- -------- Using a Gaussian width (1/beta) of 0.320163 nm for Ewald Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05 Initialized non-bonded Ewald tables, spacing: 9.33e-04 size: 1073 Generated table with 1066 data points for 1-4 COUL. Tabscale = 500 points/nm Generated table with 1066 data points for 1-4 LJ6. Tabscale = 500 points/nm Generated table with 1066 data points for 1-4 LJ12. Tabscale = 500 points/nm Long Range LJ corr.: 6.7202e-04 Using GPU 8x8 nonbonded short-range kernels Using a dual 8x8 pair-list setup updated with dynamic, rolling pruning: outer list: updated every 100 steps, buffer 0.132 nm, rlist 1.132 nm inner list: updated every 14 steps, buffer 0.001 nm, rlist 1.001 nm At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be: outer list: updated every 100 steps, buffer 0.283 nm, rlist 1.283 nm inner list: updated every 14 steps, buffer 0.057 nm, rlist 1.057 nm Using Lorentz-Berthelot Lennard-Jones combination rule Removing pbc first time Initializing LINear Constraint Solver ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije LINCS: A Linear Constraint Solver for molecular simulations J. Comp. Chem. 18 (1997) pp. 1463-1472 -------- -------- --- Thank You --- -------- -------- The number of constraints is 13620 ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ S. Miyamoto and P. A. Kollman SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid Water Models J. Comp. Chem. 13 (1992) pp. 952-962 -------- -------- --- Thank You --- -------- -------- The -noconfout functionality is deprecated, and may be removed in a future version. ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ G. Bussi, D. Donadio and M. Parrinello Canonical sampling through velocity rescaling J. Chem. Phys. 126 (2007) pp. 014101 -------- -------- --- Thank You --- -------- -------- There are: 141677 Atoms Constraining the starting coordinates (step 0) Constraining the coordinates at t0-dt (step 0) Center of mass motion removal mode is Linear We have the following groups for center of mass motion removal: 0: System RMS relative constraint deviation after constraining: 3.88e-06 Initial temperature: 279.104 K Started mdrun on rank 0 Thu Sep 9 10:25:46 2021 Step Time 0 0.00000 Energies (kJ/mol) Bond Angle Proper Dih. Ryckaert-Bell. Improper Dih. 8.51519e+02 8.91827e+04 8.15628e+04 1.56545e+04 1.81361e+03 Improper Dih. LJ-14 Coulomb-14 LJ (SR) Disper. corr. 3.30515e+03 4.75419e+04 3.40936e+05 7.67685e+04 -3.46877e+04 Coulomb (SR) Coul. recip. Potential Kinetic En. Total Energy -2.51729e+06 2.10380e+04 -1.87332e+06 3.63069e+05 -1.51025e+06 Conserved En. Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd -1.51025e+06 2.78605e+02 -3.54583e+02 2.48652e+03 3.78959e-06 step 900: timed with pme grid 96 96 128, coulomb cutoff 1.000: 1554.6 M-cycles step 1100: timed with pme grid 84 84 112, coulomb cutoff 1.116: 1305.7 M-cycles step 1300: timed with pme grid 72 72 96, coulomb cutoff 1.302: 1075.1 M-cycles step 1500: timed with pme grid 64 64 84, coulomb cutoff 1.472: 986.6 M-cycles step 1500: the maximum allowed grid scaling limits the PME load balancing to a coulomb cut-off of 1.563 step 1700: timed with pme grid 60 60 80, coulomb cutoff 1.563: 942.1 M-cycles step 1900: timed with pme grid 64 64 80, coulomb cutoff 1.545: 956.2 M-cycles step 2100: timed with pme grid 64 64 84, coulomb cutoff 1.472: 984.6 M-cycles step 2300: timed with pme grid 64 64 96, coulomb cutoff 1.465: 1019.2 M-cycles step 2500: timed with pme grid 80 80 96, coulomb cutoff 1.288: 1142.7 M-cycles step 2700: timed with pme grid 80 80 100, coulomb cutoff 1.236: 1164.7 M-cycles step 2900: timed with pme grid 80 80 104, coulomb cutoff 1.189: 1192.7 M-cycles step 3100: timed with pme grid 80 80 108, coulomb cutoff 1.172: 1206.2 M-cycles step 3300: timed with pme grid 84 84 108, coulomb cutoff 1.145: 1264.9 M-cycles optimal pme grid 60 60 80, coulomb cutoff 1.563 step 10000: resetting all time and cycle counters Restarted time on rank 0 Thu Sep 9 10:26:43 2021 Step Time 20000 40.00000 Energies (kJ/mol) Bond Angle Proper Dih. Ryckaert-Bell. Improper Dih. 4.39138e+04 8.77100e+04 8.18219e+04 1.53269e+04 1.91457e+03 Improper Dih. LJ-14 Coulomb-14 LJ (SR) Disper. corr. 3.28573e+03 4.34871e+04 3.30255e+05 7.62739e+04 -3.46877e+04 Coulomb (SR) Coul. recip. Potential Kinetic En. Total Energy -2.49297e+06 4.32923e+03 -1.83934e+06 4.02728e+05 -1.43661e+06 Conserved En. Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd -1.50946e+06 3.09038e+02 -3.54583e+02 9.00010e+01 3.91816e-06 Energy conservation over simulation part #1 of length 40 ns, time 0 to 40 ns Conserved energy drift: 1.40e-04 kJ/mol/ps per atom <====== ############### ==> <==== A V E R A G E S ====> <== ############### ======> Statistics over 20001 steps using 2001 frames Energies (kJ/mol) Bond Angle Proper Dih. Ryckaert-Bell. Improper Dih. 4.09673e+04 8.82822e+04 8.19917e+04 1.66794e+04 1.88034e+03 Improper Dih. LJ-14 Coulomb-14 LJ (SR) Disper. corr. 3.32440e+03 4.36054e+04 3.30595e+05 7.66853e+04 -3.46877e+04 Coulomb (SR) Coul. recip. Potential Kinetic En. Total Energy -2.49512e+06 5.60827e+03 -1.84019e+06 4.03868e+05 -1.43632e+06 Conserved En. Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd -1.50984e+06 3.09912e+02 -3.54583e+02 1.21126e+02 0.00000e+00 Total Virial (kJ/mol) 1.28727e+05 3.38679e+02 -2.48871e+01 3.38283e+02 1.28160e+05 -4.91087e+00 -2.40573e+01 -3.71763e+00 1.29182e+05 Pressure (bar) 1.27361e+02 -7.43006e+00 8.13203e-01 -7.42198e+00 1.38735e+02 5.49980e-01 7.96263e-01 5.25619e-01 9.72825e+01 T-Protein T-DOPCT-Water_and_ions 3.09767e+02 3.09128e+02 3.10160e+02 P P - P M E L O A D B A L A N C I N G NOTE: The PP/PME load balancing was limited by the maximum allowed grid scaling, you might not have reached a good load balance. PP/PME load balancing changed the cut-off and PME settings: particle-particle PME rcoulomb rlist grid spacing 1/beta initial 1.000 nm 1.001 nm 96 96 128 0.117 nm 0.320 nm final 1.563 nm 1.564 nm 60 60 80 0.188 nm 0.500 nm cost-ratio 3.81 0.24 (note that these numbers concern only part of the total PP and PME load) M E G A - F L O P S A C C O U N T I N G NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table W3=SPC/TIP3p W4=TIP4p (single or pairs) V&F=Potential and force V=Potential only F=Force only Computing: M-Number M-Flops % Flops ----------------------------------------------------------------------------- Pair Search distance check 2192.954896 19736.594 0.0 NxN QSTab Elec. + LJ [F] 2692271.571840 110383134.445 84.9 NxN QSTab Elec. + LJ [V&F] 299440.249472 17666974.719 13.6 Calc Weights 4250.735031 153026.461 0.1 Spread Q Bspline 90682.347328 181364.695 0.1 Gather F Bspline 90682.347328 544094.084 0.4 3D-FFT 104472.126168 835777.009 0.6 Solve PME 36.003600 2304.230 0.0 Shift-X 14.309377 85.856 0.0 Virial 141.863722 2553.547 0.0 Stop-CM 14.309377 143.094 0.0 Calc-Ekin 283.495677 7654.383 0.0 Lincs 136.213620 8172.817 0.0 Lincs-Mat 715.871580 2863.486 0.0 Constraint-V 1251.905178 11267.147 0.0 Constraint-Vir 111.669558 2680.069 0.0 Settle 326.492646 120802.279 0.1 ----------------------------------------------------------------------------- Total 129942634.917 100.0 ----------------------------------------------------------------------------- R E A L C Y C L E A N D T I M E A C C O U N T I N G On 1 MPI rank, each using 20 OpenMP threads Computing: Num Num Call Wall time Giga-Cycles Ranks Threads Count (s) total sum % ----------------------------------------------------------------------------- Neighbor search 1 20 101 1.169 55.957 3.0 Launch GPU ops. 1 20 10001 0.797 38.162 2.0 Force 1 20 10001 0.408 19.514 1.0 PME mesh 1 20 10001 29.177 1397.202 73.8 Wait Bonded GPU 1 20 1001 0.035 1.656 0.1 Wait GPU NB local 1 20 10001 0.236 11.321 0.6 NB X/F buffer ops. 1 20 19901 3.452 165.284 8.7 Update 1 20 10001 1.808 86.571 4.6 Constraints 1 20 10001 1.773 84.886 4.5 Rest 0.661 31.658 1.7 ----------------------------------------------------------------------------- Total 39.514 1892.211 100.0 ----------------------------------------------------------------------------- Breakdown of PME mesh computation ----------------------------------------------------------------------------- PME spread 1 20 10001 14.416 690.352 36.5 PME gather 1 20 10001 9.493 454.568 24.0 PME 3D-FFT 1 20 20002 4.644 222.376 11.8 PME solve Elec 1 20 10001 0.546 26.143 1.4 ----------------------------------------------------------------------------- Core t (s) Wall t (s) (%) Time: 790.275 39.514 2000.0 (ns/day) (hour/ns) Performance: 43.735 0.549 Finished mdrun on rank 0 Thu Sep 9 10:27:23 2021