Profiling and debugging


Download slides

Omnitrace and Omniperf

Download slides


Download slides


Reservation: enccs_3


  • Get the exercise:

cp -r /global/training/enccs/exercises/HIP-Examples/mini-nbody/hip/ .

  • Compile and run the code

cd hip 
hipcc -I../ -DSHMOO nbody-orig.cpp -o nbody-orig
65536, 161.871
  • Check the file



./$EXE 65536

  • The binary is called nbody-orig

  • Use rocprof with --stats

rocprof --stats nbody-orig 65536

RPL: on '221130_200946' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip'
RPL: profiling '"nbody-orig" "65536"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_221130_200946_3670592'
RPL: result dir '/tmp/rpl_data_221130_200946_3670592/input_results_221130_200946'
ROCProfiler: input from "/tmp/rpl_data_221130_200946_3670592/input.xml"
  0 metrics
65536, 159.960

ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_200946_3670592/input_results_221130_200946
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.csv' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.stats.csv' is generating

Files with the prefix results are created

  • Check the files results.csv

You can se einformation for each kernel call with their duration

 cat results.csv
0,"bodyForce(Body*, float, int) [clone .kd]",0,0,0,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372809346673,1591372809872935,1591372836189584,1591372836215944,26316649
1,"bodyForce(Body*, float, int) [clone .kd]",0,0,2,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372836774261,1591372837000949,1591372863116796,1591372863132315,26115847
2,"bodyForce(Body*, float, int) [clone .kd]",0,0,4,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372863652552,1591372863877281,1591372889980009,1591372889994436,26102728
3,"bodyForce(Body*, float, int) [clone .kd]",0,0,6,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372890512133,1591372890735562,1591372916796147,1591372916817087,26060585
4,"bodyForce(Body*, float, int) [clone .kd]",0,0,8,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372917332974,1591372917556629,1591372943652575,1591372943667909,26095946
5,"bodyForce(Body*, float, int) [clone .kd]",0,0,10,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372944181896,1591372944405378,1591372970475883,1591372970491020,26070505
6,"bodyForce(Body*, float, int) [clone .kd]",0,0,12,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372971009527,1591372971233309,1591372997318181,1591372997339821,26084872
7,"bodyForce(Body*, float, int) [clone .kd]",0,0,14,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372997856209,1591372998080743,1591373024164495,1591373024180993,26083752
8,"bodyForce(Body*, float, int) [clone .kd]",0,0,16,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591373024701060,1591373024924818,1591373051023611,1591373051040364,26098793
9,"bodyForce(Body*, float, int) [clone .kd]",0,0,18,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591373051559851,1591373051782878,1591373077878145,1591373077902255,26095267

  • Check the statistics result file, one line per kernel

cat results.stats.csv
"bodyForce(Body*, float, int) [clone .kd]",10,261124944,26112494,100.0
  • Profile the HIP calls with --hip-trace

rocprof --stats --hip-trace nbody-orig 65536
RPL: on '221130_201416' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip'
RPL: profiling '"nbody-orig" "65536"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_221130_201416_3670892'
RPL: result dir '/tmp/rpl_data_221130_201416_3670892/input_results_221130_201416'
ROCTracer (pid=3670915):
65536, 161.051
hsa_copy_deps: 0
scan ops data 29:30                                                                                                    File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.copy_stats.csv' is generating
dump json 19:20
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.hip_stats.csv' is generating
dump json 51:52
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.stats.csv' is generating
dump json 9:10
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating

Now we have new files with the hip in their name like below, checl the file results.hip_stats.csv

 cat results.hip_stats.csv
  • Profile also the HSA API with the --hsa-trace

rocprof --stats --hip-trace --hsa-trace nbody-orig 65536
RPL: on '221130_201737' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip'
RPL: profiling '"nbody-orig" "65536"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_221130_201737_3671219'
RPL: result dir '/tmp/rpl_data_221130_201737_3671219/input_results_221130_201737'
ROCProfiler: input from "/tmp/rpl_data_221130_201737_3671219/input.xml"
  0 metrics
ROCTracer (pid=3671242):
65536, 155.978

ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_201737_3671219/input_results_221130_201737
hsa_copy_deps: 1
scan hsa API data 5953:5954                                                                                                    hsa_copy_deps: 0
scan hip API data 51:52                                                                                                    File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.csv' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.stats.csv' is generating
dump json 9:10
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.hsa_stats.csv' is generating
dump json 5963:5964
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.copy_stats.csv' is generating
dump json 19:20
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.hip_stats.csv' is generating
dump json 51:52
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating
  • See the content of the file results.hsa_stats.csv

cat results.hsa_stats.csv

  • Download the results.json file on your laptop

From your laptop: scp -i id_ed25519 -P 8560 enccs_tr_0@ .

  • Visit the web page:

  • Click on the top left menu, “Open Trace File on th eleft top”

  • Select the file results.json

Zoom in/out: W/S Move left/right: A/D

Read about the counters: vim /global/software/rocm/rocm-5.3.0/lib/rocprofiler/gfx_metrics.xml

  • Create a file with the contents:

cat rocprof_counters.txt
pmc : Wavefronts VALUInsts VFetchInsts VWriteInsts VALUUtilization VALUBusy WriteSize
pmc : SALUInsts SFetchInsts LDSInsts FlatLDSInsts GDSInsts SALUBusy FetchSize
pmc : L2CacheHit MemUnitBusy MemUnitStalled WriteUnitStalled ALUStalledByLDS LDSBankConflict
  • Execute with using the counters

 rocprof --timestamp on -i rocprof_counters.txt  nbody-orig 65536
RPL: on '221130_205737' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip'
RPL: profiling '"nbody-orig" "65536"'
RPL: input file 'rocprof_counters.txt'
RPL: output dir '/tmp/rpl_data_221130_205737_3673574'
RPL: result dir '/tmp/rpl_data_221130_205737_3673574/input0_results_221130_205737'
ROCProfiler: input from "/tmp/rpl_data_221130_205737_3673574/input0.xml"
  gpu_index =
  kernel =
  range =
  7 metrics
    Wavefronts, VALUInsts, VFetchInsts, VWriteInsts, VALUUtilization, VALUBusy, WriteSize
65536, 155.389

ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_205737_3673574/input0_results_221130_205737
RPL: result dir '/tmp/rpl_data_221130_205737_3673574/input1_results_221130_205737'
ROCProfiler: input from "/tmp/rpl_data_221130_205737_3673574/input1.xml"
  gpu_index =
  kernel =
  range =
  7 metrics
    SALUInsts, SFetchInsts, LDSInsts, FlatLDSInsts, GDSInsts, SALUBusy, FetchSize
65536, 156.996

ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_205737_3673574/input1_results_221130_205737
RPL: result dir '/tmp/rpl_data_221130_205737_3673574/input2_results_221130_205737'
ROCProfiler: input from "/tmp/rpl_data_221130_205737_3673574/input2.xml"
  gpu_index =
  kernel =
  range =
  6 metrics
    L2CacheHit, MemUnitBusy, MemUnitStalled, WriteUnitStalled, ALUStalledByLDS, LDSBankConflict
65536, 155.264

ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_205737_3673574/input2_results_221130_205737
File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/rocprof_counters.csv' is generating
  • Contents of the rocprof_counters.csv file

cat rocprof_counters.csv
0,"bodyForce(Body*, float, int) [clone .kd]",0,0,0,3673711,3673711,65536,256,0,0,20,4,16,64,0x0,0x7f2b4d282500,2048.0000000000,1212443.0000000000,12.0000000000,12.0000000000,100.0000000000,68.1476813493,7872.0000000000,131228.5000000000,65553.0000000000,0.0000000000,0.0000000000,0.0000000000,6.3483148000,9429.1875000000,96.5684331443,0.0250344612,0.0044357832,0.0102024550,0.0000000000,0.0000000000,1594244102859719,1594244111978746,1594244138305243,1594244138330792


We have made special builds of the Omnitools, omnitrace and omniperf for use in the exercises

  • Load Omnitrace

module use --append /global/training/enccs/modules/
module load omnitrace/1.7.3
  • Reserve a GPU

  • Check the various options and their values and also a second command for description

srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description

  • Create an Omnitrace configuration file with description per option

srun -n 1 omnitrace-avail -G omnitrace_all.cfg --all

  • Declare to use this configuration file: export OMNITRACE_CONFIG_FILE=/path/omnitrace_all.cfg

  • Get the file cp /global/software/rocm/rocm-5.3.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp .

  • Compile hipcc -o MatrixTranspose MatrixTranspose.cpp

  • Execute the binary: time srun -n 1 --gpus 1 ./MatrixTranspose and check the duration

Dynamic instrumentation

  • Execute dynamic instrumentation: time srun –n 1 –gpus 1 omnitrace -- ./MatrixTranspose and check the duration

  • Check what the binary calls and gets instrumented: nm --demangle MatrixTranspose | egrep -i ' (t|u) '

  • Available functions to instrument: srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -- ./MatrixTranspose

    • the simulate option means that it will not execute the binary

Binary rewriting

  • Binary rewriting: srun -n 1 --gpus 1 omnitrace -v -1 --print-available functions -o matrix.inst -- ./MatrixTranspose

    • We created a new instrumented binary called matrix.inst

  • Executing the new instrumented binary: time srun -n 1 --gpus 1 ./matrix.inst and check the duration

  • See the list of the instrumented GPU calls: cat omnitrace-matrix.inst-output/TIMESTAMP/roctracer.txt


  • Copy the perfetto-trace.proto to your laptop, open the web page click to open the trace and select the file

Hardware counters

  • See a list of all the counters: srun -n 1 --gpus 1 omnitrace-avail --all

  • Declare in your configuration file: OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy

  • Execute: srun -n 1 --gpus 1 ./matrix.inst and copy the perfetto file and visualize


Activate in your configuration file OMNITRACE_USE_SAMPLING = true and OMNITRACE_SAMPLING_FREQ = 100, execute and visualize

Kernel timings

  • Open the file omnitrace-binary-output/timestamp/wall_clock.txt (replace binary and timestamp with your information)

  • In order to see the kernels gathered in your configuration file, make sure that OMNITRACE_USE_TIMEMORY = true and OMNITRACE_FLAT_PROFILE = true, execute the code and open again the file omnitrace-binary-output/timestamp/wall_clock.txt


We have made special builds of the Omnitools, omnitrace and omniperf for use in the exercises

  • Load Omniperf

module use --append /global/training/enccs/modules/
module load omniperf/1.0.4
  • Reserve a GPU, compile the exercise and execute Omniperf, observe how many times the code is executed

 salloc -N1 -p MI250 --reservation=enccs_3 --gpus=1 --time 00:10:00
cp /global/training/enccs/omniperf/1.0.4/share/sample/vcopy.cpp .
hipcc -o vcopy vcopy.cpp
srun -n 1 --gpus 1 omniperf profile -n vcopy_all -- ./vcopy 1048576 256
  • Run srun -n 1 --gpus 1 omniperf profile -h to see all the options

  • Now is created a workload in the directory workloads with the name vcopy_all (the argument of the -n). So, we can analyze it

 srun -n 1 --gpus 1 omniperf analyze -p workloads/vcopy_all/mi200/ &> vcopy_analyze.txt

There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file vcopy_analyze.txt

  • We can select specific IP Blocks, like:

srun -n 1 --gpus 1 omniperf analyze -p workloads/vcopy_all/mi200/ -b 7.1.2

But you need to know the code of the IP Block

  • If you have installed Omniperf on your laptop (no ROCm required for analysis) then you can download the data and execute:

omniperf analyze -p workloads/vcopy_all/mi200/ --gui
  • Open the web page: