Affinity — Placement, Ordering and Binding

Slides

Slides

Exercises

=${OMPI_COMM_WORLD_RANK} export local_rank=${OMPI_COMM_WORLD_LOCAL_RANK} export ranks_per_node=${OMPI_COMM_WORLD_LOCAL_SIZE}

if [ -z “${NUM_CPUS}” ]; then let NUM_CPUS=96 fi

if [ -z “${RANK_STRIDE}” ]; then let RANK_STRIDE=${NUM_CPUS}/${ranks_per_node} fi

if [ -z “${OMP_STRIDE}” ]; then let OMP_STRIDE=1 fi

cpu_list=($(seq 0 ${NUM_CPUS})) let cpu_start_index=$(( ($RANK_STRIDE*${local_rank}) )) let cpu_start=${cpu_list[$cpu_start_index]} let cpu_stop=$(($cpu_start+$OMP_NUM_THREADS*$OMP_STRIDE-1))

export GOMP_CPU_AFFINITY=$cpu_start-$cpu_stop:$OMP_STRIDE

“$@”

Run the application using the helper script: 

`OMP_NUM_THREADS=2 mpirun -np 2 ./helper.sh ./hello_mpi_omp` 

* You’ll find that the threads of ranks 0 and 1 got bound to HWTs 0, 1 and 48, 49 respectively 

`OMP_NUM_THREADS=2 RANK_STRIDE=2 mpirun -np 2 ./helper.sh ./hello_mpi_omp` 

* Setting RANK_STRIDE to 2 will pack all threads to HWTs 0, 1, 2, and 3.

OMP_NUM_THREADS=2 RANK_STRIDE=4 OMP_STRIDE=2 mpirun -np 2 ./helper.sh ./hello_mpi_omp

* By setting RANK_STRIDE=4 and OMP_STRIDE=2, we can run the threads on HWTs 0, 2, 4 and 6 respectively. 

* Feel free to use this script on your own system by setting up NUM_CPUS accordingly. 

The above example and script can be found in ~/exercises/affinity/hello_mpi_omp directory 

##### Case 2: MPI + OpenMP + HIP 

* Allocate a node: salloc -N1 --exclusive -p MI250 -w mun-node-5
* Download hello_jobstep.cpp from here: https://code.ornl.gov/olcf/hello_jobstep 
* Load necessary modules in the environment: module load rocm openmpi/4.1.4-gcc 
* Use this simpler Makefile to compile: 

$ cat Makefile  

SOURCES = hello_jobstep.cpp OBJECTS = $(SOURCES:.cpp=.o) EXECUTABLE = hello_jobstep

CXX=/global/software/openmpi/gcc/ompi/bin/mpic++ CXXFLAGS = -fopenmp -I${ROCM_PATH}/include -D__HIP_PLATFORM_AMD__ LDFLAGS = -L${ROCM_PATH}/lib -lhsa-runtime64 -lamdhip64

all: ${EXECUTABLE}

$(CXX) $(CXXFLAGS) -o $@ -c $<

$(EXECUTABLE): $(OBJECTS) $(CXX) $(CXXFLAGS) $(OBJECTS) -o $@ $(LDFLAGS)

clean:
rm -f $(EXECUTABLE) rm -f $(OBJECTS)

* Compile: make 
* Set up helper script for setting up ROCR_VISIBLE_DEVICES and GOMP_CPU_AFFINITY for indicating GPU and CPU core affinity respectively: 

#!/bin/bash export global_rank=${OMPI_COMM_WORLD_RANK} export local_rank=${OMPI_COMM_WORLD_LOCAL_RANK} export ranks_per_node=${OMPI_COMM_WORLD_LOCAL_SIZE}

if [ -z “${NUM_CPUS}” ]; then let NUM_CPUS=128 fi

if [ -z “${RANK_STRIDE}” ]; then let RANK_STRIDE=${NUM_CPUS}/${ranks_per_node} fi

if [ -z “${OMP_STRIDE}” ]; then let OMP_STRIDE=1 fi

if [ -z “${NUM_GPUS}” ]; then let NUM_GPUS=8 fi

if [ -z “${GPU_START}” ]; then let GPU_START=0 fi

if [ -z “${GPU_STRIDE}” ]; then let GPU_STRIDE=1 fi

cpu_list=($(seq 0 127)) let cpus_per_gpu=${NUM_CPUS}/${NUM_GPUS} let cpu_start_index=$(( ($RANK_STRIDE*${local_rank})+${GPU_START}$cpus_per_gpu )) let cpu_start=${cpu_list[$cpu_start_index]} let cpu_stop=$(($cpu_start+$OMP_NUM_THREADS$OMP_STRIDE-1))

gpu_list=(4 5 2 3 6 7 0 1) let ranks_per_gpu=$(((${ranks_per_node}+${NUM_GPUS}-1)/${NUM_GPUS})) let my_gpu_index=$(($local_rank*$GPU_STRIDE/$ranks_per_gpu))+${GPU_START} let my_gpu=${gpu_list[${my_gpu_index}]}

export GOMP_CPU_AFFINITY=$cpu_start-$cpu_stop:$OMP_STRIDE export ROCR_VISIBLE_DEVICES=$my_gpu

“$@”

Run the application using the helper script: 

`OMP_NUM_THREADS=2 mpirun -np 8 ./helper.sh ./hello_jobstep` 

* Runs 2 threads per rank, 8 ranks, associating the GPUs in the order given for each rank and binding 2 CPU cores from each set of 16 cores for each rank 

`OMP_NUM_THREADS=2 mpirun -np 16 ./helper.sh ./hello_jobstep` 

* To run 2 ranks per GCD packed closely (ranks 0 and 1 run on GCD 4) and bind 2 cores from each set of 8 cores for each rank. 

The above example and scripts can be found in ~/exercises/affinity/hello_jobstep directory.