OpenACC: Analysis

Steps in porting code to GPUs

The three key steps in porting to high performance accelerated code:
1. Analyze/Identify parallelism
2. Express data movement and parallelism
3. Optimize data movement and loop performance
4. Go back to 1!

Analyze your code to determine most likely places needing parallelization or optimization.
Parallelize your code by starting with the most time consuming parts and check for correctness.
Optimize your code to improve observed speed-up from parallelization.
One should generally start the process at the top with the analyze step. For complex applications, it’s useful to have a profiling tool available to learn where your application is spending its execution time and to focus your efforts there. Since our example code is quite a bit simpler than a full application, we’ll skip profiling the code and simply analyze the code by reading it

Compiler diagnostics

Compiler diagnostics is usually the first thing to check when starting the OpenACC work
- It can tell you what operations were actually performed
- Data copies that were made
- If and how the loops were parallelized
The diagnostics is very compiler dependent
- Compiler flags
- Level and formatting of information
Diagnostics is controlled by compiler flag -Minfo=option
Useful options:
- accel – operations related to the accelerator
- all – print all compiler output
- intensity – print loop computational intensity info

Example: Adding two vectors

Adding two vectors

#include <stdio.h>
#ifdef _OPENACC
#include <openacc.h>
#endif

#define NX 102400

int main(void)
{
    double vecA[NX], vecB[NX], vecC[NX];
    double sum;
    int i;

    /* Initialization of the vectors */
    for (i = 0; i < NX; i++) {
        vecA[i] = 1.0 / ((double) (NX - i));
        vecB[i] = vecA[i] * vecA[i];
    }

    /* TODO:
     * Implement vector addition on device with OpenACC
     * vecC = vecA + vecB
     */
    for (i = 0; i < NX; i++) {
       vecC[i] = vecA[i] + vecB[i];
    } 

    sum = 0.0;
    /* Compute the check value */
    for (i = 0; i < NX; i++) {
        sum += vecC[i];
    }
    printf("Reduction sum: %18.16f\n", sum);

    return 0;
}

#include <stdio.h>
#ifdef _OPENACC
#include <openacc.h>
#endif

#define NX 102400

int main(void)
{
    double vecA[NX], vecB[NX], vecC[NX];
    double sum;
    int i;

    /* Initialization of the vectors */
    for (i = 0; i < NX; i++) {
        vecA[i] = 1.0 / ((double) (NX - i));
        vecB[i] = vecA[i] * vecA[i];
    }

#pragma acc data copy(vecA,vecB,vecC)
{
#pragma acc parallel
    {
#pragma acc loop
        for (i = 0; i < NX; i++) {
            vecC[i] = vecA[i] * vecB[i];
        }
    }
}
    sum = 0.0;
    /* Compute the check value */
    for (i = 0; i < NX; i++) {
        sum += vecC[i];
    }
    printf("Reduction sum: %18.16f\n", sum);

    return 0;
}

#include <stdio.h>
#ifdef _OPENACC
#include <openacc.h>
#endif

#define NX 102400

int main(void)
{
    double vecA[NX], vecB[NX], vecC[NX];
    double sum;
    int i;

    /* Initialization of the vectors */
    for (i = 0; i < NX; i++) {
        vecA[i] = 1.0 / ((double) (NX - i));
        vecB[i] = vecA[i] * vecA[i];
    }

#pragma acc data copy(vecA,vecB,vecC)
{
#pragma acc kernels
    for (i = 0; i < NX; i++) {
        vecC[i] = vecA[i] + vecB[i];
    }
}
    sum = 0.0;
    /* Compute the check value */
    for (i = 0; i < NX; i++) {
        sum += vecC[i];
    }
    printf("Reduction sum: %18.16f\n", sum);

    return 0;
}

`-Minfo`

$ pgcc -g -O3 -acc -Minfo=acc sum_parallel.c -o sum

main:
   21, Generating copy(vecA[:],vecB[:],vecC[:]) [if not already present]
   23, Generating Tesla code
       25, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

$ pgcc -g -O3 -acc -Minfo=accel sum_kernels.c -o sum

main:
   21, Generating copy(vecA[:],vecB[:],vecC[:]) [if not already present]
   23, Loop is parallelizable
       Generating Tesla code
       23, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Example: `PGI_ACC_TIME=1`

 $ cat slurm-13186502.out  #output of sum_parallel.c

 Accelerator Kernel Timing data
 main  NVIDIA  devicenum=0
   time(us): 451
   21: data region reached 2 times
       21: data copyin transfers: 3
            device time(us): total=245 max=100 min=71 avg=81
       29: data copyout transfers: 3
            device time(us): total=206 max=72 min=67 avg=68
   23: compute region reached 1 time
       23: kernel launched 1 time
           grid: [800]  block: [128]
           elapsed time(us): total=41 max=41 min=41 avg=41
  Reduction sum: 1.2020569031119108

$ cat slurm-13186514.out

Accelerator Kernel Timing data
 main  NVIDIA  devicenum=0
   time(us): 453
   21: data region reached 2 times
       21: data copyin transfers: 3
            device time(us): total=247 max=100 min=72 avg=82
       26: data copyout transfers: 3
            device time(us): total=206 max=73 min=66 avg=68
   23: compute region reached 1 time
       23: kernel launched 1 time
           grid: [800]  block: [128]
           elapsed time(us): total=40 max=40 min=40 avg=40
Reduction sum: 1.2020569031119108

NVIDIA nvprof

NVIDIA nvprof provides a simple interface to collect on a target without using the GUI. (new NSight system)
- GPU profiling capabilities: High-level usage statistics, Timeline collection, Analysis metrics
- basic CPU sampling

The Himeno stencil benchmark

Solving Poisson’s equation for incompressible fluid by the Jacobi iteration method.

(https://blogs.fau.de/hager/archives/7850)

himeno code

/*********************************************************************
      This benchmark test program is measuring a cpu performance 
      of floating point operation and memory access speed.  

      Modification needed for testing turget computer!!
      Please adjust parameter : nn to take one minute to execute
      all calculation.  Original parameter set is for PC with 
      200 MHz MMX PENTIUM, whose score using this benchmark test
      is about 32.3 MFLOPS.

      If you have any question, please ask me via email.
      written by Ryutaro HIMENO, October 3, 1998.
      Version 2.0  
      ----------------------------------------------
         Ryutaro Himeno, Dr. of Eng.
         Head of Computer Information Center, 
         The Institute of Pysical and Chemical Research (RIKEN)
         Email : himeno@postman.riken.go.jp
      ---------------------------------------------------------------
      You can adjust the size of this benchmark code to fit your target
      computer.  In that case, please chose following sets of 
      (mimax,mjmax,mkmax):
       small : 129,65,65
       midium: 257,129,129
       large : 513,257,257
       ext.large: 1025,513,513
      This program is to measure a computer performance in MFLOPS
      by using a kernel which appears in a linear solver of pressure 
      Poisson included in an incompressible Navier-Stokes solver. 
      A point-Jacobi method is employed in this solver.
      ------------------
      Finite-difference method, curvilinear coodinate system
      Vectorizable and parallelizable on each grid point
      No. of grid points : imax x jmax x kmax including boundaries
      ------------------
      A,B,C:coefficient matrix, wrk1: source term of Poisson equation
      wrk2 : working area, OMEGA : relaxation parameter
      BND:control variable for boundaries and objects ( = 0 or 1)
      P: pressure 
       -----------------
      -------------------     
      "use portlib" statement on the next line is for Visual fortran 
      to use UNIX libraries.  Please remove it if your system is UNIX.
      -------------------
     use portlib

     Version 0.2 
*********************************************************************/

#include <stdio.h>

#include "himeno_C.h"

#ifdef SMALL
#define MIMAX            129
#define MJMAX            65
#define MKMAX            65
#endif

#ifdef MIDDLE
#define MIMAX            257
#define MJMAX            129
#define MKMAX            129
#endif

#ifdef LARGE
#define MIMAX            513
#define MJMAX            257
#define MKMAX            257
#endif

#ifdef DOUBLE_PRECISION
typedef double real;
#else
typedef float real;
#endif

static real  p[MIMAX][MJMAX][MKMAX];
static real  a[MIMAX][MJMAX][MKMAX][4],
              b[MIMAX][MJMAX][MKMAX][3],
              c[MIMAX][MJMAX][MKMAX][3];
static real  bnd[MIMAX][MJMAX][MKMAX];
static real  wrk1[MIMAX][MJMAX][MKMAX],
              wrk2[MIMAX][MJMAX][MKMAX];

#define NN               3

double second();
real jacobi(int);
void initmt();

static int imax, jmax, kmax;
static real omega;

int
main()
{
  int i, j, k;
  real gosa;
  double cpu0, cpu1, nflop, xmflops2, score;

  omega = 0.8;
  imax = MIMAX-1;
  jmax = MJMAX-1;
  kmax = MKMAX-1;

  /*
   *    Initializing matrixes
   */
  initmt();

  printf("mimax = %d mjmax = %d mkmax = %d\n",MIMAX, MJMAX, MKMAX);
  printf("imax = %d jmax = %d kmax =%d\n",imax,jmax,kmax);

  /*
   *    Start measuring
   */
  cpu0 = second();

  /*
   *    Jacobi iteration
   */

  gosa = jacobi(NN);
  
  cpu1 = second();
  cpu1 = cpu1 - cpu0;

  nflop = (kmax-2)*(jmax-2)*(imax-2)*34;

  if(cpu1 != 0.0)
    xmflops2 = nflop/cpu1*1.0e-6*(real)NN;

  score = xmflops2/32.27;
  
  printf("\ncpu : %f sec.\n", cpu1);
  printf("Loop executed for %d times\n",NN);
  printf("Gosa : %e \n",gosa);
  printf("MFLOPS measured : %f\n",xmflops2);
  printf("Score based on MMX Pentium 200MHz : %f\n",score);
  
  // Now estimate how many iterations could be done in 20s
  int nn2 = 20.0/cpu1*NN;
  cpu0 = second();
  gosa = jacobi(nn2);
  cpu1 = second();
  cpu1 = cpu1 - cpu0;

  nflop = (kmax-2)*(jmax-2)*(imax-2)*34;

  if(cpu1 != 0.0)
    xmflops2 = nflop/cpu1*1.0e-6*(real)nn2;

  score = xmflops2/32.27;
  
  printf("\ncpu : %f sec.\n", cpu1);
  printf("Loop executed for %d times\n",nn2);
  printf("Gosa : %e \n",gosa);
  printf("MFLOPS measured : %f\n",xmflops2);
  printf("Score based on MMX Pentium 200MHz : %f\n",score);
  
  return (0);
}

void initmt()
{
  int i,j,k;

  // TODO: Implement data initialization with OpenACC on device
  
  // TODO: Implement computation with OpenACC on device
  for(i=0 ; i<imax ; ++i)
    for(j=0 ; j<jmax ; ++j)
      for(k=0 ; k<kmax ; ++k){
        a[i][j][k][0]=0.0;
        a[i][j][k][1]=0.0;
        a[i][j][k][2]=0.0;
        a[i][j][k][3]=0.0;
        b[i][j][k][0]=0.0;
        b[i][j][k][1]=0.0;
        b[i][j][k][2]=0.0;
        c[i][j][k][0]=0.0;
        c[i][j][k][1]=0.0;
        c[i][j][k][2]=0.0;
        p[i][j][k]=0.0;
        wrk1[i][j][k]=0.0;
        bnd[i][j][k]=0.0;
      }

  // TODO: Implement computation with OpenACC on device
  for(i=0 ; i<imax ; ++i)
    for(j=0 ; j<jmax ; ++j)
      for(k=0 ; k<kmax ; ++k){
        a[i][j][k][0]=1.0;
        a[i][j][k][1]=1.0;
        a[i][j][k][2]=1.0;
        a[i][j][k][3]=1.0/6.0;
        b[i][j][k][0]=0.0;
        b[i][j][k][1]=0.0;
        b[i][j][k][2]=0.0;
        c[i][j][k][0]=1.0;
        c[i][j][k][1]=1.0;
        c[i][j][k][2]=1.0;
        p[i][j][k]=(real)(k*k)/(real)((kmax-1)*(kmax-1));
        wrk1[i][j][k]=0.0;
        bnd[i][j][k]=1.0;
      }
}

real jacobi(int nn)
{
  int i,j,k,n;
  real gosa, s0, ss;

  // TODO: Implement data initialization with OpenACC on device
  for(n=0;n<nn;++n){
    gosa = 0.0;

    // TODO: Implement computation with OpenACC on device
    for(i=1 ; i<imax-1 ; ++i)
      for(j=1 ; j<jmax-1 ; ++j)
        for(k=1 ; k<kmax-1 ; ++k){
          s0 = a[i][j][k][0] * p[i+1][j  ][k  ]
             + a[i][j][k][1] * p[i  ][j+1][k  ]
             + a[i][j][k][2] * p[i  ][j  ][k+1]
             + b[i][j][k][0] * ( p[i+1][j+1][k  ] - p[i+1][j-1][k  ]
                               - p[i-1][j+1][k  ] + p[i-1][j-1][k  ] )
             + b[i][j][k][1] * ( p[i  ][j+1][k+1] - p[i  ][j-1][k+1]
                               - p[i  ][j+1][k-1] + p[i  ][j-1][k-1] )
             + b[i][j][k][2] * ( p[i+1][j  ][k+1] - p[i-1][j  ][k+1]
                               - p[i+1][j  ][k-1] + p[i-1][j  ][k-1] )
             + c[i][j][k][0] * p[i-1][j  ][k  ]
             + c[i][j][k][1] * p[i  ][j-1][k  ]
             + c[i][j][k][2] * p[i  ][j  ][k-1]
             + wrk1[i][j][k];

          ss = ( s0 * a[i][j][k][3] - p[i][j][k] ) * bnd[i][j][k];

	  gosa = gosa + ss*ss;

          wrk2[i][j][k] = p[i][j][k] + omega * ss;
        }
    printf("nn %d, gosa %f\n",n,gosa);

    // TODO: Implement computation with OpenACC on device
    for(i=1 ; i<imax-1 ; ++i)
      for(j=1 ; j<jmax-1 ; ++j)
        for(k=1 ; k<kmax-1 ; ++k)
          p[i][j][k] = wrk2[i][j][k];
  } /* end n loop */

  return(gosa);
}

double second()
{
#include <sys/time.h>

  struct timeval tm;
  double t ;

  static int base_sec = 0,base_usec = 0;

  gettimeofday(&tm, NULL);
  
  if(base_sec == 0 && base_usec == 0)
    {
      base_sec = tm.tv_sec;
      base_usec = tm.tv_usec;
      t = 0.0;
  } else {
    t = (double) (tm.tv_sec-base_sec) + 
      ((double) (tm.tv_usec-base_usec))/1.0e6 ;
  }

  return t ;
}

$ srun -n 1 nvprof --cpu-profiling on --cpu-profiling-mode top-down ./himeno.x
======== CPU profiling result (top down):
Time(%)      Time  Name
85.74%    18.16s  jacobi
 7.46%     1.58s  initmt
 0.94%     200ms  | ???
 6.75%     1.43s  __c_mcopy4_sky
 0.05%      10ms  __c_mcopy4

Summary

Profiling is essential for optimization
NVPROF and NVVP for NVIDIA platform

OpenACC: Analysis

Steps in porting code to GPUs

Compiler diagnostics

Work sharing construct: loop

Example: Adding two vectors

-Minfo

Example: PGI_ACC_TIME=1

NVIDIA nvprof

The Himeno stencil benchmark

Summary

Work sharing construct: `loop`

`-Minfo`

Example: `PGI_ACC_TIME=1`