Tuning

We will now combine what we have learnt and apply it to the problem of tuning the well known matrix multiplication algorithm. Matrix multiplication has several advanatges as an example, including:

  • It is well known and relatively simple (a few lines at our starting point).

  • It is eminently tunable, as we will see. There is quite a difference in performance between and a heavily tuned version (that is why you probably use a library for this operation).

  • It is used in many contexts, as are algorithms with a similar structure such as convolutions.

  • It probably consumes more compute cycles than (almost) any other given that it is what powers large language models such as GPT-4. Transformers, as the underlying neural network architecture is called, consist almost entirely of matrix multiplications.

Matrix matrix multiplication

Here is a naive matrix multiply with the same structure as the classic definition. The outer two loops iterate over the result array while the innermost loop computes the sum.

 1void matmul(double* a, double* b, double* c, long n) {
 2  for(long i = 0; i < n; i++) {
 3    for(long j = 0; j < n; j++) {
 4      double sum = 0;
 5      for(long k = 0; k < n; k++) {
 6        sum += b[i*n + k] * c[k*n + j];
 7      }
 8      a[i*n + j] = sum;
 9    }
10  }
11}

Note

This code uses explicit index calculations. For an N x N matrix M stored in row major order, the element M[i][j] is stored at offset i*N + j from the start of the (one dimensinal) array.

This is exactly how the C compiler implements multi dimensional arrays so if you have

double A[100][80];

the reference A[i][j] will effectively be translated to A[i*80 + j]. This does not work if the compiler does not know the size of the inner dimension (the row length), which is impossible since we want to use the matmul() function for matrices of different size.

You can build this code using the following command:

gcc -o mm ex-mm-ref.c ex-mm-main.c -lm

This gives you an executable program called mm.

Exercise

Try running mm with different sized inputs and measure the execution time using the time command. Sizes around 1000 should run in a few seconds.

We know from inspecting the code that matrix multiplication is cubic in the size of the matrix (for square matrices). Does that theory hold in practice as well?

Exercise

Try to fit a cubic function (degree three polynomial) to the measurements in the previous exercise.

The question then becomes: Is this a good result? Should we be satisfied and call it a day or should we invest time and effort into trying to make the code faster?

One way to approach this question is to try to find a lower bound for the possible execution time and comparing that lower bound to the observed performance. Such a bound will often come from reasoning about the resources necessarily consumed by the program.

What resources can be used to form this bound depends on what techniqes one is ready to use to increase performance. For instance, there are algorithms for matrix multiplication that have a better complexity than \(O(n^3)\), one example being Strassen’s algorithm which is approximately \(O(n^{2.8})\). So if we are willing to switch algorithm, we cannot use the number of arithmetic operations of our current program to compute our lower bound.

For the remainder of this chapter, we will keep the algorithm fixed, so the number of arithmetic operations can be used to establish a lower bound. But what are those arithmetic instructions? The obvious answer is that they are additions and multiplications. However, most modern processors provide fma instructions which compute \(A = B + C \times D\), mapping very well to the sum += ... * ... of the inner loop of our program. Thus we will say that we need to execute \(N^3\) fma operations.

So at what rate can a machine execute fma operations? That depends on three different factors:

  • SIMD width: Does the machine have SIMD fma instructions that performs multiple operations in a single instruction, and if so, how many?

  • Instruction throughput: How many fma instructions can the machine execute per cycle? If an fma takes more than a cycle, this number is less than one.

  • Clock frequency: Given DVFS, the dynamic change of clock frequency to balance power and performance depending on system load and other factors, there is no simple number to plug in here.

The SIMD width is a feature of the instruction set architecture amd can be found in the relevant manuals. The throughput of the fma instruction is a feature of the implementation, that is, the specific processor model. It is typically documented in either an optimization manual or a hardware manual for that processor model. As for the clock frequency, there is typically documentation about the interval in which it falls, but that interval is generally wide, so that information is of limited value.

We will handle the varying clock frequency by using the Linux perf tool to measure the number of cycles the program executed rather than the elapsed time. This is a good choice if the transformations we intend to do on the program for the purpose of tuning do not affect the clock frequency. This may not always be true, but in our experience it is close enough for the purpose of gauging the room for improvement.

Now, as usual I do not know what kind of machine you, dear student, is running your code on (you are running code, right?).

Example

For the Core i7-8550U of the T580, the greatest SIMD width implemented is 256 bits (the AVX256 instruction set extension). Since the element type of the matrix multiplication program is double, we have four elements per vector.

When it comes to the microarchitectures of common x86 implementations, one of the best references is the web site maintained by the Danish scientist Agner Fog. For the present purpose, we consult the instruction table document. The Core i7-8550U is based on the Kaby Lake core which shares microarchitecture with the preceeding Skylake core, so we check the entry for the VFMADD instruction family in the Skylake tables. In the reciprocal throughput column we find the value 0.5 which means that the thoughput is 2 VFMADD family instructions per cycle. The reciprocal throughput can be interpreted as the resource cost in cycles of a VFMADD family instruction.

Putting these pieces of information together, we find that this processor can do up to eight fma operations per cycle.

Next we need to figure out the number of cycles the program spends per (logical) fma operation. Of course we have a rough idea since we found 3.1ns per operation to be a good fit above and the clock frequency of our Core i7 will be somewhere around 2.5 to 3.5 GHz giving an expected interval of 8-11 cycles.

Exercise

Find the number of cycles by running perf as follows:

perf stat ./mm <size>

The conclusion is then that this version of the program is a factor of a 80-100 from the lower bound, so a tuning effort is warranted.

An observation we can immediately make is that we forgot to turn on optimization when we compiled the matrix multiplication program. Oops!

Exercise

Explore how the performance in time or cycles changes with optimization level. For GCC, interesting alternatives are -O´´, ``-O2 and -O3. There are also more specialized flags which are sometimes useful.

So, we got quite a boost from turning on optimization, as expected given all of the index calculations in the source code. At this point, the best thing to do is to try to understand the performance we have as far as possible before trying to improve it. What barriers do we have to overcome to get further?

The first step here is to think about our program in the context of what we know of the machine that is executing it and formulating some hypotheses regarding what happens.

Exercise

Take a moment to think about potential performance limiters in the code. You may want to start from the simple performance equation:

\(N_C = N_I \times CPI\)

Also, what hints do we get from the shape of the graphs?

Now that we have a few hypotheses, we know what to look for so the next steps are:

  • Look at the generated code, to see if there is anything surprising. We will only be interested in the innermost loop, and there we expect to see the two memory references, the arithmetic (separate mul and add or an fma), uppdates to the pointers (will be separate because of the different strides), and finally loop control in the form of a compare and a conditional branch.

  • Use perf again to get the CPI value (reported as IPC since it is expected that there are more instructions than cycles). We can also use the tool to verify that there are really many cache misses and we can get an estimate of how these impact performance.

Here are the example values as for my machine:

Example

Assembly for the innermost loop of the matrix multiplication program, generated by GCC under maximum optimization (-O3). Comments are mine.

.L4:                             ; Assembly label to branch to
      movsd    (%rax), %xmm0     ; Read an element of b into register xmm0
      mulsd    (%rdx), %xmm0     ; Multiply xmm0 with the element from c
      addq     $8, %rax          ; Next element of b (a double is 8 bytes)
      addq     %rcx, %rdx        ; Next element of c
      addsd    %xmm0, %xmm1      ; Add product to partial sum in xmm1
      cmpq     %rax, %rsi        ; Have we reached the end of the row of b?
      jne      .L4               ; Loop closing branch

This is good code; we have nothing to complain about. The compiler has not generated any fma instruction, probably since we have not told it that it is available.

Recall that register operands are indicated with % and immediates with $. Parentheses around a register indicate a memory reference to the address in the register (register indirect addressing).

We now run perf stat with a small matrix size and enough repetitions to get a reasonable run time.

$ perf stat ./mmO 100 150
Ok = 1

Performance counter stats for './mmO 100 150':

       326,34 msec task-clock                #    0,999 CPUs utilized
            1      context-switches          #    0,003 K/sec
            0      cpu-migrations            #    0,000 K/sec
          200      page-faults               #    0,613 K/sec
1 186 216 578      cycles                    #    3,635 GHz
2 389 839 498      instructions              #    2,01  insn per cycle
  342 460 826      branches                  # 1049,404 M/sec
    2 276 417      branch-misses             #    0,66% of all branches

 0,326737273 seconds time elapsed

 0,322730000 seconds user
 0,003984000 seconds sys

We can combine the information in these two experiments. The number of iterations of the innermost loop that we execute is \(100 \times 150^3 = 337500k\) which we can multiply by the 7 instructions we have in that loop and compare to the number of instructions executed to verify that we spend 99% of all instructions within that loop.

We can also see that we execute on average two instructions per cycle, which tallies with about 3.5 cycles per fma we measured previously.

This is surprising. According to Agner Fog’s tables (and Intel’s documentation), the addsd instruction has a latency of 4 cycles. The explanation is that while the instruction window is not large enough to cover the 150 iterations of the summation loop, it does cover about 28 iterations so the processor can in fact execute the last iterations of one loop together with the first iterations of the next. This is also consistent with the increase in cycles per fma as the matrix size increases; the 28 overlapped iterations become a smaller and smaller part of the whole trip count of the summation loop.

../_images/Inswin.png

The instruction window in the middle of an iteration of the j loop. Several iterations of the k loop are visible.

../_images/Inswin-later.png

The instruction window at the boundary between two iterations of the j loop. Parallelism between the iterations is possible.

DAXPY

At this point, we can conclude that with this program structure, where the innermost loop computes the sum, we will not get below four cycles per iteration of that loop for large matrix sizes. This leaves two possibilities:

  1. Change the nesting of the loops so that for instance the j loop becomes the innermost. Then there vill not be data dependencies between the iterations of the innemost loop, enabling more parallelism.

  2. Do the sums for several elements of the result matrix a at the same time. This also increases parallelism, but this time within each iteration of the innermost loop.

We will try alternative 1 first. The inner loop of this version is that of the the venerable DAXPY routine, which is why we will refer to this version as the DAXPY version. This loop ordering was used by the vector computers of old as a building block for matrix multiplication precisely because the loop iterations are independent.

 1void matmul(double* a, double* b, double* c, long n) {
 2  for(long i = 0; i < n; i++) {
 3    for(long j = 0; j < n; j++) a[i*n + j] = 0;
 4    for(long k = 0; k < n; k++) {
 5      for(long j = 0; j < n; j++) {
 6        a[i*n + j] += b[i*n + k] * c[k*n + j];
 7      }
 8    }
 9  }
10}

As we can see, the reference to b is invariant with respect to the innermost (j) loop while the references to the two other matrices are stride 1. That is because j occurs in the index expressions, but only with coefficient 1.

You can build this code as follows:

gcc -O -o mmD ex-mm-daxpy.c ex-mm-main.c -lm

Exercise

Explore the performance of the DAXPY variant!

It is instructive to take a look at the assembly code of this version, just to check what it looks like.

Exercise

Check the assembly code generated from the DAXPY version! You get it by the following command:

gcc -O -o mmD.s -S ex-mm-daxpy.c

The resulting code should be about a hundred lines and you should find the innermost loop where you have multiplication and addition operations (or an fma) with an xmm or ymm register as operand (if you are using an x86 machine; if you’re on an ARM the registers will be d0, d1 and so on).

The performance limitation for this version is the number of instructions or uops (work) rather than latency (span) as for the baseline implementation. There is also no major cache effects as long as at least the c matrix fits in the L3 cache.

Hence, we need to get more work done per instruction. There are several ways in which this area can be improved.

  • Only scalar arithmetic gets generated; no vector operations.

  • No fma instructions are used, doubling the number of arithmetic instructions.

  • Only a quarter (two out of eight) of the instructions are arithmetic while the processor supports up to half.

Vectorization

We will start by addressing the first point. There are in general several ways to get vectorized code:

  • The easiest is if the compiler can take scalar code and convert it to vectorized code with no change in the source. This is called autovectorization and it has been with us for fifty years and at least somewhat useful since forty years or so. Nowadays, compilers like GCC, Clang and Intel’s icc often do a reasonable job.

  • Special intrinsic functions, typically built into the compiler, can be used. There is typically one intrinsic for each SIMD instruction. These intrinsics are therefore machine dependent.

  • GCC and Clang provide a way to define vector types like “vector of four doubles” that automatically inherits many operators. Hence, if x and y are vectors of the same size and element type, x + y is their element-wise sum.

Exercise

Let’s try autovectorization! On the version of GCC that I have, auto vectorization is included in -O3 optimization, so we recompile our code thus:

gcc -O3 -o mmD ex-mm-daxpy.c ex-mm-main.c -lm

Do we get any performance difference?

Example

The effect of -O3 on the Core i7-8550U. The assembly code of the innermost loop now looks as follows:

.L7:
        movupd   (%rax,%r9), %xmm0     ; Read elements of c
        movupd   (%rcx,%r9), %xmm2     ; Read elements of a
        mulpd    %xmm1, %xmm0          ; Multiply elements from b and c
        addpd    %xmm2, %xmm0          ; Add to elements of a
        movups   %xmm0, (%rcx,%r9)     ; Write sum to a
        addq     $16, %r9              ; Update offsets
        cmpq     %r9, %rbp             ; Done?
        jne      .L7                   ; Branch back if not

This code uses addresses of the form (%rax,%r9) which refer to the memory location whose address is the sum of the contents of the two registers. So rax holds the address of the current row in c while rcx points at the start of the current row in a.

The use of xmm registers as well as the add of 16 to the offset in r9 indicate that the compiler has chosen a SIMD vector length of 128 bits, or two double values. Hence each iteration of the innermost loop now computes two fma operations so the slightly above 1.2 cycles per fma results indicate an inner loop time of just under 2.5 cycles on average.

Given the number of variants of the x86 architecture, what code should be generated by a compiler like GCC? If GCC generates code very conservatively, using only instructions that are included in all processors that have been produced for many years, the code will run on just about any x86 machine out there. However, it will not take advantage of more recent improvements like SIMD vector extensions.

The most common compromise struck by GCC on 64-bit systems (almost all today) is to go for the first 64-bit version of the x86 architecture, often referred to as x86_64. This includes 128-bit SIMD vector support, but not the 256 bit AVX. This default is decided when GCC is compiled, so it is up to the distribution, for instance Ubuntu, to make the choice.

It is always possible to tell GCC to generate code for a particlar instruction set using the -march option. If you want code that uses all of the features of the machine on which you run the compiler, you can use -march=native.

Example

Let us see what code we get if we tell GCC to generate code specifically for the Core i7-8550U. We do this by compiling the matrix multiplication program as follows:

gcc -O3 -march=native -S -o mmD.s ex-mm-daxpy.c

This gives us the following code for the innermost loop:

.L7:
        vmovupd     (%rax,%rdi), %ymm0          ; Read elements from c
        vfmadd213pd (%rcx,%rdi), %ymm1, %ymm0   ; fma with element from a
        vmovupd     %ymm0, (%rcx,%rdi)          ; Write back to a
        addq        $32, %rdi                   ; Increment offset
        cmpq        %rdi, %rbx                  ; Done?
        jne         .L7                         ; Branch back if not

This is excellent code:

  • Finally, we actually get the fma instruction. The pd at the end says that it is packed (that is, a vector operation) and that the element type is double. The 213 controls the interpretation of the operands; in this case we have approximately:

    ymm0 = memory[rcx+rdi] + ymm0*ymm1
    
  • Both the use of the 256 bit ymm registers and the value 32 being added to the offset in rdi tells us that we now get 256-bit code.

Exercise

Explore the performance effect of compiling for your native machine! Compile with

gcc -O3 -march=native -o mmD ex-mm-daxpy.c ex-mm-main.c -lm

Do we get any performance difference?

One phenomenon that we now encounter for the first time is the cost of unaligned data accesses. When the matrix size is divisible by two but not by four (as in for instance 150), the length of a row is not an integral number of 32 byte SIMD vectors. Rather, each row is of length “something-and-a-half” vectors. Thus every other row will not be vector aligned. This will lead to vector accesses that straddle a cache line boundary, necessitating two tag look-ups with the attendant increase in access time. Note that this happens not only at the start of such rows, but for every other iteration of the j loop.

One way to handle this problem is to use a padded representation of matrices. Thus, during the mapping of the two index dimensions to the single linear offset, we round the row length to the nearest multiple of four (in this case).

This in implemented in this padded version of the code where rtvl() (round to vector length) is a function that rounds its argument to the next multiple of 4:

 1#include "rtvl.h"
 2
 3void matmul(double* a, double* b, double* c, long n) {
 4  long rn = rtvl(n);
 5  for(long i = 0; i < n; i++) {
 6    for(long j = 0; j < n; j++) a[i*rn + j] = 0;
 7    for(long k = 0; k < n; k++) {
 8      for(long j = 0; j < n; j++) {
 9        a[i*rn + j] += b[i*rn + k] * c[k*rn + j];
10      }
11    }
12  }
13}

Note that we only use the rounded size for the index computations, not for the loop bounds. For this code to work correctly, the caller must also use the same padding.

Exercise

Explore the performance effect of padding! Compile with

gcc -O3 -march=native -DRTVL -o mmDp ex-mm-daxpy-p.c ex-mm-main.c -lm

The -DRTVL enables the rtvl() function which otherwise just returns its argument. This allows us to use the same ex-mm-main.c with both padded and unpadded matrix multiplication functions.

Do we get any performance difference?

We are now a factor of four to five above the ideal performance and we execute at about three instructions per cycle out of the theoretical maximum of four, so in order to get a substantial improvement, we must execute fewer instructions.

The innermost loop (in the example above) contains one vector fma instruction, two data movements instructions and three housekeeping instructions (pointer update, loop control). We need to get more arithmetic compared to the rest of the operations.

There are two potential ways to accomplish this:

  • Loop unrolling would decrease the house keeping fraction.

  • Register blocking would increase the compute fraction.

Out of these two, we will do the register blocking since we can not perform loop unrolling as a source level transformation as it would interfere with the autovectorization done by the compiler.

Blocking for registers

Register blocking looks at the memory references in the innermost loop and attempt to amortize these over more computation. We get the additional computation by executing multiple iterations of the enclosing loops at once. Here is the new version of the code:

 1#include "rtvl.h"
 2#include <string.h>
 3
 4void matmul(double* a, double* b, double* c, long n) {
 5  long rn = rtvl(n);
 6  long m = n%4;
 7  for(long i = 0; i < n; i+=2) {
 8    memset(a + i*rn, 0, 2*rn*sizeof(double));
 9    for(long k = 0; k < n-3; k+=4) {
10      double b00 = b[(i+0)*rn + k + 0];
11      double b01 = b[(i+0)*rn + k + 1];
12      double b02 = b[(i+0)*rn + k + 2];
13      double b03 = b[(i+0)*rn + k + 3];
14      double b10 = b[(i+1)*rn + k + 0];
15      double b11 = b[(i+1)*rn + k + 1];
16      double b12 = b[(i+1)*rn + k + 2];
17      double b13 = b[(i+1)*rn + k + 3];
18      #pragma omp simd
19      for(long j = 0; j < n; j++) {
20        double s0 = a[i*rn + j];
21        double s1 = a[(i+1)*rn + j];
22        s0 += b00 * c[k*rn + j];
23        s1 += b10 * c[k*rn + j];
24        s0 += b01 * c[(k+1)*rn + j];
25        s1 += b11 * c[(k+1)*rn + j];
26        s0 += b02 * c[(k+2)*rn + j];
27        s1 += b12 * c[(k+2)*rn + j];
28        s0 += b03 * c[(k+3)*rn + j];
29        s1 += b13 * c[(k+3)*rn + j];
30        a[    i*rn + j] = s0;
31        a[(i+1)*rn + j] = s1;
32      }
33    }
34    for( long k = n-m; k < n; k++ ) {
35      #pragma omp simd
36      for( long j = 0; j < n; j++ ) {
37        a[    i*n + j] += b[    i*n + k] * c[k*n + j];
38        a[(i+1)*n + j] += b[(i+1)*n + k] * c[k*n + j];
39      }
40    }
41  }
42}

There is quite a lot to unpack here:

  • Looking at the innermost loop, lines 19-32, we see eight += operators, working on scalar variables rather than directly at the a matrix. There are two partial sum variables, s0 and s1, each used four times.

  • In the innermost loop, we see that k, k+1, k+2, and k+3 are used and the k loop (line 9) uses an increment of 4.

  • The reads and writes to the a matrix use both i and i+1 in the index expressions, and the i loop (line 7) has an increment of 2.

  • Putting the last two observations together, we have blocked with respect to the i and k loops with blocking factors of 2 and 4, respectively, which also tallies with the eight (\(2 \times 4\)) += operators in the innermost loop.

  • Tke k loop (line 9) only iterates if k < n-3 in order not to overshoot if n is not a multiple of four. The remaining 0-3 iterations are performed in lines 34-40.

  • There is no corresponding provision for odd values of n in the i loop (line 7), so this version of the code only works for even matrix sizes (including those not a multiple of four).

  • There are two OpenMP pragmas (lines 18 and 35). These instruct the compiler to SIMD-vectorize the code. Unfortunately, the blocked inner loop is too big for gcc to do the SIMD vectorization without this explicit request.

The strategy behind the block selection (2 in the i direction and 4 in the k direction) is that each direction saves work in the memory accesses that do not depend on that loop variable. There are two accesses to the a matrix (a read and a write) so we bring the number of accesses from two per fma to one half per fma.

As for the c[k*rn + j] access from the non blocked version, it is independent of i so the same matric element will be reused for different values of that variable. In the original, there is one read per fma operation, so the blocking factor of 2 for i gives us one half read for each fma. Exchanging the blocking factors would give us one reference to a and a quarter of a reference to c for every compute instruction which is clearly worse.

We can verify this analysis by referring to the innermost loop in lines 19-32 and counting

... += ... * ...

statements and unique memory references (the compiler deals with multiple copies of the same expression).

  • Two reads of a: a[i*rn + j], a[(i+1)*rn + j]

  • Four reads of c: c[k*rn + j], c[(k+1)*rn + j], c[(k+2)*rn + j], c[(k+3)*rn + j]

  • Two writes to a: a[i*rn + j], a[(i+1)*rn + j]

Thus we are now at one memory access per fma operation rather than three as in the original code.

What stops us from improving the ratio between compute and memory reference even further by choosing even larger blocking factors? The most concrete limiting factor is the additional number of registers that would be needed for the bik variables. These depend on both i and k but are independent of j which is why they are loop invariant. There are 16 vector registers in the architecture and we need a few for common subexpressions or similar. So 4 by 4 or larger would lead to some of the bik being spilled to memory by the compiler.

Example

Let us look at what the generated code looks like for the inner loop:

gcc -O3 -march=native -S -o mmDb.s -fopenmp -DRTVL ex-mm-block.c

Here is the inner loop:

 1.L5:
 2        vmovupd (%rcx,%rax), %ymm1                ; Read from c
 3        vmovapd %ymm10, %ymm0                     ; ymm0 will be overwritten
 4        vfmadd213pd     (%r15,%rax), %ymm1, %ymm0 ; fma, a[...] is first operand
 5        vfmadd213pd     (%r14,%rax), %ymm9, %ymm1 ; fma, a[...] is first operand
 6        vmovupd (%r10,%rax), %ymm2                ; Read from c
 7        vfmadd231pd     %ymm2, %ymm8, %ymm0       ; fma
 8        vfmadd231pd     %ymm2, %ymm7, %ymm1       ; fma
 9        vmovupd (%rdx,%rax), %ymm2                ; Read from c
10        vfmadd231pd     %ymm2, %ymm6, %ymm0       ; fma
11        vfmadd132pd     %ymm5, %ymm1, %ymm2       ; fma (partial sum written to ymm2)
12        vmovupd (%rsi,%rax), %ymm1                ; Read from c
13        vfmadd231pd     %ymm1, %ymm4, %ymm0       ; fma
14        vfmadd132pd     %ymm3, %ymm2, %ymm1       ; fma
15        vmovupd %ymm0, (%r15,%rax)                ; Write to a
16        vmovupd %ymm1, (%r14,%rax)                ; Write to a
17        addq    $32, %rax                         ; Increment offset
18        cmpq    %rax, %rbx                        ; Are we done?
19        jne     .L5                               ; Loop back

This is about as good code as you can expect. There are eight fma instructions and eight memory references, two of which are combined with the compute operatons. There are three different variants of the fma instruction used. They differ in which instruction operand plays what role in the computation. The reason for the different variants is that there are two operands that are special:

  • The first operand can be read from memory.

  • The third operand will be both read and written (since the instruction format does not accomodate four operands).

Note also that the memory references use different base registers together with a common offset in rax.

In total, there are 18 instructions in the loop body. With an ideal execution rate of four instructions per cycle, this amounts to 4.5 cycles for 8 vector fma operations or a 0.14 cycles per scalar fma.

Exercise

Explore the performance effect of register blocking! Compile with

gcc -O3 -march=native -fopenmp -DRTVL -o mmDb ex-mm-block.c ex-mm-main.c -lm

Do we get any performance difference?

Blocking for the cache

For the larger matrix sizes, we lose some time to cache misses, so it seems worthwhile to try blocking for the cache as well. The goal of this blocking is the c matrix which is indexed by k and j and hence independent of i. So if we could limit the amount of data from this matrix touched within each iteration of the i loop, then that data would fit in the L1 or L2 caches.

Here is the code:

 1#include "rtvl.h"
 2#include <string.h>
 3
 4void matmul(double* a, double* b, double* c, long n) {
 5  const long rn = rtvl(n);
 6  const long m = n%4;
 7  const long jb = n > 512 ? 512 : n;
 8  const long kb = n < 256 ? 4 * (1024 / jb) : 16;
 9  for(long jj = 0; jj < n; jj += jb) {
10    long jlim = jj+jb <= n ? jj+jb : n;
11    for(long kk = 0; kk < n; kk += kb) {
12      long klim = kk+kb <= n ? kk+kb : n;
13      for(long i = 0; i < n; i+=2) {
14        if(kk == 0) {
15          memset(a + (i+0)*rn + jj, 0, (jlim-jj)*sizeof(double));
16          memset(a + (i+1)*rn + jj, 0, (jlim-jj)*sizeof(double));
17        }
18        for(long k = kk; k < klim-3; k+=4) {
19          double b00 = b[(i+0)*rn + k + 0];
20          double b01 = b[(i+0)*rn + k + 1];
21          double b02 = b[(i+0)*rn + k + 2];
22          double b03 = b[(i+0)*rn + k + 3];
23          double b10 = b[(i+1)*rn + k + 0];
24          double b11 = b[(i+1)*rn + k + 1];
25          double b12 = b[(i+1)*rn + k + 2];
26          double b13 = b[(i+1)*rn + k + 3];
27          #pragma omp simd
28          for(long j = jj; j < jlim; j++) {
29            double s0 = a[i*rn + j];
30            double s1 = a[(i+1)*rn + j];
31            s0 += b00 * c[k*rn + j];
32            s1 += b10 * c[k*rn + j];
33            s0 += b01 * c[(k+1)*rn + j];
34            s1 += b11 * c[(k+1)*rn + j];
35            s0 += b02 * c[(k+2)*rn + j];
36            s1 += b12 * c[(k+2)*rn + j];
37            s0 += b03 * c[(k+3)*rn + j];
38            s1 += b13 * c[(k+3)*rn + j];
39            a[    i*rn + j] = s0;
40            a[(i+1)*rn + j] = s1;
41          }
42        }
43        if(klim == n) {
44          for( long k = n-m; k < n; k++ ) {
45            #pragma omp simd
46            for( long j = jj; j < jlim; j++ ) {
47              a[    i*rn + j] += b[    i*rn + k] * c[k*rn + j];
48              a[(i+1)*rn + j] += b[(i+1)*rn + k] * c[k*rn + j];
49            }
50          }
51        }
52      }
53    }
54  }
55}

Exercise

Explore the performance effect of cache and register blocking! Compile with

gcc -O3 -march=native -fopenmp -DRTVL -o mmDbb ex-mm-bigblock.c ex-mm-main.c -lm

Do we get any performance difference?

Can we do even better? We have an inner loop with as many fma operations as memory accesses and it is difficult to improve on that ratio without risking to run out of vector registers. The main problem is still that we both read and write the partial sums in the a matrix, a consequence of the DAXPY loop ordering. In order to push further, we need to explore the alternative to the DAXPY loop ordering that we considered above.

Explicit vectorization

We now return to having the k loop as the innermost loop, just as in the textbook definition of matrix multiplication. We will however compute several sums in parallel.

In contrast to the development of the DAXPY variant, we will only present the final version, which is vectorized and blocked for registers, L1 and L3.

In contrast to the DAXPY versions, we have used explicit vectorization. With the loop structure we use in this code, we could not get the compiler to vecorize reliably. We use the vector constructs implemented by GCC and Clang and thus we avoid using vector intrinsics. The vectorization is requested by declaring a vector type:

typedef double __attribute__((vector_size(32))) v4d_t;

This declaration creates a type of vectors of double of size 32 bytes. Since a double is eight bytes in size, such a vector has four elements.

With this type declaration in hand we can now create arrays of such vectors in the usual way and we can also add, multiply, etc using the ordinary C operators. Vector variables can be initialized using the {...} syntax and can also be indexed to retrieve individual components.

We use vector access for the a and c matrices, accessing them through pointers of type v4d_t*. We have vectorized in the j dimension, which is the stride 1 direction for these matrices. From the b matrix, the accesses are invariant with respect to j which means that we want to do scalar-vector multiplication with c. This functionality is also supported by the vector programming model.

We have also used the unroll pragma of GCC. Together with constant loop bounds, this guarantees that the loops are completely unrolled and that arrays used in the loop and indexed with the associated index variables will be converted to sets of scalar variables. We could not use this construct in the DAXPY versions since the presence of inner loops, even if they were marked for complete unroll, disabled vectorization.

We use transpose of the c matrix, which is a common optimization with this loop ordering. Consider the innermost loop (without other optimizations):

for(long k = 0; k < n; k++) {
  sum += b[i*n + k] * c[k*n + j];
}

The b matrix is accessed with unit stride (for row major layout) since k has coefficient 1 in the index expression. In contrast, k has n as coefficient in the access to the c matrix. Hence it is a good idea to transpose this matrix and instead use the following code:

for(long k = 0; k < n; k++) {
  sum += b[i*n + k] * cT[k + j*n];
}

Here cT is the transposed version.

In this version of the code, we do the transpose for blocks of the c matrix rather than once for the whole matrix. This gives us a block of the transposed matrix stored contiguously, eliminating the risk of cache conflicts between rows.

We combine the transpose with vectorization by transposing the vectorized versrion of the c matrix. We thus essentially get a three dimensional array as c-block, with j as the outermost dimension, k as the middle dimension and then j again within the SIMD vector words.

For simplicity, this version of the code only handles matrix sizes that are mutiples of 16.

Here is the code:

 1#include <malloc.h>
 2#include <stdlib.h>
 3
 4typedef double __attribute__((vector_size(32))) v4d_t;
 5
 6#define YB 4
 7#define XB 2
 8static const long yB = YB;
 9static const long xB = XB;
10
11#define min(a,b) ((a) < (b) ? (a) : (b))
12
13void matmul(double* a, double* b, double* c, long n) {
14  v4d_t* va = (v4d_t*) a;
15  v4d_t* vb = (v4d_t*) b;
16  v4d_t* vc = (v4d_t*) c;
17  _Alignas(64) v4d_t vcT[512];
18  long JB =  8;
19  long KB = 64;
20  long IB = n <= 512 ? n : 4*KB;
21    
22  long KBB = IB;
23  long no4 = n/4;
24  double* cT = (double*) vcT;
25
26  for(long ii = 0; ii < n; ii += IB) {
27    long ilim = min(n, ii+IB);
28    for(long kkk = 0; kkk < n; kkk += KBB) {
29      long kklim = min(n, kkk+KBB);
30      for(long jj = 0; jj < no4; jj += JB) {
31        long jlim = min(no4, jj + JB);
32        for(long kk = kkk; kk < kklim; kk += KB) {
33          long klim = min(kklim, kk + KB);
34          long koff = klim - kk;
35          // Transpose part of the c matrix
36          for(long k = kk; k < klim; k++) {
37            for(long j = jj; j < jlim; j+=2) {
38              vcT[(j-jj)*koff + k-kk] = vc[k*no4 + j];
39              vcT[(j-jj+1)*koff + k-kk] = vc[k*no4 + j + 1];
40            }
41          }
42          for(long i = ii; i < ilim; i+=xB ) {
43            for(long j = jj; j < jlim; j += yB) {
44              // Prefetch the part of the a matrix that we will update
45              #pragma GCC unroll xB
46              for(long ui = 0; ui < xB; ui++) {
47                #pragma GCC unroll yB/2
48                for(long uj = 0; uj < yB; uj+=2) {
49                  __builtin_prefetch(va + (i+ui)*no4 + j + uj, 1);
50                }
51              }
52              // Do the summation loop
53              v4d_t sum[XB][YB] = {0};
54              for(long k = kk; k < klim; k++ ) {
55                #pragma GCC unroll xB
56                for(long ui = 0; ui < xB; ui++) {
57                  #pragma GCC unroll yB
58                  for(long uj = 0; uj < yB; uj++) {
59                    sum[ui][uj] += b[(i+ui)*n + k] * vcT[(j-jj+uj)*koff + k-kk];
60                  }
61                }
62              }
63              // Update the a matrix
64              if(kk == 0) {
65                // First k block so we have no previous sum to add to
66                #pragma GCC unroll 2
67                for(long ui = 0; ui < xB; ui++) {
68                  #pragma GCC unroll yB
69                  for(long uj = 0; uj < yB; uj++) {
70                    va[(i+ui)*no4 + j + uj] = sum[ui][uj];
71                  }
72                }
73              } else {
74                // We will add the results of the k-loop to the previous partial sums
75                #pragma GCC unroll 2
76                for(long ui = 0; ui < xB; ui++) {
77                  #pragma GCC unroll yB
78                  for(long uj = 0; uj < yB; uj++) {
79                    va[(i+ui)*no4 + j + uj] += sum[ui][uj];
80                  }
81                }
82              }
83            }
84          }
85        }
86      }
87    }
88  }
89}

Exercise

Explore the performance effect of cache and register blocking! Compile with

gcc -O3 -march=native -o mmDx ex-mm-explicit.c ex-mm-main.c -lm

Do we get any performance difference? Do not forget that the code only supports matrix sizes that are multiples of 16!