CS267 Assignment 2 - Matrix Multiply Optimization

Steve Chien, Ravi Subramanian, Yunfei Deng

A. Optimizations
1. Inner Loop

One key to improving matrix multiply performance is to maximize the
number of floating point operations performed per memory access.  This
can be done at all levels of the memory hierarchy (registers, L1, and
L2), but is most important at the register level.  Thus, we decided
that our inner loop should be small enough to fit entirely within
registers.  From Sun's documentation, we find that we have 32 64-bit
floating point registers to use.

The naive three-loops matrix multiply must being in two matrix entries
into registers for each multiplication, gaining only 0.5
multiplies/memory op. Blocking with 2x2 matrices allows 1
multiply/memory op (two 2x2 matrices = 8 entries, and multiplying two
2x2 matrices requires 8 multiplies).  Our approach achieves 1.57
multiplies/memory op, as follows:

If we wish to find C = AxB, we compute C in chunks of 4x3 at a time.
Suppose we wish to compute C[0-3,0-2].  We do this using the loop
below:

	load C[0-3,0-2]
	for k = 0 to M-1 do
            load A(0,k), A(1,k), A(2,k), A(3,k)
            load B(k,0), B(k,1), B(k,2)
            compute the 12 products of the A and B entries
	    accumulate into C
        end for
	store into C

Thus we achieve 12 multiples for every seven values of A and B brought
in, for a 1.57 ratio.  Furthermore, we can fit all of this into
registers: 12 registers for C, 12 temporary registers for the results
of the multiplies, and 7 registers for A and B, for a total of 31.
The fact that there are 12 multiples and 12 adds per loop gives good
balance to the instructions.

After examining the assembly produced for this inner loop, we found
that there is danger that the registers needed for loading A and B
entries in iteration i may not be available because they are still
being used in multiplies in iteration i-1.  We tried seeing if using a
3x3 inner loop (1.5 multiples/memory op) would help, but this led to a
significant performance penalty.

While we know that other groups achieve very good performance with 2x2
matrix multiplies, we feel that the 3x4 is an excellent inner loop to
use.  Most of our performance increase comes from this idea.

2. L1 Blocking

We next optimized for the next level of the memory hierarchy.  Our
results from membench (see graph in nowmem.ps) show a 16 KB L1 data 
cache, confirmed by Sun's documentation.

Our inner loop as written above will run from row and column 0 to M-1,
gaining very poor L1 cache behavior, a fact we verified
experimentally. We therefore decided to block for L1.  One
inconvenient attribute about our loop is that it works in 4x3 blocks,
so that any natural block size must be a multiple of 12.  We chose 24
as our block size, since we can fit 3 (one for each of A,B, and C)
24x24 matrices in L1. (One matrix is 24x24x8 <4800 bytes).  We found
that this preserved performance of our inner loop to matrix sizes well
over 100x100.

3. Loop Unrolling

Another benefit of having a fixed L1 block size is that most of the
time we will be computing results in known 24x24 blocks.  This
constant allowed us to unroll our inner loop 24 times, removing the
loop branch prediction and hopefully making it easier for the compiler
to schedule the code.  For the fringe block cases where this was
feasible, we also unrolled those loop.

Unrolling the loop gave a good boost to performance.

4. Software Pipelining

We also manually software pipelined the unrolled inner loop.  Each
iteration of the inner loop is of the form L M A, where L is load, M
is multiply, and A is add.  If unrolled as L M A L M A L M A etc.,
each M phase may have to wait for the L phase, and each A phase may
wait for the M phase.  We software pipelined the loop to become A(1)
M(2) L(3) A(2) M(3) L(4) A(3) M(4) L(5) etc. where the number in
parentheses indicates to which iteration of the loop each phase
belongs.

We expected that this would all little benefit, since the compiler
could already schedule the fully unrolled loop, but were surprised
when this gave a nice boost to performance.

5. Special Cases

To speed computation of the fringe blocks, we used many special cases
including a different function for handling each block size less than
4x3.  We thus have separate cases for computing 1x1, 1x2, 2x1, 2x2,
3x1, 3x2, 4x1, and 4x2 blocks.  Each of these cases is broken into two
subcases, one for when we are computing a chain of 24 products, and
one for a shorter chain.  The first case is unrolled and pipelined.

Since fringe cases are not called as frequently as the main inner
loop, the benefits of these were marginal though noticeable.

6. Compiler Flags

None of this would have been possible had it not been for choosing
good compiler flags.  Flags that guarantee alignment of
double-precision variables and promise that there is no memory
aliasing were very helpful, as was the xO5 optimization level.

Optimization Not Implemented

1. More Special Cases

One idea which would have given a nice benefit would have been to
carry special casing to is extreme, and have a separate inner loop
routine for every matrix size.  Currently we spend a lot of time
computing offsets such as B[M], which requires a multiply by 8 and
addition.  By trying a few experiments with constant sizes, we found
that a good performance gain can come from doing this.  However we
shied away from writing all that code.

2. L2 Blocking

We could not discern any noticeable improvement from L2 blocking and
thus decided not to use it.

3. Strassen

We considered using Strassen at the higher levels, but decided that
the memory bandwidth required might be too costly.  This avenue
certainly deserves exploration, however.

4. 4x4 Inner Loop

We briefly tried a 4x4 inner loop due to its better being able to
handle powers of 2 and its making fringe cases easier.  Unfortunately
a 4x4 inner loop does not fit in registers (it requires 16 destination
registers, 16 temporaries, and 8 A and B entries), and we found that
3x4 gave better performance.  Still, since we assume we will be tested
on many powers of 2, this might have been good to insert as a special
case.

B. Results on NOW/ Analysis

These are our results on the NOW (see also the graph in nowperf.ps), 
and for comparison the results on

Our Results (NOW)

Size:  20  |  CPU: 100%  |  mflop/s: 140.206
Size:  24  |  CPU: 100%  |  mflop/s: 223.628
Size:  31  |  CPU: 100%  |  mflop/s: 180.598
Size:  32  |  CPU: 100%  |  mflop/s: 176.936
Size:  48  |  CPU: 100%  |  mflop/s: 232.231
Size:  64  |  CPU: 100%  |  mflop/s: 204.284
Size:  73  |  CPU: 100%  |  mflop/s: 209.208
Size:  96  |  CPU: 100%  |  mflop/s: 225.439
Size:  97  |  CPU: 100%  |  mflop/s: 209.237
Size: 127  |  CPU: 100%  |  mflop/s: 194.397
Size: 128  |  CPU: 100%  |  mflop/s: 177.664
Size: 129  |  CPU: 100%  |  mflop/s: 193.884
Size: 144  |  CPU: 100%  |  mflop/s: 206.301
Size: 163  |  CPU: 100%  |  mflop/s: 196.733
Size: 191  |  CPU: 100%  |  mflop/s: 195.462
Size: 192  |  CPU: 100%  |  mflop/s: 202.250
Size: 229  |  CPU: 100%  |  mflop/s: 177.998
Size: 255  |  CPU: 100%  |  mflop/s: 173.977
Size: 256  |  CPU: 100%  |  mflop/s: 143.595
Size: 257  |  CPU: 100%  |  mflop/s: 171.982
Size: 319  |  CPU: 100%  |  mflop/s: 172.054
Size: 320  |  CPU: 100%  |  mflop/s: 179.221
Size: 321  |  CPU: 100%  |  mflop/s: 181.721
Size: 417  |  CPU: 100%  |  mflop/s: 179.954
Size: 479  |  CPU: 100%  |  mflop/s: 178.189
Size: 480  |  CPU: 100%  |  mflop/s: 178.751
Size: 511  |  CPU:  84%  |  mflop/s: 127.399
Size: 512  |  CPU: 100%  |  mflop/s: 117.396

Notice the following:

1. Performance is best at multiples of 24 (24,48,96,144,192), where we
achieve over 200 mflop/s and frequently over 220 mflop/s.  Our less
efficient handling of special cases mean we have lower performance on
other sizes.

2. We do particularly poorly at large powers of 2, where cache
thrashing is taking place.  Columns of size 256 or 512 (2K and 4K) fit
neatly into cache and knock each other out.  Thus the dips near 128,
256, and 512.  We do much better at 255 and 257.

3. Our performance holds up quite well for larger matrices, though the
slow deterioration is an unfortunate result of weaker locality.
Strangely, adding a larger blocking level did not help.

C. PC/Linux Comparison

We next tried our program on a PC with a Pentium II processor, also
with a 16 KB cache.  (See pcmem.ps.)  The performance is shown below:

Size:  20  |  CPU:  36%  |  mflop/s:  32.242
Size:  24  |  CPU:  36%  |  mflop/s: 102.982
Size:  31  |  CPU:  36%  |  mflop/s:  46.239
Size:  32  |  CPU:  36%  |  mflop/s:  44.243
Size:  48  |  CPU:  35%  |  mflop/s:  99.103
Size:  64  |  CPU:  36%  |  mflop/s:  44.652
Size:  73  |  CPU:  36%  |  mflop/s:  84.348
Size:  96  |  CPU:  36%  |  mflop/s:  97.795
Size:  97  |  CPU:  35%  |  mflop/s:  92.549
Size: 127  |  CPU:  36%  |  mflop/s:  69.827
Size: 128  |  CPU:  36%  |  mflop/s:  63.922
Size: 129  |  CPU:  36%  |  mflop/s:  67.886
Size: 144  |  CPU:  36%  |  mflop/s:  96.159
Size: 163  |  CPU:  33%  |  mflop/s:  59.540
Size: 191  |  CPU:  36%  |  mflop/s:  57.736
Size: 192  |  CPU:  36%  |  mflop/s:  82.603
Size: 229  |  CPU:  31%  |  mflop/s:  70.158
Size: 255  |  CPU:  36%  |  mflop/s:  62.167
Size: 256  |  CPU:  36%  |  mflop/s:  59.794
Size: 257  |  CPU:  36%  |  mflop/s:  62.488
Size: 319  |  CPU:  36%  |  mflop/s:  78.348
Size: 320  |  CPU:  36%  |  mflop/s:  77.767
Size: 321  |  CPU:  36%  |  mflop/s:  75.960
Size: 417  |  CPU:  36%  |  mflop/s:  79.571
Size: 479  |  CPU:  36%  |  mflop/s:  71.344
Size: 480  |  CPU:  36%  |  mflop/s:  90.271
Size: 511  |  CPU:  36%  |  mflop/s:  66.311
Size: 512  |  CPU:  36%  |  mflop/s:  66.579

Our performance here shows the same characteristics as the NOW
performance, though obviously everything runs much slower.  There are
two good reasons for this.  The first is that the Intel chips have far
fewer floating point registers (eight).  Thus our inner loop overflows
the registers and runs very slowly.  Second, we spent a fair amount of
time finding the right compiler options for cc on the NOW.  We suspect
that gcc is not as good a compiler as Sun's, especially considering
that Sun was compiling for its own machine.