Matrix multiply summary

Group:
David Bindel
Tak Pui Lou
Yihan Shao


Contents
--------

lmbench-results -- Output of lmbench on fiat (the PPro I ran tests on)
membench-us1.ps -- Memory benchmark graph for the UltraSPARC
test-ppro-3nl.out -- Test output for three-nested-loops on PPro
test-us1-3nl.out  -- Test output for three-nested-loops on NOW 
test-ppro-opt.out -- Test output for our routine on PPro 
test-us1-opt.out  -- Test output for our routine on NOW 
test-us1.ps -- Test results on NOW
test-ppro.ps -- Test results on PPro
code/* -- Our matrix multiply code, test driver, and Makefile


Code structure
--------------

Our routine uses allocated workspace to store blocked row-major format
versions of B and C, with individual blocks that are 4-by-1.  The
computation is then organized in a "sum of outer products" style, i.e.
the outermost loop is over the (blocked) k.  The innermost loop runs
over (blocked) j, so that all computations involving one blocked row
of B will be performed before touching any of the remaining rows.  As
long as a block row (4m elements) fits largely into cache (L1 or L2),
we will have low self-interference at that level.  Since the L1 cache
on the UltraSPARCs is 16K, enough to store 512 columns, we wouldn't
expect too much self-interference on a row of B for the sizes tested
(though cross-interference would make a difference).  Further, this
allows us to access the elements of B with unit stride, and so we hope
to be able to use the entire contents of each cache line once it is
loaded.

This type of copying also protects us (some) from the sort of
cross-interference that can occur when entries of A, B, and C all
conflict for cache lines.

We could improve things further by doing explicit L1 and L2 blocking,
but we didn't get around to making a (working) version that did that
well.

At the innermost level of the computation, we multiply a 4-by-4 block
from A by each vector from a 4-by-1 block from a blocked row of B.
This multiplication is hand-unrolled.  The 4-by-4 submatrix of A is
loaded into registers before the start of the blocked j loop.

Four separate internal routines are provided to deal with the four
possible fringe-widths.

The kernels for our code are written in Fortran 77.  For a while, we
pursued Fortran and C versions in parallel, but we decided that
fiddling with C constructs to increase optimization potential was more
difficult than just letting the Fortran optimizer do those operations
for us.  A C routine allocates workspace, calls the kernel for the
correct fringe-width, and switches to the provided unoptimized blocked
code if not enough space is available.



Performance
-----------

Results from a typical timing run on the NOW are included in
test-us1-opt.out; for comparison, results for the three-nested loops
code are in the file test-us1-3nl.out.  Graphs comparing the
performance of our code versus the three nested loops for the NOW and
a Pentium Pro 200MHz with 16K L1 data cache and 256K L2 cache are in
timing-us1.ps and timing-ppro.ps, respectively.  Maximum theoretical
performance on the UltraSPARC is 334 Mflop/s; on the Pentium it is 200
MFlop/s.

The general results on the NOW are fairly intuitive.  Our code starts
a little slow; copying operands costs a bit.  We could have computed
the switchover and used a separate routine optimized for small
matrices, among other optimizations that we contemplated but didn't
do.  Then it peaks, probably while we're still getting some cache
reuse from C as well as from a row of B.  Then we get gradual
degradation of performance as we get less reuse from C due to
self-interference, and probably start getting increasing
cross-interference, too.

The three-nested-loops structure has performance dips around powers of
two due to interference effects; our version actually has little
performance peaks near some powers of two (and multiples).  I'm not
completely sure why this is the case, actually.

The "optimized" code actually runs somewhat slower on the Pentium than
the nested loops (for most of sizes).  The graph is rather
unilluminating after that; there are too many spikes to see what is
going on.  We can see the cache interference effects at power-of-two
matrix sizes for the three nested loops; the Mflop/s drops noticeably.
We also see that things get slower for both codes as whatever we cache
reuse we have drops off.  Beyond that, I'm not confident that most of
the other features aren't largely noise; I used Francois's code,
complete with diagnostic "am I sharing the processor" loop, and there
was some noticeable variance in the timing of that loop, too.  This is
not to say that there aren't other odd effects; just that I'm not sure
what they are, and certainly don't think they are reliably exposed by
this graph.

We would not expect the Pentium code to work particularly efficiently
in large part because it is not a load-store architecture like the
UltraSPARC.  The UltraSPARC has 32 floating point registers, which let
us keep a submatrix of A and a column of B in registers.  The Pentium
does not really have any general purpose registers (EAX-EDX don't
really count), and if I recall correctly, the floating point unit's
storage is organized as an odd sort of stack.  Regardless, the effect
is that the loads and stores of the submatrices of A and columns of B
don't go to registers; they go to the program stack.  That doesn't do
good things for performance.