Matrix multiply summary Group: David Bindel Tak Pui Lou Yihan Shao Contents -------- lmbench-results -- Output of lmbench on fiat (the PPro I ran tests on) membench-us1.ps -- Memory benchmark graph for the UltraSPARC test-ppro-3nl.out -- Test output for three-nested-loops on PPro test-us1-3nl.out -- Test output for three-nested-loops on NOW test-ppro-opt.out -- Test output for our routine on PPro test-us1-opt.out -- Test output for our routine on NOW test-us1.ps -- Test results on NOW test-ppro.ps -- Test results on PPro code/* -- Our matrix multiply code, test driver, and Makefile Code structure -------------- Our routine uses allocated workspace to store blocked row-major format versions of B and C, with individual blocks that are 4-by-1. The computation is then organized in a "sum of outer products" style, i.e. the outermost loop is over the (blocked) k. The innermost loop runs over (blocked) j, so that all computations involving one blocked row of B will be performed before touching any of the remaining rows. As long as a block row (4m elements) fits largely into cache (L1 or L2), we will have low self-interference at that level. Since the L1 cache on the UltraSPARCs is 16K, enough to store 512 columns, we wouldn't expect too much self-interference on a row of B for the sizes tested (though cross-interference would make a difference). Further, this allows us to access the elements of B with unit stride, and so we hope to be able to use the entire contents of each cache line once it is loaded. This type of copying also protects us (some) from the sort of cross-interference that can occur when entries of A, B, and C all conflict for cache lines. We could improve things further by doing explicit L1 and L2 blocking, but we didn't get around to making a (working) version that did that well. At the innermost level of the computation, we multiply a 4-by-4 block from A by each vector from a 4-by-1 block from a blocked row of B. This multiplication is hand-unrolled. The 4-by-4 submatrix of A is loaded into registers before the start of the blocked j loop. Four separate internal routines are provided to deal with the four possible fringe-widths. The kernels for our code are written in Fortran 77. For a while, we pursued Fortran and C versions in parallel, but we decided that fiddling with C constructs to increase optimization potential was more difficult than just letting the Fortran optimizer do those operations for us. A C routine allocates workspace, calls the kernel for the correct fringe-width, and switches to the provided unoptimized blocked code if not enough space is available. Performance ----------- Results from a typical timing run on the NOW are included in test-us1-opt.out; for comparison, results for the three-nested loops code are in the file test-us1-3nl.out. Graphs comparing the performance of our code versus the three nested loops for the NOW and a Pentium Pro 200MHz with 16K L1 data cache and 256K L2 cache are in timing-us1.ps and timing-ppro.ps, respectively. Maximum theoretical performance on the UltraSPARC is 334 Mflop/s; on the Pentium it is 200 MFlop/s. The general results on the NOW are fairly intuitive. Our code starts a little slow; copying operands costs a bit. We could have computed the switchover and used a separate routine optimized for small matrices, among other optimizations that we contemplated but didn't do. Then it peaks, probably while we're still getting some cache reuse from C as well as from a row of B. Then we get gradual degradation of performance as we get less reuse from C due to self-interference, and probably start getting increasing cross-interference, too. The three-nested-loops structure has performance dips around powers of two due to interference effects; our version actually has little performance peaks near some powers of two (and multiples). I'm not completely sure why this is the case, actually. The "optimized" code actually runs somewhat slower on the Pentium than the nested loops (for most of sizes). The graph is rather unilluminating after that; there are too many spikes to see what is going on. We can see the cache interference effects at power-of-two matrix sizes for the three nested loops; the Mflop/s drops noticeably. We also see that things get slower for both codes as whatever we cache reuse we have drops off. Beyond that, I'm not confident that most of the other features aren't largely noise; I used Francois's code, complete with diagnostic "am I sharing the processor" loop, and there was some noticeable variance in the timing of that loop, too. This is not to say that there aren't other odd effects; just that I'm not sure what they are, and certainly don't think they are reliably exposed by this graph. We would not expect the Pentium code to work particularly efficiently in large part because it is not a load-store architecture like the UltraSPARC. The UltraSPARC has 32 floating point registers, which let us keep a submatrix of A and a column of B in registers. The Pentium does not really have any general purpose registers (EAX-EDX don't really count), and if I recall correctly, the floating point unit's storage is organized as an odd sort of stack. Regardless, the effect is that the loads and stores of the submatrices of A and columns of B don't go to registers; they go to the program stack. That doesn't do good things for performance.