Sometimes the simplest solution is the best. Other times, you have to deal with the memory hierarchy, fight the underlying processor, and ponder the guts of the compiler---the best solution becomes far from simple. You and your partner are going to be optimizing a matrix multiplication kernel for the Pentium IV. Good luck!
You need to tune a square matrix multiplication routine. We're providing two example implementations: the trivial unblocked one and an almost-trivial blocked one. The unblocked routine runs as poorly as you'd expect. The provided blocked routine doesn't do much better. You are to improve the performance of the blocked routine (or write your own) on a Pentium IV workstation (or other machine of your choice---let me know).
The performance of the two routines is shown above. The naive unblocked routine is the "B=1" curve, and is just the standard three nested loops. The blocked code (block size of 30, selected as the "best" in an exhaustive search of square tile sizes up to B=256) is shown in green. The peak performance of the machine used is 667 MFlop/s, and both implementations achieve less than a quarter of that performance. However, the unblocked routine's performance completely dies on matrices larger than 255 x 255, while the blocked routine gives more consistent performance for all the tested sizes. Note the performance dips at powers of 2 (cache conflicts). The machine used for this diagram has a 2 Mbyte external cache, and 3 matrices * 256x256 entries * 8 bytes/entry is 1.5 Mbytes. Obviously, there's overhead in both operation count and memory usage.
You need to write a dgemm.c that contains a function with the following C signature:
void
square_dgemm (const unsigned M,
const double *A, const double *B, double *C)
The necessary files are in tuning-matmul.tar. Included are the following:
Here is one particularly good example done last year. This handily beat the vendor implementation on square matrix sizes (by 5-10% on large sizes) on a Sun Ultra I/170 workstation.
Your group needs to submit your group's dgemm.c, Makefile (for compiler options) and a write-up. Your write-up should contain
To show the results of your optimizations, include a graph comparing your dgemm.c with the included basic_dgemm.c. For the last requirement, try your tuned dgemm.c on another hardware platform (like Millennium or your home machine) and explain why it performs poorly. Your explanations should rely heavily on knowledge of the memory hierarchy. (Benchmark graphs help.)
Please tar up your group's dgemm.c, write-up, and associated files and mail me either an encoded tar file (uuencode or Base-64) or a URL from which I can retrieve the tar file.
The race will be held during a discussion section, exact date to be announced.
I've collected the associated files in a page, and the required matrix multiplication files are packages in a tar file.
Need information on gcc compiler flags? Read the manual. Want to do inline assembly in gcc? Here's a starting point (I'll try to find more useful links on inline assembly).
Intel has information on the Pentium IV processor. Local machines you can log into (using ssh) are saab.cs.berkeley.edu and camaro.cs.berkeley.edu. (NOTE: There is a primary user for camaro, so please talk to me about scheduling time to use it.)
Sun has some information on the UltraSPARC-IIi processor. You can get more direct information by logging into vw.cs.berkeley.edu or maruti.cs.berkeley.edu and running the following commands:
Like camaro,
Another set of Pentium III class machines is available through the
Millennium cluster.
The preceeding link has info on getting started.
To get more information on the performance of the memory
hierarchy on your machine, use a
previous class's membench program.
Another useful benchmark is available as lat_mem_rd benchmark
in lmbench.
The optimizations used in PHiPAC and ATLAS may be interesting.
Note: You cannot use PHiPAC or ATLAS to generate your matrix
multiplication kernels. You can write your own code
generator, however.
Previous years have worked on variations of this assignment.
Check out CS 267
Spring 1999,
Spring 1998 (?), and Spring
1996. The last one has an interesting twist; some students
beat IBM's own matrix multiplication routines.
Here is some relevant reading material. Definitely take a look at
this stuff.
Main CS267, Spring 2000
page and
the files associated with this assignment.
Readings