Sometimes the simplest solution is the best. Other times, you have to deal with the memory hierarchy, and the best solution is far from simple. You and your partner are going to be optimizing a matrix multiplication kernel for the NoW. Do it well; your entry will be pit against your classmates' entries in a race.
A few deficiencies have been brought to my attention:
First, you need to pick a parner. If you're in EECS, your partner cannot be. Since this assignment relies on low-level computer architecture knowledge, we need the EECS folks to help. One person from your group should mail me with your roster as soon as you can.
You need to tune a square matrix multiplication routine. We're providing two example implementations: the trivial unblocked one and an almost-trivial blocked one. The unblocked routine runs as poorly as you'd expect. The provided blocked routine doesn't do much better. You are to improve the performance of the blocked routine (or write your own) on the NoW.
The performance of the two routines is shown above. The peak performance of the machine used is around 600 MFlop/s, and neither of these achieves even a quarter of that performance. However, the unblocked routine's performance completely dies on matrices larger than 255 x 255, while the blocked routine gives (somewhat) consistent performance for all the tested sizes. The machine used for this diagram has a 2 Mbyte external cache, and 3 matrices * 256x256 entries * 8 bytes/entry is 1.5 Mbytes. Obviously, there's overhead in both operation count and memory usage. (The peak performance of the machines you'll be using is 334 MFlop/s, and the cache size is 512 Kbytes.)
You need to write a dgemm.c that contains a function with the following C signature:
void
square_dgemm (const unsigned M,
const double *A, const double *B, double *C)
The necessary files are in assignment2.tar. Included are the following:
Your group needs to submit your group's dgemm.c, Makefile (for compiler options) and a write-up. Your write-up should contain
To show the results of your optimizations, include a graph comparing your dgemm.c with the included basic_dgemm.c. For the last requirement, try your tuned dgemm.c on another hardware platform (like Millennium or the T3E) and explain why it performs poorly. Your explanations should rely heavily on knowledge of the memory hierarchy. (Benchmark graphs help.)
Please tar up your group's dgemm.c, write-up, and associated files and mail me either an encoded tar file (uuencode or Base-64) or a URL from which I can retrieve the tar file.
The race will be held during a discussion section, exact date to be announced.
I've collected the associated files in a page, and the required matrix multiplication files are packages in a tar file.
The Friday, 4 February discussion section will be devoted to showing you how to work on the NoW and discussion of this assignment. Further written advice will be added here soon.
There is an old, but mostly accurate, NoW tutorial from a previous version of this class. Last year's TA has some additional information. Since then, the NoW has added a batch queuing system to help acquire reliable timings. I'll add more information on using that soon.
Sun has some information on the UltraSPARC-I processor, the one in the NoW boxes. There's also a brochure with a bit of information on the machines themselves. You can get more direct information by logging into the NoW and running the following commands:
To get more information on the performance of the memory hierarchy on the NoW, use a previous class's membench program. For other machines, look at the lat_mem_rd benchmark in lmbench. If you want to try lmbench on the T3E, mail me. I think I have lat_mem_rd working. I also have T3E and gcc-x86 ports of membench, but they only give qualitative results.
The optimizations used in PHiPAC and ATLAS may be interesting. Note: You cannot use PHiPAC or ATLAS to generate your matrix multiplication kernels. You can write your own code generator, however.
Previous years have worked on variations of this assignment. Check out Spring 1999, Spring 1998 (?), and Spring 1996. The last one has an interesting twist; some students beat IBM's own matrix multiplication routines...
Oh, and there will be a prize for the fastest matrix multiplication kernel...
Main CS267 page, the files associated with this assignment, and the TA's CS267 page
E. Jason Riedy