Optimizing Matrix Multiply
Summer 2001

(Note: This page is an adaptation.)

Sometimes the simplest solution is the best. Other times, you have to deal with the memory hierarchy, fight the underlying processor, and ponder the guts of the compiler---the best solution becomes far from simple. You and your partner are going to be optimizing a matrix multiplication kernel for the Pentium IV. Good luck!

Problem

You need to tune a square matrix multiplication routine. We're providing two example implementations: the trivial unblocked one and an almost-trivial blocked one. The unblocked routine runs as poorly as you'd expect. The provided blocked routine doesn't do much better. You are to improve the performance of the blocked routine (or write your own) on a Pentium IV workstation (or other machine of your choice---let me know).

The performance of the two routines is shown above. The naive unblocked routine is the "B=1" curve, and is just the standard three nested loops. The blocked code (block size of 30, selected as the "best" in an exhaustive search of square tile sizes up to B=256) is shown in green. The peak performance of the machine used is 667 MFlop/s, and both implementations achieve less than a quarter of that performance. However, the unblocked routine's performance completely dies on matrices larger than 255 x 255, while the blocked routine gives more consistent performance for all the tested sizes. Note the performance dips at powers of 2 (cache conflicts). The machine used for this diagram has a 2 Mbyte external cache, and 3 matrices * 256x256 entries * 8 bytes/entry is 1.5 Mbytes. Obviously, there's overhead in both operation count and memory usage.

You need to write a dgemm.c that contains a function with the following C signature:

      void
      square_dgemm (const unsigned M, 
                    const double *A, const double *B, double *C)
    

The necessary files are in tuning-matmul.tar. Included are the following:

Makefile
a sample Makefile, with some useful compiler options,
basic_dgemm.c
a very simple square_dgemm implementation,
blocked_dgemm.c
a slightly more complex square_dgemm implementation (check the comments...),
matmul.c
the driver program, and
timing.gnuplot
a sample gnuplot script to display the results.

Here is one particularly good example done last year. This handily beat the vendor implementation on square matrix sizes (by 5-10% on large sizes) on a Sun Ultra I/170 workstation.

Submission

Your group needs to submit your group's dgemm.c, Makefile (for compiler options) and a write-up. Your write-up should contain

To show the results of your optimizations, include a graph comparing your dgemm.c with the included basic_dgemm.c. For the last requirement, try your tuned dgemm.c on another hardware platform (like Millennium or your home machine) and explain why it performs poorly. Your explanations should rely heavily on knowledge of the memory hierarchy. (Benchmark graphs help.)

Please tar up your group's dgemm.c, write-up, and associated files and mail me either an encoded tar file (uuencode or Base-64) or a URL from which I can retrieve the tar file.

The race will be held during a discussion section, exact date to be announced.

Resources

I've collected the associated files in a page, and the required matrix multiplication files are packages in a tar file.

Need information on gcc compiler flags? Read the manual. Want to do inline assembly in gcc? Here's a starting point (I'll try to find more useful links on inline assembly).

Intel has information on the Pentium IV processor. Local machines you can log into (using ssh) are saab.cs.berkeley.edu and camaro.cs.berkeley.edu. (NOTE: There is a primary user for camaro, so please talk to me about scheduling time to use it.)

Sun has some information on the UltraSPARC-IIi processor. You can get more direct information by logging into vw.cs.berkeley.edu or maruti.cs.berkeley.edu and running the following commands:

Like camaro, has a primary user so please check with me if you need to do heavy or timing critical runs.

Another set of Pentium III class machines is available through the Millennium cluster. The preceeding link has info on getting started.

To get more information on the performance of the memory hierarchy on your machine, use a previous class's membench program. Another useful benchmark is available as lat_mem_rd benchmark in lmbench.

The optimizations used in PHiPAC and ATLAS may be interesting. Note: You cannot use PHiPAC or ATLAS to generate your matrix multiplication kernels. You can write your own code generator, however.

Previous years have worked on variations of this assignment. Check out CS 267 Spring 1999, Spring 1998 (?), and Spring 1996. The last one has an interesting twist; some students beat IBM's own matrix multiplication routines.

Readings

Here is some relevant reading material. Definitely take a look at this stuff.


Main CS267, Spring 2000 page and the files associated with this assignment.