CS267 Assignment 1: Optimize Matrix Multiplication

Due Monday 12 February 2006 at 11:59PM

[ Problem | Implementation | Submission | Resources ]

Problem

You will optimize a routine to multiply square matrices. Matrix multiplication is a basic building block in many scientific computations; and since it is an O(n3) algorithm, these codes often spend a lot of their time in matrix multiplication. The most naive code to multiply matrices is short, sweet, simple, and very slow:

  for i = 1 to n
    for j = 1 to m
      for k = 1 to m
        C[i,j] = C[i,j] + A[i,k] * B[k,j]
      end
    end
  end

We're providing implementations of the trivial unoptimized matrix multiply code in C and Fortran, and a very simple blocked implementation in C. We're also providing a wrapper for the BLAS (Basic Linear Algebra Subroutines) library. The BLAS is a standard interface for building blocks like matrix multiplication, and many vendors provide optimized versions of the BLAS for their platforms. You may want to compare your routines to codes that others have optimized; if you do, think about the relative difficulty of using someone else's library to writing and optimizing your own routine. Knowing when and how to make use of standard libraries is an important skill in building fast programs!

The unblocked routines we provide do as poorly as you'd expect, and the blocked routine provided doesn't do much better.

The performance of the two routines is shown above. The naive unblocked routine is the "three loops" curve, and is just the standard three nested loops. The blocked code (block size of 57, selected as the "best" in an exhaustive search of square tile sizes up to 64) is shown in blue.

Characterizing the peak performance is slightly tricky. The processors here are AMD Opteron running at 2.2GHz (theoretical peak performance is 4.4 GFlop/s). These routines are achieving less than 25% of peak.

However, the unblocked routine's performance completely dies on matrices larger than 511 x 511, while the blocked routine gives more consistent performance for all the tested sizes (note, however, the huge dip in performance for 256 x 256 and 512 x 512 matrices). Note the performance dips at powers of 2 (cache conflicts). The machine used for this diagram has a 1 MiB L2 cache, and 3 matrices * 256x256 entries * 8 bytes/entry is 1.5 MiB, leading to the first mystery.

Implementation

You need to write a dgemm.c that contains a function with the following C signature:

      void
      square_dgemm (const unsigned M,
                    const double *A, const double *B, double *C)

If you would prefer, you can also write a Fortran function in a file fdgemm.f:

        subroutine sdgemm(M, A, B, C)
c
c       .. Parameters ..
        integer M
        double precision A(M,M)
        double precision B(M,M)
        double precision C(M,M)

Note that the matrices are stored in column-major order. Also, your program will actually be doing a multiply and add operation:

    C := C + A*B

Look at the code in basic_dgemm.c if you find this confusing.

The necessary files are in matmul.tar.gz. Included are the following:

Makefile
a sample Makefile, with some basic rules,
matmul.c
the driver program,
basic_dgemm.c
a very simple square_dgemm implementation,
blocked_dgemm.c
a slightly more complex square_dgemm implementation
fortran_dgemm.f
a very simple Fortran sdgemm implementation,
fortran_dgemm_wrapper.c
a wrapper that lets the C driver program call the Fortran implementation,
cfortran.h and cfortran.copyright
routines to make C and Fortran play nice, and
blas_dgemm.c
another wrapper that lets the C driver program call the dgemm routine in BLAS implementations,
timing.gnuplot
a sample gnuplot script to display the results,
matmul-hc.tar.gz
version of matmul.tar.gz using the PAPI library (load the papi module before compiling)

We will be testing on the 2.2GHz Opteron nodes on the Jacquard cluster. These are dual-processor nodes (but you will be using only one processor of these). See Jacquard Essentials for more detailed information on how to use the Jacquard cluster (and do read it before using Jacquard for the first time). Also see the Jacquard cluster page at NERSC for more information.

The y-axis in the gnuplot script plots is labeled MFlop/s, but really it should be "MFlop/s assuming you were using the standard algorithm." If you use something like Strassen's, we will still measure time/(2n3). However, the numerical error checking may haunt you.

You will get a lot of improvement from tiling, but you may want to play with other optimizations, too. Strassen-like algorithms, copy optimizations, recursive data layout, ... you can get some pretty elaborate optimization strategies. We'll cover some of the possibilities in discussion, but you might want to do some independent reading, too. The off-limit "optimizations" are those which weaken the floating point system: No flush-to-zero mode.

Submission

Your group (which we will be assign) needs to submit your group's dgemm.c (or fdgemm.c), Makefile (for compiler options) and a write-up. Your write-up should contain

To show the results of your optimizations, include a graph comparing your dgemm.c with the included basic_dgemm.c. For the last requirement, try your tuned dgemm.c on another hardware platform (like the Itanium2 CITRIS nodes, a Sun box, Pentium 4, etc.) and explain why it performs poorly. Your explanations should rely heavily on knowledge of the memory hierarchy. (Benchmark graphs help.)

Please tar up your group's dgemm.c, write-up, and associated files and mail me (Marghoob) either an URL from which I can retrieve the tar file or an encoded tar file (uuencode or Base-64).

The race will be held during a discussion section, exact date to be announced.

Resources


[ Main CS 267 | GSI Page ] Last updated Jan 27, 2007