Ben
Schwarz Jimmy Su
bschwarz@cs jimmysu@cs
In the follow sections we describe optimizations that we have implemented to
increase the performance of matrix-matrix multiplies.
· Register Tiling -- We found register tiling to be the single most important optimization. Due to the large number of registers---128 are available for floating point on the Itanium 2---we found it prudent to create a code generator to emit tiling code. This was also helpful in determining which tiling sizes to use, as we were able to generate all the possibilities and search the entire space. Our results indicate that the best tiling size is a 24x4 block, which uses almost all the available registers on the machine. In the following sections we show an optimization to reduce the storage needed for each block. The top 10 tiling sizes are displayed in the following chart. It should be noted that almost all of the top sizes use most of the available floating point registers.
Table 1. Best Register Tiling Sizes
|
Rank |
Size |
Registers |
|
1 |
24x4 |
121 |
|
2 |
12x8 |
105 |
|
3 |
17x5 |
103 |
|
4 |
13x7 |
105 |
|
5 |
20x4 |
101 |
|
6 |
15x6 |
106 |
|
7 |
14x6 |
99 |
|
8 |
16x5 |
97 |
|
9 |
9x11 |
109 |
|
10 |
10x9 |
101 |
· Cache Tiling -- On the Itanium 2 architecture, the L1 data cache is dedicated to integer types. Consequently, there will be no improvement from tiling the matrix operations to fit inside the L1 cache. We decided instead to tile for the L2 cache, which can hold 256 kilobytes of data. Our code generator is able to emit tiling code for both the L2 cache and also registers, which enabled us to do an exhaustive search for the best L2-blocking size. We found that L2-blocking is not as beneficial as register blocking. In most cases, there was no discernable difference between the block sizes. We did notice that by increasing the tiling size to larger blocks, we were able to stop some of the performance degradation on the larger-sized matrices. For this reason, we decided to tile for the L3 cache--which is 6 megabytes--as opposed to the smaller L2 cache. Without tiling for the L3 cache, we notice a dropoff of about 500 Mflops when the matrix size exceeds the size of the cache. With tiling, the dropoff is still evident, but not as larger. A similar, but smaller dropoff is evident when the matrix size exceeds the capacity of the L2 cache, but we were unable to prevent the performance loss by tiling. We chose to use an L3 tiling size of 240x80.
·
· Compiler Flags -- We tried using both gcc and ecc for this assignment. The executable produced by ecc is consistently better than the one by gcc. The difference can be as much as a factor of three. The default optimization level for ecc is -O2. We also tried -O3 for ecc. Counter intuitively, this option actually decreased performance by 10%. We also tried labeling local variables with the register qualifier. Both compilers ignore those annotations. We confirmed this by examining the executable; they were identical regardless of register qualifiers.
· Choice of Tile Size -- Although our search told us the best register tiling size was 24x4, we chose the second best. The reason can be seen in the following figure. There are intervals of size 24 between peaks. The peaks appear when the matrix size is a multiple of 24. The gradual after a peak corresponds to fringe handling. The graph was obtained when we used function generation to handle fringes. We chose 12x8 over 24x4 due to the reduced interval size.
Intervals Resulting
from Tiling Size

· Efficient Use of Storage -- A naive implementation of matrix-matrix multiply requires 3*m*n storage for each mxn block in the matrix. This is a serious bottleneck, especially when considering register tiling and how valuable registers can be. We chose to re-implement matrix-matrix multiply using a more efficient algorithm that we found in an open-source BLAS implementation. The idea behind the algorithm for computing [C] = [A] x [B] is to take the first element in B, and multiply it by the entire first column of A, storing the results in the first column of C. The process is then repeated for the second element in the first column of B, multiplying it by the second column of A and storing the results in the first column of C. This algorithm is more efficient because we can multiply all of [A] by one column of [B] and store the results in one column of [C]. The storage required at any given time is only nxm + n + 1, which will always be smaller than 3xnxm. This enabled us to use much larger register tiling sizes and get better performance. Due to the strategy we used to do the matrix multiplication, only one element of matrix B needs to be in a register. Initially, our register blocking chose to do the multiplication between axb and bxa blocks. Later we discovered that we can widen the bxa blocks to bxc blocks, where c is a multiple of a. This reduces the number of times we have to load an axb block into registers. We experimented with different sizes for c, and found that with c=3a, there was a gain of 100 Mflops in performance.
· Efficient Fringe Handling --
We provided three different solutions for fringe handling. They are simple
padding, padding only on the fringes, and function generation. Each method is
described below.
Simple Padding. This method makes a copy of the
A and B matrices if their dimensions are not a multiple of the tiling size.
Each column of the matrix is copied separately with the padding at the end of
make it a multiple of 12. Additional columns are added to make the number of
columns a multiple of 8. The padding is filled with zeros. If the dimensions of
matrix C are not a multiple of the tiling size, then the result of the
computation is stored in a buffer. At the end of the matrix multiplication, the
matrix is copied back to C. The advantage of this method is that no additional
fringe handling matrix multiply code needs to be generated. All multiplications
go at full speed with some wasted computation with the additional zeros. This
method works well for small matrices that have fringe areas close to the tiling
size. The downside of this method is that the overhead of copying the matrices
back and forth is not linear in terms of the dimensions of the matrix.
Fringe Padding. This is an improvement
upon the previous method. Padding is only done on the fringes to make the
fringe a multiple of the tiling size. This has the advantage over the previous
method that the overhead of copying is linear. But it complicates the matrix
multiply code. Several gotos were used to handle the fringe case, because the
padded fringes are stored in their own buffers separate from the storage of the
original matrices. The graph below shows the performance of simple padding and
fringe padding on small matrices. The saw glass can be explained by the fill
ratio of the matrix. At the peak, there is zero overhead from padding, since
the entire matrix is a multiple of the tiling size. The bottom occurs when the
fringe is small, so more padding is needed, thus there are more wasted
computation.
Fringe Padding vs.
Simple Padding

Function Generation. This was the first solution we attempted at fringe
handling. We generate 125 block matrix multiply routines to handle all the
possible fringe sizes. The routines work on rectangular blocks of size axb and
bxc. The values of a, b, and c can take the values from the following set, {1,
2, 4, 8, 16}. This ensures that we can always a divide the fringe area into
sizes that the routines can handle. We learned of this idea from [1]. The
amount of code generated for fringe handling using this method is large,
resulting in an additional 16,000 lines of C code. In terms of performance, it
has problems with instruction cache misses. The problem is illustrated in the
graph below. When no fringe handling is needed, only the full speed matrix
multiplication routine is executed. In that case, number of instruction cache
misses is low. When the fringe size is closed to a power of two, several fringe
handling routines are called. To minimize the number of instruction cache
misses, we used the procedure sorting optimization to place frequently called
fringe handling routines close together in the code. It reduced the number of
instruction cache misses. But the number of fringe handling routines called can
be large, sometimes more than 60 of them are called
during one matrix multiply. At the end, we chose to use the fringe padding
method to handle fringes.
I-Cache Performance

We did not try to implement a recursive data layout organization due to lack of
time. Also, our results indicate that the performance is mostly affected by
register blocking, and tiling in other levels of the cache is not as important.
Therefore the recursive data layout may not be worth the time to implement. We
also did not implement Strassen's method for matrix multiplication due to lack
of time. It is our prediction that Strassen's method will not work well on the
Itanium III architecture, however. Since a normal matrix-matrix multiply on a
2x2 matrix takes 8 multiplies and 7 adds, the multiplies and adds can be fused
together in the floating point unit and they may be executed for free.
Strassen's method increases the number of floating point adds so much that you
have to execute about 10 more on top of the fused computation. This was also a
motivating factor in deciding whether or not to implement it.
Graph 1. Matrix-Matrix Multiply Results

We have also evaluated our implementation on an architecture that we did not
optimize for--the Pentium III. The results are below.
Graph
2: MFlops on Pentium III

We tested our implementation on the slower nodes of the Millennium Cluster. The results follow.
Graph
3: MFlops on Millennium (slow nodes)

· [1] J. Bilmes, K. Asanovic, C. Chin,
J. Demmel. Optimizing Matrix Multiply using PHiPAC: a Portable,
High-Performance, ANSI C Coding Methodology. In International Conference on Supercomputing, 1996.
· Code generator: Codegen4d.java
· square_dgemm: scg12x8.c
· Fringe-handling generator: fringesig.h Fringetot.c.