Optimizing Matrix-Matrix Multiply

Ben Schwarz         Jimmy Su
bschwarz@cs         jimmysu@cs


Optimizations Tested


In the follow sections we describe optimizations that we have implemented to increase the performance of matrix-matrix multiplies.

·  Register Tiling -- We found register tiling to be the single most important optimization. Due to the large number of registers---128 are available for floating point on the Itanium 2---we found it prudent to create a code generator to emit tiling code. This was also helpful in determining which tiling sizes to use, as we were able to generate all the possibilities and search the entire space. Our results indicate that the best tiling size is a 24x4 block, which uses almost all the available registers on the machine. In the following sections we show an optimization to reduce the storage needed for each block. The top 10 tiling sizes are displayed in the following chart. It should be noted that almost all of the top sizes use most of the available floating point registers.

Table 1. Best Register Tiling Sizes

Rank
(by MFlops)

 Size 

Registers
Used

1

24x4

121

2

12x8

105

3

17x5

103

4

13x7

105

5

20x4

101

6

15x6

106

7

14x6

99

8

16x5

97

9

9x11

109

10

10x9

101

 

·  Cache Tiling -- On the Itanium 2 architecture, the L1 data cache is dedicated to integer types. Consequently, there will be no improvement from tiling the matrix operations to fit inside the L1 cache. We decided instead to tile for the L2 cache, which can hold 256 kilobytes of data. Our code generator is able to emit tiling code for both the L2 cache and also registers, which enabled us to do an exhaustive search for the best L2-blocking size. We found that L2-blocking is not as beneficial as register blocking. In most cases, there was no discernable difference between the block sizes. We did notice that by increasing the tiling size to larger blocks, we were able to stop some of the performance degradation on the larger-sized matrices. For this reason, we decided to tile for the L3 cache--which is 6 megabytes--as opposed to the smaller L2 cache. Without tiling for the L3 cache, we notice a dropoff of about 500 Mflops when the matrix size exceeds the size of the cache. With tiling, the dropoff is still evident, but not as larger. A similar, but smaller dropoff is evident when the matrix size exceeds the capacity of the L2 cache, but we were unable to prevent the performance loss by tiling. We chose to use an L3 tiling size of 240x80.

·  Loop Unrolling -- When carrying out register tiling, the code generator does not unroll the innermost loop. The unrolling factor can be specified as a parameter to generator, which tells is how many times unroll. The advantage of unrolling is that less instructions are executed--in particular, some branch instructions are avoided--and the generated code may also enable further optimizations inside the compiler. For instance, some array references can be subscripted with constants instead of variables. The disadvantage is that the volume of code increases so there may be more misses in the instruction cache and more page faults. We exhaustively searched the unrolling space for all the register tiling combinations. Ultimately, the results indicate that unrolling does not increase performance in a noticeable way, but can certainly hinder performance when the code spills over the size of the instruction cache. Degradations up to 50% were seen. We chose not to include loop unrolling in our optimized matrix-matrix multiply because our evidence suggests it does not help much, and it may make our code perform poorly on other architectures.

·  Compiler Flags -- We tried using both gcc and ecc for this assignment. The executable produced by ecc is consistently better than the one by gcc. The difference can be as much as a factor of three. The default optimization level for ecc is -O2. We also tried -O3 for ecc. Counter intuitively, this option actually decreased performance by 10%. We also tried labeling local variables with the register qualifier. Both compilers ignore those annotations. We confirmed this by examining the executable; they were identical regardless of register qualifiers.

·  Choice of Tile Size -- Although our search told us the best register tiling size was 24x4, we chose the second best. The reason can be seen in the following figure. There are intervals of size 24 between peaks. The peaks appear when the matrix size is a multiple of 24. The gradual after a peak corresponds to fringe handling. The graph was obtained when we used function generation to handle fringes. We chose 12x8 over 24x4 due to the reduced interval size.

Intervals Resulting from Tiling Size

 

·  Efficient Use of Storage -- A naive implementation of matrix-matrix multiply requires 3*m*n storage for each mxn block in the matrix. This is a serious bottleneck, especially when considering register tiling and how valuable registers can be. We chose to re-implement matrix-matrix multiply using a more efficient algorithm that we found in an open-source BLAS implementation. The idea behind the algorithm for computing [C] = [A] x [B] is to take the first element in B, and multiply it by the entire first column of A, storing the results in the first column of C. The process is then repeated for the second element in the first column of B, multiplying it by the second column of A and storing the results in the first column of C. This algorithm is more efficient because we can multiply all of [A] by one column of [B] and store the results in one column of [C]. The storage required at any given time is only nxm + n + 1, which will always be smaller than 3xnxm. This enabled us to use much larger register tiling sizes and get better performance. Due to the strategy we used to do the matrix multiplication, only one element of matrix B needs to be in a register. Initially, our register blocking chose to do the multiplication between axb and bxa blocks. Later we discovered that we can widen the bxa blocks to bxc blocks, where c is a multiple of a. This reduces the number of times we have to load an axb block into registers. We experimented with different sizes for c, and found that with c=3a, there was a gain of 100 Mflops in performance.

·  Efficient Fringe Handling -- We provided three different solutions for fringe handling. They are simple padding, padding only on the fringes, and function generation. Each method is described below.

Simple Padding. This method makes a copy of the A and B matrices if their dimensions are not a multiple of the tiling size. Each column of the matrix is copied separately with the padding at the end of make it a multiple of 12. Additional columns are added to make the number of columns a multiple of 8. The padding is filled with zeros. If the dimensions of matrix C are not a multiple of the tiling size, then the result of the computation is stored in a buffer. At the end of the matrix multiplication, the matrix is copied back to C. The advantage of this method is that no additional fringe handling matrix multiply code needs to be generated. All multiplications go at full speed with some wasted computation with the additional zeros. This method works well for small matrices that have fringe areas close to the tiling size. The downside of this method is that the overhead of copying the matrices back and forth is not linear in terms of the dimensions of the matrix.

Fringe Padding. This is an improvement upon the previous method. Padding is only done on the fringes to make the fringe a multiple of the tiling size. This has the advantage over the previous method that the overhead of copying is linear. But it complicates the matrix multiply code. Several gotos were used to handle the fringe case, because the padded fringes are stored in their own buffers separate from the storage of the original matrices. The graph below shows the performance of simple padding and fringe padding on small matrices. The saw glass can be explained by the fill ratio of the matrix. At the peak, there is zero overhead from padding, since the entire matrix is a multiple of the tiling size. The bottom occurs when the fringe is small, so more padding is needed, thus there are more wasted computation.

Fringe Padding vs. Simple Padding


Function Generation. This was the first solution we attempted at fringe handling. We generate 125 block matrix multiply routines to handle all the possible fringe sizes. The routines work on rectangular blocks of size axb and bxc. The values of a, b, and c can take the values from the following set, {1, 2, 4, 8, 16}. This ensures that we can always a divide the fringe area into sizes that the routines can handle. We learned of this idea from [1]. The amount of code generated for fringe handling using this method is large, resulting in an additional 16,000 lines of C code. In terms of performance, it has problems with instruction cache misses. The problem is illustrated in the graph below. When no fringe handling is needed, only the full speed matrix multiplication routine is executed. In that case, number of instruction cache misses is low. When the fringe size is closed to a power of two, several fringe handling routines are called. To minimize the number of instruction cache misses, we used the procedure sorting optimization to place frequently called fringe handling routines close together in the code. It reduced the number of instruction cache misses. But the number of fringe handling routines called can be large, sometimes more than 60 of them are called during one matrix multiply. At the end, we chose to use the fringe padding method to handle fringes.

I-Cache Performance

 


Optimizations Not Tested


We did not try to implement a recursive data layout organization due to lack of time. Also, our results indicate that the performance is mostly affected by register blocking, and tiling in other levels of the cache is not as important. Therefore the recursive data layout may not be worth the time to implement. We also did not implement Strassen's method for matrix multiplication due to lack of time. It is our prediction that Strassen's method will not work well on the Itanium III architecture, however. Since a normal matrix-matrix multiply on a 2x2 matrix takes 8 multiplies and 7 adds, the multiplies and adds can be fused together in the floating point unit and they may be executed for free. Strassen's method increases the number of floating point adds so much that you have to execute about 10 more on top of the fused computation. This was also a motivating factor in deciding whether or not to implement it.


Results

 

Graph 1. Matrix-Matrix Multiply Results



We have also evaluated our implementation on an architecture that we did not optimize for--the Pentium III. The results are below.

Graph 2: MFlops on Pentium III

We tested our implementation on the slower nodes of the Millennium Cluster. The results follow.

Graph 3: MFlops on Millennium (slow nodes)


References

 

·  [1] J. Bilmes, K. Asanovic, C. Chin, J. Demmel.  Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology.  In International Conference on Supercomputing, 1996.


Code

 

·  Code generator: Codegen4d.java

·  square_dgemm: scg12x8.c

·  Fringe-handling generator: fringesig.h Fringetot.c.