DGEMM Optimization Results

Patrick R. Eaton and Michael H. Scott

This summary describes the results of experimenting with various optimizations of the double-precision generic matrix-matrix multiply code for square matrices on the Sun Ultra-SPARC I processor. These experiments were performed in connection with an assignment in CS267.

These experiments focus on the multiplication of square matrices. It is assumed that the data is stored in column major order. That is, rows, which are normally accessed with unit stride in C, are accessed with a stride equal to the dimension of the matrix. Columns are accessed with unit stride.

Background

The optimizations examined were largely influenced by the memory hierarchy of the SPARC-based machines used in the NOW for which the optimizations were targeted.

Machines containing the Ultra-SPARC I processor are equipped with two levels of cache. The L1 cache is divided into a data cache and instruction cache. The data cache is a 16 KB direct-mapped cache that is virtually indexed and virtually tagged. It is organized as 512 lines of 32 bytes each. Each line is divided into two 16-byte sub-blocks. The instruction cache is also a 16 KB cache that is virtually indexed and virtually tagged organized as 512 lines of 32 bytes each. It differs from the data cache, however, in that it is two-way set-associative.

The Ultra-SPARC I is designed to work with L2 caches of size between 512 KB and 4 MB. The machines in the NOW contain the minimum of 512 bytes.

The organization and performance of this cache hierarchy can be observed through simple memory benchmarks, such as the one shown below.

Plot of memory access times over data sets of varying
            sizes

This graph shows the time required to scan various sized data sets at several different strides. Time is shown on the vertical axis; stride is shown on the horizontal axis. Data sets of size from 4 KB to 4MB are plotted. It can be seen that small data sets (less than 16 KB) fit completely in the L1 cache and are thus accessed in constant time. Medium sized data sets (between 32 KB and 512 KB) fit completely in the L2 cache. These data sets also exhibit nearly constant time look up but at a slow rate because the data must be promoted into the L1 cache. Finally, large data sets (larger that 1 MB) overextend the abilities of the cache causing long access times.

Optimizations

We focussed on three types of optimizations: blocking, loop unrolling, pipelining/prefetching. Each of these will be discussed in turn.

Blocking

Blocking is the process of dividing a large problem into smaller chunks (blocks) to solve. One key reason to approach a problem through blocking is to improve cache effectiveness. Smaller problems are more likely to fit into the cache. This increases the performance of an algorithm because the data that it accesses is returned more quickly.

In order to provide flexibility for experimentation, we developed a general algorithm that could adapt to various, even non-square, block sizes. We discovered that larger block sizes performed better than smaller ones. This inspired an adaptive approach to the computation of the block size so that the same algorithm could be used for all matrix sizes. The key number used in blocking the matrix is computed as the largest power of two less than or equal to the dimension of the matrix up to a maximum of 128 elements.

We also discovered that performance increased when different matrices were accessed in different patterns. Surprisingly, it was most effective to access columns in a block in a serial fashion, one at a time. While working with a single column, all row blocks were scanned for computation of the partial result. Then the next column would be considered for a given set of row blocks. Working with columns individually did have the benefit of eliminating all but one type of fringe blocks, partially empty blocks at the edges of the matrices.

Loop Unrolling

Loop unrolling is the process of manually repeating the body of a loop several times. The key advantage of this technique is the reduced loop overhead and the introduction of new instructions which the compiler can use to create a more dense schedule. One disadvantage of this technique is that not all loops will contain a number of iterations that is a multiple of the amount of unrolling. This requires special clean-up code to account for partial iterations.

We found the optimal amount of unrolling to be four iterations. Performance fell with unrolling of two or eight iterations.

Software Pipelining/Prefetching

Software pipelining is a technique that divides work done in a single loop iteration into several pieces. The tasks are then spread over several iterations. The goal is to increase available parallelism, by increasing the distance between dependent tasks and still performing the same amount of work each iteration.

We examined software pipelining in hopes of increasing the distance of a load from its first use thereby hiding the load latency. Because of the wide range of times to access memory, however, we were unable to develop a software pipelining solution that worked well in all instances.

Final Solution

The final solution we implemented combined the blocking and loop unrolling techniques described above. The solution did not include, however, any type of software pipelining or data prefetching. The performance of this solution is presented below.

Performance

A graph showing the performance of the optimized algorithm as a function of matrix size is shown below. Also displayed on the plot is the performance of two other matrix multiply algorithms, a basic nested loop approach and a simple blocking approach.

Performance of optimized dgemm.

This graph shows that the optimizations are very effective when the size of the matrix is less than 255. The optimized code runs more than twice as fast as the blocked code and roughly 40% faster than the basic code. Unfortunately, the optimized code performs poorly on large matrices. This shows that our blocking approach is not as effective as the simple blocking approach on larger matrices. Consequently, larger matrices outstrip the cache's ability to keep needed data close to the processor.

Other Techniques

Several other techniques were considered but not implemented due to time and complexity. A further study of matrix-matrix multiplication may examine these ideas.

Memory Copy Operations

We have clearly seen that supplying the processor with enough data from memory is one of the biggest obstacles hindering performance on a matrix-matrix multiply. It would be interesting to examine if the overhead of copying the matrices into a more suitable memory layout could be amortized for large matrices. In the computation, one matrix is scanned with unit stride. The other, however, is scanned with a stride equal to the dimension of the matrix. If this matrix could be copied so that it would also be accessed with unit stride, it is conceivable that the caching benefits would pay for the costs of copying the matrix.

Matrix Symmetry and Antisymmetry

In an effort to improve data locality in the L1 cache, a well known property of square matrices may be applied. Any square matrix A can be decomposed into symmetric and antisymmetric parts, As and Aa, which are defined as follows,

As = 1/2 (A + AT)

Aa = 1/2 (A - AT).

The notation MT denotes the transposition of matrix M. The computation of C = AB can thus be decomposed to

C = (As + Aa) (Bs + Ba)

C = As Bs + As Ba + Aa Bs + Aa Ba.

The topologies of matrices A and B can be altered so the upper triangular region (including the diagonal) contained the symmetric part of each matrix while the anti-symmetric part was stored in the lower triangular region. This method requires four matrix-matrix multiplications, as seen above, but each multiplication involves multiplying n2/2 elements which would effectively "double" the amount of data in the L1 cache. This "doubling" is due to the fact that the elements of a symmetric matrix are related by Aji = Aij, and those of an antisymmetric matrix by Aji = -Aij. A blocked matrix multiply routine which understood this form of matrix storage could be written. The initial cost of rearranging the matrix data would certainly outweigh any speedup in performance for small matrix multiplications. However, it would be interesting to examine if multiplication of sufficiently large matrices could amortize this overhead. Alternatively, if these matrix decompositions have already been calculated for another purpose, this approach to multiplication could prove faster.

Performance on Millennium

Unfortunately, the memory benchmark used to show characteristics of the memory hierarchy on the NOW is not portable and could not be run on Millennium. Furthermore, knowing only that the Millennium runs on Pentium Pro and Pentium-II processors does not help us in determining the characteristics; both chips run with several different cache sizes. However, some conclusions can be drawn from the graph of performance shown below.

Performance of optimized dgemm on Millennium.

First, it can be seen that the optimizations do indeed target the Ultra-SPARC I process. The optimizations actually slow execution on the Pentium-class machines. As on the Ultra-SPARC, the blocked algorithm performs consistently across a wide range of input sizes. Also, as on the Ultra-SPARC, the basic algorithm performs better than the blocked algorithm on small inputs but slows considerably as the input size grows larger than the cache. Both algorithms suffer considerably when the input size is a power of two. Interestingly, the optimized algorithm actually performs slightly better on inputs of size power of two. We believe this shows that the optimized algorithm has begun to make more effective use of the cache.


Patrick R. Eaton / eaton@cs.berkeley.edu
Michael H. Scott / mhscott@ce.berkeley.edu

Last modified: Wed Feb 16 12:40:32 PST 2000