We ran the search scripts to find the best register and L1 cache blocking parameters for six commercial workstation systems. These systems have different instruction set architectures and widely varying microarchitectures and memory hierarchies. The results are summarized in Table 1.
Table 1: Workstation system details and results of matrix multiply parameter search.
The SGI R8K and R10K searches used the newer version of the code generator and search scripts, while other results were obtained with the alpha release. Figures 3-6 plot the performance of the resulting routines for all square matrices M = K = N = D, where D runs over powers of 2 and 3, multiples of 10, and primes, up to a maximum of 300. We compare with the performance of a vendor-optimized BLAS GEMM where available. In each case, PHiPAC yields a substantial fraction of peak performance and is competitive with vendor BLAS. Due to limited availability, we could only perform an incomplete search on the R8K and R10K, and so these are preliminary performance numbers. There is also potential for improvement on the other machines when we rerun with the newer version. For completeness, we also show the poor performance obtained when compiling a simple three nested loop version of GEMM with FORTRAN or C optimizing compilers.
Figure 3: Performance of single precision matrix multiply on a
Sparcstation-20/61.
Figure: Performance of single precision matrix multiply on a
100 MHz SGI Indigo R4K. We show the SGEMM from SGI's
libblas_mips2_serial.a library.
Figure 5: Performance of double precision matrix multiply on a
HP 712/80i. We show DGEMM from the
pa1.1 version of libvec.a in HP's compiler distribution.
Figure 6: Performance of double precision matrix multiply on an
IBM RS/6000-590. We show the DGEMM from
IBM's POWER2-optimized ESSL library.
Figure 7: Preliminary performance of double precision matrix multiply
on an SGI R8K Power Challenge. We show the
DGEMM from SGI's R8K-optimized libblas.
Figure 8: Preliminary performance of double precision matrix multiply
on an SGI R10K Octane. We show the
DGEMM from SGI's R10K-optimized libblas.
The PHiPAC methodology can also improve performance even if there is no scope for memory blocking. In Figure 9 we plot the performance of a dot product code generated using PHiPAC techniques. Although the parameters used were obtained using a short manual search, we can see that we are competitive with the assembly-coded SGI BLAS SDOT.
Figure 9: Performance of single precision unit-stride dot-product on a
100 MHz SGI R4K.
In some of the plots, we see that PHiPAC routines suffer from cache conflicts. Our measurements exaggerate this effect by including all power-of-2 sized matrices, and by allocating all regions contiguously in memory. For matrix multiply, we can reduce cache conflicts by copying to contiguous memory when pathological strides are encountered [LRW91]. Unfortunately, this approach does not help dot product. One drawback of the PHiPAC approach is that we can not control the order compilers schedule independent loads. We've occasionally found that exchanging two loads in the assembly output for dot product can halve the number of cache misses where conflicts occur, without otherwise impacting performance.