next up previous
Next: StatusAvailability, and Future Up: Optimizing Matrix Multiply using Previous: Cache Block Search

Results

 

We ran the search scripts to find the best register and L1 cache blocking parameters for six commercial workstation systems. These systems have different instruction set architectures and widely varying microarchitectures and memory hierarchies. The results are summarized in Table 1.

 

  table1494


Table 1: Workstation system details and results of matrix multiply parameter search.

The SGI R8K and R10K searches used the newer version of the code generator and search scripts, while other results were obtained with the alpha release. Figures 3-6 plot the performance of the resulting routines for all square matrices M = K = N = D, where D runs over powers of 2 and 3, multiples of 10, and primes, up to a maximum of 300. We compare with the performance of a vendor-optimized BLAS GEMM where available. In each case, PHiPAC yields a substantial fraction of peak performance and is competitive with vendor BLAS. Due to limited availability, we could only perform an incomplete search on the R8K and R10K, and so these are preliminary performance numbers. There is also potential for improvement on the other machines when we rerun with the newer version. For completeness, we also show the poor performance obtained when compiling a simple three nested loop version of GEMM with FORTRAN or C optimizing compilers.

   figure1536
Figure 3: Performance of single precision matrix multiply on a Sparcstation-20/61.

   figure1541
Figure: Performance of single precision matrix multiply on a 100 MHz SGI Indigo R4K. We show the SGEMM from SGI's libblas_mips2_serial.a library.

   figure1546
Figure 5: Performance of double precision matrix multiply on a HP 712/80i. We show DGEMM from the pa1.1 version of libvec.a in HP's compiler distribution.

   figure1552
Figure 6: Performance of double precision matrix multiply on an IBM RS/6000-590. We show the DGEMM from IBM's POWER2-optimized ESSL library.

   figure1557
Figure 7: Preliminary performance of double precision matrix multiply on an SGI R8K Power Challenge. We show the DGEMM from SGI's R8K-optimized libblas.

   figure1562
Figure 8: Preliminary performance of double precision matrix multiply on an SGI R10K Octane. We show the DGEMM from SGI's R10K-optimized libblas.

The PHiPAC methodology can also improve performance even if there is no scope for memory blocking. In Figure 9 we plot the performance of a dot product code generated using PHiPAC techniques. Although the parameters used were obtained using a short manual search, we can see that we are competitive with the assembly-coded SGI BLAS SDOT.

   figure1568
Figure 9: Performance of single precision unit-stride dot-product on a 100 MHz SGI R4K.

In some of the plots, we see that PHiPAC routines suffer from cache conflicts. Our measurements exaggerate this effect by including all power-of-2 sized matrices, and by allocating all regions contiguously in memory. For matrix multiply, we can reduce cache conflicts by copying to contiguous memory when pathological strides are encountered [LRW91]. Unfortunately, this approach does not help dot product. One drawback of the PHiPAC approach is that we can not control the order compilers schedule independent loads. We've occasionally found that exchanging two loads in the assembly output for dot product can halve the number of cache misses where conflicts occur, without otherwise impacting performance.


next up previous
Next: StatusAvailability, and Future Up: Optimizing Matrix Multiply using Previous: Cache Block Search

Richard Vuduc
Tue Nov 18 15:58:12 PST 1997