next up previous
Next: About this document ... Up: Automatic assembly of highly Previous: Bibliography

   
The matrix multiply routines

All experiments were performed on a Sun Ultra-I/170 workstation with 512-KByte L2 caches. The three automatically generated routines were:

1.
A routine that uses only a fully-unrolled register-level blocking. The block size is $2 \times 1 \times 8$. It also uses software pipelining.

2.
A routine that uses L1 blocking with the same register blocked core. The core is embedded within this L1 routine. However, the M-loop for the L1 routine has been eliminated. The L1 block size in the K and N is $62 \times 64$.

3.
A routine that uses L2, L1, and register level blocking. The core is the same as in the above routines. However, the L2 block size is $400 \times 496 \times 64$, and the L1 routine is the same as above.



Richard Vuduc
1998-12-15