Next: About this document ...
Up: Automatic assembly of highly
Previous: Bibliography
The matrix multiply routines
All experiments were performed on a Sun Ultra-I/170 workstation with
512-KByte L2 caches. The three automatically generated routines were:
- 1.
- A routine that uses only a fully-unrolled register-level blocking.
The block size is
.
It also uses software
pipelining.
- 2.
- A routine that uses L1 blocking with the same register blocked
core. The core is embedded within this L1 routine. However, the
M-loop for the L1 routine has been eliminated. The L1 block
size in the K and N is
.
- 3.
- A routine that uses L2, L1, and register level blocking.
The core is the same as in the above routines. However, the
L2 block size is
,
and the L1
routine is the same as above.
Richard Vuduc
1998-12-15