next up previous
Next: Matrix Multiply Code Up: Optimizing Matrix Multiply using Previous: Loop unroll explicitly to

Matrix Multiply Generator

 

mm_gen is a generator that produces C code, following the PHiPAC coding guidelines, for one variant of the matrix multiply operation tex2html_wrap_inline1923 where op(A), op(B), and C, are respectively M tex2html_wrap_inline1931 K, K tex2html_wrap_inline1931 N, and M tex2html_wrap_inline1931 N matrices, tex2html_wrap_inline1937 and tex2html_wrap_inline1939 are scalar parameters, and op(X) is either transpose(X) or just X. Our individual procedures have a lower level interface then a BLAS GEMM and have no error checking. For optimal efficiency, error checking should be performed by the caller when necessary rather than unnecessarily by the callee. We create a full BLAS-compatible GEMM, by generating all required matrix multiply variants and linking with our GEMM-compatible interface that includes error checking.

mm_gen produces a cache-blocked matrix multiply [GL89, LRW91, MS95], restructuring the algorithm for unit stride, and reducing the number of cache misses and unnecessary loads and stores. Under control of command line parameters, mm_gen can produce blocking code for any number of levels of memory hierarchy, including register, L1 cache, TLB, L2 cache, and so on. mm_gen's code can also perform copy optimization [LRW91], optionally with a different accumulator precision. The latest version can also generate the innermost loop with various forms of software pipelining.

A typical invocation of mm_gen is:

   mm_gen -cb M0 K0 N0 [ -cb M1 K1 N1 ] ...
where the register blocking is tex2html_wrap_inline1947 , tex2html_wrap_inline1949 , tex2html_wrap_inline1951 , the L1-cache blocking is tex2html_wrap_inline1953 , tex2html_wrap_inline1955 , tex2html_wrap_inline1957 , etc. The parameters tex2html_wrap_inline1947 , tex2html_wrap_inline1949 , and tex2html_wrap_inline1951 are specified in units of matrix elements, i.e., single, double, or extended precision floating-point numbers, tex2html_wrap_inline1953 , tex2html_wrap_inline1955 , tex2html_wrap_inline1957 are specified in units of register blocks, tex2html_wrap_inline1971 , tex2html_wrap_inline1973 , and tex2html_wrap_inline1973 are in units of L1 cache blocks, and so on. For a particular cache level, say i, the code accumulates into a C destination block of size tex2html_wrap_inline1979 units and uses A source blocks of size tex2html_wrap_inline1981 units and B source blocks of size tex2html_wrap_inline1983 units (see Figure 1).

   figure1447
Figure 1: Matrix blocking parameters




next up previous
Next: Matrix Multiply Code Up: Optimizing Matrix Multiply using Previous: Loop unroll explicitly to

Richard Vuduc
Tue Nov 18 15:58:12 PST 1997