mm_gen is a generator that produces C code, following the PHiPAC coding guidelines, for one variant of the matrix multiply operation where op(A), op(B), and C, are respectively M K, K N, and M N matrices, and are scalar parameters, and op(X) is either transpose(X) or just X. Our individual procedures have a lower level interface then a BLAS GEMM and have no error checking. For optimal efficiency, error checking should be performed by the caller when necessary rather than unnecessarily by the callee. We create a full BLAS-compatible GEMM, by generating all required matrix multiply variants and linking with our GEMM-compatible interface that includes error checking.
mm_gen produces a cache-blocked matrix multiply [GL89, LRW91, MS95], restructuring the algorithm for unit stride, and reducing the number of cache misses and unnecessary loads and stores. Under control of command line parameters, mm_gen can produce blocking code for any number of levels of memory hierarchy, including register, L1 cache, TLB, L2 cache, and so on. mm_gen's code can also perform copy optimization [LRW91], optionally with a different accumulator precision. The latest version can also generate the innermost loop with various forms of software pipelining.
A typical invocation of mm_gen is:
mm_gen -cb M0 K0 N0 [ -cb M1 K1 N1 ] ...where the register blocking is , , , the L1-cache blocking is , , , etc. The parameters , , and are specified in units of matrix elements, i.e., single, double, or extended precision floating-point numbers, , , are specified in units of register blocks, , , and are in units of L1 cache blocks, and so on. For a particular cache level, say i, the code accumulates into a C destination block of size units and uses A source blocks of size units and B source blocks of size units (see Figure 1).
Figure 1: Matrix blocking parameters