mm_gen is a generator that produces C code, following the
PHiPAC coding guidelines, for one variant of the matrix multiply
operation
where op(A), op(B),
and C, are respectively M
K, K
N, and M
N
matrices,
and
are scalar parameters, and op(X) is
either transpose(X) or just X. Our individual procedures have a
lower level interface then a BLAS GEMM and have no error checking.
For optimal efficiency, error checking should be performed by the
caller when necessary rather than unnecessarily by the callee. We
create a full BLAS-compatible GEMM, by generating all required matrix
multiply variants and linking with our GEMM-compatible interface that
includes error checking.
mm_gen produces a cache-blocked matrix multiply [GL89, LRW91, MS95], restructuring the algorithm for unit stride, and reducing the number of cache misses and unnecessary loads and stores. Under control of command line parameters, mm_gen can produce blocking code for any number of levels of memory hierarchy, including register, L1 cache, TLB, L2 cache, and so on. mm_gen's code can also perform copy optimization [LRW91], optionally with a different accumulator precision. The latest version can also generate the innermost loop with various forms of software pipelining.
A typical invocation of mm_gen is:
mm_gen -cb M0 K0 N0 [ -cb M1 K1 N1 ] ...where the register blocking is
Figure 1: Matrix blocking parameters