In this section, we examine the code produced by mm_gen for the operation C = C + A*B where A (respectively B, C) is an M K (respectively K N, M N) matrix. Figure 2 lists the L1 cache blocking core code comprising the 3 nested loops, M, N, and K. mm_gen does not vary the loop permutation [MS95, LRW91] because the resulting gains in locality are subsumed by the method described below.
The outer M loop in Figure 2 maintains pointers c0 and a0 to rows of register blocks in the A and C matrices. It also maintains end pointers (ap0_endp and cp0_endp) used for loop termination. The middle N loop maintains a pointer b0 to columns of register blocks in the B matrix, and maintains a pointer cp0 to the current C destination register block. The N loop also maintains separate pointers (ap0_0 through ap0_1) to successive rows of the current A source block. It also initializes a pointer bp0 to the current B source block. We assume local variables can be held in registers, so our code uses many pointers to minimize both memory references and integer multiplies.
The K loop iterates over source matrix blocks and accumulates into the same destination block. We assume that the floating-point registers can hold a accumulator block, so this block is loaded once before the K loop begins and stored after it ends. The K loop updates the set of pointers to the A source block, one of which is used for loop termination.
The parameter controls the extent of inner loop unrolling as can be seen in Figure 2. The unrolled core loop performs outer products accumulating into the C destination block. We code the outer products by loading one row of the B block, one element of the A block, then performing multiply-accumulates. The C code uses memory references per floating-point operations in the inner K loop, while holding values in local variables. While the intent is that these local variables map to registers, the compiler is free to reorder all of the independent loads and multiply-accumulates to trade increased memory references for reduced register usage. The compiler also requires additional registers to name intermediate results propagating through machine pipelines.
The code we have so far described is valid only when M, K, and N are integer multiples of , , and respectively. In the general case, mm_gen also includes code that operates on power-of-two sized fringe strips, i.e., through where L is , , or . We can therefore manage any fringe size from 1 to L-1 by executing an appropriate combination of fringe code. The resulting code size growth is logarithmic in the register blocking (i.e., ) yet maintains good performance. To reduce the demands on the instruction cache, we arrange the code into several independent sections, the first handling the matrix core and the remainder handling the fringes.
The latest version of the generator can optionally produce code with a software-pipelined inner loop. Each outer product consists of a load, a multiply, and an accumulate section. We group these sections into three software pipelined code variants: two two-stage pipes with stages [load-multiply, accumulate] and [load, multiply-accumulate], and a three-stage pipe with stages [load, multiply, accumulate]. Recent results show that this software pipelining can result in an appreciable speedup.
Because of the separation between matrix dimension and matrix stride, we can implement higher levels of cache blocking as calls to lower level routines with appropriately sized sub-matrices.