In this section, we examine the code produced by mm_gen for the
operation C = C + A*B where A (respectively B, C) is an
M
K (respectively K
N, M
N) matrix.
Figure 2 lists the L1 cache blocking core code
comprising the 3 nested loops, M, N, and K.
mm_gen does not vary the loop permutation [MS95, LRW91]
because the resulting gains in locality are subsumed by the method
described below.
The outer M loop in Figure 2 maintains pointers c0 and a0 to rows of register blocks in the A and C matrices. It also maintains end pointers (ap0_endp and cp0_endp) used for loop termination. The middle N loop maintains a pointer b0 to columns of register blocks in the B matrix, and maintains a pointer cp0 to the current C destination register block. The N loop also maintains separate pointers (ap0_0 through ap0_1) to successive rows of the current A source block. It also initializes a pointer bp0 to the current B source block. We assume local variables can be held in registers, so our code uses many pointers to minimize both memory references and integer multiplies.
The K loop iterates over source matrix blocks and accumulates into the
same
destination block. We assume that the
floating-point registers can hold a
accumulator block,
so this block is loaded once before the K loop begins and stored after
it ends. The K loop updates the set of pointers to the A source
block, one of which is used for loop termination.
The parameter
controls the extent of inner loop unrolling as can
be seen in Figure 2. The unrolled core loop performs
outer products accumulating into the C destination block. We
code the outer products by loading one row of the B block, one element
of the A block, then performing
multiply-accumulates. The C
code uses
memory references per
floating-point
operations in the inner K loop, while holding
values
in local variables. While the intent is that these local variables
map to registers, the compiler is free to reorder all of the
independent loads and multiply-accumulates to trade increased memory
references for reduced register usage. The compiler also requires
additional registers to name intermediate results propagating through
machine pipelines.
The code we have so far described is valid only when M, K, and N are
integer multiples of
,
, and
respectively. In the
general case, mm_gen also includes code that operates on
power-of-two sized fringe strips, i.e.,
through
where L is
,
, or
. We can
therefore manage any fringe size from 1 to L-1 by executing an
appropriate combination of fringe code. The resulting code size
growth is logarithmic in the register blocking (i.e.,
) yet maintains good performance. To reduce the
demands on the instruction cache, we arrange the code into several
independent sections, the first handling the matrix core and the
remainder handling the fringes.
The latest version of the generator can optionally produce code with a software-pipelined inner loop. Each outer product consists of a load, a multiply, and an accumulate section. We group these sections into three software pipelined code variants: two two-stage pipes with stages [load-multiply, accumulate] and [load, multiply-accumulate], and a three-stage pipe with stages [load, multiply, accumulate]. Recent results show that this software pipelining can result in an appreciable speedup.
Because of the separation between matrix dimension and matrix stride, we can implement higher levels of cache blocking as calls to lower level routines with appropriately sized sub-matrices.