The register block search evaluates all combinations of
and
where
and where
is
the number of machine floating-point registers. We search the above
for
where
but is
adjustable. Empirically,
has never shown
appreciable benefit.
Our initial strategy [BAD
96] benchmarked a set of square
matrices that fit in L1 cache. We then chose the L0 parameters that
achieved the highest performance. While this approach gave good
performance, the searches were time consuming.
We noted that that the majority of the computation, especially for
larger matrices, is performed by the core
register blocked code. Our newer search strategy, therefore, produces
code containing only the core, which decreases compile time, and for
each L0 parameter set, we benchmark only a single matrix multiply with
size
,
, and
. The parameter k is chosen
such that the three matrices are no larger than L1 cache (we call this
a ``fat'' dot-product). While this case ignores the cost of the M-
and N-loop overhead, we have so far found this approach to produce
comparable quality code in much less time than the previous strategy
(e.g., 5 hours vs. 24 hours on the SGI Indigo R4K).
We initially thought the order in which we searched the L0 parameter space could have an effect on search time and evaluated orders such as i-j-k, random, best-first, and simulated annealing. We found, however, that the added search complexity outweighed any benefit so we settled on a random strategy.