The search script take parameters describing the machine architecture, including the number of integer and floating-point registers and the sizes of each level of cache. For each combination of generator parameters and compilation options, the matrix multiply search script calls the generator, compiles the resulting routine, links it with timing code, and benchmarks the resulting executable.
To produce a complete BLAS GEMM routine, we find separate parameters
for each of the three cases
,
, and
(
has code identical to
). For each
case, we first find the best register (or L0) parameters for
in-L1-cache matrices, then find the best L1 parameters for in-L2-cache
matrices, etc. While this strategy is not guaranteed to find the best
L0 core for out-of-L1-cache matrices, the resulting cores have
performed well in practice.