The search script take parameters describing the machine architecture, including the number of integer and floating-point registers and the sizes of each level of cache. For each combination of generator parameters and compilation options, the matrix multiply search script calls the generator, compiles the resulting routine, links it with timing code, and benchmarks the resulting executable.
To produce a complete BLAS GEMM routine, we find separate parameters for each of the three cases , , and ( has code identical to ). For each case, we first find the best register (or L0) parameters for in-L1-cache matrices, then find the best L1 parameters for in-L2-cache matrices, etc. While this strategy is not guaranteed to find the best L0 core for out-of-L1-cache matrices, the resulting cores have performed well in practice.