The register block search evaluates all combinations of and where and where is the number of machine floating-point registers. We search the above for where but is adjustable. Empirically, has never shown appreciable benefit.
Our initial strategy [BAD 96] benchmarked a set of square matrices that fit in L1 cache. We then chose the L0 parameters that achieved the highest performance. While this approach gave good performance, the searches were time consuming.
We noted that that the majority of the computation, especially for larger matrices, is performed by the core register blocked code. Our newer search strategy, therefore, produces code containing only the core, which decreases compile time, and for each L0 parameter set, we benchmark only a single matrix multiply with size , , and . The parameter k is chosen such that the three matrices are no larger than L1 cache (we call this a ``fat'' dot-product). While this case ignores the cost of the M- and N-loop overhead, we have so far found this approach to produce comparable quality code in much less time than the previous strategy (e.g., 5 hours vs. 24 hours on the SGI Indigo R4K).
We initially thought the order in which we searched the L0 parameter space could have an effect on search time and evaluated orders such as i-j-k, random, best-first, and simulated annealing. We found, however, that the added search complexity outweighed any benefit so we settled on a random strategy.