next up previous
Next: Cache Block Search Up: Matrix Multiply Search Scripts Previous: Matrix Multiply Search Scripts

Register (L0) Parameter Search

 

The register block search evaluates all combinations of tex2html_wrap_inline1947 and tex2html_wrap_inline1951 where tex2html_wrap_inline2043 and where tex2html_wrap_inline2045 is the number of machine floating-point registers. We search the above for tex2html_wrap_inline2047 where tex2html_wrap_inline2049 but is adjustable. Empirically, tex2html_wrap_inline2051 has never shown appreciable benefit.

Our initial strategy [BAD tex2html_wrap_inline1913 96] benchmarked a set of square matrices that fit in L1 cache. We then chose the L0 parameters that achieved the highest performance. While this approach gave good performance, the searches were time consuming.

We noted that that the majority of the computation, especially for larger matrices, is performed by the core tex2html_wrap_inline2053 register blocked code. Our newer search strategy, therefore, produces code containing only the core, which decreases compile time, and for each L0 parameter set, we benchmark only a single matrix multiply with size tex2html_wrap_inline2055 , tex2html_wrap_inline2057 , and tex2html_wrap_inline2059 . The parameter k is chosen such that the three matrices are no larger than L1 cache (we call this a ``fat'' dot-product). While this case ignores the cost of the M- and N-loop overhead, we have so far found this approach to produce comparable quality code in much less time than the previous strategy (e.g., 5 hours vs. 24 hours on the SGI Indigo R4K).

We initially thought the order in which we searched the L0 parameter space could have an effect on search time and evaluated orders such as i-j-k, random, best-first, and simulated annealing. We found, however, that the added search complexity outweighed any benefit so we settled on a random strategy.



Richard Vuduc
Tue Nov 18 15:58:12 PST 1997