We perform the L1 cache blocking search after the best register blocking is known. We would like to make the L1 blocks large to increase data reuse but larger L1 blocks increase the probability of cache conflicts [LRW91]. Tradeoffs between M- and N- loop overheads, memory access patterns, and TLB structure also affect the best L1 size. We currently perform a relatively simple search of the L1 parameter space. For the D D square case, we search the neighborhood centered at where L1 is the L1 cache size in elements. We set to the values where and . and are set similarly. We benchmark the resulting 125 combinations with matrix sizes that either fit in L2 cache, or are within some upper bound if no L2 cache exists. The L2 cache blocking search, when necessary, is performed in a similar way.