We perform the L1 cache blocking search after the best register
blocking is known. We would like to make the L1 blocks large to
increase data reuse but larger L1 blocks increase the probability of
cache conflicts [LRW91]. Tradeoffs between M- and N- loop
overheads, memory access patterns, and TLB structure also affect the
best L1 size. We currently perform a relatively simple search of the
L1 parameter space. For the D
D square case, we search the
neighborhood centered at
where L1 is the L1 cache size in
elements.
We set
to the values
where
and
.
and
are set
similarly. We benchmark the resulting 125 combinations with matrix
sizes that either fit in L2 cache, or are within some upper bound if
no L2 cache exists. The L2 cache blocking search, when necessary, is
performed in a similar way.