The problem. It used to be that understanding microprocessors
was easy. Testing sufficed to verify their correctness, and linear formulas accurately
explained their performance. Today, processors baffle their own creators. The
reason is
Our approach. The overall goal of the BAFL project is to develop
a robust foundation for guiding micro-architectural innovations as transistor
counts surpass one billion. Specifically, we are developing methods for finding
and eliminating bottlenecks---program instructions and processor resources
responsible for lost performance and wasted power consumption. This task is
complex because the more parallel the machine, the harder it is to tell which
execution events (e.g., cache misses, ALU operations, message transactions)
constrained the execution, and which had their execution times tolerated.
Our framework is built on dependence-graph analysis of a program's execution,
implementable entirely in hardware. The framework allows a qualitatively new
way of thinking about performance. For example, by representing micro-execution
events and dependences as a suitable dependence graph, its critical path
automatically determines which processor stage (e.g., fetch, execute, or
commit) is a bottleneck, and also for which dynamic instructions.
Our solutions attack performance understanding problems for which no (or no
efficient) methods existed. These problems span the entire processor life
cycle, for example:
Criticality. Our ISCA'01
paper made three notable contributions to computer architecture: First, it
developed a dependence-graph abstraction of a program executing on a given
micro-architecture. The abstraction established a formal basis for our work and
also demystified the fuzzy notion of the ``critical path,'' previously used
informally (and sometimes incorrectly) in justifying new processor designs.
Second, we solved the open problem of how to compute the critical path directly
by the processor. Based on a novel randomized algorithm, our
critical-path analyzer, implemented in hardware, does so very efficiently,
without actually building a dependence graph, and with only a tiny storage
array.
Finally, observing that past criticality of an instruction correlates with its
future criticality, we turned the critical-path analyzer into a criticality
predictor, thus facilitating the design of first truly cost-aware processor
policies. One such policy, for instruction scheduling, was able to speed up
programs by up to 22%---without beefing up the design with additional
resources, by simply changing the execution order of instructions.
Slack. Perhaps the greatest challenge of upcoming processors are
technological constraints, such as limited power consumption and long wire
delays. To accommodate them, future designs will be non-uniform: rather
than constraining all resources equally, they will offer them at multiple
quality levels, with the fast and power-hungry ones (e.g., high-speed
functional units) intended for the more critical instructions. The key problem
is how to dynamically assign instructions to resources so that the performance
penalty of the lower-quality resources is completely hidden.
Our ISCA'02 paper solves this problem by exploring a refined notion of criticality, slack. Defined as the number of cycles the instruction can be delayed without slowing down the (overall) execution, slack provides a natural solution---if only we could find it at run-time. Our paper distinguished several notions of slack, and showed that slack is plentiful in the SPEC2000 workload. To exploit it, we developed a hardware slack analyzer (thanks to a reduction trick, as simple the criticality analyzer) and used it to systematically devise a control policy for a power-friendly processor in which one half is clocked at half the frequency. With existing policies, this processor executes programs 10% slower than a full-frequency one; with our slack-based policy, it incurs no performance degradation.