BAFL: Bottleneck Analysis of Fine-grain Parallelism

The problem. It used to be that understanding microprocessors was easy. Testing sufficed to verify their correctness, and linear formulas accurately explained their performance. Today, processors baffle their own creators. The reason is Moore's Law: to obtain speedup, processor designers turn the growing supply of transistors into growing parallelism, at a growing number of levels (e.g., out-of-order execution, pipelined memory hierarchy, multi-threading). While the effects of parallelism on verification have already been recognized (e.g., via model checking), the problem of performance complexity has been attacked only with ad hoc methods.

Our approach. The overall goal of the BAFL project is to develop a robust foundation for guiding micro-architectural innovations as transistor counts surpass one billion. Specifically, we are developing methods for finding and eliminating bottlenecks---program instructions and processor resources responsible for lost performance and wasted power consumption. This task is complex because the more parallel the machine, the harder it is to tell which execution events (e.g., cache misses, ALU operations, message transactions) constrained the execution, and which had their execution times tolerated.

Our framework is built on dependence-graph analysis of a program's execution, implementable entirely in hardware. The framework allows a qualitatively new way of thinking about performance. For example, by representing micro-execution events and dependences as a suitable dependence graph, its critical path automatically determines which processor stage (e.g., fetch, execute, or commit) is a bottleneck, and also for which dynamic instructions.

Our solutions attack performance understanding problems for which no (or no efficient) methods existed. These problems span the entire processor life cycle, for example:

  • processor policies: a processor capable of discerning which instructions (would) hurt performance can schedule instructions and allocate resources to avoid stalls, thus increasing its raw performance;
  • power consumption: a processor capable of analyzing which of its resources are not bottlenecks at a given moment can reconfigure itself, scaling down power-hungry units;
  • feedback-directed optimizations: performance monitoring hardware aware of parallelism is able to determine the actual contribution of cache misses and other ``bad'' events to the execution time, enabling accurate performance-tuning tools and machine-aware compiler optimizations; and
  • balanced processor design: the ability to measure the contribution of any resources will help design and size the resources so that none is excessively slower than the other, reducing the human design cost by avoiding the design-space search used today.

Results

Criticality. Our ISCA'01 paper made three notable contributions to computer architecture: First, it developed a dependence-graph abstraction of a program executing on a given micro-architecture. The abstraction established a formal basis for our work and also demystified the fuzzy notion of the ``critical path,'' previously used informally (and sometimes incorrectly) in justifying new processor designs.

Second, we solved the open problem of how to compute the critical path directly by the processor. Based on a novel randomized algorithm, our
critical-path analyzer, implemented in hardware, does so very efficiently, without actually building a dependence graph, and with only a tiny storage array.

Finally, observing that past criticality of an instruction correlates with its future criticality, we turned the critical-path analyzer into a criticality predictor, thus facilitating the design of first truly cost-aware processor policies. One such policy, for instruction scheduling, was able to speed up programs by up to 22%---without beefing up the design with additional resources, by simply changing the execution order of instructions.

Slack. Perhaps the greatest challenge of upcoming processors are technological constraints, such as limited power consumption and long wire delays. To accommodate them, future designs will be non-uniform: rather than constraining all resources equally, they will offer them at multiple quality levels, with the fast and power-hungry ones (e.g., high-speed functional units) intended for the more critical instructions. The key problem is how to dynamically assign instructions to resources so that the performance penalty of the lower-quality resources is completely hidden. 

Our ISCA'02 paper solves this problem by exploring a refined notion of criticality, slack. Defined as the number of cycles the instruction can be delayed without slowing down the (overall) execution, slack provides a natural solution---if only we could find it at run-time. Our paper distinguished several notions of slack, and showed that slack is plentiful in the SPEC2000 workload. To exploit it, we developed a hardware slack analyzer (thanks to a reduction trick, as simple the criticality analyzer) and used it to systematically devise a control policy for a power-friendly processor in which one half is clocked at half the frequency. With existing policies, this processor executes programs 10% slower than a full-frequency one; with our slack-based policy, it incurs no performance degradation.

Interaction Cost. Attacking bottlenecks in modern processors is difficult because many microarchitectural events overlap with each other.  This parallelism makes it difficult to both (a) assign a cost to an event (e.g., to one of two overlapping cache misses) and (b) assign blame for each cycle (e.g., for a cycle where many, overlapping resources are active).  Our MICRO‘03 introduces a new model for understanding event costs to facilitate processor design and optimization.

First, we observe that everything in a machine (instructions, hardware structures, events) can interact in only one of two ways (in parallel or serially). We quantify these interactions by defining interaction cost, which can be zero (independent, no interaction), positive (parallel), or negative (serial). Second, we illustrate the value of using interaction costs in processor design and optimization. Finally, we propose performance-monitoring hardware for measuring interaction costs that is suitable for modern processors.

People

Papers

Funding

  • NSF
  • Intel
  • IBM