Daniel M. Pressel Computer Scientist U.S. Army Research Laboratory =============================================================================== Efficiency vs. Peak Performance: A Vote for Efficiency Daniel M. Pressel U.S. Army Research Laboratory, APG, MD For many years now, any time someone has suggested making significant changes to the design of a RISC processor, the knee-jerk response has been to point out how such a change would adversely affect the processor's peak level of performance. Now that an increasing number of systems are showing signs of hitting the memory wall, it might be appropriate to review some of these decisions. The basic postulates here are as follows: 1) Architectures that are deeply pipelined, or have other types of impediments, are difficult to schedule and will generally achieve very poor levels of performance. 2) It is of questionable value to produce a blazingly fast processor if it will spend most of its time stalled on cache misses. 3) It is of questionable value to produce a blazingly fast processor if it will spend most of its time either executing instructions that the compiler really did not need to create or alternatively speculatively executing instructions that only have a small probability of graduating. Some of the changes that are worth reconsidering are as follows: 1) Very large TLB's, which can significantly reduce the rate of TLB misses. In extreme cases this can increase the speed of some code segments by one to two orders of magnitude. (Note: variable page sizes are frequently a less effective solution). 2) A hierarchy of caches with large on-chip caches (along the lines of the HP PA 8500 or the new RM7000 from QED) and an even larger off-chip cache can produce very significant levels of performance gains for certain important classes of scientific codes. The main problem is that many of these codes will require implementation- level tuning to take full advantage of these changes. Unfortunately, many design studies seem to assume fixed coding and fixed problem sizes and therefore miss the potential for this type of change. 3) The benefits of using 64-bit instructions. With the much larger on-chip instruction caches that are now available, the cost of this decision is not excessive. The benefits of this decision are that it can allow the compiler to generate far more efficient code (e.g., by making larger numbers of registers architecturally visible, and by supporting larger offsets and constants to be included as part of the instruction). 4) Incorporating fully pipelined hardware support for 128-bit floating point arithmetic. (This can be used to improve the performance of some math functions and has the potential for significantly improving the speed of operations involving banded matrices.) Clearly, most of these changes are of limited value for the PC and low-end workstation market. However, if mass market processors are to be used in machines such as the IBM SP and the SGI Origin 2000, then they must be well tuned for classes of problems solved on those machines. Based on my experiences, each of these changes has the potential of benefiting one or more important classes of scientific code.