Daniel M. Pressel

			      Computer Scientist

			      U.S. Army Research Laboratory

===============================================================================
           Efficiency vs. Peak Performance:  A Vote for Efficiency

			   Daniel M. Pressel
		U.S. Army Research Laboratory, APG, MD

For many years now, any time someone has suggested making significant changes
to the design of a RISC processor, the knee-jerk response has been to point
out how such a change would adversely affect the processor's peak level of
performance.  Now that an increasing number of systems are showing signs of
hitting the memory wall, it might be appropriate to review some of these
decisions.  The basic postulates here are as follows:

     1)  Architectures that are deeply pipelined, or have other types of
	 impediments, are difficult to schedule and will generally achieve
	 very poor levels of performance.

     2)  It is of questionable value to produce a blazingly fast processor
	 if it will spend most of its time stalled on cache misses.

     3)  It is of questionable value to produce a blazingly fast processor
	 if it will spend most of its time either executing instructions
	 that the compiler really did not need to create or alternatively
	 speculatively executing instructions that only have a small
	 probability of graduating.

Some of the changes that are worth reconsidering are as follows:

     1)  Very large TLB's, which can significantly reduce the rate of TLB
	 misses.  In extreme cases this can increase the speed of some
	 code segments by one to two orders of magnitude.  (Note:  variable
	 page sizes are frequently a less effective solution).

     2)  A hierarchy of caches with large on-chip caches (along the lines
	 of the HP PA 8500 or the new RM7000 from QED) and an even larger
	 off-chip cache can produce very significant levels of performance
	 gains for certain important classes of scientific codes.  The
	 main problem is that many of these codes will require implementation-
	 level tuning to take full advantage of these changes.  Unfortunately,
	 many design studies seem to assume fixed coding and fixed problem
	 sizes and therefore miss the potential for this type of change.

     3)  The benefits of using 64-bit instructions.  With the much larger
         on-chip instruction caches that are now available, the cost of this
	 decision is not excessive.  The benefits of this decision are that it
	 can allow the compiler to generate far more efficient code (e.g., by
	 making larger numbers of registers architecturally visible, and
	 by supporting larger offsets and constants to be included as part of
	 the instruction).

     4)  Incorporating fully pipelined hardware support for 128-bit floating
	 point arithmetic. (This can be used to improve the performance of
	 some math functions and has the potential for significantly improving
	 the speed of operations involving banded matrices.)

Clearly, most of these changes are of limited value for the PC and low-end
workstation market.  However, if mass market processors are to be used in
machines such as the IBM SP and the SGI Origin 2000, then they must be well
tuned for classes of problems solved on those machines.  Based on my
experiences, each of these changes has the potential of benefiting one or
more important classes of scientific code.