The use of a standard linear algebra library interface, such as BLAS [LHKK79, DCHH88, DCDH90], enables portable application code to obtain high-performance provided that an optimized library (e.g., [AGZ94, KHM94]) is available and affordable.
Developing an optimized library, however, is a difficult and time-consuming task. Even excluding algorithmic variants such as Strassen's method [BLS91] for matrix multiplication, these routines have a large design space with many parameters such as blocking sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction schedules. Furthermore, these parameters have complicated interactions with the increasingly sophisticated microarchitectures of new microprocessors.
Various strategies can be used to produced optimized routines for a given platform. For example, the routines could be manually written in assembly code, but fully exploring the design space might then be infeasible, and the resulting code might be unusable or sub-optimal on a different system.
Another commonly used method is to code using a high level language but with manual tuning to match the underlying architecture [AGZ94, KHM94]. While less tedious than coding in assembler, this approach still requires writing machine specific code which is not performance-portable across a range of systems.
Ideally, the routines would be written once in a high-level language
and fed to an optimizing compiler for each machine. There is a large
literature on relevant compiler techniques, many of which use matrix
multiplication as a test case
[WL91, LRW91, MS95, ACF95, CFH95, SMP
96]
. While these
compiler heuristics generate reasonably good code in general, they
tend not to generate near-peak code for any one operation. A
high-level language's semantics might also obstruct aggressive
compiler optimizations. Moreover, it takes significant time and
investment before compiler research appears in production compilers,
so these capabilities are often unavailable. While both
microarchitectures and compilers will improve over time, we expect it
will be many years before a single version of a library routine can be
compiled to give near-peak performance across a wide range of
machines.
We have developed a methodology, named PHiPAC, for developing Portable High-Performance linear algebra libraries in ANSI C. Our goal is to produce, with minimal effort, high-performance linear algebra libraries for a wide range of systems. The PHiPAC methodology has three components. First, we have developed a generic model of current C compilers and microprocessors that provides guidelines for producing portable high-performance ANSI C code. Second, rather than hand code particular routines, we write parameterized generators [ACF95, MS95] that produce code according to our guidelines. Third, we write scripts that automatically tune code for a particular system by varying the generators' parameters and benchmarking the resulting routines.
We have found that writing a parameterized generator and search scripts for a routine takes less effort than hand-tuning a single version for a single system. Furthermore, with the PHiPAC approach, development effort can be amortized over a large number of platforms. And by automatically searching a large design space, we can discover winning yet unanticipated parameter combinations.
Using the PHiPAC methodology, we have produced a portable, BLAS-compatible matrix multiply generator. The resulting code can achieve over 90% of peak performance on a variety of current workstations, and is sometimes faster than the vendor-optimized libraries. We focus on matrix multiplication in this paper, but we have produced other generators including dot-product, AXPY, and convolution, which have similarly demonstrated portable high performance.
We concentrate on producing high quality uniprocessor libraries for
microprocessor-based systems because multiprocessor libraries, such as
[CDD
96], can be readily built from uniprocessor
libraries. For vector and other architectures, however, our machine
model would likely need substantial modification.
Section 2 describes our generic C compiler and microprocessor model, and develops the resulting guidelines for writing portable high-performance C code. Section 3 describes our generator and the resulting code for a BLAS-compatible matrix multiply. Section 4 describes our strategy for searching the matrix multiply parameter space. Section 5 presents results on several architectures comparing against vendor-supplied BLAS GEMM. Section 6 describes the availability of the distribution, and discusses future work.