DOE2000 Progress Report

Jim Demmel - UC Berkeley

Jack Dongarra - University of Tennessee and ORNL


Recent Awards

James Demmel was recently elected to the National Academy of Engineering for "contributions to numerical linear algebra and scientific computing".

NetSolve

Overview

NetSolve enables users to solve complex scientific problem remotely. The system allows users to access both hardware and software computational resources distributed across a network. NetSolve searches for computational resources on a network, chooses the best one available, and using retry for fault-tolerance solves a problem, and returns the answers to the user. A load-balancing policy is used by the NetSolve system to ensure good performance by enabling the system to use the computational resources available as efficiently as possible. Our framework is based on the premise that distributed computations involve resources, processes, data, and users, and that secure yet flexible mechanisms for cooperation and communication between these entities is the key to metacomputing infrastructures.

NetSolve has been ported to every major UNIX platform. A Windows 95/98/NT version of the client (C, MATLAB, and Mathematica) is available and a server version is being developed. The following numerical and scientific libraries have been integrated into NetSolve computational servers: ARPACK, BLAS, FFTPACK, FitPack, ItPack, LAPACK, MinPack, PETSc, AZTEC, and ScaLAPACK. These numerical libraries cover several fields of scientific computing; Linear Algebra, Optimization, Fast Fourier Transforms, etc. This includes only the software that has been integrated by the NetSolve team.

The NetSolve client is now available from C, Matlab and Mathematica interfaces for MS-Windows platform. The server and agent are soon to follow. The Mathematica interface is also available from UNIX platforms.

Current Activities:

Most of the changes since version 1.1 were a re-implementation of the internals and a redesign of the internal data structures, and hence transparent at the user-level. In addition, a number of improvements have been made to the system and are outlined below.

NetSolve version 1.2, which was developed during the summer and fall of 1998, has now been officially released after undergoing testing and bug fixes. As previously reported, this version is a complete rewrite of the software with much tighter code and more modularity. The most important addition is the new DoAll (aka. task farming) interface. The DoAll interface was motivated by an application with researchers at the Salk Institute and is designed to make it simple to implement and optimize embarrassingly parallel applications within NetSolve. Other enhancements to Ver. 1.2 include a Win32 (95/98/NT) implementation of the client interface for MATLAB, C, and Mathematica, which opens access to NetSolve-enabled grids to a much larger user community.

This FY also saw the continued of the integration of NetSolve with other Metacomputing environments. The NetSolve and the Globus and Legion groups collaborated to make the systems interoperate; the first substantial results of this effort were presented in January 1999. We also began actively examining methods for adding improved security and access control into NetSolve, focusing on the integration with Globus tools for this purpose.

Progress is being made on various other aspects of NetSolve, which we expect to come to fruition later this quarter. Improvements will include the following:

  • Customizations of the NetSolve server to integrate more numerical and scientific libraries, such as Aztec and PETSc.
  • The NetSolve Java API and GUI has been upgraded to Java ver. 1.2.x and is currently undergoing testing and bug fixes before it is released.
  • Following the Win32 port of the NetSolve client, the NetSolve agent and server are being ported to Windows NT. This will permit much more flexibility in building NetSolve enabled grids.
  • The client code has been ported to the Macintosh. After testing and bug fixes and it will be integrated into the NetSolve system.
  • We have begun a collaboration with the University of Utah to put SCIrun over NetSolve.
  • Future Activities:

    The research that we propose takes two forms: research on NetSolve, and research with NetSolve. Re- search on NetSolve concentrates on improving the design and implementation of the NetSolve system. Areas for such improvement include the following:
  • Design and implementation of advanced inter-server fault-tolerance and load balancing.
  • Data caching between computations and improved resource management by the agents.
  • Dynamic loading of software modules from repositories.
  • Integration of other meta-computing software (e.g. Globus/Nexus, Legion).
  • Incorporating the Network Weather Service for help with scheduling.
  • Research with NetSolve focuses on providing intra-server functionalities. As continued development expands the collection of high-performance server software, Netsolve enables us to integrate a variety of techniques for fault-tolerance and load-balancing and to compare their performance directly with each other. These techniques may be general-purpose (e.g. based on checkpointing and rollback recovery), or specific to certain computations on certain platforms. In any event, by developing these techniques within the context of NetSolve we can ascertain their real performance on real problems in real applications. Through NetSolve we will also be able to immediately deploy the best of these techniques to yield direct positive benefits to the end user.


    ATLAS

    Overview:

    The existing BLAS have proven to be very effective in facilitating the production of portable, efficient software for sequential, vector and shared memory high-performance computers. However, hand-optimized BLAS are expensive and tedious to produce for any particular architecture, and in general will only be created when there is a large enough market, which is not the true for all platforms. The process of generating an optimized set of BLAS for a new architecture or a slightly different machine version can be a time consuming process. The programmer must understand the architecture, how the memory hierarchy can be used to provide data in an optimal fashion, how the functional units and registers can be manipulated to generate the correct operands at the correct time, and how best to use the compiler optimization. ATLAS is an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. ATLAS has been designed to automate much of this process. ATLAS was motivated by prior work on PHIPAC at UC Berkeley. This version of ATLAS focuses on the Level 3 BLAS. ATLAS optimizes the operations to account for many parameters such as blocking factors, loop unrolling depths, software pipelining strategies, loop ordering, register allocations, and instruction scheduling. ATLAS does this by carrying out a parameter study on the target architecture and generating the "optimal" code. ATLAS' performance wins are evident in a number of cases. On a system where there is presently no vendor supplied BLAS (viz. DEC ALPHA 21164 running Linux), the portable Fortran reference implementation gets less than 8% of the performance that ATLAS achieves. On platforms where vendors have hand-tuned this kernel library, ATLAS produces software that runs at roughly the same rate. ATLAS generally requires only 1-2 hours on new hardware in order to tune itself to the same level of efficiency.

    Current Activities:

    We have developed a general methodology for the generation of the efficient linear algebra kernels. In our approach, we have isolated the machine-specific features of the operation to several routines, all of which deal with performing an optimized on-chip, cache contained, (i.e., in Level 1 (L1) cache) matrix multiply. This section of code is automatically created by a code generator which uses timings to determine the correct blocking and loop unrolling factors to perform an optimized on-chip multiply. The user may directly supply the code generator with as much detail as desired (i.e., the user may explicitly indicate the L1 cache size, the blocking factor(s) to try, etc); if such details are not provided, the generator will determine appropriate settings via timings. While the present release is for the Level 3 BLAS, much of the technology and approach developed here can be generalized relatively easily. We therefore plan to support the level 1 and 2 BLAS in the near future. Indeed, we plan to extend the ATLAS concept to cover other operations as well. In particular, we would like to provide sparse linear algebra and Java language support. The release of ATLAS as well as documentation can be found at http://www.netlib.org/atlas/.

    Future Activities:

    We plan to experimented with a variety of techniques for optimizing sparse matrix vector multiplication to take instruction sets, functional units, and memory hierarchies into account. Sparse matrix-vector multiplication is of course the inner loop in any iterative solver, even multigrid, since it includes all the interpolation, restriction, and smoothing operations. The structural properties of the application leads to sparse matrices that feature a sufficiently regular pattern, so that the automatic optimization techniques already integrated in ATLAS can be successfully re-used and applied to generate the appropriate basic sparse linear algebra kernels needed in many applications. Our plan for achieving the necessary and exceptionally high degree of portability and optimization leverages the experience of our team in developing ATLAS technology. We will build on our success with this technology and extend our research to automate the generation of the most heavily used kernels used in the shared- and distributed-memory versions of ASCI and SSI related applications.

    Second, we intend to develop the required technology to enable experimentation of promising new dynamic or run-time automated optimization techniques. The optimization strategies we have explored so far are inherently static, that is, optimization is performed during the installation process of the code generator, i.e., at compile time. This approach is well-suited for the computational building blocks of many ASCI and SSI related applications, because of the structural properties of the sparse matrices arising from the applications. In a more general setting however, these static automated optimization techniques may be insufficient. So, we propose to enhance this static approach with dynamic and run-time techniques. Indeed, during the execution of a program implementing an iterative linear system solver, a few basic computational kernels such as the sparse matrix-vector multiply are called a large number of times. This suggests the possibility of performing optimizations at run-time, as well as at compile-time. In this scheme, the first call to such a dynamically optimized subprogram diagnoses the overall structure of the sparse matrix operand, and a first-guess approach is used to perform the operation. The time for this first try will be retained for the next call. On the next call, we can try a slightly different mechanism for the operation. As the function is called iteratively, mechanisms yielding performance wins are retained, while those options that fail to increase efficiency are discarded. This is precisely the method we presently use at compile time for the dense linear algebra kernel generator to specialize the code to unknown hardware. The non-uniform nature of sparse operands, coupled with the iterative nature of the call, allows for the possibility of runtime optimizations, as the cost of the structural analysis and subsequent algorithmic tests are amortized over the many calls to the routine.


    Basic Linear Algebra Subprogram Technical Forum

    Overview:

    Linear algebra, particularly the solution of linear systems of equations and eigenvalue problems, is fundamental to most calculations in scientific computing. Designers of computer programs involving linear algebraic operations have frequently chosen to implement certain low level operations such as the dot product as separate subprograms. This may be observed both in many published codes and in codes written for specific applications at many computer installations.

    This approach encourages the idea of structured programming and improves the self-documenting quality of the software by specifying basic building blocks and identifying these operations with unique mnemonic names. Since a significant amount of execution time in complicated linear algebraic programs may be spent in a few low level operations, reducing the execution time spent in these operations reflects an overall reduction in the execution time of the program. The programming of some of these low level operations involves algorithmic and implementation subtleties that are likely to be ignored in the typical applications programming environment. If there could be general agreement on standard names and parameter lists for some of these basic operations, then portability and efficiency could also be achieved.

    The Level 1 Basic Linear Algebra Subprograms (BLAS) and test suite were the result of a collaborative project in 1973-77. Following the distribution of the initial version of the specifications to people active in the development of numerical linear algebra software, a series of open meetings were held at conferences. The meetings were attended by approximately thirty people, and as a result, extensive modifications were made in an effort to improve the design and make the subroutines more robust. Specifications for the Level 2 and 3 BLAS were drawn up in 1984-86 and 1987-88. These specifications made it possible to construct new software packages based on block-partitioned algorithms to more effectively utilize the memory hierarchy of modern computers. To a great extent, the user community embraced the BLAS, not only for performance reasons, but also because developing software around a core of common routines like the BLAS is good software engineering practice. Highly efficient machine-specific implementations of the BLAS are available for many modern high-performance computers. The BLAS enable software to achieve high performance with portable code.

    Current Activities:

    In the spirit of the BLAS open forum meetings and the standardization efforts of the MPI and HPF forums, a technical forum was established to consider expanding the Basic Linear Algebra Subprograms (BLAS) in a number of directions in the light of modern software, language, and hardware developments. The BLAST Forum meetings began in November 1995 at the University of Tennessee. Universities and software and hardware vendors hosted meetings. Various working groups were established to consider issues such as the overall functionality, language interfaces, sparse BLAS, distributed memory BLAS, extended and mixed precision BLAS (XBLAS), interval BLAS, and extensions to the existing BLAS. UC Berkeley is responsible for the XBLAS. To facilitate group discussions outside of the meetings, a number of mailing lists were established. The rules of the forum were adopted from those used for the MPI and HPF forums. The efforts of these working groups are summarized in the chapters of the BLAST document.

    The goals of the forum have been to stimulate thought and discussion for the functionality and future development of a set of standards for basic matrix data structures in both dense and sparse applications in sequential and SMP settings. The major aims of these standards are to enable linear algebra libraries (both public domain and commercial) to interoperate efficiently and easily. Numerical methods for sparse and dense matrices require highly efficient kernels that provide functionality similar to that in the traditional BLAS on sequential machines. In addition, iterative and sparse direct methods require additional functionality not in traditional BLAS. Hardware and software vendors, independent software vendors, application programmers, and higher level library writers all benefit from the efforts of this forum and are the end users of these interfaces.

    Future Activities:

    We expect to complete this effort in FY2000. We would develop a reference implementation with the help of others from the community. UC Berkeley is designing a macro-based approach to generating the reference implementation and test code from a few templates; this is motivated by the need to manage and maintain what will be a large number of similar subroutines. We are collaborating with David Bailey at NERSC on developing the underlying high precision algorithms. We plan to use these routines to support the exploitation of high accuracy in linear algebra algorithms, such as SuperLU.

    SuperLU

    Overview:

    SuperLU is one of the fastest implementations of sparse Gaussian elimination with partial pivoting for fully nonsymmetric matrices. It has been released as a sequential code optimized for cache based machines, and also for SMPs. We are now working on distributed memory version and incomplete factorizations.

    SuperLU is joint work with Xiaoye Li at NERSC.

    The main challenge in implementing sparse Gaussian elimination is that any naive and even most sophisticated implementations of sparse Gaussian elimination will spend most of their time not doing floating point operations at full speed, but rather accessing and updating the linked data structures that store the sparse factors. For cache-based workstations with multiple functional units, sparsity also means that it is hard to attain the full speed of BLAS3 operations, which assume that floating point operands are stored in consecutive memory locations. We introduced a number of innovations in the sequential code to overcome these obstacles. First, we use we use Gilbert and Peierl's depth-first-search with Eisenstat and Liu's symmetric structural reductions to minimize the cost of data structure update and traversal. Second, we use unsymmetric supernodes, with supernode-panel updates and two-dimensional data partitioning, to automatically find and exploit small dense matrices that arise during the factorization, to get the most out of BLAS kernels. For example, the performance on an IBM RS6000/590 ranges up to 124 Mflops out of a peak of 266 Mflops, compared to 168 Mflops for dense LU, with similar performance on the MIPS R8K, and DEC Alpha.

    The parallel implementation is more challenging because of its dynamic and somewhat unpredictable way of generating work and intermediate results at run time. In addition to the innovation of the last paragraph, we had to redefine the panels and supernodes, and then schedule two types of parallel tasks using them via a shared task queue. One task is factoring the independent panels on the disjoint subtrees in the column elimination tree of $A$, and the other task is updating a panel by previously computed supernodes. No global synchronization is used in the algorithm. One realistic problem arising from a 3-D flow calculation achieves factorization rates of 1.0, 2.5, 0.8 and 0.8 Gigaflops, on the 12 processor Power Challenge, 8 processor Cray C90, 16 processor Cray J90, and 8 processor AlphaServer 8400, respectively.

    Both the uniprocessor and SMP versions of SuperLU have been released, with extensive documentation.

    Current Activities:

    The distributed memory version is still more challenging because the cost of communication makes it impossible to scalably exploit the fine grain load balancing, synchronization and communication used by the SMP algorithm above. Here is our approach. There has been success in parallelizing sparse Cholesky scalably, because in Cholesky any pivot order is numerically stable, so all the load balancing, data structure construction and communication scheduling can be performed before the numerical factorization phase (i.e. statically) and so carefully optimized. Our goal is to make general sparse Gaussian elimination as scalable as Cholesky, which provides a natural ``upper bound'' on scalability for us. Our innovation is to eliminate the apparent need to pivot dynamically (during the numerical factorization), thus letting us do the same static optimizations as in Cholesky. We maintain numerical stability by several techniques. First, we use ``static pivoting'', which uses a bipartite graph matching algorithm of Duff and Koster to permute the rows of the matrix to make its diagonal as large as possible. Then rows and column are permuted symmetrically to simultaneously keep the diagonal fixed and maximize sparsity and parallelism. Then the load balancing, data structure construction, and communication scheduling is done for this fixed order of rows and columns. During numerical factorization small pivots can still arise, so they are handled by a mixture of numerical techniques that do not change any data structures, including artificially enlarging tiny pivots, iterative refinement to `'clean up'' afterwards, and judicious use of extra precision. This combination of techniques was successful at maintaining both numerical stability and scalabilty on a wide range of test matrices.

    We presented a paper on this work at SuperComputing 98, where we obtained over 8 Gflops on a 512 processor Cray T3E, on a matrix of dimension nearly 52000 arising the device simulation.

    In joint work with Bill McCurdy at LBNL, who is working on a quantum chemistry problem calculating electron impact ionization cross-sections, we solved a problem of dimension n=209764 in 2 minutes on a 16 processor Cray T3E, and a problem of size n=736164 on a 24 processor ASCI Blue Pacific in 21 minutes.

    Future Activities:

    We will soon finalize the calling sequence and documentation of Distributed SuperLU and release it.

    Only the numerical phase has been parallelized, so work remains to make the entire algorithm scalable by parallelizing the symbolic phases as well. In particular, we need to parallelize the algorithm by Duff and Koster that we use, both to make the code scalable and to make it completely public domain, since the Duff/Koster code is part of the Harwell subroutine library.

    We need to use a judicious amount of extra precision to make SuperLU numerically reliable. We will be using technology developed as part of the XBLAS (described above) to enable this.

    We are working on an incomplete factorization version of SuperLU. This is of interest for all architectures, from uniprocessor workstations to distributed memory machines. The efficiency challenges are even greater than with Gaussian elimination, since the suppression of fill-in makes it harder to mask memory latency.

    Here are slides from a recent talk at Petaflops 99. These slides include material on Prometheus (see below) and SuperLU.


    Parallel Eigensolvers and the SVD

    Overview:

    We are incorporating a number of improvements into the eigensolvers currently in LAPACK and ScaLAPACK. Algorithms for finding eigenvalues and eigenvectors of symmetric matrices (and definite pencils) as well as the SVD of general matrices have three phases:
  • Phase 1: Reduction to condensed form (tridiagonal for eigenvalue problems and bidiagonal for the SVD)
  • Phase 2: Finding the eigenvalues and eigenvectors (or SVD) of the condensed form
  • Phase 3: Transforming the eigenvectors back to be those of the original matrix.
  • All three phases permit significant algorithmic improvements.

    Current Activities:

    The first set of improvements involve Phases 1 and 3 in ScaLAPACK. This work was initially performed by Ken Stanley of Berkeley in collaboration with Greg Henry of Intel and Mark Sears of SNL, and was a runner up for the Gordon Bell Prize (peak performance) at Supercomputing 98, achieving a sustained performance of well over 600 Gflops on ASCI Red, 1/3 of machine peak, on a quantum chemistry application called MP-Quest.

    We are currently trying to incorporate the improvements in the ASCI Red code back into ScaLAPACK to make them generally available. The approach involves modifying a basic ScaLAPACK design principle, which is to use the same data layout for all phases of the algorithm. Instead, we use different layouts optimized for each phase, and pay the price of converting layouts between phases; this turned out to be quite advantageous overall, doubling the speed in some cases. More important, it makes the issue of choosing a good layout the algorithms responsibility instead of the users, so performance is uniformly high nearly independently of the format chosen by the user.

    Another modification requires a new BLAS2 routine. This is based on the observation that the two matrix-vector multiplications y=T*x and z = TT*w can be performed in about the same time as one of them, since the time for accessing T in memory outweighs the cost of the floating point arithmetic. This involves collaboration with the Basic Linear Algebra Subprogram Technical Forum described above, to incorporate this new BLAS 2 routine into the standard.

    The second set of improvements involve Phase 2. This involves completing and integrating a new algorithm that is optimal in two senses: it has perfect output complexity O(kn) to compute k eigenvectors of length n (O(1) work per solution component), and is embarrassingly parallel (each eigenvector can be computed independently in parallel). Because of the dual optimality we call this algorithm the Holy Grail. This is based on work of Beresford Parlett, his graduate student Inderjit Dhillon (whose thesis has been nominated for the 1999 Household Prize for best thesis in numerical linear algebra over the previous 3 years), and is also being done in collaboration with Osni Marques of NERSC. We anticipate releasing sequential code for LAPACK in Spring 1999, just for the symmetric tridiagonal and dense symmetric eigenvalue problems.

    Future Activities:

    Both the current activities mentioned above will be continued. In particular, we have only the integrated the Holy Grail into two symmetric eigenvalue code in LAPACK. It still needs to be integrated into the band and generalized symmetric eigenproblem, the singular value decomposition (bidiagonal, banded, dense, generalized), and the corresponding ScaLAPACK codes.

    Pbody

    Overview:

    Pbody is not part of our original DOE2000 proposal, but is work that had been going on independently, and as it has reached maturity, I have supported it with DOE2000 funds by paying for the graduate student responsible, David Blackston, for the current semester.

    Pbody is a parallel library of N-body methods, i.e. for quickly calculating the electrostatic or gravitational forces or potentials on N particles. It is portable across a range of parallel machines, permits the user a wide range of choices in how the computation is done, and is efficient.

    The motivation for Pbody is two-fold. First, there are a variety of algorithms for the N-body problem with varying accuracies, efficiencies, load balancing techniques, and communication mechanisms. Prior implementations had been handcrafted for a particular machine, particular algorithm (Barnes-Hut, Fast Multipole, ...), etc. Not all useful algorithms were available on all machines. Second, we noticed that we could implement all these algorithms and their variations in a unified fashion (using so-called Bulk-Synchronous Programming (BSP)) that made it possible to implement all versions of all algorithms portably across all machines, with relatively little loss of efficiency.

    Current Activities:

    We are collaborating with Andy Neureuther at Berkeley, who has a program called EBEAM for simulating electron beam semiconductor manufacturing devices. EBEAM had been using another N-body code called DPMTA (from Duke), which we replaced by PBody. For large problems Pbody was over 50 times faster than DPMTA (which is nonadaptive, whereas Pbody is adaptive) in computing the forces on the electrons. Indeed, the bottleneck is no longer the N-body calculation but elsewhere in EBEAM.

    Future Activities:

    We will be doing extensive performance evaluation on a variety of platforms, and packaging PBody for general release. The graduate student responsible for this work plans to write up his thesis and graduate by summer 1999.

    Prometheus

    Overview:

    Prometheus is not part of our original DOE2000 proposal, and has not been supported by DOE2000. I include it here because it is one of the largest applications I know of that is built on PETSc. It was built by my former PhD student and current postdoc Mark Adams, who was jointly advised by Prof. Robert Taylor of Civil Engineering. Mark has been supported by ASCI and other sources.

    Prometheus is a scalable solver for linear systems arising from finite element problems on irregular meshes. It is based on a variant of algebraic/geometric multigrid. It takes as input the finite element mesh and stiffness matrix, and automatically builds the sequence of coarse meshes, the restriction and interpolation operators, and the coarse grid operators. It has been used to solve 26M degree of freedom problems on a 640 Processor Cray T3E with 55% parallel efficiency, and 20% efficiency on a 336 processor ASCI Blue Pacific. (This benchmark involved a cube with multiple imbedded thin shells, and strong material discontinuities and near incompressibility, i.e. a challenging problem.)

    Here are slides from a recent talk at Petaflops 99. These slides include material on Prometheus and SuperLU.