NetSolve has been ported to every major UNIX platform. A Windows 95/98/NT version of the client (C, MATLAB, and Mathematica) is available and a server version is being developed. The following numerical and scientific libraries have been integrated into NetSolve computational servers: ARPACK, BLAS, FFTPACK, FitPack, ItPack, LAPACK, MinPack, PETSc, AZTEC, and ScaLAPACK. These numerical libraries cover several fields of scientific computing; Linear Algebra, Optimization, Fast Fourier Transforms, etc. This includes only the software that has been integrated by the NetSolve team.
The NetSolve client is now available from C, Matlab and Mathematica interfaces for MS-Windows platform. The server and agent are soon to follow. The Mathematica interface is also available from UNIX platforms.
NetSolve version 1.2, which was developed during the summer and fall of 1998, has now been officially released after undergoing testing and bug fixes. As previously reported, this version is a complete rewrite of the software with much tighter code and more modularity. The most important addition is the new DoAll (aka. task farming) interface. The DoAll interface was motivated by an application with researchers at the Salk Institute and is designed to make it simple to implement and optimize embarrassingly parallel applications within NetSolve. Other enhancements to Ver. 1.2 include a Win32 (95/98/NT) implementation of the client interface for MATLAB, C, and Mathematica, which opens access to NetSolve-enabled grids to a much larger user community.
This FY also saw the continued of the integration of NetSolve with other Metacomputing environments. The NetSolve and the Globus and Legion groups collaborated to make the systems interoperate; the first substantial results of this effort were presented in January 1999. We also began actively examining methods for adding improved security and access control into NetSolve, focusing on the integration with Globus tools for this purpose.
Progress is being made on various other aspects of NetSolve, which we expect to come to fruition later this quarter. Improvements will include the following:
Research with NetSolve focuses on providing intra-server functionalities. As continued development expands the collection of high-performance server software, Netsolve enables us to integrate a variety of techniques for fault-tolerance and load-balancing and to compare their performance directly with each other. These techniques may be general-purpose (e.g. based on checkpointing and rollback recovery), or specific to certain computations on certain platforms. In any event, by developing these techniques within the context of NetSolve we can ascertain their real performance on real problems in real applications. Through NetSolve we will also be able to immediately deploy the best of these techniques to yield direct positive benefits to the end user.
Second, we intend to develop the required technology to enable experimentation of promising new dynamic or run-time automated optimization techniques. The optimization strategies we have explored so far are inherently static, that is, optimization is performed during the installation process of the code generator, i.e., at compile time. This approach is well-suited for the computational building blocks of many ASCI and SSI related applications, because of the structural properties of the sparse matrices arising from the applications. In a more general setting however, these static automated optimization techniques may be insufficient. So, we propose to enhance this static approach with dynamic and run-time techniques. Indeed, during the execution of a program implementing an iterative linear system solver, a few basic computational kernels such as the sparse matrix-vector multiply are called a large number of times. This suggests the possibility of performing optimizations at run-time, as well as at compile-time. In this scheme, the first call to such a dynamically optimized subprogram diagnoses the overall structure of the sparse matrix operand, and a first-guess approach is used to perform the operation. The time for this first try will be retained for the next call. On the next call, we can try a slightly different mechanism for the operation. As the function is called iteratively, mechanisms yielding performance wins are retained, while those options that fail to increase efficiency are discarded. This is precisely the method we presently use at compile time for the dense linear algebra kernel generator to specialize the code to unknown hardware. The non-uniform nature of sparse operands, coupled with the iterative nature of the call, allows for the possibility of runtime optimizations, as the cost of the structural analysis and subsequent algorithmic tests are amortized over the many calls to the routine.
This approach encourages the idea of structured programming and improves the self-documenting quality of the software by specifying basic building blocks and identifying these operations with unique mnemonic names. Since a significant amount of execution time in complicated linear algebraic programs may be spent in a few low level operations, reducing the execution time spent in these operations reflects an overall reduction in the execution time of the program. The programming of some of these low level operations involves algorithmic and implementation subtleties that are likely to be ignored in the typical applications programming environment. If there could be general agreement on standard names and parameter lists for some of these basic operations, then portability and efficiency could also be achieved.
The Level 1 Basic Linear Algebra Subprograms (BLAS) and test suite were the result of a collaborative project in 1973-77. Following the distribution of the initial version of the specifications to people active in the development of numerical linear algebra software, a series of open meetings were held at conferences. The meetings were attended by approximately thirty people, and as a result, extensive modifications were made in an effort to improve the design and make the subroutines more robust. Specifications for the Level 2 and 3 BLAS were drawn up in 1984-86 and 1987-88. These specifications made it possible to construct new software packages based on block-partitioned algorithms to more effectively utilize the memory hierarchy of modern computers. To a great extent, the user community embraced the BLAS, not only for performance reasons, but also because developing software around a core of common routines like the BLAS is good software engineering practice. Highly efficient machine-specific implementations of the BLAS are available for many modern high-performance computers. The BLAS enable software to achieve high performance with portable code.
The goals of the forum have been to stimulate thought and discussion for the functionality and future development of a set of standards for basic matrix data structures in both dense and sparse applications in sequential and SMP settings. The major aims of these standards are to enable linear algebra libraries (both public domain and commercial) to interoperate efficiently and easily. Numerical methods for sparse and dense matrices require highly efficient kernels that provide functionality similar to that in the traditional BLAS on sequential machines. In addition, iterative and sparse direct methods require additional functionality not in traditional BLAS. Hardware and software vendors, independent software vendors, application programmers, and higher level library writers all benefit from the efforts of this forum and are the end users of these interfaces.
SuperLU is joint work with Xiaoye Li at NERSC.
The main challenge in implementing sparse Gaussian elimination is that any naive and even most sophisticated implementations of sparse Gaussian elimination will spend most of their time not doing floating point operations at full speed, but rather accessing and updating the linked data structures that store the sparse factors. For cache-based workstations with multiple functional units, sparsity also means that it is hard to attain the full speed of BLAS3 operations, which assume that floating point operands are stored in consecutive memory locations. We introduced a number of innovations in the sequential code to overcome these obstacles. First, we use we use Gilbert and Peierl's depth-first-search with Eisenstat and Liu's symmetric structural reductions to minimize the cost of data structure update and traversal. Second, we use unsymmetric supernodes, with supernode-panel updates and two-dimensional data partitioning, to automatically find and exploit small dense matrices that arise during the factorization, to get the most out of BLAS kernels. For example, the performance on an IBM RS6000/590 ranges up to 124 Mflops out of a peak of 266 Mflops, compared to 168 Mflops for dense LU, with similar performance on the MIPS R8K, and DEC Alpha.
The parallel implementation is more challenging because of its dynamic and somewhat unpredictable way of generating work and intermediate results at run time. In addition to the innovation of the last paragraph, we had to redefine the panels and supernodes, and then schedule two types of parallel tasks using them via a shared task queue. One task is factoring the independent panels on the disjoint subtrees in the column elimination tree of $A$, and the other task is updating a panel by previously computed supernodes. No global synchronization is used in the algorithm. One realistic problem arising from a 3-D flow calculation achieves factorization rates of 1.0, 2.5, 0.8 and 0.8 Gigaflops, on the 12 processor Power Challenge, 8 processor Cray C90, 16 processor Cray J90, and 8 processor AlphaServer 8400, respectively.
Both the uniprocessor and SMP versions of SuperLU have been released, with extensive documentation.
We presented a paper on this work at SuperComputing 98, where we obtained over 8 Gflops on a 512 processor Cray T3E, on a matrix of dimension nearly 52000 arising the device simulation.
In joint work with Bill McCurdy at LBNL, who is working on a quantum chemistry problem calculating electron impact ionization cross-sections, we solved a problem of dimension n=209764 in 2 minutes on a 16 processor Cray T3E, and a problem of size n=736164 on a 24 processor ASCI Blue Pacific in 21 minutes.
Only the numerical phase has been parallelized, so work remains to make the entire algorithm scalable by parallelizing the symbolic phases as well. In particular, we need to parallelize the algorithm by Duff and Koster that we use, both to make the code scalable and to make it completely public domain, since the Duff/Koster code is part of the Harwell subroutine library.
We need to use a judicious amount of extra precision to make SuperLU numerically reliable. We will be using technology developed as part of the XBLAS (described above) to enable this.
We are working on an incomplete factorization version of SuperLU. This is of interest for all architectures, from uniprocessor workstations to distributed memory machines. The efficiency challenges are even greater than with Gaussian elimination, since the suppression of fill-in makes it harder to mask memory latency.
Here are slides from a recent talk at Petaflops 99. These slides include material on Prometheus (see below) and SuperLU.
All three phases permit significant algorithmic improvements.
We are currently trying to incorporate the improvements in the ASCI Red code back into ScaLAPACK to make them generally available. The approach involves modifying a basic ScaLAPACK design principle, which is to use the same data layout for all phases of the algorithm. Instead, we use different layouts optimized for each phase, and pay the price of converting layouts between phases; this turned out to be quite advantageous overall, doubling the speed in some cases. More important, it makes the issue of choosing a good layout the algorithms responsibility instead of the users, so performance is uniformly high nearly independently of the format chosen by the user.
Another modification requires a new BLAS2 routine. This is based on the observation that the two matrix-vector multiplications y=T*x and z = TT*w can be performed in about the same time as one of them, since the time for accessing T in memory outweighs the cost of the floating point arithmetic. This involves collaboration with the Basic Linear Algebra Subprogram Technical Forum described above, to incorporate this new BLAS 2 routine into the standard.
The second set of improvements involve Phase 2. This involves completing and integrating a new algorithm that is optimal in two senses: it has perfect output complexity O(kn) to compute k eigenvectors of length n (O(1) work per solution component), and is embarrassingly parallel (each eigenvector can be computed independently in parallel). Because of the dual optimality we call this algorithm the Holy Grail. This is based on work of Beresford Parlett, his graduate student Inderjit Dhillon (whose thesis has been nominated for the 1999 Household Prize for best thesis in numerical linear algebra over the previous 3 years), and is also being done in collaboration with Osni Marques of NERSC. We anticipate releasing sequential code for LAPACK in Spring 1999, just for the symmetric tridiagonal and dense symmetric eigenvalue problems.
Pbody is not part of our original DOE2000 proposal, but is work that had been going on independently, and as it has reached maturity, I have supported it with DOE2000 funds by paying for the graduate student responsible, David Blackston, for the current semester.
Pbody is a parallel library of N-body methods, i.e. for quickly calculating the electrostatic or gravitational forces or potentials on N particles. It is portable across a range of parallel machines, permits the user a wide range of choices in how the computation is done, and is efficient.
The motivation for Pbody is two-fold. First, there are a variety of algorithms for the N-body problem with varying accuracies, efficiencies, load balancing techniques, and communication mechanisms. Prior implementations had been handcrafted for a particular machine, particular algorithm (Barnes-Hut, Fast Multipole, ...), etc. Not all useful algorithms were available on all machines. Second, we noticed that we could implement all these algorithms and their variations in a unified fashion (using so-called Bulk-Synchronous Programming (BSP)) that made it possible to implement all versions of all algorithms portably across all machines, with relatively little loss of efficiency.
Prometheus is not part of our original DOE2000 proposal, and has not been supported by DOE2000. I include it here because it is one of the largest applications I know of that is built on PETSc. It was built by my former PhD student and current postdoc Mark Adams, who was jointly advised by Prof. Robert Taylor of Civil Engineering. Mark has been supported by ASCI and other sources.
Prometheus is a scalable solver for linear systems arising from finite element problems on irregular meshes. It is based on a variant of algebraic/geometric multigrid. It takes as input the finite element mesh and stiffness matrix, and automatically builds the sequence of coarse meshes, the restriction and interpolation operators, and the coarse grid operators. It has been used to solve 26M degree of freedom problems on a 640 Processor Cray T3E with 55% parallel efficiency, and 20% efficiency on a 336 processor ASCI Blue Pacific. (This benchmark involved a cube with multiple imbedded thin shells, and strong material discontinuities and near incompressibility, i.e. a challenging problem.)
Here are slides from a recent talk at Petaflops 99. These slides include material on Prometheus and SuperLU.