Getting Software via the WWW

Design and Implementation of LAPACK and ScaLAPACK

(CS 267, Feb 23 1995)

Announcement

Keep reading chapter 5 of the text.

Getting software and reports with the WWW

One of the major practical uses of the WWW is accessing public-domain software and technical reports. We will discuss one repository of numerical software and related reports, called Netlib. If you clone this window and click here, you will get a one-page listing of all the libraries available within netlib. Here is a short guided tour.

LAPACK, or Linear Algebra PACKage is a library of linear algebra routines for high-performance workstations, vector machines, and shared memory parallel computers. We will discuss its design in detail later. LAPACK is written in Fortran 77, clapack is a C version. All source code can be down-loaded into your own directory by appropriate pointing and clicking.

ScaLAPACK, or Scalable LAPACK, is a distributed memory version of LAPACK, which we will also discuss.

There are also many more specialized libraries of software for doing linear algebra, such as for sparse matrices (sparse, sparse-blas, sparspak), solving ordinary differential equations (ode, odepack), partial differential equations (pltmg, bihar, fishpack, pdes), numerical integration (quadpack), Fast Fourier Transforms, or FFTs (fftpack, vfftpk), curve fitting (fitpack), splines (pppack, dierckx), statistics (gcv), optimization (opt), and so on. A one-line description of each library in Netlib is available here.

There is also a lot of systems software of various kinds. pvm, pvm3, and hence are portable message passing systems for parallel computing. paragraph helps visualize message traffic in a parallel computer. f2c translates program written in Fortran to C. bmp is a Fortran multiple precision package (a more modern one called mpfun is available from dbailey@nas.nasa.gov). jakef is a precompiler which takes a set of Fortran subroutines computing a (multivariate) function, and automatically produces another set of subroutines computing the derivative of the function. This is useful for optimization problems (a more modern precompiler call adifor is available from bischof@mcs.anl.gov). paranoia is W. Kahan's program for finding bugs in floating point arithmetic.

Benchmarks of various kinds are available, both source code and tech reports (in postscript) of performance results. There are "golden oldies" like Whetstones and Livermore Loops, more recent ones like the Linpack Benchmark (Gaussian elimination), and ones still under construction, like parkbench, which is intended to contain real applications running on distributed memory parallel computers, written either using message passing or in High Performance Fortran.

The Conferences Database is a list of conference in parallel computing and numerical analysis.

Since netlib is so large, a keyword-style search facility is available to help find things. A more ambitious search facility called GAMS (Guide to Available Mathematical Software) is available here. There is also a book of templates (this is almost a megabyte, and so takes a while to download), which try to offer numerical advice on which algorithm to use for solving sparse systems of linear equations (see below). More such advice facilities are planned for the future.

To attest to the popularity of this service, usage statistics are also available. As of Feb 1995, there were approximately 9400 requests per day.

Overview of LAPACK and ScaLAPACK

As stated above, LAPACK and ScaLAPACK are libraries of linear algebra routines for solving standard linear algebra problems on shared memory and distributed memory machines, respectively. In addition, LAPACK is well suited for workstations like the RS6000/590 and vector computers like a single processor Cray C90. We will outline the contents of these libraries, and discuss the design of the basic linear equation solver in some detail. The major issues will include

How to reorganize an algorithm, most naturally expressed using either vector operations or perhaps matrix-vector multiplication, to use matrix-matrix multiplication. This is necessary to run at top speed in the presence of a memory hierarchy.

How to layout dense matrices across multiple processor memories. We will study the tradeoffs among load balancing, minimizing communication, and exploiting the memory hierarchy.

How to design parallel software to be portable across a variety of architectures. Since machines change relatively rapidly, this is essential if we want to avoid constantly rewriting applications software.

We will use Matlab notation to express our algorithms below. Recall that

A(i:j, k:l) refers to the submatrix of A in rows i through j and columns k through l. A(:, k:l) refers to columns k through l, and A(i:j, :) refers to rows i through j.

A' is the transpose of A. We assume all matrices are real; otherwise A' means conjugate-transpose.

A*B is the product of matrices A and B. It is defined if A has as many columns as B has rows. A.*B is the component-wise product of A and B. If is defined if A and B have the same number of rows and the same number of columns. A.^2 is the same as A.*A.

sum(x) is the sum of the entries of x.

norm(x) is the Euclidean norm of x: sqrt(sum(x.^2))

The 6 most basic problem of linear algebra are summarized below. For details, see any good text book in numerical linear algebra ... In a future lecture we will give examples of where they arise in real physical applications.

Solve a system of linear equations Ax=b, where A is a general matrix. In matlab this may be solved by x = A\b.

Solve a symmetric, positive definite (spd) system of linear equations Ax=b. Symmetry means that A=A', and positive definiteness is equivalent to being able to write A = B*B' for some other matrix B. In matlab this may also be solved by x = A\b.

Solve a least squares problem, i.e. find the x which minimizes norm(Ax-b). Typically A has more rows than columns, so the system is overdetermined, and norm(A*x-b) cannot equal zero. In matlab this may also be solved by x = A\b. This problem comes up in fitting lines and curves to data.

Find the eigenvalues lambda_i and (possibly) eigenvectors x_i of A:

         A*x_i = lambda_i * x_i,

where A is a general matrix. If we let X be the matrix whose columns are the eigenvectors x_1 through x_n, and D be the diagonal matrix whose diagonal entires are lambda_1 through lambda_n, then we may also write this as A*X = X*D, or A = X*D*inv(X), if X is nonsingular (the most common case). In matlab this may be solved by [X,D]=eig(A).

In physical problems, eigenvalues are often frequencies of vibration, and the eigenvectors are the shapes of the vibrational modes. For a matlab animation of the vibrational modes of a simple bridge (of interest to anyone designing, or driving across, a bridge in earthquake country), type demo in matlab, click on continue, and follow the pull-down menus to Fun/Extras/Miscellaneous/Bending.

Find the eigenvalues lambda_i and possibly eigenvectors x_i of a symmetric matrix A. In matlab this may also be solved by [X,D]=eig(A). In contrast to the case of general A, we know that the eigenvalues must be real, and the eigenvectors must be orthogonal to one another. In particular, X is can be chosen to be an orthogonal matrix, i.e. X*X' = I.

Find the singular value decomposition (SVD) of A:

     A = U*S*V'.

Here the columns of U and V are called singular vectors, and U and V are orthogonal matrices. S is a diagonal matrix, and its diagonal entries are called singular values. It turns out that the singular values are the square roots of the eigenvalues of either A'*A or A*A', and the singular vectors are the eigenvectors of these two matrices.

After a summary of general computational aspects of these problems, we will discuss solving problem 1, Ax=b, using Gaussian elimination.

The most important determiner of which algorithm to use for these problems is whether the matrix A is dense or sparse. A dense matrix has all (or mostly) nonzero entries, without any special properties that would let one represent it in much less than n^2 words. Sparse matrices are the complement of dense matrices, either with lots of zero entries, or otherwise much more compactly representable than n^2 words. The algorithms for sparse matrices, which are very common in science and engineering, are often rather different than algorithms for dense matrices, and we will discuss them in detail later. For a small example click here. The first picture is a mesh around a wing. The physical problem is to compute the airflow around the wing at the mesh points. This leads to solving a sparse n-by-n system of spd linear equations Ax=b, where n=4253 and A has the "sparsity structure" in the top left plot of the second picture (black dots represent nonzero entries, the rest are zero). Only .16% of the 4253^2 entries of A are nonzero. The top right plot of the second picture is a different view of the matrix, with the rows and columns permuted. If we did Gaussian elimination on the spd matrix A without taking sparsity into account, it would take (1/3)*n^3 = 2.6e10 floating point operations. If we do Gaussian elimination on the matrix in the top left, and take sparsity into account by not doing unnecessary operations on zero entries, it does over 2200 times fewer operations, resulting in the Cholesky factor at the bottom left. If we do Gaussian elimination on the matrix at the top right, we get the Cholesky factor at the bottom right for 4 times fewer operations still. This matrix is available in matlab by typing demo, and clicking on continue/Matlab_Visit/Matrices/Select_a_demo/Cmd_Line_Demos/NASA_Airfoil; all the pictures and other manipulations are one-liners in matlab. Designing and implementing parallel sparse matrix algorithms is a currently active field of research, and at least one project is available in this area. LAPACK and ScaLAPACK currently deal mostly with dense matrices.

(Temporary pointer to nested dissection illustration)

The next important distinction is between the first three problems (solving Ax=b and least squares problems), and the last three problems (eigenvalues and the SVD). The first three problems have exact, finite solution procedures, and the latter three can only be solved approximately. The first three problems have exact finite solutions in the sense that O(n^3) algorithms exist for each, which would yield the exact solution in the absence of round-off errors. (Analyzing roundoff is a topic for a numerical analysis class, and we will only mention it here when it requires us to modify the algorithm to make it reliable.) In contrast, the second three problems only have iterative solutions (for 5x5 matrices or larger), because they implicitly require finding the roots of the characteristic polynomial of the matrix A: p(x)=determinant(A-x*I) (for the SVD substitute A'*A for A). As shown by Galois, no finite algorithm can exist to find the roots of a general polynomial of degree 5 or greater. A large variety of iterative solutions exist. In practice, the best iterations converge so reliably and so rapidly (quadratically or cubically) that they are often considered to be finite solutions.

The latter three problems are all solved in 2 phases: a (finite) algorithm to reduce the original dense matrix to a more compactly representable form with the same eigenvalues (or singular values), and an iterative phase on the compact form. We will again distinguish the first three problems from the initial phases of the second three problems: For the first three problems, it is known how to do almost all the work by using matrix-matrix multiplication (BLAS3). In contrast, the first phases of the second three problems are more difficult to convert to BLAS3. The algorithm in LAPACK does 50% BLAS3 and 50% matrix-vector multiplication (BLAS2). multiplication. More recently, work on "multi-step band-reduction" by C. Bischof, X. Sun, and others has shown that if no eigenvectors are desired, one can use nearly 100% BLAS3. If eigenvectors are desired, one either has to use BLAS2 or increase the operation count significantly (by 2n^3).

In other words, we can in principle take nearly full advantage of the memory hierarchy for the first three problems, but not the second three, at least if eigenvectors are desired and we do not want to increase the operation count significantly.

Now consider just the last three problems. The iterative phase of finding the eigenvalues of a general matrix is much more expensive than the initial reduction phase. In contrast, the best recent iterations for the symmetric eigenvalue problem and SVD are in principle much cheaper than the corresponding reduction phases (this was not true until recently).

Good parallel software for the first three problems exists in LAPACK and ScaLAPACK. Software for the second three problems exists in LAPACK, with recent better algorithms on the way. ScaLAPACK currently only solves one of the second three problems (the symmetric eigenproblem), and the Feb 1995 release makes certain unfortunate compromises between parallel efficiency and accuracy, which better algorithms should make unnecessary. Class projects to improve and expand ScaLAPACK are available.

It is interesting to ask whether the best parallel algorithm is always a parallel version of the best serial algorithm. This is usually the case, and in fact the search for better parallel algorithms has sometimes led to better serial algorithms as well. However, there are some interesting exceptions. For example, certain "banded" linear systems can be solved more quickly using parallel-prefix than Gaussian elimination, but the results are numerically unstable in the presense of roundoff, and so potentially much less accurate. Currently, LAPACK and ScaLAPACK use different algorithms for finding eigenvalues, but we expect them to use the same algorithm eventually, once we better understand the problem.

Quick review of Gaussian Elimination

Gaussian elimination to solve Ax=b starts with a dense matrix, and adds multiples of each row to subsequent rows in order to create zeros below the diagonal, ending up with an upper triangular matrix we will call U. Then, one solves a linear system with U, which can be done by substitution starting with the last variable. The simplest way one could write this algorithm is as follows, where we ignore transformations of b for the time being:

    for i = 1 to n-1    ... for each column i, zero it out below the diagonal
        for j = i+1 to n    ... each row j below row i
            for k = i to n      ... add a multiple of row i to row j
                A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

Let us improve this rudimentary implementation step by step. First, we remove the constant (A(j,i)/A(i,i)) from the innermost loop:

    for i = 1 to n-1    
        for j = i+1 to n    
            m = A(j,i)/A(i,i)
            for k = i to n      
                A(j,k) = A(j,k) - m * A(i,k)

Next, we avoid computing results we know, i.e. the zeros below the diagonal:

    for i = 1 to n-1    
        for j = i+1 to n    
            m = A(j,i)/A(i,i)
            for k = i+1 to n      
                A(j,k) = A(j,k) - m * A(i,k)

It will be convenient to store the multipliers m in the implicitly created zeros below the diagonal, so we can use them later to transform the right hand side b:

    for i = 1 to n-1    
        for j = i+1 to n    
            A(j,i) = A(j,i)/A(i,i)
        for j = i+1 to n    
            for k = i+1 to n      
                A(j,k) = A(j,k) - A(j,i) * A(i,k)

Now we use Matlab notation to express the algorithm even more compactly. Note that the last two loops are really multiplying the column vector A(i+1:n, i) times the row vector A(i, i+1:n), and add the resulting (n-i)-by-(n-i) matrix to A(i+1:n, i+1:n):

    for i = 1 to n-1
        A(i+1:n, i) = A(i+1:n, i) / A(i,i)
        A(i+1:n, i+1:n) = A(i+1:n, i+1:n) - A(i+1:n, i)*A(i, i+1:n)

Thus, the inner loop consists of one BLAS1 operation (scaling A(i+1:n,i) by a constant), and one BLAS2 operation (adding a rank-one matrix -A(i+1:n,i)*A(i,i+1:n), to A(i+1:n, i+1:n)). Here is a picture of what happens at step i.

Call the strictly lower triangular matrix of multipliers which is stored below the diagonal of A after the algorithm ends M, and let L=I+M. We state the following easy lemma without proof.

Lemma. (LU Factorization). If the above algorithm terminates (i.e. it did not try to divide by zero) then A = L*U.

Now we can state our complete algorithm for solving A*x=b:

   1) Factorize A = L*U.
   2) Solve L*y = b for y by doing forward substitution.
   3) Solve U*x = y for x by doing backward substitution.

Then x is the solution we seek because A*x = L*(U*x) = L*y = b.

Let us compute the serial and PRAM complexity of this algorithm. From the matlab code, it is easy to see that the number of floating point operations performed (serial complexity) is

     sum_{i=1}^{n-1}  [     (n-i)      ... cost of A(i+1:n,i+1:n)/A(i,i)
                        + 2*(n-i)^2 ]  ... cost of A(i+1:n,i+1:n)-A(i+1:n,i)*A(i,i+1:n)
   = (2/3)*n^3 + O(n^2)

Solving L*y=b and U*x=b each only cost n^2, so the total cost is still (2/3)*n^3 + O(n^2).

Assuming we have one processor per matrix entry, and that communication is free (the PRAM model), the cost is

     sum_{i=1}^{n-1}  [   1    ... cost of A(i+1:n,i+1:n)/A(i,i)
                        + 2 ]  ... cost of A(i+1:n,i+1:n)-A(i+1:n,i)*A(i,i+1:n)
   = 3*(n-1)

since all the divisions in the inner loop can be done at once, all the multiplications can be done at once, and all the subtractions can be done at once. Thus, a lot of parallelism is available in this algorithm for us to exploit. One can also confirm that solving L*y=b and U*x = y in straightforward ways:

   y(1:n) = b(1:n)         ... in practice, y is overwritten on b
   for i = 1 to n-1
       y(i+1:n) = y(i+1:n) - y(i)*L(i+1:n,i)

   x(1:n) = y(1:n)         ... in practice, x is overwritten on y, i.e. b
   for i = n downto 1
       x(i) = x(i)/U(i,i)
       x(1:i-1) = x(1:i-1) - x(i)*U(1:i-1,i)

costs (1+2*(n-1)) + (1+n+2*(n-1)) = 5n-2, for a grand total of 8*n-5. Interestingly, in the PRAM model, solving triangular systems with the most straigthtforward algorithms is about as expensive as LU factorization, whereas triangular system solving was much cheaper on a serial machine.

Here are some obvious problems with this algorithm, which we need to address:

If A(i,i) is zero, the algorithm cannot proceed. If A(i,i) is tiny, we will also have numerical problems.

The majority of the work is done by calls to the BLAS2 (rank-one update), which does not exploit a memory hierarchy at well as a BLAS3 operation like matrix-matrix multiplication.

In the next lecture we will address both these issues.