Design and Implementation of LAPACK and ScaLAPACK

• Overview of LAPACK and ScaLAPACK
• Quick review of Gaussian Elimination
• Pivoting to avoid tiny A(i,i)
• How to Use BLAS3 in Gaussian Elimination with Partial Pivoting
• Performance of LAPACK LU Decomposition
• How to Layout Matrices on Distributed Memory Machines
• Distributed Gaussian elimination with a 2D Block Cyclic Layout
• Modeling the Performance of Distributed Gaussian Elimination
• Software Structure of ScaLAPACK
• Getting software and reports with the WWW
• Overview of LAPACK and ScaLAPACK

LAPACK and ScaLAPACK libraries of linear algebra routines. LAPACK stands for Linear Algebra PACKage, and ScaLAPACK stands for Scalable LAPACK. LAPACK and ScaLAPACK contain routines for standard linear algebra problems on shared memory and distributed memory machines, respectively. In addition, LAPACK is well suited for workstations like the RS6000/590 and vector computers like a single processor Cray C90. LAPACK is written in Fortran 77, but a C version called CLAPACK is also available. A detailed user guide for LAPACK and CLAPACK is available, and one for ScaLAPACK is forthcoming.

We will outline the contents of these libraries, and discuss the design of the basic linear equation solver in some detail. The major issues will include

• How to reorganize an algorithm, most naturally expressed using either vector operations or perhaps matrix-vector multiplication, to use matrix-matrix multiplication. This is necessary to run at top speed in the presence of a memory hierarchy (see Lecture 2).
• How to layout dense matrices across multiple processor memories. We will study the tradeoffs among load balancing, minimizing communication, and exploiting the memory hierarchy.
• How to design parallel software to be portable across a variety of architectures. Since machines change relatively rapidly, this is essential if we want to avoid constantly rewriting applications software.
• We will use Matlab notation to express our algorithms below. Recall that

• A(i:j, k:l) refers to the submatrix of A in rows i through j and columns k through l. A(:, k:l) refers to columns k through l, and A(i:j, :) refers to rows i through j.
• A' is the transpose of A. We assume all matrices are real; otherwise A' means conjugate-transpose.
• A*B is the product of matrices A and B. It is defined if A has as many columns as B has rows. A.*B is the component-wise product of A and B. If is defined if A and B have the same number of rows and the same number of columns. A.^2 is the same as A.*A.
• sum(x) is the sum of the entries of x.
• norm(x) is the Euclidean norm of x: sqrt(sum(x.^2))
• The 6 most basic problem of linear algebra are summarized below. For details, see any good text book in numerical linear algebra, such as "Matrix Computations" by Golub and Van Loan, or "Numerical Linear Algebra" by J. Demmel. Lectures 11, 12 and 13 describe how some of these problems arise in doing simulations of physical systems.

• Solve a system of linear equations Ax=b, where A is a general matrix. In matlab this may be solved by x = A\b.
• Solve a symmetric, positive definite (spd) system of linear equations Ax=b. Symmetry means that A=A', and positive definiteness is equivalent to being able to write A = B*B' for some other matrix B. In matlab this may also be solved by x = A\b.
• Solve a least squares problem, i.e. find the x which minimizes norm(Ax-b). Typically A has more rows than columns, so the system is overdetermined, and norm(A*x-b) cannot equal zero. In matlab this may also be solved by x = A\b. This problem comes up in fitting lines and curves to data.
• Find the eigenvalues lambda_i and (possibly) eigenvectors x_i of A:
```         A*x_i = lambda_i * x_i,
```
where A is a general matrix. If we let X be the matrix whose columns are the eigenvectors x_1 through x_n, and D be the diagonal matrix whose diagonal entires are lambda_1 through lambda_n, then we may also write this as A*X = X*D, or A = X*D*inv(X), if X is nonsingular (the most common case). In matlab this may be solved by [X,D]=eig(A).

In physical problems, eigenvalues are often frequencies of vibration, and the eigenvectors are the shapes of the vibrational modes. For a matlab animation of the vibrational modes of a simple bridge (of interest to anyone designing, or driving across, a bridge in earthquake country), type demo in matlab, click on continue, and follow the pull-down menus to Fun/Extras/Miscellaneous/Bending.

• Find the eigenvalues lambda_i and possibly eigenvectors x_i of a symmetric matrix A. In matlab this may also be solved by [X,D]=eig(A). In contrast to the case of general A, we know that the eigenvalues must be real, and the eigenvectors must be orthogonal to one another. In particular, X is can be chosen to be an orthogonal matrix, i.e. X*X' = I.
• Find the singular value decomposition (SVD) of A:
```       A = U*S*V'.
```
Here the columns of U and V are called singular vectors, and U and V are orthogonal matrices. S is a diagonal matrix, and its diagonal entries are called singular values. It turns out that the singular values are the square roots of the eigenvalues of either A'*A or A*A', and the singular vectors are the eigenvectors of these two matrices.
• After a summary of general computational aspects of these problems, we will discuss solving problem 1, Ax=b, using Gaussian elimination.

The most important determiner of which algorithm to use for these problems is whether the matrix A is dense or sparse. A dense matrix has all (or mostly) nonzero entries, without any special properties that would let one represent using many fewer than n^2 independent numbers. Sparse matrices either have lots of zero entries, or are otherwise much more compactly representable than n^2 numbers. The algorithms for sparse matrices, which are very common in science and engineering, are often rather different than algorithms for dense matrices, and we will discuss them in detail in other lectures.

For a small example of a sparse matrix, click here. The first picture is a mesh around a wing. The physical problem is to compute the airflow around the wing at the mesh points. This leads to solving a sparse n-by-n system of spd linear equations Ax=b, where n=4253 and A has the "sparsity structure" in the top left plot of the second picture (black dots represent nonzero entries, the rest are zero). Only .16% of the 4253^2 entries of A are nonzero. The top right plot of the second picture is a different view of the matrix, with the rows and columns permuted. If we did Gaussian elimination on the spd matrix A without taking sparsity into account, it would take (1/3)*n^3 = 2.6e10 floating point operations. If we do Gaussian elimination on the matrix in the top left, and take sparsity into account by not doing unnecessary operations on zero entries, it does over 2200 times fewer operations, resulting in the Cholesky factor at the bottom left. If we do Gaussian elimination on the matrix at the top right, we get the Cholesky factor at the bottom right for 4 times fewer operations still. This matrix is available in matlab by typing demo, and clicking on

• continue
• Matlab_Visit
• Matrices
• Select_a_demo
• Cmd_Line_Demos
• NASA_Airfoil
• (All the pictures and other manipulations are one-liners in matlab.) Designing and implementing parallel sparse matrix algorithms is a currently active field of research. LAPACK and ScaLAPACK currently deal mostly with dense matrices, although some sparse matrix software is available, with more on the way.

The next important distinction among linear algebra problems is between the first two problems (solving Ax=b and least squares problems), and the last problem (computing eigenvalues and the SVD). The first two problems have exact, finite solution procedures, and the latter can only be solved approximately. The first two problems have exact finite solutions in the sense that algorithms doing O(n^3) operations exist for each, which would yield the exact solution in the absence of round-off errors. (Analyzing roundoff is a topic for a numerical analysis class, and we will only mention it here when it requires us to modify the algorithm to make it reliable.) In contrast, to compute eigenvalues requires iterative solutions, because they implicitly require finding the roots of the characteristic polynomial of the matrix A: p(x)=determinant(A-x*I) (for the SVD substitute A'*A for A). As shown by Galois, no finite algorithm can exist to find the roots of a general polynomial of degree 5 or greater. A large variety of iterative solutions exist. In practice, the best iterations converge so reliably and so rapidly (quadratically, where the error is squared at every step, or even cubically) that they are often considered to be finite solutions.

Eigenvalue and singular value problems are all solved in 2 phases: a (finite) algorithm to reduce the original dense matrix to a more compactly representable form with the same eigenvalues (or singular values), and an iterative phase on the compact form. We will again distinguish the first two problems from the initial phases of the latter problems: For the first two problems, it is known how to do almost all the work by using matrix-matrix multiplication (BLAS3), which according to Lecture 2 is the fastest available linear algebra building block. In contrast, the first phases of the latter problems are more difficult to convert to BLAS3. The algorithm in LAPACK does 50% of its work in the BLAS3 and 50% in the slower matrix-vector multiplication (BLAS2). More recently, work on "multi-step band-reduction" by C. Bischof, X. Sun, and others has shown that if no eigenvectors are desired, one can use nearly 100% BLAS3. If eigenvectors are desired, one either has to use BLAS2 or increase the operation count significantly (by 2n^3).

In other words, we can in principle take nearly full advantage of the memory hierarchy for the first two problems, but not the latter ones, at least if eigenvectors are desired and we do not want to increase the operation count significantly.

Now consider just eigenvalue and singular value problems. The iterative phase of finding the eigenvalues of a general matrix is much more expensive than the initial reduction phase. In contrast, the best recent iterations for the symmetric eigenvalue problem and SVD are in principle much cheaper than the corresponding reduction phases (this was not true until recently).

Good parallel software for the first two problems exists in LAPACK and ScaLAPACK. Software for eigenvalue and singular value problems exists in LAPACK, with recent better algorithms on the way. ScaLAPACK currently only solves the symmetric eigenproblem, and the Feb 1995 release makes certain unfortunate compromises between parallel efficiency and accuracy, which future algorithms should make unnecessary. Class projects to improve and expand ScaLAPACK are available.

It is interesting to ask whether the best parallel algorithm is always a parallel version of the best serial algorithm. This is usually the case, and in fact the search for better parallel algorithms has sometimes led to better serial algorithms as well. However, there are some interesting exceptions. For example, certain "banded" linear systems can be solved more quickly using parallel-prefix than Gaussian elimination, but the results are numerically unstable in the presense of roundoff, and so potentially much less accurate. Currently, LAPACK and ScaLAPACK use different algorithms for finding eigenvalues, but we expect them to use the same algorithm eventually, once we better understand the problem.

Quick review of Gaussian Elimination

Gaussian elimination to solve Ax=b starts with a dense matrix, and adds multiples of each row to subsequent rows in order to create zeros below the diagonal, ending up with an upper triangular matrix we will call U. Then, one solves a linear system with U, which can be done by substitution starting with the last variable. The simplest way one could write this algorithm is as follows, where we ignore transformations of b for the time being:
``` ... for each column i,
... zero it out below the diagonal by
... adding multiples of row i to later rows
for i = 1 to n-1
... each row j below row i
for j = i+1 to n
... add a multiple of row i to row j
for k = i to n
A(j,k) = A(j,k) -
(A(j,i)/A(i,i)) * A(i,k)
```

Let us improve this rudimentary implementation step by step. First, we remove the constant (A(j,i)/A(i,i)) from the innermost loop:

```  for i = 1 to n-1
for j = i+1 to n
m = A(j,i)/A(i,i)
for k = i to n
A(j,k) = A(j,k) - m * A(i,k)
```
Next, we avoid computing results we know, i.e. the zeros below the diagonal:
```  for i = 1 to n-1
for j = i+1 to n
m = A(j,i)/A(i,i)
for k = i+1 to n
A(j,k) = A(j,k) - m * A(i,k)
```
It will be convenient to store the multipliers m in the implicitly created zeros below the diagonal, so we can use them later to transform the right hand side b:
```  for i = 1 to n-1
for j = i+1 to n
A(j,i) = A(j,i)/A(i,i)
for j = i+1 to n
for k = i+1 to n
A(j,k) = A(j,k) - A(j,i) * A(i,k)
```
Now we use Matlab notation to express the algorithm even more compactly. Note that the last two loops are really multiplying the column vector A(i+1:n, i) times the row vector A(i, i+1:n), and add the resulting (n-i)-by-(n-i) matrix to A(i+1:n, i+1:n):
```  for i = 1 to n-1
A(i+1:n, i) = A(i+1:n, i) / A(i,i)
A(i+1:n, i+1:n) = A(i+1:n, i+1:n) -
A(i+1:n, i)*A(i, i+1:n)
```
Thus, the inner loop consists of one BLAS1 operation (scaling A(i+1:n,i) by a constant), and one BLAS2 operation (adding a rank-one matrix -A(i+1:n,i)*A(i,i+1:n), to A(i+1:n, i+1:n)). Here is a picture of what happens at step i.

Call the strictly lower triangular matrix of multipliers which is stored below the diagonal of A after the algorithm ends M, and let L=I+M. Recall that U is the upper triangular matrix stored in the upper triangular of A. We state the following easy lemma without proof.

Lemma. (LU Factorization). If the above algorithm terminates (i.e. it did not try to divide by zero) then A = L*U.

Now we can state our complete algorithm for solving A*x=b:

```   1) Factorize A = L*U.
2) Solve L*y = b for y
by forward substitution.
3) Solve U*x = y for x
by backward substitution.
```
Then x is the solution we seek because A*x = L*(U*x) = L*y = b.

Let us compute the serial and PRAM complexity of this algorithm. From the matlab code, it is easy to see that the number of floating point operations performed (serial complexity) is

``` sum_{i=1}^{n-1}
[ (n-i)
... cost of A(i+1:n,i+1:n)/A(i,i),
+ 2*(n-i)^2 ]
... cost of A(i+1:n,i+1:n) -
...         A(i+1:n,i)*A(i,i+1:n).
= (2/3)*n^3 + O(n^2)
```
Solving L*y=b and U*x=b each only cost n^2, so the total cost is still (2/3)*n^3 + O(n^2).

Assuming we have one processor per matrix entry, and that communication is free (the PRAM model), the cost is

``` sum_{i=1}^{n-1}
[ 1
... cost of A(i+1:n,i+1:n)/A(i,i)
+ 2 ]
... cost of A(i+1:n,i+1:n) -
... A(i+1:n,i)*A(i,i+1:n)
= 3*(n-1)
```
since all the divisions in the inner loop can be done at once, all the multiplications can be done at once, and all the subtractions can be done at once. Thus, a lot of parallelism is available in this algorithm for us to exploit. One can also confirm that solving L*y=b and U*x = y in straightforward ways:
``` ... in practice, y is overwritten on b
y(1:n) = b(1:n)
for i = 1 to n-1
y(i+1:n) = y(i+1:n) - y(i)*L(i+1:n,i)

... in practice, x is overwritten on y,
... i.e. on b
x(1:n) = y(1:n)
for i = n downto 1
x(i) = x(i)/U(i,i)
x(1:i-1) = x(1:i-1) - x(i)*U(1:i-1,i)
```
costs (1+2*(n-1)) + (1+n+2*(n-1)) = 5n-2, for a grand total of 8*n-5. Interestingly, in the PRAM model, solving triangular systems with the most straigthtforward algorithms is about as expensive as LU factorization, whereas triangular system solving was much cheaper on a serial machine.

Here are some obvious problems with this algorithm, which we need to address:

• If A(i,i) is zero, the algorithm cannot proceed. If A(i,i) is tiny, we will also have numerical problems.
• The majority of the work is done by calls to the BLAS2 (rank-one update), which does not exploit a memory hierarchy at well as a BLAS3 operation like matrix-matrix multiplication (see Lecture 2).
• First we will deal with zero or tiny A(i,i) (by pivoting), and then show how to convert BLAS2 to BLAS3.

Pivoting to avoid tiny A(i,i)

It is obvious why the algorithm fails when some A(i,i) is exactly zero. Sometimes failure is appropriate: if the entire first column of the matrix is zero, for example, the matrix is singular and the user deserves a warning. On the other hand, if the matrix is perfectly nonsingular, such as
```     A= [ 0 1 ]
[ 1 0 ]
```
then a warning is not deserved.

Even if A(i,i) is not exactly zero, but merely tiny, trouble is possible because of roundoff errors in floating point arithmetic. We illustrate with a 2-by-2 example. Suppose for simplicity we compute in 3-decimal digit arithmetic, so it is easy to see what happens. Let us solve Ax=b where

```     A= [ 1e-4  1 ]  and  b= [ 1 ]  .
[  1    1 ]          [ 2 ]
```
The correct answer to 3 decimal places is easily confirmed to be
```     x = [ 1 ]  .
[ 1 ]
```
The result of LU decomposition on A is as follows, where fl(a op b) is the floating point result of a op b:
```   L = [    1        0 ]  =  [  1   0 ]
[ fl(1/1e-4)  1 ]     [ 1e4  1 ]
... no error yet
and
U = [ 1e-4      1      ] = [ 1e-4   1  ]
[  0   fl(1-1e4*1) ]   [  0   -1e4 ]
... error in the 4th place
```
The only rounding error committed is replacing 1-1e4*1 by -1e4, an error in the fourth decimal place. Let's see how close L*U is to A:
```     L * U = [  1e-4   1  ]
[   1     0  ]
```
The (2,2) entry is completely wrong. If we continue to try to solve A*x=b, the subsequent solves of L*y=b and U*x=y yield
```   y = [      1      ] = [   1  ]
[ fl(1-1e4*1) ]   [ -1e4 ]
... error in the 4th place
and
x = [ x1 ] = [ fl((1-x2*1)/1e-4) ]
[ x2 ]   [  fl(-1e4/(-1e4))  ]
= [ 0 ]
[ 1 ]
... no error in either component
```
We see that the computed answer is completely wrong, even though there were only two floating errors in the 4th decimal place, and even though the matrix A is "well-conditioned", i.e. changing the entries of A or b by eps<<1 only changes the true solution by O(eps).

The problem can be traced to the first rounding error, U(2,2) = fl(1-1e4*1) = -1e4. Note that if we were to change A(2,2) from 1 to any other number z in the range from -5 to 5, we would get the same value of U(2,2) = fl(z-1e4*1) = -1e4, and so the final computed value of x would be the same, independent of z. In other words, the algorithm "forgets" the value of A(2,2), and so can't possibly get the right answer. This phenomenon is called numerical instability, and must be eliminated to yield a reliable algorithm.

The standard solution to this problem is called partial pivoting, which means reordering the rows of A so that A(i,i) is large at each step of the algorithm. In particular, at the beginning of step i of the algorithm, row i is swapped with row k>i if |A(k,i)| is the largest entry among |A(i:n,i)|. This yields the following algorithm.

```  Gaussian elimination with Partial Pivoting,
BLAS2 implementation

for i = 1 to n-1
find and record k where
|A(k,i)| = max_{i<=j<=n} |A(j,i)|
if |A(k,i)|=0, exit with a warning
that A is singular, or nearly so
if i != k, swap rows i and k of A
A(i+1:n, i) = A(i+1:n, i) / A(i,i)
...  each quotient lies in [-1,1]
A(i+1:n, i+1:n) = A(i+1:n, i+1:n) -
A(i+1:n, i)*A(i, i+1:n)
```

If we apply this algorithm to our 2-by-2 example, we get a very accurate answer.

We may describe this algorithm as computing the factorization P*A=L*U, where L and U are lower and upper triangular as before, and P is a permutation matrix, i.e. the identity matrix with permuted columns. P*A is the matrix A with its rows permuted. As a consequence of pivoting, each entry of L has absolute value at most 1. Using this new factorization of A, solving A*x=b only requires the additional step of permuting b according to P.

Note that we have several choices as to when to swap rows i and k. Indeed, we could use indirect addressing and not swap them at all, but then we couldn't use the BLAS, so this would be slow. Different implementations make different choices about when to swap.

How to Use BLAS3 in Gaussian Elimination with Partial Pivoting

Most of the work in the algorithm occurs in the rank-one update in its last line, a BLAS2 operation which does 2*m operations on about m data items (where m=(n-i)^2). As we know from Lecture 2, a BLAS3 operation like matrix-matrix multiplication can run much faster than BLAS2, so our goal is to show how to reorganize Gaussian elimination to use matrix-matrix multiplication to do most of its work. A standard trick for changing BLAS2 into BLAS3 in linear algebra codes is delayed updating. (In the linear algebra literature, it is also called blocking, but that is a heavily overloaded term in parallel computing!) This means saving up a sequence of BLAS2 operations, and applying them all at once using BLAS3.

For LU decomposition, this means that we will process the matrix in blocks of b columns at a time, rather than one column at a time. b is called the block size. We will do a complete LU decomposition just of the b columns in the current block, essentially using the above BLAS2 code. Then we will update the remainder of the matrix doing b rank-one updates all at once, which turns out to be a single matrix-matrix multiplication, where the common matrix dimension is b. b will be chosen in a machine dependent way to maximize performance. A good value of b will have the following properties:

• b is small enough so that the b columns currently being LU-factorized fit in the fast memory (cache, say) of the machine.
• b is large enough to make matrix-matrix multiplication fast.
• ``` Gaussian elimination with Partial Pivoting,
BLAS3 implementation

... process matrix b columns at a time
for ib = 1 to n-1 step b
... point to end of block of b columns
end = min(ib+b-1,n)

... LU factorize A(ib:n,ib:end) with BLAS2
for i = ib to end
find and record k where
|A(k,i)| = max_{i<=j<=n} |A(j,i)|
if |A(k,i)|=0, exit with a warning
that A is singular, or nearly so
if i != k, swap rows i and k of A
A(i+1:n, i) = A(i+1:n, i) / A(i,i)
... only update columns i+1 to end
A(i+1:n, i+1:end) = A(i+1:n, i+1:end)
- A(i+1:n, i)*A(i, i+1:end)
endfor

... Let LL be the b-by-b lower triangular
... matrix whose subdiagonal entries are
... stored in A(ib:end,ib:end), and with
... 1s on the diagonal. Do delayed update
... of A(ib:end, end+1:n) by solving
... n-end triangular systems
... (A(ib:end, end+1:n) is pink below)
A(ib:end, end+1:n) =
LL \ A(ib:end, end+1:n)

... do delayed update of rest of matrix
... using matrix-matrix multiplication
... (A(end+1:n, end+1:n) is green below)
... (A(end+1:n, ib:end) is blue below)
A(end+1:n, end+1:n) = A(end+1:n, end+1:n)
- A(end+1:n,ib:end)*A(ib(end,end+1:n)

endfor
```

The LU factorization of A(ib:n,ib:end) uses the same algorithm as before. Solving a system of n-end equations with triangular coefficient matrix LL is a single call to a BLAS3 subroutine (strsm) designed for that purpose. No work or data motion is required to refer to LL; it is done with a pointer. When n>>b, almost all the work is done in the final line, which multiplies an (n-end)-by-b matrix times a b-by-(n-end) matrix in a single BLAS3 call (to sgemm). Here is a picture, where the final line subtracts the product of the blue and pink submatrices from the green submatrix.

Here is another way to derive this algorithm. Suppose that we are at the top of the ib loop, midway through the algorithm. Then implicitly we have computed the following factorization, where all the submatrices are the same size as in the above figure (for simplicity we ignore pivoting):

```        ib-1   b  n-b-ib
ib-1  [ A11  A12  A13  ]
b    [ A21  A22  A23  ]
n-b-ib [ A31  A32  A33  ]

[ L11  0  0 ]   [ U11 U12  U13  ]
= [ L21  I  0 ] * [  0  A'22 A'23 ]
[ L31  0  I ]   [  0  A'32 A'33 ]
```
(all matrices are partitioned conformally). Once we finish the i-loop, which LU factorizes A(ib:n,ib:end), we have updated this factorization to
```    [ A11  A12  A13  ]
[ A21  A22  A23  ]
[ A31  A32  A33  ]

[ L11  0  0 ]   [ U11   U12   U13  ]
= [ L21  I  0 ] * [  0  L22*U22 A'23 ]
[ L31  0  I ]   [  0  L32*U22 A'33 ]
```
Following the triangular solve with LL, we have further updated this factorization to
```    [ A11  A12  A13  ]
[ A21  A22  A23  ]
[ A31  A32  A33  ]

[ L11  0  0 ]   [ U11   U12     U13   ]
= [ L21  I  0 ] * [  0  L22*U22 L22*U23 ]
[ L31  0  I ]   [  0  L32*U22   A'33  ]
```
The final matrix multiply updates this to
```    [ A11  A12  A13  ]
[ A21  A22  A23  ]
[ A31  A32  A33  ]

[ L11  0  0 ]
= [ L21 L22 0 ]
[ L31 L32 I ]

[ U11  U12       U13        ]
* [  0   U22       U23        ]
[  0    0   A'33 - L32*U23  ]

[ L11  0  0 ]   [ U11  U12  U13   ]
= [ L21 L22 0 ] * [  0   U22  U23   ]
[ L31 L32 I ]   [  0    0   A''33 ]
```
At this point, we are done with the outer loop, and have computed another b columns of L and b rows of U. The algorithm continues by processing the next b columns of A''33.

The LAPACK source for this routine, sgetrf, may be viewed by clicking here.

Performance of LAPACK LU Decomposition

We discuss the performance of the LAPACK LU decomposition routine, as well as some other routines. The first plot below shows the performance of matrix-matrix multiplication, matrix-vector multiplication, and LU decomposition on the Cray C90, both on 1 processor (dotted lines) and on 16 processors (solid lines). The speed of each routine divided by peak machine speed (efficiency) is shown. Matrix-matrix multiply is fastest, running at over 90% of peak for large matrices, followed by matrix-vector multiplication, and finally LU decomposition. The most important feature of this plot is that efficiency increases with increasing dimension. This is a common phenomenon for matrix-matrix multiply based algorithms as we discussed in Lecture 2. The second important feature is that efficiency is higher on one processor than 16 processors. This happens because keeping the matrix dimension fixed and increasing the number of processors means there is less work to do per processor.

Now we fix the dimension at N=1000 and compare the efficiencies on a variety of architectures, from large vector processors like the Cray C90 and Convex C4640, to shared-memory multiple processor versions of both machines, to workstations like the DEC Alpha, IBM RS6000/590, and SGI Power Challenge. The vital statistics of these machines are given in the following table. All results are for 64-bit precision.

```Machine          #Procs  Clock    Peak  Block
Speed  Mflops Size b
(MHz)
---------------------------------------------
Convex C4640          1    135     810     64
Convex C4640          4    135    3240     64
Cray C90              1    240     952    128
Cray C90             16    240   15238    128
DEC Alpha 3000-500X   1    200     200     32
IBM RS 6000/590       1     66     264     64
SGI Power Challenge   1     75     300     64
```
The figure below plots several efficiency measures for the machines in the table. There is one column in the plot for each machine. First consider the green line, which measures the fraction of peak performance attained by 1000-by-1000 matrix-multiplication. We use the best matrix-multiplication available for the machine, typically supplied by the vendor. For all machines except the DEC Alpha 3000-500X, efficiency exceeds 90%.

Now consider the red line, which is the speed of LU decomposition divided by the speed of matrix-multiply. The block size b is given in the above table; it was chosen by brute force search to optimize the performance. This curve measures how well we achieved our goal of making LU decomposition as fast as the matrix-multiply in its inner loop. The best efficiency, 95%, is attained on a single processor Cray C90, and the worst efficiency, 50%, is attained on a 16 processor Cray C90. This drop in performance is due to the fact that multiplying (n-i)-by-128 times 128-by-(n-i) matrices does not yield enough work to keep all 16 processors busy, although it is enough for 1 processor. The workstations (Alpha, RS6000, SGI) are all close to 70% efficient, which is reasonably satisfactory.

Finally, the blue line measures the efficiency of our LU decomposition with respect to "best effort", which means a version of LU decomposition crafted by each vendor especially for its machine. These vendor-supplied codes use very similar techniques to the ones described above, but try to improve performance by using a variable block size, assembly language, or an algorithm corresponding to a different permutation of the 3 nested loops in the original algorithm (all 3!=6 permutations lead to viable algorithms). The blue curve measures the extent to which we have sacrificed performance to gain portability: We have not lost much on the large vector machines, but 25%-30% on the workstations. The vendors typically invest a large amount of effort in optimizing 1000-by-1000 LU because it is one of the most popular benchmarks used to compare high performance machines: the LINPACK Benchmark. It would be nice if vendors had the resources to highly optimize all the library routines one might want to use, but few do, so portable high-performance libraries remain of great interest.

The LINPACK Benchmark is named after the predecessor of LAPACK. It originally consisted just of timings for 100-by-100 matrices, solved using a fixed Fortran program (sgefa) from the LINPACK linear algebra library. The data is an interesting historical record, with literally every machine for the last 2 decades listed in decreasing order of speed, from the largest supercomputers to a hand-held calculator. As machines grew faster and vendors continued to optimize their LINPACK benchmark performance (sometimes by using special compiler flags that worked only for the LINPACK benchmark!), the timings for 100-by-100 matrices grew less and less indicative of typical machine performance. Then 1000-by-1000 matrices were introduced. In contrast to the 100-by-100 case, which required vendors to use identical Fortran code, vendors were permitted to modify Gaussian elimination anyway they wanted for the 1000-by-1000 case, provided they got an accurate enough answer. Finally, a third benchmark was added for large parallel machines, which measured their speed on the largest linear system that would fit in memory, as well as the size of the system required to get half the Mflop rate of the largest matrix.

Here is a plot analogous to the last one but for 100-by-100 matrices. Generally speaking, efficiencies are lower for n=100 than n=1000, dramatically so for large machines like the Cray, less so for workstations. Rather than comparing LAPACK performance to the ``best effort'', it is compared to the LINPACK benchmark discussed above, i.e. performance using the LINPACK benchmark code sgefa. This ratio is as high as 2.46 for, but is as low as .73, a testimony to the efforts of the manufacturers to tune their compilers to work well on sgefa.

For an analysis and summary of the performance of other LAPACK routines besides LU, click here.

How to Layout Matrices on Distributed Memory Machines

Now we consider data layout of matrices on distributed memory machines, with the goal of making Gaussian elimination as efficient as possible. We will discuss a sequence of data layouts, starting with the most simple, obvious, and inefficient one, and work our way up to the more complicated but efficient one ultimately used. Even though our justification is based on Gaussian elimination, analysis of many other algorithms has led to the same set of layouts. As a result, these layouts have been standardized as part of the High Performance Fortran standard, with corresponding data declarations as part of that language.

The two main issues in choosing a data layout for Gaussian elimination are

• load balance, or splitting the work reasonably evenly among the processors throughout the algorithm, and
• ability to use the BLAS3 during computations on a single processor, to account for the memory hierarchy on each processor.
• It will help to remember the pictorial representation of Gaussian elimination below. As the algorithm proceeds, it works on successively smaller square southeast corners of the matrix. In addition, there is extra BLAS2 work to factorize the submatrix A(ib:n,ib:end).

For convenience we will number the processors from 0 to p-1, and matrix columns (or rows) from 0 to n-1. The following figure shows a sequence of data layouts we will consider. In all cases, each submatrix is labeled with the number of the processor (from 0 to 3) that contains it. Processor 0 owns the shaded submatrices.

Consider the first layout, the Column Blocked Layout. In this layout, column i is stored on processor floor(i/c) where c=ceiling(n/p) is the maximum number of columns stored per processor. In the figure n=16 and p=4. This layout does not permit good load balancing, because as soon as the first c columns are complete, processor 0 is idle for the rest of the computation. The transpose of the this layout, the Row Blocked Layout, has a similar problem.

The second layout, the Column Cyclic Layout, tried to address this problem by assigning column i to processor i mod p. In the figure, n=16 and p=4. With this layout, each processor owns approximately 1/p-th of the square southeast corner of the matrix, so the load balance is good. However, the fact that single columns are stored rather than blocks means we cannot use the BLAS2 to factorize A(ib:n,ib:end), and may not be able to use the BLAS3 to update A(end+1:n,end+1:n). The transpose of this layout, the Row Cyclic Layout, has a similar problem.

The third layout, the Column Block Cyclic Layout, is a compromise between the last two. We choose a block size b, divide the columns into groups of size b, and distribute these groups in a cyclic manner. This means column i is stored in processor (floor(i/b)) mod p. In fact, this layout includes the first two as the special cases b=c=ceiling(n/p) and b=1, respectively. In the figure n=16, p=4 and b=2. For b larger than 1, this has a slightly worse balance than the Column Cyclic Layout, but can use the BLAS2 and BLAS3. For b less than c, it has a better load balance than the Columns Blocked Layout, but can only call the BLAS on smaller subproblems, and so perhaps take less advantage of the local memory hierarchy (recall our plot of performance of the BLAS3, where performance improved for larger matrix sizes). However, this layout has the disadvantage that the factorization of A(ib:n,ib:end) will take place on perhaps just on processor (in the natural situation where column blocks in the layout correspond to column blocks in Gaussian elimination). This would be a serial bottleneck.

This serial bottleneck is eased by the fourth layout in the figure, the Row and Column (or 2D) Block Cyclic Layout. Here we think of our p processors arranged in an prow-by-pcol rectangular array of processors, indexed in a 2D fashion by (pi,pj), 0 <= pi < prow and 0 <= pj < pcol. All the processors (i,j) with a fixed j are referred to as processor column j. All the processors (i,j) with a fixed i are referred to as processor row i. Matrix entry (i,j) is mapped to processor (pi,pj) by mapping i to pi and j to pj independently using the formula for a block cyclic layout:

```    pi = floor(i/brow) mod prow, where
brow = block size in the row direction

pj = floor(j/bcol) mod pcol, where
bcol = block size in the column direction
```
Thus, this layout includes all the previous layouts, and their transposes, as special cases. In the figure, n=16, p=4, prow=pcol=2, and brow=bcol=2. This layout permits pcol-fold parallelism in any column, and calls to the BLAS2 and BLAS3 on matrices of size brow-by-bcol. This is the layout that we will use for Gaussian elimination.

Since Gaussian elimination is not entirely "symmetric" (there is more communication required during the BLAS2 factorization of A(ib:n,ib:end) than during the BLAS3 update of A(ib:end,end+1:n)), it is not clear that choosing an entirely symmetric layout (prow=pcol) is most efficient. Indeed, later we will see that choosing prow There is one more layout shown, the Block Skewed Layout, which is not a special case of 2D Block Cyclic Layout. This layout has feature that each row and each column is shared among all p processors, so p-fold parallelism is available for any row operation or any column operation. In contrast, the 2D block cyclic layout can have at most sqrt(p)-fold parallelism in all the rows and all the columns. The Block Skewed Layout is not useful for Gaussian elimination, but can be useful in a variety of other matrix operations, so we mention it here.

Distributed Gaussian elimination with a 2D Block Cyclic Layout

The following figure shows how the BLAS3 algorithm is performed on a 2D block cyclic layout. The block size b in the algorithm and the block sizes brow and bcol in the layout satisfy b=brow=bcol. Shaded regions indicate busy processors or communication performed.

For example, step (1) requires a reduction operation among the prow processors owning the current column, in order to find the largest absolute entry (pivot) and broadcast its index. Step (2) requires communication among processors in that column to exchange rows, and broadcast the pivot row to all processors in the column. Steps (3) and (4) are local computation, BLAS1 and BLAS2, respectively. Steps (5) and (6) involve communication, broadcasting pivot information and swapping rows among all the other p-prow processors. Step (7) is more communication, sending LL to all pcol processors in its row. Step (8) is local computation. Steps (9) and (10) are communication, sending matrix blocks right and down respectively, in preparation for the local matrix multiplies that update the trailing southeast square matrix in step (11).

It is unnecessary to have a barrier between each step of the algorithm. In other words, there can be parallelism with some steps of the algorithm forming a pipeline. For example, consider steps 9, 10 and 11, where most of the communication and floating point work occurs. As soon as a processor received its desired (blue) subblock of A(end+1:n,ib:end) from the left, and (pink) subblock of A(ib:end:end+1:n) from above, it may begin its local matrix multiply to update its part of the (green) submatrix A(end+1:n,end+1:n). As soon as the leftmost b columns of A(end+1:n,end+1:n) are updated, their LU factorization may begin while the remaining columns of the green submatrix are being updated by other processors.

The ScaLAPACK software which implements this is located here. To view the top level routine that performs Gaussian elimination, pdgetrf.f, click here.

Modeling the Performance of Distributed Gaussian Elimination

We will use the performance model introduced in Lecture 9:
• One floating point operation at top speed (i.e. the speed of matrix multiplication) costs one time unit.
• Sending a message of n words from one processor to another costs alpha + beta*n time units.
• We assume no collective communications like broadcasts are available.
• We have p processor arranged in a prow-by-pcol grid.
• b is the block size in the 2D block cyclic layout.
• n is the matrix dimension.
• We will express the running time as a sum:
```   Time =   msgs * alpha
+ words * beta + flops   time units
```
where msgs is the number of messages sent (not counting those sent in parallel), words is the number of words sent (again not counting those sent in parallel), and flops is the number of floating point operations performed (again, not counting those in parallel). To make the presentation simple, we will only look at steps 9, 10 and 11 in detail. It turns out these account for almost all of words and flops, which are the most important terms for large n.

The implementation we discuss will overlap steps 10 and 11 of the algorithm as described above. More precisely,

• In step 9, we will use a tree-based broadcast in each processor column. with the first processor sending to two others, each of these two sending to 2 others, and so on, so that a total of log_2 prow message sending steps are required in each processor column. Since the size of a message is b*(n-end)/pcol, the total number of time units for step 9 is
```    Time for step 9 =
( log_2 prow ) *
( alpha + (b*(n-end)/pcol)*beta )
time units
```
• In step 10, we will use a ring based broadcast, with each processor communicating to its neighbor on the right, so that a total of pcol message sending steps are needed. Since the size of a message is b*(n-end)/prow, this would appear to take
```    ( pcol ) *
( alpha + (b*(n-end)/prow)*beta )
time units.
```
However, since only the leftmost processor column in the green submatrix needs to receive its message, pass it on, and update its submatrices, before proceeding to LU factorize the next block column, we only charge
```    Time for step 10 =
2 * ( alpha + (b*(n-end)/prow)*beta )
time units.
```
• In step 11, each processor needs to do a local matrix multiply of a (n-end)/pcol-by-b matrix by a b-by-(n-end)/prow matrix, which costs
```    Time for step 11 = 2*b*(n-end)^2/p
time units.
```
• To estimate the total cost for Distributed Gaussian Elimination, we need to sum the above three contributions for end = b, 2*b, 3*b, ... , n-b to get
``` Time for all steps 9, 10 11 =
( ( n * (log_2 prow) + 2 ) / b )
* alpha
+ ( n^2 *
( (log_2 prow)/(2*pcol) + 1/prow )
) * beta
+ ( (2/3)*n^3/p )
```
Taking the other steps of algorithm into account increases the coefficients of alpha and beta, yielding
``` Time for distributed Gaussian elimination =
( n * (6+ log_2 prow) ) * alpha
+ ( n^2 *
( 2*(prow-1)/p +
(pcol-1)/p +
(log_2 prow)/(2*pcol)
)
) * beta
+ ( (2/3)*n^3/p )
```
We compute the efficiency by dividing the serial time, (2/3)*n^3, by p times the Time just displayed, yielding
```  Efficiency = 1 / (  1
+ (1.5*p*(6 + log_2 prow)/n^2)
* alpha
+ ( 1.5*
( 2*prow + pcol
+ (prow * log_2 prow)/2 -3 )
/n )
* beta )
```
Let us examine this formula, and ask when it is likely to be close to 1, i.e. when we can expect good efficiency. First, if communication were free, so alpha=0 and beta=0, efficiency would be 1. This means we have successfully balanced the load (modulo the lower order terms we are ignoring). Then, for fixed p, prow, pcol, alpha and beta, the efficiency approaches 1 as n grows. This is because we are doing O(n^3) floating point operations, but communicating only O(n^2) words, so eventually the floating point, which is load balanced, swamps the communication, so efficiency is good. The efficiency also grows if alpha and beta shrink, i.e. as communication becomes less expensive.

Now we consider the choice of prow and pcol. The expression

```   2*prow + pcol + (prow * log_2 prow)/2
= 2*prow +p/prow + (prow * log_2 prow)/2
```
is minimized when prow is slightly smaller than sqrt(p), meaning we want a prow-by-pcol processor grid which is slightly longer than it is tall. (In practice, for a given p, we are limited to choosing prow and pcol among the factors of p, unless we want to have idle processors.)

Overall, assuming prow ~ pcol ~ sqrt(p), and ignoring the log terms, the efficiency is

```  Efficiency = 1 / ( 1 + O( p/n^2 * alpha ) +
O( sqrt(p)/n * beta ) )
= E( n^2/p )
```
i.e. an increasing function of n^2/p. But n^2/p is the amount of data stored per processor, since an n-by-n matrix takes up n^2 words. In other words, if we let the "total problem size" n^2 grow proportionally to the number of processor, we expect the efficiency to be approximately constant. Let us confirm this hypothesis with an experiment.

We will use the Intel Gamma, which has the following machine parameters:

• Sending a message costs 100 + 3.14n microseconds, where n is the number of 64-bit words.
• Matrix-multiply runs at 35 Megaflops. Thus, alpha = 100/(1/35) = 3500, and beta = 3.14/(1/35) = 78.4.
• There are 128 processors. We will run timings for the following configurations:
```        p    prow     pcol
---    ----     ----
1      1        1
16      4        4
32      4        8
64      8        8
128      8       16
```
Memory per processor will be taken as 8*n^2/p bytes = 2, 3, 4, 5 and 6.25 Megabytes.
• Click here for the predicted run times on the Gamma, and click here for the actual run times; they are satisfyingly close together, and confirm that the efficiency is to first order simply an increasing function of n^2/p. The predicted times are labelled by the memory used per processor in Megabytes, ranging from 2 to 6.25.

Software Structure of ScaLAPACK

A great deal of software infrastructure is needed to implement the ScaLAPACK algorithms described above. As we have said in previous lectures, the design goal is to build software layers that hide the details of lower levels, and so make it simple to write (and debug and maintain) software at higher levels. In this section we will describe the layers inside ScaLAPACK. They are depicted in the figure below:

Below the dotted line is local software, and above the dotted line is global software. Local software either only runs on one processor, and is not parallel, or refers to other processors explicitly to do communication. Global software refers to data structures, and operations on them, that are spread out over the whole parallel machine (or selected subsets).

We have already discussed the BLAS, or Basic Linear Algebra Subroutines, at length in Lecture 2, and will not dwell on them here. The use of the BLAS, especially BLAS3, in LAPACK is discussed above. Neither the BLAS nor LAPACK are explicitly parallel.

(We note that on shared memory parallel machines, it is possible to hide all the parallelism within the BLAS, by doing matrix-matrix multiplication in parallel for example, and so to run LAPACK in parallel without changing any software. The extra complexities of ScaLAPACK are due to the lack of a shared address space. Also, there are some LAPACK routines for eigenvalue problems, whose ScaLAPACK counterparts use different algorithms. We discuss these in other lectures.)

The BLACS are the Basic Linear Algebra Communication Subroutines. These routines are message passing libraries which let the user send, receive, broadcast, reduce and do other basic operations on the matrix data types used in ScaLAPACK algorithms, namely rectangular or trapezoidal matrices, and submatrices of other matrices. The BLACS hide (almost all!) the differences between the underlying message passing libraries (PVM, MPI, MPL for the IBM SP-1 and SP-2, CMMD for the CM-5, and NX for the Intel Paragon). This permits the software using the BLACS to be the same across platforms. The design of the BLACS is discussed in detail in "A User's Guide to the BLACS v1.0," by J. Dongarra and R. C. Whaley, LAPACK Working Note 94, University of Tennessee Report CS-95-286, April 1995.

(There are two caveats to this claim. First, PVM requires a slightly different initialization sequence. Second, there are various tuning parameters to suggest how the BLACS can best use the underlying processor topology (ring, 2D-mesh, tree, etc.).)

The PBLAS are the Parallel BLAS. They implement BLAS operations on matrices distributed in a 2D block-cyclic fashion as described above.

Given all this infrastructure, the high level ScaLAPACK code for Gaussian elimination (pdgetrf.f) has essentially the same structure as the sequential LAPACK code (sgetrf.f). This is not just esthetically pleasing, but essential to be able to understand and maintain the large body of software comprising LAPACK, CLAPACK and SCLAPACK (currently between 2 and 3 million lines, some of it generated automatically from other parts).

In order to appreciate the complexity hidden by this layering, let us briefly examine how 2D block cyclic matrices are represented. internally in the PBLAS. If fact, let us recall how matrices are represented in Fortran on a single sequential machine. Matrices in Fortran are stored columnwise (not rowwise, as in C), which means that the leading dimensions change fastest as we traverse memory. So the simplest representation is

• a pointer to A(1,1) (Fortran starts at 1, whereas C starts at 0).
• the number of rows M, and
• the number of columns N.
• But this is not adequate to refer to a submatrix of another matrix, as the following picture shows, because a submatrix of another matrix does not necessarily occupy consecutive memory locations, as the following figure shows:
So to represent the 4-by-3 submatrix of the larger 11-by-8 matrix in the above figure we need
• a pointer to A(5,4), the upper left corner of the submatrix,
• the number of rows, M=4,
• the number of columns, N=3,
• the leading dimension, LDA=11, of the original matrix A.
• Thus, if B is the 4-by-3 submatrix, we can find the location of B(i,j) by the formula
``` loc(B(i,j)) = loc(A(5,4)) + (i-1) + (j-1)*LDA
```

Yet more information is needed by the PBLAS to represent (an arbitrary submatrix) of a 2D block cyclic matrix spread out over an PROW-by-PCOL grid of processors. Indeed, each matrix has a 9-element descriptor containing

1. TYPE (2D block cylic or 1D)
2. Number of rows M
3. Number of column N
4. Row block size MB
5. Column block size NB
6. Processor row 0 <= RSRC < PROW in which A(1,1) lies
7. Processor column 0 <= CSRC < PCOL in which A(1,1) lies
8. BLACS context
9. Leading dimension LDA of local matrix containing A
The BLACS context is used to distinguish messages used to operate on A from other messages sent to operate on some other matrix B. (Recall that message passing libraries include "message types" in messages in order to distinguish among messages from different senders.) This is needed to keep different operations from interfering with one another. Contexts are provided by calling BLACS routines like BLACS_GRIDINIT when constructing A in the first place.

To refer to a submatrix of such a matrix, we separately need the indices I and J of its upper left corner A(I,J), the number of its rows, and the number of its columns.

For details on the design of the PBLAS, see

• "A Proposal for a Set of Parallel Basic Linear Algebra Subprograms", J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and R. C. Whaley, LAPACK Working Note 100, University of Tennessee Report CS-95-292, May 1995,
• "PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed Memory Concurrent Computers," by Jaeyoung Choi, Jack J. Dongarra, and David W. Walker LAPACK Working Note 57, University of Tennessee Report CS-93-187, May 1993, and
• "SUMMA: Scalable Universal Matrix Multiplication Algorithm," by R. A. van de Geijn and J. Watts, LAPACK Working Note 96, University of Tennessee Report CS-95-286, April 1995.
• Getting software and reports with the WWW

One of the major practical uses of the WWW is accessing public-domain software and technical reports. We will discuss one repository of numerical software and related reports, called Netlib, which contains many libraries besides the LAPACK and ScaLAPACK libraries discussed above. If you clone this window and click here, you will get a one-page listing of all the libraries available within netlib. Here is a short guided tour.
• LAPACK, its C version CLAPACK, and ScaLAPACK were discussed above. All source code can be down-loaded into your own directory by appropriate pointing and clicking.
• There are also many more specialized libraries of software for doing linear algebra, such as for sparse matrices (sparse, sparse-blas, sparspak), solving ordinary differential equations (ode, odepack), partial differential equations (pltmg, bihar, fishpack, pdes), numerical integration (quadpack), Fast Fourier Transforms, or FFTs (fftpack, vfftpk), curve fitting (fitpack), splines (pppack, dierckx), statistics (gcv), optimization (opt), and so on. A one-line description of each library in Netlib is available here.
• There is also a lot of systems software of various kinds. pvm, pvm3, and hence are portable message passing systems for parallel computing. paragraph helps visualize message traffic in a parallel computer. f2c translates program written in Fortran to C. bmp is a Fortran multiple precision package (a more modern one called mpfun is available from dbailey@nas.nasa.gov). jakef is a precompiler which takes a set of Fortran subroutines computing a (multivariate) function, and automatically produces another set of subroutines computing the derivative of the function. This is useful for optimization problems (a more modern precompiler call adifor is available from bischof@mcs.anl.gov). paranoia is W. Kahan's program for finding bugs in floating point arithmetic.
• Benchmarks of various kinds are available, both source code and tech reports (in postscript) of performance results. There are "golden oldies" like Whetstones and Livermore Loops, more recent ones like the Linpack Benchmark (Gaussian elimination), and ones still under construction, like parkbench, which is intended to contain real applications running on distributed memory parallel computers, written either using message passing or in High Performance Fortran.
• The Conferences Database is a list of conference in parallel computing and numerical analysis.
• Since netlib is so large, a keyword-style search facility is available to help find things. A more ambitious search facility called GAMS (Guide to Available Mathematical Software) is available here. GAMS classifies "all" mathematical problems into a tree. By specializing the problem and so traversing the tree, one is eventually led to software to solve it. There is also a book of templates (this is almost a megabyte, and so takes a while to download), which try to offer numerical advice on which algorithm to use for solving sparse systems of linear equations (see below). More such advice facilities are planned for the future.
• To attest to the popularity of this service, usage statistics are also available. During the first two months of 1996, there were over 17000 requests per day.