The two main issues in choosing a data layout for Gaussian elimination are

It will help to remember the pictorial representation of Gaussian elimination below. As the algorithm proceeds, it works on successively smaller square southeast corners of the matrix. In addition, there is extra BLAS2 work to factorize the submatrix A(ib:n,ib:end).

For convenience we will number the processors from 0 to p-1, and matrix columns (or rows) from 0 to n-1. The following figure shows a sequence of data layouts we will consider. In all cases, each submatrix is labeled with the number of the processor (from 0 to 3) that contains it. Processor 0 owns the shaded submatrices.

Consider the first layout, the *Column Blocked Layout*.
In this layout, column
i is stored on processor floor(i/c) where c=ceiling(n/p) is the maximum
number of columns stored per processor. In the figure n=16 and p=4.
This layout does not permit good load balancing, because as soon
as the first c columns are complete, processor 0 is idle for the rest of
the computation. The transpose of the this layout, the Row Blocked Layout,
has a similar problem.

The second layout, the *Column Cyclic Layout*,
tried to address this problem
by assigning column i to processor i mod p. In the figure, n=16 and p=4.
With this layout, each processor owns approximately 1/p-th of the square
southeast corner of the matrix, so the load balance is good. However, the
fact that single columns are stored rather than blocks means we cannot use
the BLAS2 to factorize A(ib:n,ib:end), and may not be able to use the BLAS3
to update A(end+1:n,end+1:n). The transpose of this layout, the Row Cyclic
Layout, has a similar problem.

The third layout, the *Column Block Cyclic Layout*, is a compromise between
the last two. We choose a block size b, divide the columns into groups of
size b, and distribute these groups in a cyclic manner. This means column i
is stored in processor (floor(i/b)) mod p. In fact, this layout includes
the first two as the special cases b=c=ceiling(n/p) and b=1, respectively.
In the figure n=16, p=4 and b=2.
For b larger than 1, this has a slightly worse balance than the Column
Cyclic Layout, but can use the BLAS2 and BLAS3. For b less than c,
it has a better load balance than the Columns Blocked Layout, but can only
call the BLAS on smaller subproblems, and so perhaps take less advantage
of the local memory hierarchy (recall our plot of performance of the BLAS3,
where performance improved for larger matrix sizes). However, this layout
has the disadvantage that the factorization of A(ib:n,ib:end) will take
place on perhaps just on processor (in the natural situation where
column blocks in the layout correspond to column blocks in Gaussian
elimination). This would be a serial bottleneck.

This serial bottleneck is eased by the fourth layout in the figure, the
*Row and Column (or 2D) Block Cyclic Layout*.
Here we think of our p processors
arranged in an prow-by-pcol rectangular array of processors, indexed in
a 2D fashion by (pi,pj), 0 <= pi < prow and 0 <= pj < pcol.
All the processors (i,j) with a fixed j are referred to as
*processor column j*.
All the processors (i,j) with a fixed i are referred to as
*processor row i*.
Matrix entry
(i,j) is mapped to processor (pi,pj) by mapping i to pi and j to pj
independently using the formula for a block cyclic layout:

pi = floor(i/brow) mod prow, where brow = block size in the row direction pj = floor(j/bcol) mod pcol, where bcol = block size in the column directionThus, this layout includes all the previous layouts, and their transposes, as special cases. In the figure, n=16, p=4, prow=pcol=2, and brow=bcol=2. This layout permits pcol-fold parallelism in any column, and calls to the BLAS2 and BLAS3 on matrices of size brow-by-bcol. This is the layout that we will use for Gaussian elimination.

Since Gaussian elimination is not
entirely "symmetric" (there is more communication required during
the BLAS2 factorization of A(ib:n,ib:end) than during the BLAS3 update
of A(ib:end,end+1:n)), it is not clear that choosing an entirely
symmetric layout (prow=pcol) is most efficient. Indeed, later we will
see that choosing prow*Block Skewed Layout*, which is
not a special case of 2D Block Cyclic Layout. This layout has feature that
each row and each column is shared among all p processors, so p-fold
parallelism is available for any row operation or any column operation.
In contrast, the 2D block cyclic layout can have at most sqrt(p)-fold
parallelism in all the rows and all the columns. The Block Skewed Layout
is not useful for Gaussian elimination, but can be useful in a variety of
other matrix operations, so we mention it here.

For example, step (1) requires a reduction operation among the prow processors owning the current column, in order to find the largest absolute entry (pivot) and broadcast its index. Step (2) requires communication among processors in that column to exchange rows, and broadcast the pivot row to all processors in the column. Steps (3) and (4) are local computation, BLAS1 and BLAS2, respectively. Steps (5) and (6) involve communication, broadcasting pivot information and swapping rows among all the other p-prow processors. Step (7) is more communication, sending LL to all pcol processors in its row. Step (8) is local computation. Steps (9) and (10) are communication, sending matrix blocks right and down respectively, in preparation for the local matrix multiplies that update the trailing southeast square matrix in step (11).

It is unnecessary to have a barrier between each step of the algorithm. In other words, there can be parallelism with some steps of the algorithm forming a pipeline. For example, consider steps 9, 10 and 11, where most of the communication and floating point work occurs. As soon as a processor received its desired (blue) subblock of A(end+1:n,ib:end) from the left, and (pink) subblock of A(ib:end:end+1:n) from above, it may begin its local matrix multiply to update its part of the (green) submatrix A(end+1:n,end+1:n). As soon as the leftmost b columns of A(end+1:n,end+1:n) are updated, their LU factorization may begin while the remaining columns of the green submatrix are being updated by other processors.

The ScaLAPACK software which implements this is located here. To view the top level routine that performs Gaussian elimination, pdgetrf.f, click here.

Time = msgs * alpha + words * beta + flops time unitswhere msgs is the number of messages sent (not counting those sent in parallel), words is the number of words sent (again not counting those sent in parallel), and flops is the number of floating point operations performed (again, not counting those in parallel). To make the presentation simple, we will only look at steps 9, 10 and 11 in detail. It turns out these account for almost all of words and flops, which are the most important terms for large n.

The implementation we discuss will overlap steps 10 and 11 of the algorithm as described above. More precisely,

To estimate the total cost for Distributed Gaussian Elimination, we need to sum the above three contributions for end = b, 2*b, 3*b, ... , n-b to getTime for all steps 9, 10 11 = ( ( n * (log_2 prow) + 2 ) / b ) * alpha + ( n^2 * ( (log_2 prow)/(2*pcol) + 1/prow ) ) * beta + ( (2/3)*n^3/p )Taking the other steps of algorithm into account increases the coefficients of alpha and beta, yielding

Time for distributed Gaussian elimination = ( n * (6+ log_2 prow) ) * alpha + ( n^2 * ( 2*(prow-1)/p + (pcol-1)/p + (log_2 prow)/(2*pcol) ) ) * beta + ( (2/3)*n^3/p )We compute the efficiency by dividing the serial time, (2/3)*n^3, by p times the Time just displayed, yielding

Efficiency = 1 / ( 1 + (1.5*p*(6 + log_2 prow)/n^2) * alpha + ( 1.5* ( 2*prow + pcol + (prow * log_2 prow)/2 -3 ) /n ) * beta )(ib:end, end+1:n)Let us examine this formula, and ask when it is likely to be close to 1, i.e. when we can expect good efficiency. First, for fixed p, prow, pcol, alpha and beta, the efficiency approaches 1 as n grows. This is because we are doing O(n^3) floating point operations, but communicating only O(n^2) words, so eventually the floating point, which is load balanced, swamps the communication, so efficiency is good. The efficiency also grows if alpha and beta shrink, i.e. as communication becomes less expensive.

Now we consider the choice of prow and pcol. The expression

2*prow + pcol + (prow * log_2 prow)/2 = 2*prow +1/prow + (prow * log_2 prow)is minimized when prow is slightly smaller than sqrt(p), meaning we want a prow-by-pcol processor grid which is slightly longer than it is tall. (In practice, for a given p, we are limited to choosing prow and pcol among the factors of p, unless we want to have idle processors.)

Overall, assuming prow ~ pcol ~ sqrt(p), and ignoring the log terms, the efficiency is

Efficiency = 1 / ( 1 + O( p/n^2 * alpha ) + O( sqrt(p)/n * beta ) ) = E( n^2/p )i.e. an increasing function of n^2/p. But n^2/p is the amount of data stored per processor, since an n-by-n matrix takes up n^2 words. In other words, if we let the "total problem size" n^2 grow proportionally to the number of processor, we expect the efficiency to be approximately constant. Let us confirm this hypothesis with an experiment.

We will use the Intel Gamma, which has the following machine parameters:

Click here for the predicted run times on the Gamma, and click here for the actual run times; they are satisfyingly close together, and confirm that the efficiency is to first order simply an increasing function of n^2/p. The predicted times are labelled by the memory used per processor in Megabytes, ranging from 2 to 6.25.