We will show how to implement matrix multiplication C=C+A*B on several of the communication networks discussed in the last lecture, and develop performance models to predict how long they take. We will see that the performance depends on several factors.
We will begin by examining the 1D Blocked Layout on a bus without broadcast (like a single ethernet), a bus with broadcast, and a ring of processors.
The algorithm depends on the following simple formula:
C(i) = C(i) + A*B(i) = C(i) + sum_{j=0}^{p-1} A(j)*B(j,i)
Since processor i owns C(i) and B(i), but not each A(j) as required by the
formula, the algorithm will have to send each A(j) to each processor.
This is similar to the Sharks and Fish 2 problem, fish moving under gravity,
where each fish has to visit each processor in order to perform the computation.
Our first model of a bus is that at most one processor may send, and at most one may receive at a time. A floating point operation costs 1 time unit, message startup costs alpha time units, and the cost per work is beta time units. In other words, sending a message of n words takes alpha + beta*n time units. We use synchronous send and receive.
Here is the algorithm:
Matrix multiplication with a 1D blocked layout on a bus without broadcast,
with a barrier
C(MYPROC) = C(MYPROC) + A(MYPROC)*B(MYPROC,MYPROC)
for i=0 to p-1
for j=0 to p-1 except i
if ( MYPROC = i ) send A(i) to processor j
if ( MYPROC = j )
receive A(i) from processor i
C(MYPROC) = C(MYPROC) + A(i)*B(i,MYPROC)
endif
barrier()
end for
end for
The cost of the arithmetic in the inner loop is 2*n*(n/p)^2 = 2*n^3/p^2,
and the cost of the communication in the inner loop is alpha + n*(n/p)*beta,
so the overall time is
Time = (p*(p-1)+1)*(2*n^3/p^2) + (p*(p-1))*( alpha + n*(n/p)*beta)
~ 2*n^3 + p^2*alpha + p*n^2*beta
where we have ignored lower order terms in p in the second formula.
This is more than the serial time, 2*n^3, and grows with p rather than
shrinking with p, and so is a rather poor parallel algorithm!
We can hope to do better by removing the barrier in order to overlap computation and communication, and to send A(i) to several processor so they can operate in parallel. Here is the algorithm without a barrier:
Matrix multiplication with a 1D blocked layout on a bus without broadcast,
no barrier
C(MYPROC) = C(MYPROC) + A(MYPROC)*B(MYPROC,MYPROC)
for i=0 to MYPROC-1
receive A(i) from processor i
C(MYPROC) = C(MYPROC) + A(i)*B(i,MYPROC)
end for
for i=0 to p-1 except MYPROC
send A(MYPROC) to processor i
end for
for i=MYPROC+1 to p-1
receive A(i) from processor i
C(MYPROC) = C(MYPROC) + A(i)*B(i,MYPROC)
end for
This program is non-deterministic, in the sense
that the sends and receives in the algorithm without a barrier do not
necessarily have to occur in the same order as the algorithm with a barrier.
For example, the two communications from processor 1 to 3, and from
processor 2 to 0, may occur in either order.
Technically, we say the partial orders imposed on the communication
events by the p programs on each processor do not constitute a
total order.
This race condition is not important because the same results are computed
no matter what the order is.
Let us analyze the performance of this algorithm. Intuitively, if the cost of a communication step,
cm = alpha + n*(n/p)*beta,
is sufficiently smaller than the cost of arithmetic in the inner loop,
ar = 2*n^3/p^2,
then efficiency should be high. Conversely, if communication is comparable
to, or dominates, computation, we expect low efficiency. If we keep p fixed
and let n grow, then since ar grows like n^3 and cm like n^2, we expect
ar to dominate cm for large enough problems, and high efficiency to be
achieved.
Now we will quantify this intuition. Let us draw the time-line for this algorithm, assuming that cm <= ar. In the time-line, a block of time labeled i->j refers to communication from processor i to j and takes time cm, and a block of time labeled jC refers to processor j computing, and takes time ar. For simplicity we assume communications occur in the same order as the program with a barrier, although as stated above this should not matter.
time --->
| 0C |
| 1C |
...
| p-1C |
| 0->1 | 1C |
| 0->2 | 2C |
| 0->3 | 3C |
...
|0->p-1| p-1C |
...
As shown in the diagram, the overall computation forms a pipeline.
To compute the duration of the pipeline, we need to consider whether
there are any "bubbles" in the pipeline, i.e. if the communication
steps cannot happen without intervening delays.
In particular, the next communication to follow 0->p-1 is 1->0. 1->0 will
be able to start without delay after 0->p-1 completes provided processor
1 is not busy, i.e. provided computation 1C has completed. This will be the
case if ar <= (p-2)*cm, in which case the total time is
Time = p*(p-1)*cm + 2*ar
Using the inequality ar <= (p-2)*cm and some algebra yields
2*n^3/p = p*ar < Time <= (p^2+p-4)*cm
The lower bound 2*n^3/p is (nearly) attained when cm equals its lower bound
ar/(p-2). This corresponds to (nearly) perfect speedup, since the serial
time is 2*n^3. Thus, if communication is so fast that cm ~ ar/(p-2),
the fact that the bus is a serial bottleneck does not matter.
If cm is even smaller than ar/(p-2), so there are
"bubbles" in the pipeline, the speedup can improve a little, but not much.
As cm grows, the speedup shrinks. When cm=ar, the running time is (p^2-p+2)*ar ~ 2*n^3, or approximately equal to the serial running time. In other words, parallelism does not help at all. If cm is larger than ar, the running time is slower than the serial algorithm. These possibilities are captured in the efficiency formula
Efficiency = Serial_Time / ( p * Parallel_Time )
= 1 / ( 2/p + (p-1)*cm/ar )
= 1 / ( 2/p + ((p^3-p^2)/n^3)*alpha/2 + ((p^2-p)/n)*beta/2 )
and we are assuming 1/(p-2) <= cm/ar <= 1.
When cm/ar is close to 1/(p-2), efficiency
is near 1. When cm/ar is close to 1, efficiency is nearly 1/p, i.e. there
is no speedup at all.
>From the expression for n/p, we see that cm/ar is small when n>>p, and when
alpha and beta are not too large. One could easily plug in particular values
of n, p, alpha and beta to measure the efficiency for a particular
matrix size on a particular machine.
Here is the algorithm:
Matrix multiplication with a 1D blocked layout on a bus with broadcast
C(MYPROC) = C(MYPROC) + A(MYPROC)*B(MYPROC,MYPROC)
for i=0 to p-1
if ( MYPROC = i )
broadcast A(i)
else
receive A(i) from processor i
endif
C(MYPROC) = C(MYPROC) + A(i)*B(i,MYPROC)
end for
Assuming the same communication model, the time is now
Time = p*2*n^3/p^2 + p*( alpha + n^2/p*beta )
= 2*n^3/p + p*alpha + n^2*beta
so
Efficiency = Serial_Time / ( p * Parallel_Time )
= 1 / ( 1 + cm/ar )
= 1 / ( 1 + (p^2*alpha)/(2*n^3) + (p*beta)/(2*n) )
In contrast to the bus without broadcast, the cm/ar term in the
denominator of the efficiency is p-1 times smaller, and so efficiency is
a much less sensitive function of p.
As before, since we expect alpha>>1 and beta>>1, we require n>>p for
parallelism to be effective. However, since there is p times less communication
time, our lower bound on n/p for effective parallelism is much lower than
for a bus without broadcast, quantifying our intuition that more
communication helps parallelism.
Matrix multiplication with a 1D blocked layout on a ring
copy A(MYPROC) into T
C(MYPROC) = C(MYPROC) + T*B(MYPROC,MYPROC)
for i=1 to p-1
send T to processor MYPROC+1 mod p
receive T from processor MYPROC-1 mod p
C(MYPROC) = C(MYPROC) + T * B( (MYPROC-i) mod p, MYPROC )
end for
Assuming the same communication model, the time is now
Time = p*2*n^3/p^2 + (p-1)*( alpha + n^2/p*beta )
= 2*n^3/p + (p-1)*alpha + (p-1)/p * n^2*beta
and the efficiency is
Efficiency = 1 / ( 1 + (p-1)/p * cm/ar )which is just slightly faster than the time for the broadcast bus.
It is easy to see that this is essentially optimal, for a 1D blocked layout on a bus or ring, because the matrix multiplication formula we use requires every product A(i)*B(i,j) to eventually accumulate on processor j, which entails some data of size n*(n/p) moving from each processor i to each processor j, and this requires at least p-1 messages of size n^2/p.
Since a ring may be embedded in a mesh or hypercube, the same algorithm also works on these networks. But as we will see, much better algorithms are available.
Cannon's matrix multiplication algorithm
for all (i=0 to s-1) ... "skew" A
Left-circular-shift row i of A by i,
so that A(i,j) is overwritten by A(i, (j+i) mod s)
end for
for all (i=0 to s-1) ... "skew" B
Up-circular-shift column i of B by i,
so that B(i,j) is overwritten by B( (i+j) mod s, j)
end for
for k=0 to s-1
for all (i=0 to s-1, j=0 to s-1)
C(i,j) = C(i,j) + A(i,j)*B(i,j)
Left-circular-shift each row of A by 1,
so that A(i,j) is overwritten by A(i, (j+1) mod s)
Up-circular-shift each column of B by 1,
so that B(i,j) is overwritten by B( (i+1) mod s, j)
end for
end for
This algorithm works because
C(i,j) = C(i,j) + sum_{k=0}^{s-1} A(i,k)*B(k,j)
= C(i,j) + sum_{k=0}^{s-1} A(i, (i+j+k) mod s)*B( (i+j+k) mod s, j)
In other words, after each data movement A(i,k) and B(k,j) arrive at
processor P(i,j), and are multiplied and accumulated into C(i,j). The k's
occur in different permutations on different processors, as described by
(i+j+k) mod s.
For example, in the picture above the 3 colored blocks are accumulated to form C(1,2). First the blue blocks A(1,0) and B(0,2) are co-resident in P(1,2) and multiplied and accumulated. Then the green blocks A(1,1) and B(1,2) are multiplied and accumulated, and finally the red blocks A(1,2) and B(2,2) are multiplied and accumulated.
This algorithm is well suited to an s-by-s mesh of processors, for which we will now measure the performance.
The total cost is therefore
Time = 2*n^3/p + 3*sqrt(p)*alpha + 3*n^2/sqrt(p) * beta
Comparing with the time for a ring, we see the arithmetic time is the
same, but the communication time is about sqrt(p) times smaller. This
reflects the fact that the bisection width of a mesh is sqrt(p) times
as large as a ring.
Since a ring can be embedded in a hypercube, the same algorithm can be used there. But it turns out there is something even better.
This mapping makes it clear that columns of the matrix are mapped to
independent sub-hypercubes of the original hypercube, and the same for
columns. For example, the blue column above is mapped into the blue
sub-hypercube below, and the same for the red row and red sub-hypercube.
Thus, we can use the more numerous wires of the hypercube to accelerate the skewing phase of Cannon's algorithm. This variation is due to Dekel, Nassimi and Sahni. We assume without loss of generality that A(i,j) is stored on processor i*2^m + j.
Dekel, Nassimi and Sahni's hypercube matrix multiply
for k= 0 to m-1
jk = 2^k and j ... "logical and" of 2^k and j
ik = 2^k and i ... "logical and" of 2^k and j
for all (i=0 to s-1, j=0 to s-1)
swap A(i, j xor ik) and A(i, j) ... "exclusive or" of j and jk
swap B(jk xor i, j) and B(i, j) ... "exclusive or" of jk and i
end for
for k=0 to s-1
for all (i=0 to s-1, j=0 to s-1)
C(i,j) = C(i,j) + A(i,j)*B(i,j)
Left-circular-shift each row of A by 1, in Gray code order
Up-circular-shift each column of B by 1, in Gray code order
end for
end for
The algorithm works as follows. After the skewing phase, A(i,j) has
been moved to A(i,j xor i). This is accomplished by changing one
bit of j at a time (from bit k=0 to bit k=m-1) to match the corresponding
bit of j xor i. j xor ik can differ from j only in the k-th bit, so swapping
with processor requires nearest neighbor communication only.
Similarly, B(i,j) is moved to B(j xor i, j), one bit at a time.
The cost of this skewing phase is 2*m*(alpha + (n/s)^2*beta), a factor of
(sqrt(p)/2)/(2*m) = sqrt(p)/(2*log_2 p) times faster than Cannon
on a mesh. The subsequent multiply-accumulate phase does not change in cost.
There is one more hypercube matrix-multiplication algorithm. To go faster than the last algorithm, it assumes that all log(p) wires connected to each processor can be used simultaneously, to get log(p) parallelism in communication; this corresponds to the hardware available on the CM-2. Rather than describe this algorithm, due to Ho, Johnsson and Edelman, in detail, we refer the reader to paper Parallel Numerical Linear Algebra, by Demmel, Heath, and van der Vorst, in volume 7 of the Class Reference Material. We reproduce the results below, for an n*2^n -by- n*2^n matrix on a 2*n-dimensional hypercube:
Algorithm message startups data sending steps arithmetic steps
Cannon 2*(2^n-1) 2*n^2*(2^n-1) 2*n^3*2^n
Dekel et al n+2^n-1 n^3 + n^2*(2^n-1) 2*n^3*2^n
Ho et al n+2^n-1 n^3 + n*(2^n-1) 2*n^3*2^n
The final algorithm uses "all the wires all the time" on a hypercube,
and is optimal in this sense. It was used in the matrix-multiplication
routine in the CMSSL library on the CM-2.