# Solving the Discrete Poisson Equation using Jacobi, SOR, Conjugate Gradients, and the FFT

• Review of the Discrete Poisson Equation
• The Complexity of Solving the Discrete Poisson Equation
• Jacobi's Method
• Successive Overrelaxation (SOR)
• Fast Fourier Transform (FFT)
• Review of the FFT
• Multiplying by Q using the FFT
• A High Level Serial O(m log m) Implementation of the FFT
• Detailed FFT Algorithm
• Parallelizing the FFT of a single vector
• Parallelizing Higher Dimensional FFTs
• ## Review of the Discrete Poisson Equation

In Lecture 13 we discussed Poisson's equation, which arises in heat flow, electrostatics, gravity, and other situations. In 2-dimensions the equation was
```                        d^2 u(x,y)   d^2 u(x,y)
2D-Laplacian(u)  =   ---------- + ----------  = f(x,y)
d x^2          d y^2
```
for (x,y) in a region Omega in the (x,y) plane, say the unit square 0 < x,y < 1. To make the solution well defined, we also need to specify boundary conditions, i.e. the value of u(x,y) on the boundary of Omega. We will consider the simple case
```        u(x,y) = 0 if (x,y) is on the boundary of Omega
```
This equation describes the steady-state temperature of a uniform square plate with the boundaries held at temperature u=0, and f(x,y) equaling the external heat supplied at (x,y).

In 3 dimensions, the equation is quite similar,

```                       d^2 u(x,y,z)   d^2 u(x,y,z)   d^2 u(x,y,z)
3D-Laplacian(u)  =  ------------ + ------------ + ------------
d x^2           d y^2          d z^2

= f(x,y,z)
```
for (x,y,z) in a region Omega in (x,y,z) space. Boundary conditions are also needed, i.e. the value of u specified on the boundary of Omega.

For simplicity of presentation, we will discuss only the solution of Poisson's equation in 2D; the 3D case is analogous.

Recalling Lecture 13 again, we discretize this equation by using finite differences: We use an (n+1)-by-(n+1) grid on Omega = the unit square, where h=1/(n+1) is the grid spacing. We let U(i,j) be the approximate solution at x=i*h and y=j*h. This is shown below for n=7, including the system of linear equations it leads to:

The above linear equation relating U(i,j) and the value at its neighbors (indicated by the blue stencil) must hold for 1 <= i,j <= n, giving us N=n^2 equations in N unknowns. When (i,j) is adjacent to a boundary (i=1 or j=1 or i=n or j=n), one or more of the U(i+-1, j+-1) values is on the boundary and therefore 0. b(i,j) = -f(i*h,j*h)*h^2, the scaled value of the right-hand-side function f(x,y) at the corresponding grid point (i,j).

To write this as a linear system in the more standard form P*Uhat = bhat, where Uhat and bhat are column vectors, we need to choose a linear ordering of the unknowns U(i,j). For example, the natural row ordering shown below

leads to the linear system P*Uhat = bhat:

For the most part, we will describe our algorithms in term of the square array U(i,j) rather than the column vector Uhat. But for analysis it is convenient to be able to describe the algorithms using both formulations.

## The Complexity of Solving the Discrete Poisson Equation

We first presented this table of algorithms and their complexities in Lecture 13, but did not discuss them in detail. All table entries are meant in the O( ) sense.

The first column in the table identifies the algorithm, except the last entry, which gives a simple Lower Bound on the running time for any algorithm. The Lower Bound is obtained as follows. For the serial time, the time required simply to print each of the N solution components is N. For the PRAM time, which assumes as many processors as we like and assumes communication is free, we note that the inverse inv(P) of the discrete Poisson matrix P is dense, so that each component of the solution Uhat = inv(P)*bhat is a nontrivial function of each of the N components of bhat. The time required to compute any nontrivial function of N values in parallel is log N.

The second column says whether the algorithm is of Type D=Direct, which means that after a finite number of steps it produces the exact answer (modulo roundoff error), or of Type I=Indirect, which means that one step of the algorithm decreases the error by a constant factor rho<1, where rho depends on the algorithm and N. This means that if one wants the final error to be epsilon times smaller than the initial error, one must take m steps where rho^m <= epsilon. To compute the complexities in the table, we choose m so that rho^m is about as small as the discretization error , i.e. the difference between the true solution x(i*h,j*h) and the exact discrete solution U(i,j). There is no point in making the error in the computed U(i,j) any smaller than this, since this could only decrease the more significant error measure, the difference between the true solution u(i*h,j*h) and the computed U(i,j), by a factor of 2.

The second and third columns give the running time for the algorithm on a serial machine and a PRAM, respectively. Recall that on a PRAM we can have as many processors as we need (shown in the last column), and communication is free. Thus, the PRAM time is a lower bound for any implementation on a real parallel machine. Finally, the fifth column indicated how much storage is needed. LU decomposition requires significantly more storage than the other methods, which require just a constant amount of storage per grid point.

```
Algorithm   Type  Serial    PRAM          Storage  #Procs
Time      Time
---------   ----  ------    ---------     --------  ------
Dense LU      D   N^3           N           N^2      N^2
Band LU       D   N^2           N         N^(3/2)     N
Inv(P)*bhat   D   N^2         log N         N^2      N^2
Jacobi        I   N^2           N            N        N
Sparse LU     D   N^(3/2)     N^(1/2)     N*log N     N
CG            I   N^(3/2)   N^(1/2)*log N    N        N
SOR           I   N^(3/2)   N^(1/2)          N        N
FFT           D   N*log N     log N          N        N
Multigrid     I   N           (log N)^2      N        N
Lower Bound       N           log N          N

Key to abbreviations:
Dense LU    : Gaussian elimination, treating P as dense
Band LU     : Gaussian elimination, treating P as zero
outside a band of half-width n-1 near diagonal
Sparse LU   : Gaussian elimination, exploiting entire
zero-structure of P
Inv(P)*bhat : precompute and store inverse of P,
multiply it by right-hand-side bhat
SOR         : Successive Overrelaxation
FFT         : Fast Fourier Transform based method
```

We include methods like Dense LU (Gaussian Elimination) even though these are much slower than the fastest methods, because these slower methods solve much more general linear systems than the much faster but more specialized algorithms like Multigrid. This table illustrates that there is often an enormous payoff for using an algorithm specially tuned for the system at hand. Band LU is Gaussian Elimination specialized to take advantage of the fact that none of the zero entries of P more than n entries away from the diagonal ever "fill-in" during Gaussian Elimination, so we can avoid both storing them and doing arithmetic with them. Sparse LU exploits the fact that many other entries in the L and U factors of P are zero, and avoids both computing with or storing any zero entries; this requires interesting linked data structures we will discuss in later lectures. Inv(P)*bhat means precomputing and storing the exact N-by-N inverse of P (so we do not count the cost of this precomputation in the complexity), and then doing a matrix-vector multiplication. On a serial machine, this is no faster than any of the subsequent methods, and uses much more storage.

Jacobi, SOR (Successive OverRelaxation) and CG (Conjugate Gradients) can be thought of as performing most nearest neighbor operations on the grid (CG requires summing across the grid too). In other words, information at one grid point can only propagate to an adjacent grid point in 1 iteration. Since the solution U(i,j) at each grid point depends on the value of b(l,m) at every other grid point, it take n steps for information to propagate all the way across the grid. This would lead us not to expect a good solution in many fewer than n=sqrt(N) steps, which is what these methods require.

We can formalize this argument using the picture available here, or the less detailed picture below. Indeed, we can only show a lower bound of O(log n)=O(log N) steps for any "nearest neighbor" iterative method (rather than sqrt(n)). This picture shows the solution of the Poisson equation when b is zero expect for the value 1 in the center. The solution decays logarithmically from the center. If the solution could propagate only from grid point to grid point starting from 0 initial values, then after m < < n steps, one would still have a zero solution outside distance m from the center, even though the solution is large there; this means the error is large, and convergence is impossible. In fact, one can show that m=O(log n)=O(log N) steps are needed to multiply the error by a constant g<1, independent of n.

The FFT (Fast Fourier Transform) and Multigrid can be faster than nearest-neighbor methods because they move information across the grid in larger steps than merely between neighbors. It is interesting that Multigrid is asymptotically the fastest on a serial machine, attaining the lower bound of N steps. In contrast, FFT is fastest on a PRAM, attaining a lower bound of log N steps. One of our goals in the next two lectures is to do more detailed performance modeling of these methods, taking communication costs into account, to see which one we expect to be fastest in practice.

In addition to the methods in this table being in increasing order of speed for solving Poisson's equation, they are (roughly) in order of increasing specialization, in the sense that Dense LU can be used in principle to solve any linear system, whereas the FFT and Multigrid only work on equations quite similar to Poisson's equation. In practice, one wants the fastest method suitable for one's problem, and it often takes a great deal of specialized knowledge to make this decision. Fortunately, on-line help is available to help make this decision at the "templates" subdirectory of Netlib, especially the on-line book "Templates for the Solution of Linear Systems".

## Jacobi's Method

To derive Jacobi's algorithm, we write the discrete Poisson equation as
```   U(i,j) =
( U(i-1,j) + U(i+1,j) + U(i,j-1) + U(i,j+1)
+ b(i,j) )/4
```
We let U(i,j,m) be our approximation for U(i,j) after the m-th iteration. We can use any initial guess U(i,j,0); U(i,j,0)=0 will do. To get a formula for U(i,j,m+1) in terms of the previous iteration U(i,j,m), we simply use
```   U(i,j,m+1) =
( U(i-1,j,m) + U(i+1,j,m) + U(i,j-1,m) + U(i,j+1,m)
+ b(i,j) )/4
```
In other words, U(i,j,m+1) is just a weighted average of its four neighboring values and b(i,j).

We analyze the convergence rate of Jacobi as follows. We present only the final results; for details, see any good text on numerical linear algebra, such as "Lecture Notes on Numerical Linear Algebra", J. Demmel, Berkeley Lecture Notes in Mathematics, Mathematics Dept, UC Berkeley To state the result we define the error as the root-sum-of-squares of the error at each grid point:

```   error(m) = sqrt( sum_{ij} ( U(i,j,m) - U(i,j) )^2 )
```
It turns out that Jacobi decreases the error by a constant factor rho_Jacobi(n) < 1, depending on dimension n, at each step:
```   error(m) <= rho_Jacobi(n) * error(m-1)
<= ( rho_Jacobi(n) )^m * error(0)
```
where
```   rho_Jacobi(n) = cos(pi/(n+1))
```
When n is large, the interesting case, we can approximate rho_Jacobi(n) by
```   rho_Jacobi(n) ~ 1 - .5*(pi/(n+1))^2
```
which gets closer to 1 as n grows. So to decrease the error by a factor of 2, say, requires taking m iterations where
```      .5 > ( rho_Jacobi(n) )^m
~ ( 1 - .5*(pi/(n+1))^2 )^m
~ 1 - .5*(pi/(n+1))^2 * m     ... when n is large
```
or
```       m ~ ((n+1)/pi)^2
```
In other words, convergence gets slower for large problems, with the number of steps proportional to N = n^2. This make the serial complexity
```    Serial complexity of Jacobi
= number_of_steps * cost_per_step
= O(N) * O(N)
= O(N^2)
```
This is easy to implement in parallel, because each grid cell value may be updated independently. On a PRAM, this means the cost per step is O(1), lowering the PRAM complexity to O(N). On a realistic parallel machine, one would simply assign grid point (i,j) to processors in a block fashion as shown below, and have each processor update the values of U(i,j,m+1) at the grid point it owns. This requires U values on the boundary of the blocks to be communicated to neighboring processors. When n >> p, so that each processor owns a large number n^2/p of grid points, the amount of data communicated, n/p words with each neighbor, will be relatively small.

To express the parallel running time in detail, we let

```     p = number of processors
f = time per flop
alpha = startup for a message
beta = time per word in a message
```
Then
```  Time(Jacobi)
= number_of_steps * cost_per_step
= O(N) * ( (N/p)*f + alpha + (n/p)*beta )
= O(N^2/p)*f + O(N)*alpha + O(N^(3/2) /p)*beta
```
where O(N/p) flops are needed to update local values, alpha is the start up cost of messages to each neighbor, and O(n/p) boundary values are communicated to neighbors. Later we will compare this expression to similar expressions for other methods.

The phenomenon of convergence slowing down, and needing to take more steps for large problems, is a common phenomenon for many methods and problems. It is not true for Multigrid, however, which takes a constant number of iterations to decrease the error by a constant factor, and so makes it very attractive.

## Successive Overrelaxation (SOR)

This method is similar to Jacobi in that it computes an iterate U(i,j,m+1) as a linear combination of its neighbors. But the linear combination and order of updates are different. We motivate it by describing two improvements on Jacobi.

The first improvement is motivated by examining the serial implementation, and noticing that while evaluating

```  U(i,j,m+1)
= ( U(i-1,j,m) + U(i+1,j,m) + U(i,j-1,m) + U(i,j+1,m)
+ b(i,j) )/4
```
new values for the solution at some of the grid points on the right hand side are already available, and may be used. This leads to the algorithm
```     U(i,j,m+1) = (U(i-1,j,latest) + U(i+1,j,latest) +
U(i,j-1,latest) + U(i,j+1,latest) + b(i,j) )/4
```
where "latest" is either m or m+1, depending on whether that particular grid point has been updated already. This depends on the order in which we loop through the (i,j) pairs. For example, if we use the same rowwise ordering as in an earlier figure, we get the so-called Gauss-Seidel algorithm:
```   for i=1 to n
for j=1 to n
U(i,j,m+1) = (U(i-1,j,m+1) + U(i+1,j,m) +
U(i,j-1,m+1) + U(i,j+1,m) + b(i,j) )/4
end for
end for
```
This particular update order cannot be straightforwardly parallelized, because there is a recurrence, with each U(i,j,m+1) depending on the immediately previously updated value U(i-1,j,m+1).

So instead we loop through the grid points in the Red-Black order (named after the resemblance to a chess board) shown below.

The idea is to update all the black points, followed by all the red points. The black grid points are connected only to red grid points, and so can all be updated simultaneously using the most recent red values. Then the red values can be updated using the most recent black values. This yields the following algorithm:

```  for all black (i,j) grid points
U(i,j,m+1) = (U(i-1,j,m) + U(i+1,j,m) +
U(i,j-1,m) + U(i,j+1,m) + b(i,j))/4
end for
for all red (i,j) grid points
U(i,j,m+1) = (U(i-1,j,m+1) + U(i+1,j,m+1) +
U(i,j-1,m+1) + U(i,j+1,m+1) + b(i,j))/4
end for
```
This can implemented in two parallel steps, one for updating the red points, and one for the black.

By itself, this idea does not accelerate convergence. It turns out to converge twice as fast (rho_Gauss_Seidel(n) = rho_Jacobi(n)^2), but on a PRAM each parallel Gauss-Seidel step takes the same time as two parallel Jacobi steps, so there is no overall benefit.

The second improvement is called Successive Overrelaxation, and depends on the Red-Black ordering for its speedup. Let us rewrite the algorithm so far as

```     U(i,j,m+1) = some_expression
... an abbreviation for the
... update scheme above
= U(i,j,m) + (some_expression - U(i,j,m))
= U(i,j,m) + correction(i,j,m)
```
The idea behind overrelaxation is that if correction(i,j,m) is a good direction in which to move U(i,j,m) to make it a better solution, one should move even farther in that direction, by a factor w>1:
```     U(i,j,m+1) = U(i,j,m) + w * correction(i,j,m)
```
w is called the overrelaxation factor (w<1 would correspond to underrelaxation, which is not worth doing). The word successive refers to using the latest data to evaluate the correction.

Altogether, the SOR(w) algorithm is as follows:

```  for all black (i,j) grid points
U(i,j,m+1) = U(i,j,m) +
w * ( U(i-1,j,m) + U(i+1,j,m) +
U(i,j-1,m) + U(i,j+1,m) + b(i,j)
- 4*U(i,j,m) )/4
end for

for all red (i,j) grid points
U(i,j,m+1) = U(i,j,m) +
w * ( U(i-1,j,m+1) + U(i+1,j,m+1) +
U(i,j-1,m+1) + U(i,j+1,m+1) + b(i,j)
- 4*U(i,j,m) )/4
end for
```
As before, we measure the error of iteration m by
```   error(m) = sqrt( sum_{ij} ( U(i,j,m) - U(i,j) )^2 )
```
As with Jacobi, SOR(w) decreases the error by a constant factor rho_SOR(n,w) < 1 at each step:
```   error(m) <= rho_SOR(n,w) * error(m-1)
<= ( rho_SOR(n,w) )^m * error(0)
```
We want to pick w to minimize rho_SOR(n,w). There turns out to be a simple expression for the minimizing w,
```     w_opt = 2 / ( 1 + sin(pi/(n+1)) )
```
which is between 1 and 2. The corresponding rho_SOR(n,w_opt) is
```  rho_SOR(n,w_opt)
= ( cos(pi/(n+1)) / ( 1 + sin(pi/(n+1)) ) )^2
```
As with Jacobi, rho_SOR(n,w_opt) approaches 1 as n grows, so that convergence slows down, but it slows down much less than for Jacobi:
```     rho_SOR(n,w_opt) ~ 1 - 2*pi/(n+1)
```
Therefore, reducing the error by a factor of 2, say, requires taking m iterations where
```      .5 > ( rho_SOR(n,w_opt) )^m
~ ( 1 - 2*pi/(n+1) )^m
~ 1 - 2*pi/(n+1) * m      ... when n is large
```
or
```       m ~ (n+1)/(2*pi)
```
This is approximately the square root of the number of steps Jacobi requires. This is shown in the figures below.

A complexity analysis similar to that for Jacobi shows that

```  Time(SOR)
= number_of_steps * cost_per_step
= O(sqrt(N)) * ( (N/p)*f + alpha + (n/p)*beta )
= O(N^(3/2) /p)*f + O(sqrt(N))*alpha + O(N /p)*beta
```
where O(N/p) flops are needed to update local values, alpha is the start up cost of messages to each neighbor, and O(n/p) boundary values are communicated to neighbors.

This method, called CG for short, is suitable for solving any linear system where the coefficient matrix A is both symmetric, i.e. A = A', and positive definite. Three equivalent definitions of positive definiteness are
• all of A's eigenvalues are positive,
• x'*A*x > 0 for all nonzero vectors x, and
• an LU factorization of A of the form A = L*L' exists (the so-called Cholesky factorization).
• Here is the CG algorithm:
```     x = initial guess for inv(A)*b
r = b - A*x   ... residual, to be made small
p = r         ... initial "search direction"
repeat
v = A*p
... matrix-vector multiply
a = ( r'*r ) / ( p'*v )
... dot product (BLAS1)
x = x + a*p
... compute updated solution
... using saxpy (BLAS1)
new_r = new_r - a*v
... compute updated residual
... using saxpy (BLAS1)
g = ( new_r'*new_r ) / ( r'*r )
... dot product (BLAS1)
p = new_r + g*p
... compute updated search direction
... using saxpy (BLAS1)
r = new_r
until ( new_r'*new_r small enough )
```
Here is a rough motivation for this algorithm. CG maintains 3 vectors at each step, the approximate solution x, its residual r=A*x-b, and a search direction p, which is also called a conjugate gradient. At each step x is improved by searching for a better solution in the direction p, yielding an improved solution x+a*p. This direction p is called a gradient because we are in fact doing gradient descent on a certain measure of the error (namely sqrt(r'*inv(A)*r)). The directions pi and pj from steps i and j of the algorithm are called conjugate, or more precisely A-conjugate, because they satisfy pi'*A*pj = 0 if i.ne.j. One can also show that after i iterations xi is the "optimal" solution among all possible linear combinations of the form
```    a0*x + a1*(A*x) + a2*(A^2*x) + a3*(A^3*x) + ... + ai*(A^i*x)
```

For most matrices, the majority of work is in the sparse matrix-vector multiplication v=A*p in the first step. For Poisson's equation, where we can think of p and v living on a square grid, this means computing

```     v(i,j) = 4*p(i,j) - p(i-1,j) - p(i+1,j)
- p(i,j-1) - p(i,j+1)
```
which is nearly identical to the inner loop of Jacobi or SOR in the way it is parallelized. We will deal with more general techniques for sparse-matrix-vector multiplication in a later lecture.

The other operations in CG are easy to parallelize. The saxpy operations are embarrassingly parallel, and require no communication. The dot-products require local dot-products and then a global add using, say, a tree to form the sum in log(p) steps.

The rate of convergence of CG depends on a number kappa called the condition number of A. It is defined at the ratio of the largest to the smallest eigenvalue of A (and so is always at least 1). A roughly equivalent quantity is norm(inv(A))*norm(A), where the norm of a matrix is the magnitude of the largest entry. The larger the condition number, the slower the convergence. One can show that the number of CG iterations required to reduce the error by a constant g<1 is proportional to the square root of kappa. For Poissons's equation, kappa is O(N), so n=sqrt(N) iterations are needed. This means SOR and CG take about the same number of steps. An advantage of CG over SOR is that CG is applicable to a much larger class of problems.

## Fast Fourier Transform (FFT)

We begin by deriving the algorithm to solve the discrete Poisson Equation, then show how to apply the FFT to the problem, and finally discuss parallelizing the FFT. We will assume that the reader is familiar with the FFT, and so describe the serial algorithm only briefly.

We are going to rewrite the discrete Poisson equation as a slightly different matrix equation. To motivate it, consider the continuous Poisson equation

```          d^2 u(x,y)   d^2 u(x,y)
---------- + ----------  = f(x,y)
d x^2          d y^2
```
and discretize one derivative term at a time. Then
```          d^2 u(i*h,j*h)   U(i-1,j) - 2*U(i,j) + U(i+1,j)
-------------- ~ ------------------------------
d x^2                   h^2
```
Now consider the n-by-n matrix whose (i,j) entry is the right-hand-side of the above equation. We claim that this matrix may be written as the matrix-matrix product (-1/h^2)*T*U, where U is the n-by-n matrix of U(i,j)'s, and T is the familiar symmetric tridiagonal matrix
```
[  2  -1                 ]
[ -1   2  -1             ]
T = [     -1   2  -1         ]
[       ... ... ...      ]
[             -1  2  -1  ]
[                -1   2  ]
```
A formal proof requires the simple computation:
```  (T*U)(i,j) = sum_k T(i,k)*U(k,j)
=  T(i,i-1)*U(i-1,j)
+ T(i,i)*U(i,j)
+ T(i,i+1)*U(i+1,j)
= -U(i-1,j) + 2*U(i,j) - U(i+1,j)
```
since only 3 entries of row i of T are nonzero. A completely analogous argument shows that
```  d^2 u(i*h,j*h)   U(i-1,j) - 2*U(i,j) + U(i+1,j)
-------------- ~ ------------------------------
d x^2                   h^2

-1
= --- * (U*T)(i,j)
h^2
```
and so the Discrete Poisson equation may be written
```      T*U + U*T = B
```
where the (i,j)-th entry of B is b(i,j), as before.

Here is how we will solve T*U+U*T=B for U. Suppose we know how to factorize T = Q*Lambda*inv(Q), where Q is a known nonsingular matrix, inv(Q) is its inverse, and Lambda is a known diagonal matrix. (Later we will see that is simply another way to say that the columns of Q are eigenvectors of T and the diagonal entries of Lambda are eigenvalues. We will also write down explicit formulas for Q and Lambda.) Substituting this expression for T into T*U+U*T=B yields

```   (Q*Lambda*inv(Q))*U + U*(Q*Lambda*inv(Q)) = B       .
```
Now premultiply this equation by inv(Q) and postmultiply it by Q to get
```   Lambda*(inv(Q)*U*Q) + (inv(Q)*U*Q)*Lambda = inv(Q)*B*Q
```
or, letting Ubar=inv(Q)*U*Q and Bbar=inv(Q)*B*Q,
```   Lambda*(Ubar) + (Ubar)*Lambda = Bbar    .
```
We want to solve this equation for Ubar, because then we can compute U=Q*Ubar*inv(Q). The (j,k)-th entry of this last equation is simple to evaluate, since Lambda is diagonal:
```   Lambda(j,j)*Ubar(j,k) + Ubar(j,k)*Lambda(k,k) = Bbar(j,k)
```
We may now solve for Ubar(j,k) as follows:
```   Ubar(j,k) = Bbar(j,k)/(Lambda(j,j)+Lambda(k,k))
```
This lets us state the overall high-level algorithm to solve T*U+U*T=B for U as follows:
```  (1) Compute Bbar = inv(Q)*B*Q
(2) For each j,k compute
Ubar(j,k) = Bbar(j,k)/(Lambda(j,j)+Lambda(k,k))
(3) Compute U = Q*Ubar*inv(Q)
```
Step (2) of this algorithm costs just 2*n^2=2*N operations, and is embarrasingly parallel. Steps (1) and (3) appear to cost 2 matrix-matrix multiplies by Q and and 2 by inv(Q), for a cost of O(n^3)=O(N^(3/2)). This would be no cheaper than SOR. But we will soon see that Q is very special, so that in fact
• inv(Q)=Q, and
• multiplying Q times a vector can be done by an FFT in O(n log n) flops. So multiplying Q times a square matrix takes n*O(n log n) = O(n^2 log n) = O(N log N) flops.
• This lowers the overall sequential cost to O(N log N), which is very fast.

Now we will go back and show how the factorization T = Q*Lambda*inv(Q) arises. Suppose q(j) is an eigenvector of T and lambda(j) is its corresponding eigenvalue. This means T*q(j) = lambda(j)*q(j). We can write this equation for j=1 to n as the following single matrix equation:

```   T*[ q(1), ... , q(n) ] = [ q(1)*lambda(1), ... , lambda(n)*q(n) ]
```
or
```   T*Q = Q*Lambda  ,
```
where Q is a matrix whose columns q(j) are the eigenvectors:
```   Q = [q(1), q(2), ... q(n)]
```
and Lambda is a diagonal matrix of eigenvalues
```   Lambda(j,j) = lambda(j)    .
```
Now, assuming Q is nonsingular, postmultiply T*Q = Q*Lambda by inv(Q) to get T=Q*Lambda*inv(Q) as desired.

It turns out we know an explicit expression for the eigenvalues lambda(i) and eigenvectors q(i) of T:

```   lambda(j) = 2*(1-cos(j*pi/(n+1))

and

Q(j,k) = sin(j*k*pi/(n+1)) * sqrt(2/(n+1)) = Q(k,j)       .
```
We will give an intuitive meaning to this expression in
Lecture 20. In the meantime one can simply confirm that T*q(j) = lambda(j)*q(j) using elementary trigonometry.

Q has the further attractive property of being orthogonal, which means that its columns are orthogonal to one another and have unit length. This implies that inv(Q) = Q'=Q, and so

```   T = Q*Lambda*inv(Q) = Q*Lambda*Q      .
```
This lets us simplify our algorithm to
```  (1) Compute Bbar = Q*B*Q
(2) For each j,k compute
Ubar(j,k) = Bbar(j,k)/(Lambda(j,j)+Lambda(k,k))
(3) Compute U = Q*Ubar*Q
```
This makes it clear that all we need is a fast algorithm for multiplying a matrix (or vector) by Q. To show how to do this, we first review the FFT.

### Review of the FFT

For ease of notation, just in this section we will let i = sqrt(-1), and index matrices and vectors from 0 to m-1 instead of 1 to m (i.e. C-style instead of Fortran-style).

Definition. The Discrete Fourier Transform of an m-by-1 vector v is the matrix-vector product F*v, where F is an m-by-m matrix defined as follows. Let

```    omega = exp(2*pi*i/m) = cos(2*pi/m) + i*sin(2*pi/m),
```
i.e. a complex number with absolute value 1, whose m-th power is 1 (a so-called m-th root of unity). Then for 0 <= j,k <= m-1, F(j,k) = omega^(j*k).

A standard fact about F is that inv(F) = complex_conjugate(F)/m, so that F/sqrt(m) is a unitary matrix (the complex equivalent of an orthogonal matrix). This means that multiplying by F and inverse(F) are essentially equivalent problems. (This can be confirmed by elementary trigonometry.)

A typical use of the DFT is filtering, as illustrated below. The top left plot is the signal s=sin(7t)+.5*sin(5t), for t at 128 equally spaced points between 0 and 2*pi. The top right 2 plots are the real and imaginary part of the fft of the signal; there are essentially two pairs of spikes, corresponding to the two sine waves. The middle left plot is s with random noise added to it, which is bounded in magnitude by .75. Its fft is also shown to the right. Finally, the signal+noise is filtered by taking the fft of the signal+noise and zeroing out any components less than .25 in magnitude. This accurately restores the original signal, in the bottom left graph. More sophisticated filtering schemes are also used.

Another possibly use of the 2D DFT of a matrix X, i.e. computing Y=F*X (replacing each column of X by its DFT) and then Y*F (replacing each row of Y by its DFT), besides solving the Poission equation, is data compression. The first image below is a 200x320 pixel array. The second image is gotten by taking the 2D FFT of the first image, zeroing out all but the largest 2.5% of the components, and taking the inverse 2D FFT. Thus, keeping only 2.5% of the data in the original image lets us approximately reproduce the original image. (Actually we need more than 2.5%, since we need the real and imaginary parts of the FFT, as well as the indices of the nonzero components; this may be as much as 4 times as much data, or 10% of the original.) Other, more sophisticated data compression schemes are often used in practice.

The Fast Fourier Transform (FFT) is an algorithm to compute the Discrete Fourier Transform F*v in O(m log m) time instead of O(m^2) time. We discuss it in more detail below, but first we will show how multiplying by F and multiplying by Q are closely related.

### Multiplying by Q using the FFT

Now we briefly show how being able to multiply quickly by F enables us to multiply quickly by Q. This is not the best way to do it in practice, but has the virtue of being simple to explain. We continue to use our convention that matrices are numbered starting at 0 instead of 1. Let m=2*(n+1), where F is m-by-m and Q is n-by-n, so
```     F(j,k) = cos(2*pi*j*k/(2*(n+1))) + i * sin(2*pi*j*k/(2*(n+1)))
= cos(pi*j*k/(n+1) + i * sin(pi*j*k/(n+1))
```
By comparing this expression with our earlier expression for Q,
```     Q(j,k) = sin((j+1)*(k+1)*pi/(n+1)) * sqrt(2/(n+1)) = Q(k,j)       .
```
we see that Q is the imaginary part of a submatrix of F
```     Q = Imag( F(1:n, 1:n) )
```
(Imag(x) denotes the imaginary part of x). Thus, continuing to use Matlab notation, one can write
```     Q*v = Imag( ( F*vv )(1:n) )    where  vv = [ 0; v; zeros(n,1) ]
```
In other words, we could extend v with one 0 at the top and n zeros at the bottom to get vv, do a 2*(n+1) point FFT on vv to get F*vv, and extract the imaginary part of components 1 through n from components 0 through 2*n+1 of F*vv. Matlab code for a function fast_sine_transform(V) which performs Q*V for a matrix V is as follows (note that it uses Matlab's built-in fft function).
```     % Fast Sine Transform on input matrix V
% of dimension n-by-m
function Y=fast_sine_transform(V)
[n,m]=size(V);
V1=[zeros(1,m);V;zeros(n+1,m)];
V2=imag(fft(V1));
% In Matlab vectors and matrices are indexed
% starting at 1, not 0
Y=V2(2:n+1,:);
```

Using fast_sine_transform, a Matlab implementation of the entire FFT-based Poisson solver is

```    % Solve the discrete Poisson equation
% on an n-by-n grid with right hand side b
function X=Poisson_FFT(B)
[n,m]=size(b);
% Form eigenvalues of matrix T(nxn)
L=2*(1-cos((1:n)*pi/(n+1)));
% Form reciprocal sums of eigenvalues
% Include scale factor 2/(n+1)
LL=(2/(n+1))*ones(n,n)./(L'*ones(1,n)+ones(n,1)*L);
% Solve, using the Fast Sine Transform
X = fast_sine_transform(b');
X = fast_sine_transform(X');
X = LL.*X;
X = fast_sine_transform(X');
X = fast_sine_transform(X');
```
In practice, one can multiply by Q quickly (by a constant factor) without using the complex FFT, but rather another divide-and-conquer algorithm very similar to the one described below. Much software for this and related problems is available on
Netlib, in the subdirectories
• FFTPACK, which contains optimized versions of the FFT, fast_sine_transform (multiplication by Q), fast_cosine_transform, and related routines,
• VFFTPACK, which contains vectorized versions of many algorithms from FFTPACK, which have been optimized for the Cray vector machines, and
• FISHPACK, which contains fast solvers for Poisson's equation based on a related O(N log N) algorithm called cyclic reduction.
• ### A High Level Serial O(m log m) Implementation of the FFT

We briefly show why we can evaluate the FFT in O(m log m) operations, leaving details to a course in algorithm design. We assume m=2^s is a power of 2.

We begin by showing how to interpret the product F*v as polynomial evaluation of the polynomial

```     V(x) = sum_{k=0}^{m-1} x^k * v(k)
```
at the points x = omega^0, omega^1, omega^2, ..., omega^(m-1):
```     (F*v)(j) = sum_{k=0}^{m-1} F(j,k) * v(k)
= sum_{k=0}^{m-1} omega^(j*k) * v(k)
= sum_{k=0}^{m-1} (omega^j)^k * v(k)
= V(omega^j)
```
Conversely, one can interpret multiplication by inv(F) as polynomial interpolation, i.e. computing the coefficients of a polynomial given its value at m points omega^j.

Now break up the polynomial V(x) into two parts, those consisting of even powers of x, and those consisting of odd powers of x:

```      V(x) = v(0) + x^2 * v(2) + x^4 * v(4) + ...
+ x^1 * v(1) + x^3 * v(3) + ...
= (v(0) + x^2 * v(2) + x^4 * v(4) + ...)
+ x * (v(1) + x^2 * v(3) + x^4 * v(5) + ...)
(*)        = V_even(x^2) + x * V_odd(x^2)
```
Since V is a polynomial of degree m-1, V_even and V_odd are polynomials of degree m/2 - 1, which we need to evaluate at the points (omega^j)^2, for 0 <= j <= m-1. But this is really just (m/2) different points omega^(2*j) for 0 <= j <= (m/2)-1 since
```    (omega^(j+m/2))^2
= (omega^j * omega^(m/2))^2
= (omega^j)^2 * omega^m
= (omega^j)^2
... since omega^m = 1 by construction
```
Thus, evaluating V(x) at m values omega^j is equivalent to evaluating two half sized problems (V_odd and V_even at half as many points), and then combining their values in a simple way. This leads to the following recursive, divide-and-conquer algorithm for the FFT:
```  function x = FFT(v,omega,m)
if m=1 then
return v(0)
else
v_even = FFT(v(0:2:m-2), omega^2, m/2)
... v(0:2:m-2) = v(0,2,4,...,m-2)

v_odd  = FFT(v(1:2:m-1), omega^2, m/2)
... v(1:2:m-1) = v(1,3,5,...,m-1)

omegav = [ omega^0, omega^1, omega^2, ... ,
omega^((m/2)-1) ]
... in practice these coefficients are
... precomputed and stored in a table

return x = [v_even + ( omegav .* v_odd ),
v_even - ( omegav .* v_odd )]
...  omegav .* v_odd is the component-wise
...  product of these two vectors
end if
```
The reason for using +omegav.*v_odd and -omegav.*v_odd is that when x = omega^(j+(m/2)) in equation (*) above,
```      omega^(j+(m/2)) = omega^j * omega^(m/2) = -omega*j
```

We compute the complexity of this algorithm as follows. Let FFT_time(m) be the cost of the algorithm on an input of size m. Then from the structure of the algorithm

```     FFT_time(m) = 2 * FFT_time(m/2) + O(m)
```
where 2*FFT_time(m/2) is the cost of the two recursive calls, and O(m) is the cost of m/2 products, sums and differences in computing x. Change variables to s = log_2 m to get
```     FFT_time(s) = 2 * FFT_time(s-1) + O(2^s)
```

This simple recurrence has the value FFT_time(s) = O(2^s) + O(2^s) + ... O(2^s) (s terms), or O(s*2^s) = O(m log m), as desired.

It is not necessary for m to be a power of 2 to use the divide-and-conquer technique in this algorithm. Indeed, Gauss first invented this algorithm for m=12 centuries ago. To be efficient, m should be a product of many small prime factors.

### Detailed FFT Algorithm

The call-tree of the recursive algorithm above is a complete binary tree of s levels, with the left and right children of each node corresponding to calling the FFT to get v_even and v_odd, respectively. Such a tree is shown below for m=16. We abbreviate the function call FFT( v(0),v(1),v(2),...,v(15) ) both by F( 0,1,2,...,15 ) and by F( xxxx_2 ), where xxxx_2 represents all the bit patterns of the subscripts of the v's appearing in the call (16 patterns in this case, since each x in the bit pattern may be independently 0 or 1).
```                                             F( 0,1,2,...,15 )
= F( xxxx_2 )
|        |
----------------------        -----------------------
|                                                   |
F( even )                                                 F( odd )
= F( 0,2,4,...,14)                                        = F( 1,3,5,...,15 )
= F( xxx0_2 )                                             = F( xxx1_2 )
|       |                                                  |      |
-----------       ------------                             -----------      -------------
|                            |                             |                             |
F( 0,4,8,12 )                F( 2,6,10,14 )                F( 1,5,9,13 )                 F( 3,7,11,15 )
= F( xx00_2 )                = F( xx10_2 )                 = F( xx01_2 )                 = F( xx11_2 )
|   |                         |   |                         |   |                          |   |
------   -------              ------   -------              ------   -------              -------   --------
|              |              |              |              |              |              |                |
F( 0,8 )       F( 4,12 )      F( 2,10 )      F( 6,14 )      F( 1,9 )       F( 5,13 )      F( 3,11 )        F( 7,15 )
= F( x000_2 )  = F( x100_2 )  = F( x010_2 )  = F( x110_2 )  = F( x001_2 )  = F( x101_2 )  = F( x011_2 )    = F( x111_2 )
|       |       |      |      |       |      |       |      |        |     |       |      |       |        |       |
|       |       |      |      |       |      |       |      |        |     |       |      |       |        |       |
F(0)    F(8)    F(4)   F(12)  F(2)    F(10)  F(6)    F(14)  F(1)     F(9)  F(5)    F(13)  F(3)    F(11)    F(7)    F(15)
0000    1000    0100   1100   0010    1010   0110    1110   0001     1001  0101    1101   0011    1011     0111    1111
```

Practical FFT algorithms are not recursive, but iterative. The iterative implementation loops across each level of the tree illustrated above, beginning at the leaves, and moving toward the root. Note that at the bottommost level, we combine pairs of values whose bit patterns differ in the leftmost bit (for example 0 and 8, 4 and 12, 2 and 10, etc.) At the next level, we combine pairs of values whose bit patterns differ in the next-to-leftmost bit (eg 0 and 4, 8 and 12, 2 and 6, 10 and 14, etc.) This leads to the following algorithm. We use the notation k=k(0)k(1)...k(s-2)_2 to mean the binary representation of the (s-1)-bit integer k; each k(j) is either 0 or 1.

```  for j = 0 to s-1     ... for each level of the call tree
... from bottom to top

for  k = 0 to (m/2)-1     ... for each vector entry

t0 = v[k(0)k(1)...k(j-1)0k(j)...k(s-2)]
... the subscript is equal to k with
... 0 inserted before bit j

t1 = v[k(0)k(1)...k(j-1)1k(j)...k(s-2)]
... the subscript is equal to k with
... 1 inserted before bit j

w = omega^[0k(j-1)...k(0)0...0]
... a power of omega stored in a table

x = w.*t1   ... componentwise multiplication

v[k(0)k(1)...k(j-1)0k(j)...k(s-2)] = t0 + x
... update location of v containing t0

v[k(0)k(1)...k(j-1)1k(j)...k(s-2)] = t0 - x
... update location of v containing t1
end for
end for
```
The algorithm corresponds to the recursive version because setting the j-th bit of a subscript to 0 is equivalent to taking the even subset of subscripts, and setting the j-th bit to 1 is equivalent to taking the odd subset.

The algorithm overwrites the input vector v with a permutation of the answer F*v. The permutation is called the bit-reverse permutation; component

```      k = k(0)k(1)...k(s-2)k(s-1)_2
```
of the output vector contains component
```     reverse(k) = k(s-1)k(s-2)...k(2)k(0)_2
```
of F*v.

Here is another, equivalent way to see the pattern of which components of v get combined with which other components at each step of the algorithm. As before, we illustrate in the case m = 2^4 = 16. Then we may partially evaluate the algorithm to see that

``` When j=0          When j=1         When j=2         When j=3
When k=0          When k=0         When k=0         When k=0
t0 = v[0000]     t0 = v[0000]     t0 = v[0000]     t0 = v[0000]
t1 = v[1000]     t1 = v[0100]     t1 = v[0010]     t1 = v[0001]
When k=1          When k=1         When k=1         When k=1
t0 = v[0001]     t0 = v[0001]     t0 = v[0001]     t0 = v[0010]
t1 = v[1001]     t1 = v[0101]     t1 = v[0011]     t1 = v[0011]
When k=2          When k=2         When k=2         When k=2
t0 = v[0010]     t0 = v[0010]     t0 = v[0100]     t0 = v[0100]
t1 = v[1010]     t1 = v[0110]     t1 = v[0110]     t1 = v[0101]
When k=3          When k=3         When k=3         When k=3
t0 = v[0011]     t0 = v[0011]     t0 = v[0101]     t0 = v[0110]
t1 = v[1011]     t1 = v[0111]     t1 = v[0111]     t1 = v[0111]
When k=4          When k=4         When k=4         When k=4
t0 = v[0100]     t0 = v[1000]     t0 = v[1000]     t0 = v[1000]
t1 = v[1100]     t1 = v[1100]     t1 = v[1010]     t1 = v[1001]
When k=5          When k=5         When k=5         When k=5
t0 = v[0101]     t0 = v[1001]     t0 = v[1001]     t0 = v[1010]
t1 = v[1101]     t1 = v[1101]     t1 = v[1011]     t1 = v[1011]
When k=6          When k=6         When k=6         When k=6
t0 = v[0110]     t0 = v[1010]     t0 = v[1100]     t0 = v[1100]
t1 = v[1110]     t1 = v[1110]     t1 = v[1110]     t1 = v[1101]
When k=7          When k=7         When k=7         When k=7
t0 = v[0111]     t0 = v[1011]     t0 = v[1101]     t0 = v[1110]
t1 = v[1111]     t1 = v[1111]     t1 = v[1111]     t1 = v[1111]
```

### Parallelizing the FFT of a single vector

We may represent the above tabular representation of our algorithm execution by drawing the graph below, an m-by-(s+1) (16-by-5) grid of points. Each column of m points represents the values in the array v at some stage in the computation. The leftmost column, column 0, represents the original data. The next column, column 1, represents the data after the computations performed when j=0, the next column represents the data after j=1, and so on, until the rightmost column represents the final data.

There is an edge from a grid point v0 in column j to a grid point v1 in column j+1, if the data at v1 depends on the data at v0. From examining the algorithm, we see that there is an edge from both grid points

```     [k(0)k(1)...k(j-1)0k(j)...k(s-2)]  and
[k(0)k(1)...k(j-1)1k(j)...k(s-2)]  in column j
```
to the same grid points points in column j+1, for a total of 4 edges connecting this pair of points. These 4 edges form what looks like a "butterfly" in the figure below, where we have color-coded the edges so that edges corresponding to the same butterfly connecting adjacent columns have the same color. Here is the graph in the case m=16. The first column of butterflies connects grid points whose numbers differ in bit 0 (the leftmost bit). For example, the black butterfly connects 0000 and 1000. The second column of butterflies connects grid points whose numbers differ in bit 1; for example the black butterfly connects 0000 and 0100. The next black butterfly connect 0000 and 0010, and the rightmost black butterfly connects 0000 and 0001.

This analysis is enough to compute the PRAM complexity of the FFT: Each of the s rightmost columns can be computed in one time step using m processors, for a complexity of O(s) = O(log m).

For a real parallel implementation, we need to take communication costs into account, and this is dictated by our choice of layout: which processors own which entries of the vector v. The following discussion may be found in "LogP: Towards a realistic model of parallel computation", by D. Culler, R. Karp, D. Patterson et al, which is in the class reader.

Let us consider two obvious candidates which we considered before in the case of Gaussian elimination: blocked and cyclic. The block case is shown below, with m=16 and p=4. The processor numbers are color coded. Edges which cross processor boundaries, and so require communication, are black. Edges within a processor, which require no communication, are color coded by the local processor color. In the block case, note that the leading log(p) bits of an address determine the processor (green is processor 00, blue is 01, magenta is 10, and brown is 11). Thus, in general, the first log(p) stages require communication, because addresses of data to be combined differ in their leading log(p) bits. The last log(m/p) stages are local, and require no communication.

The cyclic case is shown below, color coded in the same way. In this case, the trailing log(p) bits of an address determine the processor. Therefore, the last log(p) stages require communication, and the first log(m/p) stages are local.

Let us analyze the parallel complexity of these algorithms. As before we use the parameters.

```     p = number of processors
f = time per flop
alpha = startup for a message
beta = time per word in a message
```
Then both the block and cyclic implementations require
```  Time(block_FFT) = Time(cylic_FFT) =
2*m*log(m)/p * f     ... 2*m/p flops at log(m) stages
+ log(p) * alpha      ... one message per processor per
... stage for log(p) stages
+ m*log(p)/p * beta   ... m/p words per message
```

An alternative implementation is start with a cyclic implementation, and then transpose the data to have a block layout after the first log(p) steps. This way the first log(p) stages require no communication, then a transpose (a significant amount of communication) is performed, and finally the last log(m/p) stages are again local. (This assumes m>=p^2, which we now assume, since we expect many more points than processors. Otherwise, there may be more transposes required.) This is shown below.

```   1) perform "local FFTs"
2) transpose
3) perform "local FFTs"
```
(The "local FFTs" have the same structure as an FFT, but may use different powers of omega as coefficients.)

The reason we call converting from cyclic to block layout (or back) transposing is shown below for m=16 and p=4. When the data layout is seen in this way, it clearly looks like transposing a matrix.

The computational cost of this algorithm is the same as before: 2*m*log(m)/p*f. The communication cost is entirely the cost of transposing a matrix. This in turn depends on details of how many messages can be overlapped in the network. In the most pessimistic case, where no communication can overlap, so a processor can have only one outstanding message to another processor at a time, the cost is as follows:

```  Time(transpose_FFT) (non-overlapped) =
= 2*m*log(m)/p * f       ... 2*m/p flops at log(m) stages
+ (p-1) * alpha         ... p-1 messages per processor
... to transpose
+ m*(p-1)/p^2 * beta    ... m/p^2 words per message
```
Comparing Time(transpose_FFT) with Time(block_FFT) = Time(cyclic_FFT), we see that the transpose_FFT sends less data overall, but sends (p-1)/log p times as many messages. Thus, in this non-overlapping case, one expects the transpose_FFT to do well in a low latency, high bandwidth environment, or when m is very large and p small, and the block or cyclic to do well in a high latency, low bandwidth environment, or when m is not as large but p is large.

In the most optimistic situation, where the communication hardware and software permit communications to overlap so that one can have many messages to other processors outstanding at a time, one pays only a latency cost of alpha, rather than p-1 alpha. This lowers the cost to

```  Time(transpose_FFT) (overlapping) =
= 2*m*log(m)/p * f       ... 2*m/p flops at log(m) stages
+ alpha                 ... p-1 overlapping messages
... per processor to transpose
+ m*(p-1)/p^2 * beta    ... m/p^2 words per message
```
which is always better than either cylic_FFT or block_FFT. In fact, one can show it is close to optimal (see "LogP: Towards a realistic model of parallel computation" for details).

The above algorithm does not attempt to return the answer in sorted order; it comes out in a permuted order, which is adequrate for many applications. Sorting the final answer would require more communication (essentially another transpose).

For a radically different way to parallelize the 1D FFT, which uses the Fast Multipole Method to lower the communication needed when one insists on sorted inputs and outputs, see the paper The Future Fast Fourier Transform? on Alan Edelman's homepage. See Lecture 24 and especially Lecture 25 for detail on the Fast Multipole Method.

### Parallelizing Higher Dimensional FFTs

FFTs in 2 or 3 dimensions are defined as 1D FFTs for the vectors in all dimensions. We will discuss the 2D FFT in some detail, since the 3D case is analogous. A 2D FFT requires 1D FFTs on all rows and FFTs on all columns. Again, data layout is the paramount issue. In some applications, the data layout will be constrained by other parts of the application. For example, if the input is represented in a skewed layout, in which each processor owns 1/p of every row and column, then it may be better to redistribute the data before beginning the FFT. In other applications, the computations outside the forward and backward FFTs may be entirely local to each grid point, so the layout can be chosen to optimize only the FFT.

There are three obvious algorithm and layout choices that one might consider for the 2D case:

2D Blocked Layout

Assume we are given the n-by-n grid of data in a 2D blocked layout, using an s-by-s processor grid, where s = sqrt(p). Thus, each processor owns an n/s-by-n/s subgrid, as shown below.

The algorithm is now easy to describe. Let N=n2. We first do 1D FFTs on all the rows using the 1D parallel FFT algorithm from above. In these row FFTs, s processors cooperate to solve each row, e.g., P0, P1, P2, and P3 solve the first n/s rows in the picture. The other processors proceed in parallel, so there total running time is that of n/s 1D FFTs of size n on s processors.

```  Time(RowFFT) =
O((n*log(n)/s)) * (n/s) * f  ... O(n*log(n)/s) flops per FFT
... times n/s FFTs
+ x*alpha                   ... s-1 messages per processor
... to transpose
... x=1 for overlap
... x=s-1 for no overlap
+ n^2*(s-1)/s^3 * beta      ... (n/s^2)*(n/s) words per message

= O(N*log N / p)*f + x*alpha + N*((s-1)/s)/p *beta
```
where x=1 if all messages can overlap, and x=s-1=sqrt(p)-1 if they cannot overlap.

The column FFTs have the same cost, and unlike some of the other strategies described below, no global redistribution of data is required. (Note that, depending on 2D array layout, one of the dimensions will not have stride one accesses, even at the bottom of the FFT butterfly.)

The total cost is therefore Time(RowFFT)+Time(ColumnFFT) = 2*Time(RowFFT) shown above.

Block Row Layout with Transpose

There is another much simpler implementation of this algorithm, which does not require parallelizing the FFT at all. If one can in fact fully overlap the latency cost of transposition, its cost is identical to the algorithm above. In the somewhat more general and realistic situation of nonoverlapping communication, it costs the same in flops and bandwidth, and sqrt(p) times as much in latency:

```  1) Layout out the data by block row
(processor i gets rows i*(n/p) through (i+1)*(n/p)-1)
2) Perform FFTs on locally stored n/p rows. These require
no communication.
3) Transpose the data, so processor i gets columns i*(n/p) through
4) Perform FFTs on local n/p columns. These again require no
communication.
```
The cost of this is
```   O((n*log(n)) * (n/p) * f   ... O(n*log(n)) flops per FFT
... times n/p FFTs
+ x*alpha                 ... p-1 messages per processor
... to transpose
... x=1 for overlap
... x=p-1 for no overlap
+ n^2*(p-1)/p^2 * beta    ... (n^2/p^2) words per message

= O(N*log N / p)*f + x*alpha + N*((p-1)/p)/p *beta
```
Block Row Layout without a Transpose

Another alternative is to use the block row layout above, but to leave the data in this layout for the column FFTs. In this case, the column FFTs will be 1D parallel FFTs, each on p processors. We avoid the transpose at the cost of more expensive 1D FFTs.

For more details, including comparisons of some of these models to actual measurements, see "Performance Modeling and Composition: A Case Study in Cell Simulation" by S. Steinberg, J. Yang, and K. Yelick. This paper actually uses the parallel solution of Poisson's equation as one step in a more complicated problem involving simulating the motion of biological cells immersed in moving fluids. These techniques have been used to study blood flow in the heart, embryo growth, platelet aggregation in blood clotting, sperm motility, and other biological processes.

The 3D case

As mentioned ealier, the 3D case is entirely analogous to the 2D. One can imagine using a 3D blocked layout on a c x c x c processor grid (where c = p1/3). Note that the three dimensions need not be identical for the algorithm to work correctly. Alternatively, each processor could own complete planes, so the FFTs in 2 dimensions would require no communication; the 3rd dimension could be acheived with either a tranpose or a parallel 1D FFT.