The FACR algorithm in an MIMD environment

Material Information

The FACR algorithm in an MIMD environment
Turnbull, Thomas
Place of Publication:
Denver, CO
University of Colorado Denver
Publication Date:
Physical Description:
vii, 59 leaves : illustrations ; 29 cm


Subjects / Keywords:
Poisson's equation ( lcsh )
Fourier analysis ( lcsh )
Algorithms ( lcsh )
Algorithms ( fast )
Fourier analysis ( fast )
Poisson's equation ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Includes bibliographical references (leaves 56-57).
Submitted in partial fulfillment of the requirements for the degree of Master of Science, Department of Mathematical and Statistical Sciences
Statement of Responsibility:
by Thomas Turnbull.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
17998076 ( OCLC )
LD1190.L62 1986m .T87 ( lcc )

Full Text
Thomas Turnbull
B.S., Purdue University, 1975
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirement for the degree of
Master of Science
Department of Mathematics

This thesis for the Master of Science degree by-
Thomas Turnbull
has been approved for the
Department of
Roland A. Sweet

Turnbull, Thomas (M.S., Mathematics)
The FACR Algorithm in an MIMD Environment
Thesis directed by Assistant Professor William L. Briggs
Three direct methods for solving Poissons
equation in two dimensions are discussed. The methods
of cyclic reduction and Fourier analysis are presented
as foundations to the introduction of the FACR
algorithm. This algorithm combines both cyclic
reduction and Fourier analysis to produce a method
which is superior to each of the other two individual
methods operating independently. Particular emphasis
is given to the implementation of the FACR algorithm on
a multiprocessor MIMD computer, Denelcors HEP. The
paper concludes with an analysis of the algorithm in
terms of the total work function and the optimum
parameter for using FACR.

The author would like to acknowledge the
assistance and guidance provided by Prof. Bill Briggs.
During the past year, his friendly advice and attention
to detail have been invaluable. The time and effort he
has devoted to this project is greatly appreciated.
Profs. Steve McCormick and Roland Sweet also made
valuable suggestions and comments to improve this
thesis. The author would like to thank them for their
assistance. Computer time on the HEP was provided by
Los Alamos National Laboratory. The research for this
project was supported in part by NSF Grant No.

II. STATEMENT OF THE PROBLEM....................... 6
III. CYCLIC REDUCTION (CORF)....................... 11
IV. FAST FOURIER TRANSFORM (FFT).................. 20
Introduction................................ 38
Total Work: Case 1.......................... 40
Total Work: Case 2.......................... 41
Analysis.................................... 43
VIII. CONCLUSION.................................... 54
BIBLIOGRAPHY........................................ 56
A. TRIDIAGONAL SOLVERS........................... 58

1. Optimum ft

1. Equal Performance Lines....................... 46
2. HEP Timings for N = 128...................... 49
3. HEP Timings for N = 256...................... 50
4. Work versus J for N = 128.................... 51
5. Work versus 1 for N = 256

Elliptic partial differential equations (PDEs)
occur in a great many problems throughout the sciences,
engineering, and applied mathematics. The solutions to
problems in areas such as fluid dynamics, aerodynamics,
and electrodynamics, to name just a few, all involve
elliptic PDEs at some point. The single most important
and often occuring elliptic PDE is Poissons equation.
Because of its great importance, the numerical solution
of Poissons equation has been studied by numerical
analysts for decades. The reason so much attention and
research has been given to this single equation is not
immediately obvious. An accurate numerical method for
solving Poissons equation has been known for quite
some time. If someone were merely interested in
solving this equation a few times, no extra effort
would be required or desired. However, for some of the
more complex time-dependent models, say an oil-field
model or wing design problem, the numerical solution to
Poissons equation is not just needed a few times, but
rather hundreds or thousands of times at each time
step. Therefore, a method which solves this type of

equation very quickly and very efficiently is vital, if
the overall problem is to be solved in some reasonable
period of time.
The trend recently has been to find more
efficient algorithms to solve Poissons equation.
Coupled with this mathematical' endeavor has been the
engineering effort to design and build faster and more
efficient machines on which to run these new
algorithms. With each new computer, algorithms are
created to take advantage of the technology and
architecture. The outcome of all this has been a
series of "fast Poisson solvers" designed to work very
effectively on specific kinds of architectures.
Initially, with the advent of the high-speed
computers, iterative methods (for example, Young
[1971]) for solving Poissons equation were developed
and utilized. These methods were easy to program and
the new computers were very effective at producing a
series of iterates which, at least in theory, converged
to the approximate numerical solution. However, the
rates of convergence, the required number of iteration
steps, and the overall efficiency of these algorithms
was still somewhat in question. In many cases, the
time required to solve these kinds of problems using
iterative methods was prohibitive. Other, more direct
methods were needed. Two such direct methods appeared
in the mid-1960s. The advantage of these new direct

methods rested in their computational speed and minimum
storage requirements, as well as their "non-iterative"
nature. By that we mean that these direct methods
yield a solution in only one "pass" through the
The first method, known as cyclic reduction,
was originally proposed by Hockney and Golub (Hockney
[1965]) with various modifications by Buneman [1969],
Buzbee, et al. [1970], Sweet [1974, 1977], and Heller
[1976]. The second method basically evolved from two
major sources. Hockney [1965] developed a method for
solving Poissons equation using Fourier analysis or
matrix decomposition. During the same time period the
Fast Fourier Transform method (FFT) was developed by
Cooley and Tukey [1965], with variations made later by
Gentleman and Sande [1966] and Stockham (Cochran, et
al. [1967]). The combination of the FFT method with
Hockneys Fourier analysis produced an efficient
algorithm for solving Poissons equation. Some of the
modifications for both of these methods will be
discussed in later sections.
Although both these algorithms are very quick
and efficient at solving the discrete Poisson equation,
Hockney [1965] noticed that by combining both methods,
a further decrease in computational time could be
achieved. The method which combines cyclic reduction
and FFT is known as the Fourier Analysis and Cyclic

Reduction algorithm (FACR). Initially, Hockney used
only one step of cyclic reduction to reduce the problem
size by half before applying the FFT method. Later,
Hockney [1970] developed a more general algorithm, in
which 1 steps of cyclic reduction are performed
before the FFT method is used on the remaining
equations. The final solution is then obtained by
J steps of backsubstitution. This algorithm is
known as FACR(l). Swarztrauber [1977] calculated
the optimum 1 for the sequential version of
FACR(l) .
The FACR(l) algorithm will be studied in
some detail in this paper, with particular emphasis
given to the implementation of the algorithm on an MIMD
computer. Different computer architectures require
either totally new algorithms or modifications to old
ones in order to take advantage of the new technology.
Both the cyclic reduction and FFT portions of the
FACR(l) algorithm are well-suited for a
multiprocessor environment, at least up to some point.
It is the intent of this paper to show to what extent
this algorithm can take advantage of the new
multiprocessor technology.
In closing, it should be noted that with the
introduction of the raultigrid methods in the
late-1970's, iterative methods are once again becoming
increasingly popular. Although computational speeds

are up from previous iterative methods such as
Gauss-Seidel or SOR, storage requirements may be
greater than for direct methods. However, multigrid
does seem to be a very competitive iterative method in
the numerical solutions of partial differential
Additional bibliographical and survey material
on partial differential equations and parallel
algorithms used in their solutions can be found in
Heller [1978], Kadalbajoo and Bharadwaj [1984], and
Ortega and Voigt [1985].

Consider Poissons equation in two dimensions:
Au (x, y) =Uxx +Uyy = f (x, y) , (1)
where A is the Laplacian operator. Elliptic PDEs,
such as Poissons equation, are well-posed only if
considered as boundary value problems instead of as
initial value problems. Therefore, a region R on which
Poisson's equation is valid must be specified. For
simplicity, let R be the rectangle 0 < x,y < 2rt,
and let dR denote the boundary of R. Next we must
specify what kind of boundary conditions the PDE
satisfies. There are three important general types:
Dirichlet, Neumann and Periodic. Dirichlet boundary
conditions prescribe the value of the solution on a
boundary, e.g. u(0,y) = A(y) and u(2ff,y) = B(y),
whereas Neumann boundary conditions prescribe the value
of the normal derivative of the solution on a boundary,
e.g. u n (x, 0) = C(x) and un(x,2rr) = D(x). Periodic
boundary conditions are characterized as:
u(x,0) = u(x,2rt) and un(x,0) = Un(x,2ir).
For the purposes of this paper, let us consider
Poissons equation with Dirichlet boundary conditions

in both the x- and y-directions such that:
u(0,y) = u(2it,y) 0 and u(x,0) = u(x,2ir) = 0.
Numerically we do not actually solve Poissons equation
as stated above, but rather we solve a discretized form
of it. We begin by superimposing a rectangular grid
over the region R. We let Ax = 2tt/M and Ay = 2tT/N for
some positive integers M and N. The values at the grid
points are defined as:
x i = i Ax and y j = j Ay
for 0 < i < M and 0 < j < N. Therefore, the grid size
is (M+l)x(N+l), but because of the Dirichlet boundary
conditions there are only (M-1)(N-1) unknown values at
the interior grid points. All grid points on dR are
considered to have the value zero.
We now use a finite difference scheme to
approximate the derivatives in (1) above. Define
Ui j = u(x(i),y(j)). The standard second order finite
difference approximations are used:
Ui 41 | j 2ui j + U i l j
Uu = ------------------------ + 0 ( ( Ax ) 2 )
( Ax) 2
U| j + l 2ui j + U i j -l
U y y = --------------------- + O ( ( Ay ) 2 )
(Ay) 2
where 0((Ax)2) and 0((Ay)2) represent truncation
errors, i.e. the remaining terms in the Taylor series
expansion of the finite difference approximation for
the second derivative. In theory, as Ax and

Ay are allowed to get smaller and smaller, i.e.
as we increase M and N, the approximation will improve.
Substituting these values into Poissons equation
Au = f(x,y) and letting fij = f(x(i),y(j))
we arrive at the following expression:
Ui+i,j 2uij + Ui-i,j + (Ax/Ay)2 [uj.j+i 2u i j
+ + tij = (Ax)2 f i j
for 1 < i < M-l and 1 < j < N-l. Here tij is the total
truncation error.
At this point we drop the truncation error tij,
and replace the exact solution u,j with an approximate
solution Vi j. Notice that Vij no longer satisfies
Poissons equation, but rather satisfies the discrete
Poissons equation Aavij = fij. Consider the special
case of a uniform grid where M=N, i.e. Ax=Ay=h.
The discrete problem now has the form:
Vi-i, j + v'i.j-i 4v i j + Vi,j + i + Vi + i,j = h 2f i j = Fij
for 1 < i < M-l and 1 < j < N-l. This finite
difference approximation is represented by the
following 5-point stencil:
1-4 1
The task now is to transform this finite
difference approximation into the matrix equation
Av = F. For various values of i and j, we have the
following individual equations:

i-H II rH j = l Vo 1 + v10 4v11 + V l 2 + V 2 1 = Fll
i = l, CM II VO 2 + Vll 4V12 + Vl3 + V22 = F12
i-1, CO II Vo 3 + Vl 2 4vi 3 + V 1 4 + V23 = F 1 3

i = l, j=N-l VO , N - 1 + Vl ,11-2 - 4v1 N 1 + Vl.N
+ V 2 N 1 = F 1 ,

Note that all the terms with either subscript equal to
zero are boundary values and have the value zero.
Putting these equations in matrix form Av = F, and
numbering the unknowns as
V = (Vll V1 2 V1 3 ... Vl.N-l V 2 1 V22 ....)
we have the following system:
X* -
-4 1 0 1 0 0 .... Vi 1 F11
1 -4 1 0 1 0 .... Vl 2 F i 2
0 1 -4 1 '... 0 0 1 till Vl 3 F 1 3
1 0 0 0 . -4 1 0 f ( V 2 1 = F 2 1
0 1 0 0 . 1 -4 1 V 2 2 F 2 2
0 0 1 0 . 0 1 t -4 1 ... V 2 3 F 2 3
The unknown vector v and the right-hand side
vector F can be written in partitioned form. Let
v i
mm -
V i l Fi 1
V i 2 F i 2
V i 3 and F i = F i 3
Vi ,N-1 F i N 1
- __ ..
1 < i < M-l;
i.e. the jth column of unknown grid points and
right-hand side function. Then


Vi F i
V 2 F 2
V 3 and F = f3
VM 1 Fm-i
In general, for Dirichlet boundary conditions
with arbitrary M and N, matrix A has the following
f orm:
* t

. I
where T has dimension (N-l)x(N-l). I is the identity
matrix and also has dimension (N-l)x(N-l). Matrix T is
said to be tridiagonal, while matrix A is said to be
block-tridiagonal of block dimension (M-l)x(M-l) and is
denoted A = [I T I]. Since blocks T and I have
dimension (N-l)x(N-l), and there are (M-l) such block
rows and columns in A, then A has dimension
(M-l)(N-l)x(M-l)(N-l) .
For the model problem we used the function
f(x,y) = -(p2 + q2)sin(px)sin(qy)
as the right-hand side function. The exact solution to
Poissons equation can easily be computed to be:
u(x,y) = sin(px)sin(qy).
We now turn our attention to the first part of
the FACR(l) algorithm, the method of cyclic reduction.

Consider the model problem Au(x,y) = f(x,y)
with u(x,y) = 0 on the boundary and M = N = 2k + 1.
Although Sweet [1977] has generalized the cyclic
reduction algorithm for arbitrary size problems, this
assumption that M is a power of 2 will simplify our
discusssion. In order to better understand the general
cyclic reduction algorithm, an example with N = M = 8,
i.e. k = 2, will first be presented.
For M = 8, we have h = Ax = Ay = 1/8. We
therefore have 7 x 7 = 49 unknown values at the
interior grid points. The matrix equation Av = F has
the form:
T I 0 0 0 0 o" "v 1 ~ f r ~Fi~
I T I 0 0 0 0 V 2 f 2 f2
0 I T I 0 0 0 v3 f 3 F 3
0 0 I T I 0 0 V 4 = h2 f 4 f4
0 0 0 I T I 0 v5 f 5 f5
0 0 0 0 I T I V6 f6 F6
0 0 0 0 0 I T_ v7 W *4 7- -Ft J
where T is the 7x7 tridiagonal matrix [1 -4 1] and
I is the 7x7 identity matrix. Vectors v and F have
been partitioned as in Chapter II:
vs = (vu,...,Vi7)T and Fi = (Fi i . ,Fi 7) T, 1 < i< 7.
The Dirichlet boundary conditions impose the additional

condition that vo = Va = 0.
The resulting system of equations is:
Tv i + IV 2 = F i (la)
Ivi + Tv 2 + IV 3 = f2 (lb)
Iv 2 + Tv 3 + IV 4 = f3 (lc)
IV 3 + TV 4 + IV 5 f4 (Id)
Iv 4 + Tv 5 + Iv 6 = Fs (le)
Ivs + Tv 6 + IV 7 - F6 (If)
Iv6 + Tv 7 - f7 (lg)
We now proceed to reduce the number of unknowns by
eliminating the odd numbered columns. This can be
accomplished by taking linear combinations of these
columns with adjacent columns. In particular, one can
multiply (lb) by -T and add the result to (la) and
(lc). This produces the new set of equations
(21 T2)v2 + Iv4 = Fi TF2 + F3 = F2<1>. (2a)'
We define T = 21 T2. If we perform similar
operations on (Id) and (If), we arrive at a new,
reduced system of equations:
T ( 1 ) V 2 + IV 4 = F2 (2a)
Iv2 + T <1>V4 + IV 6 II (2 b)
Iv 4 + T ( 1 > V 6 = Fe 1> (2c)
We could now solve this reduced system by some
particular method, or repeat the above process to
reduce the system even further as follows. Again, we
multiply (2b) by -T and add the result to (2a) and (2c)
to produce the single set of equations:
[21 (T(D)]vi = F2 ( i >-T< i )F4 < 1 >+F6 < 1 ) = F4<2>. (3)
Again, let T<2> = [21 (T^1))2]* Therefore (3) can be
written as:

T<2>v4 = F40). (3)
Since this is a single set of equations we can
no longer reduce this system any further. We must now
solve for the unknown vector v4 in (3). Once v4 is
known, the unknown vectors V2 and V6 can then be
recovered. For example, from (2c) we see that if
v4 is known, then
T<1>v6 = Fe <1> Iv4
which is just another matrix equation which can be
solved. Now that the even terms are known, the odd
terms can also be recovered.
The statement was made that v4 must be
determined from (3). We need to consider exactly how
this equation is to be solved most efficiently.
Although T is tridiagonal, the matrices TO),
T< 2 ) etc. are not. In fact, they fill rapidly
and additional storage and computational effort is
therefore required for solving their equations.
Therefore, it would appear as though we are not able to
apply a tridiagonal solver to (3). However, if we
examine the matrices TO) further, some
interesting properties are found which will reveal an
effective method for solving equations of the form (3).
Denote TO) = T. Then TO) = 21 (TO)) 2,
TO) = 21 (TO))z; in general, TO> = 21 (TO-D)2.
If we consider TO) as a second degree polynomial in
TO) = t, then we can write TO) as:

T)2 = 2 T2.
Also T<2) is a second degree polynomial in T*1*, but
T<1> = 2 T2, so
T<2> = p 4(T) = 2 (2 T 2)2 = -2 + 4T2 T4
which is a fourth degree polynomial in T. In general,
T<1> is a 2r-th degree polynomial in T.
We have transformed the sequence of matrices
T into a sequence of polynomials p r (x):
Pi (X) = X
P 2 (x) = 2 x2
P4 (x) = -2 + 4x2 x4
P8 (x) = -2 16x2 + 16x4
P r (x) = 2 [p r- 1(X) ] 2
2 2
Therefore solving the equation T<2>V4 = F4(2 > is
equivalent to solving p4 the possibility of factoring the polynomial p r (x).
the form:
Pi (x) =
P 2 (x) =
P 4 (x) =
x = -2cos0. Then the polynomials take on
x = -2cos8
2 x2 = 2 4cos20 = 2(1 2cos20)
-2 + 16cos20 16cos48 = 2(-l+8cos28-8cos40)

p r (x) = 2 [p r-1 (x)] = -2cos(2 r8) .
2 2

The roots of p r (x) = 0 are those values of 0 which
satisfy: cos(2r8) = 0, or 2r0 = ( 2 j 1) rr/2 so
0j(r) = (2j-l)TT/2r + i 1 < j < 2'.
Therefore, the roots of p r (x) are
a j<*> = 2cos[(2 j 1)ff/2r + 1].
This implies that we may write p r (x), and hence, T
in factored form:
2 r
P r (x) = n (x - < r > ) .
2 j = l
In matrix form this produces the relationship:
2 r
p r (T) = T = I! (T Ctjd-JI).
2 j = l
Incidentally, the polynomials p r (x) are really
forms of the Chebyshev polynomials of the first kind,
C r. In particular, p r (x) = -C r (x). It is known
2 2 2
that the roots of C r (x) are of the form:
aj = 2cos[ (2j-l)TT/2r + 1] 1 < j < 2 *
which agrees with our results above.
We now have an expression for the non-
tridiagonal matrix T in terms of the tridiagonal
matrix T. This implies that instead of trying to solve
T< 2)V4 = Fi(2 ) , (3)
we can solve an easier problem, namely

T<2>v4 = {- n (T aj<2>I)}v4 = F4<2>. (4)
j = l
This requires four recursive tridiagonal solves, using
a tridiagonal solver such as CROUT. (The CROUT routine
used is given in Appendix A.) Equation (4) can be
expanded as:
_(T-ai(2)i)(T-a2(2>i)(T-a3(2>i)(T-a4<2>i)v4 = f4<2>. (5)
For simplicity, write (5) as:
T1T2T3T4V4 F 4 ^ 2 ^
where T, = (T (Xi(2>I). To solve (5) recursively,
first let wi = T2T3T4V4, and solve
7 -Tiwi = F4<2)
for Wi. Let w2 = T3T4V4, and solve
T2W2 = Wi
for w2 Continuing in this fashion, we now have
w2 = T3T4V4, so let W3 = T4V4, and solve
T3W3 = w 2
for W3. Finally, we have w3 = T4V4, and we solve this
system for V4. Since all of the matrices are tri-
diagonal, the computations are considerably reduced.
Using this method then, we could completely
solve our system of equations, except for one major
difficulty! The method as just described is
numerically unstable due to the way the right-hand
sides Fi are computed. This instability is
based on the fact that the matrices T used to
compute Fi quickly become ill-conditioned as

r increases. Buzbee et al. [1970] provide an analysis
of this instability. Buneman [1969] has given a method
for computing the right-hand sides which, although
mathematically equivalent to the above algorithm,
eliminates this instability. Essentially Buneman
factors out this instability by rewriting the
right-hand side functions in terms of intermediate
values, denoted qi The Buneman variant of
the cyclic reduction algorithm will now be given by way
of the example for the case M = 8. Recall that the
superscripts denote the particular stage of cyclic
(1) Reduction Phase:
(a) Define: qi(0) = Fi, 1 < i < 7 and Vo = vs = 0,
(b) Eliminate the odd blocks of equations, and
solve for the intermediate values:
q2 ( 1 ) = qi<0 > + q3 ( ) - 2T-iq2(o)
q4 ( 1 ) = q3(0 > + qs ( ) - 2T"iq4(0 >
qe*1* = q5(0) + q7 ( o ) - 2T-iq6 < 0 >
(c) Eliminate columns 2 and 6:
q4<2> = q2<1>-qa(0>+q<1>-q5<0>+q6<1> + (T<>)-i(qi<>
(2) Backsubstitution Phase:
(a) Solve for the middle block of unknowns:
v4 = (q2(15+q6(1}-q4(2>) + (T<2>)"1(q4(2>-vo-vj)
(b) Solve for the other even blocks of unknowns:

v2 = (qi<0>+q3<0>-q2<1>) + (T<2))-i(q2(i)-v0-v4)
V6 = {q5(0}+q7(0}-q6(1}) + (T(2>)- 1(q6(1>-v4-vs)
(c) Finally solve for the odd blocks of unknowns:
Vi = T1(q i < 0 > Vi-i Viti) for i = 1,3,5,7.
Note that the terms involving (TMr>)-1 are not directly
computed, i.e. we do not actually compute the inverse
of T. Instead, the method previously discussed for
decomposing T into linear factors of T is used.
For example, to calculate the term
(T(2))-i(q4(2j _vo v8) = (T<2>)-iq4<2>
we first let (T<2))-iq4(z ) = Wo, so
q4(2) = T(2)Wo = {- n (T (Jj ( 2 )i) }Wo .
j = l
We then solve for Wo by performing a series of
recursive tridiagonal solves as demonstrated earlier.
This algorithm just given is known as Cyclic
Odd-even Reduction and Factorization, or CORF, and
incorporates one of Bunemans variants. The general
CORF algorithm as given in Swarztrauber [1977] follows.
(1) Reduction Phase:
( o )
(a) qj =Fj, j = 1,2,...,M-1
( 1 ) ( 0 ) ( 0 ) ( 0 )
qi = qj-1 + qj + i 2T-!qj j = 2,4,...,M-2

( r )
(b) The remaining qj are determined for r = 2,...,k:
Cr) (r-l) ( r- 2 ) (r-1) (r-2) (r-1)
qj = qj-2h q j h + qj qj + h + qj + 2 h
(r-1) (r-2) (r-1) (r-2) (r-1)
+ (T )-1(qj-3h qj-2h + qj-h 2qj
(r-2) (r-1) (r-2)
+ qj + h q j + 2 h + qj + sh ) (6b)
for j = 2r,2 2r,3 2r,...M-2r, and where h = 2r-2.
(2) Backsubstitution Phase:
( -1 ) ( 1 ) ( 0 )
(a) Define qj-1/2 + qj+1/2 qj = v0 = vM = 0, then 1 2
1 (r-l) (r-l) (r)
Vj = (q j- 2 h + q j + 2 h qj )
( r ) ( r )
+ (T )Mqj Vj-4h Vj+4h) (7)
for r = k,...,0 and j = 2r,3 2r,...,M-2r.
We now direct our attention to the second part
of the FACR(J) algorithm, the Fourier analysis or
matrix decomposition method.

Poissons equation Au(x,y) = f(x,y) and the
resulting matrix formulation Av = F can also be solved
by the method of Fourier analysis. We begin with a
general discussion of the FFT method and then show its
applicability to solving Poissons equation.
Given a sequence (xj), 0 < j < N-l, we define
the discrete Fourier Transform (DFT) as the new
sequence {Xk}>
n 1
Xk = H x j e-2*1**/" 0 1 k < N-l
j = o
where eikx = cos(kx) + isin(kx). There is a similar
expression for the inverse DFT:
1 n 1
Xj = H Xk e2"ijk/N 0 < j < N-l.
N j = 0
Both of these expressions involve complex matrix
multiplication, and hence require roughly N2 complex
additions and multiplications. For large N, the cost
in evaluating and storing these sequences would be
prohibitive.. Therefore, another method must be used.
Cooley and Tukey [1965] devised an algorithm which
reduces the amount of work required to compute the

sequence {Xk}. One particular version of that
algorithm is the splitting algorithm, which we now
To begin with, it is important to note that the
sequence {Xk} is periodic of period N. In particular: N 1 Xk + N = H Xj e" 2ffi 3 < k + N ) / N j = 0
N 1 = H X j e- 2ffi i k/N 6-2^1^ j = 0
but e-Zffi- = cos (2ff j) isin(2itj) = 1 for all integer
values of j, 0 < j < N-l, so N 1 Xk + N = H Xj e-2ffi Jk/H = Xk. j = 0
Assuming N is even, split {xj} into even and
odd parts: N
y j = x 2 j and Zj x 2 jii 0 < j < - 1. 2
Then we can write the expression for {Xk) in two parts:
( N / 2 ) 1 ( N / 2 ) 1
Xk = H yj e-2* < 2 j ) k / N + ZI Z j e-2i ( 2 j + 1 ) k/N
j = 0 j = 0
( N / 2 ) 1
£ y i e-2ffikj/(N/2)
j = 0
+ ( N / 2 ) 1 e-2irik/N ^ Zj e-2ffijk/(N/2). 3 = 0
Note that the two summation terms are merely the DFTs
for the sequences {yj} and {zj}, which only have length

N/2 compared to length N for sequence {xj}. Let {Yj}
and {Z j} be the DFTs of the sequences {yj} and {zj},
respectively. Then we have the relation:
Xk = Yu + e-***/N Zk 0 < k < - 1 .
The sequences {Yk} and {Zk} are now periodic of period
N/2, so we have:
Xk+N/2 = Yk + N/ 2 + e*2ffi(k + N''2)''N Z k + N / 2
But e_2iri (k + N/ 2)/N e 2tti k / n e-i1T = e_2ffik/,N, since
e_i1t = cosit isinfl = -1. Therefore, we have the two
Yk + e- 2"i k / Zk
Yk e-2"* / h Zk.
0 £ k < -
Assuming N/2 is even, we can now apply the
splitting algorithm to the sequences {Yk} and {Zk},
which have length N/2, to produce other sequences of
length N/4. In general, if N = 2p, we can perform
p repetitions of the splitting algorithm. This changes
the problem of transforming a single sequence of length
N to the problem of transforming N sequences of length
1. A sequence of length 1 is just a single point, and
the transform of a point is the point itself. This
method is known as the Fast Fourier Transform, and is
one of the versions originally proposed by Cooley and

To illustrate this algorithm more fully,
consider the case N = 8. Let the input sequence {xj}
be defined by Xj = j. The DFT of {x,} is
Xk = Z Xj e-fli jk/4 0 < k £ 7 .
j = o
In order to directly calculate the terms {Xk}, we
would have to do a matrix multiplication. For example
for k = 4, we have:
X4 = Zxje*iri>>
i = o
=[0123456 7][1 -11-11-11 -1]t = -4
However, the splitting algorithm proceeds as follows,
(i) Ordering Phase
This phase involves recursive reordering of
the 8 terms of the sequence into even and odd parts.
It requires log28 = 3 steps to arrive at 8 sequences
of length 1.
It is interesting to note that the resulting
sequence is in bit-reversed order of the original
sequence. For example, since lio = 0012,

bit-reversed representation would be 1002 = 4io.
Notice that 4 is in the position 1 held originally,
while 1 is in the position 4 held originally.
(ii) Combining Phase
This phase involves pairwise combinations of
the form (1) above in each of log28 = 3 steps. We
first combine 8 sequences of length 1 into 4 sequences
of length 2 using (1) with N = 2.
Xk = Yk + e_ffik Zk and Xku = Yk e_,|ri k Zk for k = 0.
When k = 0 we have that Xo = Yo + Zo = 0 + 4 = 4 and
Xi = I 0 II 0 N 1 0 4 = -4. This is true for the f irst
two terms in each of the 4 sequences. Then we combine
4 sequences of length 2 into 2 sequences of length 4
using (1) with N = 4.
Xk = Yk + e-*iK/z Zk and Xk+2 = Yk e-*i/2 zk
for 0 < k < 1. For k = 0, Xo = Y0 + Zo = 4 + 8 = 12
and X2 = Yo Z0 = 4 8 = -4. Finally, for the last
step, N = 8, we combine the 2 sequences of length 4
into the single sequence of length 8. Equation (1) is
Xk = Yk + e-ffik/'4 Zk and Xk + 4 = Yk e-ir,k/4 Zk
for 0 < k < 3. The calculations for all terms are
summarized in the following diagram.

N=2 N=4 N=8
This splitting algorithm has a considerable
savings in the required number of operations. At each
of log2N steps, N/2 pairs must be computed using
(1). Each pair requires one complex multiply and two
complex additions. Thus, (N/2)log2N complex
multiplies and Nlog2N complex additions are
required for the full splitting algorithm. Recalling
that the cost of performing the DFT using matrix
multiplication was roughly N2 complex additions
and multiplies, one can readily see the advantage to
this algorithm.
One of the disadvantages of this algorithm as
presented is the reordering phase. The terms of the
input sequence (xj} are reordered so that the
terms of the output sequence (Xk) are in the
correct order. There are other versions which avoid
reordering the input sequence, but then the output
sequence is not in the correct order (Gentleman-Sande
[1966]). Other versions avoid reordering of either the
input or output sequences, but cannot be performed "in

place," i.e. an extra array of storage is required
(Stockham, in Cochran, et al. [1967]). There are still
other FFT algorithms which require neither reordering
nor extra storage (Temperton [1983]). We will describe
the method used in this paper after the following
discussion which considers the applicability of the FFT
method in solving Poissons equation.
Recall from Chapter II that the discrete
Poissons equation has the matrix representation
Av = F, where A = [I T I] and T is tridiagonal.
Individual equations of Av = F can be written as
Ivi-i +Tvi + Ivi+i =Fi 1 < i < M-l. (2)
The matrix T can be diagonalized by a matrix Q, whose
columns are the eigenvectors of T: T = QDQ-1, where
D= diag(Xi,i2 . ,Xn ) and \i are the eigenvalues of T.
Equation (2) then becomes
Ivi-i + QDQ_1Vi + Ivi+i = Fi.
Define Wi = Q_1Vi and gi = Q_1Fi. We now apply the FFT
-by transforming along the columns of array F. This
corresponds to multiplying both sides of the equation
by Q 1 :
Q-1Vi-i + Q_1QDQ"^v i + Q*Vi + i Q-*Fj,
Wj-i + Dwi + win = gi 1 < i < M-l_. (3)
Since D is just a diagonal matrix with eigenvalues of
T, which may be computed, (3) reduces to solving a
tridiagonal system for each 1 < j < N-l:

Wi-i, j + XjWij + Witi.j = gij 1 S i < M-l
with Dirichlet boundary conditions woj = wmj = 0. Once
the Wi are found, the solutions Vj can be recovered by
performing the inverse transform on Wi.
Vj = Qwi 1 < i < M-1.
Reordering of either the input or output
sequences can be avoided by the following technique.
The columns of the F array are left in natural order
and the forward FFT is performed using the
Gentleman-Sande version of the FFT. This results in a
bit-reversed ordering of the sequences gi in (3).
However, the tridiagonal solves can still be performed
correctly by reordering the eigenvalues Xj to correspond
to the order of gi. This bit-reversed ordering of
the eigenvalues is done during the pre-processing phase
by means of a permutation vector. The inverse
transform is then performed on Wi using the
Cooley-Tukey version of the FFT. This method requires
that the input sequence be in scrambled order and
transforms the sequence, to produce a sequence which is
in the natural order. The overall result is that the
input sequence is used in the natural order, and the
output sequence also arrives in the natural order.
Additionally, the complete algorithm is done "in-place"
and therefore does not require any extra storage

The combination of cyclic reduction and Fourier
analysis, or matrix decomposition, produces an
algorithm which, at least in theory, is more efficient
than either method working independently. Briefly,
FACR(l) begins with 1 steps of reduction at which point
the reduced system of equations is solved using
Fourier analysis. The remaining unknowns not deter-
mined by the FFT method are then recovered by ft steps
of the backsubstitution phase of the cyclic reduction
algorithm. The values of 1 range from 1=0, which
corresponds to solving the system using just FFTs, to
1 = log2M 1, which corresponds to using just
cyclic reduction.
As an example, consider the case where M = 8
and let 1=1. The first step of cyclic reduction
produces three equation:
q2 ( i ) = qi(0 > + q3 ( o ) - 2T-iq2( >
q4d) = q3(0 > + qs ( o ) - 2T- iq4 < 0 >
qe ( 1 > = qs(0 > + q7 ( o ) - 2T-iq6< 0 >
The resulting three equations in v2, V4 and V6 that now
need to be solved via the FFT algorithm have the form:

Vo + T ( 1 > V 2 + V4 = - T c i ) (qi() + qa < 0 > - q 2 ( 1 > )
2 -
+ q2 < 1 > = f 2 (2a)
V 2 + T ( 1 > V 4 + V 6 T ( i )(q3(o> + q 5 ( 0 } - q 4 ( 1 >)
2 + q4(i) = f 4 (2b)
V 4 + T( 1 >v6 + v8 = - T C i )(q5() + q 7 < 0 > - qe ( 1 5 )
2 + qe*1) = f 6 (2c)
where vo = vs = 0 due to the Dirichlet boundary-
conditions. This system can be solved by FFTs as long
as the required right-hand sides can be formed. This
involves some pre-processing of the intermediate
vectors q2(1), q4(1)> and qe <1>, which should be done
"in place" to minimize the number of required arrays.
Notice that we can rewrite the right-hand side
of (2a) as follows:
- T(D[qi(o) + q3(o) q2d) + 2 ( T < > ) Aq2 < 1 > ] .
The term 2(T<1>)-1q2(1) can be computed by letting
z = 2(T(1>)-Aq2(15 and then solving the nested tri-
diagonal systems
T< i )z = 2q2<1>.
This requires only an extra real vector, and not an
extra array. Now the correct right-hand sides fcan
be formed and sent to the FFTs.
The FFT portion proceeds as before. Define

Wj = Q-1v j and gj = Q_1fj. Multiplying (2a,b,c)
by Q_1 gives:
wj-2 + Q-iT <1> Qw j + Wj + 2 = Q_1fj = gj j = 2,4,6. (3)
Recall that Q_1T(1>Q = D, where D is a diagonal matrix
containing the eigenvalues of TM1). Therefore (3)
reduces to solving a set of tridiagonal systems. Once
w2, w4, and we are computed, the inverse transform
v, = Qwj is used to recover the vectors V2, V4, and V6.
The FFT program then returns the vectors v2, V4,
and v6 to the cyclic reduction program, where the odd
vectors are computed in one step of backsubstitution
Vj = (T(1>)- 1(qj<0) Vj-i vj + 1) j = 1,3,5,7.
The general FACR(J) algorithm as given in
Swarztrauber [1977] begins with d steps of cyclic
reduction using (6a) and (6b) from Chapter III. The
resulting reduced linear system can then be written
v + Td)v + v = T J 28 j j + 21 2 J-28-1 J + 2H-* j
+ q = f
j j
for j = 2a, 2&+1,..., M-2&. This system is solved using
Fourier methods. Define w, = Q-1Vj and gj = Q_1fj.
Multiplying by Q_1 gives:
j 28
Q 1T ( a ) Qw + W
J + 28

But this is just a set of tridiagonal systems of order
M/2&-1 which can be solved easily. Once the Wj are
determined, the desired vectors Vj can be recovered by
performing the inverse transform Vj = Qwj, for
j = 21, 21+M-21. The remaining unknowns can then
be computed by J steps of the backsubstitution phase
for r = 0 using (7) from Chapter III.

It is important to realize that the FACR(5)
algorithm we have described adapts itself very well in
an MIMD environment. MIMD, or multiple instruction
stream-multiple data stream, is the acronym used for a
computer which can perform different computations on
different sets of data simultaneously. The particular
MIMD computer on which the FACR'(l) algorithm was
implemented is the Denelcor HEP, or Heterogeneous
Element Processor. The HEP is a shared memory
computer, which implies that since all the processors
share the same global memory, special synchronization
mechanisms must exist. These mechanisms are supported
by the programming language. The HEP has a maximum
execution rate of 10 million instructions per second
(MIPS) with average speeds of up to 3 million
floating-point operations per seconds (MFLOPS). This
is true for the one PEM (Process Execution Module)
system used, although faster rates can be achieved with
more PEMs. The HEP can support up to 50 independent
user-created processes. Both the arithmetic units and
instruction stream are pipelined. This technique

improves speed by filling the pipeline with
instructions from independent instruction streams. An
instruction progresses to the next stage of the
pipeline every clock cycle of 100 nano-seconds (ns).
Additions and multiplications take 8 clock cycles or
800 ns, while divisions require 1700 ns. Further
technical aspects of the HEP may be found in Jordan
[1983] .
HEP Parallel FORTRAN is an extension of
standard FORTRAN 77. It supports two major
modifications. First, several separate processes can
be created and allowed to run in parallel. The
standard technique here is to design a subroutine to
perform a particular task and then "create" copies of
this subroutine to run as individual processes. This
is accomplished by the command CALL CREATE. For
example, let NP be the number of processes and REDUCE
be the name of the subroutine which performs the
reduction phase of cyclic reduction. In order to start
all processes working, the following code is used.
DO 50 KK = 2, NP
The main program calls NP-1 processes to perform
REDUCE, then calls REDUCE itself so it is not idle
while the other processes are working.
The second extension of standard FORTRAN 77

allows for interaction among processes by introducing
special asynchronous variables, which have both a value
and a state. The state is set either "full" or
"empty." An asynchronous variable can only be read
when full and written to when empty. This prevents two
different processes from trying to read or write to the
same asynchronous variable at the same time. These
variables may be denoted with a $, such as $1 or $JBAR,
but this is not necessary. When an asynchronous
variable is first used it is set "full" with some
initial value by the statement
or it can be set "empty" for later use by the statement
In the first case, $NB has the initial value 1 and a
full state, while in the latter case it has no
particular value initially and an empty state .
In order to ensure that each process is doing
its share of the work, scheduling techniques are
employed. The two most common ones used are
pre-scheduling, where the work is statically divided
among all the processes at the beginnning, or
self-scheduling, where each process may obtain more
work whenever it is free to do so. Since the amount of
work to be done is already known in advance for the
FACR(jl) algorithm, pre-scheduling was used. An
example of the code for pre-scheduling follows. $1 has

already been declared in COMMON with both the main
program and the subroutine.
Main program: CALL ASET ($1,2)
Subroutine: JJ = IAINC ($1,2)
DO 30 K = JJ, N-2, NP*2
The first process to reach this block of code in the
subroutine assigns the value of $1 to the variable JJ,
i.e. JJ = 2. It then increments the value of $1 by 2
by the command IAINC which stands for Integer
Asynchronous Increment. Now the next process to reach
this code will start out with JJ having the value 4.
If NP = 4 and N = 16, the DO statement schedules the
work among all four processes as follows:
Process #1 DO 30 K = 2, 14, 8 or K = 2 & 10
Process #2 DO 30 K = 4, 14, 8 or K = 4 & 12
Process #3 DO 30 K = 6, 14, 8 or K = 6 & 14
Process #4 DO 30 K = 8, 14, 8 or K = 8 .
Recall that in the reduction phase of cyclic
reduction the first step was to eliminate the odd
columns. In the case of N = 16, the above scheduling
allows this phase to proceed in parallel with each
process performing the necessary steps on at least two
of the even columns. All even indices 2-14 are used.
Recall that columns 0 and 16 have value 0 and therefore
need not be computed.
A final sychronization technique used on the
HEP is the "barrier." In any parallel algorithm, there

will be places where all processes must stop and
perform some sort of sequential code, be it updating
information or starting off a new block of code all at
the same time. A barrier is essentially a section of
code in which each process must wait for all other
processes to arrive. After the last process arrives,
it may update some information, usually re-initializing
variables. At that point, all processes are allowed to
"drop out" of the barrier and continue on. Barriers
must be used sparingly in order to maintain the
parallelism of the algorithm. An example of a barrier
in HEP FORTRAN follows.

IB = IAINC ($NB,1) (2)
CALL ASET ($NB,1) (4)

. sequential code or update
Line 1 sets $JBAR empty so it cannot be read.
Line 2 increments $NB by one each time a process
reaches it and acts as a counter for the number of
processes to arrive. ($NB has previously be set to 1.)
Line 3 will be false for all processes except the last
one, so all processes proceed to Line 7 and wait for
$JBAR to be set full. The NPth process will proceed
into the IF-loop, re-initialize $NB to 1 on Line 4 for

the next pass through this code, perform any sequential
code or update as necessary, and then set $JBAR full
with an arbitrary value of 0 on Line 5. Now that $JBAR
has been set full, all the other processes can exit the
"wait full" command on Line 7 and proceed with the rest
of the program.

Swarztrauber [1977] gives an asymptotic
operation count for serial FACR(g) as O(N2log2log2N)
compared to 0(N2log2N) for either cyclic reduction or
Fourier analysis. Additionally, he computes the optimum
1 to be log2log2N 1. In this chapter we will present
a detailed operation count for the multiprocessor
version of FACR(g) and determine an optimum g for this
algorithm. There are two cases to consider in the
multiprocessor FACR(g).
At each step of cyclic reduction the problem
size is essentially halved. After the first step,
i.e. g = 1, there are N/2 remaining columns. After the
second step, i.e. g = 2, there are N/4 remaining
columns. In general, after the gth step of cyclic
reduction, there are N2_8 remaining columns to be solved
by Fourier analysis. It is clear that at some point
there will be more processors available than individual
tasks, i.e. remaining columns. The dividing point
between Case 1 and Case 2 is a function of three

N the size of the problem, N = 2k+1 = M
p the number of processors
ft the number of cyclic reduction steps performed.
Case 1 occurs for values of p < N2~i, while Case 2 occurs
for p > N2"H. In Case 1 all processors remain busy and
all processors perform the same amount of work.
(Technically this is not completely accurate since the
amount of work may not divide evenly among the number
of processors. However, we make the assumption that
the work does divide evenly.) In Case 2 there will be
more processors than required to complete the job, so
some processors will be idle.
As mentioned earlier, adds and multiplies
require approximately 800 ns whereas a divide requires
1700 ns. Therefore, when counting operations, adds and
multiplies are consider one floating-point operation,
or flop, while divides are considers as two flops. We
ignore the overhead for synchronization and interprocess
communication since it becomes negligible for large
problem sizes. Let Wr, Wf, and Wb denote the work
required for a single processor in the reduction phase,
FFT phase, and backsubstitution phase respectively. Let
Wi and W2 denote the total work for Case 1 and Case 2,
respectively. We first consider the operation count
for Case 1.

Total Work: Case 1
(i) Reduction Phase:
N a n
Wr = [3N + 9N] + n ------ [12N + 2*-i9N]
2p r = 2 p2 r
9N 2J N2
= --- + [15/2 6*21-1].
2p P
The operation count for the stage r = 1 is different
than for all other values of r. That count is
represented by the term before the summation. The
values 3N and 12N represent the cost of processing the
right-hand sides of (6a) and (6b), respectively, in
Chapter III. The value 9N represents the work required
for a single tridiagonal solve, roughly 3N adds and
multiplies and 3N divides, which cost twice as much.
(ii) FFT Phase:
N2-a 2 9N22-a
Wf = --- [ 4N + 2a-9N] + [5N2log2N-2-a] + -------
= ----- (13 + 9-21 + 10 logzN).
The first term represents the pre-processing cost as
described in Chapter V. The middle term is the cost of
the FFT algorithm (see Briggs, et al. [1986]) exclusive
of the required tridiagonal solves which is accounted
for in the last term.

(iii) Backsubstitution Phase:
N 8- 1 N
Wb = --- [ 2N + 9N] + H ------ [ 6N + 2r 9N]
2p r=1 2pr+1
9N 2J N2
= ----- + (4 3*21-1) .
2p P
The first term is the count when r = 0, while the
values r = 1,...,!-1 are represented by the summation
term. Again, the values 2N and 6N represent the cost
of processing the right-hand sides of (7) in Chapter
III. We therefore arrive at the total work function
for Case 1 as:
Wi = Wr + Wf + Wb = W i (N p, ft)
= [91 + 41/2 9*2i-l + 13*2-1 + 10 logaN-2-1].
Total Work: Case 2
(i) Reduction Phase:
We define s as the stage where processors first
become idle:
s = max{0, k [log2p]},
where [log2p] denotes the greatest integer function.
The work function, in terms of s, is then given by:
N s N
Wr = * [3N + 9N] + H ------ [12N + 2 r-i9N]
2p r=2 p2r
+ H [12N + 2 r- 19N j
r = s + 1

9N2s 15N2 12N 2 2 ~ 8
= ---- + 9N2& 9N 23 +--------------- + 12N& 12Ns.
2p 2p p
Again, the first term represents the count at stage
r = 1. The first summation term is the work done while
all processors are still active. The second summation
term represents that portion of the work done by only
some of the processors.
(ii) FFT Phase:
Wf for Case 2 is the same as Wf for Case 1
provided N > 2p. This restriction does not limit the
analysis since it still encompasses most problem sizes
of interest with a realistic number of processors.
(iii) Backsubstitution Phase:
N s N
Wb = [ 2N + 9N ] + S ---- [ 6N + 2r-i-9N]
2p r = 2 P2 r
+ 51 [ 6N + 2 r 1 9N ]
r = s + 1
9N2s 4N2 6N22_ 8
= ---- + 9N-21 9N > 28 + -- ------- + 6Nft 6Ns.
The significance of these terms is similar to the ones
from the Reduction Phase above. The total work
function for Case 2 is given by:
W2 = Wr + Wf + Wb = W2(N,p,!l)

9N2 s 4 IN 2 18N 2 2 s
= ---- + 9N 21+ 1 9N 2 3 + 1 + - ----- + 18Nft
P 2p p
13N22-a ION 2 log 2N 2
- 18Ns + -------- + --------------- .
Recall that N = 2k+1, so k = log2N 1. Therefore,
for s = k [log2p], we have s = log2(N/p) -1, and we
can then eliminate s from the above expression.
9N2 23N 2
W2 = ---- log2 (N/p) + 9N-21+1 +------------18N + 18Nft
P 2p
13N228 ION2log2N'2
- 18Nlog2(N/p) + + --------- .
Several useful and important results can now be
derived from the total work functions given. Foremost
is the computation of an optimum ft for the algorithm.
In particular, we would like to have an expression or
graph that would determine the best ft to use for
a particular problem size N and a given number of
processors p. This can be accomplished by two
different methods.
First, an analytic solution for an optimum
ft can be arrived at by considering a standard
minimization problem. Consider
Wi for N > 2Ip
W2 for N < 2&p.

and solve dW/dft = 0 for ft. Define fti* and ft2* to be
the optimum ft for Case 1 and Case 2 respectively,
i.e. Wi(fti*) = 0 and W2(ft2*) = 0. We then have the
fti = log2[.77 logzN .39]
ft2* = log2[-l + {(9p+9.7Nlog2N+12.5N)/9p}wz] _ .47 .
The second way to determine an optimum ft
is a graphical approach. If we compute and graph the
equal performance curves
W (N, p, fti) = W(N, p, ft2 ) for ft2 = fti + 1,
we can then use the graph with an appropriate N and p
and arrive at the optimum ft. In order to take
into account both Case 1 and Case 2 in these equal
performance curves, we must consider three parts to
each curve.
ft 2
(i) Wi(N,p,fti) = W1 ( N p, ft 2 ) for N > 2 p
(ii) Wi(N,p,fti) = W2(N,p,ft2 ) for fti 2 p ft 2 < N < 2 p
(iii) W 2 ( N p, fti ) = W 2 ( N, p, ft2 ) for N £ fti 2 p.
Using the work functions derived earlier in this chapter
we have the following general formulas for the equal
performance curves for cases (i), (ii), and (iii).
(9-2 + 5)/10
(i) N = 2

ft 2
(ii) 9N(la+l) + (18p-9N)log2(N/p) + 18p(l-ft2-2 )
ft 2
+ 2 N(10 log2N 23) = 0
N(13 + 10 log 2N)
(iii) P = ---------;--------
2J2 ft2
9-2 + 18*2
Figure 1 graphically depicts the equal per-
formance curves for values of 1 = 1 to { = 5. Notice
that the scales for this graph are log2(N/p) versus
log2N. Given a particular problem size N and a given
number of processors p, one can use this figure to find
the optimum ft of the multiprocessor FACR(ft) algorithm.
Recall that the work function for the FFT phase is only
valid for N > 2p, so the shaded portion of the graph
has no validity. Table 1 provides a summary of the
theoretical results on the optimum 1, both analytically
and graphically. For values of N = 128 and N = 256,
and values of p from 10 to 50, the table provides the
necessary information to use Figure 1. The optimum ft
read from the graph, labeled "GRAPH ft" is given as well
as the optimum ft (either fti* or ft2 *) as computed
analytically, labeled "COMP ft", by (1) above. A direct
comparison can then be made. It is important to realize
that only integer values of ft are valid for the FACR(ft)
algorithm, however, the analytic results for the optimum
ft do provide a trend.

log 2N
Equal Performance Lines

N = 128
N = 256
p log 2 (N/p) GRAPH & COMP a CASE
10 3.68 2 2.32 1
12 3.42 2 2.32 1
14 3.19 2 2.32 1
16 3 2 2.32 1
18 2.83 2 2.32 1
20 2.68 2 2.32 1
22 2.54 2 2.32 1
24 2.42 2 2.32 1
26 2.3 2 2.3 N
28 2.19 2 2.19 N
30 2.09 2 2.09 N
32 2 2 2 N
34 1.91 2 1.91 N
36 1.83 2 1.83 N
38 1.75 2 1.75 N
40 1.68 2 1.68 2
42 1.61 2 1.64 2
44 1.54 2 1.6 2
46 1.48 2 1.56 2
48 1.42 2 1.52 2
50 1.36 2 1.49 2
10 4.68 3 2.53 1
12 4.42 3 2.53 1
14 4.19 3 2.53 1
16 4 3 2.53 1
18 3.83 3 2.5 3 1
20 3.68 3 2.53 1
22 3.54 3 2.53 1
24 3.42 3 2.53 1
26 3.3 3 2.53 1
28 3.19 3 2.53 1
30 3.09 3 2.53 1
32 3 3 2.53 1
34 2.91 3 2.53 1
36 2.83 2 2.53 1
38 2.75 2 2.53 1
40 2.68 2 2.53 1
42 2.61 2 2.53 1
44 2.54 2 2.53 1
46 2.48 2 2.48 N
48 2.42 2 2.42 N
50 2.36 2 2.36 N

The final column in Table 1 indicates in which
case the parameters N, p, and ft lie. In some instances
the optimum ft, either fti or ft2 *, satisfies neither the
Case 1 criterion p < N2_& nor the Case 2 criterion
p > N2*. In these instances an "N" is entered in the
column to indicate neither case is satisfied. However,
an optimum value for ft can still be derived using
the information at the boundary between the two cases,
p = N2-&, or ft = log2(N/p). This value is placed in
the computed ft column when neither case is satisfied.
Notice that these boundary values provide for a smooth
transition between Case 1 values of ft and Case 2 values
of ft.
Figures 2 and 3 present actual timings from the
HEP for the FACR(ft) algorithm for N = 128 and N = 256,
respectively. The values of ft ranged from ft = 0 to
ft = log2N 1. With these figures we can compare the
theoretical results for the optimum J with actual
results based on the HEP. Notice that in both figures
ft = 2 was optimum with 16 processors producing the
fastest times. These figures suggest that the speed of
the algorithm, which equates to the amount of work, is
dependent on the value of ft used. This can be verified
by plotting the work function versus various values of
ft. Figures 4 and 5 represent the work functions for
N = 128 and N = 256, respectively. Three representative
values of p are shown, namely p = 8, 16, and 32. The

0 2 4 6 8 10 12 14 16 18 20
HEP Timings for N

HEP Timings for N = 256

WORK (x 1000)
Work versus ft for N

WORK (x 1000)
Work versus J for N

graphs clearly indicate that there exists some optimum
value of £ which minimizes the amount of work. These
minimum values can be read directly, and correspond to
the optimum values of ft from Table 1. Additionally, the
amount of work required to solve Poissons equation can
be determined for different values of ft.

We have considered how Poissons equation in
two dimensions can be solved, first by discretizing the
continuous problem, and then by applying a numerical
algorithm to the discrete formulation. The methods of
cyclic reduction and Fourier analysis provide two
direct methods for solving the discrete Poissons
equation. We have demonstrated the superior efficiency
of a third method, which combines the other two. This
FACR(ft) algorithm was particularly well-suited for
implementation on an MIMD computer, specifically, the
Denelcor HEP. The multiprocessor version of FACR(8)
was discussed relative to the special FORTRAN
extensions required for its implementation.
Theoretical results concerning the operation count and
the total work functions were presented. The FACR(&)
algorithm operates most efficiently when an optimum
value of ft is used. The determination of this optimum
value was explored, both analytically and graphically.
Both methods gave similar results, which were
substantiated by experimental timings on the HEP.
Figure 1 provides a useful tool for determining the

optimum 1 for the multiprocessor FACR(J). This direct
method can provide a competitive alternative to other
methods used in the solution of Poissons equation.

W. Briggs, L. Hart, R. Sweet and A. OGallagher,
[1986], Multiprocessor FFT Methods, SIAM
J. Sci. Stat. Comput., to appear.
0. Buneman, [1969], A Compact Non-iterative Poisson
solver, Rep. 294, Stanford University-
Institute for Plasma Research, Stanford, Ca.
B. Buzbee, G. Golub and C. Nielson, [1970], Oh Direct
Methods for Solving Poissons Equations,
SIAM J. Numer. Anal., 7, pp.627-656.
W. Cochran, et al., [1967], What is the Fast Fourier
Transform?, IEEE Trans. Audio and Elect.,
AU-15, pp.45-55.
J. Cooley and J. Tukey, [1965], An Algorithm for the
Machine Calculation of Complex Fourier
Series, Math. Comp., 19, pp.297-310.
W. Gentleman and G. Sande, [1966], Fast Fourier
Transforms for Fun and Profit, 1966 Fall
Joint Computer Conf., AFIPS Proc. 29,
D. Heller, [1976], Some Aspects of the Cyclic
Reduction Algorithm for Block Tridiagonal
Linear Systems, SIAM J. Numer. Anal., 13,
--------, [1978], A Survey of Parallel Algorithms in
Numerical Linear Algebra, SIAM Rev., 20,
R. Hockney, [1965], A Fast Direct Solution of
Poissons Equation using Fourier Analysis,
J. ACM, 12, pp.95-113.
--------, [1970], The Potential Calculation and Some
Applications, Meth. Comput. Phys., 9,

H. Jordan, [1983], Experience with Pipelined Multiple
Instruction Streams, Department of
Electrical and Computer Engineering, University
of Colorado, Report CSDG 83-4.
M. Kadalbajoo and K. Bharadwaj, [1984], Fast
Elliptical Solvers An Overview, Appl.
Math. Comput., 14, pp.331-355.
J. Ortega and R. Voigt, [1985], Solution of Partial
Differential Equations on Vector and Parallel
Computers, SIAM Rev., 27, pp.149-240.
P. Swarztrauber, [1977], The Methods of Cyclic
Reduction, Fourier Analysis and the FACR
Algorithm for the Discrete Solution of
Poisson's Equation on a Rectangle, SIAM
Rev., 19, pp.490-501.
R. Sweet, [1974], A Generalized Cyclic Reduction
Algorithm, SIAM J. Numer. Anal., 11,
--------, [1977], A Cyclic Reduction Algorithm for
Solving Block Tridiagonal Systems of Arbitrary
Dimension, SIAM J. Numer. Anal., 14,
C. Temperton, [1983], A Note on the Prime Factor FFT
Algorithm, J. Comp. Phys., 52, pp.198-204.
D. Young, [1971], Iterative Solution of Large Linear
Systems, Academic Press, New York.

The tridiagonal solver used in this paper is
Crout's method, which decomposes the matrix T into the
product of a lower triangular matrix, L, and an upper
triangular matrix, U. For Crouts method, the diagonal
elements of U are ls.
REAL D(0:256),RT(256),XX(0:256)
C C = -2-2*(DX*DX)/(DY*DY)
DO 30 J = 0, N-l
D(J) = 0.0
DO 70 II = 1, NSOL
D(1) = C RT(NSOL+(11-1) )
DO 40 J = 2, N-l
D(J) = D(1) 1.0/D(J-1)
XX(1) = XX(1)/D(1)
DO 50 J = 2, N-l
XX(J) = (XX(J)-XX(J-l))/D(J)
DO 60 J = N-2, 1, -1
XX(J) = XX(J) XX(J+l)/D(J)

After the research for this paper was
completed, it was brought to the authors attention
that a slightly more efficient variation of Crouts
method existed. The method differs only in that
instead of storing the diagonal elements in D(I), the
reciprocals of the diagonal elements are stored. The
rest of the code remains unchanged. The result of this
modification is that the divides performed in the
forward and back substitution phases are now changed to
multiplies. This is particularly important on the HEP,
since divides cost twice as.much as multiplies. Recall
that for the CEOUT routine the operation count is 9N.
(3N adds/multiplies + 2*3N divides = 9N) The improved
method has an operation count of only 7N. (3N adds +
2N multiplies + 2*N divides = 7N) However, the
qualitative results for the model presented in Chapter
VII remain essentially unchanged by this modification.