THE FACR ALGORITHM IN AN MIMD ENVIRONMENT
by
Thomas Turnbull
B.S., Purdue University, 1975
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirement for the degree of
Master of Science
Department of Mathematics
1986
This thesis for the Master of Science degree by
Thomas Turnbull
has been approved for the
Department of
Mathematics
by
Roland A. Sweet
Date
Ill
Turnbull, Thomas (M.S., Mathematics)
The FACR Algorithm in an MIMD Environment
Thesis directed by Assistant Professor William L. Briggs
Three direct methods for solving Poissons
equation in two dimensions are discussed. The methods
of cyclic reduction and Fourier analysis are presented
as foundations to the introduction of the FACR
algorithm. This algorithm combines both cyclic
reduction and Fourier analysis to produce a method
which is superior to each of the other two individual
methods operating independently. Particular emphasis
is given to the implementation of the FACR algorithm on
a multiprocessor MIMD computer, Denelcors HEP. The
paper concludes with an analysis of the algorithm in
terms of the total work function and the optimum
parameter for using FACR.
IV
ACKNOWLEDGMENTS
The author would like to acknowledge the
assistance and guidance provided by Prof. Bill Briggs.
During the past year, his friendly advice and attention
to detail have been invaluable. The time and effort he
has devoted to this project is greatly appreciated.
Profs. Steve McCormick and Roland Sweet also made
valuable suggestions and comments to improve this
thesis. The author would like to thank them for their
assistance. Computer time on the HEP was provided by
Los Alamos National Laboratory. The research for this
project was supported in part by NSF Grant No.
DMS8504350.
V
CONTENTS
CHAPTER
I. INTRODUCTION AND HISTORICAL PERSPECTIVE........ 1
II. STATEMENT OF THE PROBLEM....................... 6
III. CYCLIC REDUCTION (CORF)....................... 11
IV. FAST FOURIER TRANSFORM (FFT).................. 20
V. FOURIER ANALYSIS AND CYCLIC REDUCTION (FACR). 28
VI. THE HEP AND MULTIPROCESSOR FACRU)............. 32
VII. OPERATION COUNT, TOTAL WORK, AND OPTIMUM ft... 38
Introduction................................ 38
Total Work: Case 1.......................... 40
Total Work: Case 2.......................... 41
Analysis.................................... 43
VIII. CONCLUSION.................................... 54
BIBLIOGRAPHY........................................ 56
APPENDIX
A. TRIDIAGONAL SOLVERS........................... 58
VI
TABLES
Table
1. Optimum ft
47
Vll
FIGURES
Figure
1. Equal Performance Lines....................... 46
2. HEP Timings for N = 128...................... 49
3. HEP Timings for N = 256...................... 50
4. Work versus J for N = 128.................... 51
5. Work versus 1 for N = 256
52
CHAPTER I
INTRODUCTION AND HISTORICAL PERSPECTIVE
Elliptic partial differential equations (PDEs)
occur in a great many problems throughout the sciences,
engineering, and applied mathematics. The solutions to
problems in areas such as fluid dynamics, aerodynamics,
and electrodynamics, to name just a few, all involve
elliptic PDEs at some point. The single most important
and often occuring elliptic PDE is Poissons equation.
Because of its great importance, the numerical solution
of Poissons equation has been studied by numerical
analysts for decades. The reason so much attention and
research has been given to this single equation is not
immediately obvious. An accurate numerical method for
solving Poissons equation has been known for quite
some time. If someone were merely interested in
solving this equation a few times, no extra effort
would be required or desired. However, for some of the
more complex timedependent models, say an oilfield
model or wing design problem, the numerical solution to
Poissons equation is not just needed a few times, but
rather hundreds or thousands of times at each time
step. Therefore, a method which solves this type of
2
equation very quickly and very efficiently is vital, if
the overall problem is to be solved in some reasonable
period of time.
The trend recently has been to find more
efficient algorithms to solve Poissons equation.
Coupled with this mathematical' endeavor has been the
engineering effort to design and build faster and more
efficient machines on which to run these new
algorithms. With each new computer, algorithms are
created to take advantage of the technology and
architecture. The outcome of all this has been a
series of "fast Poisson solvers" designed to work very
effectively on specific kinds of architectures.
Initially, with the advent of the highspeed
computers, iterative methods (for example, Young
[1971]) for solving Poissons equation were developed
and utilized. These methods were easy to program and
the new computers were very effective at producing a
series of iterates which, at least in theory, converged
to the approximate numerical solution. However, the
rates of convergence, the required number of iteration
steps, and the overall efficiency of these algorithms
was still somewhat in question. In many cases, the
time required to solve these kinds of problems using
iterative methods was prohibitive. Other, more direct
methods were needed. Two such direct methods appeared
in the mid1960s. The advantage of these new direct
3
methods rested in their computational speed and minimum
storage requirements, as well as their "noniterative"
nature. By that we mean that these direct methods
yield a solution in only one "pass" through the
algorithm.
The first method, known as cyclic reduction,
was originally proposed by Hockney and Golub (Hockney
[1965]) with various modifications by Buneman [1969],
Buzbee, et al. [1970], Sweet [1974, 1977], and Heller
[1976]. The second method basically evolved from two
major sources. Hockney [1965] developed a method for
solving Poissons equation using Fourier analysis or
matrix decomposition. During the same time period the
Fast Fourier Transform method (FFT) was developed by
Cooley and Tukey [1965], with variations made later by
Gentleman and Sande [1966] and Stockham (Cochran, et
al. [1967]). The combination of the FFT method with
Hockneys Fourier analysis produced an efficient
algorithm for solving Poissons equation. Some of the
modifications for both of these methods will be
discussed in later sections.
Although both these algorithms are very quick
and efficient at solving the discrete Poisson equation,
Hockney [1965] noticed that by combining both methods,
a further decrease in computational time could be
achieved. The method which combines cyclic reduction
and FFT is known as the Fourier Analysis and Cyclic
4
Reduction algorithm (FACR). Initially, Hockney used
only one step of cyclic reduction to reduce the problem
size by half before applying the FFT method. Later,
Hockney [1970] developed a more general algorithm, in
which 1 steps of cyclic reduction are performed
before the FFT method is used on the remaining
equations. The final solution is then obtained by
J steps of backsubstitution. This algorithm is
known as FACR(l). Swarztrauber [1977] calculated
the optimum 1 for the sequential version of
FACR(l) .
The FACR(l) algorithm will be studied in
some detail in this paper, with particular emphasis
given to the implementation of the algorithm on an MIMD
computer. Different computer architectures require
either totally new algorithms or modifications to old
ones in order to take advantage of the new technology.
Both the cyclic reduction and FFT portions of the
FACR(l) algorithm are wellsuited for a
multiprocessor environment, at least up to some point.
It is the intent of this paper to show to what extent
this algorithm can take advantage of the new
multiprocessor technology.
In closing, it should be noted that with the
introduction of the raultigrid methods in the
late1970's, iterative methods are once again becoming
increasingly popular. Although computational speeds
5
are up from previous iterative methods such as
GaussSeidel or SOR, storage requirements may be
greater than for direct methods. However, multigrid
does seem to be a very competitive iterative method in
the numerical solutions of partial differential
equations.
Additional bibliographical and survey material
on partial differential equations and parallel
algorithms used in their solutions can be found in
Heller [1978], Kadalbajoo and Bharadwaj [1984], and
Ortega and Voigt [1985].
CHAPTER II
STATEMENT OF THE PROBLEM
Consider Poissons equation in two dimensions:
Au (x, y) =Uxx +Uyy = f (x, y) , (1)
where A is the Laplacian operator. Elliptic PDEs,
such as Poissons equation, are wellposed only if
considered as boundary value problems instead of as
initial value problems. Therefore, a region R on which
Poisson's equation is valid must be specified. For
simplicity, let R be the rectangle 0 < x,y < 2rt,
and let dR denote the boundary of R. Next we must
specify what kind of boundary conditions the PDE
satisfies. There are three important general types:
Dirichlet, Neumann and Periodic. Dirichlet boundary
conditions prescribe the value of the solution on a
boundary, e.g. u(0,y) = A(y) and u(2ff,y) = B(y),
whereas Neumann boundary conditions prescribe the value
of the normal derivative of the solution on a boundary,
e.g. u n (x, 0) = C(x) and un(x,2rr) = D(x). Periodic
boundary conditions are characterized as:
u(x,0) = u(x,2rt) and un(x,0) = Un(x,2ir).
For the purposes of this paper, let us consider
Poissons equation with Dirichlet boundary conditions
7
in both the x and ydirections such that:
u(0,y) = u(2it,y) 0 and u(x,0) = u(x,2ir) = 0.
Numerically we do not actually solve Poissons equation
as stated above, but rather we solve a discretized form
of it. We begin by superimposing a rectangular grid
over the region R. We let Ax = 2tt/M and Ay = 2tT/N for
some positive integers M and N. The values at the grid
points are defined as:
x i = i Ax and y j = j Ay
for 0 < i < M and 0 < j < N. Therefore, the grid size
is (M+l)x(N+l), but because of the Dirichlet boundary
conditions there are only (M1)(N1) unknown values at
the interior grid points. All grid points on dR are
considered to have the value zero.
We now use a finite difference scheme to
approximate the derivatives in (1) above. Define
Ui j = u(x(i),y(j)). The standard second order finite
difference approximations are used:
Ui 41  j 2ui j + U i l j
Uu =  + 0 ( ( Ax ) 2 )
( Ax) 2
U j + l 2ui j + U i j l
U y y =  + O ( ( Ay ) 2 )
(Ay) 2
where 0((Ax)2) and 0((Ay)2) represent truncation
errors, i.e. the remaining terms in the Taylor series
expansion of the finite difference approximation for
the second derivative. In theory, as Ax and
8
Ay are allowed to get smaller and smaller, i.e.
as we increase M and N, the approximation will improve.
Substituting these values into Poissons equation
Au = f(x,y) and letting fij = f(x(i),y(j))
we arrive at the following expression:
Ui+i,j 2uij + Uii,j + (Ax/Ay)2 [uj.j+i 2u i j
+ + tij = (Ax)2 f i j
for 1 < i < Ml and 1 < j < Nl. Here tij is the total
truncation error.
At this point we drop the truncation error tij,
and replace the exact solution u,j with an approximate
solution Vi j. Notice that Vij no longer satisfies
Poissons equation, but rather satisfies the discrete
Poissons equation Aavij = fij. Consider the special
case of a uniform grid where M=N, i.e. Ax=Ay=h.
The discrete problem now has the form:
Vii, j + v'i.ji 4v i j + Vi,j + i + Vi + i,j = h 2f i j = Fij
for 1 < i < Ml and 1 < j < Nl. This finite
difference approximation is represented by the
following 5point stencil:
1
14 1
1
The task now is to transform this finite
difference approximation into the matrix equation
Av = F. For various values of i and j, we have the
following individual equations:
9
iH II rH j = l Vo 1 + v10 4v11 + V l 2 + V 2 1 = Fll
i = l, CM II VO 2 + Vll 4V12 + Vl3 + V22 = F12
i1, CO II Vo 3 + Vl 2 4vi 3 + V 1 4 + V23 = F 1 3
t
i = l, j=Nl VO , N  1 + Vl ,112  4v1 N 1 + Vl.N
+ V 2 N 1 = F 1 ,
Note that all the terms with either subscript equal to
zero are boundary values and have the value zero.
Putting these equations in matrix form Av = F, and
numbering the unknowns as
V = (Vll V1 2 V1 3 ... Vl.Nl V 2 1 V22 ....)
we have the following system:
X* 
4 1 0 1 0 0 .... Vi 1 F11
1 4 1 0 1 0 .... Vl 2 F i 2
0 1 4 1 '... 0 0 1 till Vl 3 F 1 3
1 0 0 0 . 4 1 0 f ( V 2 1 = F 2 1
0 1 0 0 . 1 4 1 V 2 2 F 2 2
0 0 1 0 . 0 1 t 4 1 ... V 2 3 F 2 3
m
The unknown vector v and the righthand side
vector F can be written in partitioned form. Let
v i
mm 
V i l Fi 1
V i 2 F i 2
V i 3 and F i = F i 3
Vi ,N1 F i N 1
 __ ..
1 < i < Ml;
i.e. the jth column of unknown grid points and
righthand side function. Then
V
10
Vi F i
V 2 F 2
V 3 and F = f3
VM 1 Fmi

In general, for Dirichlet boundary conditions
with arbitrary M and N, matrix A has the following
f orm:
T I
I T I
I T I
* t
. I
I T
where T has dimension (Nl)x(Nl). I is the identity
matrix and also has dimension (Nl)x(Nl). Matrix T is
said to be tridiagonal, while matrix A is said to be
blocktridiagonal of block dimension (Ml)x(Ml) and is
denoted A = [I T I]. Since blocks T and I have
dimension (Nl)x(Nl), and there are (Ml) such block
rows and columns in A, then A has dimension
(Ml)(Nl)x(Ml)(Nl) .
For the model problem we used the function
f(x,y) = (p2 + q2)sin(px)sin(qy)
as the righthand side function. The exact solution to
Poissons equation can easily be computed to be:
u(x,y) = sin(px)sin(qy).
We now turn our attention to the first part of
the FACR(l) algorithm, the method of cyclic reduction.
CHAPTER III
CYCLIC REDUCTION (CORF)
Consider the model problem Au(x,y) = f(x,y)
with u(x,y) = 0 on the boundary and M = N = 2k + 1.
Although Sweet [1977] has generalized the cyclic
reduction algorithm for arbitrary size problems, this
assumption that M is a power of 2 will simplify our
discusssion. In order to better understand the general
cyclic reduction algorithm, an example with N = M = 8,
i.e. k = 2, will first be presented.
For M = 8, we have h = Ax = Ay = 1/8. We
therefore have 7 x 7 = 49 unknown values at the
interior grid points. The matrix equation Av = F has
the form:
T I 0 0 0 0 o" "v 1 ~ f r ~Fi~
I T I 0 0 0 0 V 2 f 2 f2
0 I T I 0 0 0 v3 f 3 F 3
0 0 I T I 0 0 V 4 = h2 f 4 f4
0 0 0 I T I 0 v5 f 5 f5
0 0 0 0 I T I V6 f6 F6
0 0 0 0 0 I T_ v7 W *4 7 Ft J
where T is the 7x7 tridiagonal matrix [1 4 1] and
I is the 7x7 identity matrix. Vectors v and F have
been partitioned as in Chapter II:
vs = (vu,...,Vi7)T and Fi = (Fi i . ,Fi 7) T, 1 < i< 7.
The Dirichlet boundary conditions impose the additional
12
condition that vo = Va = 0.
The resulting system of equations is:
Tv i + IV 2 = F i (la)
Ivi + Tv 2 + IV 3 = f2 (lb)
Iv 2 + Tv 3 + IV 4 = f3 (lc)
IV 3 + TV 4 + IV 5 f4 (Id)
Iv 4 + Tv 5 + Iv 6 = Fs (le)
Ivs + Tv 6 + IV 7  F6 (If)
Iv6 + Tv 7  f7 (lg)
We now proceed to reduce the number of unknowns by
eliminating the odd numbered columns. This can be
accomplished by taking linear combinations of these
columns with adjacent columns. In particular, one can
multiply (lb) by T and add the result to (la) and
(lc). This produces the new set of equations
(21 T2)v2 + Iv4 = Fi TF2 + F3 = F2<1>. (2a)'
We define T = 21 T2. If we perform similar
operations on (Id) and (If), we arrive at a new,
reduced system of equations:
T ( 1 ) V 2 + IV 4 = F2 (2a)
Iv2 + T <1>V4 + IV 6 II (2 b)
Iv 4 + T ( 1 > V 6 = Fe 1> (2c)
We could now solve this reduced system by some
particular method, or repeat the above process to
reduce the system even further as follows. Again, we
multiply (2b) by T and add the result to (2a) and (2c)
to produce the single set of equations:
[21 (T(D)]vi = F2 ( i >T< i )F4 < 1 >+F6 < 1 ) = F4<2>. (3)
Again, let T<2> = [21 (T^1))2]* Therefore (3) can be
written as:
13
T<2>v4 = F40). (3)
Since this is a single set of equations we can
no longer reduce this system any further. We must now
solve for the unknown vector v4 in (3). Once v4 is
known, the unknown vectors V2 and V6 can then be
recovered. For example, from (2c) we see that if
v4 is known, then
T<1>v6 = Fe <1> Iv4
which is just another matrix equation which can be
solved. Now that the even terms are known, the odd
terms can also be recovered.
The statement was made that v4 must be
determined from (3). We need to consider exactly how
this equation is to be solved most efficiently.
Although T is tridiagonal, the matrices TO),
T< 2 ) etc. are not. In fact, they fill rapidly
and additional storage and computational effort is
therefore required for solving their equations.
Therefore, it would appear as though we are not able to
apply a tridiagonal solver to (3). However, if we
examine the matrices TO) further, some
interesting properties are found which will reveal an
effective method for solving equations of the form (3).
Denote TO) = T. Then TO) = 21 (TO)) 2,
TO) = 21 (TO))z; in general, TO> = 21 (TOD)2.
If we consider TO) as a second degree polynomial in
TO) = t, then we can write TO) as:
14
T)2 = 2 T2.
Also T<2) is a second degree polynomial in T*1*, but
T<1> = 2 T2, so
T<2> = p 4(T) = 2 (2 T 2)2 = 2 + 4T2 T4
which is a fourth degree polynomial in T. In general,
T<1> is a 2rth degree polynomial in T.
We have transformed the sequence of matrices
T into a sequence of polynomials p r (x):
2
Pi (X) = X
P 2 (x) = 2 x2
P4 (x) = 2 + 4x2 x4
P8 (x) = 2 16x2 + 16x4
P r (x) = 2 [p r 1(X) ] 2
2 2
Therefore solving the equation T<2>V4 = F4(2 > is
equivalent to solving p4
the possibility of factoring the polynomial p r (x).
2
Let
the form:
Pi (x) =
P 2 (x) =
P 4 (x) =
x = 2cos0. Then the polynomials take on
x = 2cos8
2 x2 = 2 4cos20 = 2(1 2cos20)
2cos20
2 + 16cos20 16cos48 = 2(l+8cos288cos40)
2cos40
p r (x) = 2 [p r1 (x)] = 2cos(2 r8) .
2 2
15
The roots of p r (x) = 0 are those values of 0 which
2
satisfy: cos(2r8) = 0, or 2r0 = ( 2 j 1) rr/2 so
0j(r) = (2jl)TT/2r + i 1 < j < 2'.
Therefore, the roots of p r (x) are
a j<*> = 2cos[(2 j 1)ff/2r + 1].
This implies that we may write p r (x), and hence, T
2
in factored form:
9
2 r
P r (x) = n (x  < r > ) .
2 j = l
In matrix form this produces the relationship:
2 r
p r (T) = T = I! (T CtjdJI).
2 j = l
Incidentally, the polynomials p r (x) are really
2
forms of the Chebyshev polynomials of the first kind,
C r. In particular, p r (x) = C r (x). It is known
2 2 2
that the roots of C r (x) are of the form:
2
aj = 2cos[ (2jl)TT/2r + 1] 1 < j < 2 *
which agrees with our results above.
We now have an expression for the non
tridiagonal matrix T in terms of the tridiagonal
matrix T. This implies that instead of trying to solve
T< 2)V4 = Fi(2 ) , (3)
we can solve an easier problem, namely
16
4
T<2>v4 = { n (T aj<2>I)}v4 = F4<2>. (4)
j = l
This requires four recursive tridiagonal solves, using
a tridiagonal solver such as CROUT. (The CROUT routine
used is given in Appendix A.) Equation (4) can be
expanded as:
_(Tai(2)i)(Ta2(2>i)(Ta3(2>i)(Ta4<2>i)v4 = f4<2>. (5)
For simplicity, write (5) as:
T1T2T3T4V4 F 4 ^ 2 ^
where T, = (T (Xi(2>I). To solve (5) recursively,
first let wi = T2T3T4V4, and solve
7 Tiwi = F4<2)
for Wi. Let w2 = T3T4V4, and solve
T2W2 = Wi
for w2 Continuing in this fashion, we now have
w2 = T3T4V4, so let W3 = T4V4, and solve
T3W3 = w 2
for W3. Finally, we have w3 = T4V4, and we solve this
system for V4. Since all of the matrices are tri
diagonal, the computations are considerably reduced.
Using this method then, we could completely
solve our system of equations, except for one major
difficulty! The method as just described is
numerically unstable due to the way the righthand
sides Fi are computed. This instability is
based on the fact that the matrices T used to
compute Fi quickly become illconditioned as
17
r increases. Buzbee et al. [1970] provide an analysis
of this instability. Buneman [1969] has given a method
for computing the righthand sides which, although
mathematically equivalent to the above algorithm,
eliminates this instability. Essentially Buneman
factors out this instability by rewriting the
righthand side functions in terms of intermediate
values, denoted qi The Buneman variant of
the cyclic reduction algorithm will now be given by way
of the example for the case M = 8. Recall that the
superscripts denote the particular stage of cyclic
reduction.
(1) Reduction Phase:
(a) Define: qi(0) = Fi, 1 < i < 7 and Vo = vs = 0,
(b) Eliminate the odd blocks of equations, and
solve for the intermediate values:
q2 ( 1 ) = qi<0 > + q3 ( )  2Tiq2(o)
q4 ( 1 ) = q3(0 > + qs ( )  2T"iq4(0 >
qe*1* = q5(0) + q7 ( o )  2Tiq6 < 0 >
(c) Eliminate columns 2 and 6:
q4<2> = q2<1>qa(0>+q<1>q5<0>+q6<1> + (T<>)i(qi<>
q2<1>+q3<>2q4<1>+q5(0)q6(1)+q7(0))*
(2) Backsubstitution Phase:
(a) Solve for the middle block of unknowns:
1
v4 = (q2(15+q6(1}q4(2>) + (T<2>)"1(q4(2>vovj)
2
(b) Solve for the other even blocks of unknowns:
18
1
v2 = (qi<0>+q3<0>q2<1>) + (T<2))i(q2(i)v0v4)
2
1
V6 = {q5(0}+q7(0}q6(1}) + (T(2>) 1(q6(1>v4vs)
2
(c) Finally solve for the odd blocks of unknowns:
Vi = T1(q i < 0 > Vii Viti) for i = 1,3,5,7.
Note that the terms involving (TMr>)1 are not directly
computed, i.e. we do not actually compute the inverse
of T. Instead, the method previously discussed for
decomposing T into linear factors of T is used.
For example, to calculate the term
(T(2))i(q4(2j _vo v8) = (T<2>)iq4<2>
we first let (T<2))iq4(z ) = Wo, so
4
q4(2) = T(2)Wo = { n (T (Jj ( 2 )i) }Wo .
j = l
We then solve for Wo by performing a series of
recursive tridiagonal solves as demonstrated earlier.
This algorithm just given is known as Cyclic
Oddeven Reduction and Factorization, or CORF, and
incorporates one of Bunemans variants. The general
CORF algorithm as given in Swarztrauber [1977] follows.
(1) Reduction Phase:
( o )
(a) qj =Fj, j = 1,2,...,M1
(6a)
( 1 ) ( 0 ) ( 0 ) ( 0 )
qi = qj1 + qj + i 2T!qj j = 2,4,...,M2
19
( r )
(b) The remaining qj are determined for r = 2,...,k:
Cr) (rl) ( r 2 ) (r1) (r2) (r1)
qj = qj2h q j h + qj qj + h + qj + 2 h
(r1) (r2) (r1) (r2) (r1)
+ (T )1(qj3h qj2h + qjh 2qj
(r2) (r1) (r2)
+ qj + h q j + 2 h + qj + sh ) (6b)
for j = 2r,2 2r,3 2r,...M2r, and where h = 2r2.
(2) Backsubstitution Phase:
( 1 ) ( 1 ) ( 0 )
(a) Define qj1/2 + qj+1/2 qj = v0 = vM = 0, then 1 2
1 (rl) (rl) (r)
Vj = (q j 2 h + q j + 2 h qj )
2
( r ) ( r )
+ (T )Mqj Vj4h Vj+4h) (7)
for r = k,...,0 and j = 2r,3 2r,...,M2r.
We now direct our attention to the second part
of the FACR(J) algorithm, the Fourier analysis or
matrix decomposition method.
CHAPTER IV
FAST FOURIER TRANSFORM (FFT)
Poissons equation Au(x,y) = f(x,y) and the
resulting matrix formulation Av = F can also be solved
by the method of Fourier analysis. We begin with a
general discussion of the FFT method and then show its
applicability to solving Poissons equation.
Given a sequence (xj), 0 < j < Nl, we define
the discrete Fourier Transform (DFT) as the new
sequence {Xk}>
n 1
Xk = H x j e2*1**/" 0 1 k < Nl
j = o
where eikx = cos(kx) + isin(kx). There is a similar
expression for the inverse DFT:
1 n 1
Xj = H Xk e2"ijk/N 0 < j < Nl.
N j = 0
Both of these expressions involve complex matrix
multiplication, and hence require roughly N2 complex
additions and multiplications. For large N, the cost
in evaluating and storing these sequences would be
prohibitive.. Therefore, another method must be used.
Cooley and Tukey [1965] devised an algorithm which
reduces the amount of work required to compute the
21
sequence {Xk}. One particular version of that
algorithm is the splitting algorithm, which we now
present.
To begin with, it is important to note that the
sequence {Xk} is periodic of period N. In particular: N 1 Xk + N = H Xj e" 2ffi 3 < k + N ) / N j = 0
N 1 = H X j e 2ffi i k/N 62^1^ j = 0
but eZffi = cos (2ff j) isin(2itj) = 1 for all integer
values of j, 0 < j < Nl, so N 1 Xk + N = H Xj e2ffi Jk/H = Xk. j = 0
Assuming N is even, split {xj} into even and
odd parts: N
y j = x 2 j and Zj x 2 jii 0 < j <  1. 2
Then we can write the expression for {Xk) in two parts:
( N / 2 ) 1 ( N / 2 ) 1
Xk = H yj e2* < 2 j ) k / N + ZI Z j e2i ( 2 j + 1 ) k/N
j = 0 j = 0
( N / 2 ) 1
Â£ y i e2ffikj/(N/2)
j = 0
+ ( N / 2 ) 1 e2irik/N ^ Zj e2ffijk/(N/2). 3 = 0
Note that the two summation terms are merely the DFTs
for the sequences {yj} and {zj}, which only have length
22
N/2 compared to length N for sequence {xj}. Let {Yj}
and {Z j} be the DFTs of the sequences {yj} and {zj},
respectively. Then we have the relation:
N
Xk = Yu + e***/N Zk 0 < k <  1 .
2
The sequences {Yk} and {Zk} are now periodic of period
N/2, so we have:
Xk+N/2 = Yk + N/ 2 + e*2ffi(k + N''2)''N Z k + N / 2
But e_2iri (k + N/ 2)/N e 2tti k / n ei1T = e_2ffik/,N, since
e_i1t = cosit isinfl = 1. Therefore, we have the two
relations:
Xk
Xk+N/2
Yk + e 2"i k / Zk
Yk e2"* / h Zk.
N
0 Â£ k < 
2
1
(1)
Assuming N/2 is even, we can now apply the
splitting algorithm to the sequences {Yk} and {Zk},
which have length N/2, to produce other sequences of
length N/4. In general, if N = 2p, we can perform
p repetitions of the splitting algorithm. This changes
the problem of transforming a single sequence of length
N to the problem of transforming N sequences of length
1. A sequence of length 1 is just a single point, and
the transform of a point is the point itself. This
method is known as the Fast Fourier Transform, and is
one of the versions originally proposed by Cooley and
Tukey.
To illustrate this algorithm more fully,
consider the case N = 8. Let the input sequence {xj}
be defined by Xj = j. The DFT of {x,} is
7
Xk = Z Xj efli jk/4 0 < k Â£ 7 .
j = o
In order to directly calculate the terms {Xk}, we
would have to do a matrix multiplication. For example
for k = 4, we have:
7
X4 = Zxje*iri>>
i = o
=[0123456 7][1 111111 1]t = 4
However, the splitting algorithm proceeds as follows,
(i) Ordering Phase
This phase involves recursive reordering of
the 8 terms of the sequence into even and odd parts.
It requires log28 = 3 steps to arrive at 8 sequences
of length 1.
0000
It is interesting to note that the resulting
sequence is in bitreversed order of the original
sequence. For example, since lio = 0012,
its
24
bitreversed representation would be 1002 = 4io.
Notice that 4 is in the position 1 held originally,
while 1 is in the position 4 held originally.
(ii) Combining Phase
This phase involves pairwise combinations of
the form (1) above in each of log28 = 3 steps. We
first combine 8 sequences of length 1 into 4 sequences
of length 2 using (1) with N = 2.
Xk = Yk + e_ffik Zk and Xku = Yk e_,ri k Zk for k = 0.
When k = 0 we have that Xo = Yo + Zo = 0 + 4 = 4 and
Xi = I 0 II 0 N 1 0 4 = 4. This is true for the f irst
two terms in each of the 4 sequences. Then we combine
4 sequences of length 2 into 2 sequences of length 4
using (1) with N = 4.
Xk = Yk + e*iK/z Zk and Xk+2 = Yk e*i/2 zk
for 0 < k < 1. For k = 0, Xo = Y0 + Zo = 4 + 8 = 12
and X2 = Yo Z0 = 4 8 = 4. Finally, for the last
step, N = 8, we combine the 2 sequences of length 4
into the single sequence of length 8. Equation (1) is
Xk = Yk + effik/'4 Zk and Xk + 4 = Yk eir,k/4 Zk
for 0 < k < 3. The calculations for all terms are
summarized in the following diagram.
25
N=2 N=4 N=8
0
4
2
6
1
5
3
7
This splitting algorithm has a considerable
savings in the required number of operations. At each
of log2N steps, N/2 pairs must be computed using
(1). Each pair requires one complex multiply and two
complex additions. Thus, (N/2)log2N complex
multiplies and Nlog2N complex additions are
required for the full splitting algorithm. Recalling
that the cost of performing the DFT using matrix
multiplication was roughly N2 complex additions
and multiplies, one can readily see the advantage to
this algorithm.
One of the disadvantages of this algorithm as
presented is the reordering phase. The terms of the
input sequence (xj} are reordered so that the
terms of the output sequence (Xk) are in the
correct order. There are other versions which avoid
reordering the input sequence, but then the output
sequence is not in the correct order (GentlemanSande
[1966]). Other versions avoid reordering of either the
input or output sequences, but cannot be performed "in
26
place," i.e. an extra array of storage is required
(Stockham, in Cochran, et al. [1967]). There are still
other FFT algorithms which require neither reordering
nor extra storage (Temperton [1983]). We will describe
the method used in this paper after the following
discussion which considers the applicability of the FFT
method in solving Poissons equation.
Recall from Chapter II that the discrete
Poissons equation has the matrix representation
Av = F, where A = [I T I] and T is tridiagonal.
Individual equations of Av = F can be written as
Ivii +Tvi + Ivi+i =Fi 1 < i < Ml. (2)
The matrix T can be diagonalized by a matrix Q, whose
columns are the eigenvectors of T: T = QDQ1, where
D= diag(Xi,i2 . ,Xn ) and \i are the eigenvalues of T.
Equation (2) then becomes
Ivii + QDQ_1Vi + Ivi+i = Fi.
Define Wi = Q_1Vi and gi = Q_1Fi. We now apply the FFT
by transforming along the columns of array F. This
corresponds to multiplying both sides of the equation
by Q 1 :
Q1Vii + Q_1QDQ"^v i + Q*Vi + i Q*Fj,
or
Wji + Dwi + win = gi 1 < i < Ml_. (3)
Since D is just a diagonal matrix with eigenvalues of
T, which may be computed, (3) reduces to solving a
tridiagonal system for each 1 < j < Nl:
27
Wii, j + XjWij + Witi.j = gij 1 S i < Ml
with Dirichlet boundary conditions woj = wmj = 0. Once
the Wi are found, the solutions Vj can be recovered by
performing the inverse transform on Wi.
Vj = Qwi 1 < i < M1.
Reordering of either the input or output
sequences can be avoided by the following technique.
The columns of the F array are left in natural order
and the forward FFT is performed using the
GentlemanSande version of the FFT. This results in a
bitreversed ordering of the sequences gi in (3).
However, the tridiagonal solves can still be performed
correctly by reordering the eigenvalues Xj to correspond
to the order of gi. This bitreversed ordering of
the eigenvalues is done during the preprocessing phase
by means of a permutation vector. The inverse
transform is then performed on Wi using the
CooleyTukey version of the FFT. This method requires
that the input sequence be in scrambled order and
transforms the sequence, to produce a sequence which is
in the natural order. The overall result is that the
input sequence is used in the natural order, and the
output sequence also arrives in the natural order.
Additionally, the complete algorithm is done "inplace"
and therefore does not require any extra storage
arrays.
CHAPTER V
FOURIER ANALYSIS AND CYCLIC REDUCTION (FACR)
The combination of cyclic reduction and Fourier
analysis, or matrix decomposition, produces an
algorithm which, at least in theory, is more efficient
than either method working independently. Briefly,
l
FACR(l) begins with 1 steps of reduction at which point
the reduced system of equations is solved using
Fourier analysis. The remaining unknowns not deter
mined by the FFT method are then recovered by ft steps
of the backsubstitution phase of the cyclic reduction
algorithm. The values of 1 range from 1=0, which
corresponds to solving the system using just FFTs, to
1 = log2M 1, which corresponds to using just
cyclic reduction.
As an example, consider the case where M = 8
and let 1=1. The first step of cyclic reduction
produces three equation:
q2 ( i ) = qi(0 > + q3 ( o )  2Tiq2( >
q4d) = q3(0 > + qs ( o )  2T iq4 < 0 >
qe ( 1 > = qs(0 > + q7 ( o )  2Tiq6< 0 >
The resulting three equations in v2, V4 and V6 that now
need to be solved via the FFT algorithm have the form:
29
Vo + T ( 1 > V 2 + V4 =  T c i ) (qi() + qa < 0 >  q 2 ( 1 > )
2 
+ q2 < 1 > = f 2 (2a)
1
V 2 + T ( 1 > V 4 + V 6 T ( i )(q3(o> + q 5 ( 0 }  q 4 ( 1 >)
2 + q4(i) = f 4 (2b)
1
V 4 + T( 1 >v6 + v8 =  T C i )(q5() + q 7 < 0 >  qe ( 1 5 )
2 + qe*1) = f 6 (2c)
where vo = vs = 0 due to the Dirichlet boundary
conditions. This system can be solved by FFTs as long
as the required righthand sides can be formed. This
involves some preprocessing of the intermediate
vectors q2(1), q4(1)> and qe <1>, which should be done
"in place" to minimize the number of required arrays.
Notice that we can rewrite the righthand side
of (2a) as follows:
1
 T(D[qi(o) + q3(o) q2d) + 2 ( T < > ) Aq2 < 1 > ] .
2
The term 2(T<1>)1q2(1) can be computed by letting
z = 2(T(1>)Aq2(15 and then solving the nested tri
diagonal systems
T< i )z = 2q2<1>.
This requires only an extra real vector, and not an
extra array. Now the correct righthand sides fcan
be formed and sent to the FFTs.
The FFT portion proceeds as before. Define
30
Wj = Q1v j and gj = Q_1fj. Multiplying (2a,b,c)
by Q_1 gives:
wj2 + QiT <1> Qw j + Wj + 2 = Q_1fj = gj j = 2,4,6. (3)
Recall that Q_1T(1>Q = D, where D is a diagonal matrix
containing the eigenvalues of TM1). Therefore (3)
reduces to solving a set of tridiagonal systems. Once
w2, w4, and we are computed, the inverse transform
v, = Qwj is used to recover the vectors V2, V4, and V6.
The FFT program then returns the vectors v2, V4,
and v6 to the cyclic reduction program, where the odd
vectors are computed in one step of backsubstitution
Vj = (T(1>) 1(qj<0) Vji vj + 1) j = 1,3,5,7.
The general FACR(J) algorithm as given in
Swarztrauber [1977] begins with d steps of cyclic
reduction using (6a) and (6b) from Chapter III. The
resulting reduced linear system can then be written
1
v + Td)v + v = T
J 28 j j + 21 2 J281 J + 2H* j
+ q = f
j j
for j = 2a, 2&+1,..., M2&. This system is solved using
Fourier methods. Define w, = Q1Vj and gj = Q_1fj.
Multiplying by Q_1 gives:
+
j 28
Q 1T ( a ) Qw + W
J + 28
w
Qif
j
g
j
31
But this is just a set of tridiagonal systems of order
M/2&1 which can be solved easily. Once the Wj are
determined, the desired vectors Vj can be recovered by
performing the inverse transform Vj = Qwj, for
j = 21, 21+M21. The remaining unknowns can then
be computed by J steps of the backsubstitution phase
for r = 0 using (7) from Chapter III.
CHAPTER VI
THE HEP AND MULTIPROCESSOR FACR(ft)
It is important to realize that the FACR(5)
algorithm we have described adapts itself very well in
an MIMD environment. MIMD, or multiple instruction
streammultiple data stream, is the acronym used for a
computer which can perform different computations on
different sets of data simultaneously. The particular
MIMD computer on which the FACR'(l) algorithm was
implemented is the Denelcor HEP, or Heterogeneous
Element Processor. The HEP is a shared memory
computer, which implies that since all the processors
share the same global memory, special synchronization
mechanisms must exist. These mechanisms are supported
by the programming language. The HEP has a maximum
execution rate of 10 million instructions per second
(MIPS) with average speeds of up to 3 million
floatingpoint operations per seconds (MFLOPS). This
is true for the one PEM (Process Execution Module)
system used, although faster rates can be achieved with
more PEMs. The HEP can support up to 50 independent
usercreated processes. Both the arithmetic units and
instruction stream are pipelined. This technique
33
improves speed by filling the pipeline with
instructions from independent instruction streams. An
instruction progresses to the next stage of the
pipeline every clock cycle of 100 nanoseconds (ns).
Additions and multiplications take 8 clock cycles or
800 ns, while divisions require 1700 ns. Further
technical aspects of the HEP may be found in Jordan
[1983] .
HEP Parallel FORTRAN is an extension of
standard FORTRAN 77. It supports two major
modifications. First, several separate processes can
be created and allowed to run in parallel. The
standard technique here is to design a subroutine to
perform a particular task and then "create" copies of
this subroutine to run as individual processes. This
is accomplished by the command CALL CREATE. For
example, let NP be the number of processes and REDUCE
be the name of the subroutine which performs the
reduction phase of cyclic reduction. In order to start
all processes working, the following code is used.
DO 50 KK = 2, NP
CALL CREATE (REDUCE)
50 CONTINUE
CALL REDUCE
The main program calls NP1 processes to perform
REDUCE, then calls REDUCE itself so it is not idle
while the other processes are working.
The second extension of standard FORTRAN 77
34
allows for interaction among processes by introducing
special asynchronous variables, which have both a value
and a state. The state is set either "full" or
"empty." An asynchronous variable can only be read
when full and written to when empty. This prevents two
different processes from trying to read or write to the
same asynchronous variable at the same time. These
variables may be denoted with a $, such as $1 or $JBAR,
but this is not necessary. When an asynchronous
variable is first used it is set "full" with some
initial value by the statement
CALL ASET ($NB,1),
or it can be set "empty" for later use by the statement
CALL SETE ($NB).
In the first case, $NB has the initial value 1 and a
full state, while in the latter case it has no
particular value initially and an empty state .
In order to ensure that each process is doing
its share of the work, scheduling techniques are
employed. The two most common ones used are
prescheduling, where the work is statically divided
among all the processes at the beginnning, or
selfscheduling, where each process may obtain more
work whenever it is free to do so. Since the amount of
work to be done is already known in advance for the
FACR(jl) algorithm, prescheduling was used. An
example of the code for prescheduling follows. $1 has
35
already been declared in COMMON with both the main
program and the subroutine.
Main program: CALL ASET ($1,2)
Subroutine: JJ = IAINC ($1,2)
DO 30 K = JJ, N2, NP*2
The first process to reach this block of code in the
subroutine assigns the value of $1 to the variable JJ,
i.e. JJ = 2. It then increments the value of $1 by 2
by the command IAINC which stands for Integer
Asynchronous Increment. Now the next process to reach
this code will start out with JJ having the value 4.
If NP = 4 and N = 16, the DO statement schedules the
work among all four processes as follows:
Process #1 DO 30 K = 2, 14, 8 or K = 2 & 10
Process #2 DO 30 K = 4, 14, 8 or K = 4 & 12
Process #3 DO 30 K = 6, 14, 8 or K = 6 & 14
Process #4 DO 30 K = 8, 14, 8 or K = 8 .
Recall that in the reduction phase of cyclic
reduction the first step was to eliminate the odd
columns. In the case of N = 16, the above scheduling
allows this phase to proceed in parallel with each
process performing the necessary steps on at least two
of the even columns. All even indices 214 are used.
Recall that columns 0 and 16 have value 0 and therefore
need not be computed.
A final sychronization technique used on the
HEP is the "barrier." In any parallel algorithm, there
36
will be places where all processes must stop and
perform some sort of sequential code, be it updating
information or starting off a new block of code all at
the same time. A barrier is essentially a section of
code in which each process must wait for all other
processes to arrive. After the last process arrives,
it may update some information, usually reinitializing
variables. At that point, all processes are allowed to
"drop out" of the barrier and continue on. Barriers
must be used sparingly in order to maintain the
parallelism of the algorithm. An example of a barrier
in HEP FORTRAN follows.
CALL ASET($NB,1)
CALL SETE ($JBAR) (1)
IB = IAINC ($NB,1) (2)
IF (IB.EQ.NP) THEN (3)
CALL ASET ($NB,1) (4)
. sequential code or update
CALL ASET ($JBAR,0) (5)
ENDIF (6)
IDUMB = IWAITF ($JBAR)  (7)
Line 1 sets $JBAR empty so it cannot be read.
Line 2 increments $NB by one each time a process
reaches it and acts as a counter for the number of
processes to arrive. ($NB has previously be set to 1.)
Line 3 will be false for all processes except the last
one, so all processes proceed to Line 7 and wait for
$JBAR to be set full. The NPth process will proceed
into the IFloop, reinitialize $NB to 1 on Line 4 for
37
the next pass through this code, perform any sequential
code or update as necessary, and then set $JBAR full
with an arbitrary value of 0 on Line 5. Now that $JBAR
has been set full, all the other processes can exit the
"wait full" command on Line 7 and proceed with the rest
of the program.
CHAPTER VII
OPERATION COUNT, TOTAL WORK AND OPTIMUM g
Introduction
Swarztrauber [1977] gives an asymptotic
operation count for serial FACR(g) as O(N2log2log2N)
compared to 0(N2log2N) for either cyclic reduction or
Fourier analysis. Additionally, he computes the optimum
1 to be log2log2N 1. In this chapter we will present
a detailed operation count for the multiprocessor
version of FACR(g) and determine an optimum g for this
algorithm. There are two cases to consider in the
multiprocessor FACR(g).
At each step of cyclic reduction the problem
size is essentially halved. After the first step,
i.e. g = 1, there are N/2 remaining columns. After the
second step, i.e. g = 2, there are N/4 remaining
columns. In general, after the gth step of cyclic
reduction, there are N2_8 remaining columns to be solved
by Fourier analysis. It is clear that at some point
there will be more processors available than individual
tasks, i.e. remaining columns. The dividing point
between Case 1 and Case 2 is a function of three
parameters:
39
N the size of the problem, N = 2k+1 = M
p the number of processors
ft the number of cyclic reduction steps performed.
Case 1 occurs for values of p < N2~i, while Case 2 occurs
for p > N2"H. In Case 1 all processors remain busy and
all processors perform the same amount of work.
(Technically this is not completely accurate since the
amount of work may not divide evenly among the number
of processors. However, we make the assumption that
the work does divide evenly.) In Case 2 there will be
more processors than required to complete the job, so
some processors will be idle.
As mentioned earlier, adds and multiplies
require approximately 800 ns whereas a divide requires
1700 ns. Therefore, when counting operations, adds and
multiplies are consider one floatingpoint operation,
or flop, while divides are considers as two flops. We
ignore the overhead for synchronization and interprocess
communication since it becomes negligible for large
problem sizes. Let Wr, Wf, and Wb denote the work
required for a single processor in the reduction phase,
FFT phase, and backsubstitution phase respectively. Let
Wi and W2 denote the total work for Case 1 and Case 2,
respectively. We first consider the operation count
for Case 1.
40
Total Work: Case 1
(i) Reduction Phase:
N a n
Wr = [3N + 9N] + n  [12N + 2*i9N]
2p r = 2 p2 r
9N 2J N2
=  + [15/2 6*211].
2p P
The operation count for the stage r = 1 is different
than for all other values of r. That count is
represented by the term before the summation. The
values 3N and 12N represent the cost of processing the
righthand sides of (6a) and (6b), respectively, in
Chapter III. The value 9N represents the work required
for a single tridiagonal solve, roughly 3N adds and
multiplies and 3N divides, which cost twice as much.
(ii) FFT Phase:
N2a 2 9N22a
Wf =  [ 4N + 2a9N] + [5N2log2N2a] + 
P P P
N22a
=  (13 + 921 + 10 logzN).
P
The first term represents the preprocessing cost as
described in Chapter V. The middle term is the cost of
the FFT algorithm (see Briggs, et al. [1986]) exclusive
of the required tridiagonal solves which is accounted
for in the last term.
41
(iii) Backsubstitution Phase:
N 8 1 N
Wb =  [ 2N + 9N] + H  [ 6N + 2r 9N]
2p r=1 2pr+1
9N 2J N2
=  + (4 3*211) .
2p P
The first term is the count when r = 0, while the
values r = 1,...,!1 are represented by the summation
term. Again, the values 2N and 6N represent the cost
of processing the righthand sides of (7) in Chapter
III. We therefore arrive at the total work function
for Case 1 as:
Wi = Wr + Wf + Wb = W i (N p, ft)
N2
= [91 + 41/2 9*2il + 13*21 + 10 logaN21].
P
Total Work: Case 2
(i) Reduction Phase:
We define s as the stage where processors first
become idle:
s = max{0, k [log2p]},
where [log2p] denotes the greatest integer function.
The work function, in terms of s, is then given by:
N s N
Wr = * [3N + 9N] + H  [12N + 2 ri9N]
2p r=2 p2r
a
+ H [12N + 2 r 19N j
r = s + 1
42
9N2s 15N2 12N 2 2 ~ 8
=  + 9N2& 9N 23 + + 12N& 12Ns.
2p 2p p
Again, the first term represents the count at stage
r = 1. The first summation term is the work done while
all processors are still active. The second summation
term represents that portion of the work done by only
some of the processors.
(ii) FFT Phase:
Wf for Case 2 is the same as Wf for Case 1
provided N > 2p. This restriction does not limit the
analysis since it still encompasses most problem sizes
of interest with a realistic number of processors.
(iii) Backsubstitution Phase:
N s N
Wb = [ 2N + 9N ] + S  [ 6N + 2ri9N]
2p r = 2 P2 r
a
+ 51 [ 6N + 2 r 1 9N ]
r = s + 1
9N2s 4N2 6N22_ 8
=  + 9N21 9N > 28 +   + 6Nft 6Ns.
P P P
The significance of these terms is similar to the ones
from the Reduction Phase above. The total work
function for Case 2 is given by:
W2 = Wr + Wf + Wb = W2(N,p,!l)
43
9N2 s 4 IN 2 18N 2 2 s
=  + 9N 21+ 1 9N 2 3 + 1 +   + 18Nft
P 2p p
13N22a ION 2 log 2N 2
 18Ns +  +  .
P P
Recall that N = 2k+1, so k = log2N 1. Therefore,
for s = k [log2p], we have s = log2(N/p) 1, and we
can then eliminate s from the above expression.
9N2 23N 2
W2 =  log2 (N/p) + 9N21+1 +18N + 18Nft
P 2p
13N228 ION2log2N'2
 18Nlog2(N/p) + +  .
P P
Analysis
Several useful and important results can now be
derived from the total work functions given. Foremost
is the computation of an optimum ft for the algorithm.
In particular, we would like to have an expression or
graph that would determine the best ft to use for
a particular problem size N and a given number of
processors p. This can be accomplished by two
different methods.
First, an analytic solution for an optimum
ft can be arrived at by considering a standard
minimization problem. Consider
W(N,p,ft)
Wi for N > 2Ip
W2 for N < 2&p.
and solve dW/dft = 0 for ft. Define fti* and ft2* to be
the optimum ft for Case 1 and Case 2 respectively,
i.e. Wi(fti*) = 0 and W2(ft2*) = 0. We then have the
44
following:
fti = log2[.77 logzN .39]
ft2* = log2[l + {(9p+9.7Nlog2N+12.5N)/9p}wz] _ .47 .
The second way to determine an optimum ft
is a graphical approach. If we compute and graph the
equal performance curves
W (N, p, fti) = W(N, p, ft2 ) for ft2 = fti + 1,
we can then use the graph with an appropriate N and p
and arrive at the optimum ft. In order to take
into account both Case 1 and Case 2 in these equal
performance curves, we must consider three parts to
each curve.
ft 2
(i) Wi(N,p,fti) = W1 ( N p, ft 2 ) for N > 2 p
(ii) Wi(N,p,fti) = W2(N,p,ft2 ) for fti 2 p ft 2 < N < 2 p
(iii) W 2 ( N p, fti ) = W 2 ( N, p, ft2 ) for N Â£ fti 2 p.
Using the work functions derived earlier in this chapter
we have the following general formulas for the equal
performance curves for cases (i), (ii), and (iii).
ft2
(92 + 5)/10
(i) N = 2
45
ft 2
(ii) 9N(la+l) + (18p9N)log2(N/p) + 18p(lft22 )
ft 2
+ 2 N(10 log2N 23) = 0
N(13 + 10 log 2N)
(iii) P = ;
2J2 ft2
92 + 18*2
Figure 1 graphically depicts the equal per
formance curves for values of 1 = 1 to { = 5. Notice
that the scales for this graph are log2(N/p) versus
log2N. Given a particular problem size N and a given
number of processors p, one can use this figure to find
the optimum ft of the multiprocessor FACR(ft) algorithm.
Recall that the work function for the FFT phase is only
valid for N > 2p, so the shaded portion of the graph
has no validity. Table 1 provides a summary of the
theoretical results on the optimum 1, both analytically
and graphically. For values of N = 128 and N = 256,
and values of p from 10 to 50, the table provides the
necessary information to use Figure 1. The optimum ft
read from the graph, labeled "GRAPH ft" is given as well
as the optimum ft (either fti* or ft2 *) as computed
analytically, labeled "COMP ft", by (1) above. A direct
comparison can then be made. It is important to realize
that only integer values of ft are valid for the FACR(ft)
algorithm, however, the analytic results for the optimum
ft do provide a trend.
log 2N
log2(N/p)
Equal Performance Lines
4^
05
47
N = 128
N = 256
p log 2 (N/p) GRAPH & COMP a CASE
10 3.68 2 2.32 1
12 3.42 2 2.32 1
14 3.19 2 2.32 1
16 3 2 2.32 1
18 2.83 2 2.32 1
20 2.68 2 2.32 1
22 2.54 2 2.32 1
24 2.42 2 2.32 1
26 2.3 2 2.3 N
28 2.19 2 2.19 N
30 2.09 2 2.09 N
32 2 2 2 N
34 1.91 2 1.91 N
36 1.83 2 1.83 N
38 1.75 2 1.75 N
40 1.68 2 1.68 2
42 1.61 2 1.64 2
44 1.54 2 1.6 2
46 1.48 2 1.56 2
48 1.42 2 1.52 2
50 1.36 2 1.49 2
10 4.68 3 2.53 1
12 4.42 3 2.53 1
14 4.19 3 2.53 1
16 4 3 2.53 1
18 3.83 3 2.5 3 1
20 3.68 3 2.53 1
22 3.54 3 2.53 1
24 3.42 3 2.53 1
26 3.3 3 2.53 1
28 3.19 3 2.53 1
30 3.09 3 2.53 1
32 3 3 2.53 1
34 2.91 3 2.53 1
36 2.83 2 2.53 1
38 2.75 2 2.53 1
40 2.68 2 2.53 1
42 2.61 2 2.53 1
44 2.54 2 2.53 1
46 2.48 2 2.48 N
48 2.42 2 2.42 N
50 2.36 2 2.36 N
TABLE 1
OPTIMUM I
48
The final column in Table 1 indicates in which
case the parameters N, p, and ft lie. In some instances
the optimum ft, either fti or ft2 *, satisfies neither the
Case 1 criterion p < N2_& nor the Case 2 criterion
p > N2*. In these instances an "N" is entered in the
column to indicate neither case is satisfied. However,
an optimum value for ft can still be derived using
the information at the boundary between the two cases,
p = N2&, or ft = log2(N/p). This value is placed in
the computed ft column when neither case is satisfied.
Notice that these boundary values provide for a smooth
transition between Case 1 values of ft and Case 2 values
of ft.
Figures 2 and 3 present actual timings from the
HEP for the FACR(ft) algorithm for N = 128 and N = 256,
respectively. The values of ft ranged from ft = 0 to
ft = log2N 1. With these figures we can compare the
theoretical results for the optimum J with actual
results based on the HEP. Notice that in both figures
ft = 2 was optimum with 16 processors producing the
fastest times. These figures suggest that the speed of
the algorithm, which equates to the amount of work, is
dependent on the value of ft used. This can be verified
by plotting the work function versus various values of
ft. Figures 4 and 5 represent the work functions for
N = 128 and N = 256, respectively. Three representative
values of p are shown, namely p = 8, 16, and 32. The
p
0 2 4 6 8 10 12 14 16 18 20
FIGURE 2
HEP Timings for N
128
TIME
HEP Timings for N = 256
51
WORK (x 1000)
FIGURE 4
Work versus ft for N
128
52
WORK (x 1000)
FIGURE 5
Work versus J for N
256
53
graphs clearly indicate that there exists some optimum
value of Â£ which minimizes the amount of work. These
minimum values can be read directly, and correspond to
the optimum values of ft from Table 1. Additionally, the
amount of work required to solve Poissons equation can
be determined for different values of ft.
CHAPTER VIII
CONCLUSION
We have considered how Poissons equation in
two dimensions can be solved, first by discretizing the
continuous problem, and then by applying a numerical
algorithm to the discrete formulation. The methods of
cyclic reduction and Fourier analysis provide two
direct methods for solving the discrete Poissons
equation. We have demonstrated the superior efficiency
of a third method, which combines the other two. This
FACR(ft) algorithm was particularly wellsuited for
implementation on an MIMD computer, specifically, the
Denelcor HEP. The multiprocessor version of FACR(8)
was discussed relative to the special FORTRAN
extensions required for its implementation.
Theoretical results concerning the operation count and
the total work functions were presented. The FACR(&)
algorithm operates most efficiently when an optimum
value of ft is used. The determination of this optimum
value was explored, both analytically and graphically.
Both methods gave similar results, which were
substantiated by experimental timings on the HEP.
Figure 1 provides a useful tool for determining the
55
optimum 1 for the multiprocessor FACR(J). This direct
method can provide a competitive alternative to other
methods used in the solution of Poissons equation.
BIBLIOGRAPHY
W. Briggs, L. Hart, R. Sweet and A. OGallagher,
[1986], Multiprocessor FFT Methods, SIAM
J. Sci. Stat. Comput., to appear.
0. Buneman, [1969], A Compact Noniterative Poisson
solver, Rep. 294, Stanford University
Institute for Plasma Research, Stanford, Ca.
B. Buzbee, G. Golub and C. Nielson, [1970], Oh Direct
Methods for Solving Poissons Equations,
SIAM J. Numer. Anal., 7, pp.627656.
W. Cochran, et al., [1967], What is the Fast Fourier
Transform?, IEEE Trans. Audio and Elect.,
AU15, pp.4555.
J. Cooley and J. Tukey, [1965], An Algorithm for the
Machine Calculation of Complex Fourier
Series, Math. Comp., 19, pp.297310.
W. Gentleman and G. Sande, [1966], Fast Fourier
Transforms for Fun and Profit, 1966 Fall
Joint Computer Conf., AFIPS Proc. 29,
pp.563578.
D. Heller, [1976], Some Aspects of the Cyclic
Reduction Algorithm for Block Tridiagonal
Linear Systems, SIAM J. Numer. Anal., 13,
pp.484496.
, [1978], A Survey of Parallel Algorithms in
Numerical Linear Algebra, SIAM Rev., 20,
pp.740777.
R. Hockney, [1965], A Fast Direct Solution of
Poissons Equation using Fourier Analysis,
J. ACM, 12, pp.95113.
, [1970], The Potential Calculation and Some
Applications, Meth. Comput. Phys., 9,
pp.135211.
57
H. Jordan, [1983], Experience with Pipelined Multiple
Instruction Streams, Department of
Electrical and Computer Engineering, University
of Colorado, Report CSDG 834.
M. Kadalbajoo and K. Bharadwaj, [1984], Fast
Elliptical Solvers An Overview, Appl.
Math. Comput., 14, pp.331355.
J. Ortega and R. Voigt, [1985], Solution of Partial
Differential Equations on Vector and Parallel
Computers, SIAM Rev., 27, pp.149240.
P. Swarztrauber, [1977], The Methods of Cyclic
Reduction, Fourier Analysis and the FACR
Algorithm for the Discrete Solution of
Poisson's Equation on a Rectangle, SIAM
Rev., 19, pp.490501.
R. Sweet, [1974], A Generalized Cyclic Reduction
Algorithm, SIAM J. Numer. Anal., 11,
pp.506520.
, [1977], A Cyclic Reduction Algorithm for
Solving Block Tridiagonal Systems of Arbitrary
Dimension, SIAM J. Numer. Anal., 14,
pp.706720.
C. Temperton, [1983], A Note on the Prime Factor FFT
Algorithm, J. Comp. Phys., 52, pp.198204.
D. Young, [1971], Iterative Solution of Large Linear
Systems, Academic Press, New York.
APPENDIX A
TRIDIAGONAL SOLVERS
The tridiagonal solver used in this paper is
Crout's method, which decomposes the matrix T into the
product of a lower triangular matrix, L, and an upper
triangular matrix, U. For Crouts method, the diagonal
elements of U are ls.
SUBROUTINE CROUT(XX,NSOL,N,C,RT)
REAL D(0:256),RT(256),XX(0:256)
C C = 22*(DX*DX)/(DY*DY)
C D(I) = DIAGONAL ELEMENTS OF L MATRIX
C NSOL = NUMBER OF RECURSIVE TRIDIAGONAL SOLVES
C RT(I) = ROOTS OF CHEBYSHEV POLYNOMIAL
C XX(I) = RIGHTHAND SIDE FUNCTIONS
DO 30 J = 0, Nl
D(J) = 0.0
30 CONTINUE
DO 70 II = 1, NSOL
D(1) = C RT(NSOL+(111) )
C LUDECOMPOSITION
DO 40 J = 2, Nl
D(J) = D(1) 1.0/D(J1)
40 CONTINUE
C FORWARD SUBSTITUTION
XX(1) = XX(1)/D(1)
DO 50 J = 2, Nl
XX(J) = (XX(J)XX(Jl))/D(J)
50 CONTINUE
C BACKSUBSTITUTION
DO 60 J = N2, 1, 1
XX(J) = XX(J) XX(J+l)/D(J)
60 CONTINUE
70 CONTINUE
RETURN
END
After the research for this paper was
completed, it was brought to the authors attention
that a slightly more efficient variation of Crouts
method existed. The method differs only in that
instead of storing the diagonal elements in D(I), the
reciprocals of the diagonal elements are stored. The
rest of the code remains unchanged. The result of this
modification is that the divides performed in the
forward and back substitution phases are now changed to
multiplies. This is particularly important on the HEP,
since divides cost twice as.much as multiplies. Recall
that for the CEOUT routine the operation count is 9N.
(3N adds/multiplies + 2*3N divides = 9N) The improved
method has an operation count of only 7N. (3N adds +
2N multiplies + 2*N divides = 7N) However, the
qualitative results for the model presented in Chapter
VII remain essentially unchanged by this modification.
