Citation
Parallel compact symmetric Fast Fourier Transforms

Material Information

Title:
Parallel compact symmetric Fast Fourier Transforms
Creator:
Henson, Van Emden
Publication Date:
Language:
English
Physical Description:
134 leaves : illustrations ; 28 cm

Subjects

Subjects / Keywords:
Fourier transformations -- Computer programs ( lcsh )
Fourier transformations -- Computer programs ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaf 105).
General Note:
Submitted in partial fulfillment of the requirements for the degree, Master of Science, Department of Mathematics.
Statement of Responsibility:
by Van Emden Henson.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
19818055 ( OCLC )
ocm19818055
Classification:
LD1190.L62 1988m .H46 ( lcc )

Full Text
PARALLEL COMPACT SYMMETRIC
FAST FOURIER TRANSFORMS
by
Van Emden Henson
B.S., University of Utah, 1979
A thesis submitted to the
Facuity of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Master of Science
Department of Mathematics
1988


This thesis for the Master of Science Degree by
Van Emden Henson
has been approved for the
Department of
Mathematics
by
Roland A. Sweet
Date


Henson, Van Emden (M.S., Mathematics)
Parallel Compact Symmetric Fast Fourier Transforms
Thesis directed by Associate Professor William L. Briggs
The Fast Fourier Transform (FFT) has become an enormously popular
algorithm because it requires O (iVlogiV) arithmetic operations to compute the
transform of a complex vector of length N, instead of the 0(N2) operations
required to compute the transform as a matrix-vector product. The description
"compact symmetric" refers to a family of FFT algorithms that use minimal
storage and arithmetic for data sequences possessing certain symmetries. These
FFTs depend on the fact that when symmetric sequences axe split (the basic
process underlying the conventional FFT), the resulting subsequences inherit
symmetries. Deriving the symmetric FFT consists of modifying the conventional
FFT "combine" formulas to take advantage of the symmetries in the subse-
quences.
In this thesis, compact symmetric algorithms are derived for parallel
processing machines. Serial codes for these FFTs axe implemented as a prelude
to parallelizing the algorithms, on both shared and distributed memory (hyper-
cube) systems. For both architectures, the parallel implementation is compli-
cated by the uneven distribution of work induced by splitting symmetric
sequences. During each pass through the data there may be several distinct
combine formulas that must be performed. Also, the portion of the data vector
to be processed with each type of combine formula changes with each pass
through the data. For the hypercube, the number of communication steps is
minimized and all the communication is shown to be "nearest neighbor".


IV
Performance models are derived to predict the amount of speed-up that
may be expected as the number of processors is increased. Factors included in
the models axe arithmetic operations, calls to the transcendental libraries, over-
head for the fork-join operations and, in the case of the hypercube, the commun-
ication steps. Actual processing times axe given for certain compact symmetric
FFTs, determined on a 24-processor Sequent Balance and a 32-node Intel
Hypercube. Experimental speed-up curves axe shown and axe compared to the
theoretical curves of the performance models.
The form and content of this abstract axe approved. I recommend its publica-
tion.
Signed


y
for
Teri


VI
ACKNOWLEDGEMENTS
I wish to thank William L. Briggs and Roland A. Sweet for their gui-
dance and encouragement. The Argonne National Laboratory provided time on
the Sequent Balance 21000 Multiprocessor. This work was supported in part by
National Science Foundation Grant number DMS-8611325. Special thanks go to
my wife, Teri, who kept me from going crazy while I wrote this thesis, and to
Craig Rasmussen, who tried to make me crazy.


CONTENTS
CHAPTER
I. INTRODUCTION..........................................1
Purpose of the Study..................................2
Scope of the Study....................................2
Arrangement of the Thesis.............................3
H. THE CANONICAL FAST FOURIER TRANSFORMS.................5
The Discrete Fourier Transform........................6
Cooley-Tukey Decimation in Time.......................7
Gentleman-Sande Decimation in Frequency..............14
A Radix 3 Decimation in Time Algorithm...............21
Mixed Radix Algorithms...............................25
HI. THE SYMMETRIES......................................27
Real Data............................................28
Real and Even Data...................................29
Real and Odd Data....................................30
Quarter Wave Even Data...............................32
Quarter Wave Odd Data................................33
Conjugate Symmetric Data.............................34
Quarter Wave Conjugate Symmetric Data................35
Other Symmetries.....................................36
IV. SYMMETRIC FFT ALGORITHMS............................37
A Symmetric Transform for R Sequences................38
A Symmetric Transform for QE Sequences...............39


A Symmetric Transform for QO Sequences..................41
A Symmetric Algorithm for E Sequences...................42
A Symmetric Transform for O Sequences...................45
A Symmetric Algorithm for CS Sequences..................46
The Ordered-to-Scrambled Algorithms.....................49
An Ordered-to-Scrambled Algorithm for R Sequences.......50
An Ordered-to-Scrambled Algorithm for E Sequences.......52
An Ordered-to-Scrambled Algorithm for O Sequences.......54
V. A COMPACT SYMMETRIC TRANSFORM FOR
REAL AND EVEN DATA ON A
SHARED MEMORY MACHINE.............................56
Serial Implementation of the
Swarztrauber E Transform............................56
Serial Implementation of the
Ordered-to-Scrambled E Transform....................63
Savings from the Compact Symmetric Algorithm............66
The Parallel Algorithm..................................68
Decreases in Parallelism................................71
Complexity of the Parallel Algorithm....................71
Simplifying the Performance Model.......................74
The Parallel Implementation.............................75
VI. A PARALLEL COMPACT SYMMETRIC TRANSFORM
FOR REAL DATA ON A DISTRIBUTED
MEMORY MACHINE....................................80
The Hypercube...........................................80
The FFT on a Hypercube..................................81
Performance of the Hypercube FFT........................83


The Compact E Transform and the Hypercube........84
A Compact Hypercube Algorithm for R Sequences...86
Complexity of the Compact Real Algorithm........93
Complexity of the Hypercube R Algorithm.........94
The Parallel Implementation.....................96
Performance of the Hypercube R Transform........98
VH. CONCLUSIONS....................................103
REFERENCES...........................................105
APPENDIX 1...........................................106
APPENDIX 2......................................... 109
APPENDIX 3...........................................112
APPENDIX 4...........................................122
APPENDIX 5...........................................125
APPENDIX 6
134


X
FIGURES
Figure
2.1 Data Flow and Storage, Cooley-Tukey Algorithm............11
2.2 Cooley-Tukey Scrambled-to-Ordered Algorithm..............13
2.3 Cooley-Tukey Ordered-to-Scrambled Algorithm..............15
2.4 Gentleman-Sande Ordered-to-Scrambled Algorithm...........18
2.5 Gentleman-Sande Scrambled-to-Ordered Algorithm...........20
2.6 Radix 3 Decimation-in-Time Algorithm.....................24
2.7 Mixed Radix Algorithm....................................26
4.1 Schematic Diagram of the Swarztrauber E Algorithm........44
4.2 Schematic Diagram of the Briggs CS Algorithm.............47
5.1 Uncompacted Swarztrauber E Algorithm.....................58
5.2 Compact Symmetric Swarztrauber E Algorithm...............60
5.3 Compact Symmetric Ordered-to-Scrambled E Algorithm.......65
5.4 Performance Curves for Long Sequences, BRIPAR............77
5.4 Performance Curves for Short Sequences, BRIPAR ..........78
6.1 Cooley-Tukey Algorithm on the Hypercube..................82
6.2 BRICOS Algorithm on the Hypercube........................85
6.3 Storage Order Requirements, FFSWEET Algorithm............89
6.4 Sequence Separation, FFSWEET Algorithm...................91
6.5 FFSWEET on the Hypercube.................................92
6.6 Performance Curve, N = 4096, FFSWEET ....................99
6.7 Performance Curve, N = 2048, FFSWEET ...................100


CHAPTER I
INTRODUCTION
Since its introduction in 1965 [1], the Fast Fourier Transform (FFT) has
become one of the most widely used algorithms of computational mathematics.
Modern signal processing, spectroscopy, image processing, and techniques for
solving partial differential equations all depend heavily on the FFT. Its enor-
mous popularity is due largely to the fact that the FFT requires O(NlogN)
arithmetic operations to compute the transform of a complex vector of length N,
instead of the 0(N2) operations required to compute the transform as a
matrix-vector product. The term "compact symmetric" refers to a family of
FFT algorithms that uses minimal storage and arithmetic for data sequences
possessing certain symmetries. The first such algorithm, generally attributed to
Edson [2], computes the transform of a real vector using half the storage and
half the operations used by the original FFT. It has long been known that
further savings are possible when the data has additional symmetries, but with
the exception of one little-publicized algorithm by Gentleman [3], such
transforms were performed by pre- and post-processing of data for use with con-
ventional FFTs [4].
In recent papers by Swarztrauber [5] and Briggs [6], compact algorithms
were developed for sequences with real, even, odd, quarter wave even, and quar-
ter wave odd symmetries. The new compact FFTs depend on the fact that
when symmetric sequences axe split the resulting subsequences inherit certain
symmetries. Performing the FFT then consists of modifying the Cooley-Tukey


2
combine formulas to account for the changing symmetries of the partial sub-
sequences. These transforms can all be performed in-place.
Purpose of the Study
The purpose of this study is twofold. The principle aim is to develop
parallel algorithms for the symmetric FFTs. Such algorithms axe developed for
two classes of parallel processors: shared memory machines and distributed
memory machines. The implementation of these algorithms is considered in
detail, both from a theoretical standpoint and through experimentation.
A secondary goal of the study is to collect in one place the derivations of
the data symmetries and the symmetric algorithms. Principally this involves
bringing together the derivations of Swarztrauber and Briggs, and providing a
framework in which all of the algorithms can be related to each other. Two new
algorithms are derived, the ordered-to-scrambled transforms for even and odd
real sequences, following a method indicated by Briggs.
Scope of the Study
The list of symmetries and algorithms that are treated in this thesis is
by no means exhaustive. The symmetries discussed are those that commonly
occur is signal processing and in the solution of partial differential equations:
real, conjugate symmetric, even, odd, and quarter wave symmetries are con-
sidered. Other symmetries, such as those used in the study of crystallographic
groups, axe not treated.
The presentation of existing algorithms begins with the "canonical"
Cooley-Tukey and Gentleman-Sande algorithms, because the derivations under-
lying those methods form the basis for all of the symmetric algorithms. The


3
symmetric algorithms themselves axe derived in detail. Serial versions for
several of these algorithms are implemented, which had not been done prior to
this study. The data structures and performance characteristics of these imple-
mentations are carefully examined, as they have a great impact on the parallel
algorithms.
Two parallel symmetric algorithms, one for each of the parallel architec-
tures under consideration, are presented in detail. Various strategies of paralleli-
zation are considered, and these two algorithms are implemented on parallel pro-
cessors. Theoretical performance models of the parallel algorithms are developed
to predict the amount of speed-up that may be expected as the number of pro-
cessors is increased. Factors included in the models are the arithmetic opera-
tions, calls to the transcendental libraries, processor-to-processor communication,
and overhead for fork-join operations. Actual processing times are given for
sequences of various lengths and for varying numbers of processors. The actual
speedups and model speedups are compared and analyzed.
Arrangement of the Thesis
The thesis is arranged into seven chapters and five appendices. Roughly
speaking, the thesis is in two parts. The first four chapters are entirely theoreti-
cal, and largely expository. The next two chapters are oriented toward issues of
implementation.
Chapter II is devoted to derivations of the original FFTs, the two
Cooley-Tukey algorithms and the two Gentleman-Sande algorithms. Also
included is a brief discussion of the transforms for sequences whose lengths are
not powers of two. Chapter III is an examination of the symmetries. For each
of the symmetries discussed, consideration is given to the subsequences induced


4
by splitting a sequence bearing the given symmetry. Usually the subsequences
are found to carry symmetries imparted by the symmetry of the parent
sequence. The nature of the Discrete Fourier Transform of sequences with each
type of symmetry is also examined, and symmetries in the transformed sequence
are identified. In Chapter IV the symmetries in the subsequences and
transforms of symmetric sequences are used to derive FFTs that transform the
symmetric sequences efficiently. For most of the symmetries, two different algo-
rithms are derived, for different orderings of the data.
The second part of the thesis consists of the derivation and implementa-
tion of parallel algorithms. Chapter V is devoted to the transform of a real and
even sequence on a shared memory computer. The algorithm is developed
theoretically, and the details of the implementation are considered. A perfor-
mance model is developed and theoretical performance is compared with actual
timings. Chapter VI is a treatment of a parallel compact symmetric transform
on a distributed memory machine, specifically a hypercube. A hypercube algo-
rithm for a conventional FFT is given, as a method of introducing the nature of
the architecture and the problems faced in devising algorithms for it. A hyper-
cube FFT for strictly real data is then implemented, using an algorithm due to
Sweet [7]. As in Chapter V, a performance model is developed and theoretical
performance is compared with actual timings. The appendices consist of the
FORTRAN codes for the algorithms implemented as a part of the study, and
the experimental timing data for the parallel algorithms.


CHAPTER II
THE CANONICAL FAST FOURIER TRANSFORMS
It is a misconception to think that there is such a thing as "the FFT".
Despite common reference to it, the term "the FFT" refers not to an algorithm,
but rather to a very large and rapidly growing family of algorithms. The
members of this family are related more in their purpose than in their structure,
that is, all of them compute the Discrete Fourier Transform more efficiently
than by matrix-vector multiplication. Beyond this, however, there is a wide
variety of approaches as to how this may be accomplished. Most FFT algo-
rithms fall into a relatively small number of categories: Decimation-in-Time,
Decimation-in-Frequency, Mixed Radix Variations, In-Place Algorithms, Self-
sorting Algorithms, Symmetric Transforms, Prime Factor Algorithms. These
categories are not sharply defined, however. For example, the algorithms dis-
cussed in this thesis are all symmetric algorithms, but they can also be classified
as either decimation-in-time or decimation-in-frequency. Additionally, they are
all in-place algorithms. Similarly, Otto [10] has recently made a significant con-
tribution to the category of symmetric transforms by developing symmetric
prime factor algorithms, while Temperton [11] has developed a family of self-
sorting prime factor algorithms.
The purpose of this chapter is to develop some of the classic FFTs in
order to lay the groundwork for all that follows. The original FFT algorithm,
the Cooley-Tukey decimation-in-time algorithm, forms the starting point for this
survey.


6
The Discrete Fourier Transform
Suppose a complex sequence, {rcn} = {x0, ^1, ' is given, for
which the Discrete Fourier Transform is desired. The DFT of the sequence is
given by
i2irnk
Xk= £ Xne N k=0,1, N-l
n= 0
and is usually termed the forward transform. It is relatively easy to show that
jy_1 i2irnk
xn = 4; E xke N n=,i,' N~1
/V k= 0
which is therefore called the Inverse Discrete Fourier Transform (IDFT). These
expressions may be written somewhat more compactly by making the substitu-
tin'
tion w, = e N in which case the DFT and IDFT become
Jv 7
Xk = N-l (2.1)
R=0
and
* = 77 E "=0,1, 'N-l (2.2)
k=0
respectively. It should be noted that there is no general agreement regarding
these definitions. Many authors define the forward transform with w~nk and
the IDFT using wA. Others multiply both formulas with £=- instead of multi-
Vn
plying the IDFT by -X. When there is no ambiguity about the length of the
i2ir
sequence, the subscript on to is frequently dropped, so that to = e
N


7
The definition given for the DFT is easily recognizable as a matrix-
vector product, where the vector is the input data sequence {£} and the matrix
is one in which the (ra,fc)th entry is the complex exponential unk :
1 o 1 . w ' Xq -
*1 u1 w2 . . xx
*2 = u,0 u2 w4 . u2N~2 x2
Xn-1 u2N~2 , ,W2-2n+2 CO XN-1
As a matrix-vector multiplication, the DFT requires N2 complex multiplications
and N2N complex additions. Counting multiplications the same as additions,
this is 7N22iVreal operations.
Cooley-Tukey Decimation-3n-T3me
For convenience, the length of the sequence, N, is assumed to be a
power of two. Suppose the sequence {zR} is split into two subsequences {?/}
and {zn}, whose elements are yn = x2n, and zn = x2+i- This particular opera-
tion is called decimation-in-time. The name refers to the decimation of the
sequence {xn} by removing every second point, and is called decimation-in-time
because in many applications the input sequence {xn} is a time series. Then
the DFT can be split into two sums, involving the elements of these two subse-
quences, as follows :
f"1 f"1
Xk = E x2nJ%k + 2 x2n+1j2n+1^k.
n= 0 n=0


8
Noting that = u>N this becomes
f-i f"1
Xk= E + E zn^NkuN
n=0 2 n=0
f"1 f"1
= E Vn.wf + Wjf E
=0 2 =0 2
for fc=0,l, -1. Comparing the two summations in this equation with
2
equation (2.1), the definition of the DFT, it is seen that they axe themselves
DFTs, of the length subsequences {?/} and {zn} respectively, and axe
denoted { F*} and {Z^}. This implies that if the transforms of the even and odd
subsequences axe available, then the first half of the desired transform {X^} is
obtained by
X* = Yk + u*Zk *=0.1. f-1 .
It is easily seen that the DFT is periodic, with a period equal to the length of
the sequence, and thus = Yk+Jt and = Z]c+_n_. The second half of the
2 2
desired transform may be obtained by substituting k + for k, which gives
JL
Xk+JL ; F*+jr + Zh+JL fc=0,l, 1
2 2 2
N
2
Using the periodicity property, and noting as well that uN e~lv = 1 this
equation simplifies to
Xk+JL = Yk w^Zk k=0,1, y~1 .
2 *
ff the transforms { F*} and {Z^} axe available, each of half the length of
the original sequence, then the entire transform of the original sequence can be


9
obtained by computing
xt = n + (2.3)
+ II 1 (2.4)
for A;=0,1, 1. Equations (2.3) and (2.4) are called the Cooley-Tukey
combine formulas, or "butterfly relation".
Computing the DFTs { Y^} and {Z/.} by matrix multiplication requires
2[7()2] real operations. Applying the butterfly relations to combine the
two half-length transforms into the full length transform requires 5N real opera-
tions, so that computing {A*} in this fashion requires ^-N2+4N operations.
This is a savings of nearly one half, achieved by a single application of the split-
ting process.
There is no reason to stop after one splitting. Since half the work in cal-
culating {Xfc} can be saved using this process, then half the work in calculating
{ Yk} and {Z^} can also be saved by applying the same process to the calculation
of those DFTs. The Cooley-Tukey FFT algorithm proceeds by recursively split-
ting the input vector until eventually N sequences of length 1 are produced.
However, the DFT of a sequence of length one is the sequence itself. The but-
terfly relations are then applied to build DFTs of length two from pairs of the
length one DFTs. Then DFTs of length four are constructed using pairs of the
length two DFTs, and so on. This process continues until finally two transforms
of length axe combined to form the length N transform of the original input
z
sequence.


10
The process is illustrated in Figure 2.1, which shows the splittings and
recombinations involved in computing the Cooley-Tukey FFT of a sequence of
length 8. The first three columns of the figure represent the ordering phase of
the transform, and show the recursive splitting of the sequence into shorter and
shorter sequences. At the end of the ordering phase, the original data is said to
be in scrambled order. (This is also referred to as bit-reversed order because if
the indices of the original sequence axe written in binary notation, then reversing
the order of the bits on each index will produce the scrambled ordering.) Data
is moved, but no computation is performed during the ordering phase. The
remaining four columns of the figure are the combine phase of the algorithm,
and show how successively longer DFTs are built up from shorter ones by
repeated application of the butterfly relations (the origin of the term "butterfly
relation" is apparent from the figure). When the process is complete, the DFT
of the original sequence has been produced, with the elements in natural order.
This is an example of a scrambled-to-ordered algorithm. Since during a given
pass through the data each butterfly involves two elements which axe not used
elsewhere during that pass, the entire transform can be performed in-place, that
is, without requiring an extra work array in the implementation.
There is an important relationship between the subsequences created
during the ordering phase, and those generated during the combine phase. The
subsequences of a given length formed during the recombinations axe exactly the
DFTs of the corresponding subsequences that occur in the ordering phase. For
example, in Figure 2.1 the first splitting produces the length 4 vector
{,Tq, x2) xAi 2-6} The DFT of this vector occurs as the vector {Y^} in the
second to last column of Figure 2.1.


11
Data storage and flow for the Cooley-Tukey Decimation-in-Time algorithm for
N = 8. The first three columns are successive applications of decimation, even-
tually producing column 4, the DFTs of length one. The lines indicate those
data items used to produce a new value. For example, in the last pass the lines
running from the elements Yi and Z\ to the element X\ in the final column
represent the operation Xi = Yi + ui1 Zy The powers of omega are not indi-
cated on this diagram.


12
Each pass through the data requires butterflies. Each butterfly
requires one complex multiplication and two complex additions, or ten real
operations, so each pass through the data entails 5N real operations. There are
log2iV passes through the data, so the FFT requires 5N\og2N real operations,
which is a tremendous savings over the 7N22N real operations required to
compute the DFT by a matrix-vector multiplication.
Diagrams of the sort displayed in Figure 2.1 are known as a data flow
diagrams, and axe a basic tool for analyzing FFT algorithms. It is more com-
mon, however, to draw them without including the ordering phase, which
involves no computation, and without labelling the intermediate sequences.
Instead, the power of u used in each butterfly is often indicated, and all such
powers are converted to powers of c^, regardless of the length of the intermedi-
ate sequence involved. The conventional data flow diagram for the sequence of
length 8 is shown in Figure 2.2. In the Cooley-Tukey scrambled-to-ordered algo-
rithm, the powers of oj are computed and applied in natural order. During the
first pass through the data, each element is combined, using the butterfly, with
an element immediatly adjacent to it. This is known as a butterfly of stride 1.
During the second pass, the butterflies axe all stride 2. With each succeeding
pass, as the subsequence lengths double, so does the stride length. During the
last pass through the data both the sequence lengths and the stride lengths are
JL
2
It is possible to rearrange the signal flow graph in Figure 2.2 so that the
input sequence is in natural order, by interchanging the positions of x4 with xy
and x3 with s6. By interchanging the corresponding positions in the other two
columns (and moving the lines and powers of u along with the positions), the


13
Conventional data flow diagram for the Cooley-Tukey scrambled-to-ordered
algorithm for N = 8. The lines. indicate the data items that are combined
together to form new values. The powers of oj are indicated on this diagram,
appearing at the terminal ends of the lines that indicate the butterflies in which
they axe used.


14
data flow diagram in Figure 2.3 is produced. None of the computation has been
altered, so this is still the Cooley-Tukey decimation-in-time algorithm. There is
no longer an ordering phase; the sequence is scrambled during the computation.
As a result the butterfly stride lengths, which begin at -y in the first pass, are
halved with each succeeding pass. This computation now features the input
data in natural order, and the output sequence is in scrambled order. Such an
algorithm is called an ordered-to-scrambled transform. In this case the powers of
oj are computed in scrambled order during each pass.
The Cooley-Tukey scrambled-to-ordered algorithm (Figure 2.2) is
characterized by taking its input data in scrambled order, producing its output
data in natural order, having stride lengths that double with each succeeding
pass, and using powers of w that are computed in natural order. The Cooley-
Tukey ordered-to-scrambled algorithm (Figure 2.3) is characterized by having
its input data in natural order, its output data in scrambled order, decreasing
stride lengths, and powers of u that are computed in scrambled order.
Gentleman-Sande Decimation in Frequency
Neither of the Cooley-Tukey algorithms is particularly convenient. It is
not a simple matter to compute the scrambled ordering, but it must be com-
puted for both algorithms, either to scramble the input data or to compute the
powers of u>. Shortly after the Cooley-Tukey paper introduced the FFT, Gentle-
man and Sande [12] introduced a new algorithm that computes the FFT of a
naturally ordered sequence using naturally ordered powers of u. This algorithm
is called decimation-in-frequency because the derivation follows closely the
derivation of the Cooley-Tukey algorithm, except that the sequence that is split
into its even and odd subsequences is the transformed data If the input


15
Data flow diagram for the Cooley-Tukey ordered-to-scrambled algorithm for
N = 8. The lines indicate the data items that are combined together. The
})owers of w axe indicated on this diagranij appearing at the terminal ends of the
ines that indicate the butterflies in which they axe used. The powers of u
appear in scrambled order.


16
to a DFT is a time series, then the transform of the sequence is said to be in the
frequency domain.
It is assumed that the transform sequence, {X*}, is in hand, and that
the IDFT, {xn}, is desired. It is also assumed that the length of the sequence
{X*} is a power of two. If the sequence {X^} is split into two subsequences
{ Yk} and {Zjf}, whose elements axe = X2*, and = X2*+i, then the fre-
quency domain sequence has been decimated. The IDFT can be split into the
two sums involving the elements of these two subsequences by writing
N
2
2
S XnwN2nk + £ X2*+1w/(2A+1)
A=0
f-1
= j; s w +
k= 0 T
k=0
"-1
2
""--is W"*"/
1
2
"-1
2
"-1
2
i s Yk^ + o,-i s ^*>7*
iy/ k=o T iV *=o 2
5
for n=0,l, -l.
Evidently the two summations in this equation are themselves IDFTs, of
the length subsequences { Yj.} and {Z^} respectively. This implies that if the
inverse transforms of the even and odd subsequences are available, then the first
half of the desired inverse transform {xw} can be computed by
n=0,1, 1 .
The second half of the desired inverse transform may be obtained by substitut-
mg n -1----for n to give
2


17
N
Xn+JL = -{Vn+JL + 'w* Zn+N_) n=0,1, y~l .
2 2 2 2 *
N
2
Using the periodicity of the IDFT, and noting as well that u>N = e = 1 ,
this equation simplifies to
Xfi+JL = ~2^Vn ~ zn) re=0,l, 1 .
This set of relationships will take a decimated (scrambled order) set of
transform coefficients {Xk} and perform the inverse transform on them, produc-
ing the input sequence {xn} in natural order. However, this is not what is
desired. Instead, it would be useful to operate on naturally ordered data and
produce the transform coefficients in scrambled order. To do this, it is neces-
sary to formally invert the process just described. This are readily done, pro-
ducing
yn = xn + Xn+JL (2.5)
2
and
zn = un{xn Xn+N_) (2.6)
2
for n = 0,1, -l. Equations (2.5) and (2.6) together are called the
Gentleman-Sande butterfly.
What transpires with the application of these formulas is illustrated in
Figure 2.4, and can be described as follows: The input data sequence {x}, in
natural order, is "uncombined" using the Gentleman-Sande butterfly, into the
sequences {/} and {zn}. These two sequences, by the manner in which they
are constructed, are the IDFTs of the subsequences { Y^} and {Zk} which would
be produced if {X*} were split into its even and odd subsequences. The


18
Figure 2.4
Data flow diagram for the Gentleman-Sande ordered-to-scrambled algorithm for
N = 8. The lines indicate data items that axe combined together. The powers
of u> are indicated on this diagram, appearing at the terminal ends of the lines
that indicate the butterflies in which they axe used. The the powers of ui appear
in natural order, and axe only applied to the bottom half of the butterfly.


19
sequences {yn} and {zn} axe then recursively uncombined into subsequences
themselves. The four short sequences thus produced are the IDFTs of the four
sequences that would be produced if the even and odd subsequences of {X/j}
were themselves split into even and odd subsequences. This process continues
until, after log2iV passes through the data, N sequences of length one are pro-
duced. By the manner of their construction these sequences axe the IDFTs of
the elements of {X*}, in scrambled order, exactly what is desired.
Examination of Figure 2.4 reveals the structure of the Gentleman-
Sande decimation-in-frequency FFT. The input data, {x}, axe in natural order,
while the output transform coefficients, {X*}, axe in scrambled order. The
stride lengths begin at and decrease by half with each pass through the data.
The powers of w axe computed in natural order, which is an attractive feature of
this algorithm. Each pass through the data requires exactly Gentleman-
2
Sande butterflies, while each butterfly requires one complex multiplication and
two complex additions, the same amount of work as is required for a Cooley-
Tukey butterfly. Thus the Gentleman-Sande algorithm requires the same
number of operations as the Cooley-Tukey algorithm.
The Gentleman-Sande algorithm can also be rearranged to operate on
scrambled data and produce ordered coefficients, if desired. The process is as
before: swap those elements that interchange natural and scrambled order (along
with their lines and powers of w), in all of the columns. This is equivalent to
scrambling the input data, beginning with short strides and then doubling stride
length with each pass, and computing the powers of u in natural order while
performing the Gentleman-Sande butterflies. This algorithm is illustrated in
Figure 2.5.


20
Data flow diagram for the Gentleman-Sande scrambled-to-ordered algorithm for
N = 8. The lines indicate data items that are combined together. The powers
of w are indicated on this diagram, appearing at the terminal ends of the lines
that indicate the butterflies in which tney are used, and in the Gentleman-Sande
combine are only applied to the bottom half of the butterfly.


21
The four algorithms in the Figures 2.2 through 2.5, respectively the
Cooley-Tukey scrambled-to-ordered, the Cooley-Tukey ordered-to-scrambled,
the Gentleman-Sande ordered-to-scrambled, and the Gentleman-Sande
scrambled-to-ordered, are the four canonical in-place FFT algorithms. Almost
all of the algorithms in this thesis are derived by following the derivation of one
of these four algorithms. In practice, the Cooley-Tukey ordered-to-scrambled
and the Gentleman-Sande scrambled-to-ordered algorithms axe never used.
Generally the input data is transformed using the Gentleman-Sande algorithm
operating on ordered data, producing scrambled coefficients. Work performed in
the transform domain, such as filtering or tri-diagonal system solving, is per-
formed on the scrambled coefficients. The inverse transform is then obtained
using the Cooley-Tukey scrambled-to-ordered algorithm. In this way the scram-
bled ordering never needs to be computed.
A Radix 3 Decimation-in-Time Algorithm
The algorithms presented so fax require that the length of the sequence,
N, is a power of two. While this is useful in many situations, it means that only
a very limited number of sequences can be transformed efficiently. It is highly
desirable to have algorithms that efficiently compute the transforms of
sequences whose length is not a power of two. The decimation-in-time and
decimation-in-frequency derivations can be extended easily to sequences whose
length is the power of some number other than two; in fact any number could
be used as the base, or radix, for the sequencie length. A radix 3 algorithm is
presented here as an example.
Suppose {x} is a sequence of length N, where N = 3W. Let {xn} be
divided into three subsequences {yn}, {zn}, and {wn}, where yn = x3n,


22
zn = X2n+i, and wn = r3re+2, for n = 0,1, -l. Then the DFT of {xn} can
3
be written as
*-l f-1
3 E _ ,Znk xZnUN + 3 E xzn+i^n+1)k + E X3n+2U§
n= 0 n=0 n= 0
^-1 f-1
3 E .. ,Znk ynuN + 3 E ZnU^k0Jk + 5] ... .3tik. .2k wnVN 0JN
=0 ri=0 =0
f"
3
"-1
3
__ .Tlk 1 | n r-i w .fin [ . 2n
= E +wif E + ujr E ^n^AL
n=0 3 n= 0 3 n=0 3
for fc=0,l, ~~1- The three summations in this last expression are them-
selves DFTs of the length y- subsequences {yn}, {zn}, and {wn} respectively.
Denoting these DFTs { Y*}, {Zk}, and { W*}, the first one-third of the desired
transform {X*} is obtained from
Xk=Yk + ^Zk + w W* *=0,1, f-1 . (2.7)
Noting that the DFTs {Y*}, {Zk}, and { Wk} are periodic with period the
remaining two-thirds of {Xn} are obtained from
JL JiL
Xk+JL = Yk + zk + wk (2-8)
3
and
2N 4N
Xk+M. = Yk + wkuN Zk + UykbJN Wf. ,
3
(2.9)
for £=0,1, 1.
3


23
Equations (2.7), (2.8), and (2.9) axe the radix 3 butterfly formulas. The
algorithm for transforming a sequence whose length is a power of three uses
these formulas exactly as the butterfly formulas were used in the power of two
Cooley-Tukey algorithm. The original sequence is split into three subsequences,
an operation that occurs recursively until sequences of length one are achieved.
At this point the original data is in a scrambled order, similar to the bit-reversed
order featured in the radix 2 algorithms. The location of a given sequence ele-
ment in this ordering is obtained by writing the original index of the element in
base three, and reversing the digits. The butterfly relations are then applied to
produce sequences of length 3. This recombination process is then followed
3
recursively. After log3iV passes, three transforms of length -y are combined into
the transform {-X*}. Figure 2.6 is the flow diagram for a sequence of length 9.
The process by which the radix 3 algorithm was constructed can be
easily applied to sequences whose length is the power of any number M. Let
N = Mm, and assume the sequence {rcre} has been divided into M subsequences,
{Vnj^ for p = 0,1, M1, where = ^Mn+p- If the DFTs of the
sequences {yn^ 3X6 designated {Y$\ then following a similar development as
was used in deriving the radix 2 and radix 3 transforms will lead to the radix M
butterfly formulas:
M-i -jf
Xk+M. = s Vjtp) (2-10)
M p= 0
for Jc = 0,1, 1 and j 0,1, M 1.
For any radix M, the splitting process can be applied to the transform
coefficients as well as to the input data. If this is done and the the butterflies
thus obtained are inverted, decimation-in-frequency algorithms (after the


24
Data flow diagram for a decimation-in-time radix 3 scrambled-to-ordered algo-
rithm, for N = 9. The lines indicate data items that axe combined together.
The powers of u> are not indicated on this diagram.


25
pattern of the Gentleman-Sande algorithm) will be produced.
Mixed Radix Algorithms
While the addition of arbitrary radices to the family of FFTs greatly
increases the number of sequences that can be transformed efficiently, this
number is still severely limited. A class of algorithms known as the mixed radix
algorithms can be developed that dramatically increases the number of
sequences for which efficient FFTs are available. A mixed radix algorithm
operates on a sequence whose length, N, is the product of powers of K different
K
radices, N = n The idea underlying this class of algorithms is quite simple,
3=1
and is illustrated with the transform of a sequence of length 24 = 3123. The
length 24 sequence can be considered as three sequences of length 8. The DFTs
of these transforms can be calculated using a Cooley-Tukey radix 2 scrambled-
to-ordered algorithm. This produces three length 8 DFTs, which can then be
combined, using the radix 3 butterflies, into the desired transform of the original
sequence. Figure 2.7 is a flow diagram for this transform.


26
Data flow diagram for a decimation-in-time mixed radix scrambled-to-ordered
algorithm, for N = 24. The lines indicate data items that are combined
together. The powers of u> are not indicated on this diagram. The scrambled
order is not the scrambling of the 24 point sequence, but the scrambling of the
three length 8 subsequences. The first three passes represent a radix 2 algo-
rithm to form length 8 transforms. The last pass is a radix 3 algorithm to form
the length 24 transform from the three length 8 transforms.


CHAPTER III
THE SYMMETRIES
Symmetric input data occurs in many applications, and the capability to
efficiently transform symmetric sequences has long been desired. Swarztrauber
[5] introduced FFT algorithms for certain symmetric sequences. These
transforms were derived by following the Cooley-Tukey splitting approach, and
taking advantage of symmetries in the input data, the subsequences formed in
the splittings, and the resultant transforms. This group of algorithms, like the
original Cooley-Tukey algorithm, operates on the input data in scrambled order
and produces the transform coefficients in natural order. Briggs [6] derived
another set of symmetric algorithms by formally inverting some of the
transforms introduced by Swarztrauber. The Briggs transforms operate on data
in natural order, and produce the transform coefficients in scrambled order. In
this respect, the Swarztrauber and Briggs transforms axe similar to the Cooley-
Tukey and Gentleman-Sande transform pairs, in that they serve as complemen-
tary operations in many applications. The naturally ordered data may be
transformed using one of the Briggs algorithms, producing scrambled coeffi-
cients. Further operations, such as filtering, spectral analysis, or tri-diagonal
solving may be performed on the coefficients in scrambled order, and the resul-
tant data may then be returned to the original domain in natural order using
one of the Swarztrauber algorithms.
Seven data symmetries axe important to this study, and the derivations
of their transforms (in both directions) axe collected in Chapter IV. These


28
symmetries axe: real data, real even and real odd data, real quarter wave even
and real quarter wave odd data, conjugate symmetric data, and quarter wave
conjugate symmetric data. To derive the compact transforms for these sym-
metries, it is necessary to analyze two aspects of each symmetry. First it must
be determined what symmetry is present in the DFT of a given symmetric
sequence. Secondly, it is necessary to determine what symmetries are induced
in the even and odd subsequences when the symmetric sequence is split, as in
the Cooley-Tukey algorithm.
It is assumed that the sequence is periodic, that is xn = %N+n- Unless
otherwise indicated by an explicit subscript, u shall be assumed to mean u>N.
Real Data
A strictly real sequence (hereafter R) is one in which each element is its
own complex conjugate, that is xN = ~%N_k- Examining the (Nk)^ element of
the DFT
N-l , N-l
= E wn{N~ *) = 0JNn ^ xnu
n=0 n=0
N-l AT1
= 2 xnunk = E xnnk
n=0 n=0
= Xk.
Thus the DFT of an R sequence has the property that Xk = ^N_k- A sequence
possessing this property is called conjugate symmetric (CS). Since the DFT of
an R sequence is GS, it follows that only half of the transform should actually be
computed in an efficient algorithm, with the remaining coefficients obtained by
taking the complex conjugate of those computed.


29
Splitting an R sequence into its even-numbered and odd-numbered ele-
ments produces two real-valued sequences which are of half the length of the
original. No additional symmetry is induced by the splitting of an R sequence,
however.
Real and Even Data
An even sequence is one in which xn = xN_n- Let E denote a sequence
which is both real and even. Since the E sequence is real, as well as even, it
must have a transform that is conjugate symmetric, as discussed in the preceed-
ing paragraph. Then
XN-k =
N-1
n=0
N-1
= E xn nk
n0
Reversing the order of summation and noting that (by periodicity) xN = x0 ^d
also that uNk = 1,
N-1
Xk E xN-n
n=0
n=0
N-1
=0
Since x.T = x this becomes
Nn n
N-1
Xk = E xnnk
Therefore an E sequence transforms into another E sequence. Only half of the


30
input sequence is required and only half of the transform needs to be computed,
since the remaining halves of both can be obtained by symmetry.
Suppose an E sequence {zn} is split into two subsequences, {yn}, con-
sisting of the elements with even-numbered indices, and {zn}, the elements
whose indices are odd. Both of these sequences must be real, since their parent
is a real sequence. Examination of the two subsequences reveals that
S
Vn = x2n = XN-2n = ^JL-n
2
and
zn : x2n+l : XN-2n-l = ZN_-n-1 '
2
The subsequence {?/} is itself an E sequence of length while the subse-
2
quence {zn} has a symmetry called quarter wave even (QE).
Real and Odd Data
An odd sequence is one in which xn = xN_n- A sequence that is both
real and odd is denoted an O sequence. The O sequence is real, and has a
transform which is conjugate symmetric. Following a line of argument similar to
that used for the transform of an E sequence, the (Nk)til element of the DFT
_ N-l-
^Nk = = Yj xn^
n=0
nk
N-1
= E nw'
n=Q
nk
is given


31
Again appealing to periodicity and reversing the order of summation,
Xk = E XN-nU {N ^ = E XN-n ^
n=0
.-Nk. .nk
N-1
E a
R=0
N-1
= y; x u
Z-t Nn
n=0
Since x = xn it follows that
Nn n
nk
_ N-1
-X* = E
n=0
= -Xk,
which can hold only if {X^} is a strictly imaginary sequence. Therefore the
transform of the O sequence is a purely imaginary odd sequence. Again, it is
only necessary to compute half of the transform, obtaining the other half by the
symmetry.
When an O sequence {z} is split into {yn}, consisting of the elements
x%n, and {zn}, consisting of the elements r2n+i) it is seen that
Vn = x2n = ~XN-2n = ~VJL-n
2
and
Zfl : ~x2n+l = ~^JV21 = ~ZN_-n-l '
2
The subsequence {yn} is another O sequence. The subsequence {zn} has a sym-
metry called quarter wave odd (QO).


32
Quarter Wave Even Data
A quarter wave even (QE) sequence of length N is one in which
xn = xN_n_x. To find the symmetry in the DFT of a QE sequence, begin with
the DFT formula. Written with the summation reversed (and without appeal-
ing to periodicity), it may be seen that
N-1 l)k N-l
Xk = £ XN-n-1 = £ XN-
n=0 n0
N-1
- ~k £ XN- nk
n=0
XNnl = xn 5
N-l N-l
xk = u~k E w nk u>~k E xnUnk
n=0 n=0
,Nk, ,nk, ,k
= ~kXk

Multiplying both sides of this last expression by oj 2 it is seen that
k_ __ ~T
w 2 Xk = to 2 Xk = u 2 Xk ,
which implies that both the real and imaginary parts of Xk can be represented
k_
by a strictly real quantity, namely Xk = w 2 XSince the QE sequence is real,
the CS nature of its transform allows the entire transform to be recovered if
only half the coefficients are actually computed.
If a QE sequence is split into its even-numbered and odd-numbered ele-
ments, each of the resulting subsequences has no symmetry by itself (except
that it is real), but taken together they have the intersequence symmetry
VjL-n-l = XN-2n-2 = I2+l = zn
2


33
This means that a QE sequence splits into two R subsequences, one of which is
redundant.
Quarter Wave Odd Data
A quarter wave odd (QO) sequence has the property that
xn = xN_n_1. The derivation of its transform symmetry is nearly identical to
that of the QE sequence. The summation in the DFT is reversed, and
Since x.
Nl N-1
V Z (Jj(N-n~1)k r Nn1 E XN-
n=0 n=0
N-1
V Z-* Nn1
K=0
~xn i
N-1 N-1
- If nu -w y xnu = -u '* S
n=0 n=0
,Nk, ,nk, k
,nk

_k_
Multiplying both sides of this expression by u 2 ,
k _ k k
u2Xk = -u~ 2 Xk = -w 2 Xk .
Evidently the argument of Xk is and both the real and imaginary parts
2 N
of the transform element Xk can be represented by a strictly imaginary quan-
k_
tity, namely Xk = u 2 Xk. The CS nature of {X*} allows the entire transform to
be represented by half the coefficients.
Let a QO sequence be split into its even and odd subsequences. Except
that they are real, the resulting subsequences have no symmetry by themselves.


34
Considered together, though, they have another intersequence symmetry given
by
VJL-n-1 = ~XN-2n-2 = x2n+l = ~zn >
2
and a QO sequence, like the QE, splits into two R subsequences, one of which
is redundant.
Conjugate Symmetric Data
It has been shown that the transform of an R sequence is conjugate
symmetric (CS). It is obvious that the IDFT of a CS sequence must therefore
be strictly real. It is also important to ascertain the nature of the forward
transform of a CS sequence. Suppose {%} is CS, that is xN = xn. Then rev-
ersing the order of summation in the DFT and invoking periodicity yields
Xk = E =umY, xN-n~nk
n=0 n=0
N-1 ___
= s
n=0
Using the CS nature of {i} this becomes
JV-l____
Xk = E Xnnk
n= 0
= Xk,
which implies that {-X*} is an R sequence. The CS symmetry of the input
implies that the complex sequence {xn} can be represented by the iVreal values
that comprise the real and imaginary parts of the first half of the sequence. The
transform consists of N strictly real quantities.


35
Suppose that a CS sequence {s} is split into its even and odd subse-
quences. Then
Vn x2n XN-2n ~ yN_-n
2
implying that the subsequence {yn} is itself a CS sequence. Examining the
subsequence {zn}, it may be seen that
zn = x2n+l = XN-2n-l = *JL-n-i
2
which introduces a new symmetry, similar to the QE except for the complex
conjugate. It is called a quarter wave conjugate symmetric (QCS) sequence.
Quarter Wave Conjugate Symmetric Data
The symmetry in the transform of a QCS sequence may now be deter-
mined. Assume that {rre} is QCS. Reversing the order of summation in the
DFT yields
JV-l JV-l
= V XN oj(N-n~1)k Z-t N-n1 = S XN.
n=0 n= 0
JV-l
= U~k Y,
n= 0
= u~kXk
Like the transforms of the QE and QO sequences derived above, this transform
has the property that each element is related to its complex conjugate by a rota-
tion through an argument dependent on the index k. Both the real and ima-

ginary parts can thus be represented by the strictly real quantity X* = w 2 X^.


36
Consider the splitting of a QCS sequence into its even-numbered and
odd-numbered elements. Each of the resulting subsequences has no symmetry
by itself but taken together they have yet another intersequence symmetry
Vn ~ X2n = XN-2n-l ~ *JL-l'
2
Thus a QCS sequence splits into two subsequences, one of which is redundant.
Briggs [6] referred to these sequences as dual sequences and denoted them D and
D*.
Other Symmetries
The symmetries discussed above do not form an exhaustive list of such
transforms. Several symmetries have been ommitted, most notably the ojEven
and ojOdd symmetries. Transforms for sequences with these symmetries may
be found in [6]. The symmetries that arise out of crystallographic studies have
not been treated at all. The symmetries that have been included axe those that
are needed to develop the parallel algorithms in Chapters V and VI.


CHAPTER IV
SYMMETRIC FFT ALGORITHMS
Chapter HI was concerned with presenting the preliminary material
necessary to develop direct transforms for symmetric data. In this chapter the
symmetric transforms themselves will be derived. Swarztrauber [5] derived the
symmetric transforms taking scrambled data to ordered coefficients, while
Briggs [6] derived most of the transforms that operate on ordered data and pro-
duce scrambled coefficients. Both directions will be considered here, beginning
with the scrambled-to-ordered transforms. It is assumed that the original
sequences are of length N = 2m.
Recall that the original Cooley-Tukey transform, derived in Chapter I,
operated on data in scrambled order to produce coefficients in natural order.
The same Cooley-Tukey approach is followed here for each of the symmetries in
question. The symmetric sequence is recursively split until sequences of length
one are obtained, which are their own transforms. Then the appropriate com-
bine formulas, or butterflies, axe applied to build up longer transforms from pairs
of shorter ones. This process is applied recursively until eventually the
transforms of two sequences of length axe combined to produce the desired
transform of the original sequence.
The approach taken by Swarztrauber is to determine the symmetries in
the subsequences as the original data is recursively split, and to note the sym-
metries that are present in the transforms of the subsequnces. By substituting


38
these symmetries into the Cooley-Tukey combine formulas, equations (2.3) and
(2.4), the butterfly relations that combine the various symmetric transforms into
transforms of longer symmetric sequences are derived.
A Symmetric Transform for R Sequences
In Chapter HI it was shown that if an R sequence {rn} is split into its
two subsequences, {yn} and {zn}, these subsequences retain the R symmetry of
the parent, but that no further symmetry is induced. Being R sequences, how-
ever, both have transforms that (like the transform of the parent) are CS
sequences. Thus Yk = YN and Zk = ZN _k and it is necessary to have only
one half of each of these sequences. The first quarter of the desired transform
{A*} is computed directly from equation (2.3). To compute the second quarter
of {Xk}, substitute -^k for k into equation (2.3), which yields
f-k
XJLk = YJLk + ZJL-k
2 2 2
Invoking the CS nature of { Yk} and {Zk}, it follows that
- f"*-
Xj!__k Yk + u Zk
2
Taking the conjugate of both sides of this last equation, the two combine formu-
las that together produce the first half of {A*} from the first halves of { Yk} and
{Zk} are
Xk = rt + ufZk
(4.1)
and
XJL-k =Yt~ w2
(4.2)


39
for k = 0,1, -j-. The remaining half of may be obtained by symmetry.
Equations (4.1) and (4.2) axe called the "RtoR" combine formulas,
because they combine the transforms of real sequences into the transform of a
real sequence. They will be important in other transforms.
The symmetric FFT taking scrambled R sequences to ordered CS coeffi-
cients consists of recursively splitting the original sequence into sequences of
length one, throwing away the second half, and then recursively applying equa-
tions (4.1) and (4.2) to form the transforms of longer and longer sequences.
Swaxztrauber refers to this as the Edson algorithm for real data, but there is
some confusion about this. Edson never published his algorithm, and the origi-
nal reference is given by Bergland [2], who cites private communication as his
source. The algorithm described by Bergland, however, is not the one just out-
lined; rather he described an algorithm that operates on ordered data to produce
scrambled coefficients. The algorithm described here is thus attributible to
Swaxztrauber.
A Symmetric Transform for QE Sequences
It has been noted that in splitting a QE sequence {xn} one of the result-
ing subsequences is redundant. Theoretically {X&} can be recovered entirely
from the transform of {yn} if an appropriate "combine" formula can be derived,
while the transform of {zn} need be neither computed nor stored. Let { Y^} be
the transform of {yn}. In Chapter m it was shown that {!*} can be
represented by a strictly real sequence {Y^} Since {yn} and {zn} have the


40
intersequence symmetry that zn = y then it follows that
JLre1
-£-1
2
zk =
£ znUN_
re = 0 2
nk ____
*-l
2
nk
£ VjL-n-i^JL
n= 0 2 2
Reversing the order of the summation and noting that yn = yn,
2
£ yn10N
n 0 2
_ (fD*

-1
= S
2 n=0 2
2
k_
~ 2
This relationship and the relation Xk = u Xk are substituted into the RtoR
combine formulas (4.1) and (4.2) to yield the butterfly relations by which the
transform of a QE sequence can be formed from the transform of its nonredun-
dant R subsequence:
Xk = 2Re[wJn] *=0,1,(4.3)
2
ajid
X^_t = -arn^n] i=0,l,...f-l (4.4)
2 2
Thus -y- real values axe required to represent the N complex values of the
transform of a real and quarter wave even sequence. Equations (4.3) and (4.4)
together axe called the "RRQE" combine, since they combine the txansfoxms of
two R sequences (one xedundant) into the transform of a QE sequence.
The algorithm for transforming a QE sequence begins by splitting the
QE sequence into its two R subsequences. The first is used, but the second
neednt be retained. The nonredundant R subsequence is then recursively split


41
until it comprises R sequences of length one, which are their own GS
transforms. The RtoR formulas axe then applied recursively to build up CS
transforms of increasing length until the point CS transform, {!*}, of the
nonredundant R sequence has been obtained. The RRQE combine formulas are
IT **
then applied to produce the real values comprising the first half of {X^}.
The entire complex sequence {X^} may be obtained from the first half of {X^}
by symmetry.
A Symmetric Transform for QO Sequences
Let {x} be a QO sequence. The development of a symmetric FFT for
this case closely follows the development of the FFT for QE sequences. Split-
ting a QO sequence, like splitting a QE sequence, produces two R sequences
with an intersequence symmetry, rendering one of the R subsequences redun-
dant. Again, {X*} can be recovered entirely from the transform of {j/B} once an
appropriate combine formula is derived. Since {yn} and {zn} have the interse-
quence symmetry that zn = yN 1 a development similar to that followed
~2~
for the transform of a QE sequence yields the relationship X* = u~k Yk This,
~2
along with the symmetry for the transform of a QO sequence (derived in
Chapter III) may be substituted into equations (4.1) and (4.2) to yield
Xk = 2ilm[c^nl *=0,1,...f
2
and
= -2>Re[u£ys] *=0,1,
2 2
which axe then called the "RRQO" combine formulas.


42
The symmetric algorithm for transforming a QO sequence is completely
analogous to that for the QE sequence. The QO sequence is split into its two R
subsequences, only the first of which is retained. This nonredundant R subse-
quence is then recursively split until it consists of sequences of length one,
2
which are their own CS transforms. The RtoR formulas are then applied recur-
sively to build up CS transforms of increasing length until the CS transform,
{ Y]t}, of the nonredundant R sequence has been obtained. The RRQO combine
formulas are then applied to produce the imaginary values comprising the
2
IV
first half of {X*}. The entire complex sequence {X^} may then be obtained
from these values by symmetry.
A Symmetric Algorithm for E Sequences.
An E sequence {rn} splits into an E subsequence {/}, consisting of the
even-numbered elements, and a QE subsequence {zn}, consisting of the odd-
numbered elements, as was shown in Chapter III. It was also shown that { Y*},
the transform of the E subsequence, is also an E sequence with YN and
~2
that {Zk}, the transform of the QE subsequence, can be represented by the
strictly real sequence {Z^}, where Z^ = To compute the first quarter of
~2~
{X*} these relations are substituted into equation (3.1) to give
xt=yt+= n +
2 2 2
= Yk + Zk k=0,l,...f . (4.5)
Computing the second quarter of {X*} is accomplished by substituting k for
2
k into equation (4.2), yielding


43
^ = Y> 4Z* = *-<£>1*
2 2 2 2
= Y*-Zt k=0,1,...f-1. (4.6)
Equations (4.5) and (4.6), which axe together called an "EQE" type combina-
tion, produce the first +1 values of the transform {Xj.}. Since {X^} is real
and even the remaining values may be obtained by symmetry.
Suppose the transform of an E sequence is desired. The symmetric algo-
rithm, proposed by Swarztrauber, is schematically illustrated in Figure 4.1. The
input sequence is split, producing one E subsequence and one QE sequence.
The E subsequence splits into another E and QE pair, while the QE sequence
splits into two real R sequences, one of which is redundant. The splitting pro-
cess continues, with each E generating an E and QE pair, each QE splitting into
an R (and a redundant R), and each R sequence splitting into two more R
sequences, until sequences of length one are produced (left of the bar in Figure
4.1). No work has yet been performed, although the data have been moved, and
are now said to be in scrambled order.
The algorithm can now begin combining short transforms into longer
transforms, using the appropriate combine formulas. In general, each pass has
one EQE combination, followed by an RRQE, followed by a series of RtoR com-
binations. At the second to last pass, there are only the RRQE and EQE types,
while the final pass involves only the EQE combination, performed on sequences
of half the length of the original input. In practice, none of the redundant R
sequences is computed or stored, as is indicated in the figure. The program uses
only +1 storage locations. The combine phase of the algorithm begins with
2
the input data in scrambled order, and proceeds through log2X passes, until the


44
Schematic diagram of the Swarztrauber Algorithm. F{ } indicates the Discrete
Fourier Transform. The asterisks represent the redundant R sequences which
are not calculated or stored. The portion of the diagram to the left of the
column of bars is the ordering phase, that to the right of the bars is the combine
phase. The column immediately to the left of the column of bars is the E
sequence in scrambled order, the final F{E} on the right is the output
transform, in natural order.


45
transform coefficients are produced in natural order.
A Symmetric Transform for 0 Sequences
Suppose {xn} is an O sequence with xn %N_k- Then if {r} is split
so that yn = x2n and zn = r2+1> the sequence {yn} is itself an O sequence,
while {zn} is QO. It was shown in Chapter Id that the transforms of the subse-
quences are strictly imaginary, and have the symmetries Yk Yk and
iZk = u~kZk. Substituting these relations equations (3.1) and (3.2) produces
the required formulas that combine the imaginary transforms of an O and a QO
sequence into the strictly imaginary transform of the original O sequence. They
are given by
Xk=Yk + Zk
and
XJL-k =Yk-'Zk-
2
for k = 0,1,
These two equations comprise the "OQO" combine formulas. Only half of the
short sequences are needed, and only half of the desired transform {X^} need be
computed, the remainder being available by symmetry.
The symmetric algorithm for transforming an O sequence follows the
same pattern as that for the E sequence. Figure 4.1 may be used to visualize
the algorithm if the substitutions O for E and QO for QE are made throughout.
The combine phase begins with the O data in scrambled order, with the redun-
dancies removed. During each pass the first combine is an OQO, the second an
RRQO, and all remaining combines are the RtoR type.


46
A Symmetric Algorithm for CS Sequences
It has been shown that transform of a CS sequence is an R sequence. If
a CS sequence is split, the resulting subsequences {?/} and {zn} are CS and
QCS respectively. When these axe split in turn, the CS subsequence again splits
into two shorter sequences, the first of which is CS, the second QCS. Splitting
the QCS subsequence produces a D, D* pair, one of which is redundant. By
itself a D sequence has no symmetry, but is simply a complex (C) sequence. If a
D sequence is split, its subsequences inherit no symmetry and are simply C
sequences. Briggs [6] developed an FFT for CS sequences based on these split-
tings. Figure 4.2 illustrates the nature of the algorithm. The CS sequence is
split recursively until sequences of length one axe produced. The combine phase
then begins, and with each pass through the data, the first combination must
form the transform of a CS sequence from the shorter transforms of a CS and a
QCS pair, the next combination must form the transform of a QCS from the
transform of its nonredundant D subsequence, and all remaining combinations
must form the transform of a D sequence from the transforms of its subse-
quences.
To complete the derivation of the CS FFT it remains to derive the
equations by which the various combinations are made. The equations for com-
bining the transforms of a CS and a QCS set of subsequences into the transform
of their CS parent begin with the standard Cooley-Tukey formulas, equations
(2.3) and (2.4), to give
Xk=Yk + cokZk and Xm = Yk ukZk ,
2 +k
for k = 0,1, -l. Since { Yk} is an R sequence and {Zk} can be represented
2
by a strictly real sequence {Zk} where Zk = u~KZk, these equations become


47
Schematic diagram of the Briggs Algorithm for CS Data. F{ } indicates the
Discrete Fourier Transform. The asterisks represent the redundant D*
sequences which are not calculated or stored. The portion of the diagram to the
left of the column of bars is the ordering phase, that to the right of the bars
the combine phase. The column immediately to the left of the column of bars
the CS sequence in scrambled order, the final F{CS} on the right is the
transform in riatur al order.
WES'S'


48
and
Xk=Yk + Zk
(4.7)
XN_+k = Yk Zk ,
(4.8)
2
for k = 0,1,...1, and together axe called the "CSQCS" combine.
Forming the transform of a QCS from the transform of its nonredundant
D subsequence depends on the fact that if {yn} and {zn} axe the D and D*
subsequences (length -y) of a QCS sequence {i} they have the intersequence
symmetry yn = x2n = %_2n_1 =*JL_1- Therefore
2
Yk Yj
n=0 2 n=0 2 2
JL.i 2 (" n1)A "-1 _*2 -
E *n a; ^ N = N E
tt=0 2 2 =0 2
-nk
= KZk for k = 0,1, -l .
N ** 7 7 o
Substituting this into the into equations (4.1) and (4.2) yields
xk = n + v* n md xM = n -
2 2 2
and since {X^} is the transform of a QCS sequence it can be represented by
k_
Xk = cj 2 Xk, yielding
Jl _JL A
Xk = w 2 Yk + w 2 = 2Re[a>2 Yk\
k_ _k__ k_
XjL+k = -*(2 Yk 2 n) = 2 n]
2
(4.9)
(4.10)


49
for k = 0,1, 1. These axe the combine formulas needed to compute the
transform of a QCS sequence from the transform of its nonredundant D subse-
quence. Equations (4.9) and (4.10) are called a "QCSD" combine.
Finally, to complete the algorithm for CS data, it remains to find the
equations by which the transform of a nonredundant D sequence may be found
from the transforms of its C subsequences. Since these are merely complex-
valued sequences, the transforms of the G subsequences may be combined using
the Cooley-Tukey combine formulas to produce the transform of the D sequence
(in this setting it will be called the "DCC" combine):
xk = Y + J^Zk (4.11)
and
XJL+k = <4'12)
2
for k = 0,1, 1. This may be done with no extra storage required since
the storage locations that would otherwise be assigned to the redundant D*
transform are available.
The Ordered-to-Scrambled Algorithms
It is generally convenient to have an algorithm that operates on the
input data in natural order, producing the coefficients in scrambled order.
Briggs [6] developed such algorithms for sequences which are real, quarter wave
even, and quarter wave odd. The derivations for these transforms are collected
here. Also, following his lead, an ordered-to-scrambled algorithm for an E
sequence is developed. The process by which these algorithms are derived is to
formally invert the corresponding scrambled-to-ordered algorithm. For example,


50
the algorithm just discussed takes a scrambled CS sequence to its R transform
in natural order. What is desired is the opposite an algorithm that works on R
data in natural order to obtain scrambled CS coefficients.
An Ordered to Scrambled Algorithm for R Sequences
To derive this algorithm it is necessary to formally invert the Briggs
algorithm for CS sequences. Since an R sequence is the transform of a CS
sequence, the inverted algorithm follows Figure 4.2 backwards, from the right
side to the middle bar. Beginning with the transform of a CS sequence in
natural order (which is the input R sequence), it is possible to "uncombine" it
into the transforms of its CS and QCS subsequences. These in turn are uncom-
bined into the transforms of their subsequences, and so on. After log2iV passes
through the data, length one transforms are produced, that are the transform
coefficients, in scrambled order, of the original R sequence.
It is necessary to formally invert the QSQCS, DQCS, and DCC combine
formulas from the scrambled-to-ordered transform for CS sequences. The
CSQCS combine relations, equations (4.7) and (4.8), are easy to invert, and pro-
duce the uncombine relations
n = k=0,1,...f (4.13)
2
and
h = \ XM) k=0,1,(4.14)
2
Inverting the DQCS combine formulas (4.9) and (4.10) is a bit more tedious,


51
leading to the uncombine relations
Re[nl = ) *=0.1.-f <4-16)
2
and
Im( r*] = ( + X, +icos^) t=0,l,-f-l (4-16)
2
Inverting equations (4.11) and (4.12), the DCC combine formulas, and at the
same time separating real and imaginary parts for storage in a real array, leads
to the following four relations:
M H = \ + Hel-KjtJ) (4-17)
2
Im[ S'*] = i(Im[Al] + Im[Jri+tl) (4.18)
M^\ = 7 |(B*W K[^ - (M**l - (4-19)
= 7 |(BW - + (H*il - (4-20)
where k = 0,1,...1.
These uncombine formulas provide the tools necessary to calculate the
FFT of an R sequence, beginning with the data in natural order. The sequence
is uncombined into the transforms of a CS and a QCS pair of subsequences
using equations (4.13) and (4.14). These same formulas axe then applied to
uncombine the first of the subsequences thus generated into two smaller subse-
quences, while the transform of the QCS sequence generated in the first step is
converted into the transform of its D subsequence using equations (4.15) and
(4.16). At the next step the sequences thus fax generated axe uncombined


52
again. The transform of the D sequence will be uncombined into two C
transforms using equations (4.17), (4.18), (4.19), and (4.20). Eventually
sequences of length one are produced, which axe the desired CS coefficients, in
scrambled order, of the original R sequence.
An Ordered-to-Scrambled Algorithm for E Sequences
The method by which Briggs developed the algorithm just presented for
calculating the FFT of an ordered R sequence can be utilized to derive the com-
pact symmetric transform of an E sequence in natural order. In this case the
algorithm that is formally inverted is the Swarztrauber algorithm for E
sequences in scrambled order. Since an E sequence is itself the transform of
another E sequence, the inverted algorithm can be thought of as Figure 4.1 run-
ning from right to left. The inverted algorithm begins with the transform of an
E sequence in natural order, which is represented by the right side of the figure.
This is then uncombined into the E and QE transforms of its subsequences.
These in turn axe uncombined into the transforms of their subsequences, and so
on. Again, after log2jV passes through the data, length one transforms axe pro-
duced, which are the transform coefficients, in scrambled order, of the original E
sequence.
To invert the Swarztrauber algorithm, it is necessary to formally invert
the EQE, RRQE, and RtoR combine formulas. The EQE combine formulas
equations (4.5) and (4.6), are easily inverted, leading to the uncombine formulas
n = J *=0,1,...f (4.21)
2


53
and
h = }(** -t=0.1.-f I (4-22)
Inverting equations (3.3) and (3.4), the RRQE formulas, produces the uncom-
bine relations
Rein] = fences-^ +
2
fc=0,l,...4
4
(4.23)
and
Im[Yk] = |-(X*sin-^ ^cos-p *=0,1,...f -1 . (4.24)
2
The RtoR combine formulas, equations (4.1) and (4.2), may be inverted and the
real and imaginary parts separated for storage in a real array, leading to :
R*[ n] = v(R*M + Re[-r J) (4-25)
2
lm[Yk] = |(Im[**l Im[** J) (4-26)
Re\Zk] = j(Re[X*] Re[X^J)cos^ (Im[X*] + Im[X^ J)Sin-^| (4.27)
hn\Zk] = |(Re[X*] Re[X^ J)sm-^ + (Im[X*] + J)cos^j (4-28)
where the real parts of both { Y*} and {Z^} are calculated for k = 0,1,...-^- and
the imaginary parts of both sequences are calculated for k = 0,1,...-^-1.
These formulas are recursively applied to the E sequence in natural
order to eventually produce the scrambled coefficients of its transform, which is
itself an E sequence. During each pass through the data, the first uncombine
performed consists of equations (4.21) and (4.22), the inverted EQE combine,


54
separating the transform of an E sequence into the transform of a half length E
sequence and a QE sequence. Then the transform of a QE sequence is uncom-
bined, using equations (4.23) and (4.24), into the transform of its nonredundant
R subsequence. All other uncombines on each pass use equations (4.25), (4.26),
(4.27), and (4.28), the inverted RtoR formulas, to produce the transforms of two
R subsequences from their parent R subsequence. This process continues until
the length one sequences are produced, which are the real and even elements, in
scrambled order, of the transform of the original E sequence.
An Ordered-to-Scrambled Transform for O Sequences
Earlier in this chapter it was shown that the Swarztrauber algorithm for
E sequences has a near twin for O sequences, and that Figure 4.1 for E
sequences can also be used to examine the algorithm for O sequences, with the
substitutions of O for E and QO for QE. The alterations in the computation are
also simple, substituting the OQO formulas for the EQE (these two sets are
actually identical), the RRQO for the RRQE, and using the RtoR formulas as
originally derived.
An ordered-to-scrambled algorithm can then be derived by formally
inverting this twin algorithm. Since the EQE and OQO combine formulas are
identical, the uncombine formulas will also be identical. The RtoR formulas for
the symmetric scrambled-to-ordered O transform are the same ones used in the
Swarztrauber E transform so the inverted algorithm will use the same uncom-
bine formulas. Only the RRQO combine formulas differ from their counter-
parts, and that difference is slight. Inverting the RRQO formulas produces the


55
corresponding uncombine formulas
Re[ ¥] = k = 0,1, f
M5"*] = -|-(-XfcCos-^- XN_ksm^r) k = 0,1, y-1 .
The algorithm proceeds as did the algorithm for ordered E sequences.
With each pass through the data, the first uncombine performed is the inverted
OQO, separating the transform of an O sequence into the transforms of its half
length O and QO subsequences. The second uncombine of each pass alters the
transform of a QO sequence into the transform of its nonredundant R subse-
quence, using the RRQO uncombine formulas. All other uncombines on each
pass use the inverted RtoR formulas to produce the transforms of two R subse-
quences from their parent R subsequence. This process continues until the
length one sequences are produced, which are the purely imaginary elements, in
scrambled order, of the transform of the original O sequence.


CHAPTER V
A COMPACT SYMMETRIC TRANSFORM FOR REAL AND EVEN
DATA ON A SHARED MEMORY MACHINE
Many of the algorithms presented in the previous chapter were derived
and published at the time the current study was begun, but none had actually
been implemented in software. As a prelude to implementing parallel versions
of compact symmetric algorithms serial versions were needed.
Serial Implementation of the Swarztrauber E Transform
Consider the Swarztrauber E algorithm, which is displayed schematically
in Figure 4.1. This algorithm operates on an E sequence in scrambled order to
produce its transform, another E sequence, in natural order. After the data has
been scrambled, the first recombination during each pass computes the
transform of an E sequence from the transforms of an E and a QE sequence
using the EQE formulas
Xk=Yt + Zk k = 0,1,---f
and
XJL-i = Y" ^ * = 1 f-
2
The second combination on each pass is an RRQE, that converts the transform
of a nonredundant R subsequence into the transform of its parent QE sequence
using the formulas


57
Xk = 2Re[wJ.r*] *=0,1,...f
2
and
= -2Im[£r*J k=0,1,...-*-1
2 2
After the EQE and RRQE combines have been performed on a given pass, all
remaining pairs of sequences axe the transforms of two R subsequences which
must be combined, using the RtoR formulas
Xk = Yt +
and
XjL-k =Yk~ ukZk
2
for *=0,1,...^-1.
4
Figure 5.1 is the data flow diagram that results from implementing these
combine formulas on a sequence of length 32 in a complex array of length 32.
The E sequence is in scrambled order on the left side of the diagram, and the
transform is in natural order on the right. The EQE, RRQE, and RtoR formu-
las take advantage of the symmetries in the intermediate transforms so that the
application of any of these combines requires as input only one-half of each of
the subsequences involved. Only one half the output is computed, as the
remainder is available by symmetry. The effects are clearly shown by the large
number of data locations that have no involvement in the transform. All of the
original input data, all of the output data, and many of the intermediate quanti-
ties are strictly real. Some of the intermediate values are complex, however. By
separating the real and imaginary parts of these values and using some of the
redundant storage locations generated by the symmetries, the entire algorithm


58
Data flow diagram for scrambled-to-ordered E transform, N = 32. During each
pass through the data, the first set of lines is the EQE combine, the second the
RRQE combine, and all other sets are RtoR combines. The input is a length 32
sequence in scrambled order. Much of the data is redundant, unused m the
transform. The input data is reordered so that all non-redundant data is taken
from the first half of the sequence.


59
can be performed in a real array of length 17, or +l.
By removing the redundant data and storing all complex subsequences
in a real array, the compact symmetric FFT can be obtained. Figure 5.2 is a
data flow diagram of the actual implementation, computing the transform of a
length 32 sequence using a 17 point storage array. Several features axe apparent
in Figure 5.2. The input data is in scrambled order, but it is not the order that
one obtains by simple decimation (odd/even splitting). Rather, this scrambled
order is a result of both the decimation and the removal of the redundant data.
As a result, the algorithm for scrambling the data for use in this transform is
somewhat more complicated than that required for a conventional decimation-
in-time algorithm.
The algorithm appears to be much more complicated in the last three
passes than in the first two passes. The simplicity of the first two passes is a bit
deceptive. It results from the fact that a conjugate symmetric sequence, where
Xk = Xn- _k, has strictly real values as its Xq and Xn_ elements. In the first two
2
passes, the CS sequences generated are respectively of length 2 and 4, so that in
these two passes the RtoR combines axe dominated by strictly real values.
Beginning with the third pass the complexity of the RtoR butterfly is apparent.
(If the original input sequence were longer than 32, the flow diagram would be
dominated by these RtoR butterflies.)
The structure of the RRQE combine is best seen in the fourth pass,
where RRQE butterflies make up the bottom half of the diagram. An impor-
tant feature of the combine is that the stride is not constant. The first butterfly
has a stride of 6, the second a stride of 4, the third a stride of 2. This forms the
"star" pattern of connections in the flow diagram, and is due to the storage


60
Figure 5.2
Storage and data flow diagram for compact scrambled-to-ordered real and even
transform, N = 32. R and I refer to the real and imaginary parts of the_ com-
plex quantity. During each pass through the data, the first set of lines is the
EQE combine, the second the RRQE combine, and all other sets are RtoR com-
bines.


61
pattern of the input and output sequences involved in the RRQE combine. The
complex sequences that form the input to the RRQE combine axe arranged with
all the real parts stored consecutively, followed by all the imaginary parts stored
in reverse order. After the RRQE combine the resulting transform of the QE
sequence is stored with the real elements in natural order. Two other possi-
ble storage schemes suggest themselves. The most obvious is to keep the real
and imaginary parts of the complex values together, as they would be in a com-
plex array. This results in uniform butterfly strides of 1, but also forces an
unusually complicated ordering on {Z/.}. Secondly, the real parts could be
stored consecutively followed by the imaginary parts in natural order. This also
results in a constant stride length, but again forces an extremely complicated
storage pattern on {Zf.}.
The EQE combines, best examined in the final pass in Figure 5.2, have
two "star" patterns. The one in the lower half of the combine is a result of the
storage order in the {Z%} sequence generated by the RRQE combine of the pre-
vious pass. The one in the upper part of the combine is generated by the back-
wards running index, -k, which appears in all of the combine formulas. The
2
cumulative effect of these patterns on the EQE combine is interesting. The two
transform elements F0 and Zq, occupying storage locations 0 and 9, are used in
the EQE butterfly to produce the transform elements X0 and X16, which are
stored in locations 0 and 16. So this particular butterfly must read the contents
of locations 0, 9, while writing to locations 0 and 16. Similarly, the butterfly
that combines the transform elements F7 and Z7 to form the elements X7 and
Xg must read the contents of locations 7 and 16 while writing to locations 7 and
9. Evidently, individual EQE butterflies cannot be performed in-place. A useful
observation, though, is that these two butterflies together read and write to


62
locations 0, 7, 9, and 16, and that no other butterfly needs access to any of these
locations. The EQE combine can be performed in-place so long as the butter-
flies are performed in the appropriate pairs. Such a pair is termed an EQE
unit .
The serial implementation of the Swarztrauber E algorithm is described
by the pseudocode
set LEN 1
For j = 0 to log2 N 1
double LEN
For k = 0 to ^
8
perform EQE units
If j < log2iV1
for k = 0 to
LEN
perform RRQE butterflies
If j < log2iV2
for m = 1 to 2J_11
for k = 0 to
LEN
perform RtoR combines for the
mth RtoR sequence
stop
The actual FORTRAN code for the algorithm, the subroutine
SWACOS, is given as Appendix 1. (The name SWACOS derives from
(Siuarztrauber Cosine transform. The DFT of an E sequence is the discrete
cosine transform of that sequence.)


63
Serial Implementation of the Ordered-to-Scrambled E Transform
In. Chapter IV an ordered-to-scrambled algorithm for E sequences was
derived. The derivation consisted of formally inverting the Swarztrauber E algo-
rithm, and running Figure 4.1 backwards. The derivation continued by invert-
ing the EQE, RRQE, and RtoR butterfly formulas to produce the corresponding
uncombine formulas. The remainder of this chapter will be concerned with the
ordered-to-scrambled algorithm, so the names EQE, RRQE, and RtoR will
hereafter refer to the uncombine formulas. The EQE uncombine formulas
n = fa + ft. J *=0.1,...f
2
\ = fat ftj i=0,l,...f-l
produce the transforms of an E and a QE subsequence by uncombining the
transform of their parent E sequence. The RRQE uncombine, which produces
the transform of a nonredundant R sequence from the transform of its parent
QE sequence, is given by
Mn] = ) *=o,i,...f
2
Mnl = fak^ ft_t*-^) *=0,1,.
2
The RtoR uncombine formulas, uncombining the transform of an R
sequence into the transforms of its two R subsequences, are
Re[7t] = i(Re[J4] + Reft, J)
Imln] = {(ImM MftJ)
R] = - Beft J)c-2j (Im[x,l +


64
ha[Zk] = ||(Re[X*] Re[Xf J)sin^ + (Im[**I + Im[X^ J)cos-^|
where the real parts of both { Y^} and {Zj,} axe calculated for k = 0,1,...-^- and
the imaginary paxts of both sequences axe calculated for k = 0,1,...-^1.
The implementation of this algorithm is quite similar to SWACOS, as
would be expected from the relationship of the algorithms to each other. Figure
5.3 is the flow diagram corresponding to Figure 5.2. The diagram looks very
nearly like the mirror image of the previous figure. In this algorithm the first
pass is entirely an EQE uncombine. The second pass features an EQE uncom-
bine followed by an RRQE uncombine. Beginning with the third pass, each pass
begins with an EQE uncombine, which is followed by an RRQE uncombine,
after which all uncombines axe of the RtoR type. The storage patterns of
SWACOS have been retained. Consequently, the EQE butterflies must be per-
formed in two-butterfly units in the uncombine case as well. The pseudocode
description of this routine (which is named BRICOS, for Briggs Cosine
Transform) is
set LEN = N
For j = 0 to log2iV1
For k = 0 to
LEN
perform EQE uncombine units
if y > i
for k = 0 to
LEN
perform RRQE uncombines
If/>2


65
Figure 5.3
Storage and data flow diagram for compact real and even transform. N = 32,
taking ordered data to scrambled coefficients. R and I refer to the real and ima-
ginary parts of the complex quantity. During each pass through the data, the
first set of lines is the EQE uncombine, the second the RRQE uncombine, and
all other sets are RtoR uncombines.


66
for m = 1 to 23 11
for k = 0 to
LEN
perform RtoR uncombines for the
mth RtoR sequence
halve LEN
stop
The FORTRAN code is given as Appendix 2.
Savings from the Compact Symmetric Algorithm
To analyze the algorithm, let the passes through the data be indexed
j = 0,1, log2A 2. The last pass, j = log2 A1, is considered as a special
case. The scalar multiplication by one-half occurs in all of the formulas, and
may be performed at the end of the algorithm. For each pass through the data,
it is necessary to count the number of butterflies (or units) of each type, and to
compute the total number of operations performed. All of the work is per-
formed in strictly real arithmetic, so the operation count is for real operations.
The jth pass through the data requires ()y EQE units, beginning
8 2
with pass j=0, and each such unit entails 2 butterflies requiring 2 additions
each. The RRQE uncombine entails 4 multiplications and 2 additions, and on
the pass -j-(y)- such butterflies are required, beginning with pass j=1. The
RtoR uncombine consists of 4 multiplications and 6 additions. Beginning with
the pass j=2, each RtoR sequence requires -^(y)y RtoR butterflies, and there
are 2J_11 such sequences. The last pass through the data is considered
separately because in this pass the sequences are all of length 2 and all the


67
butterfly types reduce to a simple butterfly requiring 2 additions. There are
such butterflies.
By summing the operation counts for each type of uncombine over the
number of passes through the data, an analytic expression can be written that
may be used to count the total number of operations:
log2AT-2 M 1 log2W2 M 1
+10
log2JV2 AT 1 N
where multiplications and additions axe counted equally. The first term of this
equation is the operation count for the EQE units, the second the count for the
RRQE uncombines, and the third represents the operations involved in perform-
ing the RtoR butterflies. The final term reflects the work involved in the last
pass through the data. By summing the geometric series in each term, this
equation can be readily evaluated. The transform of an E sequence of length N
using the BRIGOS algorithm requires iVlogoiV2iVreal arithmetic operations,
and uses a total of +1 storage locations.
2
By comparison, performing the same transform of an E sequence by
placing the input sequence into the real part of a complex array and using a
conventional Cooley-Tukey FFT requires 2N storage locations and a real opera-
tion count of 5iVlog2iV. Thus the compact symmetric FFT requires one fourth
the storage as its conventional counterpart, and requires somewhat less than one
fourth the arithmetic. Performing the same transform by traditional pre- and
post-processing methods [4] utilizes storage locations and entails
2


68
-^--Nlog2iV+yiV' real operations, somewhat greater than the compact symmetric
transform.
The Parallel Algorithm
Once the serial algorithms are in hand, consideration may be given to
the construction of parallel versions. The nature of the parallel algorithms for
computing the transform of symmetric sequences is illustrated by examining a
shared memory parallel implementation of BRICOS (which is named BRIPAR).
Only this ordered-to-scrambled case is considered, but the commentary and
analysis extend readily to the scrambled-to-ordered SWACOS, although the
computational details differ slightly. Further, for shared memory computers,
the extensions are immediate to parallel algorithms for conjugate symmetric,
real, real and odd, and real quarter wave (even or odd) sequences. Chapter VI
will deal with the problem of implementing compact symmetric transforms on a
distributed memory architecture.
The basic assumption regarding the hardware is that the number of pro-
cessors is small compared to the sequence lengths (coarse grained processing),
and that all of the processors share a common memory. It is assumed that there
are no explicit communication costs in the algorithm. There will, however, be
some overhead that must be paid for fork-join operations, and there will be some
implicit communication cost in the form of memory bus contention. Since all
processors have equal access to all of the data, the algorithm is distributed
among the processors.
Two basic parallelization strategies present themselves. First, one might
compute the total amount of work to be performed during each pass through
the data, and attempt to balance the load as evenly as possible by giving each


69
processor different chunks of the algorithm. In general, the processors will then
do different types of work. For example, suppose two processors, p0 and plt axe
performing a typical pass through the algorithm. EQE butterflies axe much
cheaper than RRQE, which axe somewhat cheaper than RtoR. Also, there are
greater numbers of RtoR, and fewer EQE and RRQE, as the algorithm
proceeds. In order to balance the work load, p0 might have to perform all the
EQE, all the RRQE, and some of the RtoR, while pi performs only RtoR but-
terflies. As the number of butterflies of each type changes, the load distribution
must be recomputed at the beginning of each pass through the data.
The second parallelization strategy is a good deal simpler to implement,
and is the one that has been chosen for this study. At the beginning of the jth
pass through the data (y=0,l,...log2iV1), copies of the subroutine axe "forked"
to the processors. The "units" of type EQE are then distributed as evenly as
possible across the processors. If the current pass is not the first, RRQE butter-
flies axe distributed as evenly as possible across the processors. After the second
pass RtoR butterflies axe required. The number of RtoR butterflies per
sequence decreases with each pass, but the number of sequences increases.
There axe two cases that must be handled. If the number of RtoR butterflies
per sequence is greater than the number of processors, the algorithm distributes
the butterflies as evenly as possible across the processors, each of which strides
through the sequences performing its designated butterflies. This mode is called
scheduling on butterflies. If, however, there axe fewer butterflies per sequence
than processors, then the algorithm distributes the sequences across the proces-
sors as best it can, and each processor must compute all the butterflies for each
of its sequences. This mode is called scheduling on sequences. At the end of
each pass the processors axe joined in a synchronization step. If the current pass


70
is the last pass, all the butterflies are identical, and these are distributed across
the processors. A pseudocode description of this scheme for p processors is
set LEN = N
For j = 0 to log2iV2
Deal out the EQE uncombine
units to the p processors as evenly as possible
Deal out the -LEN RRQE butterflies
4
to the p processors as evenly as possible
If LEN->p then
4
Deal out the RtoR butterflies
4
to the p processors as evenly as possible and let
each processor stride through the 2J_11 sequences
performing its butterflies.
Else
Deal out the 23~11 RtoR sequences
to the p processors as evenly as possible and let
each processor perform the LEN
RtoR butterflies for each sequence,
synchronize the p processors
halve LEN
For j = log2 N 1
For k = 0 to
4
Deal out Last Pass butterflies to the p


71
processors as evenly as possible
stop.
The FORTRAN code for this algorithm is given in Appendix 3.
Decreases in Parallelism
Considering the chosen parallel strategy, there are two causes of
decreased parallelism. The first is simple divisibility. When the number of pro-
cessors does not evenly divide the number of work units to be performed, there
will be a time in which some processors are busy while others must wait. The
second cause is the duplication of effort required when the algorithm switches
from scheduling on butterflies to scheduling on sequences. In scheduling on but-
terflies, each processor calculates only one set of cosines for each set of butter-
flies it performs. When scheduled on sequences, however, all of the processors
must calculate all of the cosines for each sequence, implying duplication of
effort.
Complexity of the Parallel Algorithm
To predict the speedup due to the parallel implementation, considera-
tion must be given to several factors: the changing amount of work of each
uncombine type, the cost of the change from scheduling on butterflies to
scheduling on sequences, the divisibility problems, and the cost of the fork-join
operation. This leads to an analytic expression involving six terms: the cost of
the fork-joins, the cost of the EQE units, the cost of the RRQE butterflies, the
cost of the RtoR scheduled on butterflies, the cost of the RtoR scheduled on
sequences, and the cost of the last pass through the data.


72
The jth pass through the data requires () EQE units, beginning
with pass j=0, and each such unit entails 2 butterflies requiring 2 additions
each. These units are distributed among the processors as evenly as possible.
During each pass the amount of time spent performing EQE units can be readily
determined. It is the time required to execute four additions multiplied by the
least integer greater than or equal to the ratio of units to processors. Letting a
represent the time required to execute a single addition, the total time spent
performing EQE units for the transform is obtained by summing over the passes
through the data, producing the term
p
log2Ar2
4 a £
/o
A similar argument is followed to determine the amount of time spent
performing RRQE uncombines. On the pass, (y)J such butterflies are
required, beginning with pass j=1. Letting A represent the time required to
perform the 4 multiplications and 2 additions required by an RRQE uncombine
leads to the term
log2JV-2
A E
y=i
4 V 2
for the total time spent performing RRQE uncombines.
The RtoR uncombine consists of 4 multiplications, 6 additions, and
requires 2 calls to the transcendental library. The time spent performing the
arithmetic (excluding the cosines) of a single RtoR uncombine is designated B\.
Beginning with the pass j=2, each RtoR sequence requires -^(y)y RtoR butter-


73
flies, and there are 2J_11 such sequences. Two different terms are required to
express the time spent in RtoR calculations, one for the portion of the algorithm
that is scheduled on butterflies, and one for the portion scheduled on sequences.
Only two cosines are required by any processor while scheduled on butterflies, so
the amount of time spent in this situation is given by
LT-1
E (2y_11)
/= 2
4 1 2 '
-(- 2c
where LT is the index of the first pass in which the RtoR uncombines are
scheduled on sequences rather than butterflies. Letting B2 represent the labor
of an RtoR uncombine including the calls to the cosine library, the time spent in
RtoR computation while scheduled on sequences is
log2JV2
B2 E
j=LT

(-)*' .
4 y2J
The last pass through the data is simple to analyze because in this pass
the sequences are all of length 2 and all the butterfly types reduce to a simple
butterfly requiring 2 additions. There are -j- such butterflies, which are distri-
buted as evenly as possible among the processors, leading to the term
2a
N_
4 p
for the time spent in the last pass.
The time spent performing forking operations is given by the expression
+ Pi{p~ l)+(log2i\r l)(a2 + /^2 (p1))
where is the cost of the first fork on the first processor, is the cost per
additional processor for the first fork. All succeeding fork calls have a cost of a2


74
for the first processor and f32 for each additional processor.
Thus the six-term complexity equation that models the performance of
the parallel algorithm is
log2AT2
Tp = Tf(p,N) +4 a Y,
i=o
()y
8 2 '
log2JV2
y=i
^()y
4 2 ^
£r-i
+ E (2y_1l)
/=2
4 V 2 '
-j- 2c
iog2y2
+ 52 s
j=LT
2J-11
(-)' + 2a
4
N_
4p
Simplifying the Performance Model
The complexity equation is difficult to analyze because of the greatest
integer function that occurs in most of the terms. In cases where the number of
processors is a power of two, the greatest integer functions are easily computed,
and after some algebraic labor, the complexity equation reduces to
T = T;[p,N) +
P
Bl i r c, , (3a+c-fir) a
log 2 AT + log2p+--------- + J
+ (A+4a+2c)log2p 2clog2iVA + 2c +
This equation can then be used to predict the performance of the algo-
rithm. It is worthwhile to examine the equation qualitatively prior to examining
numerical results. Regrouping the terms of the equation, the structure of the
performance model is seen to consist of five dominant terms:


75
T, = TfaN) + 0(lofcrt) + 0() + 0( log2p) + 0(log2p)
P P P
Each of the terms of this equation can be identified with the phenomenon it
represents. The first, Tf(p,N), is the overhead required to fork processes. The
N N
next two terms, 0(log2 N) + O (), together represent perfect speedup rela-
P P
tive to the serial algorithm. The remaining two terms reflect decreased parallel-
N
ism. The first, O (log2p), is the amount of time spent in the duplication of
P
effort in changing from RtoR scheduled on butterflies to RtoR scheduled on
sequences. The last term, O (log2p), represents the amount of time spent in
EQE and RRQE butterflies after the sequences become so short that there axe
fewer butterflies than there are processors.
The Parallel Implementation
BRIPAR, the parallel algorithm taking ordered E data to scrambled
coefficients was implemented on a shared memory Sequent Balance multiproces-
sor. The maximum number of processors available to one user was 23. All of
the processors had access, through a common bus, to all of the data. To deter-
mine predicted performance curves, the timing parameters of the machine were
obtained from the Sequent documentation, and then verified by experiment.
Timings of the actual transforms were obtained for sequences of various
lengths, and speedups compared with the values predicted by the performance
model. All of the timings were obtained many times, and the numbers and fig-
ures given are averaged values. The timings proved to be remarkably reproduci-
ble from one experiment to the next. The results axe shown in two separate
charts: Figure 5.4 shows the performance characteristics of long sequences and


76
Figure 5.5 displays those of shorter sequences. The actual timing data axe given
in Appendix 4.
For very long sequences, (7V=215, N=216, N=217), the computer imple-
mentation of the algorithm performed very well. There is good speedup for all
sequence lengths, and speedup generally increases with increasing processors.
The maximum speedup achieved was just over 17, occurring on the longest
sequence when transformed on 21 processors. Model speedups were obtained for
those cases in which the number of processors is a power of two. The model
speedups for the various sequence lengths are indicated with circles, squares, and
triangles. For long sequences, the actual speedup very nearly matches the
model speedup.
On shorter sequences, (iV=214, N=213, iV=212), the computer imple-
mentation performed less well, both from a standpoint of measured speedup
alone, and when compared with the model speedup. In all cases, there is a signi-
ficant decrease in efficiency as the number of processors is increased, and on
each curve there is an "optimal" number of processors, after which the computa-
tion requires more time to perform as the number of processors is increased. On
a sequence of length N=214, for example, the best performance was achieved
using 17 processors, resulting in a speedup of approximately 8. On an N=213
point sequence, however, the best results occurred with 9 processors, but with a
speedup of only 5. Additionally, as sequence lengths become shorter, the actual
performance differs more and more from the predicted curve. This may be
attributed to two factors. First, the overhead of loop indexing is not included in
the model. As sequences become shorter, the loop indexing represents an
increasing fraction of the algorithm. A more significant factor is related to the
memory management of the Sequent Balance. Essentially, as sequences become


77
Figure 5.4
Measured Speedup for long sequences (E transform on shared memory).
Speedup curves are shown for N = 217 (#), N = 216 (x), and N = 215 (+). Per-
fect speedup is represented by the diagonal line. Theoretical speedup is plotted
at p 2, 4, 8 for N = 217 (triangle), N= 217 (square), and N = 215 (circle).
At p = 4 only the square is plotted, as all three computed values fell within the
size of the square.


78
Figure 5.5
Measured Speedup for "short" sequences (E transform on shared memory)
Speedup curves are shown for N = 214 (#),N= 213 (x), and TV = 212 (+). Per-
fect speedup is represented by the diagonal line. Theoretical speedup is plotted
at p 2, 4, 8 for N = 214 (solid triangle), N = 213 (solid square), and N = 212
(solid circle). At p = 4 only the square is plotted, as all three computed values
fell within the size of the square.


79
shorter, memory accesses by the processors become more frequent, and the bus
becomes saturated. Consequently, performance is degraded as the processors
begin to compete with each other for access to the memory. (It should be noted
that the new generation of Sequent multiprocessors, the Symmetry family, use a
different memory management scheme designed specifically to eliminate this
effect.)
In all cases parallelization has significantly improved the performance of
the symmetric transform. It is conjectured that optimizing the code will lessen
the gap between theoretical and actual performance for short sequences.
Despite the degradation in the performance as the sequences become shorter,
shared memory parallelization of the.compact symmetric transforms appears to
be a highly useful and desirable goal.


CHAPTER VI
A PARALLEL COMPACT SYMMETRIC TRANSFORM FOR REAL
DATA ON A DISTRIBUTED MEMORY MACHINE
Chapter V was concerned with implementing parallel compact sym-
metric FFTs on a shared memory architecture. The other major category of
parallel processors is that of distributed memory architectures. The problems
faced in the implementation of algorithms for distributed memory machines are
entirely different from those encountered in shared memory distributions. The
architecture examined in this study is the hypercube, which is among the most
common distributed architectures.
The Hypercube
A hypercube consists of p = 2d processors, where d is the dimension of
the machine. Each processor has access only to a local memory. Since there is
no shared memory, if a processor needs to access information held by some other
processor, the information must be passed as a message between the processors.
This implies that message passing between the processors is an integral part of
any hypercube algorithm. The situation is further complicated by the fact that
not all of the processors are connected to each other. In a hypercube of dimen-
sion d, a given processor can communicate directly with exactly d other proces-
sors. If a communication link exists between two given processors, they are said
to be nearest neighbors.


81
The connection scheme of the hypercube may be summarized briefly:
Let the processors be numbered from 0 to p 1, and consider the d-digit binary
representation of the processor numbers. If the binary representations of two
processor numbers differ by exactly one bit then there is a connection between
them. For example, a dimension 2 cube consists of four processors whose
numbers are 00, 01, 10, and 11. Processor 10 is connected to 00 and 11, but not
to 01.
The FFT on a Hypercube
The canonical FFTs developed in Chapter II fit extremely well on a
hypercube architecture. Figure 6.1 illustrates this fact with the data flow
diagram of a Cooley-Tukey transform of a 32-point sequence on a dimension 3
(p = 8) hypercube. The 32 data points have been divided among the processors
in chunks of four points per processor. The boundaries between processors are
indicated by the horizontal lines cutting across the flow diagram. Anytime a
butterfly connection line crosses one of these processor divisions, a communica-
tion is required. Since a processor always has immediate access to its own local
memory, the horizontal connections in the butterflies have been removed, to
preserve clarity. During the first two (log2iV d) passes through the data, all of
the work performed by any processor is internal, requiring no information from
any other processor. During the remaining three passes, each processor must
receive a copy of all data held by some other processor before it can begin to
compute butterflies.
The communication required by the FFT fits perfectly on the hyper-
cube. In Figure 6.1, processor 4 must receive data from processor 5 in the third
pass, from processor 6 in the fourth pass, and from processor 0 in the final pass.


82
Figure 6.1
Data flow diagram for the Cooley-Tukey transform on a hypercube, N = 32.
The horizontal lines represent the boundaries between processors. All of the
communication required by the algorithm is nearest neighbor communication.


83
But the binaxy representations of 0, 5, and 6 all differ from that of 4 by exactly
one bit, indicating that there is always a direct communication pathway avail-
able. It is apparent that the algorithm requires only nearest-neighbor communi-
cations. In the general case, there will be log2iVd passes in which the work of
each processor is entirely internal, followed by d passes in which nearest-
neighbor communication is required.
During the last d passes, in which communication is required, the
amount of work each processor must perform changes somewhat. After receiv-
ing the necessary data from a neighbor, each processor computes only half of a
butterfly. In a Cooley-Tukey combine this means that one processor computes
Xk=Vt+
while the neighboring processor computes
Xt+JL = n w*z4
2
for each of the butterflies the two processors are sharing. Therefore, both of
p
the processors have to perform the complex multiplication that is only calcu-
lated once in a serial algorithm.
Performance of the Hypercube FFT
Modeling the Cooley-Tukey FFT for the hypercube is straightforward.
There are log2 Nd passes in which no communication is required. During these
passes each processor performs butterflies consisting of two complex addi-
2 P
tions and one complex multiplication, or ten real operations. This is followed by
d passes in which each processor must communicate with another. During these
passes each processor performs one complex multiplication and one complex


84
addition, or eight real operations, on each of half-butterflies. Additionally,
each processor must transmit -y complex values to its neighbor. If the cost of
starting up a data transmission is designated a and the transmission cost is /3
per complex value (where a and 0 axe measured in flops), then the performance
of the hypercube Cooley-Tukey FFT is
Tp = 10(log2N d) + 8d + d{a + 0) ,
2 p p p
which simplifies to
Tp = log2N + (3+/?)Ariog2p + alog2p .
V P
The first term is a perfect speedup term, the second represents the effects of the
data transmissions and the extra multiplications caused by the half-butterflies,
and the last term is the cost of starting up data transmission on each pass in
which communication is required.
The Compact E Transform and the Hypercube
Having seen how nicely a Cooley-Tukey FFT fits on the hypercube, it is
natural to wonder whether the compact symmetric algorithms axe amenable to
hypercube processing. If an attempt were made to implement the BRICOS
algorithm on a hypercube difficulties would arise immediately. It was shown in
Chapter V that the data flow diagram for BRICOS (Figure 5.3) is characterized
by "star" patterns of data connections. These patterns are caused by the
backward-running index -k, and by the storage patterns of the intermediate
sequences. Figure 6.2 is another version of the same data flow diagram for the
length 32 sequence; it indicates the difficulties encountered in attempting to put


85
Storage and data flow diagram for BRICOS on a hypercube, with N = 32. Data
flow is identical to that in Figure 5.3. Only those data connections that require
non-nearest-neighbor communications are shown. The BRICOS algorithm does
not map well onto the hypercube.


86
this algorithm on a hypercube. The processor divisions of the hypercube are
indicated as in Figure 6.1. Only those data connections that require non-nearest
neighbor communication axe shown. If a required data connection is either
nearest neighbor or internal it is not shown. It is evident that a sizeable portion
of the BRICOS algorithm does not fit well on the hypercube architecture.
Furthermore, there seems to be no obvious rearrangement of the data that
would eliminate the problems.
The backwards running index is a prominent feature of the algorithms
for E and O sequences, both ordered-to-scrambled and Scrambled-to-ordered. It
also occurs in the Swaxztrauber R algorithm, and therefore in the RtoR com-
bines that make up the bulk of the algorithms for QE and QO sequences.
Implementing any of these algorithms on a hypercube remains an open problem.
Two of the symmetric algorithms presented in Chapter IV axe not
characterized by the backwards running index. The scrambled-to-ordered
transform for a CS sequence, and its formal inverse, the ordered-to-scrambled
transform for R sequences, axe symmetric algorithms for which hypercube imple-
mentations may be found.
A Compact Hypercube Algorithm for R Sequences
An ordered-to-scrambled transform for R sequences was presented in the
previous chapter. It was derived by Briggs [6], who inverted the algorithm for
scrambled CS sequences to produce the new transform. Pictorially the algo-
rithm for R sequences consists of running Figure 4.2 from right to left. It con-
sists of formulas that uncombine the transform of a CS sequence into the
transforms of a CS and a QCS sequence, formulas that convert the transform of
a QCS sequence into the transform of its non-redundant D subsequence, and


87
formulas that uncombine the transform of a D sequence into the transforms of
its two complex subsequences. Since these uncombine formulas are produced by
inverting the scrambled CS algorithm, they are called the CSQCS, DQCS, and
DCC uncombines.
The CSQCS uncombine formulas are
n = }(** + *=o.i,~f,
2
and
h = - *+*) *=0,1,-f-1
2
while the DQCS uncombine relations are
Refr*] = -(Xkcos2^ X ,sin-^7) fc=0,l,...-J
1 Ki 2v N +k N 2
2
and
taffl = {(Airing + *=0,1,...f-1
2
The DCC uncombine formulas consist of the following four equations:
Re[n] = i(Re[J?i] + Re[* J)
2
Im[n] = J)
2
B.I4I = f {(Rem - - (M^l -
H*il = f { where k = 0,1,...1. None of the formulas involves a backward-running
2


88
index, but that by itself does not guarantee that the algorithm will fit on the
hypercube. It remains to find a storage pattern that requires only nearest neigh-
bor communication, and to map the required communications onto the hyper-
cube.
The algorithm that accomplishes this is due to Sweet [7], who also pro-
vided a serial implementation. The CSQCS uncombine has the same data
access structure as the standard Cooley-Tukey combine in that the fcth and
(+*)tl1 elements of the longer sequence are matched with the kth- components
of the shorter sequences. The CSQCS uncombine fits on a hypercube exactly as
the Cooley-Tukey combine does.
The situation is more complicated in the case of the DQCS uncombine,
that converts a real sequence {-X*} into a complex sequence, { Y*}. There are
several possible storage patterns for the real and imaginary parts of { YjJ, as was
observed in the case of the RtoR combine in the Swarztrauber E algorithm. On
the hypercube, where the data is distributed among the processors instead of
residing in global memory, the choice of storage pattern is critical. Sweet noted
that keeping the real and imaginary parts of { Y^} together and performing the
uncombines in two butterfly units (much as the EQE butterfly in the real and
even algorithms must be performed) had the effect of preserving the Cooley-
Tukey data pattern in the DQCS uncombine. An example of this is given in
Figure 6.3A. An important result is that {Y^} is stored in an even/odd
decimated order, with its even elements grouped together, followed by its odd
elements.
This storage pattern of the output sequence from the DQCS has a pro-
found effect on the structure of the algorithm. In the following pass the output


89
Figure 6.3
Storage order requirements of FFSWEET, imposed by the uncombine struc-
tures. Figure 6.3A shows how the DQCS uncombine has the effect of separating
even and odd-numbered components of { Y*}. Figure 6.3B shows how the DCG
combine acts on the output of the DQCS uncombine in the previous pass to
insert components of a second sequence {Z%} between the even-numbered com-
ponents and the odd-numbered components of { K*}.


90
from the DQCS serves as the input to the DCC uncombine. The DCC uncom-
bine has the effect of creating two sequences {Y/.} and {Z^} from the single
input sequence. If this uncombine, using the and (+fc)th elements of the
longer sequence to produce the elements of the shorter sequences, is per-
formed in-place, the effect is to group the even elements of both short sequences
in the storage locations that held the even elements of the long sequence, and a
similar pattern holds for the odd components. This is seen in Figure 6.3B. As
the algorithm proceeds, even components of a sequence axe separated from the
odd components of the sequence by data belonging to other sequences. Since
the output from a DCC uncombine is acted upon by another DCC uncombine in
the following pass, the even and odd components of a sequence become
separated by distances that double with each pass. This is seen in Figure 6.4A,
where the progress of the sequence separation is followed through the algorithm.
In the final pass through the data, sequences of length two are uncom-
bined into sequences of length one. In this final pass all of the butterflies reduce
to a simple addition and subtraction, where the lone odd component of each
sequence must be added to and subtracted from the even component. But these
are precisely the components that have ended up in widely separated storage
locations. Figure 6.4B shows the data connections that must be made in the
final pass through the data. There is one butterfly of stride one, then two of
stride two, then four of stride four, and so on. All of these stride lengths and
locations are identical to butterflies that are used somewhere in the Cooley-
Tukey algorithm, and therefore they fit on the hypercube in a nearest neighbor
fashion.
Figure 6.5 is a data flow diagram for the ordered-to-scrambled R algo-
rithm as implemented on the hypercube. The input sequence is a real sequence