PAGE 1
PREAND POSTPROCESSING FOURIER TRANSFORM ALGORITHMS FOR REAL SYMMETRIC DATA USING A COMPLEX FFT by Agnes Anne O'Gallagher B. A., University of Colorado, 1984 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Department of Mathematics 1991
PAGE 2
This thesis for the Master of Science degree by Agnes Anne O'Gallagher has been approved for the Department of Mathematics by j.L./3/ 'It Date ____
PAGE 3
O'Gallagher, Agnes Anne (M.S., Mathematics) Preand PostProcessing Fourier Transform Algorithms for Real Symmetric Data Using a Complex FFT Thesis directed by Professor Roland A. Sweet In this paper we explore FFT algorithms for symmetric data that use preand postprocessing based on a complex FFT routine. The implementa tion of these routines in the new package VCFFTPK is then discussed. Finally, the performance of these routines is compared to that of routines in VFFTPAK which use preand postprocessing based on a real routine. An attempt is made to discover how much is gained by avoiding extra passes through the data. The form and content of this abstract are approved. I recommend its publi cation. Roland A. Sweet
PAGE 4
Contents Acknowledgements ix 1 Introduction ................. . . . . . 1 2 Some Properties of Symmetric Sequences and Their Fourier Transforms . . . . . . 2.1 2.2 2.3 2.4 Symmetries . . . . . . . . Symmetries Produced by the Transform 2.2.1 Note on Odd Symmetric Sequences ..... Definition of the Transform in Trigonometric Form Splitting Equations ............ 4 4 4 7 7 8 2.4.1 Splitting Equations for Real Data . . . . . . 9 3 The Algorithms . . 10 3.1 The Real Algorithm 13 3.2 The Sine Algorithm . . . . . . . . . . .. . 15 IV
PAGE 5
3.2.1 3.2.2 The Original Algorithm The Combined Algorithm 3.3 The Cosine Algorithm . . . 3.4 The QuarterWave Even Algorithm 3.4.1 The Inverse Transform 3.4.2 The Forward Transform 3.5 The QuarterWave Odd Algorithm 3.5.1 3.5.2 The Inverse Transform The Forward Transform 4 Implementation, Operation Counts and Timings 4.1 VRCFTF: The Real Algorithm 4.2 VSINC: The Sine Algorithm 4.3 VCOSC: The Cosine Algorithm 4.4 The QuarterWave Algorithms v 15 17 19 23 24 27 29 30 32 32 34 36 38 39
PAGE 6
5 Conclusion . . . . . . . . . . . . . . 44 References . . . . . . . . . . . . . . 48 Vl
PAGE 7
List of Figures 1 The combined symmetric algorithms replace the real FFT of the original symmetric algorithms with the three main steps of the real algorithm. . . . . . . . . . . . . . . 12 Vll
PAGE 8
List of Tables 1 2 3 4 5 6 7 8 9 10 11 A Symmetric Vector and Its Transform Definition of Pertinent Symmetries ........ Symmetries of Transforms of Symmetric Sequences Operation Counts for Vectors of Length N = 2P The R Routines: VRCFTF vs VRFFTF The 0 Routines: VSINC vs VSINT The E Routines: VCOSC vs VCOST The Inverse QE Routines: VCSQCB vs VCOSQF The Forward QE Routines: VCSQCF vs VCOSQB The Inverse QO Routines: VSNQCB vs VSINQF The Forward QO Routines: VSNQCF vs VSINQB Vlll 3 4 7 34 36 38 40 42 43 43 44
PAGE 9
ACKNOWLEDGEMENTS I wish to thank Roland A. Sweet for his teaching and guidance, William L. Briggs for his assistance and the generous contribution of his time, Paul N. Swarztrauber for his help in working out the algorithms and John M. Gary for his understanding and encouragement. I also wish to thank The National Center for Atmospheric Research and The National Institute of Standards and Technology for the use of their computing facilities and Kenneth J. Sewell, without whom this project could not have been done. IX
PAGE 10
1 Introduction A computational problem of great and widespread interest is the calculation of the discrete Fourier transform (DFT) of a vector. Its many applications provide the motivation for the development of efficient algorithms. The forward DFT, {Xk}, of a vector {xj} of length N [1] is defined as 1 Nl i2w"Je xk = I: Xjek = 0, 1, 2, ... N1 N j=O where i = /(1). The inverse DFT is given by x 1 j = 0,1,2, ... N 1. (1) (2) The factor of 1/ N in equation ( 1) will be neglected for the remainder of this paper. If calculated as defined, using a matrixvector product, the DFT requires O(N2 ) operations. The 1965 introduction by Cooley and Tukey[2] of the Fast Fourier Transform (FFT) showed how this operation count can be reduced to O(N log N). In 1970, the first algorithms designed for symmetric sequences [3] were developed. These algorithms use a general FFT with preand postprocessmg. 1
PAGE 11
In 1984, Swarztrauber (1], presented a set of preand postprocessing al gorithms for real, symmetric data that were based on a real, periodic transform. He implemented these in FFTPAK. A vectorized version of this package, VFFT PAK, provided increased efficiency by adding the capability of processing many vectors at once. In the same paper, Swarztrauber showed how to use a complex routine with preand postprocessing to produce the transform of real data. Although not described explicitly, another set of algorithms is implied which combines these two approaches. In this combined algorithm, one would use a complex transform with preand postprocessing in place of the real transform used in FFTP AK and VFFTP AK. It was thought that the performance of these combined algorithms could exceed that of those in VFFTP AK because the use of the complex routine halves the length of the sequence and in doing so reduces the number of passes through the data for some cases. It is these combined algorithms that will be discussed in this paper. As part of this project they have been implemented, in vectorized form, in a new package called VCFFTPK. In order to make the characteristics of symmetric sequences concrete let us take a specific case. Table 1 shows a vector of length N = 8 and its transform. The input vector, {xi}, is real and odd symmetric (xNi =xi) Notice that 2
PAGE 12
Table 1: A Symmetric Vector and Its Transform INPUT VECTOR TRANSFORM real 1magmary real imaginary 0.000 0.000 0.000 0.000 0.087 0.000 0.000 2.692 0.950 0.000 0.000 '0.771 0.472 0.000 0.000 1.109 0.000 0.000 0.000 0.000 0.4 72 0.000 0.000 1.109 0.950 0.000 0.000 0.771 0.087 0.000 0.000 2.692 the transform, {Xk}, is imaginary and conjugate symmetric (XNk = Xk) As will be shown, the transform of a real and odd vector will always exhibit these symmetries. They therefore require only a quarter of the computation that would be required for a general complex vector. Since such vectors often arise in applications, algorithms that exploit these symmetries are worthwhile. Before the algorithms are presented some familiar properties of discrete Fourier transforms will be defined in order to provide a basis for the discussion. 3
PAGE 13
2 Some Properties of Symmetric Sequences and Their Fourier Transforms 2.1 Symmetries Several different symmetries are pertinent to the work that follows. For convenience they will be defined in Table 2. Each definition refers to a sequence {xj} for j = 0, 1,2, .. N 1. Table 2: Definition of Pertinent Symmetries SEQUENCE ABBREV DEFINITION Real R XjXj Imaginary I x; = x; Even E Xj = XNj Odd 0 Xj = XNj Quarterwave even QE Xj = XNj1 Quarterwave odd QO Xj = XNj1 Conjugate Symmetric cs Xj = XNj 2.2 Symmetries Produced by the Transform It can be shown that the transforms of symmetric vectors themselves exhibit different symmetries. For example the transform of a real sequence is conjugate symmetric. To show this, begin with the definition of the forward 4
PAGE 14
transform from equation (1) N1 Xk 2:::: Xje"}j". j=O Substituting Nk for k in the above sum gives N1 2:::: Xje j=O ihj(Nh) N N1 . . ""' '''JJN ''flk L....J x;e e j=O N1 .. ""' .2 l2'11"Jk L.J x;e' 'KJe N j=O Using the fact that ei2"i = 1, we have that N1..,=c"" "i/" L.J x;e j=O If z = re;e is a complex number, then z = rei0 Since Xj = Xj, it follows that As a second example, it will be shown that the transform of a real and odd vector is imaginary. As in the above case, we start with the definition of the forward discrete Fourier transform ( 1): = N1 "'"' i2 .. }k L...J x;e N j=O 5
PAGE 15
Taking the conjugate of both sides and noting that x; = x; yields Reversing the order of the summation gives N1 ( 1 'l2..NJ L..J XNj N xk j=O = N1 .. 2:: .2 k '2'l'I"Jk X e" ?r e N NJ j=O N1 .. "' ''il' L....J XNje i=O Now since x; is odd, Xj = XNj This results in N1 .. "' xk ..., x;e i=O N1 .. '"""' '2"''"1" L..J x;e N j=O Table 3 shows the definition of each of the symmetries that we will investigate and the symmetry displayed by the corresponding transform[l]. 6
PAGE 16
Table 3: Symmetries of Transforms of Symmetric Sequences INPUT VECTOR TRANSFORM Real xk = xNk Real Even xk =Xk Real Odd xk = xk Real Quarterwave even Xk = Xk Real Quarterwave odd Xk = Xk 2.2.1 Note on Odd Symmetric Sequences For an odd symmetric sequence XNf2 = XNN/2 = xN;2 Therefore xN;2 = 0. Also XN = XNN = xo. But by periodicity XN = xo. This means XN = Xo = xo =? Xo = 0. So for any odd sequence Xo = XNf2 = 0. 2.3 Definition of the Transform in Trigonometric Form Let {x;} be a real sequence of length N where N is even. Then equation (2) for the inverse DFT can be written as Xj (3) As shown above, for {xj} real, {Xk} is conjugate symmetric. Therefore (4) 7
PAGE 17
Defining (5) (6) where and S'(Z) signify the real and imaginary parts of Z respectively, ( 4) becomes N a0 { kj27r kj27r } .aN/2 Xj = 2+ akcos(JT)+bksm(JT) +(1)1 2. (7) Using (2), (5), and (6) it can be shown that (8) N1 k .2 2 I; Xj sin( 7r) k = 1, 2, 3, ... N /2 1. j=O (9) The equations (7), (8), and (9) comprise the trigonometric, or real, form of the transform for a real sequence, while equations ( 5) and ( 6) define the relationship between the real and complex forms. 2.4 Splitting Equations The CooleyTukey algorithm, which is the original fast Fourier transform, is based on splitting the sum in equation (1) into two sums as follows: (10) 8
PAGE 18
= Defining ,li:_l 2 i21f'h Yk I; x2;e =NTf', (11) j=O ,t!_l 2 i2'1f 'h zk I; x 2;+1 e =NTf', (12) j=O the transform can then be calculated by the general Splitting Equations: k = 0,1,2, ... ,Nj2 1, (13) k=0,1,2, ... ,Nj21. (14) The second equation is deduced from the periodicity of {Yk} and { Zk} 2.4.1 Splitting Equations for Real Data If the sequence {x;} is real, then the Fourier coefficients {Xk} fork = N/2+1, N/2+2, N/2+3, ... N can be computed from the conjugate symmetry of the transform. It is therefore necessary to compute {Xk} fork= 0, 1, 2, ... N /2 only. For the first N /4 of these coefficients we use (15) 9
PAGE 19
Fork= Nj4 + 1, N/4 + 2, N/4 + 3, ... N/2, substituting N/2 k into equation (15) results in By conjugate symmetry of {Yk} and {Zk} we have (16) Using (15) and (16) we are able to compute {Xk} fork= 0,1,2, ... ,N/2. 3 The Algorithms In [1] Swarztrauber presents a set of algorithms for real sequences with even (E), odd (0), quarterwave even (QE) and quarterwave odd (QO) symmetry. Each of these algorithms uses preand postprocessing around a real, periodic FFT. The input vector is preprocessed to give a new vector (which is also real). The real FFT is then used to find the transform of this new vector. In the postprocessing step, the transform of the original vector is calculated from the output of the real FFT. In the same paper, a preand postprocessing algorithm based on a complex routine is also described. This algorithm is for a general, real sequence 10
PAGE 20
(R). The preprocessing step produces a complex vector from the real input. A complex transform is then used to find the transform of this vector. Finally, the transform of the original, real vector is obtained in the postprocessing step. Implicit in these descriptions is another set of algorithms that combines these two approaches. Figure 1 gives a graphic representation of the algorithms and their combination. The call to the real FFT in the original algorithm for real, symmetric data, is replaced by the entire algorithm for general, real data. These combined algorithms have not, to our knowledge, been implemented and it was thought that their efficiency might exceed that of the original symmetric algorithms. Even if they did not, because the complex FFT routine is often the first one to be optimized in a new computing environment, it was thought this set of routines might still prove useful. In the rest of this section, these algorithms will be described in more detail. First, what we will call "the real algorithm" will be described. It is the one for general, real data which uses a complex FFT and which is mentioned above. Then, the "original" and "combined" algorithms will be given for the sine transform. For the rest of the symmetries, only the combined algorithm will be given. 11
PAGE 21
Synunetric Algorithms j ; 1 d i j accept rea symmetnc ata 1 j L.___________ J preprocess real FFT postprocess r1 I I joutput synunetric transform! I I LJ Real Algorithm ! i I 1 accept real data I I I L__________ 1 preprocess complex FFT postprocess r1 I ' l output real transform l I I L _________________ ____ Figure 1: The combined synunetric algorithms replace the real FFT of the original synunetric algorithms with the three main steps of the real algorithm. 12
PAGE 22
3.1 The Real Algorithm 1. Given the real sequence {xj} for j = 0,1,2, ... N1 where N is even, compute two real sequences, {YJ} and {zj}, of length N/2 as follows: Yi X2j j = 0,1,2, ... ,N/2 1, Zj = X2j+1 j=0,1,2, ... ,N/21. Then compute the complex sequence {wj} of length N/2 Wj = Yi+iz; j=0,1,2, ... N/21. 2. Calculate {Wk}, the Fourier transform of {w;}. 3. Recover{Yk} and {Zk}, the transforms of {y;} and {z;}, with the following formulas: for k = 0,1,2, ... lN/4J, (17) for k = 0, 1, 2, ... lN/4J (18) where lN/4J is the integer closest to and less than N/4. 4. Use the splitting equations (15) and (16) for real sequences to get {Xk}, the transform of {x;}: for k = 0,1,2, ... ,N/4, (19) 13
PAGE 23
for k=0,1,2, ... ,N/4l. (20) If N is not divisible by 4, then both of these formulas must run from k = 0,1,2, ... lN/4J. The other half of the transform is calculated using the conjugate symmetry of XNk = xk fork=1,2,3, ... ,Nj2l. We now derive the formulas for {Yk} and {Zk} used in step 3. Because the Fourier transform is linear (21) Therefore Since {y;} and {z;} are real, their transforms are conjugate symmetric and Taking conjugates of both sides, we have (22) Adding and subtracting (21) and (22) gives (17) and (18). 14 L
PAGE 24
L 3.2 The Sine Algorithm In this section the algorithm for an odd symmetric sequence is given. To be more accurate, it is an algorithm for a real sequence of length n which treats the data as if they had been extended to become an 0 sequence of length N = 2( n + 1 ). Because of this imposed symmetry, the output sequence is purely 1magmary. Recall from the discussion of the properties of an odd sequence that x0 and xN/2 = Xn+l are zero. So the input sequence is {x;} for j = 1,2,3, ... n. First, the algorithm sketched out in (1] for such a vector will be presented followed by the combined algorithm that has been implemented in this project. 3.2.1 The Original Algorithm 1. Given a sequence {x;} for j = 1,2,3, ... n where n is odd, construct the sequence { e;} as follows: J1f e; = (x;Xn+tj) + (x; + Xn+!j) sm(1 ) j = 1, 2, ... n (23) n+ eo = 0. Note that the definition for e0 is consistent with the general definition of { e;} and with the fact, discussed in the last section, that for an 0 15
PAGE 25
sequence, x0 = X!:i. = 0. The vector { ej} is thus defined for j = 0, 1, 2, ... n 2 giving n + 1 values, which is an even number. 2. Use the real periodic transform to compute the Fourier transform of {ej} in trigonometric form. From (7) "t'1 Co { kj27r kj21r } e; = + L Ck cos()+ dk sm() 2 k=l n + 1 n + 1 +(1)jC(n+l)/2 (24) 2 The transform gives us the ck and the dk. 3. As shown in section 2.2, the transform of {xj} will be imaginary. Looking at (7) again, but this time applying it to {xj} instead of {ej} it can be seen that n kj1r Xj = L bk sm(). k=1 n + 1 (25) Substituting (25) into (23) and remembering that n is odd so that (n+1 )/2 is an integer, we get (26) Finally, comparing (24) and (26) we see that we can produce the sine coef16
PAGE 26
Wj = Yi+izi j=0,1,2, ... ,(n+1)/21. Note that {wj} is complex and half as long as {ej}. 3. Call a complex FFT to compute {Wk}, the Fourier transform of {w;}. {Wk} will be a complex sequence of length '(n + 1)/2. This will be in complex, as opposed to trigonometric, form. 4. Calculate {Yk} and {Zk}, the transforms of {y;} and {z;}, as in step 3 of the real algorithm. 1 Yk = 2(wk + w"k), 1 Zk 2i(WkW"k) k=0,1,2, ... ,(n+1)/4l. Note that Wk is not defined fork= (n+1)/2, so in order to use this formula for k = 0 we must use the periodicity of the transform (W(n+l)(z = Wo). 5. Calculate { Ek} as in step 4 of the real algorithm. k = O,l,2, ... ,(n+ 1)/4 1. Thus { Ek} is defined for k = 0, 1, 2, ... ( n + 1 )/2. 18
PAGE 27
L 6. Translate this into trigonometric form by using (5) and (6). Ck 21R(Ek) k=0,1,2, ... ,(n+1)/2, dk 2'
PAGE 28
described. Only the combined algorithm, which is the one that was implemented during this project, will be discussed. 1. Given a sequence, {x;} of length n, where n is odd, produce the new sequence { e;} as follows: forj=0,1,2, ... ,n 2. (28) n1 Note that {e;} has length n1 and that all the input data are used in calculating { e;}. 2. Construct the real vectors {y;} and {z;}, and the complex vector {w;} all of length (n1)/2 using the following: w;y;+iz; forj=0,1,2, ... ,(n1)/2l. 3. Call a complex FFT to compute the Fourier transform of {w;}. The result will be a complex sequence, {Wk}, where 0 :S k :S (n 1)/21 and will be in complex, as opposed to trigonometric form. By periodicity, Wcnl)/2 = W0 a fact that will be used in the next step. 20
PAGE 29
4. Recover the transforms of {Yi} and {zi} as in step 3 of the real algorithm. 5. Now calculate {Ek} which is the transform of { ei} using (15) and (16), the splitting equations for a real sequence. ik2.r Ek = Yk+en.:rzk fork=0,1,2, ... ,(n1)/4, ih2'11' Yken.:r zk fork= 0, 1,2, ... (n 1)/41. Thus {Ek} is defined for 0 :S k :S (n1)/2 6. Translate this into trigonometric form by using(5) and (6). Ck21R(Ek) fork=0,1,2, ... ,(n1)/2, dk 28'(Ek) fork=0,1,2, ... ,(n1)/2l. 7. As in the Sine algorithm, it is necessary to obtain the transform of the original vector from { ck} and { dk}. Referring to (7) we see that Co 2 { kj27r kj27r } ei2+ L ckcos(n_1)+dksm(n_1 ) k=l +(1)jC(nl)/2 (29) 2 21
PAGE 30
As we know from section 2.2, for an even symmetric sequence {xj}, the transform is real which means (7) reduces to n2 k ao J1f an1 Xj =+ L...., ak cos()+ ( 1)1. 2 k=l n1 2 Substituting (30) into (28) yields n2"ll [ kj27r ao + L a2kcos() + (a2k+l k=1 n 1 +( 1)ian1 kj27r ] a2k1) sm() n 1 (30) (31) Comparing (29) and (31) yields the following set of identities for the { ak} which are the cosine coefficients of the original input. co ao = 2' an1 = C(n1)/2 2 a2k = Ck fork=1,2,3, ... ,(nl)/21, a2k+la2k1 = dk fork=1,2,3, ... ,(n 1)/21. The coefficient a1 is needed to start the recurrence relation in the last of these identities and it must be calculated explicitly. Using (8) and the fact that { Xj} is real and even, we have that n2 a, = i=l n1 22
PAGE 31
3.4 The QuarterWave Even Algorithm We now proceed to the description of the combined algorithm for real quarterwave even data. As for the algorithms already described, the input data need not have any symmetry other than beirtg real. It is processed as if it had been extended to be a quarterwave even symmetric vector. For these algorithms, n, the length of the input sequence, must be even, and N = 2n where N is the length of the extended vector. The extension of the data is such that Xj = XNj1 Although the transform of a quarterwave even vector is complex, Swarztrauber [1] has shown that it can be represented by a vector which is real. That is, if {Xk} is the transform of the QE vector {xi}, then Xk = eibc/N Xk where xk is real. Note that as defined above, N is always even. With that restriction, the trigonometric form of the transform is: [k(2j + 1 )7r] 4 LJ Xj cos N .1=0 (32) ao [k(2j + 1 )1r] + LJ akcos N 2 k;1 23
PAGE 32
where While for E and 0 data, the transform is its own inverse (neglecting sealing), this is not true for quarterwave data. Both QE and QO data require distinct forward and backward algorithms. For reasons that will become apparent, it is more convenient to discuss the inverse transform first. 3.4.1 The Inverse Transform 1. Given {ak} fork = O,l,2, ... ,n1, where {ak} is the nonredundant portion of the transform of a QE vector in trigonometric form (recall from Section 2.2 that the transform of a real vector will be conjugate symmetric), compute {ek} as follows: ek = (ak+ank)cos()(akank)sm() fork= O,l, ... ,n1. (33) 2n 2n This definition of {ek} requires an which is not given. However, from (32), and remembering that 2n = N, it can be shown that an = 0. 2. Construct the real vectors {Yk} and {zk}, and the complex vect?r {wk} 24
PAGE 33
all of length n/2 using the following: Wk = Yk+izk fork=0,1,2, ... ,n/2l. 3. Now calculate the Fourier transform {W;} of the complex vector {wk} by calling the forward complex routine. 4. Recover {Yj} and { Z;}, the transforms of {Yk} and { zk} as is done in step 3 of the real algorithm. 1 2(W;+W?i) forj=0,1,2, ... ,n/4 1( Z; = :W;W,._;) forj=0,1,2, ... ,n/4. 5. Obtain {E;}, which is the transform of {ek}, by using the splitting equations for a real sequence. E; ii21f for j = 0, 1, 2, ... n/4, Y:+e n Z 1 1 E!Ji ij21r for j = 0,1,2, ... ,n/4 1. Yje n Z; 6. Translate this into trigonometric form by using (5) and (6). c; = 2!R(E;)forj=0,1,2, ... ,n/2, d; 2'2r(E;)forj=1,2,3, ... ,n/2 1. 25
PAGE 34
7. As in the algorithms already discussed, the results can be obtained from the { Cj} and { dj} with the following justification: By (7) we know that Co [ kj27f kj27f ] k'"n/2 ek = + L., Cj cos()+ dj sm() + ( 1) . (34) 2 i=1 n n 2 Then substituting (32) into (33) we see that y1 [ kj27f kj27f ] ek = 4xo + 4 L (x2j + X2j1) cos()+ (x2jX2j1) sm() n n +4( 1)kXn1 (35) Comparing (35) and (34) we get the following set offormulas relating { Xj} Co 8xo, (36) Cj 4(X2 j + X 2j1) J = 1, 2, 3, ... n/21, (37) 4(X2jX 2j1) J = 1, 2, 3, ... n/21, Cn 2 Inverting these we get: Xo X2j X2j1 Xn1 = = = 26 1 8Co, 1 (c +d) 8 1 1 1 (c d) 8 J J 1 (38) (39)
PAGE 35
By these last formulas we can obtain {xj}, the nonredundant portion of the QE vector. Note that this inverse transform for quarterwave even data closely follows the pattern of the cosine and sine transforms. The forward transform is the mirror image of this algorithm. 3.4.2 The Forward Transform 1. Given a vector {xi} for j = 0,1,2, ... ,nl, compute the two vectors {cj} and {d;} using (36) through (39). 2. In the original quarterwave even algorithm, one would do an inverse transform at this point. One would treat { Cj} and { dj} as the trigonometric form of the transform of { ek} (from the inverse algorithm), and use an inverse transform to obtain { ek} The combined algorithm will also obtain { ek} but because it calls the inverse complex routine instead of the real periodic, more preand postprocessing steps are necessary. The first of these is to obtain {Ej}, the complex form of the transform of { ek}, by using the inverse of the definition of the trigonometric form of the transform for a real vector. i"R(Ei) c ; for j = 0,1,2, ... ,n/2, 27
PAGE 36
d; 2 forj=l,2,3, ... ,n/2l. 3. Now use the inverse of the splitting equations for a real vector (in this case {ek}) to get {Yj} and {Z;}. This corresponds to step 5 of the inverse algorithm. 1 Yj 2(E;+Elji) forj=0,1,2, ... ,n/4, 1 ill:!. Z; 2e n (E;E!fi) forj=O,l,2, ... ,nj4. 4. Calculate {W;}, the transform of {wk}, by inverting formulas in step 3 of the inverse algorithm to get W; Yj +iZ; for j = 0,1,2, ... ,n/4, W"fi YjiZ; forj=0,1,2, ... ,n/4l. 5. Call the inverse, complex transform with {W;}, which is complex and n/2 in length, to obtain { wk} ZkS'(wk) fork=0,1,2, ... nj21. 28
PAGE 37
Then compute { ek}, the vector defined in step 1 of the inverse algorithm, as follows: ( 40) e2k+1 Zk fork = 0, n/2 1. ( 41) 7. Finally, we are able to compute { ak} using the inverse of the formula in step 1. k1r k1r ak (ek + enk)cos()(ekenk)sm() fork= 0,1,2, ... ,nl. 2n 2n 3.5 The QuarterWave Odd Algorithm The algorithm for real quarterwave odd data is very similar to the one for quarterwave even data. Therefore only a short description will be given. Again, n, the length of the input sequence, must be even, and N = 2n where N is the length of the extended vector. But this time the extension of the data is such that Xj = XNj1 The transform of a quarterwave odd vector is complex, but Swarztrauber [1] has shown that if {Xk} is the transform of the QO vector {xi}, then Xk = eikn/N Xk where Xk is strictly imaginary. 29
PAGE 38
I I I I I I Because N is always even, the trigonometric form of the transform is: !f1 k bk = 4 'E Xjsin; (2j + 1), j::::;Q ( 42) [bk (2j + 1)] + ( 1)ibN/2 where 3.5.1 The Inverse Transform 1. Given {bk} fork= 1,2,3, ... ,n, where {bk} is the nonredundant portion of the transform of a QO vector in trigonometric form (since the data are real the transform will be conjugate symmetric), compute {ek} as follows: br br ek = (bk+bnk) cos()+ (bkbnk) sin() fork = 0, 1, ... n1. ( 43) 2n 2n The formula for { ek} requires {bk} to be defined for 0 :'0: k :'0: n, and we are not provided with b0 But notice that using the above definition for 2. Follow steps 2 through 6 of the QE algorithm to produce the { Cj} defined for 0 :'0: j :'0: n/2 and the {dj} defined for 0 :'0: j :'0: n/21 30
PAGE 39
3. To develop the formulas for the relationship between these coefficients and the {x;}, substitute (42) into (43) to obtain if1 [ kj27r kj27r ] ek 4xo + 4 I; (x2;x2;1) cos()+ (x2; +x2;1) sm() j;;;;;l n n 4(1)kXn1 (44) As in the QE case, by comparing ( 44) and (34) we find a set of relationships between {x;}, {c;} and {d;}: 8xo, (45) 4(x2iX2j1l j = 1, 2, 3, ... n/21, (46) d; = 4(x 2 ; + x2;_1 ) j = 1, 2, 3, ... n/21, ( 47) Inverting these expressions gives 1 Xo 8Co' 1 Xzj (c +d) 8 J J. 1 Xzj1 8(c;d;), 1 Xn1 C.!!.. 8 2 from which we can obtain {x;}. 31 1
PAGE 40
3.5.2 The Forward Transform 1. Given a vector { Xj} for j = 0, 1, 2, ... n1 compute the two vectors { Cj} and {dj} using (45), (46), (47), and (48). 2. Follow steps 2 through 6 of the QE forward 3. Finally, we are able to compute {bk} using the inverse of the formula in step 1: krr: b: bk = (ekenk)cos() + (ek + enk)sm() fork= 0,1,2, ... ,nl. 2n 2n 4 Implementation, Operation Counts and Tim. 1ngs We now address the topics of the implementation of these algorithms in VCFFTPK and the performance of these new codes. As stated earlier, the impetus behind this project was the desire to ascertain whether reducing the number of passes through the data saved a sufficient amount of execution time to compensate for the added operations required by these algorithms. Therefore, in each case, the performance of the new routine is measured against that of the analogous routine out of VFFTPAK. The codes for symmetric data are each compared to the routine that performs the "original" version of the 32
PAGE 41
I I I 1. I I I I same algorithm (the VCFFTPK routine does the "combined" version). The "real algorithm," which is implemented in the VCFFTPK routine VCFFTF, is compared to the VFFTP AK routine VRFFTF which does the transform of general, real data. Although VRFFTF does not use preand postprocessing, its comparison with VCFFTF makes sense, not only because it produces the same transform, but because VRFFTF provides the base for the preand postprocessing routines in VFFTP AK just as VCFFTF provides the basis for the routines in VCFFTPK. In our description of each of the algorithms, it was noted that there is a restriction on the length of the input vector. This restriction is due to the use of the complex routine. In each case, the input vector is processed to become a new vector whose length is a function of the length of the input vector. Because the new vector is passed to the complex routine, it must have an even number of elements. Therefore, the input is required, depending on the case, to be either even or odd: odd in the cosine and sine cases, even in the real and quarterwave cases. The operations counts for the VCFFTPK routines are presented in Table 4 along with those of the analogous routines from VFFTPAK. The counts for VFFTPAK are taken from [1] and those for VCFFTPK are deduced from the 33
PAGE 42
Table 4: Operation Counts for Vectors of Length N = 2P VFFTPAK VCFFTPK DATA adds mults adds mults R (3p5)k + 2 (p3)N + 4 3p'f 2 (p 2)N 4 E (3p + 2) + 5 (3p + + 1 (p+ 3)k7 0 (3p3) + 1 (3p + 2). 3 (p2);;6 QE h (3p + (3p2)+ 4 (p3)+ 2 (p2) h6 QO h (3p2h +4 (p3h +2 (3p+3h (p2h6 same source. They are, therefore, theoretical counts rather than actual tallies. Although the names of the vectorized packages are given, these counts apply to the scalar algorithms. For the vectorized version, given m vectors of length N, each of the counts above would be multiplied by m. They are valid for N ::0: 16. The implementation of each of the algorithms is discussed below. The conclusions reached as a result of this project are given after this discussion. The results presented in Tables 5 through 11 are of timings done on a single processor of the Cray YMP at NIST in Gaithersburg, Maryland. Preliminary testing was done on the Cray YMP at NCAR in Boulder, Colorado. 4.1 VRCFTF: The Real Algorithm The algorithm described in section 3.1 has been implemented under the name VRCFTF. It was first coded in Fortran exactly as described above. Then the complex arithmetic was converted to real. Since, in Fortran, the complex 34
PAGE 43
vector {wj} and the real vector {xj} are stored in exactly the same way, step 1 was thereby eliminated. Instead of creating {w;}, the routine merely passes the input vector to the complex routine. Next, steps 3 and 4 were combined into one postprocessing loop. This loop was arranged so that the calculations would be done in place, avoiding a final pass through the data to copy the results back into the input vector. When this process was completed, the resulting routine performed only one pass through the data. Finally, the routine was vectorized by adding another dimension to all the arrays and changing each DO loop to a double DO loop. The complex routine used is VCFFTF which is a vectorized version of CFFTF from FFTPAK. The original version was written by Swarztrauber. It uses a GentlemanSande algorithm [4]. Table 5 shows the results of the timings for VRCFTF and how they compare to those of VRFFTF. The times for VCFFTF are also shown. Each line shows a different case in which N is the length of the sequences to be transformed and m is the number of sequences. The error is the maximum difference between the output of VRCFTF and that of VRFFTF. The numbers under the names of the routines are the times that routine took to execute in milliseconds. The times for VRCFTF include the time spent in VCFFTF. 35
PAGE 44
Table 5: The R Routines: VRCFTF vs VRFFTF m N error VRCFTF VCFFTF VRFFTF 64 16 0.62E14 0.09 0.05 0.07 64 32 0.28E13 0.16 0.10 0.14 64 64 O.llE13 0.41 0.29 0.30 64 128 0.57E13 0.86 0.64 0.73 64 256 0.14E13 1.81 1.36 1.58 64 512 0.57E13 3.93 3.04 3.61 64 1024 0.20E13 9.29 7.52 7.72 64 2048 O.llE12 19.82 16.29 17.78 4.2 VSINC: The Sine Algorithm The algorithm described in section 3.2.2 is coded under the name VSINC. In programming this algorithm, we hoped to improve the performance of the original sine algorithm described in section 3.2.1 That algorithm is implemented in VSINT and is part ofVFFTPAK. The operation counts for the two algorithms are shown in Table 4. Originally written as described in section 3.2.2 with a separate step, and a separate data access for each of the 7 steps in the algorithm, the routine was eventually compressed to include one preprocessing step, which combined steps 1 and 2, the call to the complex routine, and one postprocessing step which combined steps 4 through 7. The idea behind combining these loops is to 36
PAGE 45
minimize the number of data accesses and maximize the amount of computation per access thereby optimizing the speed of the routine. Unlike the real case, for the sine routine and all of the rest of these cases, the awkwardness of coding in place is avoided. Since all of these algorithms have a preprocessing step before the call to the complex subroutine and a post processing step after it, the data flow is more natural. The first loop reads from the input vector and writes into a work vector. The work vector is then passed to the complex routine as input. Intermediate results are passed back in the same vector. It is then possible to read from the work vector and write back into the input vector during the last pass. After the steps were combined it was possible, in both the preprocessing loop and the postprocessing loop, to cut the number of data accesses in half again. From (23) it can be seen that eni uses the same data as ei. We therefore structured the preprocessing loop so that these were done in one iteration, thus preventing that data from being brought from memory twice. Similarly, during postprocessing, bk and bnk need the same data. To take advantage of this, we ran the recurrence relation in (27) both ways. Thus it was possible and advantageous for the work to be doubled up at each iteration and the rumber of iterations to be cut in half. 37
PAGE 46
Table 6: The 0 Routines: VSINC vs VSINT m n error VSINC VCFFTF VSINT VRFFTF 64 15 0.29E13 0.13 0.06 0.14 0.08 64 31 0.50E13 0.21 0.10 0.22 0.14 64 63 0.69E13 0.60 0.35 0.54 0.36 64 127 0.12E12 1.13 0.67 1.14 0.77 64 255 0.22E12 2.20 1.35 2.16 1.52 64 511 0.28E12 4.67 2.98 4.72 3.44 64 1023 0.51E12 10.83 7.45 9.95 7.42 64 2047 0.83E12 22.83 16.08 22.25 17.18 Table 6 shows the results of the timings of VSINC and VSINT. The times for VCFFTF and VRFFTF, the complex and real routines called by these routines respectively, are also given. Note that in this case n is the length of the input vector. The operation counts given in Table 4 are in terms of N = 2(n+ 1). The other definitions are the same here as they were for the real case. 4.3 VCOSC: The Cosine Algorithm The implementation of the algorithm described in section 3.3 is the VCFFTPK routine VCOSC. Essentially, the intention and theory behind this process was identical to that of VSINC. It, too, was compressed into a single preprocessing step, the call to the complex routine, and a single postprocessing step in order to minimize passes through the data. Here, too, the loops were doubled up to 38
PAGE 47
reduce the number of data accesses. However, this case presented a few more complications than the sine case did. In the original sine algorithm, at the last step, the starting values for the recurrence relation are available. At the same place in the original cosine algorithm, these values must be computed explicitly. In order to run the recurrence relation both directions, as we had done in the sine routine, an_2 had to be calculated as well as a1. Since both of these calculations are summations, it was possible to do them, one addition per iteration, in the loops that accomplish the preand postprocessing. Therefore, although they add to the number of calculations being done, they do not add significantly to the data motion. But, as coded, this routine increases the number of adds in Table 4 by N /4 and the number of multiplications by the same amount. Table 7 shows the results of the timings of VCOSC and VCOST. The definitions for the labels in this table are the same as those given for the sine case except that N = 2(n 1). 4.4 The QuarterWave Algorithms As noted above, the transform of quarterwave symmetric data is not its own inverse. Forward and inverse routines are required. It is somewhat 39
PAGE 48
I I I I I I I I I I I I I I Table 7: The E Routines: VCOSC vs VCOST m n error vcosc VCFFTF VCOST VRFFTF 64 17 0.28E13 0.12 0.05 0.12 O.Q7 64 33 0.46E13 0.23 0.10 0.25 0.14 64 65 0.57E13 0.54 0.29 0.49 0.29 64 129 0.91E13 1.13 0.63 1.10 0.71 64 257 0.13E12 2.35 1.35 2.30 1.53 64 513 0.22E12 4.97 2.97 5.05 3.45 64 1025 0.35E12 11.43 7.45 10.47 7.44 64 2049 0.48E12 23.96 16.03 23.18 17.15 arbitrary which algorithm is called the forward and which is called the backward transform. The QO and QE routines in VCFFTPK correspond to the definition given above in sections 3.4 and 3.5 and to the algorithms described in [1]. The forward routines are named VSNQCF and VCSQCF. The inverse routines are VSNQCB and VCSQCB. However, the corresponding routines in VFFTPAK have their names switched. VSINQF and VCOSQF do what we have been calling the inverse transform. VSINQB and VCOSQB do the forward transform. Rather than redefine the transforms, we have chosen to leave the VCFFTPK routines named as they are and note clearly in the documentation that VSNQCF produces the same results as VSINQB, while VSNQCB produces the same results as VSINQF. Similarly, for the QE routines, VCSQCF corresponds to VCOSQB 40
PAGE 49
and VCSQCB to VCOSQF. This is why, in the timing results that follow, VFFTPAK forward routines are compared to VCFFTPK inverse ones and viceversa. Notice, too, that the forward VCFFTPK routines call VCFFTB and the inverse ones call VCFFTF, respectively the inverse and forward complex routines We went about coding these two sets of routines in different ways. For the QE routines, we took the corresponding vectorized routines from VFFTPAK and edited them to call VCFFTF and VCFFTB (the vectorized version of the inverse complex routine from FFTPAK) with additional postprocessing. However, the QO routines in VFFTPAK use a different algorithm in which they reverse the order of the input vector, call the QE routines, and then change every other sign in the output vector. In other words, another layer of preand postprocessing and a subroutine call have been added. For this reason, and in order to confirm our understanding of the algorithms, we wrote the QO routines in VCFFTPK from scratch. The procedure was very similar to that used for the sine and cosine routines. The development of these codes went much more smoothly than that of the QE routines. However, the end results of the two processes are very similar. For the QO codes, we wrote scalar versions of the forward and backward real algorithm discussed in 3.1 and called these from scalar QO routines that 41
PAGE 50
I I I I I I I I I I I I I Table 8: The Inverse QE Routines: VCSQCB vs VCOSQF m n error VCSQCB VCFFTF VCOSQF VRFFTF 64 16 0.16E13 0.11 0.05 0.13 0.07 64 32 0.25E13 0.22 0.10 0.26 0.15 64 64 0.28E13 0.52 0.29 0.52 0.30 64 128 0.50E13 1.08 0.63 1.14 0.73 64 256 0.85E13 2.28 1.37 2.45 1.60 64 512 0.99E13 4.81 3.01 5.28 3.58 64 1024 0.13E12 11.27 7.64 11.32 7.95 64 2048 0.23E12 23.35 16.18 24.30 17.72 did just the first and the last step of the algorithm outlined in 3.5. Once that was producing the correct result, we pulled the subroutines up into the main routine. Then the compression and vectorization proceeded just as it did for the sine and cosine routines. Notice that the operations counts in Table 4 are identical for the QE and QO routines. The results of the timings are given in Tables 8, 9, 10, and 11. For all of them, N = 2n and the other definitions remain as they are for the other timings. 42
PAGE 51
Table 9: The Forward QE Routines: VCSQCF vs VCOSQB m n error VCSQCF VCFFTB VCOSQB VRFFTB 64 16 0.32E13 0.11 0.05 0.13 O.Q7 64 32 0.28E13 0.22 0.10 0.27 0.16 64 64 0.39E13 0.52 0.29 0.54 0.32 64 128 0.43E13 1.09 0.63 1.18 0.76 64 256 0.50E13 2.30 1.37 2.54 1.69 64 512 0.46E13 4.85 3.01 5.38 3.73 64 1024 0.57E13 11.17 7.51 11.51 8.16 64 2048 0.53E13 23.53 16.21 24.96 18.36 Table 10: The Inverse QO Routines: VSNQCB vs VSINQF m n error VSNQCB VCFFTF VSINQF VRFFTF 64 16 0.80E14 0.11 0.05 0.14 O.D7 64 32 0.28E13 0.21 0.10 0.29 0.15 64 64 0.15E13 0.51 0.29 0.58 0.30 64 128 0.43E13 1.07 0.64 1.27 0.73 64 256 0.36E13 2.22 1.36 2.65 1.58 64 512 0.85E13 4.74 3.03 5.71 3.57 64 1024 0.85E13 10.93 7.52 11.96 7.72 64 2048 0.17E12 23.17 16.34 26.26 17.77 43
PAGE 52
I I I I I I I I I I I I I I I I I I I I I Table 11: The Forward QO Routines: VSNQCF vs VSINQB m n error VSNQCF VCFFTB VSINQB VRFFTB 64 16 0.20E13 0.12 0.05 0.15 0.07 64 32 0.32E13 0.25 0.10 0.30 0.16 64 64 0.21E13 0.58 0.29 0.59 0.32 64 128 0.36E13 1.22 0.63 1.29 0.76 64 256 0.31E13 2.52 1.36 2.71 1.65 64 512 0.43E13 5.35 3.02 5.84 3.71 64 1024 0.36E13 12.14 7.50 12.23 8.01 64 2048 0.50E13 25.53 16.25 26.73 18.26 5 Conclusion Looking at the timings for the 0 routines we see that, for half of the cases shown, the VCFFTPK routines run slightly faster than the VFFTPAK routines. In general, the cases where this is true are the ones in which an extra pass through the data is avoided, that is, every other case. To see this, it must be noted that the number of passes performed by either VRFFTF or VCFFTF depend on the number of factors N has, where N is the length of the vector passed to these subroutines (as distinguished from the length of the original input vector). The timings were all done on vectors for which N is a power of two. Recall, also, that if VRFFTF is passed a vector of length N, VCFFTF is passed one of length N/2. Let N = 2P where pis 44
PAGE 53
even. Then N has p/2 factors of 4. N/2 has p/21 factors of 4 plus a factor of 2. So both algorithms require the same number of passes. However, if p is odd, N has (p1 )/2 factors of 4 plus an additional factor of 2 while N /2 has only the (p1 )/2 factors of 4, giving the algorithm based on the complex routine the advantage of one fewer pass. Thus, only every other power of two provides VCFFTPK with an advantage over VFFTPAK. This alternation can be detected in many of the timing tables. Even when these routines require the extra pass through the data, they perform better than would be predicted by the number of operations added to the algorithm by the extra preand postprocessing. This phenomenon is also seen in Tables 8 and 9 where the QE routines in VCFFTPK are shown to be faster than those in VFFTPAK for all of the cases run. We theorize that this is due to the fact that the VFFTPAK routines call VRFFTF which scales the results using an extra loop, thereby adding a pass through the data. The VCFFTPK routines call VCFFTF which does no scaling. Instead the scaling is combined with the postprocessing pass. With this approach, even though the scaling costs about n = N /2 more multiplies it doesn't cost the extra memory accesses the other routine incurs. In other words, we suspect that any speedup the VCFFTPK routines display is due to coding details rather than to the 45
PAGE 54
I I I I I I I algorithms themselves. We hope to establish this more firmly by doing further research. The cosine routine does not perform as well when compared to its analog in VFFTP AK. This can be explained by the fact that in order to run the recur' sion both ways in the postprocessing loop, we calculate a.,._2 explicitly thereby adding n/2 operations. If this was not done, and the last loop was only run one direction, more data accesses would be required. Either way, the complications make this algorithm slower than VCOST. The excellent performance of the QO routines, VSNQCF and VSNQCB, when compared to the VFFTPAK routines VSINQB and VSINQF, can be explained by the fact that the QO routines from VFFTP AK call the QE ones from that package. This design seems to sacrifice efficiency in favor of reducing the number of distinct routines. It is not surprising that the explicitly coded QO routines in VCFFTPK outperform these. The R routine does poorly when compared to VRFFTF. But notice that even if a pass through the data is saved inside of VCFFTF, it is added on again by the postprocessing loop. So this routine has no chance to make up for the extra operations it has to do. 46
PAGE 55
In conclusion, the combined algorithms used in VCFFTPK perform about as well, but no better, than the ones used in VFFTPAK. An extra pass through the data is, indeed, costly; but the time saved by avoiding it is cancelled out by the extra operations required to do so. In addition to the algorithm described in section 3.1, and used by this package, Swarztrauber [1] describes another algorithm for real data that uses a complex routine. That algorithm combines two real vectors rather than splitting one and, therefore, has no restriction on the length of the vector. It also requires less postprocessing. It was thought to be less promising for implementation on a vector computer because it requires movement in both the rows and the columns of the vector. Nevertheless, a set of routines based on that algorithm, instead of the one we have used, would be an interesting candidate for further study. 47
PAGE 56
References [1] Swarztrauber, Paul N., Symmetric FFTs, Mathematics of Computation, Volume47, Number 175, pp. 323346,1986. [2] Cooley, J. W., and Tukey, J. W., An Algorithm for the Machine Calculation of Complex Fourier Series, Mathematics of Computation, Volume 19, pp. 297301, 1965. [3] Cooley, J. W., Lewis, P. A. W., and Welsh, P. D., The Fast Fourier Transform Algorithm: Programming Considerations in the Calculation of Sine, Cosine, and Laplace Transforms, J. Sound Vibration, Volume 12, pp. 315337, 1970. [4] Swarztrauber, Paul N., FFT Algorithms for Vector Computers, Parallel Computing 1, pp. 4563, 1984. 48
