THEORY AND IMPLEMENTATION OF A MULTIPROCESSOR
SPECTRUM ANALYZER
by
TIMOTHY PAUL GROTH
B.S.E.E./ Colorado State University, 1981
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Master of Science
Department of Electrical Engineering and Computer
Science
1989
r
Ml
t
. fit
This thesis for the Master of Science degree by
Timothy Paul Groth
has been approved for the
Department of
Electrical Engineering and Computer Science
by
Daniel F. Michaels
Date
4/ / T~j
Ill
Groth, Timothy Paul (M. S., Electrical Engineering)
Theory and Implementation of a Multiprocessor Spectrum
Analyzer
Thesis directed by Associate Professor Daniel F.
Michaels
The spectrum analyzer is a widely used
instrument to observe phenomena in the frequency domain.
An implementation is realized using a Personal Computer
acting as a host to stateoftheart coprocessors. The
host processor downloads the executable programs to the
coprocessors, and transfers data between them during
execution, using the Direct Memory Access Controller
(DMAC) for some of these transfers. One coprocessor
calculates the Fast Fourier Transform (FFT), while
another displays the resulting data. Execution times
were benchmarked in this system, and were used to simu
late an expanded multiprocessing system. The simulation
was computed utilizing commercially available software,
NETWORK II.5. Parallel execution of the Fast Fourier
Transform algorithm is also explored.
CONTENTS
CHAPTER
I. INTRODUCTION ................................. 1
II. BACKGROUND.................................... 8
The Fourier Transform ...................... 8
Imperfections of the FFT................... 10
FFT of a real function..................... 17
Decimation and Interpolation .............. 23
III. IMPLEMENTATION............................... 25
Multiprocessor architecture ............... 25
Antialiasing Filter ...................... 25
Data Acquisition........................... 28
DSP Coprocessor Board ..................... 29
GSP Coprocessor Board ..................... 30
Software................................... 32
Simulation
41
V
IV. RESULTS...................................... 49
Implementation ............................ 49
Simulation................................. 51
Parallelism of the FFT..................... 54
The Perfect Shuffle....................... 55
The N Cube................................ 60
System using 4 PEs........................ 64
Other Programs............................. 81
V. DISCUSSION................................... 83
BIBLIOGRAPHY ....................................... 87
APPENDIX
A. CODE EXECUTED ON THE AT
USED FOR THE SPECTRUM ANALYZER............. 89
B. CODE EXECUTED ON THE DSP
USED FOR THE SPECTRUM ANALYZER............ 116
C. CODE EXECUTED ON THE GSP
USED FOR THE SPECTRUM ANALYZER............ 139
D. LIST OF NUMBER NINE SUPPLIED
BINDING FUNCTIONS ........................ 162
E. l. SIMULATION MODULES.......................... 165
E.2. NET LIST OF SIMULATION...................... 168
E.3. RESULTS OF SIMULATION WITH SNAPSHOT......... 186
E.4. STATUS PLOTS OF SIMULATION ............. 210
E. 5. UTILIZATION PLOTS OF SIMULATION ............ 214
F. OTHER PROGRAMS EXECUTED ON THE PC.......... 235
G. BITREVERSAL................................. 248
H. ONE NODE OF THE FFT ON THE N CUBE
IN TMS320C25 ASSEMBLY .................... 249
vi
TABLES
TABLE
1. Comparison of Window Functions .............. 15
2. Comparison of Discrete Fourier Transform
to floating and integer Fast Fourier
Transform................................ 18
3. Functions used in SPECTRUM................... 33
4. Benchmarks of 256 point complex FFT.......... 49
5. Task benchmarks.............................. 50
6. Utilization of Processing Elements .......... 52
7. Amount of times instructions executed ....... 52
8. Time for PC main Ram to fill up.............. 52
9. 256 point FFT with N PEs..................... 67
10. 4 PEs with variable FFT length............... 67
11. Twiddle factors used in PE butterflys ....... 74
12. Speed up using global ram,
fixed FFT length......................... 76
13. Speed up using global ram, fixed PE.......... 77
14. Reverse digits for N = 8
248
vii
FIGURES
FIGURE
1. AT with peripheral boards 4
2. Signal flow graph of 16 point FFT 9
3. Fourier Transform of sine wave 11
4. Fourier Transform of rectangular window ... 11
5. Spreading of spectrum 11
6 Picketfence effect 13
7. Integer multiples of time record reciprocal 13
8. Processors used in the Spectrum Analyzer 26
9. AntiAlias Filter 27
10. Bitmap with windows 32
11. Display of selected window function 35
12. Display of spectrum 36
13. Flow Chart of the Spectrum Analyzer 38
14. Data flow through System 39
15. Multiple spectrums displayed together 41
16. Block diagram of Hardware in the Simulation 43
17. Buffers used in the Simulation 45
18. PEs and Perfect Shuffle routing Network ... 57
19. Perfect Shuffle with PEs shown in circles 59
viii
20. A3 Cube of 8 nodes......................... 60
21. C2 routing function after stage 1 ......... 61
22. Routing functions after stage 2 and 3 ..... 62
23. N Cube with PEs shown in circles........... 63
24. Flow Diagram of Perfect Shuffle
with 4 PEs............................... 65
25. Flow diagram of N Cube with 4 PEs.......... 66
26. Architecture for the Perfect Shuffle ...... 69
27. Architecture for the N Cube.............. 70
28. FFT utilizing Perfect Shuffle ............. 72
29. FFT utilizing N Cube....................... 73
30. Pseudo code for entire FFT................. 79
31. Bit reversal as data is written out ....... 81
CHAPTER I
INTRODUCTION
The Spectrum Analyzer, which has been in exis
tence for years, is widely used and an important instru
ment in many fields of study. The purpose of this
Thesis is to research the problems involved in the de
velopment of a multiprocessor implementation of a Spec
trum Analyzer. This implementation uses a PC AT acting
as host to two highperformance coprocessors, one for
DSP tasks and one for graphics display tasks. The sys
tem was developed and benchmarked, and the results were
used in further study to simulate an expanded system,
and to conjecture on a system that executes the FFT in
parallel.
Chapter II of this Thesis is a review of back
ground material. The Discrete Fourier Transform (DFT)
algorithm, which converts data from the time domain to
the frequency domain, is reviewed in Chapter II. The
Fast Fourier Transform (FFT) was derived from the DFT
due to symmetry and other mathematical simplifications
by Cooley and Tukey[l]. It is possible to use an N
point complex FFT to compute a 2N point real FFT, and
this derivation is presented in Chapter II.
I
2
There are other possible methods that can be
used to implement a spectrum analyzer other than the
FFT, such as using a bank of filters with a meter at the
output of each filter. However, if it were desirable to
see closely spaced spectral lines each filter would have
to possess a small bandwidth and a large number of fil
ters would be necessary. Consequently this method is
not practical. A way to avoid the need for a large
amount of filters is to use only one filter and sweep it
through the frequency range of interest. If the filter
is swept too fast, then errors will occur since the fil
ter will not have time to respond, at each frequency.
Such implementations are common in industry, but this
type of spectrum analyzer can't capture transient
events, so its applicability is limited. Use of the FFT
offers the advantage of both the parallel filter method
and the swept filter method. In fact the FFT method be
haves like a parallel filter analyzer with hundreds of
filters. The FFT is executed on a digital computer,
leading to the use of a Personal Computer (PC AT) based
system.
The AT contains expansion slots so that co
processor boards can be added to obtain a multiprocess
ing system. In this implementation a Digital Signal
Processor (DSP), a Graphics Signal Processor (GSP), and
an analog to digital converter are added to the AT, all
3
residing on daughter boards, Figure 1. Naturally higher
throughput, faster execution and simple system growth
paths are some of the benefits of such a parallel sys
tem, when compared to a system that utilizes only one
processor.
The AT serves as a host to the other boards by
downloading program code, controlling execution, and
transferring data between them. It uses a series of
flags for synchronization, known as semaphores. The FFT
executes on the DSP board, Direct Memory Access (DMA) is
used to efficiently transfer data from the A to D board
into AT main ram, and the GSP board displays the data.
The DSP, DMA, GSP and the AT processor execute in paral
lel with each other to increase the overall efficiency
of the system. The implementation is described in
Chapter III.
The software for the system implementation was
developed on the same AT that the spectrum analyzer runs
on. The software, for the AT and the Graphic Signal
Processor board, was developed with C compilers. The C
language is more efficient, than other highlevel lan
guages since it is at a level closer to assembly lan
guage than other highlevel languages. C deals with the
same sort of objects that most computers do, such as
characters, numbers, and addresses. C provides no
inputoutput facilities so these higher level mechanisms
FIGURE 1: AT with peripheral boards
5
must be provided by called functions contained in a
library. It provides pointers and the ability to do ad
dress arithmetic. It can perform call by reference by
passing a pointer to a function. The function may
change the object to which the pointer points[2]. The
FFT software for the DSP coprocessor was written in as
sembly language to maximize performance. The FFT was
also coded in C and executed on both the GSP board and
the AT for comparison purposes.
The degree at which an application is decomposed
and structured for processing is termed granularity.
Finegrain programming takes parallel procedures to the
instruction level, whereas largegrained parallel pro
cessing involves dividing the process into functions and
tasks; and medium grain involves further destructuring
of the tasks[3]. In this system, functional tasks, such
as dsp and graphics display, are broken down and exe
cuted concurrently on the different coprocessors so it
is defined as a mediumgrain system.
In the ideal case, all processors would run at
100% efficiency and speed up would be directly related
to the number of processors in the system. Data would
flow into, through, and out of the system (to the dis
play) at an even rate and without delays. Also the re
sults of the FFTs would be stored in main ram on the AT
so that no lapse in data acquisition would be en
6
countered. The GSP board would then pull the data out
of the AT buffer as it is ready to display it. In the
system developed for this thesis; however, the GSP and
its tasks proved to be a bottleneck; consequently, the
ram buffers in the AT eventually fill up, so that the
incoming data is lost.
This scheme was simulated on a software package
produced by CACI. The system was expanded to include 4
DSP coprocessors so that 4 spectrums can be calculated
at the same time. By running a simulation, a great deal
of insight into a system can be achieved without actu
ally building the system. NETWORK II.5 allows simula
tion of numerous devices with user selectable parame
ters. In the simulation PE's are used for the proces
sors, busses are defined to interconnect them, ram is
simulated as a storage device with their contents being
files; and modules are used to simulate the software
which executes on the PEs. Semaphores are used in the
simulation for synchronizing the modules. The package
offers several statistical reports, such as utilization
of processing elements, status plots showing when pro
cessing elements were active, and a snapshot report to
show the status of the entire simulation at a certain
point in time, to name a few. The simulation is intro
duced in Chapter III with the results in Chapter IV.
7
With the FFT being the most important algorithm
of the spectrum analyzer, it deserved some additional
study. The ultimate spectrum analyzer implementation
can be accomplished if the parallelism inherent in the
FFT algorithm is exploited. It can theoretically be ex
ecuted on 1 to N/2 processing elements, where N is the
number of data samples being transformed. Using the
benchmarks achieved from the implementation, the paral
lelism of the FFT process is investigated in Chapter IV.
The speedup is determined for different interprocessor
communication schemes, such as the N Cube and the Per
fect Shuffle, using the DSP coprocessors as the process
ing elements. The code for one of the nodes on the N
Cube was written for the DSP.
CHAPTER II
BACKGROUND
The Fourier Transform
The Fast Fourier Transform is simply an effi
cient method for computing the DFT[4]. Due to symmetry
and other mathematical simplifications that can be made,
the amount of complex operations for a Cooley Tukey FFT,
which is the basis for the FFT algorithm is approxi
mately Nlog2N. The amount of complex operations for a
DFT is approximately N2.
The FFT laid out as a signal flow graph for N =
16 points is shown in Figure 2.
The transformation from the time domain to the
frequency domain is based on the Fourier Transform which
is
oo
Sx(f) = JxCt) e^n dt (2.x)
00
where x(t) is the time domain representation of the
signal x and Sx(f) is the frequency domain representa
tion of the signal x[6]. To compute this transform for
sampled data the desired result is discrete so the
transform becomes
9
X=A+B
y=(a~b) jyk
FIGURE 2: Signal flow graph of 16 point FFT
Source: Hwang and Briggs, Computer Architecture
and Parallel Processing. New York: Wiley and Sons, 1985
p. 368
10
00
Sx(kAf) = Jx(t) e;i27rkAft dt (22)
oo
where k = 0 +1, 2, +3 .
Since an integral needs to be evaluated this can
be accomplished by adding up a series of narrow
rectangles so the equation becomes
S(kAf) = At l x(nAt) e"j27rkAfnAt . (23)
n=oo
The transform must be limited to a finite
interval so that the limits on the summation start at 0
and end at Nl. The spacing between the frequency
components is given by 1/T and t is T/N where N is the
number of samples. The transform then becomes[6]
S (kAf) = Vx(nAt) e;i27rkn/N (24)
X W A
n=0
Imperfections of the FFT
The three most often encountered problems in
using the FFT is leakage, the picketfence effect and
aliasing[7]. Figure 3 shows a sine wave and its corre
sponding continuous fourier transform.
However, since only a finite part of the signal
can be sampled and analyzed, this process can be thought
of as looking at the signal through a rectangular
11
window. The window and its continuous Fourier transform
is shown in Figure 4.
s()
sen
t t
0 f
FIGURE 3: Fourier Transform of sine wave
Source: Bergland, "A guided tour of the fast
Fourier transform", IEEE Spectrum, JUL 1989. p. 45.
W(t) W(()
r~ 1 /
T/2 0 T/2 t _%A'r 0 f
FIGURE 4: Fourier Transform of rectangular window
Source: Bergland, "A guided tour of the fast
Fourier transform", IEEE Spectrum, JUL 1989. p. 45.
The portion of the signal that is analyzed is
represented by as the product of the signal with the
rectangular window. The corresponding convolution in
the frequency domain results in the blurring of S(f)
into two (sin x) / x shaped pulses, Figure 5.
s(t) w(t)
vy'
S(f)W(0
FIGURE 5: Spreading of spectrum
Source: Bergland, "A guided tour of the fast
Fourier transform", IEEE Spectrum, JUL 1989. p. 45.
12
Consequently this series of spurious peaks or sidelobes
is called leakage.
However leakage can be reduced so that a result
closer to that of Figure 3 is produced by multiplying
the signal in the time domain by something other than a
rectangular window, such as the Hanning, Hamming,
etc[8].
The picketfence effect is caused by the
inability of the DFT to observe the spectrum as a con
tinuous function, because the spectrum is limited to
integer multiples of the reciprocal of the length of the
sampled data in the time domain. That is, the sampled
bins in the frequency domain are seperated from each
other in frequency, by the reciprocal of the amount of
samples in the time domain multiplied by the sampling
frequency. Consequently the observation of the spectrum
is analogous to looking at it through a "picketfence".
At worst case with a frequency component halfway between
two computed frequency components, the amplitude of the
signal is reduced to .637 in both of the adjacent
spectral windows, when using a rectangular window
function, as shown in Figure 6. The circles represent
the points where the frequency domain is sampled noted
as spectral windows. Take note that if the transformed
frequencies lie at integer multiples of the time record
reciprocal, neither the picket fence effect or leakage
13
occurs, as indicated by Figure 7.
$
FIGURE 6: Picketfence effect
$(f)
FIGURE 7: Integer multiples of time record reciprocal
The picketfence effect, like leakage can also
be alleviated by using a window function other than the
rectangular function. In order for the window function
to have minimal effect on the desired spectrum, since
the spectrum of the window is convolved in the frequency
domain, the window spectrum should approximate an
impulse function[4]. Unfortunately, this cannot be
realized since this would require an infinitely long
14
window. The spectrum of the window, in general consists
of a main lobe and various side lobes located on either
side of the main lobe. Therefore the main lobe should
be as narrow as possible, and the maximum side lobe
level should be as small as possible relative to the
main lobe. These two criteria cannot be simultaneously
optimized, consequently most usable window functions are
a compromise of these two factors[4].
The window functions that were implemented were
the rectangular window with
w(t) = 1 for t < r/2
(25)
= 0 elsewhere;
the triangular window with
w(t) = 1 2*111/t for 111 < t/2
(26)
0
elsewhere;
the Hanning window with
w(t) = cos^7rt/r for t < r/2
(27)
0 elsewhere;
and the Hamming window with
15
w(t) = .54 + 46COS (27Tt/r) for t
= 0 elsewhere.
Where r is the length of the time sampled block. Table
1 lists the amplitude of adjacent spectral windows when
a frequency component is halfway between two computed
frequency components; the transformed signal was multi
plied with the corresponding window function in the time
domain.
TABLE 1: Comparison of Window Functions
Adiacent Frequency Windows
Window Function 1 2 3 4
Rectangular .637 .212 .127 .091
Triangular .811 .090 .032 .017
Hanning .849 .170 .024 .008
Hamming .818 .113 .002 .007
As can be seen the triangular, Hanning and Hamming
window functions will result in better amplitude
accuracy when the component is between two computed
frequency components than the rectangular window
function.
Aliasing refers to the fact that a high
frequency can impersonate a low frequency if the
16
sampling rate is too low. The Nyquist criteria states
that aliasing does not occur if the sampling rate is at
least twice the highest frequency of interest[9]. A
lowpass filter called an antialiasing filter is used
to attenuate any frequencies higher than that of
interest.
The remaining concern, noted here is that there
will be some quantization error due to the use of
integer math in the DSP chip on the FFT algorithm. The
error of a fixed point FFT when compared to a floating
point FFT is
rms(error) = 247 [2 (M+3)/22 B^3j j (29)
where M is the amount of stages given by log2N and B is
the number of bits in the number[10]. A comparison was
done on the AT of a floating point FFT and an integer
FFT, with the first 4 real points out of 512 set to 64
with the remaining at 0. These two results are also
compared with the mathematically calculated discrete
fourier transform (DFT) of the same function, which is
considered to be the exact result. The calculated DFT
used is shown here.
 DFT = 256  cos (nx/N) cos (2n7r/N)  . (210)
The difference of the floating point FFT with the DFT
was calculated, as well as the difference of the integer
17
FFT with the DFT. The maxnorm is defined as the maximum
absolute number in a vector[11], which is useful in
measuring error when a calculated vector is subtracted
from the exact vector, as it was here. The maxnorm of
the exact FFT minus the floating point FFT is 0, whereas
the maxnorm of the exact FFT minus the integer FFT is
10.4. When a 256 point complex FFT is used to calculate
the 512 real point FFT as prescribed in the next section
the max norm for the integer FFT is 5.3. A subset of
the data is shown in Table 2.
FFT of a real function
Due to the symmetry of the FFT, it can be shown
that a N point complex FFT can be used to calculate a 2N
point real FFT[1]. Let h(n) and g(n) be N point real
sequences. Then
y (n) = h(n) + jg(n) (211)
also
h(n) < > H(k) (212)
g(n) < > G(k) (213)
h(n) + jg(n) < > H(k) + jG(k) . (214)
H(k) and G(k) are both complex; therefore, in the right
hand side of the above equation, the overall real and
18
TABLE 2: Comparison of Discrete Fourier Transform to
floating and integer Fast Fourier Transform1
/ exact floatina inteaer float dif int dif
214 56.7694 56.7694 58 0.0000 1.2306
215 55.8280 55.8280 57 0.0000 1.1720
216 54.8581 54.8581 54 0.0000 0.8581
217 53.8602 53.8602 53 0.0000 0.8602
218 52.8350 52.8350 53 0.0000 0.1650
219 51.7830 51.7830 53 0.0000 1.2170
220 50.7047 50.7047 51 0.0000 0.2953
221 49.6008 49.6008 50 0.0000 0.3992
222 48.4719 48.4719 48 0.0000 0.4719
223 47.3185 47.3185 49 0.0000 1.6815
224 46.1414 46.1414 45 0.0000 1.1414
225 44.9412 44.9412 45 0.0000 0.0588
226 43.7184 43.7184 43 0.0000 0.7184
227 42.4737 42.4737 43 0.0000 0.5263
228 41.2078 41.2078 40 0.0000 1.2078
229 39.9214 39.9214 41 0.0000 1.0786
230 38.6151 38.6151 38 0.0000 0.6151
231 37.2896 37.2896 38 0.0000 0.7104
232 35.9456 35.9456 35 0.0000 0.9456
233 34.5837 34.5837 34 0.0000 0.5837
234 33.2047 33.2047 33 0.0000 0.2047
235 31.8092 31.8092 32 0.0000 0.1908
236 30.3980 30.3980 30 0.0000 0.3980
237 28.9718 28.9718 30 0.0000 1.0282
238 27.5313 27.5313 27 0.0000 0.5313
239 26.0772 26.0772 28 0.0000 1.9228
240 24.6102 24.6102 25 0.0000 0.3898
241 23.1312 23.1312 25 0.0000 1.8688
242 21.6407 21.6407 22 0.0000 0.3593
243 20.1397 20.1397 22 0.0000 1.8603
244 18.6287 18.6287 19 0.0000 0.3713
245 17.1086 17.1086 19 0.0000 1.8914
246 15.5801 15.5801 17 0.0000 1.4199
247 14.0439 14.0439 17 0.0000 2.9561
248 12.5008 12.5008 12 0.0000 0.5008
249 10.9517 10.9517 11 0.0000 0.0483
250 9.3971 9.3971 10 0.0000 0.6029
251 7.8380 7.8380 11 0.0000 3.1620
252 6.2750 6.2750 7 0.0000 0.7250
253 4.7089 4.7089 5 0.0000 0.2911
254 3.1406 3.1406 5 0.0000 1.8594
255 1.5707 1.5707 5 0.0000 3.4293
1
xThis table is a partial
complex FFT
listing of a 512 point
19
imaginary parts are not directly identifiable. Let
Y(k) = H(k) + jG(k).
(215)
Now substitute Nk for k.
Y(Nk) = H(Nk) + jG(Nk).
(216)
The real parts of H(k) and G(k) are even and the
imaginary parts are odd. Consequently,
Y(Nk) = H*(k) + jG*(k)
(217)
Taking the complex conjugate of both sides
Y*(Nk) = H(k) jG(k). (218)
Adding equations 215 and 218
* Y(k) = H(k) + jG(k)
+ Y fNk^ = Hm ~ iG(k)
H(k) = .5[Y(k) + Y (Nk)] (219)
and subtracting equation 215 from 218
Y*(Nk) = H(k) jG(k)
 fY(k)____= H(k) + iG(k) 1
G(k) = .5j[Y (Nk) Y(k)]
(220)
20
Re[Y(k)] = R(k) (221)
Im[Y(k)] = jl(k) (222)
Equation 219 is rewritten here
H(k) = .5[Y(k) + Y*(Nk)]. (223)
H(k) = .5[R(k) + R(Nk) + j(I(k) I(Nk))]. (224)
Equation 220 is rewritten here
G(k) = .5j[Y*(Nk) Y(k)] (225)
G(k) = ,5[I(k) + I(Nk) + j(R(Nk) R(k))]. (226)
Equations 224 and 226 are the FFTs of two
separate real N point sequences computed with one N
point complex FFT. To compute a 2N point real FFT with
a N point complex algorithm the previous results are
used. The DFT algorithm is written
X(k) = J x(n) Wnk. (227)
n=0
The amount of points is N, but the amount of points
could just as well be 2N, so that the DFT could also be
written
21
2N_1 nk X(k) = l x(n) W2N n=0 (2
by substituting 2N for N. Now, break x(n) into two
sequences
h(n) = x(2n) (2
g(n) = x(2n+l) (2
this results in
X(k) = l x(2n) W2Nk+ S1x(2n+1) W^n+1)k (2
n=0 n=0
and
X(k)
Nl nk k N_1
1 h(n) WN + W2N l g(n)
n=0 n=0
(2
so that
X(k) = H(k) + W
N7G(k)
(2
Using the previous results by substituting
equation 223 and equation 225 into equation 233
= 5 [ Y (k) + Y* (Nk) + j wJj/?Y*(Nk)Y(k)) ] . (2
28)
29)
30)
31)
32)
33)
34)
X(k)
22
X(k) = 5 [R (k) + R(Nk) + j (I(k)I(Nk) ) +
W^2(I(k) + I(Nk) + j(R(Nk) R(k))]. (235)
X(k) = .5[R(k) + R(Nk) + j(I(k)I(Nk)) +
(cos(k7r/N) j sin(k7r/N)) *
(I (k) + I(Nk) + j(R(Nk) R(k)))]. (236)
After multiplying this equation out, and combining like
terms the following results,
X (k) = 5 [R(k) + R(Nk) + cos (k7r/N) (I (k) + I (Nk) ) +
sin k7r/N(R(Nk) R(k)) +
j (I (k) I (Nk) + cos(k7T/N) (R(Nk) R(K) )
 sin (k7r/N) (I (k) + I (Nk))) ]. (237)
These results are applied to a CooleyTukey
Radix2, DIF FFT program that was obtained from Burris
and Parks[12]. The FFT algorithm was converted from
Fortran to C and can be found in Appendix C as function
FFTSCAL1, which executes on the GSP. It can also be
found in Appendix B in the subroutine FFT, which exe
cutes on the DSP. This algorithm was also obtained from
Burris and Parks[12] with modifications made, and with
these results applied to it.
23
Decimation and Interpolation
An area worth some discussion (but which was not
implemented) is oversampling, decimation and interpola
tion of the sampled data. If the data can be sampled at
a faster than required rate (oversampled), then decima
tion is used to decrease the sample rate. The data is
digitally low passed filtered and the next dsp function
that takes place such as the FFT would only use every
other sample for a decimation of 2, for example. The
advantage of this is that the analog antialias filter
requirements can be relaxed, and it becomes possible to
do some of the antialias filtering in the digital
domain.
Interpolation is the opposite of decimation so
that the sampling rate is increased instead of de
creased. For example if a 2 times sampling rate were
desired, a 0 would be inserted between each sample and
this data stream would then be digitally low passed
filtered. In essence, a low pass filter is simply an
averaging process, so that at the output of the filter,
the inserted 0's become the average of the adjacent
samples; thus interpolation has taken place. Interpola
tion, along with decimation is useful when a noninteger
sample rate conversion is required. For example if a 2/3
sample rate conversion is desired, then an interpolation
of 2 and a decimation of 3 is performed. The interpola
24
tion must occur before the decimation[13]. The C code,
for an FIR low pass filter, which will exhibit linear
phase, is program LOWPASS found in Appendix F.
CHAPTER III
IMPLEMENTATION
Multiprocessor architecture
In a multiprocessing environment, it is conve
nient to call each processor a processing element (PE).
The system developed for this Thesis contains five pro
cessing elements as shown in Figure 8. The analyzer is
based on a Zenith 8 MHz AT, which serves to download
programs, and transfer data to and from the coproces
sors.
Antialiasing Filter
The antialias filter was designed with a Bessel
Function realization, to obtain as close as possible a
linear phase response. See Figure 9 for a schematic
diagram. It is a low pass filter with a 30 KHz cutoff.
The transfer function is[14]
21
___________________25.75x10_____________________ . (31)
s4 + 884.87 x 103s3 + 352.36 x 109s2 + 72.752 X 1015S + 6.438 X 1021
There is a three input summer at the input of
the filter so that three separate sources can be summed
together if so desired.
26
A TO D BOARD
12 BIT
A TO 0
CONVERTER
TIMER
COUNTER
E X PA N S I ON SLOT
OMA
6 2 3 7
CPU
80266
DIGITAL SIGNAL
COPROCESSOR
TMS320C26
10 MIPS
2 4 K
MEMORY
E X PA N S I 0 N SLOT
5 12 K RAM
E X PA N S 1 0 N SLOT
1 1
HOST PC
GSP
82 7 8 6
128 K
N N I OS
EPROM
T MS3 4 010
6 MIPS
1 MEGABYTE
OtSPLAY RAM
128 K USE R
PROGRAM RAM
GRAPHICS COPROCESSOR
MONITOR
FIGURE 8: Processors used in the Spectrum Analyzer
FM
FIGURE 9: AntiAlias Filter
w
28
Data Acquisition
With a 15 microsecond A to D converter, and us
ing DMA, 60KHz throughput can be achieved with a Dash 16
Metrabyte data aquisition board[15]. Both the DMAC and
the Metrabyte board are controlled via the AT through
the program that was written in C. AT DMA channel 1 is
used since this is a free channel. Memory is allocated
with a C library function and a corresponding pointer is
passed to page register 1, which is external to the
DMAC, and to the base address register in the DMAC. The
DMAC count register is set up to transfer 512 words of
data. The mode register in the DMAC is set for single
transfer mode, so that the DMAC will relinquish the bus
after each transfer. This must be the case, so that
DMAC channel 0 can perform a refresh to the dynamic ram
in the AT. The Metrabyte board contains programmable
timers that connect to the dma DREQ pin. Consequently
the sample rate is controlled by setting the appropriate
registers on the A to D board. To start the DMA trans
fer DREQ for channel 1 is unmasked by writing the
appropriate command to the mask register in the DMA.
See function getdma in Appendix A for more details. The
data is transferred to AT main ram to be loaded to the
DSP board at a later time.
29
DSP Coprocessor Board
The Digital Signal Processor (DSP) daughter
board uses a TMS320C25 produced by Texas Instruments.
It is a Software Development System (SWDS) that has a
software monitor to debug software with. Instead of
using the monitor, the board is controlled by the AT,
which at initial program execution downloads the
program that the TMS320C25 will execute. A 256 point
FFT is used to calculate a real 512 point FFT, (as de
scribed in the Background section). In order to trans
fer data to the DSP board, the AT puts the TMS320C25
into hold mode, so the AT can directly control the local
coprocessor. The AT then transfers data to the ram on
the DSP board.
The TMS320C25 can execute at 10 MIPS, due in
part to its internal Harvard architecture and pipelining
of instructions. In a Harvardtype architecture, pro
gram and data memory reside in separate address spaces,
and use seperate hardware busses to allow a full overlap
of instruction fetch and execution. There are 544 words
of on chip ram, and the data to be FFT'd is moved there.
The DSP central arithmetic logic unit contains a 16bit
scaling register, a 16 x 16bit parallel multiplier
which is implemented in hardware, a 32bit Arithmetic
Logic Unit (ALU), a 32 bit accumulator, and some addi
tional scalers available at the outputs of both the ac
30
cumulator and the multiplier[16],[17]. Moreover, the
TMS320C25 architecture allows moves, shifts, multiplies,
and accumulations to occur in a single 100 nsec machine
cycle. When data is loaded into the accumulator, it can
be left shifted up to 16 bits, and when data leaves the
accumulator it can be left shifted 0 to 7 bits in the
same cycle. Consequently, this processor is excellent
for performing tasks which are math intensive, such as
DSP applications. To take full advantage of these
capabilities, and to execute the FFT in minimum time,
the software was written in assembly language for the
DSP.
GSP Coprocessor Board
The graphics board is a model Pepper SGT,
acquired from Number Nine Computer Corporation. It
contains a Texas Instruments TMS34010 GSP[18], which
executes at 6 MIPS, and an Intel 82786 coprocessor.
Number Nine provided an operating system called Number
Nine Intelligent Operating System (NNIOS), to ease the
development process[19],[20]. There is a large range of
C library functions to facilitate the drawing of lines,
rectangles, etc., along with an eprom on board the gsp
for fast execution. Communications between the host AT
and the GSP board is handled by a device driver, as part
of the overall operating system. Once a device driver
is installed in the AT, it looks very similar to BIOS,
31
to DOS. It is called by DOS on one side and communi
cates with the GSP board on the other. The device
driver provides a standard interface to the C calling
functions that use the device despite any idiosyncrasies
of the device. The device driver also would allow a
different type Number Nine Graphics board to be in
stalled without changing the C programs that use it.
Installing the corresponding device driver with the
different GSP board would be the only requirement, if
the C program is designed with this purpose in mind.
There is a provision, known as a user function,
in the GSP board to download TMS34010 executable code.
There is 128K of instruction ram for this purpose. The
GSP board also contains 128K of NNIOS eprom and 1 Meg of
display ram with the capability to expand to 4 Meg of
display ram. It was programmed for 640 x 480 display
resolution at 8 bits per pixel so that 256 colors can be
displayed at once if desired, out of a 16,777,216 color
palette.
The Intel 82786 provides for multiple bitmap
management, hardware windows and display field. The
bitmaps in memory are allocated dynamically so that
bitmaps of different sizes may coexist. The software
can then draw into each bitmap separately. The display
of the drawn images is then viewed through hardware
windows. More than one window can be associated with a
32
particular bitmap, and can overlap each other and roam
over their respective bitmap. Figure 10 shows a bitmap
with multiple windows.
f
_______________________^
7
><
a.
windows may
include any
portion of
the bitmap *
^windows may
overlap and
share common
data in
windows may roam within the
limits of the bitmap perimeter 
(A)
jC.
windows may include entire bitmap
X640 PIXELS
7.
Z4 BJTS/PIXEL
BITMAP
FIGURE 10: Bitmap with windows
Source: "Pepper SGT User's Guide", Number Nine
Computer Corporation, Cambridge MA, 19861989, p. 39.
The display field is the viewable area of the
display monitor. Until NNIOS instructs the board to
display a window the default field color is displayed.
The windows simply display over the field.
Software
Numerous programs were written for the spectrum
analyzer, as well as programs that simulated certain
portions of the spectrum analyzer. The most significant
33
program titled SPECTRUM can be found in Appendix A. The
files used in SPECTRUM are noted in Table 3 with their
corresponding processors.
TABLE 3. Functions used in SPECTRUM
80286 AT Processor
allocbuf
downld
getjnt
getdma
dmadone
fftdone
c25hold
rectangular
triangular
hanning
hamming
TMS320C25 DSP
fftdiv
Program FFTDIV which is downloaded to the DSP coproces
sor, for the fft execution is found in Appendix B. The
FFT algorithm was derived from Burrus and Parks[12] and
modified to allow a N point complex FFT to calculate a
2N point real FFT as discussed in Chapter II.
The spectrum analyzer functions as follows. The
function allocbuf is used to allocate memory in AT main
ram to store data that is transferred in via dma from
the A to D board. Program fftdiv, which was written in
TMS320C25 assembly language, is then downloaded to the
DSP coprocessor for execution. The DSP is held in both
reset and hold mode by the AT during this process, and
34
any other process that requires any data transfer be
tween the two. The AT then displays a menu to the four
window functions, rectangular, triangular, hanning and
hamming for user selection. At this point the user may
also exit the program and return to DOS if so desired.
The selected window function is calculated and loaded
into the DSP board by calling the corresponding window
function call via the switch statement. The communica
tion channel to the GSP board is then opened up to cre
ate bitmaps and windows for the drawing environment.
The working palette is set up with the desired colors.
The selected window function is then displayed, as shown
in Figure 11, until the user presses any key such as the
space bar.
After this occurs the x and y axis are drawn in
preparation for spectrum displays. Data is then trans
ferred via dma in from the A to D board to AT host mem
ory to start the spectrum display process. The first
FFT cannot start until the first block of data is ready,
so the AT polls the status of Chan 1 dma until this is
the case by function call dmadone. The data is then
loaded into the DSP board for FFT execution. Imme
diately after the AT gives the command for FFT execution
it also starts transferring in data from the A to D.
From this point forward FFT execution will occur in
parallel with data acquisition for the next FFT. At
35
FIGURE 11: Display of selected vindow function
the completion of the first FFT execution, and the com
pletion of the next dma process, the data is loaded back
into the AT main ram and the data that was just dma'd,
is loaded into the DSP card for the next FFT. The AT
polls the XF pin on the DSP chip to determine when FFT
execution is complete, via function call fftdone. The
two function calls dmadone and fftdone can be thought of
as semaphores used for synchronization. Execution does
not continue until their conditions are satisfied. A
pointer, that points to the data that was FFT'd is given
to the GSP card for display. Consequently at this time,
I
I
I
36
data is being transferred in via dma, the DSP is
executing an FFT, and the two processors on the GSP
board are busy displaying the data, all concurrently.
See Figure 12 for a spectrum display of three signal
generators.
FIGURE 12; Display of spectrum
Figure 13 shows a flow chart of the spectrum
analyzer and Figure 14 shows how the data flows through
the system.
The FFT and dma transfer of data occur much
faster than the time for displaying the data, conse
quently these two processes are held up while waiting
for the display to finish, causing a lapse in the data
FIGURE 13; Flow Chart of the Spectrum Analyzer
(FIGURE continued next page)
(FIGURE 13 continued)
FIGURE 13; Flow Chart of the Spectrum Analyzer
EXTERNAL
FIGURE 14: Data flow through System
40
stream. If the dma could constantly be transferring
data with the FFT execution and display process keeping
up with it, or even occurring at a faster rate, then the
spectrum analyzer would be considered to be "real time".
The process of displaying spectrums continues until the
user presses any key, in which case the current spectrum
will stay frozen on the screen until another key is
pressed. The user then has the option of selecting a
new window function or terminating the program. The en
tire process then starts over as noted above.
Another program of interest is program SPECREP8,
which is found in Appendix A. It opens up multiple
bitmaps and windows for the display of numerous spec
trums so that a history can be observed. The latest or
newest spectrum is displayed on the bottom window and
the previous spectrums scroll up the screen. See Figure
15. It uses the userfunction FFTSCAL1, which executes
on the GSP, found in Appendix C. Program SPECREP8
prompts the user for the desired amount of spectrums to
be displayed.
42
ments, software requirements and a great deal of gained
insight into the system. NETWORK II.5 allows the simu
lation of PEs, busses, storage devices, files, software
modules along with the capability to control these
devices. In this simulation binary semaphores, which
can hold a true or false value, are used for processor
synchronization.
The system simulated uses four DSP boards, a AT,
with its on board DMAC, one GSP board, and a multiplexed
Analog to Digital Converter board. These processors and
peripherals are characterized with the same parameters
of performance, as the hardware used in the previously
described spectrum analyzer. See Figure 16 for a block
diagram of the Hardware.
In the case of the DSP boards the simulation
assumes dual port ram so that data is dma'd into the DSP
board while the DSP is executing the FFT. The dual port
ram can be thought of as a series of block buffers ar
ranged in a circular fashion. The first block of data
is loaded into block 1, after which case the DSP
executes the FFT and returns the result to the same
location. While this FFT is executing the DMA loads
data into block 2. The DSP executes the FFT faster than
the time it takes for the A to D board to convert an
entire block of data. Consequently the next FFT
execution occurs immediately after the last word of data
HARDWARE INTERCONNECTION DIAGRAM
* *PE DMA CONTROLL* * * * * * * *
* PE HOST PC * * ER * * PE DSP1 * * PE DSP2 * * PE DSP3 * * PE DSP4
* * * * * * * * * *
PCBUSXX
DSP_BUS1:
DSP_BUS2:
DSP_BUS3
DSP_BUS4:
VIDEO_BUS:
X
X
X
X
X
X
X
X
X
X
X
X
MAINRAM * * DSP_RAM1
DSP_RAM2
DSP RAH3
DSP RAM4
GSP_RAM
PE GSP
PC_BUS
DSP_BUS1
DSP BUS2
DSP~BUS3
DSP_BUS4
VIDEO_BUS
X
X
X
X
X
X
DISPLAY
A TO_Dl
A TO D2
FIGURE 16: Block diagram of Hardware in the Simulation
u
44
is loaded into the second buffer. This process contin
ues in a circular fashion, that is when the last block
is reached, the next block, is block 1. The AT will
remove the data from block 1 and store it in main ram
for later display after the FFT in block 1 is complete
and so on. In this system it is advantageous to use the
circular buffer so that if the AT is held up or inter
rupted for some reason, it is not required to move the
data off the buffer until it becomes entirely full. In
an operating system environment when a process passes
data to another process through a circular buffer, it is
known as a pipe. The circular buffer scheme gives the
AT more time for overhead, and would allow more spec
trums to be stored, before an overflow in ram space oc
curs, as opposed to a two buffer (block) scheme where
the dma is transferring data to one block while the DSP
is executing the FFT on the other. Figure 17 shows a
diagram of the buffers and the processes producing and
consuming data in the buffers.
The data acquisition board has the ability to
multiplex up to 16 channels of input into one A to D
which has a worse case 15 uS conversion time. Conse
quently, this conversion time is fast enough so that one
A to D can be used for four channels sampled at a 125 uS
rate, (although in the simulation four A to Ds are used
FIGURE 17
Buffers used in the Simulation
(J1
46
with the understanding that only one multiplexed A to D
is necessary). The GSP will display the four channels
in four different windows, but in order to keep the
simulation at a reasonable length they are grouped
together into one, with the understanding that four
channels are being displayed.
The purpose of the simulation is to identify
bottle necks and to gain insight into the system. In
this particular system, the GSP board does not display
the spectrums as fast as the DSP board is producing
them. The data is stored in main ram while the AT waits
for the GSP board to finish displaying the spectrum, in
which case the AT loads the next set of data points to
be displayed. The GSP board is obviously the major
bottle neck, consequently, main ram will eventually
overflow. The question is how many spectrums can be
displayed before this occurs, and how long does it take
for this to occur. This is the major theme that this
particular simulation addresses, as well as the
utilization of the various devices.
The simulation is entered using a CACI editor
that prompts the user through the input to generate a
Net list which is found in Appendix E.2. The software
module diagram that is generated from the Net list is
found in Appendix E.l.
47
The software modules give the best indication of
how the simulation works. Some of the semaphores for
the four channels have identical names with the addition
of a suffix number indicating the channel number the
semaphore corresponds to. Since these semaphores func
tion the same on a per channel basis, the suffix number
of the semaphore is left off in this discussion. Ini
tially the semaphores are initialized to their proper
values. It takes no simulation time to set or reset a
semaphore. The DMAC starts transferring data at the
beginning of the simulation, and continues to do so for
the length of the simulation. The DMA modules chain to
each other so when one transfer finishes the next one
starts. After 512 transfers occur semaphore s_dma_done
is set so that the FFT will start execution on the cor
responding DSP board. The DSP module will then start
execution and set the semaphore s_start_dma, so that the
dma will transfer data while the FFT is executing. Also
semaphore s_dma_done is reset so that the DSP will not
start another FFT execution after it finishes the first
one; it must wait for the next dma transfer to complete.
After the FFT completes execution the DSP sets semaphore
s_fft_done signaling the AT to transfer the data to its
main ram for later display. The AT will reset
s_fft_done, transfer the data and set semaphore
s_display_xfer, so that the data will be loaded into the
48
GSP. After the data is loaded into the GSP, semaphore
s_display is set so that the GSP will start execution.
Every time the GSP finishes a display it resets
semaphore s_display so that another block of data will
be downloaded to it for another display.
CHAPTER IV
RESULTS
Implementation
The execution times of the FFT as run on the
various processors are noted in Table 4.
TABLE 4; Benchmarks of 256 point complex FFT
256 pt complex FFT
80286 (8MHz) C Lang. C Lang. Assembly 32 bit float 16 bit Integer 16 bit Integer 5000 msec 2000 msec 72 msec1
TMS320C25 Assembly 16 bit Integer 9 msec
GSP C Lang. 16 bit Integer 300 msec
Take note that the 80287 math coprocessor was
not installed in the PC. Other benchmarks of interest
are shown in Table 5.
Note that the GSP user function calls NNIOS dis
play functions from compiled TMS34010 code. The most
impressive result was the execution speed of the DSP,
^Courtesy Dan Michaels
50
TABLE 5: Task benchmarks
Data Acquisition and Transfer
From A/D to host memory (DMA) 8.5 msec
(512 word transfer at 60KHz
sampling rate)
512 word transfer from PC memory
to DSP 5 msec
512 word transfer from PC memory
toGSP 13 msec
DSP Time
256 square roots
128 point complex fft
Rearrange the 256 point data into
512 real data
10 msec
5 msec
< 1 msec
GSP Time
Display 512 spectrum points .5 sec
Userfunction used to display .3 sec
whereas the most discouraging result was the display
speed of the GSP board. The Number Nine Intelligent
Operating System (NNIOS) is the likely cause of this
inefficiency. Perhaps, this could be overcome by by
passing the operating system entirely and programming in
assembly language. Unfortunately, with the particular
graphics board that was available, it was not recom
mended that this be done, and there was little
51
documentation to support this kind of effort. Other
alternatives would be to use a different graphics board
or even design one specific to this application.
Simulation
When sampling the data at 8 KHz, the PC was able
to remove data fast enough from the DSP cards so that no
overflow occurred there. However, after 7.76 seconds
the 448 K byte of main ram available for spectrum
storage filled up, resulting in 7 spectrums of each
channel being displayed. The display of spectrums will
continue until the ram has been emptied out. This will
result in a total of 121 spectrums of each channel
displayed.
The sampling rate was varied to study the sensi
tivity of the system. Table 6 shows the utilization of
the processors with various sampling rates. The amount
of times FFT1 through FFT4 each were executed by their
respective DSPs, and the amount of times Draw Line was
executed by the GSP is shown in Table 7. The amount of
time it took for the PC main ram to fill up is shown in
Table 8.
52
TABLE 6: Utilization of Processing Elements
Sample Rate Host PC DMAC PE DSP1..4 GSP
2KHz 12.7% .14% 5% 92%
4KHz 20% .28% 10.1% 92%
8KHz 34.96% .56% 20.3% 91.4%
10KHz 42.2% .7% 25.3% 91.7%
TABLE 7: Amount of times instructions executed
INSTRUCTION
Sample Rate FFT1..4 DRAW LINE
2KHz 151 37
4KHz 129 16
8KHz 121 7
10KHz 119 6
TABLE 8: Time for PC main Ram to fill up
Sample Time
Rate
2KHz 38 sec
4KHz 16.5 sec
8KHz 7.76 sec
10KHz
6.1 sec
53
The results of the simulation with the sampling
rate at 8KHz are found in Appendix E.3. When the
sampling rate was set above 10 KHz the PC could not
remove the data from the DSP coprocessors fast enough
resulting in meaningless results.
It is interesting to note that although the PC
is simply transferring data, its utilization was at 35%,
for the 8KHz sampling rate case, which is higher than
one would expect. Most likely the data transfer time
could have been greatly decreased if the code for the PC
had been written in assembler, rather than Microsoft
Quick C version 1. These utilization results are very
valuable in determining which PEs should execute and
carry out which functions. For example, if enough
execution time remained for the PC, then it would be
possible for the PC to perform a running time average on
the data while waiting for display.
Status plots for the simulation that sampled at
8KHz, are contained in Appendix E.4. An X indicates
that the device was active or that a semaphore was set.
Appendix E.5 contains utilization plots. If the device
was active for an entire time bin, then that corresponds
to 100% utilization in that bin. If the device was
active for one half the length of the particular bin
interval, then this would correspond to 50% utilization
in that bin.
54
Parallelism of the FFT
It is beneficial to take the previous results to
conjecture and demonstrate certain aspects of a parallel
system. If the FFT is executed on a machine that has a
single instruction stream and a multiple data stream
(SIMD), then a certain amount of the FFT butterflys can
be executed in parallel. This class of computer organi
zation corresponds to array processors which have multi
ple processing elements supervised by one control unit.
All the processing elements receive the same instruction
broadcast but operate on different data sets from dis
tinct data streams. The execution of the FFT in paral
lel results in a finegrain system.
By referring to Figure 2 in Chapter II it can be
seen that the reguired amount of PEs to perform the
parallel Npoint FFT is N/2[5]. In the serial method
log2N stages of computations, with N/2 bufferfly opera
tions per stage are reguired, giving a time proportional
to N/2 log2N. Using the array processor, the N/2 but
terfly operations are performed in parallel so that
log2N parallel operations are performed. The speed up
is given by the time to execute the algorithm with one
processor divided by the time to execute the algorithm
with P processors. When there is insignificant inter
processor communication time, the speed up is N/2. The
55
efficiency is the speed up, divided by the amount of
processors, which in this case would be 100%.
The Perfect Shuffle
If the machine is loosely coupled, that is each
PE has its own private memory then it becomes necessary
for the data to be transferred via some sort of network
from PE to PE. It can be shown that an interconnecting
pattern known as the Perfect Shuffle[22] is useful in
transferring the data. Other applications of the
Perfect Shuffle are polynomial evaluation, sorting, and
matrix transposition[22].
The Perfect Shuffle can be thought of as letting
the N points of the FFT represent a vector of length N,
with 0 being the first element indice, i being the i^h
element, and Nl being the last element indice in the
vector. After transmitting the vector through the
Perfect Shuffle network, the indices are mapped in an
interleaved pattern dictated by[22],
x(i) = 2i o <= i <= (N/2) 1 (41)
= 2i + 1 N N/2 <= i <= N 1. (42)
Another view of the shuffle is related to the
binary representation of the indices of the elements in
the vector. The ith element is shuffled to position i'
where i' is the number obtained by cyclically rotating
the bits one bit position to the left. Let the binary
expansion of i be the following[22]:
56
i = imi2ltl 1 + im22m 2 + + ii2 + ^o* (43)
Then i' is given by
i# = im22m_1 + im32TCl~2 + + Lo2 + inii (4~4)
where
m = log2N. (45)
So for example if N = 8, then element 5 which is
101, will move to position Oil or 3 after the Perfect
Shuffle. Notice that both the first and last elements
in the vector never move positions after any amount of
perfect shuffles.
The Perfect Shuffle routing network along with
the processing elements that calculate the butterflys is
shown in Figure 18, which was modified from a drawing
obtained from Stone[22]. Notice that, at each stage,
every PE always transfers its data to the same next PE.
57
FIGURE 18: PEs and Perfect Shuffle routing Network
Adapted from Harold S. Stone, "Parallel Processing
with the Perfect Shuffle", IEEE Transactions on Comput
ers, Vol. C20, Feb 1971, p. 155.
58
The following sequence repeats m times to compute the
FFT:
1) shuffle;
2) N/2 butterflys in parallel;
3) transfer results back into input
of shuffle network.
Figure 2 of Chapter II taken from Hwang and
Briggs[5] was modified to show how the PEs are utilized
as dictated by the Perfect Shuffle shown in Figure 18.
The PEs are indicated inside of the butterfly circles,
Figure 19.
59
X=A+B
Y=(AB)Wk
FIGURE 19; Perfect Shuffle with PEs shown in circles
Adapted from Hwang and Briggs, Computer Architec
ture and Parallel Processing. New York: Wiley and Sons,
1985 p. 368
60
The N Cube
The cube interconnection network is another
scheme worthy of discussion. Figure 20 shows a 3 cube
with 8 nodes.
000 001
S\oiO
101
110 111
FIGURE 20: A3 Cube of 8 nodes
Source: Hwang and Briggs, Computer Architecture
and Parallel Processing. New York: Wiley and Sons, 1985
p. 342.
A PE is located at each node. Take note that in the
FFT, the items being paired in a butterfly at stage k
1.U
are those whose indices differ in the (log2N k l)tn
bit position of the binary representation. In the N
Cube the neighboring indices only differ in one bit
position, making it ideal for fft execution[5]. A cube
network with M PEs corresponds to an N Cube with N =
log2M. The PE at the vertex may have its address repre
sented by A = (an1.. .a^aQ) 2 where 0 <= A <= Nl. Conse
quently, the ndimensional cube network of N PEs is
specified by the following n routing functions:
61
ci(anl*aiao) anl'*ai+idiail*ao (46)
for i = 0,1,2,...,nl
In other words if i = 0, then the neighbors that differ
from each other in bit position 0 communicate to each
other.
Initially each processor in the 3 cube contains
two different time samples when executing a 16 point DIF
FFT. After the first 8 (stage 1) butterflys are
executed in parallel the cube uses the C2 routing
function as shown in Figure 21 by the bold lines.
000 001
110 111
FIGURE 21: C2 routing function after stage 1
Adapted from HWang and Briggs, Computer Architec
ture and Parallel Processing. New York: Wiley and Sons,
1985 p. 342
Each neighbor must interchange a word of information to
its corresponding neighbor as dictated by the routing
path. After the second stage routing function ^ is
62
utilized, and after stage 3 routing function CQ is uti
lized as shown in Figure 22.
000 001
s. 010
k 1
110 111
Routing function
C^ after stage 2
Routing function
Cq after stage 3
FIGURE 22: Routing functions after stage 2 and 3
Adapted from Hwang and Briggs, Computer Architec
ture and Parallel Processing. New York: Wiley and Sons,
1985 p. 368
Notice that the Perfect Shuffle utilizes the
same routing function regardless of stage, whereas the N
Cube utilizes N different routing functions.
Again Figure 2 of Chapter II was taken from
Hwang and Briggs and modified to show how the PEs are
utilized as dictated by the N Cube shown in Figure 20.
The PEs are indicated inside of the butterfly circles,
Figure 23.
63
FIGURE 23: N Cube with PEs shown in circles
Adapted from Hwang and Briggs, Computer Architec
ture and Parallel Processing. New York: Wiley and Sons,
1985 p. 368
64
System using 4 PEs
For the purpose of this thesis, Hwang and
Briggs', and Stone's results are used to investigate a
practical system that only utilizes several PEs. Be
cause of cost and space considerations an implementation
using several PEs could prove to be advantageous. The
flow diagram of a 16 point FFT utilizing the Perfect
Shuffle network with 4 PEs was derived by applying the
Perfect Shuffle to the flow diagram, Figure 24, as it
was for the N Cube in Figure 25. Figure 25 was taken
from Hwang and Briggs with the appropriate Twiddle
Factors added to the diagram.
In either the Perfect Shuffle or the N Cube
network, given a certain amount of PEs, a certain number
of Parallel data transfers and a certain amount of par
allel butterflys result as indicated by Table 9. These
numbers were achieved from formulas given by Hwang and
Briggs for the N Cube[5]. Also their respective total
amount of network transfers is shown.
STAGE 1
STAGE 2
STAGE 3
STAGE 4
PE 0
PE 1
PE 2
PE 3
FIGURE 24: Flow Diagram of Perfect Shuffle with 4 PEs
PE 0
PE 1
PE 2
PE 3
FIGURE 25: Flow diagram of N Cube with 4 PEs
Adapted form Hwang and Briggs, Computer
Architecture and Parallel Processing. New York:
Wiley and Sons, 1985 p. 371
ON
ON
67
TABLE 9: 256 point FFT with N PEs
Number Parallel Parallel Parallel Perfect N Cube
of PEs data data butterflvs shuffle network
transfers transfers network transfers
Perfect Shuffle N Cube transfers
1 0 0 1024 0 0
2 128 64 512 128 128
4 128 64 256 384 256
8 96 48 128 672 384
16 64 32 64 960 512
32 40 20 32 1240 640
64 24 12 16 1512 768
128 14 7 8 1778 896
Note that the columns parallel data transfers
and parallel butterflys designate the overall number of
events that occur for each PE, not the total number of
events. For example, in the 256 point FFT, 1024 butter
flys are always executed; so with 128 PEs, 128 butter
flys are executed at a time and 8 parallel butterflys
are executed in total. Table 10 shows the same data
TABLE 10: 4 PEs with variable FFT lencrth
FFT Parallel Parallel Parallel Perfect N Cube
lenath data data butter Shuffle network
transfers transfers flvs network transfers
Perfect Shuffle N Cube transfers
64 32 16 48 96 64
128 64 32 112 192 128
256 128 64 256 384 256
512 256 128 576 768 512
1024 512 256 1280 1536 1024
2048 1024 512 2816 3072 2048
with the amount of PEs held at 4 and letting the FFT
length vary.
68
The architecture of the Perfect Shuffle using
four TMS320C25s is shown in Figure 26. The blocks noted
as buffers accept data written to them, which is stored
so that a different PE can read the data a short time
later. The outlined arrows indicate which PE is writing
to the buffer, where the solid arrow indicates which PE
reads the buffer, during the network transfer. PE 1 and
PE 4 write one complex word and read one complex word,
whereas PE 2 and PE 3 write two complex words and read
two complex words. Consequently the amount of instruc
tions to determine the network transfer time is four
writes and four reads. If "looped code" is used, there
is also an if statement required to determine if there
is a parallel transfer for the particular stage that
just completed execution. Because the route through the
network is the same for any transfer, no time is
necessary to determine and/or switch the route.
Processor synchronization can be achieved through a
hardware interrupt scheme, where a processor that wrote
a word out to a buffer then interrupts the processor
that must read it. Another scheme would use a semaphore
that resides in the buffer that can be set and reset.
The architecture for the N Cube is shown in
Figure 27. Since the PEs must exchange data with neigh
69
FIGURE 26
Architecture for the Perfect Shuffle
A to D
DUAL PORT
RAM
WR/ WR/
DMAC
GLOBAL
RAM
FIGURE 27; Architecture for the N Cube
71
boring nodes, dual port ram was used, so that both pro
cessors write the complex data to the ram at the same
time and then read the complex data at the same time.
Consequently the network transfer time can be determined
by two writes, two reads, in addition to an if statement
determining if a network transfer must take place and an
if statement determining which route must be taken, if
looped code is used. Again processor synchronization
can be achieved through interrupts or semaphores resid
ing in the dual port ram. Figure 27 shows an interrupt
scheme so that when a PE writes to dual port ram the PE
that must read it is interrupted. The architecture for
the Perfect Shuffle and the N Cube is similar to the
SIMD machine but lacks a Control Unit (CU) to broadcast
an instruction and to control the PEs.
The pseudo code for a 256 point FFT utilizing
the Perfect Shuffle is shown in Figure 28. The pseudo
code shown in Figure 29 for the 256 FFT that utilizes
the N Cube is very similar. The only difference is that
the routing function must be determined and used. Note
that all four processors are executing these exact same
routines in parallel.
The twiddle factors used for butterfly calcula
tion by the 4 PEs in the N Cube through out the FFT are
shown in Table 11.
72
PROCEDURE {256 POINT FFT UTILIZING PERFECT SHUFFLE}
Read sampled data from global ram;
Perform 6 stages of FFT;
WHILE stage 7 or stage 8
DO Begin
FOR i: = 1 to 32 step 1
Begin {FOR}
Route via Perfect Shuffle;
Calculate Butterfly
End {FOR}
End {WHILE}
Transfer FFT result to global ram;
END{PROCEDURE}
FIGURE 28: FFT utilizing Perfect Shuffle
PROCEDURE {256 POINT FFT UTILIZING N CUBE}
Read sampled data from global ram;
Perform 6 stages of FFT;
WHILE stage 7
DO Begin
FOR i := 1 to 32 step 1
Begin {FOR}
Route via Cj;
Calculate Butterfly
End {FOR}
End {WHILE}
WHILE stage 8
DO Begin
FOR i := 1 to 32 step 1
Begin {FOR}
Route via Cgl
Calculate Butterfly
End {FOR}
End {WHILE}
Transfer FFT result to global ram;
END{PROCEDURE}
FIGURE 29: FFT utilizing N Cube
74
TABLE 11: Twiddle factors used in PE butterflys
STAGE Twiddle Factors PE offset
1 Increment by 4 1
1 of each
2 Increment by 8 2
2 of each
3 Increment by 16 4
4 of each
4 Increment by 32 8
8 of each
5 Increment by 64 16
16 of each
6 Increment by 128 (0) 32
32 of each
7 Increment by 256 (0) 64
32 of each
8 All PE butterflys 0
use Twiddle factor 0
Note: All numbers are modulus 128
Note that each PE calculates 32 butterflys per stage, so
that in the first stage for example, the first PE uses
Twiddle factor 0 for the first butterfly, Twiddle factor
4 for the next butterfly and so on. The column PE off
set designates the offset of the Twiddle factors used in
the next PE's butterfly calculations. In other words
the second PE will use Twiddle factor 1 for the first
butterfly, Twiddle factor 5 for the next butterfly and
so on.
75
The Twiddle factors used in the Perfect Shuffle
Network are the same for the first six stages and the
eighth stage as it was for the N Cube. However, in the
seventh stage the first two PEs use Twiddle factor 0 and
the following two PEs use Twiddle factors 64.
The Cube Network utilizes less total network
transfers than the Perfect Shuffle network and it leads
to near symmetry in the software, that executes at the
nodes. Consequently, the pseudo code, for one of the
four nodes, of the 256 Point FFT utilizing the N Cube
was expanded into TMS320C25 assembly language, which is
contained in Appendix H. The assembly language for the
Perfect Shuffle network and the other nodes on the cube
is very similar. Interrupts are used to reset a binary
semaphore to synchronize the processors after a network
transfer. The cycle times of the assembly instructions
are given in the assembly manual allowing the speed up
to be calculated by comparing the FFT section of a
single PE to four PEs. The speed up was found to be
close to 4 resulting in nearly 100 % efficiency. Note
that this speed up calculation does not include bit
reversal and the math required at the end of the FFT
algorithm to convert the array so that a N point complex
FFT can be used to calculate a 2N point real FFT, as
discussed in Chapter II. The speed up of the Perfect
76
Shuffle scheme would be slightly lower than the N Cube
scheme but both appear to be nearly linear.
It is of interest to note the speed up of the
FFT if global ram is used for interprocessor communica
tion. The first case uses 4 PEs on the same circuit
board where the global ram can transfer data at 0 wait
states with the TMS320C25. The second case noted uses
one PE per circuit board, utilizing PC ram as the global
ram, also the host PC must intervene to assist in inter
processor communication. The transfer time for the PC
used in this calculation is 10 us per word taken from
the previous benchmark, found at the beginning of this
Chapter. The speed ups that can be achieved on a 256
point FFT are noted here in Table 12.
TABLE 12; Speed up using global ram, fixed FFT length
Speed Up
Number of PEs Case 1 Case 2
(0 wait state) (PC ram)
2 1.9 .81
4 3.8 .58
8 6.8 .43
16 10.8 .33
32 14.5 .27
64 16.4 .23
128 16.5 .19
If the amount of PEs are set at a constant of 4 and the
length of the FFT is allowed to vary, the following
speed ups result as shown in Table 13.
77
TABLE 13; Speed up using global ram, fixed PE
FFT length Speed Up
Case 1 Case 2
(0 wait state) (PC ram)
64 3.7 .46
128 3.7 .52
256 3.8 .58
512 3.8 .68
1024 3.8 .69
2048 3.8 .74
The pseudo code, for one node of the two
dimensional cube, for the entire FFT process which
includes loading in the window function, data to be
transformed, the FFT, bit reversal, and the complex N to
real 2N conversion is shown in Figure 30. The data
array to be transformed and transformed array is ar
ranged so that the first position is real directly fol
lowed by its imaginary component. Initially a window
function is downloaded to the board which doesn't change
until the user desires a different window function. In
the next step, the two complex data items for the but
terfly are obtained by reading every fourth complex sam
ple in the first half of the array and pairing it with
every fourth complex sample in the second half of the
array. Note that this data is also multiplied by its
corresponding window data. The data is then fourier
transformed as shown previously. The data is then moved
back to the global ram. Note that two complex words of
78
PROCEDURE {512 Point Real FFT Utilizing N Cube}
N := 256;
/* Get window function */
FOR I := 0 to 2N 1 step 4
BEGIN FOR
/* Real position */
WINDOW[l/4] := GLOBAL[l];
/* Imag position */
WINDOW[l/4 + 1] := GLOBAL[l+1];
END FOR
/* Get Data from global ram and window */
FOR I := 0 to N/2 1 step 4
BEGIN FOR
/* Real position */
INPUT[4(l/4)] := GLOBAL[2l] WINDOW[l/2];
/* Imag position */
INPUT[4(l/4) + 1] := GLOBAL[2(l+1)] WINDOW[l/2+1];
/* Real position */
INPUT[4(l/4)+2] := GLOBAL[N+2l] WINDOW[N+l/2];
/* Imag position */
INPUT[4(l/4)+3] := GLOBAL[N+2(1+1)] WINDOW[N+l/2+1];
END FOR
FFT
/* Write Data back to Global Ram */
FOR I := 0 to N/2 Step 4
BEGIN FOR
/* Real position */
GLOBAL[4*l] := INPUT[I];
/* Imag position */
GLOBAL[4*l+1] := INPUT[I+1];
/* Real position */
GLOBAL[4*l+2] := INPUT[l+2];
/* Imag position */
GLOBAL[4*l+3] := INPUT[l+3];
END FOR
FIGURE 30: Pseudo code for entire FFT
(FIGURE continued next page)
79
(FIGURE Continued)
/* Bit Reversal */
/* PE 0 will do 1/4 of the bit reversals */
FOR I := 0 to N/4 1 step 1
BEGIN FOR
/* Temporary store real number */
TEMPR := GLOBAL[2l];
/* Temporary store imag number */
TEMPI := GLOBAL[2l+1];
/* Find the bit reverse of index I */
J := BITREVERSE(I);
/* Exchange data with bit reversed data */
/* Real position */
GLOBAL[2l] := GLOBAL[2J];
/* Imag position */
GLOBAL[2l+1] := GLOBAL[2J+1];
GLOBAL[2J] := TEMPR;
GLOBAL[2J+1] := TEMP;
END FOR
/* Real position */
/* Imag position */
/* Convert array from N point complex to 2N real */
/* PE 0 will do 1/4 of the conversion */
FOR I := 0 to N/4 1 step 1
BEGIN FOR
K := I
/* Real position */
GLOBAL[2K] := .5(GLOBAL[2l] + GLOBAL[2N2l] +
COS(Itt/N) (GLOBAL[2l+1] +
GLOBAL[2N2l+1]) + SIN(Itt/N) *
(GLOBAL[2N2l] GLOBAL[2l]));
/* Imag position */
GLOBAL[2K+1] := .5(GLOBAL[2l+1] GLOBAL[2N2l+1] +
COS(Itt/N) (GLOBAL[2N2l] 
GLOBAL[2l]) SIN(k/N) *
(GLOBAL[2N2K+1 ] + GLOBAL[2N+1]));
END FOR
FIGURE 30: Pseudo code for entire FFT
80
data are written to every 16^ position in the global
array. This allows two complex words of data from the
following three PEs to be written in succession, since
each PE takes up 4 locations in contiguous memory. In
other words, after FFT execution, the first PE outputs
its first butterfly pair result to the first position in
the global array. The next PE outputs its first butter
fly pair result to the next location in the global array
and so forth. After each PE accomplishes writing its
first butterfly pair, the process starts over with the
first PE writing its second butterfly pair result to the
next successive position in the global array. The sec
ond PE then writes its second butterfly pair result to
the next position in the global array and so on. After
this process, the data is then bit reversed after which,
it is converted. The function BITREVERSE is described
in Appendix G. Notice that whenever a global ram access
is attempted that the three other PEs will also be con
tending for the Global ram since the software in the
nodes is nearly symmetrical. Consequently delays will
occur here, unless of course a 4 ported storage device
is used, which would be the most desirable case. It is
also important to note that when the transformed data is
written back to global ram it can initially be written
to the correct bit reversed position, as shown in Figure
31.
81
FOR I := 0 to N/2 1 Step 4
BEGIN FOR
J := BITREVERSE (4I/2);
/* Real position */
GLOBAL[2J] := INPUT[I];
/* Imag position */
GLOBAL[2J+1] := INPUT[I+1];
J := BITREVERSE (4I/2 + 1);
/* Real position */
GLOBAL[2J] := INPUT[l+2];
/* Imag position */
GLOBAL[2J+1] := INPUT[l+3];
END FOR
FIGURE 31: Bit Reversal as data is written out
It appears not to be advantageous to calculate
the conversion process of using a N point complex FFT to
generate a 2N real FFT before the data is written out to
global ram. The algorithm to do this would be somewhat
complex and the network transfers are not symmetrical
resulting in too much time wasted in the interprocessor
communication. Also, when the amount of PEs reaches a
certain number it will no longer be necessary to use
this technique to speed up the FFT process.
Other Programs
Several programs of interest were written for
the PC which simulated sampled data by loading a partic
82
ular test pattern into an array. One of them plots the
phase response as well as the magnitude response of the
FFT. It is program PHASEPEP which is found in Appendix
F. An inverse FFT, program FINVSFFT, is also found in
Appendix F, which was derived from the FFT algorithm by
negating the sign of the sine terms in the butterfly.
This is done since j is replaced by j which can be seen
by comparing the formula of the DFT and the inverse DFT.
The DFT is
Nl
X(m) = X x(n)Wmn (46)
n=0
and the inverse DFT is
Nl
x(n) = 1/N Â£ X(m)W"mn
m=0
(47)
CHAPTER V
DISCUSSION
Using stateoftheart daughter boards with a PC
acting has host, a multiprocessor spectrum analyzer was
built. The same PC that was used as the spectrum ana
lyzer was also used to develop software for it. This
sort of environment is quite useful for experimentation
and benchmarking of different sorts of algorithms and
multiprocessing arrangements.
The Fast Fourier Transform algorithm is at the
heart and the most important process of the spectrum an
alyzer. Consequently some of the pitfalls of using a
digital system for such a process was studied. Some of
the major problems are leakage and the picketfence ef
fect. These two problems are closely related to each
other and can be explained by a convolution in the fre
quency domain. Several window functions are suggested
to improve the amplitude accuracy of the spectrum. It
was found that an N point complex FFT could be manipu
lated to execute a 2N real point FFT due to certain sym
metries, cutting the execution time of the FFT down
roughly in half.
84
Whereas the DSP coprocessor achieved some im
pressive results, the GSP lead to some very disappoint
ing results, more than likely due to its onboard oper
ating system. Also, the DSP was programmed directly in
assembly language whereas the GSP wasn't. The GSP's
operating system was very strong in the setting up of
multiple windows and bitmaps, but consequently the speed
of plotting data suffered.
If a "real time" spectrum analyzer, where no
data loss, is required, the spectrums can be buffered up
in the PC main ram and removed for display as the GSP is
ready for them. Consequently they would not be dis
played in real time but the spectrums themselves would
represent a continuous record of events that occurred in
the past, up until the PC main ram fills up. Such a
system was simulated using NETWORK II.5 to investigate
potential bottle necks and to gain insight into such a
system. The DMAC on the PC mother board transfers data
from the multiplexed A to D to four DSP coprocessors.
The PC is then required to transfer the results from the
DSP coprocessors to main ram. If the PC bus does become
a bottleneck then the pipe between the DSP board and the
PC would overflow and this would show up in the simula
tion as a DSP memory overflow. This event does not
occur when the sampling rate is held below 10 KHz,
85
instead the PC main memory eventually overflows, due to
the GSP processing bottleneck.
Because of the importance of the FFT algorithm
in this system, the parallelism of this algorithm was
investigated. It was found that an N point FFT could be
theoretically speeded up by a factor of N/2, which is
quite significant. This speed up assumes no overhead in
interprocessor communication. The architecture of a
loosely coupled machine was studied using four DSPs.
Two interprocessor networks were studied, the Perfect
Shuffle and the N Cube. The code for one of the Nodes
of the N Cube was written in order to investigate the
speed up. It was found that the speed up was nearly
linear and directly proportional to the amount of PEs,
resulting in nearly 100% efficiency, of the FFT algo
rithm. This speed up calculation assumes that the input
data is initially contained in the PEs, and the calcula
tion does not incorporate the bit reversal or any other
functions other than the FFT.
An improved architecture could be devised that
would use dual port ram between the host and the copro
cessors. DMA would be used for all data transfers to
minimize the transfer time and host CPU impact. DSP
boards could be built with multiple processors utilizing
an interprocessor network such as the N Cube to take
advantage of the inherent parallelism of the FFT algo
86
rithm. A GSP which is specifically designed for fast
displays without the overhead of an operating system
could be used.
The TMS34010 GSP used in the Number Nine copro
cessor board utilizes the inner loop of Bresenham's
linedrawing algorithm[18]. This algorithm is useful
when drawing diagonal lines, but strictly horizontal and
vertical lines were drawn in this implementation. Con
sequently, in a GSP designed specifically for this ap
plication, there should be a utility that allows verti
cal and horizontal lines to be drawn very quickly.
In the ideal system, the processing of tasks
would be evenly distributed over the Processing Ele
ments. Consequently, it would be useful to implement
some sort of a SIMD machine for the GSP. The GSP could
then perform a running RMS average of the spectrums for
example. A running average algorithm is easily paral
lelized. If the video ram is designed with multiple
ports then the line drawing algorithm can also be
parallelized.
BIBLIOGRAPHY
[ 1] E. Oran Brigham, The Fast Fourier Transform.
Englewood Cliffs: PrenticeHall,Inc., 1974.
[2] J. Virgil Hornstein, "Parallel Processing Attacks
RealTime World", MiniMicro Systems. DEC 1986,
pp.6577.
[ 3] Brian W.Kernighan, Dennis M.Ritchie, The C
Programming Language. Englewood Cliffs: Prentice
Hall, Inc., 1978.
[ 4] William D. Stanley, Gary R. Dougherty, Ray
Dougherty, Digital Signal Processing. Reston:
Reston Publishing Company, Inc., 1984.
[ 5] Kai Hwang, Faye A. Briggs, Computer Architecture
and Parallel Processing. New York: McGrawHill
Book Company, 1984.
[ 6] ,"The Fundamentals of Signal Analysis", Hewlett
Packard. Application Note 243. February 1985.
[ 7] G. D. Bergland, "A guided tour of the fast Fourier
transform", IEEE Spectrum. JUL 1989, pp.4152.
[8] R.B. Blackman, J.W. Tukey, The Measurement of
Power Spectra From the Point of View of
Communications Engineering. New York: 1958.
[ 9] Leon W. Couch, Digital and Analog Communication
Systems. New York: Macmillan Publishing Co., Inc.,
1983.
[10] Peter D. Welch, "A FixedPoint Fast Fourier
Transform Error Analysis, IEEE Trans. Audio
Electroacoust.. vol. AU17, JUN 1969, pp.151157
[11] Richard L. Burden and J. Douglas Faires, Numerical
Analysis. Boston: Prindle, Weber and Schmidt,
1985.
[12] C.S. Burrus, T.W. Parks, DFT/FFT and Convolution
Algorithms. Theory and Implementation. New York:
John Wiley and Sons, 1985.
88
[13] R.E. Crochiere, L.R. Rabiner, Multirate Digital
Signal Processing. Englewood Cliffs, Prentice
Hall, Inc., 1983.
[14] Aram Budak, Passive and Active Network Analysis
and Synthesis. Boston: Houghton Mifflin Company,
1974.
[15] ,"Dash16/16F Manual", MetraByte Corporation,
Taunton Mass.
[16] ,"TMS320C25 User's Guide", Texas Instruments,
Dallas Texas, 1986.
[17] ,"Digital Signal Processing Applications",
Texas Instruments, Dallas Texas,1986.
[18] ,"TMS34010 User's Guide", Texas Instruments,
Dallas Texas, 1986.
[19] ,"NNIOS Programmer's Guide Version 3.04",
Number Nine Computer Corporation, Cambridge MA,
1987.
[20] ,"PEPPER SGT User's Guide", Number Nine
Computer Corporation, Cambridge MA, 1987
[21] ,"Network II.5", CACI, Los Angeles CA, 1985.
[22] Harold S. Stone, "Parallel Processing with the
Perfect Shuffle", IEEE Transactions on Computers.
Vol. C20, FEB 1971, pp.153161.
APPENDIX A
CODE EXECUTED ON THE AT USED FOR THE SPECTRUM ANALYZER
/*
Program spectrum uses allocated memory for dma data, uses SWDS
TMS320C25 DSP daughter board for fft execution. The fft uses a N point fft to
calculate a 2N point FFT CoolyTukey Radix 2, DIF FFT Program. Spectrum analyzer
uses metrabyte aquisition board, Texas Instruments SWDS board, and Number Nine
SGT graphics board. Asks for desired window function. This source code is
compiled with Microsoft QuickC version 1 using the medium model, single segment
for data and multiple segments for program code. The magnitude of the fftd data is
calculated on the TMS320C25. The data is adjusted for the display on the DSP. The
GSP draws xy axis and displays spectrum.
*/
#inc!ude
#include
#include
#include
#include "nnios.h
#include "nnioslib.h"
typedef struct
{
unsigned char page;
unsigned char upper;
unsigned char lower;
JBUFFER;
/* Required for all NNIOS programs */
/* Language binding header file */
/* Page for dma data */
/* Upper byte for dma */
/* Lower byte for dma */
/* Data is inputted for fft calculation on the SWDS here */
#define DATA_BUFFER
OxDOOOAOOO;
/* Window function loaded on SWDS at this location */
#define WINDOW_FUNCTION OxDOOOCOOO;
/* Write to this location for SWDS hardware control,
read for hardware status information */
#define C25_CONTROL OxDOOOEOOO;
90
#define CLRBRK_RESET 0x32;
#define HOLD_RESET 0x3A;
#define RUN OxFA;
#define C25_COMMAND 0xD0009000;
#define RUN_FFT 0x01;
#define STOP 0x00;
DEV device;
BITMAP bitmap, bitmapx, bitmapy;
WINDOW window, windowx, windowy;
DEVCON dev;
DEVPREF pref;
HANDLE font;
POINT strSize;
BYTE message [64];
COLOR palette[13];
double nx[2050];
double ny[2050];
double wr[300];
/* Clear breakpoint and hold */
/* Hold, internal clock,
memory control = 01; on SWDS */
/* Internal clock, mem control = 01 */
/* SWDS software command location */
/* Software command run fft */
/* Software command stop */
/* Info about the current device */
/* Data for the apps bitmaps */
/* And for the apps windows */
/* Current device configuration */
/* Number of NNIOS devices present */
/* Handle of loaded text font */
/* Size of a text string in pixels */
/* Buffer for text message */
/* Array to store palette values */
/* new real output array */
/* new imag output array */
/* magnitude */
/* pointers to sampled array */
unsigned short *indataO, allocbufO;
short far *window_function;
unsigned short far *c25_command;
short far *fft_data;
unsigned char far *c25_control;
BUFFER bufferO; /* Buffers for dma data */
main 0
{
WORD fft;
int n, m, I, i, n 1, j, k, n2, ia,t, ie, displ,xx,yy, cc, loop,
spectrum,window_fnc;
double c,s,p1aIxt,yt,outIpi1d,b1maximum;
time_t start, finish;
/* load sampled data */
indataO = allocbuf(&bufferO); /* allocate memory */
windowjunction = WINDOW_FUNCTION;
fft_data = DATA_BUFFER;
c25_control = C25CONTROL;
c25_command = C25_COMMAND;
n = 256; /* length of FFT */
m = 8; /* dim of FFT */
pi = 3.141592654;
/* Down load fft code to SWDS */
downld ("fftdiv. msbyfftdiv. Isb);
printf (D\n\nSelect window function");
printf ("\n\n1. Rectangular (default)");
printf (\n2. Triangular");
printf (\n3. Hanning");
printf ("\n4. Hamming");
printf ("\n5. Exit Program\n\n);
window_fnc = get_intO;
if (window_fnc = = 5)
exit(0); /* exit program here */
*c25_control = CLRBRK_RESET;
*c25_control = HOLD_RESET;
c25holdQ;
/* Clear breakpoint */
/* Hold C25 */
/* Wait for C25 to hold */
printf ("\nPlease stand by, calculating window
function for downloading to SWDS and for
Display An");
92
printf ("\nPress space bar after observing window
functional);
switch (window_f nc)
{
case 1:
rectangular(window_f unction);
sprintf (message,"Rectangular window
function. Press space bar.);
break;
case 2:
triangular(window_function);
sprintf (message,Triangular window
function.
Press space bar.);
break;
case 3:
hanning(window_function);
sprintf (message,"Hanning window
function. Press space bar.);
break;
case 4:
hamming(window_function);
sprintf (message,"Hamming window
function. Press space bar.");
break;
default:
rectangular(window_function);
sprintf (message,"Rectangular window
function. Press space bar.);
break;
>
/*
Check to see that the NNIOS device driver is installed. If the driver is present,
CheckNNIOSO initializes the language binding global function pointer, "Global, to the
address returned by the driver and CheckNNIOSO returns FALSE. If the driver has
not been installed, no NNIOS applications can run, the value of CheckNNIOSO will be
TRUE, and the application should print an error message and terminate cleanly.
*/
if (CheckNNIOS 0) /* Try to initialize NNIOS */
{
printf (This application requires the NNIOS
device driver.\n\r");
exit (1);
}
93
/*
NNIOS is present. Now determine the number of the preferred device, the preferred
application window size, and the preferred bitmap size. Although the preferred device
number will typically be 1, multiple Pepper boards can be installed. This program will
simply run on the device specified by the preference. The preference information can
be changed by using the binding function SetDevPreferenceO Alternately, an
application program can examine all installed boards and determine the most
appropriate one for its use. Because the NNIOS device driver can be installed even
when no PEPPER boards are present, check the preferred device number for a value
>= 0. Alternatively, use the binding function DeviceCountO to determine the number
of boards in the system.
*/
GetDevPreference(&pref);
if (pref.devNumber == 0)
{
printf (No Pepper Graphics System products are
installed.\n\r);
exit (1);
}
/*
At least one device is installed, so get info about the preferred device. The data
returned in the pref structure describes the preferred Pepper product installed and
the segment address it occupies.
Devicelnfo (&device, pref.devNumber);
/*
Now use the device description data (in the structure "device") to open a
communication channel to the board. The NNIOS language binding services one of
four possible channels at a given time and defaults to the one selected with the
binding function OpenCommChannelO This function initializes the language binding
variable "NNIOS" to point to the local function handler for the opened Communication
Channel.
*/
OpenCommChannel (&device, "NNIOS Program");
BitCreate (&bitmap, pref.window.extents.x,
pref.window.extents.y,
pref.depth, FALSE, 0, Template");
BitCreate (&bitmapx, 1024, 60, pref. depth, FALSE,
0, "XAXIS");
BitCreate (&bitmapy, 128, 480, pref.depth, FALSE,
0, "YAXIS");
BMClear (bitmap.handle, (CINDEX) 0);
