A NEURAL FILTER FOR NOISE REDUCTION
IN WAVEFORM AND IMAGE DATA
by
Charles R. Barnes
B.S., Oklahoma State University, 1978
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science
1995
This thesis for the Master of Science
degree by
Charles R. Barnes
has been approved
by
Jay Rothman
4/u/ tC
Date
Barnes, Charles R. (M.S. Computer Science)
A Neural Filter for Noise Reduction in Waveform and
Image Data
Thesis directed by Associate Professor William J. Wolfe
ABSTRACT
Signal processing applications have become more
important as data transfer rates and image resolution
have continued to increase. However, this increased
throughput or resolution cannot be realized if the
received signal is noisy or corrupted. In this paper, I
present a feedforward neural network design for
filtering noise from waveform and image data. To
provide suitable background information, a brief
description of basic concepts is provided for Median and
FIR style filters. The design concepts and enhancements
to neural networks are then discussed followed by the
design of a feedforward multilayer neural filter based
on a modified backpropagation algorithm. The remainder
of the paper is a presentation of research results on
two forms of additive noise corrupted data signals,.
iii
waveform and block image data. Neural filter results
are then compared against those achieved by applying the
Median and FIR filters on the same noisy data sets. In
conclusion I summarize the test results by comparing the
neural filter design to that of more conventional
methods and describe areas of research which should be
explored in greater depth.
This abstract accurately represents the content of the
candidate's thesis. I recommend its publication.
IV
CONTENTS
Chapter
1. Introduction ..................................... l
1.1 Description of the Filtering Problem ............ 5
1.2 Conventional Algorithms vs. Neural Networks ... 8
1.3 Parallel Processing ............................ 10
1.4 Notation ....................................... 17
2. Image Noise ..................................... 18
3. Image Enhancement Methods ....................... 23
3.1 Conventional Image Enhancement Techniques ...... 24
3.1.1 Linear Image Filtering ...................... 25
3.1.2 Nonlinear Image Filtering ................... 29
3.2 Neural Network Image Enhancement Technique .... 35
4. FIR Filter Basics .............................. 38
4.1 Description ................................... 3 8
4.2 Digital Filter Characteristics ................ 41
4.3 Filtering Effects on Image Data ................ 44
v
5. Neural Network Basics ........................... 47
5.1 Basic Processing Element ....................... 47
5.2 Activation Functions ........................... 51
5.3 Error Surfaces ................................. 52
5.4 Interconnectivity Architecture ................. 56
5.5 Training Methodology ........................... 57
5.6 Backpropagation Algorithm ..................... 59
5.7 Modifications to the Basic
Backpropagation Algorithm ..................... 65
5.8 Effect of Hidden Layers ....................... 71
5.9 Effects of Network Overtraining ................ 75
6. Neural Filter Design ........................... 77
6.1 Description .................................... 77
6.2 Why Implement a Filter in Neural
Networks ...................................... 80
6.3 Neural Filter Architecture .................... 82
7. Test Data Generation .......................... 88
7.1 Waveform Data Signals ......................... 8 9
7.2 Image Data Signals ............................ 92
7.3 FIR Filter Coefficients Generation ............ 94
8. Test Results .................................. 96
VI
8.1 Waveform Test Results ......................... 96
8.1.1 FIR Filter versus Neural Filter Test ........ 125
8.2 Image Data Test Results ....................... 128
8.2.1 Full Image Neural Filter Results ............ 152
8.2.2 Subset Block Image Test ..................... 152
8.3 Test Results Summary .......................... 165
9. Further Research Work ......................... 170
Appendix
A. Program Listings ................................ 173
References ......................................... 3 03
vii
FIGURES
Figure
1.1 Human Cell Connectivity ....................... 12
1.2 Neural Network Classes ........................ 14
2.1 Spike Induced Noise ........................... 19
2.2 Gaussian Noise Effect on Image ................ 21
3.1 Block Diagram of Linear System ................ 26
3.2 Block Diagram of Linear Filter ................ 29
3.3 Mean Filter Averaging ......................... 29
3.4 Median Filter ................................. 33
4.1 FIR Filter Architecture ....................... 44
5.1 Standard Network Node Connection .............. 48
5.2 Common Activation Functions ................... 50
5.3 Linear Function Error Surface ................. 55
5.4 Nonlinear Function Error Surface .............. 55
5.5 Multilayer Network Structure .................. 56
5.6 Backpropagation Algorithm Diagram ............. 60
5.7 Local versus Global Minima .................... 66
5.8 Hidden Layer Solution Bounds .................. 72
5.9 Effect of Too Many Hidden Layer Nodes ......... 74
5.10 Number of Nodes versus Generalization ......... 75
VHl
5.11 Effect of Network Overtraining ................ 76
6.1 FIR Filter Structure .......................... 78
6.2 Neural Filter Structure ....................... 80
6.3 Effect of K on Function Slope ................. 85
6.4 Neural Filter Network Configuration ......... 87
8.1 2 Signals with Spike Noise ( 0 & 5M ) 105
8.2 2 Signals with Spike Noise ( 15M & 15F ) .... 106
8.3 2 Signals with Spike Noise ( 15N ) 107
8.4 2 Signals with Gaussian Noise ( O & D ) 108
8.5 2 Signals with Gaussian Noise (15M & 15F) ... 109
8.6 2 Signals with Gaussian Noise ( 15N ) 110
8.7 Signal with Uniform Noise ( 0 & 15M ) Ill
8.8 Signal with Uniform Noise ( 15F & 15N ) 112
8.9 Signal with Spike Noise ( 0 & D ) 113
8.10 Signal with Spike Noise ( 15M & 15F ) 114
8.11 Signal with Spike Noise ( 15N ) 115
8.12 Signal with Gaussian Noise ( 0 & D ) 116
8.13 Signal with Gaussian Noise ( 15M & 21F ) .... 117
8.14 Signal with Gaussian Noise ( 21N ) ......... 118
8.15 2 Signals Jamming Effect (O&D) ....... 119
8.16 2 Signals  Jamming Effect ( 15M & 15F ) .... 120
8.17 2 Signals  Jamming Effect ( 15N ) ......... 121
8.18 2 Signals Low Freq Extraction (O&D) ... 122
8.19 2 Signals Low Freq Extraction (15M & 15F).. 123
IX
8.20 2 Signals Low Freq Extraction ( 15N ) .... 124
8.21 2 Layer Network ............................... 126
8.22 Image with Spike Noise ........................ 138
8.23 Image with Spike Noise ( S & DP ) .......... 13 9
8.24 Image with Spike Noise ( 5M & DP ) 140
8.25 Image with Spike Noise ( 15M & DP ) 141
8.26 Image with Spike Noise ( N & DP ) 142
8.27 Image with Spike Noise  Row Graph (0 & N) .. 143
8.28 Image with Spike Noise  Row Graph ( M ) .... 144
8.29 Image with Spike Noise  Row Graph ( N ) .... 145
8.30 Image with Gaussian Noise ( G & DP ) 146
8.31 Image with Gaussian Noise ( 5M & DP ) ...... 147
8.32 Image with Gaussian Noise ( N & DP ) 148
8.33 Image with Gaussian Noise  Row Graph (O&G).. 149
8.34 Image with Gaussian Noise  Row Graph (M&N).. 150
8.35 Image with Spike Noise FIR Filter ........... 151
8.36 Block Image with Spike Noise ( O & S ) ..... 157
8.37 Block Image with Spike Noise ( M & N ) .... 158
8.38 Block Image with Gaussian Noise (O&G) ... 159
8.39 Block Image with Gaussian Noise ( M & N ) ... 160
8.40 Block Image with Spike Noise  Cross (O&S)... 161
8.41 Block Image with Spike Noise  Cross (M&N)... 162
8.42 Block Image w/Gaussian Noise  Cross (O&G)... 163
8.43 Block Image w/Gaussian Noise  Cross (M&N)... .164
x
TABLES
Table
3.1 Mean/Median Filtered Data Sequence ............ 34
3.2 Mean/Median Filtered Slight Shading
Data Sequence ................................. 35
8.1 Test Results for Waveform Data ............... 100
8.2 Neural Filter Convergence Iterations ......... 102
8.3 Filter Performance Comparison Waveform .... 104
8.4 FIR/Neural Filter Coefficients ............... 127
8.5 Neural Filter Training Parameters
for Image Data ............................... 131
8.6 Median/Neural Filter Test/Training Results .. 133
8.7 FIR Filter Test/Training Results ............. 135
8.8 Filter Performance Comparison Image ........ 136
8.9 Neural Filter Training Parameters
for Block Image .............................. 154
8.10 Block Image Filter Results ................... 155
8.11 Filter Performance Comparison Block
Data Images .................................. 156
xi
1. Introduction
The areas of signal processing and video image
reconstruction have seen dramatic growth in the past few
decades. Expanding availability and usage of newer
computerized multimedia and higher resolution video by
the general population places new demands on the current
industry. From the technical arena, increased data
transfer protocols and hardware CPU speed has demanded
that signal processing tasks be completed in an ever
decreasing time slice window. Both demands force the
industry to develop more powerful tools and new
algorithmic procedures to tackle increasingly more
difficult problems with the same relative amount of time
and effort.
As a short term answer to that demand, the past two
decades have seen marked increases in computing power
including increased memory capacity and processing speed
in conventional von Neumann architecture computers.
Increased computing power has allowed technology to
reduce system physical size while simultaneously
providing increased capability and data throughput rate.
However; as computational power increased, data
1
throughput bottlenecks were revealed as a root basic
problem of conventional serial pipeline von Neumann
computers. Data input rates have been increased but
I
processing in many cases still proceeds in a serial
fashion, handing intermediary results from one
computation off to another in the processing chain. In
today's complex systems however; there are tasks which
can be performed which are not interdependent upon
receiving data from another process. These tasks are
forced to suffer processing delays until they are allowed
resource time in a serial process.
Image processing and noise filtering are areas of
signal processing which require a large number of
computations to be performed in a "timely manner". High
resolution imagery tends to compound the problem. With
an increased number of pixel elements per image, the
processing time increases accordingly and requires a new
architecture which will process the increased data load
in a shorter timeframe. In answer to demands for
increasing effective throughput, approaches such as
massively parallel computation are continuing to be
researched and implemented in the real world.
Parallelization has been approached by two different
methodologies:
2
Multitasking multiple processes are run on a
single computer (time sharing a
single resource) making it appear
that more than one process is running
simultaneously
Parallel hardware more than one processor is
utilized with each processor
simultaneously running its own
program procedure. If data sharing
is necessary data is piped from one
processor to another.
Multitasking finds itself hampered by the same von
Neumann bottleneck regarding processing capability.
While several tasks may appear to the user to be running
in parallel, there is still a single resource providing
the main computational horsepower. In order to process
multiple task programs, the resource uses a time slicing
of the total CPU processing time and allocates a time
slice to each running task. Since time slicing does not
provide a true parallel computation path; computational
time is still sequential and provides no speedup.
The use of parallel hardware realizes the true
potential of parallel processing by providing a parallel
3
computational path for each task and is the method
currently of interest in the artificial intelligence
community. By providing truly parallel computational
paths, time is no longer edict by serial processing time
but instead by the number of parallel processing elements
installed in the system. Parallel computations allow
more complex problems to be solved without having to
abide by linear time calculations.
Parallel computation research led many researchers
to begin drawing correlations between the processing
capabilities of the human system and creation of
intelligent machines which could "think" or adapt to
their environments. The human standing at a street
corner stop light is immersed in the demand for parallel
signal processing. The senses are flooded from multiple
simultaneous stimuli while the system may be forced to
consider thousands of decisions within a short timeframe.
Is the light red or green? Are there obstacles in the
way which must be maneuvered around in order to cross the
street? Is traffic coming which may or may not stop? In
addition to these case dependent conditions, additional
stimulus from our five senses and subconscious actions
related to pumping blood, walking, and breathing air into
our lungs are also being processed. Each of these
4
actions represent a "hard" problem in today's data
processing environment but are performed simultaneously
by the human system. An obvious conclusion is that all
of these decisions cannot be performed by a serialized
first come, first served computation process like a
standard computer program.
Processing of multiple stimuli can be regarded as a
data filtering problem whereby objects of interest are
extracted from the environment while interfering stimuli
can be removed or suppressed. It was described how
processing of this data would become cumbersome if
performed in a serial fashion but would not suffer the
same limitations if processed in some parallel fashion.
Parallel noise filtering can thus be performed on more
complex data sources without the normal increased
processing time.
1.1 Description of the Filtering Problem
The problem addressed in this paper is the design of
a feedforward neural network used as a onedimensional
neural filter for providing enhancement and noise
filtering of image and waveform data. The purpose of
this work is to show that a parallel neural network
5
structure can perform the same filtering processes
currently performed by serial systems. The benefit is
that if a neural implementation can be proven effective,
its implementation in a parallel hardware architecture
will result in increased performance and less processing
time than current systems. While application of neural
networks to image processing is not a new concept, papers
detailing the actual design and results associated with
noise filtering are rare. The typical noise filtering
has long been associated with pattern recognition format
problems. In these problems, a network has been trained
through autoassociation to recreate an image which is
incomplete or distorted with noise. Since the network
has effectively learned a codebook of results, a noisy
image can be filtered to return a close facsimile of the
original. In contrast, this paper attempts to show how a
network can be created whose purpose is extraction of an
underlying signal from a noisy environment. This noisy
environment may or may not be a noise source which was
encountered during the network training phase. In
addition, the test signal will be a variation of the
training waveform signal or a different type of image.
Test results will be shown to compare how such a neural
filter performs in comparison to conventional linear or
6
nonlinear filtering methodologies. Two forms of data
will be used for comparison, waveform and image data.
Waveform data is similar to that received from a serial
data source (noise generator or serial pattern generator)
and rarely will have identical values for adjacent
elements. Image data is similar to waveform data in that
each row represents a serial data stream. The difference
is that image data generally has higher pixel to pixel
correlation since adjacent points rarely have a large
variance unless at an edge in the image. The difficulty
is that image data can have an almost infinite number of
combinations which depend on the subject, lighting
conditions, and type of introduced noise. This can make
reconstruction and noise filtering of image data a much
more difficult task. To test the performance of
filtering methods, each data source is corrupted with
simulated sensor noise and used as the input data
sequence to a Median, FIR, and neural filter. Test
results will be compared to show the overall performance
of the neural filter versus conventional filtering
techniques.
The following chapters will provide the necessary
background information on the types of noise encountered
in image data, conventional filtering methods, and the
7
basics of FIR filters, Median filters and neural
networks. A neural network is created which performs a
function similar to that of a FIR filter but with a
nonlinear transfer function. Several variations of these
neural filters are described along with performance
results to show the benefits of a neural filter for
enhancement and noise removal in corrupted image data.
1.2 Conventional Algorithms versus Neural Networks
Neural networks have been described as a parallel
latticework of simple processing elements possessing no
predetermined programming model or series of solution
steps. This differs from a conventional computing
science algorithm which has a formal programming
structure in place and a series of sequential solution
steps.
Conventional algorithms suffer from the case of
uncertainty. What is an algorithm to do if it receives
unpredicted, incorrect or incomplete information? For
complex problems, it is difficult to anticipate every
case condition which may occur during operation for each
algorithm variable. A parameter set state which has not
been anticipated can cause a normally stable algorithm to
converge to a wrong solution or may cause an infinite
8
search loop condition. In general, the more complex the
problem, the harder it is to account for every possible
scenario.
Neural networks have an advantage in that they
possess no predetermined programming model or solution.
During the learning/training phase, the weighted
connections between layers are constantly updated to
create the solution weight matrix. The inherent
distributed knowledge architecture allows a trained
network to create a generalized solution matrix where no
individual node element is of greater importance than any
other node. This generalized solution allows them to
continue functioning in a degraded mode (equivalent to
processing fuzzy or incomplete information) which may
suffer slightly from the loss of some knowledge but will
not prevent a viable solution from being achieved.
While neural networks are ideal vehicles for solving
certain types of problems, it requires mention that
neural networks are not applicable for all problems.
Problem solutions which require a high level of accuracy
should use more conventional algorithmic solutions.
9
1.3 Parallel Processing
Parallel computation is one arena addressed under
the broad title of Parallel Distributed Processing (PDP)
described by Rumelhart and McClelland [1] in their 1986
publications. While true parallel computational hardware
is currently available, the area of PDP to be discussed
in this paper is neural networks. Specifically, the
topic involves use of Artificial Neural Networks ( ANN )
to simulate parallel processing capabilities related to
image data enhancement. The basic premise behind PDP is
that information transfer occurs via interaction of a
large number of simple processing elements or "nodes",
each of which receives and sends inhibitory or excitatory
signals to other nodes. This parallel interaction
between multiple simple elements is the basis of neural
network processing. A neural network can be visualized
as a latticework connection between multiple simple
processing elements. In their simplest form, each node
receives input from one element and outputs the result to
another element; each acting as a link in the lattice.
Each unit can only retain information about a single
value (in some cases this is related to binary switching
theory, an element has a value of 1 or 0; on or off) and
interconnections between elements are provided by
10
weighted arcs. A further description of this
connectivity will be addressed in Chapter 4. This simple
methodology (single nodes connected by weighted arcs)
allows knowledge or information to be transferred between
network elements while maintaining a distributed
knowledge representation. Distributed knowledge is both
a strength and weakness of neural networks. By
distributing the data, slight aberrations in the network
do not prevent it from performing its job; however, if a
network is not performing as expected the proposed
solution cannot be looked at from a theoretical or
programmatic approach and the problem area identified.
This lack of the ability to break apart a neural network
and identify how it correlates data or how it "learns"
its weighting matrix leads to the mysticism and "black
magic" reputation of neural networks.
What working model can be used to show the
advantages of massively parallel processing? The common
comparison model for describing a working complex
parallel interconnection of simple processing elements
and development of intelligent thought or "thinking
machines" is the human brain. When biologists attempt to
calculate the total number of cells and associated
synaptic connectivity between cells in the human brain,
11
the numbers become staggering in magnitude. Scientists
estimate that the number of neuron cells in a human brain
can be in the order of 1010 with neuron cell fanout
interconnection levels approaching 104 [2] (Figure 1.1 ).
By comparison, today's complex neural network
applications may consist of less than 50 node elements
and less than 500 interconnections. No attempt is made
to imply that artificial neural networks are attempting
to duplicate the complexity found in the human system.
The information to be gained from such a complex system
is instead how to draw knowledge from parallelization of
simple elements to form a more complex structure. By
Figure 1.1 Human Cell Connectivity
12
observing the connectivity and signal processing
capabilities of the human brain, research points to the
use of parallel system architectures as a viable solution
to difficult or "hard" real world problems.
Neural networks are divided into two general
classes, recurrent networks and feedforward networks as
shown in Figure 1.2.
Recurrent networks are a self contained matrix of
elements which reach a solution or convergence by
relaxing to a state representing a stable condition.
Since there is no specific input to output data flow, the
network node inhibition and excitation connections are
used to represent the problem. Figure 1.2 shows that a
recurrent network possesses no directed network flow.
Instead, recurrent networks can receive activation from
any direction allowing them to contain feedback or closed
loops within the network. Any node may simultaneously
have an effect on any other node within the network due
to a feedback connection weight arc. The strength of
recurrent networks is drawn from this full connectivity
architecture, no single element is separate and must
converge as a single entity. Because of their feedback
capability, stability and control of network oscillations
13
are problems which must be addressed with a recurrent
network.
Feedforward Network Architecture
Hidden Layer Output Layer
Data Flow Direction
>
Error Correction Flow
eFull connectivity
will exist between
all network nodes
Recurrent Network Connectivity
Figure 1.2 Neural Network Classes
Input Layer
14
Feedforward or "associative" networks are designed
to receive a set of inputs and propagate this data
forward through the network from layer to layer. The
result or solution appears at a set of output nodes on
the other side of the network. Conner [3, p.140] defines
feedforward networks as: "The term "feedforward" states
that in the network architecture no processing element
output can be an input for a processing element on the
same layer or a preceding layer". In conventional
architectures, feedback from downstream layer nodes or
nodes on the same layer are not allowed which separates
them from recurrent networks. A feedforward network is
trained by comparing the output result obtained from the
network with an expected or correct solution. The
difference between the actual and expected output is then
fed back or "backpropagated" through the network from the
output back to the input elements while adjusting the
interconnection weights between elements as it proceeds.
These adjustments are done to modify the current state of
the network matrix to improve its probability of success
on the next input. This propagation of error back
through the network causes the network to learn a
solution to the problem it is trying to solve. A
feedforward network starts its learning process knowing
15
nothing about the solution other than the training set
data. The learning process relies on data training pairs
X and Y; given input matrix X, the desired output matrix
is to be Y. When the network reaches a solution (or
convergence) the resulting interconnection matrices
represent a solution set to the problem. In this way,
the learning process embeds the problem solution in the
architectural fabric of the network. This learning
process is why neural networks are useful in performing
data transformation processing. Neural networks are not
explicitly programmed with an explicit algorithm to solve
a problem, instead they must learn to solve the problem
by using representative sets of training data. One
network architecture can perform different forms of
processing based upon the data on which it has been
trained.
This paper will examine the use of neural networks
as a means of achieving noise filtering and providing
image reconstruction to video and waveform data. Since
description of the algorithmic matrices needed to provide
such filtering is not known in advance, feedforward
neural networks will be used to solve this problem. As
data is fed into the network, an input/output training
16
set will be used to force the system to "learn" a
solution given no preconceived input knowledge.
1.4 Notation
The following standard conventions will be used in
this paper:
Vectors Vectors will be represented by bold
lower case letters (e.g. a or c).
Elements of that vector will be shown as
plain type subscripts on the vector name
(e.g. a is the ith element of a )
Matrices Matrices will be represented by bold
uppercase letters (e.g., X or Y). A pair
of subscripts in plain type on a matrix
name will be used to identify an element
of that matrix (e.g., X^j is the element
in the ith row, jth column of matrix X)
Scalars Scalars will be represented by either
upper or lowercase letters in plain type.
17
2. Image Noise
Image data is very rarely received at its final
destination in an ideal noise free form. Each time a
sensor or transmission medium is used, the signal is
susceptible to the introduction of noise. Even the
particular type of processing algorithm used for image
recovery can be responsible for introducing distortion
into the perceived image. The following description will
briefly introduce the three most common forms of
introduced noise visible in a data image; impulse,
gaussian, and narrowband jamming.
Impulse noise is commonly referred to as "salt and
pepper" or "sparkle" noise. Atmospheric conditions such
as electrical storms or line transients are frequently
responsible for briefly disturbing the state level of a
signal. Because of their transient behavior, the
disturbance is typically randomly distributed and may
only effect one or two adjacent pixel elements. The
visible impact on an actual data image is a change in a
given pixel's intensity level in apparent random
locations throughout the processing line or image.
Figure 2.1 shows a representative image line segment
18
illustrating the effect of impulse noise. The gray scale
surface should be of near uniform intensity but impulse
noise has introduced pixel intensity changes which may be
of any valid value in the representative gray scale range
( 0 255 for an 8 bit scale code ). Because of their
random occurrence and variable intensity levels, a
conventional linear filtering technique such as mean
filtering ( section 3.1.1 ) is not effective in removing
impulse noise effects.
122 124 124 117 120 37 121 124 123 246 122 121
Spike Noise Effects
Figure 2.1 Spike Induced Noise
Gaussian noise is a frequent result of sensor
electronics or electronic equipment responsible for the
reception/transmission of data. Wideband thermal noise
is the most common type of noise inherent in electronic
components existing due to Brownian motion of electrons
in a conductor and is proportional to the temperature
acting upon the circuitry [4]. A second term to describe
the effect is "white noise" because it has been shown
19
theoretically and experimentally that the noise has a
uniform distribution from low up to high frequencies.
Due to its uniform distribution across the frequency
spectrum, thermal noise is frequently modeled as a white
gaussian noise source. The effect of gaussian noise is
to create a "grainy" image due to the slight variations
present in adjacent pixel elements as shown in the
representative image scan in Figure 2.2. Image
observation shows that multiple adjacent elements may be
affected simultaneously as opposed to single elements due
to impulse noise. Because the noise is distributed
across all elements, standard nonlinear filters such as
the median filter ( section 3.1.2 ) normally have little
or no effect on gaussian noise.
20
Figure 2.2 Gaussian Noise Effect on Image
Narrowband jamming is similar to the effect of
impulse noise but instead of a random spectral
distribution it is typically seen as a narrowband, high
amplitude signal present at all times within a specific
frequency range. Narrowband noise may occur either
accidentally due to its location near a noise source of
the given frequency or deliberately as a means for
someone to disrupt reception of the data signal. If
nothing is known about the signal to be received or the
jamming signal, it may be difficult or impossible to
21
retrieve the original transmitted data because of the
overlapping frequency spectra. Linear filtering methods
have been shown ineffective in removal of jamming noise
due to their inability to separate the effects of
overlapping frequencies. Narrowband jamming is most
obvious when observing waveform signal data from a
transmitter. By placing a jamming signal sufficiently
close to the desired input signal frequency, the waveform
is distorted and may not be seen due to the masking
effect of the other carrier frequency. The effects on
waveform signals and narrowband jamming are shown in the
Chapter 8 test results.
22
3. Image Enhancement Methods
The average individual is repeatedly exposed to
image data everyday. The TV screen floods us with images
ranging from late night movies to weather image pictures
shown on the evening news. Newspapers print images which
may have been digitized from still photographs or
videotape and transferred onto newsprint. An inherent
problem is that each time information is gathered via
some form of sensor, noise can be introduced into the
final product. This noise effectively masks some image
elements and reduces the clear/clean image that we
expect. For example, image sensors on satellites are
susceptible to a variety of noise sources including
wideband thermal, impulse, and jamming which create
erroneous bit values or sparkles in the image data. If
this data were to be looked at in its raw format, the
noise may be observable as mismatched coloration on a
uniform surface or a perceived blurring of the image.
Current communication and data transmission systems all
use some form of enhancement process on the noisy data
set to extract and recreate a smooth undistorted image.
23
Image enhancement for removal or reduction of noise
impacts is important both aesthetically and financially.
Aesthetically, as a viewer we do not want to see fuzzy or
distorted imagery. While an image may be discernible
from the background noise, anything less than a sharp
image is considered unacceptable. From a cost
perspective, noise constitutes a degradation in the
overall transmission quality requiring retransmission or
reprocessing of a received image. This overhead wastes
transmission capacity of the transmitter and indirectly
causes increased cost or risks reduced quality of the
image.
3.1 Conventional Image Enhancement Techniques
Image enhancement can be performed using either
linear or nonlinear processes on today's computer
equipment. The benefits and disadvantages of each method
will be briefly described below in more detail. Because
of the breadth and countless variations associated with
today's data filtering, only a few common methods will be
shown as representative examples. It is not the
intention of this paper to attempt to cover all current
methodologies and techniques but rather to compare a few
24
standard filter types operating on a small application
problem.
3.1.1 Linear Image Filtering
Linear systems are identified as the convolution of
an input sequence with the impulse response of the system
( Figure 3.1 ). Three important characteristics of
linear systems are linearity, causality and time
invariance. Linearity [5] states that for a system with
an input/output relationship of:
x(t) => y(t) (3.1)
if x(t) = 0 for t = t0 t0 + 1, t0 + 2,
..., then
y(t) = 0 for t = t0 t + 1, t0 + 2, ...
Causality deals with the system impulse response and
states that h(t) = 0 for t < 0. In this context, t =
2 implies the condition that occurs 2 time steps in the
future after time t = 0. For physical implementations,
nothing is known about the function after time point t =
0, therefore all points occurring after t = 0 are
assumed to be zero. Causal systems are shown to operate
only on inputs occurring at or prior to time t = 0.
25
Timeinvariance describes the input/output sequence
based on time. For a system to be timeinvariant, any
delayed input x( t t0 ) has an output y ( t t0 ) The
function response will be the same, independent of the
time when the input is applied to the system [6].
Representation of Linear System
Figure 3.1 Block Diagram of Linear System
If the impulse response of the system in Figure 3.1
is assumed to be the impulse response of a filter
function, the model is transformed into that of a linear
filter as shown in Figure 3.2. Linear filters convolve
an input sequence with a coefficient vector ( comprised
of the filter coefficients) and provide an output which
is the summation of all input connections. Since the
output is a result of pure summation, the slope of the
26
function is linear with respect to the inputs and filter
performance becomes entirely dependent on the chosen
coefficient matrix. The coefficient matrix is a mask
which defines the number of pixel elements used in the
linear combination. The computation performs a sum of
products operation of the coefficient mask and a block
of pixels which are centered at a pixel element. The
resulting output produces a single pixel value which is
positioned at the previously centered pixel element. If
the impulse response coefficients are assumed to define
the finite length of a window function, the linear filter
is called a finite impulse response filter ( FIR ) and
are described in further detail in Chapter 5. Filter
characteristics are defined by adjusting the impulse
response vector. By proper selection a linear filter can
be designed to functionally perform as a lowpass,
bandpass, or highpass filter based on the element values
in each of the impulse response coefficients. Several
optimization programs [7] such as ParksMcClellan (using
the Remez exchange algorithm) exist which can be used to
create an optimal filter design. ParksMcClellan
provides as output a set of filter coefficients which
constitute the optimal fit between desired and realizable
filter frequency responses. Optimal design in this case
27
is termed to be where the maximum error between desired
and realized frequency responses has been minimized.
A popular filter used for noise suppression which
requires little upfront knowledge or additional
processing is the averaging or "mean" filter. The filter
sums the value of each element in the window mask and
averages the sum total based on the number of window
elements. This can easily be implemented by summation of
data in a register or memory location for N data samples.
An averaging window of length 9 is illustrated in Figure
3.3. Mean filters are effective in the removal of
Gaussian type wideband noise but tend to add a blurring
effect due to the loss of high frequency components from
the image. For impulse type noise, the mean filters'
performance is ineffective because the intensity change
of a single element ( either high or low ) is averaged
into the resulting output value. A high intensity value
in an otherwise low intensity grid can create a
"splotched" effect in the reproduced image.
28
H(k) =
Jr=o
Yn(k)
Yn = H( ) * h( )
Single point output y at time n = vector [i
convolved with vector h
h( ) = Filter vector coefficients
H( ) = Input data sequence vector
Figure 3.2 Block Diagram of Digital Filter
Average 480/9 53
40 58 60 60 61 35 33 96 37 34 32 80 82 87 105 90
 Averaging Window (9) 
Figure 3.3 Mean Filter Averaging
3.1.2 Nonlinear Image Filtering
One particular problem experienced in image
enhancement and filtering is the preservation of sharp
29
transition edges when reconstructing an image picture.
This problem occurs because many image filtering
techniques use a moving average window to determine the
intensity (or gray scale value) assigned to a pixel point
as the window is scanned across the image. This moving
window can be either a onedimensional or twodimensional
structure and has the effect of removing spurious or
impulse noise. One technique for onedimensional
nonlinear processing is median filtering.
Median filters perform operations on a local
neighborhood sequence of n adjacent pixel elements
similar to that described for the "mean" filter ( Figure
3.4 ). Instead of averaging all pixel values, a graded
sequential pixel string is created with each element's
position based on its intensity level, lowest to highest.
The filter then selects the middle or median of this
sequential string and assigns the pixel intensity value
to the output image location at time y( t ). The median
filter input window is then shifted one location ( to
left or right ) and the process repeated for time y( t +
1 ). Table 3.1 shows the result for a median filter of
length 9 operating on an input sequence of 16 pixel
elements. The advantage of median filtering is that
single displaced elements will be removed or dampened.
30
Removal of spurious elements occurs because these single
points will normally be located on either the higher or
lower edges of the sequential string and not at the
median point.
While median filtering is very good at removal of
salt and pepper or speckle noise, it adds some distortion
to the recovered image. Distortion occurs because
whenever a sharp transition edge is encountered ( such as
the edge of a building ) the edge will not be recorded in
the reconstructed image until enough sequential pixels
are of a similar pixel intensity to become the median
value. The result may be shifting or blurring of the
edge or elimination of the edge if it is not sufficiently
wide with respect to the width of the filter window.
When the median or mean type filters operate on an image
or waveform with minimal contrast elements, the effect is
to lose these minor variations when reconstructing the
filtered output ( Table 3.2 ). While this may be the
desired result, it can have the effect of removing any
minor shading effects and will make the reconstructed
image look washed out and flat.
Two effects of median filtering can be seen in the
output values shown in Table 3.1. First, median
filtering removes the impulse noise element in the
31
sequence window at the expense of replacing the original
image gradual gray scale pixel elements by the sequence
average. This replacement causes some loss of contrast
in the reproduced image based on the length of filter
employed. The second effect is a blurring or perceived
widening of transition boundaries due to replacement of
each sequence element by the median value. In the test
case; 5 low contrast, 2 high contrast and 1 impulse
elements are replaced by 4 low contrast and 4 high
contrast elements. Filtering has removed the impulse but
smears the transition edges of the image based on the
length of the filter string.
32
Median 5B
40 58 60 60 61 35 33 96 37 34 32 80 62 87 105 90
. Median Window (S)  .
Input Image Sequence =
40 58 60 60 61 35 33 96 37 34 32 80 82 87 105 90
Impulse Noise Element = 96
Median Filter Width = 9
First Ordered String =
33 35 37 40 58 60 60 61 96
Median Element Position = (n1) / 2 + 1
n = Median Filter Width
First Median Value = 58 (position 5 in string)
Figure 3.4 Median Filter
33
Time Window Data Sequence Image Element Median Output Median Data Sequence Mean Output
0 40 58 60 60 61 35 33 96 37 61 58 33 35 37 40 58 60 60 61 96 53
1 58 60 60 61 35 33 96 37 34 35 58 33 34 35 37 58 60 60 61 96 53
2 60 60 61 35 33 96 37 34 32 33 37 32 33 34 35 37 60 60 61 96 50
3 60 61 35 33 96 37 34 32 80 96 37 32 33 34 35 37 60 61 80 96 52
4 61 35 33 96 37 34 32 80 82 37 37 32 33 34 35 37 61 80 82 96 54
5 35 33 96 37 34 32 80 82 87 34 37 32 33 34 35 37 80 82 87 96 57
6 33 96 37 34 32 80 82 87 105 32 80 32 33 34 37 80 82 87 96 105 65
7 96 37 34 32 80 82 87 105 90 80 82 32 34 37 80 82 87 90 96 105 71
Table 3.1 Mean/Median Filtered Data Sequence
Time Window Data Sequence Original Element Median Output Mean Output
0 30 32 34 36 32 34 36 36 34 33
1 32 34 36 32 34 36 38 32 34 34
2 34 36 32 34 36 38 31 34 34 34
3 36 32 34 36 38 31 30 36 34 34
4 32 34 36 38 31 30 33 38 33 33
5 34 36 38 31 30 33 35 31 34 34
6 36 38 31 30 33 35 34 30 34 34
Table 3.2 Mean/Median Filtered Slight
Shading Data Sequence
3.2 Neural Network Image Enhancement Technique
Neural networks appear as one solution to the
problem of noise reduction in images. Training sets can
be created which use impaired ( blurred or noisy images )
data as the input set and use an ideal noisefree output
image { typically the original image ) as the desired
output. When the network is forced to train on this form
of data set it develops an inherent immunity to certain
levels of noise or erroneous data bits. There is another
added benefit to using neural networks over conventional
35
linear filtering approaches. By their inherent
architecture, feed forward networks are comprised of
multiple "layers" or groupings of nodes arranged in the
lattice structure noted earlier. The input layer
receives data from the external environment and feeds it
to a middle or "hidden" layer for further data
processing. In most cases the number of hidden layer
nodes is much less than that provided at the input layer.
This reduction forces the network to generate more
generalized solutions to the problem. This mismatch of
nodes leads to a property which has been called
"dimensionality reduction" [8] or a compression of the
input representation. For example, if there are 96 nodes
in the input layer mapping into 8 nodes in the hidden
layer the input data representation has been compressed
by a factor of 8/96 or a 12:1 compression. Klimasauskas
[8] claims that this nodal reduction forces the network
to develop feature detectors which encode the data.
These feature detectors create localized filters or
feature clusters which help the network remove small or
nonrecurring features. The contention is that if the
signal being collected has an added noise component, this
noise will be random or nonstationary with respect to
36
other image features. By proper selection of the number
of feature detectors the network will force the data to
flow through at least one localized filter. By
compressing the data representation to fewer possible
states, image processing removes some of these random
noise components. While this approach is not present in
every network design it shows that neural networks used
for filtering (referred to as neural filters) can play an
important role in image processing and image enhancement.
37
4. FIR Filter Basics
This section provides an overview of Finite Impulse
Response (FIR) filters and their role in noise filtering.
4.1 Description
Data filtering is analogous to sifting through an
input data stream to remove unwanted or unnecessary
elements from the data set without destruction of the
underlying structure. Digital filters make use of the
architectural design of modern computer systems to create
filtering functions which directly remove noise or
corruption from an input data stream without first
passing the signal through a discrete analog filter. By
removing the dependence on analog components, the
performance of digital filters over time are predictable
and not subject to changes in voltage,temperature, or
component aging. By providing direct control over the
filter impulse function coefficients, a digital filter
can be created with sharp cutoff characteristics, linear
phase response, and tight passband/stopband boundary
specifications which are difficult to implement in the
38
analog arena. The field of image processing makes heavy
use of digital filters to provide consistent filtering
and enhancement of captured images by removal of additive
sensor or transmission effects.
Digital filters have two general classes, finite
impulse response (FIR) and infinite impulse response
(HR) filters. HR filters are able to perform like
filtering systems which possess infinite length memory.
This gives them the advantage of requiring less
coefficients than a FIR filter to provide the same filter
frequency response. An impulse state occurring in the
data stream will see its effects diminished over time but
some small component will always exist. Disadvantages of
IIR filters are that their infinite length nature makes
these filters prone to stability and linear phase
response problems. Lack of stability can cause the
filter output to oscillate or have the output grow
without bound. Linear phase response is necessary for
many communication or data processing applications
because the phase transfer relationship is what
guarantees minimum distortion for a signal passing
through the filter. Because greater care is necessary to
39
design and implement HR filters, many simple digital
filters are based on the FIR class.
Compared to an IIR filter, a FIR filter uses a
finite number of samples, similar to a window, from the
input data set. By using a finite number of elements,
the response to any impulse appearing in the data dies
out after a finite number of samples have been processed.
This characteristic makes them inherently stable because
there is no need to account for any infinite term series
or decay rates. FIR filters are also shown to provide
linear phase response if the upper and lower image
coefficients are mirror images. Linear response is
highly desirable in the design of filters which require a
flat response across their signal passband. While the
number of elements or taps on the filter are greater for
a FIR than for an equivalent IIR filter, their ease of
design and stable characteristics make FIR filters easy
to model and compare against a neural filter. When
selecting the desired neural network architecture to use
in implementing a FIR filter, the lack of feedback
elements or recursive connections makes them ideal for
implementation in a feedforward network using the
backpropagation algorithm.
40
4.2 Digital Filter Characteristics
A digital filter can be characterized by its impulse
response h(k) or its transfer function H(z) (Figure 3.2).
The output is calculated from a convolution of the input
data vector ji(k) and the filter's impulse response h(k)
as follows:
The filter is said to be causal or realizable if the
output at n = n0 is dependent only on values of the input
for n<= n0. This implies that the impulse response h(n)
is 0 for n < 0 and these elements will not be reflected
in the filters' output performance [9]. The transfer
function H(z) can be described as an Nth order rational
function by the sequence:
y(n) = Â£/i(A:)h(wÂ£)
( 4.1 )
H (z)
( 4.2 )
aO + a\z 1 + a2z~2 +... .+aNz N
l61z biz'2,bNz~N
41
A finite impulse response ( FIR ) filter is defined
as a digital filter whose number of h(k) terms is finite
in length ( referred to as the number of taps or order of
the filter ) and possesses no poles in the function
equation. This can be shown as a transfer function H(z)
with Nl poles at Z = 0 and Nl zeros anywhere on the z
plane. Equation 4.2 utilizing only zeros of the function
reduces to:
H(z) = h(0) + hdlz1 + h(2)z"2 + ( 4.3 )
... + h (N) z'N
= aO + alz1 + a2z"2 + . + aNz"N
Performing the product summation (4.1) on the reduced
impulse response (4.3) provides the output of the desired
Nth order ( N tap ) FIR filter. Using N + 1 input data
points in equation 4.3 yields the FIR filter output
equation:
y(k) = h(0)p(k) + h(l)n(kl) + h(2)p.(k2) +
... + h(n)p(kn)
for k= 0,1,2,...,n (4.4)
42
The direct form output of a FIR filter can be
written as a convolutional sum of the filter impulse
response and the delayed input samples:
y(n) = 2 h(k)X(nk)
X(nk) = delayed input sample (4.5 )
As defined previously, a FIR filter will have a linear
phase characteristic if its unit sample response h(n)
satisfies the condition:
h(n) = h(M 1 n) ( 4.6 )
n = 0,1, . . M 1
(where M is the filter length)
The architecture of a conventional FIR filter is
shown in Figure 4.1.
43
x(t)
Figure 4.1 FIR Filter Architecture
4.3 Filtering Effects on Image Data
Image enhancement is generally referred to as the
sharpening or smoothing of an image to bring out details
which may be masked or distorted in the original image.
Image filters are required to deal with two main
frequency component elements: high frequency edges and
low frequency surfaces. High frequency elements
44
represent sharp transition edges or in gray scale images
appear as abrupt changes in the gray level intensity of
adjacent image pixels. Because of their abrupt
transition, edges possess high levels of spectral power
due to their combined high frequency components. This is
an element of basic mathematics whereby a sharp
transition resembles a square wave edge and square waves
are the summation of multiple sinewave elements. Low
frequency elements are representative of broad relatively
constant gray levels which represent the surface area of
objects or background imagery. By possessing a more
uniform intensity level, their spectral power is low
being composed of few sinewave elements.
Image components can be described more clearly by
imagining a photo or video image of a house. High
frequency image components are represented by the house
edges, window frames, chimneys or doors because of the
high contrast separations between adjacent pixels. Low
frequency components would be represented as the walls of
the house or the actual door where adjacent pixels show
very little change in the intensity level. Passing the
image through both highpass and lowpass frequency filters
demonstrate how the two filters differ from the features
45
that they will detect [10]. A high frequency (highpass)
filter allows the user to see the house edges but tends
to attenuate the walls, dropping the observed wall
intensity level. A low frequency ( lowpass ) filter will
show the house walls but the sharp edges of the house
separating it from the background image would be blurred
and distorted. This blurring represents loss of the
sharp contrast edges and a loss of spectral power. In
both cases, use of a single type filter results in the
loss of information.
Image enhancement requires a gentle blending of the
best properties for both highpass and lowpass filters.
When viewing a weather satellite image, it is necessary
to define and correctly identify the sharp contours of a
landmass or storm front while maintaining the slight gray
scale levels of the storm cell surface area.
46
5. Neural Network Basics
Neural networks consist of four basic elements: a
processing element, an activation function, an
interconnectivity architecture, and a training
methodology. A neural network architecture is crafted by
meshing each of the basic elements based on the type of
problem to be solved and the desired input/output format
of the data. Once the desired architecture has been
chosen, the learning algorithms are responsible for
feeding input data and propagating the resulting training
errors back into the network to provide movement around
the error surface of the function during its search for a
global solution.
5.1 Basic Processing Element
A processing element receives inputs from three
sources: connection arcs from other processing elements,
self connectivity, and an external bias ( Figure 5.1 ).
Each processing element generates an output result by
summing the matrix multiplication of the units inputs by
47
the connection arcs weights plus the addition of the bias
to form the net input:
neti(t) =
neti
xij
aj (t)
n
t
2 Xj_jaj (t) + bi
= net input into unit i at time t
= connection weight from unit j to unit
i, Xu is the self connection weight
= output or activation value of unit j at
time t
= bias input into unit i
= number of units in the network
= time
Bias
bi
neti(t)
Figure 5.1 Standard Network Node Connection
48
The summed output of the unit represents the full input
connectivity for a given node in the network layer. The
value provided to the next network layer is the result of
applying a nonlinear function to this linearly combined
output parameter. The nonlinear result is referred to as
the activation value for the node defined by:
a { x ) = f( netx (t) ) ( 5.1 )
The function f is the activation function or update rule
and determines how the processing element will behave
given a specific input net ( x ). Common activation
functions include ( Figure 5.2 ):
Sigmoidal/Hyperbolic Tangent:
Sigmoidal:
f(x) = 1/ ( l+e'** )
Range value 0 < f(x) < 1
Hyperbolic:
f(x) = tanh(Kx)
Range value 1 < f(x) < 1
Sine function:
f(x) = sin(Kx)
( 5.2 )
( 5.3 )
( 5.4 )
49
Binary Threshold:
f (X) = 1 , if x>0 ( 5.5 )
f (x) = 0, if x<=0
Linear:
f(x) = x ( 5.6 )
Linear Threshold:
f (x) = 1, if x< 1 ( 5.7 )
f (x) = x, if 1<=x<=1
f (x) = 1, if X>1
Hyperbolic Tangent Function
Hard Limiter Function
Figure 5.2 Common Activation Functions
50
5.2 Activation Functions
The activation function is responsible for providing
a nonlinear mapping of the linear node summation and
provides the power of neural networks. Without the use
of a nonlinear activation transformation, all neural
networks would be reduced to a single network layer. The
single layer explanation is simple to understand. Basic
mathematics show that a series of linear functions can be
replaced by a single linear function which provides an
overall linear transformation. Since multiple layers of
linear elements could be combined into a single linear
layer function, additional layers would never be
necessary. The network would consist only of an input, a
linear translation, and an output. This would be the
equivalent of a linear FIR filter. Linear activation
functions will be found in neural architectures but
generally only as the output layer function providing a
linear combination of the hidden layer outputs.
When designing a network, choice of an activation
function helps determine the overall performance. Areas
which an activation function effect include:
1) How well a network may generalize
2) How difficult the network is to train
51
3) Realtime data processing speed
4) The output data range boundaries.
Multiple literature sources [11], [12] describe advantages
and disadvantages for each type of activation function.
For feedforward networks utilizing backpropagation, one
design criteria is that it is necessary to use an
activation function which provides a continuously
differentiable curve. The requirement for a continuous
function stems from the need to take the derivative of
the output function during error propagation ( see
section 5.6). From the above list, both sigmoid and sine
style functions provide smoothly continuous curves which
do not contain discontinuities around the origin. Most
implementations of feedforward backpropagation networks
are found to use these or their equivalent ( particularly
the sigmoid function ) as the de facto standard
activation.
5.3 Error Surfaces
The complexity of an activation function is
representative of the peaks and valleys created by the
multidimensional meeting of layer planes and determines
the form taken by a functions error surface. A simple
52
error surface can be shown by the linear activation
function. As shown in Figure 5.3, linear functions have
an easily definable bowl shape. An optimal solution
using this error surface requires only that the bowl
curvature be followed in a descending fashion until the
minimum error point is achieved ( at the bottom of the
bowl ). Using this basic architecture; no matter where
the starting point on the error surface, an optimal
solution will always be found. However, section 5.2
showed that linear functions have a very limited range in
neural networks and therefore do not represent the normal
error surface. A contrasting nonlinear sigmoid function
space is shown in Figure 5.4. When considering the
output error surface for a single processing element, the
optimal function solution is not as easily identifiable.
Because of the fluctuations in the hyperspace error
surface, one cannot simply descend to the bottom of the
bowl. While a starting point on one of the sloping edges
or trough may eventually lead to a solution, a starting
point in one of the flat regions ( corresponding to the
minimum and maximum boundary limits of the function ) may
prevent a solution from ever being found. Because
starting points are normally chosen during the
53
initialization phase of a network by random selection,
the initial state for a node cannot be predetermined
unless forced to a known condition by the user.
The difficulty in traversing a node error surface
using nonlinear functions is compounded by the fact that
the input to each layer node (except for the input layer)
can be viewed as the accumulated sum of error surfaces
from each of its previous layer interconnections. At
each update time period tA the current position for a
single node on its activation curve is determined based
on its cumulative net input. When these are combined by
a node into a single error surface, the nonlinear meshing
results in a complex surface where the optimal solution
may not be evident ( Figure 5.4 ). In addition, the
error surface may be littered with numerous locally
optimal holes ( called local minima ) which may prevent
the network from finding the globally optimal solution
region. If the network falls into a sufficiently strong
local minima, training effectively ceases and no
additional number of iterations will result in change.
Mechanisms to help avoid local minima are discussed
further in section 5.7.
54
Figure 5.3 Linear Function Error Surface
Figure 5.4 Nonlinear Function Error Surface
55
5.4 Interconnectivity Architecture
Parallel distributed processing ( PDP ) models are a
connected latticework of nodes interconnected by a set of
"weighted" arcs. Figure 5.5 shows the generalized
network structure for a multilayer system.
Feedforward Network Architecture
Input Layer Hidden Layer Output Layer
Figure 5.5 Multilayer Network Structure
A network structure consists of three major layers:
input, hidden, and output. The input layer consists of n
input nodes which provide connectivity from the external
56
environment to the network. These inputs may represent
the output signals from hardware sensors, the output of
an AnalogtoDigital converter (A/D) which has converted
an analog signal into its digital equivalent, or the
tapped inputs from a delayed input signal serial data
stream. The input layer nodes are connected to the j
layer nodes of the secondary ( or hidden ) layer.
Feedforward networks may have one or more hidden layers
in their architectural makeup. Finally, the hidden layer
nodes are connected to the k layer nodes of the output
layer. The number of nodes present in the output layer
are dependent upon the problem to be solved. Single
condition outputs may have only one output layer node
whereas pattern recognition problems may require tens of
nodes to represent each of the pixel elements in an image
grid.
5.5 Training Methodology
The learning mechanism determines the steps a
network will follow in trying to achieve minimum error or
convergence. In recurrent networks constraint
satisfaction is maximized to yield a stable network
condition. In feedforward networks which utilize
57
backpropagation, the output is compared with a desired or
ideal output training value. Training by comparison of
the computed output with that of a known desired result
is commonly referred to as "Supervised Learning" [13].
Since the desired output is known, the variance is easily
computed as desired minus calculated. This difference,
or error value, is then passed in a reverse direction
from the output towards the input while adjusting the
weighted connections at each level. One pattern training
cycle has occurred when the error has propagated back to
the input and all interconnection node and bias weights
have been modified to follow the gradient descent curve.
Once a pattern has been propagated both directions
through the network, a new input data set is provided at
the input layer. These feedforward/comparison/feedback
steps are repeated until a desired mean error result is
achieved or it is determined that convergence will not
occur and the process is terminated.
When a network is first initialized it begins with a
randomized set of weighted connections and an external
bias. By presenting input vectors and backpropagating
the resulting error component, the weight matrix is
58
continually changed in an attempt to minimize the
resulting error.
5.6 Backpropagation Algorithm
The basic architecture of a neural network using the
backpropagation algorithm can be described as an input
layer, one or more hidden layers, and an output layer.
The layers are interconnected as shown in Figure 5.5.
The learning process for backpropagation is best
described as an iterative procedure which makes two
effective passes through the network for each pattern
presented at its inputs. During the forward pass, input
data is passed through several layers of nonlinear
transformations to provide a calculated output. During
the reverse pass, computed error is fed back through
these nonlinear transformations and layer to layer
connection weights are updated to minimize the least
squares error of the network. The forward/reverse
process is repeated until the total error reaches some
predetermined threshold value. The components of the
backpropagation algorithm are shown in Figure 5.6.
59
Yn
*11
Ys
ds
s
j
W
Wjk
represents the input data set vector
n= l,2,....,i is the number of inputs
represents the calculated output pattern
represents the desired output pattern
s= 1,2,.....,k is the number of outputs
represents the number of nodes in the hidden
layer
represents the connection weights between
the input layer and hidden layer
represents the connection weights between
the hidden layer and the output layer
represents the node bias term for node x
Figure 5.6 Backpropagation Algorithm Diagram
60
For each processing element (node), an inner product of
the inputs and connecting weights is calculated then
summed with the node bias as follows:
xj =iWJY + bJ ( 5.8 )
n=1
The value Xj is transformed into a modified vector x by
passing xj through the activation function (Figure 5.2).
f (x) = 1 / ( 1 + e_Kx ) x = Xj
xmodj = f(xj) (5.9)
To calculate the output layer nodes, the inner product of
xmodj and connection weight matrix Wj^ (plus the bias) is
calculated and passed through the activation function
yielding the calculated outputs ymods.
Yk =X +bk ( 5.10 )
n=i
ymodk = f (yk) ( 5.11 )
61
Total error calculation is the summation of errors
between calculated and desired outputs for each of the
output layer nodes.
k
E = 1/2 ][](
n=i
Chain rule of differentiation for the error function with
respect to the computed output ymodk and the weight
matrices results in the delta change equation:
A = (dk ymodk) s' ( ymodk ) ( 5.13 )
For the sigmoidal function (5.2), the derivative s' is:
s' ( ymodk ) = ymodk ( 1 ymodk ) (5.14)
The interconnection weights between layer j and layer k
are updated using the existing weight, the derivative
error component, and a learning parameter r\. The
parameter r indicates what portion of the feedback error
should be combined during this pass. The weights update
62
is performed for each hidden to output layer connection
and the bias:
wjk(t+l) = wjk(t) + t Ajk Xj ( 5.15 )
Xj = output of hidden layer node j
This error propagation and weight update applies to the
output layer where the desired output pattern value is
known.
For the hidden layer(s), there is no prior knowledge
of the desired output value. Therefore, to determine the
error content, the value must be calculated recursively
from the output layer error and interconnection weights.
The difference is found in the Aj calculation. The
error component must be calculated as the sum error of
all connections between that layer and the output.
k
Ai;j = s' (xmodj) Ajk wjn ( 5.16 )
n=i
s' (xmodj) = xmodj ( 1 xmodj ) ( 5.17 )
63
( 5.18 )
wj (t+1) = wi3(t) + ri Ai;j Xi
Xi = input layer node I
The weight update equation used in backpropagation
is a gradient descent method called the Generalized Delta
Rule. The generalized delta rule causes the weight
vector to move such that if the error surface were a
paraboloid its projection would move down the steepest
descent of the error bowl surface ( Figure 5.3 ) as it
approaches an optimal solution and would converge at the
lowest point on the bowl ( point C ). However, in the
backpropagation algorithm, error surfaces do not
represent simple paraboloids but instead are complex
multidimensional input and output spaces defined by
complex functions ( Figure 5.4 ). This multidimensional
surface creates multiple local minima, peaks, and troughs
in the error surface which must be traversed to find the
global minimum error.
Updating of the interconnection weight matrix for a
multilayer network in the reverse pass has two forms of
calculations (shown in equations 5.15 and 5.18). For the
output layer, the error signal is multiplied by the
derivative of the activation function f( ) for use in
64
updating the matrix. For the hidden layer(s), a
recursive calculation is used to combine the error
contribution for each post layer node connection to the
hidden layer node because true error is only known for
the output layer. The calculations are continued until
the least squared error E is less than (or equal to) some
predetermined error threshold value after which the
training loop is terminated.
5.7 Modifications to the Basic Backpropagation Algorithm
The backpropagation algorithm suffers from two major
problems: slow convergence time and a tendency to become
stuck in local minima. While slow convergence times
force an increased number of iterations before an
acceptable convergent result is reached, local minima can
cause a network to never reach convergence. Two
different modifications to the.basic backpropagation
algorithm can be used to help minimize the effects of
these problems, momentum and simulated annealing.
Momentum is an attempt to model in neural networks
the actions of an object moving on a physical surface.
The standard analogy for momentum is to envision rolling
a ball down the curve of the function surface. Section
65
5.3 described how the error surface in a network is not a
smooth constantly decreasing slope but instead contains
many anomalies which will impede the movement of the
ball. These anomalies may take the form of depressions
or small hills in the error surface ( Figure 5.7 ).
Sample Network Function Error Surface
Figure 5.7 Local versus Global Minima
Backpropagation uses a gradient descent approach to
follow the decreasing slope of the function curve in
order to find the surface minima. If the ball is moving
at a slow rate, determined by the update step size, it
may become trapped in a small valley depression (point A)
66
and not be able to continue its path towards the function
global minima ( point B ). This can result in a network
exhibiting poor convergence ( suboptimal solution ) or no
convergence. The solution is to add an additional term
to the update rule which takes into account past delta
changes and adds a small portion of this summation to the
term at its next update time period.
To help the network escape shallow local surface
minima, the standard update learning rule (5.15) is
modified to add a momentum term ( this occurs for all
layers, interconnection wjk is shown as an example ) :
Wjk(t+1) = wjk(t) + T] Ajk Xj + a(wjk(t) + wjk(tl))
a = momentum term value ( 5.19 )
The typical momentum term uses the effective connection
value for the last two time periods to determine an
average rate of change and direction of movement. This
parameter is weighted and added as part of the overall
delta weight change for a node during the error
propagation update. The basis for adding this momentum
term is defined by physics. A ball rolling down a hill
67
develops some amount of inertia which allows it to pass
through a small depression or valley without having its
progress impeded. If the hill it encounters coming out
of the valley is too steep, the ball will not be able to
climb the grade and will become trapped in the valley.
The corollary in neural networks is that if the gradient
descent is following the slope of an error surface, small
depressions or local minima can be overcome if they are
shallow. The inertia parameter is replaced by an average
of several past changes to a node connection indicating
the rate of change that the node has been experiencing.
By adding a fraction of this rate of change, the network
is "nudged" in the same direction it has been traveling
even if the current update would normally have indicated
that no change was required. The addition of momentum to
the update rule has been shown [14] to aid a network in
escaping from localized minima.
Simulated annealing [2], [15], [16] is a process used
to make the network more active during its initial
learning stages and decreasing the effect as the network
trains. When a network is initialized, usually the
activations and weights are set to randomly selected
values. As the network is trained, the gradient descent
68
method starts to follow a path in search of a global
minima. Depending on the shape of the error surface, it
may be possible that if the search had started on the
other side of a peak, the network would have converged to
a completely different point. The problem is defined as
a failure to explore the entire state space when
searching for the global solution. Simulated annealing
adds a parameter in the calculation of the nonlinear
function ( 5.20 ) which causes the output to be slightly
more chaotic and may force the function into another
region of the state space. This chaotic movement would
seem to defeat the purpose of normal gradient descent.
The benefit of this chaotic behavior is that the
calculation will not necessarily follow the next
descending element on the error surface but instead may
ricochet around the state space in the initial stages of
training. By exposing the network to more regions of the
surface, another path may be defined which leads to a
more optimal solution. To ensure that the network does
not ricochet around the state space forever, the effect
of annealing is reduced by periodically decreasing the
value of the temperature coefficient. Provision of a
broad exploration space during the early training stages
69
will hopefully have situated the network to a better
point on the surface where it can continue its gradient
descent search. The decreasing annealing parameter will
not have the effect of bouncing the network out of its
ideal state region but may help it to bounce out of
encountered local minima. At some stage in training, the
annealing parameter has been reduced to a minimal value
and no longer has a dramatic effect on the computation.
1 + exp (Kx IT)
T = temperature coefficient
( 5.20 )
Simulated annealing is another mechanism (similar to
momentum) based in physics, associating the movement of
molecules in metal as the material is heated and then
allowed to gradually cool. In metallurgy, this annealing
process is used to allow the molecules to find an optimum
alignment structure which results in a stronger bond
between molecules. This strong bonding results in high
strength properties in the metal. Simulating this
process in a neural network forces the network to examine
more than one possible solution path during the training
70
phase. As the network decides which path provides the
best chance for a solution, the effect is reduced over
time. Continued cooling ( reduction of the annealing
parameter ) provides small random movements in the
gradient algorithm allowing it to escape shallow local
minima. By randomly moving through the state space, the
number of iterations necessary to find convergence may be
reduced if the network starts its final search process at
a point closer to the acceptable solution.
5.8 Effect of Hidden Layers
When Minsky and Papert [17] published the results of
their work on perceptron networks, they demonstrated that
networks with a maximum of two layers (input and output)
could never solve problems that were not linearly
separable. This presented a severe limitation because
most real world problems of any importance are of the
nonlinearly separable variety. Later work by Werbos
[18], Rumelhart and McClelland [1], and others showed
that addition of a middle or hidden layer containing
nonlinear outputs, complex state spaces could be created
which remove the nonlinearly separable constraint. The
strength of neural networks is their ability to find
71
solutions which exist in a very complex state surface.
Decision regions in this state space may take different
forms depending on the number of network layers as
illustrated in Figure 5.8. Lippmann [19] is a frequently
cited reference for his description of the number of
layers required by a multilayer perceptron to define a
given solution region. Based on the intersection of
cutting surface hyperplanes in multidimensional space, a
maximum of three layers ( 2 hidden, 1 output ) are shown
to be adequate to describe any arbitrary solution region.
For this reason most feedforward style neural networks
rarely consist of more than 34 layers regardless of the
complex error surface.
Single Layer TwoLayers ThreeLayers
Figure 5.8 Hidden Layer Solution Bounds
72
While the number of hidden layers for a particular
type of problem seems to have defined boundaries, one
open issue in PDP research is determination of the number
of nodes required in the hidden layer. Researchers have
proposed solutions to this problem for identifying excess
unused nodes, but in general the number is problem
dependent and must be discovered in a trial and error
fashion. Test of the solution is achieved when the
network has converged during a training phase and
performs well during the ensuing test phase. If enough
nodes exist with the appropriate connectivity, the
network will be able to achieve a generalized solution to
test case input data sets. For a network with too few
nodes, convergence may not occur. If too many nodes
exist, the network may train and achieve high scores (low
error) on training data sets but fail when tested on
similar data sets which it has not previously seen
(Figure 5.9).
Poor generalization versus free parameters is
observable in the areas of statistics and curve fitting.
Too many free parameters results in overfitting of the
problem set. When too many free parameters (weighted
connections and nodes) are available, the network will be
73
able to follow small details and noise in the input data
set. When this occurs, the network has not succeeded in
creating a generalized solution to the problem but has
instead memorized the data patterns. Memorization is
easily identified by excellent match results on training
data set patterns but poor performance on new input
pattern data sets ( Figure 5.10 ).
rmmm Network with minimal required
hidden nodes
.... Network with excessive hidden
layer nodes
:>
Test Data Sequence
Identification
Accuracy
Figure 5.9 Effect of Too Many Hidden Layer Nodes
In a properly connected network it is not the patterns
themselves which are stored but the connection weights
between nodes which allow the patterns to be recreated.
74
A general statement will be that if there are a large
enough number of connections from the input to a
sufficient number of hidden layer nodes, any mapping from
input to output can be performed.
Overfitting of data during training, Generalized fit to data, training
poor generalization on teat data data and test data result similar
Figure 5.10 Number of Nodes versus Generalization
5.9 Effects of Network Overtraining
Poor generalization performance (section 5.8) can
also be observed if a network is allowed to train far
beyond normal convergence ( Figure 5.11 ). Test results
by HechtNielsen [20] have shown the effect on final
network performance based on number of training cycles.
If a training cycle is stopped near the first point of
convergence or after a predefined number of training
75
cycles, the generalized performance is indicated by the
upper dashed line. If training is allowed to continue,
after awhile performance has peaked an begins to decline.
HechtNielsen attributes this effect mainly to self
organizational structures but feels that the effect is
similar to the "too many nodes" problem whereby the extra
training forces the network to more closely align with
the training data. This alignment then results in less
than optimal results on unseen test data.
Training stopped near
first convergence 
Identification
Network trained far
Figure 5.11 Effect of Network Overtraining
76
6. Neural Filter Design
6.1 Description
Neural filters will be described as a broadband
discretetime FIR type filter implemented by a neural
network architecture. The requirements are that it shall
be capable of learning the functions and coefficient
matrices required to generate a filter output y(n)
representing the output of a nonconventional FIR filter.
The term nonconventional FIR is applied to differentiate
the neural result from a normal FIR. As shown in Chapter
4, the direct form realization of a FIR filter can be
shown by the equation:
y(n) = Â£ h(k)X(nk) ( 6.1 )
= h (0) X (n) + h(l)X(nl) +
... + h(Nl)X(n(Nl))
This is functionally equivalent to providing a serial
input data stream to a FIR filter ( Figure 6.1 )
containing Nl tap delays of the input x(n). The tap
77
delay elements are multiplied by the impulse response
coefficients matrix and summed to yield the single FIR
filter output element at time point n.
x(t)
Figure 6.1 FIR Filter Structure
A neural network architecture can provide similar
functionality in a multilayer feedforward network
(Figure 6.2). The difference between filter types is due
to their overall purpose. A FIR is normally configured
78
to enhance or suppress a particular frequency spectral
band resulting in LowPass and HighPass filters. The
drawback of applying this filter to a waveform containing
both high and low frequency components plus noise is that
one of the desired frequency components will be treated
as additional noise. A LowPass filter will therefore
suppress the high frequency noise components and the high
frequency signal component, it cannot differentiate
between the two. A BandPass style FIR filter can be
created to pass both signal frequencies but will also
pass any noise components that fall within the passband.
The purpose of the neural filter is to combine effects
similar to both a LowPass and HighPass filter while
suppressing additive noise.
In this paper, the delay network responsible for
conversion of a serial input data stream to a parallel
data format will not be implemented. The network will be
trained on a parallel input data vector which represents
the current input data value and Nl previous data
points. The parallel vector is equivalent to processing
a data stream through a single input serializing FIR
filter front end incorporating a multitap delay and has
79
no impact on the final processing capabilities of the
neural filter.
Input Layer Hidden Layer Output Layer
Figure 6.2 Neural Filter Structure
6.2 Why Implement a Filter in Neural Networks
Since FIR and Median filters are already
successfully implemented by a computer algorithm, what
are the advantages of implementing a similar filter
function on a neural network? Four possible reasons are
suggested:
80
1) The parallelism of neural networks should
reduce calculation time by providing parallel
computation on each of the filter taps.
Current computer implementations use a
sequential process of 1 multiply/add for each
tap. Parallel matrix computations would make
the calculation time equivalent to that of
infinite impulse response ( HR ) filters which
require fewer taps than an FIR but use feedback
loops which make them prone to instability.
2) Median filters perform well in reduction or
removal of impulse type noise but do not
perform as well on distributed noise sources
such as gaussian or uniform noise. The result
is a blurred image or a choppy recovered
waveform signal. It would be desirable to
achieve the spike suppression capability
without the degraded broadband performance.
3) Standard computerized filters provide no
graceful degradation method for dealing with
failures in hardware or unexpected data values.
Since neural networks are designed around a
distributed knowledge base, they have some
81
inherent fault tolerance built into the
functions. This can be an advantage when
trying to develop a hardware based neural
filter in VLSI hardware.
4) The neural network method may show improved
overall performance over standard linear
type filters when the signal is extremely noisy
because the filter can be created using noisy
data during the training phase. In addition
the network may be able to extract out signals
that the other filter methods cannot handle.
6.3 Neural Filter Architecture
A standard feedforward neural network contains a
minimum of three layers and employs full connectivity
between the input layer and each of the hidden layer
nodes ( Figure 6.2 ) and between the hidden layer and
output layer nodes. Each hidden layer node receives the
product of the input matrix X convolved with the
corresponding interconnection weight matrix W for all
input elements, each input element providing a weighted
connection arc to each of the hidden layer nodes.
82
The neural filter architecture to be used for noise
filtering is a threelayer feedforward network with full
connectivity between layers ( Figure 6.2 ). This network
architecture has been used by other researchers [21], [22]
as a means to separate a desired signal from an additive
noise source. The input matrix X consists of data
waveform or image pixels presented as a parallel word to
the input nodes. The input data propagates from the
input layer to the output while undergoing a
transformation through the hidden layer nonlinearity
function. If this data were to be received in a serial
fashion and allowed to pass through a series of single
time delay elements, the input structure would be
identical to that of a FIR filter. Since this work is
not concerned with creation of a serial to parallel shift
function, the data will be presented in parallel while
simulating a serial input data stream. Each time slice
will be represented by shifting the input data set one
position to the right and adding the next data set
element into the leftmost slot. By presenting parallel
data in this fashion, the network data at each time
period simulates the appearance of a serial input data
stream.
83
The hidden layer activation function was chosen such
that the network can handle either waveform or image
pixel data. While test waveform data is most commonly
associated with sine or cosine waves in the range of 1
to +1, image data is based on the gray scale coding for
display on a computer monitor. This gray scale code
contains pixel elements with values of 0 to 255. Scaling
can reduce this range to 0 to 1 and a bias/multiplier can
be used on the waveform data to create the same range.
To provide a common activation function, the standard
sigmoidal is used as the nonlinear function. Two
modifications to the standard backpropagation algorithm
are used, momentum ( 5.16 ) and an adjustment of the bias
parameter K in the activation function ( 5.2 ). The
effect of adjusting K relates to the steepness of the
slope of the function ( Figure 6.3 ) and can alter the
speed with which a network will converge.
Neural filter training consists of using the
Supervised Learning method. During the test generation
phase prior to running the network, data training pairs
are created. These training pairs consist of the desired
output result (normally the clean data signal) and the
noisy corrupted version to be used as the network input
84
data matrix. Network convergence (or solution) is
assumed to occur when the calculated error is equal to or
less than some desired error value. For all cases of
the neural filter, the calculated error is the average
error over multiple training data patterns. The
advantage of this method is to force the network to have
a low average error over a large number of consecutive
random training patterns rather than terminating training
when the first pattern is found to have a low enough
error. While this method of determining network
convergence is not always adequate, it places higher
restrictions on the network to achieve a more optimal
solution region.
Figure 6.3 Effect of K on Function Slope
85
The training mechanism used for the feedforward
network is shown in Figure 6.4. Two inputs are provided
to the training network, the original clean signal and
the noise source altered signal. The noisy source data
is provided as a parallel data word of width N to each of
the input layer nodes ( one data bit for each network
node ). The output layer receives a parallel data word
of width M to each of the output layer nodes ( one data
bit for each node ) representing the clean source signal.
For this version of supervised learning where the output
layer contains a single node, the desired output is
represented by the middle element of the clean input
signal. Error calculation is accomplished by applying
the noisy signal to the network inputs and comparing the
output node value against the desired clean value. The
difference error is calculated and backpropagated through
the network to adjust the layer interconnection weights.
86
*0ElEh HD
Desired Output
Signal
x (t)
Noise
^*EDyHDplD] HD
Calculated
Output
Figure 6.4 Neural Filter Network Configuration
87
7. Test Data Generation
Two types of data can be used to test the overall
performance of a conventional or neural filter. Waveform
data represents a sequence of elements which may normally
be received in a serial fashion from a piece of test
equipment or a downlinked satellite transmission.
Depending on the data being transmitted, there may or may
not be a tight correlation between adjacent signal
elements. Image data can be represented as a superset of
waveform data. Image data may be received in either a
serial or parallel format with the difference being pixel
to pixel correlation. In an image, adjacent pixels
generally are found to have a high correspondence value
with its neighbors to either side ( the exception being
sharp transition image edges or the ends of image rows).
Both types of data are generated by the programs
described below and used as the test.data streams to each
type of data filter.
88
7.1 Waveform Data Signals
Waveform data is created using a C program called
multwav3.cpp ( see Appendix A for code listing ). Two
options are given to the user upon initialization; single
or dual waveform generation. All waveform creation uses
a sinewave as the baseline signal. The user is prompted
to enter the frequency of the baseline signal to be
created. This is followed by a prompt for the desired
sample frequency to be used in the creation of sampled
elements from the baseline signal. If the dual waveform
option has been selected; the computer will prompt for
the frequency of the second signal and request how it is
to be implemented. The two options are: baseline signal
is the combination of signal frequency 1 and signal
frequency 2, or signal frequency 2 is to act as an
interference source (or jammer signal). The amplitude of
the second signal is requested relative to the strength
of the baseline.
For example:
T? r signall = 357 HZ
r signal2 = 2598 Hz @ 0.8 Interference Signal
^sample = 17598 Hz
89
