Motion estimation techniques and their application to digital image compression

Material Information

Motion estimation techniques and their application to digital image compression
Schramke, Richard W
Publication Date:
Physical Description:
ix, 91 leaves : illustrations ; 29 cm


Subjects / Keywords:
Image processing -- Digital techniques ( lcsh )
Image processing -- Digital techniques ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Includes bibliographical references.
General Note:
Submitted in partial fulfillment of the requirements for the degree, Master of Science, Department of Electrical Engineering, Department of Computer Science and Engineering.
Statement of Responsibility:
by Richard W. Schramke.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
28863611 ( OCLC )
LD1190.E54 1993m .S383 ( lcc )

Full Text
Richard W. Schramke
B.S., Pennsylvania State University, 1984
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Master of Science
Department of Electrical Engineering
and Computer Science

This thesis for the Master of Science degree by
Richard Wayne Schiamke
has been approved for die
Department of
Electrical Engineering and Computer Science
Dr. Jay Rothman
Date: /-^ H> /ffS

Schramke, Richard Wayne (M.S., Electrical Engineering)
Motion Estimation Techniques and their Application to Digital Image Compression
Thesis directed by Dr. Joe Thomas
In the last ten years, considerable research has been done in the area of image
sequence processing. Much of this research has been prompted by the desire for
; i
advanced commercial applications such as videoteleconferencing, high-delinition digital
television and multimedia systems.
This thesis investigates the theory behind image sequence processing and two
types of techniques used in motion estimation for image compression applications:
correlation-based and gradient-based techniques. A description of various, specific
implementations for each type of technique is then given. Before investigating motion
estimation techniques, however, a variety of traditional digital image compression
techniques are described in order to demonstrate the need for motion estimation in
advanced compression applications.
[ i
The form and content of this abstract are approved. I recommend its publication

This thesis is dedicated to my wife, Corine, who supported me and/or left me
alone (whichever was appropriate at the time) throughout it's development and my
graduate coursework, even though she would rather have been the one to pursue a
Masters degree. This thesis is also dedicated to Robyn and Erik, my "big" kids, who
are old enough (12 and 10 years old, respectively) to realize what time they gave up
with me for it, without complaining, yet without completely realizing its significance
(although neither; can imagine writing a report longer than 20 pages); and to Diane, who
involuntarily shared the first nine months of her life with it, but also made it take twice
as long to complete, thanks to her distracting cuteness.

I'd like to acknowledge the following people, without whom, either directly or
indirectly, this thesis would not have been possible:
Vance McCollough, who provided a year's worth of guidance and stimulating
discussion, and helped to keep this thesis from becoming my "lifetime achievement,
Joe Thomas, who stepped in to advise me on this thesis when I needed it
my father and mother, Dick and Jean Schramke, who provided me with the
education and inquisitiveness to learn, and, from an early age encouraged me to do my
best and to do what I enjoyed, and provided me what I needed to do that,
David Keller, my 9th grade math teacher, who gave me my first chance to
"play" with computers, yet always reminded me to keep my priorities straight, and
Tom Vito, my "mentor", who taught me "real" engineering and originally
sparked my interest in the topic of this thesis.

1. INTRODUCTION.......................................................1
1.1 Image Compression Applications..............................1
1.2 Thesis Objectives..........................................3
1.3 Outline....................................................3
2. DIGITAL IMAGE COMPRESSION..........................................4
2.1 Technique Classifications....................................4
2.2 General Compression Techniques..............................6
2.3 Statistical Compression Techniques.........................12
2.4 Spatial-Domain Compression Techniques......................17
2.5 Transform-Domain Compression Techniques....................24
2.6 Interframe Compression Techniques..........................29
2.7 Compression Technique Standards........................... 32
3. MOTION ESTIMATION THEORY......................................... 46
3.1 Image Sequence Processing...................................46
3.2 Optical Flow Theory........................................48
3.3 Motion Estimation Techniques...............................51
4. CONCLUSIONS........................................................82
5. FURTHER STUDY......................................................83

2.7- 1 Run-Length Component Value Category Definitions........37
2.7- 2 Sample Huffman Coding Table for AC Component Symbols...39
3.3- 1 Blockmatching Techniques Computational Complexity......62
3.3- 2 Typical Hierarchical Blockmatching Parameters......... 64

2.2- 1 5-Bit Uniform Quantizer Example ........................ 8
2.3- 1 Vector Quantization Example Histogram ................. 15
2.3- 2 Codebook Example ffl ................................16
2.3- 3 Codebook Example #2 ............................. 16
2.3- 4 Codebook Example #3 ................................17
2.4- 1 Typical DPCM Encoder ...................................20
2.4- 2 Pixel Relationships ....................................21
2.7- 1 JPEG Encoder ...........................................33
2.7- 2 JPEG Decoder ...........................................33
2.7- 3 DCT Coefficient Arrangement ............................34
2.7- 4 "Zig-Zag" Run-Length Encoding ......................... 36
2.7- 5 Run-Length Encoding Symbol Construction ............... 37
2.7- 6 H.261 Interframe Encoder .............................. 40
2.7- 7 H.261 Interframe Decoder .............................. 42
2.7- 8 MPEG Encoder Block Diagram ............................ 43
2.7- 9 MPEG Decoder Block Diagram .......................... 44
2.7- 10 MPEG Prediction Scheme ................................ 45
3.2- 1 Aperature Problem Example ............................. 51
3.3- 1 Motion Estimation Correlation Function ................ 52
3.3- 2 Motion Vector Example ................................. 53
3.3- 3 Offset Sampling ....................................... 56
3.3- 4 log D Search Technique Example ....................... 58
3.3- 5 Cross Search Technique Example ........................59
3.3- 6 One-at-a-Time Search Technique Example .................61
3.3- 7 Conjugate Direction Search Technique Example........... 62
3.3- 8 Hierarchical Blockmatching, Level 3 Example............ 66

3.3- 9 Hierarchical Blockmatching, Level 2 Example................. 67
3.3- 10 Hierarchical Blockmatching, Level 1 Example................. 68
3.3- 11 Limb and Murphy Displacement Estimation .................... 70
3.3- 12 Method of Steepest Descent Gradient Estimation.............. 72
3.3- 13 Walker and Rao Gradient Calculations ....................... 76

The intent of image compression is to minimize the amount of data required to
represent an image [1,2]. There are many ways to perform this operation, each with
strengths, weaknesses, and practical limitations, and each based on the particular
application they are to be used in. In this chapter we investigate some of the
fundamental image compression applications and narrow the focus of this thesis to one
particular area of image compression interest: motion estimation.
1.1 Image Compression Applications
There are two general categories into which image compression applications can
be divided: 1) those for the transmission of images, and 2) those for the storage of
images. Fundamentally, however, the theory and associated algorithms used to
implement the compression schemes for both is the same.
The first of these application types, the transmission of images, encompass a
wide-variety of applications, including:
communications where the upcoming communication standard, known as
the Integrated Service Digital Network (ISDN), is
intended to allow not only conventional audio traffic, but
also videoteleconferencing [3-7];
broadcasting where the resolution of standard television signals will
not only be increased in High-Definition Television
(HDTV), but digital standards are also being proposed at
the same resolution as HDTV but within the bandwidth

limitations of the current standard television signals [3-5, 7-11];
remote sensing which not only includes weather and land resource management satellites, such as LANDSAT [1,3], but which also includes the video signals which are sent back from the Space Shuttle [12];
military where there is a desire to use remotely-piloted vehicles (RPVs) in place of human pilots [1,13]; and,
business where digital graphics and FAX machines have become commonplace [3,5].
The second of these application types, the storage of images, is an ever-
expanding area of interest which includes:
medical diagnosis where digitized images are used routinely in such
technologies as computed radiography, computer
multimedia tomography (CT) scans, and magnetic resonance imaging (MRI) [3,13,14]; which is one of the fastest growing applications, and includes such technologies as Compact Disc-Interactive (CDI) and Digital Video-Interactive (DVI), and is primarily targeted for advertising and training applications [3,7,15]; and,

photojournalism which is probably the newest application, and represents
a change from traditional film-based photography to
digital data-based, with such products as digital cameras
and photo Compact Discs (CDs) [3,15].
1.2 Thesis Objectives
The purpose of this thesis is to: 1) summarize the fundamental theories behind
current motion estimation techniques, 2) investigate the role motion estimation plays in
advanced digital image compression applications, and 3) describe a variety of general
and specific motion estimation implementations.
1.3 Outline
In Chapter 2, historical digital image compression techniques are described to
understand what limitations are imposed by these techniques. Three current image
compression standards are described, two of which are meant for motion image
compression applications. This chapter primarily establishes what role motion
estimation plays in more advanced compression applications.
In Chapter 3, the fundamental theories behind motion estimation techniques are
explored. This investigation begins with a description of general image sequence
processing and then goes in to optical flow theory, upon which the majority of motion
estimation techniques are based. The chapter continues by describing the two primary
types of techniques used in motion estimation: correlation-based and gradient-based. It
concludes with a description of a variety of specific implementations of each type of
motion estimation technique.

In order to understand what role motion estimation plays in digital image
compression, we must first understand some of the fundamental image compression
techniques used. This chapter reviews some of the common and more advanced
algorithms currently used in digital image compression. Then, specific examples of
still-image and motion-image compression standards are described.
2.1 Technique Classifications
In essence, we can classify image compression techniques in at least three
different ways. The first of these ways is based on how well we can reconstruct the
original image from the compressed version; this describes the technique as being either
lossless or lossy. The second way is based on what type of data redundancy is being
exploited by the technique: spatial, transform or temporal redundancy. And, finally,
the third way is based on what data is used by the technique: intraframe or interframe
data [13, 16, 17].
A lossless technique, also known as a bit-preserving or reversible technique
[14], is one for which, upon decompressing the data, the original and decompressed
data are exactly the same. Of course, this assumes that the process through which the
compressed data is passed, be it a storage or transmission media, does not introduce
errors. A lossy data compression technique, also known as an irreversible technique
[14], on the other hand, does not reproduce the original data exactly upon
decompression, regardless of whether the data was sent through a noiseless channel or

Techniques which are based primarily on whether they are lossless or lossy are
generally statistical compression techniques, and these techniques need not be limited to
just image compression applications.
All image compression techniques make use of one of three forms of data
redundancy in order to achieve data reduction gains. These three forms, or domains of
redundancy are: 1) spatial, 2) transform, and 3) temporal.
Spatial redundancy, sometimes also refered to as the data-domain [17], makes
use of the fact that the intensity of pixels, particularly those surrounding the one being
described, are related in some way. This redundancy is somewhat intuitive since we
can observe that neighboring pixels normally have similar amplitudes. Of course, this
is not necessarily true when we observe a portion of an image near the edge of an
Transform redundancy in images is not something intuitively obvious,
generally. There are a number of ways to transform image data, however, into a
domain [17] which may exhibit a high degree of localized redundancy. For instance, a
solid portion of the image will have a large content of low-frequency energy, while a
highly-textured region will have a large content of high-frequency energy. In general,
then, the transform domain of one small region will be very similar to that of its
neighboring region.
And, finally, the temporal redundancy of a sequence of related images is quite
possibly the most intuitive redundancy. From one frame to the next, in most cases,
there is very little change in pixel values. If this were not the case, there would be a
jerkiness to the sequence which, to a human observer, would be extremely annoying.
It is this form of redundancy, in time, which gives us the greatest compression gain.

If data from a single, independent image is the only data used in the
compression process, then the technique is known as an intraframe technique.
However, if data from a sequence of related images is used, the technique is known as
an intetframe technique.
At first, this classification may seem to be the same as saying that spatial-
domain and transform-domain techniques are the same as intraframe, and temporal-
domain techniques are the same as interframe techniques. Although both of these
statements are true, this classification can not be limited to only these cases. In general,
all interffame techniques not only make use of temporal redundancy, but they may also
use data in the spatial- or transform-domain to determine and encode this redundancy
[5, 17].
2.2 General Compression Techniques
Although there seems to be quite a difference in how digital signals are
processed compared to their analog equivalent, in essence the only real difference is that
a digital signal is one which is discrete in both time and amplitude. In general, we can
choose the interval at which these two discrete components of a digital image are
sampled to effectively compress our data. In this section, we will investigate what
varying the sample interval does to our representation of the image, and its effect on the
quantity of data as simple data compression techniques.
Digitization changes a continuous, analog signal into discrete, digital form,
creating digital values representing the intensity of particular location on the images.
This location is commonly refered to as a pixel or pel (pixel element) [14].
Representing any signal, whether digital or analog, by a set of discrete level is known

as quantization. Quantized signal intensity values are also commonly known as PCM
(Pulse Code Modulation) levels [13].
Although the terms "digitization" and "quantization" are sometimes used
interchangeably, they are not equivalent processes; a digitized signal is always
quantized, but a quantized signal is not necessarily digitized, since the original signal
may have already been in digital form.
Whenever a continuous signal is digitized, some amount of information is
always lost, since a finite-sized number can not represent the infinite number of
intensity levels associated with a continuous signal. The more bits which are used to
represent the continuous pixel intensities, the greater is the systems ability to accurately
represent that intensity in discrete form [2].
For instance, if only a single bit is used, then the intensity represented by that
bit may only be one of two values, normally white or black. However, if 8 bits are
used, the intensity may now be represented by 256 values, which, for a monochrome
image, is normally gray levels ranging from white to black. We can see that the more
bits which are used the more accurate the representation can be. But, we can also see
that an infinite number of intensities can never be represented by a discrete value, since
an infinite number of bits would be required!
The dynamic range of a discrete signal is another way in which we may
describe the capability of a system to represent the continous pixel intensity. By
definition it is the difference in amplitude between the smallest and largest intensity
which a digital word may discriminate, normally expressed in dB (decibels). Each bit
represents 6 dB of dynamic range, and, therefore, our 8-bit quantization represents a
dynamic range of 48 dB [19].
Normally, a uniform quantizer is used to digitize continuous signals, since these
quantizers are generally easy to implement When a signal is uniformally quantized,
this means that the discrete amplitude levels generated by the quantizer are equally

spaced between the minimum and maximum values [1]. An example transfer function
for a uniform quantizer is shown in Figure 2.2-1.
However, nonuniform quantizers, or companders, may also be used. A
nonuniform quantizer may be implemented by performing a nonlinear transformation
on the sampled signal [1]. In general this is done in order to perform rudimentary data
compression. While not actually eliminating data, as one would normally consider data
compression, it maximizes the use of the limited number of quantization levels. The
form of this nonlinear transform is primarily based on the probability of a level being
contained in the image [16].
Discrete ..
Output --
>\ I I H) I 1 1 1 1 I I I t 1
Figure 2.2-1 5-Bit Uniform Quantizer Example

However, this transform is normally determined in such a manner as to reduce
the mean squared error of the quantized signal [20]. Such a transform is called a
Lloyd-Max Quantizer when the following error, E, is minimized:
E= CLH [x-u(x)fpu(x)dx =E [(w-w)2] (2.2-1)
where u is the continuous pixel intensity, with probability density functions
Pu(x), is the discrete variable representing u, and E[] is the expected value of the random
It can be shown [2] that the solution to this equation, which minimizes E, is a
function of only the total number of quantization levels and the probability of each
distinct value.
For a Lloyd-Max Quantizer it can also be shown [2] that: 1) the output is an
unbiased estimate of the input values, and 2) the quantization error is orthogonal to the
quantizer input.! However, as we can see from Equation 2.2-1, this quantizer is only
optimal for a particular image, and is not optimal for any other image which does not
have the same histogram.
Based on these results, and the amount of tolerable distortion, a limited amount
of image compression can be done by carefully choosing the quantizer transform and
size, such that an optimum quantizer is used for the desired number of bits. For
instance, most commercial, monochrome frame grabbers today are designed for 8-bit
PCM quantization. However, for a uniform quantizer it is known [2] that only 6 bits
are required by the human visual system; and, furthermore, if the optimum quantizer is
used for each particular image, only 5 bits are required. Therefore, we can effectively
compress our image by 8:5, or a factor of 1.6, by designing an optimum quantizer
designed around visual quantization requirements. Although not mathematically

lossless, such a quantizer would be considered visually lossless, since a human
observer would not be able to detect the losses.
Just as the dynamic range of the image data may be chosen to trade off storage
for image quality, so may the rate at which the image is sampled. It is obvious that the
lower the rate at which the image is sampled, the smaller the amount of data that will be
generated will be. The lower bound on the sampling rate, however, is limited by the
quality of the image desired.
Nyquists sampling theorem states that the lowest sampling rate allowed must
be at least twice the highest frequency of the signal that we wish to represent.
Translated for image sampling, this means that the sampling rate is directly related to
how sharp edges in the image may be represented.
Another way to look at the effect the sampling rate has on the image quality is to
consider what physical limitations it places on the image representation. For instance,
if an image is first sampled at frequency fs, then each pixel has a size of l/fs x l/fs. If
the sampling rate is cut in half, then each pixel now has a size of 2/fs x 2/fs. We can
see that each pixel sampled at half the rate would then represent twice the horizontal and
vertical size of the pixels sampled at fs; it also means that, where there were once 4
distinct pixel values, there is now only one. In effect, the values of the original 4 pixels
are averaged to form the new, single pixel value.
For areas of constant intensity, such as a blank wall, we can see that this
averaging would have absolutely no effect on the image quality. However, if the image
has lots of detail in it, where adjacent pixel intensities can vary greatly, we can
intuitively see that this will have a substantial effect on image quality.
Suppose that our original image was simply a checker board, where adjacent
pixels alternated between white and black, in both the horizontal and vertical direction.

If we sampled at half the rate, we would be averaging two white and two black pixel
values, to generate a composite gray pixel. Instead of having a checker board pattern,
we would then have an image which would just be constant gray throughout.
Subsampling the image, therefore, creates an intensity blurring effect If, then,
the desire is to compress data through a subsampling technique, we must determine
what level of blurring is acceptable, primarily based on general, a priori knowledge of
our images, and choose our minimum sampling rate accordingly.
Standard television stations in the United States broadcast 60 interlaced fields
per second, which produces a frame rate of 30 frames per second. This frame rate was
chosen for a nuriiber of reasons, primarily because it exceeds the rate at which flicker is
perceived by the human brain, and also due to the frequency of AC power in the United
States. However, what if 15 frames per second were determined to be sufficient? Just
cutting the frame rate in half would also cut our data rate in half, producing an effective
compression ratio of 2:1. This is the simplest way in which varying the frame rate may
contribute to compression gain [1].

2.3 Statistical Compression Techniques
Theoretically, there is a maximum compression ratio possible for any particular
sampled data set, beyond which lossless data compression is not possible. This
limitation is based on Shannons fundamental work in the area of information theory,
and is known as the source data entropy [18,21-23]. According to this theory, the
minimum average number of bits, H, by which any particular sample in the data set can
be represented is:
tf=-Eflfclog2(flfc) (2-3-1)
where is the probability that the particular value, k, occurs in the data set,
and His the entropy, or average number of bits used to represent any possible value in
the data set. Fof a typical monochrome image, the maximum lossless compression
ratio is rarely better than 3:1 or 4:1 [24].
It is this fundamental law of information theory that all data compression
techniques must abide by if they are to be considered lossless. However, depending on
what amount of distortion is acceptable, a lossy data compression technique can achieve
significant compression with very little perceivable, or at least objectionable, distortion.
Many of the image compression techniques used are lossy but may be considered to be
visually lossless [14].
The lossless data compression technique which is the most optimal from an
entropy perspective is Huffman encoding [25]. This technique, which relies on
statistical knowledge of the data to be compressed, creates variable-length codes from
the fixed-lengthy quantized values.

Observing Equation 2.3-1, it is apparent that if the variable-code length for each
value fc is chosen to be
k=-log 2ipk) (2.3-2)
then the average code length will exactly equal the calculated entropy.
However, code lengths must be integer values, and the length Ik, calculated in this
manner, is generally not an integer. Therefore, we must choose the code length to be
the smallest integer value greater than that given in Equation 2.3-2, or
4=-flog2(ft)l (2.3-3)
which will always give us an average length greater than the entropy. This
approach to choosing the variable code lengths is known as Shannon-Fano coding
[21,26], but does not, in fact, produce an optimal code. This is due to the fact that the
code lengths must be integer values.
The encoding technique which is optimal, however, is the Huffman code [25]
This code is produced with the intent of assigning the smallest code lengths to the most
probable data values, and the largest code lengths to the least probable values. To
increase the compression efficiency, pairs, triples, etc., of data may be encoded using
the Huffman technique. The more data values which are included in a particular code,
the closer the average code length will be to the theoretical entropy limit
There are some inherent problems with using variable length codes, however.
One requirement is that, since the decoding process must distinguish all codes
uniquely, one code must not be a prefix to any other code [25]. The way in which
Huffman codes are designed guarantees that this requirement is met, and that there is no
ambiguity between codes.
Another requirement, in order to maintain the calculated average data
compression for: a variable length code, is that the required statistics, in particular the

probability that each data value occurs, be known prior to coding, and that the decoding
process have a codebook which maps the variable-length codes to their corresponding
data values. In general, the exact statistical nature of the data being compressed is not
known. However, if common statistical information is available about normal data
of the type being compressed, such as the probability of a particular character occuring
for text processing applications, the code can be built based on this more general
statistical data. As long as the data being compressed does, in fact, behave like the
normal data, the compression obtained should be close to optimal. Unless the exact
nature of the data is known, however, the actual compression ratio will be lower than
the optimal rate for which the code was designed.
Vector quantization is a lossy coding technique which maps all image intensity
values to a smaller subset of intensity values, or vectors, the collection of which are
known as a codebook. The codebook is primarily chosen to produce minimal
distortion to the images being compressed, based on some known fidelity criteria.
Preferably, relatively few vectors will be chosen for the codebook to represent a
relatively large number of intensity values, while maintaining a desired reconstruction
quality [27, 28].;
As an example, suppose we have 16 possible intensity values, represented by a
4-bit number, and wish to create a codebook of only 4 vectors, represented by a 2-bit
number. This would give us an effective compression ratio of 2:1.
There are many ways to choose a codebook, few of which may minimize
distortion. Assume that a histogram for our example data is shown in Figure 2.3-1. A
simple way to create the codebook might be to ignore the lower two bits of our intensity
value, and only transmit the upper two bits. This codebook selection requires no
knowledge of what the statistical nature of our data is. Our codebook, mapping, and

relative distortion, would appear as illustrated in Figure 2.3-2. (For simplicity sake,
we are using the;absolute difference as our distortion measure for these examples).
If we change the mapping slightly, to more closely reflect this data set, we see
that the distortion changes, as shown in Figure 2.3-3. Now, let us choose what
appears to be a better codebook, based solely on our histogram, as shown in Figure
2.3-4. We see that the distortion is significantly reduced, by a factor of 6.3:1 from
Example #1 to Example #3.
There are many techniques which may be used in the selection of a vector
codebook. It is beyond the scope of this thesis, however, to discuss any particular
techniques. A good discussion of different types of vector quantization techniques may
be found in [14].
Intensity Value
Figure 2.3-1 Vector Quantization Example Histogram

Intensity Codebook Reconst. Absolute
Value1 Value Value Diff. Ouantitv Distortion
0 0 0 0 0 0
1 0 0 1 1 1
2 0 0 2 2 4
3 0 0 3 4 12
4: 1 4 0 7 0
5 1 4 1 30 30
6 1 4 2 40 80
7 1 4 3 30 90
8 2 8 0 7 0
9 2 8 1 4 4
10 2 8 2 2 4
11 2 8 3 1 3
12 3 12 0 0 0
13 3 12 1 0 0
14 3 12 2 0 0
15 3 12 3 0 0
Total Distortion: 228
Figure 2.3-2 Codebook Example #1
Intensity Codebook Value Value Reconst. Value Absolute Diff. Ouantitv Distortion
0 0 3 3 0 0
1 0 3 2 1 2
2 0 3 1 2 2
3 0 3 0 4 0
4 1 6 2 7 14
5 1 6 1 30 30
6 1 6 0 40 0
7 1 6 1 30 30
8 2 8 0 7 0
9 2 8 1 4 4
io 2 8 2 2 4
11 2 8 3 1 3
12 3 12 0 0 0
13 3 12 1 0 0
14 3 12 2 0 0
15 3 12 3 0 0
Total Distortion: 89
Figure 2.3-3 Codebook Example #2

Intensity Codebook Reconst.
Value Value Value
0 0 4
1 0 4
2 0 4
3 0 4
4 0 4
5 1 5
6 2 6
7 3 7
8 3 7
9 3 7
10 3 7
11 3 7
12 3 7
13 3 7
14 3 7
15 3 7
Diff. Ouantitv Distortion
4 0 0
3 1 3
2 2 4
1 4 4
0 7 0
0 30 0
0 40 0
0 30 0
1 7 7
2 4 8
3 2 6
4 1 4
5 0 0
6 0 0
7 0 0
8 0 0
Total Distortion: 36
Figure 2.3-4 Codebook Example #3
2.4 Spatial-Domain Compression Techniques
Beyond the general and statistical data compression methods discussed thus
far, any further compression of the data should make use of one or more known
characteristics of the data being compressed. Spatial-domain compression techniques
makes use of the fact that neighboring pixels tend to be highly correlated in intensity [1,
13, 29-31].

When consecutive pixels have equal intensity values, data compression may be
achieved by representing these values in the form of a run-length code. A run-length
code consists of two parts: an intensity value and a run-length. The intensity value
simply indicates what the common value of the pixels are, and the run-length specifies
how many pixels, in a row, have this value. Obviously, the more pixels which have
this intensity value, the greater the compression ratio will be when using run-length
coding techniques.
This technique, however, is normally associated with the compression of bi-
level, or black-and-white, images, where each intensity value may be represented by a
single bit One such application would be in the transmission of FAXes, where each
pixel is specified as either being black or white. When combined with Huffman
coding techniques, optimized to represent typical run-lengths for the associated
application, run-length coding of bi-level images can represent significant compression
gains [32].
The majority of real world, multi-level images, though, where intensity levels
are represented by more than one bit do not normally have significant runs of pixels
with equal values, however. Even if, to the human eye, the intensity values may
appear to be equal, most values will, in fact, deviate about an average value by some
number of intensity levels. In this case, then, not only will there be very little
compression achieved, but quite often the compressed image will have more data than
the original.
One multi-level image application which does use run-length coding, and
achieves significant compression gains, is the area of computer-generated graphics.
When a computer is used to generate images, such as briefing charts and graphical
diagrams, where there are only a relatively small number of colors, run-length coding
tends to perform well. Another set of applications where run-length coding performs

well, but is not done in the spatial domain, are for the some of the current digital image
compression standards, such as discussed in Section 2.7.
2.4.2 DPCM
Differential PCM (DPCM) data compression, in the spatial-domain, makes use
of the fact that the intensity of a pixel, although not exactly the same, is normally
closely related to the intensity of pixels around it. In DPCM, therefore, the difference
of the current pixel, from one or more neighboring pixels, is transmitted rather than its
actual intensity value [1], Figure 2.4-1 shows a general block diagram for the
implementation of a DPCM encoder, where: w(&) is our current pixel, a(k) is our
prediction of u(k) ? u(k) is our prediction bias, e(&) is the error between our predicted
and actual pixel value, and e(k) is our quantized error, which is transmitted.
There are two characteristics of the DPCM encoder which should be noted.
First, the predicted value for the current pixel, u(k) 5 is some function of previous
pixels, or, more accurately, previous prediction values and errors, a(k) and e(k). This
means that the encoder must have some memory, or delay capability. And, second, the
transmitted value is the prediction error, following a quantization function. This
quantization function effectively determines the number of bits used to represent the
pixel value, but it also affects the prediction error. The choice of both of these
functions, prediction and quantization, greatly influences the effectiveness of the
DPCM encoder.
For the simplest case of a DPCM encoder, assume that our prediction for the
current pixel is equal to that of its leftmost neighbor, A, as shown in Figure 2.4-2
[13]. In this case, our quantized error, e(k), would be equal to our actual error, e{k),
and our prediction, a(k), would be the intensity of the previous pixel, u(k-1).

Therefore, we can see that our transmitted value in this case would be:
e(k) = u(k) -u (fc-1)
It is obvious however, that, to some extent, the more pixels we use in our
predictor, the better our prediction should be; and, since we are sending the prediction
error, the more compact our transmitted values will be. The number of pixels used in
the prediction is considered to be the order of the predictor. Some of the more
commonly used predictors are [14]:
u( k) = 0.97A
u(k)= 0.50A + 0.50C
u{k)= 0.90A -0.8IB + 0.90C
u(k)= 0.75A -0.50B 0.75C
w(fc)=A-B +C

s? O0-00*
Figure 2.4-2 Pixel Relationships
The efficiency of any particular DPCM scheme is directly related to the selection
of both the predictor and quantizer functions. These two functions are, necessarily,
inter-related, and must, therefore, be mutually chosen. In most cases a Lloyd-Max
quantizer is the most optimal choice, based on a given predictor [33].
2.4.3 ADPCM
While the majority of DPCM predictors and quantizers are globally applied
across an image, Adaptive DPCM (ADPCM) attempts to choose a more optimal
predictor and/or quantizer based on localized image statistics. For ADPCM, an
effective scheme must be employed to determine which DPCM coder would be the
most optimum to use for a particular segment of image data. Along with prediction
errors, then, information must be transmitted which identifies the DPCM coding
scheme being used for each particular portion of the image data. While, in most cases,
the prediction errors are significantly reduced in magnitude, in this manner, the
overhead data associated with coder identification must also be accounted for when
determining the effective compression ratios obtained.
Bit-plane coding, like DPCM, assumes that neighboring pixels are closely
related in intensity. However, unlike DPCM, which takes advantage of the fact that the
intensity difference between two neighboring pixels is small, bit-plane coding assumes

that the significant bits of neighboring are highly correlated For instance, for 8-bit
intensity values,ibit-plane coding takes advantage of the fact that the most significant
bit, bit 7, will normally be the same for adjoining pixels. Likewise, there will be some
correspondence in the other bit planes which will allow the data to be compressed
In general, the more significant the bit-plane, the higher the correspondence
between pixel values there will be. This is particularly true when there are areas of
highly uniform intensity. However, this is not always the case. For example, take the
condition, for 8-bit values, where the uniform intensity is either 127 (most significant
bit is 0, all others are 1) or 128 (most significant bit is 1, all others are 0). From a
DPCM perspective this deviation is not large at all, but in bit-plane coding it means that
every bit plane experiences a change when going from one of these values to the other.
Such is also a problem when coding near the edge of an object, where the background
and object may have a high degree of contrast.
To compensate for the former of these effects, Gray codes may be used in place
of the typical, linear binary quantization values. Gray codes are designed such that, for
all intensities which differ by only one quantization level, the codes differ in only one
bit plane, as well; likewise, for differences of two quantization levels, the codes differ
at most in two bit planes, etc. Using gray codes, therefore, makes bit plane coding
more of a linear function of the image intensities.
If we encode each of the bit planes independently, it is intuitive to use run-
length coding tb represent each plane. This is particularly true for the more significant
bits, which will tend to have rather long runs of ones and zeros. However, depending
on our application and intensity characteristics, we could also encode the more
significant bits using run-length coding, while using some number of less significant
bits, and encoding them using DPCM or some other encoding technique. The

particular implementation used to encode bit-planes is not as important as the fact that
we make use of some statistical nature of the bit planes to accomplish data reduction.
One useful feature of bit-plane coding is that the image can be progressively
transmitted or stored, essentially varying the frame rate. This is particularly important
for applications which require human observation, but where communication
bandwidth, or storage availability, is limited. If the most significant bit planes are
transmitted first, a coarse image, in amplitude, will appear on the users monitor. As
the transmission progresses, and more bit planes are sent, more amplitude detail will be
added to the image, thereby removing intensity coarseness. Although not significant
technically, this technique has been found to be somewhat more acceptable to users, in
some applications, than equivalent scan transmissions, where full resolution is sent
for each pixel, in order, and the image is built along scan lines. This method of
transmission is sometimes also refered to as hierarchical coding.
Although the final spatial-domain technique we investigate may not seem like
one, in concept, it is. Like bit-plane coding, subband coding techniques decompose an
image in to smaller, and more granular, subimages. But, unlike bit-plane coding,
which decomposes the image as a function of intensity, subband coding creates
subimages by decomposing the image as a function of frequency.
At first, subband coding may seem to be a transform-domain technique, due to
the fact that frequency is the primary component exploited. However, the
implementation of this technique, in fact, makes it a spatial-domain technique.
Subband coding decomposes the image in to subimages by applying finite-impulse
response (FIR) filters at appropriate frequencies, not transforming the image in to the
frequency (or transform) domain.

Subband coding generates a subimage for each frequency band we filter the
original image in to. If, for instance we used two-dimensional quadrature mirror
filters, which allows alias-free reconstruction, we will get four subimages, each
containing a quarter of the information from the original. Each subimage will indicate
strong components for areas of the original image which are contained in the response-
curve for the particular filter used. For example, if a high-pass filter is used, edges and
highly-textured areas will have strong components, while, if a low-pass filter is used,
smooth areas will have strong components.
Each of these subimages will normally exhibit better statistical characteristics
than the aggregate image. One of the previously discussed procedures, particularly
DPCM coding, may then be used to encode each of the subimages. And, just like bit-
plane coding, subband coding allows for effective progressive transmission of the
image to take place. For instance, the subimages may be transmitted from the lowest-
frequency components to highest. Then, as each subimage is received, the image will
improve from being a blurry image, with no high-frequency components, to a sharp
image, containing all of the original image frequency components [34].
2.5 Transform-Domain Compression Techniques
When transform-domain techniques are used, the spatial image data is
transformed in to some other domain. The primary reason that a transformation is done
is to deconrelate the spatial image data in to orthogonal components. This
transformation, then, produces a component for which each coefficient maximizes the
amount of information, or energy, which it represents. In image compression
applications, then, this means that this one component will contain more information,
per bit, than the equivalent data did in the spatial-domain. For instance, if the spatial-
image is decomposed in to two equal sized components, and the major component

represents 95% of the spatial-image information, then an effective compression ratio of
almost 2:1 is directly obtained.
In addition, based on the particular transform-domain in to which the image is
decomposed, if certain components of the domain are known to contain little or no
significant data, than these components can be greatly, or completely suppressed with
little degradation to the original image. For example, most natural images, in the
frequency-domain, contain little or no energy in the high-frequency components.
Knowing this, therefore, these components can be quantized to very few bits, relative
to the lower frequency components, or completely ignored, with little degradation to the
reconstructed image quality.
Two other characteristics of transforms, besides decorrelation, are important
when it comes to using them for image compression applications. The first of these is
that the transform should ideally decorrelate the spatial-image independent of the actual
image data. In other words, the optimum transform should not depend, in any way, on
the particular statistics of the spatial image. And the other characteristic, which makes a
transform truly practical, is that there must be some efficient, or fast implementation for
it [35].
The following sections briefly describes a few of the more common transform-
domains in to which images are sometimes converted for compression applications:
The Karhunen-Loeve Transform (KLT) is the theoretical optimal transform,
from an energy-packing perspective [2,36]; this means that a minimal set of
coefficients may be found which represents more information, per number of
coefficients, than any other possible transformation. The KLT is sometimes also
referred to as a Hotelling, principal component, or eigenvector transform.

TheKLT is essentially found by calculating the covariance matrix of an image.
It can be shown [17] that the basis vector terms are then given by:
However, even though the KLT completely fulfills the decorrelation criteria
desired, it fails both the independence and implementation criteria. Due to the fact that
the optimal KLT i s a function of the image covariance matrix, it is totally dependent
upon the image statistics; in other words, a different, optimal KLT exists for each
image to be compressed. And, secondly, no fast implementation for this transform has
been found [2],
Due to these last two failures, then, the KLT is not used for general image
0 where: N is the order of the transform,
K =-----2---------7.and
1 + p -2 p cos(cOp]
p is the inter-element correlation
compression techniques. However, a number of the following transforms, to be
discussed, while not optimal, closely approximate the decorrelation characteristics of
the KLT while also better fulfilling the independence and implementation criteria.

The discrete Fourier transform (DFT) is well known throughout digital signal
processing [2,37], primarily for signal analysis and filtering applications. However,
in the process of transforming an image in to spectral components, like the KLT, it is
also a useful transform for image compression applications.
The two dimensional DFT pair is defined as [1]:
1 M-lN-l (Uj V]c\
F{U'V)=m^ ^/(/fc)e VM+vj
j= 0 k = 0
M-1JV-1 (uj v&\
Mk)= E E f{u,v)^Wn)
where f(jjc) is the pixel (spatial) intensity and F(u,v) is the spectral intensity for
an MxN block of image data.
While the DFT does not decorrelate the energy as well as the optimal KLT, it is
still 91 % [17] efficient, which is significantly good for most applications. In addition,
the DFT components are nearly as independent as the KLTs, and, due to the fact that
the transformation kernels are separable [17], there is a fast, efficient implementation
for it based on the one-dimensional FFT.
However, as we can see from Equations 2.5-2 and 2.5-3, although our image is
real-valued in the spatial domain, it is complex in the transform domain. This means
that there will be twice as much data per original image pixel after transformaion. In
essence then, the basic DFT increases our data by a factor of two, rather than
| ___________________________
compressing it. Therefore, although the DFT is extremely efficient, from a transform
perspective, it does not inherently contribute to our data compression application.

In order to overcome the problem associated with the DFT being a complex
transform, real-valued implementations were investigated which exhibit similar
efficiency characteristics. The most common of these is the discrete cosing transform
(DCT), which transforms real-valued spatial data to real-valued spectral data [38]. The
DCT has been shown to be 91% [17] efficient compared to the KLT.
The two dimensional DCT pair is defined as [14]:
F(u,v) =
M- \ N- 1
Z Z/(/,£) cos
4 C(u)C(v)
(2/+ 1 )u it
\ ({2k + 1) V 7T
/(/.*) =
M-1N-1 . ,(n.
Yj Z C(u) C(v) F{u, v) cos I
M =0 v = 0
f(2k+ 1) V 7T^
l 2N
where: C(w) =
~^= for w = 0
1 forw*0
Since the transformation kernel for the DCT is separable [17], it also has a fast
and efficient implementation. Due to all these facts, the DCT is currently the most used
transform when it comes to digital image processing.

Although beyond the scope of investigation for this thesis, there are two
relatively new transforms which have begun to play a prominent role in image
compression research, and deserve mention: wavelet [39-42] and fractal [43,44]
2.6 Interframe Compression Techniques
To achieve a greater compression factor than any of the previously described
techniques allow requires that the temporal redundancy of a sequence of images be
exploited. Techniques which exploit temporal redundancy are known, generically, as
interframe compression techniques. Those compression techniques described in the
previous sections made use of one- or two-dimensional data only; interffame techniques
extend these othler techniques to a third-dimensiontime.
2.6.1 DPCM
The most straight-forward extension of prior techniques is to DPCM coding.
The simplest example of this would be if the predictor for a pixel in the current frame is
its intensity in the previous frame. The change in intensity between the current and
previous frame, then, would be encoded, just as the difference between the current and
previous pixel, as in Equation 2.4-1, was encoded for intraframe DPCM encoding
This technique, of course, could be further extended to use any combination of
spatial and temporal pixels in the predictor. For instance, more than one frame could
also be used, as could any number of neighboring pixels in any of those frames or the
current one.

Conditional replenishment compression techniques produce an effective data
compression by| essentially changing the frame rate [45]. Like progressive coding, as
described in Section 2.2.3, not all of the image is transmitted at every frame time.
Instead, only certain pixel values are updated during frame time.
The particular pixels which are updated, unlike progressive coding, is not
uniformally chosen, but is chosen based on some condition, hence the name of the
technique. Hie primary condition used to determine updates is some threshold of
DPCM value; if the difference of a pixels intensity has changed more than some given
amount, an updated intensity value is encoded and transmitted.
Although this represents a significant savings in data to be sent, provided the
threshold is chosen properly, it presents a challenge in designating the location of the
pixel in an effective manner. This designation could be done through either addresses
or, more typically, run-length encoding all pixels which do not exceed the threshold.
2.6.3 HYBRID
Much of the current direction for image compression research stems from the
early hybrid encoding techniques described by Habibi [12,46,47] and others [38,48,
49]. Although primarily aimed at intraffame compression, many of the hybrid
techniques are easily extended to be interframe techniques and represent significant data
compression. Id general, the hybrid techniques make use of a 2-D transformation
followed by a DPCM section which predicts the current transform components as a
function of either intra- or interframe components.

Like DPCM coding, which attempts to predict the intensity of a pixel from
intraframe and/or interframe neighboring pixels, motion compensated image coding
attempts to predict the intensity of a pixel based on their expected "motion". This
motion can be predicted through either object-based or pixel-based motion estimation.
An object, or knowledge-based motion prediction technique attempts to segment
an image in to distinct objects. Then, information from whole, segmented object is
used to determine the magnitude and direction of the objects motion. If a description of
the object's shape and an associated motion vector are transmitted, this may represent
significant compression for that particular portion of the image, since individual pixel
information need not be sent. This technique, however, implicitly assumes that: 1)
objects may be satisfactorily segmented, 2) the objects are rigid in nature, and 3) the
intensity, or illumination of the object are constant. All three of these assumptions,
however, tend to not be true, so this technique has not had a significant amount of
research or applications associated with it.
Most current motion estimation techniques make use of pixel-based techniques.
These techniques attempt to determine the motion in an image based on a small set of
pixels, without attempting to associate them with each other. It is these pixel-based
techniques which the remainder of this thesis will investigate.

2.7 Compression Technique Standards
There are currently three international standard image compression techniques in
use, or proposed for use with continuous-tone, digital images. The first of these, the
JPEG (Joint Photographic Experts Group) standard, is used for general-purpose
compression of still-images; the other two, H.261 (sometimes known as px64) and
MPEG (Moving Picture Experts Group), are used in the compression of motion image
sequences, for video teleconferencing and motion image storage, respectively. These
three standards are further described in the following sections.
2.7.1 JPEG
The JPEG image compression standard [3,14,50,51] is designed for the
transmission of still images with a varying degree of user-selectable reconstruction
quality. In general, JPEG is a statistically lossy compression standard, although, at
relatively large compression ratios, on the order of 10:1, it is nearly visually lossless.
There are four primary components to the JPEG compression standard, each of
which has been; discussed, in part, in the previous sections on standard image
compression techniques. The basic JPEG encoder is composed of a DCT transform, a
custom quantizer, a run-length encoder and acustom entropy encoder, as shown in
Figure 2.7-1. The basic JPEG decoder performs the inverse of each of these sections,
as shown in Figure 2.7-2, but in reverse order from the encoder.

Figure 2.7-1 JPEG Encoder
Image Data
Entropy Run- De- IDCT
Decoder Length quantizer
Figure 2.7-2 JPEG Decoder
The DCT, and conversely the inverse DCT (IDCT), are essentially the same as
those given in Equations 2.5-4 and 2.5-5. The JPEG implementation, however, has
slightly different scaling factors in order to make them power preserving:
w C(w)C(v)v ((2J+1)UA ((2k+l)vn\
F(u,v)=----^----L cos|---^----Jcos[---^----J (2.7-1)

7 7
H C{u)C(y)F (w,v) cos
u = 0v = 0
( ( 2j + 1) U 7r\ /(2k + 1) V 7T
[ 16 16
and where M = N = 8, corresponding to the 8x8 pixel transform blocks on
which the JPEG standard is based. The DCT produces an F(u,v) which has coefficient
values arranged as shown in Figure 2.7-3. The "DC" component, as shown,

represents the average value of the pixel intensities within the 8x8 block, while the
"AC" components represent increasing frequency components as shown.
Once transformed, each of the frequency components, F(u,v), is quantized
based on a customizable normalization array, Q(u,v). The resulting, quantized
components, F (u,v), are defined to be:
= Nearest Integer
F(u,v) +
16 MU
Figure 2.7-3 DCT Coefficient Arrangement

where a typical quantization array used in the JPEG studies has been:
Q{u,v) =
16 11 10 16 24 40 51 61
12 12 14 19 26 58 60 55
14 13 16 24 40 57 69 56
14 17 22 29 51 87 80 62
18 22 37 56 68 109 103 77
24 35 55 64 81 104 113 92
49 64 78 87 103 121 120 101
72 92 95 98 112 100 103 99
Notice in Equation 2.7-3 that, the larger a number Q(u,v) is, the fewer levels
F(u,v) will be quantized in to. Therefore, the areas of Q(u,v) which contain smaller
values, particularly toward the upper left comer, will be quantized in to relatively fine
values compared to those areas which contain larger values, such as toward the lower
right comer, which will be quantized in to extremely coarse levels.
Looking once again at Figure 2.7-3, notice also that the lower frequency
components, including DC, of F(u,v), in the upper left comer, are allocated more
quantization levels than are the higher frequency components, in the lower right comer.
As described in the discussion of transform coding, in Section 2.5, for natural images
the majority of significant visual information is contained in the lower frequency
components, while the high frequency components contain relatively less. Therefore,
the quantization array, Q(u,v) given in Equation 2.7-4 allocates more levels to the
most significant components and less levels to the relatively insignificant ones.
Quantizing the DCT transform components in a manner such as Equation 2.7-4,
then, also produces a large number of zeros for the high frequency components,
which naturally leads to the first encoding function: run-length encoding. If the
components are run-length encoded based on the order shown in Figure 2.7-4,
therefore, then there will be significantly long runs of zeros for the latter encoded

First Component
Figure 2.7-4 Zig-Zag Run-Length Encoding
components, since these are the higher frequency componets. This zig-zag pattern is,
in fact, how the JPEG data is run-length encoded.
The form of this run-length encoding, is somewhat different from the scheme
used for bilevel images, in a number of ways. First, the DC component, unlike the
rest, is a DPCM value. This DC DPCM value is based on the DC value of the previous
DCT block.
Secondly, only AC components which have a value of zero are actually run-
length encoded. This is due to the fact that these components are not bilevel, and,
therefore it is relatively unlikely that the non-zero values will produce any significant
runs. Figure 2.7-5 illustrates how symbols are created to describe the run-length for
components which equal zero, along with the absolute values of an AC component.

, * >/ * w *
nI nI nI n i > s S s j s fc-bits
Figure 2.7-5 Run-Length Encoding Symbol Construction
Refering to Figure 2.7-5, the four bits of which n is comprised represents the
number of "zero components which precede this particular non-zero component. Run-
lengths of 0 to 15 may be represented by n; actually, a run-length of 16 may also be
represented by a single symbol, and run-length greater than 16 may be represented by
multiple symbols. The value, k, which also consists of four bits, defines the category,
or range, in which the component value exists, and is followed by the component value
and sign, represented by k bits. The actual component value ranges are defined in
Table 2.7-1. Ncite that a run of 16 zeroes may be represented in a single symbol by
setting n equal to 15, k equal to zero and no component value bits.
Table 2.7-1 Run-Length Component Value Category Definitions
k AC Coefficient Range (+/-)
0 0
1 1
2 2-3
3 4-7
4 8-15
5 16-31
6 32-63
7 64-127
8 128 255
9 256 511
10 512 -1023

In addition, rather than coding a large number of zeros near the end of the
block, a special End-Of-Block (EOB) code is sent if there aren't any other non-zero
components in the block, since the block size is fixed and the actual number of zeros
may therefore be calculated as 64 minus the number of actual component values sent.
This EOB symbol is defined as being when n and k both equal zero.
The final step in the JPEG standard is entropy encoding of the run-length
symbols for the AC components of each block. This is done through a user-defined
Huffman coding scheme, two of which may be defined in the header information for
the baseline JPEG system. An example Huffman code table, taken from [14], is
shown in Table 2.7-2.
There are two things, in particular, which should be mentioned concerning
implementation of the JPEG standard. First of all, there is no specification for error
correction of the compressed image data. This is of considerable concern since
variable-length coding is being used in the final stage of the compression standard.
And secondly, multi-plane images (i.e. color, RGB, YIQ, etc.) are compressed one
plane at a time, with each plane receiving the same compression techniques as described
above, individually.
2.7.2 H.261 fPx64J
The H.261 CCITT standard [51-53] is used for the transmission of motion
images used in video-telephone applications. It is also commonly refered to as the
"Px64" standard because it compresses the picture phone images into bandwidths
which are integer multiples of a single 64 Kb channel. The 64 Kb channel was chosen
as the base bandwidth primarily due to ISDN (Integrated Services Digital Network)
applications; however, when used on a single 64 Kb channel the quality of the images
is generally quite low.

Table 2.7-2 Sample Huffman Coding Table for AC Component Symbols
Zero Run Category Code Length Codeword
i : o 1 2 00
0 2 2 01
1 0 3 3 100
o 4 4 1011
i 0 5 5 11010
0 6 6 111000
j 0 7 7 1111000
i 1 1 4 1100
! 1 2 6 111001
1 3 7 1111001
1 4 9 111110110
2 1 5 11011
! 2 i 2 8 11111000
3 1 6 111010
3 i 2 9 111110111
] 4 1 6 111011
5 1 7 1111010
6 1 7 1111011
I 7 i 1 8 11111001
| 8 1 8 11111010
i 9 1 9 111111000
io 1 9 111111001
i 11 i 1 9 111111010
| EOB 4 1010

A block; diagram of the interframe implementation of the H.261 standard is
shown in Figure 2.7-6. It is important to notice how similar a major portion of this
standard is compared to the JPEG standard described in Section 2.7-1, as shown in the
shaded box at the top of the figure. Unlike the equivalent functions in JPEG, the
quantization anid entropy tables under the H.261 standard are not defined by the user.
Except for this difference, however, this section of the H.261 system is essentially the
same as that of the JPEG.
There are two fundamental differences between the JPEG and H.261 standard.
The first is the addition of an error correction function prior to transmission of the
compressed image. This allows the system to compensate for bit errors encountered
between the encoder and decoder. Such errors are quite common in the practical
application of the standard, and the BCH error correction specified allows the system to
correct for up to 2 bit errors in any 512-bit block of transmitted data. The JPEG
Figure 2.7-6 H.261 Interframe Encoder

standard, however, does not specify such a function since the requirements for error
correction scheme are application dependent, but the JPEG system is very general in
And, the primary difference between the JPEG and H.261 standard is the
addition of the motion estimation and compensation functions. As can be seen in
Figure 2.7-6, data is tapped from the "JPEG-like" section of the encoder, dequantized
and returned to the spatial domain. This process allows the motion prediction section
of the system to base the estimation and compensation of the motion on the lossy
frame, which is all that is available at the decoder. In this way the motion vectors
calculated at the encoder are valid for the image sequence which is reconstructed at the
The difference between this lossy frame and the predicted frame is then stored
in a frame memory, which makes it available for prediction of the next frame. This
buffer is what allows this to be an interframe compression technique.
As briefly discussed in Section 2.6.4, a motion-compensated compression
technique essentially considers each pixel, or spatial block of pixels, to be either "static"
or "in motion". In order to compensate for pixels which are in motion, an attempt must
be made to estimate this motion. This function, for the H.261 standard, is performed
in the box identified as "motion estimation." A natural by-product of this process are
motion vectors which indicate the most-likely displacement of the pixels, referenced to
the previously transmitted frame. For static pixels, this motion vector would simply be
"zero" in both displacement directions.
These motion vectors, as shown, are then passed on to a motion compensation
function which modifies the previous frame based on the calculated motion of the
pixels. Once filtered, this predicted frame is subtracted from the current frame to
produce a prediction error. It is this prediction error which is actually transmitted to the
decoder. In order for the decoder to correctly reproduce the image sequence, however,

ll Error
!| ( birecti on
Run- Length Decotfer De-
Motion Frame
V. Compensation Memory
Figure 2.7-7 H.261 Interframe Decoder
the predicted Motion vectors must also be sent along with the prediction error. These
motion vectors are, therefore, entropy encoded and sent to the decoder for each frame
along with the frame prediction error.
Reconstruction of the image sequence is essentially the reverse of the encoding
process, as it was for the JPEG standard, with the exception that the motion estimation
function need riot be performed. A block diagram of the H.261 decoder is shown in
Figure 2.7-7, illustrating this reverse process. Notice that the motion compensation
functions performed in the decoder are exactly the same as those used in the encoder.
2.7.3 MPEG
The MPEG motion-image compression standard [54-56] is a general-purpose
technique intended for use in many differing applications, unlike the H.261 standard
which was designed primarily for video telephone applications. It is very similar in
design to the H.261 [51], but has one significant additional requirement to allow
bidirectional arid random-access retrieval. One of the prime applications MPEG is

intended to support is future digital videocassette recording, and bidirectional retrieval
would allow a consumer to preview the recording while it is being played in reverse, or
starting at any arbitrary location. The intended data rate for the MPEG standard is 1.5
Mbit/sec, for the video portion, with an additional 64,128 or 192 Kbit/sec for any
audio, and an extension to this standard, ostensibly being called MPEG n or
MPEG++, will allow HDTV-quality video to be transmitted at 20 Mbit/sec or greater.
Although the MPEG block diagrams shown in Figures 2.7-8 and 2.7-9, look
very similar to those of the H.261 standard, as shown in Figures 2.7-6 and 2.7-7, the
standard, when finalized, will not dictate any particular implementation; the only
requirements for MPEG-compliant encoders and decoders will be that they provide the
specified functionality while adhering to a specified data structure and syntax for the
transmitted or stored bit stream. This syntax will include specifications for six layers,
supporting such functions as DCT, motion compensation, resynchronization, random
access, bit rate and buffer size.
Figure 2.7-8 MPEG Encoder Block Diagram

Figure 2.7-9 MPEG Decoder Block Diagram
The addition of bidirectional and random-access retrieval capability to the
MPEG standard, compared to the unidirectional nature of H.261, represents a
significant emphasis on motion estimation techniques. The MPEG standard defines
three types of frames which support these retrieval capabilities, as illustrated in Figure
Each of the three frame types, as shown, serve a particular prediction function.
The Intra-picture (I) frame is coded without reference to any other frame, and,
therefore, serves as a fixed reference frame for the other frame types. The I frame
allows for the random-access retrieval ability of the MPEG standard, but only
compresses the particular frame being represented to a moderate extent, along the order
of the JPEG compression standard.

Figure 2.7-10 MPEG Prediction Scheme
The Predicted (P) frame makes use of motion estimation to predict the current
frame based on; past I or P frames. The compression obtained by the P frames is
roughly equal tjo that of the H.261 compression standard. And, the Bidirectional (B)
frame makes use of both the past and future I and P reference frames to predict the
current frame. The B frame achieves significant compression gain due to the fact that is
can represent areas which will be uncovered in the future. In effect, the B frames
represent an interpolation between both past and future reference frames.
The frequency and ratio of these three frame types is up to the user. The choice
of these parameters, however, directly affects the performance as well as effective
compression ratio. For example, the further apart the I frames are placed, the greater
the overall compression ratio will be, but the longer the delay in reconstructing the
image sequence when randomly accessing the frames.

In this qhapter we investigate the basic theory upon which the majority of
current motion estimation techniques are based. Many of the techniques used for
motion estimation are based on the fundamentals of image sequence processing and are
either directly or indirectly related to the general concepts upon which optical flow
theory is based.
There are two major categories in to which most current motion estimation
techniques may be classified: correlation-based and gradient-based techniques. This
chapter will also describe the general characteristics and differences between these two
techniques as well as several of the more common implementations for each.
3.1 Image Sequence Processing
. i
Image sequence processing is a relatively new, specialized area of image
processing research. In essence, it is a three-dimensional signal processing technique,
where the three dimensions consist of not only the two spatial dimensions of the frame
f i
(i.e., x and y), but also the temporal dimension. Primarily, image sequence
. i
processing techniques seek to make use of the natural temporal redundancy found in a
sequence of images. In essence, all interframe compression techniques described in the
previous chapter also fall under the heading of image sequence processing.
Image sequence processing, however, is much more than just interframe
compression. Motion estimation, in particular, is used not only in the data
compression, but also such general applications as image enhancement, restoration,
spatiotemporal interpolation and decimation, target tracking, segmentation and image
understanding | [4,57].
Image enhancement and restoration applications use image sequence processing
techniques to filter noise [58,59] and restore images degraded by blur [60]. It is

intuitive that, if a sequence of images is degraded by random, uncorrelated noise, then
successive frames of the images may be used to effectively cancel out the effects of the
noise. This is most useful when the noise is created at the sensor, be it a CCD array,
film or a storage media [4]. Restoration of a blurred image is most often needed when
: i
a frame of motion video is to be shown in a still-frame environment. The blurring
effect may be caused by such things as object motion, sensor motion, an out-of-focus
lens, or atmospheric turbulence [4].
Different frame rates are used for different motion video applications. For
instance, the Standard frame rate for U.S. television is 30 frames per second, while, in
Europe, it is 25! frames per second, and for motion films it is 24 or 72 frames per
second. To convert from one frame rate to another, requires either an interpolation or
; i
decimation from one original frame to the next. If only the spatial components of the
original frames are used in these processes, then only the pixel intensities are
interpolated or decimated. This will create a blurring effect to the predicted frame. To
avoid this effect, the motion of the pixels must also be factored in to the prediction,
creating a much more realistic frame.
Another application for motion estimation is in the determination of real-world,
; i
or three-dimensional, object motion. This is particularly useful in target tracking
applications, where moving objects may much more easily be identified and tracked
when the temporal nature of the image sequence is exploited. Similarly, an image may
also be classified into areas of static and moving objects. This form of segmentation of
the image facilitates a greater degree of image understanding than only spatial data

3.2 Optical Flow Theory
The fundamental theory by which most current motion estimation techniques are
based is the concept of optical flow [61-65]. This theory attempts to relate the physical,
three-dimensional world to what the human visual system or the sensor perceives in
only two dimensions. In particular, it places mathematical constraints on how image
intensities, and object motion in particular, are interpreted from one temporal frame to
In its most basic form, optical flow attempts to relate the pixels of one frame to
those in the next. In most instances, pixels nominally maintain their previous intensity,
' i
either in the same position, for static objects, or in a translated position, for moving
As in the principle of the conservation of energy, which states that energy may
be transformed Ifrom one kind to another, but it cannot be created or destroyed [66],
the theory of optical flow attempts to define a similar principle. The principle of optical
flow, however,;tries to generalize that pixels, or objects, may change slightly in
intensity from one frame to another, but can neither dynamically appear nor disappear.
This, of course; is not strictly true, but, for those areas which do obey this principle, it
i ___
presents an extremely reliable framework from which to estimate object motion. This
area usually includes the majority of the frame. Those areas which do not obey this
principle, however, represent significant problems to most current motion estimation
Motion|estimation techniques which are based on optical flow principles,
therefore, attempt to match the pixels of one frame to that in the next. How this
matching takes; place, or, in other words, how a technique specifically finds and
determines that a pixel in one frame corresponds to that in another, signifies the primary
difference in the existing motion estimation algorithms. How well each algorithm

handles areas which do not obey the optical flow constraints, however, also
significantly influences each of the algorithms overall effectiveness.
A more theoretical, mathematical description of optical flow theory will be
described later, in Section 3.3.2, when we investigate gradient-based motion estimation
The primary assumption made by most motion estimation techniques, based on
optical flow principles, is that of the smoothness constraint. This constraint requires
that the optic flow fields in the area surrounding the pixel being evaluated varies
smoothly. In fact, this is a reasonable assumption to make since the majority of pixels
will be physically related to those surrounding them. This applies to not only static
objects, but also objects which are in motion, since, for rigid objects the velocity
associated with: surrounding pixels will be the same as the one being evaluated.
There are two types of areas which routinely go against the optical flow
smoothness constraint. One of these areas may be referred to as occlusions, while the
other is generally known as the aperture problem [67,68]. To estimation algorithms,
however, the effect is normally the same, regardless of which form of the constraint is
not being satisfied. The main difference between the two is where in the frame the area
is located.
An occlusion is normally found away from the edge of the frame, and may
occur due to orte of a couple of reasons. One such reason is that an object has moved
and, therefore, has revealed a portion of the background which had previously been
obscured, and now obscures a portion of the background which had previously been
visible. Another possible source of occlusion is if a three-dimensional object is

rotating, thereby possibly exposing a portion of itself which was previously obscured,
and/or obscuring a portion of itself that was previously visible. In both of these cases it
will appear to a two-dimensional sensor that a pixel intensity value has either been
dynamically created or destroyed, even though this is not true in the three-dimensional
The aperture problem is closely related to that of occlusion, especially in the
effect it has on motion estimation algorithms. It primarily occurs at the image edges,
however. When an object is entering the field of view from one of the edges, it appears
to the two-dimensional sensor that an object is dynamically being created; likewise, if
an object moves off the edge, it appears that an object is dynamically being destroyed.
The aperture problem, however, accounts for another estimation problem which
occlusion does not. Not only does the aperture problem introduce the complexity
associated withcreated pixels, but it also obscures the capability to estimate partially
visible portions of an object. This is particularly true when a small neighborhood is
used in the motion estimation process.
The following example, as presented in [62], illustrates the ambiguity
associated with the aperture problem. Take the case where the area we are imaging, as
shown by the box in Figure 3.2-la, only shows a portion of an object, in this case a
rod, at time t. If the rod moves to the right with a velocity of Ax/At, it will appear as
shown in Figure 3.2-lb at time t+At. If, however, the rod moves to the right and
bottom with a velocity of 1.41 Ax/At, it will appear as shown in Figure 3.2-lc at the
same time, t+At. Notice that the rod, as viewed through our aperture, appears exactly
the same, even jthough there was only a horizontal component of velocity in the first
case, but both a horizontal and vertical component in the second case. Note also that,
both the direction of the motion as well as the velocity between the two cases were
indistinguishable. This example significantly illustrates the effects of the aperture

; Figure 3.2-1 Aperture Problem Example
problem when the object whose motion is being estimated is near the edge of our
3.3 Motion Estimation Techniques
The primary problem each motion estimation technique tries to solve is
determining which parts or pixels of the current frame are associated with those in the
previous temporal frame [5]. In essence, all techniques try to find the highest
correlation between frames with the least amount of computation or time necessary or,

conversely, the least amount of distortion. This problem may best be illustrated by
Figure 3.3-1 [62] which illustrates the correlation function between blocks in two
frames. The results of any motion estimation technique should find the peak, which
corresponds to the motion vector associated with the block.
Another way to illustrate motion vectors obtained from our motion estimation is
shown in Figure 3.3-2. On the left the image from the previous frame is shown, where
the moving object is the triangle, and the circle remains stable. On the right is the
current image, from which we would like to estimate motion. The shaded triangle
indicates the location of the triangle in the current frame, and the arrow indicates what
would be calculated as our motion vector for the triangle.
Point of
Figure 3.3-1 Motion Estimation Correlation Function

Figure 3.3-2 Motion Vector Example
Correlation-based techniques, also commonly referred to as blockmatching,
feature-based or region-matching techniques, attempt to directly associate features or
regions from frame to those in another. This is primarily done by comparing a block of
pixels in one frame to a spatially-shifted block of pixels in another frame. A motion
vector for the block is then estimated by looking for the "best match" between fixed and
shifted blocks within a localized neighborhood.
Gradient-based techniques, also commonly referred to as spatiotemporal or
optic flow techniques, attempt to estimate motion by calculating the change, or gradient,
of image intensities in the spatial and temporal domains. These calculated changes then
are used to determine the motion associated with the particular pixel or block of pixels.
3.3.1 Correlation-Based Techniques
The correlation-based motion estimation techniques investigated as part of this
thesis, and the vast majority of the currently published techniques [5,69,70-72,94],
attempt to determine object motion by searching for the block of pixels in one frame
which most closely match or are highly correlated with those in a previous frame. In

general, this match must take place within a fairly limited neighborhood of pixels, the
maximum distance of which is known or assumed a priori.
There are two areas which determine how well correlation-based techniques
perform. The first determines how many iterations it takes to find a reliable match, and
is referred to asithe search method. This is the primary area used to differentiate one
particular implementation from another. The second area determines how "good" the
match is, and isireferred to as the matching criteria. Although this is a major factor in
the performance, it is rarely used to distinguish particular implementations.
The only way to guarantee that the best match is found would be to
systematically slide the reference block across all locations in the frame until the best
match was found. Although it would guarantee the best match, it would be far from the
most optimal search technique in terms of computational efficiency.
There is one basic assumption which may be made which improves this
efficiency without significantly decreasing potential accuracy. This assumption is that
the motion we are estimating may not exceed a fixed number of pixels per frame. This
maximum displacement is highly dependent, however, on the physical characteristics of
the scenes being imaged. However, if properly specified, it may represent a significant
savings in computations.
For instance, assume that we have an image consisting of 640x480 pixels and
are estimating our image in to 8x8 blocks. If a full, exhaustive search is performed it
would require 632x472 = 298,304 block-comparison operations; however, if we were
able to determine that our maximum displacement would be 32 pixel locations, then we
could cut this down to 56x56 = 3,136 operations, which represents a saving of over
1000:1 operations!

Intuitively we might assume that the best match will always be when all pixels
in our blocks correspond; however, there are four particular situations which make this
assumption an invalid one. Two of these assumptions, that pixels are not "created" or
"destroyed" and that we maintain constant pixel intensity, are directly related to optic
flow constraints. Unfortunately, in real images both are routinely violated, particularly
due to the problem of occlusions and that variable object illumination and reflectance
properties are rarely overcome. Both of these cause problems, however, to not only
exhaustive search techniques, but also correlation-based and gradient-based motion
estimation techniques.
The other two conditions do not relate directly to optic flow theory, but cause
particular problems to correlation-based techniques, nonetheless. The first deals with
the inherent "noise" associated with the image signal, regardless of whether this noise
is generated at the sensor or internally. Noise will cause the intensity of a pixel to
randomly vary; regardless of whether there is an object in motion or not. In this case,
an exact match cannot be made since the pixel has, in fact, changed in value, even if it
represents exactly the same point on the object.
The second problem deals with the fact that an object is not guaranteed to move
an integral number of pixels from one frame to the next. Under this condition, it means
that the intensities of the pixels from one frame to the next will not be exactly translated
spatially. This is illustrated through a one-dimensional example in Figure 3.3-3. The
reference frame values are shown in Figure 3.3-3a while motion equivalent to one-half
pixel (to the right) is shown in Figure 3.3-3b. Notice how the pixel intensity values are
significantly different for the shifted "image" even though all other conditions were
met. None of these problems are limited only to exhaustive search techniques; they are
problems for any of the correlation-based techniques.

Object Intensity
Pixel Intensity
Object Intensity
Pixel Intensity
Figure 3.3-3 Offset Sampling log D SEARCH
One of the fundamental blockmatching techniques, presented by Jain and Jain in
[5,73,74], is icommonly referred to as the "log D" technique. A fundamental
assumption this technique makes is that the distortion function, which reflects how well
the reference block is correlated with blocks in the previous frame, is monotonically
increasing along any of the 3 axis tested (0,45 and 90 degrees).
In the log D search, the correlation at nine particular points are calculated, one
corresponding to the origin and the other 8 along each axis. The point which
corresponds to the highest correlation is then chosen as the new origin, and the search
is continued in a like manner with successively smaller search areas. Once the search

area is narrowed down to a 3x3 area, this 3x3 area is exhaustively searched and the
point with the highest correlation is chosen as the motion vector for the block.
An example of the log D search process is illustrated in Figure 3.3-4, with the
estimated motion vector being shown by the arrow. This example assumes an a priori
maximum displacement of dmax = 7 pixels. The points marked with a "1" indicate the
first 9 points searched, and corresponds to a search area of one-half of the maximum
displacement, or 4 pixels. In this example the highest correlation was found at the
point (x+4,y+4). Next, the search area is cut in half again, with a maximum
displacement of 2 pixels, and the correlation of the new origin and points marked with a
"2" are compared to each other. This highest correlation this time was found at the
point (x+6,y+2), which once again becomes our new origin. This time, however, the
search is concluded since the search area is now 3x3. The correlation for each of the
remaining 9 points are calculated, and the highest correlation becomes our overall
estimate for the block displacement, in this case representing a block displacement, or
motion vector, or (7,1).
This technique is known as "log D" due to the fact that the number of search
steps, or levels, n, it takes to estimate the motion is:
n = log2 (dmax + l) (3.3-1)
where cP*10* is the maximum displacement to be estimated, and equal to 7 for
this example. The total number of distortion, or correlation, calculations which would
have to be performed, then would be:
c = 1 + 8 log2 (dmax+l) (3.3-2)
or 25 for our example.

Figure 3.3-4 log D Search Technique Example CROSS SEARCH
The Cross Search technique, as presented by Ghanbari in [75], is based on the
log D method but attempts to cut down on the number of distortion calculations
required by limiting the number of checks at each level of the search. Rather than
checking the 9 points which the log D technique does, the Cross Search only checks 5,
one at the origin and the other four on the 45 degree axis.
The example shown in Figure 3.3-4, which uses the log D approach, is used
once again in Figure 3.3-5, but this time using the Cross Search technique. Notice
that, although the Cross Search results are the same as the log D results, in this case,
the answers would have been different if, in fact, the actual location had been at
(x+7,y+2) rather than (x+7,y+l).

Figure 3.3-5 Cross Search Technique Example
The computational efficiency of the Cross Search technique, however, is better
than that of log D Search. Although the number of search levels for the Cross Search
are the same as given in Equation 3.3-1, the number of distortion computations are
reduced to:
c= l + 41og2(dmax+l)
or 13 for our example, which represents a savings of 12 computations.

The Conjugate Direction Search, as presented by Srinivasan and Rao in [5,72,
74], is not based on any of the techniques presented so far and is claimed to be much
more efficient. It's computation performance, however, as we will see, is not fixed or
predictable like the other approaches, and is dependent upon the actual block correlation
statistics. An upper bound, though, may be determined for the worst case block match.
In the Conjugate Direction Search, a particular axis is chosen on which to begin
the search. In the example shown in Figure 3.3-6 our initial direction will be along the
x axis. The search begins at the origin, and the distortion at each point on either side of
the origin, in the x direction in this example, is calculated. The origin is then shifted to
the minimum of the three points; if the minimum is the origin, then the search in this
direction is concluded, otherwise the search continues as described until the minimum
is the origin or maximum displacement.
The previous steps, as indicated by the points marked by a "1" in Figure 3.3-6,
determines our minimum along the x axis; the search continues in the same manner, but
this time along the y axis, as indicated by the points marked with a "2". If the search is
concluded when the minimum distortion in the y direction is found, the search
technique is knpwn as a "One-at-a-Time" Search, which is an extremely simplistic
technique compared to the remainder of the Conjugate Direction Search technique.
To complete the Conjugate Direction Search, the search continues in a direction
conjugate to the previously found minimum, as shown in Figure 3.3-7. Let's assume
that the search in the x direction does not line us up at the directly with the expected
result. The search then continues, as the points marked with a "3" indicate, along the
line which is formed by the vector from the origin through the point marked with a
circled "2". Like the searches in the x and y direction, the search continues until a point
is found which is bracketed by two larger values, in this case indicated by the circled
number 3".

Figure 3.3-6 One-at-a-Time Search Example
A comparison of the maximum number of searches required for each of the
previously discussed block-matching techniques is shown in Table 3.3-1. Note that the
maximum complexity for the Conjugate Direction search will be equal to that of the
One-at-a-Time search. Although it appears, based on Table 3.3-1, that the conjugate
direction search is much less efficient than the log D and Cross Search techniques, it
must be remembered that the maximum and average number of searches for these
techniques is the same, as shown in the Table, and will always be the same. It was
found experimentally, however, that the Conjugate Direction Search technique, for
which the maximum and average number of searches are not necessarily the same,

Figure 3.3-7 Conjugate Direction Search Example
actually took 2 to 3 times less searches than the log D technique, with nearly
comparable performance [72].
Table; 3.3-1 Maximum Blockmatching Computational Complexities
Maximum Displacement Exhaustive Search logD Search Cross Search Conjugate Direction Search
3 16 17 9 10
7: 64 25 13 18
15 256 33 17 34
31 1,024 41 21 66
63 4,096 49 25 130

There is one predominant condition which causes problems for the previously
described blockmatching techniques. It manifests itself by driving the estimation
toward a falsely matched minima. In other words, the image sequence, and therefore
the criteria from which the motion is estimated, may not be a smooth and monotonically
decreasing function. This is normally caused by image noise, but may also be due to
naturally occurring phenomena in the imaged sequence. These discrepancies may then
cause the blockmatching technique to produce a better match for "motion" which is not
associated with any real, physical movement of objects. If this false minima occurs in
the early stages of the blockmatching evaluations, all future evaluations will continue to
narrow in on a motion vector which is not real.
One approach to solve this problem is to run a low-pass filter across the images
before one of the previously described blockmatching techniques is used on them. This
effectively cuts down on high-frequency components which are most apt to produce
false minimas. However, it also causes the technique to evaluate only the coarse
features of the sequence, which only allows use to estimate the motion vector to a
coarse spatial resolution.
It appears that, although both problems seem to be contradictory, from an
evaluation perspective, each presents a particular strength which could also be used to
produce more accurate motion estimations. The hierarchical blockmatching technique,
as presented by Bierling [76], uses both extremes in just this way.
The hierarchical technique starts at a macro level, evaluating across large areas,
and uses fairly narrow low-pass filters in the early stages of evaluation. These early
stages allow estimates to be made, each based upon previous estimates which are more
coarse than the current one, which are relatively isolated from the problems associated
with noise and local minima. As the evaluations continue, more fine resolution data
and evaluation criteria are used to refine the estimate down to the pixel level. The

theory is that, by the time the fine resolution stages are implemented, the estimate
should be relatively close to the actual motion vector, and false, local minima should
not exist.
There aie a number of parameters used to define the data and criteria used at
each level of evaluation, and any of the previously discussed blockmatching techniques
may be used at:each level, but the log D approach is the one originally proposed, and
primarily used in actual implementations. A table showing "typical" values for each
parameter, for each level, is given in Table 3.3-2.
The first step at each stage is to filter the image using a low-pass filter of the
size shown in the "Filter Size" row. This filter is normally implemented as a median
filter, which, although not truly a low-pass filter, is extremely easily to implement and
is computationally simple. Next, the window in which the estimation is to be made is
chosen as shown in the "Measurement Window Size" row, which will allow a
. I
maximum displacement of "Maximum Displacement" to be made. This displacement is
, |
normally determined using the "log D" technique. If the whole measurement window
were used in the evaluation, it would be computationally expensive, and, therefore, for
large measurement windows, this evaluation is instead done through subsampling, as
given in the "Subsampling in Criterion Computation" row.
An example of the last three stages of hierarchical blockmatching, using the
' i
parameters of Table 3.3-2, is shown in Figures 3.3-8 through 3.3-10. The evaluations
Table 3.3-2 Typical Hierarchical Block Matching Parameters
Parameters at Level #: 5 4 3
Filter Size 10 10 5 5 3
Maximum (update) Displacement 31 15 7 3 1
Measurement Window Size 128 64 64 28 12
Subsampling in Criterion Computation 8 8 4 4 2

performed, starting at Level 3 for this example, are shown in Figure 3.3-8.
These evaluations are performed according to the log D technique, starting at the
position indicated by a "0", at the origin, and ending, for this example, with a motion
estimate as indicated by the "X" at location (+6,+2). Notice the maximum displacement
being estimated at this level, indicated by the dark shaded-line box, and the large size of
the measurement window, two of which are shown by narrow solid-line boxes. Based
on the specified subsampling, only those points at which an x-grid line crosses a y-grid
line are used in the criterion evaluation.
The next level, Level 2, as shown in Figure 3.3-9, begins it's search at the
point estimated in the Level 3 evaluation. Once again, the estimate obtained at this level
is indicated by an "X", at (+7,-1), with the maximum displacement estimate indicated
by a dark shaded-line box, and the measurement window size by a narrow solid-line.
Keeping in mind the change of scale between Level 3 and this level, notice how the size
of the measurement window and maximum displacement have decreased. In this case
the subsampling is the same as the previous level, and the points used in criterion
evaluation are once again spaced apart as far as the gridlines indicate.
Finally, the last evaluation level, Level 1, as shown in Figure 3.3-10, is
searched in the same manner as the previous levels, beginning at the previous levels
estimated location. The arrows shown beginning at (0,0) and proceeding to (+7,-2)
indicate how each level incrementally contributes to the final estimate, shown by an

Measurement Window Measurement Window Maximum
for Evaluation Point for Evaluation Point Displacement
(-4,-4) (+4,+4) Estimate
Figure 3.3-8 Hierarchical Blockmatching, Level 3 Example

Measurement Window Measurement Window Maximum
for Evaluation Point for Evaluation Point Displacement
(+4,0) (+8,+4) Estimate
Figure 3.3-9 Hierarchical Blockmatching, Level 2 Example

Measurement Window Measurement Window Maximum
for: Evaluation Point for Evaluation Point Displacement
(+6,-2) (+8,0) Estimate MATCHING CRITERIA
Up totbis point the criterion used to determine the "best match of displaced
measurement windows has been left undefined. There are primarily three ways in
which this match is calculated, each of which may be used with any of the block-
matching techniques discussed thus far. The particular matching criteria used may be

chosen to be any function desired, however, depending on what characteristics are
determined to be most important [73,77].
The principle technique used to produce an estimate of how well one block
matches that of ianother is the Mean Squared Error function. This definition of the MSE
function is: j
MSEij{Ax,Ay,At)=r £ £
/(iN +m, jN +n, t +At)
-/(iN +m +Ax,jN +n +Ay, t)
where the block size is NxN, the image intensity value at location (x,y) and
time t is g(x,y,t), and block (ij) is being evaluated for an estimated relative motion of
(Am, An) at time t+At.
The squared function used in the MSE evaluation, is computationally
expensive, and hard to implement efficiently in hardware. Therefore, another common
function, known as the Mean Absolute Error is often used which is much simpler to
implement computationally. This function is very similar to the MSE method, and is
defined to be: i
MADu(Ax,Ay,At) = £ £
m =1 n= 1
f{JN +m, iN +n, t +At)
- f(jN +m +Ax, iN +n +Ay, t)
The final matching evaluation technique is not used nearly as often as the other
two, but is considered to be nearly as accurate as the MSE approach but no more
computationally intensive than the MAD approach [73]. It is known as the Pel
Difference Classification, or thresholding technique, and attempts to classify each pixel
intensity as either being a match or not with the reference frame.

The total number of matches is used to determine which block is has the best
match. This can be illustrated mathematically as:
where It is a set intensity threshold chosen to specify whether, the intensities
match or not.
3.3.2 Gradient-Based Techniques
Gradient-based motion estimation techniques [5,9,57,78-80] attempt to relate
image time differences (motion) with an associated spatial difference (gradient), and are
derived directly from optical flow theory. In fact, much of the early work on gradient-
based technique drove the development of a unified optical flow theory.
The first step in describing optical flow is to represent a series of images, in
which objects may be in motion, as an equation which has been expanded into a Taylor
In general, the higher-order terms may be ignored. In order to estimate the
motion, as seen in our discussion of block-matching techniques, we attempt to find a
dx and dy such that:
PDCj (Ax, Ay, At) = JL

in which case, Equation 3.3-7 reduces to:
df J df A dfJ
0 = -?-dx + 4-dy + 4-dt
dx dy J dt
Rearranging this equation, the following relationship may be established:
d/_ df dx ^ df dy
dt dx dt dy dt
Since the partial derivatives of f with respect to t and the spatial gradient may be
obtained directly from the image sequence intensity values, this leaves just the motion
vector to be solved for. How we solve for this velocity, and how well the technique
performs is what differentiates the different gradient-based techniques described in the
following sections.
we can write Equation 3.3-10 more succinctly as:
where: Vf is the spatial gradient, and
u is the velocity, or motion vector
dx dy
dt dt
3.312.1 Limb & Murphy Cl9751
One of the earliest attempts to solve for the motion vectors was described by
Limb and Murphy [74,81-83] and became the founding classical technique for
gradient-based: motion estimation. Their approach was extremely simplistic, accounting
for only one moving object, and estimating motion in the horizontal direction alone.

To compute the absolute speed, u, Limb and Murphy used the following
'L\f{x,y,t + At)-f{x,y,t )|
, i m__________________________
Z|f(x,y,t )-f(x-Ax,y, t )|
which is illustrated by way of Figure 3.3-11, where M is the area in which
motion is detected. Notice that the hashed area corresponds to the numerator of
Equation 3.3-12 and the "height", or intensity difference, is the denominator.
Figure 3.3-11 limb and Murphy Displacement Estimation

It is important to realize that, based on their technique, the displacement may
only be estimated along a single direction, and that only the magnitude of the
displacement is calculated.
3.35.2 Cafforio & Rocca (1976.19831
Cafforio and Rocca [74,82-86,93] expanded on this estimation, making Limb
and Murphy's approach a simplified solution of their own. Cafforio and Rocca started
with the following relationship:
=/(*,y> t)-f(x-ud0,y-vd0, t-dt) (3.3-13)
df J df . , ,
=-^ ud~dy vdo + n(x>y)
where do is the sampling distance, u do is the horizontal component of
displacement, v do is the vertical component of displacement, andn(x,y) is any noise
plus the effects of dropping the higher-order terms of the Taylor series, per Equation
If we the define
&T=f(x,y,t)-f(x,y,t -dt)
where A^is the frame difference, Axis the horizontal element difference, and
Ayis the vertical element difference.

Combining thd definitions of Equation 3.3-14 with those of 3.3-13 leads to:
Ar= u Ax+ v Ay+ n (;t,;y) =uAx+ vAy
if we assume that n(x,y) is small relative to the frame difference.
In the case of pure translational motion, the frame difference and both element
differences are known, leaving us to solve for u and v only, which happen to be the
scaling components of our motion vector (multiplying each by the sampling distance,
do gives us the actual motion vectors). Since both u and v are constants, we can solve
for them using jlinear regression techniques, such that [84]:
Oa >a A -ri A rA A
air araf
v =
^ l-rljAr
is the variance of A T, and
rATAYls ^ correlation of A T to A Y
If we assume that fAxAy is negligible, then these estimates essentially revert to
those found by Limb and Murphy, but for two dimensions instead of just one.

3.312.3 Netravali & Robbins (1979)
The estimation approach presented by Cafforio and Rocca made a motion
estimation based on a single sampling sequence. In 1979, Netravali and Robbins [4,5,
74,83,87-89] extended this approach to make the technique inherently recursive, or
iterative. Their approach has commonly become known as the pel recursive technique
since it bases the current estimate on past ones. Netravali and Robbins attempted to
minimize the squared displaced frame difference of the predicted frame through the
steepest descent gradient approach.
To describe their approach, a few definitions are required. If define the location
of the current pixel being evaluated to be:
r=[xy]T (3.3-18)
and the! displacement estimation for this pixel to be:
d\r;t,At) = ^dx1 {r-,t, At) dy{r;t, Af)j (3.3-19)
then the current displaced frame difference would be:
dfd|r,d1 j.=/^r + d1 (r; t, Atj,f + Arj-/(r, t) (3.3-20)
For perfectly estimated motion, this displaced frame difference would be zero;
but in reality it never is due to noise. Netravali and Robbins attempted to minimize the
square of the displaced frame difference through the method of steepest descents. This
is illustrated in Figure 3.3-12.
The method of steepest descent can be simply expressed by the following
5i+l 51 Ai
d =d + u

* [ ^1+1 A i
d is our the previous estimate, a is the new estimate, and u
is the update term. Netravali and Robbins proposed that this update term be:
i = -^eVa[4/a(r,rfi)JZ (3.3-22)
where !e is a positive constant which regulates the step size, or convergence
rate, and V j i: is the spatial gradient operator with respect to the current motion

It is important to notice that this approach, unlike those of block matching, will
generally not give an estimate that is an integer multiple of the sampling distance; and,
for most cases,'the estimate will be for a subsample distance. Walker & Rao (19841
Netravali and Robbins, in their original paper, suggested that 1/1024 be used as
the rate of convergence, e Walker and Rao [4,90,91] observed that the estimate
provided by Netravali and Robbin's approach converged rather slowly due to the rate
of convergence being a fixed value. Walker and Rao proposed, rather, that this step
size be adaptive, based on the magnitude of the update term. This intuitively makes
sense since a large update term indicates that the estimate is still far from converging,
while a small update term means that the estimate is near or at convergence. Their
approach has commonly become known as the adaptive pel recursive method.
Walker; and Rao proposed that the adaptive value for e be:
They found, through their evaluations of both this estimate and the original one
proposed by Netravali and Robbins, that this adaptive technique performed
significantly better than the original technique, but was slightly more difficult to
Walkef and Rao calculated their gradient terms as shown in Figure 3.3-13.

II 12 13
I4 15 l6
18 19
Ik is the intensity of the pixel at location k,
f{r,t) = I5,
v4 =
Figure 3.3-13 Walker and Rao Gradient Calculations
Of course, the particular method Walker and Rao used for calculating the
gradient is not the only way. Any number of neighboring pixel values may be used as
long as the gradient is normalized properly. Increasing the number of points used will
decrease the effect of noise, thereby increasing the accuracy of the gradient calculation;
however, this is at the expense of computational complexity.
3.3:2.5 Horn & Schunk 119811
In 1981, Horn and Schunk [61,62, 64, 82, 83,92] attempted to describe the
fundamental theories behind motion estimation. In the process they related motion
estimation to the principles of optical flow theory, and presented an advanced gradient-

based motion estimation technique derived directly from optical flow theory. Horn and
Schunk defined the Displaced Frame Difference (DFD) as:
DFD =f(x+dx,y+dy,t+dt)-f{x,y, t)
dx dt dy dt dt
This technique is driven by the smoothness constraint, and attempts to minimize
the optical flow error by driving down the squared-error defined by:

v 8/ ery
re | (33.25)
= (fx +fy V +ftf + A ( + My2 + V? + vy2)
fxu + fyV + ftis the estimation error due to real-world constraints, and
w2 + Mjf + v2 + is the deviation from the smoothness constraint
The strength of the smoothness constraint components, ux> Uy, vx, vy is
controlled by varying X; small values of X represent little smoothing, while large values
represent significant smoothing. The actual value for X is normally chosen heuristically
and is highly subjective relative to the application.
The motion vector, u and v, may be found by minimizing R in the following
; R = jj {fxU+jyV+ft)2 + X(ux +Uy + vl + Vy)dxdy (3.3-26)

If we define:
| + My ) = V2 u u uav
(v* + v?) = V2V = V-Vav
and then differentiate Equation 3.3-26 with respect to u, we obtain:
and differentiating with respect to v we obtain:
(^+^2)v+/4 = X2v-^
Rearranging terms we get the following general equations:
w uav fx ^
s p
V = Vav-£
P =fx av +fy Vov +ft

In reality, u and v are solved iteratively, using the Gauss-Seidel method, by
starting with ,|v, and k at zero and continuing until some minimum error measurement
is obtained for:
u =u
fx D
V =V
fy D

There is currently a great deal of interest in achieving extremely high amounts of
: i
digital image compression for full-motion video sequences by exploiting the estimation
of image motion. This interest is evident both in the volume of research papers being
published on the subject as well as the activities of image compression standardization
committees. Much work still needs to be done in the area of motion estimation,
however, in order to make it realizable for real-world applications.
This thesis established the role that motion estimation plays with respect to
image compression by first investigating some of the more traditional image
compression techniques developed, primarily for still-image compression. This
investigation was important not only to establish the significance of motion estimation
in order to achieve high compression ratios, but also because motion estimation will
normally be used in conjunction with one or more of these techniques.
The fundamental theory behind most current motion estimation techniques was
then described.; This theory, known as optical flow theory, attempts to provide a
universal framework in which to describe the phenomena associated with motion as it is
projected ffom|a 3-dimensional world on to a 2-dimensional sensor.
The two primary approaches to motion estimation, correlation-based and
gradient-based; were described, both generally and through specific examples of
various techniques previously reported in technical articles. Both approaches were
found to have strengths and weaknesses which had to be evaluated based on the
particular application they were to be used in. Correlation-based techniques were found
to be computationally, as well as intuitively simpler than those of gradient-based
, i
techniques, but suffered from poor estimation resolution and were more susceptible to
converging on false local minima due to noise. Gradient-based techniques, while more
accurate and robust with respect to noise, were found to be computationally more
; j
expensive and did not deal well with occlusions.

The study of motion estimation, and image sequence processing in general, is
still in it's infancy. The information and techniques presented in this thesis are still
primarily a laboratory curiosity and must be extended to real-world applications in the
near future in order to be useful. Many of the most significant problems for motion
estimation, such as noise and occlusions, have not been seriously dealt with in the
research described in this thesis due to the controlled image sequence samples used in
evaluating the techniques. One area of further study would be to investigate and
categorize the effects of various real-world problems on each of the techniques.
The hierarchical blockmatching technique described in Section, in my
opinion, is really the only technique which has been designed, from the start, to deal
with the real-world problems associated with noise and occlusions, both. The gradient-
based techniques, however, appear to be much more robust, provided that the problem
of occlusions could be overcome and higher efficiency algorithms are developed.
Therefore, further study is required to make gradient techniques more real-world
And finally, due to the nature of images and limited knowledge of the human
visual system [95-97], it is hard to determine the effectiveness of image compression to
real world applications [98-99]. The mean-squared error and other similar methods are
not particularly applicable, and subjective assessment appears to be the only possible
real measurement of success. So, further study would be warranted to attempt to
measure the effectiveness quantitatively, particularly for image sequences, which would
then allow techniques to be developed which exploited particular characteristics of the
human visual system and made compression technique evaluations less subjective.

[1] R.C. Gonzalez and P. Wintz, Digital Image Processing, Second Edition, Addison-Wesley Publishing Co., 1987.
[2] A.K. Jain, Image Data Compression: A Review, Proc. IEEE, vol. 69, no. 3, pp. 349-389, March 1981.
[3] M. Rabbani, Review of Digital Image Compression World Standards, Course Notes: Electronic Imaging 91 East, September 30,1991.
[4] M.I. Sezan, Image Sequence Processing, Course Notes: Electronic Imaging 91 East, September 30,1991.
[5] A.N. Netravali and B.G. Haskell, Digital Pictures: Representation and Compression, Plenum Press, 1988.
[6] M.L. Liou, "Visual Telephony as an ISDN Application," IEEE Comm. Mag., Vol. 28, no. 2, pp. 30-38, February 1990
m G. Privat and E. Petajan, "Processing Hardware for Real-Time Video Coding," IEEE Micro, Vol. 12, no. 5, pp. 9-12, October 1992
[8] P. Fockens and A.N. Netravali, Digital Spectrum-Compatible HDTV: The Zenith/AT&T Broadcast Proposal, Advanced Imaging, pp. 14-20, April 1992.
[9] R.J. Jurgen, "The Challenges of Digital HDTV," IEEE Spectrum, Vol. 28, no. 4, pp. 28-30, 71-73, April 1991
[10] E. Petajan, "Digital Video Coding Techniques for US High-Definition TV," IEEE Micro, Vol. 12, no. 5, pp. 13-21, October 1992
[11] O. Duardo, S.C. Knaver, J.N. Mailhot, K. Mondal and T.C. Poom, "DSC-HDTV Video Decoder System," IEEE Micro, Vol. 12, no. 5, pp. 22-27, October 1992
[12] A. Habibi and B.H. Batson, Potential Digitization/Compression Techniques for Shuttle Video, IEEE Trans. Comm., vol. COM-26, no. 11, pp. 1671-1682, November 1978.
[13] M.P. Ekstrom, Ed., Digital Image Processing Techniques, Academic Press, 1984
[14] M. Rabbini and P.W. Jones, Digital Image Compression Techniques, SPIE Optical Engineering Press, 1991
[15] R.K. Jurgen, Putting the Standards to Work, IEEE Spectrum, vol. 29, no. 3, pp. 28-30, March 1992.
[16] A.N. Netravali and J.O. Limb, Picture Coding: A Review, Proc. IEEE, vol. 68, no. 3, pp. 366-406, March 1980.

[17] R.J. Clarke, Transform Coding of Images, Academic Press, 1985.
[18] J.A. Storer, Data Compression: Methods and Theory, Computer
Science Press, 1988.
[19] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals,
Prentice-Hall, 1978
[20] S.P. Lloyd, Least Squares Quantization in PCM, IEEE Trans. Info.
Theory, Vol. IT-28, no. 2, pp. 129-137, March 1982
[21] C.E. Shannon, "A Mathematical Theory of Communications," Bell
System Technical Journal, Vol. 27, pp. 398-403, July 1948
[22] V. Cappellini, Ed., Data Compression and Error Control Techniques
with Applications, Academic Press, 1985.
[23] D. Welsh, Codes and Cryptography, Clarendon Press, 1988
[24] M. Rabbani, Ed., Selected Papers on Image Coding and Compression,
S PIE Optical Engineering Press, 1992
[25] D.A. Huffman, A Method for the Construction of Minimum-
Redundancy Codes, Proc. IRE, Vol. 40, no. 10, pp. 1098-1101,
September 1952
[26] R.M. Fano, "The Transmission of Information," Technical Report No.
65, Research Laboratory of Electronics, M.I.T., Cambridge, Mass.,
[27] Y. Unde, A. Buzo, and R.M. Gray, "An Algorithm for Vector
Quantization Design," IEEE Trans. Comm., Vol. COM-28, no. 1, pp.
84-95, January 1980
[28] R.M. Gray, "Vector Quantization," IEEE ASSP Mag., Vol. 1, no. 2,
pp. 4-29, April 1984
[29] R.J. Offen, VLSI Image Processing, McGraw-Hill Book Co., 1985.
[30] G. Pujolle, D. Seret, D. Dromard, and E. Horlait, Integrated Digital
Communications Networks, Vol. 1, John Wiley & Sons, 1988.
[31] A. Habibi, Survey of Adaptive Image Coding Techniques, IEEE
Trans. Comm., vol. COM-25, no. 11, pp. 1275-1284, November
[32] T.S. Huang, "Coding of Two-Tone Images," IEEE Trans. Comm.,
Vol. COM-25, no. 11,1406-1424, November 1977
[33] D.K. Sharma and A.N. Netravali, "Design of Quantizers for DPCM
Coding of Picture Signals," IEEE Trans. Comm., Vol. COM-25, no.
11, pp. 1267-1274, November 1977

[34] J.W. Woods and S.D. O'Neil, "Subband Coding of Images," IEEE
Trans. Acoust., Speech, Speech, Signal Proc.., Vol. ASSP-34, no. 5,
pp. 1278-1288, October 1986
[35] P.A. Wintz, "Transform Picture Coding," Proc. IEEE, Vol. 60, no.7,
pp. 809-820, July 1972
[36] A .K. Jain, Advances in Mathematical Models for Image Processing,
Proc. IEEE, vol. 69, pp. 502-528, May 1981.
[37] A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Processing,
Prentice-Hall, 1989
[38] J.A. Roese, Interframe Coding of Digital Images Using Transform and
Hybrid Transform/Predictive Techniques, USCIPI Report 700, Image
Processing Institute, 1976.
[39] M. Antonini, M. Barlaud, P. Mathieu and I. Daubechies, Image
Coding Using Wavelet Transforms, IEEE Trans. Image Proc., Vol. 1,
no. 2, pp. 205-220, April 1992
[40] A.S. Lewis and G. Knowles, Image Compression Using the 2-D
Wavelet Transform, IEEE Trans. Image Proc., Vol. 1, no. 2, pp.
244-250, April 1992
[41] O. Rioul and M. Vetterli, Wavelet and Signal Processing, IEEE Sig.
Proc. Magazine, Vol. 8, no. 4, pp. 14-38, October 1991
[42] W.R. Zettler, J. Huffman and D.C.P. Linden, "Application of
Compactly Supported Wavelets to Image Compression,", Image Proc.
Alg., Tech.,, Proc. SPIE, Vol. 1244, pp. 150-160, 1990
[43] M. Barnsley, Fractals Everywhere, Academic Press, 1988
[44] A.E. Jacquin, Image Coding Based on a Fractal Theory of Iterated
Contractive Image Transformations, IEEE Trans. Image Proc., Vol.
1, no. 1, pp. 18-30, January 1992
[45] B.G. Haskell, P.L. Gordon, R.L. Schmidt, and J.V. Scattaglia,
Interframe Coding of525-Line, Monochrome Television at 1.5
Mbits/s, IEEE Trans. Comm., vol. COM-25, no. 11, pp. 1339-1348,
November 1977.
[46] A. Habibi, Hybrid Coding of Pictorial Data, IEEE Trans. Comm.,
vol. COM-22, no. 5, pp. 614-624, May 1974.
[47] A. Habibi, An Adaptive Strategy for Hybrid Image Coding, IEEE
Trans. Comm., vol. COM-29, no. 12, pp. 1736-1740, December
[48] A. Puri and H.M. Hang, Adaptive Schemes for Motion-Compensated
Coding, Visual Communications and Image Processing 88, T.R.
Hsing, Ed., SPIE Vol. 1001, pp. 925-935, 1988

[49] T.O. Tam and J.A. Stuller, "Line-Adaptive Hybrid Coding of Images,"
IEEE Trans. Comm., Vol. COM-31, no. 3, pp. 445-450, March 1983
[50] G.K. Wallace, The JPEG Still Picture Compression Standard,
Comm. ACM, vol. 34, no. 4, pp. 30-44, April 1991.
[51] P.H. Ang, P.A. Ruetz, and D. Auld, Video Compression Makes Big
Gains, IEEE Spectrum, vol. 28, no. 11, pp. 16-19, October 1991.
[52] M.L. Liou, Overview of the px64 kbit/s Video Coding Standard,
Comm. ACM, vol. 34, no. 4, pp. 59-63, April 1991.
[53] , CCTTT Video Compression Databook, LSI Logic, September 1991
[54] Coded Representation of Picture, Audio and Multimedia/Hypermedia
Information (Tentative Title), ISO/BEC JTC 1/SC 29, December 1991
[55] R.K. Jurgen, An Abundance of Video Formats, IEEE Spectrum, vol.
29, no. 3, pp. 26-28, March 1992.
[56] D.L. Gall, MPEG: A Video Compression Standard for Multimedia
Applications, Comm. ACM, vol. 34, no. 4, pp. 46-58, April 1991.
[57] D.H. Ballard and C.M. Brown, Computer Vision, Prentice-Hall, Inc.,
[58] E. DuBois and S. Sabri, Noise Reduction in Image Sequences Using
Motion-Compensated Temporal Filtering, IEEE Trans. Comm., vol.
COM-32, no. 7, pp. 826-831, July 1984.
[59] D. Martinez and J.S. Lim, Implicit Motion Compensated Noise
Reduction of Motion Video Scenes, IEEE International Conference on
ASSP, pp. 375-378, March 1985
[60] R.K. Ward and B.E.A. Saleh, Deblurring Random Blur, IEEE Trans.
Acoust., Speech, Signal Proc., vol. 35, no. 10, pp. 1494-1498,
October 1987.
[61] K.D. Skifstad, High-Speed Range Estimation Based on Intensity
Gradient Analysis, Springer-Verlag, 1991.
[62] A. Singh, Optic Flow Computation: A Unified Perspective, IEEE
Computer Science Press, 1991.
[63] R.J. Schalkoff, Tutorial on Image Motion Analysis, Course Notes:
Electronic Imaging 91 East, September 30,1991.
[64] J.K. Aggarwal and N. Nandhakumar, "On the Computation of Motion
from Sequences of Images-A Review," Proc. IEEE, Vol. 76, no. 8,
pp. 917-935, August 1988
[65] D. Heeger, A Model for Extraction of Image Flow, Proc. First Inti.
Conf. Computer Vision, pp. 181-190, 1987

[66] R. Resnick and D. Halliday, Physics, John Wiley & Sons, 1977
[67] C. Labit and A. Benveniste, "Motion Estimation in a Sequence of Television Pictures," in Image Sequence Processing and Dynamic Scene Analysis, T.S. Huang, Ed., Springer-Verlag, 1983, pp. 292- 306
[68] R. Lenz and A. Gerhard, "Image Sequence Coding Using Scene Analysis and Spatio-Temporal Interpolation," in Image Sequence Processing and Dynamic Scene Analysis, T.S. Huang, Ed., Springer- Verlag, 1983, pp. 264-274
[69] R. Srinivasan and K.R. Rao, Motion-Compensated Coder for Videoconferencing, IEEE Trans. Comm., vol. COM-35, no. 3, pp. 297-304, March 1987.
[70] S. Kappagantulaand K.R. Rao, Motion Compensated Interframe Image Prediction, IEEE Trans. Comm., vol. COM-33, no. 9, pp. 1011-1015, September 1985.
[71] C.D. Bowling and R.A. Jones, Motion Compensated Image Coding with a Combined Maximum A Posteriori and Regression Algorithm, IEEE Trans. Comm., vol. COM-33, no. 8, pp. 844-857, August 1985.
[72] R. Srinivasan and K.R. Rao, Predictive Coding Based on Efficient Motion Estimation, IEEE Trans. Comm., vol. COM-33, no. 8, pp. 888-896, August 1985.
[73] J.R. Jain and A.K. Jain, Displacement Measurement and Its Application in Interframe Image Coding, IEEE Trans. Comm., vol. COM-29, no. 12, pp. 1799-1808, December 1981.
[74] H.G. Musmann, P. Pirsch and H.J. Grallert, "Advances in Picture Coding," Proc. IEEE, Vol. 73, no. 4, pp. 523-548, April 1985
[75] M. Ghanbari, The Cross-Search Algorithm for Motion Estimation, IEEE Trans. Comm., vol. 38, no. 7, pp. 950-953, July 1990.
[76] M. Bierling, Displacement Estimation by Hierarchical Blockmatching, Visual Communications and Image Processing '88, T.R. Hsing, Ed., SPIE Vol. 1001, pp. 942-951, 1988
[77] D.C. Burdick, M.L. Marcuse and J.D. Mislan, Computer Aided Target Tracking in Motion Analysis Studies, Image-Based Motion Measurement, J.S. Walton, Ed., SPIE Vol. 1356, pp. 89-100, 1990
[78] :K.A. Prabhu and A.N. Netravali, Motion Compensated Component Color Coding, IEEE Trans. Comm., vol. COM-30, no. 12, pp. 2519- 2527, December 1982.
[79] R.A. Jones and C.D. Bowling, An Adaptive Gradient Approach to Displacement Estimation, in Image Sequence Processing and Dynamic Scene Analysis, T.S. Huang, Ed., Springer-Verlag, 1983, pp. 235-

[80] Z. Houkes, "Motion Parameter Estimation in TV-Pictures," in Image
Sequence Processing and Dynamic Scene Analysis, T.S. Huang, Ed.,
Springer-Verlag, 1983, pp. 249-274
[81] J.O. Limb and J.A. Murphy, "Measuring the Speed of Moving Objects
from Television Signals," IEEE Trans. Comm., Vol. COM-23, no. 4,
pp. 474-478, April 1975
[82] T.S. Huang, Ed., Image Sequence Analysis, Springer-Verlag, 1981.
[83] H .C. Bergmann, Analysis of Different Displacement Estimation
Algorithms for Digital Television Signals, in Image Sequence
Processing and Dynamic Scene Analysis, T.S. Huang, Ed., Springer-
Verlag, 1983, pp. 215-234.
[84] C. Cafforio and F. Rocca, "Methods for Measuring Small
Displacements of Television Images," IEEE Trans. Info. Theory, Vol.
IT-22, no. 5, pp. 573-579, September 1976
[85] C. Cafforio and F. Rocca, "The Differential Method for Image Motion
Estimation," in Image Sequence Processing and Dynamic Scene
Analysis, T.S. Huang, Ed., Springer-Verlag, 1983, pp. 104-124
[86] G. Tziritas, Displacement Estimation for Image Predictive Coding and
Frame Motion-Adaptive Interpolation, Visual Communications and
Image Processing 88, T.R. Hsing, Ed., SPIE Vol. 1001, pp. 936-
941, 1988
[87] A.N. Netravali and J.D. Robbins, Motion-Compensated Television
Coding: Part I, Bell System Technical Journal, vol. 48, no. 3, pp.
631-670, March 1979.
[88] J.D. Robbins and A.N. Netravali, Recursive Motion Compensation: A
Review, in Image Sequence Processing and Dynamic Scene Analysis,
T.S. Huang, Ed., Springer-Verlag, 1983, pp.75-103.
[89] B. Widrow and S.D. Steams, Adaptive Signal Processing, Prentice-
Hall, 1985
[90] D.R. Walker and K.R. Rao, Motion-Compensated Coder, IEEE
Trans. Comm., vol. COM-35, no. 11, pp.1171-1178, November
[91] D.R. Walker and K.R. Rao, Improved Pel-Recursive Motion
Compensation, IEEE Trans. Comm., vol. COM-32, no. 10, pp.
1128-1134, October 1984.
[92] B.K.P. Horn and B.G. Schunk, "Determining Optical Flow," Artificial
Intelligence, Vol. 17, pp. 185-203,1981
[93] S. Sabri, Movement-Compensated Interframe Prediction for NTSC
Colour TV Signals, in Image Sequence Processing and Dynamic Scene
Analysis, T.S. Huang, Ed., Springer-Verlag, 1983, pp. 156-199.

[94] H. Gharavi and M. Mills, Blockmatching Motion Estimation AlgorithmsNew Results, IEEE Trans. Circ., Systems, vol. 37, no. 5, pp. 649-651, May 1990.
[95] S.A. Wemess, Statistical Evaluation of Predictive Data Compression Systems, IEEE Trans. Acoust., Speech, Signal Proc., vol. 35, no. 8, pp. 1190-1198, August 1987.
[96] N.B. Nill, A Visual Model Weighted Cosine Transform for Image Compression and Quality Assessment, IEEE Trans. Comm., vol. COM-33, no. 6, pp. 551-557, June 1985.
[97] C.F. Hall and E.L. Hall, "A Nonlinear Model for the Spatial Characteristics of the Human Visual System, IEEE Trans. Systems, Man, Cybernetics, vol. SMC-7, no. 3, pp. 161-170, March 1977.
[98] D.J. Sakrison, "On the Role of the Observer and a Distortion Measure in Image Transmission," IEEE Trans. Comm., Vol. COM-25, no. 11, pp. 1251-1267, November 1977
[99] L.D. Davisson, "Rate-Distortion Theory and Applications," Proc. IEEE, Vol. 60, no. 7, pp. 800-808, 1972

Although not directly referenced in the thesis, the following references were also
investigated and served primarily as background reading material:
P.M. Farrelle, Recursive Block Coding for Image Data Compression,
Springer-Verlag, 1990.
D. J. Connor and J.O. Limb, Properties of Frame-Difference Signals
Generated by Moving Images, IEEE Trans. Comm. Vol. COM-22,
no. 10, pp. 1564-1575, October 1974
E. H. Adelson and J.R. Bergen, Spatiotemporal Energy Models for the
Perception of Motion, J. Optical Soc. America, Vol. 2, pp. 284-299,
R.N. Braithwaite and M.P. Beddoes, Iterative Methods for Solving the
Gabor Expansion: Considerations of Convergence, IEEE Trans.
Image Proc., Vol. 1, no. 2, pp. 243-244, April 1992
J. Weng, N, Ahuja, and T.S. Huang, Motion and Structure From
Point Correspondences with Error Estimation: Planar Surfaces, IEEE
Trans. Signal Processing, vol. 39, no. 12, pp. 2691-2717, December
R. K. Jurgen, Digital Video, IEEE Spectrum, vol. 29, no. 3, pp. 24-
26, March 1992.
N.M. Namazi and C.H. Lee, Nonuniform Image Motion Estimation
from Noisy Data, IEEE Trans. Acoust., Speech, Signal Proc., vol.
38, no. 2, pp. 364-366, February 1990.
S. A. Mahmoud, A New Technique for Velocity Estimation of Large
Moving Objects, IEEE Trans. Signal Processing, vol. 39, no. 3, pp.
741-743, March 1991.
S. D. Blostein and T.S. Huang, Detecting Small, Moving Objects in
Image Sequences Using Sequential Hypothesis Testing, IEEE Trans.
Signal Processing, veil. 39, no. 7, pp. 1611-1629, July 1991.
N.M Namazi and D.W. Foxell, On the Convergence of the Generalized
Maximum Likelihood Algorithm for Nonuniform Image Motion
Estimation, IEEE Trans. Image Proc., Vol.l, no. 1, pp. 116-119,
January 1992
C.G. Broncelet, Jr., J.R. Cobbs and A.R. Moser, "Error Free
Compression of Medical X-Ray Images," Vis. Comm. & Image Proc.,
Proc. SPIE, Vol. 1001, pp. 269-276,1988
T. R. Natarajan and N. Ahmed, "On Interframe Transform Coding,"
IEEE Trans. Comm., Vol. COM-25, no. 11, pp. 1323-1329,
November 1977
S. Sabri, "Movement Compensated Interframe Prediction for NTSC
Color TV Signals," IEEE Trans. Comm., Vol. COM-32, no. 8, pp.
954-968, August 1984