An adaptive statistical filter for detecting motion in MPEG-1 video streams for surveillance applications

Material Information

An adaptive statistical filter for detecting motion in MPEG-1 video streams for surveillance applications
Osminer, Matthew Corey
Place of Publication:
Denver, Colo.
University of Colorado Denver
Publication Date:
Physical Description:
94 leaves : illustrations ; 28 cm

Thesis/Dissertation Information

Master's ( Master of Science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Computer Science and Engineering, CU Denver
Degree Disciplines:
Computer Science
Committee Chair:
Alaghband, Gita
Committee Members:
Clark, John
Choi, Min


Subjects / Keywords:
Stochastic processes ( lcsh )
MPEG (Video coding standard) ( lcsh )
Electronic surveillance ( lcsh )
Motion ( lcsh )
Demodulation (Electronics) ( lcsh )
Demodulation (Electronics) ( fast )
Electronic surveillance ( fast )
Motion ( fast )
MPEG (Video coding standard) ( fast )
Stochastic processes ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Includes bibliographical references (leaf 94).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Matthew Corey Isminer.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
51775244 ( OCLC )
LD1190.E52 2002m .O75 ( lcc )

Full Text
Matthew Corey Osminer
B.S. Univeristy of Colorado at Boulder, 1994
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science

This thesis for the Master of Science Computer Science
degree by
Matthew Corey Osminer
has been approved
Gita Alaghband
Min Choi
/I- 13-0 2.

Osminer, Matthew Corey (M.S., Computer Science)
An Adaptive Statistical Filter for Detecting Motion in MPEG-1 Video Streams for
Surveillance Applications
Thesis directed by Professor Gita Alaghband
This paper presents a method for detecting motion in MPEG-1 video scenes for
surveillance applications, using stochastic filtering. The method aims to minimize
computational requirements by detecting motion without first decompressing the
MPEG video for use in real-time on a limited resource video acquisition device.
A scaleable software architecture was developed that provides the framework not
only for the final motion defection algorithm, but also for an MPEG analysis tool
to analyze MPEG video streams. The output from the analysis tool was used to
formulae a basic motion detection test and two stochastic tests. The basic test
provided high motion detection accuracy for scenes with high average luminance,
however failed to detect, or erroneously detected, motion in scenes with medium
and low average luminance. The two stochastic tests improved medium
luminance scene detection accuracy, however still failed to resolve low luminance
scenes properly.
While the motion detection framework for this project focused on a specific
hardware platform, the method and framework developed are by no means
exclusive to this platform and the resulting detection algorithms may be used in
various applications that require a characterization of motion in an MPEG video
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Gita Alaghband

I would like to acknowledge:
Dr. Mike Perkins for his MPEG expertise and undying friendship, support,
and encouragement.
The Vantum Corporation for their generous donation of a Vantum Cld
Video Appliance.
My family, for getting me into engineering in the first place.
Most of all, my wife, Teresa, who has patiently stood by and been ready to
help out at a moments notice with anything I could possibly hope to ask

1. Introduction.....................................................1
2. MPEG Compression Theory and Bitstream Syntax.................... 3
2.1 MPEG-1 Video Structure...........................................3
2.2 MPEG-1 Motion Estimation.........................................6
2.3 Block Matching...................................................8
2.3.1 Mean Absolute Difference (MAD)...................................8
2.3.2 Mean Square Difference (MSD).....................................9
2.3.3 Block Comparison Performance.....................................9
2.4 Searching for Matching Blocks...................................10
2.4.1 Pyramid Search..................................................10
2.4.2 Full Search.....................................................13
2.4.3 Restricted Search...............................................13
2.5 Spatial Compression.............................................13
3. Encoding Apparatus and Software Architecture....................16
3.1 Encoding Hardware Platform......................................16
3.2 MPEG Software Framework.........................................17
3.3 Algorithm Constraints...........................................18
3.4 Software Architecture...........................................18
3.4.1 Element Handling................................................21
3.4.2 Element Handler Processing Modules..............................22
4. Motion Detection Filter Development.............................24
4.1 Overview........................................................24
4.2 Sum and Threshold Test..........................................24
4.3 Statistical Filtering Approach..................................26
4.3.1 Spurious Vector Characterization................................27
4.3.2 Methods.........................................................32
4.4 Spurious Motion Vector Filter Tests.............................34
4.4.1 Adaptive Threshold Test.........................................35
4.4.2 Weighted Vector Threshold Test..................................37
4.4.3 Weighted Vector Ratio Threshold Test............................40
4.5 Performance Comparisons.........................................47
4.5.1 Cost Analysis...................................................52

4.5.2 Performance Conclusions......................................55
5. Summary......................................................56
A. Adaptive Threshold Test Plots................................58
B. Weighted Vector Threshold Plots..............................64
C. Weighted Vector Ratio Threshold Plots (T=16).................70
D. Weighted Vector Ratio Threshold Plots (T=32).................76
E. Weighted VectorThreshold Plots (T=64).................... 82
F. Motion Scene Event Timetables................................88
References.......................................................... 94

Figure 2.1 MPEG Bitstream Hierarchy.....................................4
Figure 2.2 Pyramid Search............................................ 12
Figure 2.3 Spatial Compression Signal Chain............................14
Figure 3.1 Encoder Configuration.......................................16
Figure 3.2 Parser Data Flow............................................20
Figure 4.1 General Statistical Detection Framework.....................26
Figure 4.2 The Probability Space for Statistical Detection.............27
Figure 4.3 Scene Content for Development...............................29
Figure 4.4 Spurious Vector Characterization Framework..................29
Figure 4.5 Probability of Spurious Motion vs. Luminance for a Static Scene.31
Figure 4.6 Mean Spurious Vector Magnitude vs. Luminance for Static Scenes.. 32
Figure 4.7 Erroneous Motion Detection Before Variance Compensation.....33
Figure 4.8 Unfiltered Mean Vector Sum vs. Mean Image Luminance.........36
Figure 4.9 WVTT Detection Threshold and Curve Fit......................38
Figure 4.10 Medium Scene Detection WVTT....................................39
Figure 4.11 Mean Vector Sum vs. Mean Luminance for WVRTT...................42
Figure 4.12 Erroneous Static Scene Detection (T=16)........................43
Figure 4.13 WVRTT Missed Motion for Low Luminance (T=16)...................44
Figure 4.14 Comparison of Medium and Low Luminance Samples.................45
Figure 4.15 Erroneous Motion Detection UsingWVRTT (T=32)...................46
Figure 4.16 Motion Detection, Low Luminance (T=64).........................47
Figure 4.17 Missed Motion Due to Excessive Threshold (Motion 5, T=32)......51
Figure 4.18 Excessive Filtering Using the Weighted Threshold (Motion 5)....52

Table 4.1 Motion Scene Comments..................................49
Table 4.2 Filter Error Summary...................................50
Table 4.3 Operation Costs for Determining P of Each Algorithm....53
Table 4.4 Summary of Test Processing Costs.......................54
Table 4.5 Summary of Decision Costs..............................55
Table 4.6 Filter Cost Summary....................................55

1. Introduction
Digital video is increasingly finding its way into many video applications that
were once traditionally held by analog video. Applications such as video
broadcast, long distance learning, video conferencing, portable camcorders, and
motion pictures all make use of digital video. Unfortunately uncompressed digital
video consumes large amounts of storage space when recorded, or transmission
bandwidth when streamed. To address storage and bandwidth issues a number of
video compression standards have been developed. A popular video compression
standard is the Motion Picture Experts Group standard one, or MPEG-1 standard
[1]. MPEG-1 leverages many advances in video compression technology to
create a standards based format for storing and broadcasting digital video.
MPEG-1 encoding technology has advanced to the point where single chip
hardware encoder implementations have been developed. Such encoder chips
have allowed the development of low cost, stand alone video appliances that can
encode video and record it to a hard disk, or stream it to a network in real-time.
Such appliances find their way into surveillance, long distance learning, or
networked video broadcast applications. Unfortunately the use of hardware
encoder chips sometimes means that access to the original raw video may not
always be available such that image analysis can be done independent of the
An appliance that requires motion detection capability has been developed around
such a hardware encoder. The appliance requires real-time detection and response
behaviors so that various system behaviors can occur such as enabling video
recording to disk, or recording of motion events accurately indexed to MPEG
video streams such that search and retrieval of interesting video is possible.
The video appliance will be the target of the final algorithm implementation, and
is presented in Section 3.1.
Access to motion characterization software that works with MPEG video directly
would aid multimedia description standards such as MPEG-7, which may wish to
operate on any arbitrary MPEG video clip [2]. MPEG-7 defines an XML schema
for describing recorded multimedia content so that multimedia may be searched
and retrieved by its contents like any other text based document. Clearly video
search and retrieval gives digital video a distinct advantage over traditional analog
recording methods, when considering the amount of time it may require to search
through hundreds of archived VHS cassettes for a specific event.

Understanding the concepts of MPEG compression methodology and bitstream
syntax are necessary when approaching the statistical filter development
presented in Section 4. Section 2 will provide the theoretical foundations of
MPEG bitstream syntax and compression techniques such that filter development
can be explored.
As already stated the targeted video appliance will be presented in Section 3.1.
An overview of the video appliances configuration and limitations will be of use
when considering the software architecture and algorithms developed.
A great deal of time was spent developing a robust software architecture that
would accommodate the limited processor and memory resources of the video
appliance, while still allowing access to the MPEG elements required for motion
detection. An overview of the software architecture and its primary requirements
are presented in Sections 3.2 and 3.3.
Finally Section 4 describes the results provided by the software framework,
including filter design and theory, along with presentation of experimental results.
It was observed that the number of spurious motion vectors (discussed in Section
2.2) for a scene increases as luminance decreases leading to increased difficulty
when attempting to detect motion in medium to low luminance scenes. Accurate
identification of spurious motion vectors under various luminance conditions is
the focus of each of the stochastic filtering tests. As will be discussed in Section
2.2 accurate identification of spurious versus actual motion vectors is difficult due
to the techniques MPEG uses when generating motion vectors. Three tests are
presented each attempting to further improve its predecessor. The first test is a
basic adaptive threshold test that attempts to compensate for spurious motion
vectors as scene luminance drops. The basic test serves as the baseline for
comparison of stochastic efforts. The second and third tests attempt to identify
and filter spurious motion vectors using measured stochastic observations.
A summary of test results in Section 5, as well as some potential directions for
future research, conclude this paper.

2. MPEG Compression Theory and Bitstream Syntax
In its most basic form MPEG video is simply a bitstream of encoded data
representing a sequence of pictures over time. The MPEG-1 standard for video
prescribes a process for compressing video streams both temporally and spatially,
and encoding the compressed information into standard bitstream syntax. When
first approaching video compression it is sometimes helpful to keep in mind that
any video sequence be it analog or digital is simply a series of still pictures that
are presented in rapid sequence. The concept of stringing successive images
together to simulate motion extends back a few hundred years, when people
discovered the human brain, coupled with bio-optical effects, interprets rapidly
moving pictures as an animated or moving sequence. With the invention of the
Kinetoscope by Thomas Edison around 1891 film based motion pictures were fast
under way to becoming the popular form of entertainment they are today with
digital video becoming the new format of choice. MPEG-1 video encoding and
syntax relies heavily on the idea of video as a sequence of pictures. The
following sections provide an overview of both MPEG video structure, and
compression techniques specific to the software and filter development discussed
in Sections 3 and 4 respectively.
2.1 MPEG-1 Video Structure
The MPEG video standard specifies a series of hierarchical start codes that are
prefixed by twenty three 0 bits followed by a 1, followed by 8 bits
representing a numeric code used to interpret the bits following the start code
prefix. Each start code can be thought of as a sub-element of a sequence of
pictures. The start code hierarchy can be divided into start of sequence, group of
pictures, picture, slice, and macroblock layer. Each start code level begins a finer
grained view of the series of pictures, from groups of pictures down to sub-block
representations of each picture. Figure 2.1 illustrates the breakdown of MPEGs
bitstream hierarchy and should be referenced for the following paragraphs.

Group of Pictures (15,3) GOP Structure
Block 0 Block 1
Block 2 Block 3
Figure 2.1 MPEG Bitstream Hierarchy
For MPEG elementary video streams, that is streams that contain only video, the
MPEG hierarchy begins with a start code that indicates a start of sequence (not
pictured). The start of sequence code indicates that a start of sequence header
follows the start code. The start of sequence header contains general information
about the pictures following the start code, including horizontal and vertical
resolution, bitrate, and picture rate information. The start of sequence header
contains little of interest for current motion detection efforts and is therefore
A group of picture start code follows the sequence header, which is immediately
followed by a group of pictures header. Sequences of pictures are grouped into
blocks of Bipredictive (B) and Predictive (P) frames which are always preceded
with an Intra (I) frame. An I-frame represents a snapshot of the current scene that
can be decoded without reference to any other picture. I-Frames, along with the
other frame types presented here will be discussed in further detail in Section 2.2.
The set of pictures between group-of-picture start codes is termed a group of
pictures, or GOP. The MPEG specification provides numerous ways of defining
the structure of groups of pictures, in terms of picture type depending on

application needs. For this applications a 15,3 GOP structure will be used. A 15,
3 GOP structure simply specifies 15 frames between each I-frame, and 3 frames
between P-Frames, or a preceding I-frame. The GOP header encodes
fundamental timing information to aid in the display of the GOP as well as
structure information about the GOP. Again the motion detection algorithm
developed has no dependence on the GOP header and therefore the GOP header is
not processed.
A picture start code follows the group of pictures header. A picture start code
indicates the beginning of a new picture in the group of pictures and is
immediately followed by a picture header that provides details about how the
picture is encoded so that it may be decoded. Picture header information includes
picture type (commonly I, B, or P frame), as well as information about how
motion vectors are encoded within the picture if the picture is a B or P frame. The
motion detection algorithm makes use of both picture type and motion vector
information and as such both values are parsed by the software framework.
Each picture is further subdivided into slices by slice start codes followed by a
slice header. Slices represent a series of macroblocks in raster scan order. A
picture must have at least one slice, and may contain one or more macroblocks
per slice if desired. The exact semantics of slice length within a picture is left up
to the encoder. The slice header encodes relatively little information beyond the
quantizer scaled used for spatial compression, and is not used by the motion
detection algorithm directly. Slices are, however, parsed such that access to the
macroblocks within the slice can be decoded for access to motion and luminance
Each macroblock is preceded by a macroblock header that provides specific
details about motion vectors (discussed in Section 2.2), and quantization scales
used to encode video. Also contained within a macroblock are six block
structures that represent the discrete cosine transform (DCT) coefficients for the
luminance, and chrominance values used to reconstruct the pixel information at
the decoder. There are four luminance blocks, and two chrominance blocks per
macroblock. Each macroblock encodes a 16x16 pixel region of a picture. The
motion detection algorithm makes use of the DCT coefficients and therefore the
Macroblock, and associated blocks are parsed and partially decoded.

2.2 MPEG-1 Motion Estimation
MPEG makes use of many methods for data compression however temporal
compression through the use of motion estimation is a key compression
contributor for video compression. MPEG is not the first video encoding
algorithm to make use of temporal compression, and in fact temporal compression
forms the basis of most other video compression techniques [3], The motion
estimation products of MPEG encoding are decoded and analyzed when
ascertaining the presence of motion for each filter attempt presented in Section 4.
Therefore an understanding of what motion estimation is and how it works is
mandatory in comprehending the filter approach discussed in Section 4.
The idea behind motion estimation is as follows. Imagine a series of images that
show an empty park devoid of motion. As the images proceed in time a ball is
thrown across the scene. For discussion it is assumed the ball first appears in the
second picture of the sequence and has been thrown such that the ball does not
rotate. The goal is to encode the images and send them to a decoder for
reconstruction with as few bits as possible while maintaining image quality.
Consider the following strategy. Send as the first frame a picture of the empty
park, in its entirety. When the ball appears in the second picture only send the
decoder information about the ball, because the decoder already knows about the
park. In other words we tell the decoder to repeat the first image of the park that
it decoded, but add the ball to it at a specified location, sending the bits needed to
represent the ball. Intuitively the ball, being smaller than the image as a whole,
will require substantially fewer bits to describe to the decoder as opposed to a full
image. Thus a substantial number of bits have been saved by not reiterating the
static park background, but instead sending only the difference between the first
and second images. As the sequence progresses the ball changes location as it
moves across the scene, uncovering the park behind it, but covering a new area of
the park. The succeeding frames could be encoded in one of two ways at this
point. As before each image could be compared against the preceding image and
the difference, e.g. the ball, could be sent to the decoder. Alternately the encoder
could simply recognize the ball and send the decoder an X, Y displacement for
the ball from one frame to the next. Because the park scene behind the ball is
already known by the initial reference, and the ball was fully described as it
moved into the scene, the decoder can reconstruct the image without requiring a
full description of either the park or the ball. Because less information is
required, sending an XY displacement is much more efficient than re-encoding
the entire area of the scene where the ball was, and currently is. The XY
displacement vector is called a motion vector in MPEG vernacular. As will be
discussed later in this section, an encoder can be tricked during its block matching

efforts, providing the illusion motion is present in the scene when it is really not.
Therefore the term motion vector is somewhat of a misnomer as the XY
displacement vector is really describing a match vector. Motion vectors describe
how far and in what direction a block of pixels have moved between a reference
image and the current image. Thus motion vectors provide a more optimal
compression solution than sending picture updates in terms of literal bits. The
challenge for the encoder is determining what the ball looks like, and where did it
move to? The encoder must perform some kind of search algorithm to match the
balls current position relative to its previous position in the reference image. The
entire encoding process described is known as motion estimation.
Obviously most real-world encoding solutions are not as friendly as the
hypothetical ball across the park example. Real encoders often have to deal
with scene changes, lighting differences as objects move, system noise, and any
number of other challenges. In reality en encoder may also encode motion
vectors to indicate not only motion but also frame redundancy due to static pattern
matching, as might occur with a checkerboard. Nevertheless motion estimation
remains one of the prime compressing agents of video compression, providing up
to 50:1 compression ratios for MPEG-1 video [3].
MPEG makes use of motion estimation in a slightly different way than described
for the ball example. Unlike the ball example MPEG attempts to make best use of
both past and future frames in some instances, as well as just past frames in other
instances to provide an optimal encoding and compression. The term future
isnt meant to imply that that encoder is clairvoyant, but rather the encoder buffers
a few frames such that better estimations for the frame being encoded can be
made based on past and future data. Much like the ball example MPEG fully
encodes a reference frame, called an intra-coded frame, or I frame. I frames
represent an image in its entirety and can be decoded without reference to any
other frame. Frames that make predications based on only past images are called
predictive frames, or P frames. Frames that make use of both future and/or past
frames are called bi-predictive frames, or B frames. Both B and P frames
cannot be decoded without reference to other frames in the sequence. They both
use the same motion estimation techniques, but B frames may use a past and
future reference frame for block matching, averaging the results of the
comparison. The use of both past and future frames in this manner can lead to
improved block match accuracy, down to quarter pixel resolutions [4].
MPEG, as with many other video and image compression algorithms such as
JPEG, breaks an image into smaller more manageable pieces called macroblocks.
Macroblocks represent a 16x16 pixel region of an image. Because MPEG

specifies that raw video be in a 4:2:0 format a macroblock comprises a 16 x 16
array of luminance values, with two 8x8 arrays of chrominance data [4]. Each of
the macroblock sub-arrays are called blocks. Most video encoders use a block
match heuristic, with a search algorithm to estimate motion. The idea is to
compare a block from the current frame to blocks of a reference frame using a
block match heuristic to determine the best match, and then encode the best match
as a motion vector.
There is a clear trade off between the computational effort required for a search
and the quality of the match. If more effort is spent comparing blocks more
accurate assessments of motion can be made, which leads to better predictions,
thereby using fewer bits. Encoder designers must weigh quality and encoding
effort carefully when selecting an algorithm for use. Furthermore, as MPEG only
specifies the output semantics for encoding motion vector information and not the
specific algorithm implementation, encoder compression efficiency can vary,
which is why a very specific platform is targeted for the motion detection filter
under development for this research. A poor motion estimation routine may miss
potential matches and therefore need more bits, whereas a good motion estimator
is able to match blocks more frequently and leverage better compression by
encoding motion vectors more accurately.
2.3 Block Matching
Before a search can be performed it is necessary to formulate a heuristic by which
blocks can be tested against for a match. For each block in the current picture a
match metric is computed for some subset of blocks in the reference picture. The
metrics are then compared and the closest match is used to compute an optimal
motion vector. Two of the most commonly used block matching metrics are
Mean Absolute Difference (MAD), and Mean Square Difference (MSD) [5,6].
2.3.1 Mean Absolute Difference (MAD)
Mean Absolute Difference is computed by summing up the absolute values of a
direct pixel by pixel difference within the block. Equation
( 2.1 ) presents MAD explicitly.
1 -m n
MAD = YZ\Alp,q]-B[p,q]|
Wift p-1 q-\

Where A, and B represent two blocks of pixels that are m by n in size. A\p,q\
represents the value of the pixel in the /7th row and the q& column. For two
exactly identical blocks MAD is zero, with MAD increasing as the blocks
increasingly differ. Thus when we look for the best possible match we take the
lowest MAD value found. Note that the Mean Absolute Error is also sometimes
known as the Mean Absolute Error (MAE).
MAD is commonly selected for motion estimation techniques when using a
Pyramid Search algorithm, which follows. Because MAD works with a
summation of absolute pixel differences texture information about the image is
lost resulting in a loss of block match accuracy for highly redundant average
values across an image [7].
2.3.2 Mean Square Difference (MSD)
The Mean Square Difference is very similar to the Mean Absolute Difference
except the difference between pixels is squared. Explicit definition of the MSD is
presented in equation ( 2.2 )
1 _n
MSD = - B[p,q]J
P=1 9='
Where A, B,p, and q are equivalently defined per the MAD.
2.3.3 Block Comparison Performance
Encoder developers are again faced with a trade off when considering block
matching metrics versus block match search algorithms. If we consider a single
block comparison using the MAD algorithm, the number of operations performed
for a 16 x 16 macroblock would be n2 (assuming n=m for each macroblock) add
and subtraction operations for the luminance set, or 2n2 operations in all for
luminance data, assuming =16. Coupled with two 8x8 computations for
chrominance we have an additional 2c2 operations, letting c be the width of a
square chrominance block. This computes to 2*162 + 2*82 = 512 + 128 = 640
operations per block comparison. Assume a search range of plus/minus 32 pixels
per block is desired. A total of 32 32 = 1024 blocks are therefore available for
comparison, which leads to 655,360 operations to complete a search. Typically
the number of block comparisons becomes a restricting factor in terms of
processing requirements [1]. For this reason many encoders perform the chosen

block comparison algorithm on the luminance components of each pixel instead
of the full set of pixel components [5]. This allows the encoder to search a wider
area of the image for more likely block matches and therefore improved image
Encoders that rely on the luminance information alone can present a problem
when developing a motion detection algorithm based on motion estimation
results. For example when an image is completely dark (e.g. all luminance values
are ideally 0) how does one concretely say which black block best matches
another? This can lead to substantial motion vector noise as was observed in
the first basic algorithm attempt during this research. Motion matches that do not
correspond to true physical motion may not be detrimental from a compression
standpoint, but they can wreak havoc on a motion estimation algorithm, and
therefore are of concern if motion vector information is to be used for a motion
detection algorithm.
2.4 Searching for Matching Blocks
Armed with a means of comparing blocks we now turn to the problem of how to
search for matching blocks. Choosing the proper search algorithm is critical if a
good motion estimation, and therefore compression ratio, is to occur. If a search
algorithm is suboptimal, motion may be missed that could potentially be encoded
in a more compact form, thereby wasting precious bits that could be used to
improve image quality. By contrast if large quantities of processing speed are
required the cost of the encoder hardware is increased. Thus there is a tradeoff
between search effort and encoder cost.
It is useful to look at some of the more common search algorithms that are widely
used to gain some understanding of how the encoder may behave under certain
situations. Generally speaking, when performing a search it can be shown that
better block match results can be obtained if not only full pixel offsets between a
reference and comparison image are considered, but also half pixel offsets.
Interpolation is used to predict half pixel macroblocks for purposes of this
enhancement. [4,5].
2.4.1 Pyramid Search
A pyramid search is the most commonly used algorithm for motion estimation,
and is in fact the search algorithm used by the hardware encoder for the targeted
video appliance. The pyramid search is popular because it strikes a balance

between match accuracy and computing requirements [3,5]. A pyramid search
begins by low-pass filtering and sub-sampling the reference and comparison
images a number of times. The search then begins by comparing the lowest
resolution image pair using a local search effort, with the area of locality largely
dependent on how far we expect an object to move within our image. The results
of the lowest resolution comparison are fed into the next resolution levels
comparison search for refinement. The process repeats until we reach the original
resolution images and a motion vector result is produced. Figure 2.2 illustrates
the pyramid process.

Reference and

Low-Pass Filter
and Sub-Sample
Search Results
Figure 2.2 Pyramid Search
Note that when the frame is subjected to a low pass filter the macroblock
luminance information tends toward an overall DC luminance value. When we
subsample the image we effectively quarter the total space required for a search
greatly reducing the computation cost for block matching. By using the

subsampled data to seed the next stage of the search algorithm the overall search
space is again limited yielding increasingly better results as we unwind our
filtering and subsampling.
2.4.2 Full Search
A full search algorithm exhaustively tests every possible block for a match within
a search range. While extremely accurate, the number of comparisons required is
usually computationally prohibitive, being on the order of 23 million operations
per search per block on a 300 x 150 pixel region. [1].
2.4.3 Restricted Search
A restricted search is, as the name implies, a class of search algorithm that
attempts to use prior knowledge or assumptions about the image data in question
to limit the search space that must be examined for each block in the image. A
number of restriction criteria are possible. For instance all blocks that exhibited
motion in the reference frame could be examined, or an extremely course search
could be performed first followed by a finer search in areas that appear to exhibit
motion from the course search. A pyramid search is very much a restrictive
2.5 Spatial Compression
Apart from temporal compression through motion estimation MPEG also makes
use of inter-frame compression techniques to compress pictures. Three
techniques are used for spatial compression, spatial to frequency transforms,
quantization, and run length Huffman encoding. The spatial compression process
is summarized in Figure 2.3 and is actually a slightly modified form of the JPEG
image compression standard optimized for video sequences.

Figure 2.3 Spatial Compression Signal Chain
Spatial to frequency transforms are performed using the two dimensional discrete
cosine transform (DCT). The DCT is a derivative of the Fourier Transform and
can be expressed as shown in equation ( 2.3).
F{u, v) = j C()C(v)^ 2 /(* y) cos(^2^ cos^; 1')~)
4 r=n v=n 16 16
x,y are spatial coordinates in the pixel domain
u,v are coordinates in the transform domain
C(), C(v) = -7= for u,v = 0
C(u), C(v) = 1 otherwise
The DCT takes as input an 8x8 block of pixel data, and generates an 8x8 block of
coefficients representing the frequency components of the original data. The
DCT has the effect of compressing all of the spatial signal energy toward the
upper left quadrant of the frequency array. The coefficient at location (0,0) is
more commonly known as the DC coefficient and represents the average value of
the source array. The DC coefficient becomes important when analysis of image
luminance vs. spurious motion vectors is performed.

The DCT is applied to a 16x16 macroblock by dividing the 16x16 pixel array into
luminance and chrominance values. MPEG specifies that chrominance must be
subsampled in 4:2:0 format, or two chrominance values per four luminance
values. Chrominance values are comprised of two components Cr, and Cb. As
the macroblock is 16x16 in the luminance dimension with two 8x8 arrays of
chrominance a total of six 8x8 blocks are DCT transformed for each macroblock.
After a DCT transform is performed on each 8x8 data block, the blocks are
compressed by applying a frequency dependent quantization value. The
quantization value resamples each DCT coefficient to a courser scale, allowing
fewer bits to represent a given range of values. Quantization is a lossy process
and therefore degrades the original image. The MPEG specification provides
numerous quantization tables that are applied according to the data being
compressed to attempt to preserve picture quality as much as possible, while still
maintaining a designated encoded bit rate. One exception to the use of variable
quantization tables is the quantization of the DC coefficient which is always
quantized by a factor of 8 because the human eye is exceptionally sensitive to
luminance changes [4]. After quantization many of the higher frequency
coefficients (those toward the lower right frequency quadrant) are reduced to zero.
Finally the quantized coefficients are run length Huffman encoded. Because there
is high correlation between blocks on average the DC coefficients are encoded
differentially with respect to one another. Again this reduces the maximum range
of values that must be encoded, thereby reducing the run length encoding bits
needed for each value. After run length encoding the image information is
inserted into the MPEG stream.
The decoding process, which must be at least partially undertaken if luminance
data is to be used for improving the motion detection algorithm, is the reverse of
the encoding process. Computing expense is kept to a minimum for motion
detection by only considering the DC coefficient (the average block luminance),
thereby avoiding a more expensive inverse DCT.

3. Encoding Apparatus and Software Architecture
The overall data acquisition and analysis apparatus relies on both hardware and
software components. The hardware component is off the shelf hardware that
comprises a video acquisition and compression platform. The software
framework was developed during research and represents a scaleable MPEG
parsing, decoding, and analysis framework.
3.1 Encoding Hardware Platform
The encoding platform is a Vantum Cld networked video appliance, also known
as a Vantum Video Appliance (WA). The Cld video appliance comprises a
complete end to end solution for video capture and encoding, ideal for
surveillance applications. Figure 3.1 illustrates the video capture and encoding
Figure 3.1 Encoder Configuration
The encoding chain comprises a Lens & CCD assembly, a video A/D, an
LSILogic MPEG encoder chip, and a host processor which consumes the encoded
video stream from the encoder. Supporting components, such as memory and
clocks have been omitted for brevity.

Image capture begins with the lens, which focuses light onto a CCD array. A
Tokina model TVR 3314 varifocal lens was used. The TVR 3314 has a 3.3-8mm
focal length and up to a 1.4F aperture. The lens is mounted on an off the shelf
camera board assembly that comprises a CCD array and electronics assembly to
convert the CCD output to an NTSC video signal.
The Video A/D is a Phillips semiconductor SAA7113HB 9 bit video encoder
which converts the analog output of the camera board module into an ITU 656
digital video stream. The Video A/D is accurate to 1 LSB on a 9 bit output bus.
The LSILogic encoder is a DVX-IDVX-BB1 video encoder. The DVX is a
specialized video encoding chip capable of encoding multiple video formats by
downloading encoder specific microcode. MPEG-1 version 1.02 microcode
provided by LSILogic was used for data compression on all sample sequences
acquired. The MPEG-1 microcode can consume ITU 656 video from the video
A/D and compress it into an MPEG-1 video stream in real-time. The compressed
output is exposed to the host processor via a PCI bus master transfer into a shared
SDRAM memory. The DVX only makes the compressed MPEG video available
for host use and thus the motivation for developing a motion detection algorithm
for MPEG compressed video. For purposes of motion detection algorithm
development the encoder was configured to stream at one megabit per second,
SIF resolution, with a 15, 3 closed group of pictures configuration. That is 15
frames between each I frame, and every third frame is a B frame. Thus the
encoding sequence is IBBPBBPBBPBBPBB repeating thereafter. The 15,3 GOP
structure was chosen so that proprietary frame rate adjustment software could be
implemented, therefore the GOP structure is inflexible.
The host processor is an IDT79RC32V334-150BB from IDT semiconductors.
The 79RC32 is a MIPS-R3000 based RISC processor running at 150MHz. The
79RC32 will be running the final algorithm for motion detection, and will
examine the MPEG-1 compressed video from the DVX video encoder. The
79RC32 performs a variety of other functions in the system, including network
streaming, data recording, web hosting, and audio compression. As a result the
number of CPU cycles and memory available for motion detection is extremely
3.2 MPEG Software Framework
The MPEG software framework is responsible for consuming MPEG-1 video,
analyzing the video bitstream, and determining if motion is present or not within
the frame. The framework was developed with VVA target hardware in mind for

detection, as well as attempting to provide a backbone upon which MPEG
analysis tools and algorithm development could proceed efficiently.
3.3 Algorithm Constraints
As with any software effort it is desirable to start with some kind of requirements
and constraints to help guide design and program output. Having requirements
specifically stated helped to clarify the exact goals of the project. The following
bulleted list of items summarizes the requirements and constraints used for this
MPEG-1 compressed video as input
Encoded MPEG-1 video will be at SIF resolution (352x240)
Encoded MPEG-1 video will be at thirty frames per second from an NTSC
Encoded MPEG-1 video will be at 1 megabit per second.
Minimal use of memory and CPU cycles
Output a simple Boolean indicating motion is present or absent
Motion detection must be done in real-time
Captured scene content will not contain scene changes, and will contain
relatively consistent backgrounds, as per a standard surveillance
Camera will be fixed mounted. Motion detection for this effort need not
compensate for camera motion.
Calibration of the motion detection filter is allowable for each individual
camera installation
Some requirements, such as memory and CPU utilization, are exceedingly
difficult to quantify in explicit terms. Therefore best effort was done during
software design and implementation to achieve these goals. The primary focus is
development of a reliable detection scheme for fixed point surveillance
3.4 Software Architecture
While a variety of open source MPEG decoding libraries are available none of
these libraries were written with memory and CPU cycles in mind as they are
targeted for a more resource rich PC or workstation platform and full decode
capability. It was also deemed an exceedingly tedious task to extract only the
pertinent code required for motion vector extraction, and extraction of motion

estimation code from a full decoder tended to lead to a software solution that was
difficult to maintain and extend for various algorithmic approaches. In some
instances open source licensing issues that require full disclosure of any modified
forms of the original decoder was also required, which is not desirable for
commercial deployment.
To address the issues of maintenance, extensibility, and licensing concerns an
MPEG parsing framework was constructed. The MPEG framework developed
for this project is somewhat unique in that it supports a completely pluggable
architecture, thereby allowing developers to choose exactly what portions of the
MPEG stream they wish to parse and then write modules to handle parsed video
in whatever fashion desired.
By designing a pluggable framework full control of desired processing is possible
by removing unnecessary modules from the framework, allowing minimization of
memory and CPU utilization. Furthermore it was possible to quickly implement
video analysis modules to aid in statistical profiling of encoder output, and
various motion detection modules. Because the tools and algorithm variants are
all pluggable modules they can easily be removed upon algorithm deployment,
saving valuable memory resources.
The architecture of the MPEG parsing framework revolves around the structure of
the MPEG bitstream. A necessary requirement for the developers of the MPEG
standard was the ability to seek, through encoded video. The MPEG standard
specifies a series of hierarchical start codes that are prefixed by twenty three 0
bits followed by a 1, followed by eight bits forming a start code prefix which
used to interpret the following bits. MPEG, as stated previously, is a hierarchy of
start codes, where each start code can be thought of as a sub-element of a
contiguous series of pictures.
The software framework leverages the natural elements defined by the MPEG
start codes to provide a single threaded, event driven, processing model. Each
MPEG element: sequence, GOP, picture, slice, and macroblock, has its own
associated parser and entity object. Each time an element is successfully parsed
the resulting MPEG element is handed to an onElement handler for further

Figure 3.2 Parser Data Flow
Figure 3.2 illustrates data flow within the parser framework. An MPEG
elementary video stream is fed into the parser framework. A start code parser
seeks start codes, interprets the start code and invokes the appropriate element
parser as appropriate. When the element parsing is complete an entity object
representing the MPEG element is passed to a handler event method which can
utilize the entity as desired.
Note that both the Macroblock Parser and Block Parser blocks are not members of
the start code processing chain. This is because neither macroblocks, or blocks
begin with start code prefixes, but both macroblocks and blocks are decoded from
the MPEG bitstream such that motion vector and luminance information may be
In theory any of the parsing agents could be arbitrarily removed and reduce the
amount of processing overhead and memory required. Unfortunately there are

some dependencies in the parsing chain where certain segments of the bitstream
cannot be parsed without further information from higher layers in the hierarchy.
Lower levels of the hierarchy rely on higher levels and therefore parsers must be
provided for higher level elements, if lower level elements are to be parsed. For
this project it was not necessary to parse any of the sequence header, or GOP
header packets thus those parsers were not implemented.
3.4.1 Element Handling
By and large most of the element handlers contain some implementation details
such that the MPEG streams can be at least partially decoded. The final
framework includes several event handlers, each with a specific purpose in the
analysis and motion detection chains. Specifically the framework implements an
onPicture, onSlice, onMacroblock, and onBlock handler. Each handler delegates
to pluggable sub-processing modules that perform stream analysis, pre-processing
for motion detection, or actual motion detection. Following a top down approach
the event handlers behave as follows:
The onPicture handler is called each time the parser completes parsing of a
picture start code and header. The onPicture handler checks the previous picture
type for two cases. If the previous picture was an I-frame then a new average
image luminance array is applied to several data analysis modules, a motion
vector filter, and finally a threshold generator. If the previous picture was a P-
frame then a new motion vector magnitude array is computed, filtered, and finally
checked for motion. Finally the analysis modules are each updated with new
picture information and the new picture type is saved to a local state variable for
the next onPicture event.
The onSlice handler is called each time a new slice start code and header are
parsed. The onSlice handler simply makes a call to an average luminance
computation module. The luminance computation module resets its internal DC
coefficient state to a known starting value for each slice, as defined by the MPEG
The onMacroblock handler updates the state of a motion decoder that decodes
motion vector magnitude information from the macroblock. The average
luminance module is also updated allowing DC coefficient information to be
parsed from the block for luminance decoding.

Finally the onBlock handler updates the average luminance module so it can
complete luminance decoding and build an average luminance image for the
Together these events and associated sub-modules decode motion vector, and
luminance information which are in turn used for motion detection.
3.4.2 Element Handler Processing Modules
Rather than implementing a monolithic block of code a series of processing
modules were developed to handle different aspects of the MPEG stream as
different elements arrived. Each module is plugged into a global event handling
framework that delegates events as they come in, as well as coordinates inter-
module communication. As previously mentioned several modules exist and the
sum can be broken into a four categories, namely decoding, analysis, pre-
processing, and motion detection modules.
The decoding modules comprise a motion vector decoder, and DC luminance
decoder. As we are targeting a low memory, bandwidth limited CPU
environment it became clear that decoding every motion vector and luminance
value available could become prohibitive. For this reason a few concessions for
both motion vector, and DC luminance values were made when decoding the
MPEG stream.
Recall motion vectors appear in both P and B frames. P-ffames reference a single
picture, while B-frames reference one or two frames. Because P-ffames always
reference a single image its motion vectors require fewer operations to decode.
Furthermore the use of P-ffames limits the number of images per second we must
analyze for motion due to the chosen 15, 3 GOP structure per section 3.3. Recall
every 15th frame is an I-ffame, while every third frame within the 15 are P-ffames.
Since frames are arriving at 30 frames per second and we wish to detect motion in
real-time, but are limited in CPU cycles by our target hardware platform, we
allow roughly 100ms of time for motion processing by choosing P-frames, over
the more frequent B-frames which would only allow 33 to 66ms of time. By
reducing the number of frames we analyze we have effectively reduced our
motion detection sampling rate, and therefore become less sensitive to motion in a
scene, but to satisfy the real-time and platform constraints a loss of sensitivity is
deemed acceptable for near term implementation of a detection algorithm because
it is assumed motion will be present in the scene for more than 100ms at any
given time. Given a faster processor we could, in theory, process all motion
vectors for B and P frames, increasing sensitivity. However for surveillance

applications where the moving objects of interest are likely to be present in the
scene for more than 100ms, there is relatively little benefit to this exercise as a
means of increasing sensitivity.
As future sections will show, and as expected per discussion in Section 2.3, it was
discovered that the DVX encoder becomes extremely sensitive to low lighting
conditions, generating a substantial amount of spurious motion vectors. The
initial static filtering efforts turned into more of an adaptive filtering problem with
the filter parameters adjusting to current lighting levels. A luminance decoder
module was implemented to acquire the necessary luminance data from the image
such that the motion detection filter could be dynamically adjusted according to
source data.
Luminance in the compressed image has a much higher resolution than motion
vectors due to the four blocks of luminance data per macroblock per the DCT, vs.
a single motion vector per macroblock. Furthermore luminance data can be
updated on any frames should a B or P frame exhibit a macroblock that cannot be
encoded by a motion vector. Observe that luminance may therefore be updated
every frame should the encoder be unable to make a successful block match, or
the image changes completely.
For a real-time processing scenario, should luminance decoding be done for every
frame, it is implied that we cannot exceed 33ms per image luminance update.
Again CPU limitations prevent updates at this rate. For simplicity it was decided
that the average image luminance be updated every I-frame, or every 15 frames.
Updating luminance values at this rate means that the filter parameters are only
updated every half second (15 frames 33ms per frame). Thus if a sudden
lighting change takes place during a sequence of images, or a scene change occurs
if we are encoding a motion picture we may erroneously detect motion because
our adaptive filter parameters are no longer valid for the new average picture
luminance. As the VVA is targeting surveillance and similar applications and not
motion picture encoding this is an acceptable trade-off at present as scene changes
will not occur and a sudden lighting change, while generating an erroneous
motion event, will be pessimistic in nature and not detrimental in detecting true

4. Motion Detection Filter Development
4.1 Overview
Two approaches were taken when developing the motion detection algorithm for
MPEG. The first approach termed a Sum and Threshold Test assumed basic use
of the encoders motion estimation routines by assuming the encoders motion
estimation routines were near perfect and thus encoded motion vectors
representing actual motion in the scene. The second approach used statistical
modeling to attempt to filter spurious motion vectors out such that the basic
approach with an adaptive threshold would work. For high luminance scenes the
basic approach resulted in successful motion detection on only high luminance
scenes at an extremely minimal cost in processing and memory. Empirical
observation of the basic detection methods behavior indicated difficulties
ascertaining the presence of motion in low lighting conditions due to spurious
motion vectors. Two statistical models were devised to filter spurious motion
vectors based on image luminance such that detection was more accurate at lower
image luminance values.
4.2 Sum and Threshold Test
The Sum and Threshold approach was devised based on the assumption that the
encoders motion estimation routines accurately assessed motion in a scene. For
each vector the magnitude of the vector was computed and all vector magnitudes
were summed across each image. Equation ( 4.1 ) describes the sum and
threshold computation
where n is the number of motion vectors in the image and jc and y represent the x
and y displacement values of the motion vector. The sum s was then compared
against a pre-determined threshold. If s was greater than the threshold then

motion was indicated as present, and when s was less then the threshold motion
was indicated as absent.
The sum and threshold approach was found to produce acceptable results for a
very strict class of picture sequences. Specifically sequences of pictures that
contain no scene changes and a relatively high average luminance, responded
extremely well to the sum and threshold method. For purposes of
experimentation the threshold was manually tuned by adjusting the threshold level
until various static scenes produced no false positives, but introducing moderate
amounts of motion into the scene generated motion events fairly accurately. A
threshold of about 4000 was the end result. It was observed that as the average
picture luminance fell, that is the sequence of pictures become darker, the sum
and threshold technique broke down and began reporting numerous false
positives, ultimately detecting motion at all times under near dark conditions.
Analysis showed that as the average picture luminance decreased the number and
magnitude of erroneous motion vectors increased. The specifics of the luminance
analysis will be presented during statistical filter development in Section 4.3.
A recap on the thought experiment from Section 2.3.3 illustrates why the sum and
threshold breaks down under low lighting conditions. Imagine two pictures in a
sequence of pictures that are to be compared using block matching and motion
estimation by a hypothetical encoder. Both pictures are perfectly black, that is to
say all luminance values in both images have a value of zero. Therefore when
comparing any block with another block it is determined that all blocks in the
image match, including the block at the origin of the search. It is therefore easiest
just to encode a motion vector with a magnitude of 0, to indicate that the decoder
need not consider the block for spatial location translation at all. Under this
condition MPEG, in fact, does not encode any macroblock information at all to
indicate to the decoder no changes are necessary for the new frame. However
when real system issues are considered, such as thermal noise in the CCD array,
and noise in the acquisition channel from the video amplifier and analog to digital
converter, block matching becomes more difficult as block comparisons no longer
yield perfect comparisons. How does the encoder distinguish one black block
from another in the presence of noise? The result is spurious motion vectors as
the encoder is tricked into believing it sees motion based on noise, rather than
actual image signal.
The observations from the basic sum and threshold approach were used to guide
statistical filter development, resulting in a filter that provides moderate
improvement for motion detection across a wide range of luminance conditions.

4.3 Statistical Filtering Approach
Based on the results of the sum and threshold experiment several methods were
tested in an attempt to improve motion detection for a wider range of scene
luminance values. A stochastic approach was taken because stochastic methods
are most easily applied when system noise is independent identically distributed,
as would seem the case for motion vector analysis.
Image K Detection
Luminance > i Threshold
Decoder V Generator
Figure 4.1 General Statistical Detection Framework
Figure 4.1 shows the generic detection framework that served as the basis for each
statistical filtering method. Moving from right to left observe that the sum and
threshold test forms the basis for the general statistical detection framework with
two exceptions. First the detection threshold is being dynamically generated
based on the mean image luminance. The mean image luminance is updated on
each I-frame for reasons described in Section 3.4. Second the motion vectors in
each P-frame go through a spurious motion filter before being summed. The
problem of accurately detecting motion using the general framework presented
now breaks down into two sub-problems. The first problem is determining a set
of rules to try and identify an observed motion vector as spurious. The second
problem is determining how to set the detection threshold such that static scenes
do not generate excessive motion events, while actual motion events are correctly

4.3.1 Spurious Vector Characterization
Formally the problem space for characterizing spurious motion vectors such that
suitable filtering criteria could be deduced is defined as follows. Define variable
S/ to be the set of all possible outcomes for an observed motion vector of
luminance /. For motion detection we define two possible events on set S/,
motion is present, and motion is not present, which we will call m and mc
respectively. Define A to be the set of all subsets S/. Figure 4.2 illustrates the
probability space upon which we will operate.
Figure 4.2 The Probability Space for Statistical Detection
In order to characterize the behavior of the sample space Si it is necessary to first
decide upon a reasonable experiment that allows observation of motion vector
behavior for known motion conditions under different lighting conditions. As
each Si comprises two events m and mc, simply measuring the characteristics of
either m or mc by exclusion yields the characteristics of the other. After careful
thought it was decided that characterization of spurious motion when no motion
was present would be the easiest case to empirically measure as characterizing
actual motion would be dependent on too many factors, including object size,
speed, and texture.
The spurious characterization experiment was to acquire five samples of a static
scene under different lighting conditions. The contents of the sample static scenes
were carefully selected in an attempt to reduce pattern redundancy. Certain scene
contents could offset the validity of measured results by causing mismatches other

than those attributed to image luminance. For example a checkerboard pattern
may easily trick the encoder into matching different areas of the checkerboard,
leading to erroneous motion vectors attributable to scene content rather than
luminance. It was decided that a scene with interesting texture and contrast would
provide the best subject for spurious characterization when attempting to apply
the end algorithm to a surveillance application. It is fair to assume that calibration
of the VVA is possible for each installation to compensate for highly redundant
static scenes, per the checkerboard argument, but better results are expected if
scene redundancy is kept to a minimum.
It was also necessary to further restrict the static scenes by not rapidly varying the
lighting conditions within the scene. Each scenes luminance was adjusted
between recordings, so that each scene was of constant luminance. Varying
luminance in this fashion assures that the encoders block match routines are not
confused by lighting shifts during encoding. This is a requirement because we
only intend on updating the scenes luminance values once per I-frame, which
occurs at a 2 Hz rate. Naturally, slowly occurring luminance changes, such as
sunrise or sunset, will be acceptable as they allow ample time for the filter to
update its luminance and therefore threshold values and adapt. Furthermore it can
be reasoned that shifting suddenly from a dark scene to a light scene would be
acceptable as the sum and threshold setting would be higher for dark images than
lighter images, due to the shift in the noise floor. A light shift of this sort could
potentially lead to a missed motion event, but given a 500ms luminance update
rate (two I-frames per second) dark to light shifts are not of concern.
A textured white background with a flower arrangement was chosen as the final
scene content, as shown in Figure 4.3. The flower arrangement provides
interesting non-repetitive texture across the scene as well as excellent luminance

Figure 4.3 Scene Content for Development
Each static scene was restricted to three hundred frames in length and six scene
samples were recorded over various lighting conditions. Lighting was provided
by a high intensity halogen lamp with a potentiometer that allowed the lamp to be
varied from pitch black to full intensity. Each resulting MPEG sequence was fed
into the software framework and processed as per Figure 4.4. The end results
were tabulated and recorded to text files for post analysis.
Figure 4.4 Spurious Vector Characterization Framework

Data acquisition and encoding was performed by the WA. After encoding the
MPEG video stream was parsed for motion vector and luminance content. The
motion vector data was used to generate vector magnitudes, while luminance data
was used to compute average scene luminance. Luminance information for each
block associated with a motion vector was also made available to the final
probability computation block. The probability of a spurious motion vector v
given some luminance l, P[v \ l], was computed by simply dividing the number of
macroblocks with motion seen by the total number macroblocks for a given
luminance. As the scene is static it is known a priori that any motion vectors seen
under these conditions are spurious.
Since the intent of using the sum and threshold detection scheme was maintained
another reasonable element to measure was the mean value of the spurious motion
vectors for a given luminance value, or E[v [ /], such that spurious motion vector
magnitude could be observed..
The results of the experiment were two plots. One plot showing P[v | /], and a
second showing E[v | /]. Figure 4.5 illustrates the resulting probability a spurious
motion vector is obtained vs. luminance. As expected from the basic sum and
threshold observations the probability of a spurious vector is extremely large for
low luminance conditions as evidenced by large probability and magnitude peaks
for low luminance values. Increased numbers of spurious vectors would indicate
an increased spurious vector sum, which would therefore require higher
luminance thresholds, experimentally proving the reasons of failure for the basic
sum and threshold methodology. Even under full intensity light, it was difficult to
get luminance values above two hundred twelve. It is assumed that the scene did
not contain brilliant enough whites and sufficient lighting to generate luminance
values above the two hundred twelve threshold. Under full darkness conditions a
luminance value less the twenty seven could also not be obtained. Presumably,
low luminance values could not be obtained due to the ambient noise floor of the
CCD, analog to digital converter, and video buffering amplifiers. For motion
analysis, motion vectors that fall on macroblocks with luminance values on data
blackout, that is points where no motion vector statistics were obtained, will be
omitted. Omission of course indicates that motion cannot be detected should the
image contain large amounts of motion at these luminance intensities. When
evaluating spurious filter performance motion will be induced using the same
static scene background so as not to change the parameters of the experiment
excessively, thereby eliminating concern over data blackout effects. Restricting
motion acquisition according to available luminance is an acceptable trade-off for
a surveillance application as long as device calibration is performed across several
lighting extremes.

Probability of Spurious Motion vs. Luminance
Figure 4.5 Probability of Spurious Motion vs. Luminance for a Static Scene
Figure 4.6 presents the plot showing E[v | /]. Not surprisingly luminance values
below 75 show dramatically increased magnitudes, which almost perfectly
corresponds to a jump in spurious probability per Figure 4.5. Presumably this is
because the motion estimation routines of the encoder are able to find better block
matches due to increased block texture for higher luminance images.

Figure 4.6 Mean Spurious Vector Magnitude vs. Luminance for Static Scenes
4.3.2 Methods
A standard procedure was established for detection threshold curve generation. In
each filter test the spurious motion vector filter was implemented and executed
against each of the five static scenes used to characterize P[v | /] and E[v \ l]. A
plot of the mean vector magnitude sum vs. mean image luminance for the entire
static scene was generated. The resulting plot was then fit with an estimation
routine, based on the mean vector sum vs. mean image luminance curve trend.
Most curves were exponential in nature, however the weighted threshold plot
required a piecewise linear model to correctly fit. The final curve fits
characteristics were then used to dynamically generate the detection threshold
based on the current mean image luminance. The final spurious motion vector
filter, and associated detection threshold generator was then run against the five
static scenes a final time to evaluate static scene performance.

Early experiments showed substantial amounts of erroneous motion detection in
the static scenes 100% of the time. The erroneous motion detection occurred
because the detection threshold generators curve fit was calibrated to the mean
motion vector sum magnitude, and was therefore subject to variance in the vector
magnitude sum over time, where the variance registered as an erroneous
detection. Figure 4.7 illustrates the end effect for one such experiment.
Figure 4.7 Erroneous Motion Detection Before Variance Compensation
Shown is a vector magnitude sum over time for a static scene of luminance value
82. Overlaid on the magnitude sum plot is a heavy set plot showing motion
detection threshold, during motion detection events. Which is to to say if motion
is not being detected the detection threshold plot is 0, otherwise the plot shows the
current vector magnitude sum threshold. Note that in Figure 4.7 the detection
threshold hovers about the mean value of the magnitude sum over time, and that
variance in the magnitude sum over time causes erroneous motion detection
events, as we know this scene is static.

To compensate for the variance in the vector magnitude sum two standard
deviations were added to each mean value before a curve fit was performed. The
end effect was a slightly elevated detection threshold curve that eliminated most
erroneous static scene motion events due to magnitude sum variance.
4.4 Spurious Motion Vector Filter Tests
Two spurious motion vector filters were devised and tested for motion detection,
based on the general detection framework of Figure 4.1. As already outlined,
each test followed three procedural steps. First a set of spurious motion vector
detection rules were devised and implemented. The spurious detection rules were
then executed against the five static test scenes. A mean magnitude sum vs. mean
image luminance graph was generated for the static scenes and the magnitude sum
variance was computed. A detection threshold curve fit was produced and a
detection threshold generation function was implemented based on the plot
behavior. Finally tests were run against static scenes and scenes with motion and
detection results were tabulated.
A total of six motion scenes were acquired. As calibration to a specific
background may be required, the flower scene shown in Figure 4.3 was used as a
background for motion scenes. A small card with a dark green and a white side
was suspended by a thread and passed through the scene to produce motion.
Luminance conditions were again varied with a lamp and potentiometer, and two
motion samples were acquired for a high, medium, and low luminance value. The
two motion samples at each luminance setting comprised slow and fast motion
events. Rate of motion is an important consideration as faster moving objects
should, in general, generate larger vector magnitudes, while slower objects should
generate smaller vector magnitudes, by the nature of the frame rate and distance
traveled based on object speed.
Each motion scenes motion events were visually analyzed one picture at a time
such that a plot of visually detected motion could be overlaid onto each motion
graph to ascertain detection accuracy. To assist with this process an MPEG tool
from the University of California, Berkeleys multimedia group was downloaded
and applied to the motion scenes. The tool mpeg2decoder was run against each
motion scene generating a series of numbered still image files. Each image file
allowed motion to be precisely pinpointed and characterized versus picture
number and therefore time, including details such as color facing of the card, and
aspect ratio of the card. Appendix F contains time tables of interesting motion
events for each motion scene. When examining motion detection plots versus

time in the following sections it is necessary to note that the card always enters
the motion scene from the left hand side on the first pass through the scene. The
position of the light source causes the cards shadow to always precede the card as
it enters from the left. All visually inspected motion events begin a motion event
with the cards shadow, and not the card itself. This observation will be important
when considering motion detection accuracy.
Three filter approaches were taken. To establish a base case an adaptive detection
threshold was attempted without spurious filtering. The second test weighted
each motion vector according to the probability of error, and then compared the
weighted vector against the expected spurious vector magnitude. If the weighted
vector was less then the expected spurious magnitude it was thrown out. Finally a
ratio test was applied where each vector was again weighted according to the
probability of error, and then divided by the expected spurious vector magnitude.
The resulting ratio was compared against a pre-set threshold. If the magnitude
ratio was less than the threshold the motion vector was again thrown out.
4.4.1 Adaptive Threshold Test
The adaptive threshold test serves as a base case to compare further tests against
in terms of detection accuracy. The adaptive filter test extends the basic sum and
threshold detection scheme to attempt to compensate for increased spurious
motion vectors under low luminance conditions, by adapting the detection
threshold to the mean image luminance. No attempt is made to identify and
eliminate spurious motion vectors.
As per the outlined procedure the five static scenes were run through the detection
framework and the mean magnitude sum vs. mean image luminance curve shown
in Figure 4.8 was generated.

Figure 4.8 Unfiltered Mean Vector Sum vs. Mean Image Luminance
As expected the mean sum of all vectors for a scene with a given mean luminance
value, increases as mean scene luminance decreases, and in fact does so
exponentially. Again the standard deviation of each static scene was computed
over time and two standard deviations were added to the mean vector sum curve
before a curve fit was performed, accounting for the slightly greater magnitude of
the curve fit indicated by the diamond trend line in Figure 4.8. The generated
threshold was then applied to each of the five static scenes, and the six motion
As expected due to the curve fit motion was not detected in each of the static
scenes. When run against the scenes with motion content the high luminance
scenes resulted in 100% motion detection accuracy. The scenes with medium
luminance values resulted in accurate detection of motion events, but also
generated numerous false detections, and low luminance scenes showed no
detection of motion events whatsoever.

4.4.2 Weighted Vector Threshold Test
The weighted vector threshold test was the first attempt to use the statistical
characteristics gathered in Section 4.3.1 to identify and eliminate spurious motion
vectors. The Weighted Vector Threshold Test (WVTT) was formulated with the
idea that if the probability of a vector for a given luminance is high the vector
should contribute as little as possible to the final vector magnitude sum. Each
vector was weighted according to
vw(/) = v(l-P[v|/]) if vw(/) > E[v | /]
where v represents the observed vector, and P[v \ l] is the probability of a
spurious vector for luminance l. If vw is less than the expected mean vector
magnitude for the given luminance the vector is thrown out, otherwise the
weighted vector is added to the image magnitude sum for comparison with the
generated detection threshold.
Figure 4.9 shows the results of applying the WVTT spurious filtering rule to the
five static scenes.

Note that the mean vector sum for the scene is linear, but requires two intersecting
lines to model as the mean vector sum drops off rapidly for a mean luminance
value of 82. The sudden drop in mean vector magnitude can be attributed to the
high probability of a spurious vector, and large mean vector magnitude values
below a luminance value of 82, per Figure 4.5 and Figure 4.6. Essentially vectors
are increasingly filtered out for images below a mean luminance of 82. For this
test the detection threshold function was generated by generating two least
squares fit lines, with an intersecting point for a luminance value of 82. Per
discussion in Section 4.3.2 two standard deviations were again added before a
least squares fit was performed, accounting for the greater offset of each fitted
line. The resulting detection threshold fit is represented by the diamond and
square trend lines.
Application of the WVTT method to the six motion scenes again resulted in 100%
motion detection accuracy for the high luminance scene samples on both slow and
fast motion events. For the medium luminance scenes all motion was again
detected accurately, but several false detections also occurred. The number of

false detection events was significantly less than those of the Adaptive Threshold
Figure 4.10 illustrates the results of motion detection for a medium luminance
scene with rapid motion versus time. The dotted square plot represents the
visually determined motion events within the scene (magnitude is arbitrary). The
heavy-set solid plot represents the presence of motion and motion detection
threshold for motion that has been identified by the detection framework. For
Figure 4.10 we see a false motion detection event around 4.7 seconds, and again
at 12.7 and 13.8 seconds. Motion is accurately detected at 6.8 and 9.2 seconds.
Per previous discussion the cards shadow actually precedes the card itself when
first entering the scene. In the case of this plot the cards shadow appears at
roughly 6.5 seconds and the card fully appears at 6.7 seconds. The magnitude
peaks reflect these events accurately. At approximately 6.8 seconds the card turns
perpendicular to the camera, as indicated by the sudden drop in the magnitude
sum peak at that time. Just after 7 seconds the card has again started turning

toward the camera as evidenced by the peak at around 7 seconds. The card leaves
the scene at approximately 7.7 seconds as evidenced by the sudden magnitude
drop off. Further event indexing is possible comparing scene information from
the tables in Appendix F, with the plots in Appendices A through E.
A final point of interest is the dashed plot representing the unfiltered motion
vector magnitude sum. The unfiltered magnitude sum allows observation of the
spurious motion vector filters effect on the magnitude sum over time. In the case
of this plot relatively little data is eliminated as the filtered magnitude sum is not
substantially less than the unfiltered sum. Also note that relatively few peaks are
actually smoothed out of the filtered waveform. Filtering of the spurious
magnitude peaks is desirable as it could help to eliminate spurious motion
detection events due to the generated detection threshold. In theory, if all
spurious vectors were eliminated, the static scene magnitude would be zero, and
any magnitude level above zero would represent motion in the scene. Because
motion estimation works upon block matches, and scene redundancy may incur
block matches it is nearly impossible to filter out all spurious vectors. A general
movement of the static scene mean magnitude sum toward zero indicates that
vectors are being properly categorized as spurious. Attenuation of actual motion
magnitude peaks indicates some actual motion vectors are being filtered. An
ideal filtering condition would leave the motion peaks virtually unchanged, and
static scene magnitudes at zero.
Under low luminance conditions the WVTT test accurately detected motion, but
also generated so many false detection events that it would be difficult to make
use of the motion state changes. However, when compared to the Adaptive Filter
method which detected no motion, the WVTT method shows some promise as
motion was at least detected.
4.4.3 Weighted Vector Ratio Threshold Test
The WVTT method started the idea of attempting to determine if a vector is
spurious by examining the magnitude of an observed motion vector, with that of
the expected spurious vector magnitude for a given luminance. Extending the
idea a little further a ratio test was devised. The Weighted Vector Ratio
Threshold Test (WVRTT) seeks to classify observed motion vectors according to
their magnitude with respect to the expected spurious vector magnitude. The
WVRTT closely parallels signal to noise (SNR) testing commonly used under
deterministic detection methods. The observed motion vector magnitude is
considered to be the signal level, while the expected spurious vector magnitude is

the noise floor of the system. By generating a ratio a statement about observed
vector magnitude may be generally done for all luminance values.
A test ratio for each observed motion vector magnitude was generated by
r(/) =
where v again represents the observed vector magnitude, P[v | /] represents the
probability of a spurious motion vector for a given luminance and E[v \ l]
represents the expected spurious mean vector magnitude for a given luminance.
r(l) was then compared to a pre-determined threshold value, T. Values of r(l)
below the threshold were eliminated. Threshold values of 16, 32, and 64 were all
tested. Note that the WVTT method is really a special case of the WVRTT
method where the threshold value is equal to 1.
Per standard procedure the spurious motion vector rules were implemented and
executed against the five static scenes, generating a detection generation function.
It was expected that because the WVTT method is a special case of the WVRTT
method where the threshold is 1, an exponential result would again be obtained.

Figure 4.11 Mean Vector Sum vs. Mean Luminance for WVRTT
Figure 4.11 shows the composite results of testing for all threshold tests. As
expected each test resulted in an exponential trend line. Detection threshold
generation again followed a curve fit on the mean trend line plus two standard
deviations. Results for a Threshold Value of 16
A threshold value of 16 run again the static scenes resulted in a single false
motion detection event for the static scene representing mean image luminance of
82. The erroneous event is shown in Figure 4.12.

Note the significant drop in mean sum magnitude over time, which is an indicator
that vectors are being properly classified as spurious. Also note that for the first
time peak smoothing began to occur. For instance the peaks exhibited by the
unfiltered sum around 4.5 to 5.5 seconds were smoothed out.
For high luminance motion scenes motion was again detected with 100%
accuracy as expected. For medium luminance scenes motion is detected 100%
accurately for the first time. The dark luminance motion samples were interesting
in that clear motion peaks occur but the generated motion detection threshold was
too high to trigger a motion detection event.

Detection Threshold and Magnitude vs. Time
(Mean Scene Luminance 70, Fast Motion)
Visual Detection
-Statistical Detection
> (Jittered Mag
Figure 4.13 WVRTT Missed Motion for Low Luminance (T=16)
Figure 4.13 illustrates an instance of missed motion events that, for the properly
generated threshold, should have been relatively easy to detect due to significant
magnitude sum peaks at 6 and 8.5 seconds.
When compared to the Adaptive Threshold, or WVTT method substantially
improved results are seen for the medium luminance images. Despite the
difficulty in detecting motion in a dark scene, when considering a final algorithm
these results may be acceptable as the image quality for the medium and dark
luminance samples is drastically different.

Figure 4.14 Comparison of Medium and Low Luminance Samples
Figure 4.14 shows the medium and dark luminance samples used for motion
scenes. The dark luminance sample is almost devoid of detail providing relatively
little information that would be of use in identifying a perpetrator. Results for a Threshold Value of 32
Attempting to improve detection results further a threshold value of 32 was
applied to the static and motion scenes. Under this threshold value motion was
again detected for the static scene representing mean luminance 82. A single false
event was noted, as per a threshold value of 16.
For motion scenes the results for high luminance were again 100% correct
detection. Medium luminance scenes also illustrated 100% accurate detection as
per a threshold value of 16. For the low luminance motion samples motion was
once more not detected, but distinct peaks were again present per Figure 4.13.
Fundamentally the improvement using a threshold value of 32 is unremarkable
when compared to a threshold value of 16. Results for a Threshold Value of 64
It is expected that raising the threshold value should, in general, decrease the
static scene values as it becomes increasingly difficult for spurious motion vectors
to exceed the threshold limit. By a similar token it also becomes more difficult
for actual motion vectors to exceed the threshold. The net effect of raising the
ratio threshold is therefore reduced sensitivity to slow moving objects.

Running a threshold value of 64 against the static scenes actually increased the
number of erroneous detection events, however the overall variance of the mean
magnitude sum over time was much less than previous threshold values.
Compare Figure 4.12 and Figure 4.15.
Figure 4.15 Erroneous Motion Detection UsingWVRTT (T=32)
Note that the mean value of the magnitude sum has been reduced by a difference
of almost 500 by doubling the threshold value. Furthermore the peaks in the
threshold 64 plot are substantially less than those of the threshold 16 plot.
Evidence that the number of vector magnitudes sums exceeding a ratio threshold
of 64 are less than those exceeding a threshold value of 16. Because the variance
of the plot is also reduced the generated detection threshold is much closer to the
noise level of a given luminance which therefore increases the chance of an
erroneous motion detection event.
Looking at each motion scene it should come as no surprise that motion is
detected in the high luminance scenes with 100% accuracy. Because of the

tighter curve fit due to lower static magnitude sum variance however motion
detection on the medium luminance scenes still detects all motion events, but also
generates several false positives. Examining the low luminance scenes fast
motion is again not detected however motion in the slow motion scene is detected.
Figure 4.16 shows the first detected motion for low luminance scenes.
4.5 Performance Comparisons
The performance of each filter can be measured by its ability to correctly detect
motion under various lighting conditions and motion rates in addition to
considering the computational cost of each filter. For each static and motion
scene the number of false and missed detection events was tabulated. A motion
event was not counted as a miss as long as motion of any kind was detected
during the visual evaluation range. This is because various pre and post detection
methods may be applied to make a global statement about motion. For example if
motion is detected the motion state may be locked for a period of time using a

hysteresis time value. Using hysteresis ensures that recorded motion scenes are
not so short so as to be useless and further serve to filter aspect ratio changes or
pauses in target motion. Furthermore in most applications it is reasonable to
assume that once motion is present in the scene it is highly probable motion will
continue to be in the scene for a period of time.
One advantage of digital video recording over traditional analog recording is that
video can be pre-recorded before video recording is enabled. During pre-
recording the WA is continuously recording video to disk, storing a
predetermined number of seconds of data. Should the number of seconds of pre-
recorded media exceed the prescribed amount of prerecorded time the oldest
media is thrown away in favor of new media at a rate that matches the incoming
media. In other words a pre-recording queue of predetermined length can be
implemented such that when the queue is full old frames are removed from the
queue as new ones are pushed into the queue. When recording is enabled the
queue is flushed to disk and new media sent directly to disk until recording is
disabled, at which point the queue begins to fill again. The end effect of pre-
recording and hysteresis is the VVA will record several seconds of media just
prior, and just after a motion event. Using these recording methods thereby
allows a motion detection scheme to ignore the delay between a visual sighting of
the cards shadow and actual motion detection, or the possibility of a non-
detection event after motion has been detected and is actually still present.
Table 4.1 summarizes the contents of each motion file used for detection filter
evaluation. Great care was taken not to adjust the lighting between slow and fast
motion testing, however luminance variance due to system noise, and motion
content as the card moved about the scene caused variances in the end recordings.

Scene Motion Count Comments
Motion 0 4 High Luminance, fast motion. Card moves from left to right, right to left, and left to right.
Motion 1 2 Medium Luminance, fast motion. Card moves from left to right, and right to left.
Motion 2 2 Low Luminance, fast motion. Card moves from left to right, right to left, and left to right.
Motion 3 1 High Luminance, slow motion. Lighting unchanged from that of the fast motion scene. Card moves slowly from left to right.
Motion 4 1 Medium Luminance, slow motion. Lighting unchanged from that of the fast motion scene. Card moves slowly from left to right.
Motion 5 1 Low Luminance, slow motion. Lighting unchanged from that of the fast motion scene. Card moves slowly from left to right.
Table 4.1 Motion Scene Comments
Motion count indicates the number of visually measured motion events within the
scene. Table 4.2 summarizes overall filter error based on the number of motion
events missed, and false events reported for each filter and file. The compiled
results from Table 4.1 and Table 4.2 form the basis of how filter performance will
be evaluated.

Scene Scene Mean Luminance Adaptive Threshold Results WVTT Results WVRTT Results
False / Missed / Actual False / Missed / Actual T=T6 T=32 T=64
Static 0 30.08 0/0/0 0/0/0 0/0/0 0/0/0 0/0/0
Static 1 55.06 0/0/0 0/0/0 0/0/0 0/0/0 0/0/0
Static 2 82.82' 0/0/0 0/0/0 0/0/0 0/0/0 0/0/0
Static 3 133.34 0/0/0 1/0/0 1/0/0 1/0/0 2/0/0
Static 4 134.82 0/0/0 3/0/0 0/0/0 0/0/0 1/0/0
Static 5 130.56 0/0/0 0/0/0 0/0/0 0/0/0 0/0/0
Motion 0 135.39 0/0/4 0/0/4 0/0/4 0/0/4 0/0/4
Motion 1 115.06 5/0/2 3/0/2 0/0/2 0/0/2 2/0/2
Motion 2 70.98 0/2/2 21/0/2 0/2/2 0/2/2 0/2/2
Motion 3 137.64 0/0/1 0/0/1 0/0/1 0/0/1 0/0/1
Motion 4 110.44 9/0/1 8/0/1 0/0/1 0/0/1 4/0/1
Motion 5 71.11 0/1/1 25/0/1 0/1/1 0/1/1 0/0/1
Table 4.2 Filter Error Summary
Each entry in Table 4.2 represents the number of false detection events, missed
motion events (by comparing the visual and filter detection events), and actual
motion events (by visual inspection). Therefore for static scenes 0/0/0 is a perfect
score, whereas a perfect score for motion scenes would be 0/0/-. More generally
motion scenes receive a perfect score when the number of false and missed events
is 0. Note that some filters, such as the WVTT for Motion 2, generate an
inordinate number of false detections, and some filters completely miss motion
events such as the WVRTT (T=16) test for Motion 5. Investigation into the
reasons behind erroneous or missed events is paramount in evaluating why a filter
failed to work well, and increases understanding such that new filter approaches
may be formulated. Generally error occurs for one of several reasons in almost
all of the filters and scenes tested. For erroneous motion events the filter
threshold was typically so close to the static scene magnitude level that variance
in the magnitude sum was enough to falsely trigger a detection event as was the
case for Figure 4.12.

For missed events one of two things typically occurred. The threshold curve fit
was so poor that the threshold was set too high for small motion events to register
as was the case in Figure 4.17 where the threshold is at roughly 2200 for the
scene, or the spurious filter was too aggressive and omitted too many actual
motion vectors, resulting in a loss of any viable magnitude sum peaks thereby
complicating detection, as would be the case for Figure 4.18, had a valid threshold
been selected.
Figure 4.17 Missed Motion Due to Excessive Threshold (Motion 5, T=32)

Figure 4.18 Excessive Filtering Using the Weighted Threshold (Motion 5)
Given data for filter accuracy under various conditions how is a filter evaluated as
a whole in terms of performance? Certainly there are consequences to generating
an erroneous event in terms of system response. For example if a system policy
exists such that media is recorded to disk each time a motion event occurs a
.spurious detection event would waste potentially disk space. If motion events
were being used to bookmark recorded media there is a cost associated in with
user time wasted looking at recorded media that does not contain actual motion.
These kinds of considerations can be applied to build a cost analysis that allows
deterministic measurement of filter performance.
4.5.1 Cost Analysis
When evaluating filter performance an acceptable balance between missed events,
erroneous events, and actual detection is sought. Five costs will be considered for
each test. The cost of processing, the cost of a correct detection when motion is
present, the cost of a correct non-detection when motion is not present, the cost of
a missed detection when motion is present, and finally the cost of a erroneous
detection when motion is not present. Apart from the cost of processing the costs
being analyzed are those of the correct detection, and those of incorrect detection.

In statistical vernacular these misdetection errors are also known as type I, and
type II errors respectively. The cost of generated threshold computation is not
considered because the threshold is only updated twice per second and requires
relatively few mathematical computations.
The cost can formally be defined as follows. Let P, represent the cost of
processing for a given algorithm, and D,* represent the cost of a decision for type I
and type II errors. Where i identifies the test being considered, j is the detection
indicated, and k is the validity of the detection. For example D01 would represent
the cost of choosing a non-motion event when motion was actually present,
termed a miss'. Dy0 would be termed a false alarm. Thus the net cost for any
algorithm can be determined by
i i
j=0 *=0
The cost of processing can be equated to the number of operations required for
each filter and is summarized in Table 4.3.
Operation Cost
Comparison 1
Add/Subtract 2
Multiply/Divide 3
Table 4.3 Operation Costs for Determining P of Each Algorithm
Each tests operation cost can easily be computed by analysis of the approach.
Table 4.4 presents a summary of each tests cost.

Test Comparisons Additions / Subtractions Multiplications / Divisions Net Cost
Adaptive Threshold 0 0 0 0
WVTT 1 0 1 4
WVRTT 1 0 2 7
Table 4.4 Summary of Test Processing Costs
The difficulty now becomes assigning the cost of detection error D. Because
correctly detection motion is the desired the behavior a cost of zero will be
assigned to any condition that is not a type I or type II error. Explicitly, when i =
j then Dy = 0. What then, is cost when motion is incorrectly identified?
Turning back to a surveillance application and considering the consequences of
error will allow a cost basis to be formulated. As two costs (a miss and a false
alarm) are sought it is sufficient to simply assign a higher cost to one or the other
event according to which error type is most favorable.
Consider the cost of a miss. A clear cost savings in disk space is obtained as no
video is recorded. However when an end user is relying on a device to obtain
evidence to prosecute for stolen goods, the potential cost to that user could be
substantial as no evidence would be obtained. Furthermore user perspective of
product performance could be compromised reducing potential sales of VVAs in
the future. The cost of lost revenue and customer confidence is therefore deemed
quite high relative to disk consumption.
Conversely a false alarm would consume more disk space for storage, however
the customer will more than likely have vide needed for evidence. More time
must be spent on the customers part searching through false alarms, but overall
satisfaction would remain higher.
Based on these observations the cost of missing a motion event is substantially
higher than that of reporting erroneous events, therefore a cost of 2 is assigned to
the case of a missed motion event, and a cost of 1 is assigned to a spurious motion
The cost of each decision, D, summarized in Table 4.5

Decision D,* Cost
No Motion D oo 0
Miss D oi 2
False Alarm D io 1
Motion D n 0
Table 4.5 Summary of Decision Costs
Computation of equation ( 4.4 ) is now possible using Table 4.2, Table 4.4, and
Table 4.5 collectively. Let / = { 0, 1, 2 }, be the Adaptive Threshold, WVTT, and
WVRTT tests respectively.
Filter( i) Processing Cost (P,) Miss Cost (Doi) False Alarm Cost (Dy0) Net Cost
0 0 6 14 20
1 4 0 61 65
2 T=16 7 6 1 14
'1-T=32 7 6 1 14
'2-T=64 . 7 4 9 20
Table 4.6 Filter Cost Summary
4.5.2 Performance Conclusions
Considering the results of Table 4.6 it is clear that in terms of overall cost the
WVRTT provides the best value and performance for surveillance applications.
Accepted tradeoffs for this choice are greater user tolerance for false alarms,
increased processing effort, and increased storage space if recording is controlled
by motion detection.

5. Summary
In this contribution a pluggable parsing architecture aimed at minimizing resource
utilization for MPEG-1 video has been developed, and several spurious motion
vector filters and motion detection schemes have been presented. The results of
these efforts are moderately successful in detecting motion events in medium to
high luminance scenes with a reasonable degree of accuracy, for fairly minimal
operating costs.
The difficulty of accurately detecting motion in low luminance scenes still
remains an issue as attempts to accurately detect motion under low luminance
scenes were unreliable, often failing to detect motion at all.
Despite being targeting for a resource poor environment the software framework
and current tests are immediately useful to any target platform capable of working
with MPEG video. Various video applications can benefit using the tests derived
thus far. For example processing MPEG video for MPEG-7 motion descriptors,
analysis of MPEG encoded VHS tapes, or automated security systems, may all
make use of motion events.
Future efforts can be made to improve filter accuracy by observation of spurious
vector magnitude variation with respect to the mean value. Using the spurious
vector magnitudes variance may allow more accurate assessment of observed
motion vector validity based on statistical distribution, rather than simple
magnitude analysis alone.
Analysis of neighboring blocks motion vectors can also be done in tandem with
magnitude analysis to attempt to identify spurious motion vectors. For instance, if
an observed block and its neighbors all have motion vectors associated with them,
and the motion vectors are all pointing in the same direction it is highly unlikely
an observed motion vector is spurious. A median filter may also be applied to try
and localize motion events based on neighboring blocks. Statistical
characterization of static scene spurious motion vectors can still contribute to the
spurious filtering effort in tandem with neighboring block analysis.
Furthermore a variety of additional extensions may be made to allow motion
detection on scenes with scene changes by extracting luminance information in all
frames, rather than just 1-frames.


A. Adaptive Threshold Test Plots






B. Weighted Vector Threshold Plots


Magnitude Sum vs. Time (Mean Luminance 55) onnn _
ftnnn .
7QQQ . ''A A .h. AA-
annn . * -/" V V ' ' * v ~ yw y

2 -?nnn .
onnn .
mnn .
0 c
* 1 ) ** r , i r* 1 ^ 1 2 3 4 5 6 i i * i i 7 8 9 10
Time (s)

Unfiltered Mag Sum
Magnitude Sum vs. Time (Mean Luminance 30) ocnnn

O) n 1 *nnn . 7. : ^ t a7t i V \ ~i \ "/ '\/"\i"\i v''' v-, V' < V
3 -4-1 c D)
0 c
1 2 3 4 5 6 Time (s) 7 8 9 10 Magnitude Sum Motion Detect Unfiltered Mag Sum

Detection Threshold and Magnitude vs. Time
(Mean Scene Luminance 135, Fast Motion)
I i i'i
/ i \
i !/l I |
i : l i
; f TT: i
j .'j j 1 > m i i
in i 'A k nt i
7 8 9
- Visible Detection
Statistical Detection
-----------Unfiltered Mag

Detection Threshold and Magnitude vs. Time
(Mean Scene Luminance 110, Slow Motion)
0 1 2 3 4 5
Visual Detection
Statistical Detection
'Ulnfiltered Mag


C. Weighted Vector Ratio Threshold Plots (T=16)

Magnitude Sum vs. Time (Mean Luminance 82)
3000 - Cl
! \ i! f: . ; i i f, r\*
to 1 \J .. r- 1 7:TY.. \ :"\ N /!;! r^r i:1 !'{'\>
C ' V ? i.'1
J5 iouu - l/| n I] ^ A\ a A Aa aAA
v UW w \ aM V
o c
11 2 3 4 5 6 Time (s) 7 8 9 10 Magnitude Sum Motion Detect Unfiltered Mag Sum


Detection Threshold and Magnitude vs. Time
(Mean Scene Luminance 135, Fast Motion)
A 5
! k.

.. i i
, f. i
\ i | n !
j / i i- l j Il _,J " k 1 |

Visible Detection
Statistical Detection
-Unaltered Mag
Detection Threshold and Magnitude vs. Time
(Mean Scene Luminance 137, Slow Motion)

5 rd

: i
.. fi ! 'i i!!
W 1 ^ L i i I ' ui
'! A \ k, i dd|
!i /ili 1h 1
Visual Detection
-Statistical Detection
-Untflered Mag_________



D. Weighted Vector Ratio Threshold Plots (T=32)

Magnitude Sum vs. Time (Mean Luminance 82)
onnn .
'\ i\ / n f. ft J n A
o "O oaaa i i/ / 1 ;V .i j \ i : i : ' , ; | / V V- - V- ' i AN' !
& c ro *i^nn . / *: ^ # w ? w/ -j V v v J ; / i r * L - / 1
a |OUU Ann
TTOTO w n 'V 4 ^ V7\a.aAVV/\
>1 2 3 4 5 6 Time (s) * i i 7 8 9 10 Magnitude Sum Motion Detect Unfiltered Mag Sum



Detection Threshold and Magnitude vs. Time
(Mean Scene Luminance 115, Fast Motion)
| .1500

l i ;
tJl i f i tv j i
;ili If ! |i t r. ; /. . Jt;1. U.-'iVL /'/t /
Iv.'Vvj J v 11/ 1 *_ A f\ . /^\ rs /\ a J 1 I K/\ / \ / vw_ww \J i 1 i * '
7 8 9
10 11 12 13 14 15 16
Visual Detection
Statistical Detection
-Unfiltered Mag
Detection Threshold and Magnitude vs. Time
(Mean Scene Luminance 110, Slow Motion)
| 1000
o> 800
S 600

; 4
~ r-j i
: ;li 11 1 ?J ! i>
: i -S \ ,* 1.
/; r r!*. .J : A. nr; r .--.rntf ri if ; % r~* n 7 < i (::i :-n 'Vi ; 1
; ; H : /ft Mt 1 ^ M. I A . | 1 1, 1 j ..'I n vi1v: \ V* f'
! / \ ' "Lh Vrt j/ ' i 1 A II i i i i .
r I i /vAi iV^ M l\f\J\f\.J[k /^/v
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
- - - Visual Detection
Statistical Detection
----- - Ulnfiltered Mag


E. Weighted VectorThreshold Plots (T=64)
Magnitude Sum vs. Time (Mean Luminance 140) 4 on _ ___

1AQ . / jl
i i i ii
*S HQO - ; I i ?,! a
m fin . n I i ; ! {\ ||
to s Bn. / \ ] 'r'i i i ! fir./ r T 7^ ^
; \ A-J > V 1/ / !: i/ \i V j ' \i\!'i -r./ 1
on - / A ' I ,7 lA I P V
u /vJL^'
I I I I I I 3 1 2 3 4 5 6' Time (s) i i J 8 9 10 Magnitude Sum Motion Detect Unliltered Mag Sum




1800 1600 1400 1200 0) I 1000 o> 800 re S 600 400 200 0 - Detection Threshold and Magnitude vs. Time (Mean Scene Luminance 110, Slow Motion)
L *
il 'li

~Vr~ :| *
! ;i ii : 1 / ijjfl 1;
t \ r, ,f* l 11 i'-; $.ji /
\ j \Jl >VV -i/'j fli/ \ f i ,. -i t. ; >} '1 . ;*if Vv- r< vA' I )-J' i
!<__i v Hrrt - j- ! ^'' j nli 1 t l |(

)123456789 0 11 12 13 14 15 16 17 18
Time(s) Magnitude - Visual Detection - UInfiltered Mag


F. Motion Scene Event Timetables
Mean Luminance 135, Fast Motion
Time (s) Comments
6.633 Shadow Appears on left side of scene
6.93 Edge of card, white side, appears. Card moving right.
7.062 White side of square fully visible
7.392 Card turns perpendicular to camera
7.656 White side of square starts to turn toward camera
8.151 Card turns perpendicular to camera
8.217 Card leaves right side of scene
10.956 Square appears at right with white side facing camera
11.154 Full square in scene. Glare from lighting.
11.319 Card turns perpendicular to camera.
11.484 Card has turned with green side facing camera in full
11.649 Card off screen at left. Shadow still visible.
11.946 Shadow Exits
12.177 Shadow appears at left
12.309 Green comer of card appears on screen
12.408 Card fully on screen, green side fully facing camera
12.573 Card perpendicular to camera
12.705 Card has rotated white side fully to camera.
12.771 Card 30% off scene at right
12.837 Card fully off right side of scene
13.365 50% card showing on right side of scene. White side.
13.431 Card fully on screen. Good glare from white face.
13.728 Card fully off screen at left. Shadow still present.
13.86 Shadow departs scene.

Mean Luminance 115, Fast Motion
Time (s) Comments
6.600 Card just entering scene on green side
6.699 Card fully on screen, slightly cocked with green side facing camera
6.864 Card perpendicular to camera
6.996 Card begins showing white side
7.392 Card turns perpendicular to camera
7.557 Card turned so some green side showing
7.623 Card leave scene at right
8.844 Card enters scene, green side facing
8.976 Card fully in scene
9.240 Card dips downward then continues toward left
9.57 Card turned perpendicular to camera
9.636 White side starts showing
9.867 Full white side exposure
10.065 Card turned perpendicular to camera
10.263 Green side showing and leaving at left
10.461 Shadow departs scene

Mean Luminance 70, Fast Motion
Time (s) Comments
5.181 Shadow Appears on left side of scene
5.511 Cards full shadow at left
5.61 Card begins entering scene, green side facing camera.
5.94 Card fully in scene.
6.237 Card turned perpendicular to camera
6.27 Card disappears in background
6.369 White side begins turning toward camera
6.534 Full white side facing camera
6.831 Card turned perpendicular to camera
6.996 Green side starting to turn toward camera
7.062 Green side fully facing camera. Begins to leave scene at right
7.194 Card leaves scene
8.448 Card enters from right, green side facing
8.514 Card is lost in background
8.580 White side begins turning toward camera
8.679 Full white side facing camera
8.811 Card turned perpendicular to camera
8.943 Card turned full white side to camera
9.075 Card turned perpendicular to camera
9.141 Card turned white side to camera, starting to leave scene at left
9.240 Card fully out of scene, shadow still present
9.405 Shadow departs scene

Mean Luminance 137, Slow Motion
Time (s) Comments
8.085 Full card shadow at left
8.184 Green side starts appearing
9.339 Full card on screen
9.504 Card turned perpendicular to camera
9.570 White side begins showing
9.900 Full white side facing camera with good glare
10.263 Card turned perpendicular to camera
10.296 Green side starts showing
10.494 Full green side facing camera
10.824 Card turned perpendicular to camera
11.055 Full white side facing, with good glare
11.319 Card turned perpendicular to camera
11.517 Full green side facing camera
11.847 Card turned perpendicular to camera
12.078 Full white side facing camera
12.375 Card turned perpendicular to camera
12.573 Full green side facing camera
12.936 Card turned perpendicular to camera
13.233' Full white side facing camera
13.266 Starting to leave scene at right
13.464 50% off screen
13.563 Off screen

Mean Luminance 110, Slow Motion
Time (s) Comments
6.039 Shadow starting to enter at left
6.996 Full Shadow
7.227 Card enters scene at left
7.359 Card turns toward camera greed side facing
8.085 Full green side facing camera
8.943 Card turned perpendicular to camera
9.306 White side fully facing camera
9.900 Card turned perpendicular to camera
10.230 Green side fully facing camera
10.758 Card turned perpendicular to camera
10.824 White side starting to show and card getting ready to leave scene at right
11.22 Card leaves scene fully.