Citation
Smart surveillance : spatial tracking with computer vision and machine learning

Material Information

Title:
Smart surveillance : spatial tracking with computer vision and machine learning
Creator:
Bollig, Charles
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English

Thesis/Dissertation Information

Degree:
Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Computer Science and Engineering, CU Denver
Degree Disciplines:
Computer science
Committee Chair:
Connors, Dan
Committee Members:
Liu, Chao
Vahid, Alireza

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
Copyright Charles Bollig. Permission granted to University of Colorado Denver to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.

Downloads

This item has the following downloads:


Full Text
Smart Surveillance: Spatial Tracking With Computer Vision and Machine
Learning
%
University of Colorado
Denver
Charles Bollig
Fall 2018


Copyright
Copyright 2018. All Rights Reserved.


Approval
Smart Surveillance: Spatial Tracking With Computer Vision and
Machine Learning Author: Charles Bollig
Approved:
Dan Connors, PhD Thesis Advisor
Chao Liu, PhD Committee Member
Alireza Vahid, PhD
Committee Member


Abstract
Historically, video surveillance has heavily relied on a person (or people) sitting behind an array of monitors, alert for any kind of suspicious behavior. However, this method of monitoring requires constant vigilance, stamina, and exacting attention to detail to be effective as a means of security. Deep learning has made ’’out of the box” computer vision technologies accessible to those with little more than a basic computer science background. This has allowed engineers and scientists to adapt these easily accessible computer vision (along with simple machine learning) tools into “smart visual surveillance” systems. This research explores the viability of a spatial tracking surveillance system utilizing fundamental components of these extremely accessible tools. Using generalized linear regression models, we translate object-detection camera positions into real-world locations. In configuring the spatial tracking system and training the regression models, we explore several methods of real world mapping, including GPS and manually measured distances. Suitability of the models are assessed using four evaluation methods: ’Leave One Out,’ ’Train-to-Test,’ ’Path Approximation,’ and ’Moving Target.’ Several trials in different locations are conducted to assess varying conditions, as well as the development and repeat-ability of the configuration process. In addition, we design and build a web-based application as a proof of concept for the user-facing spatial tracking surveillance system.


Acknowledgements
Courtney Mattson -
Thank you for all of your support. I never would have been able to complete this without your patience as an assistant and genius as a scientist.
You are a rock. I hope you get all the grants!
Alex Mattson -
Thanks for volunteering your time!


Contents
1 Purpose 1
2 Background 2
2.1 Traditional Surveillance ................................................ 2
2.2 Computer Vision’s Role .................................................. 3
2.3 Mapping to Real Space.................................................... 5
2.4 Realistic Situation ..................................................... 7
2.5 Computer Vision.......................................................... 8
2.5.1 ’Out of the Box’.................................................. 8
2.5.2 Yolo v3 .......................................................... 9
2.6 Related Research........................................................ 10
2.6.1 SLAM............................................................. 10
2.7 Spatial Tracking Smart Surveillance System.............................. 12
2.7.1 Definition....................................................... 12
2.7.2 Requirements..................................................... 12
3 Methods 16
3.1 Tools................................................................... 16
3.1.1 Software......................................................... 16
3.1.2 Hardware......................................................... 20
3.2 Procedure Overview ..................................................... 20
3.3 Client ................................................................. 20
3.3.1 Obtain Training Data ............................................ 21
3.3.2 Build Models..................................................... 21
3.3.3 Determine Best Model............................................. 21
3.3.4 Export Model and Deploy on Camera................................ 24
3.4 Server.................................................................. 24
4 Data/Analysis 25
4.1 Synthesis............................................................... 25
4.2 Trial 1: Field in Stapleton (Denver, CO)................................ 27
4.2.1 Overview ........................................................ 27
4.2.2 Analysis......................................................... 30
4.3 Trial 2: Soccer Field................................................... 30
4.3.1 Overview ........................................................ 30
4.3.2 GPS.............................................................. 33
4.3.3 w/out A.TGQ,Bounding Box ........................................ 34
4.3.4 With Area Bounding Box........................................... 40
4.3.5 Analysis......................................................... 43


4.4 Trial 3: Tennis Court.............................................. 44
4.4.1 Overview ...................................................... 44
4.4.2 Without AreaBmmdingBox......................................... 47
4.4.3 With A-YQBB0undingBox.......................................... 49
4.4.4 Analysis....................................................... 51
4.5 Trial 4: Tennis Court 2............................................ 52
4.5.1 Overview ...................................................... 52
4.5.2 Without Height^â„¢^^............................................. 55
4.5.3 With Height BmmdingBox......................................... 57
4.5.4 HeightBoundingBox Replacing Y-Coordinate Pixel Value........... 59
4.5.5 Analysis....................................................... 62
5 Making the Geo-Tracking System 63
6 Discussion 68
6.1 Roadblocks and Issues................................................. 68
6.2 Missed Opportunities and Future Work.................................. 69


List of Figures
1.1 Translating Pixel-to-Real.................................................. 1
2.1 Yolo Neural Net Framework [9] 9
2.2 SLAM Example Mapping [14]................................................. 11
2.3 System ’’Depth” Problem .................................................. 13
2.4 System ’’Lens Distortion” Problem......................................... 14
4.1 Trial 1 Camera Images..................................................... 28
4.2 Trial 1 Camera Position Plot.............................................. 28
4.3 Trial 1 Camera Data Relationships......................................... 29
4.4 Trial 1 GPS and Network Position Locations................................ 29
4.5 Trial 2 Camera Images..................................................... 30
4.6 Trial 2 Camera Position Plot.............................................. 31
4.7 Trial 2 Camera Data Relationships......................................... 32
4.8 Trial 2 Collected GPS and Measured Distance Location Points............... 32
4.9 Trial 2 Path.............................................................. 33
4.10 Trial 2 GPS comparison.................................................... 33
4.11 Trial 2 Measured LOO w/out Are&Bo-undingBox Accuracy...................... 34
4.12 Trial 2 Measured LOO w/out Are&BcmndingBox Path Approx.................... 35
4.13 Trial 2 Measured LOO w/out AreaBoundingBox Moving Target.................. 35
4.14 Trial 2 Measured LOO w/out Area BoundingBox Error......................... 36
4.15 Trial 2 Measured LOO w/out Area Bounding Box Path Approx.................. 37
4.16 Trial 2 Measured LOO w/out Area BoundingBox Moving Target................. 37
4.17 Trial 2 Measured Train2Test w/out kieBsoundingBox Accuracy................ 38
4.18 Trial 2 Measured T2T w/out Are&BcmndingBox Error.......................... 39
4.19 Trial 2 Measured T2T w/out Area BoundingBox Plot.......................... 39
4.20 Trial 2 Measured T2T w/out Aie&BoundingBox Moving Target.................. 40
4.21 Trial 2 Measured LOO With Aie&BoundingBox Accuracy........................ 41
4.22 Trial 2 Measured LOO With Area BoundingBox Error ......................... 41
4.23 Trial 2 Measured LOO With Area BoundingBox Path Approx.................... 42
4.24 Trial 2 Measured LOO With Area BoundingBox Moving Target.................. 42
4.25 Trial 3 Camera Images..................................................... 44
4.26 Trial 3 Camera Position Plot.............................................. 45
4.27 Trial 3 Camera Data Relationships......................................... 45
4.28 Trial 3 Collected measured distance location points....................... 46
4.29 Trial 3 Path.............................................................. 46
4.30 Trial 3 Measured LOO w/o AreaBoundingBox Accuracy......................... 47
4.31 Trial 3 Measured LOO w/o Area BoundingBox Error........................... 47
4.32 Trial 3 Measured LOO w/o Area BoundingBox Path Approx..................... 48
4.33 Trial 3 Measured LOO w/o Area BoundingBox Moving Target................... 48
4.34 Trial 3 Measured LOO With Area BoundingBox Accuracy....................... 49


4.35 Trial 3 Measured LOO With Area Bounding Box Error ...................... 50
4.36 Trial 3 Measured LOO With Aie&BoundingBox Path Approx............ 50
4.37 Trial 3 Measured LOO With AreaBounding Box Moving Target......... 51
4.38 Trial 4 Camera Image ............................................... 52
4.39 Trial 4 Camera Position Plot........................................... 53
4.40 Trial 4 Camera Data Relationships......................................... 54
4.41 Trial 4 Collected measured distance location points....................... 54
4.42 Trial 4 Path.............................................................. 55
4.43 Trial 4 Measured LOO w/o HeightBmmdingBox Accuracy............... 55
4.44 Trial 3 Measured LOO w/o HeightBoundingBox Error................. 56
4.45 Trial 4 Measured LOO w/o Height BoundingBox Path Approx, and Moving
Target 1................................................................... 56
4.46 Trial 4 Measured LOO w/o Height BmmdingBox Path Approx, and Moving
Target 2................................................................... 57
4.47 Trial 4 Measured LOO With Height BoundingBox Accuracy............ 58
4.48 Trial 4 Measured LOO With Height BoundingBox Error ..................... 58
4.49 Trial 4 Measured LOO With Height BoundingBox Path Approx, and Moving
Target 1................................................................... 59
4.50 Trial 4 Measured LOO With Height BoundingBox Path Approx, and Moving
Target 2................................................................... 59
4.51 Trial 4 Measured LOO With Height BoundingBox Replacing Y-Coordinate
Accuracy................................................................... 60
4.52 Trial 4 Measured LOO With Height BmmdingBox Replacing Y-Coordinate
Error...................................................................... 60
4.53 Trial 4 Measured LOO With Height BoundingBox Path Approx, and Moving
Target 1................................................................... 61
4.54 Trial 4 Measured LOO With Height BoundingBox Path Approx, and Moving
Target 2................................................................... 61
5.1 Spatial Tracking Surveillance System First Iteration...................... 63
5.2 Spatial Tracking Surveillance System Second Iteration..................... 64
5.3 User Interface Server Architecture Diagram................................. 65
5.4 Spatial Tracking Surveillance System Second Iteration Error Visualization 66
5.5 Spatial Tracking Surveillance System Final Iteration...................... 66


List of Abbreviations
PoC - Proof of Concept
CNN - Convolutional Neural Network
DL - Deep Learning
YOLO - You Only Look Once
SGD - Stochastic Gradient Descent
SLAM - Simultaneous Localization and Mapping
OpenCV - Open Source Computer Vision Library
GPS - Global Positioning System
NTP - Network Time Protocol
AJAX - Asynchronous Javascript and HTML


1 Purpose
The goal of this project was to investigate a method of translating a target’s pixel position on a camera scene to a real-world spatial location; ultimately, incorporating the results of this research into the creation of a spatial tracking system.
Figure 1.1: Translating Pixel-to-Real The primary goals of this research were to:
1. Explore which regression algorithm(s) helps translate camera (x,y) to spatial (x,y).
2. Create a PoC of a working smart surveillance system.
3. Create a reusable framework based on refining an experimental method.
1


2 Background
2.1 Traditional Surveillance
Surveillance exists in a plethora of different forms: monitoring of crowds, tracking of an individual, security maintenance of infrastructure, assessment of cyber threats, radar, etc. Traditional video surveillance has heavily relied on a person (or people) sitting behind an array of monitors, alert for any kind of suspicious behavior.
Video surveillance exists for two purposes: “first, it is a means to monitor evolving activity in real time with the possibility for intervention.” Secondly, as a means to “provide an important visual record of a space at any given time, so that it may be viewed retrospectively to detect the presence of a particular person or object in a particular place, or to trace the onset and development of an event that has already occurred” [5]. When used effectively, video surveillance is an invaluable tool in the maintenance of public safety.
However, this method of monitoring requires constant vigilance, stamina, and exacting attention to detail to be effective as a means of surveillance. “These supervisory monitoring tasks are all similar in that they involve concentration for long periods on a complex and dynamically evolving situation, whereby particular rules are kept in mind to identify visual threats in the environment. Given the potential high cost of a detection failure (e.g., bomb explosion, drowning, mid-air collision, cyber-attack), high expectations are placed upon surveillance personnel to ensure safety and security.” [5].
On of the most glaring obstacles to the efficacy of surveillance personnel is the sheer volume of information for which operators are responsible. Sometimes, up to fifty screens are to be monitored at once by a single individual. Feelings of feeling overwhelmed have been reported, as well as experimental studies into divided visual attention, whereby the “visual system becomes noticeably stretched and performance declines when monitor-
2


ing four scenes concurrently.” Moreover, the number of potential camera feeds to select from often exceeds the number of physically available screens on which to view them. Furthermore, the process of constantly switching context between camera feeds introduces inefficiencies in monitoring. “The ability of an operator to perceive the elements and events of his or her work environment with respect to time or space, to understand their meaning in relation to the tasks at play, and to foresee how they may change over time—will be reduced each time until the operator has reassessed the new scene and updated their situational model” [5].
Operators in charge of these feeds are not invulnerable from one of the most pernicious perpetrators of human error: boredom. Considering the long shifts of these operators (12hr 4 days on with 2 days off), it is not surprising that many important incidents can go completely undetected for several brief, critical moments or even for long periods of time. This phenomenon has been labeled ’inattentional blindness.’
Security surveillance systems are crucial in situations related to public safety resulting from criminal or threatening activity. Ubiquitously, security surveillance devices can be found littered throughout public places: parks, schools, government buildings, etc. [6]. However, for the reasons outlined above, automating and streamlining the surveillance process for the benefit of the surveillance operators could provide a much-desired boon in the interest of “maintaining the peace” in public and private areas.
2.2 Computer Vision’s Role
The technological evolution of surveillance systems can be divided into three epochs. An example of the initial generation consists of a series of Closed Circuit TV (CCTV) cameras connected to a central control room (obvious disadvantages have been outlined in the previous section). The second generation involves the early stages of “intelligent,” but rudimentary computer vision detection aided technologies. Finally, the third generation consists of advanced detection and tracking computer vision technologies on a large, geographically distributed scale [6].
Over the past three decades, there has been an ever-growing interest in human detection
3


and tracking. Video acquisition technology is one of the fundamental aspects concerning this interest [2]. Besides simple surveillance, the automatic tracking of humans in video has always been a cross-domain research area with many applications in many different domains [2]. The advancements of “Deep Learning” (DL) technologies and parallel CPU processing hardware, inspired by the organic architecture of the human cortex, have allowed engineers and scientists to adapt computer vision technologies into ’smart visual surveillance’ systems [2]. Smart visual surveillance deals with the real-time monitoring of objects within an environment. The primary goal is to provide automatic interpretation and rudimentary analysis of the scene to understand activities and interactions of observed agents, specifically human beings [2],
A taxonomy divides surveillance within the video domain into four groups: visual information representation, regions of interest location, tracking and identification, and knowledge extraction [6]. Visual information representation refers to representing the information contained in the visual data. For example, converting pixel information to a feature space. Regions of interest location is the focusing on locations of the image where information regarding activities can be extracted. Tracking and identification identifies agents based on a-priori information (such as facial recognition) and follows the trajectories of the agents within the scene. Knowledge extraction refers to the representation of a scene given the previously mentioned steps for the use of performing analysis and making inferences [6].
One of the great challenges of automatic surveillance systems is that to interpret what is happening in a scene, a sequence of problems needs to be solved, including (but not limited to): detection, recognition, pose estimation, tracking, and re-identification [6]. The specific focus area of this research will address one of those primary challenges related to automatic surveillance: tracking.
Of course, it is important to recognize that these automated systems do not necessarily provide a fail-safe solution. The role of the surveillance operator is still very much necessary. These systems and the research for the system described in this thesis are to be considered a tool available to the operators, not a replacement. Over-reliance on auto-
4


mated systems can create a sense of false security; whereby operators become complacent - assuming incidents will be consistently detected and handled by the automated system. No system is perfect. [5]. Computer vision software, although increasingly robust, is prone to failing under sub-optimal conditions.
2.3 Mapping to Real Space
First, it is important to clarify several distinctions:
1. The differences between classification, detection, and tracking.
2. Tracking as it is traditionally accepted within the realm of computer vision versus tracking as it is being used within the context of this research.
We shall refer to tracking in the context of this research as ’spatial tracking. ’ Furthermore (for the sake of clarification), “scene” is used to describe the video feed from a camera. “Frame” is used to describe a single image, or time slice, of the scene.
Frame : fx
Scene : (/o, /i, /2, ■ ■ ■ fx-■ ■ /«)
Classification is the process by which objects found in a frame are organized into a logical category. For example, if an animal and a chair are identified in an image, two potential classifications are of “living” and “non-living” objects.
Detection is the process by which objects are identified and mapped within a frame. With most general purpose object detectors, objects are detected and subsequently encapsulated within a bounding box denoting the object’s position within the frame. Detection can be thought of as classification and localization, localization being the position of the object in the frame. This differs from the tracking of an object in a scene in that the detection of an object from one frame to the next is completely independent of previous or future frames.
Tracking, as understood within the realm of computer vision, is accepted as the analysis of video sequences for establishing the location of a target object over a time sequence
5


[13]. The referenced location of a target is within the visual scene captured by the camera - essentially tracking the detected target on camera from frame to frame. This technique is useful when there is a desire to isolate and focus on one (or several) distinct objects within a scene.
Tracking can be divided into five main groups based on the method of tracking: trackers which optimize the direct match between a single model of the target and the incoming image on the basis of a representation of its appearance, trackers that match between the target and image but hold more that one model of the target allowing for long term memory of the target’s appearance, trackers that perform a maximization of the match but with explicit constraints derived from the motion, coherence, and elasticity of the target, trackers that do not perform matching on basis of appearance but maximize the discrimination of the target from the background, learning to label pixels as target pixels or background pixels, and trackers that maximize discrimination with explicit constraints [13].
Spatial tracking describes the process of translating a target object’s position in a scene to a real-world, physical location. In many (most) “off the shelf’ computer vision detection technologies, the general location of a detected object or tracked target is represented by a bounding box. Extrapolating a bit from boundary box’s placement and dimensions, we can determine the relative location of an object within a scene. From here, with a bit of configuration, we should be able to translate this information to a known physical location given a-priori knowledge of what location the camera’s visual feed captures.
For any kind of real visual surveillance system, context is just important (if not more important) as any kind of computation speed, detection/tracking accuracy, or performance metric of the computer vision component. In situations in which public safety is at risk, it is imperative that surveillance operators and first responders can are able to discern more information about a scene than where a target exists in a camera feed at a single moment in time.
6


2.4 Realistic Situation
The motivation for this research centers around an extremely relevant situation and concern in current U.S. culture: the active shooter scenario. Airports, schools, government buildings, restaurants, parks, casinos, neighborhoods - over the past several years all these locations (and many more) have played host to situations involving an individuals) with a hrearm(s) taking their vengeance and frustrations out on innocent people. One thing that these locations have in common: during the incident many of the locations were monitored by video surveillance systems. While not insinuating that the situations could have been wholly prevented using computer vision technologies (the solution is more political than technological), additional surveillance tools in the hands of security surveillance operators and first responders could/can be a valuable tool to mitigate damage to human life.
Imagine a situation in which an individual exists as an active threat to public safety either through being in possession a non-concealed firearm or having already opened hre with said firearm. The surveillance operators may or may not have already identified that threat. Depending on their familiarity with which camera feed corresponds to which geographical area, they may or may not be able to identify where the threat is at any point in time. When first responders arrive, they are largely in the same situation as the surveillance operators but with a massive disadvantage given their, very likely, unfamiliarity with the layout of the location. While responders and surveillance operators are using valuable time figuring out where the target currently is or has been, the likelihood of harm to human life increases. A system that can detect an active/potential threat, tag such an individual, and spatially track him/her through real space could be invaluable to security /first responders that need to act on such information - lowering the likelihood of harm to human life. After all, in situations such as these, every moment is crucial.
This proposed system utilizes computer vision technologies in two ways:
1. Detecting a threat and tagging for tracking.
2. Spatial tracking that individual through real space.
7


The focus of this research is on the latter: creation of a system that can utilize data provided by “out of the box” computer vision technologies and translate that data into a real-world location. An active-shooter scenario is an extreme example. Such a system could also be used for something as innocuous as tracking the movement of animals through a wildlife preserve to protect against poachers.
2.5 Computer Vision
2.5.1 ’Out of the Box’
Of the many applications of computer vision, one of the most powerful and easily accessible is simple (not to be confused with trivial) object detection and classification within a scene or frame. Very easily, one can retrieve the open source code from an online repository and begin detecting objects in images within minutes. This allows for all sorts of possibilities for the specific usage of this extremely generalized technology. Joseph Redmond, one of the pioneers of the exceptionally versatile and powerful detection tool YOLO, summarizes it best with a simple statement: “What are we going to do with these detectors now that we have them?” [8].
Now, it is important to clarify that the focus of this research is tangential to computer vision and object detection, but not narrowly focused upon the subject. Instead, we are treating these object detection technologies as more of a tool for the research, rather than the subject of research, itself. Just as one would expect to be able to purchase a software ’out of the box’ to begin designing architectural structures without having to dig through the source code to modify or understand how and why the software works, we are using computer vision object detection tools ’out of the box’ in order to create our system. The same applies to the machine learning algorithms that are used for the pixel-to-real mapping. The focus of the research is translating a target object’s position on a video feed to a real-world location for spatial tracking. In the interest of creating a smart surveillance system that is not strongly dependent on the optimization or customization of a specific tool, we are using an open source object detection tool that is robust, but more importantly, general enough that any object detection technology that
8


utilizes bounding boxes to identify the location of an object in an image (most do) could be easily substituted. We have decided to use Y0L0v3 for just this purpose.
2.5.2 Yolo v3
You Only Look Once (YOLO), popularized by Joseph Redmond of the University of Washington, is “an advanced approach object detection” tool. “YOLO applies a single CNN to the entire image which further divides the image into grids. Prediction of bounding boxes and respective confidence score are calculated for each grid. These bounding boxes are analyzed by the confidence score. The architecture of YOLO has 24 convolutional layers and 2 fully connected layers” [11].
Coriv. Layer Conr. Layer
7x7x645-2 3x3x192
Max pool Layer Maxpool Layer
2x2-s-2 2x2-s-2
Cone. Layers Cone. Layers 1x1x128 1x1x256 m.
3x3x256 3x3x512 J
1x1x256 1x1x512
3x3x512 3x3x1022
Corrv. Layers Canv. Layers 1x1x512 my 3x3x1024 3x3x1022/ 3x3x1022
3x3x1022 3x3x1022-5-2
Conn. Layer Conn. Layer
Figure 2.1: Yolo Neural Net Framework [9]
’Out of the box’ YOLO detector takes an input image and resizes it to 448x448 pixels. The image passes through the convolutional network with an output of a 7x7x30 tensor (a multidimensional array). The tensor gives information about the coordinates of the bounding box’s rectangle and probability distribution over all classes the system is trained for. The threshold of confidence scores eliminates class labels scoring less than 30%, which can be modified [11].
Other CNN-based systems reuse classifiers or localizers to detect an object in an image, meaning that these models are applied to an image at multiple locations with different scales. Compared to other CNN-based methods, YOLO has several advantages in both speed and precision [11],
9


First, YOLO applies a single CNN for both classification and localization; a single CNN simultaneously predicts multiple-bounding boxes and class probabilities. This means that YOLO is a comparatively fast object detector [9].
Secondly, YOLO reasons globally about an image when making predictions. It ’sees’ the entire image during training and test to implicitly encode contextual information about classes and their appearance, allowing for fewer background errors compared to other object detectors[9].
Finally, YOLO learns generalized representations of objects. Since YOLO is highly generalizable, it is less likely to break down when applied to new domains or unexpected inputs [9].
One of the drawbacks of YOLO relevant to this research is the issue related to localization of smaller objects within frames, relating to spatial constraints on bounding box predictions. However, by the time YOLOv3 was released, many of the issues related to localization of smaller objects have been attenuated. [9] [8]
YOLOv3 provides what we need for this research a, namely a general-purpose object detector that is robust, fast, versatile, and easy to use. “The resulting system is interactive and engaging. While YOLO processes images individually, when attached to a webcam it functions like a tracking system, detecting objects as they move around and change in appearance” [9].
2.6 Related Research
2.6.1 SLAM
The subject of the research in this thesis requires that information about the environment be a-priori, meaning that the location is known before configuration and deployment of the system. It is worth noting a similar technique within the realm of computer vision that is used for both mapping and geo-location without such a-priori requirements: Simultaneous Localization and Mapping (SLAM). “SLAM is the process whereby an entity (robot, vehicle, or central processing unit with senor devices carried by an individual) has the capacity for building a global map of the visited environment and, at the same time,
10


utilizing this map to deduce its own location at any moment [4]. Simply, SLAM techniques map an unknown environment and track a user entity inside of the developing map, at the same time. In this context, localization refers to determining, in an exact manner, the current pose of a user within an environment [14].
Figure 2.2: SLAM Example Mapping [14]
Initially, the processes of mapping and localization were studied as independent problems. However, without a-priori knowledge of an environment, these two issues are entirely dependent upon each other. For precise localization within an environment, a correct map is necessary, but in order to construct a correct map, it is necessary to be properly localized within an environment. Thus, the two problems are simultaneous [4], Different implementations of SLAM use different techniques, including: various exteroceptive sensors (sonar, etc.), GPS devices, external odometers, and laser scanners [14]. Laser sensors and sonar allow for precise and very dense information about the environment, however, suffer from being less useful in cluttered environments or for recognizing specific objects. GPS suffers from signal deprivation.
When cameras are used as the only means of exteroceptive sensing, this is known as visual SLAM, vision SLAM, vision-only SLAM, or camera SLAM. When proprioceptive sensors are added, allowing an entity to obtain measurements like velocity, position change, and acceleration, this is known as visual-inertial SLAM, which complements visual SLAM with increased accuracy and robustness [14]. Examples of proprioceptive sensors include: encoders, accelerometers, and gyroscopes.
Similar to the research in this thesis, cameras used for visual SLAM must be calibrated ’off-line’ for intrinsic and extrinsic parameters prior to the execution of the SLAM process. Calibration must consider the camera’s static position in space as well as geometry (focal length and principal point) [14].
11


Visual SLAM primarily works by tracking a set of points through successive camera frames. With these points, the system is able to triangulate position in an environment, while at the same time, mapping the environment as it passes through. The tracked points are known as landmarks, or a region in the real world described by 3D position and appearance information. A salient point is the 2D representation of a landmark that the entity uses for tracking [4],
Primary applications of SLAM include: automatic car piloting on unrehearsed off-road terrains, rescue tasks for high-risk or difficult-navigation environments, planetary, aerial, terrestrial, and oceanic exploration, augmented reality applications where virtual objects are included in real-world scenes, visual surveillance systems, and medicine [4],
As mentioned previously: no system is perfect and without some point of failure. Many visual SLAM systems fail when operating under sub-optimal conditions, including: external environments, dynamically changing environments, environments with too few salient features, in large scale environments, dynamic movement of the camera(s), or when occlusion of the sensor(s) occur [4],
2.7 Spatial Tracking Smart Surveillance System
2.7.1 Definition
This smart spatial tracking system will be capable of translating object-detection derived pixel data into a real-world, two-dimensional spatial location, with a mean error of no more than 3 [ft] (36 [in]) distance between predicted location to ground truth location.
We will use machine learning algorithms to implement the pixel-to-real translation. Furthermore, we will build a web-based application that will show showcase the complete PoC spatial tracking smart surveillance system.
2.7.2 Requirements
In creating a spatial tracking system that is considered ’’smart surveillance,” we needed to define a product that is highly generalized. It should be able to be implemented in different scenarios using different configurations. This requires that the system necessar-
12


ily be robust against different vantage points and camera orientations.
Depth Problem
One major issue and difficulty in creating the system is object depth. Object depth is simply the 2D vertical distance of an object from the camera.
For example: consider a spatial tracking system that is positioned to monitor a busy shopping mall corridor. Once having identified a target, the system should be able to track the location of an individual whether the camera is mounted 200ft above the scene or much lower; perhaps, closer to ground-level where it has better resolution to identify targets.
Figure 2.3: System ’’Depth” Problem
Imagining this scenario, the visual feed from the two camera positions would look very different. Shown in the image above and to the left, the camera positioned 200ft above the scene would have access to distinct pixel coordinates for a target - both horizontal positioning and distancing (depth) away from the camera. Being able to map to a distinct (x,y) real-world position would be a matter of finding an appropriate transformation from every camera point to every possible real point in the scene. However, in considering the latter situation, the camera would only have access to distinct horizontal pixel positioning. Being able to determine depth simply from the location of the detected object would be difficult. If the camera is absolutely at ground-level, being able to detect object depth
13


solely from object location is impossible: every object would appear to have the same ycoordinate location. This situation is shown in the image above and to the right.
This is the ’’depth” problem. For the requirement of being able function properly despite different vantage points and camera orientations, our system will need to be able to adapt to this issue. We will attempt to use attributes of the bounding box of a detected object as an additional indicator of depth, or as the sole indicator of depth in the case of a ground-level camera.
Lens Distortion
Figure 2.4: System ’’Lens Distortion” Problem
Another issue with creating a generalized system that is able to transform pixel coordinates to a real-world location is lens distortion. Optical lens distortion or lens error occurs when a camera captured image appears distorted from reality - a rounded deviation from a rectilinear projection. This is a result of of the design of the optical component of the camera. An example of this phenomenon is shown above. To the right, the points are each representative of where an individual stood. For each position, an image was captured and the individual was detected, capturing the pixel location within the scene. To the left is each one of the points where the individual stood from the perspective of
14


the camera. The red dot represents the camera location. From inspection, there is a clear distortion between the camera view and reality.
The focus of this research is not to correct these image errors. Furthermore, we want to be able to create a generalized system that should, ideally, be able to function with any camera feed suitable for object detection.
As can be seen in the image, ignoring any slight offset of the camera from a center-line, there is clear distortion that occurs between the real and what is captured. Again, the focus of this research is not to correct these image errors. Instead, we will attempt to use machine learning to ’’learn” the mapping from the camera locations to the real-world positions.
Repeatability
A generalized smart spatial tracking system requires that a procedure to implement such a technology be reproducible. We will outline steps used to implement such a system, including: finding a suitable mapping (pixel-to-real) model, assessing how well the model performs, and deploying the model on the system. Once successful, we will repeat these steps in a different trial location to prove the repeatability of the process.
15


3 Methods
3.1 Tools
3.1.1 Software OpenCV
We used OpenCV in python to be able to capture single image frames for calibration as well as record small sections of video for testing and accuracy assessment.
YOLO
Following what was mentioned and explained in the previous chapter, we used YOLO as a general-purpose object detector, focusing on the ’person’ class. When a person target is detected within a frame, we extract the center point of where the target was detected as well as different components of the bounding box covering the target. We then wrote this information to a hie to be used for training/configuration for spatial tracking.
After finding an adequate equation to map the center point and area of the bounding box to real world coordinates, we used an additional YOLO python script to act as the data collection portion of the geo-tracking system. The YOLO script collects data akin to the previous YOLO script, except transports to a bokeh server, which acts as the visualization portion of the spatial tracking system. Prior to sending the HTTP request, the vision portion runs the collected point through the configured system to obtain the desired real-world point. The bokeh server accepts an HTTP POST request containing the transformed data.
16


Bokeh
’’Bokeh is an interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of versatile graphics, and to extend this capability with high-performance interactivity over very large or streaming datasets.” (Taken from Bokeh official website). We use bokeh to visualize the geo-tracking of the objects captured by the YOLO script. After finding an adequate model to map pixel data to real-world coordinates, we use bokeh to act as the visual component of the spatial tracking system.
Qpython
“QPython is a script engine which runs Python programs on android devices.” (Taken from Qpython official website). Instead of building an entire android app for collecting GPS coordinates, we decided to use a simple python script. Unfortunately, after our first trial, we realized that there is a major flaw with GPS collection using an android phone and embedded GPS receiver: often, the GPS coordinates are not updated, giving stale, and useless data. We fixed this in subsequent trials to force fresh GPS data.
Scikit-learn
We used SciKit-learn’s various open source machine learning algorithms evaluate a mapping from the camera collected pixel data to real world coordinates. The general form of the types of regression algorithms are:
SimpleLinearRegression : y(x) = w0 + W\(x) MultivariableRegression : y(xi, X2, ...xp) = wo + w\{xi) +^2(^2) + ••• + wp(xp) Polynomial Regression : y(x) = wo + w\(x) + W2(x)2 + ... + wp(x)p
Where...
w = constant coef ficient (or weight) y = dependent value(s)
17


x = independent value(s)
The regression algorithms used in the configuration phase to build a pixel-to-real mapping are:
General Linear Regression
The generalized linear regression method determines the optimal curve fit by minimizing the distance between the curves predicted values and the experimental data points. To do this, error is represented by the sum of the residuals squared, E. The error is calculated as shown below and minimized:
B = AT2
i= 1
Where r* is the difference between the actual and predicted values at the Ith point. Ridge Regression
The ridge regression model is used to prevent overhtting that can occur with other models, such as the generalized linear regression model, ft introduces a small amount of bias to the least squares method in order to decease the long-term variance. This regression model minimizes, E, as defined in the following equation:
E = ^ rf + A x (m2 + p2)
i= 1
Here, a penalty term is added to the traditional least squares method where lambda (0 < A < oo) determines how severe the penalty is and m represents the slope of the line and p represents other parameters contributing to the curve. Increasing values of lambda cause the predictions of dependent variables to become decreasingly sensitive to changes in the independent variables. This model is most effective when you know most/all of your optimization parameters are informative.
18


Lasso Regression
Lasso regression is very similar to ridge regression—it adds a penalty term to the error equation, which introduces a small amount of bias. The superficial difference between these two methods is that lasso regression takes the absolute value of the optimization parameters rather than squaring them:
E = + A x i\m\ + |p|)
i= 1
Therefore, lasso regression can eliminate the dependence between the predictions of dependent variables and changes to the independent variables and therefore is better at reducing the variance in models that contain uninformative variables. In these instances, the uninformative variables will be excluded from the final equations making them simpler and easier to interpret.
ElasticNet Regression
The elastic net regression combines the lasso regression penalty with the ridge regression penalty—each maintaining separate lambda terms, as illustrated in the equation below:
E = ’^2/r2i + Ai x (to2 +p2) + A2 x (\m\ + |p|) i= 1
In doing so, it combines the strengths of both lasso and ridge regression. This is an ideal model if you don’t know how useful some of your parameters are of if you have far too many variables to know about each one. This method is especially useful in situations where there are correlations between different optimization parameters within the model.
These models were chosen because of past experience and success using one or more of the models in different machine learning applications. One of the considerations for this research was whether or not to spend a significant amount of time tuning each model to achieve maximum accuracy. We decided against this, deciding to focus our efforts on building the system, itself, through proving the repeat-ability of the process and building a PoC of the full system. The rational behind this can be found in Chapter 6 Future Work.
19


3.1.2 Hardware
Lenovo Ideapad 500
• Processor: Intel(R) Xeon(R) 3.7 GHz, 4 Core(s)
• Memory: 32 Gb
• Operating System: Ubuntu 16.04
Logitech c270 Webcam
• Max Resolution: 720p/30fps
• Focus Type: Fixed focus
• Lens Technology: Standard
• Field of View: 60 degrees Samsung Galaxy s5
• Processor: Quad-core 2.5 GHz Krait 400
• Chipset: Qualcomm MSM8974AC Snapdragon 801 (28 nm)
• Memory: 32 Gb
• Operating System: Android 4.4.2 (KitKat)
3.2 Procedure Overview
There are two necessary components of the system: the client camera(s) and the server user interface. The client captures the data related to target position and translates that into a real-world location (pixel-to-real), which is sent to the server. The server user interface accepts messages from the client camera and visually displays the results to the user, completing the surveillance system.
3.3 Client
The client camera configuration and deployment process for each camera is unique, but follows the same basic procedure:
20


Obtain Training Data
• Build Models
• Determine Best Model
• Export Model and Deploy on Camera
A more detailed description of the experimentation that led to the development of the above process is given in Chapter 2.
3.3.1 Obtain Training Data
Obtaining training points involves coordinating a real-world location with a location detected with the camera being used - either GPS and/or physical measurement. Training points are obtained one-by-one via a target moving from a location to location. At each location: GPS/wih network-based location is captured using a GPS device, camera coordinates are determined using object detection software on an individual standing in stationary position, and physical distance relative to camera is found using measuring tools.
3.3.2 Build Models
Using analytical tools, we take a sample of the aforementioned general linear regression models. With these models, we fit the dependent variable camera coordinate training data values to the independent variable ground truth training data values (GPS, measured distance, etc.).
3.3.3 Determine Best Model
Taking the sample of the fitted general linear regression models, we determine the ’best’ model through the ’Leave One Out’ and ’Train-to-Test’ methods. Accuracy is determined by a ’Hit or Miss’ score, wherein a point predicted using a regression model is either inside or outside a threshold given by a predetermined euclidean distance from the ground truth point it is attempting to predict. Accuracy of a model is determined by the total number
21


of predicted points considered accurate divided by the total number of points. In addition, we determine the model with lowest mean error of predicted point distance from ground truth.
The model(s) with the lowest mean error and highest model accuracy will be deemed the ’best’ model(s) to use. We will then assess this model(s) with the ’Path Approximation’ and ’Moving Target’ methods to determine how it/they perform when introduced to new data.
Leave One Out
The ’Leave One Out’ method of accuracy evaluation involves building a regression model with all but one of the training points. Then, the fitted regression model is applied to the ’left out’ training point to calculate a predicted point. The predicted point is then assessed for its accuracy with the previously described ’Hit or Miss’ method. The model is then rebuilt leaving out another point. This processes is repeated until all points have been left out. Again, overall model accuracy is determined by the total number of predicted points considered accurate divided by the total number of points.
Train-to-Test
The ’Train-to-Test’ method of accuracy evaluation involves building a regression model with all of the training points. Then, the fitted model is, again, applied to the training dependent variable data to yield predicted points. All points are then evaluated using the ’Hit or Miss’ method, which will eventually yield overall model accuracy.
The lowest mean error is calculated during the above ’Leave One Out’ and ’Train-to-Test’ methods. In addition to determining whether or not a predicted point is either inside or outside a threshold given by a predetermined euclidean distance from the ground truth point using the ’Hit or Miss’ method, we determine the actual distance away from the ground truth. For each model, these values are averaged to determine the mean error.
After determining the best accuracy and lowest mean error, we undertake another round of evaluation using the ’Path Approximation’ and ’Moving Target’ methods to determine how well the chosen regression model(s) perform when introduced to new camera
22


data.
Path Approximation
Originally, this step of the procedure was designed to measure real-time data’s ground truth error in the form of a GPS location versus a predicted point produced from a regression algorithm. However, during the process of obtaining data, we discovered that not only is GPS data completely inaccurate as a ground truth, but collecting real-time data was impossible for the technology available to us. We still needed a way to gauge how well the model would perform when introduced to new data. We developed the path approximation method. Dynamic motion of a target moving between known points is captured, recording the relevant dependent variable camera data values. Then, the chosen regression algorithm is applied to the camera data, yielding a predicted ’trial path.’ Comparing the video feed to the predicted values, we qualitatively assess the feasibility of the dynamic model by interpolating between the known path and the ’trial path. ’ This is a novel qualitative determination of approximately how well the model performs when introduced to new data.
Moving Target
Along the same lines of the ’Path Approximation’ method, we had to modify our original plan to circumvent the use of real-time data. We modified from real-time data capture to analysis of selected frames from dynamic motion at known positions. We would capture footage of a test target moving to several pre-determined, known positions. The exact frames of when the target was captured in those physical locations were extracted from the video footage. Then, applying our chosen regression algorithm, we apply the model to those points to yield predicted points. Finally, we determine the euclidean distance from the predetermined locations to the predicted points to determine the mean error. Again, this determines approximately how well the model performs when introduced to new data.
23


3.3.4 Export Model and Deploy on Camera
Once a model has been chosen for the stationary camera scene, it must into the object detection YOLO software. With Scikit-learn, transferring one program to another was accomplished using the following library and (Python3):
from joblib import dump, load
# export model
dump(model , ’model-name . joblib ’)
# import model
model = load (’model-name . j obli b ’)
Once we have imported our model into our client camera, we begin surveillance of a scene. When a target object is detected, the pixel information is collected. Applying our model to the pixel information with our pixel-to-real mapping, we obtain a real-world spatial location. Once we are able to acquire a spatial location, we send this data to our user interface server using a POST request.
3.4 Server
HTTP Server
The user interface server idly waits for HTTP POST requests from the client camera(s). When it receives a request, it appends this incoming data to the data structure containing the position information, or creates the data structure if it does not already exist.
Using AJAX, the server asynchronously updates the display of the smart surveillance system. More about the design and implementation of the server is provided in Chapter 5.
be integrated a model from code snippets
24


4 Data/Analysis
This chapter will show the experimental development for the first three steps of the client camera configuration process: Obtaining training data, building the models, and determining the best model to deploy on the camera client. We show the methods used to analyze the data that was collected to build the mapping from pixel-to-real, i.e. finding an appropriate mapping model and assessing how well it performs. Based on the methods and tools described in the previous chapter, we obtained the following results for four trials in different locations.
4.1 Synthesis
From Trial 1 to Trial 4, we were able to develop a process for finding an appropriate mapping model and assessing how well it performs. A visual flow of the entire process can be found in the appendix.
Trial 1
Trial 1 was beneficial in that we developed a successful method for obtaining training data. We found that our supposed method of GPS data polling did not yield the desired results: the GPS data would not necessarily update for each new captured position. We modified our GPS collection code for the next trial to force an update.
Trial 2
In Trial 2, we solved our ’lens distortion’ issue outlined in Chapter 2 Requirements. We were able to successfully map pixel data to real data for the spatial tracking surveillance system. For the process: we successfully improved our GPS data collection method,
25


devised and implemented a ’measured distance’ method for training data collection, were able to test the process of determining the best model(s) for pixel-to-real mapping, and evaluated the chosen model(s). We discovered that, although our GPS collection method had improved, GPS, as it was available to us, was not a viable method to build the system.
After completing the process of determining the best mapping model to use, the evaluation showed that the best model(s) were not quite sufficient to use in the system through the ’Moving Target’ evaluation method. The lowest mean error was well over 10ft (120 [in]).
Additionally, we determined that the ’Train-to-Test’ method was not a suitable assessment tool for model accuracy, as it led to over-fitting.
We decided to increase the number of collected training points in the next trail in an attempt to improve accuracy.
Trial 3
Trial 3 was considered the first successful trial of our research. The increase in number of training points did yield an increase in accuracy. After determining the best model(s) and evaluating performance, we were able to obtain a result of less than 12.00[in] mean error using our ’’Moving Target” method. Unfortunately, this was only using (xcoordinate, ycoordinate) as a predictive parameter. We decided to move forward with this model to build an iteration of the PoC of the spatial tracking smart surveillance system.
In addition, we discovered that using AreaBmmdingBox as a predictive parameter, despite being correlated with the distance away from the camera, yielded completely unreliable results. We decided to discontinue usage of the Are&BcmndingBox as a predictive parameter. However, we still needed some way of solving our ’’depth” issue outlined in Chapter 2 Requirements to have a completed PoC.
We decided to increase the number of collected training points in the next trail in an attempt to, again, improve accuracy.
26


Trial 4
Trial 4 was used to prove repeat-ability of our process, solving our ’repeat-ability’ issue outlined in Chapter 2 Requirements. With an increase in the number of training points, there was an increase in accuracy and decrease in mean error with the LOO method. However, when assessing the best model(s) with the ’Path Approximation’ and ’Moving Target’ methods, there was a slight decrease in performance. Wherein the last trial, we were able to obtain a mean error of 12[in] using our ’’Moving Target” method and (xcoordinate, ycoordinate) as a predictive parameter, we experienced an increase to a 30.02[in] mean error. However, this is still within tolerance of acceptance for the system. We will move forward with this model to build a second iteration of the PoC of the geo-tracking smart surveillance system.
In addition, using HeightBmmdingBox instead of AreaBmmdingBox as a predictive parameter, we were able to solve our ’’depth” issue outlined in Chapter 1 Requirements. Performing the ’’Moving Target” evaluation using (xcoordinate, HeightBoundingBox) as a predictive parameter, we were able to obtain a mean error of 35.92[in], This is within our tolerance of acceptance for the system.
With this trial, we had successfully completed our system as it was outlined in Chapter
2.
4.2 Trial 1: Field in Stapleton (Denver, CO)
4.2.1 Overview
The first trial for the project involved polling GPS and network position data to collect configuration data for the spatial tracking system.
We utilized a park in the Stapleton area of Denver, CO. This location was chosen for its open space and relative lack of pedestrian foot traffic at that time of day. The other object detection categories for YOLOv3 were omitted from the collected data. In addition, care was taken to ensure that only one relevant object (person) was detected per capture. We collected a total of 75 points between camera captured
coordinate coordinate) Ar 6 CL Bounding Box) £Uld GPS/NetwOrk Collected (latitude, longitude).
27


Figure 4.1: Trial 1 Camera Images Object detection capture in Stapleton field
The plot below shows the captured data from the camera represented as x and y coordinates with annotations denoting the recorded size of the bounding box. Notice that there is an inverse relationship between proximity to the camera and recorded yCoordinate of the object.
Camera Location
V)
Q_
â–  6721
5679670 5040
15762
^Greater Y Values = Closer to Camera
10416
9882
6384
4485
2212
187820 • *1827
f0JH3
%5
2728
Â¥324
1071 • 864
fta
, 8#0
1 • *720
60V
,189ft!46
.-V76 r
76*816 555 576
1608 • 1276
1020 * 1140

X (Pixels)
Figure 4.2: Trial 1 Camera Position Plot
%coordinate £Lnd y coordinate annotated with Area Bounding Box.
There is a direct correlation between the bounding box area and the yCOordinateâ–  This relationship is shown in the plot below. This is to be expected as the bounding box covering a detected object is larger with closer proximity to the camera.
28


Camera (Y) Position vs Area of Bounding Box
Area of Bounding Box (Pixels)
Figure 4.3: Trial 1 Camera Data Relationships
Relationships between y coordinate with Area BoundingBox
The collected data for the GPS positioning is shown below and left. Upon inspection, there are far fewer distinct GPS points collected that corresponding camera image capture points. Later, we found out that there is a quirk to general-purpose mobile GPS radios: although they appear to be constantly updating, the data is often stale. Although 75 points we collected from unique locations, not all the data is unique. More on this topic will be explained in a later section. The network latitude and longitude collected locations are shown below and right.

Figure 4.4: Trial 1 GPS and Network Position Locations
Network Position
Longitude
GPS Position
Longitude
The positioning collection program was configured to collect both GPS and network data if the network was within range. Halfway through the data collection, we found the
29


mobile device had connectivity to the network. We obtained 50 network latitude and longitude locations (versus 75 GPS). Of course, later we found the previous issue with GPS stale location data persisted with the network data, as well: Although 50 points we collected from unique locations, not all the data is unique.
4.2.2 Analysis
Based upon the inconsistency of the data collected, we made an educated assumption that an accurate geo-tracking model would be next to impossible to create. We decided not to collect dynamic video data and build a ’’trial path.” We did conduct analysis on the data. Unfortunately, as expected, the analysis did not yield favorable results.
This trial taught us the importance of being able to accurately collect distinct training data points to build a model. Although, we were unable to progress with the data collected, we devised a method to collect more consistent data during the next iteration.
4.3 Trial 2: Soccer Field
4.3.1 Overview
Inference time: 698.96 ms
Figure 4.5: Trial 2 Camera Images Object detection capture on Soccer Field
30


The second trial for the project involved polling GPS and manually measuring predetermined position data to collect training configuration data for the spatial tracking system.
We utilized a soccer field in Denver, CO. This location was chosen for its open space and relative lack of pedestrian foot traffic at that time of day. The other object detection categories for YOLOv3 were omitted from the collected data. In addition, again, care was taken to ensure that only one relevant object (person) was detected per capture.
We collected a total of 17 points between camera captured (xcoor(unate, ycoordinate, Area Bounding Box) and GPS (latitude, longitude) and measured distance in inches (x,y).
The plot below shows the captured data from the camera represented as x and y coordinates with annotations denoting the recorded size of the bounding box. Notice that there is an inverse relationship between proximity to the camera and recorded yCOordinate of the object.
Camera Location
X (Pixels)
Figure 4.6: Trial 2 Camera Position Plot
% coordinate There is a direct correlation between the area of the bounding box area and the ycoordinate- This relationship is shown in the plot below. This is to be expected as the bounding box covering a detected object is larger with closer proximity to the camera. The collected data for the GPS positioning is shown below and left. The GPS collection
31


Camera (Y) Position vs Area of Bounding Box
_L0
(U
x
CL
Area of Bounding Box (Pixels)
Figure 4.7: Trial 2 Camera Data Relationships Relationships between ycoardmate with A re a i nmdnui i >0.r
methods was updated for this trial. All of the GPS points are unique for each point collected. The collected measured distance positions are shown below and right. The measured distance points and the GPS positions are representative of the same physical locations.
Figure 4.8: Trial 2 Collected GPS and Measured Distance Location Points
Beginning this trial, we started using our three methods of evaluating the accuracy of a model: ’Leave One Out,’ ’Train-to-Test,’ ’Path Approximation,’ and ’Moving Target.’ The trial path for Path Approximation is shown below.
32


Figure 4.9: Trial 2 Path
Trial path on soccer held shown as alternating colors
4.3.2 GPS
After Trial 1, we were suspicious of the consistency of the GPS data. We decided to compare our GPS data against our measured data. To do this, we normalized the GPS data to fit on the same distance scale as the measured distance data, i.e. converting from degrees in latitude and longitude to inches in x and y. The results are displayed below.
GPS Normalized Positions vs Measured Positions
2500 •L4 *L5
2000 ITT ^3 •L5 •L2
C
a; •L0 •LI
1000 c ~o O o O 500 £ *
>- * *
0- • GPS collected (normalized)
• measured (actual) • Camera Avg Error: 416.41 [in]

X Coordinate (inches)
Figure 4.10: Trial 2 GPS comparison Trial 2 comparison between GPS and measured distances
From the above figure, it can be discerned that the GPS points collected do not give accurate location data. Several of the GPS location points are clustered together, even
33


though they were recorded at physical locations that were several tens of feet apart. In addition, the mean error between the measured distance location and the observed GPS location was 416.41 [in] (over 30ft).
Even after modifying our GPS collection methods, our data was still highly inconsistent and inaccurate. After this observation, we made the decision not to continue using GPS as a part of the spatial tracking system. We continued our analysis with the measured distance data.
4.3.3 w/OUt Bounding Box
Leave One Out
Using the Leave One Out method of accuracy analysis, we obtained a measure of accuracy for all points collected as a part of the configuration data. Plots of accuracy for the chosen regression algorithms are shown below for the Path Approximation evaluation.
Leave One Out Cross Validation Accuracy: Within 3.00 ft (36.00 [in]) of Groundtruth
0123456789
Degree of Polynomial (Regression)
Figure 4.11: Trial 2 Measured LOO w/out Are&BoundingBox Accuracy Trial 2 Measured Leave One Out Accuracy w/out Aie&BmmdingBox within 36.0 [in] of
ground truth
The above plot show that many of the regression algorithms perform similarly according to the ’Hit or Miss’ accuracy method we had developed. Although, all algorithms do not perform ideally, Elastic-Net performs slightly better than others. Using the best overall accuracy at the specified degree of polynomial regression for our ’Path Comparison’ method yields the following plot for Elastic-Net as a degree-2 polynomial.
34


Polynomial Degree 2 Regression
Figure 4.12: Trial 2 Measured LOO w/out Aie&BoundingBox Path Approx. Trial 2 Measured ElasticNet w/out Aie&BoundingBox Path Approximation
Upon inspection of the above plot, one can discern a rough approximation of the path that was followed in during the creation of the ’trial path.’ Of course, there is a large amount of inaccuracy, but the path points are consistently aligned with the general direction of what is expected.
Moving Target Accuracy

• r a ElasticNet Predicted Points •
« Target Points
2000 • Original Markers
U • • •
c —" 1500 C iooo ~o £ A7 A
o o $ £ £
u â„¢
>■ i £ &
0 if
• Camera Avg Error: 153.00
400 600 800 1000 1200 1400 1600
X Coordinate (inch)
Figure 4.13: Trial 2 Measured LOO w/out Aie&BoundingBox Moving Target Trial 2 Measured ElasticNet w/out Are&BoundingBox Moving Target Evaluation
The Moving Target evaluation above shows that although the path points are consistently aligned with the general direction of what is expected, there is still much error at the ’’test point” locations.
35


As another assessment of the Leave One Out method, we calculated the mean error for all points collected as a part of the training configuration data. The results are shown below as a plot along with the mean error at each degree.
Leave One Out: Mean Error From Ground Truth No Bounding Box Area
0 12 3
Degree of Polynomial (Regression)
Figure 4.14: Trial 2 Measured LOO w/out Aie&BoundingBox Error Trial 2 Measured LOO w/out Are&BoundingBox Mean Error
Again, many of the algorithms perform similarly. However, Ridge performs slightly better than others. Using the best overall accuracy at the specified degree of polynomial regression for our ’Path Comparison’ and ’Moving Target’ methods yield the following plots for Ridge as a degree-2 polynomial.
36


Y Coordinates (inch)
Polynomial Degree 2 Regression Trial Path (No Bounding Box Area)
u
c
'—' 1500 l/l (U
J-J
fD
C iooo 73
L-
o
o
(J 500 >-


a Ridge
K Original Markers
Camera
X Coordinate (inch)
Figure 4.15: Trial 2 Measured LOO w/out Are&BoundingBox Path Approx. Trial 2 Measured Ridge w/out Are&BoundingBox Path Approximation
Moving Target Accuracy
A Ridge Predicted Points II Target Points • Original Markers
#0

&
i
11
ft
§
&
$

Camera
Avg Error: 153.21
[in]
X Coordinate (inch)
Figure 4.16: Trial 2 Measured LOO w/out Are&BcwndingBox. Moving Target Trial 2 Measured Ridge w/out Aie&BoundingBox Moving Target Evaluation
Similar to the previous ElasticNet ’’Path Approximation” evaluation, there is a large amount of inaccuracy, but the path points are consistently aligned with the general direction of what is expected. Ridge performed similarly to ElasticNet for the ’Moving
37


Target’ assessment, just with slightly more error.
Train-to-Test
Using the ’Train-to-Test’ method of accuracy analysis, we obtained a measure of accuracy for all points collected as a part of the configuration data. Plots of accuracy for the chosen regression algorithms are shown below.
Train-to-Test Accuracy: Within 3.00 ft (36.00 [in]) of Groundtruth (Hit or Miss)
Figure 4.17: Trial 2 Measured Train2Test w/out Are&BoundingBox Accuracy Trial 2 Measured Train-to-Test Accuracy w/out Aie&BoundingBox within 18.0 [in] of
ground truth
The above plot shows that Linear Regression achieves almost 100% accuracy after fourth degree polynomial regression. In addition, there is a slight decrease in accuracy after second degree polynomial regression. The last ’Leave One Out’ method informed us that second degree polynomial regression consistently yields the best accuracy.
In order to evaluate this further, we calculated the mean error for all points collected as a part of the training configuration data. The results are shown below as a plot along with the mean error at each degree.
38


Train-to-Test: Mean Error From Ground Truth No Bounding Box Area
Degree of Polynomial (Regression)
Figure 4.18: Trial 2 Measured T2T w/out Are&BoundingBox Error Trial 2 Measured Train-to-Test w/out Are&BoundingBox Mean Error
The above plot further shows that mean error for Linear Regression goes to almost zero for fifth degree polynomial regression. The plots of the ’’Train-to-Test” method for fifth degree polynomial regression are shown below.
Actual vs Predicted Soccer Field Locations (Test Data)
2500
2000
-C
u
c
----- 1500
10
0)
4-l
fC
C iooo T3
i_
O
O
u â„¢
>-
o
Figure 4.19: Trial 2 Measured T2T w/out Are&BoundingBox Plot Trial 2 Measured Train-to-Test w/out Are&BcwndingBox. Plot of Training Data
Polynomial Degree: 5
jk4 • • Linear Regression Predicted Actual
rfb3 #6 ^2
JoO 9 jhl
9 9 1
9 f 9 9
0 Camera
400 600 800 1000 1200 1400 1600
X Coordinate (inch)
39


We had a suspicion that the accuracy of this model was a result of over-fitting. We
conducted the ’Moving Target’ evaluation. The result is shown on the following plot.
Moving Target Accuracy
u
c
(U
c
o
o
u
A
A Simple Linear Regression Pred. Points K Target Points • Original Markers

#1
*>
X1
tf
£
Camera Av9 Error: 3492.C9 [in]
X Coordinate (inch)
Figure 4.20: Trial 2 Measured T2T w/out Are&BoundingBox Moving Target Trial 2 Measured Train-to-Test w/out Aie&BoundingBox. Moving Target Evaluation
The above plot shows that the ” Train-to-Test” method to evaluate model accuracy for the system is likely not a valuable addition. Although, the model will be able to correctly predict training points, over-fitting means that any new data introduced to the model will be highly inaccurate. We decided to discontinue using the ’Train-to-Test’ method.
4.3.4 With IKYQ3j^ouni]ingj^ox
We needed a method of solving the ’depth problem’ outlined in Chapter 2. As there is a direct positive correlation between Are&BoundingBox and ycoordinate, we decided to use Are&BcwndingBox. as an additional predictive parameter.
Leave One Out
Using the Leave One Out method of accuracy analysis, we obtained a measure of accuracy for all points collected as a part of the configuration data. Plots of accuracy as well as the mean error for the chosen regression algorithms are shown below.
40


Leave One Out Cross Validation Accuracy: Within 3.00 ft (36.00 [in]) of Groundtruth
Degree of Polynomial (Regression)
Figure 4.21: Trial 2 Measured LOO With Ave&BoundingBox. Accuracy Trial 2 Measured Leave One Out Accuracy With Are&BoundingBox within 36.0 [in] of
ground truth
Leave One Out: Mean Error From Ground Truth With Bounding Box Area
0 12 3
Degree of Polynomial (Regression)
Figure 4.22: Trial 2 Measured LOO With AieBBMmdingBox Error Trial 2 Measured Leave One Out With Are&BoundingBox Mean Error
Similar to the model constructed without using the AreBBmmdingBox as a preditive parameter, Elastic-Net performs slightly better than others. Using the best overall accuracy and lowest error at the specified degree of polynomial regression for our ’Path Comparison’ and ’Moving Target’ methods yield the following plots for Elastic-Net as a degree-2 polynomial.
41


Y Coordinates (inch)
Polynomial Degree 2 Regression Trial Path (No Bounding Box Area)
Figure 4.23: Trial 2 Measured LOO With Axe&BoundingBox. Path Approx. Trial 2 Measued ElasticNet With Are&BoundingBox Path Approximation
Moving Target Accuracy
• ElasticNet Predicted Points •
It Target Points
• Original Markers
• • •
*0
*- * 1?
«? *
4
0 Camera Avg Error: 559.03
"So 600 800 1000 1200 MOO 1600
X Coordinate (inch)
Figure 4.24: Trial 2 Measured LOO With Aie&BoundingBox Moving Target Trial 2 Measured LOO With Aie&BoundingBox Moving Target Evaluation
Although ElasticNet was predicted to be as accurate as the first portion of the trial without using the bounding box area as a predictive parameter, the plots above show that, similar to the issue with over-fitting, there is a definite decrease in performance
42


when introducing new data.
4.3.5 Analysis
Trial 2 taught several important lessons going forward with developing the system:
First, GPS (as it is available for this research) is an inaccurate tool for recording one’s physical location. However, given that ElasticNet was able to provide some degree of realistic predictive functionality, this is not to say that GPS absolutely should not be used. It would be interesting to be able to re-run this trial with a more accurate GPS tool.
Secondly, using the bounding box area as a predictive parameter yielded less than favorable results when introduced to new data. The likely explanation for this is that the bounding box area varies based on what pose a ’’person” is detected in as the bounding box for Yolov3 encloses that total (x,y) pixel area, and can vary greatly. In addition, another obvious flaw in using the bounding box area as a predictive parameter is running the model on different sized individuals after training. In such as case, the variation in size of bounding box based on the differences in sizes of humans would create inaccuracies in the spatial tracking position.
Thirdly, the ’Train-to-Test’ method of assessing model accuracy or Ending lowest mean error should not be used as a reliable evaluation method. Attempting to find the best performing of either will likely result in a model that is over-fit to the training data and not robust to new data points.
Finally, we were able to make progress on finding a model that would yield accurate results in a spatial tracking system. However, there is still a major concern related to accuracy. For the next trial, we decided to double the number of points collected to see how that would effect the accuracy of the eventual model.
43


4.4 Trial 3: Tennis Court
4.4.1 Overview
The third trial for the project involved, again, manually measuring distances for predetermined positions to collect configuration data for the geo-tracking system.
Figure 4.25: Trial 3 Camera Images Object detection capture on Tennis Court
We utilized a tennis court in Denver, CO. Again, this location was chosen for its open space and relative lack of pedestrian foot traffic at that time of day. The other object detection categories for Y0L0v3 were omitted from the collected data. In addition, care was taken to ensure that only one relevant object (person) was detected per capture.
We collected a total of 34 points between camera captured (xcoordinate, ycoordinate, Area Bounding Box) and measured distances in inches (x,y).
The plot below shows the captured data from the camera represented as x and y coordinates with annotations denoting the recorded size of the bounding box. Notice that there is an inverse relationship between proximity to the camera and recorded yCOordinate of the object.
44


Camera Location
Figure 4.26: Trial 3 Camera Position Plot
%coordinate There is a direct correlation between the area of the bounding box area and the ycoordinate- This relationship is shown in the plot below. This is to be expected as the bounding box covering a detected object is larger with closer proximity to the camera.
250
200
_t£)
â„¢ 150
Q_
>-
100
50
The measured distance positions are shown below.
Camera (Y) Position vs Area of Bounding Box
Area of Bounding Box (Pixels)
Figure 4.27: Trial 3 Camera Data Relationships Relationships between y coordinate with A re a i jo-n mima i
45


Tennis Court Calculated Positions
U
c
in
(D
4-»
03
O
O
U
>-
1000 800 600 400 200 0
-200
-200 -100 0 100 200 300 400 500 600
X Coordinate (inch)
Figure 4.28: Trial 3 Collected measured distance location points
• •
• • •
• • •
• •
• •
• •
• •
• Camera
For this trial we used three methods of evaluating the accuracy of a model: Leave One Out, Path Approximation, and Moving Target. The trial for Path Approximation is shown below.
46


4.4.2 Without AreaBoundingBox
Leave One Out
Using the Leave One Out method of accuracy analysis, we obtained a measure of accuracy for all points collected as a part of the configuration data. Plots of accuracy as well as the mean error for the chosen regression algorithms are shown below.
Leave One Out Cross Validation Accuracy: Within 3.00 ft (36.00 [in]) of Groundtruth
Figure 4.30: Trial 3 Measured LOO w/o Are&BoundingBox Accuracy Trial 3 Measured Leave One Out Accuracy w/o Are&BoundingBox within 36.0 [in] of
ground truth
LeaveOne Out: Mean Error From Ground Truth No Bounding Box Area
0 12 3 4
Degree of Polynomial (Regression)
Figure 4.31: Trial 3 Measured LOO w/o Aie&BoundingBox Error Trial 3 Measured Leave One Out w/o AieascmndingBox. Mean Error
Based on the results of assessing ’’Hit or Miss” accuracy and finding the lowest of the
47


mean error for the predicted points, ElasticNet, once again, was determined to be the ideal model to use. Path Comparison and Moving Target assessments for ElasticNet as a degree-3 polynomial are shown below.
Degree 3 Regression Trial Data Plot
Figure 4.32: Trial 3 Measured LOO w/o Are&BoundingBox Path Approx. Trial 3 Measured ElasticNet w/out Aie&BoundingBox Path Approximation
.c
u
I/)
QJ
4-1
03
C
T3
i_
O
O
u
Moving Target Accuracy
1200 A ElasticNet Prediction Points
* Target Points
1000 • Original Markers
• • • *
• • • • • • •
. * • • • • •
600 • • • • • •
400
200 • • • ^ •
w • •
0 • Camera Avg Error: 12.00
-200
-200 -100 0 100 200 300 400 500 600
X Coordinate (inch)
Figure 4.33: Trial 3 Measured LOO w/o AieBBMmdingBox Moving Target Trial 3 Measured LOO w/o Aie&BoundingBox Moving Target Evaluation
Both the ’’Moving Target” and ’’Path Approximation” methods yielded ideal results. The predicted points in the ’’Path Approximation” methods clearly follow the ’’trial
48


path.” The ’’trial points” in comparison to ground truth points are able to yield a mean error that is within 12.0 [in].
4.4.3 With IKYQ3j^ouni]ingj^ox
Again, we needed a method of solving the ’depth problem’ outlined in Chapter 2. As there is a direct positive correlation between Are&BoundingBox and ycoordinate, we decided to use Area Bounding Box as an additional predictive parameter once again as the inaccuracy in the last trial may have been a result of a lack of training data.
Leave One Out
Based on the results of Trial 2, we expect the the results of using bounding box area as a predictive parameters will yield an inaccurate model, not robust to new data. However, in the interest of testing, we decided to analyze the data. The results follow:
Using the Leave One Out method of accuracy analysis, we obtained a measure of accuracy for all points collected as a part of the configuration data. Plots of accuracy as well as the mean error for the chosen regression algorithms are shown below.
Leave One Out Cross Validation Accuracy: Within 3.00 ft (36.00 [in]) of Groundtruth
Degree of Polynomial (Regression)
Figure 4.34: Trial 3 Measured LOO With Are&BoundingBox Accuracy Trial 3 Measured Leave One Out Accuracy With Aie&BoundingBox within 18.0 [in] of
ground truth
49


LeaveOne Out: Mean Error From Ground Truth With Bounding Box Area
Degree of Polynomial (Regression)
Figure 4.35: Trial 3 Measured LOO With Aie&BoundingBox Error Trial 3 Measured Leave One Out With Are&BoundingBox Mean Error
Once again, ElasticNet shows the overall lowest mean error and highest accuracy according to our Hit or Miss method. The Path Approximation and Moving Target evaluations are shown below for degree-3 ElasticNet.
Degree 3 Regression Trial Data Plot
Figure 4.36: Trial 3 Measured LOO With Area.BoundingBox Path Approx. Trial 3 Measured ElasticNet With AveBBowndingBox Path Approximation
50


u
C 800 1/1
0) 600 fD
c
■Q 400 l—
O
O
U 2°0
Moving Target Accuracy
A ElasticNet Prediction Points
U Target Points
■ Original Markers • • • J .
• • • • • • •
• • •
• • • • • •
• • • 3* •
d •

• Camera Avg Error: 44.53
X Coordinate (inch)
Figure 4.37: Trial 3 Measured LOO With Aie&BoundingBox Moving Target Trial 3 Measured LOO With AYe&BmmdingBox Moving Target Evaluation
in]
As expected, the mean error including the bounding box area as a predictive parameter resulted in lower accuracy than simply using the center point.
4.4.4 Analysis
Trial 3 taught several important lessons going forward with developing the system:
First, the increase in the number of training points yields better overall accuracy and lower mean error after modeling training. For the next trial, an even greater number of points will be used during the configuration phase.
Second, although bounding box area is an important parameter to use (instances where ycoordinate doesn’t change drastically), using bounding box as a means of predictive parameters only acts as a detriment to accuracy. Further exploration is needed to find a way to include the bounding box area in the system.
Third, this trial should be considered the first ’’successful” trial wherein the system could actually be deployed. We will move on to building an iteration of the PoC system with the degree-3 ElasticNet model that we have created.
51


4.5 Trial 4: Tennis Court 2
4.5.1 Overview
The fourth trial for the project involved, again, manually measuring distances for predetermined positions to collect configuration data for the geo-tracking system. We used this trial to prove repeat-ability and explore further ways of improving system accuracy.
Inference time: 630.47 ms
Figure 4.38: Trial 4 Camera Image Object detection capture on Tennis Court
After the success with the previous tennis court, we decided to repeat the trial in a new location with more data points. We utilized a tennis court in Mequon, WI. The same considerations for previous trials was taken with this trial.
We collected an increase of 53 points between camera captured (xCOOrdinatei ycoordinate, HeightBoundingBoa,) and measured distances in inches (x,y).
The plot below shows the captured data from the camera represented as x and y coordinates with annotations denoting the recorded height of the bounding box. Notice that
52


there is an inverse relationship between proximity to the camera and recorded yCOordinate of the object.
Camera Location (Bounding Box Height Annotation)
-202
(D 300 -
X CL
CD 2
ru c
--250
"O 1— o
3 =
>-
140
.111
♦Greater Y Values = Closer to Camera
143
114
71
-113
_102
80
•76* .73
V1
75
-147
•nl
.102
80
-71
-57
63
-5* _57
•^44 .5?4»48
_56
-70
.142
-HI
-72

.56
.102
& .73
<*)8
-150
-122
_80
X-Coordinate (Pixels)
Figure 4.39: Trial 4 Camera Position Plot
% coordinate aild ycoordinate annotated with Pleigllt Bounding Box.
There is a direct correlation between the height of the bounding box height and the ycoordinate- This relationship is shown in the plot below. This is to be expected as the bounding box covering a detected object is larger with closer proximity to the camera, therefore the height should be comparatively higher as well.
53


Camera (Y) Position vs Height of Bounding Box
Height of Bounding Box (Pixels)
Figure 4.40: Trial 4 Camera Data Relationships
Relationships between 'X coordinate and. Ucoordinate with Height Bounding Box.
The measured distance positions are shown below. The numbers show the order in which the images were taken.
1000
800
U
c
---- 600
If)
A—>
ro
400
"O
O
o
V 200
V
0
-200
-200 -100 0 100 200 300 400 500 600
X-Coordinate (inch)
Figure 4.41: Trial 4 Collected measured distance location points
Tennis Court Measured Positions
.4 .3 .2 .i .0
.5 .6 .7 .8 ^ .10 .11
.12 .13 .14 .50 .49 .15 .16 .17
.47 .48
.46 .45 .44
.18 .19 .20 *32 .21 .51 .22 .23 .24
.41 .42 .43
.40 .39 .38
.25 .26 .36 .27 *37 .28 .29
.30 .35 .31 *34 .32
.33
• Camera
Repeating the steps that gave success for the last trial, we used three methods of evaluating the accuracy of a model: Leave One Out, Path Approximation, and Moving Target. The trial path for Path Approximation is shown below.
54


Figure 4.42: Trial 4 Path
4.5.2 Without HeightBoundingBox
Leave One Out
Using the Leave One Out method of accuracy analysis, we obtained a measure of accuracy for all points collected as a part of the configuration data. Plots of accuracy as well as the mean error for the chosen regression algorithms are shown below.
Leave One Out Cross Validation Accuracy: Within 3.00 ft (36.00 [in]) of Groundtruth
Figure 4.43: Trial 4 Measured LOO w/o Height BoundingBox Accuracy Trial 4 Measured Leave One Out Accuracy w/o Height BoundingBox within 36.0 [in] of
ground truth
55


LeaveOne Out: Mean Error From Ground Truth No Bounding Box Height
Degree of Polynomial (Regression)
Figure 4.44: Trial 3 Measured LOO w/o HeightBoundingBox Error Trial 3 Measured Leave One Out w/o Height BoundingBox Mean Error
Based on the results of assessing ’’Hit or Miss” accuracy and finding the lowest of the mean error for the predicted points, Linear Regression and Lasso were determined to be the ideal models to use. ’’Path Comparison” and ’’Moving Target” assessments for the two models as degree-3 and degree-4 polynomials, respectively, are shown below.
U
>-
Degree 3 Regression
X Coordinate (inch) X Coordinate (inch)
Figure 4.45: Trial 4 Measured LOO w/o Height BoundingBox Path Approx, and Moving Target 1
Trial 4 Measured Distance Linear Regression w/out Height BoundingBox Path Approximation and Moving Target Evaluation
56


Degree 4 Regression
Path Approximation w/o Bounding Box Height

« *
‘V-A
a Lasso
* Original Markers • Camera
X Coordinate (inch)
Moving Target Accuracy w/o Bounding Box Height
A Lasso Prediction Points * Target Points
MOO • Original Markers
!c U C 800 . «£>. * .*$ :
. . ....
01 QJ 500
"O O o (J 200 .1.1.
>- X
„
. Camera Av9 Error: 35 45
" X Coordinate (inch)
Figure 4.46: Trial 4 Measured LOO w/o HeightBoundingBox Path Approx, and Moving Target 2
Trial 4 Measured Distance Lasso Regression w/out Height BoundingBox Path Approximation and Moving Target Evaluation
Both the ’’Moving Target” and ’’Path Approximation” methods yielded results that were not quite as good as expected. Linear Regression performed the best with a mean error of 30[in]. However, from the results of Trial 3, we were expecting a mean error closer to 12 [in], especially considering we provided more training points.
4.5.3 With Bounding Box
The results of previous trials showed that using Are&BoundingBox as a predictive parameter yielded poor accuracy. However, we still needed an alternate predictive parameter to counter-act the ’’depth” issue for instances in which there is no change in the pixel Y-Coordinate of detected objects. After careful analysis of the data, we realized that, although Are&BoundingBox is positively correlated with the Y-Coordinate pixel location value, the real correlation is derived from Height BoundingBox as: AreascmndingBox. =
Height BoundingBox X Width BoundingBox-
Leave One Out
The results of accuracy as well as the mean error for the chosen regression algorithms using the Height BoundingBox are shown below.
57


Leave One Out Cross Validation Accuracy: Within 3.00 ft (36.00 [in]) of Groundtruth
Figure 4.47: Trial 4 Measured LOO With HeightBoundingBox Accuracy Trial 4 Measured Leave One Out Accuracy With Height BoundingBox within 18.0 [in] of
ground truth
LeaveOne Out: Mean Error From Ground Truth With Bounding Box Height
Degree of Polynomial (Regression)
Figure 4.48: Trial 4 Measured LOO With Height BoundingBox Error Trial 4 Measured Leave One Out With Height BoundingBox Mean Error
Based on the results of assessing ’’Hit or Miss” accuracy and finding the lowest of the mean error for the predicted points, Lasso and Ridge were determined to be the ideal models to use. ’’Path Comparison” and ’’Moving Target” assessments for Lasso and Ridge regressions as degree-3 polynomials using Height BoundingBox as a predictive parameter are shown below.
58


Degree 3 Regression
Moving Target Accuracy With Bounding Box Height
« « ~rt » A Lasso Prediction Points ft Target Points
* i •*“ “ « ftfft it _1000 a • Original Markers ft •
« i * ft ft ft ft ft ft "u C 800 : 1: • .«> •
[ft ft ft in QJ 600
. * X ft * A: ft ft ft ft TJ 400 O O * i .
a Lasso ft ft >- * ' **
k Original Markers • Camera • Camera Av9 Error: 63 24
in]
X Coordinate (inch)
X Coordinate (inch)
Figure 4.49: Trial 4 Measured LOO With HeightBoundingBox Path Approx, and Moving Target 1
Trial 4 Measured Distance Lasso Regression With Height BoundingBox Path Approximation and Moving Target Evaluation
Degree 3 Regression
Path Approximation With Bounding Box Height
Figure 4.50: Trial 4 Measured LOO With Height BoundingBox Path Approx, and Moving Target 2
Trial 4 Measured Distance Ridge Regression With Height BoundingBox Path Approximation and Moving Target Evaluation
Using Height BoundingBox hi addition to the (x,y) pixel coordinates as predictive parameters yields a result that is much better than using Are&BoundingBox■ However, the qualitative accuracy obtained with both models is not as precise as just using (x,y) pixel coordinates. In addition, there is an increase in mean error of approximately 30 [in] (100 [in] with Ridge Regression) with the ’’Moving Target” evaluation.
4.5.4 H eight Bounding Box Replacing Y-Coordinate Pixel Value
We still needed a way of solving the ’’depth” problem. We decided to replace the pixel Y-Coordinate value with the Height BoundingBox hi the model.The results of accuracy as well as the mean error for the chosen regression algorithms replacing the pixel Y-Coordinate of detected object with Height BoundingBox are shown below.
59


Leave One Out Cross Validation Accuracy: Within 3.00 ft (36.00 [in]) of Groundtruth
Figure 4.51: Trial 4 Measured LOO With HeightBoundingBox Replacing Y-Coordinate Accuracy
Trial 4 Measured Leave One Out Accuracy With Height BoundingBox Replacing Y-Coordinate within 18.0 [in] of ground truth
LeaveOne Out: Mean Error From Ground Truth With Bounding Box Height (No Y-Coordinate Pixel)
Degree of Polynomial (Regression)
Figure 4.52: Trial 4 Measured LOO With Height BoundingBox Replacing Y-Coordinate Error
Trial 4 Measured Leave One Out With Height BoundingBox Replacing Y-Coordinate Mean
Error
Based on the results of assessing ’’Hit or Miss” accuracy and finding the lowest of the mean error for the predicted points, Lasso and Elastic-Net were determined to be the ideal models to use. ’’Path Comparison” and ’’Moving Target” assessments for Lasso and Elastic-Net regressions as degree-3 polynomials using Height BoundingBox as a predictive
60


parameter replacing the Y-Coordinate pixel value are shown below.
Degree 3 Regression
Path Approximation Bounding Box Height Replacing Y-Coordinate
** aV*4* m *
E « * * X*X
A x x
1 *

a Lasso
» Original Markers

• Camera
X Coordinate (inch)
Moving Target Accuracy Bounding Box Height Replacing Y-Coordinate
!200 A* A Lasso Prediction Points X Target Points • Original Markers
• • » •
u C 800 • • ■j* •
01 0) eoo • *: •
ro c T3 foo O •
O U 200 #•
>- i • Camera Av9 Error: 38 03
in]
X Coordinate (inch)
Figure 4.53: Trial 4 Measured LOO With HeightBoundingBox Path Approx, and Moving Target 1
Trial 4 Lasso Regression With HeightBoundingBox Replacing Y-Coordinate pixel value Path Approximation and Moving Target Evaluation
Degree 3 Regression
X Coordinate (inch) X Coordinate (inch)
Figure 4.54: Trial 4 Measured LOO With Height BoundingBox Path Approx, and Moving Target 2
Trial 4 ElasticNet Regression With Height BoundingBox Replacing Y-Coordinate pixel value Path Approximation and Moving Target Evaluation
With both models, we found that the ’’Path Approximation” evaluation using xcoordinate and Height BoundingBox as predictive parameters does not yield a predicted path that is quite as well aligned with the trial path as using just xcoordinate and ycoordinate- However, upon visual inspection, it is clear that the predicted path does follow the expected trial path, just with extra noise. In addition, the results of the ’’Moving Target” evaluation show mean error is only slightly increased by 8 [in] and 5 [in], respectively, from using just using Xcoordinate and ycoordinate as predictive parameters.
61


4.5.5 Analysis
The primary goal of Trial 4 was to prove repeat-ability of the steps taken to build a model that would be able to translate pixel data into a real-world geo-location on our smart surveillance system. The secondary goal was to explore other methods of improving model accuracy for this translation process.
We succeeded in our primary goal. The success from Trial 3 was seen in Trial 4. We will use several of the models derived from this trial to build the PoC geo-tracking system.
Interestingly, our secondary goal was accomplished, but with a slight caveat: accuracy seemed to decrease from the previous trial. This is not expected as our intuition leads us to believe that more training points would lead to great accuracy. On the other hand, we were able to develop a successful solution to our ’’depth” problem by utilizing Height Bounding Box instead of Area Bounding Box as a predictive parameter. Although, using all three attributes, xcoordinate> Y coordinates and Height Bounding Box Y 4'lds worse results that using the dual of xcoordinate and HeightBoundingBox- However, this combination’s accuracy
is lower than the dual of Xcoordinate and ycoordinate-
62


5 Making the Geo-Tracking System
The PoC design of user interface portion of the system took into consideration the necessities required by the surveillance operators that were described in Chapter 1. Therefore, from the beginning, we decided to design the system to have both an ’’instantaneous” geo-tracking capability and ’’historical” geo-tracking capability.
A system such as this is easily monetized. Source code is likely proprietary, meaning that our initial search yielded very little in the way of a dynamic tool that would be able to plot our desired geo-location points as we needed. So, we decided to build the system, ourselves. We built the system in steps, slowly incorporating additional functionality and modifications with each iteration.
The surveillance system, itself, was built before the process of pixel-to-real mapping
was developed. This was with the assumption that GPS would be a viable mapping
option for the final system.
First Iteration
Visualization 1 (Dynamic) Constant Motion View Visualization 2 (History)
Figure 5.1: Spatial Tracking Surveillance System First Iteration
63


The first iteration of the system used a simple plotting library to visualize the two important concepts of our smart surveillance system: instantaneous and historical positioning. We displayed several target points randomly moving to new locations within each one’s respective immediate surrounding every 0.5[s]. This represented the movement of several target objects. We maintained the history of each point’s previous movements within the scene. Putting these two plots together allowed us to visualize a rudimentary smart surveillance system that is both able to track a target’s current position and display a history of past movement.
Second Iteration
Denver
_> Denver
oain a mu. o *
the Curtis Denver - A
^ (\ DoubleTree by Hilton
" i EllieCaulkins Opera Q ® House at Denver... ▼
%
%
waigreens 'y' 0
4 Grancj *{*
„ 0P
* 0 ~ ' ©
â–¡ 0
The Colorado Q Convention Center"
Hyatt Regency Denver at Colorado...
- f
Sherat
I
<5- Denver Firefighters Q â– c?", Museum
• J_f • • • • | | 11 •-:»
II
• •
S' Tjl
©
Voo
WC(
Lindsey-Flanigan ( Courthouse
i
Q>
*3
a
Google King Soopers^
Map data ©2018 Gc-ogle Te'msofUse Report a map eror
Figure 5.2: Spatial Tracking Surveillance System Second Iteration
The second iteration moved from concept to real user interface. We decided to use the python ’’Bokeh” library for our visualization purposes. Built into the library is a tool that allows the plot to be rendered as HTML code. Thus, we are able to visualize the plot in an internet browser.
As a design choice, we decided to make the back-end data-handling portion of the system an HTTP server. Using an HTTP server, the system idly waits for new data to
64


be sent from the client surveillance cameras. The server accepts POST requests in the json form of:
{
x: x.coor , y : y _ c o o r
}
The POST request is created on the client side camera (See Chapter 2 Procedure). The server appends this incoming data to the data structure containing the position information, or creates the data structure if it does not already exist. AJAX is used to asynchronously update the data to allow access by the user interface portion of the server without disrupting the page, itself. Every 0.1 [s], this process checks whether new data has been sent, and, if so, fetches the new data. At the same time, the user interface portion of the server maintains variables containing both the historical and current position data. If new data has been fetched, both of these variables are updated.
A structural diagram of the architecture is shown below:
HTTP Server
Figure 5.3: User Interface Server Architecture Diagram
Using the unique ability of Bokeh plots to be rendered as HTML, we recreate the plots with the new data and seamlessly reload the HTML being displayed on the browser. Google’s Google Maps API is used to show a real-world location via latitude and
65


longitude coordinates.
Mentioned previously, this system was created before the process of Chapter 2. We were unaware that GPS would not work as a viable mapping method. So, we created additional functionality that would allow us to interactively visualize the error of the model versus ground truth GPS data as both instantaneous and cumulative. This is shown below.
Locatk>n_1
uenver ^ \ s\
>. the Curtis Denver -jp DoubleTree by Hiltc
iv
Ellie Caulkins Opera ® House at Denver...
‘“.".“'"'-Y waigreens^ 0
+
n dp
*;
Denver FI ref ighters O Museum Y
Voo
• t
idsey-Flanigan @ Courthouse t
Google KlngSoopmO
©2018 Google Terms of Use Report a map en
Instantaneous Error Cumulative Error
Figure 5.4: Spatial Tracking Surveillance System Second Iteration Error Visualization
Final Iteration
Figure 5.5: Spatial Tracking Surveillance System Final Iteration
The final iteration of the smart surveillance PoC geo-tracking system was created after completing experimentation conducted in Chapter 2. Using what was already created, we stripped out the Google Maps API functionality from the user interface. Returning to the
66


first iteration of the system, we decided to display instantaneous and historical positions separately. Additionally, as a design choice, we decided to show the visual camera feed for coordination with the geo-tracking system. This way, a surveillance operator would not only be able to easily track a target’s position, but also be able to assess his/her behavior.
The error visualization portion was removed as it is impossible to accomplish without being able to check against an associated real-time ground truth.
67


6 Discussion
6.1 Roadblocks and Issues
Ultimately, we were able to create a functional geo-tracking system that utilizes the basic components of most ’’off the shelf’ object detection technologies. However, along the way, there were several roadblocks that led to less accurate results and unusable data/methods, forcing adaptation. This section will discuss some of the issues encountered throughout the process.
GPS
Originally, the system was intended to work with GPS coordinates. We would sample a target standing in a GPS location, and, at the same time, record the target’s position in the scene as (x,y) pixel values and some sounding Box or Area sounding Box.
However, we quickly discovered that not only was the recorded GPS data extremely inaccurate, but the readily available GPS location provided by a cell phone is not updated in real-time. As a result, we devised the scheme of utilizing measured or easily calculated position points as a ground truth on which to train our regression models.
Real Time
The original concept also proposed using real time data collection to determine accuracy and error, as well as build a model.
To build a model, we would constantly take GPS and object detection data, querying an NTP server to sync time stamps between the two. Using this method, we would have hundreds of points from which to build a model. Accuracy and error would be determined in a similar way: The predicted points would be evaluated against the ground truth real-
68


time GPS data. However, as a result of the non real-time GPS issue, these both became impossible.
Measured Distances
Since the GPS data was not reliable, we were forced to use the ’’measured distance” approach. Unfortunately, this approach greatly limited the number of accurate data points we could collect because we always needed some constant reference distance. This is why we used the soccer fields and tennis courts.
Multiple Cameras
After the success of Trial 3, it was suggested that we utilize multiple cameras on the same scene to boost accuracy of the system. So, we procured an additional camera and modified the data collection phase to account for multiple camera inputs. Unfortunately, the additional camera was not suited for the experiments. It failed to capture consistent images and video data. Object detection on the data from that camera was almost impossible.
6.2 Missed Opportunities and Future Work
The process of conducting this research and writing this thesis has taught us that a good project never really has a point where it is ’’complete.” What was given in this thesis was ’’laying the groundwork.” It provides the basis from which to build bigger, better, and more accurate systems. Given more resources and time, we could expand this project to much more than what it is. In addition to remarking on some of the missed opportunities, this section will discuss some of the adaptations, modifications, and improvements that can be built on top of what has already been accomplished. Many of the ideas described here can be considered independent research projects, by themselves.
GPS
In the previous section, we explained the issue with using GPS. However, this does not mean that GPS technology is completely unusable. We’ve proven that a system can be
69


created with accurate and consistent ground-truth data. If a dedicated and accurate GPS could be used during the training and evaluation phases, we believe a much more accurate and easily-conhgurable system could be created.
Multiple Cameras
Again, we have already discussed the missed opportunity of being able to build a system that uses multiple cameras. Future work would entail running the experiments and system on camera hardware that is identical.
Stitching together multiple scenes
Of all of the potential future works, this is the one we would be the most excited to design and build. The whole premise of the project is a ’’smart surveillance system.” The implementation that we have built serves that purpose, but, currently, only for a single ’’scene.” For a usable and robust surveillance system, we would need multiple overlapping cameras that would be able to survey multiple connected scenes, in their entirety. For example, the security cameras of a park.
In theory, each camera client would have a specific tag which would be used to map the already collected locations (cam data to geo-location) data to a specific region in the park by the server. Of course, if GPS was an option, the re-mapping would be unnecessary.
Different hardware
Attaching a device dedicated to neural network calculations to each camera would greatly improve speed. Running the current implementation on basic laptop yields an inference time of approximately 0.65 [s] on average. This means that the frame rate is not nearly high enough for true real-time object detection. To reduce run time and approximate achievable real-time inference, we lowered the frame rate of the captured video used in the PoC.
70


Target Tracking
As mentioned in Chapter 1, this research focused on the geo-tracking aspect of the smart-surveillance system. However, for a true smart surveillance system, the target tracking portion would need to be implemented, as well. It would be necessary to be able to isolate a target object from background objects for the complete system to function as envisioned.
Different Targets
In creating the model for each trial, we only used one target with static dimensions, meaning the model was only trained on this data. Intuitively, this means that introducing another target with much different dimensions (much taller, for example) will output different results. Whether or not this yields a significant impact on model accuracy is unknown, however.
In addition, for each of our trials, the training data and model all involved the target standing straight up, i.e. no crouching, jumping, standing on object, etc. Again, all of these aspects would effect the model and subsequent accuracy.
These both need to be explored.
Tuning Model Accuracy
One of the main issues with the experiments conducted during the creation of this system was the lack of dedicated hardware and a work space. So, we focused our efforts on building the system, itself, through proving the repeat-ability of the process and PoC of the final smart surveillance system. We chose a regression model for pixel-to-real translation based on highest accuracy and lowest mean error from the LOO method. However, this is not to say that the accuracy within the model, itself, was as tuned and accurate as it could possibly be. Being able to tune accuracy requires repeated trials with the same configuration in the same static camera location. This was not possibile given our circumstances.
Posterior Smoothing
One of the notable traits encountered in even our best models from Trial 3 and Trials 4 was the noise that was a part of our predicted data, meaning the points tended to ’’jump
71


around” erratically. This can clearly be seen on many of the ’’Path Approximation” evaluations. The noise is a result of the bounding box of the detected target constantly shifting in size.
The whole object detection technique is centered around the bounding box of the detected target. The xcoordinate, Y coordinate, and Height Bounding Box are are all derived from the bounding box. The model is trained on a small sample of data: one set of static pixel data (bounding box location and dimensions) is captured at each designated location. With the bounding box constantly shifting in size, the pixel-to-real translation will give differing results, even for a target in the same approximate location. The chosen models from our trials performed admirably and allowed for the creation of the PoC. However, there was still a great deal of obvious discrepancy between the trial path and the predicted path due to noise.
One hypothesis on how to correct this is to make the bounding box more consistent throughout the target detection process. However, this may over-complicate the detection process, further exacerbating our ’’depth” problem.
Another hypothesis is related to ’’posterior smoothing.” Posterior smoothing, as we envision it, is predicting the placement of a point based on some combination of xcoordinate, y coordinate, Height Bounding Box, and the prior data from the previous several points.
P oifltfleal — f {x pixel, Upixel, H eight Bounding Box) T f {P oifl^tBeal—l, P OVntReal—2)


Bibliography
Cited works
[2] Antonio Brunetti et al. “Computer vision and deep learning techniques for pedestrian detection and tracking: A survey”. English. In: Neurocomputing 300 (2018), pp. 17-33.
[4] Jorge Fuentes-Pacheco, Jose Ruiz-Ascencio, and Juan M. Rendon-Mancha. “Visual simultaneous localization and mapping: a survey”. English. In: Artificial Intelligence Review 43.1 (2015), pp. 55-81.
[5] Helen M. Hodgetts et al. “See No Evil: Cognitive Challenges of Security Surveillance and Monitoring”. In: Journal of Applied Research in Memory and Cognition 6.3 (2017), pp. 230 -243. ISSN: 2211-3681. DOI: https : //doi . org/10 . 1016 / j . jarmac .2017.05.001. URL: http: //www. sciencedirect. com/science/article/ pii/S2211368117300207.
[6] Antonio C. Nazare Jr. and William Robson Schwartz. “A scalable and flexible framework for smart video surveillance”. In: Computer Vision and Image Understanding 144 (2016). Individual and Group Activities in Video Event Analysis, pp. 258 -275. ISSN: 1077-3142. DOI: https : / /doi . org/10 . 1016/j . cviu . 2015.10.014. URL: http : //www . sciencedirect . com/science/article/pii/ SI077314215002349.
[8] Joseph Redmon and Ah Farhadi. “YOLOv3: An Incremental Improvement”. English. In: (2018).
[9] Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”. English. In: (2015).


[10] Young-Woo Seo and Myung Hwangbo. “A Computer Vision System for Lateral Localization”. In: Journal of Field Robotics 32.7 (), pp. 1004-1014. DOI: 10.1002/ rob.21576. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/rob. 21576. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/rob.21576.
[11] Shubham Shinde, Ashwin Kothari, and Vikram Gupta. “YOLO based Human Action Recognition and Localization”. In: Procedia Computer Science 133 (2018). International Conference on Robotics and Smart Manufacturing (RoSMa2018), pp. 831 -838. ISSN: 1877-0509. DOI: https : //doi . org/10 . 1016 / j . procs . 2018.07.112. URL: http : //www . sciencedirect. com/science/article/pii/ S1877050918310652.
[13] A. W. M. Smeulders et al. “Visual Tracking: An Experimental Survey”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 36.7 (2014), pp. 1442-1468. ISSN: 0162-8828. DOI: 10.1109/TPAMI. 2013.230.
[14] Carlos Sanchez et al. “Localization and tracking in known large environments using portable real-time 3D sensors”. English. In: Computer Vision and Image Understanding 149 (2016), pp. 197-208.
Supplementary works
[1] Maha M. Azab, Howida A. Shedeed, and Ashraf S. Hussein. “New technique for online object tracking-by-detection in video”. English. In: IET Image Processing 8.12 (2014), pp. 794-803.
[3] Moonsub Byeon et al. “Unified optimization framework for localization and tracking of multiple targets with multiple cameras”. English. In: Computer Vision and Image Understanding 166 (2018), pp. 51-65.
[7] Joseph Redmon and Ah Farhadi. “YOLO9000: Better, Faster, Stronger”. English. In: (2016).
[12] R. L. Simpson. “Computer vision: an overview”. In: IEEE Expert 6.4 (1991), pp. 11-15. ISSN: 0885-9000. DOI: 10.1109/64.85917.


Appendix
Experiments
The experimental data and analysis for each trial can be found here: GITHUB LINK
The above figure shows the experimental process used for each trial in the configuration of the client cameras. The figure below gives a summary of the experimental analysis for each of the four trials.




Code
The following code is what was used to collect the GPS data on a Samsung Galaxy s5
import sys import time
import a n d r o i d he lp e r as android count = 0
with open (”/storage/emulated/0/ qpython / test . csv” , ” r” ) as f: for line in f : count += 1
def write.coordinates ( droid): global count
droid . eventClearBuffer ()
droid . makeToast (” Reading „GPS„data” )
droid . startLocating ()
print (” reading ^.GPS^ . . . ” )
t i me . sleep (1)
loc = droid. readLocation () . result while loc == {} :
loc = droid. readLocation (). result #f or i in loc:
#pr i nt(lo c) print(loc) if loc is not None: coor = ””
if loc [’network’]: co or += ”
c o or += ” net work, % s , % s ” % (loc [’network’] [’latitude’], loc [’network’] [’longitude’]) if loc[’gps’]: co or += ”
c o o r += ” gps ,%s ,%s” % (loc[’gps’][’latitude’], loc [ ’gps’] [ ’longitude’]) c o or += ” \ n”
f = open (’’/storage/emulated /0 / qpython / test .csv” , ” a” )
f.write(coor)
f . c1o se ( )
print(count)
print(coor)
count += 1
droid . stopLocating () if __name__ == ’ ..main.. ’ :
droid = android. Android ()
inp = ” y”
while inp != ” q” :
write.coordinates ( droid ) inp = inp ut (” Next ?” )


The following code is what is run on the user interface server:
from flask import Flask , jsonify from jinja2 import Template import math
from six import string-types import numpy as np
from flask import Flask , jsonify , make.response , request , current.app from datetime import timedelta
from functools import update.wrapper , wraps import time import random
from bokeh . plotting import ColumnDataSource from bokeh . plotting import figure
from bokeh. models import A j axD at aS our ce , GMapOptions
from bokeh. embed import components
from bokeh . r e s o u r c e s import INLINE
from bokeh. util . string import encode_utf8
from bokeh . plotting import gmap
from bokeh . s amp le d at a import us.states
from bokeh. models import Rangeld
from bokeh . layouts import row
app = Flask ( __name__ )
Global Variables
source = ColumnDataSource({ ’ x.GPS ’ : [ ] , ’ y_GPS ’ : [ ] , ’x_CAM ’ : [ ] , ’y.CAM ’ : [ ] } )
xl = [] yi = []
### DON’T TOUCH THIS..........................................................
def crossdomain( or igin=None , met ho d s=None , headers=None , max.age = 21600, attach_to_all = True, automatic_options = True ) :
Decorator to set crossdomain configuration on a Flask view For more details about it refer to : htt p : / / f l a s k . po co o . or g / snippets / 56 /
if methods is not None:
methods = ’ , ^ . join (sorted (x . upper () for x in met hods))
if headers is not None and not isinstance ( headers , string-types ) : headers = join (x . upper () for x in headers)
i f not isinstance (origin , string-types): origin = ’,-’.join(origin)
if isinstance (max.age , timedelta):
max.age = max.age . total-seconds ()
def get.methods ():
options.resp = current.app . make.default.options.response () return options.resp . headers [ ’ allow ’ ]
def decorator(f):
@wraps(f )
def wrapped_function(* args , **kwargs ) :
if automatic-options and request . met hod == ’ OPTIONS ’ : resp = current.app . make.default.options.response () else:
resp = make.response ( f (* args , **kwargs)) if not attach.to.all and r e q ue s t . met hod != ’OPTIONS’:


return resp
h = resp.headers
h [ ’Access—Control — A llo w —O rigin’] = origin h [ ’Access—Control — A llo w—Met hod s ’] = get.methods ()
h [ ’Access—Control —Max—Age ’ ] = s t r ( max.age )
requested.headers = request.headers.get(
’Access —Control—Request—Headers ’
)
if headers is not None:
h [ ’Access—Control — A llo w —Head er s ’] = headers
elif requested.headers:
h [ ’Access—Control — A llo w —Head er s ’] = requested.headers
return resp
f.provide.automatic.options = False return update.wrapper ( wrapped.function , f)
return decorator
##/ DON’T TOUCH THIS..........................................................
@app . route ( ’ / data ’ , met hods = [ ’POST ’ , ’GET ’ , ’ OPTIONS ’ ] )
@crossdomain( origi n=” * ” , methods=[ ’GET ’ , ’ POST ’ ] , header s=None )
de f up d at e ( ) :
global xl , yl
if request, method == ’POST ’ : x = request, form, get (” x” ) y = request . form . get (” y” ) if x != None and y != None: xl. append ( float(x) ) yl. append ( float(y) )
return jsonify(x_po s=xl [ : ] , y.po s=y 1 [ : ] , x.inst = [ xl [ — 1 ] ] , y.inst = [yl [ — 1]])
template = Template( ’’’


<,m eta c h a r s et = ” utf —
T ennis Court CameraK/1 it l e >
{{ j s-resources }}
{{ css-resources }}

< body>
{{ plot-div }}
{{ plot-script }}


''')
@app. route(”/” , methods = [’’POST” , ’’GET”]) def user-interface (): streaming = True global xl , y 1 source = None
source = AjaxDataSource( data.ur 1=” http://localhost:5000/data” ,
polling_interval=200, mode= ’ append ’ , if.modified =T rue , max_size = 30) inst.source = AjaxDataSource (data.ur 1=” http : / / localhost :5000 / data” , polling_interval=150, mode= ’replace’) source, data = dict( x.p os = [], y_pos = [],x_inst = [],y_inst = [])
left , right , bottom, top = 0, 420, 0, 1000
figl = f i g ur e ( p lo t _w id t h =600 , p 1 o t _ h e i g h t = 6 00 , t i t 1 e=” H i story ” ) figl . x.range = Rangeld(left , right) figl . y.range = Rangeld ( bottom , top) figl. axis, visible = False


figl . toolbar . logo = None figl . toolbar-location = None
figl. circle (’x-pos’, ’ y.pos ’ , fill_color=’red’, fill_alpha=0.7, source = source,size=15)
fig2 = figure(plot_width=600, plot_height=600, title=”Current_Position”)
fig2 . x.range = Rangeld(left , right)
fig2 . y.range = Rangeld ( bottom , top)
fig2. axis, visible = False
fig2 . toolbar . logo = None
fig2 . toolbar-location = None
fig2.circle(’x_inst’, ’y_inst’, fill_color = ’blue’, fill_alpha=0.7, sourc e= inst.source , size=15)
js.resources = INLINE . render.js () css.resources = INLINE . render.css () script , div = components (row( fig 1 , f i g 2 ) , INLINE ) html = template . render ( p 1 o t _ s c r i p t = s c r i p t , p 1 o t _ d i v = d iv ,
j s_resources = j s.resources , css_resources=css_resources
)
#return str(xl) return encode_ut f 8 ( ht ml) app . run ( )


The following code was added to the Yolov3 ’object-detection.py’ source code by Joseph Redmond to allow our camera to obtain pixel data, translate to a geo-location position, and transmit to the user-interface server (server addr: localhost:5000):
import requests from joblib import load Load fitted model
model = load ( ’ cam.regr.model . joblib’)
#In the def postprocess () fu notion after determining bounding box location and dimensions . . .
classIdsfO] is the ! person ’ class if classlds[i] == 0:
Obtain center point and bounding box height x,y, height = (left + (left+width) )/2) , (top + (top + height) )/2) , height
Translate p ix el—to — r e al (x , height ) —> Geo— location X_real , Y_real = model, predict (x, height)
(x,y) — > Geo—location X_real , Y_real = model, predict (x,y)
Send POST request to user interface server requests. post(”localhost:5000”, data = {’x’: X _r e a 1 , ’y’:Y_real})


Full Text

PAGE 1

SmartSurveillance:SpatialTracking WithComputerVisionandMachine Learning CharlesBollig Fall2018

PAGE 2

Copyright Copyright2018.AllRightsReserved.

PAGE 3

Approval SmartSurveillance:SpatialTrackingWithComputerVisionand MachineLearning Author:CharlesBollig Approved: DanConnors,PhD ThesisAdvisor ChaoLiu,PhD CommitteeMember AlirezaVahid,PhD CommitteeMember

PAGE 4

Abstract Historically,videosurveillancehasheavilyreliedonapersonorpeoplesittingbehind anarrayofmonitors,alertforanykindofsuspiciousbehavior.However,thismethod ofmonitoringrequiresconstantvigilance,stamina,andexactingattentiontodetailto beeectiveasameansofsecurity.Deeplearninghasmade"outofthebox"computer visiontechnologiesaccessibletothosewithlittlemorethanabasiccomputersciencebackground.Thishasallowedengineersandscientiststoadapttheseeasilyaccessiblecomputervisionalongwithsimplemachinelearningtoolsintosmartvisualsurveillance" systems.Thisresearchexplorestheviabilityofaspatialtrackingsurveillancesystem utilizingfundamentalcomponentsoftheseextremelyaccessibletools.Usinggeneralized linearregressionmodels,wetranslateobject-detectioncamerapositionsintoreal-world locations.Inconguringthespatialtrackingsystemandtrainingtheregressionmodels, weexploreseveralmethodsofrealworldmapping,includingGPSandmanuallymeasured distances.Suitabilityofthemodelsareassessedusingfourevaluationmethods:'Leave OneOut,''Train-to-Test,''PathApproximation,'and'MovingTarget.'Severaltrialsin dierentlocationsareconductedtoassessvaryingconditions,aswellasthedevelopment andrepeat-abilityofthecongurationprocess.Inaddition,wedesignandbuildawebbasedapplicationasaproofofconceptfortheuser-facingspatialtrackingsurveillance system.

PAGE 5

Acknowledgements CourtneyMattsonThankyouforallofyoursupport.Ineverwouldhavebeenabletocomplete thiswithoutyourpatienceasanassistantandgeniusasascientist. Youarearock.Ihopeyougetallthegrants! AlexMattsonThanksforvolunteeringyourtime!

PAGE 6

Contents 1Purpose1 2Background2 2.1TraditionalSurveillance...........................2 2.2ComputerVision'sRole...........................3 2.3MappingtoRealSpace............................5 2.4RealisticSituation..............................7 2.5ComputerVision...............................8 2.5.1'OutoftheBox'............................8 2.5.2Yolov3................................9 2.6RelatedResearch...............................10 2.6.1SLAM.................................10 2.7SpatialTrackingSmartSurveillanceSystem................12 2.7.1Denition...............................12 2.7.2Requirements.............................12 3Methods16 3.1Tools......................................16 3.1.1Software................................16 3.1.2Hardware...............................20 3.2ProcedureOverview.............................20 3.3Client.....................................20 3.3.1ObtainTrainingData........................21 3.3.2BuildModels.............................21 3.3.3DetermineBestModel........................21 3.3.4ExportModelandDeployonCamera................24 3.4Server.....................................24 4Data/Analysis25 4.1Synthesis....................................25 4.2Trial1:FieldinStapletonDenver,CO..................27 4.2.1Overview...............................27 4.2.2Analysis................................30 4.3Trial2:SoccerField.............................30 4.3.1Overview...............................30 4.3.2GPS..................................33 4.3.3w/outArea BoundingBox ........................34 4.3.4WithArea BoundingBox .........................40 4.3.5Analysis................................43

PAGE 7

4.4Trial3:TennisCourt.............................44 4.4.1Overview...............................44 4.4.2WithoutArea BoundingBox .......................47 4.4.3WithArea BoundingBox .........................49 4.4.4Analysis................................51 4.5Trial4:TennisCourt2............................52 4.5.1Overview...............................52 4.5.2WithoutHeight BoundingBox ......................55 4.5.3WithHeight BoundingBox ........................57 4.5.4Height BoundingBox ReplacingY-CoordinatePixelValue.......59 4.5.5Analysis................................62 5MakingtheGeo-TrackingSystem63 6Discussion68 6.1RoadblocksandIssues............................68 6.2MissedOpportunitiesandFutureWork...................69

PAGE 8

ListofFigures 1.1TranslatingPixel-to-Real...........................1 2.1YoloNeuralNetFramework[9].......................9 2.2SLAMExampleMapping[14]........................11 2.3System"Depth"Problem..........................13 2.4System"LensDistortion"Problem.....................14 4.1Trial1CameraImages............................28 4.2Trial1CameraPositionPlot.........................28 4.3Trial1CameraDataRelationships.....................29 4.4Trial1GPSandNetworkPositionLocations................29 4.5Trial2CameraImages............................30 4.6Trial2CameraPositionPlot.........................31 4.7Trial2CameraDataRelationships.....................32 4.8Trial2CollectedGPSandMeasuredDistanceLocationPoints......32 4.9Trial2Path..................................33 4.10Trial2GPScomparison...........................33 4.11Trial2MeasuredLOOw/outArea BoundingBox Accuracy..........34 4.12Trial2MeasuredLOOw/outArea BoundingBox PathApprox........35 4.13Trial2MeasuredLOOw/outArea BoundingBox MovingTarget.......35 4.14Trial2MeasuredLOOw/outArea BoundingBox Error............36 4.15Trial2MeasuredLOOw/outArea BoundingBox PathApprox........37 4.16Trial2MeasuredLOOw/outArea BoundingBox MovingTarget.......37 4.17Trial2MeasuredTrain2Testw/outArea BoundingBox Accuracy.......38 4.18Trial2MeasuredT2Tw/outArea BoundingBox Error............39 4.19Trial2MeasuredT2Tw/outArea BoundingBox Plot.............39 4.20Trial2MeasuredT2Tw/outArea BoundingBox MovingTarget.......40 4.21Trial2MeasuredLOOWithArea BoundingBox Accuracy..........41 4.22Trial2MeasuredLOOWithArea BoundingBox Error............41 4.23Trial2MeasuredLOOWithArea BoundingBox PathApprox.........42 4.24Trial2MeasuredLOOWithArea BoundingBox MovingTarget.......42 4.25Trial3CameraImages............................44 4.26Trial3CameraPositionPlot.........................45 4.27Trial3CameraDataRelationships.....................45 4.28Trial3Collectedmeasureddistancelocationpoints............46 4.29Trial3Path..................................46 4.30Trial3MeasuredLOOw/oArea BoundingBox Accuracy...........47 4.31Trial3MeasuredLOOw/oArea BoundingBox Error.............47 4.32Trial3MeasuredLOOw/oArea BoundingBox PathApprox..........48 4.33Trial3MeasuredLOOw/oArea BoundingBox MovingTarget........48 4.34Trial3MeasuredLOOWithArea BoundingBox Accuracy..........49

PAGE 9

4.35Trial3MeasuredLOOWithArea BoundingBox Error............50 4.36Trial3MeasuredLOOWithArea BoundingBox PathApprox.........50 4.37Trial3MeasuredLOOWithArea BoundingBox MovingTarget.......51 4.38Trial4CameraImage............................52 4.39Trial4CameraPositionPlot.........................53 4.40Trial4CameraDataRelationships.....................54 4.41Trial4Collectedmeasureddistancelocationpoints............54 4.42Trial4Path..................................55 4.43Trial4MeasuredLOOw/oHeight BoundingBox Accuracy..........55 4.44Trial3MeasuredLOOw/oHeight BoundingBox Error............56 4.45Trial4MeasuredLOOw/oHeight BoundingBox PathApprox.andMoving Target1....................................56 4.46Trial4MeasuredLOOw/oHeight BoundingBox PathApprox.andMoving Target2....................................57 4.47Trial4MeasuredLOOWithHeight BoundingBox Accuracy.........58 4.48Trial4MeasuredLOOWithHeight BoundingBox Error...........58 4.49Trial4MeasuredLOOWithHeight BoundingBox PathApprox.andMoving Target1....................................59 4.50Trial4MeasuredLOOWithHeight BoundingBox PathApprox.andMoving Target2....................................59 4.51Trial4MeasuredLOOWithHeight BoundingBox ReplacingY-Coordinate Accuracy....................................60 4.52Trial4MeasuredLOOWithHeight BoundingBox ReplacingY-Coordinate Error......................................60 4.53Trial4MeasuredLOOWithHeight BoundingBox PathApprox.andMoving Target1....................................61 4.54Trial4MeasuredLOOWithHeight BoundingBox PathApprox.andMoving Target2....................................61 5.1SpatialTrackingSurveillanceSystemFirstIteration............63 5.2SpatialTrackingSurveillanceSystemSecondIteration...........64 5.3UserInterfaceServerArchitectureDiagram.................65 5.4SpatialTrackingSurveillanceSystemSecondIterationErrorVisualization66 5.5SpatialTrackingSurveillanceSystemFinalIteration............66

PAGE 10

ListofAbbreviations PoC-ProofofConcept CNN-ConvolutionalNeuralNetwork DL-DeepLearning YOLO-YouOnlyLookOnce SGD-StochasticGradientDescent SLAM-SimultaneousLocalizationandMapping OpenCV-OpenSourceComputerVisionLibrary GPS-GlobalPositioningSystem NTP-NetworkTimeProtocol AJAX-AsynchronousJavascriptandHTML

PAGE 11

1Purpose Thegoalofthisprojectwastoinvestigateamethodoftranslatingatarget'spixelposition onacamerascenetoareal-worldspatiallocation;ultimately,incorporatingtheresults ofthisresearchintothecreationofaspatialtrackingsystem. Figure1.1:TranslatingPixel-to-Real Theprimarygoalsofthisresearchwereto: 1.Explorewhichregressionalgorithmshelpstranslatecamerax,ytospatialx,y. 2.CreateaPoCofaworkingsmartsurveillancesystem. 3.Createareusableframeworkbasedonreninganexperimentalmethod. 1

PAGE 12

2Background 2.1TraditionalSurveillance Surveillanceexistsinaplethoraofdierentforms:monitoringofcrowds,trackingofan individual,securitymaintenanceofinfrastructure,assessmentofcyberthreats,radar,etc. Traditionalvideosurveillancehasheavilyreliedonapersonorpeoplesittingbehind anarrayofmonitors,alertforanykindofsuspiciousbehavior. Videosurveillanceexistsfortwopurposes:rst,itisameanstomonitorevolvingactivityinrealtimewiththepossibilityforintervention."Secondly,asameanstoprovide animportantvisualrecordofaspaceatanygiventime,sothatitmaybeviewedretrospectivelytodetectthepresenceofaparticularpersonorobjectinaparticularplace,or totracetheonsetanddevelopmentofaneventthathasalreadyoccurred"[5].Whenused eectively,videosurveillanceisaninvaluabletoolinthemaintenanceofpublicsafety. However,thismethodofmonitoringrequiresconstantvigilance,stamina,andexacting attentiontodetailtobeeectiveasameansofsurveillance.Thesesupervisorymonitoringtasksareallsimilarinthattheyinvolveconcentrationforlongperiodsonacomplex anddynamicallyevolvingsituation,wherebyparticularrulesarekeptinmindtoidentify visualthreatsintheenvironment.Giventhepotentialhighcostofadetectionfailure e.g.,bombexplosion,drowning,mid-aircollision,cyber-attack,highexpectationsare placeduponsurveillancepersonneltoensuresafetyandsecurity."[5]. Onofthemostglaringobstaclestotheecacyofsurveillancepersonnelisthesheer volumeofinformationforwhichoperatorsareresponsible.Sometimes,uptoftyscreens aretobemonitoredatoncebyasingleindividual.Feelingsoffeelingoverwhelmedhave beenreported,aswellasexperimentalstudiesintodividedvisualattention,wherebythe visualsystembecomesnoticeablystretchedandperformancedeclineswhenmonitor2

PAGE 13

ingfourscenesconcurrently."Moreover,thenumberofpotentialcamerafeedstoselect fromoftenexceedsthenumberofphysicallyavailablescreensonwhichtoviewthem. Furthermore,theprocessofconstantlyswitchingcontextbetweencamerafeedsintroducesinecienciesinmonitoring.Theabilityofanoperatortoperceivetheelements andeventsofhisorherworkenvironmentwithrespecttotimeorspace,tounderstand theirmeaninginrelationtothetasksatplay,andtoforeseehowtheymaychangeover time|willbereducedeachtimeuntiltheoperatorhasreassessedthenewsceneand updatedtheirsituationalmodel"[5]. Operatorsinchargeofthesefeedsarenotinvulnerablefromoneofthemostpernicious perpetratorsofhumanerror:boredom.Consideringthelongshiftsoftheseoperators hr4daysonwith2dayso,itisnotsurprisingthatmanyimportantincidentscan gocompletelyundetectedforseveralbrief,criticalmomentsorevenforlongperiodsof time.Thisphenomenonhasbeenlabeled'inattentionalblindness.' Securitysurveillancesystemsarecrucialinsituationsrelatedtopublicsafetyresulting fromcriminalorthreateningactivity.Ubiquitously,securitysurveillancedevicescanbe foundlitteredthroughoutpublicplaces:parks,schools,governmentbuildings,etc.[6]. However,forthereasonsoutlinedabove,automatingandstreamliningthesurveillance processforthebenetofthesurveillanceoperatorscouldprovideamuch-desiredboon intheinterestofmaintainingthepeace"inpublicandprivateareas. 2.2ComputerVision'sRole Thetechnologicalevolutionofsurveillancesystemscanbedividedintothreeepochs. AnexampleoftheinitialgenerationconsistsofaseriesofClosedCircuitTVCCTV camerasconnectedtoacentralcontrolroomobviousdisadvantageshavebeenoutlinedin theprevioussection.Thesecondgenerationinvolvestheearlystagesofintelligent,"but rudimentarycomputervisiondetectionaidedtechnologies.Finally,thethirdgeneration consistsofadvanceddetectionandtrackingcomputervisiontechnologiesonalarge, geographicallydistributedscale[6]. Overthepastthreedecades,therehasbeenanever-growinginterestinhumandetection 3

PAGE 14

andtracking.Videoacquisitiontechnologyisoneofthefundamentalaspectsconcerning thisinterest[2].Besidessimplesurveillance,theautomatictrackingofhumansinvideo hasalwaysbeenacross-domainresearchareawithmanyapplicationsinmanydierent domains[2].TheadvancementsofDeepLearning"DLtechnologiesandparallelCPU processinghardware,inspiredbytheorganicarchitectureofthehumancortex,haveallowedengineersandscientiststoadaptcomputervisiontechnologiesinto'smartvisual surveillance'systems[2].Smartvisualsurveillancedealswiththereal-timemonitoring ofobjectswithinanenvironment.Theprimarygoalistoprovideautomaticinterpretationandrudimentaryanalysisofthescenetounderstandactivitiesandinteractionsof observedagents,specicallyhumanbeings[2]. Ataxonomydividessurveillancewithinthevideodomainintofourgroups:visualinformationrepresentation,regionsofinterestlocation,trackingandidentication,and knowledgeextraction[6].Visualinformationrepresentationreferstorepresentingthe informationcontainedinthevisualdata.Forexample,convertingpixelinformationtoa featurespace.Regionsofinterestlocationisthefocusingonlocationsoftheimagewhere informationregardingactivitiescanbeextracted.Trackingandidenticationidenties agentsbasedona-prioriinformationsuchasfacialrecognitionandfollowsthetrajectoriesoftheagentswithinthescene.Knowledgeextractionreferstotherepresentation ofascenegiventhepreviouslymentionedstepsfortheuseofperforminganalysisand makinginferences[6]. Oneofthegreatchallengesofautomaticsurveillancesystemsisthattointerpretwhat ishappeninginascene,asequenceofproblemsneedstobesolved,includingbutnot limitedto:detection,recognition,poseestimation,tracking,andre-identication[6]. Thespecicfocusareaofthisresearchwilladdressoneofthoseprimarychallenges relatedtoautomaticsurveillance:tracking. Ofcourse,itisimportanttorecognizethattheseautomatedsystemsdonotnecessarily provideafail-safesolution.Theroleofthesurveillanceoperatorisstillverymuchnecessary.Thesesystemsandtheresearchforthesystemdescribedinthisthesisaretobe consideredatoolavailabletotheoperators,notareplacement.Over-relianceonauto4

PAGE 15

matedsystemscancreateasenseoffalsesecurity;wherebyoperatorsbecomecomplacent {assumingincidentswillbeconsistentlydetectedandhandledbytheautomatedsystem. Nosystemisperfect.[5].Computervisionsoftware,althoughincreasinglyrobust,is pronetofailingundersub-optimalconditions. 2.3MappingtoRealSpace First,itisimportanttoclarifyseveraldistinctions: 1.Thedierencesbetweenclassication,detection,andtracking. 2.Trackingasitistraditionallyacceptedwithintherealmofcomputervisionversus trackingasitisbeingusedwithinthecontextofthisresearch. Weshallrefertotrackinginthecontextofthisresearchas'spatialtracking.'Furthermore forthesakeofclarication,scene"isusedtodescribethevideofeedfromacamera. Frame"isusedtodescribeasingleimage,ortimeslice,ofthescene. Frame : f x Scene : f 0 ;f 1 ;f 2 ;:::f x :::f n Classicationistheprocessbywhichobjectsfoundinaframeareorganizedintoa logicalcategory.Forexample,ifananimalandachairareidentiedinanimage,two potentialclassicationsareofliving"andnon-living"objects. Detectionistheprocessbywhichobjectsareidentiedandmappedwithinaframe. Withmostgeneralpurposeobjectdetectors,objectsaredetectedandsubsequentlyencapsulatedwithinaboundingboxdenotingtheobject'spositionwithintheframe.Detection canbethoughtofasclassicationandlocalization,localizationbeingthepositionofthe objectintheframe.Thisdiersfromthetrackingofanobjectinasceneinthatthe detectionofanobjectfromoneframetothenextiscompletelyindependentofprevious orfutureframes. Tracking,asunderstoodwithintherealmofcomputervision,isacceptedastheanalysis ofvideosequencesforestablishingthelocationofatargetobjectoveratimesequence 5

PAGE 16

[13].Thereferencedlocationofatargetiswithinthevisualscenecapturedbythecamera {essentiallytrackingthedetectedtargetoncamerafromframetoframe.Thistechnique isusefulwhenthereisadesiretoisolateandfocusononeorseveraldistinctobjects withinascene. Trackingcanbedividedintovemaingroupsbasedonthemethodoftracking:trackers whichoptimizethedirectmatchbetweenasinglemodelofthetargetandtheincoming imageonthebasisofarepresentationofitsappearance,trackersthatmatchbetween thetargetandimagebutholdmorethatonemodelofthetargetallowingforlongterm memoryofthetarget'sappearance,trackersthatperformamaximizationofthematch butwithexplicitconstraintsderivedfromthemotion,coherence,andelasticityofthe target,trackersthatdonotperformmatchingonbasisofappearancebutmaximizethe discriminationofthetargetfromthebackground,learningtolabelpixelsastargetpixels orbackgroundpixels,andtrackersthatmaximizediscriminationwithexplicitconstraints [13]. Spatialtrackingdescribestheprocessoftranslatingatargetobject'spositionina scenetoareal-world,physicallocation.Inmanymostotheshelf"computervision detectiontechnologies,thegenerallocationofadetectedobjectortrackedtargetis representedbyaboundingbox.Extrapolatingabitfromboundarybox'splacementand dimensions,wecandeterminetherelativelocationofanobjectwithinascene.From here,withabitofconguration,weshouldbeabletotranslatethisinformationtoa knownphysicallocationgivena-prioriknowledgeofwhatlocationthecamera'svisual feedcaptures. Foranykindofrealvisualsurveillancesystem,contextisjustimportantifnotmore importantasanykindofcomputationspeed,detection/trackingaccuracy,orperformancemetricofthecomputervisioncomponent.Insituationsinwhichpublicsafetyis atrisk,itisimperativethatsurveillanceoperatorsandrstresponderscanareableto discernmoreinformationaboutascenethanwhereatargetexistsinacamerafeedata singlemomentintime. 6

PAGE 17

2.4RealisticSituation ThemotivationforthisresearchcentersaroundanextremelyrelevantsituationandconcernincurrentU.S.culture:theactiveshooterscenario.Airports,schools,government buildings,restaurants,parks,casinos,neighborhoods{overthepastseveralyearsall theselocationsandmanymorehaveplayedhosttosituationsinvolvinganindividualswitharearmstakingtheirvengeanceandfrustrationsoutoninnocentpeople. Onethingthattheselocationshaveincommon:duringtheincidentmanyofthelocationsweremonitoredbyvideosurveillancesystems.Whilenotinsinuatingthatthe situationscouldhavebeenwhollypreventedusingcomputervisiontechnologiesthesolutionismorepoliticalthantechnological,additionalsurveillancetoolsinthehands ofsecuritysurveillanceoperatorsandrstresponderscould/canbeavaluabletoolto mitigatedamagetohumanlife. Imagineasituationinwhichanindividualexistsasanactivethreattopublicsafety eitherthroughbeinginpossessionanon-concealedrearmorhavingalreadyopenedre withsaidrearm.Thesurveillanceoperatorsmayormaynothavealreadyidentied thatthreat.Dependingontheirfamiliaritywithwhichcamerafeedcorrespondstowhich geographicalarea,theymayormaynotbeabletoidentifywherethethreatisatany pointintime.Whenrstrespondersarrive,theyarelargelyinthesamesituationasthe surveillanceoperatorsbutwithamassivedisadvantagegiventheir,verylikely,unfamiliaritywiththelayoutofthelocation.Whilerespondersandsurveillanceoperatorsare usingvaluabletimeguringoutwherethetargetcurrentlyisorhasbeen,thelikelihood ofharmtohumanlifeincreases.Asystemthatcandetectanactive/potentialthreat,tag suchanindividual,andspatiallytrackhim/herthroughrealspacecouldbeinvaluableto security/rstrespondersthatneedtoactonsuchinformation{loweringthelikelihood ofharmtohumanlife.Afterall,insituationssuchasthese,everymomentiscrucial. Thisproposedsystemutilizescomputervisiontechnologiesintwoways: 1.Detectingathreatandtaggingfortracking. 2.Spatialtrackingthatindividualthroughrealspace. 7

PAGE 18

Thefocusofthisresearchisonthelatter:creationofasystemthatcanutilizedata providedbyoutofthebox"computervisiontechnologiesandtranslatethatdatainto areal-worldlocation.Anactive-shooterscenarioisanextremeexample.Suchasystemcouldalsobeusedforsomethingasinnocuousastrackingthemovementofanimals throughawildlifepreservetoprotectagainstpoachers. 2.5ComputerVision 2.5.1'OutoftheBox' Ofthemanyapplicationsofcomputervision,oneofthemostpowerfulandeasilyaccessibleissimplenottobeconfusedwithtrivialobjectdetectionandclassicationwithin asceneorframe.Veryeasily,onecanretrievetheopensourcecodefromanonline repositoryandbegindetectingobjectsinimageswithinminutes.Thisallowsforallsorts ofpossibilitiesforthespecicusageofthisextremelygeneralizedtechnology.Joseph Redmond,oneofthepioneersoftheexceptionallyversatileandpowerfuldetectiontool YOLO,summarizesitbestwithasimplestatement:Whatarewegoingtodowith thesedetectorsnowthatwehavethem?"[8]. Now,itisimportanttoclarifythatthefocusofthisresearchistangentialtocomputer visionandobjectdetection,butnotnarrowlyfocuseduponthesubject.Instead,weare treatingtheseobjectdetectiontechnologiesasmoreofatoolfortheresearch,rather thanthesubjectofresearch,itself.Justasonewouldexpecttobeabletopurchasea software'outofthebox'tobegindesigningarchitecturalstructureswithouthavingto digthroughthesourcecodetomodifyorunderstandhowandwhythesoftwareworks, weareusingcomputervisionobjectdetectiontools'outofthebox'inordertocreate oursystem.Thesameappliestothemachinelearningalgorithmsthatareusedforthe pixel-to-realmapping.Thefocusoftheresearchistranslatingatargetobject'sposition onavideofeedtoareal-worldlocationforspatialtracking.Intheinterestofcreating asmartsurveillancesystemthatisnotstronglydependentontheoptimizationorcustomizationofaspecictool,weareusinganopensourceobjectdetectiontoolthatis robust,butmoreimportantly,generalenoughthatanyobjectdetectiontechnologythat 8

PAGE 19

utilizesboundingboxestoidentifythelocationofanobjectinanimagemostdocould beeasilysubstituted.WehavedecidedtouseYOLOv3forjustthispurpose. 2.5.2Yolov3 YouOnlyLookOnceYOLO,popularizedbyJosephRedmondoftheUniversityof Washington,isanadvancedapproachobjectdetection"tool.YOLOappliesasingleCNNtotheentireimagewhichfurtherdividestheimageintogrids.Prediction ofboundingboxesandrespectivecondencescorearecalculatedforeachgrid.These boundingboxesareanalyzedbythecondencescore.ThearchitectureofYOLOhas24 convolutionallayersand2fullyconnectedlayers"[11]. Figure2.1:YoloNeuralNetFramework[9] 'Outofthebox'YOLOdetectortakesaninputimageandresizesitto448x448pixels. Theimagepassesthroughtheconvolutionalnetworkwithanoutputofa7x7x30tensor amultidimensionalarray.Thetensorgivesinformationaboutthecoordinatesofthe boundingbox'srectangleandprobabilitydistributionoverallclassesthesystemistrained for.Thethresholdofcondencescoreseliminatesclasslabelsscoringlessthan30%,which canbemodied[11]. OtherCNN-basedsystemsreuseclassiersorlocalizerstodetectanobjectinanimage, meaningthatthesemodelsareappliedtoanimageatmultiplelocationswithdierent scales.ComparedtootherCNN-basedmethods,YOLOhasseveraladvantagesinboth speedandprecision[11]. 9

PAGE 20

First,YOLOappliesasingleCNNforbothclassicationandlocalization;asingleCNN simultaneouslypredictsmultiple-boundingboxesandclassprobabilities.Thismeansthat YOLOisacomparativelyfastobjectdetector[9]. Secondly,YOLOreasonsgloballyaboutanimagewhenmakingpredictions.It'sees' theentireimageduringtrainingandtesttoimplicitlyencodecontextualinformation aboutclassesandtheirappearance,allowingforfewerbackgrounderrorscomparedto otherobjectdetectors[9]. Finally,YOLOlearnsgeneralizedrepresentationsofobjects.SinceYOLOishighly generalizable,itislesslikelytobreakdownwhenappliedtonewdomainsorunexpected inputs[9]. OneofthedrawbacksofYOLOrelevanttothisresearchistheissuerelatedtolocalizationofsmallerobjectswithinframes,relatingtospatialconstraintsonboundingbox predictions.However,bythetimeYOLOv3wasreleased,manyoftheissuesrelatedto localizationofsmallerobjectshavebeenattenuated.[9][8] YOLOv3provideswhatweneedforthisresearcha,namelyageneral-purposeobject detectorthatisrobust,fast,versatile,andeasytouse.Theresultingsystemisinteractiveandengaging.WhileYOLOprocessesimagesindividually,whenattachedtoa webcamitfunctionslikeatrackingsystem,detectingobjectsastheymovearoundand changeinappearance"[9]. 2.6RelatedResearch 2.6.1SLAM Thesubjectoftheresearchinthisthesisrequiresthatinformationabouttheenvironment bea-priori,meaningthatthelocationisknownbeforecongurationanddeploymentofthe system.Itisworthnotingasimilartechniquewithintherealmofcomputervisionthatis usedforbothmappingandgeo-locationwithoutsucha-priorirequirements:Simultaneous LocalizationandMappingSLAM.SLAMistheprocesswherebyanentityrobot, vehicle,orcentralprocessingunitwithsenordevicescarriedbyanindividualhasthe capacityforbuildingaglobalmapofthevisitedenvironmentand,atthesametime, 10

PAGE 21

utilizingthismaptodeduceitsownlocationatanymoment[4].Simply,SLAMtechniques mapanunknownenvironmentandtrackauserentityinsideofthedevelopingmap,at thesametime.Inthiscontext,localizationreferstodetermining,inanexactmanner, thecurrentposeofauserwithinanenvironment[14]. Figure2.2:SLAMExampleMapping[14] Initially,theprocessesofmappingandlocalizationwerestudiedasindependentproblems.However,withouta-prioriknowledgeofanenvironment,thesetwoissuesareentirelydependentuponeachother.Forpreciselocalizationwithinanenvironment,a correctmapisnecessary,butinordertoconstructacorrectmap,itisnecessarytobe properlylocalizedwithinanenvironment.Thus,thetwoproblemsaresimultaneous[4]. DierentimplementationsofSLAMusedierenttechniques,including:variousexteroceptivesensorssonar,etc.,GPSdevices,externalodometers,andlaserscanners[14]. Lasersensorsandsonarallowforpreciseandverydenseinformationabouttheenvironment,however,suerfrombeinglessusefulinclutteredenvironmentsorforrecognizing specicobjects.GPSsuersfromsignaldeprivation. Whencamerasareusedastheonlymeansofexteroceptivesensing,thisisknownas visualSLAM,visionSLAM,vision-onlySLAM,orcameraSLAM.Whenproprioceptivesensorsareadded,allowinganentitytoobtainmeasurementslikevelocity,position change,andacceleration,thisisknownasvisual-inertialSLAM,whichcomplementsvisualSLAMwithincreasedaccuracyandrobustness[14].Examplesofproprioceptive sensorsinclude:encoders,accelerometers,andgyroscopes. Similartotheresearchinthisthesis,camerasusedforvisualSLAMmustbecalibrated 'o-line'forintrinsicandextrinsicparameterspriortotheexecutionoftheSLAMprocess. Calibrationmustconsiderthecamera'sstaticpositioninspaceaswellasgeometryfocal lengthandprincipalpoint[14]. 11

PAGE 22

VisualSLAMprimarilyworksbytrackingasetofpointsthroughsuccessivecamera frames.Withthesepoints,thesystemisabletotriangulatepositioninanenvironment, whileatthesametime,mappingtheenvironmentasitpassesthrough.Thetracked pointsareknownaslandmarks,oraregionintherealworlddescribedby3Dposition andappearanceinformation.Asalientpointisthe2Drepresentationofalandmarkthat theentityusesfortracking[4]. PrimaryapplicationsofSLAMinclude:automaticcarpilotingonunrehearsedo-road terrains,rescuetasksforhigh-riskordicult-navigationenvironments,planetary,aerial, terrestrial,andoceanicexploration,augmentedrealityapplicationswherevirtualobjects areincludedinreal-worldscenes,visualsurveillancesystems,andmedicine[4]. Asmentionedpreviously:nosystemisperfectandwithoutsomepointoffailure.Many visualSLAMsystemsfailwhenoperatingundersub-optimalconditions,including:externalenvironments,dynamicallychangingenvironments,environmentswithtoofewsalient features,inlargescaleenvironments,dynamicmovementofthecameras,orwhenocclusionofthesensorsoccur[4]. 2.7SpatialTrackingSmartSurveillanceSystem 2.7.1Denition Thissmartspatialtrackingsystemwillbecapableoftranslatingobject-detectionderived pixeldataintoareal-world,two-dimensionalspatiallocation,withameanerrorofno morethan3[ft][in]distancebetweenpredictedlocationtogroundtruthlocation. Wewillusemachinelearningalgorithmstoimplementthepixel-to-realtranslation. Furthermore,wewillbuildaweb-basedapplicationthatwillshowshowcasethecomplete PoCspatialtrackingsmartsurveillancesystem. 2.7.2Requirements Increatingaspatialtrackingsystemthatisconsidered"smartsurveillance,"weneeded todeneaproductthatishighlygeneralized.Itshouldbeabletobeimplementedin dierentscenariosusingdierentcongurations.Thisrequiresthatthesystemnecessar12

PAGE 23

ilyberobustagainstdierentvantagepointsandcameraorientations. DepthProblem Onemajorissueanddicultyincreatingthesystemisobjectdepth.Objectdepthis simplythe2Dverticaldistanceofanobjectfromthecamera. Forexample:consideraspatialtrackingsystemthatispositionedtomonitorabusy shoppingmallcorridor.Oncehavingidentiedatarget,thesystemshouldbeableto trackthelocationofanindividualwhetherthecameraismounted200ftabovethescene ormuchlower;perhaps,closertoground-levelwhereithasbetterresolutiontoidentify targets. Figure2.3:System"Depth"Problem Imaginingthisscenario,thevisualfeedfromthetwocamerapositionswouldlookvery dierent.Shownintheimageaboveandtotheleft,thecamerapositioned200ftabove thescenewouldhaveaccesstodistinctpixelcoordinatesforatarget-bothhorizontal positioninganddistancingdepthawayfromthecamera.Beingabletomaptoadistinct x,yreal-worldpositionwouldbeamatterofndinganappropriatetransformationfrom everycamerapointtoeverypossiblerealpointinthescene.However,inconsideringthe lattersituation,thecamerawouldonlyhaveaccesstodistincthorizontalpixelpositioning. Beingabletodeterminedepthsimplyfromthelocationofthedetectedobjectwouldbe dicult.Ifthecameraisabsolutelyatground-level,beingabletodetectobjectdepth 13

PAGE 24

solelyfromobjectlocationisimpossible:everyobjectwouldappeartohavethesame y coordinate location.Thissituationisshownintheimageaboveandtotheright. Thisisthe"depth"problem.Fortherequirementofbeingablefunctionproperlydespitedierentvantagepointsandcameraorientations,oursystemwillneedtobeableto adapttothisissue.Wewillattempttouseattributesoftheboundingboxofadetected objectasanadditionalindicatorofdepth,orasthesoleindicatorofdepthinthecaseof aground-levelcamera. LensDistortion Figure2.4:System"LensDistortion"Problem Anotherissuewithcreatingageneralizedsystemthatisabletotransformpixelcoordinatestoareal-worldlocationislensdistortion.Opticallensdistortionorlenserroroccurs whenacameracapturedimageappearsdistortedfromreality-aroundeddeviationfrom arectilinearprojection.Thisisaresultofofthedesignoftheopticalcomponentof thecamera.Anexampleofthisphenomenonisshownabove.Totheright,thepoints areeachrepresentativeofwhereanindividualstood.Foreachposition,animagewas capturedandtheindividualwasdetected,capturingthepixellocationwithinthescene. Totheleftiseachoneofthepointswheretheindividualstoodfromtheperspectiveof 14

PAGE 25

thecamera.Thereddotrepresentsthecameralocation.Frominspection,thereisaclear distortionbetweenthecameraviewandreality. Thefocusofthisresearchisnottocorrecttheseimageerrors.Furthermore,wewant tobeabletocreateageneralizedsystemthatshould,ideally,beabletofunctionwith anycamerafeedsuitableforobjectdetection. Ascanbeseenintheimage,ignoringanyslightosetofthecamerafromacenter-line, thereiscleardistortionthatoccursbetweentherealandwhatiscaptured.Again,the focusofthisresearchisnottocorrecttheseimageerrors.Instead,wewillattemptto usemachinelearningto"learn"themappingfromthecameralocationstothereal-world positions. Repeatability Ageneralizedsmartspatialtrackingsystemrequiresthataproceduretoimplementsuch atechnologybereproducible.Wewilloutlinestepsusedtoimplementsuchasystem, including:ndingasuitablemappingpixel-to-realmodel,assessinghowwellthemodel performs,anddeployingthemodelonthesystem.Oncesuccessful,wewillrepeatthese stepsinadierenttriallocationtoprovetherepeatabilityoftheprocess. 15

PAGE 26

3Methods 3.1Tools 3.1.1Software OpenCV WeusedOpenCVinpythontobeabletocapturesingleimageframesforcalibrationas wellasrecordsmallsectionsofvideofortestingandaccuracyassessment. YOLO Followingwhatwasmentionedandexplainedinthepreviouschapter,weusedYOLOas ageneral-purposeobjectdetector,focusingonthe'person'class.Whenapersontarget isdetectedwithinaframe,weextractthecenterpointofwherethetargetwasdetected aswellasdierentcomponentsoftheboundingboxcoveringthetarget.Wethenwrote thisinformationtoaletobeusedfortraining/congurationforspatialtracking. Afterndinganadequateequationtomapthecenterpointandareaofthebounding boxtorealworldcoordinates,weusedanadditionalYOLOpythonscripttoactasthe datacollectionportionofthegeo-trackingsystem.TheYOLOscriptcollectsdataakin tothepreviousYOLOscript,excepttransportstoabokehserver,whichactsasthe visualizationportionofthespatialtrackingsystem.PriortosendingtheHTTPrequest, thevisionportionrunsthecollectedpointthroughtheconguredsystemtoobtainthe desiredreal-worldpoint.ThebokehserveracceptsanHTTPPOSTrequestcontaining thetransformeddata. 16

PAGE 27

Bokeh "Bokehisaninteractivevisualizationlibrarythattargetsmodernwebbrowsersforpresentation.Itsgoalistoprovideelegant,conciseconstructionofversatilegraphics,andto extendthiscapabilitywithhigh-performanceinteractivityoververylargeorstreaming datasets."TakenfromBokehocialwebsite.Weusebokehtovisualizethegeo-tracking oftheobjectscapturedbytheYOLOscript.Afterndinganadequatemodeltomap pixeldatatoreal-worldcoordinates,weusebokehtoactasthevisualcomponentofthe spatialtrackingsystem. Qpython QPythonisascriptenginewhichrunsPythonprogramsonandroiddevices."Taken fromQpythonocialwebsite.Insteadofbuildinganentireandroidappforcollecting GPScoordinates,wedecidedtouseasimplepythonscript.Unfortunately,afterourrst trial,werealizedthatthereisamajorawwithGPScollectionusinganandroidphone andembeddedGPSreceiver:often,theGPScoordinatesarenotupdated,givingstale, anduselessdata.WexedthisinsubsequenttrialstoforcefreshGPSdata. Scikit-learn WeusedSciKit-learn'svariousopensourcemachinelearningalgorithmsevaluateamappingfromthecameracollectedpixeldatatorealworldcoordinates.Thegeneralformof thetypesofregressionalgorithmsare: SimpleLinearRegression : y x = w 0 + w 1 x MultivariableRegression : y x 1 ;x 2 ;:::x p = w 0 + w 1 x 1 + w 2 x 2 + ::: + w p x p PolynomialRegression : y x = w 0 + w 1 x + w 2 x 2 + ::: + w p x p Where... w = constantcoefficient orweight y = dependentvalue s 17

PAGE 28

x = independentvalue s Theregressionalgorithmsusedinthecongurationphasetobuildapixel-to-realmappingare: GeneralLinearRegression Thegeneralizedlinearregressionmethoddeterminestheoptimalcurvetbyminimizing thedistancebetweenthecurvespredictedvaluesandtheexperimentaldatapoints.Todo this,errorisrepresentedbythesumoftheresidualssquared,E.Theerroriscalculated asshownbelowandminimized: E = n X i =1 r 2 i Wherer i isthedierencebetweentheactualandpredictedvaluesatthei th point. RidgeRegression Theridgeregressionmodelisusedtopreventoverttingthatcanoccurwithothermodels, suchasthegeneralizedlinearregressionmodel.Itintroducesasmallamountofbiasto theleastsquaresmethodinordertodeceasethelong-termvariance.Thisregression modelminimizes,E,asdenedinthefollowingequation: E = n X i =1 r 2 i + m 2 + p 2 Here,apenaltytermisaddedtothetraditionalleastsquaresmethodwherelambda << 1 determineshowseverethepenaltyisandmrepresentstheslopeoftheline andprepresentsotherparameterscontributingtothecurve.Increasingvaluesoflambda causethepredictionsofdependentvariablestobecomedecreasinglysensitivetochanges intheindependentvariables.Thismodelismosteectivewhenyouknowmost/allof youroptimizationparametersareinformative. 18

PAGE 29

LassoRegression Lassoregressionisverysimilartoridgeregression|itaddsapenaltytermtotheerror equation,whichintroducesasmallamountofbias.Thesupercialdierencebetween thesetwomethodsisthatlassoregressiontakestheabsolutevalueoftheoptimization parametersratherthansquaringthem: E = n X i =1 r 2 i + j m j + j p j Therefore,lassoregressioncaneliminatethedependencebetweenthepredictionsof dependentvariablesandchangestotheindependentvariablesandthereforeisbetterat reducingthevarianceinmodelsthatcontainuninformativevariables.Intheseinstances, theuninformativevariableswillbeexcludedfromthenalequationsmakingthemsimplerandeasiertointerpret. ElasticNetRegression Theelasticnetregressioncombinesthelassoregressionpenaltywiththeridgeregression penalty|eachmaintainingseparatelambdaterms,asillustratedintheequationbelow: E = n X i =1 r 2 i + 1 m 2 + p 2 + 2 j m j + j p j Indoingso,itcombinesthestrengthsofbothlassoandridgeregression.Thisisan idealmodelifyoudon'tknowhowusefulsomeofyourparametersareofifyouhavefar toomanyvariablestoknowabouteachone.Thismethodisespeciallyusefulinsituations wheretherearecorrelationsbetweendierentoptimizationparameterswithinthemodel. Thesemodelswerechosenbecauseofpastexperienceandsuccessusingoneormoreof themodelsindierentmachinelearningapplications.Oneoftheconsiderationsforthis researchwaswhetherornottospendasignicantamountoftimetuningeachmodel toachievemaximumaccuracy.Wedecidedagainstthis,decidingtofocusoureortson buildingthesystem,itself,throughprovingtherepeat-abilityoftheprocessandbuilding aPoCofthefullsystem.TherationalbehindthiscanbefoundinChapter6FutureWork. 19

PAGE 30

3.1.2Hardware LenovoIdeapad500 Processor:IntelRXeonR3.7GHz,4Cores Memory:32Gb OperatingSystem:Ubuntu16.04 Logitechc270Webcam MaxResolution:720p/30fps FocusType:Fixedfocus LensTechnology:Standard FieldofView:60degrees SamsungGalaxys5 Processor:Quad-core2.5GHzKrait400 Chipset:QualcommMSM8974ACSnapdragon801nm Memory:32Gb OperatingSystem:Android4.4.2KitKat 3.2ProcedureOverview Therearetwonecessarycomponentsofthesystem:theclientcamerasandtheserver userinterface.Theclientcapturesthedatarelatedtotargetpositionandtranslatesthat intoareal-worldlocationpixel-to-real,whichissenttotheserver.Theserveruser interfaceacceptsmessagesfromtheclientcameraandvisuallydisplaystheresultstothe user,completingthesurveillancesystem. 3.3Client Theclientcameracongurationanddeploymentprocessforeachcameraisunique,but followsthesamebasicprocedure: 20

PAGE 31

ObtainTrainingData BuildModels DetermineBestModel ExportModelandDeployonCamera Amoredetaileddescriptionoftheexperimentationthatledtothedevelopmentofthe aboveprocessisgiveninChapter2. 3.3.1ObtainTrainingData Obtainingtrainingpointsinvolvescoordinatingareal-worldlocationwithalocationdetectedwiththecamerabeingused-eitherGPSand/orphysicalmeasurement.Training pointsareobtainedone-by-oneviaatargetmovingfromalocationtolocation.Ateach location:GPS/winetwork-basedlocationiscapturedusingaGPSdevice,cameracoordinatesaredeterminedusingobjectdetectionsoftwareonanindividualstandingin stationaryposition,andphysicaldistancerelativetocameraisfoundusingmeasuring tools. 3.3.2BuildModels Usinganalyticaltools,wetakeasampleoftheaforementionedgenerallinearregression models.Withthesemodels,wetthedependentvariablecameracoordinatetraining datavaluestotheindependentvariablegroundtruthtrainingdatavaluesGPS,measureddistance,etc.. 3.3.3DetermineBestModel Takingthesampleofthettedgenerallinearregressionmodels,wedeterminethe'best' modelthroughthe'LeaveOneOut'and'Train-to-Test'methods.Accuracyisdetermined bya'HitorMiss'score,whereinapointpredictedusingaregressionmodeliseitherinside oroutsideathresholdgivenbyapredeterminedeuclideandistancefromthegroundtruth pointitisattemptingtopredict.Accuracyofamodelisdeterminedbythetotalnumber 21

PAGE 32

ofpredictedpointsconsideredaccuratedividedbythetotalnumberofpoints.Inaddition, wedeterminethemodelwithlowestmeanerrorofpredictedpointdistancefromground truth. Themodelswiththelowestmeanerrorandhighestmodelaccuracywillbedeemed the'best'modelstouse.Wewillthenassessthismodelswiththe'PathApproximation'and'MovingTarget'methodstodeterminehowit/theyperformwhenintroduced tonewdata. LeaveOneOut The'LeaveOneOut'methodofaccuracyevaluationinvolvesbuildingaregressionmodel withallbutoneofthetrainingpoints.Then,thettedregressionmodelisappliedto the'leftout'trainingpointtocalculateapredictedpoint.Thepredictedpointisthen assessedforitsaccuracywiththepreviouslydescribed'HitorMiss'method.Themodel isthenrebuiltleavingoutanotherpoint.Thisprocessesisrepeateduntilallpointshave beenleftout.Again,overallmodelaccuracyisdeterminedbythetotalnumberofpredictedpointsconsideredaccuratedividedbythetotalnumberofpoints. Train-to-Test The'Train-to-Test'methodofaccuracyevaluationinvolvesbuildingaregressionmodel withallofthetrainingpoints.Then,thettedmodelis,again,appliedtothetraining dependentvariabledatatoyieldpredictedpoints.Allpointsarethenevaluatedusing the'HitorMiss'method,whichwilleventuallyyieldoverallmodelaccuracy. Thelowestmeanerroriscalculatedduringtheabove'LeaveOneOut'and'Trainto-Test'methods.Inadditiontodeterminingwhetherornotapredictedpointiseither insideoroutsideathresholdgivenbyapredeterminedeuclideandistancefromtheground truthpointusingthe'HitorMiss'method,wedeterminetheactualdistanceawayfrom thegroundtruth.Foreachmodel,thesevaluesareaveragedtodeterminethemeanerror. Afterdeterminingthebestaccuracyandlowestmeanerror,weundertakeanother roundofevaluationusingthe'PathApproximation'and'MovingTarget'methodstodeterminehowwellthechosenregressionmodelsperformwhenintroducedtonewcamera 22

PAGE 33

data. PathApproximation Originally,thisstepoftheprocedurewasdesignedtomeasurereal-timedata'sground trutherrorintheformofaGPSlocationversusapredictedpointproducedfromaregressionalgorithm.However,duringtheprocessofobtainingdata,wediscoveredthat notonlyisGPSdatacompletelyinaccurateasagroundtruth,butcollectingreal-time datawasimpossibleforthetechnologyavailabletous.Westillneededawaytogauge howwellthemodelwouldperformwhenintroducedtonewdata.Wedevelopedthepath approximationmethod.Dynamicmotionofatargetmovingbetweenknownpointsis captured,recordingtherelevantdependentvariablecameradatavalues.Then,thechosenregressionalgorithmisappliedtothecameradata,yieldingapredicted'trialpath.' Comparingthevideofeedtothepredictedvalues,wequalitativelyassessthefeasibility ofthedynamicmodelbyinterpolatingbetweentheknownpathandthe'trialpath.'This isanovelqualitativedeterminationofapproximatelyhowwellthemodelperformswhen introducedtonewdata. MovingTarget Alongthesamelinesofthe'PathApproximation'method,wehadtomodifyouroriginal plantocircumventtheuseofreal-timedata.Wemodiedfromreal-timedatacaptureto analysisofselectedframesfromdynamicmotionatknownpositions.Wewouldcapture footageofatesttargetmovingtoseveralpre-determined,knownpositions.Theexact framesofwhenthetargetwascapturedinthosephysicallocationswereextractedfrom thevideofootage.Then,applyingourchosenregressionalgorithm,weapplythemodelto thosepointstoyieldpredictedpoints.Finally,wedeterminetheeuclideandistancefrom thepredeterminedlocationstothepredictedpointstodeterminethemeanerror.Again, thisdeterminesapproximatelyhowwellthemodelperformswhenintroducedtonewdata. 23

PAGE 34

3.3.4ExportModelandDeployonCamera Onceamodelhasbeenchosenforthestationarycamerascene,itmustbeintegrated intotheobjectdetectionYOLOsoftware.WithScikit-learn,transferringamodelfrom oneprogramtoanotherwasaccomplishedusingthefollowinglibraryandcodesnippets Python3: from joblib import dump,load #exportmodel dumpmodel,'model name.joblib' #importmodel model=load'model name.joblib' Oncewehaveimportedourmodelintoourclientcamera,webeginsurveillanceofa scene.Whenatargetobjectisdetected,thepixelinformationiscollected.Applyingour modeltothepixelinformationwithourpixel-to-realmapping,weobtainareal-world spatiallocation.Onceweareabletoacquireaspatiallocation,wesendthisdatatoour userinterfaceserverusingaPOSTrequest. 3.4Server HTTPServer TheuserinterfaceserveridlywaitsforHTTPPOSTrequestsfromtheclientcameras. Whenitreceivesarequest,itappendsthisincomingdatatothedatastructurecontaining thepositioninformation,orcreatesthedatastructureifitdoesnotalreadyexist. UsingAJAX,theserverasynchronouslyupdatesthedisplayofthesmartsurveillance system.MoreaboutthedesignandimplementationoftheserverisprovidedinChapter 5. 24

PAGE 35

4Data/Analysis Thischapterwillshowtheexperimentaldevelopmentfortherstthreestepsoftheclient cameracongurationprocess:Obtainingtrainingdata,buildingthemodels,anddeterminingthebestmodeltodeployonthecameraclient.Weshowthemethodsusedto analyzethedatathatwascollectedtobuildthemappingfrompixel-to-real,i.e.nding anappropriatemappingmodelandassessinghowwellitperforms.Basedonthemethods andtoolsdescribedinthepreviouschapter,weobtainedthefollowingresultsforfour trialsindierentlocations. 4.1Synthesis FromTrial1toTrial4,wewereabletodevelopaprocessforndinganappropriate mappingmodelandassessinghowwellitperforms.Avisualowoftheentireprocess canbefoundintheappendix. Trial1 Trial1wasbenecialinthatwedevelopedasuccessfulmethodforobtainingtraining data.WefoundthatoursupposedmethodofGPSdatapollingdidnotyieldthedesired results:theGPSdatawouldnotnecessarilyupdateforeachnewcapturedposition.We modiedourGPScollectioncodeforthenexttrialtoforceanupdate. Trial2 InTrial2,wesolvedour'lensdistortion'issueoutlinedinChapter2Requirements.We wereabletosuccessfullymappixeldatatorealdataforthespatialtrackingsurveillance system.Fortheprocess:wesuccessfullyimprovedourGPSdatacollectionmethod, 25

PAGE 36

devisedandimplementeda'measureddistance'methodfortrainingdatacollection,were abletotesttheprocessofdeterminingthebestmodelsforpixel-to-realmapping,and evaluatedthechosenmodels.Wediscoveredthat,althoughourGPScollectionmethod hadimproved,GPS,asitwasavailabletous,wasnotaviablemethodtobuildthe system. Aftercompletingtheprocessofdeterminingthebestmappingmodeltouse,theevaluationshowedthatthebestmodelswerenotquitesucienttouseinthesystemthrough the'MovingTarget'evaluationmethod.Thelowestmeanerrorwaswellover10ft [in]. Additionally,wedeterminedthatthe'Train-to-Test'methodwasnotasuitableassessmenttoolformodelaccuracy,asitledtoover-tting. Wedecidedtoincreasethenumberofcollectedtrainingpointsinthenexttrailinan attempttoimproveaccuracy. Trial3 Trial3wasconsideredtherstsuccessfultrialofourresearch.Theincreaseinnumber oftrainingpointsdidyieldanincreaseinaccuracy.Afterdeterminingthebestmodels andevaluatingperformance,wewereabletoobtainaresultoflessthan12.00[in]mean errorusingour"MovingTarget"method.Unfortunately,thiswasonlyusing x coordinate , y coordinate asapredictiveparameter.Wedecidedtomoveforwardwiththismodelto buildaniterationofthePoCofthespatialtrackingsmartsurveillancesystem. Inaddition,wediscoveredthatusingArea BoundingBox asapredictiveparameter,despite beingcorrelatedwiththedistanceawayfromthecamera,yieldedcompletelyunreliable results.WedecidedtodiscontinueusageoftheArea BoundingBox asapredictiveparameter. However,westillneededsomewayofsolvingour"depth"issueoutlinedinChapter2 RequirementstohaveacompletedPoC. Wedecidedtoincreasethenumberofcollectedtrainingpointsinthenexttrailinan attemptto,again,improveaccuracy. 26

PAGE 37

Trial4 Trial4wasusedtoproverepeat-abilityofourprocess,solvingour'repeat-ability'issue outlinedinChapter2Requirements.Withanincreaseinthenumberoftrainingpoints, therewasanincreaseinaccuracyanddecreaseinmeanerrorwiththeLOOmethod. However,whenassessingthebestmodelswiththe'PathApproximation'and'Moving Target'methods,therewasaslightdecreaseinperformance.Whereinthelasttrial, wewereabletoobtainameanerrorof12[in]usingour"MovingTarget"methodand x coordinate ,y coordinate asapredictiveparameter,weexperiencedanincreasetoa30.02[in] meanerror.However,thisisstillwithintoleranceofacceptanceforthesystem.Wewill moveforwardwiththismodeltobuildaseconditerationofthePoCofthegeo-tracking smartsurveillancesystem. Inaddition,usingHeight BoundingBox insteadofArea BoundingBox asapredictiveparameter,wewereabletosolveour"depth"issueoutlinedinChapter1Requirements.Performingthe"MovingTarget"evaluationusing x coordinate ,Height BoundingBox asapredictive parameter,wewereabletoobtainameanerrorof35.92[in].Thisiswithinourtolerance ofacceptanceforthesystem. Withthistrial,wehadsuccessfullycompletedoursystemasitwasoutlinedinChapter 2. 4.2Trial1:FieldinStapletonDenver,CO 4.2.1Overview ThersttrialfortheprojectinvolvedpollingGPSandnetworkpositiondatatocollect congurationdataforthespatialtrackingsystem. WeutilizedaparkintheStapletonareaofDenver,CO.Thislocationwaschosenforitsopenspaceandrelativelackofpedestrianfoottracatthattimeofday. TheotherobjectdetectioncategoriesforYOLOv3wereomittedfromthecollected data.Inaddition,carewastakentoensurethatonlyonerelevantobjectperson wasdetectedpercapture.Wecollectedatotalof75pointsbetweencameracaptured x coordinate ;y coordinate ;Area BoundingBox andGPS/Networkcollectedlatitude,longitude. 27

PAGE 38

Figure4.1:Trial1CameraImages ObjectdetectioncaptureinStapletoneld Theplotbelowshowsthecaptureddatafromthecamerarepresentedasxandy coordinateswithannotationsdenotingtherecordedsizeoftheboundingbox.Noticethat thereisaninverserelationshipbetweenproximitytothecameraandrecorded y coordinate oftheobject. Figure4.2:Trial1CameraPositionPlot x coordinate and y coordinate annotatedwithArea BoundingBox Thereisadirectcorrelationbetweentheboundingboxareaandthe y coordinate .This relationshipisshownintheplotbelow.Thisistobeexpectedastheboundingbox coveringadetectedobjectislargerwithcloserproximitytothecamera. 28

PAGE 39

Figure4.3:Trial1CameraDataRelationships Relationshipsbetween y coordinate withArea BoundingBox ThecollecteddatafortheGPSpositioningisshownbelowandleft.Uponinspection, therearefarfewerdistinctGPSpointscollectedthatcorrespondingcameraimagecapture points.Later,wefoundoutthatthereisaquirktogeneral-purposemobileGPSradios: althoughtheyappeartobeconstantlyupdating,thedataisoftenstale.Although75 pointswecollectedfromuniquelocations,notallthedataisunique.Moreonthistopic willbeexplainedinalatersection.Thenetworklatitudeandlongitudecollectedlocations areshownbelowandright. Figure4.4:Trial1GPSandNetworkPositionLocations ThepositioningcollectionprogramwasconguredtocollectbothGPSandnetwork dataifthenetworkwaswithinrange.Halfwaythroughthedatacollection,wefoundthe 29

PAGE 40

mobiledevicehadconnectivitytothenetwork.Weobtained50networklatitudeand longitudelocationsversus75GPS.Ofcourse,laterwefoundthepreviousissuewith GPSstalelocationdatapersistedwiththenetworkdata,aswell:Although50pointswe collectedfromuniquelocations,notallthedataisunique. 4.2.2Analysis Basedupontheinconsistencyofthedatacollected,wemadeaneducatedassumption thatanaccurategeo-trackingmodelwouldbenexttoimpossibletocreate.Wedecided nottocollectdynamicvideodataandbuilda"trialpath."Wedidconductanalysison thedata.Unfortunately,asexpected,theanalysisdidnotyieldfavorableresults. Thistrialtaughtustheimportanceofbeingabletoaccuratelycollectdistincttrainingdatapointstobuildamodel.Although,wewereunabletoprogresswiththedata collected,wedevisedamethodtocollectmoreconsistentdataduringthenextiteration. 4.3Trial2:SoccerField 4.3.1Overview Figure4.5:Trial2CameraImages ObjectdetectioncaptureonSoccerField 30

PAGE 41

ThesecondtrialfortheprojectinvolvedpollingGPSandmanuallymeasuringpredeterminedpositiondatatocollecttrainingcongurationdataforthespatialtracking system. WeutilizedasoccereldinDenver,CO.Thislocationwaschosenforitsopenspace andrelativelackofpedestrianfoottracatthattimeofday.Theotherobjectdetection categoriesforYOLOv3wereomittedfromthecollecteddata.Inaddition,again,care wastakentoensurethatonlyonerelevantobjectpersonwasdetectedpercapture. Wecollectedatotalof17pointsbetweencameracaptured x coordinate , y coordinate , Area BoundingBox andGPSlatitude,longitudeandmeasureddistanceininchesx,y. Theplotbelowshowsthecaptureddatafromthecamerarepresentedasxandy coordinateswithannotationsdenotingtherecordedsizeoftheboundingbox.Noticethat thereisaninverserelationshipbetweenproximitytothecameraandrecorded y coordinate oftheobject. Figure4.6:Trial2CameraPositionPlot x coordinate and y coordinate annotatedwithArea BoundingBox Thereisadirectcorrelationbetweentheareaoftheboundingboxareaandthe y coordinate .Thisrelationshipisshownintheplotbelow.Thisistobeexpectedasthe boundingboxcoveringadetectedobjectislargerwithcloserproximitytothecamera. ThecollecteddatafortheGPSpositioningisshownbelowandleft.TheGPScollection 31

PAGE 42

Figure4.7:Trial2CameraDataRelationships Relationshipsbetween y coordinate withArea BoundingBox methodswasupdatedforthistrial.AlloftheGPSpointsareuniqueforeachpoint collected.Thecollectedmeasureddistancepositionsareshownbelowandright.The measureddistancepointsandtheGPSpositionsarerepresentativeofthesamephysical locations. Figure4.8:Trial2CollectedGPSandMeasuredDistanceLocationPoints Beginningthistrial,westartedusingourthreemethodsofevaluatingtheaccuracyof amodel:'LeaveOneOut,''Train-to-Test,''PathApproximation,'and'MovingTarget.' ThetrialpathforPathApproximationisshownbelow. 32

PAGE 43

Figure4.9:Trial2Path Trialpathonsoccereldshownasalternatingcolors 4.3.2GPS AfterTrial1,weweresuspiciousoftheconsistencyoftheGPSdata.Wedecidedto compareourGPSdataagainstourmeasureddata.Todothis,wenormalizedtheGPS datatotonthesamedistancescaleasthemeasureddistancedata,i.e.convertingfrom degreesinlatitudeandlongitudetoinchesinxandy.Theresultsaredisplayedbelow. Figure4.10:Trial2GPScomparison Trial2comparisonbetweenGPSandmeasureddistances Fromtheabovegure,itcanbediscernedthattheGPSpointscollecteddonotgive accuratelocationdata.SeveraloftheGPSlocationpointsareclusteredtogether,even 33

PAGE 44

thoughtheywererecordedatphysicallocationsthatwereseveraltensoffeetapart.In addition,themeanerrorbetweenthemeasureddistancelocationandtheobservedGPS locationwas416.41[in]over30ft. EvenaftermodifyingourGPScollectionmethods,ourdatawasstillhighlyinconsistent andinaccurate.Afterthisobservation,wemadethedecisionnottocontinueusingGPS asapartofthespatialtrackingsystem.Wecontinuedouranalysiswiththemeasured distancedata. 4.3.3w/outArea BoundingBox LeaveOneOut UsingtheLeaveOneOutmethodofaccuracyanalysis,weobtainedameasureofaccuracy forallpointscollectedasapartofthecongurationdata.Plotsofaccuracyforthechosen regressionalgorithmsareshownbelowforthePathApproximationevaluation. Figure4.11:Trial2MeasuredLOOw/outArea BoundingBox Accuracy Trial2MeasuredLeaveOneOutAccuracyw/outArea BoundingBox within36.0[in]of groundtruth Theaboveplotshowthatmanyoftheregressionalgorithmsperformsimilarlyaccordingtothe'HitorMiss'accuracymethodwehaddeveloped.Although,allalgorithmsdo notperformideally,ElasticNetperformsslightlybetterthanothers.Usingthebestoverallaccuracyatthespecieddegreeofpolynomialregressionforour'PathComparison' methodyieldsthefollowingplotforElasticNetasadegree-2polynomial. 34

PAGE 45

Figure4.12:Trial2MeasuredLOOw/outArea BoundingBox PathApprox. Trial2MeasuredElasticNetw/outArea BoundingBox PathApproximation Uponinspectionoftheaboveplot,onecandiscernaroughapproximationofthe paththatwasfollowedinduringthecreationofthe'trialpath.'Ofcourse,thereisa largeamountofinaccuracy,butthepathpointsareconsistentlyalignedwiththegeneral directionofwhatisexpected. Figure4.13:Trial2MeasuredLOOw/outArea BoundingBox MovingTarget Trial2MeasuredElasticNetw/outArea BoundingBox MovingTargetEvaluation TheMovingTargetevaluationaboveshowsthatalthoughthepathpointsareconsistentlyalignedwiththegeneraldirectionofwhatisexpected,thereisstillmucherrorat the"testpoint"locations. 35

PAGE 46

AsanotherassessmentoftheLeaveOneOutmethod,wecalculatedthemeanerrorfor allpointscollectedasapartofthetrainingcongurationdata.Theresultsareshown belowasaplotalongwiththemeanerrorateachdegree. Figure4.14:Trial2MeasuredLOOw/outArea BoundingBox Error Trial2MeasuredLOOw/outArea BoundingBox MeanError Again,manyofthealgorithmsperformsimilarly.However,Ridgeperformsslightly betterthanothers.Usingthebestoverallaccuracyatthespecieddegreeofpolynomial regressionforour'PathComparison'and'MovingTarget'methodsyieldthefollowing plotsforRidgeasadegree-2polynomial. 36

PAGE 47

Figure4.15:Trial2MeasuredLOOw/outArea BoundingBox PathApprox. Trial2MeasuredRidgew/outArea BoundingBox PathApproximation Figure4.16:Trial2MeasuredLOOw/outArea BoundingBox MovingTarget Trial2MeasuredRidgew/outArea BoundingBox MovingTargetEvaluation SimilartothepreviousElasticNet"PathApproximation"evaluation,thereisalarge amountofinaccuracy,butthepathpointsareconsistentlyalignedwiththegeneraldirectionofwhatisexpected.RidgeperformedsimilarlytoElasticNetforthe'Moving 37

PAGE 48

Target'assessment,justwithslightlymoreerror. Train-to-Test Usingthe'Train-to-Test'methodofaccuracyanalysis,weobtainedameasureofaccuracy forallpointscollectedasapartofthecongurationdata.Plotsofaccuracyforthechosen regressionalgorithmsareshownbelow. Figure4.17:Trial2MeasuredTrain2Testw/outArea BoundingBox Accuracy Trial2MeasuredTrain-to-TestAccuracyw/outArea BoundingBox within18.0[in]of groundtruth TheaboveplotshowsthatLinearRegressionachievesalmost100%accuracyafter fourthdegreepolynomialregression.Inaddition,thereisaslightdecreaseinaccuracy afterseconddegreepolynomialregression.Thelast'LeaveOneOut'methodinformed usthatseconddegreepolynomialregressionconsistentlyyieldsthebestaccuracy. Inordertoevaluatethisfurther,wecalculatedthemeanerrorforallpointscollected asapartofthetrainingcongurationdata.Theresultsareshownbelowasaplotalong withthemeanerrorateachdegree. 38

PAGE 49

Figure4.18:Trial2MeasuredT2Tw/outArea BoundingBox Error Trial2MeasuredTrain-to-Testw/outArea BoundingBox MeanError TheaboveplotfurthershowsthatmeanerrorforLinearRegressiongoestoalmost zeroforfthdegreepolynomialregression.Theplotsofthe"Train-to-Test"methodfor fthdegreepolynomialregressionareshownbelow. Figure4.19:Trial2MeasuredT2Tw/outArea BoundingBox Plot Trial2MeasuredTrain-to-Testw/outArea BoundingBox PlotofTrainingData 39

PAGE 50

Wehadasuspicionthattheaccuracyofthismodelwasaresultofover-tting.We conductedthe'MovingTarget'evaluation.Theresultisshownonthefollowingplot. Figure4.20:Trial2MeasuredT2Tw/outArea BoundingBox MovingTarget Trial2MeasuredTrain-to-Testw/outArea BoundingBox MovingTargetEvaluation Theaboveplotshowsthatthe"Train-to-Test"methodtoevaluatemodelaccuracyfor thesystemislikelynotavaluableaddition.Although,themodelwillbeabletocorrectly predicttrainingpoints,over-ttingmeansthatanynewdataintroducedtothemodel willbehighlyinaccurate.Wedecidedtodiscontinueusingthe'Train-to-Test'method. 4.3.4WithArea BoundingBox Weneededamethodofsolvingthe'depthproblem'outlinedinChapter2.Asthereis adirectpositivecorrelationbetweenArea BoundingBox andy coordinate ,wedecidedtouse Area BoundingBox asanadditionalpredictiveparameter. LeaveOneOut UsingtheLeaveOneOutmethodofaccuracyanalysis,weobtainedameasureofaccuracy forallpointscollectedasapartofthecongurationdata.Plotsofaccuracyaswellas themeanerrorforthechosenregressionalgorithmsareshownbelow. 40

PAGE 51

Figure4.21:Trial2MeasuredLOOWithArea BoundingBox Accuracy Trial2MeasuredLeaveOneOutAccuracyWithArea BoundingBox within36.0[in]of groundtruth Figure4.22:Trial2MeasuredLOOWithArea BoundingBox Error Trial2MeasuredLeaveOneOutWithArea BoundingBox MeanError SimilartothemodelconstructedwithoutusingtheArea BoundingBox asapreditiveparameter,ElasticNetperformsslightlybetterthanothers.Usingthebestoverallaccuracy andlowesterroratthespecieddegreeofpolynomialregressionforour'PathComparison'and'MovingTarget'methodsyieldthefollowingplotsforElasticNetasadegree-2 polynomial. 41

PAGE 52

Figure4.23:Trial2MeasuredLOOWithArea BoundingBox PathApprox. Trial2MeasuedElasticNetWithArea BoundingBox PathApproximation Figure4.24:Trial2MeasuredLOOWithArea BoundingBox MovingTarget Trial2MeasuredLOOWithArea BoundingBox MovingTargetEvaluation AlthoughElasticNetwaspredictedtobeasaccurateastherstportionofthetrial withoutusingtheboundingboxareaasapredictiveparameter,theplotsaboveshow that,similartotheissuewithover-tting,thereisadenitedecreaseinperformance 42

PAGE 53

whenintroducingnewdata. 4.3.5Analysis Trial2taughtseveralimportantlessonsgoingforwardwithdevelopingthesystem: First,GPSasitisavailableforthisresearchisaninaccuratetoolforrecordingone's physicallocation.However,giventhatElasticNetwasabletoprovidesomedegreeof realisticpredictivefunctionality,thisisnottosaythatGPSabsolutelyshouldnotbe used.Itwouldbeinterestingtobeabletore-runthistrialwithamoreaccurateGPS tool. Secondly,usingtheboundingboxareaasapredictiveparameteryieldedlessthan favorableresultswhenintroducedtonewdata.Thelikelyexplanationforthisisthatthe boundingboxareavariesbasedonwhatposea"person"isdetectedinasthebounding boxforYolov3enclosesthattotalx,ypixelarea,andcanvarygreatly.Inaddition, anotherobviousawinusingtheboundingboxareaasapredictiveparameterisrunning themodelondierentsizedindividualsaftertraining.Insuchascase,thevariationin sizeofboundingboxbasedonthedierencesinsizesofhumanswouldcreateinaccuracies inthespatialtrackingposition. Thirdly,the'Train-to-Test'methodofassessingmodelaccuracyorndinglowestmean errorshouldnotbeusedasareliableevaluationmethod.Attemptingtondthebest performingofeitherwilllikelyresultinamodelthatisover-ttothetrainingdataand notrobusttonewdatapoints. Finally,wewereabletomakeprogressonndingamodelthatwouldyieldaccurate resultsinaspatialtrackingsystem.However,thereisstillamajorconcernrelatedto accuracy.Forthenexttrial,wedecidedtodoublethenumberofpointscollectedtosee howthatwouldeecttheaccuracyoftheeventualmodel. 43

PAGE 54

4.4Trial3:TennisCourt 4.4.1Overview Thethirdtrialfortheprojectinvolved,again,manuallymeasuringdistancesforpredeterminedpositionstocollectcongurationdataforthegeo-trackingsystem. / Figure4.25:Trial3CameraImages ObjectdetectioncaptureonTennisCourt WeutilizedatenniscourtinDenver,CO.Again,thislocationwaschosenforitsopen spaceandrelativelackofpedestrianfoottracatthattimeofday.Theotherobject detectioncategoriesforYOLOv3wereomittedfromthecollecteddata.Inaddition,care wastakentoensurethatonlyonerelevantobjectpersonwasdetectedpercapture. Wecollectedatotalof34pointsbetweencameracaptured x coordinate ,y coordinate , Area BoundingBox andmeasureddistancesininchesx,y. Theplotbelowshowsthecaptureddatafromthecamerarepresentedasxandy coordinateswithannotationsdenotingtherecordedsizeoftheboundingbox.Noticethat thereisaninverserelationshipbetweenproximitytothecameraandrecorded y coordinate oftheobject. 44

PAGE 55

Figure4.26:Trial3CameraPositionPlot x coordinate and y coordinate annotatedwithArea BoundingBox Thereisadirectcorrelationbetweentheareaoftheboundingboxareaandthe y coordinate .Thisrelationshipisshownintheplotbelow.Thisistobeexpectedasthe boundingboxcoveringadetectedobjectislargerwithcloserproximitytothecamera. Figure4.27:Trial3CameraDataRelationships Relationshipsbetween y coordinate withArea BoundingBox Themeasureddistancepositionsareshownbelow. 45

PAGE 56

Figure4.28:Trial3Collectedmeasureddistancelocationpoints Forthistrialweusedthreemethodsofevaluatingtheaccuracyofamodel:Leave OneOut,PathApproximation,andMovingTarget.ThetrialforPathApproximation isshownbelow. Figure4.29:Trial3Path 46

PAGE 57

4.4.2WithoutArea BoundingBox LeaveOneOut UsingtheLeaveOneOutmethodofaccuracyanalysis,weobtainedameasureofaccuracy forallpointscollectedasapartofthecongurationdata.Plotsofaccuracyaswellas themeanerrorforthechosenregressionalgorithmsareshownbelow. Figure4.30:Trial3MeasuredLOOw/oArea BoundingBox Accuracy Trial3MeasuredLeaveOneOutAccuracyw/oArea BoundingBox within36.0[in]of groundtruth Figure4.31:Trial3MeasuredLOOw/oArea BoundingBox Error Trial3MeasuredLeaveOneOutw/oArea BoundingBox MeanError Basedontheresultsofassessing"HitorMiss"accuracyandndingthelowestofthe 47

PAGE 58

meanerrorforthepredictedpoints,ElasticNet,onceagain,wasdeterminedtobethe idealmodeltouse.PathComparisonandMovingTargetassessmentsforElasticNetas adegree-3polynomialareshownbelow. Figure4.32:Trial3MeasuredLOOw/oArea BoundingBox PathApprox. Trial3MeasuredElasticNetw/outArea BoundingBox PathApproximation Figure4.33:Trial3MeasuredLOOw/oArea BoundingBox MovingTarget Trial3MeasuredLOOw/oArea BoundingBox MovingTargetEvaluation Boththe"MovingTarget"and"PathApproximation"methodsyieldedidealresults. Thepredictedpointsinthe"PathApproximation"methodsclearlyfollowthe"trial 48

PAGE 59

path."The"trialpoints"incomparisontogroundtruthpointsareabletoyieldamean errorthatiswithin12.0[in]. 4.4.3WithArea BoundingBox Again,weneededamethodofsolvingthe'depthproblem'outlinedinChapter2.As thereisadirectpositivecorrelationbetweenArea BoundingBox andy coordinate ,wedecided touseArea BoundingBox asanadditionalpredictiveparameteronceagainastheinaccuracy inthelasttrialmayhavebeenaresultofalackoftrainingdata. LeaveOneOut BasedontheresultsofTrial2,weexpectthetheresultsofusingboundingboxareaasa predictiveparameterswillyieldaninaccuratemodel,notrobusttonewdata.However, intheinterestoftesting,wedecidedtoanalyzethedata.Theresultsfollow: UsingtheLeaveOneOutmethodofaccuracyanalysis,weobtainedameasureof accuracyforallpointscollectedasapartofthecongurationdata.Plotsofaccuracyas wellasthemeanerrorforthechosenregressionalgorithmsareshownbelow. Figure4.34:Trial3MeasuredLOOWithArea BoundingBox Accuracy Trial3MeasuredLeaveOneOutAccuracyWithArea BoundingBox within18.0[in]of groundtruth 49

PAGE 60

Figure4.35:Trial3MeasuredLOOWithArea BoundingBox Error Trial3MeasuredLeaveOneOutWithArea BoundingBox MeanError Onceagain,ElasticNetshowstheoveralllowestmeanerrorandhighestaccuracyaccordingtoourHitorMissmethod.ThePathApproximationandMovingTargetevaluationsareshownbelowfordegree-3ElasticNet. Figure4.36:Trial3MeasuredLOOWithArea BoundingBox PathApprox. Trial3MeasuredElasticNetWithArea BoundingBox PathApproximation 50

PAGE 61

Figure4.37:Trial3MeasuredLOOWithArea BoundingBox MovingTarget Trial3MeasuredLOOWithArea BoundingBox MovingTargetEvaluation Asexpected,themeanerrorincludingtheboundingboxareaasapredictiveparameter resultedinloweraccuracythansimplyusingthecenterpoint. 4.4.4Analysis Trial3taughtseveralimportantlessonsgoingforwardwithdevelopingthesystem: First,theincreaseinthenumberoftrainingpointsyieldsbetteroverallaccuracyand lowermeanerroraftermodelingtraining.Forthenexttrial,anevengreaternumberof pointswillbeusedduringthecongurationphase. Second,althoughboundingboxareaisanimportantparametertouseinstanceswhere y coordinate doesn'tchangedrastically,usingboundingboxasameansofpredictiveparametersonlyactsasadetrimenttoaccuracy.Furtherexplorationisneededtonda waytoincludetheboundingboxareainthesystem. Third,thistrialshouldbeconsideredtherst"successful"trialwhereinthesystem couldactuallybedeployed.WewillmoveontobuildinganiterationofthePoCsystem withthedegree-3ElasticNetmodelthatwehavecreated. 51

PAGE 62

4.5Trial4:TennisCourt2 4.5.1Overview Thefourthtrialfortheprojectinvolved,again,manuallymeasuringdistancesforpredeterminedpositionstocollectcongurationdataforthegeo-trackingsystem.Weused thistrialtoproverepeat-abilityandexplorefurtherwaysofimprovingsystemaccuracy. Figure4.38:Trial4CameraImage ObjectdetectioncaptureonTennisCourt Afterthesuccesswiththeprevioustenniscourt,wedecidedtorepeatthetrialinanew locationwithmoredatapoints.WeutilizedatenniscourtinMequon,WI.Thesame considerationsforprevioustrialswastakenwiththistrial. Wecollectedanincreaseof53pointsbetweencameracaptured x coordinate ,y coordinate , Height BoundingBox andmeasureddistancesininchesx,y. Theplotbelowshowsthecaptureddatafromthecamerarepresentedasxandycoordinateswithannotationsdenotingtherecordedheightoftheboundingbox.Noticethat 52

PAGE 63

thereisaninverserelationshipbetweenproximitytothecameraandrecorded y coordinate oftheobject. Figure4.39:Trial4CameraPositionPlot x coordinate and y coordinate annotatedwithHeight BoundingBox Thereisadirectcorrelationbetweentheheightoftheboundingboxheightandthe y coordinate .Thisrelationshipisshownintheplotbelow.Thisistobeexpectedasthe boundingboxcoveringadetectedobjectislargerwithcloserproximitytothecamera, thereforetheheightshouldbecomparativelyhigheraswell. 53

PAGE 64

Figure4.40:Trial4CameraDataRelationships Relationshipsbetween x coordinate and y coordinate withHeight BoundingBox Themeasureddistancepositionsareshownbelow.Thenumbersshowtheorderin whichtheimagesweretaken. Figure4.41:Trial4Collectedmeasureddistancelocationpoints Repeatingthestepsthatgavesuccessforthelasttrial,weusedthreemethodsof evaluatingtheaccuracyofamodel:LeaveOneOut,PathApproximation,andMoving Target.ThetrialpathforPathApproximationisshownbelow. 54

PAGE 65

Figure4.42:Trial4Path 4.5.2WithoutHeight BoundingBox LeaveOneOut UsingtheLeaveOneOutmethodofaccuracyanalysis,weobtainedameasureofaccuracy forallpointscollectedasapartofthecongurationdata.Plotsofaccuracyaswellas themeanerrorforthechosenregressionalgorithmsareshownbelow. Figure4.43:Trial4MeasuredLOOw/oHeight BoundingBox Accuracy Trial4MeasuredLeaveOneOutAccuracyw/oHeight BoundingBox within36.0[in]of groundtruth 55

PAGE 66

Figure4.44:Trial3MeasuredLOOw/oHeight BoundingBox Error Trial3MeasuredLeaveOneOutw/oHeight BoundingBox MeanError Basedontheresultsofassessing"HitorMiss"accuracyandndingthelowestofthe meanerrorforthepredictedpoints,LinearRegressionandLassoweredeterminedtobe theidealmodelstouse."PathComparison"and"MovingTarget"assessmentsforthe twomodelsasdegree-3anddegree-4polynomials,respectively,areshownbelow. Figure4.45:Trial4MeasuredLOOw/oHeight BoundingBox PathApprox.andMoving Target1 Trial4MeasuredDistanceLinearRegressionw/outHeight BoundingBox Path ApproximationandMovingTargetEvaluation 56

PAGE 67

Figure4.46:Trial4MeasuredLOOw/oHeight BoundingBox PathApprox.andMoving Target2 Trial4MeasuredDistanceLassoRegressionw/outHeight BoundingBox Path ApproximationandMovingTargetEvaluation Boththe"MovingTarget"and"PathApproximation"methodsyieldedresultsthat werenotquiteasgoodasexpected.LinearRegressionperformedthebestwithamean errorof30[in].However,fromtheresultsofTrial3,wewereexpectingameanerror closerto12[in],especiallyconsideringweprovidedmoretrainingpoints. 4.5.3WithHeight BoundingBox TheresultsofprevioustrialsshowedthatusingArea BoundingBox asapredictiveparameteryieldedpooraccuracy.However,westillneededanalternatepredictiveparametertocounter-actthe"depth"issueforinstancesinwhichthereisnochangeinthe pixelY-Coordinateofdetectedobjects.Aftercarefulanalysisofthedata,werealized that,althoughArea BoundingBox ispositivelycorrelatedwiththeY-Coordinatepixellocationvalue,therealcorrelationisderivedfromHeight BoundingBox as:Area BoundingBox = Height BoundingBox Width BoundingBox . LeaveOneOut Theresultsofaccuracyaswellasthemeanerrorforthechosenregressionalgorithms usingtheHeight BoundingBox areshownbelow. 57

PAGE 68

Figure4.47:Trial4MeasuredLOOWithHeight BoundingBox Accuracy Trial4MeasuredLeaveOneOutAccuracyWithHeight BoundingBox within18.0[in]of groundtruth Figure4.48:Trial4MeasuredLOOWithHeight BoundingBox Error Trial4MeasuredLeaveOneOutWithHeight BoundingBox MeanError Basedontheresultsofassessing"HitorMiss"accuracyandndingthelowestofthe meanerrorforthepredictedpoints,LassoandRidgeweredeterminedtobetheideal modelstouse."PathComparison"and"MovingTarget"assessmentsforLassoandRidge regressionsasdegree-3polynomialsusingHeight BoundingBox asapredictiveparameterare shownbelow. 58

PAGE 69

Figure4.49:Trial4MeasuredLOOWithHeight BoundingBox PathApprox.andMoving Target1 Trial4MeasuredDistanceLassoRegressionWithHeight BoundingBox Path ApproximationandMovingTargetEvaluation Figure4.50:Trial4MeasuredLOOWithHeight BoundingBox PathApprox.andMoving Target2 Trial4MeasuredDistanceRidgeRegressionWithHeight BoundingBox Path ApproximationandMovingTargetEvaluation UsingHeight BoundingBox inadditiontothex,ypixelcoordinatesaspredictiveparametersyieldsaresultthatismuchbetterthanusingArea BoundingBox .However,the qualitativeaccuracyobtainedwithbothmodelsisnotaspreciseasjustusingx,ypixel coordinates.Inaddition,thereisanincreaseinmeanerrorofapproximately30[in] [in]withRidgeRegressionwiththe"MovingTarget"evaluation. 4.5.4Height BoundingBox ReplacingY-CoordinatePixelValue Westillneededawayofsolvingthe"depth"problem.WedecidedtoreplacethepixelYCoordinatevaluewiththeHeight BoundingBox inthemodel.Theresultsofaccuracyaswell asthemeanerrorforthechosenregressionalgorithmsreplacingthepixelY-Coordinate ofdetectedobjectwithHeight BoundingBox areshownbelow. 59

PAGE 70

Figure4.51:Trial4MeasuredLOOWithHeight BoundingBox ReplacingY-CoordinateAccuracy Trial4MeasuredLeaveOneOutAccuracyWithHeight BoundingBox Replacing Y-Coordinatewithin18.0[in]ofgroundtruth Figure4.52:Trial4MeasuredLOOWithHeight BoundingBox ReplacingY-CoordinateError Trial4MeasuredLeaveOneOutWithHeight BoundingBox ReplacingY-CoordinateMean Error Basedontheresultsofassessing"HitorMiss"accuracyandndingthelowestofthe meanerrorforthepredictedpoints,LassoandElasticNetweredeterminedtobethe idealmodelstouse."PathComparison"and"MovingTarget"assessmentsforLassoand ElasticNetregressionsasdegree-3polynomialsusingHeight BoundingBox asapredictive 60

PAGE 71

parameterreplacingtheY-Coordinatepixelvalueareshownbelow. Figure4.53:Trial4MeasuredLOOWithHeight BoundingBox PathApprox.andMoving Target1 Trial4LassoRegressionWithHeight BoundingBox ReplacingY-Coordinatepixelvalue PathApproximationandMovingTargetEvaluation Figure4.54:Trial4MeasuredLOOWithHeight BoundingBox PathApprox.andMoving Target2 Trial4ElasticNetRegressionWithHeight BoundingBox ReplacingY-Coordinatepixel valuePathApproximationandMovingTargetEvaluation Withbothmodels,wefoundthatthe"PathApproximation"evaluationusingx coordinate andHeight BoundingBox aspredictiveparametersdoesnotyieldapredictedpaththatis quiteaswellalignedwiththetrialpathasusingjustx coordinate andy coordinate .However, uponvisualinspection,itisclearthatthepredictedpathdoesfollowtheexpectedtrial path,justwithextranoise.Inaddition,theresultsofthe"MovingTarget"evaluation showmeanerrorisonlyslightlyincreasedby8[in]and5[in],respectively,fromusing justusingx coordinate andy coordinate aspredictiveparameters. 61

PAGE 72

4.5.5Analysis TheprimarygoalofTrial4wastoproverepeat-abilityofthestepstakentobuilda modelthatwouldbeabletotranslatepixeldataintoareal-worldgeo-locationonour smartsurveillancesystem.Thesecondarygoalwastoexploreothermethodsofimproving modelaccuracyforthistranslationprocess. Wesucceededinourprimarygoal.ThesuccessfromTrial3wasseeninTrial4.We willuseseveralofthemodelsderivedfromthistrialtobuildthePoCgeo-trackingsystem. Interestingly,oursecondarygoalwasaccomplished,butwithaslightcaveat:accuracy seemedtodecreasefromtheprevioustrial.Thisisnotexpectedasourintuitionleads ustobelievethatmoretrainingpointswouldleadtogreataccuracy.Ontheother hand,wewereabletodevelopasuccessfulsolutiontoour"depth"problembyutilizing Height BoundingBox insteadofArea BoundingBox asapredictiveparameter.Although,using allthreeattributes,x coordinate ,y coordinate ,andHeight BoundingBox yieldsworseresultsthat usingthedualofx coordinate andHeight BoundingBox .However,thiscombination'saccuracy islowerthanthedualofx coordinate andy coordinate . 62

PAGE 73

5MakingtheGeo-TrackingSystem ThePoCdesignofuserinterfaceportionofthesystemtookintoconsiderationthenecessitiesrequiredbythesurveillanceoperatorsthatweredescribedinChapter1.Therefore, fromthebeginning,wedecidedtodesignthesystemtohavebothan"instantaneous" geo-trackingcapabilityand"historical"geo-trackingcapability. Asystemsuchasthisiseasilymonetized.Sourcecodeislikelyproprietary,meaning thatourinitialsearchyieldedverylittleinthewayofadynamictoolthatwouldbeable toplotourdesiredgeo-locationpointsasweneeded.So,wedecidedtobuildthesystem, ourselves.Webuiltthesysteminsteps,slowlyincorporatingadditionalfunctionalityand modicationswitheachiteration. Thesurveillancesystem,itself,wasbuiltbeforetheprocessofpixel-to-realmapping wasdeveloped.ThiswaswiththeassumptionthatGPSwouldbeaviablemapping optionforthenalsystem. FirstIteration Figure5.1:SpatialTrackingSurveillanceSystemFirstIteration 63

PAGE 74

Therstiterationofthesystemusedasimpleplottinglibrarytovisualizethetwo importantconceptsofoursmartsurveillancesystem:instantaneousandhistoricalpositioning.Wedisplayedseveraltargetpointsrandomlymovingtonewlocationswithin eachone'srespectiveimmediatesurroundingevery0.5[s].Thisrepresentedthemovement ofseveraltargetobjects.Wemaintainedthehistoryofeachpoint'spreviousmovements withinthescene.Puttingthesetwoplotstogetherallowedustovisualizearudimentary smartsurveillancesystemthatisbothabletotrackatarget'scurrentpositionanddisplayahistoryofpastmovement. SecondIteration Figure5.2:SpatialTrackingSurveillanceSystemSecondIteration Theseconditerationmovedfromconcepttorealuserinterface.Wedecidedtousethe python"Bokeh"libraryforourvisualizationpurposes.Builtintothelibraryisatool thatallowstheplottoberenderedasHTMLcode.Thus,weareabletovisualizethe plotinaninternetbrowser. Asadesignchoice,wedecidedtomaketheback-enddata-handlingportionofthe systemanHTTPserver.UsinganHTTPserver,thesystemidlywaitsfornewdatato 64

PAGE 75

besentfromtheclientsurveillancecameras.TheserveracceptsPOSTrequestsinthe jsonformof: f x:x coor, y:y coor g ThePOSTrequestiscreatedontheclientsidecameraSeeChapter2Procedure. Theserverappendsthisincomingdatatothedatastructurecontainingtheposition information,orcreatesthedatastructureifitdoesnotalreadyexist.AJAXisusedto asynchronouslyupdatethedatatoallowaccessbytheuserinterfaceportionoftheserver withoutdisruptingthepage,itself.Every0.1[s],thisprocesscheckswhethernewdatahas beensent,and,ifso,fetchesthenewdata.Atthesametime,theuserinterfaceportion oftheservermaintainsvariablescontainingboththehistoricalandcurrentpositiondata. Ifnewdatahasbeenfetched,bothofthesevariablesareupdated. Astructuraldiagramofthearchitectureisshownbelow: Figure5.3:UserInterfaceServerArchitectureDiagram UsingtheuniqueabilityofBokehplotstoberenderedasHTML,werecreatetheplots withthenewdataandseamlesslyreloadtheHTMLbeingdisplayedonthebrowser. Google'sGoogleMapsAPIisusedtoshowareal-worldlocationvialatitudeand 65

PAGE 76

longitudecoordinates. Mentionedpreviously,thissystemwascreatedbeforetheprocessofChapter2.We wereunawarethatGPSwouldnotworkasaviablemappingmethod.So,wecreated additionalfunctionalitythatwouldallowustointeractivelyvisualizetheerrorofthe modelversusgroundtruthGPSdataasbothinstantaneousandcumulative.Thisis shownbelow. Figure5.4:SpatialTrackingSurveillanceSystemSecondIterationErrorVisualization FinalIteration Figure5.5:SpatialTrackingSurveillanceSystemFinalIteration ThenaliterationofthesmartsurveillancePoCgeo-trackingsystemwascreatedafter completingexperimentationconductedinChapter2.Usingwhatwasalreadycreated,we strippedouttheGoogleMapsAPIfunctionalityfromtheuserinterface.Returningtothe 66

PAGE 77

rstiterationofthesystem,wedecidedtodisplayinstantaneousandhistoricalpositions separately.Additionally,asadesignchoice,wedecidedtoshowthevisualcamerafeed forcoordinationwiththegeo-trackingsystem.Thisway,asurveillanceoperatorwould notonlybeabletoeasilytrackatarget'sposition,butalsobeabletoassesshis/her behavior. Theerrorvisualizationportionwasremovedasitisimpossibletoaccomplishwithout beingabletocheckagainstanassociatedreal-timegroundtruth. 67

PAGE 78

6Discussion 6.1RoadblocksandIssues Ultimately,wewereabletocreateafunctionalgeo-trackingsystemthatutilizesthebasic componentsofmost"otheshelf"objectdetectiontechnologies.However,alongtheway, therewereseveralroadblocksthatledtolessaccurateresultsandunusabledata/methods, forcingadaptation.Thissectionwilldiscusssomeoftheissuesencounteredthroughout theprocess. GPS Originally,thesystemwasintendedtoworkwithGPScoordinates.Wewouldsamplea targetstandinginaGPSlocation,and,atthesametime,recordthetarget'sposition inthesceneasx,ypixelvaluesandsomeHeight B oundingBox orArea B oundingBox . However,wequicklydiscoveredthatnotonlywastherecordedGPSdataextremelyinaccurate,butthereadilyavailableGPSlocationprovidedbyacellphoneisnotupdated inreal-time.Asaresult,wedevisedtheschemeofutilizingmeasuredoreasilycalculated positionpointsasagroundtruthonwhichtotrainourregressionmodels. RealTime Theoriginalconceptalsoproposedusingrealtimedatacollectiontodetermineaccuracy anderror,aswellasbuildamodel. Tobuildamodel,wewouldconstantlytakeGPSandobjectdetectiondata,querying anNTPservertosynctimestampsbetweenthetwo.Usingthismethod,wewouldhave hundredsofpointsfromwhichtobuildamodel.Accuracyanderrorwouldbedetermined inasimilarway:Thepredictedpointswouldbeevaluatedagainstthegroundtruthreal68

PAGE 79

timeGPSdata.However,asaresultofthenonreal-timeGPSissue,thesebothbecame impossible. MeasuredDistances SincetheGPSdatawasnotreliable,wewereforcedtousethe"measureddistance"approach.Unfortunately,thisapproachgreatlylimitedthenumberofaccuratedatapoints wecouldcollectbecausewealwaysneededsomeconstantreferencedistance.Thisiswhy weusedthesoccereldsandtenniscourts. MultipleCameras AfterthesuccessofTrial3,itwassuggestedthatweutilizemultiplecamerasonthe samescenetoboostaccuracyofthesystem.So,weprocuredanadditionalcameraand modiedthedatacollectionphasetoaccountformultiplecamerainputs.Unfortunately, theadditionalcamerawasnotsuitedfortheexperiments.Itfailedtocaptureconsistent imagesandvideodata.Objectdetectiononthedatafromthatcamerawasalmostimpossible. 6.2MissedOpportunitiesandFutureWork Theprocessofconductingthisresearchandwritingthisthesishastaughtusthatagood projectneverreallyhasapointwhereitis"complete."Whatwasgiveninthisthesiswas "layingthegroundwork."Itprovidesthebasisfromwhichtobuildbigger,better,and moreaccuratesystems.Givenmoreresourcesandtime,wecouldexpandthisprojectto muchmorethanwhatitis.Inadditiontoremarkingonsomeofthemissedopportunities, thissectionwilldiscusssomeoftheadaptations,modications,andimprovementsthat canbebuiltontopofwhathasalreadybeenaccomplished.Manyoftheideasdescribed herecanbeconsideredindependentresearchprojects,bythemselves. GPS Intheprevioussection,weexplainedtheissuewithusingGPS.However,thisdoesnot meanthatGPStechnologyiscompletelyunusable.We'veproventhatasystemcanbe 69

PAGE 80

createdwithaccurateandconsistentground-truthdata.Ifadedicatedandaccurate GPScouldbeusedduringthetrainingandevaluationphases,webelieveamuchmore accurateandeasily-congurablesystemcouldbecreated. MultipleCameras Again,wehavealreadydiscussedthemissedopportunityofbeingabletobuildasystem thatusesmultiplecameras.Futureworkwouldentailrunningtheexperimentsandsystemoncamerahardwarethatisidentical. Stitchingtogethermultiplescenes Ofallofthepotentialfutureworks,thisistheonewewouldbethemostexcitedto designandbuild.Thewholepremiseoftheprojectisa"smartsurveillancesystem."The implementationthatwehavebuiltservesthatpurpose,but,currently,onlyforasingle "scene."Forausableandrobustsurveillancesystem,wewouldneedmultipleoverlapping camerasthatwouldbeabletosurveymultipleconnectedscenes,intheirentirety.For example,thesecuritycamerasofapark. Intheory,eachcameraclientwouldhaveaspecictagwhichwouldbeusedtomapthe alreadycollectedlocationscamdatatogeo-locationdatatoaspecicregioninthepark bytheserver.Ofcourse,ifGPSwasanoption,there-mappingwouldbeunnecessary. Dierenthardware Attachingadevicededicatedtoneuralnetworkcalculationstoeachcamerawouldgreatly improvespeed.Runningthecurrentimplementationonbasiclaptopyieldsaninference timeofapproximately0.65[s]onaverage.Thismeansthattheframerateisnotnearly highenoughfortruereal-timeobjectdetection.Toreduceruntimeandapproximate achievablereal-timeinference,weloweredtheframerateofthecapturedvideousedin thePoC. 70

PAGE 81

TargetTracking AsmentionedinChapter1,thisresearchfocusedonthegeo-trackingaspectofthesmartsurveillancesystem.However,foratruesmartsurveillancesystem,thetargettracking portionwouldneedtobeimplemented,aswell.Itwouldbenecessarytobeabletoisolate atargetobjectfrombackgroundobjectsforthecompletesystemtofunctionasenvisioned. DierentTargets Increatingthemodelforeachtrial,weonlyusedonetargetwithstaticdimensions, meaningthemodelwasonlytrainedonthisdata.Intuitively,thismeansthatintroducing anothertargetwithmuchdierentdimensionsmuchtaller,forexamplewilloutput dierentresults.Whetherornotthisyieldsasignicantimpactonmodelaccuracyis unknown,however. Inaddition,foreachofourtrials,thetrainingdataandmodelallinvolvedthetarget standingstraightup,i.e.nocrouching,jumping,standingonobject,etc.Again,allof theseaspectswouldeectthemodelandsubsequentaccuracy. Thesebothneedtobeexplored. TuningModelAccuracy Oneofthemainissueswiththeexperimentsconductedduringthecreationofthissystem wasthelackofdedicatedhardwareandaworkspace.So,wefocusedoureortsonbuildingthesystem,itself,throughprovingtherepeat-abilityoftheprocessandPoCofthe nalsmartsurveillancesystem.Wechosearegressionmodelforpixel-to-realtranslation basedonhighestaccuracyandlowestmeanerrorfromtheLOOmethod.However,thisis nottosaythattheaccuracywithinthemodel,itself,wasastunedandaccurateasitcould possiblybe.Beingabletotuneaccuracyrequiresrepeatedtrialswiththesamecongurationinthesamestaticcameralocation.Thiswasnotpossibilegivenourcircumstances. PosteriorSmoothing OneofthenotabletraitsencounteredinevenourbestmodelsfromTrial3andTrials4 wasthenoisethatwasapartofourpredicteddata,meaningthepointstendedto"jump 71

PAGE 82

around"erratically.Thiscanclearlybeseenonmanyofthe"PathApproximation" evaluations.Thenoiseisaresultoftheboundingboxofthedetectedtargetconstantly shiftinginsize. Thewholeobjectdetectiontechniqueiscenteredaroundtheboundingboxofthe detectedtarget.Thex coordinate ,y coordinate ,andHeight BoundingBox areareallderivedfrom theboundingbox.Themodelistrainedonasmallsampleofdata:onesetofstaticpixel databoundingboxlocationanddimensionsiscapturedateachdesignatedlocation. Withtheboundingboxconstantlyshiftinginsize,thepixel-to-realtranslationwillgive dieringresults,evenforatargetinthesameapproximatelocation.Thechosenmodels fromourtrialsperformedadmirablyandallowedforthecreationofthePoC.However, therewasstillagreatdealofobviousdiscrepancybetweenthetrialpathandthepredicted pathduetonoise. Onehypothesisonhowtocorrectthisistomaketheboundingboxmoreconsistent throughoutthetargetdetectionprocess.However,thismayover-complicatethedetection process,furtherexacerbatingour"depth"problem. Anotherhypothesisisrelatedto"posteriorsmoothing."Posteriorsmoothing,aswe envisionit,ispredictingtheplacementofapointbasedonsomecombinationofx coordinate , y coordinate ,Height BoundingBox ,andthepriordatafromthepreviousseveralpoints. Point Real = f x pixel ;y pixel ;Height BoundingBox + f Point Real )]TJ/F28 7.9701 Tf 6.587 0 Td [(1 ;Point Real )]TJ/F28 7.9701 Tf 6.586 0 Td [(2

PAGE 83

Bibliography Citedworks [2]AntonioBrunettietal.Computervisionanddeeplearningtechniquesforpedestriandetectionandtracking:Asurvey".English.In: Neurocomputing 300, pp.17{33. [4]JorgeFuentes-Pacheco,JoseRuiz-Ascencio,andJuanM.Rendon-Mancha.Visual simultaneouslocalizationandmapping:asurvey".English.In: ArticialIntelligence Review 43.1,pp.55{81. [5]HelenM.Hodgettsetal.SeeNoEvil:CognitiveChallengesofSecuritySurveillance andMonitoring".In: JournalofAppliedResearchinMemoryandCognition 6.3 ,pp.230{243. issn :2211-3681. doi : https://doi.org/10.1016/j. jarmac.2017.05.001 . url : http://www.sciencedirect.com/science/article/ pii/S2211368117300207 . [6]AntonioC.NazareJr.andWilliamRobsonSchwartz.Ascalableandexible frameworkforsmartvideosurveillance".In: ComputerVisionandImageUnderstanding 144.IndividualandGroupActivitiesinVideoEventAnalysis, pp.258{275. issn :1077-3142. doi : https://doi.org/10.1016/j.cviu. 2015.10.014 . url : http://www.sciencedirect.com/science/article/pii/ S1077314215002349 . [8]JosephRedmonandAliFarhadi.YOLOv3:AnIncrementalImprovement".English.In:. [9]JosephRedmonetal.YouOnlyLookOnce:Unied,Real-TimeObjectDetection". English.In:.

PAGE 84

[10]Young-WooSeoandMyungHwangbo.AComputerVisionSystemforLateral Localization".In: JournalofFieldRobotics 32.7,pp.1004{1014. doi : 10.1002/ rob.21576 .eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/rob. 21576 . url : https://onlinelibrary.wiley.com/doi/abs/10.1002/rob.21576 . [11]ShubhamShinde,AshwinKothari,andVikramGupta.YOLObasedHumanActionRecognitionandLocalization".In: ProcediaComputerScience 133. InternationalConferenceonRoboticsandSmartManufacturingRoSMa2018, pp.831{838. issn :1877-0509. doi : https://doi.org/10.1016/j.procs. 2018.07.112 . url : http://www.sciencedirect.com/science/article/pii/ S1877050918310652 . [13]A.W.M.Smeuldersetal.VisualTracking:AnExperimentalSurvey".In: IEEE TransactionsonPatternAnalysisandMachineIntelligence 36.7,pp.1442{ 1468. issn :0162-8828. doi : 10.1109/TPAMI.2013.230 . [14]CarlosSanchezetal.Localizationandtrackinginknownlargeenvironmentsusing portablereal-time3Dsensors".English.In: ComputerVisionandImageUnderstanding 149,pp.197{208. Supplementaryworks [1]MahaM.Azab,HowidaA.Shedeed,andAshrafS.Hussein.Newtechniquefor onlineobjecttracking-by-detectioninvideo".English.In: IETImageProcessing 8.12,pp.794{803. [3]MoonsubByeonetal.Uniedoptimizationframeworkforlocalizationandtracking ofmultipletargetswithmultiplecameras".English.In: ComputerVisionandImage Understanding 166,pp.51{65. [7]JosephRedmonandAliFarhadi.YOLO9000:Better,Faster,Stronger".English. In:. [12]R.L.Simpson.Computervision:anoverview".In: IEEEExpert 6.4,pp.11{ 15. issn :0885-9000. doi : 10.1109/64.85917 .

PAGE 85

Appendix Experiments Theexperimentaldataandanalysisforeachtrialcanbefoundhere:GITHUBLINK Theabovegureshowstheexperimentalprocessusedforeachtrialintheconguration oftheclientcameras.Thegurebelowgivesasummaryoftheexperimentalanalysisfor eachofthefourtrials.

PAGE 87

Code ThefollowingcodeiswhatwasusedtocollecttheGPSdataonaSamsungGalaxys5: import sys import time import androidhelperasandroid count=0 with open "/storage/emulated/0/qpython/test.csv","r"asf: for line in f: count+=1 def write coordinatesdroid: global count droid.eventClearBuffer droid.makeToast"Reading GPS data" droid.startLocating print "reading GPS ..." time.sleep1 loc=droid.readLocation.result while loc== fg : loc=droid.readLocation.result #foriinloc: #printloc print loc if loc isnot None: coor="" if loc['network']: coor+=" " coor+="network,%s,%s"%loc['network']['latitude'],loc['network']['longitude'] if loc['gps']: coor+=" " coor+="gps,%s,%s"%loc['gps']['latitude'],loc['gps']['longitude'] coor+=" n n" f= open "/storage/emulated/0/qpython/test.csv","a" f.writecoor f.close print count print coor count+=1 droid.stopLocating if name ==' main ': droid=android.Android inp="y" while inp!="q": write coordinatesdroid inp= input "Next?"

PAGE 88

Thefollowingcodeiswhatisrunontheuserinterfaceserver: from flask import Flask,jsonify from jinja2 import Template import math from six import string types import numpyasnp from flask import Flask,jsonify,make response,request,current app from datetime import timedelta from functools import update wrapper,wraps import time import random from bokeh.plotting import ColumnDataSource from bokeh.plotting import figure from bokeh.models import AjaxDataSource,GMapOptions from bokeh.embed import components from bokeh.resources import INLINE from bokeh.util.string import encode utf8 from bokeh.plotting import gmap from bokeh.sampledata import us states from bokeh.models import Range1d from bokeh.layouts import row app=Flask name #GlobalVariables source=ColumnDataSource f 'x GPS':[],'y GPS':[],'x CAM':[],'y CAM':[] g x1=[] y1=[] ###DON'TTOUCHTHIS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ def crossdomainorigin=None,methods=None,headers=None, max age=21600,attach to all=True, automatic options=True: """ DecoratortosetcrossdomainconfigurationonaFlaskview Formoredetailsaboutitreferto: http://flask.pocoo.org/snippets/56/ """ if methods isnot None: methods=', '.join sorted x.upper for x in methods if headers isnot None andnotisinstance headers,string types: headers=', '.joinx.upper for x in headers ifnotisinstance origin,string types: origin=', '.joinorigin ifisinstance max age,timedelta: max age=max age.total seconds def get methods: options resp=current app.make default options response return options resp.headers['allow'] def decoratorf: @wrapsf def wrapped function args, kwargs: if automatic options and request.method=='OPTIONS': resp=current app.make default options response else : resp=make responsef args, kwargs ifnot attach to all and request.method!='OPTIONS':

PAGE 89

return resp h=resp.headers h['Access )]TJ/F29 5.9776 Tf 5.768 0 Td [(Control )]TJ/F29 5.9776 Tf 5.569 0 Td [(Allow )]TJ/F29 5.9776 Tf 5.918 0 Td [(Origin']=origin h['Access )]TJ/F29 5.9776 Tf 5.768 0 Td [(Control )]TJ/F29 5.9776 Tf 5.569 0 Td [(Allow )]TJ/F29 5.9776 Tf 5.491 0 Td [(Methods']=get methods h['Access )]TJ/F29 5.9776 Tf 5.768 0 Td [(Control )]TJ/F29 5.9776 Tf 4.896 0 Td [(Max )]TJ/F29 5.9776 Tf 5.289 0 Td [(Age']= str max age requested headers=request.headers.get 'Access )]TJ/F29 5.9776 Tf 5.768 0 Td [(Control )]TJ/F29 5.9776 Tf 5.647 0 Td [(Request )]TJ/F29 5.9776 Tf 5.734 0 Td [(Headers' if headers isnot None: h['Access )]TJ/F29 5.9776 Tf 5.767 0 Td [(Control )]TJ/F29 5.9776 Tf 5.57 0 Td [(Allow )]TJ/F29 5.9776 Tf 5.734 0 Td [(Headers']=headers elif requested headers: h['Access )]TJ/F29 5.9776 Tf 5.767 0 Td [(Control )]TJ/F29 5.9776 Tf 5.57 0 Td [(Allow )]TJ/F29 5.9776 Tf 5.734 0 Td [(Headers']=requested headers return resp f.provide automatic options=False return update wrapperwrapped function,f return decorator ###DON'TTOUCHTHIS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @app.route'/data',methods=['POST','GET','OPTIONS'] @crossdomainorigin=" ",methods=['GET','POST'],headers=None def update: global x1,y1 if request.method=='POST': x=request.form.get"x" y=request.form.get"y" if x!=None and y!=None: x1.append float x y1.append float y return jsonifyx pos=x1[:],y pos=y1[:],x inst=[x1[ )]TJ/F29 5.9776 Tf 6.699 0 Td [(1]],y inst=[y1[ )]TJ/F29 5.9776 Tf 6.719 0 Td [(1]] template=Template ''' < !DOCTYPEhtml > < htmllang="en" > < head > < metacharset="utf )]TJ/F46 5.9776 Tf 5.558 0 Td [(8" > < title > TennisCourtCamera < /title > ff js resources gg ff css resources gg < /head > < body > ff plot div gg ff plot script gg < /body > < /html > ''' @app.route"/",methods=["POST","GET"] def user interface: streaming=True global x1,y1 source=None source=AjaxDataSourcedata url="http://localhost:5000/data", polling interval=200,mode='append',if modified=True,max size=30 inst source=AjaxDataSourcedata url="http://localhost:5000/data", polling interval=150,mode='replace' source.data= dict x pos=[],y pos=[],x inst=[],y inst=[] left,right,bottom,top=0,420,0,1000 fig1=figureplot width=600,plot height=600,title="History" fig1.x range=Range1dleft,right fig1.y range=Range1dbottom,top fig1.axis.visible=False

PAGE 90

fig1.toolbar.logo=None fig1.toolbar location=None fig1.circle'x pos','y pos',fill color='red',fill alpha=0.7,source=source,size=15 fig2=figureplot width=600,plot height=600,title="Current Position" fig2.x range=Range1dleft,right fig2.y range=Range1dbottom,top fig2.axis.visible=False fig2.toolbar.logo=None fig2.toolbar location=None fig2.circle'x inst','y inst',fill color='blue',fill alpha=0.7,source=inst source,size=15 js resources=INLINE.render js css resources=INLINE.render css script,div=componentsrowfig1,fig2,INLINE html=template.render plot script=script, plot div=div, js resources=js resources, css resources=css resources #returnstrx1 return encode utf8html app.run

PAGE 91

ThefollowingcodewasaddedtotheYolov3'object detection.py'sourcecodebyJoseph Redmondtoallowourcameratoobtainpixeldata,translatetoageo-locationposition, andtransmittotheuser-interfaceserverserveraddr:localhost:5000: import requests from joblib import load #Loadfittedmodel model=load'cam regr model.joblib' ... #Inthedefpostprocessfunctionafterdeterminingboundingboxlocationanddimensions... #classIds[0]isthe'person'class if classIds[i]==0: #Obtaincenterpointandboundingboxheight x,y,height=left+left+width/2,top+top+height/2,height #Translatepixel )]TJ/F46 5.9776 Tf 5.766 0 Td [(to )]TJ/F46 5.9776 Tf 6.237 0 Td [(real #x,height )]TJ/F31 5.9776 Tf 4.658 0 Td [(> Geo )]TJ/F46 5.9776 Tf 6.249 0 Td [(location X real,Y real=model.predictx,height #x,y )]TJ/F31 5.9776 Tf 4.658 0 Td [(> Geo )]TJ/F46 5.9776 Tf 6.249 0 Td [(location X real,Y real=model.predictx,y #SendPOSTrequesttouserinterfaceserver requests.post"localhost:5000",data= f 'x':X real,'y':Y real g