Citation
Interactive full-body motion capture using infrared sensor network

Material Information

Title:
Interactive full-body motion capture using infrared sensor network
Creator:
Duong, Son Trong
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
1 electronic file. : ;

Subjects

Subjects / Keywords:
Computer animation ( lcsh )
Computer animation ( fast )
Genre:
non-fiction ( marcgt )
Target Audience:
Traditional motion capture (mocap) has been well-studied in visual science for a long time. More and more techniques are introduced each year to improve the quality of the mocap data. However up until a few years ago the field is mostly about capturing the precise animation to be used in different application after post processing such as studying biomechanics or rigging models in movies. These data sets are normally captured in complex laboratory environments with sophisticated equipment thus making motion capture a field that is mostly exclusive to professional animators. In addition, obtrusive sensors must be attached to actors and calibrated within the capturing system, resulting in limited and unnatural motion. In recent year the rise of computer vision and interactive entertainment opened the gate for a different type of motion capture which focuses on producing marker or mechanical sensorless motion capture. Furthermore a wide array of low-cost but with primitive and limited functions device are released that are easy to use for less mission critical applications. Beside the traditional problems of markerless systems such as data synchronization, and occlusion, these devices also have other limitation such as low resolution, excessive signal noise and narrow tracking range. In this thesis I will describe a new technique of using multiple infrared devices to process data from multiple infrared sensors to enhance the flexibility and accuracy of the markerless mocap. The method involves analyzing each individual sensor data, decompose and rebuild them into a uniformed skeleton across all sensors. We then assign criteria to define the confidence level of captured signal from sensor. Each sensor operates on its own process and communicates through MPI. After each sensor provides the data to the main process, we synchronize data from all sensors into the same coordinate space. Finally we rebuild the final skeleton presentation by picking data with a combination of the most confident information. Our method emphasizes on the need of minimum calculation overhead for better real time performance while being able to maintain good scalability. These are specific contributions of this thesis: first, this technique offers a more accurate and precise mocap by making sure all the involved joints are properly tracked by at least one sensor at all time. Second, this method alleviates intrinsic shortfall of the device such as noise and occlusion. Third, it provides greater flexibility outside the geometric range limitation of one sensor which allows for greater movement freedom of an actor. And finally it does not require lengthy calibration and pre-processing procedures making this setup much more straightforward and easy to deploy in many application cases.

Notes

Thesis:
Thesis (M.S.)--University of Colorado Denver. Computer science and engineering
Bibliography:
Includes bibliographic references.
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Son Trong Duong.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
861795238 ( OCLC )
ocn861795238

Downloads

This item has the following downloads:


Full Text
INTERACTIVE FULL-BODY MOTION CAPTURE USING
INFRARED SENSOR NETWORK.
By
SON TRONG DUONG
B.S Computer Science and Engineering, University of Colorado. (2008)
A Thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
Of the requirements for the degree of
Master of Science in Computer Science and Engineering
2012


This thesis for the Master of Science degree by
Son Trong Duong
has been approved for the
Computer Science and Engineering program
by
Min-Hyung Choi, Chair
Gita Alaghband
Ellen Gethner
Bryan Leister
November 12th, 2012
n


Son Trong Duong (M.S., Computer Science and Engineering).
Interactive Full-Body Motion Capture Using Infrared Sensor Network.
Thesis directed by Associate Professor Min-Hyung Choi.
ABSTRACT
Traditional motion capture (mocap) has been well-studied in visual science for
a long time. More and more techniques are introduced each year to improve the
quality of the mocap data. However up until a few years ago the field is mostly about
capturing the precise animation to be used in different application after post
processing such as studying biomechanics or rigging models in movies. These data
sets are normally captured in complex laboratory environments with sophisticated
equipment thus making motion capture a field that is mostly exclusive to professional
animators. In addition, obtrusive sensors must be attached to actors and calibrated
within the capturing system, resulting in limited and unnatural motion. In recent year
the rise of computer vision and interactive entertainment opened the gate for a
different type of motion capture which focuses on producing marker or mechanical
sensorless motion capture. Furthermore a wide array of low-cost but with primitive
and limited functions device are released that are easy to use for less mission critical
applications. Beside the traditional problems of markerless systems such as data
synchronization, and occlusion, these devices also have other limitation such as low
resolution, excessive signal noise and narrow tracking range. In this thesis I will
describe a new technique of using multiple infrared devices to process data from
multiple infrared sensors to enhance the flexibility and accuracy of the markerless
mocap. The method involves analyzing each individual sensor data, decompose and
rebuild them into a uniformed skeleton across all sensors. We then assign criteria to


define the confidence level of captured signal from sensor. Each sensor operates on its
own process and communicates through MPI. After each sensor provides the data to
the main process, we synchronize data from all sensors into the same coordinate space.
Finally we rebuild the final skeleton presentation by picking data with a combination
of the most confident information. Our method emphasizes on the need of minimum
calculation overhead for better real time performance while being able to maintain
good scalability. These are specific contributions of this thesis: first, this technique
offers a more accurate and precise mocap by making sure all the involved joints are
properly tracked by at least one sensor at all time. Second, this method alleviates
intrinsic shortfall of the device such as noise and occlusion. Third, it provides greater
flexibility outside the geometric range limitation of one sensor which allows for
greater movement freedom of an actor. And finally it does not require lengthy
calibration and pre-processing procedures making this setup much more
straightforward and easy to deploy in many application cases.
The form and content of this abstract are approved. I recommend its publication
Approved by: Min-Hyung Choi
IV


DEDICATION
I dedicated this work to my parents Khoat and Ngoc Anh. Throughout their
life they have given up a lot to support me in my academic path. It is through their
many scarifies that I can achieve what I have today.
v


1
2
2
4
6
7
8
10
12
13
14
15
16
16
17
18
20
20
21
TABLE OF CONTENTS
INTRODUCTION
Overview-------------------------
The Kinect Device----------------
Motivation-----------------------
RELATED WORKS
Markerless Motion Capture--------
Motion Analysis------------------
Occlusion And Collision----------
Calibration And Setup------------
Unsynchronized Camera Tracking---
BASE CONCEPT AND DESIGN
Initial Concept------------------
Device Limitation----------------
MPI.NET--------------------------
Kinect SDK-----------------------
Accuracy Versus Speed------------
High-Level Concept
Data Structure-----------
Synchronized Sensor Space-
Ranking System-----------
METHOD AND IMPLEMENTATION
Joint Info Processing------------
Smooth Filter--------------------
Building The Angle Set-----------
vi


Skeleton Reconstruction
24
Data Synchronization--------------------------------------28
Data Calibration------------------------------------------29
Merging The Final Skeleton--------------------------------32
V. EXPERIMENT AND RESULT.
Calibration Test------------------------------------------38
Construction Test-----------------------------------------42
Modeling Test---------------------------------------------44
VI. DISCUSSION.-----------------------------------------------46
VII. CONCLUSION------------------------------------------------49
REFERENCES------------------------------------------------50
vii


LIST OF FIGURES
FIGURE
II. 1: Synthetic and real data-----------------------------------------------4
II.2: Visual Hull------------------------------------------------------------6
II.3: Kinect decision forest tree--------------------------------------------7
II. 4: Infrared signal interference------------------------------------------9
III. 1: Basic workflow------------------------------------------------------12
IV. 1: Updated workflow----------------------------------------------------19
IV. 2: Joint Structure------------------------------------------------------23
IV.3: Multiple occlusions.--------------------------------------------------24
IV.4: The rotation model----------------------------------------------------25
IV.5: Displace between two skeleton-----------------------------------------31
IV.6: Skeleton space synchronization.---------------------------------------33
IV. 8: Lead token sensor placement------------------------------------------36
V. 3: Calibration sample---------------------------------------------------41
V.4: Skeleton merging-------------------------------------------------------43
V.5: Skinned model demonstration--------------------------------------------44
viii


LIST OF TABLES
TABLE
ID.2: High level data structure---------------------------------------------16
IV.7: Joint ranking system.-------------------------------------------------35
IV. 9: Joint selection------------------------------------------------------37
V. l: 600 frame calibration error.-----------------------------------------39
V.2: Average over 100 frame error-------------------------------------------39
IX


CHAPTER I
INTRODUCTION
Overview
Technologies and methods that produce high quality mocap data have been available
for quite some time. However they are usually out of reach of the mainstream users
for both their cost and complexity. The traditional methods of motion-cap like optical
or magnetic require a cumbersome setup of markers and/or sensors on the actor.
There have also been research about markerless motion capture but they also require
complex set up and calibration in a laboratory environment. Another drawback is the
price points of these systems are usually quite high and the data itself needs a lot of
post processing. These traditional methods can yield very accurate data, but they can
only be saved and used in a simulation for either modeling or studying since they are
not viable for real time recording. In recent years the entertainment industry has
started introducing cheap and simple devices to stimulate some basic motion capture
technology to the mainstream market. The Wiimote and Playstation Move have
tracking sensor to record the hand movements to determine the motion (or pose) that
the user is performing and apply the motion into the application. Later on combining
with computer vision technology we have devices like the Playstation Eye and
Microsoft Kinect. These devices employ an infrared sensor to track the actors
position and allow some virtual interaction with on screen objects. While they do not
measure up to the traditional motion capture system in term of accuracy and detail,
all of these devices are meant for interactive use, so the motion affects their
application in real time.
1


The Kinect Device
One of the first and most advanced motion sensors released to mainstream users. It
comes with two sensors, an infrared and a RGB sensor, this research only focuses on
the use of the infrared sensor although we might explore the potential of the RGB
sensor in the future. The infrared sensor (depth sensor) has an effective range from
0.8m to 4m from the sensor, with an FOV of 47 vertical and 57 horizontal and it
operates at 30 frames per second [5], The reason it is chosen for this research is
because it already has an active support platform to work with the device output
(work on this research has begun since the beta) and thus let us delves into our
research topic more readily. The RGB sensor while not mentioned in this paper still
has potential benefits for future research and it will be more convenience to receive
the data stream from the same device. However the theories presented here is based
on receiving and processing data from any infrared device, and should still work with
a different set up, like the Playstation Eye with iPisofts Markerless Motion Capture
suit.
Motivation
As mention above for various reasons up until now the motion capture scene is more
or less exclusive to industrial professional. The release of the mainstream-consumer
price sensors opened a floodgate of new opportunities to a much larger user base.
Since its inception there have been various works from fun community hacks like
using the Kinect to handle computer interaction, monitoring patience [41] to testing
its usage in robotic and surgical field [42], However as one would expect from a
device at such a low price it comes with many draw back such as inconsistent data,
noise, very limited actor orientation and sometime simply guess work (inferred). This
2


research is meant to address and overcome some of these disadvantages while
remains simple and inexpensive enough that it can still be used interactively and at a
cost not prohibit to mainstream users. Specifically these are the limitation that this
research seeks to address and correct or improve:
- Limited actor orientation: sensor can only detect actor facing it directly.
Inaccurate joint tracking: due to either occlusion or orientation.
Conflicting and corrupted data: due to signal interference.
Scalable method that can be used to track up to 360 viewpoints for full body.
3


CHAPTER II
RELATED WORKS
Markerless Motion Capture
The idea of using markerless motion capture has been an active work for quite some
time. While cumbersome, a traditional motion cap model using markers can reliably
relays the joint position on active model thanked to the attached markers which can be
easily identified by some parameters. A markerless motion capture process does not
have this data and as thus, has to construct the joint information base on the visual cue.
A comparison about the two methods was given by Lanshammar and Persson based
on video image [40], Some popular approaches to constructing the data include using
depth images [1,7,19] and RGB image segmentation [5,8], orboth[ll].
Figure II:. Synthetic and real data in Kinect. The depth sensor returns the
depth map of the actor. Then a texture map with body part labelling (color-coded) is
applied on it to define the joint region.Figure taken from MS Reseach Cambridge [1]
Depth image method first involves generating an intensity map base on the depth
info of each pixel. Data calibration such as near and far plane can be used to remove
environment background so that only the actors information is constructed. The next
part is to define a region for each key body part. This can be done by either texture
map targeting or establishing local planars [44],
4


In other method typically a visual hull of the actor is reconstructed into a workable
model [4,17,36], A visual hull can be described as a 3D approximation of the objects
shapes, in this case a human actor. The silhouette of the actor for each camera is
calculated from a background subtraction using the depth image generated from an
infrared sensor. The hull of a human model is then constructed to represent the actor
in 3D. There are several methods on how to construct one such as constructing it
from segmented body part. One approach is to use super-quadric shape with a joint
hierarchy to hold them together. Our main interest in this research is on the accuracy
of the joint hierarchy itself instead of the model presentation. Once we have the
correct joint data we can then use it to rig any general human mesh for demonstrating
purpose. In case of an absolute model presentation is needed, a full body scan of the
actor will be needed. S. Corazza et al provided another method by constructing a
visual hull base on a point cloud system with embedded kinematic constraint, this
method while produce a very accurate model it is computing intensive [2,8],
Figure II. 2: Visual Hull reconstruction concept. The intersection of each sensors are
used to approximates the actor model. Figure from Standford University[2],
5


Motion Analysis
Upon reconstructing the model, the next step in markerless motion capture is how to
recognize the actors motion based on the cue. Depending on the need of the
application the methods are varied. In the case of single sensor the interest lies more
within detecting a specific set of poses for interaction. This case usually treats the
depth image as a 2D space image[13,26,28]. To be able to analyze real 3D motion the
process usually involved a set of training data. [21,24,29] This is usually done by
training the data through a variety of human pose[30,33,34]. In general, markerless
motion cap is about understanding a set of poses and interpreting the motion in
between. The quality of the final animation depends on the original training set
(synthesis image) and the decision tree.[1,38,39], Synthesis image is created from a
random set of parameter including a depth image and 3D texture meshes. For the
Kinect its pose classifier is trained with 3 tree of depth 20 from lmillion synthesis
images, the process takes about 24 hours on a 1000 core cluster.
Figure II. 3: The Kinect decision forest tree to determine a pose. With the x pixel in
1 T
image I for the body part c. The traverse / P(c\I,x) = ,Pt(e|J,x) Figure taken
e=i
from MS Research Cambridge [1]
6


Occlusions And Collision
One of the most significant disadvantages of markerless motion capture is occlusion.
Since the data come from a view of a sensor instead of being generated on the actor, it
is quite common that certain joints will not be seen by a sensor at certain pose. A
common method to deal with this issue on single sensor by using an inferred or
prediction system: as in using the position of other visible joint to make a guess for
where the occluded joint is. This can be quite confusing not only for the occluded
joint but also for the joint in front since the algorithm might be confused in separating
the 2 joints. Vladimir Dochev, Tzvetomir Vassilev, and Bernhard Spanlang proposed a
method of using multiple layers of depth map to detect occlusion [3], It divides the
depth map into several layers and separated the body parts base on their depth level.
Two depth maps of the actor are taken from the front and back then decomposed and
labeled into difference body parts like described in the above session. Each body part
is then given a bounding box (BB) volume. The BB volume is then used to detect
collision between body-parts which is then determine which session of the depth map
need to be divided into layer. The test for collision is calculated per:
bbox'
y resolution
y ______________ y
1 * i 71back
bbox
X
x +
front
X +
bbox
bbox
bbox
* resolution
* resolution,
This method can be used by any infrared sensor. In our prototype process
however the collision approach is generally too expensive for interactive application.
Another approach using Image Space interference test to detect collision like
described in [26] but it is too expensive processing wise for our purpose.
7


Sensor Calibration And Setup
Since most markerless motion capture system will require quite a few sensors to be
able to record the full body motion, they have to be calibrated so data from each
device can be combined together. A typical problem of markerless systems is noise.
Noise is an inherited issue with the model even among high end system. They can
come from air and lighting condition, surface property of the tracked object among
other thing. Professional grade sensor however does have benefit to address other
source of noise. For example in the same system most likely each sensor will have a
unique identifier and operate on a different infrared frequency so they dont have to
worry about signal interference between sensors. This is a problem with consumer
grade sensor (especially for mass produce sensor like Kinect). When working with
multiple devices all the infrared signals on each sensor will operate on the same
frequency. This means when more than one sensor is pointed at the same direction,
they will create some noise and interference. An illustration of this problem is
showed in the figure below.
An infrared sensor shoots a beam at the object and waits for a reflective signal,
based on this wait time the depth (distance from the sensor) of a particular point is
constructed,. Due to the scattering rate when a beam hits the surface, it is quite
common for this signal to be reflect in different direction and not just toward the
8


original sensor and its quite possible for another sensor to pick this signal up. In the
case of professional systems this is not an issue since each sensor will only pick up
the signal of its corresponding frequency while ignoring signal from other sensor.
But since all of our sensors operate on the same frequency they can cause
interference. Our experience showed that the level of interference depends on the
density and orientation of the sensors. For example two sensors with a parallel
orientation will suffer a lot more interference comparing to two sensors that are
perpendicular to each other. We find that with enough spread (about 4m apart) and
wide enough angles (around 60-120 degree) the interference should not have serious
effect when working on the skeleton data, and there are suggestion to correct and
improve the quality of the Depth Image. One way to improve the depth image is to
correct the noise issue. A method of using the Kalman filters to reduce the fluctuation
by adaptive hole-filling using spatial-temporal information [9,10,11], While the
depth data is the base where all of our calculation is derived from so ideally we
would want a smooth and stable data set to work with, but each filter will come with
an overhead cost that needs to be considered for speed. Kai Berger et al provided a
comprehensive method of setting up and calibrate four infrared sensors [6] using the
Bouguetis checkerboard tool under different lighting condition.[22] This method is
used to calibrate and align the depth sensors using an RGB camera. A checker pattern
is printed on an aluminum foil and used as the point of reference. The purpose is to
provide a unique predictable IR pattern to be used in the calibration. We found that
the matching silhouette approach using Depth Map while might provide a more
accurate motion cap (because the Skeleton data is extracted from it) it also has some
limitation. It requires an elaborated calibration process to align each sensor, and this
9


is only at a specific point in space. The displacement of an actor can result in the data
from each sensor becomes out of phase.
Unsynchronized Camera Tracking
There has been some research into this direction. Technically this is the behavior of
tracking an object with a single moving camera like described in [4], This is mainly
used for scene reconstruction and not motion for the reason it cannot cover all angle
at the same time. Unsynchronized multi-camera animation tracking like the one
proposed by N.Hasler et al[20] uses an approach that treat the multi-camera case as
an extension of the single-moving-camera for the calibration process. In both cases,
the main difficulty liee with identifying the sensor position in a 3D space in relative
to the scene or tracking object. A common technique is using the Structure-from-
Motion (SfM) approach. This technique employs the use of continuous 2D images
taking from a moving sensor and identify common feature between images to
reconstruct the scene in 3D. First each camera is calibrated using a feature based
method against a static background. In the case of infrared sensor this is the distance
trajectory from the feature point P to the sensor. In each frame the 2D image
information of P is recorded and match with the same point in 3D space. As the
camera moves around and a set of consecutive frames are compared and the 3D
presentation of P is created usually by triangulation. In the case of multiple sensors,
the authors of the method proposed in [20] extend the case of one sensor. Instead of
moving one sensor and take different frames continuously, the process is replaced
with multiple sensors taking several frames at the same time. Each sensor calculates
its own set of SfM data like described above based on their own projection matrix.
The theory of SfM reconstruction in this case is: given consistent structure in each
10


image In, the point Px,y in each image can be calculated by the projection matrix Pn.
The method used in this research is similar to the pairwise matching described in
[45], we further enhance this technique by constructing matching planes between
each coordinate space.
11


CHAPTER m
BASE CONCEPT AND DESIGN
Initial Concept
Our approach to markerless motion tracking can be summarized as follow:
- Process the data from each sensor and reconstructed more accurate-kinematic
base data set.
Construct a uniform skeleton model on each sensor.
- Use a smart calibration system to reduce complexity.
- Establish a ranking system to merge the data from all sensors as well as
providing tools for further optimization.
Good flexibility and scalability.
A
Reconstruct Reconstruct Reconstruct
/ ^ A A
Decomposed Decomposed Decomposed
A A
Sensor 1 Sensor 2 Sensor n
Figure III.l: Basic workflow
12


Device Limitation
While it is a great device at the price point, consumer grade device does have a quite a
few limitations especially when operate alone:
As mentioned above the skeleton data is mainly built as a pose recognition
feature in general markerless animation capture. The device calibrates itself once
it believes it has the full body of the actor in view. However this calibration pose
only works well when the actor facing directly at the device. Through
experimentation we find that the data is generally reliable when the actor do not
turn more than 30 to one side of the sensor (giving an effective arch of 60),
passing this point the data become unreliable, even with joints that is reported
properly tracked. This is not entirely an issue of occlusion (although it is part of
the problem), the main problem is without an appropriate calibration pose, the
skeleton tracking is simply confused about the full body presentation. While it
might still be able to track certain part of body (we found that the legs area can
still operate correctly even at a 90 ) in general its too unreliable for the entire
body.
As a single device the Kinect also shares the common problem of markerless
motion capture when it tries to track occlusion. The Kinect SDK has a way of
dealing with this using inferred joint. When the sensor cannot see a joint, it will
try to guess that joint position based on the position of the other joints. We find
that while this work for simple pose like when an arm is stretched across outside
the sensor range, it does not work in term of other complex occlusion like when
we point the arm directly at the device, (i.e the arm joint will overlap the wrist,
elbow and shoulder joint at the same time).
13


- Noise: another typical problem of markerless motion capture is the noise from
the signal (actually motion capture with marker also has problem with noise,
although to lesser extend). It is even more of a problem with cheap devices, since
being mass produced produce device like Kinect do not have unique infrared
frequency so they will create even more noise due to interference.
MPI.NET
The Kinect driver at the time of this writing has another limitation. While its
possible for a computer to be able to connect and run several Kinects at the same time,
the skeleton tracking can only be activated on the first devices on the same computer.
We have tried to use file I/O to transfer data between separate process, but this is
proven to be very slow even only between two processes. Other considerations we
have taken in general when working with our research are:
- Each sensor will have to process a fair amount of data transformation natured
to any animation application.
Since our focus is real time performance, this means each device will need to
maintain its own data stream. We observe that when active each sensor can use
up 60% to 80% the bandwidth of an USB.2.0 Control hub.
- We need a good and reliable method of synchronization between devices.
To which we address:
- Each sensor should run on its own process in parallel.
- We assign each sensor to a separate USB hub.
- We need a thread-safe mechanism to ensure all sensors are at the same frame.
These considerations lead us to the usage MPI in order to handle the data
exchange between our processes. In particular MPI.NET is a MPI wrapper for C# that
14


allows us to launch multiple processes that each starts and handles its own sensor
device. We found that the overhead of messaging between process is acceptable, and
can be optimized by reduce the complexity of the message to bare minimum. Also
since MPI can handle communication between processes on the same computer or
communication between nodes in a network without many differences, the use of MPI
thus can imply greater scaling flexibility in the case we want to use more devices.
Another useful feature of MPI is its Barrier Synchronization routine, in which we use
to synchronize our sensor. Assuming we have a powerful enough computer or
network, each sensor should be running at the same steady frame rate. However some
data lag might appear during the gathering process. There are several ways of making
sure we are processing the same set of frames from all sensors such as time-stamp or
using sound cue from each sensor. One final advantage that makes us choose MPI
over other mean of communication is due to its scalability. There are very little
differences between handling processes in one computer or processes across a
network, this will allow us to add more sensors and computing nodes as needed by
our accuracy and computers processing power without modifying the base code too
much.
The Kinect SDK
This is an API provided by Microsoft to use with Kinect on window machine. It
provides the user with access to the raw data from the sensor, specifically the depth
image stream and the RGB image stream. Using the depth image a 20 joints skeleton
model can be derived, this is what were interested in this research. In particular,
were interested in the position and tracking state of each joint.
15


Accuracy VS Speed
Before we present the core of our method, we need to mention the balance between
Accuracy and Speed. A single infrared/RBG is neither a speedy nor accurate by the
average animation capture standard. Especially most consumer grade device even
when working at optimal condition its sensor is normally capped at 30 frames per
second (Kinect, Playstation Eyes). This means we cannot make the device any faster,
so the goal of our research is to produce a better result of the raw data. But another
goal as mentioned is to maintain its interactive element. For academic purpose we will
mentioned several technique at different points whether users can decide theyre
willing to scarify more speed for better accuracy, but the direction of this research and
the implementation of the proof of concept/demo will be speed driven.
High Level Concept
Data Structure: our data structure has to be able to handle these following tasks:
capture the raw data for process, prepare the data to be sent and received via MPI,
reconstruct each skeleton at the main process and merging into the final skeleton. For
those purpose we have the following classes:
Table III. 2: high level data structure.
Class Joint Array Class angleSet Class processSkeleton
jointType. angleType JointArray
jointStatus. angleStatus(0,l) synchAngle(Q)
jointPosition. rotation Angle. (9) confi dent Angl e(a, P)
rotationAxis(n)
16


- jointArray: this class exists at each child process and is pretty much self-
explanatory. It is the place holder for the raw data from the Kinect. It also
exists at the end of the main process where it is used to hold the final skeleton.
angleSet: this class is where we hold all the data we processed at each child
node, they dont contain the skeleton data itself, but one can be constructed
from them. The original spine coordinate is carried over to serve as the starting
point when the new skeleton is constructed.
processSkeleton: this class is created using data from angleSet, it contains a
copy of a kinematic built jointArray, and the confident angles. This is what we
send to the main process. Once arrived, the data will be used to construct the
variable synchAngle at the main process.
Synchronized Sensor Space: as mentioned above, each sensor needs to be able to
understand their relative position to one another in order for the merging process to
have a meaningful meaning. And while there are methods to pre-calibrate the sensors
base on their setup, we feel it too restrictive for our purpose for these reasons:
The process is complicate for end users.
It requires third party tools.
- Restrictive: each time the sensors are set up differently or the actor moves out
of the designate spot the system may need to recalibrate.
What we will do here is to implement a smart calibration method that will be able to
self-calibre as long as the sensors are in reasonable position using the info from each
skeleton. It will also allow the actors to move relatively freely without worrying about
de-synch the data from each sensor. In another word, we will do our calibration and
synchronization at run time. Input time set up information can help, but not critical
(Right now we have it as a safety measure).
17


Ranking system: as mention above one of the biggest drawback of infrared device is
it only returns optimal data when the actor face the sensor directly, and the quality
degrades as the actor turn away from it. We propose a method to address this
problem by distributing a priority ranking system based on how far away an actor
turned away from sensor. How this is calculated will be described later under the
method session.
18


CHAPTER IV
METHOD AND IMPLEMENTATION
Based on the description from the previous session, a more detail workflow is
19


Joint Info Processing
We start by extracting the raw data to fill our jointArray for each device. These
includes the position of each joint and a status flag; true means the joint will not be
considered for future calculation, false means the data point is good and can be used.
For every joint Jn(x,y,z, status):
o Tracked: indicate the joint is actually seen by the sensor and properly
tracked. We set the status flag to true.
o Inferred: the joint is not actually seen, but its position is extrapolated
using other joint. (For example, based on the wrist and shoulder position
the sensor can guess our elbow position). This is mainly how occluded
joints are constructed. For our purpose this data is not good enough but
we may fall back on it in case a joint cannot be seen by any sensor. We set
the status flag to false.
o Unknow: there is no information about this joint from sensor. (Either it
doesn't see it, or the joint is outside the FOV or min/max distance). We
wont use them for calculation, thus the status flag is set to zero.
Smooth Filter
To alleviate the noise problem in order to get more accurate data we need some filters.
There are many types of filter that can be employed for processing signal noise such
as the Savitzky-Golay algorithm that is based on a Least-Squares Polynomial
smoothing [42], The balance between accuracy and speed is an important
consideration for choosing the right type of filter. Exponential or double Exponential
Smoothing Filter can give good result but also very expensive for interactive
applications. Most devices API and library already comes with a few filters for the
20


noise generated on a particular sensor so thats not our research interest. We focus on
in dealing with the difference between data generated by different sensors that even in
the case of perfect synchronization will still not guarantee to be 100% the same giving
the nature of markerless motion capture. In this paper we tackle the issue using
several methods, while they will not be independent filters they are used directly as
part of our data construction. We construct a uniform model on our sensor base on
kinematic data which is balance between speed and accuracy. The reason why we
decomposed and reconstruct our skeleton data instead of using the raw joint position
data from each sensor is to ensure the uniform between our presentations. It may
seems redundant due to our new data is calculated based on the raw data, but by
reconstructing them using only the angle and rotation of each bone, we can impose
the same offset for the bone sections between the joint across all sensors. This has the
benefit of our new skeleton will have the same orientation value as the raw data, but
the noise of the specific joint location is reduced. Later on we also use a joint average
filter which is cheap, but reliable if we are able to feed it with good data which are
provided by our uniform models and a ranking system. A short white paper by
Mehran Azimi in [43] will give readers a brief idea of all the filters mentioned so far
in this session (except the ranking system).
Building The Angleset
Using the information from jointArray, we break down the 20 joints into 19 set of
angles Ax(Ja, Jb, Jc) where x is the angle from one particular joint to another and Ja, Jb,
Jc are 3 connected joints from jointArray.
21


Upper Body:
Aspine(Jspie, Jshouiderc) // This is an exception.
Aneck(Jspine, JshoulderC? Jhead)
AshoulderR(Jspine, JshoulderC? JshoulderR)
AbicepR(JshoulderC5 JshoulderR? JelbowR)
AforamiR(JshoulderR, JelbowR? JwristR)
AhandR(JelbowR? JwristR? JhandR)
AshoulderL(Jspine, Jshouiderc? JshoulderL)
AbicepL(JshoulderC? JshoulderL? JelbowL)
AforarmL(JshoulderL, JelbowL? JwristL)
AhandL(JelbowL? JwristL? JhandL)
Lower body:
AhipCenter(JSpine, Jshouiderc? Jhipc)
AhipR(JhipC? Jspine? JhipR)
AthighR(JhipR? JhipC? JkneeR)
AlegR( JkneeR? JhipR? JankleR)
AfootR(JankleR? JkneeR? JfootR)
AhipL(Jhipc, Jspine? JhipL)
AthighL(JhipL, JhipC? JkneeL)
AlegL(JiaieeL? JhipL? JankleL)
Af00tL(JankleL? JkneeL? JfootL)
We will then use these 19 angles to calculate the new 20 joint location for our
uniformed skeleton.
22


Figure IV. 2: the joint structure and their relationship
The method to construct Ax is as followed:
First we check the status of the 3 joints component that makes up the angle. For our
purpose (Ja. Status && Jb. Status && Jc. Status) has to be true for the set to be taken
into calculation. This method is meant to enhance the accuracy and authenticity of
the later reconstructed joint, instead of just relying only on the status of each
individual joint. For example in the pose where the actor points the hand directly at
the sensor, only the hand joint is visible and tracked correctly while the wrist and
elbow joints are occluded (and thus inferred). While for simple tracking application
the correct position of the hand might be sufficient, this is not a reliable construct in
term of hierarchy kinematic.
23


e = ?
Figure IV.3: the case of multiple occlusions. While it might still be
accurate enough for simple application, the kinematic information
between these 3 joints is not reliable.
Once all joints in Ax(Ja, Jb, Jc) is verified, we calculate the angle between them by
first calculating and normalize two vector a,b and the angle 9 between them, as well
as their axis of rotation n:
a = (Ja Jb),Normalize();
b = (Jc Jb).Normalize();
9 = DotProduct(a,b).Normalize();
n = CrossProcut(a,b).Normalize();
Also for each Ax that is verified, we set the associated angles status to true indicating
it can be used in reconstruction.
Skeleton Reconstruction
Once the data of the angleSet A is completed, we begin our first reconstruction.
Starting from the spine joint we apply a forward kinematic process basing on the
angular information and pre-defined offset to map out the skeleton. The most critical
part in this reconstruction process is the rotation matrix that allow us to calculate a
new joint by rotate a vector about a certain angle around an arbitrary axis using a
single Rotation Matrix R(n, 9). To rotate a unit vector b to the new position b around
24


an arbitrary axis n by an angle 0, using the rotation matrix R, the derivation is
descried as follow:
bR(n, 0) = b (1)
- Decompose b into b parallel (b||) and b orthogonal (bl) that is respected to n and
b = b|| + b 1 (2)
- Next we need to rotate b 1 about n to get b 1 .To do this we need:
- A new vector T that is perpendicular to both b|| and b 1 and is the same length as b
1.
So T = n x b 1.
- Now we can calculate b 1 :
b 1 = b Icos 0 + Tsin 0 (3)
with
b|| = (b n) n
b 1 II a- 1
= b (b n) n
T = n x b 1
= n x (b b )
= nx b-n x b
25


= n x b
Substitute b 1 and T into (3) we have:
(n x b|| = 0)
b 1 = cos 0 (b (b n)n) + sin 0 (n x b)
Substitute (3) into (2) we have:
b = cos 0 (b (b n)n) + sin 0 (n x b) + (b n) n (4)
With this, now we can construct our rotation matrix R(n, 9). Let o, p, q be the three
components unit vector of the original matrix.
0 1 0 0
V = 0 1 0
.0 0 1.
We will now apply equation (4) to rotate each component vector. We will do this in
column form and apply transpose on them later.
cos9 (m (m n)n) + sin9 (n x m) + (m n) n
COS0
T
0
.0.
T nx \ nx
0 ny ny
.0. nz. J nz.
+ sinQ
nx T
ny X 0
nz. .0.
+
T
0
.0.
nx \ nx
ny ny
nz. J nz.
= COS0
1 nx \ 0 0
0 ~nx ny + sinQ nz + nx nz
0. nz. J -Tly. -Tly.
1 nx2' 0 ' nx2 -
COS0 -nxny + sinB nz + nxny
- ~nxnz. -riy. -nxnz.
cos0 nx2c os0
nxnyc os0
nxnzc os0
+
nzsinB
riySind
+
x
nxny
nxnz.
cos0 nx2c os0 + nx2
nxnyc os0 + nzsind + nxny
nxnzc os0 UySinQ + nxnz
cos0(l nx2) + nx2
nxny(l cos0) + nzsind
nxnz{ 1 cos0) UySinQ
(5)
Apply the same process to vectors p and q, we will have:
26


'nxny(l cos0) nzsinQ
p= ny2(l cos0) + cos0
nynz( 1 cos0) + nxsin&
(6)
nxnz( 1 cos0) + riySinQ
nynz(l cos0) nxsin0
nz2(l cos0) + co50
(7)
Now apply transpose on (5),(6),(7) to form the rotation matrix.
R(n, 0) =
o' cos0(l nx2) + nx2 nxtiy(l cos0) + nzsinQ nxtiz(l cos0) nysin&
p' = nxtiy(l cos0) nzsinQ %2(1 cos0) + cos0 nynz(l cos0) + nxsinQ
.7'. nxnz(l cosQ) + riySinQ nynz(l cos0) nxsin0 nz2 (1 cos0) + cos0
Lets call Rn is the normalize-homogenous rotation matrix for joint n, we then
multiply a scalar Bn (which varied for each specific joint) to get the new joint location.
/* ... 0\
Jrw = M i Bn
VO 1/
For joints that are properly calculated due to availability of the associate angle Ax, we
set the status of the joint to true. For joints that the associated angle data is not
available, we temporary fill it with the original sensor data, but set the joint status to
false. Structure wise this new joinArray is the same as the initial jointArray with the
difference is that it is constructed using hierarchy kinematic as opposed to raw data.
There are two purposes to why we construct this new jointArray:
As discussed above, this helps us to build a more reliable model based on
kinematic structure and thus minimize the error that can be produced from the
raw data (noise).
- By using the angle information with the same pre-define offset on all sensors,
we are able to produce a uniform skeleton model across all process and
27


eliminate the differences result from distance and scaling of each sensor. The
uniform skeleton model is a critical key in our calibration and synchronization.
We then embedded this new jointArray into our processSkeleton class with the
additional of the two confident Angle (will be explained later). Then we send this
information from each child process to the main process.
Data Synchronization
Assuming we have a powerful enough system that each sensor can stream it owns
data without a performance hit, there is still the issue of communication lag between
each sensor. We assume each device will have a unique time-stamp for each data
frame starting from the time the sensor is started. This is still leaving the issue of
ensuring all of device is started at the same since in our observation even within the
same machine each device still has up to one second lag between their initialization.
We can ignore this issue by analysing the data stream and look for a common point of
interest (i.e the frame where the same action taking place), calculate the offset
between these frame and apply it to the duration of the process. We find that using
MPIs Gather Barrier method is a better way to ensure we get the closest frames by
two barriers:
- First we place a barrier before and after all sensor initialization
Intracommunicator comm = Communicator.world;
comm.Barrier();
sensor. start();
comm.Barrier();
Second we also place a barrier before and after each frame is gathered:
comm.BarrierQ;
28


Skeleton[] skeletonArray = comm.Gather(skeletondata, 0);
comm.Barrier();
At 30 frames per second we allow a marginal of error of 0.2s for each frames.
Data Calibration
Up until now all the tasks carried so far are on an individual basic by each process
working with their respective sensor. This is where our method is different than the
traditional calibration discussed in previous works. Usually the data about the setup of
the sensors is given at input time to ensure each sensor can understand and interpreted
each other, in other methods the set up data is used in after-processing when the data
is merged. The method we are describing is doing the opposite. Each sensor works
independently without needing the calibration data of other device, but they produce a
uniform skeleton model in their respective world space. When these uniformed
skeletons are sent to the same space of the reference sensor, they will be synchronized
with that space in run time.
When each skeleton arrive at the main process, they are put into the same world space
of the reference sensor. This is the sensor defined by user whose coordinate space
provides the orientation and displacement for the final model of which all other
sensors are synched to. We can call this our active work space, or global space. At
first the skeletons from other sensor will appear to be at arbitrary location and
orientation within the reference space because their original local spaces were all
different. But thanked to the uniformed structure of all skeleton its possible to
synchronize them to the same point through a series of translation and rotation, and
the info needed can be calculated from each skeleton itself.
29


Translation: the translation offset is simple to calculate. Similar to the approach in
[20] the offset between two data sets we can get this information by calculating the
difference between the same feature points on each frame of reference. However we
dont have to rely on a static cue like described, instead due to our previous
preparation we already have the reference point that we need. Ideally, if each sensor is
able to return a perfect set of data then each of our 20 joints can be used as a reference
point but this will be most likely not the case. All we need is the distance between
each child skeletons and the reference skeleton. For this calculation to be accurate we
need to pick a centerpoint on each skeleton. We choose the spine joint for this given
its position at the middle of the skeleton. Also we found that this is the most reliably
tracked joint at all time since it is not contested by occlusion in most poses. This is
also the root joint for our skeleton construction so its made a natural selection. For
each skeleton T we calculate a displacement vector m with the reference skeleton R:
Hlx Rspinex T spinex
my Rspiney TSpiney
Hlz Rspinez TSpinez
For each joint J in skeleton T
J = J.Translate(mx, my, mz);
To further enhance the accuracy of the translation, we also calculate the difference
between the hipcenters and average the two values.
30


Figure IV.5: The displacement between two skeletons
After the translation we need a rotation in order to put all skeletons in the same
orientation, this process is more complicate than the translation process above. We
establish the idea of a plane that each skeleton is based on. This plane is constructed
using a set of 3 joints on the skeleton, which 3 joints is used and under what
circumstance need to have these considerations.
They have to be well representative of the body orientation.
They need to be the least affected by movement.
Picking the correct 3 joints to construct this plane is a difficult task because the
human body can twist. The spine joint is an obvious candidate for the first joint of any
set given its location. The set of (spine, right shoulder, left shoulder) joints can
construct a good plane to represent the upper body but not necessary the entire body if
the actor is twisting the hip, the same applies for the set of (hip center, right hip, left
hip). The safest method to construct the plane is using a calibration pose for each
sensor when the actor is not twisted. A more dynamic way of doing it will be to
average two or more planes from the same skeleton. We then using the cross product
to figure the normal vector of each plane, then calculate the angle q between each
plane and the reference plane. Since all skeletons are already at the same root from the
31


previous translation, we create a rotation axis around Y-axis using the spine joint.
Next we rotate all the joints of each skeleton around this axis with their respective q.
The result is now we have all skeleton from each sensor in the same reference frame.
We believe there is a more reliable way to calculate the q angle through the use of an
augmented sensor. This sensor can be set to track only the body orientation instead of
the full body tracking and thus can be set up and calibrated specifically for this
purpose. For example if a sensor can reliably track the two shoulder joints we can use
the data to calculate the body angle base on the Z-coordinate of the two shoulders by:
g = archtan (shoulderRight.Z- shoulderLeft.Z, shoulderLeft.X- shoulderRight.X)
Merging The Final Skeleton
Now that we have all skeleton data synchronized, we will use them to reconstruct our
final skeleton. Recalling from previous discussion of this session the data from each
individual skeleton might not be complete or accurate due to:
Joints are occluded, and thus their positions are inferred and not reconstructed
due to the lack of kinematic data.
Joint qualitys degradation due to the actor is turning away from the sensor.
Our reconstruction process will basically select the best possible data from each
skeleton for each joint. To help with this selection progress we propose a ranking and
token system.
32


Figure IV. 6: the process of synchronization the local space of a skeleton into the
space of the reference skeleton, (a): a skeleton from a child process, (b): the
reference skeleton. We establish a plane for each skeleton using their joint info. In
(c), as a result from the previous translation the two planes should already be at
the same local space so we only need the angle between their normal vectors, (d)
shows the two planes should be close to overlap each other after the rotation, and
thus complete synching the two skeleton they re based on.
Again recalling from previous section our processSkeleton class has a
confident Angle with two angles a and p. These angles are used to track how far
away the actor is turning away from the camera. Their concept and construction is
similar to the construction of the planes in previous section as in the planes are
constructed using a set of 3 joints on the skeleton, but with a few key differences:
33


The planes are compared against the sensors XY plane and not the reference
plane. The angles are calculated based on the normal vector of the skeletons
plane and the Z axis of the sensor.
The reason we have two angles is because a represents the orientation of the
upper (above the spine joint) body while P represent orientation of the lower
body (below the spine joint). Unlike the angle for the sensor calibration where
the angle has to uniquely reflect the sensors orientation, these angles can be
varied on the same skeleton. For example when an actor twist in the same pose,
a certain sensor might return good data for the facing half of the body while
its better to take the data of the upper from another sensor. To improve
accuracy both the vector Va and VP are calculated using the half body
average:
Va = Average (Cross(spine, left shoulder, shoulder center). normalize (),
Cross(spine, right shoulder, shoulder
center), normalize ())
Vf = Average(Cross(spine, left hip, hip center).normalize(),
Cross(spine, right hip, hip center).normalizeQ).
34


Table IV 7: Joint Ranking System
Degree of turning Rank Description
-60 > x > 60 4 Data degradation can be very bad. Ignore this joint if better data available from other skeletons.
-30 > x > 30 3 If this is a, ignore if better data is available. If this is beta, recommend average the value with other skeleton of the same rank.
-10 > x > 10 2 Data can generally considered stable enough, recommend average with other skeleton of the same rank if possible.
-10 < x < 10 1 The best possible data, it can be used as a standalone although one might want to consider average with skeleton in rank 2. Also considering giving this skeleton set the lead token.
Before merging we will briefly explain the purpose of the lead token. One thing to
note is the sensor with the lead token is not necessary the same sensor being used as
the reference; although it can be and technically is ideal for the role. However
repeatedly changing the reference sensor is not desirable in term of speed since it
will need more synchronization pass (more rotation), and it can cause inconsistency
in term of the actors location and orientation. Our main purpose of introducing this
token aside from pointing out which sensor has the best data, it also provides a
heuristic locality for adjacent sensors. For example, the two sensors immediately
next to the lead sensors will most likely hold better data than those that are farther
away; this can be used for data optimization without paying for expensive arithmetic.
It also provides a good indicator for the case when we have no choice but to change
the reference camera. This usually happens when the actor have turn so far away (>
|60|) away from the reference sensor that the data it receives cannot be used to
reliably calculate the reference angle, in this case we have to change our reference
sensor to one that is closer to the direction the actor is facing (and then synch them
35


back to the original orientation). Technically, every sensor can hold the lead token
when the actor is facing them, but arbitrary making every sensor into a potential
reference has the disadvantages described above. It is better (and cheaper) to specify
a specific set of potential sensors to be used as reference, and switching between
them depending on their proximity to the sensor thats holding the lead token.
Figure IV.8: Let Black, Yellow, Red serve as our potential reference sensor.
If Green is hold the lead token, then Black will serve as our reference, if
blue is holding the lead token than we pick Red as our reference sensor.
Another advantage of this set up is that the user or application can add or remove
sensors between the reference sensors without needing to recalibrate the whole
system each time it is done.
Finally after all necessary preparation, we merge the data and construct our final
skeleton with a simple process. We consider all the candidates for each joint from
each skeleton. Starting from the sensor with the lead token we check all adjacent
sensors for their joint status and rank. User can set up some constraint on how many
36


sensors away from the lead sensor to reduce the amount of calculation. The joint
selection is described in the Table 4.7 below.
Table IV.9: Joint selection depending on its status and rank.
Rank 1 Rank 2 Rank 3 Rank 4
Status True Take. Take. Average with others of the same condition. Take or ignore depending on which joint. Ignore if there are better joint available. Ignore if any better joint is available. Only take if all other skeletons are in the same condition.
Status False Ignore. If the joint is reported false on all sensor, then take base on confident value Ignore Ignore Ignore.
Note that according to the table the status of a joint take precedent over the ranking
system. The reason for this is because even if the actor is facing the camera, a false
status means the joint is being occluded, and thus its better to use joint from another
sensor that can track properly even though from a less than optimal angle. In the case
of a joint cannot be tracked we will use the inferred data from the top rank sensor,
assuming it is in the best position to guess the missing joint.
37


CHAPTER V
EXPERIMENTS AND RESULT
We ran a setup with 3 Kinect Sensors with various test set up to verify the concepts
we introduced so far.
Calibration Test
The first test we run is to check the accuracy of our smart calibration and
synchronizing method. To perform this test we prepare three sets of data:
The skeleton data from the reference sensor, (skeleton A)
The skeleton from a child sensor with an input time (calibrated with physical
measurement), (skeleton B)
The skeleton data from a child sensor synchronized with data calculated in run
time. (Skeleton C).
Our metric is as followed: for each joint we calculate the distance of the vector
between each joint and the root position. For each n joint we have DAx, DBx, DCx.
for skeleton A, B, and C. Then we calculate the percent error E as:
= 1* 100
\DAx\
= \E^.* ioo
c* \DAx\
The graphs below compares the difference between the two types of calibration,
the data is taken over 600 frames (about 20 seconds) where the actor performs a
variety of poses. The line is an error average between all joints.
Overall, the traditional calibration edge out the run time calibration method by
about 1.5% on average with 3% error versus about 4.5%. We believe this error is
sufficiently small to justify the advantage of the method proposed.
38


1 101 201 301 401 501
Table V.l: the percentage of error between two different types of calibration.
B(blue) line represents the tradition input time calibration while C(red) line
represents the runtime calibration.
Table V.2: the same data set of V.l, but we averaged out the value of every 100
frames to better illustrate the differences between two calibration methods.
Note that were ignoring the spike value showed in the graph since they are
represented frames that a particular sensor cannot come up with a good data and those
data will likely be provided by another sensor on those frame.
39


We include here a few screenshots from our proof of concept prototype to
demonstrate the visual concepts mentioned so far. This first set of screen is to
demonstrate the flexibility and accuracy of our calibration method.
For the first set of 3 screens in figure 5.3: we arbitrary drop two sensor a few meters
away from each other and eyeball them toward the same general direction.
The first screen (a): shows the skeletons from each sensor after the child
skeleton (blue) is sent and put in the space of the reference skeleton (red).
The second screen (b) shows the result of our synchronization, as showed we
achieve a pretty good overall overlap between two skeleton, this result was
achieved by using no input time calibration data (we dont actually have it
anyway since were setting up this example casually on purpose), as
described in our method the calibration data came from our main process
itself in run time.
The third screen (c); we left the programming running, and had a person pick
up the child (blue) sensor and arbitrarily moved it around. Basically it was
displaced for around a meter in several directions with some small turn left
and right, the screen was captured during this process and it shows our
method was able to keep up with the changes in real time with the two
skeleton are still closely matched.
40


(a)
(b) (c)
Figure V.3: a)The data from 2 sensors placing casually before a synchronization, b)
This figure shows the skeleton synced into the same coordinate using our new method,
the sensor still at the original position, c) one sensor is displaced (during run time),
we can see that the action does not affect our synchronization by much.
41


Construction Test
The second set of screen is to demonstrate our merging process and how a new
skeleton can be constructed using partial data from multiple skeletons.
For the set of 3 screens in figure V.4 we again set up two sensors but their locations
are more deliberated and closer to an actual setup of a camera setup for a traditional
marker less motion capture system. The child sensor (green) is put at about 2 meter
away from the reference (red) sensor, their direct line of sight from an angle of 30
degrees at their intersection. A reminder here that we dont make this information
available to our system at input time, it will figure out itself.
The first screen (a): two skeletons pre-synching from two sensors. We can notice
a few differences here comparing to the first screen of the previous set. First is
the angle between the two skeletons. Another difference here is that the two
skeletons are rendered using raw data, so theyre not uniformed. We take this
screen to emphasize the differences between them due to scaling causing by
distance, camera len and orientation.
The second screen (b): show two skeleton synched, basically the same concept of
the second screen from the first set.
The third screen (c): we see another skeleton (blue), this is the newly merge
skeleton. We also rendered the original two skeletons (pre-synched) to server as
reference. We purposely stretch our right arm completely outside the red sensor
FOV and cause its data to be unavailable. But the green sensor being at a
different angle, is still able to track that arm as showed on the green skeleton, and
thus they were using to construct the missing arm on the blue skeleton (after
synched and merged).
42


Figure V.4 Skeleton merging, a) Two skeleton raw data, b) Two skeletons are synched,
c) Merged data, demonstrate using different skeleton to construct missing joints.
43


Modelling Test
Visually, this allows us to render poses that would be otherwise impossible to render
with one camera, specifically poses with the actor turning at a great angle and/or with
occlusion. The less set of pictures are skin of the skeleton data rendered with an actual
skin model to demonstrate what is possible before and after the application of our
proposal method.
(a)___________________________ (b)
(c)
Figure V.5: Skinned model demonstration
Here we can see the result of the same pose from the same actor:
44


The first screen (a): With one sensor: the actor facing the sensor directly. No
occlusion and the pose is rendered correctly.
The second screen (b): the same pose, but the actor turns about 70 away from
the sensor. Its pretty easy to see the pose is not rendered correctly. The overall
orientation is mixed up and the occluded arm has a strange rendering. This is not
a pose specific problem but in general what would happen to the model when an
actor turns too far away from the sensor for reason we have mentioned earlier.
The second screen (c): with combination of multiple sensors, the pose is now
rendered in good quality. The orientation is correct and the occluded part is also
rendered satisfactory.
We match our three sensors setup against the existing XNA Avatar demo from
Microsoft. In the Avatar demo, the tracker cannot properly track the actor at a body
orientation of more than 30 degree away from the sensor, and occlusion results in a
lot of strange animation on the avatar. Our set up on the other hand shows a very
large improvement with a reliable tracking of up to 80 degree of body orientation
away from the reference sensor.
45


CHAPTER VI
DISCUSSION
Most consumer grade has a performance rate of 30 FPS, so if an actor move too fast
there might be loss of data. This prevents the tracking of fast movement. In the case of
Kinect we use the motion of the arm rotation as a measurement, and we find that more
than three rotations per second is too fast for the sensor to track properly. They also
have a rather low depth resolution (640*480 for the kinect) so the quality of the data
is limited. This is a device limitation that can only be fixed by using a better hardware.
Infrared device is not meant to capture 360 degree movements. This means that it has
no real way to determine whether it's looking at the face or at the back of an actor (the
depth map is identical from the front and back). This means that a sensor from the
back and front of an actor might return conflicting data while both can be considered
high confidence value. We are exploring some methods to help with this issue, like
the facial recognition algorithm using RGB image proposed in [7], Another method is
place a color marker on the front and back of the actor and use the sensor to help the
process. Our lead token may be able to help speeding up this process. In theory, it's
possible to cover 360 degree movements without the need of completely surrounding
an actor with sensors if we can build a good extrapolating algorithm.
The usage of MPI does introduce some overhead due to messaging. This happens
between inter-processes communication on one computer and also expected between
multi-computer setup. In our three sensors set up the lag between two sensor is around
half a second, but the third sensor has a significant lag up to 1.8 seconds due to bus
contention. For further optimization the algorithm can be modified with a shared
46


memory architecture or thread spreading to alleviate the problem. When more
computing nodes (sensor hub) are added, there will be a lag expected in
communication between nodes; we believe this issue can be easily minimized with a
good local network. Also note that we have a cap in term on overhead in our set up. In
most conventional multi sensor setup, adding more sensors while improve the result,
it also means more computation due to the increase of data volume. Adding more
sensors in our setup while improving fidelity will not add more data volume. This is
due to the established locality by the lead sensors that allow us to pick a specific set of
data to work with base on the body orientation. This gives our model a great degree of
flexibility in term of scaling.
Our three sensors set up serve as the test bed and concept proof. With enough resource
and further tuning the system can be scale up to 360 degree real time body tracking.
This can have a wide range of application in entertainment, research, and training
field. One of the suggestion come up near the end of this research is right now the
system is good enough for slow to moderate activities like Taichi training. With better
hardware the system may be improve to include other sports like golf training. The
ability to have a larger range of movement can also offer more flexibility in the
entertainment field, allow users to have an enhanced digital reality over the current
limitation of the devices.
So far the method does not have a way to properly process complex poses like a body
bending forward or backward as this nullify the effectiveness of the confident angles
due to the normal vectors cannot be appropriately calculated. We believe if we can
identify two sensors on the side of the body, we can set additional conditions base on
47


the upper and lower body position these two sensors provide we can rectify this
problem. However this is also a problem with the pose classifier process during the
training of the depth image, so we may have to improve that part as well in order to
give a wider pose recognition pattern.
At the beginning of the researches we considered synchronizing the depth image from
each devices like described in [21], While this method could provide a better
synchronization data since the joint data is built from it. However this proved to be
very restrictive as it takes a lot more calculation to calibrate the different depth images
into the same space. Even after we do that, the actor is rooted at one spot since any
displacement can put the images out of phase. By letting each sensors working on the
best possible construct they can achieve and later merge we simplified the calibration
process while achieve better flexibility. We also consider improving our prototype and
adding a UI to allow us to test in more conditions and easier interpretation of test data.
48


CHAPTER VII
CONCLUSION
Our experiment so far has introduced a method of merging data from multiple
infrared devices, and we showed that this method can be applied to with good result
when used with consumer grade devices on the market. We introduce a simple but
efficient method calibration process which will help lessen the constraint in the set up
process. (Our prototype showed the method work reasonably well even with a casual
eyeballing set up). This method is able to work due to the construction of a uniform
skeleton and selection of data that enhance the capacity of the sensor and provide a
more accurate tracking ability. This in turn gives more flexibility to the action that
can be performed by the actor (greater turn angle, a bigger field of movement, and
report occlusion pose more reliably). The combination of using MPI for direct
communication and the lead token/reference sensors give our method a good level of
scalability since the calculation complexity does not scale with data volume.
Depending on the need of the application and degree of accuracy/complexity of the
application and the desire angle freedom more sensor can be added without making a
lot of changes in the base method. The method are done with limited overhead and
can perform reasonably well in real time environment which then can be used in
interactive software or the field of robotic since the data reproduce a animation. All
of these are done using device that's readily available to the mainstream user, and thus
may have a wide range of application.
49


REFERENCES
[1] J. Shotton, A. Fizgibbon, M. Cook et al. Real-Time Human Pose Recognition
in Parts from Single Depth Image. Microsoft Research Cambridge & Xbox incubation.
[2] S. Corazza, L. Mundermann, A.M. Chaudhari et al. A Markerless Motion
Capture System to Study Musculoskeletal Biomechanics: Visual Hull and Simulated
Annealing Approach. Annals of Biomedical Engineering, Vol 34, No. 6, June 2006.
[3] V. Dhochev, T. Vassilev, and B. Spanlang. Image-space Based Collision
Detection in Cloth Simulation on Walking Humans. International Conference on
Computer Systems and Technologies CompSysTech 2004.
[4] S. Izadi, D. Kin, O. Hilliges et al. KinectFusion: Real-time 3D reconstruction
and interaction. Using a Moving Depth Sensor. Microsoft Research Cambridge, UK.
[5] Y. Liu, C. Stoll, J.Gall etal. Markerless Motion Capture of Interacting
Character using Multi-View Image Segmentation. Automation Deparment, TNlist,
Tsinghua University.
[6] K. Berger, K. Ruht, Y. Schroeder et al. Markerless Motion Capture using
multiple Color-Depth Sensor. The Eurographics Association 2011
[7] L. Ballan and G.M. Cortelazzo. Marker-less motion capture of skinned models
in a four sensor set-up using optical flow and silhouettes. Proceedings of 3DPVT08 -
the Fourth International Symposium on 3D Data Processing, Visualization and
Transmission, 2008.
[8] L. Mundermann, S. Corazza and T.P Andriacchi. The evolution of methods for
the capture of human movement leading to markerless motion capture for
biomechanical applications. For submission to Journal of NeuroEngineering and
Rehabilitation.
[9] M. Camplani and L. Salgado. Adaptive Spatio-Temporal filter for low-cost
sensor depth maps. University of Politecnica de Madrid, Spain, 2012
[10] M. Camplani and L. Salgado. Efficient Spatio-Temporal Hole Filling Strategy
for Kinect Depth Maps. University of Politecnica de Madrid, Spain, 2012.
[11] S. Matyunin, D. Vatolin, Y. Berdnikov, and M. Smirnov, Temporal filtering for
depth maps generated by Kinect depth sensor,. 3DTVConference: The True Vision -
Capture, Transmission and Display of 3D Video, May 2011.
[12] K. Lai., L. Bo, X. Ren, and D. Fox, A large scale hierarchical multi-view rgb-
d object dataset, in Robotics and Automation, IEEE International Conference, May
2011.
[13] M. Siddiqui and G. Medioni. Human pose estimation from a single view point,
real time range sensor. In CVGG at CVPR, 2010.
50


Y. Zhu and K. Fujimura. Constrained optimization for human pose estimation from
depth sequences. In Proc ACCV, 2007.
[14] S. Knoop, S. Vacek and R. Dillmann. Sensor fusion for 3D human body
tracking with an articulated 3D body model. In Proc ICRA, 2006.
[15] A. Cappozzo, F. Catani, U. Della Croce and A. Leardini. Position and
orientation in space of bones during movement: anatomical frame definition and
orientation. Clin. Biomech. 10:171-178, 1995
[16] A. Laurentini. The visual hull concept for silhouette based image
understanding. IEEEPAMI16: 150-162, 1994.
[17] F A. Kakadiaris and D. Metaxis. Model-based estimation of 3D human motion
with occlusion base on active multi-viewpoint selection. In Prc IEEE CVPR 81-87,
1996.
[18] W. Matusik, C. Buehler, R. Raskar, S. Gortler and L. McMillan. Imaged-
based visual hull. Proc ACMSIGGRAPH 369-374, 2000.
[19] N. Hasler, B. Rosenhahn, T.THormahlen et al: Markerless motion capture with
unsynchronized moving cameras. Computer Vision and Pattern Recognition, 2009.
CVPR IEEE 2009. pp 224-231.
[20] Y. Kim, D. Chan, C. Theobalt, S. Thrun. Design and calibration of multi-view
of sensor fusion system. Computer Vision and Pattern Recognition Workshops, 2008.
[21] J. Bouguet. Sensor calibration toolbox. http://www. vision, caltech.
edu houguetj calih doc, 2010.
[22] E. De Aguiar, C. Stoll, N. Ahmed et al. Performance capture from sparse
multi-view video. In ACM Transaction on Graphic vol 27, p. 98, 2008.
[23] L. Guan, J. Franco, M. Pollefeys: 3d object re-construction with
heterogeneous sensor data. 4th International Symposium on 3D Data Processing,
Visualization and Transmission. Atlanta, GA, 2008.
[24] G. Baciu, W.S. Wong, H. Sun. An Image-Based Collision Detection Algorithm.
The journal of Visualization and Computer aniation, pp 181-192. 1999
[25] A. Balan, L. Sigal, M. Black, J. Davis and H. Haussecker. Detailed human
shape and pose from images. In CVPR, 2007
[26] A. Sundaresan and R. Chellappa. Multi-sensor tracking of articulated human
motion using motion and shape. In Asian Conference on Computer, Hyderabad, 2006.
51


[27] T.B Moeslund and E. Granum. A Survey of computer vision-based human
motion capture. In International Conference on Face and Gesture Recognition, 2000.
[28] M. Yamamoto, A. Sata, S. Kawada, T. Kondo and Y. Osaki. Incremental
tracking of human actions from multiple views. In CVPR, p2-7, 1998.
[29] D.M Gavrila and L.S Davis. 3-D model-based tracking of humans in action:
A multi-view approach. In Computer Vision and Pattern Recognition, p73-80, 1996.
[30] I. Kakadiaris and D. Metaxas Model-Based Estimation of 3D Human Motion.
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol22 no 12,
December 2000.
[31] T. Cham and J.M. Rehg. A multiple hypothesis approach to figure tracking. In
Computer Vision and Pattern Recognition, v.2, June 1999.
[32] D. M Gavrila. The visual analysis of human movement: A survey. Computer
Vision and image understanding: CVIU, 73pp 82-98, 1999.
[33] T. Tan, L. Wang, W. Hu. Recent developments in human motion analysis.
Pattern Recognition, 36 pp 585-601, March 2003.
[34] C. Menier, E. Boyer, and B. Raffin. 3D skeleton-based body recovery. In Proc
of the 3rd International Symposium on 3D Data Processing, Visualaization and
Transmission, Chapel Hill, June 2006.
[35] R. Kehl and L.V Gool. Markerless tracking of complex human motion from
multiple views. Computer Vision and Image Understanding, 104(2)pp 190-209, 2006.
[36] R Besl and H. McKay. A method for registration of 3d shapes. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 14:pp239-256, 1992.
[37] C. Cedras and M. Shah: Motion-based recognition: a survey. Image and
Vision Computing 1995, 13(2) pp 129-155.
[38] J. Aggarwal, Q. Cai: Human motion analysis: a review. Computer Vision and
Image Understanding 73(3): pp82-98, 1999.
[39] H. Lanshammar, T Persson, V. Medved. Comparison between a marker-based
and amarker-free method to estimate centre of rotation using video image analysis. In
Second World Congress of Biomechanics, 1994.
[40] E. E. Stone and M. Skubic. Evaluation of an Inexpensive Depth Sensor for
Passive In-Home Fall Risk Assessment. University of Missouri, 2011.
[41] R. W Scahfer. What is a Savitzky-Golay Filter? IEEE Signal Processing
Magazin, pp 111 -116, July 2011.
52


[42] M. Azimi. Skeletal Joint Smoothing White Paper. MSDNdigital library, 2012.
[43] V. Lepetit, P. Lagger and P. Fua. Randomized tree for real time keypoint
recognition. In Proc, CVPR pp775-781, 2005.
[44] T. Thormahlen, N. Hasler, M.Wand, and H.-P. Seidel. Merging of unconnect-
ed feature tracks for robustcamera motion estimation from video. In CVMP, Nov.
2008
53


Full Text

PAGE 1

INTERACTIVE FULL BODY MOTION CAPTURE USING INFRARED SENSOR NETWORK. By SON TRONG DUONG B.S Computer Science and Engineering, University of Colorado. (2008) A Thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment Of the requirements for the degree of Master of Science in Computer Science and Engineering 2012

PAGE 2

ii This t hesis for the Master of Science d e g ree by Son Trong Duong h as been approved for the Computer Science and Engin eering program by Min Hyung Choi, Chair Gita Alaghband Ellen Gethner Bryan Leister November 12 th 2012

PAGE 3

iii Son Trong Duong (M.S., Computer Science and Engineering). Interactive Full Body Motion Capture Using Infrared Sensor Network Thesis directed by Associate Professor Min Hyung Choi. ABSTRACT Traditional motion capture (mocap) has been well studied in visual science for a long time. More and more techniques are introduced each year to improve the quality of the mocap data. However up until a few years ago the field is mostly about capturing the precise animation to be used in different application after post processing such as studying biomechanics or rigging models in movies These data sets are normally captured in complex laboratory environments with sophisticated equipment thus making motion capture a field that is mostly exclusive to professional animators. In addition obtrusive sensors must be attached to actors and calibrated within the capturing system, result ing in limited and unnatural motion. In recent year the rise of computer vision and interactive entertainment opened the gate for a different type of motion capture which focuses on producing marker or mechanical sensorless motion capture Furthermore a wi de array of low cost but with primitive and limited functions device are released that are easy to use for less mission critical applications Beside the traditional problems of markerless systems such as data synchronization, and occlusion, these devices also have other limitation such as low resolution, excessive signal noise and narrow tracking range. In this thesis I will describe a new technique of using multiple infrared devices to process data from multiple infrared sensors to enhance the flexibilit y and accuracy of the markerless mocap The method involves analyzing each individual sensor data, decompose and rebuild them into a uniformed skeleton across all sensors. We then assign criteria to

PAGE 4

iv define the confidence level of captured signal from senso r. Each sensor operates on its own process and communicates through MPI. After each sensor provides the data to the main process, we synchronize data from all sensors into the same coordinate space. Finally we rebuild the final skeleton presentation by pic king data with a combination of the most confident information Our method emphasizes on the need of minimum calculation overhead for better real time performance while being able to maintain good scalability. These are specific contributions of this thesi s : f irst this technique offer s a more accurate and precise mocap by making sure all the involved joint s are properly tracked by at least one sensor at all time Second, this method alleviates intrinsic shortfall of the device such as noise and occlusion. Third, it provide s greater flexibility outside the geometric range limitation of one sensor which all ow s for greater movement freedom of an actor A nd finally it does not require lengthy calibration and pre processing procedures making this setup much more straightforward and easy to deploy in many application cases. The form and content of this abstract are approved. I recommend its publication Approved by: Min Hyung Choi

PAGE 5

v DEDICATION I dedicated this work to my parents Khoat and Ngoc Anh. Throughout the ir life they have given up a lot to support me in my academic path. It is through their many scarifies that I can achieve what I have today.

PAGE 6

vi TABLE OF CONTENTS I. INTRODUCTION Overview -----------------------------------------------------------------------1 The Kinect Device -------------------------------------------------------------2 Motivation ---------------------------------------------------------------------2 II. RELATED WORK S Markerless Motion Capture --------------------------------------------------4 Motion Analysis --------------------------------------------------------------6 Occlusion And Collision ------------------------------------------------------7 Calibration And Setup ---------------------------------------------------------8 Unsynchronized Camera Tracking ------------------------------------------10 III. BASE CONCEPT AND DESIGN Initial Concept -----------------------------------------------------------------12 Device Limitation -------------------------------------------------------------13 MPI.NET -----------------------------------------------------------------------14 Kinect SDK --------------------------------------------------------------------15 Accuracy Versus Speed -------------------------------------------------------16 High Level Concept Data Structure -------------------------------------------------------16 Synchronized Sensor Space ---------------------------------------17 Ranking System ----------------------------------------------------1 8 IV. METHOD AND IMPLEMENTATION Joint Info Processing ----------------------------------------------------------20 S mooth Filter ------------------------------------------------------------------20 Building The Angle Set -------------------------------------------------------21

PAGE 7

vii Skeleton Reconstruction ------------------------------------------------------24 Data Synchronization ---------------------------------------------------------28 Data Calibration --------------------------------------------------------------29 Merging The Final Skeleton -------------------------------------------------32 V. EXPERIMENT AND RESULT Calibration Test ---------------------------------------------------------------38 Construction Test --------------------------------------------------------------42 Modeling Test -----------------------------------------------------------------44 VI. DISCUSSION ----------------------------------------------------------------46 VII. CONCLUSION ---------------------------------------------------------------49 REFERENCE S ----------------------------------------------------------------5 0

PAGE 8

viii LIST OF FIGURES FIGURE II.1: Synthetic and real data --------------------------------------------------------------4 II.2: Visual Hull ---------------------------------------------------------------------------6 II.3: Kinect decision forest tree ----------------------------------------------------------7 II.4: Infrared signal interference ---------------------------------------------------------9 III.1: Basic workflow ---------------------------------------------------------------------12 IV.1: Updated workflow ------------------------------------------------------------------19 IV.2: Joint Structure -----------------------------------------------------------------------23 IV.3: Multiple occlusions. ----------------------------------------------------------------24 IV.4: The rotation model -----------------------------------------------------------------25 IV. 5 : Displace between two skeleton ----------------------------------------------------31 IV.6 : Skeleton space synchronization. --------------------------------------------------33 IV.8 : Lead token sen sor placement ------------------------------------------------------36 V.3: C alibration sample ------------------------------------------------------------------41 V.4: Skeleton merging --------------------------------------------------------------------43 V.5: Skinned model demonstration ------------------------------------------------------44

PAGE 9

ix LIST OF TABLES TABLE III.2: High level data structure -----------------------------------------------------------16 IV.7: Joint ranking system. ---------------------------------------------------------------35 IV.9: Joint selection -----------------------------------------------------------------------37 V.1: 600 frame calibration error. ---------------------------------------------------------39 V.2: Average over 100 frame error -------------------------------------------------------39

PAGE 10

1 CHAPTER I INTRODUCTION Overview Technologies and methods that produce high quality mocap data have been available for quite some time. However they are usually out of reach of the mainstream users for both their cost and complexity. The traditional methods of motion cap like optical or magnetic require a cumbersome setup of markers and/or sen sors on the actor. There have also been research about markerless motion capture but they also require complex set up and calibration in a laboratory environment. Another drawback is the price points of these systems are usually quite high and the data it self needs a lot of post processing. These traditional methods can yield very accurate data, but they can only be saved and used in a simulation for either modeling or study ing since they are not viable for real time recording. In recent years the entert ainment industry has started introducing cheap and simple device s to stimulate some basic motion capture technology to the mainstream market. The Wiimote and Playstation Move have tracking sensor to record the hand movements to determine the motion (or pos e) that the user is performing and apply the motion into the application. Later on combining with computer vision technology we have devices like the Playstation Eye and Microsoft Kinect. These devices employ an infrared sensor to track the positio n and allow some virtual interaction with on screen objects While they do not measure up to the traditional motion capture system in term of accuracy and detail, all of these devices are meant for int eractive use, so the motion a ffect s their application i n real time.

PAGE 11

2 The Kinect Device One of the first and most advanced motion sensor s released to mainstream users. It comes with two sensor s an infrared and a RGB sensor this research only focuses on the use of the infrared sensor although we might explore the potential of the RGB sensor in the future. The infrared sensor (depth sensor) has an effective range from 0.8m to 4m from the sensor, with an FOV of 47 o vertical and 57 o horizontal and it operates at 30 frame s per second [5]. The reason it is ch osen for this research is because it already has an active support platform to work with the device output (work on this research has begun since the beta) and thus let us delves into our research topic more readily The RGB sensor whil e not me ntioned in this paper still has potential benefit s for future research and it will be more convenience to receive the data stream from the same device. However t he theories presented here is based on receiving and processing data from any infrared device, and should still work with suit. Motivation As mention above for various reason s up until now the motion capture scene is more or less exclusive to industrial professional. The release of the mainstream consumer price sensors opened a floodgate of new oppo rtunities to a much larger user base Since its inception there have been variou s works from fun community hacks like using the Kinect to handle computer interaction, monitoring patience [41] to testing its usage in robotic and surgical field [42]. However as one would expect from a device at such a low price it comes with many draw b ack such as inconsistent data, noise, very limited actor orientation and sometime simply guess work (inferred). This

PAGE 12

3 research is meant to address and overcome some of these disadvantage s while remains simple and inexpensive enough that it can still be used interactively and at a cost not prohibit to mainstream users. Specifically these are the limitation that this research seeks to address and correct or improve: Limited actor orientation: sensor can only detect actor facing it directly. Inaccurate joint t racking: due to either occlusion or orientation. Conflicting and corrupted data: due to signal interference. Scalable method that can be used to track up to 360 viewpoints for full body

PAGE 13

4 CHAPTER II RELATED WORKS Markerless Motion Capture T he idea of using markerless motion capture has been an active work for quite some time. While cumbersome, a traditional motion cap model using markers can reliably relay s the joint position on active model thanked to the attached markers which can be easil y identified by some parameter s A markerless motion capture process does not have this data and as thus, has to construct the joint information base on the visual cue. A comparison ab out the two methods was given b y Lanshammar and Persson base d on video i mage [ 40]. Some popular approaches to constructing the data include using depth images [ 1, 7,19] and R G B image segmentation [5,8] or both[ 11 ]. Figure II: Synthetic and real data in Kinect. The depth sensor returns the depth map of the actor. Then a text ure map with body part labelling (color coded) is applied on it to define the joint region. Figure taken from MS Reseach Cambridge[1] Depth i mage method first involves generating an intensity map base on the depth info of each pixel. Data calibration such as near and far plane can be used to remove The next part is to define a region for each key body part. This can be done by either texture map targeting or establishing local plan ar s [ 44].

PAGE 14

5 In other method typically a visual hull of the actor is reconstructed into a workable model [4,17,36]. A visual hull can be described as shapes in this case a human actor. The silhouette of the actor for e ach camera is calculated from a background subtraction using the depth image generated from an infrared sensor. T he hull of a human model is then constructed to represent the actor in 3D. There are several methods on how to construct one such as constructi ng it from segmented body part One approach is to use super quadric shape with a joint hierarchy to hold them together. Our main interest in this research is on the accuracy of the joint hierarchy itself instead of the model presentation. Once we have the correct joint data we can then use it to rig any general human mesh for demonstrating purpose. In case of an absolute model presentation is needed, a full body scan of the actor will be needed. S. Corazza et al provided another method by constructing a vi sual hull base on a point cloud system with embedded kinematic constraint this method while produce a very accurate model it is computing intensive [ 2 ,8 ]. Figure II. 2 : Visual Hull reconstruction concept. The intersection of each sensor s are used to approximates the actor model. Figure from Standford University [2]

PAGE 15

6 Motion A nalysis U pon reconstructing the model, the next step in mark e rless motion capture is how to d on the cue. Depending on the need of the application the methods are varied. In the case of single sensor the interest lie s more within detecting a specific set of pose s for interaction. This case usually treat s the depth image as a 2D space image [13 ,26,28 ]. To be able to analyze real 3D mo tion the process usually involved a set of training data. [21,24,29] This is usually done by training the data through a variety of human pose [30,33,34 ] In general, markerless motion cap is about understanding a set of pose s an d interpret ing the motion in between. The quality of the final animation depends on the original training set (synthesis image) and the decision tree. [1,38,39] Synthesis image is created from a random set of parameter including a depth image and 3D texture meshes. For the Kinect its pose classif ier is trained with 3 tree of depth 20 from 1million synthesis images, the process takes about 24 hours on a 1000 core cluster. Figure II. 3 : The Kinect decision forest tree to determine a pose. With the x pixel in image I for the body part c. The traverse formula is whefda F igure taken from MS Research Cambridge [1]

PAGE 16

7 Occlusions And Collision O ne of the most significant disadvantages of markerless motion capture is occlusion Since the data come from a view of a sensor instead of being generated on the actor, it is quite common that certain joints will not be seen by a sensor at certain pose. A common method to deal with this issue on single sensor by using an inferred or prediction system : as in using the position of other visible joint to make a guess for where the occluded joint is. This can be quite confusing not only for the occluded joint but also for the joint in front since the algorithm might be confused in sep arating the 2 joints. Vladimir Dochev, Tzvetomir Vassilev, and Bernhard Spanlang proposed a method of using multiple layers of depth map to detect occlusion [ 3]. I t divides the depth map into several layers and separated the body parts base on their depth level. Two depth maps of the actor are taken from the front and back then decomposed and labeled into difference body part s like described in the above session. Each body part is then given a bounding box (BB) volume. The BB volume is then used to detect c ollision between body parts which is then determine which session of the depth map need to be divided into layer. The test for collision is calculated per: This method can be used by any infrared sensor In our prototype process however the collision approach is generally too expensive for interactive application. Another approach using Image Space interference test to detect collision like described in [26] but it is too expensive processing wise for our purpose.

PAGE 17

8 Sensor Calibration And Setup S ince most markerless motion capture system will require quite a few sensor s to be able to record the full body motion, they have to be calibrated so data from each device can be combined together. A typical problem of markerless systems is noise. Noise is an inherited issue with the model even among high end system. They can come from air and lighting condition, surface property of the tracked object among other thing. Professional grade sensor however does have benefit to address other source of noise. For example in the same system most likely each sensor will have a unique iden tifier and operate on a different infrared frequency worry about signal interference between sensors. This is a problem with consumer grade sensor (especially for mass produce sensor like Kinect) W hen working with multiple devices al l the infrared signals on each sensor will operate on the same frequency. This means when more than one sensor is pointed at the same direction, they will create some noise and interference. An illustration of this problem is showed in the figure below. Figure II.4: Infrared signal interference An infrared sensor shoot s a beam at the object and wait s for a reflective signal, based on this wait time the depth (distance from the sensor) of a particular point is constructed,. Due to the scattering rate when a beam hit s the surface, it is quite common for this signal to be reflect in different direction a nd not just toward the

PAGE 18

9 case of professional systems this is not an issue since each sensor will only pick up the signal of its corresponding frequency while ignoring signal from other sensor. But since all of our sensors operate on the same frequency they can cause interference. Our experience showed that the level of interference depend s on the density and orientation of the sensors. For example two sensors with a par allel orientation will suffer a lot more interference comparing to two sensors that are perpendicular to each other. We find that with enough spread (about 4m apart) and wide enough angles (around 60 120 degree ) the interference should not have serious eff ect when working on the skeleton data, and there are suggestion to correct and improve the quality of the Depth Image One way to improve the depth image is to correct the noise issue. A method of using the Kalman filters to reduce the fluctuation by adapt ive hole filling using spatial temporal information [ 9,10,11]. While the depth data is the base where all of our calculation is derived from so ideally we would want a smooth and stable data set to work with, but each filter will come with an overhead cost that needs to be considered for speed. K ai Berger et al provided a comprehensive method of setting up and calibrate four infrared sensors [6] using the This method is used to calibrate a nd align the depth sensor s using an RGB camera. A checker pattern is printed on an aluminum foil and used as the point of reference. The purpose is to provide a unique predictable IR pattern to be used in the calibration We found that the matching si lhou e t te approach using Depth Map while might provide a more accurate motion cap (because the Skeleton data is extracted from it) it also has some limitation. It requires an elaborated calibration process to align each sensor and this

PAGE 19

10 is only at a specific point in space. The displacement of an actor can result in the data from each sensor becomes out of phase. Unsynchronized Camera Tracking T here has been some research into this direction. Technically this is the behavior of tracking an object with a single moving camera like described in [4]. This is mainly used for scene reconstruction and not motion for the reason it cannot cover all angle at the same time. Unsynchronized multi camera animation tracking like the one proposed by N.Ha sler et al[2 0 ] use s an approach that treat the multi camera case as an extensi on of the single moving camera for the calibration process. In both case s the main difficult y lie e with identifying the sensor position in a 3D space in relative to the scene or tracking object. A common technique is using the Structure from Motion (SfM) approach. This technique employs the use of continuous 2D image s taking from a moving sensor and identify common feature between images to reconstruct the scene in 3D. First each camera is calibrated using a feature based method against a static background. In the case of infrared sensor this is the distance trajectory from the feature point P to the sensor. In e ach frame the 2D image information of P is recorded and match with th e same point in 3D space. As the camera moves around and a set of consecutive frames are compared and the 3D presentation of P is created usually by triangulation. In the case of multiple sensors, the authors of the method proposed in [20] extend the case of one sensor. Instead of moving one sensor and take different fr ames continuously, the process is replaced with multiple sensors taking several frames at the same time. Each sensor calculates its own set of SfM data like described above based on their own projection matrix. The theory of SfM reconstruction in this case is: given consistent structure in each

PAGE 20

11 image I n the point P x,y in each image can be calculated by the projection matrix P n The method used in this research is similar to the pairwise matching described in [45], we further enhance this technique by cons tructing matching planes between each coordinate space.

PAGE 21

12 CHAPTER III BASE CONCEPT AND DESIGN Initial Concept Our approach to markerless motion tracking can be summarized as follow: Process the data from each sensor and reconstructed more accurate kinemati c base data set. Construct a uniform skeleton model on each sensor. Use a smart calibration sy stem to reduce complexity. Establish a ranking system to merge the data from all sensor s as well as providing tools for further optimization. Good flexibility and scalability. Raw data Decomposed Reconstruct Calibrate and synchronize. Establish Rank Merge Sensor 1 Sensor 2 Sensor n Raw data Decomposed Reconstruct Raw data Decomposed Reconstruct Figure III.1: Basic workflow

PAGE 22

13 Device Limitation W hile it is a great device at the price point consumer grade device does have a quite a few limitations especially when operate alone: As mentioned above the skeleton data is mainly built as a pose recognition feature in general markerless animation capture. The device calibrates itself once it believes it has the full body of the actor in view. However this calibration pose only works well when the actor facing directly at the device. Thr ough experimentation we find that the data is generally reliable when the actor do not turn more than 30 o to one side of the sensor (giving an effective arch of 60 o ), passing this point the data become unreliable, even with joints that is reported properly tracked. This is not entirely an issue of occlusion (although it is part of the problem), the main problem is without a n appropriate calibration pose, the skeleton tracking is simply confused about the full body presentation While it might still be able to track certain part of body (we found that the legs area can still operate correctly even at a 90 o ) in general i body. As a single device the Kinect also shares the common problem of markerless motion capture when it tr ies to track occlusion. The Kinect SDK has a way of dealing with this using inferred joint. When the sensor cannot see a joint, it will try to guess that joint position based on the position of the other joints. We find that while this work for simple pose like when an arm is stretched across outside the sensor range, it does not work in term of other complex occlusion like when we point the arm directly at the device. (i.e the arm joint will ove rlap the wrist, elbow and shoulder joint at the same time).

PAGE 23

14 No is e: another typical problem of markerless motion capture is the noise from the signal (actually motion capture with marker also has problem with noise, although to lesser extend). It is even more of a problem with cheap device s since being mass produced produce device like Kinect do not have unique infrared frequency so they will create even more noise due to interference. MPI.NET T he Kinect driver at the time of this writing possible for a computer to be able to c onnect and run several Kinects at the same time, the skeleton tracking can only be activated on the first devices on the same computer W e have tried to use file I/O to transfer data between separate process but this is proven to be very slow even only between two processes Other considerations we have taken in general when working with our research are : Each sensor will have to process a fair amount of data transformation natured to any animation application. S ince our focus is real time performance, this means each device will need to maintain its own data stream. We observe that when active each sensor can use up 60% to 80% the bandwidth of an USB.2.0 Control hub. We need a good and reliable method of synchron ization between devices. To which we address: Each sensor should run on its own process in parallel. We assign each sensor to a separate USB hub. We need a thread safe mechanism to ensure all sensor s are at the same frame. These consideration s lead us to t he usage MPI in order to handle the data exchange between our processes. In particular MPI.NET is a MPI wrapper for C# that

PAGE 24

15 allows us to launch multiple processes that each starts and handle s its own sensor device. We found that the overhead of messaging b etween process is acceptable, and can be optimized by reduce the complexity of the message to bare minimum. Also since MPI can handle co mmunication between processes on the same computer or communication between nodes in a network without many differences, the use of MPI thus can imply greater scaling flexibility in the case we want to use more devices. Another useful feature of MPI is its Barrier Synchronization routine, in which we use to synchronize our sensor. Assuming we have a powerful enough computer or network, each sensor should be running at the same steady frame rate. However some data lag might appear during the gathering process. There are several way s of making sure we are processing the same set of frames from all sensors such as time stamp or using sound cue from each sensor. One final advantage that makes us choose MPI over other mean of communication is due to its scalability. There are very little differences between handling processes in one computer or processes across a network, this wi ll allow us to add more sensor s and computing node s as needed by too much The Kinect SDK This is an API provided by Microsoft to use with Kinect on window machine. It provides the user with access to the raw data from the sensor, specifically the depth image stream and the RGB image stream. Using the depth image a 20 joints skeleton sted in the position and tracking state of each joint.

PAGE 25

16 Accuracy VS Speed B efore we present the core of our method we need to mention the balance between Accuracy and Speed. A single infrared/RBG is neither a speedy nor accurate by the average animation capture standard. Especially most consumer grade device e ven when working at optimal condition its sensor is normally capped at 30 frames per second (Kinect, Playstation Eyes) This means we cannot make the device any faster, so the goal of our research i s to produce a better result of the raw data. But another goal as mentioned is to maintain its interactive element. For academic purpose we will willing to scarify more speed for better accuracy, but the direction of this research and the implementation of the proof of concept/demo will be speed driven. High Level Concept Dat a Structure: our data structure ha s to be able to handle these following tasks : capture the raw data fo r process, prepare the data to be sent and received via MPI, reconstruct each skeleton at the main process and merging into the final skeleton. For those purpose we have the following classes: Table III 2 : high level data structure. Class JointArray Class angleSet Class process Skeleton jointType. angleType JointArray jointStatus. angleStatus (0,1) synchAngle jointPosition. rotationAngle. ) rotationAxis ( n )

PAGE 26

17 j ointArray: this class exist s at each child process and is p retty much self explanatory. It is the place holder for the raw data from the Kinect. It also exists at the end of the main process where it is used to hold the final skeleton angleSet: this class is where we hold all the data we processed at each child node, they d ed from them The original spine coordinate is carried over to serve as the starting point when the new skeleton is constructed. process Skeleton : this class is created using data from angleSet it contain s a copy of a kinematic built jointArray, and the confident angles. This is what we send to the main process. Once arrived, the data will be used to construct the variable synchAngle at the main process. Synchronized Sensor Space: as mentioned above, each sensor needs to be able to understand their relative position to one another in order for the merging process to ha ve a meaningful meaning. And while there are method s to pre calibrate the sensors base on their setup, we feel it too restrictive for our purpose for these reasons: The process is complicate for end users. It requires third party tools. Restrictive: each time the sensors are set up differently or the actor moves out of the designate spot the system may need to recalibrate. What we will do here is to implement a smart calibration method that will be able to self calibre as long as the sensors are in reasonable position using the info from each skeleton. I t wi ll also allow the actors to move relatively freely without worrying about de synch the data from each sensor. In another word, we will do our calibration and synchronization at run time. Input time set up information can help, but not critical (Right now w e have it as a safety measure).

PAGE 27

18 Ranking system: as mention above one of the biggest drawback of infrared device is it only return s optimal data w hen the actor face the sensor directly, and the quality degrade s as the actor turn away from it. We propose a method to address this problem by distributing a priority ranking system based on how far away an actor turned away from sensor How this is calculated will be described later under the method session.

PAGE 28

19 CHAPTER IV METHOD AND IMPLEMENTATION Base d on the description from the previous session, a more detail workflow is updated: Figure IV .1: Updated work flow jointArray (position,status) angleArray axis n) processSkeleton (jointArray, confidentAngle( processSkel eton (jointArray, confidentAngle( synchAngle added ) Array of processSkeleton Calibrating and syncronize Array of processSkeleton Establish rank and token. Merge final jointArray. jointArray ( position,status) angleArray axis n) processSkeleton (jointArray, confidentAngle( jointArray (position,status) angleArray axis n) processSkeleton (jointArray, confidentAngle( MPI Sensor 1 Sensor 2 Sensor n

PAGE 29

20 Joint Info Processing We start by extracting the raw data to fill our jointArray for each device. These includes the position of each joint and a status flag; true means the joint will not be con sidered for future calculation, false means the data point is good and can be used. For every joint J n (x,y,z, status): o Tracked: indicate the joint is actually seen by the sensor and properly tracked. We set the status flag to true o Inferred: the joint is not actually seen, but its position is extrapolated using other joint. (For example, base d on the wrist and shoulder position the sensor can guess our elbow position). This is mainly how occluded joints are constructed For our purpose this data is not goo d enough but we may fall back on it in case a joint cannot be seen by any sensor. We set the status flag to false o Unknow: there is no information about this joint from sensor (Either it doesn't see it, or the joint is outside the FOV or min/max distance) W e Smooth Filter T o alleviate the noise problem in order to get more accurate data we need some filter s There are many type s of filter that can be employed for processing signal noise such as the Savitzky Golay algorithm that is based on a Least Squares Polynomial smoothing [42]. The balance between accuracy and speed is an important consideration for choosing the right type of filter. Exponential or double Expon ential Smoothing Filter can give good result but also very expensive for interactive application s Most API and l ibrary already comes with a few filters for the

PAGE 30

21 n oise generated on a particular sensor n in dealing with the difference between data generated by different sensors that even in the case of perfect synchronization will still not guarantee to be 100% the same giving the nature of markerless motion capture. In this paper we tackle the issue u sing several methods, while they will not be independent filters they are used directly as part of our data construction. W e construct a uniform model on our sensor base on kinematic data which is balance between speed and accuracy. The reason why we decom posed and reconstruct our skeleton data instead of using the raw joint position data from each sensor is to ensure the uniform between our presentations. It may seems redundant due to our new data is calculated based on the raw data, but by reconstructing them using only the angle and rotation of each bone, we can impose the same offset for the bone sections between the joint across all sensor s This has the benefit of our new skeleton will have the same orientation value as the raw data, but the noise of t he specific joint location is reduced. Later on we also use a joint average filter which is cheap, but re liable if we a re able to feed it with good data which are provided by our uniform models and a ranking system. A short white paper by Mehran Azimi in [ 43] will give reader s a brief idea of all the filters mentioned so far in this session (except the ranking system). Building T he Angleset U sing the information from jointArray, we break down the 20 joint s into 19 set of angles A x (J a J b J c ) where x is the angle from one particular joint to another and Ja, Jb, Jc are 3 connected joints from jointArray.

PAGE 31

22 Upper Body: Aspine(J spine J shoulder C ) // This is an exception. Aneck(J spine J shoulder C J head ) AshoulderR(J spine J shoulderC J shoulderR ) AbicepR( J shoulderC J shoulder R J elbowR ) AforarmR(J shoulderR J elbowR J wristR ) AhandR(J elbowR J wristR J handR ) AshoulderL(J spine J shoulderC J shoulder L ) AbicepL(J shoulderC J shoulder L J elbow L ) AforarmL(J shoulder L, J elbow L J wrist L ) AhandL(J elbow L J wrist L J handL ) Lower body: Ahipcenter(J spine J shoulderC J hip C ) AhipR(J hipC J spine J hipR ) AthighR(J hip R J hipC J kneeR ) AlegR(J kneeR J hipR J ankleR ) AfootR(J ankleR J kneeR J footR ) AhipL(J hipC J spine J hipL ) AthighL(J hip L J hipC J kneeL ) AlegL(J kneeL J hipL J ankleL ) AfootL(J ankleL J kneeL J footL ) We will then use these 19 angle s to calculate the new 20 joint location for our uniformed skeleton.

PAGE 32

23 Figure IV .2: the joint structure and their relationship The method to construct A x is as followed: First we check the status of the 3 joints component that makes up the angle For our purpose (J a .Status && J b .Status && J c .Status) has to be true for the set to be taken into calculation. This method is meant to enhance the accuracy and authenticity of the later reconstructed joint instead of just rely ing only on the status of each individual joint. For example in the pose where the actor points the hand directly a t t he sensor, only the hand joint is visible and tracked correctly while the wrist and elbow joints are occluded (and thus inferred). While for simple tracking application the correct position of the hand might be sufficient this is not a reliable constru ct in term of hierarchy kinematic. Spine Head Shoulder Center Shoulder Left Elbow Left Wrist Left Hand Left Shoulder Right Elbow Right Wrist Right Hand Right Hip Center Hip Left Knee Left Ankle Left Foot Left Hip Right Knee Right Ankle Right Foot Right

PAGE 33

24 Figure IV 3 : the case of multiple occlusions. While it might still be accurate enough for simple application, the kinematic information between these 3 joints is not reliable. Once all joints in A x (J a J b J c ) is verified, we calculate the angle between them b y as their axis of rotation n : a = (J a J b ),Normalize(); b = (J c J b ). Normalize(); .Normalize() ; n = Cross P rocut(a,b) .Normalize() ; Also for each A x that is verified, we set the associate d true indicating it can be used in reconstruction. S keleton R econ struction Once the data of the angleSet A is completed, we begin our first reconstruction Starting from the spine joint we apply a forward kinematic process basing on the angular information and pre defined offset to map out the skeleton. The most critical part in this reconstruction p rocess is the rotation matrix that allow us to calculate a new joint by rotate a vector about a certain angle around an arbitrary axis using a To rotate a unit Visible joint Invisible joint s

PAGE 34

25 an arbitrary axis n by an angle the derivation is descried as follow: Figure IV.4: the rotation model (1) Decompose b into b parallel (b||) and b orthogonal ( b ) that is respected to n and b = b|| + b (2) Next we need to rotate b about n to get b A new vector T that is perpendicular to both b|| and b and is the same length as b So T = n x b Now we can calculate b b = b (3) with b|| = (b n) n b = b b|| = b (b n) n T = n x b = n x (b b||) = n x b n x b|| b b|| b T n

PAGE 35

26 = n x b (n x b|| = 0) Substitute b and T into (3) we have: b Substitute (3) int o (2) we have: (4) components unit vector of the original matrix. = We will now apply equation (4) to rotate each component vector. We will do this in column form and apply transpose on them later. = cos = = = (5) Apply the same process to vectors p and q, we will have:

PAGE 36

27 (6) (7) Now apply transpose on (5),(6),(7) to form the rotation matrix. R(n, ) = homogenous rotation matrix for joint n, we then multiply a scalar Bn (which var ied for each specific joint) to get the new joint location. Jn new = For joint s that are properly calculated due to availability of the associate angle A x we set the status of the joint to true. For joints that the associate d angle data is not available, we temporary fill it with the original sensor data, but set the joint status to false. Structure wise this new joinArray is the same as the initial jointArray with the difference is that it is constructed using h ierarchy k inem atic as opposed to raw data. There are two purposes to why we construct this new jointArray: As discussed above, this helps us to build a more reliable model base d on kinematic structure and thus minimize the error that ca n be produced from the raw data (n oise). By using the angle information with the same pre define offset on all sensors, we are able to produce a uniform skeleton model across all process and

PAGE 37

28 eliminate the differenc es result from distance and scaling of each sensor The uniform skeleton mod el is a critical key in our calibration and synchronization. We then embedded this new jointArray into our processSkeleton class with the additional of the two confidentAngle (will be explained later). Then we send this information from each child process to the main process Data Synchronization Assuming we have a powerful enough system that each sensor can stream it owns data without a performance hit, there is still the issue of communication lag between each sensor. We assume each device will have a unique time stamp for each data frame starting from the time the sensor is started. This is still leavi ng the issue of ensuring all of device is started at the same since in our observation even within the same machine each device still has up to one second lag between their initialization. We can ignore this issue by analysing the data stream and look for a common point of interest (i.e the frame where the same action taking place), calculate the offset between these frame and apply it to the duration of the process. We find that using way to ensure we get the closest frames by two barriers : First we place a barrier before and after all sensor initialization Intracommunicator comm = Communicator.world; comm.Barrier(); sensor.start(); comm.Barrier(); Second we also place a barrier before and after each frame is gathered: comm.Barrier();

PAGE 38

29 Skeleton[] skeletonArray = comm.Gather(skeletondata, 0); comm.Barrier(); At 30 frames per second we allow a marginal of error of 0. 2s for each frames. Data Calibration U p until now all the tasks carried so far are on an individu al basic by each process working with their respective sensor. This is where our method is different than the traditional calibration discussed in previous work s Usually the data about the setup of the sensor s is given at input time to ensure each sensor can understand and interpreted each other, in other methods the set up data is used in after processing when the data is merged. The method we are describing is doing the o pposite. Each sensor works independently without needing the calibration data of oth er device, but they produce a uniform skeleton model in their respective world space. When these uniformed skeletons are sent to the same space of the reference sensor they will be synchronized with that space in run time. When each skeleton arrive at t he main process, they are put in to the same world space of the r eference sensor. This is the sensor defined by user whose coordinate space provide s the orientation and displacement for the final model of which all other sensors are synched to. We can call this our active work space, or global space. At first the skeletons from other sensor will appear to be at arbitrary location and orientation within the reference space because their original local spaces were all syn chronize them to the same point through a series of translation and rotation, and the info needed can be calculated from each skeleton itself.

PAGE 39

30 Translation: the translation offset is simple to calculate. Similar to the approach in [20] the offset between tw o data set s we can g et this information by calculating the difference between the same feature points on each frame of reference. However we preparation we already have the refe rence point that we need. Ideally, if each sensor is able to return a perfect set of data then each of our 20 joints can be used as a reference point but this will be most likely not the case. All we need is the distance between each child skeleton s and t he reference skeleton. For this calculation to be accurate we need to pick a centerpoint on each skeleton. We choose the spine joint for this given its position at the middle of the skeleton. Also we found that this is the most reliably tracked joi nt at al l time since it is not contested by occlusion in most poses. This is each skeleton T we calculate a displacement vector m with the reference skeleton R: m x = R spinex T spinex m y = R spiney T spiney m z = R spinez T spinez For each joint J in skeleton T J = J.Translate(m x m y m z ) ; To further enhance the accuracy of the translation, we also calculate the difference between the hipcenters and average the two values.

PAGE 40

31 Figure IV 5 : The displacement between two skeletons After the translation we need a rotation in order to put all skeletons in the same orientation, this process is more complicate than the translation process above. We establish the idea of a plane that ea ch skeleton is based on This plane is constructed using a set of 3 joints on the skeleton, which 3 joints is used and under what circumstance need to have these considerations. They have to be well representative of the body orientation. They need to be t he least affected by movement. Picking the correct 3 joints to construct this plane is a difficult task because the human body can twist. The spine joint is an obvious candidate for the first joint of any set given its location. The set of (spine, right s houlder, left shoulder) joints can construct a good plane to represent the upper body but not necessary the entire body if the actor is twisting the hip, the same applies for the set of ( hip center right hip, left hip). The safest meth od to construct the plan e is using a calibr ation pose for each sensor when the actor is not twisted. A more dynamic way of doing it will be to average two or more planes from the same skeleton. We then using the cross product to figure the normal vector of each p lane, then ca lculate the angle between each plane and the reference plane. Since all skeleton s are already at the same root from the

PAGE 41

32 previous translation, we create a rotation axis around Y axis using the spine joint. Next we rotate all the joints of each skeleton ar ound this axis with their respective The result is now we have all skeleton from each sensor in the same reference frame We believe there is a more reliable way to calculate the angle through the use of an augmented sensor. This sensor can be set to track only the body orientation instead of the full body tracking and thus can be set up and calibrated specifically for this purpose. For example if a sensor can reliably track the two shoulder joints we can use the data to calculate the body angle base on the Z coordinate of the two shoulders by: archtan (shoulderRight.Z shoulderLeft.Z, shoulderLeft.X shoulderRight.X) Merging The Final Skeleton Now that we have all skeleton data synchronized, we will use them to reconstruct our final skeleton Recalling from previous discussion of this session the data from each individual skeleton might not be complete or accurate due to: Joints are occluded, and thus their positions are inferred and not reconstructed due to the lack of kinematic data. Joint degradation due to the actor is turning away from the sensor. Our reconstruction process will basically select the best possible data from each skeleton for each joint. To help with this selection progress we propose a ranking and token system.

PAGE 42

33 Figure IV 6 : the process of synchronization the local space of a skeleton into the space of the reference skeleton. (a ): a skeleton from a child process. (b): the reference skeleton. We establish a plane for each skeleton using their joint info. In (c), as a result from the previous translation the two planes should already be at the same local space so we only need the a ngle between their normal vectors. (d) show s the two planes should be close to overlap each other after the rotation, and Again recalling from previous section our process Skeleton class has a away the actor is turning away from the camera. Their con cept and construction is similar to the construction of the planes in previous section as in the plane s are constructed using a set of 3 joints on the skeleton, but with a few key differences: (a) (b) (c) (d)

PAGE 43

34 plane and the Z axis of the s ensor. body (below the spine joint). Unlike the angle for the sensor calibration where the angle has varied on the same skeleton. For example when an actor twist in the same pose, a certain sensor might return good data for the facing half of the body while he upper from another sensor. To improve average: Cross(spine, right shoulder, shoulder center).normal ize()) Cross(spine, right hip, hip center).normalize()).

PAGE 44

35 Table IV 7: Joint Ranking System Degree of turning Rank Description 60 o > x > 60 o 4 Data degradation can be very bad. Ignore this joint if better data available from other skeleton s 30 o > x > 30 o 3 recommend average the value with other skeleton of the same rank. 10 o > x > 10 o 2 Da ta can generally considered stable enough, recommend average with other skeleton of the same rank if possible. 10 o < x < 10 o 1 The best possible data, it can be used as a standalone although one might want to consider average with skeleton in rank 2. A lso considering giving this skeleton set the lead token. Before merging we will briefly explain the purpose of the lead token. One thing to note is the sensor with the lead token is not necessary the same sensor being used as the reference; although it can be and technically is ideal for the role However repeatedly changing the reference sensor is not desirable in term of speed since it will need more synchronization pass (more rotation), and it can cause inconsistency s location and orientation. Our main purpose of introducing this token aside from pointing out which sensor has the best data, it also provides a heuristic locality for adjacent sensors. For example, the two sensors immediately next to the lead sensors wil l most likely hold better data than those that are farther away; this can be used for data optimization without paying for expensive arithmetic. It also provides a good indicator for the case when we have no choice but to change the reference camera. This usually happens when the actor have turn so far away (> |60 o |) away from the reference sensor that the data it receives cannot be used to reliably calculate the reference angle, in this case we have to change our reference sensor to one that is closer to t he direction the actor is facing (and then synch them

PAGE 45

36 back to the original orientation). Technically, every sensor can hold the lead token when the actor is facing them, but arbitrary making every sensor into a potential reference has the disadvantages des cribed above. It is better (and cheaper) to specify a specific set of potential sensor s to be used as reference, and switching between Figure IV 8 : Let Black, Yello w, Red serve as our potential reference sensor If Green is hold the lead token, then Black will serve as our reference, if blue is holding the lead token than we pick Red as our reference sensor. Another advantage of this set up is that the user or application can add or remove sensors between the reference sensors without needing to recalibrate the whole system each time it is done. Finally after all necessary preparation, we merge the data and construct our final skeleton with a simple process. We consider all the candidates for each joint from e ach skeleton. Starting from the sensor with the lead token we check all adjacent sensors for th eir joint status and rank. User can set up some constraint on how many

PAGE 46

37 sensors away from the lead sensor to red uce the amount of calculation. The joint selection is described in the Table 4.7 below. Table IV 9 : Joint selection depending on its status and rank. Rank 1 Rank 2 Rank 3 Rank 4 Status True Take. Take. Average with others of the same condition. Take or ignore depending on which joint. Ignore if there are better joint available. Ignore if any better joint is available. Only take if all other skeleton s are in the same condition. Status False Ignore. If the joint is reported false on all sensor, t hen take base on confident value Ignore Ignore Ignore. Note that according to the table the status of a joint take precedent over the ranking system. The reason for this is because even if the actor is facing the camera, a false status means the joint is being occluded sensor that can track properly even though from a less than optimal angle. In the case of a joint cannot be tracked we will use the inferred data from the top rank se nsor, assuming it is in the best position to guess the missing joint.

PAGE 47

38 CHAPTER V EXPERIMENTS AND RESULT We run a setup with 3 Kinect Sensors with various test set up to verify the concepts we introduced so far. Calibration Test The first test we run is to check the accuracy of our smart calibration and synchronizing method. To perform this test we prepare three sets of data: The skeleton data from the reference sensor. (skeleton A) The skeleton from a child sensor with an input time (calibrated with physical measurement). (skeleton B) The skeleton data from a child sensor synchronized with data calculated in run time. (Skeleton C). Our metric is as followed: for each joint we calculate the distance of the vector between each joint an d the root position. For each n joint we have D Ax D Bx D Cx for skeleton A, B, and C. The n we calculate the percent error E as: The graph s below compares the difference between the two types of calibration the data is taken over 6 00 frames (about 20 seconds) where the actor performs a variety of pos es. The line is an error average between all joints. Overall, the traditional calibration edg e out the run time calibration method by about 1.5% on average with 3% error versus about 4.5%. We believe this error is sufficiently small to justify the advantage of the method proposed.

PAGE 48

39 Table V .1: the percentage of error between two different type s o f calibration. B (blue) line represents the tradition input time calibration while C(red) line represents the runtime calibration. Table V 2 : the same data set of V 1 but we averaged out the value of every 100 frames to better illustrate the differences between two calibration methods. represented frames that a particular sensor cannot come up with a good data and those data will likely be provided by another sensor on those fram e.

PAGE 49

40 We include here a few screenshots from our proof of concept prototype to demonstrate the visual concepts mentioned so far. This first set of screen is to demonstrate the flexibility and accuracy of our calibration method For the first set of 3 s creen s in figure 5.3 : we arbitrary drop two sensor a few meters away from each other and eyeball them toward the same general direction. The first screen (a) : shows the skeletons from each sensor after the child skeleton (blue) is sent and put in the space of t he reference skeleton (red). The second screen (b) shows the result of our synchronization, as show ed we achieve a pretty good overall overlap between two skeleton, this result was t actually have it any described in our method the calibration data came from our main process itself in run time. T he third screen (c); we left the programming running, and had a person pick up the child (blue) sensor and arbitrarily moved it around. Basically it was displaced for around a meter in several directions with some small turn left and right, the screen was captured during this process and it shows our method was able to keep up with the changes in real time with the two skeleton are still closely matched.

PAGE 50

41 (a) (b) (c) Figure V .3 : a)The data from 2 sensors placing casually before a synchronization. b) This figure shows the skeleton synced into the same coordinate using our new method, the sensor still at the original position. c) one sensor is displaced (during run time), we can see that the action does not affect our synchronization by much.

PAGE 51

42 Construction T est The second set of screen is to demonstrate our merging process and how a new skeleton can be constructed using partial data from multiple skeletons. For the set of 3 screens in fi gure V .4 we again set up two sensors but their locations are more deliberated and closer to an actual setup of a camera setup for a traditional marker less motion capture system. The child sensor (green) is put at about 2 meter away from the reference (red ) sensor, their direct line of sight from an angle of 30 degrees at their intersection. available to our system at input time, it will figure out itself. The first screen (a): two skeletons pre synching from two sensors. We can notice a few differences here comparing to the first screen of the previous set. First is the angle between the two skeletons. Another difference here is that the two skeletons are r screen to emphasize the differences between them due to scaling causing by distance, camera len and orientation. The second screen (b): show two skeleton synched, basically the same concept of the second screen from the first set. The third screen (c): we see another skeleton (blue), this is the newly merge skeleton. We also rendered the original two skeletons (pre synched) to server as reference. We purposely stretch our right arm completely ou tside the red sensor FOV and cause its data to be unavailable. But the green sensor being at a different angle, is still able to track that arm as showed on the green skeleton, and thus they were using to construct the missing arm on the blue skeleton (aft er synched and merged).

PAGE 52

43 (a) (b) (c) Figure V .4 Skeleton merging. a) Two skeleton raw data. b) Two skeletons are synched. c) Merged data, demonstrate using different skeleton to construct missing joints.

PAGE 53

44 Modelling T est Visually this allows us to render poses that would be otherwise impossible to render with one camera specifically poses with the actor turning at a great angle and/or with occlusion. The less set of pictures are skin of the skeleton data rendered with an actual skin model to demonstrate what is possible before and after the application of our proposal method. (a) (b) (c) Figure V 5 : Skinned model demonstration Here we can see the result of the same pose from the same actor:

PAGE 54

45 The first screen (a) : With one sensor: the actor facing the sensor directly. No occlusion and the pose is rendered correctly. The second screen (b): the same pose, but the actor turns about 70 o away from sy to see the pose is not rendered correctly. The overall orientation is mixed up and the occluded arm has a strange rendering. This is not a pose specific problem but in general what would happen to the model when an actor turns too far away from the sens or for reason we have mentioned earlier. The second screen (c): with combination of multiple sensors the pose is now rendered in good quality The orientation is correct and the occluded part is also rendered satisfactory. We match our three sensors setu p against the existing XNA Avatar demo from Microsoft. In the Avatar demo, the tracker cannot properly track the actor at a body orientation of more than 30 degree away from the sensor, and occlusion results in a lot of strange animation on the avatar. Our set up on the other hand shows a very large improvement with a reliable tracking of up to 80 degree o f body orientation away from the reference sensor.

PAGE 55

46 CHAPTER VI DISCUSSION Most consumer grade has a performance rate of 30 FPS, so if an actor move too fast there might be los s of data. This prevents the tracking of fast movement In the case of Kinect we use the motion of the arm rotation as a measurement, and we find that more than three rotations per second is too fast for the sensor to track properly. They also have a rather low depth resolution (640*480 for the kinect) so the quality of the data is limited. T his is a device limitation that can only be fix ed by using a better hardware. Infrared device i s not meant to capture 360 degree movements This means that it has no real way to determine whether it's looking at the face or at the back of an actor (the depth map is identical from the front and back). This means that a sensor from the back and front of an actor might return conflicting data while both can be considered high confidence value. We are exploring some methods to help with this issue, like the facial recognition algorithm using RGB image proposed in [7]. Another method is place a color marker on the front and back of the actor and use the sensor to help the process. Our lead token may be able to help speeding up t his process. In theory, it's possible to cover 360 degree movements without the need of completely surrounding an actor with sensors if we can build a good extrapolat ing algorithm. The usage of MPI does introduce some overhead due to messaging This happ ens between inter processes communication on one computer and also expected between multi computer setup. In our three sensor s set up the lag between two sensor is around half a second, but the third sensor has a significant lag up to 1.8 seconds due to bus contention. For further optimizat ion the algorithm can be modified with a shared

PAGE 56

47 memory architecture or thread spreading to alleviate the problem. When more computing nodes (sensor hub) are added, there will be a lag expected in communication between n odes; we believe this issue can be easily minimized with a good local network. Also note that we have a cap in term on overhead in our set up. In most conventional multi sensor setup, adding more sensors while improve the result, it also means more computa tion due to the increase of data volume. Adding more sensors in our setup while improving fidelity will not add more data volume. This is due to the established locality by the lead sens ors that allow us to pick a specific set of data to work with base on the body orientation This gives our model a great degree of flexibility in term of scaling. Our three sensors set up serve as the test bed and concept proof. With enough resource and further tuning the system can be scale up to 360 degree real time body tracking. This can have a wide range of application in entertainment, research, and training f ield. One of the suggestion come up near the end of this research is right now the system is good enough for slow to moderate activities like Taichi training. With better hardware the system may be improve to include other sports like golf training. The ab ility to have a larger range of movement c an also offer more flexibility in the entertainment field, allow users to have an enhanced digital reality over the current limitation of the devices. So far the method does not have a way to properly process comp lex poses like a body bending forward or backward as this nullify the effectiveness of the confident angles due to the normal vectors cannot be appropriately calculated. We believe if we can identify two sensors on the side of the body, we can set addition al conditions base on

PAGE 57

48 the upper and lower body position these two sensors prov ide we can rectify this problem However this is also a problem with the pose classifier process during the training of the depth image, so we may have to improve that part as we ll in order to give a wider pose recognition pattern. At the beginning of the researches we considered synchronizing the depth image from each devices like described in [21 ]. While this method could provide a better synchronization data since the joint da ta is built from it. However this proved to be very restrictive as it takes a lot more calculation to calibrate the different depth image s into the same space. Even after we do that, the actor is rooted at one spot since any displacement can put the images out of phase. By letting each sensors working on the best possible construct they can achieve and later merge we simplified the calibration process w hile achieve better flexibility. We also consider improving our prototype and adding a UI to allow us to t est in more conditions and easier interpretation of test data.

PAGE 58

49 CHAPTER VII C ONCLUSION Our experiment so far has introduced a method of merging data from multiple infrared devices, and we showed that this method can be applied to with good result when used with consumer grade d evices on the market. We introduce a simple but efficient method calibration process which will help lessen the constrain t in the set up process. (Our prototype showed the method work reasonably well even with a casual eyeballing set up). This method is able to work due to the construction of a uniform skeleton and selection of data that enhance the capacity of the sensor and provide a more accurate tracking ability. This in turn gives more flexibility to the action that can be pe rformed by the actor (greater turn angle, a bigger field of movement, and report occlusion pose more reliably). The combination of using MPI for direct communication and the lead token/reference sensors give our method a good level of scalab ility since the calculation complexity does not scale with data volume D epending on the need of the application and degree of accuracy/complexity of the application and the desire angle freedom more sensor can be added without making a lot of changes in the base method. The method are done with limited overhead and can perform reasonably well in real time environment which th en can be used in interactive software or the field of robotic since the data reprodu ce a animation All of these are done using device that's readily available to the mainstream user, and thus may have a wide range of application.

PAGE 59

50 R EFERENCES [1] J Shotton, A Fizgibbon, M Cook et al. Real Time Human Pose Recognition in Parts from Single Depth Image. Microsoft Research Cambridge & Xbox incubation. [2] S. Corazza, L. Mundermann, A.M. Chaudhari et al. A Markerless Motion Capture System to St udy Musculoskeletal Biomechanics: Visual Hull and Simulated Annealing Approach. Annals of Biomedical Engineering, Vol 34, No. 6, June 2006. [3] V. Dhochev, T. Vassilev, and B. Spanlang Image space Based Collision Detection in Cloth Simulation on Walking Humans. International Conference on Computer Systems and Technologies [4] S. Izadi, D. Kin, O. Hilliges et al. KinectFusion: Real time 3D reconstruction and interaction Using a Moving Depth Sensor Microsoft Research Cambridge, UK. [5] Y. Liu, C.Stoll, J.Gall etal. Markerless Motion Capture of Interacting Character using Multi View Image Segmentation. Automation Deparment, TNlist, Tsinghua University. [6] K. Berger, K. Ruht, Y. Schroeder et al. Markerless Motion Capture using multiple Color Depth Sensor. The Eurographics Association 2011 [7] L. Ballan and G.M. Cortelazzo. Marker less motion capture of skinned models in a four sensor set up using optical flow and silhouettes Proc the Fourth International Symposium on 3D Data Processing, Visualization and Transmission, 2008. [8] L. Mundermann S. Corazza and T.P. Andriacchi. The evolution of methods for the capture of human movement leading to markerless motion ca pture for biomechanical applications. For submission to Journal of NeuroEngineering and Rehabilitation. [9] M. Camplani and L. Salgado. Adaptive Spatio Temporal filter for low cost sensor depth maps. University of Politecnica de Madrid, Spain 2012 [10] M. Camplani and L. Salgado. Efficient Spatio Temporal Hole Filling Strategy for Kinect Depth Maps. University of Politecnica de Madrid, Spain 2012. [11] S. Matyunin, D. Vatolin, Y. Berdnikov, and M. Smirnov, Temporal filtering for depth maps generated by Kinect depth sensor ,. 3DTV Conference: The True Vision Capture, Transmission and Display of 3D Video, May 2011. [12] K. Lai., L. Bo, X. Ren, and D. Fox, A large scale hierarchical multi view rgb Robotics and Automation, IEEE International Con ference May 2011. [13] M. Siddiqui and G. Medioni. Human pose estimation from a single view point, real time range sensor. In CVGG at CVPR 2010.

PAGE 60

5 1 Y. Zhu and K. Fujimura. Constrained optimization for human pose estimation from depth sequences. In Proc ACCV 2007. [14] S. Knoop, S. Vacek and R. Dillmann. Sensor fusion for 3D human body tracking with an articulated 3D body model. In Proc ICRA 2006. [15] A. Cappozzo, F. Catani, U. Della Croce and A. Leardini. Position and orientation in space of bones during movement: anatomical frame definition and orientation. Clin. Biomech. 10:171 178, 1995 [16] A. Laurentini. The visual hull concept for silhouette based image understanding. IEEE PAMI 16: 150 162, 1994. [17] I. A. Kakadiaris and D. Metaxis. Model based estimation of 3D human motion with occlusion base on activ e multi viewpoint selection. In Prc IEEE CVPR 81 87, 1996. [18] W. Matusik, C. Buehler, R. Raskar, S. Gortler and L. McMillan. Imaged based visual hull. Proc ACM SIGGRAPH 369 374, 2000. [19] N. Hasler, B. Rosenhahn, T.THo rmahlen et al: Markerless motion capture with unsynchronized moving camer a s. Computer Vision and Pattern Recognition, 2009. CVPR IEEE 2009. pp 224 231. [20] Y. Kim, D. Chan, C. Theobalt, S. Thrun. Design and calibration of multi view of sensor fusion system. Computer Vision and Pattern Recognition Workshops, 2008. [21] J. Bouguet. Sensor calibration toolbox. http://www. vision. caltech. edu/bouguetj/calib_doc 2010. [22] E. De Aguiar, C. Stoll, N. Ahmed et al. Performance capture from sparse multi view video. In ACM Transaction on Graphic vol 27, p. 98, 2008. [23] L. Guan, J. Franco, M. Pollefeys: 3d object re construction with heterogeneous sensor data. 4 th International Symposium on 3D Data Processing, Visualization and Transmission. Atlanta, GA, 2008. [24] G. Baciu W.S. Wong, H. Sun. An Image Based Collision Detection Algorithm The journal of Visualization and Computer aniation, pp 181 192. 1999 [25] A. Balan, L. Sigal, M. Black, J. Davis and H. Haussecker. Detailed human shape and pose from images. In CVPR, 2007 [26] A. Sundaresan and R. Chellappa. Multi sensor tracking of articulated human motion using motion and shape. In Asian Conference on Computer, Hyderabad, 2006.

PAGE 61

52 [27] T.B Moeslund and E. Granum. A Survey of computer vision based human motion capture. In Internation al Conference on Face and Gesture Recognition, 2000. [28] M. Yamamoto, A. Sata, S. Kawada, T. Kondo and Y. Osaki. Incremental tracking of human actions from multiple views. In CVPR, p2 7, 1998. [29] D.M Ga v rila and L.S Davis. 3 D model based tracking of humans i n action: A multi view approach. In Computer Vision and Pattern Recognition, p73 80, 1996. [30] I. Kakadiaris and D. Metaxas Model Based Estimation of 3D Human Motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol22 no12, December 2000. [31] T. Cham and J.M. Rehg. A multiple hypothesis approach to figure tracking. In Computer Vision and Pattern Recognition, v.2, June 1999. [32] D. M Gavrila. The visual analysis of human movement: A survey. Computer Vision and image understanding : CVIU, 73pp 82 98 1999 [33] T. Tan, L. Wang, W. Hu. Recent developments in human motion analysis. Pattern Recognition 36 pp 585 601, March 2003. [34] C. Menier, E. Boyer, and B. Raffin. 3D skeleton based body recovery. In Proc of the 3 rd International Symposium on 3D Data Processing, Visualaization and Transmission, Chapel Hill, June 2006. [35] R. Kehl and L.V Gool. Markerless tracking of complex human motion from multiple views. Computer Vision and Image Understanding, 104(2)pp 190 209 200 6. [36] P. Besl and H. McKay. A method for registration of 3d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14:pp239 256, 1992. [37] C. Cedras and M. Shah: Motion based recognition: a survey. Image and Vision Computing 1995, 13(2) pp 129 155. [38] J. Aggarwal, Q. Cai: Human motion analysis: a review. Computer Vision and Image Understanding 73(3): pp82 98, 1999. [39] H. Lanshammar, T Persson, V. Medved. Comparison between a marker based and amarker free method to estimate centre of rotation using video image analysis. In Second World Congress of Biomechanics, 1994. [40] E. E. Stone and M. Skubic. Evaluation of an Inexpensive Depth Sensor for Passive In Home Fall Risk Assessment. University of Missouri, 2011. [41] R. W Scahfer. What is a Savitzky Golay Filt er? IEEE Signal Processing Magazin, pp111 116, July 2011.

PAGE 62

53 [42] M. Azimi. Skeletal Joint Smoothing White Paper. MSDN digital library, 2012 [43] V. Lepetit, P. Lagger and P. Fua. Randomized tree for real time keypoint recognition. In Proc, CVPR pp775 781, 2005. [44] T. Thorm ahlen, N. Hasler, M.Wand, and H. P. Seidel. Merging of unconnec ted fea ture tracks for robust camera motion estimation from video. In CVMP, Nov. 2008