Preventing overfitting in convolutional neural networks during classification of multivariate spationtemporal data

Material Information

Preventing overfitting in convolutional neural networks during classification of multivariate spationtemporal data
Lees, W. Max
Place of Publication:
Denver, CO
University of Colorado Denver
Publication Date:

Thesis/Dissertation Information

Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Computer Science and Engineering, CU Denver
Degree Disciplines:
Computer science
Committee Chair:
Banaei-Kashani, Farnoush
Committee Members:
Biswas, Ashis
Choi, Min

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
Copyright W. Max Lees. Permission granted to University of Colorado Denver to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.


This item has the following downloads:

Full Text
B.A., University of Colorado, 2013
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Computer Science Program

This thesis for the Master of Science degree by W. Max Lees has been approved for the Computer Science Program by
Farnoush Banaei-Kashani, Advisor Ashis Biswas Min Choi
Date: December 16, 2017

Lees, W. Max (M.S., Computer Science Program)
Thesis directed by Assistant Professor Farnoush Banaei-Kashani
Multivariate spatiotemporal (MVS) data, such as player locations on a held, troop location in battle, or human joint locations, can yield useful information when classified. Convolutional neural networks (CNNs) have shown superior results for classification of the MVS data. Despite their power, however, convolutional networks suffer from overhtting. As a result, they result in lower accuracy on unseen data; hence, limited generalization. Many methods are proposed to address the overhtting problem with CNNs, such as dropout, batch normalization, and data augmentation. We identify one structural source of overhtting within convolutional neural networks that existing methods have not addressed. We present a novel convolutional layer that aims to address this problem. Additionally, we present a form of data augmentation designed for MVS data. We evaluate the benehts of our proposed convolutional layer and show that it signihcantly decreases overhtting and improves training time. We also show that our data augmentation algorithm improves on existing forms of time series data augmentation. We have performed extensive empirical studies to evaluate both of our contributions using real human gait data, which demonstrated up to 25% improvement over the state-of-the-art MVS classihcation methods.
The form and content of this abstract are approved. I recommend its publication.
Approved: Farnoush Banaei-Kashani

Firstly, I would like to thank my advisor Assistant Professor Farnoush Banaei-Kashani for helping me find a research topic worth exploring and for pushing me to perform my best work. Additionally, I would like to thank Kyle Estler and Oliver Batista for helping design the data collection procedures and for taking the time to collect a lot of the gait data.

I. INTRODUCTION.................................................... 1
II. RELATED WORK.................................................... 4
III. BACKGROUND...................................................... 9
Multivariate Spatiotemporal Data................................. 9
Convolutional Neural Networks.................................... 9
Convolutional Layer.......................................... 10
Pooling...................................................... 12
Fully Connected Layers ...................................... 13
Activation Functions......................................... 13
IV. CONVOLUTIONAL COLUMNS.......................................... 14
Structural Cause of Overfitting in CNNs ........................ 14
Convolutional Columns........................................... 16
V. DATA AUGMENTATION.............................................. 20
VI. GAIT DATA SET.................................................. 24
VII. EVALUATION..................................................... 27
Whole System.................................................... 27
Convolutional Columns........................................... 27
Data Augmentation............................................... 29
VIII. CONCLUSION AND FUTURE WORK.................................... 32
Future Work..................................................... 32
REFERENCES.......................................................... 33

1 Types of gaits collected in the data set........................................ 24
2 Joints captured within the data set............................................. 26

1 The convolutional bundle architecture........................................... 18
2 An example of completion values for a single sequence........................... 21
3 A view of how data artifacts are smoothed in favor of general trends when using
GTW. The figure on the left is the original data. The figure on the left is the smoothed data............................................................. 22
4 A single frame in gait MVS data set............................................. 25
5 The layout of the M-CNN model................................................... 27
6 A comparison of the aggregate system versus a standard CNN model........... 28
8 Results from CC-4 in comparison to M-CNN........................................ 29
9 Comparison of M-CNN, CC-4, and DEEP............................................. 29
10 Comparison of M-CNN, CC-4, and PARA............................................. 30
11 The three different convolutional column layer models in log scale.............. 30
12 The testing accuracy of the three different data aggregation models versus the
training epoch............................................................... 31

Lots of useful information could be gained from algorithms able to classify data in the form of Multivariate Spatiotemporal (MVS) data. MVS data contains two or three dimensional location data of multiple objects tracked through time. This type of data is a subclass of sequence and time series data and sequence data.
Examples of MVS data include battlefield troop movements, player locations on a held, ride sharing car routes, and human joint locations.
With these examples, a myriad of different useful data classification applications could be considered. For troop movements, potential battlefield tactics could be determined. With player locations, the plays and strategies of the other team could be predicted. For ride sharing, data mining could predict the likelihood of an individual needing a ride in a particular area. The particular action an individual is performing or their identity could all be gleaned by using the joint locations.
Classification of MVS data poses a unique problem, however. Because MVS data is temporal, data that should have the same label can take place in different lengths or dilations of time and at entirely different locations within a time series. There is no guarantee of temporal alignment.
Using the human action recognition example, if the action ’stand up’ takes 0.5 seconds in one instance and 0.7 seconds in another, an algorithm attempting to identify the ’stand-up’ action must be able to recognize both of these instances. Additionally, in one ’stand-up’ instance, the action may take place at the beginning of the time series. In another instance, the ’stand-up’ action may occur at the end. Algorithms that identify similarities in MVS data require either the ability to handle possible differences in the time domain or must use some sort of heuristic pre-alignment.
Similarly, because MVSs are multivariate, the multiple variables all may containing different time dilations. To use the human action recognition example again, consider a system that must identify whether two people are shaking hands based on the location of

their arm and hand joints. The system must be robust enough to identify a handshake when both individuals raise their hands at the same time as well as if there is some lag between one individual and the other raising their hands. The system must still be able to identify both possible scenarios as a handshake.
Classification of MVS data is generally considered the domain of three different forms of classification.
The first is Dynamic Time Warping (DTW) and related algorithms. DTW determines the closest possible distance between two unaligned time series sequences while applying a temporal alignment to the sequences that maintains temporal ordering. Because of this temporal alignment, DTW is able to handle different time scales and dilations within the data.
DTW algorithms have been shown to be extremely adept at classifying time series data. One shortcoming is their reliance on comparing unknown sequences to large sets of known sequences. This requirement means that as the number of known sequences increases, the required time to identify an unknown sequence increases.
Feature based classification is another well explored form of classification on sequence data. Hand selected features are extracted from the all known sequences. These features are then extracted from any unknown sequences as well and are compared with the features of the known labels. This final comparison can be achieved by anything from euclidean distance to support vector machines.
Because these features are extracted from the original data, the actual comparison is no longer time series data. Instead the classification is occurring with features such as stride length (for gait identification) or average distance from goal (for soccer play identification).
For feature based classification to work well, however, the features that are extracted must be carefully considered and hand crafted based on knowledge of the data and the domain of the data. Additionally, because features are fed to other existing classification algorithms, we consider feature extraction to be a preprocessing step rather than a classih-

cation algorithm in its own right.
Deep learning is the final classification method. Convolutional neural networks (CNNs) have gained state of the art classification results on MVS data. By learning useful features, they are able to leverage the power of feature based preprocessing in order to achieve excellent classification results. Again, the temporal problem is mostly sidestepped by extracting extratemporal features before classification.
CNNs are prone to learning biases that exist in the training set rather than learning generalizations. This learning of biases is known as overhtting. There are several tools that have been studied for reducing overhtting in all forms of deep learning, both external and structural, with varying levels of effectiveness. None of them entirely remove the problem of overhtting, however.
The CNNs feature extractors must still learn how to handle temporal dilations as well. There are data augmentation algorithms proposed to help improve the temporal generalization of CNNs on time series data.
We propose two complementary augmentations to CNNs for MVS data. Specifically, we propose a new type of convolutional layer to prevent overhtting. We also propose a form of data augmentation tailored to MVS data that removes the necessity to learn temporal generality. We demonstrate the effectiveness of both methods on the classihcation of human gait data.

Starting in 1994, DTW and its variants have been obtaining state-of-the-art results on classification of sequence data [3]. The power of DTW comes from its ability to align mismatched sequences during comparison.
Canonical Time Warping (CTW) was introduced in 2009 by Zhou et al. [30] as a combination of DTW and Canonical Component Analysis to produce an alignment through perturbations in both space and time that improved accuracy of previous DTW variants.
DMW was introduced by Gong et al. [7] as a variation of DTW that can compare sequences in different modalities (such as 2D to 3D) by first mapping them to the same low dimensional space first. It improves upon the accuracy of both CTW and DTW.
One shortcoming of DTW and its variants is their requirement that all known sequences are compared to unknown sequences during classification. In 2015, Koegh et al. introduced exact indexing of DTW [13] removing the necessity to comparing all known examples. Even with the introduction of indexing, however, about 20%-30% (depending on the length of the sequences being compared) of known sequences need to be compared to unknown sequences. Despite this reduction in the number of sequences that must be compared, the necessity to compare large amounts of sequence data is still a shortcoming of DTW and its variants.
More recently, the use of feature based classification of sequences has been explored. Feature based methods preprocess the data by extract useful features from the raw data set before applying a classification algorithm such as support vector machines (SVMs) or k-nearest neighbors (k-NNs). By reducing the dimensionality of the input to a small set of useful extra-temporal features, these classification algorithms are able to distinguish between classes fairly accurately. Much of the computation can occur before classification time and must only occur once. They are, therefore, much faster at labeling unknown data than DTW variants.
As early as the 1990s, feature based methods were used for classification. Chuzhanova et al. in 1998 used special features to classify RNA sequence data [4], passing the features

on to a 10-NNs algorithm with great results.
In the early 2000s, feature based methods began to be applied to MVS data. One feature that has become extremely popular for MVS classification of individuals through their gait is the 2D binary silhouette [26] [18]. This silhouette is generated from a single gait cycle. We consider this a feature based approach because the classification algorithms must only consider a subset of the gait features. In this case, the shape and area of a 2D cross section of the individual’s gait is the extracted feature.
More recently, Gianaria et al. used a limited set of features such as arm length and stride length to identify individuals based on their gait signature [6]. They passed these generated features on to an SVM to generate the actual classification.
All of these are powerful examples of feature based classification. The shortcoming of feature methods, though, is their reliance on hand selected features in order to perform well. The selection of these features requires special domain and data specific information. Additionally, they rely on the power of existing classification methods to perform the actual classification step.
In recent years, machine learning algorithms that are able to learn useful feature extractions have been shown to be extremely successful in the classification of MVS data. In 2014, Zheng et al. showed that CNNs could get state of the art classification results on time series data [29]. In 2015, Yang et al. [28] demonstrated the power of CNNs specifically on MVS data for human action recognition. They both used ensemble methods by treating each variable as a separate sequence. These separate sequences were passed to parallel CNNs before the results were aggregated using fully connected layers.
These CNNs are able to achieve above or equal classification results as other methods. They do not require special hand selected feature extraction like feature based classification nor do they require comparison between large sets of known labelled sequences like DTW variants.
As with many learning models, CNNs can easily overfit the training data. The result of

overfitting is low accuracy during the classification of data not seen in the training phase. This weakness is especially apparent on data sets that contain relatively few training samples compared to the number of parameters in the model.
We categorize these forms of overfitting reduction in two classes. The first class contains external forms of overfitting reduction. These forms do not change the actual network itself but rather change peripheral components that affect how the network is used. The second class is structural. These modifications change how the learning algorithm works.
One external method for reducing overfitting is increasing the number of training examples. The amount of overfitting is directly proportional to the ratio between the number of training examples versus the number of parameters within the learning algorithm [1], By increasing the number of training examples, the ratio decreases. Increasing the size of the training data set is not always possible, though, as data collection can be both expensive and time consuming.
Early stopping is another common external method to reduce overfitting during training [21]. Using cross validation, the training is automatically stopped when some threshold of the ratio between training accuracy and cross validation accuracy is breached. This is another extremely effective technique but requires that the data set be large enough to accommodate cross validation which is not always the case.
Data augmentation is another form of external overfitting reduction. The goal of data augmentation is to artificially inflate the size of the training data set. Simard et al. showed that using data augmentation could improve accuracy and reduce overfitting in CNNs [22], They demonstrated that the only necessity for data augmentation to improve a CNNs result is that the underlying data has some form of translation invariance and that the transformations applied to the data are label preserving.
Alex Krizhevsky et al. [15] demonstrated the effectiveness of data augmentation when used for image classification in CNNs. They randomly skew, rotate, and flip images within the training set to artificially create new training examples. Unfortunately, rotating and

skewing multivariate sequence data would be quite confusing to a classifier because, while in images all data points represent the same type of data (i.e. the coloration or saturation of a specific location within the image), not all types in a multivariate sequence are the same.
For example, one variable could represent the values of a stock while the variable directly neighboring could represent the GDP of Norway. Rotating the stock value into the GDP variable space would most likely confound the classifier because the meaning of the data has changed. The transformation may not be label preserving and does not represent natural variation in the data.
Le et al. introduced a set of data augmentation techniques that could be used on time series data [16]. In their paper they introduce window slicing (WS) in which they extract random sub-samples along the time domain of the entire data set to use as new training data. They also introduce window warping (WW) in which they select small windows within the data to either double or half in the time dimension and then reinserted into their original position in the data. These newly warped sequences are used as new training examples. They recorded the best results when using both WW and WS. We will compare our data augmentation technique to theirs.
As stated previously, the amount of overhtting is directly proportional to the ratio between the number of training examples versus the number of parameters within the learning algorithm. Reducing the number of parameters by reducing the size of the network, then, can also reduce overhtting. This is a structural method in which the actual network architecture is changed. Reducing the size of the network reduces the representational power of the network as well. For sufficiently difficult classification problems, both wider [25] and deeper [9] networks have been shown to classify more accurately. So reducing the size of the network may not always be ideal given a hard enough classification problem.
A very common structural method to reduce overhtting in all types of learning algorithms is dropout [24]. Dropout deactivates values within the model during the training phase with some set probability. The result is a network that does not rely on internal co-adaptation

of weights to regularize errors within itself. The effect of dropout on convolutional layers, though, is unpredictable.
Batch Normalization is another form of overhtting reduction [12]. With multiple layers in deep learning algorithms all being trained simultaneously, a very common problem is covariate shifting. Batch normalization attempts to reduce co-variate shifting by maintaining the distribution of values for each layer’s output. While batch normalization is a very useful discovery in deep learning that we recommend always be used, batch normalization does not address the structural problem of overhtting that we define and attempt to address.
Ensemble methods have also been shown to help reduce overhtting [23]. Using multiple classihers and aggregating their results can dramatically improve the overall performance of any classihcation task. Any improvement on the individual classihers will improve the overall performance of the ensemble, however, so we choose to focus on individual models.

Multivariate Spatiotemporal Data
MVS data is a subset of time series data that consists of multiple variables all containing spatial data.
To represent an MVS data point we use a 3D matrix. One dimension represents the time, one dimension represents the set of variables, and the final dimension represents the spacial dimension. This last dimension is of either size two or three depending on whether we are considering 2D or 3D spatial data. For the purposes of this paper, we will assume all MVS data is in three dimensional euclidean space. All of the algorithms discussed in this paper are trivially converted from three dimensional space to two dimensional space.
Let xu be the location of variable i at time step t within the x-dimension of euclidean space. Let yu and zu be the y-dimensional and z-dimensional counterparts respectively. The value of a variable for each time step can be defined as shown in equation 1.
If m is the total number of time steps within an MVS data point and n is the total number of variables contained within an MVS data point, then the data point can be represented as shown in equation 2.
At [>Do Hit? Zit\
Vll Vi2 Vi3
Vl(m—1) V\m
V21 V22 V23
A(m—1) V2m
V(n—1)1 V(n—1)2 V{n—1)3 • • • ^(ra—l)(m—1) V(n- 1 )m
Ail Ai2 Ai3 • • • ^n(m— 1) ^nm
Convolutional Neural Networks
CNNs were designed initially for classifying image data. They are specially built to learn
to recognize useful features within the raw image color or saturation data. The architecture

is effective because it takes advantage of some underlying assumptions about images that both allows for a reduction in architecture complexity and better generalizations. Some of these underlying assumptions of images hold true for multivariate time series as well.
For images, they assume specific values within an image are correlated to all collocated data points. They also assume features are location invariant. This location invariance and correlation allows for some simplifications in the internal representation of data which in turn reduces the complexity of the model.
Specific values within MVS data are related to each other along the temporal axis. Additionally, there would be no reason to group single variable sequences into multivariate sequences unless they had some correlation; correlations exist between different variables as well. Based on the type of data, features may be temporally invariant as well. A handshake is a handshake whether it occurs at the beginning or the end of the sequence.
Almost all CNNs contain some organization of three common components. These components are the convolutional layer, the pooling layer, and the fully connected layer.
Convolutional Layer. The convolutional layer is comprised of a set of kernels. The height and width of the kernels are hyperparameters. A hyperparameter is any value that is set during the construction of a machine learning model. They effect the power of the model can be easily modified in order to properly tune it.
The depth of the kernels are equal to the number of input channels. We will see what input channels are shortly. All of the kernels within a single layer are the same size. The number of kernels within the layer is also a hyperparameter.
The input to a convolutional layer is a three dimensional matrix. The first and second dimensions are called the height and width of the input. The third dimension is called the input channels.
For the first convolutional layer in a network in which raw data is the input, we make the following considerations:
For images, the height and width of the input are simply the height and width of the

image. For grey scale images, there is only one input channel which represents the saturation at each location. For color images, there are three input channels that contain the red, green, and blue saturation for each location in the image.
For multivariate time series data, either the height or the width is the time dimension of the data. The remaining of the two dimensions are the different variables. So for players on a basketball court, the height dimension may contain all of the different players, for example. There are two or three input channels containing the x, y, and possibly z coordinates separately for each variable per time step.
The input of deeper convolutional layers within the network are harder to characterize because they consist of the output from shallower layers. We will explain how to interpret this output shortly. But first, let’s examine their structure.
The output of a convolutional layer is also a three dimensional matrix. Again, the first and second dimensions are the height and width. The third dimension is called the output channels. The number of output channels is equal to the number of kernels in the layer.
To generate the output, the kernels are swept along the input. The dot product between the kernels’ values and each subsection of the input which are of equal size to the kernel are calculated. The results of these dot products are then passed on as the output of the convolutional layer. The nth kernel generates the nth output channel. These channels are called the activations or activation maps of the kernels.
While it is fairly standard for the dot products to be calculated between each subsection of the input, the step size they take between each calculate is another hyperparameter. Setting a step size of (2,3), for example, would mean that the activations of the kernels would only be calculated every other location along the height and every third location along the width of the input. For the remainder of the paper, we will assume a step size of (1,1) unless otherwise stated.
Another hyperparameter that can be adjusted is the padding that is added to the input height and width before processing. Padding can be added for two reasons. If the step size is

not set to (1,1), depending on the ratio of input height and width to kernel shape, there may exist values within the input that are not used to calculate activations. By adding padding to the input, we can guarantee all input values contribute to the activations. Additionally, padding can be added to ensure that the output height and width are equal to the input height and width. Padded values are usually set to zero.
In standard convolutional layers, in simplified terms, the output can be interpreted as follows. The value located at (x, y, z) in the output can be considered the probability of the zth learned pattern being present at (x, y) in the input. These are calculated directly by the convolutional kernels. Let Kz(x,y) be the zth kernels activation for location (x,y) in the input. Let 0(x,y,z) be the output of the convolutional layer at location (x,y,z).
0(x, y, z) = Kz(x, y) (3)
The entire output is the likelihood of the set of learned features existing at each location of the input where each channel represents a single feature.
Pooling. The standard pooling layer is a form of down sampling. Like the convolutional layer, the input of the pooling layer is a three dimensional matrix. The first two dimensions are the height and width and the third layer contains the input channels.
A window is swept through the height and width of the input. Depending on the type of pooling, an operation reduces the set of values within the window to a single value. For max pooling, the maximum value within the window is selected. For average pooling, the average value of the window is calculated.
The window size and the step size are both selected as hyperparameters. Generally, these windows are of size 2x2 because anything larger can be very destructive. The step size is usually the same size as the window size so no inputs are examined multiple times and no inputs are skipped.
There also exist other less standard forms of pooling.
Global channel averaging is a pooling operation in which the entire contents of each input

channel are averaged together into a single value. There is a method for pooling between channels called Channel-Max in which the maximum value of each {x, y) location in the input is taken along the z axis. The result is a single channel output with the same width and height as the input.
All of these have different uses and interpretations but their goal is to reduce the dimensionality of the input in some useful way.
Fully Connected Layers. Each fully connected layer has a single matrix as it’s weights. The input of a fully connected layer is a single vector. The only requirement is that the input vector length and the width of the layer’s matrix are equal. The layer simply performs a dot product between the input vector and the matrix and passes the new vector out as output.
Additionally, there may be a bias added to the output. A bias is simply a vector who’s size is the same as the height of the layer’s matrix. To add the bias, the output vector and the bias are simply added together. Note that some papers implicitlyonvolutional neural networks including a bias when discussing fully connected layers.
Fully connected layers are ubiquitous in deep learning because they are able to learn nonlinear mappings between their input and their output. In CNNs, these layers are typically saved for the end of the network and are used to perform the actual classification once the features have been extracted by the convolutional layers.
Activation Functions. Activation functions are functions that take in some input vector or matrix and feed each value through a function. Common functions include the sigmoid function, the hyperbolic tangent function, and the rectifier linear unit (ReLU).
Activation functions are used to control how information flows through the network. For example, using the ReLU, we can limit the network to only process positively correlated information. In addition, backpropegation through ReLU layers prevents applying gradient descent to values that are negatively correlated.

Despite CNNs suffering from less overfitting than traditional forms of deep learning, they still have structural problems in which overfitting is not entirely removed.
Essentially this overfitting comes from the feature generation process. For generation of a feature at some layer without a CNN, all features from a lower level feature map must be considered. This is the case even when only a subset of the lower level features are indicators of this specific higher level feature.
We first explore exactly what this structural problem is. We then propose a new form of convolutional layer that helps to relieve this problem by allowing small subsets of these lower level features to self organize into useful high level features.
Structural Cause of Overfitting in CNNs
Given some arbitrarily convolutional layer, k, within a CNN, there exists some theoretical set of features, Fk, that the kernels of this convolutional layer may learn. Let’s limit the set of features in Fk to be only features that lead to proper classification.
These features in Fk are constructed by combining some subset of the features learned in the previous layer, Fk-\. In the special case where k = 1 (the layer we are considering is the first layer in the network), Fk_i instead contains the set of useful features that exists in the original input.
For any classification task, there must exist some feature in Fk_i that is only used to construct some subset of the features that exists in Fk. Or, to state a different way, not all features in Fk_i are used to create all features in Fk. We will call one of these features from Fk_i that is not universally useful /.
To illustrate this, let’s assume for a moment that there is no such feature as /. The result would be that each higher level feature in Fk is a combination of all lower level features in Fk_i. All features in Fk would be identical. The classes would be made of the same exact features. Classification would be impossible. So there must exist at least one feature that

can distinguish between different classes. For any sufficiently difficult classification task, there must be a large set of such features.
There are two possibilities for /. Case A: no kernels in the (k — l)th layer will learn to recognize / or Case B: at least one of the kernels will.
In Case A, / is not a feature recognized by the (k — l)th layer. The subset of features in Fk that rely on / will either not be recognizable or must rely on a smaller set of indicators. In the latter situation, this k level feature may be harder to recognize with this smaller set of k — 1 level features. Both of the outcomes in Case A could lead to lower generalization.
In Case B, where / is a feature recognized by the (k — l)th layer, the presence or absence of / will be passed along to the kth layer. The subset of Fk that is not constructed using / must either learn to ignore the feature or, if the training data happens to be biased, may learn incorrectly to rely on the feature resulting in false positives. In the former outcome, parameter space that could be used for learning useful information must instead be used to interpret non-useful features from the previous layer. In the latter outcome, the network has overfit the training data and reduced its generalization capacity.
An important note: the smaller the training data set size, the greater the bias in the training data meaning the more likely Case B will result in overfitting.
One potential solution is to reduce the size of the layer so that less used features are never learned in favor of more widely useful features. This forces the network towards case A. As we have already discussed, wider and deeper networks are more expressive. So for sufficiently difficult classification tasks, reducing the size of the network may not be a desirable option.
If we want to keep our layer large, pushing our network towards case B, we can increase the training data size to reduce the bias in our training data. Data augmentation can be used to get much of the benefit of increasing the training set size without the requirement of collecting new data.
Neither of these solutions address the actual problem, however, which is that each kernel in layer k must consider all learned features from layer k — 1 whether they are useful or not.

We propose a fully differentiable, trainable architecture that allows lower level features to self organize into higher level features. These high level features will be created without the necessity of examining all combinations of low level features.
Convolutional Columns
Let’s assume for a minute that we know the exact size of Fk. Let’s also assume that we know exactly which features from Fk_i are needed for recognition of each feature in layer k. Let the zth feature be fik G Fk. We could create an activation function, A(), that would return 1 if some acceptable percentage, r, of these features existed and 0 otherwise.
We’ll assume we have the indices for the set of features from Fk_i that are needed to recognize fik. We’ll call the set of these indices Pi C Fk_i.
A(z, k)
1 T<\k\ 'EjtPik~^
0 T>\k\ k~l)
We will call the summation of lower level activations U().
V(P„k) = YJM:hk-1) (5)
Notice how only the necessary features from Fk_i are considered when generating these higher level features. Additionally, note that we can reinterpret k() as a summation of votes generated by the set of lower level features where each lower level feature is voting on whether the higher level feature exists.
Unfortunately, assuming we know exactly which features or how many features will be useful is not a good assumption. A better model would be one that can learn how low level features can be organized into useful high level features. By making some less severe assumptions we can create an approximate A() that is differentiable and, therefore, able to learn the proper feature combinations.
We would like to achieve this goal by forcing a reinterpretation of the meaning of outputs
from convolutional layers. Instead of learning whether a pattern exists, we would like kernels

to implicitly learn how their pattern contributes to a higher level pattern. In order to do this, we add an additional set of steps to a standard convolutional layer.
Let’s say we want to generate an activation map for our selected level k. We want to do so for each of some a number of patterns, a < \Fk\. Let’s also say we can assume these patterns are recognizable from some bound, /?, of lower level patterns where f3 We first create a parallel standard convolutional layers containing f3 kernels. Let x be the output of the previous layer in the network. Let id* be a function that generates a standard convolutional activation map for the *th kernel. We define the output of one of these parallel layers as:
0{x) = tanh(ith(x)) (6)
i= 1
Essentially each kernel is now generating a vote between [—1,1] about whether the kernel’s learned pattern is useful in combination with the other kernels within the layer. These votes are then all tallied. The result is an activation map of some implicit high level feature generated from the lower level features. The output from the parallel layers are concatenated together in order to create the final set of a higher level activation maps.
We chose hyperbolic tangent because some features may be negatively correlated to the target high level. We wanted to allow these negatively correlated features to be able to produce negative votes.
The result of our new architecture is that we are replacing V () from equation 5 with the summation of hyperbolic tangent kernel activations and converting the function A() from equation 4 into continuous function rather than a step function. Through stochastic gradient descent or one of its variants, we can train the kernels to approximate A() for a useful set of features.
Convolutional bundles provide an extra benefit as well. By generating higher level features using the summation of votes rather than an additional convolutional layer, we are

Figure 1: The convolutional bundle architecture.
forcing these kernels to learn useful features. Features can no longer be ignored by higher level layers. They are all considered equally during higher level feature generation and must therefore learn useful features. We are preventing the coadaptation of weights.
Figure 1 shows the convolutional bundle architecture.
Because of the possibility of saturation in the hyperbolic tangent activation function and its negative effects on training large networks [2], we recommend using residual connections to propagate the gradient through the entire network.
Despite process of summing these low level features forcing them to learn good generalizations, there is still the possibility that some of these features are less useful than others. As a result, we want to allow the network to decide how useful they are after they have been learned. So, once our network has converged during training, we swap out the process of summing the votes with a lxl convolution with the initial values set to all Is.
The lxl convolution with all values set to 1 starts out performing the exact same operation as summing the results. The network is then trained again allowing these lxl convolutions to learn a parametric combination of these low level features. This fine tunes the network and produces higher accuracy.
The new layer does have a few shortcomings that should be mentioned. First, like

with standard convolutional layers, the number of features to learn must be selected. In convolutional layers, this is equivalent to choosing the number of kernels. In convolutional bundles, this is equivalent to choosing the number of parallel bundles.
Additionally, as with standard convolutional layers, there is no guarantee that the learned features are unique at that layer. For convolutional bundles, this is two fold. The kernels learned within a single bundle can recognize duplicate features and the implicit high level features can be the same across the entire layer.
Convolutional bundles introduce a new hyperparameter. The number of kernels within a single bundle must be chosen. This adds a layer of complexity while tuning the network.
Lastly, the higher level features that each bundle learns may share some component low level features. The detection of these low level features are not shared across bundles, however, meaning there may be a significant duplication of effort between bundles.
Despite these shortcomings, we have found that these convolutional bundles outperform standard convolutional layers when classifying MVS data. We show the results of our experimentation in Chapter 7.

One method for reducing overfitting in machine learning algorithms like CNNs is to increase the size of the training data set. When collecting new data is not possible, one method for generating additional training data is data augmentation. By applying transformations to the pre-existing training set, ’’new” data can be generated. There are two requirements for these transformations to be useful. First, these transformations must be label retaining, meaning the applied transformation does not change the actual class of the data point. These transformations must also imitate natural variance within the data.
For time series data, WS has been proposed as a form of cropping. A random range within the time domain is selected and extracted from the original data point. We employ this method in our data augmentation scheme. WW was proposed as a form of re-scaling in time series data. Again a random time range is selected in the original data. This time frame is then either doubled or halved and reinserted back into the original time series.
Unfortunately with WW, to generate a training data set that contains a rich enough amount of data to learn good generalizations of time scale may either be prohibitively large or require a very large training set.
We introduce a new form of scaling specifically suited for MVS data that removes the necessity of generating large amounts of training data. We call our method global time warping (GTW).
To perform GTW, we apply the following steps. We first select the target number of time steps, 7 as a hyperparameter. We then generate a linearly spaced vector of length 7 with values between 0 and 1. We call this our target vector, C.
1 2
5 :
7 7

7~ 1
We call C our target completion percentage for each time step in our target time series. For each time series to be scaled, we generate the total distance, dt, for each time step t using the euclidean distance. Let u be the total number of dimensions in each data point.

Figure 2: An example of completion values for a single sequence.
For example, MVS data that contains x, y, and z data would have u = 3.
dt = J2
We then generate the completion percentage, c, for each time step, t, in the series. Let m be the total number of time steps in the sequence.
Figure 2 shows an example of completion values for a single sequence.
Next, for each dimension in each variable of our time series, we generate the ID cubic spline interpolation using all points in the original sequence. The value of the original sequence is the dependent variable and the completion percentage is the independent variable.
We the use these cubic spline interpolations to generate our new values for our target sequence. The value of the ith variable for some completion percent c 6 1 is defined as:
Vic =

Figure 3: A view of how data artifacts are smoothed in favor of general trends when using GTW. The figure on the left is the original data. The figure on the left is the smoothed data.
In this case, SA(), Siy(), and SiZ() are the cubic spline interpolation piecewise functions for the zth variable and dimensions x, y, and z respectively.
For values c E C, we generate a new time series of the following form:
hi ex hic2 bic3 • • hic7_1 hi c7
h2cx V2c2 V2c3 ■ ■ • h2c7_! h2c7
h(ra—l)ci G ,n l)c2 A n— l)c3 • • • b(ra_i)c7_ -l h(ri— 1)
hrici hric2 hric3 • • ^ric7_ i hric7
This new series is now scaled to a new time length and aligned to a global reference. Both up-scaling and down-scaling works with this algorithm. Additionally, because of the alignment along completion percentage, general trends become more important than the actual shape of the original data. See figure 3 to see how general trends are strengthened and artifacts are smoothed.
If the speed at which some feature occurs is integral to the data’s labels, GTW will not be beneficial as it destroys the relative time scale of data. Additionally, because of the use of euclidean distance and cubic spline interpolation to generate points that do not exist in

the original data, the space in which the data exists must be metric. GTW cannot be used for all types of time series data.
We have found that GTW improves accuracy over using no data augmentation when applied to MVS data. Additionally, we have found that GTW outperforms WW and WS on the same data. The results of our experimentation can be found in Chapter 7.

Table 1: Types of gaits collected in the data set.
Name Type
SI Walk straight towards primary Kinect
S2 Walk straight away the primary Kinect
Cl Walk straight towards primary Kinect while holding briefcase
C2 Walk straight towards primary Kinect while holding cell phone
C3 Walk straight towards primary Kinect with arms crossed across partici-
pants chest
C4 Walk straight towards primary Kinect with hands in pocket
C5 Walk straight towards primary Kinect while participant attempts to ”dis-
guise” their natural walking motion
In order to properly demonstrate the power of our algorithms, we generated a new human gait data set. The whole data set consists of 50 individuals each performing a number of different types of walks ranging from 3 to 10. The data was captured using the Microsoft Kinect. The data contains the 3D joint locations of the 20 joints that are captured by the Microsoft Kinect. Each gait is captured from two view angles. Table 2 contains a list of the possible types of gaits that were collected in the data set. In total there were 842 gait sequences collected.
In order to create a consistent data set, the following considerations were made. For every gait collected, standard fluorescent lighting was used and blinds were closed to minimize glare, shadows, and sunlight. Participants were asked to remove jackets and bags and to not carry any objects unless specifically required for the test. If the participants were wearing movement impeding footwear such as high heels or boots, they were asked to remove the footwear. Participants were asked to walk naturally at about 3 miles per hour. Addition-

Figure 4: A single frame in gait MVS data set.
ally, the material the participants walked on was standard linoleum tile. No slick or sticky materials were used to change the gait.
The location and angles of the Kinect sensors were restricted by their technical speculations. Their maximum held of view were 43 vertical by 57 horizontal. With an optimal far held distance of 11 feet and optimal near held distance of 4 feet from Kinect, the resulting optimal viewing range was a walk distance of 7 feet. We set the Kinect sensor 32 inches from the ground in an attempt to center the Kinect with the average human’s height.
For the actual data, we wanted a path that created the least amount of joint overlap. If joints overlap from the view of the Kinect, the accuracy of the occluded joints becomes low. We found that a straight walk towards the Kinect sensor was the best path to prevent any accuracy loss.
In addition, we used an additional Kinect placed horizontally to the direction of the primary Kinect in order to create a more difficult test set with noise from occluded joints and multiple viewing angles. The different types of data were all tracked and demarcated properly.
The data itself consists of 20 joints per time step collected at roughly 45 frames per second. Each joint’s x, y, and z locations were recorded for every time step. Table ?? lists the joints that were captured in the data. They are the standard joints used by the Kinect.

Table 2: Joints captured within the data set.
Head Shoulder-Center Shoulder-Right Shoulder-Left Elbow-Right
Elbow-Left Wrist-Right Wrist-Left Hand-Right Hand-Left
Spine Hip-Center Hip-Right Hip-Left Knee-Right
Knee-Left Ankle-Right Ankle-Left Foot-Right Foot-Left

Figure 5: The layout of the M-CNN model.
To test or different algorithms we first selected a standard base model. We used this model as a baseline to compare our contributions to.
The base model, which we will call M-CNN (multichannel CNN), consists of three convolutional layers, two max pooling layers and two fully connected layers. Additionally, M-CNN employs batch normalization and dropout. Random noise is added to the input during training. Figure 5 shows the M-CNN layout.
In order to fit different lengths of sequences, for M-CNN only the first 40 time steps were saved. If the sequence was shorter than 40 time steps, zero padding was added.
In order to ensure the results of our tests are accurate we trained and tested each algorithm 10 separate times. The results were then averaged. Each algorithm was trained for 1000 epochs at which point their training seemed to have converged.
For the gait data set, the test set consists of 150 samples. The test samples were randomly selected from the data set in order to not bias the test cases. The specific samples are kept constant through all training and testing cases.
Whole System
Figure 6 shows the accuracy versus training epoch of the entire proposed system as GaitNet. This system includes both GTW data augmentation and convolutional column layers for the first and third convolutional layers. We include M-CNN as a reference.
Convolutional Columns
To test convolutional columns there are three convolutional layers we can transform into
convolutional columns. Model CC-1 through CC-3 are all networks in which we transform

Training Epoch
Figure 6: A comparison of the aggregate system versus a standard CNN model.
Figure 7: The layout of the CC-4 model.
a single convolutional layer into a convolutional column. CC-4 replaces two convolutional layers with convolutional columns.
Additionally, using CC-4, we adjust both a and /5 to see how different values effect the classification results. We use CC-1 through CC-3 to test how the location of the convolutional column effects classification.
Figure ?? shows the test accuracy versus training epoch of the CC-4 model in comparison with M-CNN. There are two important pieces to note here. First, the final accuracy is considerably better. Additionally, CC-4 trained faster than M-CNN.
Because our convolutional columns are essentially two layers built into one because they implicitly generate higher level features, we wanted to ensure that just adding these extra layers was not where their benefit was being realized. We created a new model we called DEEP that added two new layers to M-CNN. We compare this model with CC-4 and M-

Figure 8: Results from CC-4 in comparison to M-CNN.
CNN in figure 9. Not unexpectedly, the accuracy decreases. We have increased the number of parameters in our model causing it to overfit rather than generalizing well.
Additionally, we wanted to ensure that the gain in accuracy wasn’t due to parallel convolutional layers. We constructed a model we called PARA that replaced the convolutional columns in CC-4 with the same number of parallel standard convolutional layers. Figure 10 shows the results of this network versus CC-4 and M-CNN. Just as DEEP led to increased overfitting by increasing the number of parameters, so did PARA.
The benefit of convolutional columns comes from the actual structural changes it introduces rather than simply increasing the network size.
Figure 11 shows the test accuracy versus training epoch of the CC-1 through CC-3 models in comparison to M-CNN. For CC-1 and CC-3, we again see that training times are improved.M-CNN’s test accuracy is equal with these networks, however. With CC-2, there seems to be no difference in training time or accuracy in comparison with M-CNN.
Data Augmentation
To test our form of data augmentation, we again start with M-CNN in which we don’t use any augmentation. The GTW model applies cropping and scaling to the incoming training data using GTW. Finally, to compare their accuracy with existing methods, the WW model applies WW and WC to the input of CONV-1. For both the WW and GTW models the target cropping size was the same. Both the potential amount of cropping and the size of the final scaled sequence can be modified for data augmentation.
Figure 12 shows the accuracy versus training epoch for all three models. Not unexpectedly, any data augmentation significantly improves the accuracy of the model during testing.

Figure 9: Comparison of M-CNN, CC-4, and DEEP.
Figure 10: Comparison of M-CNN, CC-4, and PARA.

Figure 11: The three different convolutional column layer models in log scale.
Note that GTW outperforms WW and WS for MVS data.

Figure 12: The testing accuracy of the three different data aggregation models versus the training epoch.

Convolutional neural networks have received state-of-the-art results for many machine learning tasks. They leverage the power of automatic feature learning to identify unlabeled data. Multivariate time series data shares a lot of properties with images that make CNNs ideal for classifying multivariate time series. Despite their specialized architecture, these CNNs suffer from overhtting.
We proposed two solutions for reducing overhtting of CNNs for multivariate metric time series data. We developed a data augmentation algorithm for metric time series data that improved upon existing data augmentation techniques for time series data. Our algorithm allows for scaling to any target size and simultaneously globally aligns the data.
Additionally, we developed a structural change to convolutional layers. This layer attempts to reduce the amount of complex co-adaptation of weights within the network and also reduce the chance of bias being learned from training data. We compared these layers’ power to standard convolutional layers to show the improvement in accuracy.
Future Work
We plan to explore potential time series that benefit more from WTS. Additionally, we want to explore how convolutional bundles could be used with image data and other forms of data that have state-of-the-art classification results from CNNs. Finally, we would like to examine how to automatically learn global alignment vectors that produce better results than linearly spaced vectors.

[1] Eric B Baum and David Haussler. What size net gives valid generalization? In Advances in neural information processing systems, pages 81-90, 1989.
[2] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2): 157—166, 1994.
[3] Donald J Berndt and James Clifford. Using dynamic time warping to find patterns in time series. In KDD workshop, volume 10, pages 359-370. Seattle, WA, 1994.
[4] Nadia A. Chuzhanova, Antonia J. Jones, and Steve Margetts. Feature selection for genetic sequence classification. Bioinformatics (Oxford, England), 14(2): 139—143, 1998.
[5] Deepjoy Das and Alok Chakrabarty. Human gait recognition using deep neural networks. In Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies, page 132. ACM, 2016.
[6] Elena Gianaria, Marco Grangetto, Maurizio Lucenteforte, and Nello Balossino. Human classification using gait features. In International Workshop on Biometric Authentication, pages 16-27. Springer, 2014.
[7] Dian Gong and Gerard Medioni. Dynamic manifold warping for view invariant action recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 571-578. IEEE, 2011.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pages 346-361. Springer, 2014.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016.
[10] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[11] Yuchi Huang, Xiuyu Sun, Ming Lu, and Ming Xu. Channel-max, channel-drop and stochastic max-pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 9-17, 2015.
[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448-456, 2015.

[13] Eamonn Keogh and Chotirat Ann Ratanamahatana. Exact indexing of dynamic time warping. Knovjledge and information systems, 7(3):358-386, 2005.
[14] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097-1105, 2012.
[16] Arthur Le Guennec, Simon Malinowski, and Romain Tavenard. Data augmentation for time series classification using convolutional neural networks. In ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data, 2016.
[17] Yann LeCun and Corinna Cortes. MNIST handwritten digit database., 2010.
[18] Lily Lee and W Eric L Grimson. Gait analysis for recognition and classification. In Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on, pages 155-162. IEEE, 2002.
[19] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.U00, 2013.
[20] Gerard Medioni, Chi-Keung Tang, and Mi-Suen Lee. Tensor voting: Theory and applications. In Proceedings of RFIA, volume 2000, 2000.
[21] Lutz Prechelt. Automatic early stopping using cross validation: quantifying the criteria. Neural Networks, ll(4):761-767, 1998.
[22] Patrice Y Simard, David Steinkraus, John C Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In ICDAR, volume 3, pages 958-962, 2003.
[23] Peter Sollich and Anders Krogh. Learning with ensembles: How overfitting can be useful. In Advances in neural information processing systems, pages 190-196, 1996.
[24] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overhtting. Journal of machine learning research, 15(1): 1929—1958, 2014.
[25] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1-9, 2015.
[26] Liang Wang, Tieniu Tan, Huazhong Ning, and Weiming Hu. Silhouette analysis-based gait recognition for human identification. IEEE transactions on pattern analysis and machine intelligence, 25(12): 1505—1518, 2003.

[27] Haibing Wu and Xiaodong Gu. Towards dropout training for convolutional neural networks. Neural Networks, 71:1-10, 2015.
[28] Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiaoli Li, and Shonali Krishnaswamy. Deep convolutional neural networks on multichannel time series for human activity recognition. In IJCAI, pages 3995-4001, 2015.
[29] Yi Zheng, Qi Liu, Enhong Chen, Yong Ge, and J Leon Zhao. Time series classification using multi-channels deep convolutional neural networks. In International Conference on Web-Age Information Management, pages 298-310. Springer, 2014.
[30] Feng Zhou and Fernando Torre. Canonical time warping for alignment of human behavior. In Advances in neural information processing systems, pages 2286-2294, 2009.

Full Text


PREVENTINGOVERFITTINGINCONVOLUTIONALNEURALNETWORKS DURINGCLASSIFICATIONOFMULTIVARIATESPATIOTEMPORALDATA by W.MAXLEES B.A.,UniversityofColorado,2013 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof MasterofScience ComputerScienceProgram 2017


ThisthesisfortheMasterofSciencedegreeby W.MaxLees hasbeenapprovedforthe ComputerScienceProgram by FarnoushBanaei-Kashani,Advisor AshisBiswas MinChoi Date:December16,2017 ii


Lees,W.MaxM.S.,ComputerScienceProgram PREVENTINGOVERFITTINGINCONVOLUTIONALNEURALNETWORKSDURINGCLASSIFICATIONOFMULTIVARIATESPATIOTEMPORALDATA ThesisdirectedbyAssistantProfessorFarnoushBanaei-Kashani ABSTRACT MultivariatespatiotemporalMVSdata,suchasplayerlocationsonaeld,trooplocationinbattle,orhumanjointlocations,canyieldusefulinformationwhenclassied. ConvolutionalneuralnetworksCNNshaveshownsuperiorresultsforclassicationofthe MVSdata.Despitetheirpower,however,convolutionalnetworkssuerfromovertting.As aresult,theyresultinloweraccuracyonunseendata;hence,limitedgeneralization.Many methodsareproposedtoaddresstheoverttingproblemwithCNNs,suchasdropout,batch normalization,anddataaugmentation.Weidentifyonestructuralsourceofovertting withinconvolutionalneuralnetworksthatexistingmethodshavenotaddressed.Wepresent anovelconvolutionallayerthataimstoaddressthisproblem.Additionally,wepresenta formofdataaugmentationdesignedforMVSdata.Weevaluatethebenetsofourproposedconvolutionallayerandshowthatitsignicantlydecreasesoverttingandimproves trainingtime.Wealsoshowthatourdataaugmentationalgorithmimprovesonexisting formsoftimeseriesdataaugmentation.Wehaveperformedextensiveempiricalstudiesto evaluatebothofourcontributionsusingrealhumangaitdata,whichdemonstratedupto 25%improvementoverthestate-of-the-artMVSclassicationmethods. Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:FarnoushBanaei-Kashani iii


ACKNOWLEDGMENTS Firstly,IwouldliketothankmyadvisorAssistantProfessorFarnoushBanaei-Kashanifor helpingmendaresearchtopicworthexploringandforpushingmetoperformmybest work.Additionally,IwouldliketothankKyleEstlerandOliverBatistaforhelpingdesign thedatacollectionproceduresandfortakingthetimetocollectalotofthegaitdata. iv


TABLEOFCONTENTS CHAPTER I.INTRODUCTION..................................1 II.RELATEDWORK..................................4 III.BACKGROUND...................................9 MultivariateSpatiotemporalData..........................9 ConvolutionalNeuralNetworks............................9 ConvolutionalLayer...............................10 Pooling......................................12 FullyConnectedLayers.............................13 ActivationFunctions...............................13 IV.CONVOLUTIONALCOLUMNS..........................14 StructuralCauseofOverttinginCNNs......................14 ConvolutionalColumns................................16 V.DATAAUGMENTATION..............................20 VI.GAITDATASET..................................24 VII.EVALUATION....................................27 WholeSystem.....................................27 ConvolutionalColumns................................27 DataAugmentation..................................29 VIII.CONCLUSIONANDFUTUREWORK......................32 FutureWork......................................32 REFERENCES.......................................33 v


LISTOFTABLES 1Typesofgaitscollectedinthedataset........................24 2Jointscapturedwithinthedataset..........................26 vi


LISTOFFIGURES 1Theconvolutionalbundlearchitecture........................18 2Anexampleofcompletionvaluesforasinglesequence...............21 3Aviewofhowdataartifactsaresmoothedinfavorofgeneraltrendswhenusing GTW.Thegureontheleftistheoriginaldata.Thegureontheleftisthe smootheddata.....................................22 4AsingleframeingaitMVSdataset.........................25 5ThelayoutoftheM-CNNmodel...........................27 6AcomparisonoftheaggregatesystemversusastandardCNNmodel.......28 8ResultsfromCC-4incomparisontoM-CNN....................29 9ComparisonofM-CNN,CC-4,andDEEP......................29 10ComparisonofM-CNN,CC-4,andPARA......................30 11Thethreedierentconvolutionalcolumnlayermodelsinlogscale.........30 12Thetestingaccuracyofthethreedierentdataaggregationmodelsversusthe trainingepoch.....................................31 vii


CHAPTERIINTRODUCTION Lotsofusefulinformationcouldbegainedfromalgorithmsabletoclassifydatain theformofMultivariateSpatiotemporalMVSdata.MVSdatacontainstwoorthree dimensionallocationdataofmultipleobjectstrackedthroughtime.Thistypeofdataisa subclassofsequenceandtimeseriesdataandsequencedata. ExamplesofMVSdataincludebattleeldtroopmovements,playerlocationsonaeld, ridesharingcarroutes,andhumanjointlocations. Withtheseexamples,amyriadofdierentusefuldataclassicationapplicationscould beconsidered.Fortroopmovements,potentialbattleeldtacticscouldbedetermined.With playerlocations,theplaysandstrategiesoftheotherteamcouldbepredicted.Forridesharing,dataminingcouldpredictthelikelihoodofanindividualneedingarideinaparticular area.Theparticularactionanindividualisperformingortheiridentitycouldallbegleaned byusingthejointlocations. ClassicationofMVSdataposesauniqueproblem,however.BecauseMVSdatais temporal,datathatshouldhavethesamelabelcantakeplaceindierentlengthsordilations oftimeandatentirelydierentlocationswithinatimeseries.Thereisnoguaranteeof temporalalignment. Usingthehumanactionrecognitionexample,iftheaction'standup'takes0.5secondsin oneinstanceand0.7secondsinanother,analgorithmattemptingtoidentifythe'stand-up' actionmustbeabletorecognizebothoftheseinstances.Additionally,inone'stand-up' instance,theactionmaytakeplaceatthebeginningofthetimeseries.Inanotherinstance, the'stand-up'actionmayoccurattheend.AlgorithmsthatidentifysimilaritiesinMVS datarequireeithertheabilitytohandlepossibledierencesinthetimedomainormustuse somesortofheuristicpre-alignment. Similarly,becauseMVSsaremultivariate,themultiplevariablesallmaycontaining dierenttimedilations.Tousethehumanactionrecognitionexampleagain,considera systemthatmustidentifywhethertwopeopleareshakinghandsbasedonthelocationof 1


theirarmandhandjoints.Thesystemmustberobustenoughtoidentifyahandshakewhen bothindividualsraisetheirhandsatthesametimeaswellasifthereissomelagbetween oneindividualandtheotherraisingtheirhands.Thesystemmuststillbeabletoidentify bothpossiblescenariosasahandshake. ClassicationofMVSdataisgenerallyconsideredthedomainofthreedierentforms ofclassication. TherstisDynamicTimeWarpingDTWandrelatedalgorithms.DTWdetermines theclosestpossibledistancebetweentwounalignedtimeseriessequenceswhileapplying atemporalalignmenttothesequencesthatmaintainstemporalordering.Becauseofthis temporalalignment,DTWisabletohandledierenttimescalesanddilationswithinthe data. DTWalgorithmshavebeenshowntobeextremelyadeptatclassifyingtimeseriesdata. Oneshortcomingistheirrelianceoncomparingunknownsequencestolargesetsofknown sequences.Thisrequirementmeansthatasthenumberofknownsequencesincreases,the requiredtimetoidentifyanunknownsequenceincreases. Featurebasedclassicationisanotherwellexploredformofclassicationonsequence data.Handselectedfeaturesareextractedfromtheallknownsequences.Thesefeatures arethenextractedfromanyunknownsequencesaswellandarecomparedwiththefeatures oftheknownlabels.Thisnalcomparisoncanbeachievedbyanythingfromeuclidean distancetosupportvectormachines. Becausethesefeaturesareextractedfromtheoriginaldata,theactualcomparisonisno longertimeseriesdata.Insteadtheclassicationisoccurringwithfeaturessuchasstride lengthforgaitidenticationoraveragedistancefromgoalforsoccerplayidentication. Forfeaturebasedclassicationtoworkwell,however,thefeaturesthatareextracted mustbecarefullyconsideredandhandcraftedbasedonknowledgeofthedataandthe domainofthedata.Additionally,becausefeaturesarefedtootherexistingclassication algorithms,weconsiderfeatureextractiontobeapreprocessingstepratherthanaclassi2


cationalgorithminitsownright. Deeplearningisthenalclassicationmethod.ConvolutionalneuralnetworksCNNs havegainedstateoftheartclassicationresultsonMVSdata.Bylearningusefulfeatures, theyareabletoleveragethepoweroffeaturebasedpreprocessinginordertoachieveexcellent classicationresults.Again,thetemporalproblemismostlysidesteppedbyextractingextratemporalfeaturesbeforeclassication. CNNsarepronetolearningbiasesthatexistinthetrainingsetratherthanlearning generalizations.Thislearningofbiasesisknownasovertting.Thereareseveraltoolsthat havebeenstudiedforreducingoverttinginallformsofdeeplearning,bothexternaland structural,withvaryinglevelsofeectiveness.Noneofthementirelyremovetheproblemof overtting,however. TheCNNsfeatureextractorsmuststilllearnhowtohandletemporaldilationsaswell. TherearedataaugmentationalgorithmsproposedtohelpimprovethetemporalgeneralizationofCNNsontimeseriesdata. WeproposetwocomplementaryaugmentationstoCNNsforMVSdata.Specically,we proposeanewtypeofconvolutionallayertopreventovertting.Wealsoproposeaform ofdataaugmentationtailoredtoMVSdatathatremovesthenecessitytolearntemporal generality.Wedemonstratetheeectivenessofbothmethodsontheclassicationofhuman gaitdata. 3


CHAPTERIIRELATEDWORK Startingin1994,DTWanditsvariantshavebeenobtainingstate-of-the-artresults onclassicationofsequencedata[3].ThepowerofDTWcomesfromitsabilitytoalign mismatchedsequencesduringcomparison. CanonicalTimeWarpingCTWwasintroducedin2009byZhouetal.[30]asa combinationofDTWandCanonicalComponentAnalysistoproduceanalignmentthrough perturbationsinbothspaceandtimethatimprovedaccuracyofpreviousDTWvariants. DMWwasintroducedbyGongetal.[7]asavariationofDTWthatcancompare sequencesindierentmodalitiessuchas2Dto3Dbyrstmappingthemtothesamelow dimensionalspacerst.ItimprovesupontheaccuracyofbothCTWandDTW. OneshortcomingofDTWanditsvariantsistheirrequirementthatallknownsequences arecomparedtounknownsequencesduringclassication.In2015,Koeghetal.introduced exactindexingofDTW[13]removingthenecessitytocomparingallknownexamples.Even withtheintroductionofindexing,however,about20%-30%dependingonthelengthofthe sequencesbeingcomparedofknownsequencesneedtobecomparedtounknownsequences. Despitethisreductioninthenumberofsequencesthatmustbecompared,thenecessityto comparelargeamountsofsequencedataisstillashortcomingofDTWanditsvariants. Morerecently,theuseoffeaturebasedclassicationofsequenceshasbeenexplored. Featurebasedmethodspreprocessthedatabyextractusefulfeaturesfromtherawdata setbeforeapplyingaclassicationalgorithmsuchassupportvectormachinesSVMsor k-nearestneighborsk-NNs.Byreducingthedimensionalityoftheinputtoasmallsetof usefulextra-temporalfeatures,theseclassicationalgorithmsareabletodistinguishbetween classesfairlyaccurately.Muchofthecomputationcanoccurbeforeclassicationtimeand mustonlyoccuronce.Theyare,therefore,muchfasteratlabelingunknowndatathanDTW variants. Asearlyasthe1990s,featurebasedmethodswereusedforclassication.Chuzhanova etal.in1998usedspecialfeaturestoclassifyRNAsequencedata[4],passingthefeatures 4


ontoa10-NNsalgorithmwithgreatresults. Intheearly2000s,featurebasedmethodsbegantobeappliedtoMVSdata.Onefeature thathasbecomeextremelypopularforMVSclassicationofindividualsthroughtheirgait isthe2Dbinarysilhouette[26][18].Thissilhouetteisgeneratedfromasinglegaitcycle. Weconsiderthisafeaturebasedapproachbecausetheclassicationalgorithmsmustonly considerasubsetofthegaitfeatures.Inthiscase,theshapeandareaofa2Dcrosssection oftheindividual'sgaitistheextractedfeature. Morerecently,Gianariaetal.usedalimitedsetoffeaturessuchasarmlengthandstride lengthtoidentifyindividualsbasedontheirgaitsignature[6].Theypassedthesegenerated featuresontoanSVMtogeneratetheactualclassication. Allofthesearepowerfulexamplesoffeaturebasedclassication.Theshortcomingof featuremethods,though,istheirrelianceonhandselectedfeaturesinordertoperform well.Theselectionofthesefeaturesrequiresspecialdomainanddataspecicinformation. Additionally,theyrelyonthepowerofexistingclassicationmethodstoperformtheactual classicationstep. Inrecentyears,machinelearningalgorithmsthatareabletolearnusefulfeatureextractionshavebeenshowntobeextremelysuccessfulintheclassicationofMVSdata.In 2014,Zhengetal.showedthatCNNscouldgetstateoftheartclassicationresultson timeseriesdata[29].In2015,Yangetal.[28]demonstratedthepowerofCNNsspecically onMVSdataforhumanactionrecognition.Theybothusedensemblemethodsbytreating eachvariableasaseparatesequence.TheseseparatesequenceswerepassedtoparallelCNNs beforetheresultswereaggregatedusingfullyconnectedlayers. TheseCNNsareabletoachieveaboveorequalclassicationresultsasothermethods. Theydonotrequirespecialhandselectedfeatureextractionlikefeaturebasedclassication nordotheyrequirecomparisonbetweenlargesetsofknownlabelledsequenceslikeDTW variants. Aswithmanylearningmodels,CNNscaneasilyovertthetrainingdata.Theresultof 5


overttingislowaccuracyduringtheclassicationofdatanotseeninthetrainingphase. Thisweaknessisespeciallyapparentondatasetsthatcontainrelativelyfewtrainingsamples comparedtothenumberofparametersinthemodel. Wecategorizetheseformsofoverttingreductionintwoclasses.Therstclasscontains externalformsofoverttingreduction.Theseformsdonotchangetheactualnetworkitself butratherchangeperipheralcomponentsthataecthowthenetworkisused.Thesecond classisstructural.Thesemodicationschangehowthelearningalgorithmworks. Oneexternalmethodforreducingoverttingisincreasingthenumberoftrainingexamples.Theamountofoverttingisdirectlyproportionaltotheratiobetweenthenumberof trainingexamplesversusthenumberofparameterswithinthelearningalgorithm[1].By increasingthenumberoftrainingexamples,theratiodecreases.Increasingthesizeofthe trainingdatasetisnotalwayspossible,though,asdatacollectioncanbebothexpensive andtimeconsuming. Earlystoppingisanothercommonexternalmethodtoreduceoverttingduringtraining [21].Usingcrossvalidation,thetrainingisautomaticallystoppedwhensomethresholdofthe ratiobetweentrainingaccuracyandcrossvalidationaccuracyisbreached.Thisisanother extremelyeectivetechniquebutrequiresthatthedatasetbelargeenoughtoaccommodate crossvalidationwhichisnotalwaysthecase. Dataaugmentationisanotherformofexternaloverttingreduction.Thegoalofdata augmentationistoarticiallyinatethesizeofthetrainingdataset.Simardetal.showed thatusingdataaugmentationcouldimproveaccuracyandreduceoverttinginCNNs[22]. TheydemonstratedthattheonlynecessityfordataaugmentationtoimproveaCNNsresultis thattheunderlyingdatahassomeformoftranslationinvarianceandthatthetransformations appliedtothedataarelabelpreserving. AlexKrizhevskyetal.[15]demonstratedtheeectivenessofdataaugmentationwhen usedforimageclassicationinCNNs.Theyrandomlyskew,rotate,andipimageswithin thetrainingsettoarticiallycreatenewtrainingexamples.Unfortunately,rotatingand 6


skewingmultivariatesequencedatawouldbequiteconfusingtoaclassierbecause,whilein imagesalldatapointsrepresentthesametypeofdatai.e.thecolorationorsaturationofa speciclocationwithintheimage,notalltypesinamultivariatesequencearethesame. Forexample,onevariablecouldrepresentthevaluesofastockwhilethevariabledirectly neighboringcouldrepresenttheGDPofNorway.RotatingthestockvalueintotheGDP variablespacewouldmostlikelyconfoundtheclassierbecausethemeaningofthedatahas changed.Thetransformationmaynotbelabelpreservinganddoesnotrepresentnatural variationinthedata. Leetal.introducedasetofdataaugmentationtechniquesthatcouldbeusedontime seriesdata[16].IntheirpapertheyintroducewindowslicingWSinwhichtheyextract randomsub-samplesalongthetimedomainoftheentiredatasettouseasnewtrainingdata. TheyalsointroducewindowwarpingWWinwhichtheyselectsmallwindowswithinthe datatoeitherdoubleorhalfinthetimedimensionandthenreinsertedintotheiroriginal positioninthedata.Thesenewlywarpedsequencesareusedasnewtrainingexamples. TheyrecordedthebestresultswhenusingbothWWandWS.Wewillcompareourdata augmentationtechniquetotheirs. Asstatedpreviously,theamountofoverttingisdirectlyproportionaltotheratiobetweenthenumberoftrainingexamplesversusthenumberofparameterswithinthelearning algorithm.Reducingthenumberofparametersbyreducingthesizeofthenetwork,then, canalsoreduceovertting.Thisisastructuralmethodinwhichtheactualnetworkarchitectureischanged.Reducingthesizeofthenetworkreducestherepresentationalpowerof thenetworkaswell.Forsucientlydicultclassicationproblems,bothwider[25]and deeper[9]networkshavebeenshowntoclassifymoreaccurately.Soreducingthesizeofthe networkmaynotalwaysbeidealgivenahardenoughclassicationproblem. Averycommonstructuralmethodtoreduceoverttinginalltypesoflearningalgorithms isdropout[24].Dropoutdeactivatesvalueswithinthemodelduringthetrainingphasewith somesetprobability.Theresultisanetworkthatdoesnotrelyoninternalco-adaptation 7


ofweightstoregularizeerrorswithinitself.Theeectofdropoutonconvolutionallayers, though,isunpredictable. BatchNormalizationisanotherformofoverttingreduction[12].Withmultiplelayers indeeplearningalgorithmsallbeingtrainedsimultaneously,averycommonproblemiscovariateshifting.Batchnormalizationattemptstoreduceco-variateshiftingbymaintaining thedistributionofvaluesforeachlayer'soutput.Whilebatchnormalizationisaveryuseful discoveryindeeplearningthatwerecommendalwaysbeused,batchnormalizationdoesnot addressthestructuralproblemofoverttingthatwedeneandattempttoaddress. Ensemblemethodshavealsobeenshowntohelpreduceovertting[23].Usingmultiple classiersandaggregatingtheirresultscandramaticallyimprovetheoverallperformance ofanyclassicationtask.Anyimprovementontheindividualclassierswillimprovethe overallperformanceoftheensemble,however,sowechoosetofocusonindividualmodels. 8


CHAPTERIIIBACKGROUND MultivariateSpatiotemporalData MVSdataisasubsetoftimeseriesdatathatconsistsofmultiplevariablesallcontaining spatialdata. TorepresentanMVSdatapointweusea3Dmatrix.Onedimensionrepresentsthe time,onedimensionrepresentsthesetofvariables,andthenaldimensionrepresentsthe spacialdimension.Thislastdimensionisofeithersizetwoorthreedependingonwhether weareconsidering2Dor3Dspatialdata.Forthepurposesofthispaper,wewillassumeall MVSdataisinthreedimensionaleuclideanspace.Allofthealgorithmsdiscussedinthis paperaretriviallyconvertedfromthreedimensionalspacetotwodimensionalspace. Let x it bethelocationofvariable i attimestep t withinthex-dimensionofeuclidean space.Let y it and z it bethey-dimensionalandz-dimensionalcounterpartsrespectively.The valueofavariableforeachtimestepcanbedenedasshowninequation1. v it =[ x it ;y it ;z it ] If m isthetotalnumberoftimestepswithinanMVSdatapointand n isthetotalnumber ofvariablescontainedwithinanMVSdatapoint,thenthedatapointcanberepresentedas showninequation2. 2 6 6 6 6 6 6 6 6 6 6 6 4 v 11 v 12 v 13 :::v 1 m )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 v 1 m v 21 v 22 v 23 :::v 2 m )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 v 2 m . . . . . . . . . . . . . . . . . . v n )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 v n )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 v n )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 :::v n )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 m )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 v n )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 m v n 1 v n 2 v n 3 :::v n m )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 v nm 3 7 7 7 7 7 7 7 7 7 7 7 5 ConvolutionalNeuralNetworks CNNsweredesignedinitiallyforclassifyingimagedata.Theyarespeciallybuilttolearn torecognizeusefulfeatureswithintherawimagecolororsaturationdata.Thearchitecture 9


iseectivebecauseittakesadvantageofsomeunderlyingassumptionsaboutimagesthat bothallowsforareductioninarchitecturecomplexityandbettergeneralizations.Someof theseunderlyingassumptionsofimagesholdtrueformultivariatetimeseriesaswell. Forimages,theyassumespecicvalueswithinanimagearecorrelatedtoallcollocated datapoints.Theyalsoassumefeaturesarelocationinvariant.Thislocationinvarianceand correlationallowsforsomesimplicationsintheinternalrepresentationofdatawhichin turnreducesthecomplexityofthemodel. SpecicvalueswithinMVSdataarerelatedtoeachotheralongthetemporalaxis. Additionally,therewouldbenoreasontogroupsinglevariablesequencesintomultivariate sequencesunlesstheyhadsomecorrelation;correlationsexistbetweendierentvariablesas well.Basedonthetypeofdata,featuresmaybetemporallyinvariantaswell.Ahandshake isahandshakewhetheritoccursatthebeginningortheendofthesequence. AlmostallCNNscontainsomeorganizationofthreecommoncomponents.Thesecomponentsaretheconvolutionallayer,thepoolinglayer,andthefullyconnectedlayer. ConvolutionalLayer. Theconvolutionallayeriscomprisedofasetofkernels.The heightandwidthofthekernelsarehyperparameters.Ahyperparameterisanyvaluethatis setduringtheconstructionofamachinelearningmodel.Theyeectthepowerofthemodel canbeeasilymodiedinordertoproperlytuneit. Thedepthofthekernelsareequaltothenumberofinputchannels.Wewillseewhat inputchannelsareshortly.Allofthekernelswithinasinglelayerarethesamesize.The numberofkernelswithinthelayerisalsoahyperparameter. Theinputtoaconvolutionallayerisathreedimensionalmatrix.Therstandsecond dimensionsarecalledtheheightandwidthoftheinput.Thethirddimensioniscalledthe inputchannels. Fortherstconvolutionallayerinanetworkinwhichrawdataistheinput,wemake thefollowingconsiderations: Forimages,theheightandwidthoftheinputaresimplytheheightandwidthofthe 10


image.Forgreyscaleimages,thereisonlyoneinputchannelwhichrepresentsthesaturation ateachlocation.Forcolorimages,therearethreeinputchannelsthatcontainthered,green, andbluesaturationforeachlocationintheimage. Formultivariatetimeseriesdata,eithertheheightorthewidthisthetimedimensionof thedata.Theremainingofthetwodimensionsarethedierentvariables.Soforplayerson abasketballcourt,theheightdimensionmaycontainallofthedierentplayers,forexample.Therearetwoorthreeinputchannelscontainingthex,y,andpossiblyzcoordinates separatelyforeachvariablepertimestep. Theinputofdeeperconvolutionallayerswithinthenetworkarehardertocharacterize becausetheyconsistoftheoutputfromshallowerlayers.Wewillexplainhowtointerpret thisoutputshortly.Butrst,let'sexaminetheirstructure. Theoutputofaconvolutionallayerisalsoathreedimensionalmatrix.Again,therst andseconddimensionsaretheheightandwidth.Thethirddimensioniscalledtheoutput channels.Thenumberofoutputchannelsisequaltothenumberofkernelsinthelayer. Togeneratetheoutput,thekernelsaresweptalongtheinput.Thedotproductbetween thekernels'valuesandeachsubsectionoftheinputwhichareofequalsizetothekernel arecalculated.Theresultsofthesedotproductsarethenpassedonastheoutputofthe convolutionallayer.The n thkernelgeneratesthe n thoutputchannel.Thesechannelsare calledtheactivationsoractivationmapsofthekernels. Whileitisfairlystandardforthedotproductstobecalculatedbetweeneachsubsection oftheinput,thestepsizetheytakebetweeneachcalculateisanotherhyperparameter. Settingastepsizeof ; 3,forexample,wouldmeanthattheactivationsofthekernels wouldonlybecalculatedeveryotherlocationalongtheheightandeverythirdlocation alongthewidthoftheinput.Fortheremainderofthepaper,wewillassumeastepsizeof ; 1unlessotherwisestated. Anotherhyperparameterthatcanbeadjustedisthepaddingthatisaddedtotheinput heightandwidthbeforeprocessing.Paddingcanbeaddedfortworeasons.Ifthestepsizeis 11


notsetto ; 1,dependingontheratioofinputheightandwidthtokernelshape,theremay existvalueswithintheinputthatarenotusedtocalculateactivations.Byaddingpadding totheinput,wecanguaranteeallinputvaluescontributetotheactivations.Additionally, paddingcanbeaddedtoensurethattheoutputheightandwidthareequaltotheinput heightandwidth.Paddedvaluesareusuallysettozero. Instandardconvolutionallayers,insimpliedterms,theoutputcanbeinterpretedas follows.Thevaluelocatedat x;y;z intheoutputcanbeconsideredtheprobabilityofthe z thlearnedpatternbeingpresentat x;y intheinput.Thesearecalculateddirectlybythe convolutionalkernels.Let K z x;y bethe z thkernelsactivationforlocation x;y inthe input.Let O x;y;z betheoutputoftheconvolutionallayeratlocation x;y;z . O x;y;z = K z x;y Theentireoutputisthelikelihoodofthesetoflearnedfeaturesexistingateachlocation oftheinputwhereeachchannelrepresentsasinglefeature. Pooling. Thestandardpoolinglayerisaformofdownsampling.Liketheconvolutional layer,theinputofthepoolinglayerisathreedimensionalmatrix.Thersttwodimensions aretheheightandwidthandthethirdlayercontainstheinputchannels. Awindowissweptthroughtheheightandwidthoftheinput.Dependingonthetype ofpooling,anoperationreducesthesetofvalueswithinthewindowtoasinglevalue.For maxpooling,themaximumvaluewithinthewindowisselected.Foraveragepooling,the averagevalueofthewindowiscalculated. Thewindowsizeandthestepsizearebothselectedashyperparameters.Generally, thesewindowsareofsize2x2becauseanythinglargercanbeverydestructive.Thestepsize isusuallythesamesizeasthewindowsizesonoinputsareexaminedmultipletimesandno inputsareskipped. Therealsoexistotherlessstandardformsofpooling. Globalchannelaveragingisapoolingoperationinwhichtheentirecontentsofeachinput 12


channelareaveragedtogetherintoasinglevalue.Thereisamethodforpoolingbetween channelscalledChannel-Maxinwhichthemaximumvalueofeach x;y locationinthe inputistakenalongthe z axis.Theresultisasinglechanneloutputwiththesamewidth andheightastheinput. Allofthesehavedierentusesandinterpretationsbuttheirgoalistoreducethedimensionalityoftheinputinsomeusefulway. FullyConnectedLayers. Eachfullyconnectedlayerhasasinglematrixasit's weights.Theinputofafullyconnectedlayerisasinglevector.Theonlyrequirementis thattheinputvectorlengthandthewidthofthelayer'smatrixareequal.Thelayersimply performsadotproductbetweentheinputvectorandthematrixandpassesthenewvector outasoutput. Additionally,theremaybeabiasaddedtotheoutput.Abiasissimplyavectorwho's sizeisthesameastheheightofthelayer'smatrix.Toaddthebias,theoutputvectorand thebiasaresimplyaddedtogether.Notethatsomepapersimplicitlyonvolutionalneural networksincludingabiaswhendiscussingfullyconnectedlayers. Fullyconnectedlayersareubiquitousindeeplearningbecausetheyareabletolearnnonlinearmappingsbetweentheirinputandtheiroutput.InCNNs,theselayersaretypically savedfortheendofthenetworkandareusedtoperformtheactualclassicationoncethe featureshavebeenextractedbytheconvolutionallayers. ActivationFunctions. Activationfunctionsarefunctionsthattakeinsomeinput vectorormatrixandfeedeachvaluethroughafunction.Commonfunctionsincludethe sigmoidfunction,thehyperbolictangentfunction,andtherectierlinearunitReLU. Activationfunctionsareusedtocontrolhowinformationowsthroughthenetwork. Forexample,usingtheReLU,wecanlimitthenetworktoonlyprocesspositivelycorrelated information.Inaddition,backpropegationthroughReLUlayerspreventsapplyinggradient descenttovaluesthatarenegativelycorrelated. 13


CHAPTERIVCONVOLUTIONALCOLUMNS DespiteCNNssueringfromlessoverttingthantraditionalformsofdeeplearning,they stillhavestructuralproblemsinwhichoverttingisnotentirelyremoved. Essentiallythisoverttingcomesfromthefeaturegenerationprocess.Forgenerationof afeatureatsomelayerwithoutaCNN,allfeaturesfromalowerlevelfeaturemapmustbe considered.Thisisthecaseevenwhenonlyasubsetofthelowerlevelfeaturesareindicators ofthisspecichigherlevelfeature. Werstexploreexactlywhatthisstructuralproblemis.Wethenproposeanewform ofconvolutionallayerthathelpstorelievethisproblembyallowingsmallsubsetsofthese lowerlevelfeaturestoselforganizeintousefulhighlevelfeatures. StructuralCauseofOverttinginCNNs Givensomearbitrarilyconvolutionallayer, k ,withinaCNN,thereexistssometheoretical setoffeatures, F k ,thatthekernelsofthisconvolutionallayermaylearn.Let'slimittheset offeaturesin F k tobeonlyfeaturesthatleadtoproperclassication. Thesefeaturesin F k areconstructedbycombiningsomesubsetofthefeatureslearned inthepreviouslayer, F k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 .Inthespecialcasewhere k =1thelayerweareconsideringis therstlayerinthenetwork, F k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 insteadcontainsthesetofusefulfeaturesthatexistsin theoriginalinput. Foranyclassicationtask,theremustexistsomefeaturein F k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 thatisonlyusedto constructsomesubsetofthefeaturesthatexistsin F k .Or,tostateadierentway,notall featuresin F k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 areusedtocreateallfeaturesin F k .Wewillcalloneofthesefeaturesfrom F k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 thatisnotuniversallyuseful ^ f . Toillustratethis,let'sassumeforamomentthatthereisnosuchfeatureas ^ f .The resultwouldbethateachhigherlevelfeaturein F k isacombinationofalllowerlevelfeatures in F k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 .Allfeaturesin F k wouldbeidentical.Theclasseswouldbemadeofthesameexact features.Classicationwouldbeimpossible.Sotheremustexistatleastonefeaturethat 14


candistinguishbetweendierentclasses.Foranysucientlydicultclassicationtask, theremustbealargesetofsuchfeatures. Therearetwopossibilitiesfor ^ f .CaseA:nokernelsinthe k )]TJ/F15 11.9552 Tf 12.005 0 Td [(1thlayerwilllearnto recognize ^ f orCaseB:atleastoneofthekernelswill. InCaseA, ^ f isnotafeaturerecognizedbythe k )]TJ/F15 11.9552 Tf 11.122 0 Td [(1thlayer.Thesubsetoffeaturesin F k thatrelyon ^ f willeithernotberecognizableormustrelyonasmallersetofindicators. Inthelattersituation,this k levelfeaturemaybehardertorecognizewiththissmallerset of k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1levelfeatures.BothoftheoutcomesinCaseAcouldleadtolowergeneralization. InCaseB,where ^ f isafeaturerecognizedbythe k )]TJ/F15 11.9552 Tf 10.622 0 Td [(1thlayer,thepresenceorabsence of ^ f willbepassedalongtothe k thlayer.Thesubsetof F k thatisnotconstructedusing ^ f musteitherlearntoignorethefeatureor,ifthetrainingdatahappenstobebiased,may learnincorrectlytorelyonthefeatureresultinginfalsepositives.Intheformeroutcome, parameterspacethatcouldbeusedforlearningusefulinformationmustinsteadbeusedto interpretnon-usefulfeaturesfromthepreviouslayer.Inthelatteroutcome,thenetworkhas overtthetrainingdataandreduceditsgeneralizationcapacity. Animportantnote:thesmallerthetrainingdatasetsize,thegreaterthebiasinthe trainingdatameaningthemorelikelyCaseBwillresultinovertting. Onepotentialsolutionistoreducethesizeofthelayersothatlessusedfeaturesarenever learnedinfavorofmorewidelyusefulfeatures.ThisforcesthenetworktowardscaseA.As wehavealreadydiscussed,wideranddeepernetworksaremoreexpressive.Soforsuciently dicultclassicationtasks,reducingthesizeofthenetworkmaynotbeadesirableoption. Ifwewanttokeepourlayerlarge,pushingournetworktowardscaseB,wecanincrease thetrainingdatasizetoreducethebiasinourtrainingdata.Dataaugmentationcanbe usedtogetmuchofthebenetofincreasingthetrainingsetsizewithouttherequirement ofcollectingnewdata. Neitherofthesesolutionsaddresstheactualproblem,however,whichisthateachkernel inlayer k mustconsideralllearnedfeaturesfromlayer k )]TJ/F15 11.9552 Tf 11.38 0 Td [(1whethertheyareusefulornot. 15


Weproposeafullydierentiable,trainablearchitecturethatallowslowerlevelfeatures toselforganizeintohigherlevelfeatures.Thesehighlevelfeatureswillbecreatedwithout thenecessityofexaminingallcombinationsoflowlevelfeatures. ConvolutionalColumns Let'sassumeforaminutethatweknowtheexactsizeof F k .Let'salsoassumethatwe knowexactlywhichfeaturesfrom F k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 areneededforrecognitionofeachfeatureinlayer k .Letthe i thfeaturebe f ik 2 F k .Wecouldcreateanactivationfunction, A ,thatwould return1ifsomeacceptablepercentage, ,ofthesefeaturesexistedand0otherwise. We'llassumewehavetheindicesforthesetoffeaturesfrom F k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 thatareneededto recognize f ik .We'llcallthesetoftheseindices P i F k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 . A i;k = 8 > > < > > : 1 < 1 j P i j P j 2 P i A j;k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 0 1 j P i j P j 2 P i A j;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 Wewillcallthesummationoflowerlevelactivations V . V P i ;k = X j 2 P i A j;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 Noticehowonlythenecessaryfeaturesfrom F k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 areconsideredwhengeneratingthese higherlevelfeatures.Additionally,notethatwecanreinterpret V asasummationof votesgeneratedbythesetoflowerlevelfeatureswhereeachlowerlevelfeatureisvotingon whetherthehigherlevelfeatureexists. Unfortunately,assumingweknowexactlywhichfeaturesorhowmanyfeatureswillbe usefulisnotagoodassumption.Abettermodelwouldbeonethatcanlearnhowlow levelfeaturescanbeorganizedintousefulhighlevelfeatures.Bymakingsomelesssevere assumptionswecancreateanapproximate A thatisdierentiableand,therefore,ableto learntheproperfeaturecombinations. Wewouldliketoachievethisgoalbyforcingareinterpretationofthemeaningofoutputs fromconvolutionallayers.Insteadoflearningwhetherapatternexists,wewouldlikekernels 16


toimplicitlylearnhowtheirpatterncontributestoahigherlevelpattern.Inordertodo this,weaddanadditionalsetofstepstoastandardconvolutionallayer. Let'ssaywewanttogenerateanactivationmapforourselectedlevel k .Wewantto dosoforeachofsome numberofpatterns, j F k j .Let'salsosaywecanassumethese patternsarerecognizablefromsomebound, ,oflowerlevelpatternswhere j F k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 j . isalsolessthanthetotalnumberofpatternswelearnedinlayer k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1. Werstcreate parallelstandardconvolutionallayerscontaining kernels.Let x be theoutputofthepreviouslayerinthenetwork.Let K i beafunctionthatgeneratesa standardconvolutionalactivationmapforthe i thkernel.Wedenetheoutputofoneof theseparallellayersas: O x = X i =1 tanh K i x Essentiallyeachkernelisnowgeneratingavotebetween[ )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 ; 1]aboutwhetherthe kernel'slearnedpatternisusefulincombinationwiththeotherkernelswithinthelayer. Thesevotesarethenalltallied.Theresultisanactivationmapofsomeimplicithighlevel featuregeneratedfromthelowerlevelfeatures.Theoutputfromtheparallellayersare concatenatedtogetherinordertocreatethenalsetof higherlevelactivationmaps. Wechosehyperbolictangentbecausesomefeaturesmaybenegativelycorrelatedtothe targethighlevel.Wewantedtoallowthesenegativelycorrelatedfeaturestobeableto producenegativevotes. Theresultofournewarchitectureisthatwearereplacing V fromequation5withthe summationofhyperbolictangentkernelactivationsandconvertingthefunction A from equation4intocontinuousfunctionratherthanastepfunction.Throughstochasticgradient descentoroneofitsvariants,wecantrainthekernelstoapproximate A forausefulsetof features. Convolutionalbundlesprovideanextrabenetaswell.Bygeneratinghigherlevelfeaturesusingthesummationofvotesratherthananadditionalconvolutionallayer,weare 17


+ ... + + Figure1:Theconvolutionalbundlearchitecture. forcingthesekernelstolearnusefulfeatures.Featurescannolongerbeignoredbyhigher levellayers.Theyareallconsideredequallyduringhigherlevelfeaturegenerationandmust thereforelearnusefulfeatures.Wearepreventingthecoadaptationofweights. Figure1showstheconvolutionalbundlearchitecture. Becauseofthepossibilityofsaturationinthehyperbolictangentactivationfunctionand itsnegativeeectsontraininglargenetworks[2],werecommendusingresidualconnections topropagatethegradientthroughtheentirenetwork. Despiteprocessofsummingtheselowlevelfeaturesforcingthemtolearngoodgeneralizations,thereisstillthepossibilitythatsomeofthesefeaturesarelessusefulthanothers. Asaresult,wewanttoallowthenetworktodecidehowusefultheyareaftertheyhavebeen learned.So,onceournetworkhasconvergedduringtraining,weswapouttheprocessof summingthevoteswitha1x1convolutionwiththeinitialvaluessettoall1s. The1x1convolutionwithallvaluessetto1startsoutperformingtheexactsameoperationassummingtheresults.Thenetworkisthentrainedagainallowingthese1x1convolutionstolearnaparametriccombinationoftheselowlevelfeatures.Thisnetunesthe networkandproduceshigheraccuracy. Thenewlayerdoeshaveafewshortcomingsthatshouldbementioned.First,like 18


withstandardconvolutionallayers,thenumberoffeaturestolearnmustbeselected.In convolutionallayers,thisisequivalenttochoosingthenumberofkernels.Inconvolutional bundles,thisisequivalenttochoosingthenumberofparallelbundles. Additionally,aswithstandardconvolutionallayers,thereisnoguaranteethatthelearned featuresareuniqueatthatlayer.Forconvolutionalbundles,thisistwofold.Thekernels learnedwithinasinglebundlecanrecognizeduplicatefeaturesandtheimplicithighlevel featurescanbethesameacrosstheentirelayer. Convolutionalbundlesintroduceanewhyperparameter.Thenumberofkernelswithin asinglebundlemustbechosen.Thisaddsalayerofcomplexitywhiletuningthenetwork. Lastly,thehigherlevelfeaturesthateachbundlelearnsmaysharesomecomponent lowlevelfeatures.Thedetectionoftheselowlevelfeaturesarenotsharedacrossbundles, however,meaningtheremaybeasignicantduplicationofeortbetweenbundles. Despitetheseshortcomings,wehavefoundthattheseconvolutionalbundlesoutperformstandardconvolutionallayerswhenclassifyingMVSdata.Weshowtheresultsofour experimentationinChapter7. 19


CHAPTERVDATAAUGMENTATION OnemethodforreducingoverttinginmachinelearningalgorithmslikeCNNsistoincreasethesizeofthetrainingdataset.Whencollectingnewdataisnotpossible,onemethod forgeneratingadditionaltrainingdataisdataaugmentation.Byapplyingtransformations tothepre-existingtrainingset,"new"datacanbegenerated.Therearetworequirements forthesetransformationstobeuseful.First,thesetransformationsmustbelabelretaining,meaningtheappliedtransformationdoesnotchangetheactualclassofthedatapoint. Thesetransformationsmustalsoimitatenaturalvariancewithinthedata. Fortimeseriesdata,WShasbeenproposedasaformofcropping.Arandomrange withinthetimedomainisselectedandextractedfromtheoriginaldatapoint.Weemploy thismethodinourdataaugmentationscheme.WWwasproposedasaformofre-scaling intimeseriesdata.Againarandomtimerangeisselectedintheoriginaldata.Thistime frameistheneitherdoubledorhalvedandreinsertedbackintotheoriginaltimeseries. UnfortunatelywithWW,togenerateatrainingdatasetthatcontainsarichenough amountofdatatolearngoodgeneralizationsoftimescalemayeitherbeprohibitivelylarge orrequireaverylargetrainingset. WeintroduceanewformofscalingspecicallysuitedforMVSdatathatremovesthe necessityofgeneratinglargeamountsoftrainingdata.Wecallourmethodglobaltime warpingGTW. ToperformGTW,weapplythefollowingsteps.Werstselectthetargetnumberof timesteps, asahyperparameter.Wethengeneratealinearlyspacedvectoroflength withvaluesbetween0and1.Wecallthisourtargetvector, ^ C . ^ C = 0 ; 1 ; 2 ;:::; )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 ; 1 Wecall ^ C ourtargetcompletionpercentageforeachtimestepinourtargettimeseries. Foreachtimeseriestobescaled,wegeneratethetotaldistance, d t ,foreachtimestep t usingtheeuclideandistance.Let u bethetotalnumberofdimensionsineachdatapoint. 20


t ft 0 .1 .7.8 1 Figure2:Anexampleofcompletionvaluesforasinglesequence. Forexample,MVSdatathatcontainsx,y,andzdatawouldhave u =3. d t = t )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 X i =0 v u u t u X j =1 x j;i +1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(x j;i 2 Wethengeneratethecompletionpercentage, c ,foreachtimestep, t ,intheseries.Let m bethetotalnumberoftimestepsinthesequence. c t = d t d m Figure2showsanexampleofcompletionvaluesforasinglesequence. Next,foreachdimensionineachvariableofourtimeseries,wegeneratethe1Dcubic splineinterpolationusingallpointsintheoriginalsequence.Thevalueoftheoriginalsequenceisthedependentvariableandthecompletionpercentageistheindependentvariable. Wetheusethesecubicsplineinterpolationstogenerateournewvaluesforourtarget sequence.Thevalueofthe i thvariableforsomecompletionpercent c 2 R isdenedas: ^ v ic =[ S ix c ;S iy c ;S iz c ] 21


Figure3:Aviewofhowdataartifactsaresmoothedinfavorofgeneraltrendswhenusing GTW.Thegureontheleftistheoriginaldata.Thegureontheleftisthesmoothed data. Inthiscase, S ix , S iy ,and S iz arethecubicsplineinterpolationpiecewisefunctions forthe i thvariableanddimensions x , y ,and z respectively. Forvalues^ c 2 ^ C ,wegenerateanewtimeseriesofthefollowingform: 2 6 6 6 6 6 6 6 6 6 6 6 4 ^ v 1^ c 1 ^ v 1^ c 2 ^ v 1^ c 3 ::: ^ v 1^ c )]TJ/F18 5.9776 Tf 5.756 0 Td [(1 ^ v 1^ c ^ v 2^ c 1 ^ v 2^ c 2 ^ v 2^ c 3 ::: ^ v 2^ c )]TJ/F18 5.9776 Tf 5.757 0 Td [(1 ^ v 2^ c . . . . . . . . . . . . . . . . . . ^ v n )]TJ/F17 7.9701 Tf 6.586 0 Td [(1^ c 1 ^ v n )]TJ/F17 7.9701 Tf 6.586 0 Td [(1^ c 2 ^ v n )]TJ/F17 7.9701 Tf 6.587 0 Td [(1^ c 3 ::: ^ v n )]TJ/F17 7.9701 Tf 6.586 0 Td [(1^ c )]TJ/F18 5.9776 Tf 5.756 0 Td [(1 ^ v n )]TJ/F17 7.9701 Tf 6.586 0 Td [(1^ c ^ v n ^ c 1 ^ v n ^ c 2 ^ v n ^ c 3 ::: ^ v n ^ c )]TJ/F18 5.9776 Tf 5.756 0 Td [(1 ^ v n ^ c 3 7 7 7 7 7 7 7 7 7 7 7 5 Thisnewseriesisnowscaledtoanewtimelengthandalignedtoaglobalreference. Bothup-scalinganddown-scalingworkswiththisalgorithm.Additionally,becauseofthe alignmentalongcompletionpercentage,generaltrendsbecomemoreimportantthanthe actualshapeoftheoriginaldata.Seegure3toseehowgeneraltrendsarestrengthened andartifactsaresmoothed. Ifthespeedatwhichsomefeatureoccursisintegraltothedata'slabels,GTWwillnot bebenecialasitdestroystherelativetimescaleofdata.Additionally,becauseoftheuse ofeuclideandistanceandcubicsplineinterpolationtogeneratepointsthatdonotexistin 22


theoriginaldata,thespaceinwhichthedataexistsmustbemetric.GTWcannotbeused foralltypesoftimeseriesdata. WehavefoundthatGTWimprovesaccuracyoverusingnodataaugmentationwhen appliedtoMVSdata.Additionally,wehavefoundthatGTWoutperformsWWandWSon thesamedata.TheresultsofourexperimentationcanbefoundinChapter7. 23


Table1:Typesofgaitscollectedinthedataset. Name Type S1 WalkstraighttowardsprimaryKinect S2 WalkstraightawaytheprimaryKinect C1 WalkstraighttowardsprimaryKinectwhileholdingbriefcase C2 WalkstraighttowardsprimaryKinectwhileholdingcellphone C3 WalkstraighttowardsprimaryKinectwitharmscrossedacrossparticipantschest C4 WalkstraighttowardsprimaryKinectwithhandsinpocket C5 WalkstraighttowardsprimaryKinectwhileparticipantattemptsto"disguise"theirnaturalwalkingmotion CHAPTERVIGAITDATASET Inordertoproperlydemonstratethepowerofouralgorithms,wegeneratedanewhuman gaitdataset.Thewholedatasetconsistsof50individualseachperforminganumberof dierenttypesofwalksrangingfrom3to10.ThedatawascapturedusingtheMicrosoft Kinect.Thedatacontainsthe3Djointlocationsofthe20jointsthatarecapturedbythe MicrosoftKinect.Eachgaitiscapturedfromtwoviewangles.Table2containsalistof thepossibletypesofgaitsthatwerecollectedinthedataset.Intotaltherewere842gait sequencescollected. Inordertocreateaconsistentdataset,thefollowingconsiderationsweremade.Foreverygaitcollected,standarduorescentlightingwasusedandblindswereclosedtominimize glare,shadows,andsunlight.Participantswereaskedtoremovejacketsandbagsandtonot carryanyobjectsunlessspecicallyrequiredforthetest.Iftheparticipantswerewearing movementimpedingfootwearsuchashighheelsorboots,theywereaskedtoremovethe footwear.Participantswereaskedtowalknaturallyatabout3milesperhour.Addition24


Figure4:AsingleframeingaitMVSdataset. ally,thematerialtheparticipantswalkedonwasstandardlinoleumtile.Noslickorsticky materialswereusedtochangethegait. ThelocationandanglesoftheKinectsensorswererestrictedbytheirtechnicalspeculations.Theirmaximumeldofviewwere43verticalby57horizontal.Withanoptimalfar elddistanceof11feetandoptimalnearelddistanceof4feetfromKinect,theresulting optimalviewingrangewasawalkdistanceof7feet.WesettheKinectsensor32inchesfrom thegroundinanattempttocentertheKinectwiththeaveragehuman'sheight. Fortheactualdata,wewantedapaththatcreatedtheleastamountofjointoverlap.If jointsoverlapfromtheviewoftheKinect,theaccuracyoftheoccludedjointsbecomeslow. WefoundthatastraightwalktowardstheKinectsensorwasthebestpathtopreventany accuracyloss. Inaddition,weusedanadditionalKinectplacedhorizontallytothedirectionofthe primaryKinectinordertocreateamorediculttestsetwithnoisefromoccludedjoints andmultipleviewingangles.Thedierenttypesofdatawerealltrackedanddemarcated properly. Thedataitselfconsistsof20jointspertimestepcollectedatroughly45framesper second.Eachjoint'sx,y,andzlocationswererecordedforeverytimestep.Table ?? lists thejointsthatwerecapturedinthedata.TheyarethestandardjointsusedbytheKinect. 25


Table2:Jointscapturedwithinthedataset. Joint HeadShoulder-CenterShoulder-RightShoulder-LeftElbow-Right Elbow-LeftWrist-RightWrist-LeftHand-RightHand-Left SpineHip-CenterHip-RightHip-LeftKnee-Right Knee-LeftAnkle-RightAnkle-LeftFoot-RightFoot-Left 26


Figure5:ThelayoutoftheM-CNNmodel. CHAPTERVIIEVALUATION Totestordierentalgorithmswerstselectedastandardbasemodel.Weusedthis modelasabaselinetocompareourcontributionsto. Thebasemodel,whichwewillcallM-CNNmultichannelCNN,consistsofthreeconvolutionallayers,twomaxpoolinglayersandtwofullyconnectedlayers.Additionally,M-CNN employsbatchnormalizationanddropout.Randomnoiseisaddedtotheinputduringtraining.Figure5showstheM-CNNlayout. Inordertotdierentlengthsofsequences,forM-CNNonlytherst40timestepswere saved.Ifthesequencewasshorterthan40timesteps,zeropaddingwasadded. Inordertoensuretheresultsofourtestsareaccuratewetrainedandtestedeach algorithm10separatetimes.Theresultswerethenaveraged.Eachalgorithmwastrained for1000epochsatwhichpointtheirtrainingseemedtohaveconverged. Forthegaitdataset,thetestsetconsistsof150samples.Thetestsampleswererandomly selectedfromthedatasetinordertonotbiasthetestcases.Thespecicsamplesarekept constantthroughalltrainingandtestingcases. WholeSystem Figure6showstheaccuracyversustrainingepochoftheentireproposedsystemas GaitNet.ThissystemincludesbothGTWdataaugmentationandconvolutionalcolumn layersfortherstandthirdconvolutionallayers.WeincludeM-CNNasareference. ConvolutionalColumns Totestconvolutionalcolumnstherearethreeconvolutionallayerswecantransforminto convolutionalcolumns.ModelCC-1throughCC-3areallnetworksinwhichwetransform 27


Figure6:AcomparisonoftheaggregatesystemversusastandardCNNmodel. Figure7:ThelayoutoftheCC-4model. asingleconvolutionallayerintoaconvolutionalcolumn.CC-4replacestwoconvolutional layerswithconvolutionalcolumns. Additionally,usingCC-4,weadjustboth and toseehowdierentvalueseectthe classicationresults.WeuseCC-1throughCC-3totesthowthelocationoftheconvolutional columneectsclassication. Figure ?? showsthetestaccuracyversustrainingepochoftheCC-4modelincomparison withM-CNN.Therearetwoimportantpiecestonotehere.First,thenalaccuracyis considerablybetter.Additionally,CC-4trainedfasterthanM-CNN. Becauseourconvolutionalcolumnsareessentiallytwolayersbuiltintoonebecausethey implicitlygeneratehigherlevelfeatures,wewantedtoensurethatjustaddingtheseextra layerswasnotwheretheirbenetwasbeingrealized.Wecreatedanewmodelwecalled DEEPthataddedtwonewlayerstoM-CNN.WecomparethismodelwithCC-4andM28


Figure8:ResultsfromCC-4incomparisontoM-CNN. CNNingure9.Notunexpectedly,theaccuracydecreases.Wehaveincreasedthenumber ofparametersinourmodelcausingittoovertratherthangeneralizingwell. Additionally,wewantedtoensurethatthegaininaccuracywasn'tduetoparallelconvolutionallayers.WeconstructedamodelwecalledPARAthatreplacedtheconvolutional columnsinCC-4withthesamenumberofparallelstandardconvolutionallayers.Figure10 showstheresultsofthisnetworkversusCC-4andM-CNN.JustasDEEPledtoincreased overttingbyincreasingthenumberofparameters,sodidPARA. Thebenetofconvolutionalcolumnscomesfromtheactualstructuralchangesitintroducesratherthansimplyincreasingthenetworksize. Figure11showsthetestaccuracyversustrainingepochoftheCC-1throughCC-3 modelsincomparisontoM-CNN.ForCC-1andCC-3,weagainseethattrainingtimesare improved.M-CNN'stestaccuracyisequalwiththesenetworks,however.WithCC-2,there seemstobenodierenceintrainingtimeoraccuracyincomparisonwithM-CNN. DataAugmentation Totestourformofdataaugmentation,weagainstartwithM-CNNinwhichwedon'tuse anyaugmentation.TheGTWmodelappliescroppingandscalingtotheincomingtraining datausingGTW.Finally,tocomparetheiraccuracywithexistingmethods,theWWmodel appliesWWandWCtotheinputofCONV-1.ForboththeWWandGTWmodelsthe targetcroppingsizewasthesame.Boththepotentialamountofcroppingandthesizeof thenalscaledsequencecanbemodiedfordataaugmentation. Figure12showstheaccuracyversustrainingepochforallthreemodels.Notunexpectedly,anydataaugmentationsignicantlyimprovestheaccuracyofthemodelduringtesting. 29


Figure9:ComparisonofM-CNN,CC-4,andDEEP. Figure10:ComparisonofM-CNN,CC-4,andPARA. 30


Figure11:Thethreedierentconvolutionalcolumnlayermodelsinlogscale. NotethatGTWoutperformsWWandWSforMVSdata. 31


Figure12:Thetestingaccuracyofthethreedierentdataaggregationmodelsversusthe trainingepoch. 32


CHAPTERVIIICONCLUSIONANDFUTUREWORK Convolutionalneuralnetworkshavereceivedstate-of-the-artresultsformanymachine learningtasks.Theyleveragethepowerofautomaticfeaturelearningtoidentifyunlabeled data.MultivariatetimeseriesdatasharesalotofpropertieswithimagesthatmakeCNNs idealforclassifyingmultivariatetimeseries.Despitetheirspecializedarchitecture,these CNNssuerfromovertting. WeproposedtwosolutionsforreducingoverttingofCNNsformultivariatemetrictime seriesdata.Wedevelopedadataaugmentationalgorithmformetrictimeseriesdatathat improveduponexistingdataaugmentationtechniquesfortimeseriesdata.Ouralgorithm allowsforscalingtoanytargetsizeandsimultaneouslygloballyalignsthedata. Additionally,wedevelopedastructuralchangetoconvolutionallayers.Thislayerattemptstoreducetheamountofcomplexco-adaptationofweightswithinthenetworkand alsoreducethechanceofbiasbeinglearnedfromtrainingdata.Wecomparedtheselayers' powertostandardconvolutionallayerstoshowtheimprovementinaccuracy. FutureWork WeplantoexplorepotentialtimeseriesthatbenetmorefromWTS.Additionally,we wanttoexplorehowconvolutionalbundlescouldbeusedwithimagedataandotherforms ofdatathathavestate-of-the-artclassicationresultsfromCNNs.Finally,wewouldlike toexaminehowtoautomaticallylearnglobalalignmentvectorsthatproducebetterresults thanlinearlyspacedvectors. 33


REFERENCES [1]EricBBaumandDavidHaussler.Whatsizenetgivesvalidgeneralization?In Advances inneuralinformationprocessingsystems ,pages81{90,1989. [2]YoshuaBengio,PatriceSimard,andPaoloFrasconi.Learninglong-termdependencies withgradientdescentisdicult. IEEEtransactionsonneuralnetworks ,5:157{166, 1994. [3]DonaldJBerndtandJamesCliord.Usingdynamictimewarpingtondpatternsin timeseries.In KDDworkshop ,volume10,pages359{370.Seattle,WA,1994. [4]NadiaA.Chuzhanova,AntoniaJ.Jones,andSteveMargetts.Featureselectionfor geneticsequenceclassication. BioinformaticsOxford,England ,14:139{143,1998. [5]DeepjoyDasandAlokChakrabarty.Humangaitrecognitionusingdeepneuralnetworks.In ProceedingsoftheSecondInternationalConferenceonInformationandCommunicationTechnologyforCompetitiveStrategies ,page132.ACM,2016. [6]ElenaGianaria,MarcoGrangetto,MaurizioLucenteforte,andNelloBalossino.Human classicationusinggaitfeatures.In InternationalWorkshoponBiometricAuthentication ,pages16{27.Springer,2014. [7]DianGongandGerardMedioni.Dynamicmanifoldwarpingforviewinvariantaction recognition.In ComputerVisionICCV,2011IEEEInternationalConferenceon , pages571{578.IEEE,2011. [8]KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun.Spatialpyramidpoolingin deepconvolutionalnetworksforvisualrecognition.In EuropeanConferenceonComputerVision ,pages346{361.Springer,2014. [9]KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun.Deepresiduallearning forimagerecognition.In ProceedingsoftheIEEEconferenceoncomputervisionand patternrecognition ,pages770{778,2016. [10]GeoreyEHinton,NitishSrivastava,AlexKrizhevsky,IlyaSutskever,andRuslanR Salakhutdinov.Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors. arXivpreprintarXiv:1207.0580 ,2012. [11]YuchiHuang,XiuyuSun,MingLu,andMingXu.Channel-max,channel-dropand stochasticmax-pooling.In ProceedingsoftheIEEEConferenceonComputerVision andPatternRecognitionWorkshops ,pages9{17,2015. [12]SergeyIoeandChristianSzegedy.Batchnormalization:Acceleratingdeepnetwork trainingbyreducinginternalcovariateshift.In InternationalConferenceonMachine Learning ,pages448{456,2015. 34


[13]EamonnKeoghandChotiratAnnRatanamahatana.Exactindexingofdynamictime warping. Knowledgeandinformationsystems ,7:358{386,2005. [14]AlexKrizhevsky.Learningmultiplelayersoffeaturesfromtinyimages.2009. [15]AlexKrizhevsky,IlyaSutskever,andGeoreyEHinton.Imagenetclassicationwith deepconvolutionalneuralnetworks.In Advancesinneuralinformationprocessingsystems ,pages1097{1105,2012. [16]ArthurLeGuennec,SimonMalinowski,andRomainTavenard.Dataaugmentation fortimeseriesclassicationusingconvolutionalneuralnetworks.In ECML/PKDD WorkshoponAdvancedAnalyticsandLearningonTemporalData ,2016. [17]YannLeCunandCorinnaCortes.MNISThandwrittendigitdatabase.,2010. [18]LilyLeeandWEricLGrimson.Gaitanalysisforrecognitionandclassication.In AutomaticFaceandGestureRecognition,2002.Proceedings.FifthIEEEInternational Conferenceon ,pages155{162.IEEE,2002. [19]MinLin,QiangChen,andShuichengYan.Networkinnetwork. arXivpreprint arXiv:1312.4400 ,2013. [20]GerardMedioni,Chi-KeungTang,andMi-SuenLee.Tensorvoting:Theoryandapplications.In ProceedingsofRFIA ,volume2000,2000. [21]LutzPrechelt.Automaticearlystoppingusingcrossvalidation:quantifyingthecriteria. NeuralNetworks ,11:761{767,1998. [22]PatriceYSimard,DavidSteinkraus,JohnCPlatt,etal.Bestpracticesforconvolutional neuralnetworksappliedtovisualdocumentanalysis.In ICDAR ,volume3,pages958{ 962,2003. [23]PeterSollichandAndersKrogh.Learningwithensembles:Howoverttingcanbe useful.In Advancesinneuralinformationprocessingsystems ,pages190{196,1996. [24]NitishSrivastava,GeoreyEHinton,AlexKrizhevsky,IlyaSutskever,andRuslan Salakhutdinov.Dropout:asimplewaytopreventneuralnetworksfromovertting. Journalofmachinelearningresearch ,15:1929{1958,2014. [25]ChristianSzegedy,WeiLiu,YangqingJia,PierreSermanet,ScottReed,Dragomir Anguelov,DumitruErhan,VincentVanhoucke,andAndrewRabinovich.Goingdeeper withconvolutions.In ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition ,pages1{9,2015. [26]LiangWang,TieniuTan,HuazhongNing,andWeimingHu.Silhouetteanalysis-based gaitrecognitionforhumanidentication. IEEEtransactionsonpatternanalysisand machineintelligence ,25:1505{1518,2003. 35


[27]HaibingWuandXiaodongGu.Towardsdropouttrainingforconvolutionalneural networks. NeuralNetworks ,71:1{10,2015. [28]JianboYang,MinhNhutNguyen,PhyoPhyoSan,XiaoliLi,andShonaliKrishnaswamy. Deepconvolutionalneuralnetworksonmultichanneltimeseriesforhumanactivity recognition.In IJCAI ,pages3995{4001,2015. [29]YiZheng,QiLiu,EnhongChen,YongGe,andJLeonZhao.Timeseriesclassication usingmulti-channelsdeepconvolutionalneuralnetworks.In InternationalConference onWeb-AgeInformationManagement ,pages298{310.Springer,2014. [30]FengZhouandFernandoTorre.Canonicaltimewarpingforalignmentofhumanbehavior.In Advancesinneuralinformationprocessingsystems ,pages2286{2294,2009. 36