Citation
Multispectral skeleton-based human action recognition

Material Information

Title:
Multispectral skeleton-based human action recognition
Creator:
Lang, Bo
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English

Thesis/Dissertation Information

Degree:
Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Electrical Engineering, CU Denver
Degree Disciplines:
Electrical engineering
Committee Chair:
Liu, Chao
Committee Members:
Lei, Tim
Harid, Vijay

Notes

Abstract:
Automatic recognition of human actions is an important and challenging problem in surveillance and intelligence transportation areas. Dynamics of human body skeletons convey significant information for human action recognition, which attracted much attention in computer vision. Conventional pose estimation approaches for getting skeleton data which are done on visible color imaging data would be affected by lighting condition. Whereas thermal camera is stable to human body detection regardless of the lighting condition. On the contrary, thermal data always lose the fine visual details of human objects, especially at long distance. In this paper, we first proposed a multispectral pose estimation algorithm to generate the skeleton body keypoints data from multispectral images or videos. Then, to capture richer dependencies besides the fixed skeleton graphs, we introduce a Multi-task feedback action recognition model. In the structural task S (skeleton graph), we use multi selective filters convolution network with attention to extract features while attention-based GCN is applied to the actional task A (action graph composed of action links, directly from action) to learn actional graph features. Furthermore, to select discriminative temporal information, we increase time links among several frames and the attention mechanism is employed to enhance information of key time links. A feedback module is connected between pairs of blocks to iteratively update the action links information. Our overall model stacks attention-based graph convolution and temporal convolution as a basic building block, to learn both spatial and temporal features from multispectral data for action recognition in different environment.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
Copyright Bo Lang. Permission granted to University of Colorado Denver to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.

Downloads

This item has the following downloads:


Full Text
MULTISPECTRAL SKELETON-BASED
HUMAN ACTION RECOGNITION by
BO LANG
B.S., Electrical Engineering, University of Colorado Denver, 2016
A thesis submitted to the Faculty of the Graduate School of the University of Colorado Denver in partial fulfillment of the requirements for the degree of Master of Science
Electrical Engineering, 2019


This thesis for the Master of Science degree by Bo Lang
has been approved for the Department of Electrical Engineering by
Chao Liu, Chair Tim Lei Vijay Harid
©2019 BO LANG
ALL RIGHTS RESERVED


Bo, Lang (M.S., Electrical Engineering)
Multispectral Skeleton-based Human Action Recognition Thesis directed by Assistant Professor Chao Liu
ABSTRACT
Automatic recognition of human actions is an important and challenging problem in surveillance and intelligence transportation areas. Dynamics of human body skeletons convey significant information for human action recognition, which attracted much attention in computer vision. Conventional pose estimation approaches for getting skeleton data which are done on visible color imaging data would be affected by lighting condition. Whereas thermal camera is stable to human body detection regardless of the lighting condition. On the contrary, thermal data always lose the fine visual details of human objects, especially at long distance. In this paper, we first proposed a multispectral pose estimation algorithm to generate the skeleton body keypoints data from multispectral images or videos. Then, to capture richer dependencies besides the fixed skeleton graphs, we introduce a Multi-task feedback action recognition model. In the structural task S (skeleton graph), we use multi selective filters convolution network with attention to extract features while attention-based GCN is applied to the actional task A (action graph composed of action links, directly from action) to learn actional graph features. Furthermore, to select discriminative temporal information, we increase time links among several frames and the attention mechanism is employed to enhance information of key time links. A feedback module is connected between pairs of blocks to iteratively update the action links information. Our overall model stacks attention-based graph convolution and temporal convolution as a basic building block, to learn both spatial and temporal features from multispectral data for action recognition in different environment.
I


The form and content of this abstract are approved. I recommend its publication.
Approved: Chao Liu
ii


ACKNOWLEDGMENT
I would like to express my deepest thanks to my advisor, Dr. Chao Liu for his valuable guidance and suggestions he has given me in this proj ect. Without him, this work would have been impossible. Also, I thank the committee members, Dr. Tim Lei and Dr. Harid Vijay for spending their valuable time reviewing this report and attending my defense.
Thank you to my friend Yan Pang for getting me up to speed on graph convolution networks and granting your advice during the thesis process.
ill


Table of Contents
CHAPTER 1....................................................................................1
Role and Challenge of Action Recognition in Intelligence Area...........................1
Skeleton-Based Human Action Recognition.................................................1
Related Works.......................................................................2
Pose Estimation Methods.................................................................3
Related Works.......................................................................4
Multispectral Skeleton-based Human Action Recognition...................................4
CHAPTER II...................................................................................6
Introduction of OpenPose................................................................6
Materials & Methods.....................................................................8
RGB & Thermal human skeleton detection dataset......................................8
Data Fusion.........................................................................9
Model Fusion.......................................................................11
Multispectral Fusion Pose Estimation Network...........................................14
Experiments & Results..................................................................14
CHAPTER III.................................................................................16
Introduction of Graph Convolutional Network............................................16
Overview...........................................................................16
Related Works......................................................................17
Materials & Methods....................................................................21
Backgrounds........................................................................21
Objectives.........................................................................22
Graph Attention Network............................................................23
Multi-Filters GCN..................................................................25
IV


Temporal Convolution Network
27
Two Steam Feedback Attention Based GCN...........................................29
Structural Task (s)..........................................................29
Action Task (A)..............................................................33
Feedback Model...............................................................35
CHAPTER IV............................................................................37
Datasets.....................................................................37
Model Setting................................................................38
Comparisons to the State-of-the-Art..........................................38
CHAPTER V.............................................................................41
REFERENCES............................................................................42
V


FIGURES
Figure
1.1 Skeleton sequence of N frames for action walking 1
1.2 Some pose estimation results 3
1.3 The pipeline of the proposed model 4
II. 1 Expression of body keypoints 6
11.2 Architecture of OpenPose 7
II. 3 Comparison results of RGB and Thermal image 8
II.4 Equipments for data collection 9
II. 5 Data fusion process 10
II. 6 Some results of data fusion 10
11.7 Early Fusion 11
II. 8 Middle Fusion 11
11.9 Late Fusion 12
11.10 Examples of skeleton detection results on RGB & Thermal dataset 13
II. 11 Architecture of multispectral fusion pose estimation model 14
II. 12 Pose estimation results for the fusion images 15
HIT Euclidean Structure 16
111.2 Non Euclidean Structure 16
111.3 The proposed partitioning strategy for constructing convolution 18
III. 4 ST-GCN network 18
III. 5 An example of the skeleton graph 20
III. 6 The pipeline of the AS-GCN 21
111.7 The attention algorithm 24
III. 8 Multi-fdters graph convolution 26
III.9 Temporal convolution in ST-GCN 27
III. 10 Structural-Temporal graph 28
VI


III. 11 The pipeline of 2S-ATGCN model 29
III. 12 The architecture of selective filter network 30
III. 13 An AST-GCN block 33
III. 14 Establishment process of the actional links 33
III. 15 Feedback and updates of the actional links 36
VII


TABLES
Table
II. 1 Comparison results of three different fusion models 13
II.2 Comparison results of different fusion process 15
IV. 1 Comparison of action recognition performance on NTU-RGB+D 39
IV.2 Comparison of action recognition performance on Kinetics 40
VIII


CHAPTER I
INTRODUCTION
Role and Challenge of Action Recognition in Intelligence Area
Computer vision (CV) is an interdisciplinary scientific field that aims to build smart applications to understand the contents of images and videos as human understanding. Human behavior analysis and description is a research hotspot that has been widely concerned in in the field of machine learning and computer vision. And human action recognition (HAR) is a broad research area which focuses on identifying specific movements or behaviors of human using imaging sensor. It is popular for its wide applicability in using visual feature to retrieve specific actions in videos automatically. For example, healthcare monitoring system, and indoor/outdoor activity surveillance & emergency assistance system are developed using the concept of HAR. In recent years, a variety of action recognition algorithms have been proposed and achieved good results. However, using RGB video data as an input to the system could cause several problems in environments such as occlusion or complex backgrounds and low illumination. Therefore, accurate action recognition in video is still a challenging task.
Skeleton-Based Human Action Recognition
Frame 1 Frame 2 Frame N-1 Frame N
Figure 1. Skeleton sequence ofN frames
1


Skeleton-based action recognition is widely used in recent applications due to its robustness to illumination and scene changes.
The dynamic skeletal modality can be represented naturally by a time series of body keyjoints with the form of 2D or 3D coordinates. Human motion can then be identified by analyzing its motion patterns. With the development of cost-effective depth sensors (eg, RealSense| 11. Kinect[2]) and pose estimation algorithms (eg, Openpose[3]), the application of human action recognition based on 2D or 3D body keypoints have received extensive attention. And there have been many advanced methods that have been proposed in the past few years.
Related Works
Current skeleton-based action recognition algorithms can be divided into two approaches: handcrafted feature based and non-handcrafted feature (learned feature) based. For the first approach, a wide variety of the state of the art algorithms have been proposed: Vemulapalli et al.[4] represented the human skeletons as points in Lie group, and implemented the temporal modelling and classification in Lie algebra; Wang et al.[5] designed algorithm to capture action patterns based on the local occupancy features; And two kernel-based tensor representations were proposed by Koniusz et al.[6] to capture compatibility between two action sequences and dynamic information for a single action.
On the other hand, the non-handcrafted feature based methods learn the human action features from data automatically. One type of the most popular techniques was to use recurrent-neural-network (RNN)-based model to capture the temporal dependencies between consecutive frames. Zhu et al.[7] introduced a regularized LSTM model for co-occurrence feature learning. Song et al.[8] developed a spatio-temporal attention model to allocate different weights to different frames and joints in the video. Besides, convolutional neural networks (CNN) also achieved remarkable results, such as residual temporal CNN[9], information enhancement model110| and new representation of body keypoints data with convolutional neural networks (CNN)[11].
Recently, the graph-based approach has drawn a lot of attention due to its more flexible use of the body
2


joint relations, S. Yan et al.[12] proposed a Spatial-Temporal Graph Convolutional Networks (ST-GCN) which automatically learned both the spatial and temporal patterns from data; Motivated by ST-GCN, M. Li et al.[ 13] further proposed an actional-structural graph convolution network (AS-GCN) by combining actional links and structural links to generate the skeleton graph; Besides, an adaptive graph convolutional network was proposed[14] to adaptively leam the topology of the graph for different GCN layers and skeleton samples in an end-to-end manner.
In this paper, we adapt the graph-based approach for action recognition part. Different from any previous method, we capture more useful non-local information about actions by introduce addition actional graph, and we also use attention model to pay different levels of attention to the joints or frames.
Pose Estimation Methods
Human pose estimation is one of the key challenges that have been studied for many years in computer vision area. It is based on the spatial information of the body keypoints obtained directly from images or videos to establish a mathematical model of the specific posture. Therefore, the accuracy of detecting keypoints from images or videos has a significant impact on the estimation of human posture. Human poses estimation which aims to obtain the skeleton data from input video or image is heavily used for the action recognition. However, many inevitable factors such as small and hard-to-capture skeletal points, occlusion of disturbing objects as well as the lighting changes, make this process a challenging task.
Figure 2. Some pose estimation results
3


Related Works
Most recent pose estimation approches commonly adapt ConvNets as their primary building block, largely replacing handcrafted features and graphical models; this strategy has made significant advances in standard benchmarks. Toshev et al. [ 15] first presented a cascade of such DNN regressors for increasing precision of joint localization. Z. Cao et al.[3] introduced Part Affinity Fields (PAFs), which represent the association scores, to encode the location and orientation of limbs parts with individuals in the image. Although many researches that focused on pose estimation have already performed well on common datasets such as MPII and COCO, their accuracies are still affected by lighting condition. For example, in the case of dim light, camera can hardly capture the visible human body. So, the accuracy of keypoints detection cannot be guaranteed. Whereas thermal camera is stable to human body detection regardless of lighting condition. However, there are currently no RGB-Thermal dataset for human pose estimation. Therefore, the first contribution of this work is to create the first RGB-Thermal human action dataset for training a fusion model of pose estimation which could get higher accuracy under different environment.
Multispectral Skeleton-based Human Action Recognition
RGB
Video
Sequence
Thermal
Video
Sequence
t
| Walk: 0.98 t+1 Run: 0.42 Hclap: 0.00 lump: 0.00
t+2
Figure 3. The pipeline of the proposed model
Figure 3 illustrates the pipeline of the proposed model. There are two sub-networks in our method: multispectral pose estimation via deep fusion network and multi-task graph-based convolutional network
with attention mechanisms. The multispectral pose estimation algorithm aims to generate the body
4


keyjoints data from multispectral images or videos. Usually the data is a sequence of frames, each frame will have a set of joint coordinates. Then, we organize the outputs of the fusion network into a graphical structure based on the dependencies between human joints, and feed them into the two branches of GCNS to recognize the action label. There is a feedback module between the two tasks which propagates information and provides rewards during the training process. We first introduce the pose estimation part as follows.
5


CHAPTER II
MULTISPECTRAL POSE ESTIMATION VIA DEEP FUSION NEURAL NETWORK
Introduction of OpenPose
A Human Pose Skeleton represents the orientation of a person in a graphical format. Essentially, it is a set of coordinates that can be connected to describe the pose of the person. Each co-ordinate in the skeleton is known as a keypoint. A valid connection between two parts is known as a pair (or a limb). Note that, not all part combinations give rise to valid pairs. A sample human pose skeleton is shown below.
Figure 4. Expression of body keypoints
Multi-Person pose estimation is more difficult than the single person case as the location and the number of people in an image are unknown. Typically, we can tackle the above issue using one of two approaches:
The simple approach is to incorporate a person detector first, followed by estimating the parts and then calculating the pose for each person. This method is known as the top-down approach. Another approach is to detect all parts in the image (i.e. parts of every person), followed by associating parts belonging to distinct persons. This method is known as the bottom-up approach.
6


OpenPose[3] is one of the most popular bottom-up approaches for multi-person human pose estimation, which first detects body parts belonging to every person in the image, followed by assigning parts to distinct individuals. Shown below is the architecture of the OpenPose model which we use as the base for model fusion.
Figure 5. Architecture of OpenPose
The OpenPose network first extracts features from the input image using the first 10 layers of VGG-19 which is a kind of CNN. These features are then fed into two parallel convolutional layer branches. The first branch predicts a set of 18 confidence maps, each representing a specific portion of the body pose skeleton. The second branch predicts a set of partial affinity fields (PAFs) that encode the location and orientation of limbs. Successive stages are used to refine the predictions made by each branch. Through the above steps, human pose skeletons can be estimated and assigned to every person in the image. This is the basic pose estimation model used for fusion to get human skeleton from multispectral data.
7


Materials & Methods
RGB & Thermal human skeleton detection dataset
Many researches that focus on Real Time Multi-Person Pose Estimation have already achieved great accuracy on the conventional visible color imaging data. However, the accuracy would be affected by lighting and distance conditions as well as the cluttered backgrounds. Thermal cameras, which have long-wavelength infrared, are stable by the intensity of light. On the contrary, thermal data always lose the fine visual details of human objects, especially at long distance.
Figure 6. Comparison results of RGB and Thermal image These two pairs of images above show the comparison of the RGB and thermal photos taken in the same
time and background. Visible cameras, much like our eyes, often have trouble seeing through naturally occurring visual obscurants that block reflected light. From the left RGB image of the first row, we cannot see anything under the dark scenarios. However, because thermal radiation passes through these


visual barriers, thermal cameras can easily and clearly capture the human body figure shown in the right image. In the second row, applying relatively good lighting conditions on both types of images is likely to increase the sharpness for RGB images, but for thermal images it may be too bright for the image to be blurred.
Data Collection
Equipment: FLIR Duo R Camera, 250w Power Station, Monitor
Figure 7. Equipments for data collection
We record the RGB & Thermal videos through FIFR DUO R multispectral camera under different location, time and weather conditions to ensure dataset integrity and diversity. Then we develop the world's first multispectral human skeleton detection dataset by extracting frames from videos and making the annotation files which records the coordinates of 17 key joints of the human body; "nose", "eyes", "ears", "shoulders", "elbows", "wrists", "hips", "knees", "ankles". For deep learning process, the dataset is separated as two sub-datasets: Train (25716 color and 25716 thermal images) and Test (6429 color and 6429 thermal). The complete fusion neural network will be trained and tested with the same dataset.
Data Fusion
We aim to fuse thermal captured information with visible color imaging to improve the overall accuracy of multi-person pose estimation. To achieve optimal results, we divide the integration process into two parts, one is data fusion and the other is model fusion.
In the data fusion part, firstly we convert original RGB images to CIEFAB color space, then replace the channel F which express the lightness from black (0) to white (100) by the channel T of the thermal
9


image which has single channel to get an integrated image.
RGB
Thermal
Figure 8. Data fusion process
Image Results
Shown below are some typical data fusion results. We found that the clarity of the image and the adaptability in the dark environment have improved significantly.
Figure 9. Some results of data fusion
The original data of the two groups were taken at night and afternoon respectively. We can hardly see people in the dark light environment, so do the visible color camera. On the contrary, thermal images become blurred in brighter conditions. After the data fusion part, the integrated image preserves fine
10


visual details of human objects and can also detects useful information in the dark background.
Model Fusion
In this section, we introduce three different fusion neural networks which are early, middle and late. Each individual model is trained independently on our data. The results from the models are tested, compared, and ranked on their performance.
VGG 19
Figure 10. Early Fusion (EF)
Convolution
Pooling
Our architecture, shown in Fig. 1, concatenates the feature maps from color and thermal branches immediately after the middle of convolutional layers. Afterwards, we introduce a 1 x 1 convolutional layer that reduces the dimension of concatenate layer. The output connects the rest of VGG layers and Part Affinity Fields & Confidence Map stages to simultaneously predict detection confidence maps and affinity fields that encode part-to-part association. To guide the network to iteratively predict confidence maps of body parts in the first branch and PAFs in the second branch, two loss functions are applied at the end of each stage, one at each branch respectively.
11


In the middle fusion model, both RGB and thermal image are first analyzed by a convolutional network (initialized by the first 10 layers of VGG-19 and fine-tuned), generating a set of feature maps FI and F2. Then we fuse F1 and F2 to get a set of integrated feature maps F that is input to the first stage of each branch which is an iterative prediction architecture to refine the predictions over successive stages.
Stage 1 Stage 3
Figure 13. Late Fusion (LF)
In this model we do the fusion part in the end of stage3 so that it is called late fusion. Concatenate each output (S, L, F) from RGB image with the one from thermal image and perform dimensionality reduction operations. The fusion parameters are the input for the next stage.
Test Results
We have trained and test these three models on the same dataset. Our results showed that the late fusion performed the best out of all three of our proposed models and all of our models performed the baseline for pose estimation. This is because in LF, each original visible color image and thermal image are separately analyzed by multi-layer convolutional network and then fused, and the respective features can be fully extracted. The more discriminative features the model extract, the better performance we got in the results.
Some obvious image test results are showed below and the comparison results of three different fusion models are listed in the table.
12


Fusion Hea Sho Elb Wri Hip Kne Ank niAP
EF 91.2 87.6 77.7 66.8 75.4 68.9 61.7 75.6
MF 92.4 90.4 80.9 70.8 79.5 73.1 66.5 79.1
LF 93.7 91.4 81.4 72.5 77.7 73.0 68.1 79.7
Table 1. Comparison results of three different fusion models
Outdoor
RGB
Outdoor
Thermal
Indoor
RGB
Indoor
Thermal
Figure 14. Examples of skeleton detection results on RGB & Thermal dataset The first two rows show the indoor scenarios and the last two rows represent the outdoor scenarios. Benefit from the thermal data, the entire model performs well under low illumination or even dark environments.
13


Multispectral Fusion Pose Estimation Network
In order to combine the advantages of the two fusion process to get the optimal model, we use the output of the data fusion model that is the integrated image as the third input of the model fusion, here we use the late fusion (LF) which is the best one. We separately concatenate produced detection confidence maps S and part affinity fields L as well as the image feature F from each stream, then the integrated prediction is used to be fine-tuned by each subsequent stage. Finally, human pose skeletons can be estimated and assigned to each person in the fusion image through the process. The architecture shown below is our complete multispectral fusion pose estimation neural network.
Stage 1 Stage 3
Figure 15. Architecture of multispectral fusion pose estimation model. Experiments & Results
In the training code, we adjusted the number of epochs to our dataset, which was enough to obtain a not bad model. It is worth mentioning that we are training with GPU which is much faster than training with CPU. In this step, the based learning rate was 4e-4 with batch size 16 which is depended on the GPU memory. We set up 5000 max iteration at the beginning and found that the downward trend of loss was obvious from millions to hundreds. Until all the epoch runs, it was still falling sharply. After that we increased the number of max_iter to 20000 and finally got a loss of around 30. The learning rate is a hyperparameter that controls how much we adjust the weight of the network based on the loss gradient. The lower the value, the slower we travel along the downward slope. Using an optimizer adjusting lower learning rate in terms of makes sure that we do not miss any local minima, which could also mean that
14


we'll be taking a long time to converge. The model was trained and evaluated on a Nvidia GTX 1070Ti
graphics card took around 6 hours. The accuracy of the model can improve with the expansion of the dataset and precise label. Shown below are some results of human pose skeleton output in the fusion image.
Figure 16. Pose estimation results for the fusion images
Fusion Process mAP
Data Fusion 77.8
Late Fusion 79.7
Data fusion + Late fusion 82.6
Table 2. Comparison results of different fusion process
15


CHAPTER III
HUMAN ACTION RECOGNITION MODEL
Introduction of Graph Convolutional Network
Overview
The data processed by traditional CNN is in the form of Euclidean Structure based on a matrix in which pixels are arranged. CNN cannot process the data of Non-Euclidean Structure because the traditional discrete convolution cannot maintain translation invariance on the data of Non-Euclidean Structure. That is, in the topology diagram, the number of adjacent vertices of each vertex may be different, then convolution operation cannot be performed with a convolution kernel of the same size. Recently many important real-world datasets come in the form of graphs or networks: social networks, knowledge graphs, etc. It is a topological diagram of the relationship between vertices and edges in graph theory. Graph Neural networks (GCNs), which generalize convolution neural networks (CNNs) to graphs of arbitrary structures, have received increasing attention. The application of GCNs to extract features from dynamic graphs over large-scale datasets, e.g. human skeleton sequences, are yet to be explored.
Figure 17. Euclidean Structure
Figure 18. Non-Euclidean Structure
16


GCNs are a very powerful neural network architecture for machine learning on graphs. The feature extraction of the graph is divided into vertex domain (spatial domain) and spectral domain. Spectral perspective methods utilize the eigenvalues and eigenvectors of graph Laplace matrices. It performs the graph convolution in frequency domain with the help of graph Fourier transform 116], which does not need to extract locally connected regions from graphs for each convolutional step.
Most of the skeleton based human action recognition approaches apply GCNs on the spatial domain, where the convolution fdters are performed directly on the graph vertices and their neighbors based on the manually designed rules [17], Node sequence selection is the process of identifying, for each input graph, a sequence of nodes for which receptive fields are created. For each of the nodes identified in the previous step, the nodes of the neighborhood are the candidates for the receptive field. The receptive field for a node is constructed by normalizing the neighborhood assembled.
Related Works
Recently, with the flexibility to exploit the body joint relations, the graph-based approach draws much attention.
Some typical GCNs based approaches are adopted to leam the graphs adaptively from data, which captures useful non-local information about actions. To capture joint dependencies, recent methods construct a skeleton graph whose vertices are joints and edges are bones, and apply graph convolutional networks (GCN) to extract correlated features. In this section, the implementation process is briefly introducedo Notations
Consider a skeleton graph as G = (V, E), where V is the set of n body joints and E is a set of m limbs. In this graph, the node set V = {ut;|t = 1, = 1, ...,1V} includes the all the joints in a skeleton sequence
with T frame. Formally, the edge set E is composed of two subsets, the first subset depicts the intraskeleton connection at each frame, denoted as Es = {vtivtj |(t,y) E H}. where H is the set of naturally connected human body joints. The second subset contains the inter-frame edges, which connect the same joints in consecutive frames as EF = {vtiv(t+i)i}- Therefore, all edges in EF for one particular joint i will
represent its trajectory over time. Let A 6 (0,l}nxn be the adjacent matrix of the skeleton graph which
17


fully describes the skeleton structure, where Ay = 1 if the i-th and the j-th joints are connected and 0 otherwise.
For implementing the label map and capturing more refined location information, there is a strategy to divide the neighbor set into three subsets: 1) the root node itself; 2) centripetal group: the neighboring nodes that are closer to the gravity center of the skeleton than the root node; 3) otherwise the centrifugal group. Here the average coordinate of all joints in the skeleton at a frame is treated as its gravity center.
<5
”• X
i
; /
\ \
■ ,•)
Figure 19. The proposed partitioning strategy for constructing convolution
Spatial-Temporal GCN
Input Video
ST-GCNs
- ___ Running
Class Score
Figure 20. ST-GCN network
Spatial-Temporal Graph Convolutional Networks (ST-GCN), which extends graph neural networks to a spatial-temporal graph model, is proposed for skeleton-based action recognition[12], This deep learning network use the estimated joint locations in the pixel coordinate system as input which can be estimated by the public available OpenPose[3] toolbox on every frame of the clips.
ST-GCN consists of a series of the ST-GCN block, each of which applies a spatial graph convolution followed by a temporal convolution to extract spatial and temporal features altematingly. The last ST-GCN block is connected to a fully connected classifier to generate final predictions. The key component
18


in ST-GCN is the spatial graph convolution operation, which introduces weighted average of neighboring features for each joint.
The intra-body connections of joints within a single frame are represented by an adjacency matrix A and an identity matrix I representing self-connections which contains the features of node itself. In this work they use the 1-neighbor set of joint nodes for all cases.
For partitioning strategies with multiple subsets, the adjacency matrix is dismantled into several matrixes
Aj where A + I = Aj. Let fm e Rnxdin be the input features of all joints in one frame, where din is the
input feature dimension, and fout e Rnxdout be the output features obtained from spatial graph
convolution, where dout is the dimension of output features. The spatial graph convolution is
■^—1 _i _i
[out = 2_^M®Aj 2AjAj 2finWj j
i i
where A. 2AjA.2 is the normalized adjacent matrix for each partition group, M and W are leamable weight
matrixes for each partition group to capture edge weights and feature importance, respectively. STGCN makes reasonable use of prior knowledge to give more attention to j oints with large movements, which is potentially reflected in the weight distribution strategy.
ST-GCN uses TCN as the temporal convolution operation to extract time features, because of the fixed shape, they use the traditional convolution layer to complete the time convolution operation. Compared with the convolution operation of the image, the shape of the last three dimensions of the feature map outputted by ST-GCN is (C,V, T), which corresponds to the shape (C, H, W) of the image feature map. The width of the image W corresponds to the number of key frames T. The height of the image H corresponds to the number of joints V.
Actional-Structural GCN
Although ST-GCN extracts features of joints that are directly connected through bones, long-distance joints that may cover critical motion patterns are largely ignored. For example, when walking, hands and feet are strongly related. ST-GCN though attempts to use a layered GCN to aggregate a wider range of features, node characteristics may be weakened during long diffusion. Then, to capture richer action-
19


specific latent dependencies, AS-GCN is further proposed[13].
In this model, the author data-driven infer the actional links (A-links) to capture the latent dependencies between any joints and an A-link inference module (AIM) with an encoder-decoder structure is proposed.
t
f I
O
o
•L
Skeleton graph in single frame

0
Lm
Structure links
Action links
Figure 21. An example of the skeleton graph
The actional-structural graph is constructed as G = (V, E) and the E is the set of generalized links. Compared with the traditional skeleton graph, a new subset including actional links of the edge set is introduced. These A-links are activated by actions and might exist between arbitrary pair of joints. They develop a trainable A-link inference module (AIM), which consists of an encoder and a decoder, to automatically infer the A-links
from actions. The actional graph convolution (AGC) which uses C types of A-links to capture inter-joint motion dependencies is defined. For diffusing information in a long range, the network uses the high-order polynomial of the adjacent matrix Aj in ST-GCN, indicating the S-links. With the L-order polynomial, the structural graph convolution (SGC) can directly reach the L-hop neighbors to increase the receptive field.
c
AGC : fact = AGC(fLn) = £ A~Jt A%tfinWaft
C=1
Li
SGC : fstruc = SGC(fin) = ^ ^ Mstruc(P’l) <8> A{v)lfmW^c
l=i pep
where the graph transition matrix A^ provides the nice initialization for edge weights, which stabilizes
20


the learning of MstruJp,l\ To integrally capture both actional and structural information among arbitrary joints, the AGC and SGC are combined to develop the actional-structural graph convolution (ASGC). Mathematically, the ASGC operation is formulated as
font = ASGC{fin) = fstruc + Xfact 6 Rn*dout
AS-GCN block uses the same method as ST-GCN to capture the inter-frame action features, which is applying one layer of temporal convolution (TCN) along the time axis to extract the temporal feature of each joint independently. The architecture of AS-GCN is shown below.
Prob
Recognition Result
Classifier
Figure 22. The pipeline of the AS-GCN
Both algorithms have great significance in the field of action recognition. ST-GCN is a pioneer and innovator of human action recognition algorithms based on graph convolution network, which provides a good idea and foundation for subsequent algorithms. ASGCN which extracts more discriminative features through body structure and action also achieves large improvement compared with the previous state of art methods.
Materials & Methods
Backgrounds
The graph convolutional network on vertex domain has reached an unprecedented height on the NTU-RGB+D dataset[18] and can comprehensively extract spatial features of the body structure. However, compared with Long Short-Term Memory (LSTM) which has strong acquisition ability for time features, the performance of TCN extracting temporal features may not be as good as LSTM.
21


In the above methods, once the mining is done, the degrees of importance of joints/features are fixed and will not change for different temporal frames and sequences. As the action proceeds, the associated and arbitrary pair of joints may also change accordingly. For example, the joints “hand”, “elbow”, and “head” are discriminative for the action “drinking” while the joints from legs can be considered as noise. For the action “running” or “skiing”, action synergistic links denotes that arms and feet are not connected but correlated.
For action recognition, not all frames in the sequence have the same importance. Some frames capture less meaningful information, or even carry misleading information associated with other types of actions, while some other frames carry key information[19], A number of methods have been proposed using key frames as representations for recognizing action class.
One is to utilize the frame distillation network (FDNet) to distil a fixed number of key frames from the input sequences with a deep progressive reinforcement learning method before feeding the output into the GCN to recognize the action label [20], Another one employs AdaBoost to select the most discriminative key frames for action recognition^9], Using key frames can help eliminate noise frames that contain unwanted and distracting information. However, in comparisons to the holistic based approaches which use all the frames, it loses some information.
Objectives Objective 1
We aim to propose a spatial attention module or attention-based graph convolution determining the degrees of importance of joints or limbs by adding a leamable weight matrix to the co-occurrence links based on the contents.
Objective 2
Based on the thought of TCN, we extend the range of existing temporal graph which dose not only include the time edges between the consecutive frames. The joint in the current frame is connected to the same joint in the next few frames. The features of time should be determined by the current frame and the previous frames. Then we introduce a temporal attention model which determines the degree of
22


importance for each time edges within the frames. Instead of skipping frames, it allocates different attention weights to different time edges to automatically exploit their respective discriminative power and focus more on the important frames.
Objective 3
To fully extract features of the intra-body structure within a single frame, we use multi-filters GCN to enlarge the range of neighbors instead of the 1-neighbor and influence the receptive field over the considered graph.
Objective 4
We further propose a Multi-task feedback model for forming the discriminative action correlated links to the original skeleton graph through propagating the attention probability parameter and iteratively updating between the two corresponding blocks of each task. Then applying multi-filters convolution on the graph which contains the newly generated action edges to improve the extraction of motion and structural features.
Graph Attention Network
In this section, we utilize an attention-based architecture which is graph attention networks (GAT)[21] to perform node classification of skeleton data. Recently Attention mechanisms have become almost an essential and meaningful standard in many sequence-based tasks[22, 23], It is effective for the action recognition model to focus on the most relevant parts of the input human pose skeleton to make decisions. The graph attentional layer allows for (implicitly) assigning different importances to different nodes within a neighborhood while dealing with different sized neighborhoods.
23


Figure 23. The attention algorithm
The input to the graph attention layer is a set of node features of the skeleton data, h = [ht, h2,..., hN], ht e Rf , where N is the number of body joints, and F is the number of features in each joint. The layer generates a new set of node features of potentially different cardinality F', h'= {/ii/, h2',..., hN'|, e Rf' , as its output.
In order to get enough expressiveness to transform the input features into higher-level features into a higher level feature, at least one leamable linear transform is required. To that end, as an initial step, a shared linear transformation, parametrized by a weight matrix, W e RF xF , is applied to every node. Then perform self-attention on the nodes to computes attention coefficients that indicate the importance of node j's features to node i. The whole process expressed that the relationship between the two nodes is determined by the features of the two nodes.
gj j = a(Whi,Whj)
We only compute for nodes j e N{ where N{ is the neighbor of node i in the graph. To make the coefficients easy to compare between different nodes, the softmax function is applied to normalize them in all of j's options:
exp(ej.)
= WxPfe)
24


In summary, we use the features of two body joints first to generate a new expressive feature through linear transformation, and then calculate the attention coefficient which is the parameter used to perform the weighted summation at each convolution, so there is an intention coefficient between any two joints.
The single-layer feedforward neural network parameterized by the weight vector a represents the attention mechanism a, and uses the LeakyRuLU to activate the node, so the calculation formula is specifically expressed as:
exp(LeakyReLU(aT[VK/ij | |VK/iy])) lj Ifeeiv£ exp(LeakyReLU(aT[VK/ij| \Whj]))
As opposed to GCNs, this model allows for assigning different importance to nodes of a same neighborhood, enabling a leap in model capacity.
In action task A, the action graph which has no natural connecting body limbs but only consists of all the dotted action links is considered as the input. And we aim to adapt GAT model to calculate the attention coefficient between pair of joints and then form the possible potential action links which are used to construct new action-structural graph.
Multi-Filters GCN
We apply a new definition of graph convolutional filter in this section which is presented in [24], It
generalizes the most commonly adopted filter, adding a hyper-parameter controlling the distance of the
considered neighborhood. This idea follows another approach based on shortest-paths. The adjacency
matrix A of a graph can be seen as the matrix of the shortest-paths of length 1. Moreover, the identity
matrix I is the matrix of the shortest-paths of length 0.
a. . i. = fl ifsPay) = i
l,J l,] 0 otherwise
The first layers of the network are again stacked graph convolutional layers, can be defined as
Hl+1 = a(D~1(S?°+S?1)HlWl)
Here, A = SP°+SP1 is the adjacency matrix of the undirected graph G with added self-connections and
25


W(l) is a layer-specific trainable weight matrix, a(-) denotes an activation function, such as the ReLU(-) = max(0, •). H(i) 6 RNxD is the matrix of activations in the 1-th layer; H(0) = X.
We decide to keep the contribution of the nodes at different shortest path distances, which is equivalent to the definition of multiple graph convolution filters, one for each shortest path. Then the Parametric Graph Convolution is as:
Hr’l+1 = ||.^0 o((pi)-1SPiHlwi’1
where 11 is the vertical concatenation of vector and j is the distance of the shortest-path. The parameter r is proposed to control the maximum distance of the considered neighborhood, and the dimensionality of the output.
Figure 24. Multi-filters graph convolution
It is easy to see that, by definition, the receptive field of a graph convolutional filter parameterized by r and applied to the vertex v includes all the nodes at shortest-path distance at most r from v. When we stack multiple layers of the parametric convolution, the receptive fields grow in the same way. The receptive field of the parametric convolution filter of size r applied to layer 1 of vertex v then includes all vertices at
26


the shortest path distance of distance v up to l r.
Action recognition cannot only be determined by small range of joints/limb features. Traditional methods diffuse information in a local range which may ignores discriminative motion patterns covered the longdistance joints. We can enlarge the receptive field by adapting the multi filters GCNs to propagate useful features in a wider range of body structure. Then we concatenate the output of the convolutions for all j A r to obtain higher-order spatial features for action recognition. The richer action-specific latent dependencies is captured, the better performance the results have.
Temporal Convolution Network
GCNs help us learn the local features of adjacent joints in space. On this basis, we need to learn the local features of joint changes in time. How to superimpose timing features for Graph is one of the problems faced by graph networks. There are two main ideas in this area: time convolution (TCN) and sequence model (LSTM).
Our AST-GCN model uses one later of TCN along the time axis, which extracts the temporal feature of each joint independently but shares the weights on each joint, to capture the inter-frame action features.
Figure 25. Temporal convolution in ST-GCN
There is a very simple strategy to extend the spatial graph CNN to the spatial temporal domain proposed in ST-GCN. That is, we extend the concept of neighborhood to also include temporally connected joints as
B(vti) = \vqj \d(vtj ,vti) < K,\q - t\< [j]}
The temporal kernel size T controls the temporal range to be included in the neighbor graph.
When we construct the temporal graph, we decide to increase the time links between the joints in the
27


current frame and the same joints in the next few frames. This is because when recognizing an action, the temporal feature of the joints and limbs should contain not only the information of the current frame, but also the information of the previous frames.
Figure 26. Structural-Temporal graph
The dashed lines in the figure above represent the original time edges and the solid lines are the addition time links existing between the same joints within several frames.
In order to determine the degree of importance for each time link within the frames, we adapt an attention
mechanism which can calculate the attention coefficients a^i+k^ between pairs of the same joints in the
temporal graph. Then the model can assign different weights to the time edge for capture more relative
information and eliminate noise or misleading features.
exp(LeakyReLU(wr[Wfy||Wfy+fc])) l(l+k) IkENt exp(LeakyReLU(wT[Whi 11 Whi+k])) ’
where T is the time range. Through the above operation, the regularized attention coefficients between
same joints in the different frames after regularization are obtained, which can be used to predict the output
feature h\ of each joint as
Wi = a(^aKi+k) Wht) fceT
We use the non-linear function a(-) of ReLU due to its good convergence performance. The temporal
28


attention graph convolution network can be trained to implicitly learn the meaningful temporal features.
Two Steam Feedback Attention Based GCN
We further propose the Two Steam Feedback Attention Based GCN(2S-ATGCN) model, which stacks attention based structural graph convolution and temporal convolution as a basic building block, to learn both spatial and temporal features for action recognition.
After the multiple spatial and temporal feature aggregations, 2S-ATGCN extracts high-level semantic information across time. A correlation link with the discriminant action is formed on the original skeleton map by propagating the attention probability parameter and iteratively updating between two corresponding blocks of each task. To classify actions, we apply the global averaging pooling on the joint and temporal dimensions of the feature maps output by the backbone Task S network, and obtain the feature vector, which is finally fed into a softmax classifier to obtain the predicted class-label y. The standard cross entropy loss is used as the loss function for action recognition.
Lrecog = — yT log(y)
The architecture shown below is the pipeline of our 2S-ATGCN model.
Attention based Attention-based Attention based ^ Attention-based Attention based ^ Attention-based
GCN TCN GCN TCN GCN TCN
AAT-GCN block Action Task (A)
Feedback
Module
Add
AAT-GCN block
T Adjust Link probabilities
Feedback
Module
y A-Links V -/
• •
U y "UK • • Selective Multi-filters AS-GCN Attention-based TCN yN » \ VtCt * Selective Multi-filters AS-GCN J Attention-based TCN
t • • AST-GCN block \ S J AST-GCN block
AAT-GCN block
Feedback
Module
>#••1 Classifier ——--------1------- 1
Structural Task (S)
Figure 27. The pipeline of 2S-ATGCN model
Structural Task (s)
The input to the structural task (S) is the sequences of body joints in the form of 2D or 3D coordinates. And we construct the spatial-temporal graph on the skeleton sequences in two steps. First, the joints within
29


one frame are connected with edges according to the connectivity of human body structure. Then each joint will be connected to the same joint in the next few frames. Task S consists of a series of the Attention-based Structural-Temporal GCN(AST-GCN) blocks. Each block contains a Selective Multi-fdters Attention-based Structural GCN(AS-GCN) followed by an attention-based temporal convolution, which altematingly extracts effective spatial and temporal features. It will then be classified by the standard Softmax classifier to the corresponding action category.
Selective Multi-Filters AS-GCN(SF-ASGC)
Figure 28. The architecture of selective filter network
In standard Graph Convolutional Neural Networks (GCNs), the receptive fields of artificial neurons in each layer are designed to share the same range of neighbors. We aim to obtain rich information on different receptive fields by applying convolution filters of different sizes. Then the size of the convolution filters corresponding to the Multi-filters GCN network is determined. We want to introduce a model so that the network can adaptively adjust the receptive size according to multiple scales of the input information.
We have integrated the idea of Selective Kernel unit[25], in which multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion
30


layer. Instead of concatenating the output of the convolutions for all r-neighbors, we integrate the Sknet which can adjust neurons’ reflective field size to propose our Selective Multi-Filters structural graph attention convolution.
This selective filter network consists of a triplet of operators: Split, Fuse and Select. The Split operator generates multiple paths with various filter sizes which correspond to different RF sizes of neurons. The Fuse operator combines and aggregates the information from multiple paths to obtain a global and comprehensive representation for selection weights. The Select operator aggregates the feature maps of differently sized filters according to the selection weights.
Split: For the feature vector X on a node consists of coordinate vectors, as well as the estimation confidence of the i-th joint on frame t, by default we first conduct three transformations with filter sizes where the ranges of node neighbors are 1, 2, 3 respectively and output U, U, U\ The transformations are composed of efficient structural graph convolutions (SGC), Batch Normalization and ReLU function in sequence.
Fuse: Our goal is to enable neurons to adaptively adjust their RF size based on the stimulus content. The basic idea is to use gates to control the flows of information from multiple branches carrying different scales information to the neurons of the next layer. In order to achieve this goal, there is the gate integrating information from all branches. We first fuse the results from multiple branches by element summation:
U = U + U + U'
then we generated channel-wise statistics s E Rc by simply using the global average pool to embed global information. Specifically, the c-th element of s is calculated by shrinking U through spatial-temporal dimensions V x T, where V is the number of joints and T is the number of frames.
V T
sc=Fgp(Pc)='y'i
(=1t=l
In addition, a compact feature z E Rdxlhave been created to guide the precise and adaptive selection, which is achieved through a simple fully connected (fc) layer that reduces the number of dimensions for efficiency.
31


z = Ffc(s) = 8(B(WS))
where 8 is the ReLU function[26], B denotes the Batch Normalization[27], W 6 Rd/C.
Select: A soft attention across channels is used to adaptively select different spatial scales of information, which is guided by the compact feature descriptor z. Specifically, a softmax operator is applied on the channel-wise digits:
qAcz qBcz q^cz
c qAcz _j_ qBcz _j_ qCcz } c qAcz _j_ qBcz _j_ qCcz } c qAcz _j_ qBcz _j_ qCcz
where A, B, C 6 /?Cxd and a, b, c denote the soft attention vector for U, U and U', respectively. In the case of three branches, the matrix B is redundant because ac + bc + cc = 1. The final feature map Y is obtained through the attention weights on various kernels:
Yc = ac â–  U + bc â–  U + cc â–  U', ac + bc + cc = 1
where Y = [K,, Y2,..., Yc], Yc E RTxV . Note that here we provide a formula for the three-branch case and one
can easily deduce situations with more branches by extending these equations above.
It is obvious that the importance and correlation of the human body structure links are different of various motion sequences. Taking the action “handshaking” as an example, the two persons approach each other, stretch out the hands, the relation of the arm looks seems more important than the legs. Then we introduce the Selective Multi-Filters Graph Attention Convolution by adapting the attention mechanism to calculate the attention coefficients in the structural links instead of using the traditional GCNs, which can pay different degree of relation to the pairs of joints of the body connectivity.
AST-GCN Block
To integrally capture the structural features among skeleton joints, we develop the selective multi-filters ASGC. Different time links have different degrees of importance and robustness to variations. For capturing the inter-frame action features, we use attention based temporal convolution (AT-TCN) along the time axis, which extracts the temporal feature of each keypoint independently, which also enables the network to pay different levels of attention to different joints and assign different degrees of importance
32


to various time links as an action proceeds. The AST-GCN block which consists of SF-ASGC, AT-TCN
and other operations: batch normalization (BN), ReLU and residual blocks is shown below.
Figure 29. An AST-GCN block
Since SF-ASGC and AT-TCN leams spatial and temporal features, respectively, we concatenate both layers as an AST-GCN block to extract temporal features from various actions. Note that SF-ASGC is a single operation to extract only spatial information from structural graph and the AST-GCN block including a series of operations can extract both spatial and temporal information from spatial-temporal graph.
Action Task (A)
In this section, we introduce the action task (A) where the actional links is formed and combined to the generated skeleton graph for future feature extraction.
Figure 30. Establishment process of the actional links Many human activities require joints that are far-apart to move collaboratively, resulting in non-physical
dependence among joints. In order to capture the corresponding dependencies of various actions, we
introduce actional links, which are activated by actions and may exist between arbitrary pair of joints.
33


The input to the action task (A) is also the skeleton sequence represented by 2D or 3D coordinates of each human joint in each frame. However, the action graph which we construct in task A is different from the one in the structural S. We connect the joint with all other joints except the natural connection joints of the human body to represent the potential actional edges, which is showed in dashed line in the action graph. Note that this action graph does not include any edges of the connectivity of human body structure. AAT-GCN Block
To automatically infer the actional links from actions, we develop the AAT-GCN block which consists of a graph attention convolution followed by an attention-based temporal convolution. First, we calculate the attention coefficients between the pairs of joints
et j = a(Whi,Whj)
that indicate the importance of joint j’s features to joint i. In its most general formulation, the model allows every joint to attend on every other joint, dropping all structural information. We perform masked attention for injecting the graph structure into the mechanism—we only computes e^ for joints j E Nt, where Nt is some neighborhood of joint i in the skeleton graph. In all our experiments, these will be exactly all other joints of the generated actional graph except i.
We apply a LeakyReLU activation function to employ the attention mechanism a(Wh[, Whj') which is parametrized by a weight vector a and finally get the attention coefficients/links probabilities in the sequence of various action
exp (Leaky Re LU{eij)\
11 Ifeeiv£ exp (LeakyReLU (eik))
which assign different degrees of importance to the probable actional links and then infer the actional links shown in solid lines of different thickness. The thicker the line is, the greater the actional correlation the two joints have. The thinner the line is, the less synergy in this series of actions the two joints have. The formed actional links are combined into the structural graph in task S for more comprehensive extraction of features. Meanwhile, the actional links are iteratively updated based on the output of the structural
34


task (S) blocks and gradually form a more complete action graph of the input skeleton sequence.
For the temporal domain, the temporal graphs in the both tasks are same, which are constructed by connecting the joints with the same joints in the next t frames. If AT-TCN in the AST-GCN block of structural S is used to extract the temporal features in the human body structure of the action sequence, then we apply AT-TCN in AAT-GCN block of the action task A to extract the temporal features of the potential action co-occurrence and control the amount of information of each time edge in frames for optimizing the attention coefficients.
Feedback Model
Here we introduce a feedback model which propagates the attention probability parameter between the two corresponding blocks of each task for iterative updating. The actional links which generated by task A is combined into the skeleton graph in the structural task S and the renewable actional-structural graph is considered as input to the next AST-GCN block. Then the output features extracted by SF-ASGC and AT-TCN are employed to adjust the attention coefficients/link probabilities of the actional links in the feedback module and feedback to task A to adjust the formed action edges or generate more correlated co-occurrence links.
First, we set the value of the attention parameters/link probabilities atj of the actional links in the original structural graph of the task S to 0, which means that there is not any co-occurrence links in the graph. Then the parameters are computed in the AAT-GCN block of task A and the attention vector [(Xi j, (*i2j ... 0C[vj ] where v is the number of joints, is transformed by the feedback module to update the value in the task S.
By assigning the different degree of importance to these co-occurrence links based on the attention vector, the actional links are further formed in the structural graph for convolution operation in AST-GCN block. The updating process from AAT-GCN block to AST-GCN block is shown below.
35


Calculate a, .
Task A
" r • •
T
Propagate vector / \ <• update actional graph t K -iUrf
and update NJ \ \ v u ' 9 AST-GCN hrough output
A to S f ♦ block c . . S to A VW
\4 w \» »/
• i
TaskS
Task A
Figure 31. Feedback and updates of the actional links For the backpropagation, the attention vector is re-computed through the AST-GCN block and outputted as ^.Thereafter, the feedback module updates the actional graph in task A by adjusting the value of attention vector ajj = cc^j ... aTvJ ]. Note that, the SF-ASGC only pay attention to the structural links in the original skeleton graph, with the generation of the action edge, the block of task S simultaneously calculates the attention parameters of the action edge and the structure edge. The value of attention coefficient is feedbacked to task A to adjust the actional graph for the next convolution operation in AAT-GCN block. And the refresh calculated parameters further transformed would still influence the actional links in the spatial graph. As the network progresses, based on the feedback from task S, the actional links generated by task A will become more and more accurate.
In general, the SF-ASGC network only extracts the features of the human body structure from the beginning, and gradually extracts the features of the both structure edges and the actional edges. The attention vector, which express the degree of potential relation between arbitrary joints from different action classes, is iteratively updated between the two task models, and finally forms a reliable and complete actional-structural graph. Finally, the rich spatial and temporal feature are fully extracted by the SF-ASGC and AT-TCN in main network for the action classifier to recognize action class.
36


CHAPTER IV
EXPERIMENTS & RESULTS
In this section, we evaluate the performance of 2S=ATGCN model in skeleton-based action recognition experiments. We experiment on two large-scale action recognition datasets with vastly different properties: Kinetics human action dataset (Kinetics)[3] and NTU-RGB+D[18],
Datasets
NTU-RGB+D: NTU-RGB+D, which is currently one of the largest and most widely used in-door-captured action recognition dataset, contained 56, 880 skeleton action sequences completed by one or two performers and categorized into 60 classes. It provides the 3D spatial coordinates of 25 joints for each human in an action. For evaluating the models, two protocols are recommended: Cross-Subject and Cross-View. In Cross-Subject, 40, 320 samples performed by 20 subjects are separated into training set, and the rest belong to test set. Cross-View assigns data according to camera views, where training and test set have 37, 920 and 18, 960 samples, respectively. Training clips in this setting come from the camera views 2 and 3, and the evaluation clips are all from the camera view 1.
Kinetics-Skeleton: DeepMind Kinetics human action dataset[3] contains around 300, 000 video clips retrieved from YouTube. The videos cover as many as 400 human action classes, ranging from daily activities, sports scenes, to complex actions with interactions. Each clip in Kinetics lasts around 10 seconds. It only provides raw video clips without skeleton data.
In this work we are focusing on our skeleton-based action recognition model, so we use the fusion pose estimation model outputting 2D coordinates (X, Y) in the pixel coordinate systems and confidence scores C from the resized videos with resolution of 340 x 256 to estimate the location of 18 joints on every frame of the clips. For the multi-person cases, we select the body with the highest average joint confidence in each clip. We thus represent each joint with a tuple of (X, Y, C) and a skeleton frame is recorded as an array of 18 tuples. In this way, one clip with T frames is transformed into a skeleton sequence of these tuples. In practice, we represent the clips with tensors of (3, T, 18) dimensions. For simplicity, we pad each sequence by repeating the data from the start to totally T = 300. Following the evaluation method in
37


ST-GCN, we train the models on the training set and report the top-1 and top-5 accuracies on the validation set.
Model Setting
We construct the backbone of the structural task S with 9 AST-GCN blocks, where the dimensions of the features are 64, 128,256 in every three blocks. These layers have 9 temporal kernel size. And we randomly discard the features with a probability of 0.5 after each AST-GCN unit to avoid overfitting. The stride of the 4th and 7th time convolution layers is set to 2 as the pooling layer. Thereafter, a global pooling is performed on the obtained tensors to obtain a 256-dimensional feature vector for each sequence. Finally, we provide them to the SoftMax classifier.
The backbone of the action task A consists of 8 AAT-GCN blocks and the convolution filer is just performed on 1-neighbor of the joint. Using a stochastic gradient descent learning model, the learning rate is 0.01. We reduce the learning rate by 0.1 after every 10 periods. In the SF-ASGC, we set the max range r of the neighbors to 3, which indicates the convolution filters are performed on the vertices and their neighbor from 1-3. Note that when r = 1, the corresponding structural link is exactly the physical skeleton itself. We use PyTorch 0.4.1 and train the model for 100 epochs on 2 GTX-TITAN GPUs. The batch size is 32. We use the SGD algorithm to train the whole network of action recognition model, whose learning rate is initially 0.1, decaying by 0.1 every 20 epochs.
Comparisons to the State-of-the-Art
We compare the final recognition model on skeleton-based action recognition tasks with the state-of-the-art methods on both the NTU-RGB+D dataset and Kinetics-Skeleton dataset. Table 2 and Table 3 show the performance comparisons for the NTU and Kinetics datasets, respectively.
On NTU-RGB+D, we train 2S-ATGCN on two recommended benchmarks: Cross-Subject and Cross-View, then we respectively obtain the top-1 classification accuracies in the test phase. We compare with covering handcraft-feature-based method[28], RNN/CNN-based deep learning models[9, 18, 29, 30] and recent GCN-based methods[12, 13, 20, 31], Specifically, ST-GCN combines GCN with temporal CNN to capture spatial-temporal features, and AS-GCN extends the skeleton graphs by introducing actional links
38


and non-local structural links to capture useful long-range dependencies. SR-TSL[31] use gated recurrent unit (GRU) to propagate messages on graphs and use LSTM to leam the temporal features. Table 2 shows the comparison. Our model achieves state-of-the-art performance with a large margin on both of the datasets, which verifies the superiority of our model.
Methods Cross Subject Cross View
Lie Group (Veeriah, Zhuang, and Qi 2015) 50.1% 52.8%
H-RNN (Du, Wang, and Wang 2015) 59.1% 64.0%
Deep LSTM (Shahroudy et al. 2016) 60.7% 67.3%
PA-LSTM (Shahroudy et al. 2016) 62.9% 70.3%
ST-LSTM+TS (Liu et al. 2016) 69.2% 77.7%
Temporal Conv (Kim and Reiter 2017) 74.3% 83.1%
C-CNN + MTLN (Ke et al. 2017) 79.6% 84.8%
ST-GCN (S. Yan et al. 2018) 81.5% 88.3%
DPRL(Y. Tang et al. 2018) 83.5% 89.8%
SR-TSL (C. Si et al. 2018) 84.8% 92.4%
AS-GCN (M. Li et al. 2019) 86.8% 94.2%
2S-ATGCN(our model) 88.2% 96.3%
Table 3. Comparison of action recognition performance on NTU-RGB+D.
In the Kinetics dataset, we compare our model with six state-of-the-art methods. First introduced a handmade method called Feature Encoding[32], Then Deep LTSM and Temporal convolution network[9, 18] are implemented as two deep learning models. In addition, the action recognition of ST-GCN and AS-GCN as well as the 2s-AGCN are evaluated. Table 4 shows the classification performance of the first 1 and the top 5. We see that the proposed 2S-ATGCN outperforms the other competitive methods
39


Methods Top-1 Ace Top-5 Ace
Feature Enc. (Fernando et al. 2015) 14.9% 25.8%
Deep LSTM (Shahroudy et al. 2016) 16.4% 35.3%
Temporal Conv (Kim and Reiter 2017) 20.3% 40.0%
ST-GCN (S. Yan et al. 2018) 30.7% 52.8%
AS-GCN (M. Li et al. 2019) 34.8% 56.5%
2s-AGCN (L. Shi et al. 2019) 36.1% 58.7%
2S-ATGCN(our model) 37.5% 60.3%
Table 4. Comparison of action recognition performance on Kinetics.
40


CHAPTER V
CONCLUSIONS
We first propose the fusion pose estimation model which can output skeleton data from multispectral images or videos. Then we develop the 2S-ATGCN model for skeleton-based action recognition. The action task A captures actional dependencies and forms the actional links to the structural graph in the structural task S. We also expand the convolution filters size to represent higher order relationships. The generalized action-structure graphs are fed to AST-GCN block for a better representation of actions. An additional attention algorithm is applied to pay different degree of importance to both spatial links and time links, which indicates the network can focus on the most relative part of bodies and frames. We validate the fusion pose estimation model using our RGB & Thermal datasets, and it achieves large improvement under the dark environment. Moreover, on two challenging large-scale datasets, the proposed skeleton-based action recognition model outperforms the previous state-of-the-art skeleton based model. The combination of attention-based model and multi-filters selective network further improves the performance in action recognition. With the branch network gradually combining the new actional links to the main skeleton graph, the structural network has a better representation of actions.
At the same time, based on the back propagation and update, the generation of action edges in task A becomes more and more complete and accurate. By introducing attention mechanisms in time convolution, the network can automatically exploit the different levels of importance of different frames through allocating different attention weights to the time links. The flexibility of 2S-ATGCN model also opens up many possible directions for future works. For example, how to incorporate contextual information, such as scenes, objects, and interactions into 2S-ATGCN becomes a natural question.
41


REFERENCES
1. Oh, H.-W., System and method for real-sense acquisition. 2012, Google Patents.
2. Kay, W., et al., The kinetics human action video dataset. arXiv preprint arXiv: 1705.06950, 2017.
3. Cao, Z., et al. Realtime multi-person 2d pose estimation using part affinity fields, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
4. Vemulapalli, R., F. Arrate, and R. Chellappa. Human action recognition by representing 2d skeletons as points in a lie group, in Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
5. Wang, J., et al. Mining actionlet ensemble for action recognition with depth cameras, in 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2012. IEEE.
6. Koniusz, P., A. Cherian, and F. Porikli. Tensor representations via kernel linearization for action recognition from 2d skeletons, in European Conference on Computer Vision. 2016. Springer.
7. Zhu, W., et al. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTMnetworks, in Thirtieth AAAI Conference on Artificial Intelligence. 2016.
8. Song, S., et al. An end-to-end spatio-temporal attention model for human action recognition from skeleton data, in Thirty-first AAAI conference on artificial intelligence. 2017.
9. Kim, T.S. and A. Reiter. Interpretable 2d human action analysis with temporal convolutional networks, in 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW). 2017. IEEE.
10. Liu, M., H. Liu, and C. Chen, Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 2017. 68: p. 346-362.
11. Li, C., et al. Skeleton-based action recognition with convolutional neural networks, in 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). 2017. IEEE.
12. Yan, S., Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition, in Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
13. Li, M., et al. Actional-Structural Graph Convolutional Networks for Skeleton-based Action
42


Recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
14. Shi, L., et al. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
15. Toshev, A. and C. Szegedy. Deeppose: Human pose estimation via deep neural networks, in Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
16. Shuman, D.I., et al., The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains. arXiv preprint arXiv: 1211.0053, 2012.
17. Niepert, M., M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs, in International conference on machine learning. 2016.
18. Shahroudy, A., et al. Ntu rgb+ d: A large scale dataset for 3d human activity analysis, in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
19. Liu, L., et al., Learning discriminative key poses for action recognition. IEEE transactions on cybernetics, 2013. 43(6): p. 1860-1870.
20. Tang, Y., et al. Deep progressive reinforcement learning for skeleton-based action recognition. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
21. Velickovic, P., et al., Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
22. Chorowski, J.K., et al. Attention-based models for speech recognition, in Advances in neural information processing systems. 2015.
23. Gehring, J., et al., A convolutional encoder model for neural machine translation. arXiv preprint arXiv: 1611.02344, 2016.
24. Tran, D.V., N. Navarin, and A. Sperduti. On filter size in graph convolutional networks, in 2018 IEEE Symposium Series on Computational Intelligence (SSCI). 2018. IEEE.
25. Li, X., et al. Selective Kernel Networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
43


26. Nair, V. and G.E. Hinton. Rectified linear units improve restricted boltzmann machines, in Proceedings of the 27th international conference on machine learning (ICML-10). 2010.
27. Ioffe, S. and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv: 1502.03167, 2015.
28. Veeriah, V., N. Zhuang, and G.-J. Qi. Differential recurrent neural networks for action recognition, in Proceedings of the IEEE international conference on computer vision. 2015.
29. Liu, J., et al. Spatio-temporal Istm with trust gates for 2d human action recognition, in European Conference on Computer Vision. 2016. Springer.
30. Ke, Q., et al. A new representation of skeleton sequences for 2d action recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
31. Si, C., et al. Skeleton-based action recognition with spatial reasoning and temporal stack learning, in Proceedings of the European Conference on Computer Vision (ECCV). 2018.
32. Fernando, B., et al. Modeling video evolution for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
44


Full Text

PAGE 1

MULTISPECTRAL SKELETON BASED HUMAN ACTION RECOGNITION by BO LANG B .S., Electrical Engineering, University of Colorado Denver, 201 6 A thesis submitted to the Faculty of the Graduate School of the University of Colorado Denver in partial fulfillment of the requirements for the degree of Master of Science Electrical Engineering, 201 9

PAGE 2

This thesis for the Master of Science degree by B o Lang has been approved for the Department of Electrical Engineering by Chao Liu , Chair Tim Lei Vijay Harid © 201 9 BO LANG ALL RIGHTS RESERVE D

PAGE 3

I Bo, Lang (M.S., Electrical Engineering) Multispectral Skeleton based Human Action Recognition Thesis directed by Assistant Professor Chao Liu ABSTRACT Automatic recognition of human actions is an important and challenging problem in surveillance and intelligence transportation areas. Dynamics of human body skeletons convey significant information for human action recognition, which attracted much attention in computer vision. Conventional pose estimation approaches for getting skeleton data which are done on visible color imaging data would be affected by lighting condition. Whereas thermal camera is stable to human body detection regardless of the lighting condition. On the contrary, thermal data always lose the fine visual details of human objects, especially at long distance. In this paper, w e first proposed a multispectral pose estimation algorithm to generate the skeleton body key point s data from multispectral images or videos. Then, to capture richer dependencies besides the fixed skeleton graphs, we introduce a Multi task feedback action r ecognition model. In the structural task S (skeleton graph), we use multi selective filters convolution network with attention to extract features while attention based GCN is applied to t he actional task A (action graph composed of action links, directly from action) to learn actional graph features. Furthermore, to select discriminative temporal information, we increase time links among several frames and the attention mechanism is employed to enhance information of key time links. A feedback module is connected between pairs of blocks to iteratively update the action links information. Our overall model stacks attention based graph convolution and temporal convolution as a basic bu ilding block, to learn both spatial and temporal features from multispectral data for action recognition in different environment .

PAGE 4

II The form and content of this abstract are approved. I recommend its publication. Approved: Chao Liu

PAGE 5

III ACKNOWLEDGMENT I would like to express my deepest thanks to my advisor, Dr. Chao Liu for his valuable guidance and suggestions he has given me in this project. Without him, this work would have been impossible. Also, I thank the committee members, Dr. Tim Lei and Dr. Harid Vijay for spending their valuable time reviewing this report and attending my defense. Thank you to my friend Yan Pang for getting me up to speed on graph convolution networks and granting your advice during the thesis process.

PAGE 6

IV T able of C ontents CHAPTER I ................................ ................................ ................................ ................................ ................ 1 Role and Challenge of Action Recognition in Intelligence Area ................................ ........................ 1 Skeleton Based Human Action Recognition ................................ ................................ ...................... 1 Related Works ................................ ................................ ................................ ............................. 2 Pose Estimation Methods ................................ ................................ ................................ .................... 3 Related Works ................................ ................................ ................................ ............................. 4 Multispectral Skeleton based Human Action Recognition ................................ ................................ . 4 CHAPTER II ................................ ................................ ................................ ................................ ............... 6 Introduction of OpenPose ................................ ................................ ................................ ................... 6 Materials & Methods ................................ ................................ ................................ .......................... 8 RGB & Thermal human skeleton detection dataset ................................ ................................ .... 8 Data Fusion ................................ ................................ ................................ ................................ . 9 Model Fusion ................................ ................................ ................................ ............................ 11 Multispectral Fusion Pose Estimation Network ................................ ................................ ................ 14 Experiments & Results ................................ ................................ ................................ ..................... 14 CHAPTER ................................ ................................ ................................ ................................ ........... 16 Introduction of Graph Co nvolutional Network ................................ ................................ ................. 16 Overview ................................ ................................ ................................ ................................ ... 16 Related Works ................................ ................................ ................................ ........................... 17 Materials & Methods ................................ ................................ ................................ ........................ 21 Backgrounds ................................ ................................ ................................ ............................. 21 Objectives ................................ ................................ ................................ ................................ . 22 Graph Attention Network ................................ ................................ ................................ ......... 23 Multi Filters GCN ................................ ................................ ................................ ..................... 25

PAGE 7

V Temporal Convolution Network ................................ ................................ ............................... 27 Two Steam Feedback Attention Based GCN ................................ ................................ .................... 29 Structural Task (s) ................................ ................................ ................................ ..................... 29 Action Task (A) ................................ ................................ ................................ ........................ 33 Feedback Model ................................ ................................ ................................ ........................ 35 ................................ ................................ ................................ ................................ ........... 37 Datasets ................................ ................................ ................................ ................................ ..... 37 Model Setting ................................ ................................ ................................ ............................ 38 Comp arisons to the State of the Art ................................ ................................ ......................... 38 CHAPTER V ................................ ................................ ................................ ................................ ............ 41 REFERENCES ................................ ................................ ................................ ................................ ......... 42

PAGE 8

VI FIGURES Figure I .1 Skeleton sequence of N frames for action walking 1 I .2 Some pose estimation results 3 I .3 The pipeline of the proposed model 4 II. 1 Expression of body keypoints 6 II. 2 Architecture of OpenPose 7 II. 3 Comparison results of RGB and Thermal image 8 II. 4 Equipments for data collection 9 II. 5 Data fusion process 10 II. 6 Some results of data fusion 10 II. 7 Early Fusion 11 II. 8 Middle Fusion 11 II. 9 Late Fusion 12 II. 10 Examples of skeleton detection results on RGB & Thermal dataset 1 3 II. 11 Architecture of multispectral fusion pose estimation model 1 4 II. 12 Pose estimation results for the fusion images 1 5 III.I Euclidean Structure 1 6 III.2 Non Euclidean Structure 16 III.3 The proposed partitioning strategy for constructing convolution 1 8 III.4 ST GCN network 18 III.5 An example of the skeleton graph 20 III.6 The pipeline of the AS GCN 21 III.7 The attention algorithm 24 III.8 Multi filters graph convolution 26 III.9 Temporal convolution in ST GCN 27 III.10 Structural Temporal graph 28

PAGE 9

VII III.11 The pipeline of 2S ATGCN model 29 III.12 The architecture of selective filter network 30 III.13 An AST GCN block 33 III.14 Establishment process of the a ctional links 33 III.15 Feedback and updates of the actional links 36

PAGE 10

VIII TABLES Table II. 1 Comparison results of three different fusion models 13 II. 2 Comparison results of different fusion process 15 . 1 Comparison of action recognition performance on NTU RGB+D 39 . 2 Comparison of action recognition performance on Kinetics 40

PAGE 11

1 CHAPTER I INTRODUCTION Role and Challenge of Action Recognition in Intelligence Area Computer vision (CV) is an interdisciplinary scientific field that aims to build smart applications to understand the contents of images and videos as human understanding. Human behavior analysis and description is a research hotspot that has been widely concerned in in the field of machine learning and computer vision. And h uman acti on recognition (HAR) is a broad research area which focuses on identif ying specific movement s or behavior s of human using imaging sensor. It is popular for its wide applicability in using visual feature to retriev e specific action s in videos automatically . For example, healthcare monitoring sy s tem, and indoor/outdoor activity surveillance & emergency assistance system are developed using the concept of HAR . In recent years, a variety of action recognition algorithms have been proposed and achi eved good results. However, using RGB video data as an input to the system could cause several problems in environments such as occlusion or complex backgrounds and low illumination. Therefore, accurate action recognition in video is still a challenging ta sk. S keleton Based Human Action Recognition Figure 1. Skeleton sequence of N frames

PAGE 12

2 Skeleton based action recognition is widely used in recent applications due to its robustness to illumination and scene changes. The dynamic skeletal modality can be represented naturally by a time series of body key joint s with the form of 2D or 3D coordinates. Human motion can then be identified by analyzing its motion patterns. With the development of cost effective depth sensors (eg, RealSense [1] , Kinect [2] ) and pose estimation algorithms (eg, Openpose [3] ) , the application of human action recognition based on 2D or 3D body keypoints ha ve received extensive attention . A nd there have been many advanced methods that have been proposed in the past few years. R elated Works Current skeleton based action recognition algorithms can be divided into two approaches: handcrafted feature based and non handcrafted feature (learned feature) based. For the first approach, a wide variety of the state of the art algorithms have been proposed: Vemulapalli et al. [4] represented the human skeleton s as point s in Lie group, and implemented the temporal modelling and classification in Lie algebra ; Wang et al. [5] design ed algorithm to capture action patterns based on the local occupancy features ; And t wo kernel based tensor representations were proposed by Koniusz et al. [6] to capture compatibility between two action sequences and dynamic information for a single action . On the other hand, the non handcrafted feature based methods learn the human action features from data automatically . One type of the mos t popular techniques was to use recurrent neural network (RNN) based model to capture the temporal dependencies between consecutive frames . Zhu et al . [7] in troduced a regularized LSTM model for co occurrence feature learning. Song et al. [8] developed a spatio temporal attention model to allocate different weights to different frames and joints in the video. Besides, c onv olutional neural networks (CNN) also achieved remarkable results, such as residual temporal CNN [9] , information enhancement model [10] and new representation of body keypoints data with convolutional neural networks (CNN) [11] . Recently, the graph based approach has drawn a lot of attention due to its more flexible use of the body

PAGE 13

3 joint relations, S. Yan et al. [12] proposed a Spatial Temporal Graph Convolutional Networks (ST GCN) which automatically learn ed both the spatial and temporal patterns from data ; Motivated by ST GCN, M. Li et al. [13] further pro posed an actional structural graph convolution network (AS GCN) by combining actional links and structural links to generate the skeleton graph ; Besides, an adaptive graph convolutional network was proposed [14] to a daptively learn the topology of the graph for different GCN layers and skeleton samples in an end to end manner. In this paper, we ad a pt the graph based approach for action recognition part . Different from any previous method, we capture more useful non l ocal information about actions by introduce addition actional graph , and we also use attention model to pay different levels of attention to the joints or frames . P ose Estimation Methods Human pose estimation is one of the key challenges that have been studied for many years in computer vision area . It is based on the spatial information of the body keypoints obtained directly from images or videos to establish a mathematical model of the specific posture . Therefore, the accuracy of detecting keypoints from images or videos has a significant impact on the estimation of human posture. Human poses estimation which aims to obtain the skeleton data from input video or imag e is heavily used for the action recognition. However, m any inevitable factors such as small and hard to capture skeletal points , occlusion of disturbing objects as well as the lighting changes, make this process a challenging task . Figure 2 . Some pose estimation results

PAGE 14

4 R elated Works Most recent pose estimation approches commonly adapt ConvNets as their primary building block, largely replacing handcrafted features and graphical models; this strategy has made significant advances in standard benchmarks. Toshev et al. [15] first presen ted a cascade of such DNN regressors for increasing prec ision of joint localization . Z. Cao et al. [ 3 ] introduced Part Affinity Fields (PAFs) , which represent the association scores , to encode the location and orientation of limbs parts with individuals in the image. Although m any researches that focus ed on p ose e stimation have already performed well on common datasets such as MPII and COCO, the ir accurac ies are still affected by light ing condition . For example, in the case of dim light, camera can hard ly capture the visible human body. So, the accuracy of keypoints detection cannot be guaranteed. Whereas t hermal camera is stable to human body detection regardless of lighting condition . However, there are currently no RGB Thermal dataset for human pose estimation . Therefore, the first contribution of this wo rk is to create the first RGB Thermal human action datase t for training a fusion model of pose estimation which c ould get higher accuracy under different environment. Multispectral Skeleton based Human Action Recognition Figure 3 illustrates the pipeline of the proposed model. T here are two sub networks in our method: multispectral pose estimation via deep fusion network and multi task graph based convolutional network with attention mechanisms . The multi spectral pose estimation algorithm aims to generate the body Figure 3 . The pipeline of the proposed model

PAGE 15

5 keyjoints data from multispectral images or videos. Usually the data is a sequence of frames, each frame will have a set of joint coordinates. Then, we organize the outputs of the fusion network into a graphical structure based on the dependencies between human joints, and feed them into the two branches of GCNS to recognize the action label. There is a feedback module between the two tasks which propagates information and provides rewards during the training process . W e first introduce the pose estimation part as follows.

PAGE 16

6 CHAPTER II MULTISPECTRAL POSE ESTIMA TION VIA DEEP F USION NEURAL NETWORK Introduction of OpenPose A Human Pose Skeleton represents the orientation of a person in a graphical format. Essentially, it is a set of coordinates that can be connected to describe the pose of the person. Each co ordinate in the skeleton is known as a keypoint . A valid connectio n between two parts is known as a pair (or a limb). Note that, not all part combinations give rise to valid pairs. A sample human pose skeleton is shown below. Multi Person pose estimation is more difficult tha n the single person case as the location and the number of people in an image are unknown. Typically, we can tackle the above issue using one of two approaches: The simple approach is to incorporate a person detector first, followed by estimating the parts and then calculating the pose for each person. This method is known as the top down approach. Another approach is to detect all parts in the image (i.e. parts of every person), followed by associating parts belonging to distinct persons. This method is kn own as the bottom up approach. Figure 4 . Expression of body keypoints

PAGE 17

7 OpenPose [ 3 ] is one of the most popular bottom up approaches for multi person human pose estimation , which first detects body parts belonging to every person in the image, followed by assigning parts to distinct individuals. S hown below is the architecture of the OpenPose model which we use as the base for model fusion. The OpenPose network first extracts features from the input image using the first 10 layers of VGG 19 which is a kind of CNN. These features are then fed into two parallel convolutional layer branches. The first branch predicts a set of 18 confidence maps, each repres enting a specific portion of the body pose skeleton. The second branch predicts a set of partial affinity fields (PAFs) that encode the location and orientation of limbs. Successive stages are used to refine the predictions made by each branch. Through the above steps, human pose skeletons can be estimated and assigned to every person in the image. This is the basic pose estimation model used for fusion to get human skeleton from multispectral data. Figure 5 . Architecture of OpenPose

PAGE 18

8 Materials & Methods RGB & Thermal human skeleton detecti on dataset Many researches that focus on Real Time Multi Person Pose Estimation have already achieved great accuracy on the conventional visible color imaging data. However, the accuracy would be affected by lighting and distance conditions as well as the cluttered backgrounds. Thermal cameras, which have long wavelength infrared, are stable by the intensity of light. On the contrary, thermal data always lose the fine visual details of human objects, especially at long distance. T hese two pairs of images above show the comparison of the RGB and thermal photos taken in the same time and background . Visible cameras, much like our eyes, often have trouble seeing through naturally occurring visual obscurants that block reflected light . From the left RGB image of the first row, we cannot see anything under the dark scenarios. However, because thermal radiation passes through these Figure 6 . Comparison results of RGB and Thermal image

PAGE 19

9 visual barriers, thermal cameras can easily and clearly capture the h uman body figure shown in the right image. In the second row, applying relatively good lighting conditions on both types of images is likely to increase the sharpness for RGB images, but for thermal images it may be too bright for the image to be blurred. Data Collection Equipment: FLIR Duo R Camera, 250w Power Station, Monitor We record the RGB & Thermal videos through F ILR DUO R multispectral camera under different location, time and weather conditions to e nsure data set integrity and diversity . Then we d evelop the multispectral human skeleton detection dataset by e xtract ing frames from videos and mak ing the annotation files which record For deep learning process, the dataset is separated as two sub datasets: Train (25716 color and 25716 thermal images) and Test (6429 color and 6429 thermal). The complete fusion neural network will be trained and tested with the same dataset. D ata Fusion We aim to fuse thermal captured information with visible color imaging to improve the overall accuracy of multi person pose estimation. To achieve o ptimal results , w e divide the integration process into two parts, one is data fusion and the other is model fusion. In th e data fusion part, f irst ly we convert original RGB images to CIELAB color space , then replace the channel L which expre ss the lightness from black (0) to white (100) by the channel T of the thermal Figure 7 . Equipments for data collection

PAGE 20

10 image which has single channel to get an integrated image . I mage R esults Shown below are some typical data fusion results. We found that the clarity of the image and the adaptability in the dark environment have improved significantly. The original data of the two groups were taken at night and after noon respectively. W e can hardly see people in the dark light environment , so do the visible color camera. On the contrary, thermal images become blurred i n brighter conditions . After the data fusion part, t he integrated image preserves fine Figure 8 . Data fusion process Figure 9 . Some results of data fusion

PAGE 21

11 visual details of human objects and can also detects useful information in the dark background . M odel Fusion In this section, we introduce three different fusion neural networks which are ear ly, middle and late. Each individual model is trained independently on our data. The results from the models are tested, compared, and ranked on their performance. Our architecture, shown in Fig. 1 , c oncatenates the feature maps from color and thermal branches immediately after the middle of convolutional layers. Afterwards, we introduce a 1×1 convolutional layer that reduces the dimension of concatenate layer. The output connects the rest of VGG layer s and Part Affinity Fields & Confidence Map stages to simultaneously predict detection confidence maps and affinity fields that encode part to part association. To guide the network to iteratively predict confidence maps of body parts in the first branch a nd PAFs in the second branch, two loss functions are applied at the end of each stage, one at each branch respectively. Figure 1 0 . Early Fusion (EF) Figure 11 . Middle Fusion ( M F)

PAGE 22

12 I n the middle fusion model, b oth RGB and thermal image are first analyzed by a convolutional network (initialized by the first 10 layers of VGG 19 and fine tuned ), generating a set of feature maps F1 and F2. Then we fuse F1 and F2 to get a set of integrated feature maps F that is input to the first stage of each branch which is an iterative prediction architecture to refine the predictions over succe s sive stages . In this model we d o the fusion part in the end of stage3 so that it is called late fus ion . Concatenate each output (S, L, F) from RGB image with the one from thermal image and perform dimensionality reduction operations. The fusion parameters are the input for the next stage. T est R esults We have trained and test these three models on the same dataset. Our results showed that the late fusion performed the best out of all three of our proposed models and all of our models performed the baseline for pose estimation. This is because in LF , eac h original visible color image and thermal image are separately analyzed by multi layer convolutional network and then fused, and the respective features can be fully extracted. The more discriminative features the model extract, the better performance we got in the results. S ome obvious image test results are showed below and the c omparison results of three different fusion models are listed in the table . Figure 13 . Late Fusion ( L F)

PAGE 23

13 Figure 1 4. Examples of skeleton detection results on RGB & Thermal dataset The first two rows show the indoor scenarios and the last two rows represent the outdoor scenarios. Benefit from the thermal data, the entire model performs well under low illumination or even dark environments. Outdoor RGB Outdoor Thermal Indoor RGB Indoor Thermal Table 1. Comparison results of three different fusion models

PAGE 24

14 M ultispectral Fusion P ose Estimation Network In order to combine the advantages of the two fusion process to get the optimal model, we use the output of the data fusion model that is the integrated image as the third input of the model fusion , here we use the late fusion (LF) which is the best one. We separately concatenate produce d detection confidence maps S and part affinity fields L as well as the image feature F from each stream, then the integrated prediction is used to be fine tuned by each subsequent stage. Finally, h uman pose skeletons can be estimated and assigned to each person in the fusion image through the process. The architecture shown below is our complete m ult ispectral f usion p ose e stimation neural n etwork . Experiments & Results In the training code, we adjusted the number of epochs to our dataset, which was enough to obtain a not bad model. It is worth mentioning that we are training with GPU which is much faster than training with CPU. In this step, the based learning rate was 4 e 4 with batch size 16 which is depended on the GPU memory. We set up 5000 max iteration at the beginning and found that the downward trend of loss was obvious from millions to hundreds. Until all the epoch runs, it was still falling sharply. After that we increased the number of max_iter to 20000 and finally got a loss of around 3 0 . The learning rate is a hyperparameter that controls how much we adjust the weight of the network based on the loss gradient. The lower the value, the slower we travel along the downward slope. Using a n optimizer adjusting lower learning rate in terms of makes sure that we do not miss any local minima, which could also mean that Figure 15 . Ar chitecture of multispectral fusion pose estimation model .

PAGE 25

15 The model was trained and evaluated on a Nvidia GTX 1070Ti gr aphics card took around 6 hours. The accuracy of the model can improve with the expansion of the dataset and precise label. Shown below are some results of human pose skeleton output in the fusion image. Figure 16 . Pose estimation results for the fusion images Table 2 . Comparison results of different fusion process

PAGE 26

16 CHAPTER HUMAN ACTION RECOGNITION MODEL Introduction of Graph Convolutional Network Overview The data processed by traditional CNN is in the form of Euclidean Structure based on a matrix in which pixels are arranged. CNN cannot process the data of Non Euclidean Structure because the traditional discrete convolution cannot maintain translation invariance on the data of Non Euclidean Structure . That is, in the topology diag ram, the number of adjacent vertices of each vertex may be different, then convolution operation cannot be performed with a convolution kernel of the same size. Recently m any important real world datasets come in the form of graphs or networks: social netw orks, knowledge graphs, etc . It is a topological diagram of the relationship between vertices and edges in graph theory. Graph Neural networks (GCNs), which generalize convolution neural networks (CNNs) to graphs of arbitrary structures, have received incr easing attention. The application of GCNs to extract features from dynamic graphs over large scale datasets, e.g. human skeleton sequences, are yet to be explored. Figure 17. Euclidean Structure Figure 18. Non Euclidean Structure

PAGE 27

17 GCNs are a very powerful neural network architecture for machine learning on graphs. The feature extraction of the graph is divided into vertex domain (spatial domain) and spectral domain. S pectral perspective methods utilize the eigen values and eigenvectors of graph Laplace matrices. It performs the graph convolution in frequency domain with the help of graph Fourier transform [16] , which does not need to extract locally connected regions from graphs for each convolutional step . Most of the skeleton based human action recognition approaches apply GCN s on the spatial domain , where the convolution filters are performed directly on the graph vertices and their neighbors based on the manually designed rules [17] . Node sequence selection is the process of identifying, for each input graph, a sequence of nodes for which receptive fields are created. For each of the n odes identified in the previous step, the nodes of the neighborhood are the candidates for the receptive field. The receptive field for a node is constructed by normalizing the neighborhood assembled. Related Works Recently, with the flexibility to exploit the body joint relations, the graph based approach draws much attention. Some typical GCNs based approach es are adopted to learn the graphs adaptively from data, which captures useful non local information about actions. To capture joint dependencies, rec ent methods construct a skeleton graph whose vertices are joints and edges are bones, and apply graph convolutional networks (GCN) to extract correlated features . In this section, t he implementation process is briefly introduced Notations Consider a skele ton graph as G = (V, E), where V is the set of n body joints and E is a set of m limbs. In this graph, the node set includes the all the joints in a skeleton sequence with T frame . Formally, the edge set E is composed of two subsets, the first subset depicts the intra skeleton connection at each frame, denoted as , where H is the set of naturally connected human body joints. The second subset contains t he inter frame edges, which connect the same joints in consecutive frames as . Therefore , all edges in for one particular joint i will represent its trajectory over time. Let be the adjacent matrix of the skeleton graph which

PAGE 28

18 fully describes the skeleton structure , where A i,j = 1 if the i th and the j th joints are connected and 0 otherwise. For implement ing the label map and captur ing more refined location information , t here is a strategy to divide the neighbor set into three subsets: 1) the root node itself; 2) centripetal group: the neighboring nodes that are closer to the gravity center of the skeleton than the root node; 3) otherwise the centrifugal group. Here the average coordinate of all joints in the skel eton at a frame is treated as its gravity center . Spatial Temporal GCN Spatial Temporal Graph Convolutional Networks (ST GCN) , which extend s graph neural networks to a spatial temporal graph model , is proposed for skeleton based action recognition [12] . This deep learning network use the estimated joint locations in the pixel coordinate system as input which can be estimated by the publ ic available OpenPose [3] toolbox on every frame of the clips. ST GCN consists of a series of the ST GCN block , e ach of which applies a spatial graph convolution followed by a temporal convolution to extract spatial and temporal features alternatingly . The last ST GCN block is connected to a fully connected classifier to generate final predictions. The key component Figure 19. The proposed partitioning strateg y for constructing convolution operations Figure 20. ST GCN network

PAGE 29

19 in ST GCN is the spatial graph convolution operation, which introduces weighted average of neighboring features for each joint. The intra b ody connections of joints within a single frame are represented by an adjacency matrix A and an identity matrix I representing self connections which contains the features o f node itself. In this work they use the 1 neighbor set of joint nodes for all cases . For partitioning strategies with multiple subsets , t he adjacency matrix is dismantled into several matrixes where . Let be the input features of all joints in one frame, where is the input feature d imension, and be the output features obtained from spatial graph convolution, where is the dimension of output features . The spatial graph convolution is where is the normalized adjacent matrix for each partition group , M and W are learnable weight matrix es for each partition group to capture edge weights and feature importance, respectively . STGCN makes reasonable use of prior knowledge to give more attention to j oints with large movements, which is potentially reflected in the weight distribution strategy. ST GCN uses TCN as the temp oral convolution operation to e xtract tim e features , because of the fixed shape, they use the traditional convolution layer to complete the time convolution operation . C ompared with the convolution operation of the image , t he shape of the last three dimensions of the feature map outputted by ST GCN is , which corresponds to the shap e of the image feature map. The width of the image W corresponds to the number of key frames T. The height of the image H corresponds to the number of joints V. A ctional S tructural GCN Although ST GCN extracts features of joints that are directly c onnected through bones, long distance joints that may cover critical motion patterns are largely ignored. For example, when walking, hands and feet are strongly related. ST GCN though attempts to use a layered GCN to aggregate a wider range of features, node characteristics may be weakened during long diffusion. Then, to capture richer action -

PAGE 30

20 specific latent dependencies , AS GCN is further proposed [13] . In this model, the author data driven infer the actional links (A links) to capture the latent dependen cies between any joints and an A link inference module (AIM) with an encoder decoder structure is proposed. T he actional structural graph is constructed as G = (V, E) and the E is the set of generalized links . Compared with the traditional skeleton graph, a new subset including actional links of the edge set is introduced . These A links are activated by actions and might exist between arbitrary pair of joints. They develop a trainable A link inference mod ule (AIM), which consists of an encoder and a decoder , to automatically infer the A links from actions . T he actional graph convolution (AGC) which uses C types of A l ink s to capture inter joint motion dependencies is defined . For diffus ing information in a long range , the network use s the high order polynomial of the adjacent matrix in ST GCN , indicating the S links . With the L order polynomial , the structural graph convolution (SGC) can directly reach the L hop neighbors to increase the receptive field. where the graph transition matrix provides the nice initialization for edge weights, which stabilizes Figure 21. An example of the skeleton graph

PAGE 31

21 the learning of . To integrally capture both actional and structural information among arbitrary joints , the AGC and SGC are combined to develop the actional structural graph convolution (ASGC). Mathema tically, the ASGC operation is formulated as AS GCN block uses the same method as ST GCN t o capture the inter frame action features , which is applying one layer of temporal convolution (TCN) a long the time axis to extract the temporal feature of each joint independently . The architecture of AS GCN is shown below. Both algorithms have great significance in the field of action recognition. ST GCN is a pioneer and innovator of human action recognition algorithms based on graph convolution network , which provides a good idea and foundation for subsequent algorithms. ASGCN which extracts more discriminative features through body structure and a ction also achieves large improvement compared with the previous state of art methods . Materials & Methods Backgrounds The graph convolutional network on vertex domain has reached an unprecedented height on the NTU RGB+D dataset [18] and can comprehensively extract spatial features of the body structure. However, compared with Long Short Term Memory (LSTM) which has s trong acquisition ability for time features, t he performance of TCN extracting temporal features may not be as good as LSTM . Figure 22. The pipeline of the AS GCN

PAGE 32

22 In the above methods, once the mining is done, the degrees of importance of joints/features are fixed and will not change for different temporal frames and sequences . As the action proceeds, the as sociated and arbitrary pair of joints may also change . For the running , ac tion syner gistic links denotes that arms and feet are not connected but correlated. For action recognition , not all frames in the sequence have the same importance. Some frames capture less meaningful information, or even carry misleading information associated with other types of actions, while some other frames carry key information [19] . A number of methods have been proposed using key frames as representations for rec ognizing action class . One is to utilize the frame distillation network (FDNet) to distil a fixed number of key frames from the input sequences with a deep progressive reinforcement learning method before feeding the output into the GCN to recognize the action label [20] . Another one employs AdaBoost to select the most discriminative key frames for action recognition [19] . Us ing key frames can help eliminate noise frames that contain unwanted and distracting information . However, in comparisons to the holistic based approaches which use all the frames, it loses some information. Objectives Objective 1 W e ai m to propose a spatial attention module or attention based graph convolution determin ing the degrees of importance of joints or limbs by adding a learnable weight matrix to the c o o ccurrence links based on the contents . Objective 2 Based on the thought of TCN, we extend the range of existing temporal graph which dose not only include the time edges between the consecutive frames. The joint in the current frame is connected to the same joint in the next few frames . The features of time should be determined by the current frame and the previous frames. Then we introduce a temporal attention model which determines the degree of

PAGE 33

23 imp ortance for each time edges within the frames. Instead of skipping frames, it allocates different attention weights to different ti me edges to automatically exploit their respective discriminative power and focus more on the important frames. Objective 3 To fully extract features o f the intra body structure within a single frame, we use multi filters GCN to enlarge the range of neighbors instead of the 1 neighbor a nd influence the receptive field over the considered graph. Objective 4 We further propose a Multi task feedback model for forming the discriminative action correlated links to the original skeleton graph through propag ating the a ttention probability parameter and i terativ ely updat ing between the two corresponding block s of each task . Then applying multi filter s convolution on the graph which contain s the newly generated action edges to i mprove the extraction of motion a nd structural features . G raph Attention Network I n this section, we utilize an attention based architecture which is graph attention networks (GAT) [21] to perform node classification of skeleton data . Recently Attention mechanisms have become almost a n essential and meaningful standard in many sequence based tasks [22, 23] . It is eff ective for the action recognition model to focus on the most relevant parts of the input human pose skeleton to make decisions. The graph attentional layer allows for (implicitly) assigning different importances to different nodes within a neighborhood while dealing with different sized neighborhoods .

PAGE 34

24 The input to the graph attention layer is a set of node features of the skeleton data , where N is the number of body joints , and F is the number of features in each joint . The layer generates a new set of node features of potentially different cardinality , , as its output. In order to get enough expressiv eness to transform the input features into higher level features into a higher level feature, at least one learnable linear transform is required. To that end, as an initial step, a shared linear transformation, parametrized by a weight matrix, , is applied to every node. T hen perform self attention on the nodes to computes attention coefficients that indicate the importance of node node i . The whole process expressed that t he relationship between the two nodes is determined by t he features of the two nodes. W e only compute for nodes where is the neighbor of node i in the graph. To make the coefficients easy to compare between different nodes, the softmax function is applied to normalize them in all of j's options: Figure 23. The attention algorithm

PAGE 35

25 I n summary, we use t he features of two body joints first to generate a new expressive feature through linear transformation, and then calculate the attention coefficient which is the parameter used to perform the weighted summation at each convolution, so there is an intention coefficient between any two joints . The single layer feedforward neural network parameterized by the weight vector represents the attention m echanism , and uses the LeakyRuLU to activate the node, so the calculation formula is specifically expressed as: As opposed to GCNs, this model allows for assigning different importance to nodes of a same neighborhood, enabling a leap in model capacity. In action task A, the action graph which has no natural connecting body limbs but only consists of all the dotted action links is considered as the input. And w e aim to adapt G AT model to calculate the attention coefficient between pair of joints and then form the possible potential action links which are used to con s truct new action structural graph. M ulti F ilters GCN W e apply a new definition of graph convolutional filter in this section which is presented in [24] . It generalizes the most commonly adopted filter, adding a hyper parameter controlling the distance of the considered neighborhood. This idea follows another approach based on shortest pat hs. T he adjacency matrix A of a graph can be seen as the matrix of the shortest paths of length 1 . Moreover, the identity matrix I is the matrix of the shortest paths of length 0 . The first layers of the network are again stacked graph convolutional layers, can be defined as Here, is the adjacency matrix of the undirected graph G with added self connections and

PAGE 36

26 W(l) is a layer specific tra inable weight matrix. denotes an activation function, such as the ReLU(·) = max(0, ·) . is the matrix of activations in the l th layer; . We decid e to keep the contribution of the nodes at different shortest path distances , which is equivalent to the definition of multiple graph convolution filters, one for each shortest path. Then the Parametric Graph Convolution is as: where is the vertical concatenation of vector and j is th e distance of the shortest path. The parameter r is proposed to control the maximum distance of the considered neighborhood, and the dimensionality of the output . It is easy to see that, by definition, the receptive field of a graph convolutional filter parameterized by r and applied to the vertex v includes all the nodes at shortest path distance at most r from v. When we stack multiple layers of the parametric convolution, the receptive fields grow in the same way. The receptive field of the parametric convolution filter of size r applied to layer l of vertex v then includes all vertices at Figure 24. Multi filters graph convolution

PAGE 37

27 the shortest path distance of distance v up to l·r. Action recognition cannot only be determined by s mall range of joi nts/ limb features. Traditional method s diffuse information in a local range which may ignores discriminative motion patterns covered the long distance joints. We can enlarge the receptive field by adapting the multi filters GCNs to propagate useful feature s in a wider range of body structure. Then we concatenate the output of the convolutions for all j r to obtain higher order spatial features for action recognition. Th e richer action specific latent dependencies is captured, the better performance the results have. T emporal Convolution Network GCN s help us learn the local features of adjacent joints in space. On this basis, we need to learn the local features of joint changes in time. How to superimpose timing featu res for Graph is one of the problems faced by graph networks. There are two main ideas in this area: time convolution (TCN) and sequence model (LSTM). O ur AST GCN model uses one later of TCN along the time axis, which extracts the temporal feature of each joint independently but shares the weights on each joint , to capture the inter frame action features . There is a very simple strategy to extend the spatial graph CNN to the spatial temporal domain proposed in ST GCN . That is, we extend the concept of neighborhood to also include temporally connected joints as The t emporal in the neighbo r graph . W hen we construct the temporal graph, we decide to increase the time links between the joint s in the Figure 25. Temporal convolution in ST GCN

PAGE 38

28 current frame and the same joints in the next few frames. This is because when recognizing an action, the temporal feature of the joints and limbs should contain not only the information of the current frame, but also the information of the previous frames. T he dashed lines in the figure above represent the original time edges and the solid lines are the a ddition time links existing between the same joints within several frames. In order to determine t he degree of importance for each time link within the frames , we adapt an attention mechanism which can calculate the attention coefficients between pairs of the same joints in the temporal graph. T hen the model can assign different weights to the time edge for capture more relative information and eliminate noise or misleading features. where T is the time range. Through the above operation, the regularized attention coefficients between same joints in the different frames after regularization are obtained, which can be used to predict the output f eature of each joint as We use the non linear function of ReLU due to its good convergence performance . T he temporal Figure 26. Structural Temporal graph

PAGE 39

29 attention graph convolution network can be trained to implicitly learn the meaningful temporal features . Two Steam Feedback Attention Based GCN W e further propose the Two Steam Feedback Attention Based GCN (2S ATGCN) model , which stacks attention based structural graph convolution and temporal convolution as a basic building block, to learn both spatial and temporal features for action recognition. After the multiple spatial and temporal feature aggregations, 2S ATGCN extracts high level semantic information across time. A correlation link with the discriminant action is formed on the original skeleton map by propagating the attention probability parameter and iteratively updating between two corresponding blocks of each task. To classify actions, w e apply the global averaging pooling on the joint and temporal dimensions of the fea ture maps output by the backbone Task S network, and obtain the feature vector, which is finally fed into a softmax classifier to obtain the predicted class label . T he standard cross entropy loss is used as the loss function for action recognition. The architecture shown below is the pipeline of our 2S ATGCN model. Structural Task (s) The input to the s tructural t ask (S) is the sequences of body joints in the form of 2D or 3D coordinates. And w e construct the spatial temporal graph on the skeleton sequences in two steps. First, the joints within Figure 27. The pipeline of 2S ATGCN model

PAGE 40

30 one frame are connected with edges according to the connectivity of human body structure. Then each joint will be connected to the same joint in the next few frame s. Task S consists of a series of the Attention based Structural Temporal GCN (AST GCN) blocks. Each block contains a S elective Multi filters Attention based Structural GCN (AS GCN) followed by a n attention based temporal convolution, which alternatingly extracts effective spatial and temporal features. It will then be classified by the standard Softmax classifier to the corresponding action category. Selective Multi Filters AS GCN (SF ASGC) In standard Graph Convolutional Neural Networks ( GC Ns), the receptive fields of artificial neurons in each layer are designed to share the same range of neighbors . We aim to obtain rich information on different receptive fields by applying convolution filters of different sizes . Then the size of the convolution filters corresponding to the Multi filters GCN network is determined. We want to introduce a model so that the network can adapt ively adjust the receptive size according to multiple scales of the input information. We have integrated the idea of Selective Kernel unit [25] , in which multipl e branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion Figure 28. The architecture of selective filter network

PAGE 41

31 layer. Ins tead of concatenat ing the output of the convolutions for all r neighbors, we integrate the Sknet which can reflective field size to propose our Selective Multi Filters structural graph attention convolution. T his selective filter network consists of a triplet of operators: Split, Fuse and Select. The Split operator generates multiple paths with various filter sizes which correspond to different RF sizes of neurons. The Fuse operator combines and aggregates the information from multiple pat hs to obtain a global and comprehensive representation for selection weights. The Select operator aggregates the feature maps of differently sized filters according to the selection weights. S plit: For the feature vector X on a node consists of coordinate vectors, as well as the estimation confidence of the i th joint on frame t , by default we first conduct three transformations with filter sizes where the ranges of node neighbors are 1, 2, 3 respectively and output . The transformations are composed of efficient structural graph convolutions (SGC) , Batch Normalization and ReLU function in sequence . Fuse: Our goal is to enable neurons to adaptively adjust their RF size based on the stimulus content. The basic idea is to use gates to contro l the flow s of information from multiple branches carrying different scales information to the neurons of the next layer. In order to achieve this goal, there is the gate integrat ing information f rom all branches. We first fuse the results from multiple br anches by element summation: then we generate d channel wise statistics by simply using the global average pool to embed global information . Specifically, the c th element of s is calculated by shrinking U through spatial temporal dimensions V × T , where V is the number of joints and T is the number of frames. In addition, a compact feature have been created to guide the precise and adaptive selection , which is achieved through a simple fully connected (fc) layer that reduces the number of dimensions for efficiency.

PAGE 42

32 [26] , B denotes the Batch Normaliz ation [27] , . Select: A soft attention across channels is used to adaptively select different spatial scales of information, which is guided by the compact feature descriptor z. Specifically, a softmax operator is applied on the channel wise digits: where A, B, C and a, b, c denote the soft attention vector for , and , respectively. In the case of three branches, the matrix B is redundant because . The final feature map Y is obtained through the attention weights on various kernels: where . Note that her e we provide a formula for the three branch case and one can easily deduce situations with more branches by extending these equations above. I t is obvious that t he importance and correlation of the human body structure links are different of various motion sequences. handshaking stretch out the hands, t he relation of the arm looks seems more important than the leg s. Then w e introduce the Selective Multi Filters Graph Attention Co nvolution by adapt ing the attention mechanism to calculate the attention coefficients in the structural links instead of using the traditional GCNs , which can pay different degree of relation to the pairs of joints of the body connectivity. AST GCN Block To integrally capture the structural features among skeleton joints, we develop the selective multi filters A SGC . Different time links have different degrees of importance and robustness to variations. For capturing the inter frame action features, we use attention based temporal convolution ( AT TCN) along the time axis, which extracts the temporal feature of each keypoint independently, which also enables the network to pay different levels of attention to different joints and assign different degrees of i mportance

PAGE 43

33 to various time links as an action proceeds. The AST GCN block which consists of S F ASGC , AT TCN and other operations: batch normalization (BN), ReLU and residual block s is shown below. Since S F ASGC and AT TCN learns spatial and temporal features, respectively, we concatenate both layers as an AST GCN block to extract temporal features from various actions. Note that S F ASGC is a single operation to extract only spatial information from structural graph and the AST GCN block including a series of operations can extract both spatial and temporal information from spatial temporal graph. Action Task (A) In this section, we introduce the action task (A) where the actional links is formed and combined to the generate d skeleton graph for future feature extraction. Many human activities require joints that are far apart to move collaboratively, resulting in non physical dependence among joints. In order to ca pture the corresponding dependencies of various actions, we introduce action al links, which are activated by actions and may exist between arbitrary pair of joints. Figure 29. An AST GCN block Figure 30. E stablishment process of the a ction al links

PAGE 44

34 The input to the a ction t ask ( A ) is also the skeleton sequence represented by 2D or 3D coordinates of each human joint in each frame. However, the action graph which we construct in task A is different from the one in the structural S. We connect the joint with all other joints except the natural co nnection joints of the human body to represent the potential action al edges, which is showed in dashed line in the action graph. Note that this action graph do es not include any edges of the connectivity of human body structure . AAT GCN Block To automatically infer the actional links from actions , we develop the AAT GCN block which consists of a graph attention convolution followed by an attention based temporal convolution . First, we calculate the attention coefficients between the pairs of jo ints every joint to attend on every other joint, dropping all structural information. We perform masked attention for injec t ing the graph structure into the mechanis m we only computes for joints , where is some neighborhood of joint i in the skeleton graph. In all our experiments, these will be exactly all other joints of the generated actional graph except i. We apply a LeakyReLU activation function to employ the attention mechanis m which is parametrized by a weight vector and finally get the attention coe fficients/links probabilities in the sequence of various action which assign different degrees of importance to the probable actional links and then infer the actional links shown in s olid lines of different thickness . The thicker the line is , the greater the a ctional correlation the two joints have . The thinner the line is , the less synergy in this series of actions the two joints ha ve . The formed actional links are combined into the structural graph in task S for more comprehensive extraction of features . Mean while, the actional links are i teratively updated based on the output of the structural

PAGE 45

35 task (S) blocks and g radually form a more complete action graph of the input skeleton sequence. For the temporal domain, the temporal graphs in the both task s are same, which are constructed by connecting the joints with the same joints in the next t fram e s. If AT TCN in the AST GCN block of structural S is used to extract the temporal features in the human body structure of the action sequence, then we apply A T TCN in AAT GCN block of the action task A to extract the temporal features of the potential action co occurrence and control the amount of information of each time edge in frame s for optimizing the attention coefficients. Feedback Model Here we introduce a feedback model which propagat es the attention probability parameter between the two corresponding blocks of each task for iterative updating . The actional links which generated by task A is combined into the skeleton graph in the structural task S and the renewable actional structural graph is considered as input to the next AST GCN block. Then the output features extracted by S F ASGC and AT TCN are employed to adjust the attention coefficients/link probabilities of the actional l inks in the feedback module and feedback to task A to adjust the formed action edges or generate more correlated co occurrence links. First, we set the value of the attention parameters/link probabilities of the actional links in the original struct ural graph of the task S to 0, which means that there is not any co occurrence links in the graph. Then the parameters are computed in the AAT GCN block of task A and the attention vector where v is the number of joints, i s transformed by the feedback module to update the value in the task S. By assigning the different degree of importance to these co occurrence links based on the attention vector , the actional links are further formed in the structural graph for convoluti on operation in AST GCN block. The updating process from AAT GCN block to AST GCN block is shown below.

PAGE 46

36 F or the b ackpropagation , the attention vector is re computed through the AST GCN block and output ted as .Thereafter, t he feedback mod u l e updat es the actional graph in task A by adjust ing the value of attention vector . Note that, the SF ASGC only pay attention to the structural links in the original skeleto n graph, w ith the generation of the action edge, the block of task S simultaneously calculates the attention parameters of the action edge and the structure edge . T he value of attention coefficient is feedbacked to task A to adjust the actional graph for the next convolution operation in AAT GCN block. And the refresh calculated parameters further transformed would still influence the actional links in the spatial graph. As the network progresses, based on the feedback from task S , the action al links gener ated by task A will become more and more accurate. In general, t he SF ASGC network only extracts the features of the human body structure from the beginning, and gradually extracts the features of the both structure edges and the actional edges. T he attent ion vector, which express the degree of potential relation between arbitrary joints from different action classes, is iteratively updated between the two task models, and finally forms a reliable and complete actional structural graph. Finally, the rich sp atial and temporal feature are fully extracted by the SF ASGC and AT TCN in main network for the action classifier to recognize action class. Figure 31. Feedback and updates of the actional links

PAGE 47

37 EXPERIMENTS & RESULTS In this section , we evaluate the performance of 2S=ATGCN model in skeleton based action recognition experiments. We experiment on two large scale action recognition datasets with vastly different properties: Kinetics human action dataset (Kinetics) [3] and NTU RGB+D [18] . D atasets NTU RGB+D : NTU RGB+D, which is currently one of the largest and most widely used in door captured action recognition dataset , contain ed 56, 880 skeleton action sequences completed by one or two performers and categorized into 60 classes. It provides the 3D spatial coordinates o f 25 joints for each human in an action. For evaluating the models, two protocols are recommended: Cross Subject and Cross View. In Cross Subject, 40, 320 samples performed by 20 subjects are separated into training set, and the rest belong to test set. Cr oss View assigns data according to camera views, where training and test set have 37, 920 and 18, 960 samples, respectively. Training clips in this setting come from the camera views 2 and 3, and the evaluation clips are all from the camera view 1. Kinetic s Skeleton: D eep M ind Kinetics human action dataset [3] contains around 300, 000 video clips retrieved from YouTube. The videos cover as many as 400 human action classes, ranging from daily activities, sports scenes, to complex actions with interactions. Each clip in Kinetics lasts around 10 seconds. It only provides raw video clips without skeleton data . In this work we are focusing on our skeleton based action recognition model , so we use the fusion pose estimation model outputting 2D coordinates (X, Y) in the pixel coordinate system s and confidence scores C from the resized videos with resolution of 340 × 256 to estimate the location of 18 joints on every frame of the clips. For the multi person cases, we select the body with the highest average joint confidence in each clip. We thus represent each joint with a tuple of (X, Y, C) and a skeleton frame is recorded as an array of 18 tuples. In this way, one clip with T frames is transformed into a skeleton sequence of th ese tuples. In practice, we represent the clips with tensors of (3, T, 18) dimensions. For simplicity, we pad each sequence by repeating the data from the start to totally T = 300. Following the evaluation method in

PAGE 48

38 ST GCN , we train the models on the training set and report the top 1 and top 5 accuracies on the validation set. Model Setting We construct the backbone of the structu r al task S with 9 AST GCN blocks, where the dimensions of the features are 64, 128, 256 in every three blocks. These layers have 9 temporal kernel size. And we randomly discard the features with a probability of 0.5 after each AST GCN unit to avoid overfitting. The stride of the 4th and 7th time convolution layers is set to 2 as the pooling layer. Thereafter, a global pooling is perfo rmed on the obtained tensors to obtain a 256 dimensional feature vector for each sequence. Finally, we provide them to the SoftMax classifier. T he backbone of the action task A consists of 8 AAT GCN blocks and the convolution filer is just performed on 1 neighbor of the joint. Using a stochastic gradient descent learning model, the learning rate is 0.01. We reduce the learning rate by 0.1 after every 10 periods. In the S F ASGC , we set the max range r of the neighbors to 3, which indicates the convolution f ilters are pe r formed on the ve r tices and their neighbor from 1 3. Note that when r = 1, the corresponding structural link is exactly the physical skeleton itself. We use PyTorch 0.4.1 and train the model for 100 epochs on 2 GTX TITAN GPUs. The batch size is 32. We use the SGD algorithm to train the whole network of action recognition model , whose learning rate is initially 0.1, decaying by 0.1 every 20 epochs. Comparisons to the State of the Art We compare the final recognition model on skeleton based acti on recognition tasks with the state of the art methods on both the NTU RGB + D dataset and Kinetics Skeleton dataset. Table 2 and Table 3 show the performance comparisons for the NTU and K inetics datasets, respectively. On NTU RGB+D, we train 2S ATGCN on two recommended benchmarks: Cross Subject and Cross View, then we respectively obtain the top 1 classification accuracies in the test phase. We compare with covering handcraft feature based method [28] , RNN/CNN based deep learning models [9, 18, 29, 30] and recent GCN based methods [12, 13, 20, 31] . Specifically, ST GCN combines GCN with temporal CNN to capture spat ial temporal features, and AS GCN extend s the skeleton graphs by introducing actional links

PAGE 49

39 and non local structural links to capture useful long range dependencies . SR TSL [31] use gated recurrent unit (GRU) to propagate messages on graphs and use LSTM to learn the temporal features. Table 2 shows the comparison. Our model achieves state of the art perf ormance with a large margin on both of the datasets, which verifies the superiority of our model. In the Kinetics dataset, we compare our model with six state of the art methods. First introduced a hand made method called Feature Encoding [32] . Then Deep LTSM and Temporal convolution network [9, 18] are implemented as two deep learning models. In addit ion, the action recognition of ST GCN and AS GCN as well as the 2s AGCN are evaluated. Table 4 shows the classification performance of the first 1 and the top 5. We see that the proposed 2S ATGCN outperforms the other competitive methods Table 3 . Comparison of action recognition performance on NTU RGB+D. fusion models

PAGE 50

40 Table 4 . Compar ison of action recognition performance on Kinetics .

PAGE 51

41 CHAPTER V CONCLUSIONS W e first propose the fusion pose estimation model which can output skeleton data from multispectral images or videos. Then w e develop the 2S ATGCN model for skeleton based action recognition. The action task A captures actional dependencies and forms the actional links to the structural graph in the structural task S . We also ex pand the convolution filters size to represent higher order relationships. The generalized action structure graphs are fed to AST GCN block for a better representation of actions. An additional attention algorithm is applied to pay different degree of importance to both spatial links and time links, which indicates the network can focus on the m ost relative part of bodies and frames. We validate the fusion pose estimation model using our RGB & Thermal dat a sets, and it achieves large improvement under the dark environment . Moreover, o n two challenging large scale datasets, the proposed skeleton ba sed action recognition model outperforms the previous state of the art skeleton based model. The combination of attention based model and multi filters selective network further improves the performance in action recognition. With t he branch network gradua lly combin ing the new action al links to the main skeleton graph, the structural network has a better representation of actions . At the same time, based on the back propagation and update, the generation of action edges in task A becomes more and more complete and accurate. By introducing attention mechanisms in time convolution , the network can automatically exploit the different levels of importance of different frames through allocat ing different attention weights to the time links. The flexibility o f 2S A TGCN model also opens up many possible directions for future works. For example, how to incorporate contextual information, such as scenes, objects, and interactions into 2S ATGCN becomes a natural question.

PAGE 52

42 R EFERENCES 1. Oh, H. W ., System and method for real sense acquisition . 2012, Google Patents. 2. Kay, W., et al., The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. 3. Cao, Z., et al. Realtime multi person 2d pose estimation using part affinity fields . in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2017. 4. Vemulapalli, R., F. Arrate, and R. Chellappa. Human action recognition by representing 3d skeletons as points in a lie group . in Proceedings of the IEEE conf erence on computer vision and pattern recognition . 2014. 5. Wang, J., et al. Mining actionlet ensemble for action recognition with depth cameras . in 2012 IEEE Conference on Computer Vision and Pattern Recognition . 2012. IEEE. 6. Koniusz, P., A. Cherian, an d F. Porikli. Tensor representations via kernel linearization for action recognition from 3d skeletons . in European Conference on Computer Vision . 2016. Springer. 7. Zhu, W., et al. Co occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks . in Thirtieth AAAI Conference on Artificial Intelligence . 2016. 8. Song, S., et al. An end to end spatio temporal attention model for human action recognition from skeleton data . in Thirty first AAAI conference on artificial intelligence . 2017. 9. Kim, T.S. and A. Reiter. Interpretable 3d human action analysis with temporal convolutional networks . in 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW) . 2017. IEEE. 10. Liu, M., H. Liu, and C. Chen , Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 2017. 68 : p. 346 362. 11. Li, C., et al. Skeleton based action recognition with convolutional neural networks . in 2017 IEEE International Conference on Mult imedia & Expo Workshops (ICMEW) . 2017. IEEE. 12. Yan, S., Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton based action recognition . in Thirty Second AAAI Conference on Artificial Intelligence . 2018. 13. Li, M., et al. Actio nal Structural Graph Convolutional Networks for Skeleton based Action

PAGE 53

43 Recognition . in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2019. 14. Shi, L., et al. Two Stream Adaptive Graph Convolutional Networks for Skeleton Bas ed Action Recognition . in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2019. 15. Toshev, A. and C. Szegedy. Deeppose: Human pose estimation via deep neural networks . in Proceedings of the IEEE conference on computer vision and pattern recognition . 2014. 16. Shuman, D.I., et al., The emerging field of signal processing on graphs: Extending high dimensional data analysis to networks and other irregular domains. arXiv preprint arXiv:1211.0053, 2012. 17. Niepert, M., M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs . in International conference on machine learning . 2016. 18. Shahroudy, A., et al. Ntu rgb+ d: A large scale dataset for 3d human activity analysis . in Proceedings of the IEEE conference on c omputer vision and pattern recognition . 2016. 19. Liu, L., et al., Learning discriminative key poses for action recognition. IEEE transactions on cybernetics, 2013. 43 (6): p. 1860 1870. 20. Tang, Y., et al. Deep progressive reinforcement learning for skele ton based action recognition . in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2018. 21. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. 22. Chorowski, J.K., et al. Attention based m odels for speech recognition . in Advances in neural information processing systems . 2015. 23. Gehring, J., et al., A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344, 2016. 24. Tran, D.V., N. Navarin, and A. Sperd uti. On filter size in graph convolutional networks . in 2018 IEEE Symposium Series on Computational Intelligence (SSCI) . 2018. IEEE. 25. Li, X., et al. Selective Kernel Networks . in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognit ion . 2019.

PAGE 54

44 26. Nair, V. and G.E. Hinton. Rectified linear units improve restricted boltzmann machines . in Proceedings of the 27th international conference on machine learning (ICML 10) . 2010. 27. Ioffe, S. and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 28. Veeriah, V., N. Zhuang, and G. J. Qi. Differential recurrent neural networks for action recognition . in Proceedings of the IEEE international conference on computer vision . 2015. 29. Liu, J., et al. Spatio temporal lstm with trust gates for 3d human action recognition . in European Conference on Computer Vision . 2016. Springer. 30. Ke, Q., et al. A new representation of skeleton sequences for 3d action reco gnition . in Proceedings of the IEEE conference on computer vision and pattern recognition . 2017. 31. Si, C., et al. Skeleton based action recognition with spatial reasoning and temporal stack learning . in Proceedings of the European Conference on Computer Vision (ECCV) . 2018. 32. Fernando, B., et al. Modeling video evolution for action recognition . in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2015.