| Literature DB >> 32455931 |
Kamil Sidor1, Marian Wysocki2.
Abstract
In this paper we propose a way of using depth maps transformed into 3D point clouds to classify human activities. The activities are described as time sequences of feature vectors based on the Viewpoint Feature Histogram descriptor (VFH) computed using the Point Cloud Library. Recognition is performed by two types of classifiers: (i) k-NN nearest neighbors' classifier with Dynamic Time Warping measure, (ii) bidirectional long short-term memory (BiLSTM) deep learning networks. Reduction of classification time for the k-NN by introducing a two tier model and improvement of BiLSTM-based classification via transfer learning and combining multiple networks by fuzzy integral are discussed. Our classification results obtained on two representative datasets: University of Texas at Dallas Multimodal Human Action Dataset and Mining Software Repositories Action 3D Dataset are comparable or better than the current state of the art.Entities:
Keywords: BiLSTM; VFH descriptor; activity recognition; dynamic time warping; multiple network fusion; point clouds; transfer learning
Mesh:
Year: 2020 PMID: 32455931 PMCID: PMC7285378 DOI: 10.3390/s20102940
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Representative works using depth data for activity recognition.
| Work | Method | Classifier | Dataset | Efficiency [%] |
|---|---|---|---|---|
| Wanging et al. [ | bi-gram with | MSR Action 3D * | AS1:72.9 | |
| Vieira et al. [ | SVM | MSR Action 3D * | AS1:84.7 | |
| Wang et al. [ | SVM | MSR Action 3D ** | 86.2 | |
| Yang et al. [ | SVM | MSR Action 3D * | AS1:96.2 | |
| Chen et al. [ | Collaborative | MSR Action 3D * | AS1: 96.2 | |
| Oreifej et al. [ | SVM | MSR Action 3D ** | 88.89 | |
| MSR Hand Gesture | 92.45 | |||
| 3D Action Pairs | 96.67 | |||
| Kim et al. [ | SVM | MSR Action 3D * | 90.45 | |
| Wanget al. [ | CNN | MSRC-12 Kinect Gesture * | 93.12 | |
| G3D Dataset * | 94.24 | |||
| UTD-MHAD * | 85.81 | |||
| Kamel et al. [ | CNN | MSR Action 3D * | 94.51 | |
| Hou et al. [ | CNN | MSR Action 3D * | 94.51 | |
| Wang et al. [ | SVM | MSR Action 3D * | 88.2 | |
| Yang, Tian, [ | Naïve-Bayes- | MSR Action3D * | AS1:74.5 | |
| Luo et al. [ | SVM | MSR Action3D ** | 96.7 | |
| MSR DailyActivity | AS1: 97.2 |
CNN-Convolutional neural network, MAD-Multimodal action dataset, MHAD-Multimodal human action dataset, MSR- Mining Software Repositories, SVM-Support vector machine, UTD-University of Texas at Dallas. Datasets MSR (Action 3D and Activity 3d) are divided into three subsets AS1, AS2, and AS3. MSR Action 3D and UTD-MHAD are described in Section 5. * Protocol leave-one-subject-out ** First five subjects are used for training and the remaining five subjects are used for testing.
Figure 1Values of the surface shape component of the Viewpoint Feature Histogram (VFH).
Figure 2VFH histograms generated for point clouds representing two body postures.
Figure 3Visualization of the operation of the DTW algorithm with a transformation window width b.
Figure 4Long short-term memory (LSTM) network architecture.
Figure 5Bidirectional long short-term memory (BiLSTM) flow of data at time step t.
Figure 6Activities in the University of Texas at Dallas Multimodal Human Action Dataset (UTD-MHAD).
Figure 7The considered decompositions of the person bounding box: (a) vertical division into two cells, (b) horizontal division into four cells, (c) cross-division into four cells, (d) division into six cells.
Figure 8Sample standardized runs of mean and standard deviation for the ϕ feature of the VFH descriptor for two different activities.
Comparison of the recognition rates for the UTD-MHAD set, k-NN with DTW, and LOSO eight-fold cross-validation; the best values are marked in bold.
| Number of the Nearest Neighbors | Recognition Rate [%] | ||||
|---|---|---|---|---|---|
| Without Division | Vertical Division into | Cross Division into 4 Cells | Horizontal Division into 4 Cells | Division into 6 Cells | |
| k = 1 | 74.78 | 80.24 | 82.34 | 85.48 | 86.37 |
| k = 2 | 72.80 | 81.52 | 81.99 | 83.37 | 84.20 |
| k = 3 | 75.59 | 81.40 | 82.10 | 85.59 | 86.63 |
| k = 4 | 76.99 | 80.82 | 84.19 | 86.75 | 87.56 |
| k = 5 | 78.96 | 80.83 | 83.97 | 87.10 | 88.15 |
| k = 6 | 79.43 | 83.26 | 83.96 | 87.42 | 88.03 |
| k = 7 | 79.08 | 82.91 | 83.27 | 87.55 | 86.98 |
| k = 8 | 78.50 | 83.02 | 83.15 |
| 86.64 |
| k = 9 |
| 83.37 |
| 87.67 | 87.57 |
| k =10 | 79.43 |
| 84.54 | 87.90 |
|
Comparison of the recognition rates for the MSR-Action 3D set, k-NN with DTW, and LOSO ten-fold cross-validation.
| Number of the Nearest Neighbors | Recognition Rate [%] | ||||
|---|---|---|---|---|---|
| Without Division | Vertical Division into | Cross Division into 4 Cells | Horizontal Division into 4 Cells | Division into 6 Cells | |
| k = 1 | 59.56 | 69.66 | 69.75 | 69.20 | 77.09 |
| k = 2 | 52.55 | 67.89 | 68.68 | 64.66 | 75.42 |
| k = 3 | 58.07 | 73.89 | 73.63 | 69.91 | 79.78 |
| k = 4 | 60.41 | 71.56 | 73.15 | 70.28 |
|
| k = 5 | 59.93 | 72.30 | 73.90 | 71.16 | 81.19 |
| k = 6 | 61.21 | 74.96 | 73.15 | 69.21 | 80.51 |
| k = 7 | 63.53 | 74.73 | 73.60 | 71.20 | 81.05 |
| k = 8 | 63.54 | 75.49 | 73.61 | 70.40 | 80.08 |
| k = 9 | 62.87 | 75.68 |
| 70.85 | 80.29 |
| k =10 |
|
| 74.62 |
| 79.43 |
Comparison of the recognition rates for the UTD-MHAD set and LOSO eight-fold cross-validation.
| Method | Recognition Rate [%] |
|---|---|
| Wang et al. [ | 85.81 |
| Hou et al. [ | 86.97 |
| Kamel et al. [ | 88.14 |
| Our work |
|
Comparison of the recognition rates for the UTD-MHAD set and realizations 1 and 2 in the training set and 3, 4 in the test set.
| Method | Recognition Rate [%] |
|---|---|
| Chen et. al. [ | 85.10 |
| Mandany et. al. [ | 93.26 |
| Our work |
|
Comparison of the recognition rates for the MSR-Action 3D dataset.
| Data Set | Chen et al. [ | Proposed Method | ||||
|---|---|---|---|---|---|---|
| Test A | Test B | Test C | Test A | Test B | Test C | |
| AS1 | 97.3 | 98.6 | 96.2 | 100 | 95.3 | 87.8 |
| AS2 | 96.1 | 98.7 | 83.2 | 94.9 | 93.5 | 86.2 |
| AS3 | 98.7 | 100 | 92 | 100 | 94.7 | 90.5 |
| Average | 97.4 |
|
|
| 94.5 | 88.1 |
Classification time.
| Division of the Bounding Box | Average Classification Time [ms] | |
|---|---|---|
| UTD-MHAD | MSR-Action 3D | |
| Without division | 81.2 | 38.8 |
| Vertical division into 2 cells | 86.0 | 41.3 |
| Cross division into 4 cells | 95.5 | 52.5 |
| Horizontal division into 4 cells | 97.1 | 53.2 |
| Division into 6 cells | 112.9 | 57.4 |
Recognition rates for the UTD-MHAD dataset and two variants of feature reduction (eight-fold cross-validation LOSO).
| Number of the Nearest Neighbors | Recognition Rate [%] | ||
|---|---|---|---|
| All Features | V1: Features | V2: Features | |
| k = 1 | 86.37 | 86.28 | 85.59 |
| k = 2 | 84.20 | 84.20 | 85.13 |
| k = 3 | 86.63 | 87.80 | 85.71 |
| k = 4 | 87.56 | 87.45 | 86.52 |
| k = 5 | 88.15 | 87.92 | 87.23 |
| k = 6 | 88.03 | 87.68 | 87.34 |
| k = 7 | 86.98 | 87.46 | 87.45 |
| k = 8 | 86.64 |
| 88.39 |
| k = 9 | 87.57 | 87.92 |
|
| k = 10 |
| 87.92 | 87.92 |
Recognition rates for the MSR Action 3D dataset and two variants of feature reduction (ten –fold cross-validation LOSO).
| Number of the Nearest Neighbors | Recognition Rate [%] | ||
|---|---|---|---|
| All Features | V1: Features | V2: Features | |
| k = 1 | 77.09 | 80.21 | 79.03 |
| k = 2 | 75.42 | 75.40 | 75.66 |
| k = 3 | 79.78 | 78.84 | 80.97 |
| k = 4 |
| 79.96 | 80.56 |
| k = 5 | 81.19 | 79.79 | 81.45 |
| k = 6 | 80.51 | 79.65 |
|
| k = 7 | 81.05 | 79.87 | 82.10 |
| k = 8 | 80.08 | 80.13 | 80.45 |
| k = 9 | 80.29 |
| 82.09 |
| k =10 | 79.43 | 72.46 | 82.62 |
Recognition rates for the UTD-MHAD and k-NN using representatives (eight-fold cross-validation LOSO).
| Number of | Recognition Rate [%] | |||||
|---|---|---|---|---|---|---|
| k1 = 5 | k1 = 10 | |||||
| All Features | V1 | V2 | All Features | V1 | V2 | |
| 1 | 85.94 | 86.28 | 86.29 | 85.24 | 86.40 | 85.59 |
| 2 | 84.90 | 85.36 | 85.71 | 84.43 | 85.24 | 85.12 |
| 3 | 86.63 | 87.80 | 85.71 | 86.51 | 87.21 | 85.24 |
| 4 | 86.87 | 87.46 | 87.22 | 87.10 | 87.33 | 85.71 |
| 5 |
|
| 87.46 | 87.45 |
| 86.18 |
| 6 | 87.45 | 87.46 | 87.57 |
| 87.57 | 86.64 |
| 7 | 86.64 | 88.03 | 86.76 | 86.52 | 86.53 | 86.06 |
| 8 | 85.59 | 87.33 | 87.11 | 86.87 | 87.57 | 86.76 |
| 9 | 86.98 | 86.76 |
| 86.99 | 86.87 |
|
| 10 | 86.87 | 87.45 | 86.99 | 86.98 | 86.41 | 86.87 |
Figure 9Recognition rates for the UTD-MHAD dataset: original k-NN (left) and k-NN using representatives.
Recognition rates for the MSR Action 3D dataset using representatives (ten-fold cross-validation LOSO).
| Number of the | Recognition Rate [%], k1 = 5 | ||
|---|---|---|---|
| All Features | V1 | V2 | |
| 1 | 83.12 | 83.07 | 83.05 |
| 2 | 83.86 | 80.86 | 81.18 |
| 3 |
| 83.94 | 83.19 |
| 4 | 84.64 | 84.32 | 82.05 |
| 5 | 84.82 |
| 83.45 |
| 6 | 84.55 | 84.50 | 83.45 |
| 7 | 84.55 | 84.28 |
|
| 8 | 83.74 | 84.66 | 83.78 |
| 9 | 84.32 | 85.00 | 83.18 |
| 10 | 83.95 | 84.87 | 83.45 |
Figure 10Recognition rates for the MSR Action 3D set: original k-NN (left) and k-NN using representatives.
Recognition rates [%] for the MSR Action 3D set and k-NN using representatives - best results for AS1, AS2, AS3 subsets (ten-fold cross-validation LOSO).
| AS1 | AS2 | AS3 | ||||||
|---|---|---|---|---|---|---|---|---|
| All | V1 | V2 | All | V1 | V2 | All | V1 | V2 |
| 84.4 | 87.0 |
|
| 82.5 | 80.9 |
| 89.6 | 86.0 |
Comparison of average classification times for tested sets and k-NN using representatives.
| Variants | Average Classification Time | ||
|---|---|---|---|
| UTD-MHAD | MSR Action 3D | ||
| k1 = 5 | k1 = 10 | k1 = 5 | |
| All features | 46.2 | 59.1 | 22.5 |
| V1 | 40.9 | 53.3 | 16.9 |
| V2 | 37.9 | 49 | 11.3 |
Recognition rates [%] obtained using weight transfer in the Bidirectional long short-term memory (BiLSTM) network for the UTD-MHAD set (eight-fold cross-validation LOSO).
| Training 1 (Random Starting Weights) | Training 2 | Training 3 | |
|---|---|---|---|
| All the features | 80.70 | 82.68 | 82.45 |
| Variant V1 | 80.24 | 83.14 | 84.48 |
| Variant V2 | 81.98 | 83.27 | 82.33 |
Recognition rates [%] obtained using weight transfer in the BiLSTM network for the MSR 3D Action (ten-fold cross-validation LOSO) – all features/features in variant V1/ features in variant V2.
| First Training | Second Training | Third Training |
|---|---|---|
| AS1 | AS1 | AS1 |
| 83.60/83.58/88.22 | 85.19/87.31/86.59 | 85.72/86.89/87.73 |
| AS2 | AS2 | AS2 |
| 83.43/83.24/85.63 | 81.16/82.43/82.41 | 84.11/82.01/8455 |
| AS3 | AS3 | AS3 |
| 87.64 /87.24/86.27 | 86.90/87.51/84.62 | 88.03/89.34/87.14 |
Comparison of the best recognition rates [%] obtained by various methods (LOSO cross-validation).
| Title 1 | k-NN | BiLSTM | BiLSTM + Fuzzy Integral | ||||||
|---|---|---|---|---|---|---|---|---|---|
| All Features | V1 | V2 | All Features | V1 | V2 | All Features | V1 | V2 | |
| AS1 | 87.80 | 89.89 |
| 85.72 | 87.31 | 88.22 | 85.26 | 87.77 | 90.34 |
| AS2 | 86.23 | 85.85 | 84.40 | 84.11 | 83.24 | 85.63 |
| 85.36 | 86.23 |
| AS3 | 90.5 | 92.28 |
| 88.03 | 89.34 | 87.14 | 90.21 | 88.49 | 89.30 |
| UTD-MHAD | 88.58 |
| 88.50 | 82.68 | 83.48 | 83.27 | 84.89 | 84.77 | 84.77 |
Figure 11Comparison of the best recognition rates [%] obtained by various methods.