| Literature DB >> 31450609 |
Haoran Wei1, Roozbeh Jafari2, Nasser Kehtarnavaz3.
Abstract
This paper presents the simultaneous utilization of video images and inertial signals that are captured at the same time via a video camera and a wearable inertial sensor within a fusion framework in order to achieve a more robust human action recognition compared to the situations when each sensing modality is used individually. The data captured by these sensors are turned into 3D video images and 2D inertial images that are then fed as inputs into a 3D convolutional neural network and a 2D convolutional neural network, respectively, for recognizing actions. Two types of fusion are considered-Decision-level fusion and feature-level fusion. Experiments are conducted using the publicly available dataset UTD-MHAD in which simultaneous video images and inertial signals are captured for a total of 27 actions. The results obtained indicate that both the decision-level and feature-level fusion approaches generate higher recognition accuracies compared to the approaches when each sensing modality is used individually. The highest accuracy of 95.6% is obtained for the decision-level fusion approach.Entities:
Keywords: decision-level and feature-level fusion for action recognition; deep learning-based action recognition; fusion of video and inertial sensing for action recognition
Year: 2019 PMID: 31450609 PMCID: PMC6749419 DOI: 10.3390/s19173680
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Actions in the University of Texas at Dallas Multimodal Human Action Dataset (UTD-MHAD) dataset.
| Action Number | Hands Actions | Action Number | Legs Actions |
|---|---|---|---|
| 1 | right arm swipe to the left | 22 | jogging in place |
| 2 | right arm swipe to the right | 23 | walking in place |
| 3 | right hand wave | 24 | sit to stand |
| 4 | two hand front clap | 25 | stand to sit |
| 5 | right arm throw | 26 | forward lunge (left foot forward) |
| 6 | cross arms in the chest | 27 | squat (two arms stretch out) |
| 7 | basketball shoot | ||
| 8 | right hand draw x | ||
| 9 | right hand draw circle (clockwise) | ||
| 10 | right hand draw circle (counter clockwise) | ||
| 11 | draw triangle | ||
| 12 | bowling (right hand) | ||
| 13 | front boxing | ||
| 14 | baseball swing from right | ||
| 15 | tennis right hand forehand swing | ||
| 16 | arm curl (two arms) | ||
| 17 | tennis serve | ||
| 18 | two hand push | ||
| 19 | right hand knock on door | ||
| 20 | right hand catch an object | ||
| 21 | right hand pick up and throw |
Figure 1Example video volume as input to the 3D convolution neural network.
Figure 2Example inertial signals image as input to the 2D convolution neural network.
Figure 3Network architecture used for the decision-level fusion of video and inertial sensing modalities.
Architecture of the 3D convolutional neural network.
| Layers and Training Parameters | Values |
|---|---|
| Input layer | 320 × 240 × 32 |
| 1st 3D convolutional layer | 16 filters, filter size 3 × 3 × 3, stride 1 × 1 × 1 |
| 2nd 3D convolutional layer | 32 filters, filter size 3 × 3 × 3, stride 1 × 1 × 1 |
| 3rd 3D convolutional layer | 64 filters, filter size 3 × 3 × 3, stride 1 × 1 × 1 |
| 4th 3D convolutional layer | 128 filters, filter size 3 × 3 × 3, stride 1 × 1 × 1 |
| All 3D max pooling layer | pooling size 2 × 2 × 2, stride 2 × 2 × 2 |
| 1st fully connected layer | 256 units |
| 2nd fully connected layer | 27 units |
| dropout layer | 50% |
| Initial learn rate | 0.0016 |
| Learning rate drop factor | 0.5 |
| Learn rate drop period | 4 |
| Max epochs | 20 |
Architecture and training parameters of the 2D convolutional neural network.
| Layers and Training Parameters | Values |
|---|---|
| Input layer | 8 × 50 |
| 1st 2D convolutional layer | 16 filters, filter size 3 × 3, stride 1 × 1 |
| 2nd 2D convolutional layer | 32 filters, filter size 3 × 3, stride 1 × 1 |
| 3rd 2D convolutional layer | 64 filters, filter size 3 × 3, stride 1 × 1 |
| All 2D max pooling layer | pooling size 2 × 2, stride 2 × 2 |
| 1st fully connected layer | 256 units |
| 2nd fully connected layer | 27 units |
| dropout layer | 50% |
| Initial learn rate | 0.0016 |
| Learning rate drop factor | 0.5 |
| Learn rate drop period | 4 |
| Max epochs | 20 |
Figure 4Network architecture used for the feature-level fusion of video and inertial sensing modalities.
Average accuracy of video sensing modality only, inertial sensing modality only, feature-level fusion of video and inertial sensing modalities, and decision-level fusion of video and inertial sensing modalities.
| Approaches | Average Accuracy (%) |
|---|---|
| Video only | 76.0 |
| Inertial only | 90.3 |
| Feature-level fusion of video and inertial | 94.1 |
| Decision-level fusion of video and inertial | 95.6 |
Figure 5Recognition accuracy for the four situations of sensing modality across the eight subjects in UTD-MHAD dataset.
Figure 6Confusion matrix of the video only sensing modality.
Figure 7Confusion matrix of the inertial only sensing modality.
Figure 8Confusion matrix of the feature-level fusion of video and inertial sensing modalities.
Figure 9Confusion matrix of the decision-level fusion of video and inertial sensing modalities.