| Literature DB >> 30696100 |
Haibin Yu1, Guoxiong Pan2, Mian Pan3, Chong Li4, Wenyan Jia5, Li Zhang6, Mingui Sun7.
Abstract
Recently, egocentric activity recognition has attracted considerable attention in the pattern recognition and artificial intelligence communities because of its wide applicability in medical care, smart homes, and security monitoring. In this study, we developed and implemented a deep-learning-based hierarchical fusion framework for the recognition of egocentric activities of daily living (ADLs) in a wearable hybrid sensor system comprising motion sensors and cameras. Long short-term memory (LSTM) and a convolutional neural network are used to perform egocentric ADL recognition based on motion sensor data and photo streaming in different layers, respectively. The motion sensor data are used solely for activity classification according to motion state, while the photo stream is used for further specific activity recognition in the motion state groups. Thus, both motion sensor data and photo stream work in their most suitable classification mode to significantly reduce the negative influence of sensor differences on the fusion results. Experimental results show that the proposed method not only is more accurate than the existing direct fusion method (by up to 6%) but also avoids the time-consuming computation of optical flow in the existing method, which makes the proposed algorithm less complex and more suitable for practical application.Entities:
Keywords: deep learning; egocentric activity recognition; hierarchical fusion framework; wearable sensor system
Year: 2019 PMID: 30696100 PMCID: PMC6386921 DOI: 10.3390/s19030546
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Overall architecture of the proposed hierarchical deep fusion framework.
Figure 2Hierarchical relationship of the mapping .
Figure 3Hierarchical relationship of the correspondence defined by Equation (5). (a) is the original correspondence of ; (b) is the converted group correspondence of .
Figure 4Schematic diagram of the correspondence between sensor data and images (or image sequences): (a) When the sampling rate of images is low, there is no overlap between the time windows. (b) When the sampling rate of the images is higher and the time window is wider, overlap occurs between the time windows.
Figure 5Architecture of the VGG-16 network.
Figure 6Architecture of the proposed fine-tuned CNN for low-frame-rate photo streams.
Figure 7Architecture of the proposed CNN-LSTM network for high-frame-rate photo streams.
Figure 8A specific fusion example of the proposed hierarchical deep fusion framework applied to the self-built eButton hybrid dataset.
Figure 9Appearance and two possible ways to wear the eButton device.
Number of time segments in the training and test sets.
| Dataset | LY | SD | ST | WK | Total | |
|---|---|---|---|---|---|---|
| training set | W1 | 593 | 607 | 602 | 576 | 2378 |
| W2 | 626 | 605 | 621 | 523 | 2375 | |
| test set | W1 | 178 | 1332 | 113 | 146 | 1769 |
| W2 | 199 | 1133 | 120 | 98 | 1550 | |
Number of images in the training and test sets.
| Dataset | CU | ET | EM | MT | NP | RD | SP | SW | TK | TU | TP | WO | WU | TV | WT | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| training set | W1 | 139 | 115 | 117 | 153 | 153 | 146 | 170 | 127 | 79 | 106 | 185 | 188 | 102 | 84 | 97 | 1961 |
| W2 | 119 | 149 | 105 | 120 | 107 | 84 | 112 | 123 | 80 | 95 | 106 | 97 | 113 | 101 | 108 | 1619 | |
| test set | W1 | 138 | 155 | 59 | 92 | 178 | 101 | 149 | 113 | 70 | 99 | 184 | 146 | 90 | 70 | 125 | 1769 |
| W2 | 95 | 159 | 87 | 95 | 197 | 91 | 91 | 120 | 42 | 98 | 79 | 98 | 95 | 94 | 109 | 1550 | |
Figure 10Example image of each activity in the training set. Images (a–o) correspond to CU, ET, EM, MT, NP, RD, SP, SW, TK, TU, TP (driving), WO, WU, TV, and WT, respectively.
The egocentric activities and their corresponding categories in Multimodal Dataset.
|
| |
| 1 | walking (WK) |
| 2 | walking upstairs (WK-US) |
| 3 | walking downstairs (WK-DS) |
| 4 | riding elevator up (RD-VU) |
| 5 | riding elevator down (RD-VD) |
| 6 | riding escalator up (RD-SU) |
| 7 | riding escalator down (RD-SD) |
| 8 | sitting (SI) |
|
| |
| 9 | eating (ET) |
| 10 | drinking (DR) |
| 11 | texting (TX) |
| 12 | making phone calls (MP) |
|
| |
| 13 | working at PC (PC) |
| 14 | reading (RD) |
| 15 | writing sentences (WT) |
| 16 | organizing files (OF) |
|
| |
| 17 | running (RN) |
| 18 | doing push-ups (DPU) |
| 19 | doing sit-ups (DSU) |
| 20 | cycling (CY) |
All grouping methods established during the adjustment of the grouping method when the proposed algorithm is applied to the Multimodal Dataset.
|
|
|
|
|
| ||
|
|
|
|
|
| ||
|
|
|
|
|
| ||
|
|
|
|
|
|
Figure 11Architecture of the multistream direct fusion method proposed in [14].
Figure 12Confusion matrices for the classification results of the inertial measurement unit (IMU) sensor data: (a) W1; (b) W2.
F1 accuracy of the classification results on the IMU sensor data.
| LY | SD | ST | WK | Avg. | |
|---|---|---|---|---|---|
|
| 1.0000 | 0.8482 | 0.9000 | 0.9388 | 0.9217 |
|
| 1.0000 | 0.9212 | 0.7168 | 0.7945 | 0.8581 |
Figure 13W1’s confusion matrices for the output of the corresponding to each group: (a) lying; (b) sedentary; (c) standing; (d) walking.
Figure 14W2’s confusion matrices for the output of the corresponding to each group: (a) lying; (b) sedentary; (c) standing; (d) walking.
F1 accuracy of the lying group (LY).
| LY | NP | TU | Avg. |
|---|---|---|---|
|
| 0.9899 | 0.9794 | 0.9846 |
|
| 0.9493 | 0.8939 | 0.9216 |
F1 accuracy of the sedentary group (SD).
| SD | CU | ET | EM | MT | RD | SP | TK | TU | TP | TV | WT | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 0.8814 | 0.9152 | 0.9212 | 0.8471 | 0.8323 | 0.9727 | 0.7475 | 0.7373 | 0.9375 | 0.9080 | 0.9422 | 0.8766 |
|
| 0.9000 | 0.8765 | 0.9483 | 0.4154 | 0.5357 | 0.9732 | 0.4571 | 0.5616 | 0.9946 | 0.8099 | 0.7500 | 0.7475 |
F1 accuracy of the standing group (ST).
| ST | ET | EM | RD | SP | SW | TK | TU | TP | WU | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
|
| 0.8805 | 0.9202 | 0.8715 | 0.9630 | 0.8988 | 0.8293 | 0.8402 | 0.9202 | 0.8770 | 0.8890 |
|
| 0.9067 | 0.7843 | 0.5673 | 0.8916 | 0.6667 | 0.4632 | 0.7344 | 0.9892 | 0.9071 | 0.7678 |
F1 accuracy of the walking group (WK).
| WK | SP | WO | Avg. |
|---|---|---|---|
|
| 0.9780 | 0.9796 | 0.9788 |
|
| 0.9933 | 0.9932 | 0.9932 |
Figure 15Confusion matrices for the hierarchical fusion results: (a) W1; (b) W2.
F1 accuracy of the hierarchical fusion results.
| CU | ET | EM | MT | NP | RD | SP | SW | TK | TU | TP | WP | WU | TV | WT | Avg. | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 0.8701 | 0.7188 | 0.6250 | 0.8304 | 0.9975 | 0.7630 | 0.9215 | 0.8772 | 0.7423 | 0.6992 | 0.9091 | 0.9000 | 0.8166 | 0.8966 | 0.9375 | 0.8336 |
|
| 0.9158 | 0.8057 | 0.8364 | 0.6265 | 0.9972 | 0.6170 | 0.8504 | 0.6061 | 0.6218 | 0.5623 | 0.9865 | 0.8915 | 0.7712 | 0.8000 | 0.7939 | 0.7788 |
Figure 16W1’s confusion matrix using a single sensor: (a) IMU; (b) photo stream.
Figure 17W2’s confusion matrix using a single sensor: (a) IMU; (b) photo stream.
Figure 18Confusion matrices for the direct fusion results: (a) W1; (b) W2.
F1 accuracy of the direct fusion results.
| CU | ET | EM | MT | NP | RD | SP | SW | TK | TU | TP | WP | WU | TV | WT | Avg. | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 0.6626 | 0.8111 | 0.8810 | 0.7795 | 0.9540 | 0.6857 | 0.9392 | 0.8322 | 0.6863 | 0.5086 | 0.9036 | 0.9510 | 0.5038 | 0.7816 | 0.7358 | 0.7744 |
|
| 0.8912 | 0.8308 | 0.6598 | 0.5537 | 0.9622 | 0.4895 | 0.9565 | 0.7064 | 0.4444 | 0.5130 | 0.9892 | 0.8985 | 0.5970 | 0.6081 | 0.7785 | 0.7253 |
Figure 19F1 accuracies of four methods shown as bar graphs for (a) W1 and (b) W2.
Average F1 accuracy calculating 10 splits on motion sensor data in the Multimodal Dataset.
| WK/WK-US | WK-DS | SD/ST/CY | RN | DPU | DSU | Avg. | |
|---|---|---|---|---|---|---|---|
|
| 0.9082 | 0.8974 | 0.9825 | 0.9322 | 0.9000 | 0.9564 | 0.9294 |
Figure 20Confusion matrices for the two splits with (a) the lowest accuracy and (b) the highest accuracy on motion sensor data in the Multimodal Dataset.
Figure 21Lowest and highest accuracy confusion matrices for the VGG16-LSTM network corresponding to each group: (a,b) the WK/WK-US group; (c,d) the “SD/ST/CY group.
Average F1 accuracy of the WK/WK-US group.
| WK/WK-US | WK | WK-US | Avg. |
|---|---|---|---|
|
| 0.948 | 0.945 | 0.947 |
Average F1 accuracy of the SD/ST/CY group.
| SD/ST/CY | RD-VU | RD-VD | RD-SU | RD-SD | SI | ET | DR | TX | MP | PC | RD | WT | OF | CY | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 0.631 | 0.705 | 0.924 | 0.842 | 0.880 | 0.864 | 0.713 | 0.843 | 0.599 | 0.978 | 0.835 | 0.876 | 0.714 | 0.866 | 0.805 |
Figure 22After hierarchical fusion, among the 10 splits, the confusion matrices with the (a) lowest accuracy and (b) highest accuracy.
Average F1 accuracy of the 10 splits for each activity to be recognized.
| WK | WK-US | WK-DS | RD-VU | RD-VD | RD-SU | RD-SD | SI | ET | DR | TX | MP | PC | RD | WT | OF | RN | DPU | DSU | CY | Avg. | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 0.881 | 0.832 | 0.897 | 0.631 | 0.698 | 0.915 | 0.837 | 0.880 | 0.851 | 0.687 | 0.843 | 0.551 | 0.978 | 0.747 | 0.875 | 0.721 | 0.932 | 0.900 | 0.956 | 0.823 | 0.822 |
Figure 23Accuracy comparison of different grouping methods shown in Table 4 used in the proposed hierarchical deep fusion framework.
The direct fusion proposed in [14] and the proposed hierarchical fusion on the Multimodal Dataset.
| Direct Fusion Proposed in [ | Hierarchical Fusion Proposed in This Paper | ||
|---|---|---|---|
| Average Pooling | Maximum Pooling | ||
|
| |||
|
| |||
|
|
|
|
|
Measured calculation times in Equations (12) and (13).
| 9.709 | 0.659 | 2.395 | 0.659 | 4.103 |