| Literature DB >> 26106323 |
German I Parisi1, Cornelius Weber1, Stefan Wermter1.
Abstract
The visual recognition of complex, articulated human movements is fundamental for a wide range of artificial systems oriented toward human-robot communication, action classification, and action-driven perception. These challenging tasks may generally involve the processing of a huge amount of visual information and learning-based mechanisms for generalizing a set of training actions and classifying new samples. To operate in natural environments, a crucial property is the efficient and robust recognition of actions, also under noisy conditions caused by, for instance, systematic sensor errors and temporarily occluded persons. Studies of the mammalian visual system and its outperforming ability to process biological motion information suggest separate neural pathways for the distinct processing of pose and motion features at multiple levels and the subsequent integration of these visual cues for action perception. We present a neurobiologically-motivated approach to achieve noise-tolerant action recognition in real time. Our model consists of self-organizing Growing When Required (GWR) networks that obtain progressively generalized representations of sensory inputs and learn inherent spatio-temporal dependencies. During the training, the GWR networks dynamically change their topological structure to better match the input space. We first extract pose and motion features from video sequences and then cluster actions in terms of prototypical pose-motion trajectories. Multi-cue trajectories from matching action frames are subsequently combined to provide action dynamics in the joint feature space. Reported experiments show that our approach outperforms previous results on a dataset of full-body actions captured with a depth sensor, and ranks among the best results for a public benchmark of domestic daily actions.Entities:
Keywords: action recognition; depth information; neural networks; robot perception; self-organizing learning; visual processing
Year: 2015 PMID: 26106323 PMCID: PMC4460528 DOI: 10.3389/fnbot.2015.00003
Source DB: PubMed Journal: Front Neurorobot ISSN: 1662-5218 Impact factor: 2.650
Precision and recall of our approach evaluated on the 12 activities from in CAD60 and comparison with other algorithms.
| Sung et al., | 67.9 | 55.5 | 61.1 |
| Ni et al., | 75.9 | 69.5 | 72.1 |
| Koppula et al., | 80.8 | 71.4 | 75.8 |
| Gupta et al., | 78.1 | 75.4 | 76.7 |
| Gaglio et al., | 77.3 | 76.7 | 77 |
| Zhang and Tian, | 86 | 84 | 85 |
| Zhu et al., | 93.2 | 84.6 | 88.7 |
| Faria et al., | 91.1 | 91.9 | 91.5 |
| Shan and Akella, | 93.8 | 94.5 | 94.1 |
Bold values indicate the classification results for our algorithm.
Figure 1GWR-based architecture for the processing of pose-motion samples. (1) Hierarchical processing of pose-motion features in parallel. (2) Integration of neuron trajectories in the joint pose-motion feature space.
Growing When Required
| 1: | Start with a set |
| 2: | Initialize an empty set of connections |
| 3: | At each iteration, generate an input sample ξ according to the input distribution |
| 4: | For each node |
| 5: | Select the best matching node and the second-best matching node such that: |
| 6: | Create a connection |
| 7: | Calculate the activity of the best matching unit: |
| 8: | If |
| 9: | Else, i.e., no new node is added, adapt the positions of the winning node and its neighbours |
| 10: | Increment the age of all edges connected to |
| 11: | Reduce the firing counters according to Equation (2): |
| 12: | Remove all edges with ages larger than |
| 13: | If the stop criterion is not met, go to step 3. |
Figure 2A GWR network trained with a normally distributed training set of 1000 samples resulting in 556 nodes and 1145 connections.
Figure 3Activation values for the network trained in Figure Noisy samples line under novelty threshold a = 0.1969 (green line).
Figure 4Representation of full-body movements from our action dataset. We estimate three centroids C1 (green), C2 (yellow) and C3 (blue) for upper, middle and lower body respectively. The segment slopes θ and θ describe the posture in terms of the overall orientation of the upper and lower body.
Figure 5Daily actions from the CAD-60 dataset (RGB and depth images with skeleton).
Training results on the two datasets—For each trained network along the hierarchy, the table shows the resulting number of nodes () and connections (), and the activation threshold ().
| Full-body actions | ||||||
| CAD-60 | ||||||
Figure 6Confusion matrices for our dataset of 10 actions showing better results for our GWR-based architecture (average accuracy 94%) compared to our previous GNG-based approach (89%).
Precision, recall, and -score of our approach on the five environments of the CAD-60 dataset.
| Office | Talking on the phone | 94.1 | 92.8 | 93.4 |
| Drinking water | 92.9 | 91.5 | 92.2 | |
| Working on computer | 94.3 | 93.9 | 94.1 | |
| Writing on whiteboard | 95.7 | 94.0 | 94.8 | |
| Average | 94.3 | 93.1 | 93.7 | |
| Kitchen | Drinking water | 93.2 | 91.4 | 92.3 |
| Cooking (chopping) | 86.4 | 86.7 | 86.5 | |
| Cooking (stirring) | 88.2 | 86.2 | 87.2 | |
| Opening pill container | 90.8 | 84.6 | 87.6 | |
| Average | 89.7 | 87.2 | 88.4 | |
| Bedroom | Talking on the phone | 93.7 | 91.9 | 92.8 |
| Drinking water | 90.9 | 90.3 | 90.6 | |
| Opening pill container | 90.8 | 90.1 | 90.4 | |
| Average | 91.8 | 91.7 | 91.7 | |
| Bathroom | Wearing contact lens | 91.2 | 87.0 | 89.1 |
| Brushing teeth | 90.6 | 88.0 | 89.3 | |
| Rinsing mouth | 87.9 | 85.8 | 86.8 | |
| Average | 89.9 | 86.9 | 88.4 | |
| Living room | Talking on the phone | 94.8 | 92.1 | 93.4 |
| Drinking water | 91.7 | 90.8 | 91.2 | |
| Relaxing on couch | 93.9 | 91.7 | 92.8 | |
| Talking on couch | 94.7 | 93.2 | 93.9 | |
| Average | 93.8 | 92.0 | 92.9 |