| Literature DB >> 28492486 |
Alessandro Manzi1, Paolo Dario2, Filippo Cavallo3.
Abstract
Human activity recognition is an important area in computer vision, with its wide range of applications including ambient assisted living. In this paper, an activity recognition system based on skeleton data extracted from a depth camera is presented. The system makes use of machine learning techniques to classify the actions that are described with a set of a few basic postures. The training phase creates several models related to the number of clustered postures by means of a multiclass Support Vector Machine (SVM), trained with Sequential Minimal Optimization (SMO). The classification phase adopts the X-means algorithm to find the optimal number of clusters dynamically. The contribution of the paper is twofold. The first aim is to perform activity recognition employing features based on a small number of informative postures, extracted independently from each activity instance; secondly, it aims to assess the minimum number of frames needed for an adequate classification. The system is evaluated on two publicly available datasets, the Cornell Activity Dataset (CAD-60) and the Telecommunication Systems Team (TST) Fall detection dataset. The number of clusters needed to model each instance ranges from two to four elements. The proposed approach reaches excellent performances using only about 4 s of input data (~100 frames) and outperforms the state of the art when it uses approximately 500 frames on the CAD-60 dataset. The results are promising for the test in real context.Entities:
Keywords: RGB-D camera; SVM, SMO; assisted living; clustering, x-means; depth camera; human activity recognition; skeleton data
Mesh:
Year: 2017 PMID: 28492486 PMCID: PMC5470490 DOI: 10.3390/s17051100
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Example of activity feature instances using a sliding window of elements. Duplicates are discarded.
Figure 2Subset example of activity feature instances using a window length equal to 5 and a skeleton of seven joints (torso omitted).
Figure 3Software architecture of the training phase. The skeleton data are gathered from the depth camera, and the skeleton features are selected and normalized. Then, the input is clustered several times to find the informative postures for the sequence. The activity features are generated from the obtained basic postures, and, finally, a classifier is trained and relative models generated.
Figure 4Software architecture of the testing phase. As for the training phase, the skeleton features are extracted from the depth camera. Then, the optimal number of clusters is calculated using the X-means algorithm. A classifier is applied using the previously trained model and the generated activity features.
Number of clusters obtained with the X-means algorithm considering a different input split frame size for each activity of the CAD-60.
| Frame Split | 100 | 300 | 500 | 700 |
|---|---|---|---|---|
| Number of Clusters | ||||
| talking on the phone | 3 | 4 | 4 | 4 |
| writing on whiteboard | 2 | 2 | 2 | 2 |
| drinking water | 3 | 4 | 4 | 4 |
| rinsing mouth with water | 3 | 4 | 4 | 4 |
| brushing teeth | 4 | 4 | 4 | 4 |
| wearing contact lenses | 4 | 4 | 4 | 4 |
| talking on couch | 4 | 3 | 4 | 4 |
| relaxing on couch | 3 | 4 | 4 | 4 |
| cooking (chopping) | 2 | 4 | 4 | 4 |
| cooking (stirring) | 4 | 4 | 4 | 4 |
| opening pill container | 4 | 4 | 4 | 4 |
| working on computer | 2 | 2 | 2 | 4 |
Overall precision, recall, and accuracy values using dynamic clustering and a sliding activity window of 11 elements on CAD-60.
| Frame Split | “New Person” | ||
|---|---|---|---|
| Precision | Recall | Accuracy | |
| 100 | 0.963 | 0.958 | 0.984 |
| 300 | 0.950 | 0.958 | 0.986 |
| 500 | 1 | 1 | 1 |
| 700 | 1 | 1 | 1 |
The confusion matrix of the “new person” test case using a sliding window activity size of 11 elements on CAD-60 with a window split of 100 frames.
State of the art of precision and recall values (%) on CAD-60 dataset.
| Algorithm | Precision | Recall |
|---|---|---|
| Zhu et al. [ | 93.2 | 84.6 |
| Faria et al. [ | 91.1 | 91.9 |
| Shan et al. [ | 93.8 | 94.5 |
| Parisi et al. [ | 91.9 | 90.2 |
| Cippitelli et al. [ | 93.9 | 93.5 |
| Our First Method [ | 99.8 | 99.8 |
| Current Method | 100 | 100 |
Number of clusters obtained with the X-means algorithm considering a different input split frame size for each activity of the TST dataset.
| Category | Activity/Frame Split | Number of Clusters | |||
|---|---|---|---|---|---|
| 30 | 50 | 60 | 100 | ||
| sit on chair | 3 | 3 | 4 | 4 | |
| walk and grasp | 4 | 4 | 3 | 4 | |
| ADL | walk back and forth | 3 | 3 | 3 | 3 |
| lie down | 3 | 4 | 4 | 4 | |
| frontal fall | 3 | 3 | 3 | 4 | |
| backward fall | 2 | 3 | 4 | 2 | |
| Fall | side fall | 3 | 4 | 4 | 4 |
| backward fall and sit | 2 | 2 | 2 | 2 | |
Overall precision, recall, and accuracy values using dynamic clustering and a sliding activity window of five elements on TST.
| Frame Split | ADL | Fall | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | Accuracy | Precision | Recall | Accuracy | |
| 30 | 0.956 | 0.911 | 0.927 | 0.687 | 0.516 | 0.527 |
| 50 | 0.953 | 0.906 | 0.921 | 0.767 | 0.667 | 0.667 |
| 60 | 1 | 1 | 1 | 0.826 | 0.819 | 0.805 |
| 100 | 1 | 1 | 1 | 0.937 | 0.950 | 0.933 |
The confusion matrix of the TST Fall activity with 100 input frames.