| Literature DB >> 30103525 |
Alexander Diete1, Timo Sztyler2, Heiner Stuckenschmidt3.
Abstract
Working with multimodal datasets is a challenging task as it requires annotations which often are time consuming and difficult to acquire. This includes in particular video recordings which often need to be watched as a whole before they can be labeled. Additionally, other modalities like acceleration data are often recorded alongside a video. For that purpose, we created an annotation tool that enables to annotate datasets of video and inertial sensor data. In contrast to most existing approaches, we focus on semi-supervised labeling support to infer labels for the whole dataset. This means, after labeling a small set of instances our system is able to provide labeling recommendations. We aim to rely on the acceleration data of a wrist-worn sensor to support the labeling of a video recording. For that purpose, we apply template matching to identify time intervals of certain activities. We test our approach on three datasets, one containing warehouse picking activities, one consisting of activities of daily living and one about meal preparations. Our results show that the presented method is able to give hints to annotators about possible label candidates.Entities:
Keywords: activity recognition; machine learning; multimodal labeling
Year: 2018 PMID: 30103525 PMCID: PMC6112036 DOI: 10.3390/s18082639
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Comparison of different annotation tools with their different focuses.
| Approach | Idea | Method | Pros and Cons |
|---|---|---|---|
| Our approach | Suggest labels based on a small subset of annotations. In depth analysis of different preprocessing methods and variants of dynamic time warping. | The main focus is the analysis of different variants of dynamic time warping used for label-, and clustering-suggestions. The methods were applied to wrist worn sensors mostly. Evaluation was done in regards to time offset to the correct label and recall of the method. | An in-depth analysis among different datasets with different configurations. On the downside the tool itself is just a prototype, usability of the tool is not tested. |
| Label Movie [ | Designing a complete multimedia annotation tool with automatic annotation and crowd-sourcing capabilities. | Used dynamic time warping and SVM time series prediction with a focus on usability of the application. Classification results shown in a Gram matrix to the user. Focus on crowd-sourcing capabilities with the combination of domain experts’ and technical experts’ knowledge. | The tool is fully developed with a lot of functionality, especially the capability for crowd-sourcing. On the downside the evaluation of the tool is lacking in detail and it is not publicly available. |
| Multimodal Multisensor Activity Annotation Tool [ | A multimodal annotation tool that is able to handle multiple sensor types like video, depth, and body worn sensors. | The focus is put on capturing many different types of sensors and displaying them in a useful fashion. In contrast to the other methods, this tool is able to capture sensors live and synchronize them. Capabilities for automated annotation are present, but not implemented yet. | Live capturing of different types of sensors is integrated and the tool seems to be designed very concisely. However, as of yet automatic annotations are not integrated though the architecture allows for that. |
| Smart Video Browsing [ | Using clustering methods, automatically segment videos into different parts to improve navigation within a video. | For clustering the tool uses color and motion features to distinguish different parts of the video. These can be browsed by the user to distinguish different parts of the video. | The tool does not rely on pretrained methods and can thus easily be used. It does not, however, provide automatic labeling functionality. |
Figure 1(a) Schematic of the shelves in our test environment; (b) Angle features from the wristband used for matching.
Figure 2Photos from the recording processes of our datasets. (a) Picking scenario. Participant walking towards and away from shelves; (b) ADL scenario. A glass of water and pillbox on the table.
Figure 3Basic approach for finding matches in a dataset. Blue boxes represent data, white boxes the processing of data.
Figure 4Developed labeling tools. First version running in a browser, second version as a standalone application. (a) Old web application; (b) New desktop application.
Figure 5ADL experiment settings. Each parameter is set for a specific configuration of the matching algorithm.
Recognition performance of template matching for picking dataset. The overlap (avg. 69%) is excluding outliers and represents only the best match within a dataset. Cases 2, 6, and 7 contain two grabbing activities.
| Dataset | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| 0.43 | 0.67 | 0.78 | 0.52 | 0.72 | 0.74 | 0.99 | |
| 5.02 | 2.49 | 2.55 | 4.23 | 2.86 | 2.43 | 2.04 | |
| 2.22 | 4.11 | 2.60 | |||||
| 1.41 | 1.89 | 0.91 | 0.86 | 0.71 | 2.88 | 0.65 | |
| 1.81 | 2.61 | 2.91 | |||||
| 1.65 | 0.74 | 1.40 | 1.46 | 0.63 | 0.68 | 1.99 | |
| 0.68 | 1.52 | 1.43 |
Figure 6Overall estimate of grabbing start (a) and end (b) point for picking dataset. Cases 2, 6, and 7 contain two activities and therefore also two crosses in the plot.
Results for matching activities of daily living. For each case we report min and max distance to activities and median distance. The bold values show the best results.
| Acceleration | Reduce dim. | Approach | Subject 1 | Subject 2 |
|---|---|---|---|---|
| Raw | Yes | Zeroline | [0 s - 28.9 s] | [0 s - 39.3 s] |
| Delta | [0.2 s - 5.1 s] | [0 s - 1.4 s] | ||
| No | Zeroline | ∅ | ∅ | |
| Delta | [0.1 s - 14.5 s] | [0 s - 39.3 s] | ||
| Gravity | Yes | Zeroline | [0 s - 20.1 s] | [0 s - 39.3 s] |
| Delta | [0 s - 1.5 s] | [0 s - 12.5 s] | ||
| No | Zeroline | ∅ | ∅ | |
| Delta | [0.2 s - 10.9 s] | [0.2 s - 41.3 s] | ||
| Linear | Yes | Zeroline | [1.2 s - 82 s] | [0 s - 48.4 s] |
| No | Zeroline | ∅ | ∅ | |
| Delta | [0.4 s - 9.6 s] | [0 s - 21 s] |
Figure 7Results of matching activities of daily living with different numbers of templates used for matching and different values for k. The color shows the average distance (in ms) of a match to a label.
Figure 8Recall of the results of both hands depending on the value of k that was used for candidate selection. An overlap with the ground truth labels is counted as a True Positive.
Figure 9ROC curve for both hands. Without the candidate selection, this plot shows the overall performance of the Dynamic Time Warping algorithm. Again it can be seen that the performance for the left hand data is not as consistent as the performance of the right hand data.
Figure 10Dendrogram of the clustering of the templates in the kitchen dataset. Marked boxes are activities using the same item.