| Literature DB >> 26751452 |
Thi-Hoa-Cuc Nguyen1, Jean-Christophe Nebel2, Francisco Florez-Revuelta3.
Abstract
Video-based recognition of activities of daily living (ADLs) is being used in ambient assisted living systems in order to support the independent living of older people. However, current systems based on cameras located in the environment present a number of problems, such as occlusions and a limited field of view. Recently, wearable cameras have begun to be exploited. This paper presents a review of the state of the art of egocentric vision systems for the recognition of ADLs following a hierarchical structure: motion, action and activity levels, where each level provides higher semantic information and involves a longer time frame. The current egocentric vision literature suggests that ADLs recognition is mainly driven by the objects present in the scene, especially those associated with specific tasks. However, although object-based approaches have proven popular, object recognition remains a challenge due to the intra-class variations found in unconstrained scenarios. As a consequence, the performance of current systems is far from satisfactory.Entities:
Keywords: activity recognition; ambient assisted living; egocentric vision; wearable cameras
Mesh:
Year: 2016 PMID: 26751452 PMCID: PMC4732105 DOI: 10.3390/s16010072
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Human behaviour analysis tasks: classification (reprinted from [9]).
Figure 2Pipeline for human behaviour analysis at the motion level.
Figure 3Gaze prediction without reference to saliency or the activity model [32]. Egocentric features, which are head/hand motion and hand location/pose, are leveraged to predict gaze. A model that takes account of eye-hand and eye-head coordination, combined with temporal dynamics of gaze, is designed for gaze prediction. Only egocentric videos have been used, and the performance is compared to the ground truth acquired with an eye tracker (reprinted from [32]).
Figure 4A part-based object model for a stove in an activities of daily living (ADLs) dataset using a HOG descriptor (reprinted and adapted from [11]).
Combination of features and machine learning methods for object recognition with egocentric vision. GTEA, Georgia Tech Egocentric Activities.
| Target | Paper Year | Approach | Dataset | Results |
|---|---|---|---|---|
| [ | Standard SIFT + multi-class SVM | Intel 42 objects | 12% | |
| [ | Background segmentation + temporal integration + SIFT, HOG + SVM | Intel 42 objects | 86% | |
| [ | Background segmentation + multiple instance learning + transductive SVM [ | GTEA | in a table, but stated 35% according to [ | |
| Active objects (manipulated or observed) | [ | Background segmentation + colour and texture histogram + SVM super-pixel classifier | GTEA Gaze, GTEA Gaze+ | n/a |
| [ | Part-based object detector (latent SVM) on active images + spatial + skin detector | ADL | n/a | |
| [ | Part-based object detector (latent SVM) + “salient” assignment based on estimated gaze point | ADL | n/a | |
| [ | Visual attention maps (spatial + geometry + temporal) combined with SURF + BoVW+ SVM | GTEA, GTEA Gaze, ADL | 36.8% 12% | |
| [ | Their own dataset | 50% | ||
| General objects (All objects in the scene) | [ | Part-based object detector (2010) + latent SVM | ADL | 19.9% fridge to 69% TV |
See Section 5 for details.
Figure 5Pixel-level hand detection under varying illumination and hand poses (reprinted from [50]).
Figure 6Action recognition based on changes in the state of objects (reprinted from [64]).
Figure 7Flow chart for indoor office task classification (reprinted from [15]).
Figure 8Temporal pyramid representation of a video sequence (reprinted from [12]).
Figure 9Detection of hands and objects in the ADL dataset (reprinted from [11]).
Figure 10Graph-based framework’s model. An activity y is a sequence of actions , and each action is represented by objects and hands. During testing, objects and hand labels are assigned to regions (reprinted from [8]).
Figure 11Visual explanation of the proposed method. The region around the fixation point is extracted and encoded using a gradient-based template. These templates are used to build the vocabulary, which is then applied to generate a BoW representation for an activity (reprinted from [73]).
Figure 12Various graphical models for activity and object recognition in which A, O, R and V represent activity, object, RFID and video frame, respectively (reprinted from [42]).
Datasets for activity recognition in egocentric vision.
| Activities of Daily Living (ADL) [ | Unconstrained: A dataset of 1 million frames of dozens of people performing unscripted, everyday activities. The dataset is annotated with activities, object tracks, hand positions and interaction events. | 143 | |
| The University of Texas at Austin Egocentric (UT Ego) Dataset [ | Unconstrained: The UT Ego Dataset contains 4 videos captured from head-mounted cameras. Each video is about 3–5 h long, captured in a natural, uncontrolled setting. They used the Looxcie wearable camera, which captures video at 15 fps at 320 × 480 resolution. The videos capture a variety of daily activities. | 134 | |
| First-person social interactions dataset [ | Unconstrained: This dataset contains day-long videos of 8 subjects spending their day at Disney World Resort in Orlando, Florida. The cameras are mounted on a cap worn by subjects. Elanannotations for the number of active participants in the scene and the type of activity: walking, waiting, gathering, sitting, buying something, eating, | 100 | |
| Carnegie Mellon University Multi-Modal Activity Database (CMU-MMAC) [ | Constrained: Multimodal dataset of 18 subjects cooking 5 different recipes (brownies, pizza, | 83 | |
| Georgia Tech Egocentric Activities (GTEA) [ | Constrained: This dataset contains 7 types of daily activities, each performed by 4 different subjects. The camera is mounted on a cap worn by the subject. | 74 | |
| Georgia Tech Egocentric Activities-Gaze+ [ | Constrained: This dataset consists of 7 meal preparation activities collected using eye-tracking glasses, each performed by 10 subjects. Subjects perform the activities based on the given cooking recipes. | 63 | |
| EDSH-kitchen [ | Unconstrained: A video was taken in a kitchenette area while making tea. | 58 | |
| Zoombie Dataset [ | Unconstrained: This dataset consists of three ego-centric videos containing indoor and outdoor scenes where hands are purposefully extended outwards to capture the change in skin colour. | 58 | |
| Jet Propulsion Laboratory (JPL) First-person Interaction Dataset [ | Constrained: This dataset is composed of human activity videos taken from a first-person viewpoint. The dataset particularly aims to provide first-person videos of interaction-level activities, recording how things visually look from the perspective ( | 46 | |
| Intel 42 Egocentric Objects dataset [ | Unconstrained: This is a dataset for the recognition of handled objects using a wearable camera. It includes ten video sequences from two human subjects manipulating 42 everyday object instances. | Not currently available | 33 |
| The Hebrew University of Jerusalem (HUJI) EgoSeg Dataset [ | Unconstrained: This dataset consists of 29 videos captured from an ego-centric camera annotated in Elan format. The videos prefixed with “youtube*” were downloaded from YouTube; the rest of the videos were taken by the Hebrew University of Jerusalem researchers and contain various daily activities. | 18 | |
| National University of Singapore (NUS) First-person Interaction Dataset [ | Unconstrained: 260 videos including 8 interactions in 2 perspectives, third-person and first-person) to create a total of 16 action classes, such as handshake and open doors, captured by a GoPro Camera | 5 | |
| LENA [ | Unconstrained: This Google Glass life-logging dataset contains 13 distinct activities performed by 10 different subjects. Each subject recorded 2 clips for one activity. Therefore, each activity category has 20 clips. Each clip takes exactly 30 seconds. Their set of activities are: watching videos, reading, using the Internet, walking straightly, walking back and forth, running, eating, walking up and downstairs, talking on the phone, talking to people, writing, drinking and housework. | 2 |
Obtained from Google Scholar on 20 October 2015.