| Literature DB >> 28661440 |
Carlos Fernando Crispim-Junior1,2, Alvaro Gómez Uría3, Carola Strumia4, Michal Koperski5, Alexandra König6,7, Farhood Negin8, Serhan Cosar9, Anh Tuan Nghiem10, Duc Phu Chau11, Guillaume Charpiat12, Francois Bremond13,14.
Abstract
Visual activity recognition plays a fundamental role in several research fields as a way to extract semantic meaning of images and videos. Prior work has mostly focused on classification tasks, where a label is given for a video clip. However, real life scenarios require a method to browse a continuous video flow, automatically identify relevant temporal segments and classify them accordingly to target activities. This paper proposes a knowledge-driven event recognition framework to address this problem. The novelty of the method lies in the combination of a constraint-based ontology language for event modeling with robust algorithms to detect, track and re-identify people using color-depth sensing (Kinect® sensor). This combination enables to model and recognize longer and more complex events and to incorporate domain knowledge and 3D information into the same models. Moreover, the ontology-driven approach enables human understanding of system decisions and facilitates knowledge transfer across different scenes. The proposed framework is evaluated with real-world recordings of seniors carrying out unscripted, daily activities at hospital observation rooms and nursing homes. Results demonstrated that the proposed framework outperforms state-of-the-art methods in a variety of activities and datasets, and it is robust to variable and low-frame rate recordings. Further work will investigate how to extend the proposed framework with uncertainty management techniques to handle strong occlusion and ambiguous semantics, and how to exploit it to further support medicine on the timely diagnosis of cognitive disorders, such as Alzheimer's disease.Entities:
Keywords: activities of daily living; activity recognition; assisted living; color-depth sensing; complex events; knowledge representation; people detection and tracking; senior monitoring
Year: 2017 PMID: 28661440 PMCID: PMC5539795 DOI: 10.3390/s17071528
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Knowledge-driven framework for visual event recognition. Firstly, (0) an estimation of the ground plane is computed using the vertexes of semantic zones; (1) Video frame acquisition is performed using a color-depth sensor. Then; (2) people detection module analyzes the video frame for instances of physical objects of type person. For each instance found, it adjusts its height using ground plane information; (3) Tracking step analyzes the set of detected people in the current and previous frames for appearance matching and trajectory estimation; (4) Event recognition takes as input the information from all previous steps and evaluates which event models in its knowledge base are satisfied. Recognized events are added to each tracked person’s history and the steps 2–4 are then repeated for the next frame. (A) Prior knowledge about the problem corresponds to semantic information about the scene geometry; (B) Knowledge base corresponds to the set of events of interest.
Figure 2Video event ontology language. Three main concept branches are defined: physical objects, video events and constraints. Physical objects make abstractions for real-world objects. Video events describe the types of event templates available for activity modeling. Constraints describes the relations among physical objects and activities’ components (sub-events).
Figure 3Physical objects integrate 3D visual information into the ontological events.
Figure 4Contextual zones define geometric regions (red polygons, CHUN dataset) that carry semantic information about daily activities.
Figure 5Monitored scene at the nursing home apartment: (A) living area camera displays an “exit restroom” event and (B) bed area camera displays an “enter in bed” event.
Recognition of IADLs—CHUN dataset—-score.
| Event | DT-HOG | DT-HOF | DT-MBH | Proposed |
|---|---|---|---|---|
| Prepare drink | 58.61 | 47.33 | 63.09 | 74.07 |
| Prepare drug box | 60.14 | 70.97 | 27.59 | 90.91 |
| Read | 51.75 | 56.26 | 65.87 | 83.33 |
| Search bus line | 66.67 | 63.95 | 42.52 | 60.00 |
| Talk on telephone | 92.47 | 46.62 | 72.61 | 95.00 |
| Water plant | 42.58 | 13.08 | 24.83 | 72.22 |
| 62.0 ± 17.0 | 49.7 ± 20.3 | 49.4 ± 20.6 | 79.3 ± 13.0 |
SD: standard deviation of the mean.
Recognition of a physical task in CHUN dataset.
| IADL | Recall (%) | Precision (%) | |
|---|---|---|---|
| Walking 8 m | 90.75 | 93.10 | 91.91 |
N: 58 participants; 7 min. each; Total: 406 min.
Recognition of IADLs in CHUN dataset.
| IADL | Recall (%) | Precision (%) | |
|---|---|---|---|
| Prepare drink | 89.4 | 71.9 | 79.7 |
| Prepare drug box | 95.4 | 95.4 | 95.4 |
| Talk on telephone | 89.6 | 86.7 | 88.1 |
| Water plant | 74.1 | 69.0 | 71.5 |
| 87.1 | 81.0 | 85.3 |
N: 45 participants; 15 min. each; Total: 675 min.
Recognition of IADLs - GAADRD dataset - -score.
| Event | DT-HOG | DT-HOF | DT-MBH | Proposed |
|---|---|---|---|---|
| Account Balance | 44.96 | 34.71 | 42.98 | 66.67 |
| Prepare Drink | 81.66 | 44.87 | 52.00 | 100.00 |
| Prepare Drug Box | 14.19 | 0.00 | 0.00 | 57.14 |
| Read Article | 52.10 | 42.86 | 33.91 | 63.64 |
| Talk on telephone | 82.35 | 0.00 | 33.76 | 100.00 |
| Turn on radio | 85.71 | 42.52 | 58.16 | 94.74 |
| Water Plant | 0.00 | 0.00 | 0.00 | 52.63 |
| 51.8 ± 34.4 | 23.6 ± 22.3 | 31.5 ± 23.3 | 76.4 ± 21.0 |
Recognition of events in nursing home dataset.
| Day | D1 | D2 | D3 | |||
|---|---|---|---|---|---|---|
| Recall | Precision | Recall | Precision | Recall | Precision | |
| Enter restroom | 100.0 | 100.0 | 100.0 | 84.2 | 61.7 | 100.0 |
| Exit restroom | 100.0 | 34.8 | 100.0 | 41.0 | 100.0 | 81.4 |
| Leave room | 91.1 | 100.0 | 63.0 | 100.0 | 96.7 | 100.0 |
| Enter room | 79.7 | 100.0 | 61.1 | 100.0 | 98.3 | 100.0 |
| Sit in armchair | 100.0 | 100.0 | 87.5 | 100.0 | 100.0 | 45.4 |
| 94.2 | 87.0 | 82.3 | 85.0 | 91.3 | 85.4 | |
| Enter bed | 100.0 | 100.0 | 100.0 | 62.5 | 100.0 | 77.8 |
| Bed exit | 50.0 | 100.0 | 100.0 | 100.0 | 100.0 | 77.8 |
| 75.0 | 100.0 | 100.0 | 81.2 | 100.0 | 77.8 | |
N: 1 participant, 72 h of recording per sensor.