| Literature DB >> 35893087 |
Edvard Heikel1, Leonardo Espinosa-Leal1.
Abstract
Indoor scene recognition and semantic information can be helpful for social robots. Recently, in the field of indoor scene recognition, researchers have incorporated object-level information and shown improved performances. This paper demonstrates that scene recognition can be performed solely using object-level information in line with these advances. A state-of-the-art object detection model was trained to detect objects typically found in indoor environments and then used to detect objects in scene data. These predicted objects were then used as features to predict room categories. This paper successfully combines approaches conventionally used in computer vision and natural language processing (YOLO and TF-IDF, respectively). These approaches could be further helpful in the field of embodied research and dynamic scene classification, which we elaborate on.Entities:
Keywords: TF-IDF; object detection; scene classification; scene recognition
Year: 2022 PMID: 35893087 PMCID: PMC9330118 DOI: 10.3390/jimaging8080209
Source DB: PubMed Journal: J Imaging ISSN: 2313-433X
Figure 1Visualization of the pipeline. top: general diagram of the modules for object detection and room prediction, and bottom: step-by-step scheme (A) Train YOLO to detect indoor objects. (B) Perform object detection on scene data (examples use IOD155 with conf. thresh = 0.25). The images are from the ADE20k dataset (C) Transform predicted object labels into TF-IDF input features. (D) Train classifier to predict room category based on these input features.
Evaluating YOLO.
| Model | Precision | Recall | mAP@.50 | mAP@.50:.95 |
|---|---|---|---|---|
| IOD90 | 0.526 | 0.601 | 0.553 | 0.416 |
| IOD155 | 0.455 | 0.469 | 0.417 | 0.309 |
Figure 2Scene Recognition for IOD90 & IOD155—visualization of results across conf. thresholds (0.001, 0.25, 0.50, 0.75) for validation and testing sets. Also displayed are whether all detected objects or singular instances (sets of objects) are used in predicting room category.
Scene Recognition—Model Comparisons (IOD90 & IOD155 conf. thresh = 0.001).
| Dataset | Top-1 | Top-5 | |||
|---|---|---|---|---|---|
| Val | Test | Val | Test | ||
| ADE20K | |||||
| ResNet18+LSA [ | 53.77% | - | 75.65% | - | |
| Places365 | |||||
| VGG [ | 55.24% | 55.19% | 84.91% | 85.01% | |
| ResNet152 [ | 53.63% | 54.65% | 85.08% | 85.07% | |
| Places365-Home | |||||
| ResNet50 [ | 83.46% | 92.03% | - | - | |
| ResNet50+Word2Vec [ | 83.67% | 93.27% | - | - | |
| CBORM [ | 85.80% | - | - | - | |
| OTS [ | 85.90% | - | - | - | |
|
|
|
|
| - | - |
|
|
|
| - | - | |
|
|
|
| - | - | |