| Literature DB >> 31540453 |
David Santos1, Eric Lopez-Lopez2, Xosé M Pardo3, Roberto Iglesias4, Senén Barro5, Xosé R Fdez-Vidal6.
Abstract
Scene recognition is still a very important topic in many fields, and that is definitely the case in robotics. Nevertheless, this task is view-dependent, which implies the existence of preferable directions when recognizing a particular scene. Both in human and computer vision-based classification, this actually often turns out to be biased. In our case, instead of trying to improve the generalization capability for different view directions, we have opted for the development of a system capable of filtering out noisy or meaningless images while, on the contrary, retaining those views from which is likely feasible that the correct identification of the scene can be made. Our proposal works with a heuristic metric based on the detection of key points in 3D meshes (Harris 3D). This metric is later used to build a model that combines a Minimum Spanning Tree and a Support Vector Machine (SVM). We have performed an extensive number of experiments through which we have addressed (a) the search for efficient visual descriptors, (b) the analysis of the extent to which our heuristic metric resembles the human criteria for relevance and, finally, (c) the experimental validation of our complete proposal. In the experiments, we have used both a public image database and images collected at our research center.Entities:
Keywords: image collection summarization; meaningful images; scene recognition
Year: 2019 PMID: 31540453 PMCID: PMC6767273 DOI: 10.3390/s19184024
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1(a) Concentrated key points in an image with few objects. (b) High scattering of key points in an extensive view.
Figure 2(a) A narrow view showing many key points on many close objects. (b) Close view but with high scattering of interest points in 3D.
Figure 3Representation of an RGB-D image as a cube. The XY plane is divided into a set of M-regions.
Figure 4Example of a Minimum Spanning Tree.
Performance of an SVM Classifier using different feature representations. These results are the average performance after using four test sets (4-fold cross-validation). The 3-top scores are shown in bold.
| Representation | All Images |
|---|---|
| 11 Classes | |
| LSBP | 78.75% |
| LDBP | 80.78% |
| LMBP | 59.54% |
| Gist | 77.35% |
| SURF | 77.48% |
| LMBP+SURF | 79.07% |
| LBP+SURF |
|
| Resnet |
|
| VGG |
|
Figure 5Original test image on the left, and new test image, obtained from the original one by mirror reflection, on the right.
Performance of an SVM Classifier with different feature representations and using augmented test sets with symmetrical images.
| Feature | Performance |
|---|---|
| Representation | Train 11 Classes |
| Census Transform (LSBP) | 72.12% |
| LDBP(LSBP+LMBP) | 73.62% |
| LMBP | 53.80% |
| LDBP(LSBP+LMBP)+SURF |
|
| Resnet |
|
| VGG |
|
Summary of the second data set and participants in the poll.
| Class | Images | Voters |
|---|---|---|
| [0.5ex] | ||
| 0 (Instr. Lab.) | 61 | 30 |
| 1 (Laboratories) | 92 | 30 |
| 2 Common Staff areas (floor 1 and 2) | 144 | 30 |
| 3 Common Staff areas (Ground floor) | 122 | 29 |
| 4 (Kitchen) | 96 | 30 |
Figure 6Snapshot of the screen taken during the performance of the second experiment, while taking a poll.
Figure 7Score accumulated by the images in the first third, second third and third according to the raking obtained using our metric.
Comparison of the performance achieved with models trained using a either all the images available or a subset of images selected according to their coverage. The performance shown is the one achieved when the SVM classifies all testing images.
| Training Set | LBP+SURF | Resnet | VGG |
|---|---|---|---|
| all train |
|
|
|
|
|
|
| |
|
|
|
|
Comparison of the performance achieved with models trained using a either all the images available or a subset of images selected according to their coverage. The performance shown is the one achieved when the SVM classifies the Q1-testing images.
| Training Set | LBP+SURF | Resnet | VGG |
|---|---|---|---|
| all train |
|
|
|
|
|
|
| |
|
|
|
|
Performance achieved with the combination MST+SVM for scene recognition described in Section 4. The SVM was trained using only the most relevant training images (). The first column shows the results without the MST (the -test images are used after all test images have been sorted using the coverage metric). The second and third column show the role of the MST automatizing the filtering of images, no sorting of the test set has been carried out in these cases. We can also see how many of the 293 relevant images pass the MST filtering. The percentages are given with respect to the total number of images of the test set that have been labeled by the SVM.
| Descriptor | Restrictive Filter | Relaxed Filter | All Test | All Training and All Test | |
|---|---|---|---|---|---|
| train = 170 im. | train = 170 im. | train = 170 im. | train = 170 im. | train = 515 im. | |
| labeled = 293 im. | labeled = 76 im. | labeled = 205 im. | labeled = 884 im. | labeled = 884 im. | |
| LBP+SURF | right = 270 | right = 68 | right = 184 | right = 707 | right = 786 |
| wrong = 23,( | wrong = 8,( | wrong = 21,( | wrong = 177,( | wrong = 98.0,( | |
| Resnet | right = 271 | right = 69 | right = 173 | right = 681 | right = 785 |
| wrong = 22,( | wrong = 7,( | wrong = 32,( | wrong = 203,( | wrong = 99,( | |
| VGG | right = 259 | right = 70 | right = 183 | right = 680 | right = 756 |
| wrong = 34,( | wrong = 6,( | wrong = 22,( | wrong = 204,( | wrong = 128,( |