| Literature DB >> 34035801 |
Kengda Huang1, Wujie Zhou1,2, Meixin Fang2.
Abstract
In recent years, the prediction of salient regions in RGB-D images has become a focus of research. Compared to its RGB counterpart, the saliency prediction of RGB-D images is more challenging. In this study, we propose a novel deep multimodal fusion autoencoder for the saliency prediction of RGB-D images. The core trainable autoencoder of the RGB-D saliency prediction model employs two raw modalities (RGB and depth/disparity information) as inputs and their corresponding eye-fixation attributes as labels. The autoencoder comprises four main networks: color channel network, disparity channel network, feature concatenated network, and feature learning network. The autoencoder can mine the complex relationship and make the utmost of the complementary characteristics between both color and disparity cues. Finally, the saliency map is predicted via a feature combination subnetwork, which combines the deep features extracted from a prior learning and convolutional feature learning subnetworks. We compare the proposed autoencoder with other saliency prediction models on two publicly available benchmark datasets. The results demonstrate that the proposed autoencoder outperforms these models by a significant margin.Entities:
Year: 2021 PMID: 34035801 PMCID: PMC8116150 DOI: 10.1155/2021/6610997
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1The architecture of the proposed autoencoder.
The evaluation results of various saliency models.
| Datasets | Criteria | Itti | GBVS | QFT | Fang | Qi | DeepFix | ML-net | DVA | Proposed |
|---|---|---|---|---|---|---|---|---|---|---|
| NUS | CC | 0.341 | 0.396 | 0.163 | 0.333 | 0.371 | 0.4322 | 0.446 | 0.4549 | 0.5310 |
| KLDiv | 1.457 | 1.374 | 1.795 | 1.560 | 1.505 | 1.8138 | 1.780 | 2.4349 | 1.2323 | |
| AUC | 0.788 | 0.824 | 0.682 | 0.795 | 0.806 | 0.7699 | 0.766 | 0.7236 | 0.8501 | |
| NSS | 1.236 | 1.441 | 0.568 | 1.209 | 1.357 | 1.6608 | 1.821 | 1.7962 | 2.1195 | |
|
| ||||||||||
| NCTU | CC | 0.449 | 0.533 | 0.292 | 0.542 | 0.595 | 0.7974 | 0.696 | 0.6834 | 0.8034 |
| KLDiv | 0.738 | 0.619 | 0.893 | 0.674 | 0.616 | 1.3083 | 0.900 | 1.1045 | 0.3593 | |
| AUC | 0.753 | 0.789 | 0.698 | 0.806 | 0.816 | 0.8650 | 0.835 | 0.8035 | 0.8671 | |
| NSS | 0.978 | 1.184 | 0.695 | 1.264 | 1.373 | 1.8575 | 1.588 | 1.5546 | 1.8405 | |
Figure 2The results of various saliency models. (a) RGB. (b) GT. (c) Itti. (d) GBVS. (e) QFT. (f) Fang. (g) Qi. (h) DeepFix. (i) ML-net. (j) DVA. (k) Proposed.
The prediction performances of models A, B, and C, as well as our proposed autoencoder.
| Datasets | Criteria | Model | Model | Model | Proposed |
|---|---|---|---|---|---|
| NUS | CC | 0.5220 | 0.5227 | 0.5097 | 0.5310 |
| KLDiv | 1.2538 | 1.3408 | 1.5606 | 1.2323 | |
| AUC | 0.8353 | 0.8351 | 0.7841 | 0.8501 | |
| NSS | 2.1198 | 2.1727 | 2.1301 | 2.1195 | |
|
| |||||
| NCTU | CC | 0.7607 | 0.8043 | 0.7967 | 0.8034 |
| KLDiv | 0.3900 | 0.4152 | 0.3869 | 0.3593 | |
| AUC | 0.8552 | 0.8641 | 0.8618 | 0.8671 | |
| NSS | 1.7348 | 1.8914 | 1.8227 | 1.8405 | |
Figure 3Some failure cases. (a) RGB. (b) Ground-truth. (c) Proposed.
Figure 4Some failure cases. (a) RGB. (b) GT. (c) DeepFix. (d) ML-net. (e) DVA. (f) Proposed.