| Literature DB >> 33005008 |
Dario Zanca1, Marco Gori2,3, Stefano Melacci2, Alessandra Rufa4.
Abstract
Visual attention refers to the human brain's ability to select relevant sensory information for preferential processing, improving performance in visual and cognitive tasks. It proceeds in two phases. One in which visual feature maps are acquired and processed in parallel. Another where the information from these maps is merged in order to select a single location to be attended for further and more complex computations and reasoning. Its computational description is challenging, especially if the temporal dynamics of the process are taken into account. Numerous methods to estimate saliency have been proposed in the last 3 decades. They achieve almost perfect performance in estimating saliency at the pixel level, but the way they generate shifts in visual attention fully depends on winner-take-all (WTA) circuitry. WTA is implemented by the biological hardware in order to select a location with maximum saliency, towards which to direct overt attention. In this paper we propose a gravitational model to describe the attentional shifts. Every single feature acts as an attractor and the shifts are the result of the joint effects of the attractors. In the current framework, the assumption of a single, centralized saliency map is no longer necessary, though still plausible. Quantitative results on two large image datasets show that this model predicts shifts more accurately than winner-take-all.Entities:
Mesh:
Year: 2020 PMID: 33005008 PMCID: PMC7530662 DOI: 10.1038/s41598-020-73494-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Results on MIT1003.
| Model | Pre-attentive maps | SED | TDE | STDE |
|---|---|---|---|---|
| GRAV | Basic | 7.68 (0.65) | ||
| GRAV | Itti | 228.08 (76.97) | 0.79 (0.06) | |
| WTA | Basic | 8.41 (0.50) | 425.27 (66.87) | 0.65 (0.04) |
| WTA | Itti | 8.41 (0.49) | 417.12 (65.99) | 0.66 (0.04) |
In bold, the best results on average. The standard deviation values are given in round brackets. Note that the two models have performance equivalent to varying basic features. The gravitational model performs better on every metric, compared to the winnner-take-all model.
Results on CAT2000.
| Model | Pre-attentive maps | SED | TDE | STDE |
|---|---|---|---|---|
| GRAV | Basic | 13.81 (2.01) | ||
| GRAV | Itti | 458.76 (110.79) | ||
| WTA | Basic | 14.48 (2.07) | 762.99 (100.94) | 0.66 (0.03) |
| WTA | Itti | 14.48 (2.07) | 766.06) (101.92) | 0.66 (0.03) |
In bold, the best results on average. The standard deviation values are given in round brackets
Figure 1Example of simulated scanpath. This example shows a borderline case where the scanpath generated with WTA is unnatural because it focuses exclusively on borders with high center-surround differences. The GRAV approach, in contrast, allows to generate fixations on center of mass. Consequently, the large amount of variation in random noise on the right makes it more interesting than the square on the left.
Figure 2Field network. This network realizes the computation of a quantity proportional to the functional associate with the gravitational field. The black line on the top is illustrative. It shows a qualitative example of the distribution of cones in the retina. The maximum point correspond to the center of the fovea. A characteristic blind spot is also illustrated.
Collection of datasets.
| Dataset name | Details |
|---|---|
| MIT1003[ | This dataset contains 1003 natural indoor and outdoor scenes. They are sampled with variable sizes, where each dimension is in [405, 1024]px. The database contains 779 landscape images and 228 portrait images. Fixations of 15 human subjects are provided for 3 s of free-viewing observation |
| CAT2000[ | A collection of 2000 images is provided as the training portion of this dataset. Semantic content largely vary among twenty different categories. The resolution of the images is |
To ensure a proper evaluation of the proposed model, a large collection of static images from two different datasets have been used. Eye-tracking data is collected in a free-viewing setup. Details are described in the right column for each of the datasets.