| Literature DB >> 32635375 |
Dario Cazzato1, Marco Leo2, Cosimo Distante2, Holger Voos1.
Abstract
The automatic detection of eye positions, their temporal consistency, and their mapping into a line of sight in the real world (to find where a person is looking at) is reported in the scientific literature as gaze tracking. This has become a very hot topic in the field of computer vision during the last decades, with a surprising and continuously growing number of application fields. A very long journey has been made from the first pioneering works, and this continuous search for more accurate solutions process has been further boosted in the last decade when deep neural networks have revolutionized the whole machine learning area, and gaze tracking as well. In this arena, it is being increasingly useful to find guidance through survey/review articles collecting most relevant works and putting clear pros and cons of existing techniques, also by introducing a precise taxonomy. This kind of manuscripts allows researchers and technicians to choose the better way to move towards their application or scientific goals. In the literature, there exist holistic and specifically technological survey documents (even if not updated), but, unfortunately, there is not an overview discussing how the great advancements in computer vision have impacted gaze tracking. Thus, this work represents an attempt to fill this gap, also introducing a wider point of view that brings to a new taxonomy (extending the consolidated ones) by considering gaze tracking as a more exhaustive task that aims at estimating gaze target from different perspectives: from the eye of the beholder (first-person view), from an external camera framing the beholder's, from a third-person view looking at the scene where the beholder is placed in, and from an external view independent from the beholder.Entities:
Keywords: computer vision; gaze estimation; gaze tracking; review; survey
Mesh:
Year: 2020 PMID: 32635375 PMCID: PMC7374327 DOI: 10.3390/s20133739
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Four examples of gaze tracking systems.
Figure 2A block diagram of a typical gaze estimation framework.
Figure 3The new proposed taxonomy for categorizing gaze tracking techniques. It considers two different branches related to what is framed in the acquired image: the eye/face of the beholder or the scene observed by him.
Figure 4An example of a third-person view in which several people are in the scene. Differently from the gaze estimation from a single user facing a sensor, here the sensors are ecologically placed in the environment. The task comes with many challenges given by the necessity to consider postural, relational, semantic, and contextual information.
Figure 5A block diagram of a typical gaze tracking solution from the analysis of the eye/face.
Figure 6Scheme of end-to-end system from the analysis of the eye/face.
A summary of relevant datasets for gaze estimation. cont. stands for continuous, E/F for eye/face, pos. for position, coord. for coordinates, img. for images.
| Dataset | Taxonomy | Type of Content | # People | # Head Poses | User Distance | Gaze Target | # Images | Data Type | Resolution |
|---|---|---|---|---|---|---|---|---|---|
| Columbia [ | Beholder E/F | facial img. (lab setup) | 56 | 5 | 2 m | 21 | 5880 | color |
|
| EYEDIAP [ | Beholder E/F | facial img. (lab setup) | 16 | cont. | 0.8–1.2 m | cont. | videos (>4 h) | color+depth | |
| UT Multiview [ | Beholder E/F | eye region img. + synthetic 3D model (lab setup) | 50 | 8 | 0.6 m | 160 | 64,000 | color | |
| MPIIGaze [ | Beholder E/F | facial img. (consumer camera) | 15 | cont. | 0.4–0.6 m | cont. | 213,659 | color | |
| TabletGaze [ | Beholder E/F | facial img. (consumer camera) | 51 | cont. | 0.3–0.5 m | 35 | videos (∼24 h) | color |
|
| GazeCapture [ | Beholder E/F | facial img. (consumer camera) | 1474 | cont. | very close | cont. | 2,445,504 | color |
|
| RT-GENE [ | Beholder E/F | facial + wereable device img. (lab setup) | 15 | cont. | 0.5–2.9 m | cont. | 122,531 | color + depth | |
| SynthesEyes [ | Beholder E/F | synthetic eye patches | n.a. | n.a. | adjustable | cont. | 11,382 | color |
|
| UnitEyes [ | Beholder E/F | synthetic eye patches | n.a. | n.a. | adjustable | cont. | 1,000,000 | color |
|
| GazeGAN [ | Scene Saliency | scene + fixations | 10 | free-viewing | 0.6 m | cont. (24” display) | 1900 | color |
|
| DHF1K [ | Scene Saliency | scene + fixations | 17 | headrest constrained | 0.68 m | cont. (19” display) | 1 k videos (600 k frames) | color |
|
| SALICON [ | Scene Saliency | scene + visual attention | 16 | free-viewing | variable | cont. | 10,000 images | color |
|
| EMOd [ | Scene Saliency | scene + eye movements + emotions + objects + semantics | 16 | free-viewing | variable | cont. (22”display) | 1019 | color |
|
| EGTEA Gaze+ [ | Egocentric Vision | scene + 2D gaze point on the frame | 32 | free-viewing | variable | cont. | videos (29 h) | color |
|
| GazeFollow [ | Third-person view | scene + head pos. + target pos. | 130,339 | free-viewing | variable | cont. | 122, 143 images | color | var. |
| Gaze360 [ | Third-person view | subject + target + 3D gaze | 238 | free-viewing | variable | cont. | 129 K training, 17 K validation and 26 K test | color |
|
| Object Referring [ | Third-person view | object bounding box + language description + gazes | 20 | free-viewing | variable | cont. | 5000 stereo videos | color |
|
| VideoGaze [ | Third-person view | scene + target + gaze annotations | 47,456 | free-viewing | variable | cont. | 140 movies | color | var. |
| VideoCoAtt [ | Third-person view | heads + gaze directions | n.a. | free-viewing | variable | cont. | 380 videos (492,100 frames) | color |
|
A schematic representation of the most important works cited in this document. For each entry, we sum up the taxonomy and eventual technical requirements, as well as other constraints, and performance. The ain datasets whose authors used in the experimental phase are reported. If no dataset is reported, experiments have been done on data not make publicly available or proprietary. The reproducibility of the method in the table means that the code has been made available by the authors and presented results have been gathered on publicly available data; calib. stands for calibration, h.p. for head pose, DL for deep learning.
| Method | Taxonomy | Tech Req | Speed/Load | Constraints | Reproducibility | Metric | Notes | |
|---|---|---|---|---|---|---|---|---|
| [ | Feature | Head-mounted | person calib. | |||||
| [ | Feature | RGB Camera | 28 fps | person calib., fixed h.p. | ||||
| [ | Feature | RGB Camera | 3.63–6.03 | |||||
| 2.86–4.76 | ||||||||
| [ | Geometric | RGB Camera | person calib. | 2.18 | ||||
| [ | Geometric | RGBD Camera | 12 fps | person calib. | 1.38–2.71 | |||
| [ | Geometric | RGBD Camera | 8.66 fps |
| ||||
| [ | Geometric | RGBD Camera | - | person calib. (1 point) |
| |||
| [ | End-to-end | RGB Camera | 10–15 fps | ✓ | 2.58 cm (iTracker) | analysis of calib./non-calib. cases | ||
| 3.63 cm (MPIIGaze) | ||||||||
| 3.17 cm (TabletGaze) | ||||||||
| [ | End-to-end | RGB Camera | ✓ | 6.3 | ||||
| [ | End-to-end | RGB Camera | 1000 fps | 5.6 | face alignment time | |||
| to be considered | ||||||||
| [ | End-to-end | RGB Camera | ✓ | <5 | also 2D error analysis | |||
| < | computed in the manuscript | |||||||
| [ | End-to-end | RGB Camera | ✓ | |||||
| [ | End-to-end | RGB Camera | 25.3 fps | ✓ | results with inpainting | |||
| during the training | ||||||||
| [ | End-to-end | RGB Camera | 666 fps | person calib. | analysis on eye patch | |||
| [ | End-to-end | RGB Camera | ||||||
| [ | Feature | RGBD Camera | complete invariance analysis | |||||
| in the manuscript | ||||||||
| [ | End-to-end | RGB Camera | ✓ | 3.4 | temporal model | |||
| [ | First-person | Glasses | AUC | temporal dynamics | ||||
| AAE | ||||||||
| [ | First-person | Glasses | ✓ | AUC | DL for temporal dynamics and | |||
| AAE | image regions | |||||||
| [ | First-person | Glasses | 13–18 fps | AAE | gaze as a classification and | |||
| regression problem | ||||||||
| [ | First-person | Glasses | AAE | unsupervised modelling of | ||||
| spatial-temporal features | ||||||||
| [ | Visual Saliency | RGB Scene | AUC | empirical parameter settings | ||||
| [ | Visual Saliency | RGB (Scene + Camera) | fixed h.p. and positions | AUC | top-down and bottom-up maps | |||
| [ | Visual Saliency | RGBD Scene | AUC | depth, color, motion are learned | ||||
| [ | Visual Saliency | RGB Scene | sAUC | |||||
| AUC-j | ||||||||
| NSS | ||||||||
| IG | cross disciplinary priors | |||||||
| [ | Visual Saliency | RGB Scene | ✓ | AUC | multiple learned | |||
| NSS | priors and features | |||||||
| [ | Visual Saliency | RGB Scene | 3.8 cm | adaptable to any CNN model | ||||
| [ | Third-person | 10 fps | only h.p. | <0.5 rad | realistic settings | |||
| [ | Third-person | RGB Scene | ✓ | AUC | single picture | |||
| [ | Third-person | RGB Scene | AUC | cross-domain | ||||
| handle gaze out of frame | ||||||||
| [ | Third-person | RGB Scene | (mean)Acc@1 | object referring | ||||
| [ | Third-person | RGB Scene | ✓ | shared attention | ||||
| time-varying attention targets | ||||||||
| [ | Third-person | RGB Scene | end-to-end | |||||
| joint attention videos | ||||||||
| [ | Third-person | RGB (Scene + Camera) | AUC | gaze across views | ||||
A summary of the main attributes of a gaze estimation system, putting in evidence advantages and disadvantages of each category.
| Attribute | Category | Advantages | Disadvantages |
|---|---|---|---|
| Intrusiveness | Intrusive | Most precise methods | Difficult to use with reluctant people |
| Non-intrusive | Usable with reluctant people | Cannot track all of the eye features | |
| Number of sensors | Single | Cheaper | Less precision |
| Multi | 3D reconstruction possible | More expensive | |
| Type of Camera | RGB | More usability | Less precision |
| RGBD | 3D data available | More expensive | |
| User Calibration | Calibrated | More precise | Difficult to use with reluctant people |
| User Independent | Portable | Less precise | |
| Speed | Real-time | Most of real applications | Less precision |
| Near real-time/Offline | More precision | Can be employed only |