| Literature DB >> 35062502 |
Bor-Jiunn Hwang1, Hui-Hui Chen1, Chaur-Heh Hsieh2, Deng-Yu Huang1.
Abstract
Based on experimental observations, there is a correlation between time and consecutive gaze positions in visual behaviors. Previous studies on gaze point estimation usually use images as the input for model trainings without taking into account the sequence relationship between image data. In addition to the spatial features, the temporal features are considered to improve the accuracy in this paper by using videos instead of images as the input data. To be able to capture spatial and temporal features at the same time, the convolutional neural network (CNN) and long short-term memory (LSTM) network are introduced to build a training model. In this way, CNN is used to extract the spatial features, and LSTM correlates temporal features. This paper presents a CNN Concatenating LSTM network (CCLN) that concatenates spatial and temporal features to improve the performance of gaze estimation in the case of time-series videos as the input training data. In addition, the proposed model can be optimized by exploring the numbers of LSTM layers, the influence of batch normalization (BN) and global average pooling layer (GAP) on CCLN. It is generally believed that larger amounts of training data will lead to better models. To provide data for training and prediction, we propose a method for constructing datasets of video for gaze point estimation. The issues are studied, including the effectiveness of different commonly used general models and the impact of transfer learning. Through exhaustive evaluation, it has been proved that the proposed method achieves a better prediction accuracy than the existing CNN-based methods. Finally, 93.1% of the best model and 92.6% of the general model MobileNet are obtained.Entities:
Keywords: convolutional neural network (CNN); deep learning; gaze tracking; long short-term memory (LSTM)
Mesh:
Year: 2022 PMID: 35062502 PMCID: PMC8781122 DOI: 10.3390/s22020545
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1System architecture.
Figure 2Most gaze points are distributed near the center of the interested object [28].
Figure 3Comparing the gaze points of the different participant during frame 2387 to frame 2598.
The details of six animated videos [15].
| Title | Resolution (Pixel) | Length (s) |
|---|---|---|
| Reach | 1920 × 1080 | 193 |
| Mr Indifferent | 1920 × 1080 | 132 |
| Jinxy Jenkins & Lucky Lou | 1920 × 1080 | 193 |
| Changing Batteries | 1920 × 1080 | 333 |
| Pollo | 1920 × 1080 | 286 |
| Pip | 1920 × 1080 | 245 |
Figure 4The distribution results after labeled as blocks.
The number of video lengths.
| Length (s) | Amount |
|---|---|
| 1 | 991 |
| 2 | 21 |
| 3 | 5 |
Figure 5The distribution results after data augmentation.
Figure 6The architecture of CCLN Gaze Estimation Network.
The types of LSTM Block architectures to be evaluated.
| Type | CNN | LSTM |
|---|---|---|
| 1 | CNN Block | unidirectional |
| 2 | CNN Block | bidirectional |
| 3 | CNN Block | unidirectional multi-layers |
| 4 | CNN Block | bidirectional multi-layers |
The accuracy compared with [15] for Type 1.
| [ | Proposed CCLN | |
|---|---|---|
| Accuracy | 0.862 | 0.891 |
The accuracies for Type 1 and Type 2.
| Type 1 | Type 2 | |
|---|---|---|
| Accuracy | 0.891 | 0.915 |
Figure 7The accuracy of multiple layers for different LSTM architecture.
Figure 8The total parameters for different layers, and unidirectional and bidirectional of LSTM.
The performance for using GAP.
| 1-Layer Bidirectional LSTM | 1-Layer Bidirectional LSTM + GAP | |
|---|---|---|
| Total parameters | 6,345,157 | 5,839,301 |
| Accuracy | 0.915 | 0.927 |
| Loss | 0.361 | 0.324 |
| F1 score | 0.915 | 0.929 |
| Recall | 0.919 | 0.935 |
| Precision | 0.916 | 0.929 |
The performance for using Dropout and BN.
| 1-Layer Bidirectional LSTM + GAP + Dropout | 1-Layer Bidirectional LSTM + GAP + BN | 1-Layer Bidirectional LSTM + GAP + Dropout + BN | |
|---|---|---|---|
| Total parameters | 5,839,301 | 5,841,349 | 5,841,349 |
| Accuracy | 0.918 | 0.931 | 0.926 |
| Loss | 0.302 | 0.359 | 0.351 |
| F1 score | 0.919 | 0.930 | 0.925 |
| Recall | 0.923 | 0.935 | 0.931 |
| Precision | 0.920 | 0.930 | 0.926 |
The performance for various general models.
| CNN Model | VGG16 | VGG19 | ResNet50 | DenseNet121 | MobileNet |
|---|---|---|---|---|---|
| Pre-trained | ImageNet | ImageNet | ImageNet | ImageNet | ImageNet |
| Total parameters | 19,194,725 | 24,504,421 | 29,061,157 | 27,294,917 | 6,636,389 |
| Accuracy | 0.6 | 0.616 | 0.685 | 0.705 | 0.71 |
| Loss | 1.5 | 1.38 | 1.18 | 1.06 | 1 |
| F1 score | 0.588 | 0.596 | 0.671 | 0.693 | 0.7 |
| Recall | 0.605 | 0.621 | 0.692 | 0.709 | 0.718 |
| Precision | 0.597 | 0.601 | 0.664 | 0.688 | 0.7 |
The performance of MobileNet [36] cascaded multi-layers LSTM with unidirectional.
| LSTM Layer | 1 Layer | 2 Layers | 3 Layers |
|---|---|---|---|
| Pre-trained | ImageNet | ImageNet | Imagenet |
| Total parameters | 6,636,389 | 8,735,589 | 10,834,789 |
| Accuracy | 0.71 | 0.625 | 0.572 |
| Loss | 1 | 1.27 | 1.45 |
| F1 score | 0.7 | 0.607 | 0.54 |
| Recall | 0.718 | 0.634 | 0.578 |
| Precision | 0.7 | 0.618 | 0.537 |
The performance of MobileNet [36] cascaded multi-layers LSTM with bidirectional.
| LSTM Layer | 1 Layer | 2 Layers | 3 Layers |
|---|---|---|---|
| Pre-trained | ImageNet | ImageNet | ImageNet |
| Total parameters | 6,900,581 | 13,196,133 | 19,491,685 |
| Accuracy | 0.914 | 0.876 | 0.863 |
| Loss | 0.358 | 0.549 | 0.71 |
| F1 score | 0.913 | 0.879 | 0.865 |
| Recall | 0.92 | 0.878 | 0.868 |
| Precision | 0.914 | 0.889 | 0.87 |
The performance of MobileNet with or without adopting GAP.
| GAP | Without | With |
|---|---|---|
| Pre-trained | ImageNet | ImageNet |
| Total parameters | 6,900,581 | 6,394,725 |
| Accuracy | 0.914 | 0.926 |
| Loss | 0.358 | 0.317 |
| F1 score | 0.913 | 0.926 |
| Recall | 0.92 | 0.929 |
| Precision | 0.914 | 0.928 |
The performance of MobileNet using dropout or BN.
| Without | Dropout | BN | Dropout + BN | |
|---|---|---|---|---|
| Pre-trained | ImageNet | ImageNet | ImageNet | ImageNet |
| Total parameters | 6,394,725 | 6,394,725 | 6,396,773 | 6,396,773 |
| Accuracy | 0.926 | 0.92 | 0.923 | 0.922 |
| Loss | 0.317 | 0.305 | 0.407 | 0.287 |
| F1 score | 0.926 | 0.92 | 0.924 | 0.921 |
| Recall | 0.929 | 0.924 | 0.929 | 0.925 |
| Precision | 0.928 | 0.923 | 0.925 | 0.924 |
The performance of transfer learning.
| Pre-Trained | Add Face and Eyes | Add Face | Add Eyes |
|---|---|---|---|
| Total parameters | 6,394,725 | 6,394,725 | 6,394,725 |
| Accuracy | 0.919 | 0.924 | 0.891 |
| Loss | 0.341 | 0.332 | 0.537 |
| F1 score | 0.915 | 0.922 | 0.813 |
| Recall | 0.918 | 0.921 | 0.875 |
| Precision | 0.911 | 0.924 | 0.883 |