| Literature DB >> 32377178 |
Haoran Wang1, Yue Zhang1, Xiaosheng Yu2.
Abstract
In recent years, with the rapid development of artificial intelligence, image caption has gradually attracted the attention of many researchers in the field of artificial intelligence and has become an interesting and arduous task. Image caption, automatically generating natural language descriptions according to the content observed in an image, is an important part of scene understanding, which combines the knowledge of computer vision and natural language processing. The application of image caption is extensive and significant, for example, the realization of human-computer interaction. This paper summarizes the related methods and focuses on the attention mechanism, which plays an important role in computer vision and is recently widely used in image caption generation tasks. Furthermore, the advantages and the shortcomings of these methods are discussed, providing the commonly used datasets and evaluation criteria in this field. Finally, this paper highlights some open challenges in the image caption task.Entities:
Mesh:
Year: 2020 PMID: 32377178 PMCID: PMC7199544 DOI: 10.1155/2020/3062706
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1Method based on the visual detector and language model.
Figure 2Model based on encoder-decoder.
Figure 3(a) Scaled dot-product attention. (b) Multihead attention.
Figure 4(a) Global attention model and (b) local attention model.
Figure 5Adaptive attention model with visual sentinel.
Figure 6Semantic attention.
Figure 7SCA-CNN model.
Figure 8Areas of attention.
Figure 9Deliberate attention framework.
Comparison of attention mechanism modeling methods.
| Ref. | Attention name | Method | Comment |
|---|---|---|---|
| [ | Soft attention | Give a probability according to the context vector for any word in the input sentence when seeking attention probability distribution | Parameterization |
|
| |||
| [ | Hard attention | Focus only on a randomly chosen location using Monte Carlo sampling to estimate the gradient | Randomly |
|
| |||
| [ | Multihead attention | Linearly projecting multiple pieces of information selected from the input in parallel using multiple keys, values, and queries | Linear projection |
|
| |||
| [ | Scaled dot-product attention | Execute a single attention function using keys, values, and query matrices | High speed |
|
| |||
| [ | Global attention | Considering the hidden layer state of all encoders, the weight distribution of attention is obtained by comparing the current decoder hidden layer state with the state of each encoder hidden layer | Comprehensive |
|
| |||
| [ | Local attention | First find a location for it, then calculate the attention weight in the left and right windows of its location, and finally weight the context vector | Reduce the cost of calculations |
|
| |||
| [ | Adaptive attention | Define a new adaptive context vector which is modeled as a mixture of the spatially attended image features and the visual sentinel vector. This trades off how much new information the network is considering from the image with what it already knows in the decoder memory | Solve when and where to add attention in order to extract meaningful information for sequence words |
|
| |||
| [ | Semantic attention | Select semantic concepts and incorporate them into the hidden state and output of the LSTM | Optional |
|
| |||
| [ | Spatial and channel-wise attention | Select semantic attributes based on the needs of the sentence context | Multiple semantics |
|
| |||
| [ | Areas of attention | Modeling the dependencies between image regions, title words, and the state of the RNN language model | Interaction |
Summary of the number of images in each dataset.
| Dataset name | Size | ||
|---|---|---|---|
| Train | Valid | Test | |
| MSCOCO | 82783 | 40504 | 40775 |
| Filckr8k | 6000 | 1000 | 1000 |
| Filckr30k | 28000 | 1000 | 1000 |
| PASCAL 1K | — | — | 1000 |
| AIC | 210000 | 30000 | 30000 |
| STAIR | 82783 | 40504 | 40775 |
Figure 10An example in MSCOCO dataset image.
Scores of attention mechanisms based on the evaluations above.
| Ref. | Attention model | BLEU-4 | METEOR | ROUGE-L | CIDEr |
|---|---|---|---|---|---|
| [ | Soft attention | 24.3 | 23.9 | — | — |
| [ | Hard attention | 25.0 | 23.0 | 51.6 | 86.5 |
| [ | Multihead/scaled dot-product | 28.4 | — | — | — |
| [ | Global/local attention | 25.9 | — | — | — |
| [ | Adaptive attention | 33.2 | 26.6 | 55.0 | 108.5 |
| [ | Semantic attention | 30.4 | 24.3 | 53.5 | 94.3 |
| [ | Spatial and channel-wise | 31.1 | 25.4 | 53.0 | 94.3 |
| [ | Areas of attention | 31.9 | 25.2 | — | 98.1 |
| [ | Deliberate attention | 37.5 | 28.5 | 58.2 | 125.6 |