| Literature DB >> 35035261 |
Pengfei Cao1,2,3, Zhongyi Yang1, Liang Sun4, Yanchun Liang1,5, Mary Qu Yang6,7, Renchu Guan1,5,6,7.
Abstract
Automatically describing contents of an image using natural language has drawn much attention because it not only integrates computer vision and natural language processing but also has practical applications. Using an end-to-end approach, we propose a bidirectional semantic attention-based guiding of long short-term memory (Bag-LSTM) model for image captioning. The proposed model consciously refines image features from previously generated text. By fine-tuning the parameters of convolution neural networks, Bag-LSTM obtains more text-related image features via feedback propagation than other models. As opposed to existing guidance-LSTM methods which directly add image features into each unit of an LSTM block, our fine-tuned model dynamically leverages more text-conditional image features, acquired by the semantic attention mechanism, as guidance information. Moreover, we exploit bidirectional gLSTM as the caption generator, which is capable of learning long term relations between visual features and semantic information by making use of both historical and future contextual information. In addition, variations of the Bag-LSTM model are proposed in an effort to sufficiently describe high-level visual-language interactions. Experiments on the Flickr8k and MSCOCO benchmark datasets demonstrate the effectiveness of the model, as compared with the baseline algorithms, such as it is 51.2% higher than BRNN on CIDEr metric.Entities:
Keywords: Bidirectional guiding LSTM; Convolution neural network; Image captioning; Semantic attention mechanism
Year: 2019 PMID: 35035261 PMCID: PMC8758065 DOI: 10.1007/s11063-018-09973-5
Source DB: PubMed Journal: Neural Process Lett ISSN: 1370-4621 Impact factor: 2.908
Fig. 1The framework of the proposed model consists of a CNN, a guiding Bi-LSTM, and an attention mechanism
Fig. 2The framework of the Bag-LSTM model: CNN for visual representation, Bi-gLSTM for caption generation and visual features v and generated semantics are combined together using an attention mechanism
Fig. 3Bidirectional guiding of long short-term memory network (Bi-gLSTM). Each box represents a gLSTM unit. The Bi-LSTMs summarize semantic information from both forward and backward directions
Fig. 4Model comparison a attention model proposed in [28], b Bawg-LSTM, c Bbag-LSTM and d Bdag-LSTM are the newly proposed models
Performance of the proposed model on Flickr8k across BLEU-N (N=1, 2, 3, 4), METEOR and CIDEr
| Dataset | Model | B-1 | B-2 | B-3 | B-4 | METEOR | CIDEr |
|---|---|---|---|---|---|---|---|
| Flickr8k | BRNN [ | 57.9 | 38.3 | 24.5 | 16.0 | – | – |
| Mao et al. [ | 58.0 | 28.0 | 23.0 | – | – | – | |
| Google NIC [ |
| 41.0 | 27.0 | – | – | – | |
| VggNet+RNN [ | 56.2 | 37.5 | 24.5 | 16.6 | – | – | |
| GooLeNet+RNN [ | 56.5 | 38.5 | 27.7 | 16.3 | – | – | |
| Bag-LSTM+mean | 59.2 | 40.4 | 27.0 |
|
|
| |
| Bag-LSTM+log | 58.7 | 40.7 | 26.3 | 17.6 | 18.1 | 42.8 | |
| Bawg-LSTM+mean |
| 40.5 |
| 17.5 | 18.0 |
| |
| Bawg-LSTM+log | 58.5 |
| 27.4 | 17.7 | 17.8 | 43.6 | |
| Bbag-LSTM+mean | 59.3 |
|
| 17.0 | 17.6 | 42.9 | |
| Bbag-LSTM+log | 59.5 | 40.7 | 26.9 |
|
| 42.4 | |
| Bdag-LSTM+mean | 58.3 | 39.7 | 26.7 | 17.8 | 18.0 | 42.1 | |
| Bdag-LSTM+log | 58.1 | 40.4 | 26.4 | 17.5 | 17.7 | 42.3 |
(–) indicates unreported scores. The numbers in bold are the top 2 results of each metric
Performance of the proposed model on MSCOCO compared with other baselines across multiple evaluation metrics
| Dataset | Model | B-1 | B-2 | B-3 | B-4 | METEOR | CIDEr |
|---|---|---|---|---|---|---|---|
| MSCOCO | Google NIC [ | 66.6 | 46.1 | 32.9 | 24.6 | – | – |
| BRNN [ | 62.5 | 45.0 | 32.1 | 23.0 | 19.5 | 66.0 | |
| Log Bilinear [ | 70.8 | 48.9 | 34.4 | 24.3 | 20.0 | – | |
| Bi-LSTM [ | 67.2 | 49.2 | 35.2 | 24.4 | – | – | |
| ATT-FCN [ | 70.9 | 53.7 | 40.2 | 30.4 | 24.3 | – | |
| LRCN [ | 62.8 | 46.1 | 32.9 | 24.6 | – | – | |
| Soft-Attention [ | 70.7 | 49.2 | 34.4 | 24.3 | 23.9 | – | |
| Hard-Attention [ | 71.8 | 50.4 | 35.7 | 25.0 | 23.0 | – | |
| Sentence-condition [ |
| 54.6 | 40.4 | 29.8 | 24.5 | 95.9 | |
| Pedersoli et al. [ | 71.0 | 30.1 | – | – | 24.5 | 93.7 | |
| RIC with STL [ | 68.7 | 47.8 | 33.1 | 22.0 | 20.5 | – | |
| G-MLE [ | – | – | 39.3 | 29.9 | 24.8 |
| |
| G-GAN [ | – | – | 30.5 | 20.7 | 22.4 | 79.5 | |
| CNN+CNN [ | 68.5 | 51.1 | 36.9 | 26.7 | 23.4 | 84.4 | |
| Bag-LSTM+mean | 71.7 | 54.5 |
|
|
|
| |
| Bag-LSTM+log | 71.9 | 54.5 | 40.0 | 29.1 | 24.3 | 96.2 | |
| Bawg-LSTM+mean | 71.9 | 54.5 |
|
|
|
| |
| Bawg-LSTM+log |
|
| 40.2 | 29.2 | 24.6 | 97.9 | |
| Bbag-LSTM+mean | 71.1 | 53.8 | 40.1 | 30.0 | 24.7 | 97.7 | |
| Bbag-LSTM+log |
|
| 40.2 | 29.3 | 24.4 | 97.5 | |
| Bdag-LSTM+mean | 70.6 | 53.7 | 39.7 | 29.8 | 24.9 | 97.7 | |
| Bdag-LSTM+log | 71.6 | 53.7 | 39.2 | 28.5 | 24.3 | 96.2 |
The numbers in bold are the top 2 results of each metric
Performance of the proposed sentence selection algorithms compared with the traditional selection method on MSCOCO dataset
| Dataset | Model | B-1 | B-2 | B-3 | B-4 | METEOR | CIDEr |
|---|---|---|---|---|---|---|---|
| MSCOCO | Bag-LSTM+mean | 71.7 | 54.5 |
|
|
|
|
| Bag-LSTM+log |
| 54.5 | 40.0 | 29.1 | 24.3 | 96.2 | |
| Bag-LSTM+sum | 71.7 |
| 40.3 | 29.6 | 24.5 | 97.7 | |
| Bawg-LSTM+mean | 71.9 | 54.5 |
|
|
|
| |
| Bawg-LSTM+log |
|
| 40.2 | 29.2 | 24.6 | 97.9 | |
| Bawg-LSTM+sum | 71.9 | 54.6 | 40.5 | 29.6 | 24.4 | 97.8 | |
| Bbag-LSTM+mean | 71.1 | 53.8 | 40.1 |
|
|
| |
| Bbag-LSTM+log |
|
| 40.2 | 29.3 | 24.4 | 97.5 | |
| Bbag-LSTM+sum | 71.4 | 53.9 |
| 29.5 | 24.3 | 95.4 | |
| Bdag-LSTM+mean | 70.6 |
| 39.7 |
|
|
| |
| Bdag-LSTM+log |
| 53.7 | 39.2 | 28.5 | 24.3 | 96.2 | |
| Bdag-LSTM+sum | 71.5 | 53.6 |
| 29.4 | 24.1 | 94.9 |
The best results of each model on each metric are marked in bold
Fig. 5Examples from the MSCOCO dataset, which visualize the generation of captions
Fig. 6Qualitative results for images with high values of loss function on MSCOCO testing split. The top four examples(green solid box) shows that the proposed model can generate acceptable captions. The bottom four examples(red dashed box) indicate the model can be misled by incorrect visual information. (Color figure online)