Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Hierarchical LSTMs with Adaptive Attention for Visual Captioning.

Literature DB >> 30668467

Hierarchical LSTMs with Adaptive Attention for Visual Captioning.

Lianli Gao, Xiangpeng Li, Jingkuan Song, Heng Tao Shen.

Abstract

Recent progress has been made in using attention based encoder-decoder framework for image and video captioning. Most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., "gun" and "shooting") and non-visual words (e.g., "the", "a"). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of visual captioning. Furthermore, the hierarchy of LSTMs enables more complex representation of visual data, capturing information at different scales. Considering these issues, we propose a hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning. Specifically, the proposed framework utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the caption generation. We design the hLSTMat model as a general framework, and we first instantiate it for the task of video captioning. Then, we further instantiate our hLSTMarefine it and apply it to the imioning task. To demonstrate the effectiveness of our proposed framework, we test our method on both video and image captioning tasks. Experimental results show that our approach achieves the state-of-the-art performance for most of the evaluation metrics on both tasks. The effect of important components is also well exploited in the ablation study.

Year: 2019 PMID： 30668467 DOI： 10.1109/TPAMI.2019.2894139

Source DB: PubMed Journal: IEEE Trans Pattern Anal Mach Intell ISSN： 0098-5589 Impact factor: 6.226

Keyword Cloud
Cited

6 in total

1. Travel demand and distance analysis for free-floating car sharing based on deep learning method.

Authors: Chen Zhang; Jie He; Ziyang Liu; Lu Xing; Yinhai Wang
Journal: PLoS One Date: 2019-10-16 Impact factor: 3.240

2. An Improved Sign Language Translation Model with Explainable Adaptations for Processing Long Sign Sentences.

Authors: Jiangbin Zheng; Zheng Zhao; Min Chen; Jing Chen; Chong Wu; Yidong Chen; Xiaodong Shi; Yiqi Tong
Journal: Comput Intell Neurosci Date: 2020-10-23

3. CNN-LSTM Hybrid Real-Time IoT-Based Cognitive Approaches for ISLR with WebRTC: Auditory Impaired Assistive Technology.

Authors: Meenu Gupta; Narina Thakur; Dhruvi Bansal; Gopal Chaudhary; Battulga Davaasambuu; Qiaozhi Hua
Journal: J Healthc Eng Date: 2022-02-21 Impact factor: 2.682

Hierarchical LSTMs with Adaptive Attention for Visual Captioning.

1. Travel demand and distance analysis for free-floating car sharing based on deep learning method.

2. An Improved Sign Language Translation Model with Explainable Adaptations for Processing Long Sign Sentences.

3. CNN-LSTM Hybrid Real-Time IoT-Based Cognitive Approaches for ISLR with WebRTC: Auditory Impaired Assistive Technology.

4. Research on Video Captioning Based on Multifeature Fusion.

5. Cross-Modal Search for Social Networks via Adversarial Learning.

Review 6. An Overview of Image Caption Generation Methods.