| Literature DB >> 31331893 |
Lian Zhou, Yuejie Zhang, Yugang Jiang, Tao Zhang, Weiguo Fan.
Abstract
Visual and semantic saliency are important in image captioning. However, single-phase image captioning benefits little from limited saliency without a saliency predictor. In this paper, a novel saliency-enhanced re-captioning framework via two-phase learning is proposed to enhance the single-phase image captioning. In the framework, visual saliency and semantic saliency are distilled from the first-phase model and fused with the second-phase model for model self-boosting. The visual saliency mechanism can generate a saliency map and a saliency mask for an image without learning a saliency map predictor. The semantic saliency mechanism sheds some lights on the properties of words with part-of-speech Noun in a caption. Besides, another type of saliency, sample saliency is proposed to explicitly compute the saliency degree of each sample, which helps for more robust image captioning. In addition, how to combine the above three types of saliency for further performance boost is also examined. Our framework can treat an image captioning model as a saliency extractor, which may benefit other captioning models and related tasks. The experimental results on both the Flickr30k and MSCOCO datasets show that the saliency-enhanced models can obtain promising performance gains.Entities:
Year: 2019 PMID: 31331893 DOI: 10.1109/TIP.2019.2928144
Source DB: PubMed Journal: IEEE Trans Image Process ISSN: 1057-7149 Impact factor: 10.856