Literature DB >> 32750794

MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network.

Liang Peng, Yang Yang, Zheng Wang, Zi Huang, Heng Tao Shen.   

Abstract

Visual Question Answering (VQA) is a task to answer natural language questions tied to the content of visual images. Most recent VQA approaches usually apply attention mechanism to focus on the relevant visual objects and/or consider the relations between objects via off-the-shelf methods in visual relation reasoning. However, they still suffer from several drawbacks. First, they mostly model the simple relations between objects, which results in many complicated questions cannot be answered correctly, because of failing to provide sufficient knowledge. Second, they seldom leverage the harmony cooperation of visual appearance feature and relation feature. To solve these problems, we propose a novel end-to-end VQA model, termed Multi-modal Relation Attention Network (MRA-Net). The proposed model explores both textual and visual relations to improve performance and interpretability. In specific, we devise 1) a self-guided word relation attention scheme, which explore the latent semantic relations between words; 2) two question-adaptive visual relation attention modules that can extract not only the fine-grained and precise binary relations between objects but also the more sophisticated trinary relations. Both kinds of question-related visual relations provide more and deeper visual semantics, thereby improving the visual reasoning ability of question answering. Furthermore, the proposed model also combines appearance feature with relation feature to reconcile the two types of features effectively. Extensive experiments on five large benchmark datasets, VQA-1.0, VQA-2.0, COCO-QA, VQA-CP v2, and TDIUC, demonstrate that our proposed model outperforms state-of-the-art approaches.

Entities:  

Mesh:

Year:  2021        PMID: 32750794     DOI: 10.1109/TPAMI.2020.3004830

Source DB:  PubMed          Journal:  IEEE Trans Pattern Anal Mach Intell        ISSN: 0098-5589            Impact factor:   6.226


  1 in total

1.  Deep Modular Bilinear Attention Network for Visual Question Answering.

Authors:  Feng Yan; Wushouer Silamu; Yanbing Li
Journal:  Sensors (Basel)       Date:  2022-01-28       Impact factor: 3.576

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.