| Literature DB >> 35336415 |
Zeyd Boukhers1, Timo Hartmann1, Jan Jürjens1,2.
Abstract
Due to the significant advancement of Natural Language Processing and Computer Vision-based models, Visual Question Answering (VQA) systems are becoming more intelligent and advanced. However, they are still error-prone when dealing with relatively complex questions. Therefore, it is important to understand the behaviour of the VQA models before adopting their results. In this paper, we introduce an interpretability approach for VQA models by generating counterfactual images. Specifically, the generated image is supposed to have the minimal possible change to the original image and leads the VQA model to give a different answer. In addition, our approach ensures that the generated image is realistic. Since quantitative metrics cannot be employed to evaluate the interpretability of the model, we carried out a user study to assess different aspects of our approach. In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.Entities:
Keywords: GAN; ML interpretability; UXE; VQA
Mesh:
Year: 2022 PMID: 35336415 PMCID: PMC8953790 DOI: 10.3390/s22062245
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Summary of variables used in this paper.
| Variable | Description |
|---|---|
|
| The counterfactual generator proposed in this paper. |
|
| The VQA system (i.e., MUTAN [ |
|
| Original image |
|
| Question about |
|
| The answer of |
|
| |
|
| |
|
| The counterfactual image of |
|
| The answer of |
|
| An image generated by |
|
| The attention map of |
|
| The attention map of |
Figure 1Overview of the proposed architecture inspired by Pan et al. [7].
Figure 2Example question-image pair from the VQAv1 dataset [3]. The red bounding box indicates the question-critical object.
Figure 3Examples of ’s output for color-based questions from the VQAv1 [3] validation set. Left: original image I and the corresponding attention map M. Right: Generated counterfactual image and the corresponding attention map .
Figure 4Examples of ’s output for shape-based questions from the VQAv1 [3] validation set. Left: original image I and the corresponding attention map M. Right: Generated counterfactual image and the corresponding attention map .
Figure 5Example outputs of the Grad-CAM algorithm applied to MUTAN for color-based questions. Left: original image. Center: Interpolated attention map projected on the original image. Right: The background image.
Figure 6Frequency histogram of participants’ answers to the question: “Does the picture look photoshopped?”.
Mean () and standard deviation () of the -norm computed across the training and validation set and split across categories.
| Training Set | Validation Set | ||||
|---|---|---|---|---|---|
|
|
|
|
| ||
| All |
|
|
|
| |
| Color |
|
|
|
| |
| Shape |
|
|
|
| |
| Same VQA Answers | ALL |
|
|
|
|
| Color |
|
|
|
| |
| Shape |
|
|
|
| |
| Different VQA Answers | ALL |
|
|
|
|
| Color |
|
|
|
| |
| Shape |
|
|
|
| |
Bold is used to highlight the best/worst result.