| Literature DB >> 34615894 |
Dhruv Sharma1, Sanjay Purushotham2, Chandan K Reddy3.
Abstract
Medical images are difficult to comprehend for a person without expertise. The scarcity of medical practitioners across the globe often face the issue of physical and mental fatigue due to the high number of cases, inducing human errors during the diagnosis. In such scenarios, having an additional opinion can be helpful in boosting the confidence of the decision maker. Thus, it becomes crucial to have a reliable visual question answering (VQA) system to provide a 'second opinion' on medical cases. However, most of the VQA systems that work today cater to real-world problems and are not specifically tailored for handling medical images. Moreover, the VQA system for medical images needs to consider a limited amount of training data available in this domain. In this paper, we develop MedFuseNet, an attention-based multimodal deep learning model, for VQA on medical images taking the associated challenges into account. Our MedFuseNet aims at maximizing the learning with minimal complexity by breaking the problem statement into simpler tasks and predicting the answer. We tackle two types of answer prediction-categorization and generation. We conducted an extensive set of quantitative and qualitative analyses to evaluate the performance of MedFuseNet. Our experiments demonstrate that MedFuseNet outperforms the state-of-the-art VQA methods, and that visualization of the captured attentions showcases the intepretability of our model's predicted results.Entities:
Mesh:
Year: 2021 PMID: 34615894 PMCID: PMC8494920 DOI: 10.1038/s41598-021-98390-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Sample radiology scans and the corresponding question-answer pairs from the MED-VQA and PathVQA dataset. The first three (a–c) belong to the MED-VQA dataset and the last one (d) belongs to the PathVQA dataset.
Figure 2A high-level model design for the task of VQA. The model has four major components—image feature extraction, question feature extraction, feature fusion amalgamated with the attention mechanism, followed by answer categorization or generation depending on the task.
Notations used in this paper.
| Notation | Description |
|---|---|
| Input image | |
| Image feature vector | |
| Attended image feature vector | |
| Input question | |
| Question feature vector | |
| Attended question feature vector | |
| Combined feature vector | |
| Attention output for the | |
| LSTM output for the | |
| Number of attention glimpses | |
| Actual answer | |
| Predicted answer | |
| Actual answer sequence | |
| Predicted answer sequence | |
| Model parameters | |
| Loss function | |
| Possible set of answers | |
| Vocabulary of words in answers | |
| Inner product operation | |
| Batch size | |
| Image Attention | |
| Question Attention | |
| Decoder Attention |
Figure 3Our end-to-end framework for Medical Visual Question Answering for answer categorization. It takes the medical image and the associated question as the inputs, followed by the feature extraction. The question features are further processed using the question attention mechanism. The attended question features and the image features are then passed through the image attention mechanism to get the attended image features. These attended vectors are finally combined using MFB to build the answer classification module.
Figure 4The architecture used for the answer generation task. This module takes the image and the question as the input. It generates the feature vectors for both and produces the combined vector after fusing them using MFB as part of the image-question co-attention mechanism. This is followed by an LSTM-based decoder to generate the answer. The two major components of this decoder are—the attention mechanism and teacher forcing. The attention mechanism helps the model in focusing on various parts of the image while generating a word, and the teacher enforcing helps the model converge faster.
Train, validation, and test splits for the yes-no type question-answer pairs in MED-VQA dataset.
| Split | Modality | Plane | Organ |
|---|---|---|---|
| Train | 3200 | 3200 | 3200 |
| Validation | 500 | 500 | 500 |
| Test | 125 | 125 | 125 |
Train, validation, and test splits for the yes-no type question-answer pairs in PathVQA dataset.
| Split | Medical Images | ‘Yes’ type QA Pairs | ‘No’ type QA Pairs |
|---|---|---|---|
| Train | 4271 | 9305 | 9163 |
| Validation | 1176 | 2359 | 2335 |
| Test | 942 | 1874 | 1853 |
Comparison of MedFuseNet with the baseline models on MED-VQA answer classification dataset.
| Methods | Accuracy | AUC-ROC | AUC-PRC | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Modality | Plane | Organ | Modality | Plane | Organ | Modality | Plane | Organ | |
| VIS + LSTM[ | 0.704(0.012) | 0.701(0.017) | 0.652(0.020) | 0.899(0.012) | 0.851(0.011) | 0.775(0.015) | 0.478(0.024) | 0.453(0.022) | 0.456(0.025) |
| d-LSTM + n-CNN[ | 0.723(0.014) | 0.719(0.018) | 0.672(0.022) | 0.909(0.010) | 0.862(0.014) | 0.777(0.017) | 0.474(0.025) | 0.459(0.023) | 0.450(0.027) |
| SAN[ | 0.669(0.013) | 0.729(0.015) | 0.669(0.023) | 0.926(0.011) | 0.870(0.011) | 0.783(0.015) | 0.459(0.025) | 0.415(0.023) | 0.406(0.026) |
| HiCAt[ | 0.760(0.010) | 0.740(0.015) | 0.668(0.018) | 0.929(0.011) | 0.869(0.010) | 0.797(0.014) | 0.468(0.023) | 0.431(0.025) | 0.430(0.028) |
| BAN[ | 0.820(0.011) | 0.766(0.016) | 0.800(0.016) | 0.600(0.024) | 0.521(0.022) | 0.456(0.025) | |||
| 0.746(0.015) | 0.942(0.010) | 0.901(0.010) | |||||||
Comparison of MedFuseNet with the baseline models on PathVQA yes-no answer type dataset.
| Methods | Accuracy |
|---|---|
| VIS + LSTM[ | 0.603(0.025) |
| d-LSTM + n-CNN[ | 0.607(0.021) |
| SAN[ | 0.627(0.023) |
| HiCAt[ | 0.629(0.018) |
| BAN[ | 0.604(0.021) |
Comparison of MedFuseNet with the baseline models on answer generation dataset.
| Dataset | Method | BELU-1 | BLEU-2 | BLEU-3 | F-1 |
|---|---|---|---|---|---|
| MED-VQA | BAN + Decoder | 0.266(0.015) | 0.013(0.002) | ||
| 0.076(0.005) | 0.229(0.012) | ||||
| PathVQA | BAN + Decoder | 0.542(0.023) | 0.216(0.023) | 0.054(0.008) | 0.378(0.009) |
Performance metric scores for the ablation study experiments on MED-VQA dataset.
| Question Category | Image Feature | MCB | MUTAN | MFB | |||
|---|---|---|---|---|---|---|---|
| BERT | XLNet | BERT | XLNet | BERT | XLNet | ||
| Category 1 Modality | VGG16 | 0.718(0.019) | 0.697(0.018) | 0.751(0.016) | 0.686(0.019) | 0.805(0.012) | 0.680(0.019) |
| DenseNet121 | 0.704(0.015) | 0.675(0.019) | 0.768(0.014) | 0.688(0.021) | 0.813(0.014) | 0.675(0.020) | |
| ResNet152 | 0.731(0.014) | 0.663(0.017) | 0.783(0.018) | 0.716(0.017) | 0.701(0.018) | ||
| Category 2 Plane | VGG16 | 0.706(0.018) | 0.697(0.016) | 0.750(0.017) | 0.605(0.022) | 0.749(0.014) | 0.629(0.019) |
| DenseNet121 | 0.719(0.016) | 0.643(0.018) | 0.754(0.016) | 0.643(0.017) | 0.757(0.011) | 0.655(0.021) | |
| ResNet152 | 0.712(0.015) | 0.659(0.019) | 0.763(0.015) | 0.693(0.019) | 0.735(0.016) | ||
| Category 3 Organ System | VGG16 | 0.718(0.018) | 0.625(0.015) | 0.785(0.012) | 0.683(0.016) | 0.692(0.019) | |
| DenseNet121 | 0.753(0.013) | 0.630(0.018) | 0.774(0.015) | 0.696(0.018) | 0.774(0.012) | 0.720(0.016) | |
| ResNet152 | 0.669(0.016) | 0.672(0.013) | 0.705(0.016) | 0.649(0.019) | 0.746(0.010) | 0.682(0.015) | |
| Category 1 Modality | VGG16 | 0.845(0.011) | 0.697(0.016) | 0.896(0.010) | 0.710(0.015) | 0.738(0.015) | |
| DenseNet121 | 0.854(0.013) | 0.675(0.018) | 0.898(0.010) | 0.659(0.014) | 0.934(0.010) | 0.703(0.016) | |
| ResNet152 | 0.861(0.012) | 0.703(0.018) | 0.906(0.011) | 0.740(0.017) | 0.942(0.013) | 0.700(0.014) | |
| Category 2 Plane | VGG16 | 0.833(0.012) | 0.697(0.018) | 0.866(0.011) | 0.718(0.017) | 0.899(0.013) | 0.729(0.014) |
| DenseNet121 | 0.832(0.013) | 0.743(0.017) | 0.867(0.012) | 0.801(0.013) | 0.894(0.012) | 0.839(0.015) | |
| ResNet152 | 0.840(0.010) | 0.685(0.017) | 0.881(0.010) | 0.849(0.014) | 0.891(0.013) | ||
| Category 3 Organ System | VGG16 | 0.655(0.015) | 0.619(0.019) | 0.689(0.014) | 0.622(0.017) | 0.691(0.014) | 0.730(0.016) |
| DenseNet121 | 0.667(0.013) | 0.700(0.016) | 0.691(0.013) | 0.626(0.018) | 0.690(0.013) | 0.650(0.014) | |
| ResNet152 | 0.803(0.010) | 0.674(0.018) | 0.795(0.014) | 0.800(0.010) | 0.790(0.015) | ||
| Category 1 Modality | VGG16 | 0.322(0.019) | 0.312(0.017) | 0.379(0.017) | 0.373(0.020) | 0.590(0.016) | 0.352(0.019) |
| DenseNet121 | 0.287(0.021) | 0.310(0.019) | 0.407(0.016) | 0.390(0.019) | 0.572(0.018) | 0.219(0.021) | |
| ResNet152 | 0.361(0.021) | 0.208(0.018) | 0.469(0.017) | 0.343(0.019) | 0.224(0.018) | ||
| Category 2 Plane | VGG16 | 0.252(0.018) | 0.368(0.018) | 0.331(0.019) | 0.370(0.021) | 0.439(0.017) | 0.288(0.020) |
| DenseNet121 | 0.269(0.017) | 0.279(0.021) | 0.347(0.018) | 0.335(0.021) | 0.437(0.019) | 0.351(0.019) | |
| ResNet152 | 0.248(0.020) | 0.293(0.021) | 0.365(0.017) | 0.321(0.020) | 0.435(0.017) | ||
| Category 3 Organ System | VGG16 | 0.341(0.016) | 0.348(0.020) | 0.393(0.018) | 0.289(0.019) | 0.443(0.019) | 0.351(0.016) |
| DenseNet121 | 0.364(0.018) | 0.420(0.018) | 0.377(0.016) | 0.289(0.021) | 0.433(0.021) | 0.330(0.018) | |
| ResNet152 | 0.428(0.017) | 0.322(0.017) | 0.473(0.019) | 0.396(0.018) | 0.352(0.018) | ||
Accuracy scores for the ablation study experiments of PathVQA yes-no answer type dataset.
| Image Feature | MCB | MUTAN | MFB | |||
|---|---|---|---|---|---|---|
| BERT | XLNet | BERT | XLNet | BERT | XLNet | |
| VGG16 | 0.614(0.014) | 0.502(0.012) | 0.637(0.014) | 0.513(0.013) | 0.507(0.014) | |
| DenseNet121 | 0.609(0.013) | 0.503(0.014) | 0.624(0.012) | 0.514(0.013) | 0.636(0.013) | 0.507(0.012) |
| ResNet152 | 0.611(0.015) | 0.505(0.014) | 0.620(0.013) | 0.505(0.012) | 0.621(0.013) | 0.503(0.015) |
Image Attention visualization for SAN, Hie. Co-Att, and MedFuseNet.
Figure 5Co-Attention Maps for a sample case to display the attention span of MedFuseNet with the input image and the corresponding question attention. (a) Displays the image attention map and the corresponding question attention map for category 1—modality, (b) for category 2—plane, and (c) for category 3—organ.
Figure 6The attention maps produced by MedFuseNet while generating the words in the answer. There are three cases (a) sarcoidosis in the genitourinary system, (b) anoxic brain injury, and (c) salter-harris fracture in the bone.