| Literature DB >> 35789785 |
Qiaoying Teng1, Zhe Liu2, Yuqing Song2, Kai Han2, Yang Lu1.
Abstract
Deep learning has demonstrated remarkable performance in the medical domain, with accuracy that rivals or even exceeds that of human experts. However, it has a significant problem that these models are "black-box" structures, which means they are opaque, non-intuitive, and difficult for people to understand. This creates a barrier to the application of deep learning models in clinical practice due to lack of interpretability, trust, and transparency. To overcome this problem, several studies on interpretability have been proposed. Therefore, in this paper, we comprehensively review the interpretability of deep learning in medical diagnosis based on the current literature, including some common interpretability methods used in the medical domain, various applications with interpretability for disease diagnosis, prevalent evaluation metrics, and several disease datasets. In addition, the challenges of interpretability and future research directions are also discussed here. To the best of our knowledge, this is the first time that various applications of interpretability methods for disease diagnosis have been summarized.Entities:
Keywords: Applications; Deep learning; Disease diagnosis; Interpretability methods
Year: 2022 PMID: 35789785 PMCID: PMC9243744 DOI: 10.1007/s00530-022-00960-4
Source DB: PubMed Journal: Multimed Syst ISSN: 0942-4962 Impact factor: 2.603
Fig. 1Taxonomy of interpretability methods
Fig. 2The process of LRP. Each neuron redistributes to the lower layer as much as it has received from the higher layer [25]
Fig. 3The process of generating class activation maps: The predicted score is mapped back to the previous convolutional layer to produce the class activation maps [32]
Fig. 4Grad-CAM overview: Given an image as input, we can obtain the predicted score of the target class by forward propagating the image through the convolutional neural network and task-specific network. Then the gradient of the target class is set to 1, and the others are set to 0. This signal is then backpropagated to the rectified convolutional feature maps of interest, which we combine to compute the coarse Grad-CAM localization, which represents where the model has to look to make the particular decision. Finally, we pointwise multiply the heatmap with guided backpropagation to get Guided Grad-CAM visualizations that are both high-resolution and concept-specific [33]
Fig. 5The Score-CAM overview: The Score-CAM has two stages. Phase 1 first extracts activation maps, and then each activation works as a mask on the original image, obtaining its forwarding-pass score on the target class. Phase 2 repeats N times. Finally, the result can be generated by a linear combination of score-based weights and activation maps [34]
Overview of the advantages and disadvantages of common CAM-based methods and how they work
| Method | Advantages | Disadvantages | How they work |
|---|---|---|---|
| CAM [ | Identify important areas for image classification | Need a global average pooling layer Only visualize the output of the last convolution layer | A linear combination of weights and activation maps |
| Grad-CAM [ | Apply to a broader network without requiring a global average pooling layer Visualize any layer of a deep network | Gradient issue, such as saturation Lack the ability to highlight the fine-grained details Performance drops when localizing multiple occurrences of the same class Cannot capture the entire object in completeness | Weight the 2D activations by average gradient |
Guided Grad-CAM [ | Provide a fine-grained localization map Apply to a broader network without requiring a global average pooling layer Visualize any layer of a deep network | Gradient issue, such as saturation | Acquire a fine-grained localization map by combination of Grad-CAM and Guided Backpropagation |
| Grad-CAM++ [ | More suitable for localizing multiple targets of the same category. | Using second order derivatives requires more complex computation | Like Grad-CAM but uses second order gradients |
| Score-CAM [ | Gradient-free localization High performance | – | Perturbate the image by the scaled activations and measure how the output drops |
| Ablation-CAM [ | Gradient-free localization It is able to fuse with pixel-space gradient visualizations to generate high resolution localization maps | Require more computation time | Zero out activations and measure how the output drops |
| Group-CAM [ | Require less computation. More efficient Can be used as a data augment trick to fine-tune classification models | - | Adopt the “split-transform-merge” strategy to generate saliency maps |
Fig. 6Illustration of surrogate interpretability methods, where g is a white box, which is trained to mimic the behavior of the “black-box”, so that [30]
Fig. 7The process of knowledge distillation. A smaller student model is trained to mimic a pretrained larger teacher model [44]
Fig. 8The applications of interpretability methods in disease diagnosis
The experimental results of diagnosing various eye diseases (such as diabetic retinopathy, glaucoma, etc.) on different datasets
| Task | Author | Network | Dataset | Interpretability | Sensitivity | Specificity | Accuracy | AUC |
|---|---|---|---|---|---|---|---|---|
Diabetic Retinopathy classification | Jiang et al. [ | Integrated model | Private dataset | CAM | 85.57% | 90.85% | 88.21% | 0.946 |
Diabetic Retinopathy Classification | Jiang et al. [ | Based on ResNet | Private dataset | Grad-CAM | 93.90% | 94.40% | 94.20% | 0.989 |
Detect referable diabetic retinopathy(RDR) | Chetoui et al. [ | EfficientNet | EyePACS | Grad-CAM | 91.70% | 98.90% | – | 0.984 |
| APTOS2019 | 91.40% | 97.20% | 0.966 | |||||
Detect vision-threatening | Chetoui et al. [ | EfficientNet | EyePACS | Grad-CAM | 98.10% | 93.70% | – | 0.99 |
| APTOS2019 | 99.10% | 92.50% | 0.998 | |||||
| Glaucoma Diagnosis | Liao et al. [ | EAMNet | ORIGA dataset | Evidence Activation Mapping | – | – | – | 0.88 |
| Glaucoma Detection | Li et al. [ | AG-CNN | LAG database | Attention | 95.40% | 95.20% | 95.30% | 0.975 |
| RIM-ONE database | 84.80% | 85.50% | 85.20% | 0.916 | ||||
Retinal OCT image classification | Fang et al. [ | LACNN | UCSD dataset | Attention | 86.80% | 86.20% | 90.10% | – |
| NEH dataset | 99.33% | 99.39% | – | 0.994 |
Some of experimental results for COVID-19 diagnosis
| Task | Author | Network | Dataset | Interpretability method | Accuracy | Precision | Sensitivity/recall | Specificity | F1 score |
|---|---|---|---|---|---|---|---|---|---|
| COVID-19 Detection | Alshazly et al. [ | ResNet101 | SARS-CoV-2 CT-scan | t-SNE, Grad-CAM | 99.40% | 99.60% | 99.80% | 99.60% | 99.40% |
| DenseNet201 | COVID19-CT | 92.90% | 91.30% | 93.70% | 92.20% | 92.50% | |||
| COVID-19 Detection | Shi et al. [ | - | COVID-19 X-Ray dataset (private) | Attention | 0.9411 | 0.9673 | 0.978 | – | 0.9726 |
COVID-19 CT dataset (private) | 0.8654 | 0.913 | 0.8513 | – | 0.8811 | ||||
| COVID-19 Detection | Brunese et al. [ | VGG-16 | Private dataset | Grad-CAM | 0.98 | – | 0.87 | 0.94 | 0.89 |
| COVID-19 Detection | Karim et al. [ | Ensemble model | Private chest X-ray dataset (balanced /imbalanced dataset) | LRP, Grad-CAM++ | – | 0.904 0.877 | 0.905 0.881 | 0.905 0.879 | – |
| COVID-19 Classification | Wu et al. [ | Res2Net | Private COVID-CS dataset | Activation mapping | – | – | 95% | 93% | – |
Part of experimental results for diagnosing brain diseases
| Task | Author | Network | Dataset | Interpretability method | Accuracy | Precision | Sensitivity | Specificity | AUC |
|---|---|---|---|---|---|---|---|---|---|
| Diagnosis of AD | Nigri et al. [ | AlexNet | private MRI dataset | Swap Test | – | – | – | – | 0.923 |
Diagnosis of AD by part-of-speech (Pos) features | Wang et al. [ | C-Attention-FT | DementiaBank dataset | Attention | 92.20% | 93.50% | 97.10% | – | 0.971 |
Diagnosis of AD by language embedding features | Wang et al. [ | C-Attention-Embedding | DementiaBank dataset | Attention | 84.50% | 88.50% | 92% | – | 0.837 |
Diagnosis of AD by both Pos and language embedding features | Wang et al. [ | C-Attention-FT+Embedding | DementiaBank dataset | Attention | 91.50% | 96.90% | 92.20% | – | 0.977 |
Classification of NC (normal control) and AD | Achilleos et al. [ | – | ADNI-1 | Rule extraction | 91% | – | 87% | 95% | – |
| Detect PD | Magesh et al. [ | VGG16 | PPMI database | LIME | 95.20% | – | 97.50% | 90.90% | – |
Note: C-Attention-FT, C-Attention-Embedding and C-Attention-FT+Embedding denote attention model with only part-of-speech (PoS) features, only the language embeddings and a unified architecture (both Pos features and language embeddings), respectively
The common disease datasets
| Anatomy | Medical diagnosis | Dataset | Modality | Dataset size | Url |
|---|---|---|---|---|---|
| Eye | Diabetic Retinopathy | Messidor-2 | Retinal funds images | 1748 images | |
Retinopathy grade Risk of macular edema | Messidor | Fundus color numerical images | 1200 images | ||
Diabetic retinopathy grade Diabetic macular edema | IDRiD | Color fundus images | 516 images | ||
| Diabetic Retinopathy | e-ophtha-EX | Color fundus images | 82 images | ||
| e-ophtha-MA | Color fundus images | 381 images | |||
| Diabetic retinopathy (DR) | DiaRetDB1 V2.1 | Digital images of eye fundus | 89 images | ||
| Processing a retinal image | STARE | Retinal funds images | 400 images | ||
| Diabetic retinopathy grade | APTOS 2019 | Fundus photography | 3662 train_images 1928 test_images | ||
| Lung | Lung nodules | LIDC-IDRI | CT scans | 1018 images | |
| Lung cancer | NIH Chest X-ray | X-ray images | Over 100,000 images | ||
| Lung nodules | LUNA 16 | CT scans | 888 images | ||
14 Common Thorax Disease Categories | ChestX-ray14 | X-ray images | 112,120 images | ||
| Brain | Alzheimer’s disease | ANDI1 | Multi-site MRI and PET data | Participant: 200 Normal Controls, 400 mild cognitive impairment(MCI), 200Mild AD | |
| ADNI GO | Multi-site MRI and PET data | Participant: 200 the mildest symptomatic phase of AD (EMCI), 500 Normal Controls and MCI | |||
| ADNI2 | Multi-site MRI and PET data | Participant: 150 Normal Controls, 450–500 CN and MCI, 350 EMCI, 150 LMCI, 200 mild AD | |||
| ADNI3 | Multi-site MRI and PET data | Participant: 430-830Normal Controls, 425-835 MCI, 215-335 Mild Alzheimer’s, disease dementia | |||
| Parkinson’s disease | PPMI | DaTscan SPECT images | 642 images | ||
| Skin | Skin lesions | ISIC 2018 challenge Task1 | Dermoscopic lesion images | 2594 images and 2594 corresponding ground truth response masks | |
| ISIC 2018 challenge Task2 | Dermoscopic lesion images | 2594 images and 12,970 corresponding ground truth response masks (5 for each image) | |||
| ISIC 2018 challenge Task3 | Dermoscopic lesion images | 10,015 images and 1 ground truth response CSV file |