Literature DB >> 34177258

Evaluation of Explainable Deep Learning Methods for Ophthalmic Diagnosis.

Amitojdeep Singh^1,2, Janarthanam Jothi Balaji³, Mohammed Abdul Rasheed¹, Varadharajan Jayakumar¹, Rajiv Raman⁴, Vasudevan Lakshminarayanan^1,2.

Abstract

BACKGROUND: The lack of explanations for the decisions made by deep learning algorithms has hampered their acceptance by the clinical community despite highly accurate results on multiple problems. Attribution methods explaining deep learning models have been tested on medical imaging problems. The performance of various attribution methods has been compared for models trained on standard machine learning datasets but not on medical images. In this study, we performed a comparative analysis to determine the method with the best explanations for retinal OCT diagnosis.
METHODS: A well-known deep learning model, Inception-v3 was trained to diagnose 3 retinal diseases - choroidal neovascularization (CNV), diabetic macular edema (DME), and drusen. The explanations from 13 different attribution methods were rated by a panel of 14 clinicians for clinical significance. Feedback was obtained from the clinicians regarding the current and future scope of such methods.
RESULTS: An attribution method based on Taylor series expansion, called Deep Taylor, was rated the highest by clinicians with a median rating of 3.85/5. It was followed by Guided backpropagation (GBP), and SHapley Additive exPlanations (SHAP).
CONCLUSION: Explanations from the top methods were able to highlight the structures for each disease - fluid accumulation for CNV, the boundaries of edema for DME, and bumpy areas of retinal pigment epithelium (RPE) for drusen. The most suitable method for a specific medical diagnosis task may be different from the one considered best for conventional tasks. Overall, there was a high degree of acceptance from the clinicians surveyed in the study.

Entities: Chemical

Keywords: choroidal neovascularization; deep learning; diabetic macular edema; drusen; explainable AI; image processing; machine learning; optical coherence tomography; retina

Year: 2021 PMID： 34177258 PMCID： PMC8219310 DOI： 10.2147/OPTH.S312236

Source DB: PubMed Journal: Clin Ophthalmol ISSN： 1177-5467

Introduction

Retinal diseases are prevalent among large sections of society, especially amongst the aging population and also those with other systemic diseases such as diabetes.1 It is estimated that the number of Americans over 40 years with a diabetic retinopathy (DR) diagnosis will rise threefold from 5.5 million in 2005 to 16 million in 2050.2 For each decade of age after 40, the prevalence of low vision and blindness increases by a factor of three.3 Long wait times in the developed world and lack of access to healthcare in the developing countries lead to delays in diagnosis and in turn deteriorated vision and even irreversible blindness. This leads to financial burden (and psychological burden) on patients as well as the healthcare system due to higher treatment costs in the later stages. Tackling such challenges and providing efficient health services requires advanced tools to help health care professionals. Artificial intelligence (AI), especially deep learning which is modeled after the human neural system4 has produced promising results in many areas including ophthalmology. These are used for tasks like disease detection,5 segmentation,6 and quality enhancement7 of optical coherence tomography (OCT) and fundus photographs. The convolutional neural networks (CNN) are the most common form of deep learning algorithms used for image classification tasks like retinal disease detection and have shown promising results.8–10 Even though these algorithms show performance comparable to that of clinicians, the applications of DL methods in ophthalmology are limited. A major barrier to adoption is the “black-box” nature of these algorithms since they cannot explain how the algorithm arrived at that particular decision unlike a clinician. The other challenges include medico-legal and technical issues which could involve new legislation, user-centric systems, and improved training.11 Various explainability methods have been developed and applied to different areas including medical imaging.12 Most of the explainability methods evaluate the contribution of each pixel of the image to the model output and hence are called attribution methods. Almost all the studies, especially the ones for ophthalmic diagnosis utilize a single explainability method and do not provide comparisons with alternatives.13,14 We argue that an explainability method that performs the best on standard computer vision datasets may not be the most suitable for OCT images which have a different data distribution than real-world images. Previously,15,16 we have compared multiple explainability methods quantitatively for their ability to highlight the part of the image which had the most impact on the model decision. We did an exploratory qualitative analysis using ratings from 3 optometrists and the results showed the need for a more detailed analysis to judge these methods.16 In this study, we compare and evaluate 13 explainable deep learning methods for diagnosis of three retinal conditions – choroidal neovascularization (CNV), diabetic macular edema (DME), and drusen. These methods were rated by a panel of 14 eye care professionals (10 ophthalmologists and 4 optometrists). Their observations regarding the clinical significance of these methods, preference regarding AI systems, and suggestions for future implementations are also analyzed herein.

Methods

In this section, we discuss the deep learning model used to detect the diseases along with a brief overview of the explainability methods used to generate the heatmaps of the regions the model considered for making the decisions.

Model

A CNN called Inception-v317 is used for many computer vision tasks including the diagnosis of retinal images was used to classify the data from the UCSD OCT dataset18 into 4 classes – CNV, DME, drusen, and normal. This data set has OCT images taken from adult cohorts during the routine clinical care, retrospectively selected for the diagnosed conditions including CVN, DME, drusen and normal from electronic medical record database between the period (July, 2013 and March, 2017) at various eye care centers. Only horizontal foveal cross-section OCT were extracted in standard format. The model was trained on 84,000 images and tested on 1000 images (250 from each class). This resulted in a test accuracy of 99.3%. The confusion matrix showing the relationship between true and predicted classes is shown in Table 1. It compares the predicted label (diagnosis) by the model on the X-axis with true labels (ground truths) on the Y-axis.

Table 1

Confusion Matrix for the Model on the Test Set of 1000 Images

	Predicted Label
		CNV	DME	Drusen	Normal	Total
True Label	CNV	249	0	1	0	250
	DME	1	249	0	0	250
	Drusen	3	0	247	0	250
	Normal	0	0	2	248	250

Confusion Matrix for the Model on the Test Set of 1000 Images

Explainability with Attributions

The attribution methods used in this study can be categorized into 3 types apart from the baseline occlusion which involves covering parts of the image to see the impact on the output. There are many methods to explain the deep learning models and we chose the 13 most common ones that were applicable to the underlying Inception-v3 model. The function-based methods derive attributions directly from the model gradients and include Gradient and Smoothgrad.19 The signal-based methods analyze the flow of information (signal) through layers of neural network and include DeConvNet,20 Guided BackPropagation (GBP),21 and Saliency.22 The methods based on attributions completely include Deep Taylor,23 DeepLIFT,24 Integrated Gradients (IG),25 input times gradient, Layerwise Relevance Propagation26 with Epsilon (LRP EPS) and Z rules (LRP Z), and SHAP.27 SHAP and Deep LIFT are considered as state-of-the-art on standard machine learning datasets and have superior theoretical background while IG is commonly used for retinal images.13,14 The heatmaps for 3 correctly and 1 incorrectly classified example of using the attribution methods are shown in Figures 1 and 2. It must be noted that certain methods such as DeepTaylor and Saliency provide only positive evidence. Those providing both positive and negative evidence have some high-frequency noise (negative evidence) that can be removed in practice but retained here to compare original outputs.

Figure 1

Figure 2

Heatmaps for 2 scans with drusen, the smaller pathology. Top: Correct diagnosis, Bottom: Incorrect diagnosis. The pathological structures are smaller than the previous two and as a result most of the methods highlight regions outside too. SHAP is the most precise here in. In the incorrect case there is higher negative evidence (blue), especially with occlusion. The performance of the methods can be observed in terms of positive highlights of the bumpy RPE.

Heatmaps for scans with the larger pathologies – (top) choroidal neovascularization (CNV) and (bottom) diabetic macular edema (DME). For each case - Row 1: Input image, DeConvNet, Deep Taylor, DeepLIFT. Row 2: Gradient, GBP, Input times gradient, IG. Row 3: LRP – EPS, LRP – Z, Occlusion, Salience. Row 1: Input image, DeConvNet, Deep Taylor, DeepLIFT. Row 2: Gradient, GBP, Input times gradient, IG. Row 3: LRP – EPS, LRP – Z, Occlusion, Salience. Row 4: SHAP Random, SHAP Selected, SmoothGrad. The scale in the bottom right shows that the parts highlighted in magenta color provide positive evidence regarding presence of a disease while those in blue color provide a negative evidence indicating that the image is closer to normal. DeepTaylor, GBP perform the best, SHAP highlights partial but precise regions. The fluid accumulation for CNV and the edges of the edema for DME were highlighted by better performing methods. Heatmaps for 2 scans with drusen, the smaller pathology. Top: Correct diagnosis, Bottom: Incorrect diagnosis. The pathological structures are smaller than the previous two and as a result most of the methods highlight regions outside too. SHAP is the most precise here in. In the incorrect case there is higher negative evidence (blue), especially with occlusion. The performance of the methods can be observed in terms of positive highlights of the bumpy RPE. The heatmaps generated by the 13 methods for 20 images from each disease category were evaluated by the 14 clinicians. The group had a median experience of 5 years in retinal diagnosis, including 4 years with OCT imaging. The average number of images rated per week was approximately 40 with all the clinicians having prior experience analyzing retinal SD-OCT images. They rated the explanations from 0 (not relevant) to 5 (fully relevant). The scores of each clinician were normalized by subtracting the respective mean and then rescaling between 0 and 5.

Results

Here we provide quantitative and qualitative results of this study. The ratings from clinicians and the survey used to collect the feedback are available on request.

Comparison Between Methods

The violin plots of normalized scores of raters for all the methods across 60 scans are shown in Figure 3. The estimated probability density of each method is shown by the thickness of the violin plot. Table 2 gives the rating data for all conditions and methods. Deep Taylor with the highest median rating of 3.85 was judged as the best performing method. It is relatively simple to compute and involves Taylor series expansion of the signal at the neurons. It was considerably ahead of GBP, the next best method which was closely followed by SHAP with selected and then random background.

Figure 3

Violin plots of normalized ratings of all methods. The breadth of the plot shows the probability density of the data and the median value is reported on top of the plots. Deep Taylor was rated the highest overall followed by GBP and SHAP.

Table 2

Median Ratings (with IQR) for Each Disease for All Attribution Methods. Deep Taylor (Bold) Had the Highest Ratings

Method	Median Rating (IQR)
Method	CNV	DME	Drusen	All
DcNet	2.17 (1.71–2.61)	2.47 (1.74–3.09)	2.32 (1.71–2.61)	2.32 (1.71–2.82)
DTaylor	3.80 (3.22–4.05)	3.48 (3.09–3.99)	3.99 (3.58–4.56)	3.85 (3.23–4.07)
DLift-Res	2.44 (1.85–2.72)	2.44 (1.96–2.53)	2.53 (2.32–3.09)	2.47 (2.06–2.82)
Grad	2.32 (1.77–2.53)	2.47 (2.19–2.95)	2.44 (2.03–2.61)	2.44 (1.96–2.72)
GBP	3.23 (3.09–3.80)	3.26 (3.07–3.80)	3.71 (3.22–3.99)	3.29 (3.09–3.97)
I*Grad	2.50 (2.32–2.95)	2.47 (2.28–2.82)	2.53(2.44–3.04)	2.50 (2.32–2.95)
IG	2.50 (2.32–2.95)	2.47 (2.19–2.82)	2.57 (2.44–3.20)	2.50 (2.32–2.95)
LRP.E	2.50 (2.32–2.95)	2.50 (2.32–2.95)	2.53 (2.41–3.04)	2.50 (2.32–2.95)
LRP.Z	2.50 (2.32–2.95)	2.50 (2.32–2.95)	2.53 (2.41–3.04)	2.50 (2.32–2.95)
Occ64	1.71 (1.55–1.96)	1.71 (1.42–1.85)	1.71 (1.42–1.96)	1.71 (1.52–1.96)
Saliency	2.47 (1.74–3.29)	2.72 (1.74–3.29)	2.61 (1.74–3.29)	2.61 (1.74–3.29)
SHAP-R	3.23 (2.53–3.85)	3.23 (2.53–3.85)	3.58 (2.89–3.96)	3.23 (2.53–3.85)
SHAP-S	3.23 (2.53–3.85)	3.23 (2.53–3.85)	3.53 (2.61–3.96)	3.26 (2.53–3.96)
SmoothGrad	2.45 (1.85–2.95)	2.47 (1.96–3.09)	2.47 (1.85–3.04)	2.47 (1.93–3.04)

Median Ratings (with IQR) for Each Disease for All Attribution Methods. Deep Taylor (Bold) Had the Highest Ratings Violin plots of normalized ratings of all methods. The breadth of the plot shows the probability density of the data and the median value is reported on top of the plots. Deep Taylor was rated the highest overall followed by GBP and SHAP. IG, commonly employed in the literature for generating heatmaps for retinal diagnosis13,14 received a median score of only 2.5. It is known to be strongly related and, in some cases, mathematically equivalent28 to LRP EPS which was also reflected in similar ratings. The Z rule of EPS was not found to make much difference and the simple to compute input times gradient performed reasonably well. DeepLIFT could not be tested in its newer Reveal Cancel rule due to compatibility issues with the model architecture and the older Rescale rule had a below average performance. As expected, the baseline occlusion which used sliding window of size 64 to cover the pixel and then compute significance performed worse than the attribution-based methods. Most of the methods have the majority of the values around the median indicating consistent ratings across images and raters. Both cases of SHAP and Saliency have particularly elongated distributions. For SHAP, the curve is widest around 4 indicating good ratings for many cases. However, the values around 2.5 due to lower coverage of pathology drive the overall median lower. In the case of Saliency, the ratings are spread from about 4.5 to 1.5 with many of them around 3.25 and 1.75 marks. The former is due to larger coverage of the pathological region and the latter is due to the fact that it missed regions frequently. Hence, despite better median value, it is not as suitable as lower-rated methods such as IG where the bulk of the value is around the median.

Comparison Between Raters

The Spearman’s rank correlation was used to compare the ratings of the clinicians with each other. This non-parametric test assesses the relationship between two variables, in this case the ratings of images by two different clinicians. The correlations between the ratings of all 14 clinicians for the 60 images and 13 methods are shown in Figure 4. P1 to P10 are ophthalmologists while P11 to P14 are optometrists.

Figure 4

Spearman correlation for clinician’s ratings.

Spearman correlation for clinician’s ratings. Most of the values are around 0.5 indicating an overall moderate agreement between clinicians. The highest correlation was of 0.76 between P10 and P13. A slight negative correlation was found between P1 and P11 as well as P2 and P11. The rater P11 had relatively less experience with OCT which could have resulted in a lower correlation with other clinicians. This indicates that the background and training (ie, prior experience) of clinicians affected their ratings of the system.

Qualitative Observations

In this section, the qualitative feedback given by the clinicians regarding the performance of the system, potential use cases and other suggestions are summarized. A survey was collected from the clinicians to seek their opinion post study. It is notable that 79% (11/14) clinicians who participated in the study indicated a preference for having an explainable system assisting them in practice, reaffirming the need for such system to the clinical community. One of the ophthalmologists gave their feedback on the system as – “It is a definite boon to the armamentarium as far as screening and diagnosis is concerned on a mass scale or in a telemedicine facility.” The clinicians noted an overall better coverage of the pathology by Deep Taylor as the reason for higher ratings, however, all methods except SHAP were found to be mainly detecting the boundaries. SHAP was observed to be identifying regions inside the edema also, though the partial coverage of the region lower score. The noise, (represented in blue) especially in the case of LRP, was found to be a distraction by some clinicians and can be removed for actual implementation. Most of the clinicians identified telemedicine and tertiary care centres as potential sites which can utilize this system. It was suggested that it can be used for screening in places with large number of patients without sufficient number of clinicians. It could help clinicians by categorizing the scans with suspect conditions and thus allow them to focus their attention on examining the areas of the images highlighted by algorithm. This can improve efficiency, save time and therefore optimize patient care. Another application could be archival and data management where the heatmaps could be used for separating images faster.

Discussion

Along with a comparison of various available attribution methods to explain deep learning models, this study validated their results through ratings from a large panel of clinicians. Most of them were not involved in the design process but were generally positive about the utility of the system. A method based on Taylor series expansion, known as Deep Taylor, received the highest ratings. Apart from highlighting the markers of the disease it also focussed on the structures that could indicate further proliferation, eg, RPE in the case of mild drusen. However, the methods with stronger theoretical foundations did not perform well when compared to Deep Taylor. It should be noted that the original goal of these techniques is to generate a true representation of the features learned by a model for a given task. Hence, the heatmaps generated are affected both by the model and the attribution method. It must be noted that a significant issue with GBP, the second highest rated method in this study is that it acts as an edge detector and not actually revealing the model’s decision-making process.29,30 The dataset used here labeled only primary diagnosis, however, the clinicians were able to identify secondary diagnosis for some images from their evaluation. Also, due to the nature of the dataset the study is limited to a single orientation of the OCT scan which might differ between the images. All clinicians preferred to have a presentation of scan position on fundus images in addition to OCT for a better understanding of the scanned area. A system that uses a combination of fundus images, OCT, and patient data (eg, Mehta et al31) could be useful in practice. Another application of explainability system could be as a tool for self-learning. The system can be further developed to encompass other diseases and finetuned for the specific imaging modality, considering variables such as noise, illumination, field position, etc. Currently, OCT is not used in screening because they are expensive as well as bulky. Given recent advances in low-cost portable OCT devices,32 it is possible to integrate an explainable diagnosis system on a laptop or mobile device for teleophthalmology purposes and it would be invaluable to the clinical community.

Conclusion

This is to the best of our knowledge one of the first studies to look at qualitative comparison of various explainable AI methods performed by a large panel of clinicians. A method based on Taylor series expansion, known as Deep Taylor, received the highest ratings outperforming the methods with stronger theoretical background and better results on standard datasets. A more detailed analysis of specific retinal structures highlighted by the algorithms in comparison to clinical evaluation is currently underway. In addition to highlighting the pre-existing pathology, it could also highlight markers for further proliferation. Positive feedback about the use of such system was received from the panel of clinicians. Future enhancements of the system could make it a trustable diagnostic assistant helping resolve the lack of access to ophthalmic healthcare.

11 in total

1. Automatic segmentation of nine retinal layer boundaries in OCT images of non-exudative AMD patients using deep learning and graph search.

Authors: Leyuan Fang; David Cunefare; Chong Wang; Robyn H Guymer; Shutao Li; Sina Farsiu
Journal: Biomed Opt Express Date: 2017-04-27 Impact factor: 3.732

2. Managing diabetic macular edema: The leading cause of diabetes blindness.

Authors: Pedro Romero-Aroca
Journal: World J Diabetes Date: 2011-06-15

3. Using a Deep Learning Algorithm and Integrated Gradients Explanation to Assist Grading for Diabetic Retinopathy.

Authors: Rory Sayres; Ankur Taly; Ehsan Rahimy; Katy Blumer; David Coz; Naama Hammel; Jonathan Krause; Arunachalam Narayanaswamy; Zahra Rastegar; Derek Wu; Shawn Xu; Scott Barb; Anthony Joseph; Michael Shumski; Jesse Smith; Arjun B Sood; Greg S Corrado; Lily Peng; Dale R Webster
Journal: Ophthalmology Date: 2018-12-13 Impact factor: 12.079

Review 4. Ophthalmic diagnosis using deep learning with fundus images - A critical review.

Authors: Sourya Sengupta; Amitojdeep Singh; Henry A Leopold; Tanmay Gulati; Vasudevan Lakshminarayanan
Journal: Artif Intell Med Date: 2019-11-22 Impact factor: 5.326

5. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning.

Authors: Daniel S Kermany; Michael Goldbaum; Wenjia Cai; Carolina C S Valentim; Huiying Liang; Sally L Baxter; Alex McKeown; Ge Yang; Xiaokang Wu; Fangbing Yan; Justin Dong; Made K Prasadha; Jacqueline Pei; Magdalene Y L Ting; Jie Zhu; Christina Li; Sierra Hewett; Jason Dong; Ian Ziyar; Alexander Shi; Runze Zhang; Lianghong Zheng; Rui Hou; William Shi; Xin Fu; Yaou Duan; Viet A N Huu; Cindy Wen; Edward D Zhang; Charlotte L Zhang; Oulan Li; Xiaobo Wang; Michael A Singer; Xiaodong Sun; Jie Xu; Ali Tafreshi; M Anthony Lewis; Huimin Xia; Kang Zhang
Journal: Cell Date: 2018-02-22 Impact factor: 41.582

6. First Clinical Application of Low-Cost OCT.

Authors: Ge Song; Kengyeh K Chu; Sanghoon Kim; Michael Crose; Brian Cox; Evan T Jelly; J Niklas Ulrich; Adam Wax
Journal: Transl Vis Sci Technol Date: 2019-06-28 Impact factor: 3.283

7. Automated Detection of Glaucoma With Interpretable Machine Learning Using Clinical Data and Multimodal Retinal Images.

Authors: Parmita Mehta; Christine A Petersen; Joanne C Wen; Michael R Banitt; Philip P Chen; Karine D Bojikian; Catherine Egan; Su-In Lee; Magdalena Balazinska; Aaron Y Lee; Ariel Rokem
Journal: Am J Ophthalmol Date: 2021-05-02 Impact factor: 5.258

8. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation.

Authors: Sebastian Bach; Alexander Binder; Grégoire Montavon; Frederick Klauschen; Klaus-Robert Müller; Wojciech Samek
Journal: PLoS One Date: 2015-07-10 Impact factor: 3.240

9. Weakly supervised lesion localization for age-related macular degeneration detection using optical coherence tomography images.

Authors: Hyun-Lim Yang; Jong Jin Kim; Jong Ho Kim; Yong Koo Kang; Dong Ho Park; Han Sang Park; Hong Kyun Kim; Min-Soo Kim
Journal: PLoS One Date: 2019-04-05 Impact factor: 3.240