Yuexin Cai1,2, Jin-Gang Yu3, Yuebo Chen1,2, Chu Liu1,2, Lichao Xiao3, Emad M Grais4, Fei Zhao4, Liping Lan1,2, Shengxin Zeng1,2, Junbo Zeng1,2, Minjian Wu1,2, Yuejia Su1,2, Yuanqing Li3, Yiqing Zheng5,2. 1. Department of Otolaryngology, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, Guangdong Province, China. 2. Institute of Hearing and Speech-Language Science, Sun Yat-Sen University, Guangzhou, Guangdong Province, China. 3. Department of Automation Science and Engineering, South China University of Technology School, Guangzhou, Guangdong, China. 4. Centre for Speech and Language Therapy and Hearing Science, Cardiff School of Sport and Health Sciences, Cardiff Metropolitan University, Cardiff, UK. 5. Department of Otolaryngology, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, Guangdong Province, China zhengyiq@mail.sysu.edu.cn.
The two-stage approach for model development attempts to replicate the human visual attention system and the procedure of clinical judgement made by an experienced otolaryngologist undertaking examination of the tympanic membrane from whole view to partial image.The prediction results of the two models, each having different functionality and strength, were averaged to use the advantages of each.Non-medical history and hearing information were provided to the deep learning model and the otolaryngologists, which may have compromised the diagnosis accuracy.All the otoscopic images were taken after cerumen removed if needed.
Introduction
Otitis media (OM) is a common otological disease with a number of forms; acute otitis media (AOM), otitis media with effusion (OME) and chronic suppurative otitis media (CSOM), that appropriate medical care can treat.1 It is a primary cause for people to seek medical care, antibiotic prescription and surgery.2 Without early diagnosis and appropriate treatment, deterioration and even irreversible complications may occur.3 History taking, otoscopy and otoendoscopy examination are essential first steps in the evaluation of patients with OM in Ear, Nose and Throat and Audiology Clinics. The clear view or visual image obtained from otoscopy or otoendoscopy provides important information for initial diagnosis of otological diseases.1–3Misdiagnosis is most likely to occur if the medical doctor lacks experience in the use of otoscopy or otoendoscopy.4 5 For example, Sorrento and Pichichero6 found that the correct diagnosis rate of OM by paediatricians was only 50%, in comparison with 73% by otolaryngologists. Poor diagnostic accuracy leads to misdiagnosis and delay in treatment, which may cause preventable complications.2 7 8 A new diagnostic strategy for patients with OM needs to be developed in order to improve diagnostic accuracy.In recent years, Artificial Intelligence has been applied to medical image analysis to help clinical interpretation and medical diagnosis by image classification, segmentation and matching.7–10 In particular, convolutional neural networks (CNNs) have been widely used and demonstrate good performance in the automated classification of medical images, including diabetic retinopathy detection,9 10 skin cancer classification11 12 and congenital cataract detection.13 However, few machine-learning studies have been conducted in the field of otology for the automated diagnosis of ear diseases from otoscopic images. Myburgh et al14 built up a neural network using 389 images to classify five categories of video-otoscopic image, including normal tympanic membrane (TM), obstructing wax or foreign body in the external ear canal, AOM, OME and CSOM, achieving a classification accuracy of 86.84%. Recently a machine-learning model was generated by combining two of the best performing models (Inception-V3 and ResNet101).7 The model used 10 544 otoendoscopic images to classify six categories of ear disease; normal, attic retraction, tympanic perforation, otitis externa with myringitis, otitis externa without myringitis and tumour. This learning model achieved an accuracy of 93.67%. However, the reliability of the combined classification method appears questionable since the simple aggregation of two independent models to illustrate reliability and interpretability is hard to justify with no effective underlying rationale. Somewhat differently, Lee et al1 developed a heat map using a CNN model to distinguish between the normal middle ear condition and ears with chronic OM in an inactive phase to detect the presence of perforation. The accuracy was 91.0%, however the CNN model had poor consistency in determining the perforated area of the TM, which might be due to the relatively small sample. In addition, the model has not been improved through validation.In the present study, an algorithm that combined results from two CNNs used in two separate stages was developed for the classification of: normal eardrum, OME and CSOM in an active or inactive phase. The network in the first stage dealt with the whole image, while in the second stage the network made its decision based on a discriminative segment of the TM image. Instead of using manual annotation to locate the discriminative part, an interpretation method called Class Activation Maps (CAMs)15 was used to identify the discriminative segment automatically and without extra annotation. The rationale underlying the two-stage approach was based on the human visual attention system together with the procedure of clinical judgement made by an experienced otolaryngologist when they undertake examination of the TM from whole view to partial image. The human visual attention system selectively concentrates on parts of the visual space to capture salient information rather than processing the whole scene.16 Otolaryngologists can easily identify the salient lesion areas in an otoscopic image, mainly due to the attention mechanism of the human visual perception system.17 18 Lesions in otoscopic images always attract most of the otolaryngologist’s attention during the physical examination on patients. This strategy of attention-based CNN has been previously applied to several other medical image analyses.17 19–23 Therefore, it is logical to incorporate the lesion attention mechanism into the otoscopic image classification models.
Materials and methods
Otoendoscopic image data acquisition
Images were collected retrospectively from 2022 patients who had attended the Department of Otorhinolaryngology, Sun Yat-sen Memorial Hospital, University of Sun Yat-sen, China between 2015 and 2019. The anonymous images were identified and categorised by using the clinical diagnostic information. The four conditions of the middle ear included in this study were: normal middle ear; OME; CSOM in active phase (CSOMa); CSOM in inactive phase (CSOMi).Normal TM images were obtained from healthy ears in subjects with normal hearing thresholds determined by pure tone audiometry and a type A tympanogram. All patients with OME were confirmed by a type B tympanogram. CSOMa was further confirmed if the patient reported increased otorrhoea at the time the images were taken.
TM image data acquisition
TM images were taken by otolaryngologists using a 4 mm STORZ 0° endoscope (KARL STORZ, Germany) and a video-recording system. All images were taken before any surgery. Images were saved as JPEG graphic files with pixels in the range 500×500–700×700. For each of the 2022 patients, 3 images were chosen from each patient, and a total of 6066 images were included and categorised to four conditions: 1040 images of normal TM, 2613 images of OME, 1662 images of CSOMa and 751 images of CSOMi. Figure 1 shows examples of the TMs in each category together with a clear definition.
Figure 1
Classification tree for the four diagnostic classes. A total of 6066 images are included and categorised to four conditions: 1040 images of normal tympanic membrane, 2613 images of otitis media effusion, 1662 images of chronic suppurative otitis media in active phase and 751 images of chronic suppurative otitis media in inactive phase.
Classification tree for the four diagnostic classes. A total of 6066 images are included and categorised to four conditions: 1040 images of normal tympanic membrane, 2613 images of otitis media effusion, 1662 images of chronic suppurative otitis media in active phase and 751 images of chronic suppurative otitis media in inactive phase.
Labelling of images
All CSOM images were labelled according to surgical findings of TM perforation and postoperative pathological reports if available. CSOMa was further confirmed if the patient reported increased otorrhoea at the time the images were taken. OME was labelled if tympanocentesis found fluid in the tympanic cavity.
Two-stage model for classification
One commonly used strategy for classifying TM images is to first pretrain a CNN model on a large-scale natural image dataset, such as the most popular ImageNet,24 which includes over 1 million images, and then fine tune the model on the target training dataset. However, a vital disadvantage with such a strategy is that the images in the target dataset have to be downsampled to a much lower resolution in order to fit the mandatory input size of the CNN model, for instance, a 224×224 pixel resolution is required for ResNet. The downsampling can cause severe information loss and thereby performance degradation.In order to boost performance and make full use of all discriminative parts in the input image, a two-stage pipeline for classification was used in the present study, that resembles the attention mechanism of a human being. The pipeline involved two separated CNN models, called the main model and the focal model, providing a predicted result based on the whole image and important parts of the image.The main model acted as a main classifier of images as well as a filter to remove irrelevant parts of an image for the focal model. By fine tuning a network pretrained on ImageNet to the task, we produced the main classifier for decision-making based on the whole image. It was then feasible to find important parts in the input image using manual annotation such as boundary boxes to train a detection network.We were able to extract information from the trained classification network to provide weak clues as to the location of important parts automatically. The CAM15 is a simple but effective method to visualise parts of interest from a network by projecting back the weights of the output layer onto convolutional feature maps obtained from the last convolutional layer. By calculating CAMs from input images, the location of important parts could be highlighted. After upsampling the CAMs to the same size as the original image, a discriminative patch with a higher resolution could be cropped from the original image. This contained more information and improved the classification greatly. In summary, for each input image, the main model outputs a series of scores indicating the classification result while at the same time providing a heat map from CAMs to help locate discriminative patches.The secondary model acts as the focal classifier of images by focusing on the discriminative patches. From a single image, it was possible to extract many patches simply by random cropping, a widely used scheme for data augmentation in CNN models.25 26 However, patches selected under guidance have a higher confidence for relevance and discrimination, leading to a better performance. This model was fine tuned to a series of patches selected by a sliding window of different size adopted from the CAM. As shown in figure 2, the sliding-window strategy for patch selection is to use a window to scan through the whole image by a fixed step size, for example, 16 pixels in both row and column directions. At each location, a patch can be cropped from the image and a score assigned to this patch according to the binarised heat map. The score of this patch was calculated by averaging the intensity values of pixels within the corresponding window in the heat map. Several different window sizes were predefined. Then, the image patches cropped at various sizes and locations were ordered according to their scores with the top ones with high scores selected. To avoid the window sliding out of the image, locations very close to the image boundaries were not considered. In the test phase, the maximum point of the heat map indicated the most possible location of an object and a patch was selected around it.
Figure 2
Every patch in the same image. The sliding-window strategy for patch selection is to take a window to scan throughout the whole image by a fixed step size, (eg, 16 pixels in both row and column directions). At each location, a patch can be cropped from the image and a score assigned to this patch according to the binarised heat map. The score of this patch is calculated by averaging the intensity values of pixels within the corresponding window in the heat map. Several different window sizes are predefined. Then, the image patches cropped at various sizes and locations (eg, the red and green boxes are ordered according to their scores, and the top ones with high scores are selected, eg, the green box). CAM, Class Activation Map.
Every patch in the same image. The sliding-window strategy for patch selection is to take a window to scan throughout the whole image by a fixed step size, (eg, 16 pixels in both row and column directions). At each location, a patch can be cropped from the image and a score assigned to this patch according to the binarised heat map. The score of this patch is calculated by averaging the intensity values of pixels within the corresponding window in the heat map. Several different window sizes are predefined. Then, the image patches cropped at various sizes and locations (eg, the red and green boxes are ordered according to their scores, and the top ones with high scores are selected, eg, the green box). CAM, Class Activation Map.As discussed above, the two models have different functionality and strength due to difference in scale they handled. The main model provides a global result, while the focal model works on patches containing discriminative and local features. We merged the prediction results of the two models to use the advantage of each by averaging the classification output scores. The complete classification pipeline of the method is shown on figure 3.
Figure 3
The complete classification pipeline of the method. The main classifier provides a global result and the focal model works on patches containing discriminative and local features. The prediction results of the two models are merged by averaging the classification output scores obtained from these two models. CAMs, Class Activation Maps.
The complete classification pipeline of the method. The main classifier provides a global result and the focal model works on patches containing discriminative and local features. The prediction results of the two models are merged by averaging the classification output scores obtained from these two models. CAMs, Class Activation Maps.
Experiment settings and procedure
All experiments were performed using the Intel Xeon E5-2620 CPU and NVIDIA TITAN Xp GPU. Python 3.6 in Keras was used as the programming language for developing the deep learning framework based on TensorFlow.To evaluate the method, two common backbone networks, that is, ResNet27 and Inception-V3,26 were used in the experiments. More specifically, both networks were trained using the Adam optimiser, which runs for 30 epochs. The learning rate is initially set to be 0.0001 and decays with a factor of 0.1 for every 10 epochs.During the training phase, some methods of online data augmentation were adopted including; random shifting, shearing, zooming and flipping. The input size of ResNet is 224×224 while Inception-V3 uses 299×299. The input image was resized to fit the input size of networks using bilinear interpolation. In addition, all the experiments were conducted using patient-level threefold cross-validation. Three images were chosen at the same time from each patient. When splitting the dataset for cross-validation based on patients rather than images, no samples from the same patient appeared in both training and testing sets. All the images chosen for further analysis were conducted in the same way. Overall accuracy was calculated as the number of correctly classified images divided by the total number of considered images. All results were the average performance of three folds. The F1 Score was used to evaluate the performance in each type of TM image with the advantage of considering both precision and recall.The training process included three steps as follows:First, a CNN model pretrained on ImageNet was fine tuned with the training dataset of TM images to obtain the main model. Second, a couple of local patches were located in each image by using CAM15 derived from the main model. The aggregation of all patches acquired over the whole dataset was then taken to train another focal model, which had the same network structure as the main model. Since CAMs are class specific, and the true class labels of images in the testing phase were unavailable, we average all the CAMs of various classes to get a general attention map.It is noteworthy that there were differences in the way we selected patches between the training and testing phases. The original heat map was continuously valued so we turned it into a binary image, that is, pixels taking the value 0 or 1 by picking up a value as threshold, and turning all pixels over this value to be 1, while those under this value to be 0. During the training phase, the general attention maps were binarised with a threshold of 0.5. A series of sliding windows were adopted selecting two patches for the size of 300, 400 and 500 pixels, in order to keep all useful information. However, in the testing phase, considering the heavy computation cost, we only selected a 400×400 patch with the centre located on the maximum point of the general attention map.Third, a pretrained network used as the focal model was fine tuned on the selected patches, helping to achieve a better performance. The patches for training the focal model were selected from each image in the corresponding training set.
Patient and public involvement
Due to the retrospective nature of this study, patients and the public were not involved in the study design and research analysis.
Results
The classification results are shown in table 1. Although the accuracy of the focal classifier was lower in both backbones, the focal classifier can help the main classifier achieve better performance. The consistent performance achieved by using two different backbones indicates that our method is robust and insensitive to the choice of the backbone network.
Table 1
Comparison results of various methods on our dataset
Method
Backbone
F1 Score
Overall accuracy
Normal
OME
CSOMa
CSOMi
Main classifier
Inception-V3
0.9178±0.0224
0.9613±0.0032
0.9028±0.0085
0.8103±0.0062
0.9219±0.0068
Focal classifier
0.9294±0.0085
0.9589±0.0031
0.8793±0.0113
0.7583±0.0444
0.9078±0.0117
Our pipeline
0.9485±0.0065
0.9470±0.0017
0.9099±0.0104
0.8180±0.0295
0.9330±0.0081
Main classifier
ResNet50
0.9033±0.0189
0.9500±0.0141
0.9133±0.0047
0.8133±0.0170
0.9162±0.0056
Focal classifier
0.9333±0.0047
0.9633±0.0094
0.8900±0.0082
0.7500±0.0283
0.9126±0.0045
Our pipeline
0.9433±0.0125
0.9684±0.0132
0.9167±0.0047
0.8237±0.0171
0.9337±0.0051
The best results in each set of experiment are in bold. All results are reported as an average with an SD of the results of three folds. We used the F1 Score to measure the performance in each type of image in order to consider both precision and recall. The overall accuracy is calculated as the ratio of the number of correct classified images and the number of total images in test set.
CSOMa, chronic suppurative otitis media in active phase; CSOMi, chronic suppurative otitis media in inactive phase; OME, otitis media with effusion.
Comparison results of various methods on our datasetThe best results in each set of experiment are in bold. All results are reported as an average with an SD of the results of three folds. We used the F1 Score to measure the performance in each type of image in order to consider both precision and recall. The overall accuracy is calculated as the ratio of the number of correct classified images and the number of total images in test set.CSOMa, chronic suppurative otitis media in active phase; CSOMi, chronic suppurative otitis media in inactive phase; OME, otitis media with effusion.To provide a comparison with human experts, five doctors with a variety of experience were invited to label a subset of the test dataset. A total of 270 images from 90 subjects were randomly selected for each type of image, in total 1080 images were used in this evaluation. Two associate chief doctors achieved an accuracy of 91.02% and 87.50%, while two attending doctors achieved an accuracy of 86.57% and 79.44%, respectively. A primary doctor achieved an accuracy of 79.07%. Figure 4 displays the confusion matrices of our method using ResNet50 and three doctors with differing accuracies. Confusion matrices were consistent with clinical experience, indicating that it is more difficult to distinguish between cases of: normal versus OME and CSOMa versus CSOMi than normal versus pathological images, or OME versus CSOM. Using our method and experts, it was difficult to distinguish the slight differences between normal images and OME or CSOMa and CSOMi. The results obtained from each stage of the experimental method were combined with the three folds, that is, all the datasets were considered in calculating the confusion matrices.
Figure 4
Confusion matrices for each stage in our method with ResNet50 and three human experts. The row axis indicates the prediction while the column axis represents for the ground truth. Results among test sets of three folds are combined to report a performance of the whole dataset. (A, B) show result of the two major classifiers of our method, while (C) reports the result of average assembling. In addition, the overall accuracies of these three experts are 79.07%, 86.57% and 91.02% (D, E, F). CSOMa, chronic suppurative otitis media in active phase; CSOMi, chronic suppurative otitis media in inactive phase; OME, otitis media with effusion.
Confusion matrices for each stage in our method with ResNet50 and three human experts. The row axis indicates the prediction while the column axis represents for the ground truth. Results among test sets of three folds are combined to report a performance of the whole dataset. (A, B) show result of the two major classifiers of our method, while (C) reports the result of average assembling. In addition, the overall accuracies of these three experts are 79.07%, 86.57% and 91.02% (D, E, F). CSOMa, chronic suppurative otitis media in active phase; CSOMi, chronic suppurative otitis media in inactive phase; OME, otitis media with effusion.The performance of the two challenging binary classification problems was further evaluated, including normal versus OME and CSOMa versus CSOMi. The receiver operating characteristic (ROC) curve is an efficient tool to evaluate the comprehensive quality of classification models in all different situations. Figure 5 shows the average ROC curves of our method with ResNet50 as backbone. True positive and false positive rates were calculated for each doctor and marked on the figure. As figure 5A shows, in assessing the images of the normal TM and OME, the performance of an inexperienced primary doctor was significantly different from the judgements reported by relatively experienced attending doctors and associate chief doctors. Moreover, in regards to assessing the more challenging task of distinguishing the two stages of CSOM, the performance of only one associate chief doctor was better than the methods proposed in this paper.
Figure 5
Receiver operating characteristic curve for classification of two challenging situations, comparing the method in ResNet50 with human experts. The red curve is the average of three folds’ performance and the other curves show the result for each fold. Our method can achieve a performance similar with the associate chief doctor. AUC, area under the curve; CSOMa, chronic suppurative otitis media in active phase; CSOMi, chronic suppurative otitis media in inactive phase; OME, otitis media with effusion.
Receiver operating characteristic curve for classification of two challenging situations, comparing the method in ResNet50 with human experts. The red curve is the average of three folds’ performance and the other curves show the result for each fold. Our method can achieve a performance similar with the associate chief doctor. AUC, area under the curve; CSOMa, chronic suppurative otitis media in active phase; CSOMi, chronic suppurative otitis media in inactive phase; OME, otitis media with effusion.Figure 6 shows the intermediate results of typical samples using the methods proposed in this study. The green box indicates the patch used in the test phase, and the translucent mask shows the areas used to select patches during the training phase. The core areas of TM were successfully detected with the weakly supervised approach.
Figure 6
Typical samples of each situation, including normal, OME, CSOMa and CSOMi. From left to right, each column shows original image, the CAM of normal, the CAM of OME, the CAM of CSOMa, the CAM of CSOMi, the averages of CAM, the selected box and the patch for the focal classifier. CAM, Class Activation Map; CSOMa, chronic suppurative otitis media in active phase; CSOMi, chronic suppurative otitis media in inactive phase; OME, otitis media with effusion.
Typical samples of each situation, including normal, OME, CSOMa and CSOMi. From left to right, each column shows original image, the CAM of normal, the CAM of OME, the CAM of CSOMa, the CAM of CSOMi, the averages of CAM, the selected box and the patch for the focal classifier. CAM, Class Activation Map; CSOMa, chronic suppurative otitis media in active phase; CSOMi, chronic suppurative otitis media in inactive phase; OME, otitis media with effusion.
Discussion
In this study, we propose a two-stage CNN method for otoscopic image classification. To the best of our knowledge, this is the first time that an attention-based model combining global and local information has been used in the field of otoscopic image analysis. This deep learning model adopts attention mechanisms to focus on salient lesion areas in the TM with correcting image alignment to reducing the impact of noise. The results show the classification performance to reach the diagnostic level of an associate professor in otolaryngology in identifying normal, OME, CSOMa, CSOMi from otoscope images.Previous studies have built up machine-learning models for ear disease diagnosis and achieved high diagnostic accuracy.1 7 8 14 A recent study has compared nine models of transfer learning and assembled two of them (Inception-V3 and ResNet101) to build a deep learning model to automatically diagnose ear disease.7 They achieved an accuracy of 93.67% using a large database of 10 544 images. However, only 86% accuracy was achieved when the number of images for training was reduced to 5000. In this study, an equivalently high accuracy of 93.37% was achieved using only 6066 images. Two models were assembled in different ways to that reported in a previous study.7 First, a main classifier was tuned on the entire TM images acquired by otoscope. Second, we calculated CAMs from the trained main classifier to locate TM details which could be used to improve performance. Finally, another pretrained network was tuned on the selected patches as the focal classifier, enabling the main classifier to perform better. As table 1 shows, this assembled system improved the accuracy over the single model method.The achievement of high accuracy with a relatively small database may be attributed to the combined use of main classifier and focal classifier with CAM. CAM highlights the important area in a trained network,15 which relates to attention mechanisms. The attention mechanism of the model was consistent with the experienced otolaryngologist concentrating on the local lesion-related areas in otoscopic examination. Based on our finding, the significant spectrum of the images for CSOM by CAM was to focus on the perforation areas of the TM; while the area of a shortened or vanished cone of light as well as colour changes in the eardrum detected by CAM presented the significant spectrum in OME images. Therefore, CAM was considered as a guide to select discriminative parts and fine tune another network to focus on those parts providing higher resolution and more information. Consequently, the focal classifier targets more results from partial images. When considering information from both whole image and partial image, the performance of two kinds of backbone networks could be enhanced by 1%~2% in overall accuracy. The further improvement in accuracy of automated detection of pathological changes in the otoscopic images is vital to facilitate the capability in clinical diagnosis of OME for the paediatricians, physicians and junior otolaryngologists.It is of note that there are a few limitations in the present study. First, non-medical history and hearing information were provided to the deep learning model and the otolaryngologists, which may compromise the diagnosis accuracy. For example, CSOM is often accompanied by symptoms of recurrent otorrhoea and hearing loss,28 and doctors can greatly improve their accuracy of diagnosis by asking for a history. In addition, all the otoscopic images used in the experiment were taken after cleaning the external auditory canal. If the model is applied at home or community hospital, it may be affected by cerumen. In further studies, more ear disease images should be collected in order to train a more robust and practical network. In addition, the current deep learning model can be improved by training with non-image information such as hearing audiometry, tinnitus, ear fullness, duration of history and presence of fever for better diagnosis accuracy.
Conclusion
In this study, the assembled classifier accompanying a main classifier and a focal classifier with CAMs achieves a high accuracy in diagnosis of OM with endoscopic images based on a relatively small database. This deep learning model is useful in helping junior otolaryngologists and non-otolaryngologists to diagnose ear disease early. Further study will consider more ear diseases together with patients’ information such as medical history, hearing thresholds obtained from pure tone audiometry and middle ear function assessed by using tympanometry to improve diagnostic accuracy.
Authors: Andre Esteva; Brett Kuprel; Roberto A Novoa; Justin Ko; Susan M Swetter; Helen M Blau; Sebastian Thrun Journal: Nature Date: 2017-06-28 Impact factor: 49.962
Authors: Anne G M Schilder; Tasnee Chonmaitree; Allan W Cripps; Richard M Rosenfeld; Margaretha L Casselbrant; Mark P Haggard; Roderick P Venekamp Journal: Nat Rev Dis Primers Date: 2016-09-08 Impact factor: 52.329