| Literature DB >> 32435646 |
Achim Hekler1, Jakob N Kather1,2, Eva Krieghoff-Henning1, Jochen S Utikal3,4, Friedegund Meier5,6, Frank F Gellrich5,6, Julius Upmeier Zu Belzen7, Lars French8, Justin G Schlager8, Kamran Ghoreschi9, Tabea Wilhelm9, Heinz Kutzner10, Carola Berking11, Markus V Heppt11, Sebastian Haferkamp12, Wiebke Sondermann13, Dirk Schadendorf13, Bastian Schilling14, Benjamin Izar15, Roman Maron1, Max Schmitt1, Stefan Fröhling1,16, Daniel B Lipka1,16,17, Titus J Brinker1.
Abstract
Recent studies have shown that deep learning is capable of classifying dermatoscopic images at least as well as dermatologists. However, many studies in skin cancer classification utilize non-biopsy-verified training images. This imperfect ground truth introduces a systematic error, but the effects on classifier performance are currently unknown. Here, we systematically examine the effects of label noise by training and evaluating convolutional neural networks (CNN) with 804 images of melanoma and nevi labeled either by dermatologists or by biopsy. The CNNs are evaluated on a test set of 384 images by means of 4-fold cross validation comparing the outputs with either the corresponding dermatological or the biopsy-verified diagnosis. With identical ground truths of training and test labels, high accuracies with 75.03% (95% CI: 74.39-75.66%) for dermatological and 73.80% (95% CI: 73.10-74.51%) for biopsy-verified labels can be achieved. However, if the CNN is trained and tested with different ground truths, accuracy drops significantly to 64.53% (95% CI: 63.12-65.94%, p < 0.01) on a non-biopsy-verified and to 64.24% (95% CI: 62.66-65.83%, p < 0.01) on a biopsy-verified test set. In conclusion, deep learning methods for skin cancer classification are highly sensitive to label noise and future work should use biopsy-verified training images to mitigate this problem.Entities:
Keywords: artificial intelligence; dermatology; label noise; melanoma; nevi; skin cancer
Year: 2020 PMID: 32435646 PMCID: PMC7218064 DOI: 10.3389/fmed.2020.00177
Source DB: PubMed Journal: Front Med (Lausanne) ISSN: 2296-858X
Statistical evaluation of the primary study endpoint (accuracy) and the secondary study endpoints (sensitivity and specificity) for the four different scenarios (training with ground truth majority decision (MD)/testing with ground truth biopsy (BIO), training with MD/testing with BIO, training with BIO/testing with MD, and training with BIO/testing with BIO).
| Mean accuracy | 75.03% | 64.24% | 64.53% | 73.80% |
| 95% CI accuracy | 74.39–75.66% | 62.66–65.83% | 63.12–65.94% | 73.10–74.51% |
| Mean sensitivity | 76.76% | 69.65% | 64.31% | 75.98% |
| 95% CI sensitivity | 75.36–78.15% | 67.92–71.37% | 62.74–65.88% | 74.69–77.26% |
| Mean specificity | 73.00% | 59.05% | 64.79% | 71.85% |
| 95% CI specificity | 71.10–74.90% | 56.56–61.54% | 63.20–66.38% | 71.08–72.61% |
Figure 1Boxplot of the achieved accuracies, sensitivities, and specificities over 10 simulation runs. 804 biopsy-verified images of nevi and melanoma are labeled by both the majority decision of several dermatologists (MD) and by a biopsy-verified ground truth (BIO). All combinations of these two different ground truths are applied for training and test set, so that a total of four scenarios can be distinguished (MD/MD, BIO/MD, BIO/BIO, MD/BIO). 4-fold cross-validation is applied on a test set of 384 images to evaluate the performance of the algorithm.
Figure 2Receiver operating characteristic (ROC) curves of all four scenarios. For calculating the ROC curves, the outputs of each image are averaged over the conducted 10 simulation runs.
Figure 3On the left side, the 4 test images are shown for which the outputs of the CNN trained with dermatological labels and those trained with biopsy-verified labels differ the most. In the upper part of each image the majority decision of the dermatologists (MD) and the result of the biopsy (BIO) is given. On the right side, the boxplots of the corresponding outputs of the two CNNs over the 10 conducted simulation runs are presented.