| Literature DB >> 35461690 |
Marc Combalia1, Noel Codella2, Veronica Rotemberg3, Cristina Carrera1, Stephen Dusza4, David Gutman5, Brian Helba6, Harald Kittler7, Nicholas R Kurtansky4, Konstantinos Liopyris8, Michael A Marchetti4, Sebastian Podlipnik1, Susana Puig1, Christoph Rinner9, Philipp Tschandl7, Jochen Weber4, Allan Halpern4, Josep Malvehy1.
Abstract
BACKGROUND: Previous studies of artificial intelligence (AI) applied to dermatology have shown AI to have higher diagnostic classification accuracy than expert dermatologists; however, these studies did not adequately assess clinically realistic scenarios, such as how AI systems behave when presented with images of disease categories that are not included in the training dataset or images drawn from statistical distributions with significant shifts from training distributions. We aimed to simulate these real-world scenarios and evaluate the effects of image source institution, diagnoses outside of the training set, and other image artifacts on classification accuracy, with the goal of informing clinicians and regulatory agencies about safety and real-world accuracy.Entities:
Mesh:
Year: 2022 PMID: 35461690 PMCID: PMC9295694 DOI: 10.1016/S2589-7500(22)00021-8
Source DB: PubMed Journal: Lancet Digit Health ISSN: 2589-7500
Figure 1:Algorithm accuracy across all submissions, by dataset, metadata use, and diagnostic class
(A) Boxplot and table showing median (IQR) for balanced accuracy across all participant submissions for each test set partition (p<0·001 for all comparisons).
(B) Boxplot of diagnosis-specific balanced accuracies for each diagnostic class.
(C) Comparison of balanced accuracy over all submissions with and without clinical metadata. AK=actinic keratosis. BCC=basal cell carcinoma. BCN=Hospital Clinic Barcelona. BKL=benign keratosis. DF=dermatofibroma. HAM=Medical University of Vienna. MEL=melanoma. NT=not trained. NV=nevi. SCC=squamous cell carcinoma. VASC=vascular lesions.
Figure 2:Confusion matrix, separated into nine groups for each diagnostic category in the test set
Values represent the proportion of images in the test set given a classification specified by columns, on average for the top 25 algorithms. The reference row of each group shows the aggregate values for each diagnosis. Subsequent rows include stratifications across artifacts (ie, crust, hair, pen marks), anatomical site, and source institution. Upper extremity refers to arms and feet (not palms or soles). Lower extremity refers to legs (not palms or soles). AK=actinic keratosis. BCC=basal cell carcinoma. BCN=Hospital Clinic Barcelona. BKL=benign keratosis. DF=dermatofibroma. HAM=Medical University of Vienna. MEL=melanoma. NT=not trained. NV=nevi. SCC=squamous cell carcinoma. VASC=vascular lesion.
Figure 3:Confusion matrix of the diagnoses comprising the NT category
The confusion matrix shows which of the other categories included in training the diagnoses were confused for, measured across the top 25 algorithms. AK=actinic keratosis. BCC=basal cell carcinoma. BKL=benign keratosis. DF=dermatofibroma. MEL=melanoma. NT=not trained. NV=nevi. SCC=squamous cell carcinoma. VASC=vascular lesion.
Goal distribution of diagnoses included in a set of 30 images in the reader study
| Goal number | |
|---|---|
|
| |
| Actinic keratosis | 1 |
| Basal cell carcinoma | 6 |
| Benign keratosis | 3 |
| Dermatofibroma | 1 |
| Melanoma | 1 |
| Not trained | 5 |
| Nevi | 8 |
| Squamous cell carcinoma | 1 |
| Vascular lesion | 1 |
Summary of reader accuracy versus that of automated classifiers
| Readers | All algorithms | Top 3 algorithms | |
|---|---|---|---|
|
| |||
| AK | 0·43 (0·23–0·63) | 0·44 (0·42–0·46) | 0·83 (0·77–0·89) |
| BCC | 0·70 (0·61–0·79) | 0·80 (0·77–0·82) | 0·91 (0·88–0·95) |
| BKL | 0·48 (0·36–0·60) | 0·37 (0·35–0·39) | 0·43 (0·37–0·50) |
| DF | 0·50 (0·30–0·71) | 0·33 (0·30–0·36) | 0·73 (0·50–0·95) |
| MEL | 0·62 (0·53–0·71) | 0·58 (0·56–0·60) | 0·70 (0·64–0·77) |
| NV | 0·56 (0·46–0·66) | 0·76 (0·74–0·79) | 0·76 (0·74–0·77) |
| NT | 0·26 (0·17–0·35) | 0·06 (0·05–0·08) | 0·01 (0·01–0·02) |
| SCC | 0·65 (0·46–0·83) | 0·31 (0·29–0·33) | 0·62 (0·55–0·69) |
| VASC | 0·83 (0·68–0·97) | 0·46 (0·43–0·49) | 0·79 (0·66–0·92) |
Data are accuracy of mean count (95% CI). Mean count of correct reader classifications in batches of 30 lesions was 15·7 (95% CI 14·46–16·94). Mean count of correct algorithm (best) classifications in batches of 30 lesions was 18·95 (18·20–19·70). AK=actinic keratosis. BCC=basal cell carcinoma. BKL=benign keratosis. DF=dermatofibroma. MEL=melanoma. NT=not trained. NV=nevi. SCC=squamous cell carcinoma. VASC=vascular lesion.
Top three algorithms (average) performed >20% better than readers.
Readers performed ≥20% better than algorithms.
Figure 4:Receiver operating characteristic curves for the expert readers on grouped malignant diagnoses (A) and NT class (B) as compared with the top three algorithms
Crosses represent the average sensitivity and specificity of the readers, with the length of the bars corresponding to the 95% CI. AI=artificial intelligence. NT=not trained. SROC=summary receiver operating characteristic curve.