| Literature DB >> 36171272 |
Seung Seog Han1,2, Cristian Navarrete-Dechent3, Konstantinos Liopyris4, Myoung Shin Kim5, Gyeong Hun Park6, Sang Seok Woo7, Juhyun Park8, Jung Won Shin8, Bo Ri Kim8, Min Jae Kim8, Francisca Donoso3, Francisco Villanueva3, Cristian Ramirez3, Sung Eun Chang9, Allan Halpern10, Seong Hwan Kim11, Jung-Im Na12.
Abstract
Model Dermatology ( https://modelderm.com ; Build2021) is a publicly testable neural network that can classify 184 skin disorders. We aimed to investigate whether our algorithm can classify clinical images of an Internet community along with tertiary care center datasets. Consecutive images from an Internet skin cancer community ('RD' dataset, 1,282 images posted between 25 January 2020 to 30 July 2021; https://reddit.com/r/melanoma ) were analyzed retrospectively, along with hospital datasets (Edinburgh dataset, 1,300 images; SNU dataset, 2,101 images; TeleDerm dataset, 340 consecutive images). The algorithm's performance was equivalent to that of dermatologists in the curated clinical datasets (Edinburgh and SNU datasets). However, its performance deteriorated in the RD and TeleDerm datasets because of insufficient image quality and the presence of out-of-distribution disorders, respectively. For the RD dataset, the algorithm's Top-1/3 accuracy (39.2%/67.2%) and AUC (0.800) were equivalent to that of general physicians (36.8%/52.9%). It was more accurate than that of the laypersons using random Internet searches (19.2%/24.4%). The Top-1/3 accuracy was affected by inadequate image quality (adequate = 43.2%/71.3% versus inadequate = 32.9%/60.8%), whereas participant performance did not deteriorate (adequate = 35.8%/52.7% vs. inadequate = 38.4%/53.3%). In this report, the algorithm performance was significantly affected by the change of the intended settings, which implies that AI algorithms at dermatologist-level, in-distribution setting, may not be able to show the same level of performance in with out-of-distribution settings.Entities:
Mesh:
Year: 2022 PMID: 36171272 PMCID: PMC9519737 DOI: 10.1038/s41598-022-20632-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Summary of the test datasets.
| RD | SNU | Edinburgh | TeleDerm | |
|---|---|---|---|---|
| Number of cases | 1282 | 2101 | 1300 | 340 |
| Source | Internet community | Tertiary care center | Tertiary care center | Teledermatology |
| Photographer | Patient | Physician | Professional photographer | Patient |
| Fitzpatrick skin type | – | 3–4 | 1–2 | 1–4 |
| Number of disease classes | 62 | 133 | 10 | 87 |
| Inflammatory Dermatitis | 12 (0.9%) | 131 (6.2%) | – | 82 (24.1%) |
| Acne/rosacea | – | 51 (2.4%) | – | 78 (22.9%) |
| Autoimmune | 2 (0.2%) | 90 (4.3%) | – | 34 (10.0%) |
| Papulosquamous | – | 105 (5.0%) | – | 17 (5.0%) |
| Others inflammatory | 30 (2.3%) | 159 (7.6%) | – | 14 (4.1%) |
| Viral infection | 39 (3.0%) | 144 (6.9%) | – | 22 (6.5%) |
| Fungal infection | 7 (0.5%) | 85 (4.0%) | – | 20 (5.9%) |
| Bacterial infection | 3 (0.2%) | 125 (5.9%) | – | 9 (2.6%) |
| Parasitic infection | – | 15 (0.7%) | – | 1 (0.3%) |
| Benign neoplastic | 896 (69.9%)a | 620 (29.5%) | 819 (63.0%) | 28 (8.2%) |
| Malignant neoplastic | 123 (9.6%)a | 182 (8.7%) | 481 (37.0%) | 4 (1.2%) |
| Alopecia, scarring | – | – | – | 8 (2.4%) |
| Alopecia, non-scarring | – | 20 (1.0%) | – | 7 (2.1%) |
| Others | 170 (13.3%) | 374 (17.8%) | – | 16 (4.7%) |
aThe ground truth of the RD dataset was voted on by five specialists, whereas the malignancies in the other datasets were determined by pathological examinations.
Figure 1Binary classification for determining suspected malignancy using the Internet community (RD) dataset. (a) TEST = RD dataset (1,282 images). (b) TEST = RDadequate subset dataset (787 adequate images). (c) TEST = RDinadequate subset dataset (497 inadequate images), Red dot (TH1)-The algorithm at the high sensitivity threshold, Blue dot (TH2)-The algorithm at the high specificity threshold, Black dots-the six general physicians, Green dots-the layperson (cluster), × − Sensitivity and specificity derived from the Top-3 diagnoses of the participants, + − Sensitivity and specificity derived from the Top-1 diagnoses of the participants, The shaded area indicates the 95% confidence interval.
Figure 2Binary classification for determining malignancy using the hospital (SNU and Edinburgh) datasets. (a) TEST = Edinburgh dataset (1,300 images). (b) TEST = SNU dataset (2,201 images). (c) TEST = SNU public subset dataset (240 images), Red dot (TH1)-The algorithm at the high sensitivity threshold, Blue dot (TH2)-The algorithm at the high-specificity threshold, + − Average of dermatologists, residents, and laypersons in the previous study[17]. The mean sensitivity/specificity using 240 test images was adapted, and the result of the reader study is available at https://doi.org/10.6084/m9.figshare.6454973, The shaded area indicates the 95% confidence interval. The TeleDerm dataset was excluded in this malignancy analysis because it includes only four malignancies.
Sensitivity, specificity, positive predictive value, and negative predictive value in the binary-class classification.
| Test dataset | Sensitivity | Specificity | PPV | NPV |
|---|---|---|---|---|
| RD/1,282 images | 67.5 (58.5–75.6) | 77.0 (74.4–79.3) | 23.6 (20.8–26.6) | 95.7 (94.6–96.7) |
| RDadequate/787 images | 74.1 (64.7–82.4) | 73.1 (69.9–76.4) | 25.0 (21.8–28.3) | 95.9 (94.5–97.2) |
| RDinadequte/495 images | 52.6 (36.8–68.4) | 82.7 (79.4–86.2) | 20.2 (14.3–26.3) | 95.5 (93.9–97.0) |
| SNU/2,201 images | 90.1 (85.1–94.0) | 91.7 (90.4–92.8) | 50.6 (46.9–54.5) | 99.0 (98.5–99.4) |
| SNU subset/240 images | 85.0 (75.0–95.0) | 94.0 (90.5–97.0) | 74.4 (64.0–85.7) | 96.9 (94.9–99.0) |
| Edinburgh/1,300 images | 97.7 (96.3–99.0) | 52.0 (48.5–55.6) | 54.5 (52.7–56.4) | 97.5 (95.9–98.8) |
| RD/1,282 images | 44.7 (36.6–53.7) | 91.8 (90.2–93.4) | 36.7 (30.7–43.5) | 94.0 (93.2–94.9) |
| RDadequate/787 images | 51.8 (41.2–62.4) | 90.6 (88.3–92.7) | 40.0 (32.7–47.8) | 94.0 (92.7–95.2) |
| RDinadequte/495 images | 29.0 (15.8–42.1) | 93.7 (91.5–95.8) | 27.3 (15.8–40.5) | 94.0 (93.0–95.2) |
| SNU/2,201 images | 80.8 (74.7–86.8) | 95.9 (95.0–96.8) | 65.5 (60.1–70.6) | 98.1 (97.6–98.7) |
| SNU subset/240 images | 77.5 (62.5–90.0) | 95.0 (92.0–97.5) | 76.2 (64.6–86.9) | 95.5 (92.8–97.9) |
| Edinburgh/1,300 images | 90.6 (87.9–93.1) | 77.3 (74.6–80.2) | 70.1 (67.7–72.9) | 93.4 (91.6–95.0) |
PPV positive predictive value, NPV negative predictive value.
Figure 3Flowchart of the RD dataset.