| Literature DB >> 33764307 |
Roman C Maron1, Achim Hekler1, Eva Krieghoff-Henning1, Max Schmitt1, Justin G Schlager2, Jochen S Utikal3,4, Titus J Brinker1.
Abstract
BACKGROUND: Studies have shown that artificial intelligence achieves similar or better performance than dermatologists in specific dermoscopic image classification tasks. However, artificial intelligence is susceptible to the influence of confounding factors within images (eg, skin markings), which can lead to false diagnoses of cancerous skin lesions. Image segmentation can remove lesion-adjacent confounding factors but greatly change the image representation.Entities:
Keywords: artifacts; artificial intelligence; confounding factors; deep learning; dermatology; diagnosis; image segmentation; melanoma; neural networks; nevus
Year: 2021 PMID: 33764307 PMCID: PMC8074854 DOI: 10.2196/21695
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Figure 1Typical artifacts encountered in dermoscopic image databases. Panels A-D show an exemplary range of artifacts often found in dermoscopic images, which are (left to right) color charts and hair, text, ruler markings, and marker ink. Panels E-H show how a corresponding segmented image could look like, with the surrounding artifacts removed but artifacts within the lesion (Panel E) still visible.
Figure 2Flowchart of the study design. A training data set consisting of images from 2 different sources was either segmented or not segmented and split into 2 smaller partitions based on image origin (HAM or ISIC). An individual classifier was then trained on each of the 4 training sets and evaluated on a multi-source test set, which underwent a preprocessing step that equaled the training data preprocessing. Training and evaluation were repeated a total of 5 times for a more robust measure. HAM: human against machine data set; ISIC: international skin imaging collaboration data set; PH2: hospital Pedro Hispano data set; PROP: proprietary data set.
Overview of the balanced accuracy and area under the receiver operating characteristic curve for each type of classifier across the holdout, external, and overall test set.
| Test set components, metric | Trained classifiers | ||||||
| HAMa segmented (%) | HAM unsegmented (%) | ISICb segmented (%) | ISIC unsegmented (%) | ||||
|
| |||||||
|
| Balanced accuracy, mean (SD) | 87.6 (1.4) |
| 77.1 (1.5) |
| ||
|
| AUROCd, mean (SD) | 0.95 (0.006) |
| 0.839 (0.008) |
| ||
|
| |||||||
|
| Balanced accuracy, mean (SD) |
| 57.6 (4.1) |
| 77.6 (1.7) | ||
|
| AUROC, mean (SD) |
| 0.647 (0.025) |
| 0.851 (0.018) | ||
|
| |||||||
|
| Balanced accuracy, mean (SD) |
| 66.7 (3.2) | 77.4 (1.5) |
| ||
|
| AUROC, mean (SD) |
| 0.763 (0.02) | 0.856 (0.005) |
| ||
aHAM: human against machine data set.
bISIC: International Skin Imaging Collaboration data set.
cThe italicized data indicate the higher metric when comparing between classifiers trained on a segmented/unsegmented version of the same data set.
dAUROC: area under the receiver operating characteristic curve.
Overview of the balanced accuracy and area under the receiver operating characteristic curve for each type of classifier across the external test set’s 3 individual components.
| External test set components, metric | Trained classifiers | ||||||||
| HAMa segmented (%) | HAM unsegmented (%) | ISICb segmented (%) | ISIC unsegmented (%) | ||||||
|
| |||||||||
|
| Balanced accuracy, mean (SD) |
| 58.9 (3.1) | 74.1 (3.6) |
| ||||
|
| AUROCe, mean (SD) | 0.628 (0.005) |
| 0.827 (0.019) |
| ||||
|
| |||||||||
|
| Balanced accuracy, mean (SD) |
| 63.2 (7.1) |
| 83.7 (0.8) | ||||
|
| AUROC, mean (SD) |
| 0.894 (0.021) |
| 0.912 (0.018) | ||||
|
| |||||||||
|
| Balanced accuracy, mean (SD) | 71.1 (1.8) |
| 68.7 (1.6) |
| ||||
|
| AUROC, mean (SD) | 0.825 (0.033) |
|
| 0.814 (0.015) | ||||
aHAM: human against machine data set.
bISIC: International Skin Imaging Collaboration data set.
cIf classifiers were trained on HAM images, the first external test set component consists of ISIC and vice versa.
dThe italicized data indicate the higher metric when comparing between classifiers trained on a segmented/unsegmented version of the same data set.
eAUROC: area under the receiver operating characteristic curve.
fPH2: hospital Pedro Hispano data set.
gPROP: proprietary data set.
Figure 3Exemplary predictions of a classifier trained on unsegmented HAM (left) and ISIC (right) images and evaluated on unsegmented and cropped PH2 images. The target class (ground truth) for each lesion is displayed to the left, with the classifier’s output probability for the target class on top. An output probability larger than 50% corresponds to a correct classification, which is also indicated by a blue frame, whereas an orange frame denotes an incorrect classification. HAM: human against machine data set; ISIC: international skin imaging collaboration data set; PH2: hospital Pedro Hispano data set.
Figure 4Exemplary predictions of a classifier trained on unsegmented HAM (left) and ISIC (right) images and evaluated on PH2 images with different segmentation masks. PH2 images in the SM column were segmented using the segmentation model. PH2 images in the GT column were segmented using dermatologist-generated ground truth segmentation masks. The target class (ground truth) for each lesion is displayed to the left, with the classifier’s output probability for the target class on top. An output probability larger than 50% corresponds to a correct classification, which is also indicated by a blue frame, whereas an orange frame denotes an incorrect classification. GT: ground truth; HAM: human against machine data set; ISIC: international skin imaging collaboration data set; PH2: hospital Pedro Hispano data set; SM: segmentation model.