| Literature DB >> 35960806 |
Roxana Daneshjou1,2, Kailas Vodrahalli3, Roberto A Novoa1,4, Melissa Jenkins1, Weixin Liang5, Veronica Rotemberg6, Justin Ko1, Susan M Swetter1, Elizabeth E Bailey1, Olivier Gevaert2, Pritam Mukherjee2, Michelle Phung1, Kiana Yekrang1, Bradley Fong1, Rachna Sahasrabudhe1, Johan A C Allerup1, Utako Okata-Karigane7, James Zou2,3,5,8, Albert S Chiou1.
Abstract
An estimated 3 billion people lack access to dermatological care globally. Artificial intelligence (AI) may aid in triaging skin diseases and identifying malignancies. However, most AI models have not been assessed on images of diverse skin tones or uncommon diseases. Thus, we created the Diverse Dermatology Images (DDI) dataset-the first publicly available, expertly curated, and pathologically confirmed image dataset with diverse skin tones. We show that state-of-the-art dermatology AI models exhibit substantial limitations on the DDI dataset, particularly on dark skin tones and uncommon diseases. We find that dermatologists, who often label AI datasets, also perform worse on images of dark skin tones and uncommon diseases. Fine-tuning AI models on the DDI images closes the performance gap between light and dark skin tones. These findings identify important weaknesses and biases in dermatology AI that should be addressed for reliable application to diverse patients and diseases.Entities:
Year: 2022 PMID: 35960806 PMCID: PMC9374341 DOI: 10.1126/sciadv.abq6147
Source DB: PubMed Journal: Sci Adv ISSN: 2375-2548 Impact factor: 14.957
Fig. 1.DDI dataset and algorithm performance.
Row 1: Performance of all three AI models and the majority vote of an ensemble of dermatologists on the entire DDI dataset (A), FST I–II (B), and FST V–VI (C). Row 2: Performance across the DDI common diseases dataset with the performance of all algorithms and ensemble of dermatologists on the entire DDI common diseases dataset (D), FST I–II (E), and FST V–VI (F). Row 3: Example images from the entire DDI dataset for all skin tones (G), FST I–II (H), and FST V–VI (I). Photo Credit: DDI dataset, Stanford School of Medicine.
Fig. 2.Algorithm performance after fine-tuning.
Fine-tuned DeepDerm (A) and HAM10000 (B) on the DDI dataset (as described in Materials and Methods) compared to baseline (first three bars in each panel). Fine-tuning closes the gap between FST I–II and FST V–VI performance and leads to overall performance improvement. Ninety-five percent confidence interval is calculated using bootstrapping across the 20 seeds for both baseline and fine-tuned models to allow direct comparison.