Roxana Daneshjou1,2, Mary P Smith3, Mary D Sun4, Veronica Rotemberg5, James Zou6,7,8. 1. Stanford Department of Dermatology, Stanford School of Medicine, Redwood City, California. 2. Stanford Department of Biomedical Data Science, Stanford School of Medicine, Stanford, California. 3. Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, New York. 4. currently a medical student at Icahn School of Medicine at Mount Sinai, New York, New York. 5. Dermatology Service, Memorial Sloan Kettering Cancer Center, New York, New York. 6. Department of Electrical Engineering, Stanford University, Stanford, California. 7. Department of Biomedical Data Science, Stanford University, Stanford, California. 8. Chan Zuckerberg Biohub, San Francisco, California.
Abstract
IMPORTANCE: Clinical artificial intelligence (AI) algorithms have the potential to improve clinical care, but fair, generalizable algorithms depend on the clinical data on which they are trained and tested. OBJECTIVE: To assess whether data sets used for training diagnostic AI algorithms addressing skin disease are adequately described and to identify potential sources of bias in these data sets. DATA SOURCES: In this scoping review, PubMed was used to search for peer-reviewed research articles published between January 1, 2015, and November 1, 2020, with the following paired search terms: deep learning and dermatology, artificial intelligence and dermatology, deep learning and dermatologist, and artificial intelligence and dermatologist. STUDY SELECTION: Studies that developed or tested an existing deep learning algorithm for triage, diagnosis, or monitoring using clinical or dermoscopic images of skin disease were selected, and the articles were independently reviewed by 2 investigators to verify that they met selection criteria. CONSENSUS PROCESS: Data set audit criteria were determined by consensus of all authors after reviewing existing literature to highlight data set transparency and sources of bias. RESULTS: A total of 70 unique studies were included. Among these studies, 1 065 291 images were used to develop or test AI algorithms, of which only 257 372 (24.2%) were publicly available. Only 14 studies (20.0%) included descriptions of patient ethnicity or race in at least 1 data set used. Only 7 studies (10.0%) included any information about skin tone in at least 1 data set used. Thirty-six of the 56 studies developing new AI algorithms for cutaneous malignant neoplasms (64.3%) met the gold standard criteria for disease labeling. Public data sets were cited more often than private data sets, suggesting that public data sets contribute more to new development and benchmarks. CONCLUSIONS AND RELEVANCE: This scoping review identified 3 issues in data sets that are used to develop and test clinical AI algorithms for skin disease that should be addressed before clinical translation: (1) sparsity of data set characterization and lack of transparency, (2) nonstandard and unverified disease labels, and (3) inability to fully assess patient diversity used for algorithm development and testing.
IMPORTANCE: Clinical artificial intelligence (AI) algorithms have the potential to improve clinical care, but fair, generalizable algorithms depend on the clinical data on which they are trained and tested. OBJECTIVE: To assess whether data sets used for training diagnostic AI algorithms addressing skin disease are adequately described and to identify potential sources of bias in these data sets. DATA SOURCES: In this scoping review, PubMed was used to search for peer-reviewed research articles published between January 1, 2015, and November 1, 2020, with the following paired search terms: deep learning and dermatology, artificial intelligence and dermatology, deep learning and dermatologist, and artificial intelligence and dermatologist. STUDY SELECTION: Studies that developed or tested an existing deep learning algorithm for triage, diagnosis, or monitoring using clinical or dermoscopic images of skin disease were selected, and the articles were independently reviewed by 2 investigators to verify that they met selection criteria. CONSENSUS PROCESS: Data set audit criteria were determined by consensus of all authors after reviewing existing literature to highlight data set transparency and sources of bias. RESULTS: A total of 70 unique studies were included. Among these studies, 1 065 291 images were used to develop or test AI algorithms, of which only 257 372 (24.2%) were publicly available. Only 14 studies (20.0%) included descriptions of patient ethnicity or race in at least 1 data set used. Only 7 studies (10.0%) included any information about skin tone in at least 1 data set used. Thirty-six of the 56 studies developing new AI algorithms for cutaneous malignant neoplasms (64.3%) met the gold standard criteria for disease labeling. Public data sets were cited more often than private data sets, suggesting that public data sets contribute more to new development and benchmarks. CONCLUSIONS AND RELEVANCE: This scoping review identified 3 issues in data sets that are used to develop and test clinical AI algorithms for skin disease that should be addressed before clinical translation: (1) sparsity of data set characterization and lack of transparency, (2) nonstandard and unverified disease labels, and (3) inability to fully assess patient diversity used for algorithm development and testing.
Authors: Hamidullah Binol; Alisha Plotner; Jennifer Sopkovich; Benjamin Kaffenberger; Muhammad Khalid Khan Niazi; Metin N Gurcan Journal: Skin Res Technol Date: 2019-12-17 Impact factor: 2.365
Authors: Titus J Brinker; Achim Hekler; Alexander H Enk; Carola Berking; Sebastian Haferkamp; Axel Hauschild; Michael Weichenthal; Joachim Klode; Dirk Schadendorf; Tim Holland-Letz; Christof von Kalle; Stefan Fröhling; Bastian Schilling; Jochen S Utikal Journal: Eur J Cancer Date: 2019-08-08 Impact factor: 9.162
Authors: Philippe M Burlina; Neil J Joshi; Elise Ng; Seth D Billings; Alison W Rebman; John N Aucott Journal: Comput Biol Med Date: 2018-12-18 Impact factor: 4.589
Authors: Philipp Tschandl; Cliff Rosendahl; Bengu Nisa Akay; Giuseppe Argenziano; Andreas Blum; Ralph P Braun; Horacio Cabo; Jean-Yves Gourhant; Jürgen Kreusch; Aimilios Lallas; Jan Lapins; Ashfaq Marghoob; Scott Menzies; Nina Maria Neuber; John Paoli; Harold S Rabinovitz; Christoph Rinner; Alon Scope; H Peter Soyer; Christoph Sinz; Luc Thomas; Iris Zalaudek; Harald Kittler Journal: JAMA Dermatol Date: 2019-01-01 Impact factor: 10.282
Authors: Julia K Winkler; Christine Fink; Ferdinand Toberer; Alexander Enk; Teresa Deinlein; Rainer Hofmann-Wellenhof; Luc Thomas; Aimilios Lallas; Andreas Blum; Wilhelm Stolz; Holger A Haenssle Journal: JAMA Dermatol Date: 2019-10-01 Impact factor: 10.282
Authors: Titus J Brinker; Roman C Maron; Jochen S Utikal; Achim Hekler; Axel Hauschild; Elke Sattler; Wiebke Sondermann; Sebastian Haferkamp; Bastian Schilling; Markus V Heppt; Philipp Jansen; Markus Reinholz; Cindy Franklin; Laurenz Schmitt; Daniela Hartmann; Eva Krieghoff-Henning; Max Schmitt; Michael Weichenthal; Christof von Kalle; Stefan Fröhling Journal: J Med Internet Res Date: 2020-09-11 Impact factor: 5.428
Authors: Roxana Daneshjou; Kailas Vodrahalli; Roberto A Novoa; Melissa Jenkins; Weixin Liang; Veronica Rotemberg; Justin Ko; Susan M Swetter; Elizabeth E Bailey; Olivier Gevaert; Pritam Mukherjee; Michelle Phung; Kiana Yekrang; Bradley Fong; Rachna Sahasrabudhe; Johan A C Allerup; Utako Okata-Karigane; James Zou; Albert S Chiou Journal: Sci Adv Date: 2022-08-12 Impact factor: 14.957