Ryan J Crowley1,2, Yuan Jin Tan1,3, John P A Ioannidis1,3,4,5,6. 1. Meta-Research Innovation Center at Stanford, Stanford University, Stanford, California, USA. 2. Department of Bioengineering, Stanford School of Engineering, Stanford University, Stanford, California, USA. 3. Department of Epidemiology and Population Health, Stanford University School of Medicine, Stanford, California, USA. 4. Stanford Prevention Research Center, Department of Medicine, Stanford Medicine, Stanford University, Stanford, California, USA. 5. Department of Biomedical Data Science, Stanford Medicine, Stanford University, Stanford, California, USA. 6. Department of Statistics, School of Humanities and Science, Stanford University, Stanford, California, USA.
Abstract
OBJECTIVE: Machine learning (ML) diagnostic tools have significant potential to improve health care. However, methodological pitfalls may affect diagnostic test accuracy studies used to appraise such tools. We aimed to evaluate the prevalence and reporting of design characteristics within the literature. Further, we sought to empirically assess whether design features may be associated with different estimates of diagnostic accuracy. MATERIALS AND METHODS: We systematically retrieved 2 × 2 tables (n = 281) describing the performance of ML diagnostic tools, derived from 114 publications in 38 meta-analyses, from PubMed. Data extracted included test performance, sample sizes, and design features. A mixed-effects metaregression was run to quantify the association between design features and diagnostic accuracy. RESULTS: Participant ethnicity and blinding in test interpretation was unreported in 90% and 60% of studies, respectively. Reporting was occasionally lacking for rudimentary characteristics such as study design (28% unreported). Internal validation without appropriate safeguards was used in 44% of studies. Several design features were associated with larger estimates of accuracy, including having unreported (relative diagnostic odds ratio [RDOR], 2.11; 95% confidence interval [CI], 1.43-3.1) or case-control study designs (RDOR, 1.27; 95% CI, 0.97-1.66), and recruiting participants for the index test (RDOR, 1.67; 95% CI, 1.08-2.59). DISCUSSION: Significant underreporting of experimental details was present. Study design features may affect estimates of diagnostic performance in the ML diagnostic test accuracy literature. CONCLUSIONS: The present study identifies pitfalls that threaten the validity, generalizability, and clinical value of ML diagnostic tools and provides recommendations for improvement.
OBJECTIVE: Machine learning (ML) diagnostic tools have significant potential to improve health care. However, methodological pitfalls may affect diagnostic test accuracy studies used to appraise such tools. We aimed to evaluate the prevalence and reporting of design characteristics within the literature. Further, we sought to empirically assess whether design features may be associated with different estimates of diagnostic accuracy. MATERIALS AND METHODS: We systematically retrieved 2 × 2 tables (n = 281) describing the performance of ML diagnostic tools, derived from 114 publications in 38 meta-analyses, from PubMed. Data extracted included test performance, sample sizes, and design features. A mixed-effects metaregression was run to quantify the association between design features and diagnostic accuracy. RESULTS:Participant ethnicity and blinding in test interpretation was unreported in 90% and 60% of studies, respectively. Reporting was occasionally lacking for rudimentary characteristics such as study design (28% unreported). Internal validation without appropriate safeguards was used in 44% of studies. Several design features were associated with larger estimates of accuracy, including having unreported (relative diagnostic odds ratio [RDOR], 2.11; 95% confidence interval [CI], 1.43-3.1) or case-control study designs (RDOR, 1.27; 95% CI, 0.97-1.66), and recruiting participants for the index test (RDOR, 1.67; 95% CI, 1.08-2.59). DISCUSSION: Significant underreporting of experimental details was present. Study design features may affect estimates of diagnostic performance in the ML diagnostic test accuracy literature. CONCLUSIONS: The present study identifies pitfalls that threaten the validity, generalizability, and clinical value of ML diagnostic tools and provides recommendations for improvement.
Authors: S Sinha; F A Lucas-Quesada; N D DeBruhl; J Sayre; D Farria; D P Gorczyca; L W Bassett Journal: J Magn Reson Imaging Date: 1997 Nov-Dec Impact factor: 4.813
Authors: Rinaa S Punglia; Anthony V D'Amico; William J Catalona; Kimberly A Roehl; Karen M Kuntz Journal: N Engl J Med Date: 2003-07-24 Impact factor: 91.245
Authors: Jérémie F Cohen; Daniël A Korevaar; Douglas G Altman; David E Bruns; Constantine A Gatsonis; Lotty Hooft; Les Irwig; Deborah Levine; Johannes B Reitsma; Henrica C W de Vet; Patrick M M Bossuyt Journal: BMJ Open Date: 2016-11-14 Impact factor: 2.692
Authors: Ningxin Dong; Changyong Fu; Renren Li; Wei Zhang; Meng Liu; Weixin Xiao; Hugh M Taylor; Peter J Nicholas; Onur Tanglay; Isabella M Young; Karol Z Osipowicz; Michael E Sughrue; Stephane P Doyen; Yunxia Li Journal: Front Aging Neurosci Date: 2022-05-03 Impact factor: 5.750