Andrew F Voter1, Ece Meram2, John W Garrett2, John-Paul J Yu3. 1. School of Medicine and Public Health, University of Wisconsin-Madison, Madison, Wisconsin. 2. Department of Radiology, University of Wisconsin-Madison, Madison, Wisconsin. 3. Department of Radiology, University of Wisconsin-Madison, Madison, Wisconsin; Department of Biomedical Engineering, College of Engineering, University of Wisconsin-Madison, Madison, Wisconsin; Department of Psychiatry, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin. Electronic address: jpyu@uwhealth.org.
Abstract
OBJECTIVE: To determine the institutional diagnostic accuracy of an artificial intelligence (AI) decision support systems (DSS), Aidoc, in diagnosing intracranial hemorrhage (ICH) on noncontrast head CTs and to assess the potential generalizability of an AI DSS. METHODS: This retrospective study included 3,605 consecutive, emergent, adult noncontrast head CT scans performed between July 1, 2019, and December 30, 2019, at our institution (51% female subjects, mean age of 61 ± 21 years). Each scan was evaluated for ICH by both a certificate of added qualification certified neuroradiologist and Aidoc. We determined the diagnostic accuracy of the AI model and performed a failure mode analysis with quantitative CT radiomic image characterization. RESULTS: Of the 3,605 scans, 349 cases of ICH (9.7% of studies) were identified. The neuroradiologist and Aidoc interpretations were concordant in 96.9% of cases and the overall sensitivity, specificity, positive predictive value, and negative predictive value were 92.3%, 97.7%, 81.3%, and 99.2%, respectively, with positive predictive values unexpectedly lower than in previously reported studies. Prior neurosurgery, type of ICH, and number of ICHs were significantly associated with decreased model performance. Quantitative image characterization with CT radiomics failed to reveal significant differences between concordant and discordant studies. DISCUSSION: This study revealed decreased diagnostic accuracy of an AI DSS at our institution. Despite extensive evaluation, we were unable to identify the source of this discrepancy, raising concerns about the generalizability of these tools with indeterminate failure modes. These results further highlight the need for standardized study design to allow for rigorous and reproducible site-to-site comparison of emerging deep learning technologies.
OBJECTIVE: To determine the institutional diagnostic accuracy of an artificial intelligence (AI) decision support systems (DSS), Aidoc, in diagnosing intracranial hemorrhage (ICH) on noncontrast head CTs and to assess the potential generalizability of an AI DSS. METHODS: This retrospective study included 3,605 consecutive, emergent, adult noncontrast head CT scans performed between July 1, 2019, and December 30, 2019, at our institution (51% female subjects, mean age of 61 ± 21 years). Each scan was evaluated for ICH by both a certificate of added qualification certified neuroradiologist and Aidoc. We determined the diagnostic accuracy of the AI model and performed a failure mode analysis with quantitative CT radiomic image characterization. RESULTS: Of the 3,605 scans, 349 cases of ICH (9.7% of studies) were identified. The neuroradiologist and Aidoc interpretations were concordant in 96.9% of cases and the overall sensitivity, specificity, positive predictive value, and negative predictive value were 92.3%, 97.7%, 81.3%, and 99.2%, respectively, with positive predictive values unexpectedly lower than in previously reported studies. Prior neurosurgery, type of ICH, and number of ICHs were significantly associated with decreased model performance. Quantitative image characterization with CT radiomics failed to reveal significant differences between concordant and discordant studies. DISCUSSION: This study revealed decreased diagnostic accuracy of an AI DSS at our institution. Despite extensive evaluation, we were unable to identify the source of this discrepancy, raising concerns about the generalizability of these tools with indeterminate failure modes. These results further highlight the need for standardized study design to allow for rigorous and reproducible site-to-site comparison of emerging deep learning technologies.
Authors: Adam L Sharp; Brian Z Huang; Tania Tang; Ernest Shen; Edward R Melnick; Arjun K Venkatesh; Michael H Kanter; Michael K Gould Journal: Ann Emerg Med Date: 2017-07-21 Impact factor: 5.721
Authors: Rachel B Ger; Daniel F Craft; Dennis S Mackin; Shouhao Zhou; Rick R Layman; A Kyle Jones; Hesham Elhalawani; Clifton D Fuller; Rebecca M Howell; Heng Li; R Jason Stafford; Laurence E Court Journal: Comput Med Imaging Graph Date: 2018-09-15 Impact factor: 4.790
Authors: Juan P Cruz-Bastida; Daniel Gomez-Cardona; John Garrett; Timothy Szczykutowicz; Guang-Hong Chen; Ke Li Journal: Med Phys Date: 2017-07-18 Impact factor: 4.071
Authors: Nathaniel C Swinburne; Vivek Yadav; Julie Kim; Ye R Choi; David C Gutman; Jonathan T Yang; Nelson Moss; Jacqueline Stone; Jamie Tisnado; Vaios Hatzoglou; Sofia S Haque; Sasan Karimi; John Lyo; Krishna Juluru; Karl Pichotta; Jianjiong Gao; Sohrab P Shah; Andrei I Holodny; Robert J Young Journal: Radiology Date: 2022-01-18 Impact factor: 11.105
Authors: Almut Kundisch; Alexander Hönning; Sven Mutze; Lutz Kreissl; Frederik Spohn; Johannes Lemcke; Maximilian Sitz; Paul Sparenberg; Leonie Goelz Journal: PLoS One Date: 2021-11-29 Impact factor: 3.240