Thomas Kau1,2, Mindaugas Ziurlys3, Manuel Taschwer3, Anita Kloss-Brandstätter4, Günther Grabner5, Hannes Deutschmann6. 1. Department of Radiology, Landeskrankenhaus Villach, Nikolaigasse 43, 9500, Villach, Austria. thomas.kau@kabeg.at. 2. Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Auenbruggerplatz 9, 8036, Graz, Austria. thomas.kau@kabeg.at. 3. Department of Radiology, Landeskrankenhaus Villach, Nikolaigasse 43, 9500, Villach, Austria. 4. Carinthia University of Applied Sciences, Europastrasse 4, 9500, Villach, Austria. 5. Department of Medical Engineering, Carinthia University of Applied Sciences, Primoschgasse 8, 9020, Klagenfurt, Austria. 6. Division of Neuroradiology, Vascular and Interventional Radiology, Department of Radiology, Medical University of Graz, Auenbruggerplatz 9, 8036, Graz, Austria.
Abstract
PURPOSE: To assess an FDA-approved and CE-certified deep learning (DL) software application compared to the performance of human radiologists in detecting intracranial hemorrhages (ICH). METHODS: Within a 20-week trial from January to May 2020, 2210 adult non-contrast head CT scans were performed in a single center and automatically analyzed by an artificial intelligence (AI) solution with workflow integration. After excluding 22 scans due to severe motion artifacts, images were retrospectively assessed for the presence of ICHs by a second-year resident and a certified radiologist under simulated time pressure. Disagreements were resolved by a subspecialized neuroradiologist serving as the reference standard. We calculated interrater agreement and diagnostic performance parameters, including the Breslow-Day and Cochran-Mantel-Haenszel tests. RESULTS: An ICH was present in 214 out of 2188 scans. The interrater agreement between the resident and the certified radiologist was very high (κ = 0.89) and even higher (κ = 0.93) between the resident and the reference standard. The software has delivered 64 false-positive and 68 false-negative results giving an overall sensitivity, specificity, positive predictive value, negative predictive value, and accuracy of 68.2%, 96.8%, 69.5%, 96.6%, and 94.0%, respectively. Corresponding values for the resident were 94.9%, 99.2%, 93.1%, 99.4%, and 98.8%. The accuracy of the DL application was inferior (p < 0.001) to that of both the resident and the certified neuroradiologist. CONCLUSION: A resident under time pressure outperformed an FDA-approved DL program in detecting ICH in CT scans. Our results underline the importance of thoughtful workflow integration and post-approval validation of AI applications in various clinical environments.
PURPOSE: To assess an FDA-approved and CE-certified deep learning (DL) software application compared to the performance of human radiologists in detecting intracranial hemorrhages (ICH). METHODS: Within a 20-week trial from January to May 2020, 2210 adult non-contrast head CT scans were performed in a single center and automatically analyzed by an artificial intelligence (AI) solution with workflow integration. After excluding 22 scans due to severe motion artifacts, images were retrospectively assessed for the presence of ICHs by a second-year resident and a certified radiologist under simulated time pressure. Disagreements were resolved by a subspecialized neuroradiologist serving as the reference standard. We calculated interrater agreement and diagnostic performance parameters, including the Breslow-Day and Cochran-Mantel-Haenszel tests. RESULTS: An ICH was present in 214 out of 2188 scans. The interrater agreement between the resident and the certified radiologist was very high (κ = 0.89) and even higher (κ = 0.93) between the resident and the reference standard. The software has delivered 64 false-positive and 68 false-negative results giving an overall sensitivity, specificity, positive predictive value, negative predictive value, and accuracy of 68.2%, 96.8%, 69.5%, 96.6%, and 94.0%, respectively. Corresponding values for the resident were 94.9%, 99.2%, 93.1%, 99.4%, and 98.8%. The accuracy of the DL application was inferior (p < 0.001) to that of both the resident and the certified neuroradiologist. CONCLUSION: A resident under time pressure outperformed an FDA-approved DL program in detecting ICH in CT scans. Our results underline the importance of thoughtful workflow integration and post-approval validation of AI applications in various clinical environments.
Authors: Thomas J O'Neill; Yin Xi; Edward Stehel; Travis Browning; Yee Seng Ng; Chris Baker; Ronald M Peshock Journal: Radiol Artif Intell Date: 2020-11-18