David W G Langerhuizen1, Anne Eva J Bulstra2, Stein J Janssen1, David Ring3, Gino M M J Kerkhoffs1, Ruurd L Jaarsma2, Job N Doornberg2. 1. D. W. G. Langerhuizen, S. J. Janssen, G. M. M. J. Kerkhoffs, Department of Orthopaedic Surgery, Amsterdam Movement Sciences (AMS), Amsterdam University Medical Centre, Amsterdam, The Netherlands. 2. A. E. J. Bulstra, R. L. Jaarsma, J. N. Doornberg, Flinders University, Department of Orthopaedic & Trauma Surgery, Flinders Medical Centre, Adelaide, Australia. 3. D. Ring, Department of Surgery and Perioperative Care, Dell Medical School, The University of Texas at Austin, Austin, TX, USA.
Abstract
BACKGROUND: Preliminary experience suggests that deep learning algorithms are nearly as good as humans in detecting common, displaced, and relatively obvious fractures (such as, distal radius or hip fractures). However, it is not known whether this also is true for subtle or relatively nondisplaced fractures that are often difficult to see on radiographs, such as scaphoid fractures. QUESTIONS/PURPOSES: (1) What is the diagnostic accuracy, sensitivity, and specificity of a deep learning algorithm in detecting radiographically visible and occult scaphoid fractures using four radiographic imaging views? (2) Does adding patient demographic (age and sex) information improve the diagnostic performance of the deep learning algorithm? (3) Are orthopaedic surgeons better at diagnostic accuracy, sensitivity, and specificity compared with deep learning? (4) What is the interobserver reliability among five human observers and between human consensus and deep learning algorithm? METHODS: We retrospectively searched the picture archiving and communication system (PACS) to identify 300 patients with a radiographic scaphoid series, until we had 150 fractures (127 visible on radiographs and 23 only visible on MRI) and 150 non-fractures with a corresponding CT or MRI as the reference standard for fracture diagnosis. At our institution, MRIs are usually ordered for patients with scaphoid tenderness and normal radiographs, and a CT with radiographically visible scaphoid fracture. We used a deep learning algorithm (a convolutional neural network [CNN]) for automated fracture detection on radiographs. Deep learning, an advanced subset of artificial intelligence, combines artificial neuronal layers to resemble a neuron cell. CNNs-essentially deep learning algorithms resembling interconnected neurons in the human brain-are most commonly used for image analysis. Area under the receiver operating characteristic curve (AUC) was used to evaluate the algorithm's diagnostic performance. An AUC of 1.0 would indicate perfect prediction, whereas 0.5 would indicate that a prediction is no better than a flip of a coin. The probability of a scaphoid fracture generated by the CNN, sex, and age were included in a multivariable logistic regression to determine whether this would improve the algorithm's diagnostic performance. Diagnostic performance characteristics (accuracy, sensitivity, and specificity) and reliability (kappa statistic) were calculated for the CNN and for the five orthopaedic surgeon observers in our study. RESULTS: The algorithm had an AUC of 0.77 (95% CI 0.66 to 0.85), 72% accuracy (95% CI 60% to 84%), 84% sensitivity (95% CI 0.74 to 0.94), and 60% specificity (95% CI 0.46 to 0.74). Adding age and sex did not improve diagnostic performance (AUC 0.81 [95% CI 0.73 to 0.89]). Orthopaedic surgeons had better specificity (0.93 [95% CI 0.93 to 0.99]; p < 0.01), while accuracy (84% [95% CI 81% to 88%]) and sensitivity (0.76 [95% CI 0.70 to 0.82]; p = 0.29) did not differ between the algorithm and human observers. Although the CNN was less specific in diagnosing relatively obvious fractures, it detected five of six occult scaphoid fractures that were missed by all human observers. The interobserver reliability among the five surgeons was substantial (Fleiss' kappa = 0.74 [95% CI 0.66 to 0.83]), but the reliability between the algorithm and human observers was only fair (Cohen's kappa = 0.34 [95% CI 0.17 to 0.50]). CONCLUSIONS: Initial experience with our deep learning algorithm suggests that it has trouble identifying scaphoid fractures that are obvious to human observers. Thirteen false positive suggestions were made by the CNN, which were correctly detected by the five surgeons. Research with larger datasets-preferably also including information from physical examination-or further algorithm refinement is merited. LEVEL OF EVIDENCE: Level III, diagnostic study.
BACKGROUND: Preliminary experience suggests that deep learning algorithms are nearly as good as humans in detecting common, displaced, and relatively obvious fractures (such as, distal radius or hip fractures). However, it is not known whether this also is true for subtle or relatively nondisplaced fractures that are often difficult to see on radiographs, such as scaphoid fractures. QUESTIONS/PURPOSES: (1) What is the diagnostic accuracy, sensitivity, and specificity of a deep learning algorithm in detecting radiographically visible and occult scaphoid fractures using four radiographic imaging views? (2) Does adding patient demographic (age and sex) information improve the diagnostic performance of the deep learning algorithm? (3) Are orthopaedic surgeons better at diagnostic accuracy, sensitivity, and specificity compared with deep learning? (4) What is the interobserver reliability among five human observers and between human consensus and deep learning algorithm? METHODS: We retrospectively searched the picture archiving and communication system (PACS) to identify 300 patients with a radiographic scaphoid series, until we had 150 fractures (127 visible on radiographs and 23 only visible on MRI) and 150 non-fractures with a corresponding CT or MRI as the reference standard for fracture diagnosis. At our institution, MRIs are usually ordered for patients with scaphoid tenderness and normal radiographs, and a CT with radiographically visible scaphoid fracture. We used a deep learning algorithm (a convolutional neural network [CNN]) for automated fracture detection on radiographs. Deep learning, an advanced subset of artificial intelligence, combines artificial neuronal layers to resemble a neuron cell. CNNs-essentially deep learning algorithms resembling interconnected neurons in the human brain-are most commonly used for image analysis. Area under the receiver operating characteristic curve (AUC) was used to evaluate the algorithm's diagnostic performance. An AUC of 1.0 would indicate perfect prediction, whereas 0.5 would indicate that a prediction is no better than a flip of a coin. The probability of a scaphoid fracture generated by the CNN, sex, and age were included in a multivariable logistic regression to determine whether this would improve the algorithm's diagnostic performance. Diagnostic performance characteristics (accuracy, sensitivity, and specificity) and reliability (kappa statistic) were calculated for the CNN and for the five orthopaedic surgeon observers in our study. RESULTS: The algorithm had an AUC of 0.77 (95% CI 0.66 to 0.85), 72% accuracy (95% CI 60% to 84%), 84% sensitivity (95% CI 0.74 to 0.94), and 60% specificity (95% CI 0.46 to 0.74). Adding age and sex did not improve diagnostic performance (AUC 0.81 [95% CI 0.73 to 0.89]). Orthopaedic surgeons had better specificity (0.93 [95% CI 0.93 to 0.99]; p < 0.01), while accuracy (84% [95% CI 81% to 88%]) and sensitivity (0.76 [95% CI 0.70 to 0.82]; p = 0.29) did not differ between the algorithm and human observers. Although the CNN was less specific in diagnosing relatively obvious fractures, it detected five of six occult scaphoid fractures that were missed by all human observers. The interobserver reliability among the five surgeons was substantial (Fleiss' kappa = 0.74 [95% CI 0.66 to 0.83]), but the reliability between the algorithm and human observers was only fair (Cohen's kappa = 0.34 [95% CI 0.17 to 0.50]). CONCLUSIONS: Initial experience with our deep learning algorithm suggests that it has trouble identifying scaphoid fractures that are obvious to human observers. Thirteen false positive suggestions were made by the CNN, which were correctly detected by the five surgeons. Research with larger datasets-preferably also including information from physical examination-or further algorithm refinement is merited. LEVEL OF EVIDENCE: Level III, diagnostic study.
Authors: David W G Langerhuizen; Stein J Janssen; Wouter H Mallee; Michel P J van den Bekerom; David Ring; Gino M M J Kerkhoffs; Ruurd L Jaarsma; Job N Doornberg Journal: Clin Orthop Relat Res Date: 2019-11 Impact factor: 4.176
Authors: Andre Esteva; Brett Kuprel; Roberto A Novoa; Justin Ko; Susan M Swetter; Helen M Blau; Sebastian Thrun Journal: Nature Date: 2017-01-25 Impact factor: 49.962
Authors: Jakub Olczak; Niklas Fahlberg; Atsuto Maki; Ali Sharif Razavian; Anthony Jilert; André Stark; Olof Sköldenberg; Max Gordon Journal: Acta Orthop Date: 2017-07-06 Impact factor: 3.717
Authors: Robert Lindsey; Aaron Daluiski; Sumit Chopra; Alexander Lachapelle; Michael Mozer; Serge Sicular; Douglas Hanel; Michael Gardner; Anurag Gupta; Robert Hotchkiss; Hollis Potter Journal: Proc Natl Acad Sci U S A Date: 2018-10-22 Impact factor: 11.205
Authors: Rachel Y L Kuo; Conrad Harrison; Terry-Ann Curran; Benjamin Jones; Alexander Freethy; David Cussons; Max Stewart; Gary S Collins; Dominic Furniss Journal: Radiology Date: 2022-03-29 Impact factor: 29.146
Authors: Nils Hendrix; Ernst Scholten; Bastiaan Vernhout; Stefan Bruijnen; Bas Maresch; Mathijn de Jong; Suzanne Diepstraten; Stijn Bollen; Steven Schalekamp; Maarten de Rooij; Alexander Scholtens; Ward Hendrix; Tijs Samson; Lee-Ling Sharon Ong; Eric Postma; Bram van Ginneken; Matthieu Rutten Journal: Radiol Artif Intell Date: 2021-04-28