| Literature DB >> 33245279 |
Chi-Tung Cheng1, Chih-Chi Chen2, Fu-Jen Cheng3, Huan-Wu Chen4, Yi-Siang Su1, Chun-Nan Yeh5, I-Fang Chung6,7,8, Chien-Hung Liao1.
Abstract
BACKGROUND: Hip fracture is the most common type of fracture in elderly individuals. Numerous deep learning (DL) algorithms for plain pelvic radiographs (PXRs) have been applied to improve the accuracy of hip fracture diagnosis. However, their efficacy is still undetermined.Entities:
Keywords: algorithms; artificial intelligence; computer; deep learning; diagnosis; hip fracture; human augmentation; neural network
Year: 2020 PMID: 33245279 PMCID: PMC7732715 DOI: 10.2196/19416
Source DB: PubMed Journal: JMIR Med Inform
Figure 1(A) Pelvic x-ray (PXR) with hip fracture. (B) PXR with left hip fracture with a human-algorithm integration–enhanced reference image.
Figure 2The validation flow of the human-algorithm integration (HAI) system performance test. PXR: plain pelvic radiograph.
Figure 3The flow of clinical integration of the human-algorithm integration (HAI) system into the emergency department and real-world data validation. DCNN: deep convolutional neural network; Grad-CAM: gradient-weighted class activation mapping; PACS: picture archiving and communication system.
Demographic data of the physician participants (n=34).
| Physician characteristics | Values | |
| Age in years, median (IQR) | 29.00 (27.00-32.00) | |
| Years of practice, median (IQR) | 3.00 (2.00-5.75) | |
|
| ||
|
| Male | 27 (79) |
|
| Female | 7 (21) |
|
| ||
|
| General surgeon | 21 (62) |
|
| Emergency physician | 6 (18) |
|
| Postgraduate-year doctor | 3 (9) |
|
| Radiologist | 2 (6) |
|
| Orthopedic surgeon | 2 (6) |
The physician-alone performance and human-algorithm integration (HIA) performance of the physician participants (n=34); the Wilcoxon signed-rank test was used to compare the physician-alone and the HAI performance.
| Measures | Physician-alone performance | HAI performance | |
| Sensitivity, median (IQR) | 0.95 (0.90-1.00) | 0.99 (0.96-1.00) | <.001 |
| Specificity, median (IQR) | 0.90 (0.82-0.94) | 0.95 (0.90-0.98) | <.001 |
| Accuracy, median (IQR) | 0.90 (0.88-0.94) | 0.96 (0.93-0.98) | <.001 |
| Human-algorithm agreement, κ, median (IQR) | 0.69 (0.63-0.74) | 0.80 (0.76-0.82) | <.001 |
aAll P values are statistically significant.
Figure 4The receiver operating characteristic curve of the algorithm performance on test images. Green spots: participants’ performance; red spots: participants’ performance with human-algorithm integration (HAI) assistance; cross mark: the cut-off performance of the algorithm presented to the physician.
Figure 5Examples of inconsistencies between the participants and the algorithm. (A) A pelvic x-ray (PXR) without hip fracture that was overdiagnosed by the algorithm. No participant overdiagnosed in the physician-alone test, and only 1 (2.9%) participant overdiagnosed in the human-algorithm integration (HAI) test. (B) A PXR with left hip fracture. In the physician-alone test, 12 (35.3%) participants missed this fracture. In the HAI test, only 1 (2.9%) participant missed this fracture. (C) A PXR with left hip fracture that was missed by the algorithm. In the physician-alone test, 4 (11.8%) participants missed this fracture. In the HAI test, 3 (8.8%) participants missed this fracture. (D) A PXR without hip fracture. In the physician-alone test, 18 (52.9%) participants overdiagnosed this image. In the HAI test, only 5 (14.7%) participants overdiagnosed this image.
The physician-alone performance and human-algorithm integration (HAI) performance of the physicians by specialty (n=34).
| Physician characteristics and performance | Primary physicians | Consulting physicians | |||||
|
| General surgeons (n=21) | Emergency-department physicians (n=6) | Postgraduate-year physicians (n=3) | Radiologists (n=2) | Orthopedic surgeons (n=2) |
| |
| Age in years, median (IQR) | 28.00 (27.00-30.00) | 33.50 (29.75-37.25) | 26.00 (25.50- 26.50) | 39.00 (36.00-42.00) | 33.50 (32.75-34.25) | .006a | |
| Years of experience, median (IQR) | 3.00 (2.00- 4.00) | 5.50 (2.75-9.00) | 1.00 (1.00- 1.00) | 6.50 (6.25- 6.75) | 12.50 (10.75- 14.25) | .003a | |
|
| |||||||
|
| Human-algorithm agreement, κ, median (IQR) | 0.69 (0.63-0.72) | 0.75 (0.67- 0.80) | 0.44 (0.32- 0.53) | 0.79 (0.78- 0.79) | 0.72 (0.71- 0.74) | .027a |
|
| Accuracy, median (IQR) | 0.90 (0.88- 0.92) | 0.95 (0.91- 0.97) | 0.70 (0.66- 0.76) | 0.96 (0.96- 0.97) | 0.94 (0.93- 0.94) | .013a |
|
| Sensitivity, median (IQR) | 0.94 (0.90-0.98) | 1.00 (0.98- 1.00) | 0.58 (0.42- 0.74) | 0.99 (0.98- 0.99) | 1.00 (1.00- 1.00) | .003a |
|
| Specificity, median (IQR) | 0.90 (0.82- 0.96) | 0.91 (0.83- 0.94) | 0.82 (0.77- 0.91) | 0.94 (0.94- 0.94) | 0.87 (0.85- 0.88) | .855 |
|
| |||||||
|
| Human-algorithm agreement, κ, median (IQR) | 0.80 (0.76- 0.82) | 0.81 (0.78- 0.83) | 0.76 (0.74- 0.78) | 0.78 (0.76- 0.80) | 0.82 (0.82- 0.82) | .496 |
|
| Accuracy, median (IQR) | 0.95 (0.94- 0.97) | 0.98 (0.97- 0.99) | 0.91 (0.89- 0.91) | 0.97 (0.96- 0.97) | 1.00 (1.00- 1.00) | .011a |
|
| Sensitivity, median (IQR) | 0.98 [0.96, 1.00] | 1.00 (1.00- 1.00) | 0.90 (0.89- 0.95) | 0.97 (0.95- 0.98) | 1.00 (1.00- 1.00) | .121 |
|
| Specificity, median (IQR) | 0.94 (0.90-0.98) | 0.97 (0.94- 0.98) | 0.84 (0.83- 0.89) | 0.97 (0.96- 0.97) | 1.00 (1.00- 1.00) | .071 |
aP value is statistically significant.
A comparison of the primary physician performance with the human-algorithm integration (HAI) performance, divided by physician experience (n=30); the Wilcoxon signed-rank test was used to compare the physician-alone performance and the HAI performance.
| Primary physician characteristics and performance | Novice group (n=18) | Experienced group (n=12) | ||||||
| Age in years, median (IQR) | 27.00 (27.00-28.00) | 32.00 (30.75-34.25) | <.001a | |||||
| Years of experience, median (IQR) | 2.00 (2.00-3.00) | 5.00 (4.00-6.25) | < .001a | |||||
|
| ||||||||
|
|
| |||||||
|
|
| Physician alone | 0.66 (0.62-0.72) | 0.69 (0.64-0.77) | .330 | |||
|
|
| HAI | 0.77 (0.71-0.80) | 0.82 (0.79-0.82) | .008a | |||
|
|
| Paired test, | .0001a | .001a |
| |||
|
|
| |||||||
|
|
| Physician alone | 0.90 (0.82-0.92) | 0.90 (0.89-0.96) | .279 | |||
|
|
| HAI | 0.94 (0.91-0.97) | 0.97 (0.95-0.98) | .020 a | |||
|
|
| Paired test, | .0023a | .0032a |
| |||
|
|
| |||||||
|
|
| Physician alone | 0.91 (0.83-0.95) | 0.98 (0.94-1.00) | .017a | |||
|
|
| HAI | 0.97 (0.94-1.00) | 1.00 (0.97-1.00) | .043a | |||
|
|
| Paired test | .0028a | .0313a |
| |||
|
|
| |||||||
|
|
| Physician alone | 0.89 (0.84-0.94) | 0.86 (0.81-0.94) | .733 | |||
|
|
| HAI | 0.94 (0.88-0.96) | 0.96 (0.92-0.98) | .215 | |||
|
|
| Paired test, | .1067 | .0049a |
| |||
aP value is statistically significant.
Clinical validation of the human-algorithm integration (HAI) system in emergency departments.
| Algorithm-only vs. HIA diagnosis | Fracture | Sensitivity, % (95% CI) | Specificity, % (95% CI) | Accuracy, % (95% CI) | |||||
|
| (+) | (–) |
|
|
| ||||
|
| 91.01 (86.92%-94.16%) | 94.06 (90.88%-96.39%) | 92.67 (90.26%-94.65%) | ||||||
|
| (+) | 243 | 19 |
|
|
| |||
|
| (–) | 24 | 301 |
|
|
| |||
|
| 99.25 (97.32%-99.91%) | 95.31 (92.39%-97.35%) | 97.10 (95.40%-98.30%) | ||||||
|
| (+) | 265 | 15 |
|
|
| |||
|
| (–) | 2 | 305 |
|
|
| |||
|
|
|
|
|
|
|
| |||