| Literature DB >> 33742228 |
Patrick Tobler1, Joshy Cyriac1, Balazs K Kovacs1, Verena Hofmann1, Raphael Sexauer1, Fabiano Paciolla1, Bram Stieltjes1, Felix Amsler2, Anna Hirschmann3.
Abstract
OBJECTIVES: To evaluate the performance of a deep convolutional neural network (DCNN) in detecting and classifying distal radius fractures, metal, and cast on radiographs using labels based on radiology reports. The secondary aim was to evaluate the effect of the training set size on the algorithm's performance.Entities:
Keywords: Deep learning; Radiography; Radius fractures
Mesh:
Year: 2021 PMID: 33742228 PMCID: PMC8379111 DOI: 10.1007/s00330-021-07811-2
Source DB: PubMed Journal: Eur Radiol ISSN: 0938-7994 Impact factor: 5.315
Fig. 1Flowchart demonstrates the selection of training and test sets for wrist radiographs. Exclusion criteria marked with an asterisk (*) are only applicable for the test sets. One radiograph was eligible for multiple fracture classification labels. Test set A was rated by two musculoskeletal radiology experts and reflects the standard of reference. Test set B is a subset of A and used to compare three radiology residents to the algorithms
Fig. 2Flowchart shows the training and test architecture. Top—Set-up of training subsets and their sizes. All subsets were in accordance with the predefined sizes, except the biggest subset, which contained all radiographs available for each category. Middle—Set-up to develop artificial intelligence (AI) algorithms. Performance was evaluated with area under the receiver operating characteristics curve (AUC), Youden’s J statistic (J), and accuracy. Bottom—Set-up to determine the radiology resident`s performance and performance evaluation on test set A (AI only) and B (AI and radiology residents). *Metal and cast detection training sets were limited to 9,000 images and included both views simultaneously; therefore, the model was used directly, avoiding the splitting and averaging of predictions steps (see middle set-up). DCNN = deep convolutional neural network
Fig. 3Artificial intelligence models performance for distal radius fracture detection, classification, and cast and metal detection on test set A. The performance was measured with the area under the receiver operating characteristics curve (AUC). The graph shows the effect of an incrementally increased training subset size between 500 (subset 1) and 9,000 (subset 12) radiographs on model performance, and the possible performance variation per training subset
Spearman’s correlation coefficient (ρ) between training subset size and model performance measured by area under the receiver operating characteristics curve (AUC) with two separate analyses.
| All models | Best models | ||||
|---|---|---|---|---|---|
| View | ρ | ρ | |||
| Fracture | Frontal | 0.947 | 1.000 | ||
| Lateral | 0.946 | 0.964 | |||
| Fragment displacement | Frontal | 0.595 | 1.000 | ||
| Lateral | −0.119 | 0.464 | 0.000 | 1.000 | |
| Joint involvement | Frontal | 0.046 | 0.780 | −0.800 | 0.200 |
| Lateral | 0.200 | 0.215 | −0.400 | 0.600 | |
| Multiple fragments | Frontal | 0.856 | 1.000 | ||
| Lateral | 0.489 | 0.800 | 0.200 | ||
| Metal | Both | 0.740 | 0.522 | 0.067 | |
| Cast | Both | 0.722 | 0.305 | 0.335 | |
Best models per training subset measured AUC. A p value < 0.05 was considered statistically significant (indicated in bold)
Performance of best artificial intelligence (AI) algorithms and standard of reference (test set A) and of AI and radiology residents (test set B)
| Test set A | Test set B | |||||||
|---|---|---|---|---|---|---|---|---|
| AI | AI | Radiology residents | ||||||
| AUC | 95% CI | Accuracy | AUC | 95% CI | AUC | 95% CI | ||
| Fracture | 0.975 | 0.957–0.992 | 0.938 | 0.981 | 0.963–0.998 | 0.983 | 0.965–1.000 | 0.864 |
| Fragment displacement | 0.589 | 0.463–0.715 | 0.597 | 0.736 | 0.624–0.847 | 0.916 | 0.871–0.961 | |
| Joint involvement | 0.618 | 0.516–0.720 | 0.637 | 0.654 | 0.549–0.760 | 0.898 | 0.841–0.956 | |
| Multiple fragments | 0.842 | 0.774–0.911 | 0.782 | 0.851 | 0.780–0.922 | 0.905 | 0.853–0.956 | 0.112 |
| Metal* | 0.989 | 0.982–0.996 | 0.976 | |||||
| Cast* | 1.000 | 1.000–1.000 | 1.000 | |||||
Data of algorithms display per-study average analysis results. AUC area under the receiver operating characteristics curve, CI confidence interval. Test set B: A p value < 0.05 was considered statistically significant (indicated in bold). *Values of the best model are given, views were not considered in these categories
Fig. 4Area under the receiver operating characteristics curve (AUC) of the per-study average of the best artificial intelligence (AI) algorithms and radiology resident analysis. J = Youden’s J statistic
Interobserver agreement between standard of reference, radiology residents, and per-study average of the best artificial intelligence (AI) algorithms
| Fracture detection | Fracture classification | ||||
|---|---|---|---|---|---|
| Fragment displacement | Joint involvement | Multiple fragments | |||
| Standard of reference | Radiology residents | 0.88 | 0.57 | 0.69 | 0.62 |
| AI | Standard of reference | 0.83 | 0.24 | 0.28 | 0.51 |
| AI | Radiology residents | 0.84 | 0.21 | 0.26 | 0.63 |
| Reader 1 | Reader 2 | 0.85 | 0.54 | 0.40 | 0.31 |
| Reader 1 | Reader 3 | 0.90 | 0.63 | 0.55 | 0.48 |
| Reader 2 | Reader 3 | 0.88 | 0.47 | 0.57 | 0.52 |
| Reader 1 | Standard of reference | 0.87 | 0.61 | 0.51 | 0.60 |
| Reader 2 | Standard of reference | 0.88 | 0.37 | 0.61 | 0.42 |
| Reader 3 | Standard of reference | 0.87 | 0.46 | 0.63 | 0.52 |
Radiology residents comprise reader 1–3. Kappa values according to Landis and Koch [17]