| Literature DB >> 30177857 |
Kiwook Kim1, Sungwon Kim2, Young Han Lee2, Seung Hyun Lee3, Hye Sun Lee4, Sungjun Kim5.
Abstract
The purpose of this study was to evaluate the performance of the deep convolutional neural network (DCNN) in differentiating between tuberculous and pyogenic spondylitis on magnetic resonance (MR) imaging, compared to the performance of three skilled radiologists. This clinical retrospective study used spine MR images of 80 patients with tuberculous spondylitis and 81 patients with pyogenic spondylitis that was bacteriologically and/or histologically confirmed from January 2007 to December 2016. Supervised training and validation of the DCNN classifier was performed with four-fold cross validation on a patient-level independent split. The object detection and classification model was implemented as a DCNN and was designed to calculate the deep-learning scores of individual patients to reach a conclusion. Three musculoskeletal radiologists blindly interpreted the images. The diagnostic performances of the DCNN classifier and of the three radiologists were expressed as receiver operating characteristic (ROC) curves, and the areas under the ROC curves (AUCs) were compared using a bootstrap resampling procedure. When comparing the AUC value of the DCNN classifier (0.802) with the pooled AUC value of the three readers (0.729), there was no significant difference (P = 0.079). In differentiating between tuberculous and pyogenic spondylitis using MR images, the performance of the DCNN classifier was comparable to that of three skilled radiologists.Entities:
Mesh:
Year: 2018 PMID: 30177857 PMCID: PMC6120953 DOI: 10.1038/s41598-018-31486-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Comparison of baseline demographics and clinical characteristics between patients with tuberculous or pyogenic spondylitis.
| Variables† | Tuberculous (n = 80) | Pyogenic (n = 81) | P Value‡ |
|---|---|---|---|
| Female, n (%) | 49 (61.3) | 40 (49.4) | 0.130 |
| Median age (y), median (IQR) | 59 (38–71) | 64 (56–72) | 0.011 |
|
| |||
| Fever | 10 (12.5) | 35 (43.2) | <0.001 |
| Intermittent fever | 5 | 5 | 0.048* |
| Back pain | 69 (86.3) | 75 (92.6) | 0.108 |
| Acute (≤4 weeks) | 27 | 61 | <0.001* |
| Subacute or chronic (>4 weeks) | 42 | 15 | <0.001* |
| Neurological symptom | 29 (36.3) | 40 (49.4) | 0.092 |
|
| |||
| Hematocrit | 36.9 (34.6–39.5) | 34.7 (31.8–37.5) | 0.001 |
| WBC | 7,285 (5,835–9,295) | 8,720 (6,890–11,970) | <0.001 |
| % Neutrophils | 70.4 (62.4–75.4) | 75.8 (68.4–84.5) | 0.001 |
| ESR | 61.5 (42.0–94.8) | 79 (57–105) | 0.014 |
| CRP | 23.8 (7.8–48.7) | 56.3 (19–179.9) | <0.001 |
| Procalcitonin | 0.08 (0.06–0.24), 16/80 | 0.17 (0.07–0.79), 40/81 | 0.040 |
| Albumin | 3.9 (3.6–4.2) | 3.4 (2.9–3.8) | <0.001 |
Abbreviations: IQR, interquartile range; WBC, white blood cell; ESR, erythrocyte sedimentation rate; CRP, C-reactive protein.
†Values represent the number of subjects (%) or median (IQR).
‡Values were obtained using Student t, Fisher exact, or chi-square test as appropriate.
*p values for subgroup comparison.
Area under the receiver operating characteristics curve for the deep convolutional neural network classifier based on four-fold cross validation.
|
|
|
| |
|---|---|---|---|
| Subgroup 1 | 0.723 | (0.563–0.883) | |
| Subgroup 2 | 0.856 | (0.723–0.989) | |
| Subgroup 3 | 0.853 | (0.732–0.973) | |
| Subgroup 4 | 0.761 | (0.606–0.916) | |
|
| 0.802 | (0.733–0.872) | 85.0/67.9 |
Abbreviations: AUC, area under the receiver operating characteristics curve; CI, confidence interval.
†Optimal threshold by the Youden index.
Comparison of the diagnostic performances of the deep convolutional neural network classifier and three radiologists derived from the confusion matrix.
| Sensitivity (%) (95% CI) | TP/TP + FN | Specificity (%) (95% CI) | TN/TN + FP | Accuracy (%) (95% CI) | TP + TN/TP + FN + TN + FP | |
|---|---|---|---|---|---|---|
| DCNN | 85.0 (74.9~91.7) | 68/80 | 67.9 (56.5~77.6) | 55/81 | 76.4 (69.1~82.7) | 123/161 |
| Reader 1 | 72.5 (61.9~81.1) | 58/80 | 67.9 (56.5~77.6) | 55/81 | 70.2 (63.1~77.3) | 113/161 |
| Reader 2 | 72.5 (61.9~81.1) | 58/80 | 66.7 (55.9~76.0) | 54/81 | 69.6 (62.5~76.7) | 112/161 |
| Reader 3 | 70.0 (59.2~78.9) | 56/80 | 71.6 (61.0~80.3) | 58/81 | 70.8 (63.8~77.8) | 114/161 |
Abbreviations: CI, confidence interval; TP, true positive; FN, false negative; TN, true negative; FP, false positive; DCNN, Deep convolutional neural network classifier.
Comparison of the diagnostic performances of the deep convolutional neural network classifier and three radiologists expressed as the area under the receiver operating characteristics curves using bootstrapping (1000 bootstrap samples).
| AUC | 95% CI | P value† | |
|---|---|---|---|
| DCNN | 0.802 | (0.733–0.872) | |
| Reader 1 | 0.733 | (0.658–0.808) | 0.109 |
| Reader 2 | 0.723 | (0.647–0.799) | 0.066 |
| Reader 3 | 0.734 | (0.658–0.811) | 0.122 |
| Pooling‡ | 0.729 | (0.657–0.796) | 0.079 |
Abbreviations: AUC, area under the receiver operating characteristics curve; CI, confidence interval; DCNN, Deep convolutional neural network classifier.
†Comparison with DCNN.
‡Pooled performance of three readers calculated by multi-reader multi-case receiver operating characteristic analysis under the assumption of random readers and random cases.
Figure 1Receiver operating characteristic curves of the deep convolutional neural network (DCNN) classifier and three radiologists.
Inter-observer agreement on five-point confidence scale scores among the three radiologists
| Kappa value† | 95% CI | |
|---|---|---|
| Readers 1 and 2 | 0.6728 | 0.6090–0.7366 (substantial agreement) |
| Readers 1 and 3 | 0.6644 | 0.6027–0.7261 (substantial agreement) |
| Readers 2 and 3 | 0.7399 | 0.6923–0.7876 (substantial agreement) |
Abbreviations: CI, confidence interval.
†We used the linear weighted kappa to account for partial agreement because the outcome encompasses ordinal scoring.
Figure 2A flowchart of the patient selection process.
Figure 3A flowchart of the study process from image preprocessing to ground-truth making. The image of step (c) and the ground-truth box of step (d) were used as input data for the training phase. Abbreviations: DICOM, Digital Imaging and Communications in Medicine; PNG, Portable Network Graphics; ROI, Region of Interest.
Figure 4Two examples of probability measurements within each image. (a) A 72-year-old woman with pyogenic spondylitis. Sky-blue boxes indicate pyogenic lesions detected by the deep convolutional neural network classifier. The probability of pyogenic spondylitis for this image is 1.00, which is the highest value among the detected pyogenic lesions. Because no tuberculous lesion is observed, the probability of tuberculous spondylitis is 0. (b) A 31-year-old woman with tuberculous spondylitis. Two red boxes indicate tuberculous lesions and a sky-blue box indicates a pyogenic lesion. The probability of tuberculous and pyogenic spondylitis is 0.99 and 1.00, respectively. Abbreviations: Pyo, pyogenic spondylitis; Tb, tuberculous spondylitis.
Figure 5An example of deep-learning scoring in a 30-year-old woman with tuberculous spondylitis patients. The trained deep convolutional neural network classifier detected lesions in each image and displayed their position using a rectangular box. Each box was divided into two colors according to the class of the lesion, and a probability was given (tuberculous spondylitis, red box; pyogenic spondylitis, sky-blue box). The highest probabilities for tuberculous and pyogenic spondylitis obtained from each magnetic resonance image were summed over all slices, respectively. The deep-learning score was defined as the proportion of the summed probability for tuberculosis to the summed probability for tuberculosis and pyogenic spondylitis. The final deep-learning score was calculated as 0.81, which was higher than the selected threshold value (0.31, Youden index), and was diagnosed as tuberculous spondylitis. Abbreviations: DL, deep-learning; Pyo, pyogenic spondylitis; Tb, tuberculous spondylitis.