| Literature DB >> 33176335 |
Kyong Joon Lee1, Inseon Ryoo2, Dongjun Choi1, Leonard Sunwoo1, Sung-Hye You3, Hye Na Jung2.
Abstract
OBJECTIVES: This study aimed to compare the diagnostic performance of deep learning algorithm trained by single view (anterior-posterior (AP) or lateral view) with that trained by multiple views (both views together) in diagnosis of mastoiditis on mastoid series and compare the diagnostic performance between the algorithm and radiologists.Entities:
Year: 2020 PMID: 33176335 PMCID: PMC7657495 DOI: 10.1371/journal.pone.0241796
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Typical images of each labeling category.
(a,b) AP view (a) and lateral view (b) show bilateral, clear mastoid air cells (red circles) with honey combing pattern of category 0. (c,d) right ear (red circles) of AP view (c) and lateral view (d) shows slightly increased haziness in mastoid air cells suggesting category 1 and left ear (white circles) of both views shows bony defects with air cavities suggesting category 3. (e, f) AP view (e) and lateral view (f) show bilateral, total haziness and sclerosis of mastoid air cells (red circles) suggesting category 2.
Fig 2Location of center points for right and left cropping in AP view.
The yellow dots represent the center points of cropping the right/left ears.
Fig 3Network architectures for predicting mastoiditis.
The CNNs (convolutional neural networks) for single view (a) show a process in which AP and lateral views are separately trained in CNN. The CNN for multiple views (b) shows a process in which AP and lateral views are simultaneously trained. After Log-Sum-Exp pooling, the layers were also marked with dimensions. [1] means 1×1 size vector, and [2] means 1×2 size vector.
Fig 7Class activation mappings of true positive (a), true negative (b), false positive (c), false negative (d), and postoperative state (e) examples.
(a) Lesion related regions in mastoid air cells are detected on both AP view and lateral view. (b) No specific region is detected in either AP view or lateral view. (c) A false lesion related region is detected on lateral view. (d) Equivocal haziness is suspected on both views. The algorithm diagnosed this case as normal. An AP view with bilateral sides (right upper) shows marked asymmetry suggesting abnormality in right side. (e) Both views show lesion related regions.
Baseline characteristics of all data sets.
| Characteristic | Training set (n = 8278) | Validation set (n = 918) | Test sets | ||||
|---|---|---|---|---|---|---|---|
| Gold standard test set (n = 792) | Temporal external test set (n = 294) | Geographic external test set (n = 308) | |||||
| Number of patients | 4139 | 459 | 396 | 147 | 154 | ||
| Age | |||||||
| <20 years | 322 | 35 | 40 | 4 | 4 | ||
| 20~29 years | 262 | 32 | 16 | 4 | 7 | ||
| 30~39 years | 444 | 68 | 47 | 10 | 15 | ||
| 40~49 years | 912 | 97 | 85 | 21 | 28 | ||
| 50~59 years | 1131 | 114 | 119 | 60 | 59 | ||
| 60~69 years | 774 | 80 | 55 | 37 | 26 | ||
| 70~79 years | 258 | 30 | 27 | 8 | 14 | ||
| ≥80 years | 36 | 3 | 7 | 3 | 1 | ||
| Sex | |||||||
| Female | 2353 | 247 | 225 | 70 | 69 | ||
| Male | 1786 | 212 | 171 | 77 | 85 | ||
| Label (based on conventional radiography) | |||||||
| 0, Normal | 3155 | 349 | 261 | 159 | 175 | ||
| 1, Abnormal | 5123 | 569 | 531 | 135 | 133 | ||
| Mild | 1806 | 200 | 129 | 56 | 44 | ||
| Severe | 3317 | 369 | 402 | 76 | 89 | ||
| Postop | - | - | - | 3 | - | ||
| Label (based on CT) | |||||||
| 0, Normal | - | - | 353 | - | - | ||
| 1, Abnormal | - | - | 439 | - | - | ||
| Mild | - | - | 157 | - | - | ||
| Severe | - | - | 258 | - | - | ||
| Postop | - | - | 24 | - | - | ||
| 0.78 | 0.77 | 0.76 | 0.79 | 0.79 | |||
Comparison of the diagnostic performance between the algorithm using single view and the algorithm using multiple views in each data set based on labels by conventional radiography.
| Dataset | Comparison | AUC (single view) | AUC (multiple views) | |
|---|---|---|---|---|
| Validation set | Single view (AP) vs Multiple views | 0.955 (0.943–0.968) | 0.968 (0.959–0.977) | <0.001 |
| Single view (Lateral) vs Multiple views | 0.946 (0.932–0.959) | <0.001 | ||
| Gold standard test set | Single view (AP) vs Multiple views | 0.964 (0.953–0.975) | 0.971 (0.962–0.981) | 0.017 |
| Single view (Lateral) vs Multiple views | 0.953 (0.940–0.966) | <0.001 | ||
| Temporal external test set | Single view (AP) vs Multiple views | 0.952 (0.931–0.974) | 0.978 (0.965–0.990) | 0.002 |
| Single view (Lateral) vs Multiple views | 0.961 (0.942–0.980) | 0.004 | ||
| Geographic external test set | Single view (AP) vs Multiple views | 0.961 (0.942–0.980) | 0.965 (0.948–0.981) | 0.246 |
| Single view (Lateral) vs Multiple views | 0.942 (0.918–0.966) | 0.003 |
Data is shown to three decimal places, with the 95% confidence interval in parentheses.
AUC: Area under the receiver operating characteristic (ROC) curves.
P*: P-value of one-side DeLong’s test for two correlated ROC curves (Alternative hypothesis: AUC of multiple views was greater than AUC of single view).
*:<0.05 was significant.
Comparison of diagnostic performance for gold standard test set between the deep learning algorithm (using multiple views) and radiologists based on the labels by standard reference (temporal bone CT).
| Reader | Sensitivity | Specificity | |||
|---|---|---|---|---|---|
| Deep learning algorithm | Optimal cutoff | 96.4% (423/439, 94.1–97.9%) | 74.5% (263/353, 69.6–79.0%) | ||
| Cutoff for 95% sensitivity | 98.6% (433/439, 97.0–99.5%) | 58.9% (208/353, 53.6–64.1%) | |||
| Cutoff for 95% specificity | 95.7% (420/439, 93.3–97.4%) | 79.3% (280/353, 74.7–83.4%) | |||
| Radiologist | Radiologist 1 | 95.9% (421/439, 93.6–97.6%) | 0.752 | 68.8% (243/353, 63.7–73.6%) | 0.018 |
| Radiologist 2 | 96.1% (422/439, 93.9–97.7%) | 1.000 | 68.6% (242/353, 63.4–73.4%) | 0.012 | |
Data are percentages and nominator/denominator, and 95% confidence interval in the parentheses.
P: P values for comparing sensitivities/specificities between the deep learning algorithm based on optimal cutoff and the radiologists were determined by using McNemar’s test.
*:<0.05 was significant.
Diagnostic performance of deep learning algorithm in all test sets based on labels by conventional radiography.
| Diagnostic performance | Gold standard test set | Temporal external test set | Geographic external test set | |
|---|---|---|---|---|
| AUC | 0.971 (0.962–0.981) | 0.978 (0.965–0.990) | 0.965 (0.948–0.981) | |
| Optimal cutoff | Sensitivity | 91.3% (485/531, 88.6–93.6%) | 91.1% (123/135, 85.0–95.3%) | 85.7% (114/133, 78.6–91.2%) |
| Specificity | 89.3% (233/261, 84.9–92.8%) | 90.6% (144/159, 84.9–94.6%) | 90.3% (158/175, 84.9–94.2%) | |
| Cutoff for 95% sensitivity | Sensitivity | 96.8% (514/531, 94.9–98.1%) | 97.8% (132/135, 93.6–99.5%) | 97.0% (129/133, 92.5–99.2%) |
| Specificity | 75.5% (197/261, 69.8–80.6%) | 79.2% (126/159, 72.1–85.3%) | 80.6% (141/175, 73.9–86.2%) | |
| Cutoff for 95% specificity | Sensitivity | 89.3% (474/531, 86.3–91.8%) | 90.4% (122/135, 84.1–94.8%) | 85.0% (113/133, 77.7–90.6%) |
| Specificity | 92.7% (242/261, 88.9–95.6%) | 93.7% (149/159, 88.7–96.9%) | 91.4% (160/175, 86.3–95.1%) | |
Data are percentages and nominator/denominator and/or 95% confidence interval in the parentheses.
AUC: Area under the receiver operating characteristic curve.