| Literature DB >> 32943696 |
Jieun Koh1, Eunjung Lee2, Kyunghwa Han3, Eun-Kyung Kim3, Eun Ju Son4, Yu-Mee Sohn5, Mirinae Seo5, Mi-Ri Kwon6, Jung Hyun Yoon3, Jin Hwa Lee7, Young Mi Park8, Sungwon Kim3, Jung Hee Shin9, Jin Young Kwak10.
Abstract
The purpose of this study was to evaluate and compare the diagnostic performances of the deep convolutional neural network (CNN) and expert radiologists for differentiating thyroid nodules on ultrasonography (US), and to validate the results in multicenter data sets. This multicenter retrospective study collected 15,375 US images of thyroid nodules for algorithm development (n = 13,560, Severance Hospital, SH training set), the internal test (n = 634, SH test set), and the external test (n = 781, Samsung Medical Center, SMC set; n = 200, CHA Bundang Medical Center, CBMC set; n = 200, Kyung Hee University Hospital, KUH set). Two individual CNNs and two classification ensembles (CNNE1 and CNNE2) were tested to differentiate malignant and benign thyroid nodules. CNNs demonstrated high area under the curves (AUCs) to diagnose malignant thyroid nodules (0.898-0.937 for the internal test set and 0.821-0.885 for the external test sets). AUC was significantly higher for CNNE2 than radiologists in the SH test set (0.932 vs. 0.840, P < 0.001). AUC was not significantly different between CNNE2 and radiologists in the external test sets (P = 0.113, 0.126, and 0.690). CNN showed diagnostic performances comparable to expert radiologists for differentiating thyroid nodules on US in both the internal and external test sets.Entities:
Year: 2020 PMID: 32943696 PMCID: PMC7498581 DOI: 10.1038/s41598-020-72270-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Diagram of the study cohort. For the algorithm development, 13,560 images of thyroid nodules were collected from Severance Hospital (SH training set). For the internal test, 634 images of thyroid nodules were additionally obtained from Severance Hospital (SH test set). For the external test, 1,181 images of thyroid nodules were obtained from three different hospitals (Samsung Medical Center, SMC set; CHA Bundang Medical Center, CBMC set; Kyung Hee University Hospital, KUH set). For the four test sets, 200 images were selected and four readers retrospectively reviewed two sets of images to compare diagnostic performance between expert radiologists and CNN.
Figure 2Image acquisition process. To extract the ROI without unnecessary interference from the color bounding box used to indicate the ROI’s border, location information was harvested and applied to a duplicate image that did not have the ROI box drawn on it.
Figure 3Structure of CNN with fine-tuning. The last few layers are modified to produce two output results.
Figure 4Structure of the classification ensemble. When multiple CNNs were selected, the probability results were collected from each CNN as shown in Fig. 3 and these probabilities were averaged to generate a new probability for the final decision.
Baseline characteristics of the study cohorts.
| SH training set (n = 13,560) | SH test set (n = 634) | SMC set (n = 781) | CBMC set (n = 200) | KUH set (n = 200) | |
|---|---|---|---|---|---|
| Age, mean ± SD (years) | 47.4 ± 13.7 | 44.6 ± 13.0 | 47.2 ± 12.9 | 48.7 ± 13.0 | 49.6 ± 13.9 |
| Size, mean ± SD (mm) | 20.3 ± 11.4 | 19.6 ± 12.3 | 23.6 ± 13.4 | 21.2 ± 11.5 | 22.4 ± 11.3 |
| Female | 10,675 (78.7%) | 484 (76.3%) | 571 (73.1%) | 161 (80.5%) | 151 (75.5%) |
| Male | 2,885 (21.3%) | 150 (23.7%) | 210 (26.9%) | 39 (19.5%) | 49 (24.5%) |
| Malignancy | 7,160 (52.8%) | 539 (85.0%) | 538 (68.9%) | 118 (59.0%) | 98 (49.0%) |
| Benign | 6,400 (47.2%) | 95 (15.0%) | 243 (31.1%) | 82 (41.0%) | 102 (51.0%) |
| Papillary cancer | 6,478 (96.5%) | 519 (96.3%) | 405 (75.3%) | 116 (98.3%) | 97 (99.0%) |
| Follicular cancer | 148 (2.2%) | 10 (1.9%) | 126 (23.4%) | 0 | 0 |
| Medullary cancer | 30 (0.4%) | 6 (1.1%) | 3 (0.6%) | 1 (0.8%) | 0 |
| Anaplastic cancer | 20 (0.3%) | 0 | 1 (0.2%) | 1 (0.8%) | 1 (1.0%) |
| Other | 36 (0.5%) | 4 (0.7%) | 3 (0.6%) | 0 | 0 |
aCancer subtype is listed only for surgically confirmed cases.
Diagnostic performances of expert radiologists and CNNE2.
| SH test set (n = 200) | SMC set (n = 200) | CBMC set (n = 200) | KUH set (n = 200) | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Reader 2 | Reader 3 | Averageb | CNNE2 | Reader 3 | Reader 4 | Average | CNNE2 | Reader 1 | Reader 4 | Average | CNNE2 | Reader 1 | Reader 2 | Average | CNNE2 | |
| Sensitivity (%)a | 89.2 (83.5–93.1) | 94.0 (89.2–96.7) | 91.6 (87.3–94.5) | 83.7 (77.3–88.6) | 93.7 (88.3–96.7) | 88.7 (82.4–93.0) | 91.2 (86.4–94.4) | 78.2 (70.6–84.2) | 89.0 (82.0–93.5) | 90.7 (84.0–94.8) | 89.8 (84.4–93.5) | 94.1 (88.1–97.2) | 91.8 (84.5–95.9) | 91.8 (84.5–95.9) | 91.8 (85.5–95.5) | 91.8 (84.5–95.9) |
| Specificity (%)a | 67.7 (50.5–81.1) | 50 (33.8–66.2) | 58.8 (44.1–72.1) | 91.2 (76.0–97.1) | 39.7 (28.0–52.7) | 56.9 (44.0–68.9) | 48.3 (37.1–59.6) | 93.1 (83.0–97.4) | 67.1 (56.2–76.4) | 45.1 (34.7–56.0) | 56.1 (46.5–65.2) | 62.2 (51.3–72.0) | 60.8 (51.0–69.8) | 71.6 (62.1–79.5) | 66.2 (57.7–73.7) | 59.8 (50.0–68.9) |
| Accuracy (%)a | 85.5 (79.9–89.7 ) | 86.5 (81.0–90.6- ) | 86.0 (81.3–89.7) | 85.0 (79.4–89.3) | 78.0 (71.7–83.2) | 79.5 (73.3–84.5) | 78.8 (73.2–83.4) | 82.5 (76.6–87.2 ) | 80.0 (73.9–85.0) | 72.0 (65.4–77.8) | 76.0 (70.4–80.8) | 81.0 (75.0–85.9) | 76.0 (69.6–81.4) | 81.5 (75.5–86.3) | 78.8 (73.3–83.4) | 75.5 (69.1–81.0) |
| PPV (%)a | 93.1 (87.9–96.1) | 90.2 (84.8–93.8) | 91.6 (86.7–94.8) | 97.9 (93.7–99.3) | 79.2 (72.4–84.7) | 83.4 (76.6–88.6) | 81.2 (74.7–86.3) | 96.5 (91.1–98.7) | 79.6 (71.8–85.6) | 70.4 (62.7–77.1) | 74.7 (67.3–80.8) | 78.2 (70.6–84.2) | 69.2 (60.8–76.6) | 75.6 (67.1–82.5) | 72.3 (64.3–79.1) | 68.7 (60.3–76.1) |
| NPV (%)a | 56.1 (40.8–70.3) | 63.0 (43.8–78.8) | 58.8 (43.8–72.4) | 53.5 (40.7–65.8) | 71.9 (54.2–84.7) | 67.4 (53.2–78.9) | 69.1 (55.4–80.2) | 63.5 (52.8–73.0) | 80.9 (69.8–88.6) | 77.1 (63.2–86.8) | 79.3 (68.9–86.9) | 87.9 (76.8–94.1) | 88.6 (78.8–94.2) | 90.1 (81.5–95.0) | 89.4 (81.3–94.3) | 88.4 (78.5–94.1) |
| AUC | 0.842 (0.771–0.914) | 0.838 (0.762–0.913) | 0.840 (0.806–0.873) | 0.932 (0.885–0.978) | 0.799 (0.734–0.863) | 0.847 (0.793–0.901) | 0.823 (0.706–0.940) | 0.899 (0.858–0.940) | 0.850 (0.798–0.902) | 0.810 (0.754–0.866) | 0.830 (0.752–0.909) | 0.885 (0.839–0.930) | 0.842 (0.79–0.894) | 0.897 (0.855–0.94) | 0.870 (0.752–0.987) | 0.854 (0.800–0.908) |
| F1 | 91.1 | 92.0 | 91.6 | 90.3 | 85.8 | 86.0 | 85.9 | 86.4 | 84.0 | 79.3 | 81.5 | 85.4 | 79.0 | 83.0 | 80.9 | 78.6 |
aTo calculate the diagnostic performances of each cohort, a cut-off value of 0.6 for cancer probability was used for CNNE2 and ACR TI-RADS category 4 was used for readers.
bThe average reader performance was calculated.
Figure 5ROC curves of CNNE2 and expert radiologists for differentiating thyroid nodules. A. AUC of CNNE2 was significantly higher than radiologists in the SH test set (0.932 vs. 0.840, P < 0.001). AUC of CNNE2 was higher than radiologists in the SMC set (B) and CBMC set (C) without statistical significance (0.899 vs. 0.823 and 0.885 vs. 0.830, P = 0.113 and 0.126) D. AUC of radiologists was higher than CNNE2 in the KHU set without statistical significance (0.870 vs. 0.854, P = 0.690). (Black: CNNE2, Blue: readers average, red and orange: individual reader).