| Literature DB >> 33252691 |
Ozan Oktay1, Jay Nanavati1, Anton Schwaighofer1, David Carter1, Melissa Bristow1, Ryutaro Tanno1, Rajesh Jena1, Gill Barnett1, David Noble2,3, Yvonne Rimmer2, Ben Glocker1, Kenton O'Hara1, Christopher Bishop1, Javier Alvarez-Valle1, Aditya Nori1.
Abstract
Importance: Personalized radiotherapy planning depends on high-quality delineation of target tumors and surrounding organs at risk (OARs). This process puts additional time burdens on oncologists and introduces variability among both experts and institutions. Objective: To explore clinically acceptable autocontouring solutions that can be integrated into existing workflows and used in different domains of radiotherapy. Design, Setting, and Participants: This quality improvement study used a multicenter imaging data set comprising 519 pelvic and 242 head and neck computed tomography (CT) scans from 8 distinct clinical sites and patients diagnosed either with prostate or head and neck cancer. The scans were acquired as part of treatment dose planning from patients who received intensity-modulated radiation therapy between October 2013 and February 2020. Fifteen different OARs were manually annotated by expert readers and radiation oncologists. The models were trained on a subset of the data set to automatically delineate OARs and evaluated on both internal and external data sets. Data analysis was conducted October 2019 to September 2020. Main Outcomes and Measures: The autocontouring solution was evaluated on external data sets, and its accuracy was quantified with volumetric agreement and surface distance measures. Models were benchmarked against expert annotations in an interobserver variability (IOV) study. Clinical utility was evaluated by measuring time spent on manual corrections and annotations from scratch.Entities:
Mesh:
Year: 2020 PMID: 33252691 PMCID: PMC7705593 DOI: 10.1001/jamanetworkopen.2020.27426
Source DB: PubMed Journal: JAMA Netw Open ISSN: 2574-3805
Autosegmentation Performance on 3 Head and Neck Data Sets
| Data set | Dice score, mean (SD) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Brainstem | Mandible | Spinal cord | Globe | Parotid | SMG | ||||
| Left | Right | Left | Right | Left | Right | ||||
| IOV-10 | |||||||||
| Annotator 1 | 89.3 (4.2) | 98.6 (1.0) | 92.9 (1.5) | 96.4 (0.9) | 96.5 (1.1) | 92.7 (3.5) | 92.7 (3.5) | 92.3 (3.4) | 92.3 (2.6) |
| Annotator 2 | 91.8 (2.0) | 98.5 (0.5) | 91.8 (2.3) | 95.6 (1.3) | 96.7 (1.1) | 91.1 (4.3) | 91.2 (3.7) | 91.3 (4.7) | 91.3 (5.4) |
| Annotator 3 | 89.6 (2.7) | 96.9 (1.0) | 81.9 (7.3) | 96.5 (0.8) | 95.7 (1.0) | 88.2 (3.8) | 90.1 (2.8) | 91.6 (2.8) | 90.3 (8.0) |
| Ensemble | 88.5 (2.0) | 97.0 (1.0) | 87.7 (3.6) | 94.8 (1.0) | 94.5 (1.9) | 88.5 (2.3) | 87.8 (4.1) | 87.0 (2.9) | 85.1 (5.3) |
| Agreement between annotators, κ | 0.831 | 0.971 | 0.836 | 0.927 | 0.939 | 0.838 | 0.845 | 0.848 | 0.836 |
| Agreement between annotators and model | 0.806 | 0.966 | 0.844 | 0.917 | 0.931 | 0.852 | 0.825 | 0.803 | 0.794 |
| Main data set, ensemble | 85.0 (3.7) | 95.7 (2.3) | 84.0 (3.8) | 92.9 (1.6) | 93.1 (1.5) | 87.9 (3.8) | 87.8 (4.3) | 87.5 (2.3) | 86.7 (3.5) |
| External data set, ensemble | 84.9 (6.8) | 93.8 (2.5) | 80.3 (7.7) | 92.7 (3.6) | 93.3 (1.4) | 84.3 (4.6) | 84.5 (4.3) | 83.3 (9.1) | 78.2 (21.1) |
| External data set, | 79.1 (9.6) | 93.8 (1.6) | 80.0 (7.8) | 91.5 (2.1) | 92.1 (1.9) | 83.2 (5.4) | 84.0 (3.7) | 80.3 (7.8) | 76.0 (16.5) |
| External data set, radiographer | 89.5 (2.2) | 93.9 (2.3) | 84.0 (4.8) | 92.9 (1.9) | 93.0 (1.7) | 86.7 (3.5) | 87.0 (3.1) | 83.3 (19.7) | 74.9 (30.2) |
Abbreviations: IOV, interobserver variability; SMG, submandibular glands.
IOV-10 data set included 10 images. In the IOV study, a subset of the main data set was annotated multiple times by 2 radiation oncologists and a trained reader. Later, the proposed model was compared against each human expert. The statistical agreement between annotators and model were measured with Fleiss κ values.
Main data set included 20 images.
External data set included 26 images. For the external data set, the reference ground truth contours were delineated by an expert head and neck oncologist, and IOV between clinical experts was measured by comparing the reference contours with those produced by an experienced radiographer.[15]
Figure 1. Qualitative Evaluation of Expert and Autogenerated Contours on Head and Neck Computed Tomography Scans
Autosegmentation Performance on 3 Pelvic Data Sets
| Data set | Dice score, Mean (SD) | |||||
|---|---|---|---|---|---|---|
| Femur | Bladder | Rectum | Prostate | SV | ||
| Left | Right | |||||
| IOV-10 | ||||||
| Annotator 1 | 98.79 (0.33) | 98.72 (0.40) | 97.31 (1.54) | 90.42 (5.75) | 89.83 (4.82) | 83.47 (7.65) |
| Annotator 2 | 99.63 (0.12) | 99.63 (0.11) | 98.20 (0.65) | 95.49 (1.90) | 88.66 (6.67) | 82.98 (11.71) |
| Annotator 3 | 99.51 (0.17) | 99.43 (0.17) | 98.10 (0.71) | 91.78 (4.73) | 85.44 (8.26) | 78.02 (13.55) |
| Ensemble | 98.94 (0.34) | 98.92 (0.34) | 97.00 (1.27) | 89.90 (4.13) | 88.05 (1.43) | 81.18 (5.66) |
| Agreement between annotators, κ | 0.985 | 0.984 | 0.962 | 0.864 | 0.787 | 0.685 |
| Agreement between annotators and model, κ | 0.982 | 0.981 | 0.959 | 0.852 | 0.820 | 0.732 |
| Main data set, ensemble | 98.52 (0.50) | 98.50 (0.58) | 95.68 (2.56) | 87.73 (4.03) | 87.17 (3.70) | 80.69 (5.91) |
| External data set, ensemble | 98.04 (1.02) | 98.02 (1.13) | 95.84 (1.82) | 87.03 (3.01) | 86.51 (4.74) | 80.13 (7.00) |
| .04 | .04 | .10 | .07 | .42 | .91 | |
| Main data set, ensemble, MD | 0.25 (0.09) | 0.25 (0.10) | 0.69 (0.20) | 1.71 (0.86) | 1.62 (0.52) | 1.07 (0.41) |
| External data set, ensemble, MD | 0.30 (0.16) | 0.30 (0.18) | 0.81 (0.37) | 2.19 (1.19) | 1.73 (0.58) | 1.19 (0.56) |
| IOV-10 data set, ensemble, MD | 0.15 (0.04) | 0.15 (0.05) | 0.56 (0.20) | 1.48 (0.80) | 1.43 (0.39) | 0.96 (0.36) |
| IOV-10 data set, annotators, MD | 0.10 (0.07) | 0.11 (0.07) | 0.40 (0.19) | 1.03 (1.01) | 1.41 (0.91) | 1.07 (0.88) |
| Main data set, ensemble, HD | 1.20 (0.22) | 1.19 (0.25) | 2.42 (0.67) | 7.57 (5.54) | 4.32 (1.77) | 3.71 (1.47) |
| External data set, ensemble, HD | 1.40 (0.56) | 1.39 (0.71) | 2.86 (0.78) | 8.96 (6.71) | 5.06 (2.09) | 4.29 (2.35) |
| IOV-10 data set, ensemble, HD | 1.03 (0.12) | 1.02 (0.12) | 2.85 (0.62) | 6.64 (3.89) | 4.07 (1.04) | 3.64 (1.59) |
| IOV-10 data set, annotators, HD | 0.74 (0.46) | 0.72 (0.49) | 2.45 (0.95) | 6.30 (5.16) | 5.27 (2.74) | 5.24 (3.36) |
Abbreviations: HD, Hausdorff distance; IOV, interobserver variability; MD, mean surface distance; SV, seminal vesicles.
IOV-10 data set included 10 images. In the IOV study, the results for each annotator are reported separately to show the distribution and how they compared with autogenerated contours. The statistical agreement between annotators and model were measured with Fleiss κ values.
Main data set included 49 images.
External data set included 83 images.
The statistical significance of differences between Dice scores on external and main data sets was assessed with the Mann-Whitney test.
Figure 2. Interexpert Variability In Prostate Contour Annotations
CT indicates computed tomography.
Figure 3. Integration of the Proposed Segmentation Models Into Radiotherapy Planning Workflow