| Literature DB >> 33102623 |
Joseph Ross Mitchell1, Konstantinos Kamnitsas2, Kyle W Singleton3, Scott A Whitmire3, Kamala R Clark-Swanson3, Sara Ranjbar3, Cassandra R Rickertsen3, Sandra K Johnston3,4, Kathleen M Egan5, Dana E Rollison5, John Arrington6, Karl N Krecke7, Theodore J Passe7, Jared T Verdoorn7, Alex A Nagelschneider7, Carrie M Carr7, John D Port7, Alice Patton7, Norbert G Campeau7, Greta B Liebo7, Laurence J Eckel7, Christopher P Wood7, Christopher H Hunt7, Prasanna Vibhute7, Kent D Nelson7, Joseph M Hoxworth7, Ameet C Patel7, Brian W Chong7, Jeffrey S Ross7, Jerrold L Boxerman8, Michael A Vogelbaum9, Leland S Hu3,7, Ben Glocker2, Kristin R Swanson3,10.
Abstract
Purpose: Deep learning (DL) algorithms have shown promising results for brain tumor segmentation in MRI. However, validation is required prior to routine clinical use. We report the first randomized and blinded comparison of DL and trained technician segmentations. Approach: We compiled a multi-institutional database of 741 pretreatment MRI exams. Each contained a postcontrast T1-weighted exam, a T2-weighted fluid-attenuated inversion recovery exam, and at least one technician-derived tumor segmentation. The database included 729 unique patients (470 males and 259 females). Of these exams, 641 were used for training the DL system, and 100 were reserved for testing. We developed a platform to enable qualitative, blinded, controlled assessment of lesion segmentations made by technicians and the DL method. On this platform, 20 neuroradiologists performed 400 side-by-side comparisons of segmentations on 100 test cases. They scored each segmentation between 0 (poor) and 10 (perfect). Agreement between segmentations from technicians and the DL method was also evaluated quantitatively using the Dice coefficient, which produces values between 0 (no overlap) and 1 (perfect overlap).Entities:
Keywords: brain tumors; deep learning; observer studies; segmentation; validation
Year: 2020 PMID: 33102623 PMCID: PMC7567400 DOI: 10.1117/1.JMI.7.5.055501
Source DB: PubMed Journal: J Med Imaging (Bellingham) ISSN: 2329-4302
Fig. 1Our segmentation review software running on AWS AppStream 2.0. AppStream allows the developer to run Windows in a virtual machine on AWS and display the output to a remote instance of Google Chrome. Any application that can be installed in Windows can be installed in the virtual machine. We developed our own application in Python 3.6 and QT 5. The program launched two instances of insight segmentation and registration toolkit (ITK)-SNAP (windows top right and top left) to display an MRI exam from the test set along with the manual technician and automatic DL tumor segmentations (red overlays). The order of the display is randomized, and the viewer is blinded to the source of the segmentation. Lesion A is always displayed in the top-left window and lesion B in the top right. The viewer can zoom in and out and move the cursor (crosshairs) to any location in the MRI volume. The two ITK-SNAP instances are synchronized so that they show the same location at all times. The bottom window provides widgets (sliders) that allow the viewer to quickly and easily score the quality of each segmentation. The bottom window also provides widgets that allow the viewer to move forward (or backward) through the MR exams in their assigned group of exams.
Primary sources for the exams processed in this study. In total, eight North American academic cancer centers, three public domain datasets, and one consortium dataset contributed exams. “Study source” indicates the origin of the MRI exams. “” indicates the number of exams contributed. “Age” is the mean age (±standard deviation) of the patients when the exam was obtained. “M/F (not specified)” indicates the number of male (M) and female (F) patients in the group. The number of patients whose sex was not specified is indicated in brackets. “Study dates” lists the range of years the exams were acquired, with the median year indicated in brackets. The last row provides summary values for the entire cohort.
| Study source | N | Age | M/F (not specified) | Study dates | |
|---|---|---|---|---|---|
| 1 | Cancer centers ( | 525 | 338/187 | 2000 to 2016 (2008) | |
| 2 | TCGA-GBM | 101 | 63/38 | 1996 to 2008 (2001) | |
| 3 | TCIA | 85 | 33/24 (28) | 1990 to 2005 (1994) | |
| 4 | Ivy GAP | 18 | 7/11 | 1996 to 2000 (1997) | |
| 5 | Radiation therapy oncology group | 12 | 10/2 | 2009 to 2011 (2010) | |
| Overall | 741 | 451/262 (28) | 1990 to 2016 (2006) |
Ivy Glioblastoma Atlas Project.
The different types of brain tumors and their frequencies, as reported in the patient cohort.
| Tumor type | (%) | |||
|---|---|---|---|---|
| Glioblastomas | 463 | 62.5 | ||
| 1 | Glioblastoma multiforme | 449 | — | — |
| 2 | Glioblastoma multiforme with oligodendroglial component | 7 | — | — |
| 3 | Giant cell glioblastoma | 4 | — | — |
| 4 | Glioblastoma multiforme, small cell type | 2 | — | — |
| 5 | Glioblastoma multiforme with sarcomatous differentiation | 1 | — | — |
| Astrocytomas | 77 | 10.4 | ||
| 6 | Astrocytoma | 38 | — | — |
| 7 | Anaplastic astrocytoma | 28 | — | — |
| 8 | Diffuse astrocytoma | 7 | — | — |
| 9 | Infiltrating fibrillary astrocytoma | 2 | — | — |
| 10 | Gemistocytic astrocytoma | 1 | — | — |
| 11 | Pleomorphic xanthoastrocytoma | 1 | — | — |
| Oligodendrogliomas | 37 | 5 | ||
| 12 | Oligodendroglioma | 27 | — | — |
| 13 | Anaplastic oligodendroglioma | 10 | — | — |
| Mixed and other | 19 | 2.5 | ||
| 14 | Anaplastic oligoastrocytoma | 9 | — | — |
| 15 | Gliosarcoma | 5 | — | — |
| 16 | Oligoastrocytoma | 2 | — | — |
| 17 | Ganglioglioma | 1 | — | — |
| 18 | Diffuse pontine intrinsic glioma | 1 | — | — |
| 19 | Low-grade glioma | 1 | — | — |
| Not specified | 145 | 19.6 | ||
| Total | 741 | 100 | ||
The whole-tumor mean (±standard error), median Dice coefficient, recall, and precision over the 100 test cases. Values range from 0 to 1 in each case, with higher values indicating better performance.
| Dice | Recall | Precision | |
|---|---|---|---|
| Mean | 0.87 ( | 0.87 ( | 0.88 ( |
| Median | 0.90 | 0.91 | 0.90 |
Fig. 2The two test exams with the lowest Dice coefficients (poorest agreement between the technician and DeepMedic defined tumor regions) among the 100 test exams. Tumor segmentations are indicated in red. The mean neuroradiologist score for each exam, , is displayed in the top-left corner of the axial view. Exam A (top two rows) had the lowest Dice coefficient among the 100 test exams. The disagreement between the two segmentation sources occurred primarily in the periventricular regions, where the technician labeled hyperintense regions as tumor, while DeepMedic did not. Periventricular hyperintensities are linked to small blood vessel disease and increased risk of stroke and dementia. Their prevalence increases with age in the general population. However, they typically are not associated with neoplasia. Exam B (bottom two rows) was tied with another exam (not shown) for the second lowest Dice coefficient. The disagreement in exam B was widespread. Both segmentations missed areas of enhancement in the T1c scan.
Fig. 3The distribution and Dice coefficients of the tumor volume measured by the technician. (a) The distribution of tumor volumes. These ranged from 5.07 to 300.84 ml with a mean (±standard deviation) of 88.98 () ml. The median technician measured tumor volume was 78.20 ml. (b) Linear regression (blue line) between Dice coefficients and technician measured tumor volumes. This fit suggests a slight increase in Dice coefficient with increasing lesion volume (). However, this relationship is weak (). The shaded blue region indicates the 95% confidence interval of the linear regression.
Fig. 4The distribution of scores for manual technician and automatic DL segmentations in the test exams. Twenty neuroradiologists each performed 20 blinded and randomized side-by-side comparisons of the technician and DL segmentations in the 100 test exams. Scores ranged from 0 (no overlap with the MRI visible tumor) to 10 (perfect match with the MRI visible tumor). The technician and DL segmentations had median scores of 7 and 8 and mean (±standard error) scores of and , respectively. The magnitude difference in the mean scores was 0.34. This value was different from 0 with a two-sided . Additional details are provided in the text.
Fig. 5The two test exams with the largest differences (deltas) between the neuroradiologists’ mean scores for the technician and DeepMedic segmentations. Tumor segmentations are indicated in red. The mean neuroradiologist score for each exam, , is displayed in the top-left corner of the axial view. Delta is defined as . Exam C (top two rows) had the largest score difference in favor of the DeepMedic segmentation. The technician did not label the enhancing core of the tumor in exam C. Exam D (bottom two rows) had the largest score difference in favor of the technician segmentation. DeepMedic did not label extensive regions of enhancement in the T1c scan in exam D.
Fig. 6Example output from our DL system for automatic brain tumor segmentation. The system loads an MRI exam containing a T1-weighted postcontrast scan (T1c) and a FLAIR scan. Input from a wide range of MRI scanners and with varying scan parameters will work. We designed the system to perform the following steps automatically, without additional input: (1) enhance the MRI scans to remove artifacts; (2) identify the brain within the MRI scan (strip the skull), even in the presence of significant pathology or surgical interventions; (3) segment the brain tumor; and (4) coregister the Harvard-Oxford probabilistic atlas to the brain. The last step is used for visualization purposes and is optional. In this image, the tumor is red. Other colors indicate various atlas regions. The top and bottom rows show 3D and 2D views of the output data, respectively. Several atlas regions in the vicinity of the tumor have been made transparent in the 3D view to aid tumor visualization.
The agreement between manual and DL tumor segmentation, expressed as the mean or median Dice coefficient over the test set for multiple neural nets. The Dice coefficients for the Heidelberg datasets are for contrast-enhancing tumor regions. Dice coefficients for all other entries are for whole-tumor segmentation. “MRI series” is the number of series required as input. “Val. Set Size” refers to the validation set size. The first three deep nets were the top-scoring solutions for the multimodal BraTS challenge from 2017. Networks 4 through 7 were the top-scoring solutions from BraTS 2018. The Heidelberg solution was trained using a fivefold cross-validation on 455 exams, i.e., the dataset was divided into five groups of 91 exams each. In each fold, four of these groups (364 exams) were used for training and one group (91 exams) was used for validation. The resulting five deep neural networks were then used as an ensemble to segment a separate sequence of 239 exams from the same institution. Then, the Heidelberg ensemble was used to segment 2034 exams acquired from 38 institutions as part of a clinical trial (EORTC). DeepMedic is our ensemble of five networks applied to 100 of our test studies. Additional details are provided in the text.
| Neural network | Dataset | MRI series | Ensemble size | Training set size | Val. set size | Test set size | Test median Dice | Test mean Dice | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | EMMA | BraTS 2017 | 4 | 21 | 285 | 46 | 146 | N/A | 0.88 |
| 2 | Cascaded CNNs | BraTS 2017 | 4 | 9 | 285 | 46 | 146 | N/A | 0.87 |
| 3 | Brain Tumor U-Net | BraTS 2017 | 4 | 15 | 285 | 46 | 146 | N/A | 0.86 |
| 4 | NVDLMED | BraTS 2018 | 4 | 10 | 285 | 66 | 191 | 0.92 | 0.88 |
| 5 | MIC-DKFZ | BraTS 2018 | 4 | 10 | 285 | 66 | 191 | 0.92 | 0.88 |
| 6 | DeepSCAN | BraTS 2018 | 4 | 12 | 285 | 66 | 191 | 0.92 | 0.89 |
| 7 | DL_86_61 | BraTS 2018 | 4 | 7 | 285 | 66 | 191 | 0.92 | 0.88 |
| 8 | Heidelberg | Heidelberg EORTC | 4 | 5 | 364 | 91 | 2273 | 0.89 to 0.91 | N/A |
| 9 | DeepMedic | This study | 2 | 5 | 641 | 0 | 100 | 0.90 | 0.87 |
Fig. 7Boxplots showing the distribution of radiologists’ score differences by test group. The R numbers correspond to individual radiologists. For example, R01 refers to radiologist #1. Each row of plots corresponds to a specific group of 20 test exams. Thus, radiologists R01 through R04 all scored the same 20 exams in group 1. The score difference is defined as the radiologist score for the technician segmentation minus the radiologist score for the DL segmentation. Negative values indicate that the DL segmentation was assigned a higher (better) score than the technician segmentation. Each box shows the range of data values between the first and third quartiles. The horizontal line within each box indicates the median value. The whiskers indicate the range of values. Outliers are indicated by small circles beyond the whiskers. Variability between radiologists, both within and between groups, is evident as differing box sizes and whisker lengths.