Literature DB >> 30825711

Inter-rater agreement in glioma segmentations on longitudinal MRI.

M Visser¹, D M J Müller², R J M van Duijn³, M Smits⁴, N Verburg², E J Hendriks³, R J A Nabuurs², J C J Bot³, R S Eijgelaar⁵, M Witte⁵, M B van Herk⁶, F Barkhof⁷, P C de Witt Hamer⁸, J C de Munck³.

Abstract

BACKGROUND: Tumor segmentation of glioma on MRI is a technique to monitor, quantify and report disease progression. Manual MRI segmentation is the gold standard but very labor intensive. At present the quality of this gold standard is not known for different stages of the disease, and prior work has mainly focused on treatment-naive glioblastoma. In this paper we studied the inter-rater agreement of manual MRI segmentation of glioblastoma and WHO grade II-III glioma for novices and experts at three stages of disease. We also studied the impact of inter-observer variation on extent of resection and growth rate.
METHODS: In 20 patients with WHO grade IV glioblastoma and 20 patients with WHO grade II-III glioma (defined as non-glioblastoma) both the enhancing and non-enhancing tumor elements were segmented on MRI, using specialized software, by four novices and four experts before surgery, after surgery and at time of tumor progression. We used the generalized conformity index (GCI) and the intra-class correlation coefficient (ICC) of tumor volume as main outcome measures for inter-rater agreement.
RESULTS: For glioblastoma, segmentations by experts and novices were comparable. The inter-rater agreement of enhancing tumor elements was excellent before surgery (GCI 0.79, ICC 0.99) poor after surgery (GCI 0.32, ICC 0.92), and good at progression (GCI 0.65, ICC 0.91). For non-glioblastoma, the inter-rater agreement was generally higher between experts than between novices. The inter-rater agreement was excellent between experts before surgery (GCI 0.77, ICC 0.92), was reasonable after surgery (GCI 0.48, ICC 0.84), and good at progression (GCI 0.60, ICC 0.80). The inter-rater agreement was good between novices before surgery (GCI 0.66, ICC 0.73), was poor after surgery (GCI 0.33, ICC 0.55), and poor at progression (GCI 0.36, ICC 0.73). Further analysis showed that the lower inter-rater agreement of segmentation on postoperative MRI could only partly be explained by the smaller volumes and fragmentation of residual tumor. The median interquartile range of extent of resection between raters was 8.3% and of growth rate was 0.22 mm/year.
CONCLUSION: Manual tumor segmentations on MRI have reasonable agreement for use in spatial and volumetric analysis. Agreement in spatial overlap is of concern with segmentation after surgery for glioblastoma and with segmentation of non-glioblastoma by non-experts.

Entities: Chemical Disease Gene Species

Keywords: Glioblastoma; Glioma; Inter-rater agreement; Low-grade glioma; MRI; Manual segmentation

Mesh：

Year: 2019 PMID： 30825711 PMCID： PMC6396436 DOI： 10.1016/j.nicl.2019.101727

Source DB: PubMed Journal: Neuroimage Clin ISSN： 2213-1582 Impact factor: 4.881

Introduction

Glioma is the most common primary brain tumor in adults (Crocetti et al., 2012; Ostrom et al., 2017). Gliomas are classified by histological type and malignancy grade (Louis et al., 2007). Despite surgical resection, radiotherapy and chemotherapy, the survival of glioma patients is limited, with a two-year survival of 15% for glioblastoma (WHO grade IV) and 85% for diffuse low-grade glioma and a ten-year survival of 2% and 58% respectively (Ostrom et al., 2017). Although glioma segmentation on MRI is not generally considered to be part of standard care, it is useful in clinical practice for documentation, prediction of survival, treatment planning, assessment of quality of care, and treatment response measurement. For diagnosis and surgical planning, several MRI sequences are typically applied to assess tumor location and extent (T1-, T2- and T2-FLAIR-weighted images) and the integrity of the blood brain barrier (T1-weighted images after administration of a gadolinium-based contrast agent). Tumors are segmented on pre- and postoperative MR scans for volumetric analysis and calculation of the extent of resection (EOR). The EOR is an important predictor of survival for gliomas (Brown et al., 2016; Lacroix et al., 2001; Sanai and Berger, 2018). Tumor segmentation is standard practice for planning of radiotherapy. During radiological follow-up the tumor volume is monitored, and timing of second line treatment is based on tumor growth. Quantitation of MRI tumor volumes has proven to be valuable for studying autonomous growth (Gui et al., 2018; Mandonnet et al., 2013; Mandonnet et al., 2008), quantification of the effects of (pharmacological) interventions (Ben Abdallah et al., 2018b; Mandonnet et al., 2010; Pallud et al., 2012a; Pallud et al., 2012b), and statistical maps of care (De Witt Hamer et al., 2013; Mandonnet et al., 2007) and disease mechanisms (Amelot et al., 2017; Ellingson et al., 2013; Wang et al., 2014). These examples show that examination of glioma on MRI by human experts is important. Ideally, observer variation should be small. Theoretically, this variation could be reduced or eliminated by semi-automated or completely automated MRI segmentation algorithms, and such algorithms are being developed (Cordova et al., 2014; Gooya et al., 2012; Meier et al., 2016; Menze et al., 2015; Porz et al., 2016, 2014; Zaouche et al., 2018). To date the work on automatic segmentation has primarily focused on the segmentation of preoperative MR scans of patients with glioblastoma. However, the most recent BRATS tumor segmentation benchmarking challenges have put automatic detection of tumor volume change on follow-up MR scans on the agenda (Crimi et al., 2018, 2016). Manual segmentation by experts is still considered to be the gold standard and therefore required for quantitative interpretation of MR images and for the validation of automated segmentation algorithms. Reproducibility of manual segmentations has been investigated previously by others (Ben Abdallah et al., 2018a, 2016; Bø et al., 2017; Cattaneo et al., 2005; Gutman et al., 2013; Huber et al., 2015; Kleesiek et al., 2016; Kubben et al., 2010; Provenzale et al., 2009; Provenzale and Mancini, 2012; Sorensen et al., 2001; Weltens et al., 2001). Most of this work was focused on manual segmentation of preoperative MRI in glioblastoma, although a few of these studies consider longitudinal data (Huber et al., 2015; Kleesiek et al., 2016; Kubben et al., 2010; Meier et al., 2016). Two studies (Ben Abdallah et al., 2016; Huber et al., 2015) have addressed the issue of required level of expertise, albeit for preoperative MRI. Both these studies indicated no significant influence of either clinical expertise, or the years of experience on the reproducibility of the segmentations. Since manual segmentation of 3D MRI is labor intensive, even when semi-automated methods are used, many studies are based on a limited number of included scans and raters. Finally, most studies address segmentation of glioblastoma and relatively few studies address lower grade gliomas, although in more recent studies lower grade glioma segmentation is being studied as well (Ben Abdallah et al., 2016; Bø et al., 2017). In this study, we aim to establish the reproducibility of manual raters in the case of glioma segmentation on MRI, and the impact on extent of resection and growth rate measurements. We will therefore analyze the reproducibility of glioma segmentations at three MRI scan time points by eight raters with two levels of expertise for glioblastoma and non-glioblastomas.

Methods

Patients

Patients were randomly selected from a cohort treated at the Neurosurgical Center Amsterdam of the VU medical center (Amsterdam, The Netherlands) between 2009 and 2013 with standard T2-FLAIR-, T2-, T1-weighted images before and after contrast agent administration. All series were obtained at 3 time points: 1) preoperative, i.e. before first-time resective surgery, 2) postoperative, and 3) at disease progression. For interpretation of post-surgical ischemia, diffusion-weighted imaging on MRI after surgery was included as well. MR data from 20 patients with histopathologically confirmed WHO IV glioblastoma and from 20 patients with grade II-III glioma were included. All 20 gadolinium-enhancing gliomas had a histopathological diagnosis of glioblastoma WHO grade IV. Of the 20 non-enhancing gliomas, 12 were astrocytoma WHO grade II, four oligodendroglioma WHO grade II, three oligoastrocytoma WHO grade II, and one anaplastic astrocytoma WHO grade III, which we refer to as non-glioblastomas. The preoperative MRI was made on average within one week before resection. The MRI after surgery was made within 72 h after resection for glioblastomas and on average at four months after resection for non-glioblastomas. The MRI at progression was the scan that demonstrated the first tumor progression according to tumor board meeting consensus. The institutional review board at the VU medical center Amsterdam approved of this study (case nr. 2014.336), after which the data was gathered retrospectively from the clinical workflow. All patients provided written informed consent for use of their clinical data for medical research. The imaging was analyzed after anonymization in accordance with the Personal Data Protection Act.

MR-imaging

Imaging was performed on a variety of systems (Siemens, model Sonata or Avanto; GE medical systems, model Signa HDxt or DISCOVERY MR750; Toshiba, model Titan3T; Philips, model Panorama HFO or Ingenuity) with a field strength of 1 T (1% of all scans), 1.5 T (62% of all scans) or 3 T (37% of all scans). The standardized protocol included non-enhanced axial T1-weighted spin echo images [repetition time/echo time (TR/TE) 520–600/8–12 ms] with 5-mm slice thickness and axial T2-weighted turbo spin echo images (TR/TE 5190–8670/93–101 ms) with 5-mm slice thickness. Sagittal 3D turbo fluid-attenuated inversion-recovery (FLAIR) images [repetition time/echo time/inversion time (TR/TE/TI) 6500/355/2200 ms] with 1.3-mm slice thickness and axial single shot spin echo echo-planar diffusion- weighted (DWI) images (TR/TE 3400/122 ms) with 5-mm slice thickness were also derived. Diffusion gradients were applied along three orthogonal directions using b-values of 0, 500 and 1000 s/mm2. Apparent diffusion coefficient (ADC) maps were calculated from the DWI images. Post-contrast (0.2 mmol/Kg) sagittal 3D T1-weighted MPRAGE gradient-echo (T1c) images (TR/TE/TI 2300-2700/5-4.5/950 ms) with 1- to 1.5-mm slice thickness were obtained. All the DICOM images of pre-, postoperative MRI and at progression were loaded in the Elements environment (BrainLab™ GmBH, Feldkirchen, Germany) and were rigidly registered to the post-contrast T1-weighted MRI per time-point using the Image Fusion tool to facilitate visual comparison of scans. For non-glioblastomas, both immediate and late postoperative MRI were available to raters to discern regions of postsurgical diffusion restriction from residual tumor.

Manual segmentation

Four experts and four novices segmented each glioma at each time-point as rater. The experts consisted of three neuro-radiologists (E1, E2 and E3) and one neurosurgeon (E4) with 8, 20, 18, and 20 years of clinical experience, respectively. The novices consisted of three neurosurgical residents (N1, N3 and N4), and one neuro-radiology resident (N2) with 1, 5, 3, and 3 years of clinical experience. Raters were blinded for histopathological diagnosis and clinical follow-up of patients. Raters were asked to delineate both the non-enhancing and the contrast-enhancing tumor elements - if present - for all three MRI time points in each of 40 patients. To facilitate MRI interpretation, raters were acquainted with the VASARI-criteria (Visually AcceSAble Rembrandt Images, as proposed by The Cancer Imaging Archive (Clark et al., 2013)), but no segmentations rules were imposed. The raters were asked to segment the enhancing tumor elements on post contrast T1-weighted images and to include enclosed necrosis or cysts. Furthermore, they were requested to segment the non-enhancing tumor elements on T2/FLAIR-weighted images. A volume of zero was assigned when a rater determined absence of enhancing or non-enhancing elements. Segmentations were made with the semi-automatic SmartBrush tool (Elements©, BrainLab™ GmBH, Feldkirchen, Germany) approved for use in clinical practice. Raters were instructed with the use of the software and practiced their skills with MRI sets for preoperative, postoperative and progression time points from two test patients, one contrast-enhancing case and one non-enhancing case. Afterwards they received feedback on their use of the software and on the requirements for the segmentations. From this point on no further feedback was provided. Raters were blinded for segmentations of the co-raters and received the 40 MRI sets in identical order. The order of the MRI sets was randomized to ensure mixing of glioma gradings.

Statistical analysis

First, we evaluated the agreement in the detection of any enhancing or non-enhancing tumor tissue between expert and novice raters using bar plots, as raters may not necessarily agree on tumor presence. Second, we determined the inter-rater agreement in volume measurements derived from the segmentations using the intra-class correlation coefficient (ICC) (McGraw and Wong, 1996) and (Shrout and Fleiss, 1979). The specific ICC model used for this purpose is the ICC(A,1) from (McGraw and Wong, 1996) to quantify the inter-rater agreement on volume. ICC scores below 0.4 were considered as poor agreement, 0.4–0.6 as reasonable, 0.6–0.7 as good, and 0.7–1 as excellent (Bartko, 1991; Cicchetti, 1994). Third, we determined the inter-rater agreement in spatial overlap using the generalized conformity index (GCI) (Kouwenhoven et al., 2009) that quantifies the spatial overlap among multiple spatial objects. This a mathematical generalization of the well-known Jaccard score, which quantifies the overlap of two volumes, as the ratio between the volume of the cross-section and the union of both volumes. When the segmented set by rater j is indicated as A and its volume by Vol(A), the GCI is expressed as:where indicates summation over all combinations of unique pairs of raters. For two raters the GCI equals the Jaccard score, GCI = Vol(A1 ∩ A2)/Vol(A1 ∪ A2). The GCI was calculated separately for experts and novices, for each MRI time point of every patient. Raters who detected no tumor in a patient, i.e. a volume of zero, were omitted from the GCI calculation for that patient. A GCI of zero denotes no spatial overlap at all and a GCI of one denotes complete spatial overlap among raters. Scores of 0.7–1.0 are regarded as excellent (Bartko, 1991; Zijdenbos et al., 1994). The distributions of spatial overlap scores were visualized in scatter plots and boxplots. Differences in distributions between experts and novices were tested using the Fisher-Pitman permutation test (Ludbrook and Dudley, 1998). Fourth, to evaluate when expert knowledge is required, we also determined the Jaccard indices between expert consensus and novice consensus segmentations. Majority voting over multiple raters is a well-established method to obtain a consensus segmentation that is a better ground truth than single rater's segmentation (Kittler et al., 1996). For these consensus segmentations, a voxel-wise majority vote of at least two of four raters was used. Fifth, to evaluate the impact on clinical volumetric analysis, we calculated the extent of resection based on the pre- and postoperative MRI and the growth rate based on the postoperative and progression MRI for each rater. The extent of resection was based on volumes of enhancing elements for glioblastoma and on volumes of non-enhancing elements for non-glioblastoma:where Vpre and Vpost are the pre- and postoperative volumes of one rater.The growth rate was calculated as difference between the mean tumor diameters divided by the time-interval in years (Mandonnet et al., 2008), in which:where Dmean is the mean tumor diameter of the volume V of one rater. For the clinical volumetric analyses we used the interquartile range as measure of dispersion between the non-normal measurements of raters per case.

Results

Patient characteristics

Patients with glioblastoma had a mean age of 61.4 years (range 41.8–72.6) and consisted of 10 females and 10 males. Patients with non-glioblastoma had a mean age of 36.9 years (range 18.6–53.7) and consisted of 8 females and 12 males. The time between preoperative MRI and surgery was on average 7.8 days for glioblastoma and 53.6 days for non-glioblastoma. The time between surgery and the postoperative MRI was on average 1.2 days for glioblastoma and 4.11 months for non-glioblastoma. The time between surgery and the progressive MRI was on average 13.7 months (range: 5.6–30.7) for glioblastoma and 28.9 months (range 6.2–60.7) for non-glioblastoma. Enhancing and non-enhancing tumor were not treated as mutually exclusive by the raters, therefore overlap is present between the segmentations of enhancing and non-enhancing tumor. The average contrast-enhancing (with enclosed necrosis) tumor volume was 32.2 mL for glioblastoma and 0.8 mL for non-glioblastoma on the preoperative MRI, 2.7 and 0.0 mL on the postoperative MRI, and 24.2 and 5.2 mL on the progressive MRI. The average non-enhancing tumor volume was 88.6 mL for glioblastoma and 45.3 mL for non-glioblastoma on the preoperative MRI, 37.2 and 8.4 mL on the postoperative MRI, and 78.0 and 25.7 mL on the progressive MRI. The tumor was located in the left hemisphere in 8 patients with glioblastoma, and in 9 patients with non-glioblastoma. Detailed patient characteristics are presented in Table 1.

Table 1

Patient characteristics.

Glioblastoma							Non-glioblastoma
Pat	Path	Sex	Age	T1	T2	T3	Pat	Path	Sex	Age	T1	T2	T3
1	GB	F	67,1	13	0	415	21	A2	F	53,7	26	91	1746
2	GB	F	72,1	6	1	229	22	O2	M	44,7	1	111	1033
3	GB	M	65,3	2	0	920	23	A2	F	23,1	67	111	188
4	GB	F	66,1	2	1	310	24	A2	M	30,1	53	77	861
5	GB	M	66,7	1	1	474	25	A2	M	18,6	1	184	1477
6	GB	F	64,0	15	1	274	26	A2	F	21,8	9	92	1538
7	GB	M	45,4	4	3	591	27	O2	M	52,6	67	143	1595
8	GB	M	52,8	9	3	255	28	A2	F	35,5	46	108	1820
9	GB	M	61,3	7	0	279	29	A2	M	30,8	255	102	686
10	GB	M	70,5	2	0	184	30	A2	F	28,6	111	127	207
11	GB	M	75,5	1	1	188	31	OA2	M	34,8	109	1	191
12	GB	F	66,2	8	2	540	32	A2	F	48,2	2	99	573
13	GB	M	71,6	10	1	825	33	A2	M	29,1	12	101	1438
14	GB	M	55,1	2	3	770	34	A2	M	23.0	60	145	965
15	GB	F	42,3	5	1	732	35	A2	M	41,6	1	90	903
16	GB	M	73,0	18	1	329	36	OA2	F	39,5	1	61	183
17	GB	F	47,2	3	1	168	37	A3	M	52,8	8	170	306
18	GB	F	41,8	21	1	267	38	A2	M	37,8	52	161	186
19	GB	F	72,6	8	1	204	39	O2	F	44,3	68	380	402
20	GB	M	51,4	19	2	278	40	OA2	M	46,7	126	112	1038

T1: time of preoperative scans (days before surgery), T2: time of postoperative scans (days after surgery), T3: time of progression, GB: glioblastoma, A2: astrocytoma grade II, O2: Oligodendroglioma grade II, OA2: oligoastrocytoma grade II, A3: anaplastic astrocytoma grade III.

Patient characteristics. T1: time of preoperative scans (days before surgery), T2: time of postoperative scans (days after surgery), T3: time of progression, GB: glioblastoma, A2: astrocytoma grade II, O2: Oligodendroglioma grade II, OA2: oligoastrocytoma grade II, A3: anaplastic astrocytoma grade III.

Tumor tissue detection

The number of raters that identified any tumor are plotted in Fig. 1. Zero raters would represent perfect agreement on absence of tumor, and four raters would represent perfect agreement on presence of tumor.

Fig. 1

Bar plots of the number of patients with corresponding number of expert (EX) and novice (NO) raters detecting any enhancing tumor and any non-enhancing tumor for glioblastoma and non-glioblastoma in MRIs preoperative, postoperative and at progression. Experts and novices perfectly agreed on the presence of any enhancing tumor for glioblastoma and on any non-enhancing tumor for non-glioblastoma patients on preoperative MRIs. Few experts and even fewer novices detected enhancing tumor in non-glioblastoma patients preoperatively. In postoperative MRIs both experts and novices considerably disagreed on the presence of enhancing tumor in glioblastoma patients. Experts more frequently agreed perfectly on enhancing tumor presence than novices; novices more frequently agreed perfectly on enhancing tumor absence in postoperative MRIs. Experts generally agreed on tumor presence in non-glioblastoma patients postoperatively, whereas novices disagreed in one third of these patients. At progression, experts and novices generally agreed on the presence of any enhancing tumor in glioblastoma and perfectly agreed on any non-enhancing tumor in non-glioblastoma patients. Experts more frequently identified enhancing tumor in non-glioblastoma patients at progression than novices. All experts and novices identified non-enhancing tumor in all glioblastoma and non-glioblastoma patients.

ICC of tumor volume

The ICCs of tumor volumes are shown in Table 2. Agreement in volume measurements among experts is excellent at all three time points for enhancing tumor elements in glioblastoma patients and excellent for non-enhancing tumor elements in non-glioblastoma patients (ICC ≥ 0.8). In contrast, the non-enhancing elements in glioblastoma patients have poor to fair agreement for both experts and novices. The agreement among experts is generally better than among novices.

Table 2

Intra-class coefficient with 95% confidence intervals for experts and novices.

Histology group	Contrast	Rater	Preoperative	Postoperative	Progression
GB	Enhancing	Experts	0.99 (0.98–1.00)	0.92 (0.85–0.97)	0.91 (0.82–0.96)
GB	Enhancing	Novices	0.98 (0.96–1.00)	0.60 (0.39–0.78)	0.97 (0.95–0.99)
GB	Non-enhancing	Experts	0.61 (0.41–0.79)	0.25 (0.05–0.52)	0.53 (0.24–0.76)
GB	Non-enhancing	Novices	0.55 (0.24–0.78)	0.15 (0.00–0.38)	0.40 (0.09–0.67)
Non-GB	Enhancing	Experts	0.28 (0.07–0.55)	⁎	1.00 (1.00–1.00)
Non-GB	Enhancing	Novices	0.57 (0.35–0.77)	⁎	0.66 (0.47–0.83)
Non-GB	Non-enhancing	Experts	0.92 (0.81–0.97)	0.84 (0.70–0.93)	0.80 (0.65–0.91)
Non-GB	Non-enhancing	Novices	0.73 (0.40–0.89)	0.55 (0.32–0.76)	0.73 (0.46–0.88)

GB: glioblastoma.

No enhancing elements were identified for non-glioblastomas in the postoperative MRI, with the exception of 2 disjoint residual volumes each by a different rater.

Intra-class coefficient with 95% confidence intervals for experts and novices. GB: glioblastoma. No enhancing elements were identified for non-glioblastomas in the postoperative MRI, with the exception of 2 disjoint residual volumes each by a different rater.

Spatial overlap

Results for spatial agreement are represented as box-plots of the GCI between raters in Fig. 2, demonstrating that experts generally achieve a higher agreement in spatial overlap than the novices. For non-enhancing tumor segmentations of glioblastoma on postoperative MRI, experts had a significantly higher spatial overlap than novices with a median GCI of 0.30 versus 0.15 (p = .002). For non-enhancing tumor segmentations of non-glioblastoma at all MRI time points, experts had a significantly higher spatial agreement than novices with a median GCI of 0.79 versus 0.67 (p = .001) on preoperative MRI, 0.52 versus 0.35 (p = .007) on postoperative MRI and 0.64 versus 0.38 (p < .001) at progression.

Fig. 2

Box plots of the spatial overlap among experts (EX) and novices (NO) measured as generalized conformity index for enhancing tumor and non-enhancing tumor segmentations of 20 glioblastoma and 20 non-glioblastoma patients in MRIs taken at preoperative, postoperative and progression time points. Each dot represents the agreement among raters for one patient's MRI. Indices above 0.7 are considered excellent. The median of measurements and interquartile distances are plotted as boxes, which were omitted when fewer than five data points were present. Few data points were available for enhancing tumor segmentations in non-glioblastoma, because the generalized conformity index could not be calculated when fewer than two observers detected tumor. The spatial agreement was invariably highest for preoperative segmentations and lowest for postoperative segmentations. Agreement on enhancing tumor in glioblastoma was excellent among both experts and novices on preoperative MRI and at progression. Spatial agreement was lowest for enhancing tumor in glioblastoma on postoperative MRI, whereas this was affected by a substantial inter-observer disagreement on the presence of any enhancing tumor. Agreement on non-enhancing tumor was excellent among experts segmenting non-glioblastoma, and lowest among novices segmenting non-enhancing tumor for glioblastoma. Spatial overlap agreement was generally higher for enhancing tumor in glioblastoma than for non-enhancing tumor in non-glioblastoma at all MRI timings. To explore potential causes of the low spatial overlap agreement of postoperative enhancing tumor in glioblastoma patients, we hypothesized that lower object volumes and higher level of fragmentation may contribute to this. The scatter plots in Fig. 3A confirm that in particular enhancing tumor volumes smaller than 10 mL in glioblastoma come with a strikingly lower agreement. As tumor volumes on postoperative MRI are typically smaller than 10 mL, this may partly explain the low agreement. A similar small volume effect was observed in non-enhancing tumor segmentations of non-glioblastomas in Fig. 3B.

Fig. 3

Spatial overlap agreement as generalized conformity index versus tumor volume (average over experts) of enhancing tumor (A) and non-enhancing tumor (B) segmentations for glioblastomas and non-glioblastomas at subsequent MRI timings. Each dot represents the agreement of spatial overlap among experts on one patient's MRI. For enhancing tumor at postoperative phase it is shown that spatial overlap increases after artificial dilation of segmentation (grey dots), however not to the level of progression segmentation of the same volume. To take this one step further, we artificially dilated the enhancing tumor segmentations of glioblastomas with a 10 mm spherical structure element and recalculated the overlap of the dilated volumes (grey symbols, middle panel Fig. 3A). Although the overlap increases, it is still lower than undilated object volumes of similar size. Therefore, the lower agreement could not be fully explained by a small volume effect. In addition, we compared the fragmentation of the tumor segmentations by calculating the number of connected components for patients with an enhancing tumor volume smaller than 10 mL. The average number of fragments was 2.14 ± 1.35 (SD) on postoperative MRI and 1.92 ± 1.96 at progression. Therefore, fragmentation of tumor segmentations did not fully explain the lower agreement on postoperative MRI either.

Majority voting consensus

Subsequently the spatial overlap agreement was determined between each rater's segmentations and the majority vote for experts and novices combined (Fig. 4). The plots shown in Fig. 4 show a similar trend as the group-wise analysis shown in Fig. 2. Again, the highest agreement was observed on preoperative MRIs, followed by MRIs at the time of progression, and lowest agreement for postoperative MRIs. The comparison against the majority vote allowed for scrutiny on the individual level, showing for the non-glioblastoma patients that one novice (N4) performed at a level similar to that of the experts. We also compared the majority vote for experts and for novices (Fig. 5) which shows that the novice consensus is comparable to the expert consensus for enhancing tumor on preoperative MRI and at progression for glioblastoma and for non-enhancing tumor on preoperative MRI for non-glioblastoma. Novice consensus shows only moderate agreement with expert consensus for enhancing tumor on postoperative MRI for glioblastoma and for non-enhancing tumor on postoperative MRI and at progression for non-glioblastoma.

Fig. 4

Fig. 5

Boxplots of agreement between rater and majority vote consensus of experts and novices combined measured as Jaccard index for enhancing and non-enhancing tumor segmentations in glioblastoma and non-glioblastoma at three MRI timings. Each dot represents the agreement between a rater's segmentations and the majority vote consensus of all raters for one patient's segmentation. Indices above 0.7 are considered excellent. The median of measurements and interquartile distances are plotted as boxes, which were omitted when fewer than five data points were measured.

Box plots of agreement between majority vote of all eight raters and each of the individual raters, as Jaccard index for enhancing tumor and non-enhancing tumor segmentations in glioblastoma and non-glioblastoma at the three MRI time points. Each dot represents the agreement between the consensus and the individual rater for one patient's segmentation. The first four subplots represent the experts, the second four refer to the novices. The median of measurements and interquartile distances are plotted as boxes, which were omitted when fewer than five data points were measured. Boxplots of agreement between rater and majority vote consensus of experts and novices combined measured as Jaccard index for enhancing and non-enhancing tumor segmentations in glioblastoma and non-glioblastoma at three MRI timings. Each dot represents the agreement between a rater's segmentations and the majority vote consensus of all raters for one patient's segmentation. Indices above 0.7 are considered excellent. The median of measurements and interquartile distances are plotted as boxes, which were omitted when fewer than five data points were measured.

Clinical volumetric analysis: extent of resection and growth rate

The variation in extent of resection and growth rate between raters is plotted in Fig. 6. The agreement between raters on the extent of resection of glioblastoma is excellent with a median interquartile range of 1.2% and below 10% in 18 (90%) of 20 cases. At higher extents of resection the variation between raters is lower. For non-glioblastoma, the agreement between raters on the extent of resection is less than glioblastoma but still reasonable with a median interquartile range of 8.3% and below 10% in 10 (50%) of 20 cases. A correlation between extent of resection and variation between raters seems absent.

Fig. 6

The variation in extent of resection and growth rate for glioblastoma and non-glioblastoma between eight raters per patient. In each plot patients are sorted by median extent of resection and growth rate, respectively. Each dot represents the calculation for one patient of one rater. Experts and novices are labelled according to the legend. The median of measurements and interquartile distances are plotted as boxes. The quartile coefficients of dispersion are plotted below the boxplots. The agreement between raters on the growth rate of glioblastoma is quite high with a median interquartile range of 0.42 mm/y and below 1 mm/y in 16 (80%) of 20 cases. The agreement on growth rate is not correlated with growth rates. For non-glioblastoma, the agreement on growth rate was higher than for glioblastoma with a median interquartile range of 0.22 mm/y and below 1 mm/y in 18 (90%) of 20 cases. At lower growth rates the variation between raters is lower.

Discussion

In this study we present a comprehensive and systematic analysis of inter-rater agreement in glioma segmentations addressing glioblastoma and non-glioblastoma, at different stages of disease, and comparing experts, with extensive clinical experience, and novices, with limited training. Our main findings are that (1) the agreement on presence and overlap of preoperative tumor segmentations was high and of post-operative tumor segmentations was low, (2) experts demonstrated higher levels of agreement than novices, in particular for non-enhancing tumor segmentations in non-glioblastoma and (3) the agreement on enhancing tumor in non-glioblastoma and on non-enhancing tumor in glioblastoma was very low. The inter-rater agreement on postoperative MRI is problematic. Raters disagree considerably on tumor presence, experts and novices alike, and even more so for enhancing tumor in glioblastoma than for non-enhancing tumor in non-glioblastoma. A possible explanation is that MRIs made a few days after glioblastoma surgery suffer from surgical artefacts, such as blood clots, luxury perfusion of post resection ischemia or contusion, distortion of tissue and blood vessels. Misinterpretation of these surgical artefacts may be diminished by subtraction of the T1-weighted MRI before contrast from the T1-weighted MRI after contrast. Many of these artefacts have resolved in the months after non-glioblastoma resection, which explains the higher agreement between raters in this patient population. This time to postoperative MRI is not available in patients with glioblastoma because radiotherapy, inducing further treatment artefacts, usually follows shortly. Segmentation for non-enhancing tumor in glioblastomas on postoperative MRI has a low inter-rater agreement and is deemed to be ill-defined as a ground truth due to poor spatial overlap and volume agreement. The main reason is that some raters attempted to distinguish non-enhancing tumor portions from pure edema in glioblastoma within T2/FLAIR hyper-intense regions, whereas others considered all hyper-intensity to be tumor. A clear instruction to include all hyper-intensity may improve the agreement. Common reasons for disagreement of enhancing portions consisted of small linear enhancement at the border of the resection cavities, which was considered to be sulcal vasculature or gliosis by some raters and residual tumor by others. Furthermore, some raters identified small multifocal enhancing nodules at distance from the resection cavity that were overlooked or considered normal vasculature by others, which resulted in poor volume overlap. In non-enhancing tumor segmentations of lower-grade glioma, novices typically identified tumor in the uncus adjacent to the tumor on T2/FLAIR-weighted MRI, which contained intensities similar to the contralateral uncus according to experts. Similarly novices included the hyper-intensity of the cortex adjacent to the sulci, where experts restricted their segmentation from sulcus to sulcus. The inter-rater agreement on MRI at progression was slightly lower than the inter-rater agreement on preoperative MRI, and higher than on postoperative MRI, which is in agreement with the relative volumes. The agreement in spatial overlap for non-glioblastoma segmentation found in this study, with a GCI of 0.60 among experts, is in agreement with that found by others (Gui et al., 2018) based on two experts segmenting two MRIs. Novices can replace experts in segmentations of enhancing tumor in glioblastoma on MRI at progression. Nevertheless, experts seem to be required for non-enhancing tumor segmentation in non-glioblastomas on MRI at progression. MRIs at progression of non-glioblastomas are difficult to interpret because these suffer from artefacts from radiation therapy that cannot be discerned from disease progression (Tensaouti et al., 2017). The combination of results from experts and novices may incorrectly overlook performance of individual raters and therefore be an oversimplification. Interestingly, the comparison of individual raters with the consensus of all raters shows that one novice (the last in Fig. 4) seems to provide segmentations of similar quality as experts. For glioblastomas, the spatial overlap agreement between raters was high on preoperative MRI, which is not surprising due to the unambiguous distinction of contrast enhancing tumor to non-enhancing surrounding tissue. At progression the contrast becomes more ambiguous due to treatment effects such as pseudo-progression or radiation induced necrosis (Tensaouti et al., 2017). The contrast becomes even more ambiguous on postoperative MRI with small fragmented residual tumor in the presence of surgical artefacts. Of note is that despite the lack of spatial overlap agreement, the volume ICC scores in glioblastoma are high, particularly among experts, in contrast to findings by others (Kubben et al., 2010). Perhaps this discrepancy is due to the agreement on absence of residual tumor in several of our patients, whereas in the previously published study (Kubben et al., 2010) all 8 patients had postoperative residue. Our data support that, despite low agreement in spatial overlap, the agreement in volume measurements is reasonable, which is commonly used for determining the extent of resection. The impact of inter-rater disagreement on common clinical volumetric analyses such as the extent of resection and the growth rate appear to be limited. The extent of resection calculations for glioblastoma justifies use of exact percentages by a single rater for cohort reports. Extent of resection calculations for non-glioblastoma are subject to more variation, and therefore would likely be better represented by categories of near-complete, subtotal and partial resections, for instance. Furthermore, the growth rate calculation agreements justify use of exact growth rates by a single rater, even more so for non-glioblastoma than glioblastoma. An important aspect that impact scores like Dice and Jaccard (of which the GCI is an extension) is the effect of small volumes, which biases these scores to be lower as volume decreases. Distance measures are considered less susceptible to this small volume bias (e.g. (Dubuisson and Jain, 1994; Steenbakkers et al., 2005)), but require correlated surfaces to establish a distance measure and this is undefined in case of multiple tumor fragments, as is common for glioma segmentations. Possible causes for the poor to moderate spatial overlap agreement as described by the GCI for postoperative data include the relative small volumes and tumor fragmentation. However, we showed that the enhanced tumor segmentations at progression of glioblastoma patients have similar fragmentation but were associated with a higher spatial overlap. Even when the postoperative segmentations were artificially dilated to reduce the volume effect the overlaps stayed well below those of the results at progression. Therefore, we conclude that segmentations on postoperative imaging are more complex than those at progression. This study has some limitations. We have used a commercial semi-automatic segmentation tool, which may not be available to other users. We have selected this tool, because it is time-efficient and intuitive and is in common use in clinical settings for the treatment of patients with brain tumors. Furthermore, we adopted the VASARI-criteria for radiological definitions of glioblastoma, which are based on standard T1- and T2-weighted sequences. These standard sequences are known to have poor performance to distinguish tumor infiltration from normal brain (Verburg et al. 2017). Perhaps better performance can be expected from (combinations of) advanced imaging, which should then be used to improve tumor segmentation. Our study is an extension of the current literature, summarized in Table 3, which often focuses on glioblastoma with manual segmentations on preoperative MRI as reference to evaluate novel (semi-) automatic tumor segmentation algorithms. In the recent literature, more and more expert segmentations are made publically available (e.g. BRATS data (Menze et al., 2015) and (Bakas et al., 2017)) and are being used for the validation of (semi-) automatic algorithms (e.g. (Zaouche et al., 2018)). However, such data sets are of limited value when each segmentation results from a single rater and the inter-rater variability is unknown.

Table 3

An overview of previous studies on inter-rater agreement.

Authors	Year	Low grade			High grade			#Exp	#Nov	Context
Authors	Year	Pre	Post	Prog	Pre	Post	Prog	#Exp	#Nov	Context
Weltens et al., 2001	2001				4			6	3	Added value of MRI to CT for segmentation.
Cattaneo et al., 2005	2005				7				5^⁎	idem
Provenzale et al., 2009	2009						22^⁎⁎	8		Reproducibility of 2D tumor dimensions.
Kubben et al., 2010	2010				8	8		2	1	Manual PreOp/PostOp glioblastoma segmentation
Gooya et al., 2012	2012				10				2a	GLISTR
Provenzale and Mancini, 2012	2012				5		5b	3	4	Reproducibility of 2D tumor dimensions.
Cordova et al., 2014	2014				37	37		1e	2	Semi-automatic segmentation.
Porz et al., 2014	2014				25			1c	1c	BraTumIA
Menze et al., 2015	2015	14			51			4		BRATS
Huber et al., 2015	2015				5	5		4	8	Evaluation of inter-rater variability
Ben Abdallah et al., 2016	2016	9		3b				13		Idem
Porz et al., 2016	2016				19			4		BraTumIA
Kleesiek et al., 2016	2016				15		15	2		Semi-automatic segmentation
Meier et al., 2016d	2016				14	14	14	1	1	BraTumIA (longitudinal)
Bø et al., 2017	2017	23						1		Intra-rater assessment
Zaouche et al., 2018	2018	4						2		Semi-automatic segmentation
Gui et al., 2018	2018			4				2		Quantification of progression
This Study	2018	20	20	20	20	20	20	4	4	Evaluation of inter-rater variability

Unspecified type of rater.

Moment after surgery not specified.

Supervised by expert neuro-radiologist.

This study has multiple longitudinal moments after postoperative.

Expert used as ground truth, novices test semi-automated method.

An overview of previous studies on inter-rater agreement. Unspecified type of rater. Moment after surgery not specified. Supervised by expert neuro-radiologist. This study has multiple longitudinal moments after postoperative. Expert used as ground truth, novices test semi-automated method. Experts generally have higher agreement than novices, suggesting that expert segmentations are better than those of novices in particular for non-enhancing tumor segmentations in non-glioblastomas, although novices have similar agreement for enhancing tumor segmentations of glioblastomas on preoperative MRI and at progression. Our results indicate that preoperative tumor segmentation is done reliably by novices and experts. For other applications of tumor segmentation, such as assessment of quality of care, treatment response measurement, and evaluation of progression, segmentations are less reliable and sensitivity analysis of different raters would be needed. In practice, it is not realistic to obtain consensus segmentations from multiple experts. A promising future strategy could be to use standardized fully automated tumor segmentation algorithms which is probably more reproducible than manual segmentations, but which may be inaccurate as well. To determine the accuracy of segmentations, ground truth histopathological correlation of tumor presence would be required.

45 in total

1. Dynamic imaging response following radiation therapy predicts long-term outcomes for diffuse low-grade gliomas.

Authors: Johan Pallud; Jean-François Llitjos; Frédéric Dhermain; Pascale Varlet; Edouard Dezamis; Bertrand Devaux; Raphaëlle Souillard-Scémama; Nader Sanai; Maria Koziak; Philippe Page; Michel Schlienger; Catherine Daumas-Duport; Jean-François Meder; Catherine Oppenheim; François-Xavier Roux
Journal: Neuro Oncol Date: 2012-03-13 Impact factor: 12.300

2. Statistical evaluation of manual segmentation of a diffuse low-grade glioma MRI dataset.

Authors: Meriem Ben Abdallah; Marie Blonski; Sophie Wantz-Mezieres; Yann Gaudeau; Luc Taillandier; Jean-Marie Moureaux
Journal: Conf Proc IEEE Eng Med Biol Soc Date: 2016-08

3. Interobserver variations in gross tumor volume delineation of brain tumors on computed tomography and impact of magnetic resonance imaging.

Authors: C Weltens; J Menten; M Feron; E Bellon; P Demaerel; F Maes; W Van den Bogaert; E van der Schueren
Journal: Radiother Oncol Date: 2001-07 Impact factor: 6.280

4. Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features.

Authors: Spyridon Bakas; Hamed Akbari; Aristeidis Sotiras; Michel Bilello; Martin Rozycki; Justin S Kirby; John B Freymann; Keyvan Farahani; Christos Davatzikos
Journal: Sci Data Date: 2017-09-05 Impact factor: 6.444

5. Target delineation in post-operative radiotherapy of brain gliomas: interobserver variability and impact of image registration of MR(pre-operative) images on treatment planning CT scans.

Authors: Giovanni Mauro Cattaneo; Michele Reni; Giovanna Rizzo; Pietro Castellone; Giovanni Luca Ceresoli; Cesare Cozzarini; Andrés José Maria Ferreri; Paolo Passoni; Riccardo Calandrino
Journal: Radiother Oncol Date: 2005-05 Impact factor: 6.280

6. Probabilistic radiographic atlas of glioblastoma phenotypes.

Authors: B M Ellingson; A Lai; R J Harris; J M Selfridge; W H Yong; K Das; W B Pope; P L Nghiemphu; H V Vinters; L M Liau; P S Mischel; T F Cloughesy
Journal: AJNR Am J Neuroradiol Date: 2012-09-20 Impact factor: 3.825

7. Epidemiology of glial and non-glial brain tumours in Europe.

Authors: Emanuele Crocetti; Annalisa Trama; Charles Stiller; Adele Caldarella; Riccardo Soffietti; Jana Jaal; Damien C Weber; Umberto Ricardi; Jerzy Slowinski; Alba Brandes
Journal: Eur J Cancer Date: 2012-01-07 Impact factor: 9.162

Review 8. Surgical oncology for gliomas: the state of the art.

Authors: Nader Sanai; Mitchel S Berger
Journal: Nat Rev Clin Oncol Date: 2017-11-21 Impact factor: 66.675

9. Tumor growth dynamics in serially-imaged low-grade glioma patients.

Authors: Chloe Gui; Suzanne E Kosteniuk; Jonathan C Lau; Joseph F Megyesi
Journal: J Neurooncol Date: 2018-04-09 Impact factor: 4.130

Review 10. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS).

Authors: Bjoern H Menze; Andras Jakab; Stefan Bauer; Jayashree Kalpathy-Cramer; Keyvan Farahani; Justin Kirby; Yuliya Burren; Nicole Porz; Johannes Slotboom; Roland Wiest; Levente Lanczi; Elizabeth Gerstner; Marc-André Weber; Tal Arbel; Brian B Avants; Nicholas Ayache; Patricia Buendia; D Louis Collins; Nicolas Cordier; Jason J Corso; Antonio Criminisi; Tilak Das; Hervé Delingette; Çağatay Demiralp; Christopher R Durst; Michel Dojat; Senan Doyle; Joana Festa; Florence Forbes; Ezequiel Geremia; Ben Glocker; Polina Golland; Xiaotao Guo; Andac Hamamci; Khan M Iftekharuddin; Raj Jena; Nigel M John; Ender Konukoglu; Danial Lashkari; José Antonió Mariz; Raphael Meier; Sérgio Pereira; Doina Precup; Stephen J Price; Tammy Riklin Raviv; Syed M S Reza; Michael Ryan; Duygu Sarikaya; Lawrence Schwartz; Hoo-Chang Shin; Jamie Shotton; Carlos A Silva; Nuno Sousa; Nagesh K Subbanna; Gabor Szekely; Thomas J Taylor; Owen M Thomas; Nicholas J Tustison; Gozde Unal; Flor Vasseur; Max Wintermark; Dong Hye Ye; Liang Zhao; Binsheng Zhao; Darko Zikic; Marcel Prastawa; Mauricio Reyes; Koen Van Leemput
Journal: IEEE Trans Med Imaging Date: 2014-12-04 Impact factor: 10.048

22 in total

1. Association between tumor location and neurocognitive functioning using tumor localization maps.

Authors: Esther J J Habets; Eef J Hendriks; Martin J B Taphoorn; Linda Douw; Aeilko H Zwinderman; W Peter Vandertop; Frederik Barkhof; Philip C De Witt Hamer; Martin Klein
Journal: J Neurooncol Date: 2019-08-13 Impact factor: 4.130

2. Application of deep learning for automatic segmentation of brain tumors on magnetic resonance imaging: a heuristic approach in the clinical scenario.

Authors: Antonio Di Ieva; Carlo Russo; Sidong Liu; Anne Jian; Michael Y Bai; Yi Qian; John S Magnussen
Journal: Neuroradiology Date: 2021-01-26 Impact factor: 2.804

3. Development and Practical Implementation of a Deep Learning-Based Pipeline for Automated Pre- and Postoperative Glioma Segmentation.

Authors: E Lotan; B Zhang; S Dogra; W D Wang; D Carbone; G Fatterpekar; E K Oermann; Y W Lui
Journal: AJNR Am J Neuroradiol Date: 2021-12-02 Impact factor: 3.825

4. Foundations of Multiparametric Brain Tumour Imaging Characterisation Using Machine Learning.

Authors: Anne Jian; Kevin Jang; Carlo Russo; Sidong Liu; Antonio Di Ieva
Journal: Acta Neurochir Suppl Date: 2022

5. Automatic Tumor Segmentation With a Convolutional Neural Network in Multiparametric MRI: Influence of Distortion Correction.

Authors: Lars Bielak; Nicole Wiedenmann; Nils Henrik Nicolay; Thomas Lottner; Johannes Fischer; Hatice Bunea; Anca-Ligia Grosu; Michael Bock
Journal: Tomography Date: 2019-09

6. Voxelwise statistical methods to localize practice variation in brain tumor surgery.

Authors: Roelant Eijgelaar; Philip C De Witt Hamer; Carel F W Peeters; Frederik Barkhof; Marcel van Herk; Marnix G Witte
Journal: PLoS One Date: 2019-09-27 Impact factor: 3.240

7. Automated Quantification of Choriocapillaris Lesion Area in Patients With Posterior Uveitis.

Authors: K Matthew McKay; Zhongdi Chu; Joon-Bom Kim; Alex Legocki; Xiao Zhou; Meng Tian; Marion R Munk; Ruikang K Wang; Kathryn L Pepple
Journal: Am J Ophthalmol Date: 2021-06-06 Impact factor: 5.258

8. Timing of glioblastoma surgery and patient outcomes: a multicenter cohort study.

Authors: Domenique M J Müller; Merijn E De Swart; Hilko Ardon; Frederik Barkhof; Lorenzo Bello; Mitchel S Berger; Wim Bouwknegt; Wimar A Van den Brink; Marco Conti Nibali; Roelant S Eijgelaar; Julia Furtner; Seunggu J Han; Shawn Hervey-Jumper; Albert J S Idema; Barbara Kiesel; Alfred Kloet; Emmanuel Mandonnet; Pierre A J T Robe; Marco Rossi; Tommaso Sciortino; W Peter Vandertop; Martin Visser; Michiel Wagemakers; Georg Widhalm; Marnix G Witte; Philip C De Witt Hamer
Journal: Neurooncol Adv Date: 2021-04-08

9. Newborn amygdalar volumes are associated with maternal prenatal psychological distress in a sex-dependent way.

Authors: Satu J Lehtola; Jetro J Tuulari; Noora M Scheinin; Linnea Karlsson; Riitta Parkkola; Harri Merisaari; John D Lewis; Vladimir S Fonov; D Louis Collins; Alan Evans; Jani Saunavaara; Niloofar Hashempour; Tuire Lähdesmäki; Henriette Acosta; Hasse Karlsson
Journal: Neuroimage Clin Date: 2020-08-11 Impact factor: 4.881

10. Fully automated brain resection cavity delineation for radiation target volume definition in glioblastoma patients using deep learning.

Authors: Ekin Ermiş; Alain Jungo; Robert Poel; Marcela Blatti-Moreno; Raphael Meier; Urspeter Knecht; Daniel M Aebersold; Michael K Fix; Peter Manser; Mauricio Reyes; Evelyn Herrmann
Journal: Radiat Oncol Date: 2020-05-06 Impact factor: 3.481