Literature DB >> 36035088

Strategies for tackling the class imbalance problem of oropharyngeal primary tumor segmentation on magnetic resonance imaging.

Roque Rodríguez Outeiral¹, Paula Bos^2,3, Hedda J van der Hulst², Abrahim Al-Mamgani¹, Bas Jasperse², Rita Simões¹, Uulke A van der Heide¹.

Abstract

Background and purpose: Contouring oropharyngeal primary tumors in radiotherapy is currently done manually which is time-consuming. Autocontouring techniques based on deep learning methods are a desirable alternative, but these methods can render suboptimal results when the structure to segment is considerably smaller than the rest of the image. The purpose of this work was to investigate different strategies to tackle the class imbalance problem in this tumor site. Materials and methods: A cohort of 230 oropharyngeal cancer patients treated between 2010 and 2018 was retrospectively collected. The following magnetic resonance imaging (MRI) sequences were available: T1-weighted, T2-weighted, 3D T1-weighted after gadolinium injection. Two strategies to tackle the class imbalance problem were studied: training with different loss functions (namely: Dice loss, Generalized Dice loss, Focal Tversky loss and Unified Focal loss) and implementing a two-stage approach (i.e. splitting the task in detection and segmentation). Segmentation performance was measured with Sørensen-Dice coefficient (Dice), 95th Hausdorff distance (HD) and Mean Surface Distance (MSD).
Results: The network trained with the Generalized Dice Loss yielded a median Dice of 0.54, median 95th HD of 10.6 mm and median MSD of 2.4 mm but no significant differences were observed among the different loss functions (p-value > 0.7). The two-stage approach resulted in a median Dice of 0.64, median HD of 8.7 mm and median MSD of 2.1 mm, significantly outperforming the end-to-end 3D U-Net (p-value < 0.05).
Conclusion: No significant differences were observed when training with different loss functions. The two-stage approach outperformed the end-to-end 3D U-Net.

Entities: Chemical

Keywords: Class imbalance, MRI; Convolutional neural network; Oropharyngeal cancer; Segmentation; Two-stage approach

Year: 2022 PMID： 36035088 PMCID： PMC9405079 DOI： 10.1016/j.phro.2022.08.005

Source DB: PubMed Journal: Phys Imaging Radiat Oncol ISSN： 2405-6316

Introduction

Radiotherapy is one of the common treatment options for head and neck cancer patients [1], [2]. One key step of the radiotherapy workflow is tumor contouring. While contouring of organs at risk is increasingly being automated in clinical practice, tumor contouring is still done manually. This is time consuming and suffers from high interobserver variability [3]. Deep learning methods, particularly Convolutional Neural Networks (CNNs), are the current state of the art for automatic segmentation of medical images. Several review papers have been published on deep learning applied to radiotherapy and automatic segmentation is often discussed as one of the main applications [4], [5], [6], [7]. For the particular case of head and neck cancer, various works have focused on the automatic segmentation of organs at risk with deep learning [8], some of them achieving clinically acceptable performance and being commercially available [9]. For the case of tumor contouring, the literature is more scarce and those algorithms are still not implemented in the clinic. In our previous work [10], we segmented the oropharyngeal primary tumor on magnetic resonance imaging (MRI) and showed that combining multiple anatomical MRI sequences improved the segmentation performance compared to single-sequence. We also proposed a semi-automatic approach that improved the segmentation performance by splitting the segmentation task in manual detection and segmentation. To the best of our knowledge, there is only one other work where the authors segmented the oropharyngeal primary tumor on MRI [11]. The authors studied the impact of combining different anatomical (T1 weighted and T2 weighted) and quantitative images (ADC, Ktrans and ve) as input channels to a CNN and showed that combining anatomical sequences significantly improved the performance. A known issue in the field of deep learning for medical image segmentation is class imbalance, meaning that the structure to be segmented is present in a smaller amount of voxels compared to the rest of the image. Class imbalance can result in suboptimal solutions because the network is exposed to proportionally less relevant information during the training process. Several works in the field of medical image segmentation have focused on this problem, either by modifying the input data to the network [12], [13] or by defining different loss functions [14], [15], [16]. This problem is even more critical in the case of tumor segmentation, given that tumors tend to be smaller than other structures and they are heterogeneous in their location, shape and size. This is also the case for the oropharyngeal primary tumor. Several loss functions have been designed with the aim of tackling class imbalance, such as the Generalized Dice loss [17], the Focal loss [14], the (focal) Tversky loss [15], [18] and the Unified Focal loss [16]. Although the choice of the loss function can be critical for the training of a CNN, comprehensive loss function comparisons for specific tumor sites or anatomies are not commonly performed. Ma et al. [19] showed that the influence in performance of the loss function varies greatly depending on the segmentation task. To the best of our knowledge, this has not been studied yet in the particular case of oropharyngeal cancer segmentation. Other works have implemented two-stage approaches (i.e. detection and segmentation) that resulted in more accurate segmentations than their one-stage counterparts [20], [21], [22]. By locating the tumor first, the context around the tumor is reduced. Consequently, two-stage approaches are a possible way of tackling class imbalance. The semi-automatic approach from our previous work [10] consisted of having human observers outlining a box around the tumor to provide a first approximation of the tumor location and consequently ease the segmentation task. However, the semi-automatic approach still needed manual intervention. The implementation of a two-stage approach will also allow us to fully automate the semi-automatic approach proposed in our previous work [10]. The aim of this study was to investigate two different strategies for tackling the class imbalance problem for oropharyngeal primary tumor segmentation: training with different loss functions and implementing a fully automatic two-stage approach.

Materials and methods

Data

A cohort of 230 patients treated at our institute between January 2010 and May 2018 was used for this project. The mean age of the patients was 61 years (standard deviation ± 7 years) and 66% of the patients were male. Further details on tumor stage and HPV status can be found in the Supplemental Material (Table S.1). All patients had histologically proven primary oropharyngeal squamous cell carcinoma and received a pre-treatment MRI for primary staging. The institutional review board approved the study (IRBd18047). Informed consent was waived by the institutional review board considering the retrospective design. The cohort was extended from our previous work [10]. A total of 59 new patients were included. The scans were acquired on 1.5 T (n = 108) or 3.0 T (n = 122) MRI scanners (Philips Medical System, Best, The Netherlands). The imaging protocol included: 2D T1-weighted fast spin-echo, 2D T2-weighted fast spin-echo with fat suppression, 3D T1-weighted high-resolution isotropic volume excitation after gadolinium injection with fat suppression. Further details on the MRI protocols are given in the Supplemental Material (Table S.2). The primary tumors were manually contoured in 3D Slicer (version 4.8.0, https://www.slicer.org/) by one observer with 1 year of experience (P.B. or H.H.). Afterwards, they were reviewed and adjusted, if needed, by a radiologist with 7 years of experience (B.J.). All tumor volumes were delineated on the 3D sequence but the observers were allowed to consult the other sequences. For the experimental set-up, the data set was split in three subsets: a training set (n = 190), a validation set (n = 20) and a test set (n = 20). The test set was not used for training or hyper-parameter tuning. We stratified the three subsets for tumor volume, subsite, and aspect ratio since these features are likely relevant for segmentation. Subsites were defined as tonsillar tissue, soft palate, base of tongue and posterior wall. The aspect ratio was defined as the ratio between the shortest and the longest axis of the tumor. All images were resampled to a voxel size of 0.8 mm × 0.8 mm × 0.8 mm.

Baseline model architecture

The 3D U-Net architecture [23], [24] was used as the basis for our experiments. The Adam optimizer [25] and early stopping were used for training. Dropout and data augmentation were used for regularization. Further details on the training procedure can be found in Table S.4 and in the code which is publicly available in: https://github.com/RoqueRouteiral/oroph_segm_ts.

Training with different loss functions

We trained the 3D U-Net with four different loss functions: Dice loss [26], Generalized Dice loss [17], Focal Tversky loss [18] and Unified Focal loss [16]. For the particular case of the Unified Focal loss, Yeung et al. [16] showed that the choice of the γ hyperparameter can affect the performance. Consequently, we trained four networks with the Unified Focal loss for different values of its hyperparameter γ (γ = [0.2, 0.4, 0.6, 0.8]). We compared the segmentation performance of all the networks among each other.

Two-stage approach

In our previous work, we demonstrated that the segmentation of the oropharyngeal primary tumor was more accurate when the input image was manually cropped with a clipbox around the tumor before being fed to a segmentation network. In this work, we fully automated this two-stage approach (Fig. 1). The first stage consisted of roughly detecting the tumor by automatically selecting a clipbox around it. In the second stage, this clipbox was used to crop the image which was then used as input to a segmentation network. The loss function chosen for both stages was the Generalized Dice loss function. The loss was backpropagated through each network separately.

Fig. 1

Overview of the two-stage approach.

Overview of the two-stage approach. For the detection stage, a 3D U-Net was trained using the bounding box of the tumor as ground truth. At inference time, the output of the detection was computed as the bounding box of the output. For the segmentation stage, the same architecture as in our previous work was used [10]. This segmentation network was trained with only the information contained inside the clipboxes. In every training iteration, the clipboxes were randomly shifted by an amount of up to 25 mm in different directions to make the network robust to possible displacements in the detection. At inference time the input images were cropped by the clipboxes defined by the detection network. Similarly to our previous work, the clipboxes were dilated by 5 mm.

Statistics

To confirm that the three subsets were balanced in subsite, volume and aspect ratio, a Kruskal-Wallis test was used for continuous variables (volume and aspect ratio) and a chi-square test for independence for the categorical data (subsite). Predicted segmentations and the segmentations from the human observers were compared for the patients on the unseen test set. Common segmentation metrics were used: Sørensen–Dice coefficient (Dice), 95th Hausdorff Distance (HD) and Mean surface distance (MSD). The metrics were implemented using the Python package from DeepMind (https://github.com/deepmind/surface-distance). For the two-stage approach, the detection was evaluated by measuring the absolute mean shift in all 6 directions between the tumor bounding box and the detected clipbox for the patients on the unseen test set. The average shift of the boxes for the observers from our previous work was used for comparison [10]. Differences among the loss function experiments were assessed by the Friedman test whereas the two-stage approach experiments were assessed by the Wilcoxon signed-ranked test. P-values below 0.05 were considered statistically significant. All networks were retrained four times. Reported results are the mean of the results of the four versions of each network. We opted for this approach over N-fold cross-validation to account for the random initialization of the network while ensuring proper stratification in the three sets for all the folds.

Results

Summary of tumor characteristics

Table S.3 shows the tumor characteristics (location, volume and aspect ratio) of our cohort. No significant differences were found in the distributions of subsite, volume and aspect ratio among the training, validation and test sets. When comparing the performance of the networks trained with different loss functions no significant differences were found (p-value > 0.25 for the three metrics). Lower variance in the MSD and Dice can be observed for the network trained with the Generalized Dice loss (Fig. 2). The network achieved a median Dice of 0.54, median 95th HD of 10.6 mm and median MSD of 2.4 mm. Non-significant differences were observed when training the network with different γ values for the Unified Focal loss (Fig. S.1).

Fig. 2

Segmentation performance of the 3D U-Net trained with different loss functions: Dice Loss (DL), Generalized Dice Loss (GDL), Tversky Loss (TVL) and Unified Focal Loss (UFL).

Segmentation performance of the 3D U-Net trained with different loss functions: Dice Loss (DL), Generalized Dice Loss (GDL), Tversky Loss (TVL) and Unified Focal Loss (UFL). The mean shift for the detection network was of 8.9 mm (Table 1) and no significant differences were found when comparing to the detection of observer 2 from our previous work (p-value = 0.40). Significant differences were found when comparing the detection of this work to the detection of the observer 1 from out previous work (p-value < 0.001). When separating the mean shift per direction, we observed a mean shift of 10.0 mm in the cranial-caudal direction, 8.4 mm in the medial–lateral direction and 7.7 mm in the dorsal–ventral direction.

Table 1

Detection and segmentation performance of the two stage approach and comparison to results of the previous work [10].

	Detection	Segmentation
	Avg. shift (mm) – [SD]	Dice	HD (mm)	MSD (mm)
This work
3D end-to-end UNet	–	0.54	10.6	2.4
Two stage approach	8.7 [8.2]	0.64	8.7	2.1

Previous work
Semi-automatic approach (Obs. 1)	3.0 [3.9]	0.74	4.6	1.2
Semi-automatic approach (Obs. 2)	8.9 [6.9]	0.67	7.2	1.7

Detection and segmentation performance of the two stage approach and comparison to results of the previous work [10]. The segmentation results of the two-stage approach were significantly better for Dice (p-value = 0.03) and MSD (p-value = 0.02) than the results of the end-to-end 3D UNet (Table 1). The fully automated two-stage approach yielded a median Dice of 0.64, median HD of 8.7 mm and median MSD of 2.1 mm. One patient was missed in the detection of the two-stage approach for one of the folds, and thus removed from that fold for the analysis.

Qualitative results

Examples of segmentations obtained by the end-to-end 3D U-Net, the two-stage approach and ground truth segmentation are shown in Fig. 3. The end-to-end 3D U-Net approach oversegmented (Fig. 3a–c) the tumor, where the two-stage approach showed better segmentation comparison to the ground truth. Fig. 3b shows cases where the segmentation end-to-end 3D U-Net rendered additional false positive structures on the image.

Fig. 3

Comparison of the oropharyngeal segmentations in three different patients (a, b, c) trained with the end-to-end 3D U-Net (red), with the two-stage approach (blue) and the manual delineation (green). The yellow boxes are drawn by detection network from the two-stage approach. All the images correspond to the 3D sequence.

Discussion

This work investigated two different strategies to tackle the class imbalance problem for the task of oropharyngeal primary tumor segmentation: training with the different loss functions and implementing a two-stage approach. Additionally, the proposed two-stage approach fully automated the semi-automatic approach described in our previous work [10]. When training the networks with different loss functions, no significant improvements were observed in the segmentation metrics. Hyperparameter tuning for the γ hyperparameter of the Unified Focal loss did not yield significantly better results either. This result is consistent with the work of Ma et al. [19], where they concluded that Dice-related losses are often optimal for medical image segmentation tasks. Additionally, it is also in line with the conclusions described by Isensee et al. and their proposed “no new Net” (nnU-Net) [27]. They showed that a tailored-to-task method configuration is more relevant than specific setup choices when designing a segmentation deep learning pipeline. The two-stage approach achieved significantly better results compared to the conventional end-to-end approach. The high complexity of the task may make the end-to-end training of the network suboptimal, while focusing on two simpler tasks can render better results. In our previous work [10], a semi-automatic approach in which an observer selected a clipbox around the tumor was implemented. When comparing the current detection results to the semi-automatic approach of our previous work, we noted that one of the observers (Obs. 1) selected a tighter box (although all the tumors were included inside the clipboxes) compared to that of our two-stage approach which resulted in significantly different detection performance. However, we did not observe significant differences with the detection performance of the semi-automatic approach for the other observer (Obs. 2), showing that a fully automatic two-stage approach can be a feasible alternative to a semi-automatic approach. Also, the time spent on delineating in the clinical practice is aimed to be as low as possible. We reported in our previous work that the time spent on drawing the boxes was lower for observer 2 than for observer 1, making the delineations of observer 2 a more realistic representation of what is expected in the clinic. In the present work, the whole pipeline is automated, which can save time in the clinic. That said, further efforts in improving the detection are of interest to improve the segmentation performance of the two-stage approach. The literature on automatic segmentation for the oropharyngeal tumor on MRI is scarce and its aims are heterogeneous. Besides our previous work [10], only Wahid et al. [11] have focused on the segmentation of this tumor site on MRI. Their work focused on studying the value of multiparametric MRI on the segmentation performance, both for qualitative and quantitative imaging. Other works focused on the automatic segmentation on multiparametric MRI of the head and neck cancer in general, rather than on the particular subset of oropharyngeal cancer: Bielak et al. [28] used diffusion weighted imaging while Schouten et al. [29] proposed a multiview CNN architecture. To the best of our knowledge, only our work is focused on tackling the class imbalance problem for head and neck cancer segmentation on MRI, and particularly for the oropharyngeal subsite. In 2020, the first head and neck tumor segmentation challenge, known as HECKTOR challenge, was launched [30]. The main subsite of the challenge was the oropharyngeal tumor and the winner of the challenge achieved a mean Dice of 0.76, but the image modalities used were PET/CT. Additionally, Ren et al. [31] compared the use of PET/CT/MRI as different input image combinations for the automatic segmentation of head and neck GTV and observed that, when including PET, the segmentation performance improved. Considering all the above, it is possible that PET is a useful modality for the task of head and neck tumor segmentation. However, the differences in resolution between imaging modalities may be reflected in the detail of the manual ground truth delineations used for training and evaluation. Potentially, this can also explain the difference in performance of the MRI-based task. That said, we argue that the strategies to tackle class imbalance in this work can be useful in the development of autocontouring tools for the case of oropharyngeal cancer. This study has limitations. Firstly, there is a high interobserver variability on this tumor subsite, especially in case of tonsillar fossa and base of tongue tumor which are rich in lymphatic tissue, so it is possible that the ground truth delineations used in this work are partially biased. However, one observer corrected the other’s delineation, reducing this observer variation. Secondly, validation of our results is still needed with an independent cohort in a multi-center study. Thirdly, the performance could also be improved by making different decisions on the training setup, such as using larger batch sizes or non downsampled data, but other strategies to mitigate memory limitations would be needed. Finally, there is a certain variability in the scan protocols. However, variability in the training set can be desirable as it makes the network robust to protocol differences. In conclusion, the loss functions designed to tackle class imbalance performed comparably among each other. The approach of splitting the problem into localization and segmentation outperformed the end-to-end network, proving an effective strategy to mitigate the class imbalance problem in oropharyngeal cancer segmentation.

Data statement

The data that has been used in this study is confidential. The institutional review board approved the study (IRBd18047). Informed consent was waived considering the retrospective design.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

20 in total

1. Fully automated organ segmentation in male pelvic CT images.

Authors: Anjali Balagopal; Samaneh Kazemifar; Dan Nguyen; Mu-Han Lin; Raquibul Hannan; Amir Owrangi; Steve Jiang
Journal: Phys Med Biol Date: 2018-12-14 Impact factor: 3.609

2. Loss odyssey in medical image segmentation.

Authors: Jun Ma; Jianan Chen; Matthew Ng; Rui Huang; Yu Li; Chen Li; Xiaoping Yang; Anne L Martel
Journal: Med Image Anal Date: 2021-03-19 Impact factor: 8.545

Review 3. Deep learning in medical imaging and radiation therapy.

Authors: Berkman Sahiner; Aria Pezeshk; Lubomir M Hadjiiski; Xiaosong Wang; Karen Drukker; Kenny H Cha; Ronald M Summers; Maryellen L Giger
Journal: Med Phys Date: 2018-11-20 Impact factor: 4.071

Review 4. Survey on deep learning for radiotherapy.

Authors: Philippe Meyer; Vincent Noblet; Christophe Mazzara; Alex Lallement
Journal: Comput Biol Med Date: 2018-05-17 Impact factor: 4.589

Review 5. Estimation of an optimal external beam radiotherapy utilization rate for head and neck carcinoma.

Authors: Geoff Delaney; Susannah Jacob; Michael Barton
Journal: Cancer Date: 2005-06-01 Impact factor: 6.860

6. Head and neck tumor segmentation in PET/CT: The HECKTOR challenge.

Authors: Valentin Oreiller; Vincent Andrearczyk; Mario Jreige; Sarah Boughdad; Hesham Elhalawani; Joel Castelli; Martin Vallières; Simeng Zhu; Juanying Xie; Ying Peng; Andrei Iantsen; Mathieu Hatt; Yading Yuan; Jun Ma; Xiaoping Yang; Chinmay Rao; Suraj Pai; Kanchan Ghimire; Xue Feng; Mohamed A Naser; Clifton D Fuller; Fereshteh Yousefirizi; Arman Rahmim; Huai Chen; Lisheng Wang; John O Prior; Adrien Depeursinge
Journal: Med Image Anal Date: 2021-12-25 Impact factor: 8.545

7. Comparing different CT, PET and MRI multi-modality image combinations for deep learning-based head and neck tumor segmentation.

Authors: Jintao Ren; Jesper Grau Eriksen; Jasper Nijkamp; Stine Sofia Korreman
Journal: Acta Oncol Date: 2021-07-15 Impact factor: 4.089

8. Automatic Tumor Segmentation With a Convolutional Neural Network in Multiparametric MRI: Influence of Distortion Correction.

Authors: Lars Bielak; Nicole Wiedenmann; Nils Henrik Nicolay; Thomas Lottner; Johannes Fischer; Hatice Bunea; Anca-Ligia Grosu; Michael Bock
Journal: Tomography Date: 2019-09

9. Evaluation of deep learning-based multiparametric MRI oropharyngeal primary tumor auto-segmentation and investigation of input channel effects: Results from a prospective imaging registry.

Authors: Kareem A Wahid; Sara Ahmed; Renjie He; Lisanne V van Dijk; Jonas Teuwen; Brigid A McDonald; Vivian Salama; Abdallah S R Mohamed; Travis Salzillo; Cem Dede; Nicolette Taku; Stephen Y Lai; Clifton D Fuller; Mohamed A Naser
Journal: Clin Transl Radiat Oncol Date: 2021-10-16

10. Oropharyngeal primary tumor segmentation for radiotherapy planning on magnetic resonance imaging using deep learning.

Authors: Roque Rodríguez Outeiral; Paula Bos; Abrahim Al-Mamgani; Bas Jasperse; Rita Simões; Uulke A van der Heide
Journal: Phys Imaging Radiat Oncol Date: 2021-07-02