Literature DB >> 35443832

Gross Tumor Volume Segmentation for Stage III NSCLC Radiotherapy Using 3D ResSE-Unet.

Xinhao Yu^1,2, Fu Jin², HuanLi Luo², Qianqian Lei², Yongzhong Wu².

Abstract

INTRODUCTION: Radiotherapy is one of the most effective ways to treat lung cancer. Accurately delineating the gross target volume is a key step in the radiotherapy process. In current clinical practice, the target area is still delineated manually by radiologists, which is time-consuming and laborious. However, these problems can be better solved by deep learning-assisted automatic segmentation methods.
METHODS: In this paper, a 3D CNN model named 3D ResSE-Unet is proposed for gross tumor volume segmentation for stage III NSCLC radiotherapy. This model is based on 3D Unet and combines residual connection and channel attention mechanisms. Three-dimensional convolution operation and encoding-decoding structure are used to mine three-dimensional spatial information of tumors from computed tomography data. Inspired by ResNet and SE-Net, residual connection and channel attention mechanisms are used to improve segmentation performance. A total of 214 patients with stage III NSCLC were collected selectively and 148 cases were randomly selected as the training set, 30 cases as the validation set, and 36 cases as the testing set. The segmentation performance of models was evaluated by the testing set. In addition, the segmentation results of different depths of 3D Unet were analyzed. And the performance of 3D ResSE-Unet was compared with 3D Unet, 3D Res-Unet, and 3D SE-Unet.
RESULTS: Compared with other depths, 3D Unet with four downsampling depths is more suitable for our work. Compared with 3D Unet, 3D Res-Unet, and 3D SE-Unet, 3D ResSE-Unet can obtain superior results. Its dice similarity coefficient, 95th-percentile of Hausdorff distance, and average surface distance can reach 0.7367, 21.39mm, 4.962mm, respectively. And the average time cost of 3D ResSE-Unet to segment a patient is only about 10s.
CONCLUSION: The method proposed in this study provides a new tool for GTV auto-segmentation and may be useful for lung cancer radiotherapy.

Entities: Chemical

Keywords: CNNs; GTV; auto-segmentation; lung cancer; radiotherapy

Mesh：

Year: 2022 PMID： 35443832 PMCID： PMC9047806 DOI： 10.1177/15330338221090847

Source DB: PubMed Journal: Technol Cancer Res Treat ISSN： 1533-0338

Introduction

Lung carcinoma(LC) is one of the most severe and widespread cancers in the world. And statistics from the World Health Organization (WHO) in 2020 showed that there were 815,563 new cases of LC and 714,699 deaths in China. Currently, in addition to surgery and chemotherapy, radiotherapy(RT) is the most effective treatment for LC. And compared with other stages, patients with stage III non-small cell lung cancer are mainly treated by radiotherapy. In the radiotherapy workflow for patients of LC, precise delineation of gross tumor volume (GTV) in computed tomography(CT) images is the essential step. Other tumor target areas are based on GTV and consider the influence of potential invaded tissues, positioning errors, and other factors. Inaccurate delineation of GTV will result in unnecessary damage to normal tissues or undertreatment in the tumor target area. In clinical practice, GTV is usually manually delineated by radiologists. However, manual delineation is a time-consuming and laborious process, and the start of radiotherapy will be delayed as a result. In addition, manual delineation is a subjective process, and the radiologist’s experience will have a great influence on the delineation results. Multiple studies have reported that this process has considerable inter-observer and intra-observer variability.[4-7] Thus, it is necessary to develop suitable automatic segmentation methods to relieve the workload of radiologists in the definition of the target volume and improve the consistency of the target area delineation. Deep learning (DL) is a subfield of AI and machine learning, which has achieved tremendous success in recent years in various fields in science.[8-10] In medical image segmentation, DL-based auto-segmentation techniques have been shown to provide significant improvements over more traditional approaches.[11,12] Convolution neural networks (CNNs) are the most successful and popular DL architecture applied to image processing. A lot of researches have confirmed that CNNs are helpful for tumor target delineation for radiotherapy for head and neck cancer, breast cancer, and rectal cancer.[13-20] Some scholars have also conducted research on automatic segmentation of lung tumor target volume based on CNNs.[21-25] To explore the role of deep learning-assisted delineation, Bi N et al used a dilated residual network to delineate the CTV of NSCLC for postoperative radiation therapy. And compared with manual delineation, a CNN-assisted delineation can achieve better segmentation accuracy, segmentation consistency, and segmentation efficiency. In order to facilitate the analysis of geometric tumor changes during radiotherapy, A CNN model named A-net was designed to delineate the GTV of LC with a DSC of 0.82. Zhang F et al proposed an automatic segmentation method based on ResNet and analyze the role of the DL-assisted method for GTV segmentation of NSCLC. To monitor tumor response to therapy, Jiang J et al extended the full resolution residual neural network and developed the multiple resolution residually connected network for the tumor segmentation of NSCLC. To achieve the delineation of GTV for LC stereotactic body radiation therapy, Cui Y et al proposed CT-based dense V-networks with a DSC of 0.82. Based on the above research, we reason that the automatic segmentation of GTV for LC radiotherapy can be achieved through CNNs. However, the above studies have three issues. First, most of the above studies use 2D CNNs and ignore the high-dimensional spatial features of tumors.[21-24] When delineating the GTV of LC, the radiologist needs to refer to adjacent CT slices to determine the trend of tumor growth. Therefore, it is worth designing a 3D CNN to mine three-dimensional spatial information from CT images to segment GTV. Second, with the increase of the network depth, CNNs are prone to the problem of vanishing gradients, and some studies did not consider this problem. Third, the contribution of each channel feature in CNNs to the prediction result is different. The performance of the model can be effectively improved by using the appropriate attention mechanism. However, this point is ignored in the above research.[21-25] In this work, we proposed a 3D CNN named 3D ResSE-Unet to achieve GTV segmentation of stage III NSCLC on computed tomography(CT) images. The main innovations of this article are as follows. First, 3D convolution operations were used to mine the three-dimensional spatial correlation of GTV. And the influence of the depth of the 3D Unet on the segmentation results was explored. Second, we introduced the residual connection mechanism and channel attention mechanism into the 3D Unet to improve the robustness of the model. The residual connection was adopted to address the optimization problem and vanishing gradients. The channel attention mechanism was used to strengthen the model's representational power by selectively emphasizing useful features and suppressing useless ones. The modified version of 3D Unet was proposed to segment GTV from CT images of 214 stage III NSCLC patients. And compared with 3D Unet, 3D Res-Unet, and 3D SE-Unet, 3D ResSE-Unet can obtain superior results. Third, to solve category imbalance, we designed a mixed loss function based on the Dice loss and the Focal loss for GTV segmentation. Fourth, batch normalization(BN) was adopted in the network training process. It could prevent overfitting and improve the accuracy of the target delineation. Finally, the Dice similarity coefficient(DSC), 95th-percentile of Hausdorff distance(HD95), and mean surface distance(MSD) were used to evaluate the accuracy of the model's prediction. And the complexity and segmentation time of segmentation models were also compared and analyzed.

Methods

The experimental process of this article mainly includes three steps: data preprocessing, segmentation model training, and segmentation result evaluation. The flowchart of the method can be seen in Figure 1.

Figure 1.

Flowchart of the 3D CNN-based segmentation method

Data sets

Data of patients with the stage of III NSCLCs from January 2017 to October 2020 in the department of radiation oncology, Chongqing University Cancer Hospital, were collected selectively. The clinical staging of tumors was based on the eighth edition of the International Association for Lung Cancer (IASLC). This work was approved by the ethics committee of Chongqing University Cancer Hospital(No. CZLS2021231-A, Date:13-Sep-2021). And written consents were provided by all patients to store their medical information in the hospital database. In addition, all patient details have been de-identified. A total of 214 patient data were collected selectively. 148 patient cases were randomly selected as the training set, 30 cases were used as the validation set, and 36 cases were used as the test set. The training set was used to train the segmentation model and learn the feature distribution of GTV. The validation set was used to filter the best segmentation model. And the segmentation performance of models on new data was tested by the testing set. The general characteristic of the training, the validation, and the testing sets are shown in Table 1.

Table 1.

Characteristics of 214 patients with stage III NSCLC

Characteristics	Training Set	Validation Set	Testing Set
Patients number	148	30	36
AgeMedian (range)	61(39-78)	56(43-82)	61.5(45-76)
GenderMale/female	134/14	24/6	35/1
Tumor siteLeft/right	69/79	13/17	13/23
Tumor volumeMedian(range)	60.06cm³(3.884cm³-839.5 cm³)	76.15cm³(9.820cm³-414.7 cm³)	106.6cm³(6.303 cm³-527.3 cm³)
TreatmentIMRT/TOMO	143/5	26/4	35/1
SubtypesSquamous cell carcinomaAdenocarcinomaLarge cell carcinomaSarcomatoid carcinomaAdenosquamous carcinomaUnknown	98451211	1713	315
T stageT1T2T3T4Tx	143423734	18516	85221
N stageN0N1N2N3	1165180	22719	31815

Characteristics of 214 patients with stage III NSCLC The patients’ data were acquired on Philips BigBore CT simulator(Philips Medical Systems, Madison, WI) set on helical scan mode(120kV,30mA), and slice thicknesses of 5mm or 3mm. Iodine contrast agents were used for all patients. And CT images were obtained with free breathing. Planning CT images and radiotherapy structure of each patient were all exported and they were all Digital Imaging and Communications in Medicine(DICOM) files. Delineation of the GTV was carried out by a senior lung cancer radiologist who has more than 10 years of work experience and then peer-reviewed by two other experts. In this study, these GTV contours delineated by radiologists were referred to as the ground truth. The criteria for radiologists to delineate GTV of stage III NSCLC was based on NCCN Clinical Practice Guidelines in Oncology – Non-Small Cell Lung Cancer. And the primary gross tumor volume and the lymph node gross tumor volume were all included.

Preprocessing

To make full use of the three-dimensional spatial information of CT images, the images need to be processed according to the following steps. As shown in Figure 2, GTV contours are extracted from the radiotherapy structure of each patient by using python. And the CT images and GTV contours of each patient are converted into 3D matrices using the SimpleITK module. Then to maintain consistency across different patients, resampling operations were applied to the image and contour matrices so that each has a slice thickness of 5.0 mm and a pixel pitch of 1.0 mm. In order to reduce the computational burden and memory consumption, input images are randomly cropped into 3D volume with 160 × 160 × 32 pixels. And to make full use of the spatial information of CT data, the input data is prepared as overlapping batches. The overlapping technique ensures that the segmentation model can utilize as much information over the third axis as possible. In addition, the overlap stride is set to 8 images for training data, but this method is not used in the validation data and the testing data. An example can be seen in Figure 3. In the end, 1159 blocks of 3D data are obtained in the training set, and 90 blocks of 3D data are obtained in the validation set.

Figure 2.

Figure 3.

Examples of data cropping of training, validation, and testing set (Number show which CT image slices are included in 3D data). A. Example of data cropping of the training set. B. Example of data cropping of validation and testing set.

CT images and corresponding labels. A. CT images with GTV (red contour is the manually delineated GTV). B. Label images for the image presented in A. C. CT images without GTV. D. Label images for the image presented in C. Examples of data cropping of training, validation, and testing set (Number show which CT image slices are included in 3D data). A. Example of data cropping of the training set. B. Example of data cropping of validation and testing set. In addition, considering the difference in CT value distribution between subjects, the pixel intensity of CT images is normalized to 0-1 by using Hounsfield(HU) window [-180,220]. Hounsfield(HU) window [-180,220] is the mediastinal CT window, and the radiation oncologist observes this window when delineating the GTV of LC. Finally, since the limited data resources, data augmentation is an unavoidable choice to get better performance on unseen data. Therefore, random zoom and random rotation are adopted to augment the training data. And this process is achieved by using the multi-dimensional image processing package(.ndimage) in the Scipy.

Architecture

In the field of medical image segmentation, U-net has become one of the most well-known structures. 3D U-net is an improved version of the basic U-net model and enables 3D volumetric segmentation using very few annotated examples. More importantly, the information on adjacent slices of an image can be transmitted through the network to provide more consistent predictions. The delineation of GTV is mainly dependent on the patient's anatomical structure and tumor presentation on the CT images. Thus, we propose to apply the 3D U-net model as the base model for GTV segmentation, and the influence of depth of 3D Unet on segmentation performance is analyzed. In order to further strengthen the ability to extract features and aspired by the ResNet and SE-Net, the residual connection and the channel attention mechanism are introduced into 3D Unet. The effects of these improvement methods are also compared. In this paper, a model called 3D ResSE-Unet is proposed for target segmentation, which is an improved version of 3D Unet. The network diagram is shown in Figure4. It is composed of a contracting path to capture context and a symmetric expanding path that enables precise localization. Four max pool operations are stacked in the contracting path to reduce image resolution, expand the receptive field, and explore more detailed features. And in the expanding path, the image resolution is recovered by upsampling operations. To localize precisely, high-resolution features from the contracting path are combined with the upsampled output. Our network architecture contains 7 ResSE blocks, four max pool operations, and four upsampling operations. The last layer is a convolution to produce the predicted map. The network parameters are summarized in table 2.

Table 2.

Network parameter

Layer	Operation	Kernel size	Stride	Num. of filters	Input size(C_in × D × H × W)	Output size(C_out × D × H × W)
Double conv1	(Conv3D + BN + Relu) × 2	3 × 3 × 3	(1,1,1)	16	1 × 32 × 160 × 160	16 × 32 × 160 × 160
Max pool 1	MaxPool3D	2 × 2 × 2	(2,2,2)		16 × 32 × 160 × 160	16 × 16 × 80 × 80
ResSEblock1	Conv3D + BN + Relu + SE block	3 × 3 × 3	(1,1,1)	32	16 × 16 × 80 × 80	32 × 16 × 80 × 80
Max pool 2	MaxPool3D	2 × 2 × 2	(2,2,2)		32 × 16 × 80 × 80	32 × 8 × 40 × 40
ResSEblock2	Conv3D + BN + Relu + SE block	3 × 3 × 3	(1,1,1)	64	32 × 8 × 40 × 40	64 × 8 × 40 × 40
Max pool 3	MaxPool3D	2 × 2 × 2	(2,2,2)		64 × 8 × 40 × 40	64 × 4 × 20 × 20
ResSEblock3	Conv3D + BN + Relu + SE block	3 × 3 × 3	(1,1,1)	128	64 × 4 × 20 × 20	128 × 4 × 20 × 20
Max pool 4	MaxPool3D	2 × 2 × 2	(2,2,2)		128 × 4 × 20 × 20	128 × 2 × 10 × 10
ResSEblock4	Conv3D + BN + Relu + SE block	3 × 3 × 3	(1,1,1)	256	128 × 2 × 10 × 10	256 × 2 × 10 × 10
Trans Conv1	ConvTranspose3D	2 × 2 × 2	(2,2,2)		256 × 2 × 10 × 10	128 × 4 × 20 × 20
ResSEblock5	Conv3D + BN + Relu + SE block	3 × 3 × 3	(1,1,1)	128	256 × 4 × 20 × 20	128 × 4 × 20 × 20
Trans Conv2	ConvTranspose3D	2 × 2 × 2	(2,2,2)		128 × 4 × 20 × 20	64 × 8 × 40 × 40
ResSEblock6	Conv3D + BN + Relu + SE block	3 × 3 × 3	(1,1,1)	64	128 × 8 × 40 × 40	64 × 8 × 40 × 40
Trans Conv3	ConvTranspose3D	2 × 2 × 2	(2,2,2)		64 × 8 × 40 × 40	32 × 16 × 80 × 80
ResSEblock7	Conv3D + BN + Relu + SE block	3 × 3 × 3	(1,1,1)	32	64 × 16 × 80 × 80	32 × 16 × 80 × 80
Trans Conv4	ConvTranspose3D	2 × 2 × 2	(2,2,2)		32 × 16 × 80 × 80	16 × 32 × 160 × 160
Double conv2	(Conv3D + BN + Relu) × 2	3 × 3 × 3	(1,1,1)	16	32 × 32 × 160 × 160	16 × 32 × 160 × 160
Last conv	Conv3D	3 × 3 × 3	(1,1,1)	2	16 × 32 × 160 × 160	2 × 32 × 160 × 160

Network parameter The design of the ResSE block is presented in Figure4. The following expression can denote the details of residual connection: and correspond to the input of the layer and layer respectively. is denoted as the residual function, which is composed of several operations, including convolution, batch normalization (BN), rectified linear unit(ReLU), and SE block. is denoted as the activation function, and ReLU was used in this work. The residual block integrates with the to improve the information flow. This behavior allows the network to preserve feature maps in deeper neural networks, addressing vanishing gradients and making networks easier to optimize. The SE module can selectively strengthen useful features and suppress useless features by learning to use global information, thereby achieving feature recalibration. As shown in Figure 4, , , represent the channel number of the feature, the height, and the width of the feature image, respectively. And represents the reduction ratio, and the value in this work is 2. This method implements attention weighting on channels in three steps. Firstly, global spatial information is squeezed using global average pooling, and the channel feature map is generated in the end. The second is the excitation operation, in which a bottleneck with two fully connected (FC) layers around the ReLU unit is formed. In this process, first, compress the channel feature number to C/r, then go through a ReLU function to increase non-linearity, then restore the channel feature number to C, and finally go through a sigmoid function to obtain the weight of each channel feature. During this process, important channel features get larger weights, and unimportant channel features get smaller weights. Finally, each channel feature and the corresponding weight are multiplied as the output of the SE block.

Figure 4.

The diagram of the 3D ResSE-Unet structure. A. The architecture of 3D ResSE-Unet B. The design of ResSE block. C. The structure of SE block( represents the height of the input feature, represents the width of the input feature, represents the channel number of the input feature, represents reduction ratio, and the value in this work is 2).

Loss function

When training a CNN model, choosing an appropriate loss function can improve network performance. Considering that there is a problem of foreground-background class imbalance in this task. Thus, we designed a mixed loss function, as defined in Eq. 1: Where and represent the Dice loss and the Focal loss, respectively. And they are explained as follows: Where X and Y represent the ground truth and the prediction result, the dice loss is suitable for severe class imbalance tasks. However, in the routine task, dice loss will influence the backpropagation and lead to training difficulty. Where is the variant to balance the importance of positive/negative examples, is a modulating factor. The focal loss can be seen as the variation of Binary Cross-Entropy. Due to down-weight the contribution of easy examples and focusing more on learning hard examples, it works well for highly imbalanced class scenarios.

Evaluation

The testing set was used to evaluate the predictive performance of the 3D ResSE-Unet. The ground-truth volumes were contoured manually by experienced senior radiation oncologists. And the difference between auto-delineated GTV and the ground-truth was calculated by dice similarity coefficient(DSC), the 95th-percentile Hausdorff distance(HD95), and mean surface distance(MSD).

Model training

The proposed models were implemented by Pytorch framework on Linux operating system using Python Application Programmable Interface and then accelerated by the NVIDIA graphics card. To prevent overfitting, a batch normalization operation was performed after each convolution operation. And the Kaiming function was used to configure network initialization parameters. In the training stage, the learning rate was set to 0.00015 in the Adam optimizer, the batch size was set to 2, and the mixed loss function was the training loss function. The max number of epochs was 90, and the loss value decreased with the epoch number. After each training epoch, validation was performed on the validation set, and only the best parameters would be saved. All experiments in this article were performed on Intel Xeon E5-2650 V4 (2.2GHz) processor and NVIDIA tesla T4 graphics card.

Result

After training the models, CT images of the testing set were imported into the best-performing model to perform GTV delineation and delineation results were evaluated qualitatively and quantitatively.

Comparison of different depths of 3D unet

To find a suitable depth of 3D Unet for GTV segmentation, different depths of 3D Unet were trained respectively. Different depths of 3D Unet include 3D Unet_3B, 3D Unet_4B, 3D Unet_5B, which respectively include three downsamplings, four downsamplings, and five downsamplings. The number of convolution channels in each layer of 3D Unet_3B from shallow to deep is 16, 32, 64, 128. Similarly, the number of convolutional channels in each layer of 3D Unet_4B is 16, 32, 64, 128, 256. And the number of convolution channels in each layer of 3D Unet_5B is 16, 32, 64, 128, 256, 512. Quantitative evaluation results of three different depths of 3D Unet are summarized in Table 3. As shown, the 3D Unet_4B has realized better segmentation results. Its average values of DSC, Hausdorff distance, and average surface distance can reach 0.7090, 33.89mm, and 7.030mm, respectively. And the three quantitative evaluation results of 3D Unet_3B and 3D Unet_5B are not as good as 3D Unet_4B.

Table 3.

Comparison of quantitative evaluation metrics of 3D Unet with different depths ( ).

Method	DSC	HD₉₅(mm)	MSD(mm)
3D Unet_3B	0.6979	40.72	9.392
3D Unet_4B	0.7090	33.89	7.030
3D Unet_5B	0.6936	48.94	9.427

Comparison of quantitative evaluation metrics of 3D Unet with different depths ( ). The partial segmentation results of the testing set are shown in Figure 5. Intuitively, 3D Unet segmentation results of different depths all have the problem of false positives. However, compared with 3D Unet_3B and 3D Unet_5B, there are fewer false positives and false negatives in 3D Unet_4B.

Figure 5.

Comparison of segmentation results of different depths of 3D Unet. A-C. 3D Unet_3B segmentation results. D-F. 3D Unet_4B segmentation results. G-I. 3D Unet_5B segmentation results.

Comparison of 3D ResSE-Unet, 3D Res-Unet, 3D SE-Unet and 3D Unet

To investigate the effectiveness of the proposed segmentation model, 3D Unet, 3D Res-Unet, 3D SE-Unet, and 3D ResSE-Unet were trained respectively. Compared with 3D Unet, 3D Res-Unet introduced residual connection, 3D SE-Unet introduced channel attention mechanism, 3D ResSE-Unet introduced residual connection and channel attention mechanism at the same time. For useful comparison, when training 3D SE-Unet, 3D Res-Unet, and 3D ResSE-Unet, the hyperparameters were the same as those used in 3D Unet. As shown in Table 4, the quantitative evaluation results of four networks on the testing set are summarized. It can be found that compared with the 3D Unet, the introduction of residual connection and the introduction of channel attention mechanism can better improve the segmentation results of 3D Unet. The introduction of residual connection increases the DSC score of 3D Unet from 0.7090 to 0.7247 and reduces HD95 from 33.89mm to 21.64mm, and MSD from 7.030mm to 5.121mm. The introduction of the channel attention mechanism increases the DSC score of 3D Unet from 0.7090 to 0.7222 and reduces HD95 from 33.89mm to 23.46mm, and MSD from 7.030mm to 5.487mm. In addition, the introduction of residual connection and channel attention mechanisms at the same time can achieve better segmentation results, and the average values of DSC, HD, and MSD can reach 0.7367, 21.39mm, 4.962mm, respectively.

Table 4.

Quantitative evaluation metrics comparison of different model( ).

Method	DSC	HD₉₅(mm)	MSD(mm)
3D Unet	0.7090	33.89	7.030
3D SE-Unet	0.7222	23.46	5.487
3D Res-Unet	0.7247	21.64	5.121
3D ResSE-Unet	0.7367	21.39	4.962

Quantitative evaluation metrics comparison of different model( ). The representative comparison results of four models are shown in Figure 6. As shown, there is the problem of false positives in the segmentation results of 3D Unet. However, the introduction of residual connection and channel attention mechanism can better solve this problem. Intuitively, compared with 3D Unet, the problem of false positives has been improved in the results of 3D Res-Unet, 3D SE-Unet, and 3D ResSE-Unet. And it can be found that 3D ResSE-Unet realizes the best results.

Figure 6.

Comparison of segmentation results of 3D ResSE-Unet, 3D Res-Unet, 3D SE-Unet, and 3D Unet. A1-C1. 3D ResSE-Unet segmentation results. A2-C2. 3D Res-Unet segmentation results. A3-C3. 3D SE-Unet segmentation results. A4-C4. 3D Unet segmentation results. In addition, we also compared the network parameters and average segmentation time of the four models, as shown in Table 5. It can be seen that the introduction of the channel attention mechanism hardly increases the number of model parameters and does not reduce the segmentation efficiency. But the introduction of residual connections will increase the number of model parameters and slightly reduce the efficiency of segmentation. And it can be found that compared with 3D Unet, 3D ResSE-Unet parameters increase from 21.54MB to 44.66MB, but the average segmentation time only increases by 1 second.

Table 5.

Comparison of network parameters and average segmentation time.

	Parameters(MB)	Average segmentation time per patient(s)
3D Unet	21.54	9.58
3D SE-Unet	21.75	9.60
3D Res-Unet	44.24	10.58
3D ResSE-Unet	44.66	10.61

Comparison of network parameters and average segmentation time.

Discussion

Radiotherapy is one of the main treatments for stage III NSCLC. Accurately delineating GTV is essential to achieve precise radiotherapy. Radiologist manual delineation is time-consuming and has inter-and intra-observer variability. However, these problems can be solved by automatic segmentation methods based on CNNs. At present, the research on the automatic delineation of GTV for NSCLC radiotherapy mainly uses 2D CNNs and ignores spatial features of tumors from CT data. In this work, we chose 3D Unet as the base model and used two different methods to improve 3D Unet. We designed a model named 3D ResSE-Unet and achieved the automatic segmentation of GTV of stage III NSCLC radiotherapy. The segmentation results of different depths of 3D Unet are shown in Table 3 and Figure 5. From the perspective of the 3D Unet structure, the deeper the network, the more feature scales that can be extracted, and the better the segmentation results will be obtained. This can explain why the segmentation results of 3D Unet_4B are better than 3D Unet_3B. However, the deeper the network, the more spatial information is lost through max pooling operations, which is not suitable for segmenting small targets. In the training set, some tumors were too small, and the minimum GTV was 3.884 cm3. And the overlap cropping technique in the preprocessing will cause the 3D data block to be trained only to contain part of the GTV. Therefore, the segmentation result of 3D Unet_5B is not as good as 3D Unet_4B. Compared with other depths, 3D Unet with four downsamplings is the more suitable structure for our work, but the segmentation results still have the problem of false positives. Two methods were adopted to solve this problem in this article. To solve the problem of vanishing gradients and strengthen the transmission of features, the residual connection mechanism is introduced into 3D Unet. And the channel attention mechanism also has been introduced into 3D Unet to strengthen the useful channel features and suppress the useless channel features. As shown in Table 4, 5, and Figure 6, compared with 3D Unet, the introduction of residual connection and channel attention mechanism can solve the problem of false positives and improve the segmentation performance. Especially, 3D ResSE-Unet realizes the best results. Although the introduction of channel attention mechanism and residual connection will slightly reduce the segmentation efficiency, it only takes 10s to segment one case and still can meet the needs of clinical applications. The comparison between the proposed approach in this article and three lung tumor delineation methods developed in previous papers has been summarized in Table 6. Compared with 2D CNNs,[21,23] the proposed model can obtain the same segmentation accuracy while using fewer cases. This is due to the overlapping technique used in preprocessing. In this way, each case is fully utilized as much as possible and 3D data blocks to be trained in the training set have been expanded. And the segmentation model can make full use of the z-axis information of the CT image. With the same number of cases, our method is more likely to obtain better segmentation performance. Compared with the research of Cui Y et al, our segmentation results are average. The dense connections and V-net used in their segmentation model provide new ideas for our follow-up research. However, the influence of each convolution channel feature on the prediction result is ignored in their study. Their segmentation performance may be further improved by introducing the channel attention mechanism.

Table 6.

Comparison of 3D ResSE-Unet and other methods of tumor delineation for NSCLC radiotherapy

	CNN	Number of patients	Modality	Results
Bi N et al²¹	DD-ResNet(2D)	269(NSCLC)Training:200Validation:50Testing:19	CT	DSC:0.75
Zhang F et al²³	modified ResNet(2D)	330(NSCLC)Training:300Testing:30	CT	DSC:0.73
Cui Y et al²⁵	DVNs(3D)	192(NSCLC SBRT)10-fold cross validation(training:174-175,test:19-20)	CT	DSC:0.832HD:4.57mm
This article	3D ResSE-Unet	214(NSCLC)Training:148Validation:30Testing:36	CT	DSC:0.7367HD₉₅:21.39mm

Comparison of 3D ResSE-Unet and other methods of tumor delineation for NSCLC radiotherapy Although we have achieved automatic segmentation of GTV for stage III NSCLC, our experiment still has the following limitations. Firstly, only 214 cases of stage III NSCLC have been collected for our experiment. This number is relatively small and needs to be further increased. The tumor location, shape, and size of different patients will be very different. Increasing the number of cases used for training may further improve the generalization ability and prediction accuracy of the segmentation model. Secondly, we have only realized the automatic segmentation of stage III NSCLC. The segmentation effect of this model on GTV of stages I, II, and IV NSCLC needs further study. Thirdly, compared with other cancers, lung tumors vary greatly in size, shape, and location. The relationship between these features and segmentation accuracy has not been further analyzed. Fourth, there is no further comparison between deep learning-assisted delineation and manual delineation in terms of efficiency and inter-and intra-observer variability. Fifth, we only performed a joint assessment of the primary gross tumor volume and the lymph node gross tumor volume and did not analyze their segmentation results separately. Sixth, our department did not adopt respiratory motion management until 2018, and in order to obtain enough cases, we collected patients from 2017 to 2020, so our experiments were carried out with free breathing. In the future, we can make some new attempts to achieve better segmentation performance. Firstly, compared to the residual connection, a more extreme connection pattern has been developed, which is called the dense connection. In this pattern, each layer receives the output features of all previous layers as input and passes its feature maps to all subsequent layers. And the dense connection also can alleviate the vanishing gradient problem and encourage feature reuse. In future work, the residual connection may be replaced with dense connections. Secondly, the channel attention mechanism only pays attention to the difference of different channel information but ignores local information in each channel. However, the local spatial attention mechanism[31-33] can solve this problem by calculating the feature importance of each pixel in the space domain. Thus, combining the advantages of the two attention mechanisms to improve the segmentation effect is the next work that can be studied. Thirdly, our research is only based on CT images, which can provide high-resolution anatomical details. Currently, PET/CT and magnetic resonance images(MRI) have been widely used in the diagnosis and treatment of cancer. PET images can provide quantitative metabolic information. MRI can provide clear soft tissue contrast and help to distinguish the tumor from the surrounding normal tissues. Integrating multi-modal images can obtain richer tumor feature information and may improve the accuracy of tumor segmentation. And some scholars have carried out researches based on multi-modal images.[34-38]

Conclusion

In this article, a 3D CNN named 3D ResSE-Unet is proposed for GTV segmentation of stage III NSCLC. This model can fully excavate the three-dimensional spatial information of tumors and realize accurate and rapid segmentation of GTV. 3D ResSE-Unet is based on 3D Unet and combines the advantages of residual connection and channel attention mechanism. Compared with 3D Unet, 3D ResSE-Unet segmentation can achieve more accurate segmentation and can solve the problem of over-segmentation. This model provides a new tool for realizing the automatic delineation of GTV for lung cancer radiotherapy. But the current segmentation results still need to be adjusted manually before clinical application. In the future, the proposed method may be further improved to improve segmentation accuracy and efficiency and assist to achieve accurate and effective radiotherapy.

29 in total

1. Automated approach for segmenting gross tumor volumes for lung cancer stereotactic body radiation therapy using CT-based dense V-networks.

Authors: Yunhao Cui; Hidetaka Arimura; Risa Nakano; Tadamasa Yoshitake; Yoshiyuki Shioyama; Hidetake Yabuuchi
Journal: J Radiat Res Date: 2021-03-10 Impact factor: 2.724

2. Convolutional Networks with Dense Connectivity.

Authors: Gao Huang; Zhuang Liu; Geoff Pleiss; Laurens Van Der Maaten; Kilian Weinberger
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2019-05-23 Impact factor: 6.226

Review 3. Deep learning in medical imaging and radiation therapy.

Authors: Berkman Sahiner; Aria Pezeshk; Lubomir M Hadjiiski; Xiaosong Wang; Karen Drukker; Kenny H Cha; Ronald M Summers; Maryellen L Giger
Journal: Med Phys Date: 2018-11-20 Impact factor: 4.071

4. Observer variation in target volume delineation of lung cancer related to radiation oncologist-computer interaction: a 'Big Brother' evaluation.

Authors: Roel J H M Steenbakkers; Joop C Duppen; Isabelle Fitton; Kirsten E I Deurloo; Lambert Zijp; Apollonia L J Uitterhoeve; Patrick T R Rodrigus; Gijsbert W P Kramer; Johan Bussink; Katrien De Jaeger; José S A Belderbos; Augustinus A M Hart; Peter J C M Nowak; Marcel van Herk; Coen R N Rasch
Journal: Radiother Oncol Date: 2005-10-26 Impact factor: 6.280

5. Segmenting lung tumors on longitudinal imaging studies via a patient-specific adaptive convolutional neural network.

Authors: Chuang Wang; Neelam Tyagi; Andreas Rimner; Yu-Chi Hu; Harini Veeraraghavan; Guang Li; Margie Hunt; Gig Mageras; Pengpeng Zhang
Journal: Radiother Oncol Date: 2018-12-31 Impact factor: 6.280

6. Gross tumor volume segmentation for head and neck cancer radiotherapy using deep dense multi-modality network.

Authors: Zhe Guo; Ning Guo; Kuang Gong; Shun'an Zhong; Quanzheng Li
Journal: Phys Med Biol Date: 2019-10-16 Impact factor: 3.609

7. Cross-modality (CT-MRI) prior augmented deep learning for robust lung tumor segmentation from small MR datasets.

Authors: Jue Jiang; Yu-Chi Hu; Neelam Tyagi; Pengpeng Zhang; Andreas Rimner; Joseph O Deasy; Harini Veeraraghavan
Journal: Med Phys Date: 2019-08-20 Impact factor: 4.071

8. Conformal radiotherapy for lung cancer: different delineation of the gross tumor volume (GTV) by radiologists and radiation oncologists.

Authors: Philippe Giraud; Sabine Elles; Sylvie Helfre; Yann De Rycke; Vincent Servois; Marie France Carette; Claude Alzieu; Pierre Yves Bondiau; Bernard Dubray; Emmanuel Touboul; Martin Housset; Jean Claude Rosenwald; Jean Marc Cosset
Journal: Radiother Oncol Date: 2002-01 Impact factor: 6.280

9. Multiple Resolution Residually Connected Feature Streams for Automatic Lung Tumor Segmentation From CT Images.

Authors: Jue Jiang; Yu-Chi Hu; Chia-Ju Liu; Darragh Halpenny; Matthew D Hellmann; Joseph O Deasy; Gig Mageras; Harini Veeraraghavan
Journal: IEEE Trans Med Imaging Date: 2018-07-23 Impact factor: 10.048

10. Deep cross-modality (MR-CT) educed distillation learning for cone beam CT lung tumor segmentation.

Authors: Jue Jiang; Sadegh Riyahi Alam; Ishita Chen; Perry Zhang; Andreas Rimner; Joseph O Deasy; Harini Veeraraghavan
Journal: Med Phys Date: 2021-05-25 Impact factor: 4.071