Xinhao Yu1,2, Fu Jin2, HuanLi Luo2, Qianqian Lei2, Yongzhong Wu2. 1. College of Bioengineering, 47913Chongqing University, Chongqing, China. 2. Department of radiation oncology, 605425Chongqing University Cancer Hospital, Chongqing, China.
Abstract
INTRODUCTION: Radiotherapy is one of the most effective ways to treat lung cancer. Accurately delineating the gross target volume is a key step in the radiotherapy process. In current clinical practice, the target area is still delineated manually by radiologists, which is time-consuming and laborious. However, these problems can be better solved by deep learning-assisted automatic segmentation methods. METHODS: In this paper, a 3D CNN model named 3D ResSE-Unet is proposed for gross tumor volume segmentation for stage III NSCLC radiotherapy. This model is based on 3D Unet and combines residual connection and channel attention mechanisms. Three-dimensional convolution operation and encoding-decoding structure are used to mine three-dimensional spatial information of tumors from computed tomography data. Inspired by ResNet and SE-Net, residual connection and channel attention mechanisms are used to improve segmentation performance. A total of 214 patients with stage III NSCLC were collected selectively and 148 cases were randomly selected as the training set, 30 cases as the validation set, and 36 cases as the testing set. The segmentation performance of models was evaluated by the testing set. In addition, the segmentation results of different depths of 3D Unet were analyzed. And the performance of 3D ResSE-Unet was compared with 3D Unet, 3D Res-Unet, and 3D SE-Unet. RESULTS: Compared with other depths, 3D Unet with four downsampling depths is more suitable for our work. Compared with 3D Unet, 3D Res-Unet, and 3D SE-Unet, 3D ResSE-Unet can obtain superior results. Its dice similarity coefficient, 95th-percentile of Hausdorff distance, and average surface distance can reach 0.7367, 21.39mm, 4.962mm, respectively. And the average time cost of 3D ResSE-Unet to segment a patient is only about 10s. CONCLUSION: The method proposed in this study provides a new tool for GTV auto-segmentation and may be useful for lung cancer radiotherapy.
INTRODUCTION: Radiotherapy is one of the most effective ways to treat lung cancer. Accurately delineating the gross target volume is a key step in the radiotherapy process. In current clinical practice, the target area is still delineated manually by radiologists, which is time-consuming and laborious. However, these problems can be better solved by deep learning-assisted automatic segmentation methods. METHODS: In this paper, a 3D CNN model named 3D ResSE-Unet is proposed for gross tumor volume segmentation for stage III NSCLC radiotherapy. This model is based on 3D Unet and combines residual connection and channel attention mechanisms. Three-dimensional convolution operation and encoding-decoding structure are used to mine three-dimensional spatial information of tumors from computed tomography data. Inspired by ResNet and SE-Net, residual connection and channel attention mechanisms are used to improve segmentation performance. A total of 214 patients with stage III NSCLC were collected selectively and 148 cases were randomly selected as the training set, 30 cases as the validation set, and 36 cases as the testing set. The segmentation performance of models was evaluated by the testing set. In addition, the segmentation results of different depths of 3D Unet were analyzed. And the performance of 3D ResSE-Unet was compared with 3D Unet, 3D Res-Unet, and 3D SE-Unet. RESULTS: Compared with other depths, 3D Unet with four downsampling depths is more suitable for our work. Compared with 3D Unet, 3D Res-Unet, and 3D SE-Unet, 3D ResSE-Unet can obtain superior results. Its dice similarity coefficient, 95th-percentile of Hausdorff distance, and average surface distance can reach 0.7367, 21.39mm, 4.962mm, respectively. And the average time cost of 3D ResSE-Unet to segment a patient is only about 10s. CONCLUSION: The method proposed in this study provides a new tool for GTV auto-segmentation and may be useful for lung cancer radiotherapy.
Lung carcinoma(LC) is one of the most severe and widespread cancers in the world.
And statistics from the World Health Organization (WHO) in 2020 showed that
there were 815,563 new cases of LC and 714,699 deaths in China. Currently, in
addition to surgery and chemotherapy, radiotherapy(RT) is the most effective
treatment for LC. And compared with other stages, patients with stage III non-small
cell lung cancer are mainly treated by radiotherapy.In the radiotherapy workflow for patients of LC, precise delineation of gross tumor
volume (GTV) in computed tomography(CT) images is the essential step. Other tumor
target areas are based on GTV and consider the influence of potential invaded
tissues, positioning errors, and other factors. Inaccurate delineation of GTV will
result in unnecessary damage to normal tissues or undertreatment in the tumor target
area. In clinical practice, GTV is usually manually delineated by radiologists.
However, manual delineation is a time-consuming and laborious process, and the start
of radiotherapy will be delayed as a result.
In addition, manual delineation is a subjective process, and the
radiologist’s experience will have a great influence on the delineation results.
Multiple studies have reported that this process has considerable inter-observer and
intra-observer variability.[4-7]
Thus, it is necessary to develop suitable automatic segmentation methods to relieve
the workload of radiologists in the definition of the target volume and improve the
consistency of the target area delineation.Deep learning (DL) is a subfield of AI and machine learning, which has achieved
tremendous success in recent years in various fields in science.[8-10] In medical image
segmentation, DL-based auto-segmentation techniques have been shown to provide
significant improvements over more traditional approaches.[11,12] Convolution neural networks
(CNNs) are the most successful and popular DL architecture applied to image
processing. A lot of researches have confirmed that CNNs are helpful for tumor
target delineation for radiotherapy for head and neck cancer, breast cancer, and
rectal cancer.[13-20] Some scholars have also conducted research on automatic
segmentation of lung tumor target volume based on CNNs.[21-25] To explore the role of deep
learning-assisted delineation, Bi N et al used a dilated residual
network to delineate the CTV of NSCLC for postoperative radiation therapy. And
compared with manual delineation, a CNN-assisted delineation can achieve better
segmentation accuracy, segmentation consistency, and segmentation efficiency.
In order to facilitate the analysis of geometric tumor changes during
radiotherapy, A CNN model named A-net was designed to delineate the GTV of LC with a
DSC of 0.82.
Zhang F et al proposed an automatic segmentation method
based on ResNet and analyze the role of the DL-assisted method for GTV segmentation
of NSCLC.
To monitor tumor response to therapy, Jiang J et al extended
the full resolution residual neural network and developed the multiple resolution
residually connected network for the tumor segmentation of NSCLC.
To achieve the delineation of GTV for LC stereotactic body radiation therapy,
Cui Y et al proposed CT-based dense V-networks with a DSC of 0.82.
Based on the above research, we reason that the automatic segmentation of GTV
for LC radiotherapy can be achieved through CNNs. However, the above studies have
three issues. First, most of the above studies use 2D CNNs and ignore the
high-dimensional spatial features of tumors.[21-24] When delineating the GTV of
LC, the radiologist needs to refer to adjacent CT slices to determine the trend of
tumor growth. Therefore, it is worth designing a 3D CNN to mine three-dimensional
spatial information from CT images to segment GTV. Second, with the increase of the
network depth, CNNs are prone to the problem of vanishing gradients, and some
studies did not consider this problem.
Third, the contribution of each channel feature in CNNs to the prediction
result is different. The performance of the model can be effectively improved by
using the appropriate attention mechanism. However, this point is ignored in the
above research.[21-25]In this work, we proposed a 3D CNN named 3D ResSE-Unet to achieve GTV segmentation of
stage III NSCLC on computed tomography(CT) images. The main innovations of this
article are as follows. First, 3D convolution operations were used to mine the
three-dimensional spatial correlation of GTV. And the influence of the depth of the
3D Unet on the segmentation results was explored. Second, we introduced the residual
connection mechanism and channel attention mechanism into the 3D Unet to improve the
robustness of the model. The residual connection was adopted to address the
optimization problem and vanishing gradients. The channel attention mechanism was
used to strengthen the model's representational power by selectively emphasizing
useful features and suppressing useless ones. The modified version of 3D Unet was
proposed to segment GTV from CT images of 214 stage III NSCLC patients. And compared
with 3D Unet, 3D Res-Unet, and 3D SE-Unet, 3D ResSE-Unet can obtain superior
results. Third, to solve category imbalance, we designed a mixed loss function based
on the Dice loss and the Focal loss for GTV segmentation. Fourth, batch
normalization(BN) was adopted in the network training process. It could prevent
overfitting and improve the accuracy of the target delineation. Finally, the Dice
similarity coefficient(DSC), 95th-percentile of Hausdorff distance(HD95),
and mean surface distance(MSD) were used to evaluate the accuracy of the model's
prediction. And the complexity and segmentation time of segmentation models were
also compared and analyzed.
Methods
The experimental process of this article mainly includes three steps: data
preprocessing, segmentation model training, and segmentation result evaluation. The
flowchart of the method can be seen in Figure 1.
Figure 1.
Flowchart of the 3D CNN-based segmentation method
Flowchart of the 3D CNN-based segmentation method
Data sets
Data of patients with the stage of III NSCLCs from January 2017 to October
2020 in the department of radiation oncology, Chongqing University Cancer
Hospital, were collected selectively. The clinical staging of tumors was
based on the eighth edition of the International Association for Lung Cancer
(IASLC). This work was approved by the ethics committee of Chongqing
University Cancer Hospital(No. CZLS2021231-A, Date:13-Sep-2021). And written
consents were provided by all patients to store their medical information in
the hospital database. In addition, all patient details have been
de-identified. A total of 214 patient data were collected selectively. 148
patient cases were randomly selected as the training set, 30 cases were used
as the validation set, and 36 cases were used as the test set. The training
set was used to train the segmentation model and learn the feature
distribution of GTV. The validation set was used to filter the best
segmentation model. And the segmentation performance of models on new data
was tested by the testing set. The general characteristic of the training,
the validation, and the testing sets are shown in Table 1.
Table 1.
Characteristics of 214 patients with stage III NSCLC
Characteristics of 214 patients with stage III NSCLCThe patients’ data were acquired on Philips BigBore CT simulator(Philips
Medical Systems, Madison, WI) set on helical scan mode(120kV,30mA), and
slice thicknesses of 5mm or 3mm. Iodine contrast agents were used for all
patients. And CT images were obtained with free breathing. Planning CT
images and radiotherapy structure of each patient were all exported and they
were all Digital Imaging and Communications in Medicine(DICOM) files.
Delineation of the GTV was carried out by a senior lung cancer radiologist
who has more than 10 years of work experience and then peer-reviewed by two
other experts. In this study, these GTV contours delineated by radiologists
were referred to as the ground truth. The criteria for radiologists to
delineate GTV of stage III NSCLC was based on NCCN Clinical Practice
Guidelines in Oncology – Non-Small Cell Lung Cancer. And the primary gross
tumor volume and the lymph node gross tumor volume were all included.
Preprocessing
To make full use of the three-dimensional spatial information of CT images,
the images need to be processed according to the following steps. As shown
in Figure 2, GTV
contours are extracted from the radiotherapy structure of each patient by
using python. And the CT images and GTV contours of each patient are
converted into 3D matrices using the SimpleITK module. Then to maintain
consistency across different patients, resampling operations were applied to
the image and contour matrices so that each has a slice thickness of 5.0 mm
and a pixel pitch of 1.0 mm. In order to reduce the computational burden and
memory consumption, input images are randomly cropped into 3D volume with
160 × 160 × 32 pixels. And to make full use of the spatial information of CT
data, the input data is prepared as overlapping batches. The overlapping
technique ensures that the segmentation model can utilize as much
information over the third axis as possible. In addition, the overlap stride
is set to 8 images for training data, but this method is not used in the
validation data and the testing data. An example can be seen in Figure 3. In the end,
1159 blocks of 3D data are obtained in the training set, and 90 blocks of 3D
data are obtained in the validation set.
Figure 2.
CT images and corresponding labels. A. CT images with GTV (red
contour is the manually delineated GTV). B. Label images for the
image presented in A. C. CT images without GTV. D.
Label images for the image presented in C.
Figure 3.
Examples of data cropping of training, validation, and testing set
(Number show which CT image slices are included in 3D data). A.
Example of data cropping of the training set. B. Example of data
cropping of validation and testing set.
CT images and corresponding labels. A. CT images with GTV (red
contour is the manually delineated GTV). B. Label images for the
image presented in A. C. CT images without GTV. D.
Label images for the image presented in C.Examples of data cropping of training, validation, and testing set
(Number show which CT image slices are included in 3D data). A.
Example of data cropping of the training set. B. Example of data
cropping of validation and testing set.In addition, considering the difference in CT value distribution between
subjects, the pixel intensity of CT images is normalized to 0-1 by using
Hounsfield(HU) window [-180,220]. Hounsfield(HU) window [-180,220] is the
mediastinal CT window, and the radiation oncologist observes this window
when delineating the GTV of LC. Finally, since the limited data resources,
data augmentation is an unavoidable choice to get better performance on
unseen data. Therefore, random zoom and random rotation are adopted to
augment the training data. And this process is achieved by using the
multi-dimensional image processing package(.ndimage) in the Scipy.
Architecture
In the field of medical image segmentation, U-net
has become one of the most well-known structures. 3D U-net
is an improved version of the basic U-net model and enables 3D
volumetric segmentation using very few annotated examples. More importantly,
the information on adjacent slices of an image can be transmitted through
the network to provide more consistent predictions. The delineation of GTV
is mainly dependent on the patient's anatomical structure and tumor
presentation on the CT images. Thus, we propose to apply the 3D U-net model
as the base model for GTV segmentation, and the influence of depth of 3D
Unet on segmentation performance is analyzed. In order to further strengthen
the ability to extract features and aspired by the ResNet
and SE-Net,
the residual connection and the channel attention mechanism are
introduced into 3D Unet. The effects of these improvement methods are also
compared.In this paper, a model called 3D ResSE-Unet is proposed for target
segmentation, which is an improved version of 3D Unet. The network diagram
is shown in Figure4. It is composed of a contracting path to capture context
and a symmetric expanding path that enables precise localization. Four max
pool operations are stacked in the contracting path to reduce image
resolution, expand the receptive field, and explore more detailed features.
And in the expanding path, the image resolution is recovered by upsampling
operations. To localize precisely, high-resolution features from the
contracting path are combined with the upsampled output. Our network
architecture contains 7 ResSE blocks, four max pool operations, and four
upsampling operations. The last layer is a
convolution to produce the predicted map. The network
parameters are summarized in table 2.
Table 2.
Network parameter
Layer
Operation
Kernel size
Stride
Num. of filters
Input
size(Cin × D × H × W)
Output
size(Cout × D × H × W)
Double conv1
(Conv3D + BN + Relu) × 2
3 × 3 × 3
(1,1,1)
16
1 × 32 × 160 × 160
16 × 32 × 160 × 160
Max pool 1
MaxPool3D
2 × 2 × 2
(2,2,2)
16 × 32 × 160 × 160
16 × 16 × 80 × 80
ResSEblock1
Conv3D + BN + Relu + SE block
3 × 3 × 3
(1,1,1)
32
16 × 16 × 80 × 80
32 × 16 × 80 × 80
Max pool 2
MaxPool3D
2 × 2 × 2
(2,2,2)
32 × 16 × 80 × 80
32 × 8 × 40 × 40
ResSEblock2
Conv3D + BN + Relu + SE block
3 × 3 × 3
(1,1,1)
64
32 × 8 × 40 × 40
64 × 8 × 40 × 40
Max pool 3
MaxPool3D
2 × 2 × 2
(2,2,2)
64 × 8 × 40 × 40
64 × 4 × 20 × 20
ResSEblock3
Conv3D + BN + Relu + SE block
3 × 3 × 3
(1,1,1)
128
64 × 4 × 20 × 20
128 × 4 × 20 × 20
Max pool 4
MaxPool3D
2 × 2 × 2
(2,2,2)
128 × 4 × 20 × 20
128 × 2 × 10 × 10
ResSEblock4
Conv3D + BN + Relu + SE block
3 × 3 × 3
(1,1,1)
256
128 × 2 × 10 × 10
256 × 2 × 10 × 10
Trans Conv1
ConvTranspose3D
2 × 2 × 2
(2,2,2)
256 × 2 × 10 × 10
128 × 4 × 20 × 20
ResSEblock5
Conv3D + BN + Relu + SE block
3 × 3 × 3
(1,1,1)
128
256 × 4 × 20 × 20
128 × 4 × 20 × 20
Trans Conv2
ConvTranspose3D
2 × 2 × 2
(2,2,2)
128 × 4 × 20 × 20
64 × 8 × 40 × 40
ResSEblock6
Conv3D + BN + Relu + SE block
3 × 3 × 3
(1,1,1)
64
128 × 8 × 40 × 40
64 × 8 × 40 × 40
Trans Conv3
ConvTranspose3D
2 × 2 × 2
(2,2,2)
64 × 8 × 40 × 40
32 × 16 × 80 × 80
ResSEblock7
Conv3D + BN + Relu + SE block
3 × 3 × 3
(1,1,1)
32
64 × 16 × 80 × 80
32 × 16 × 80 × 80
Trans Conv4
ConvTranspose3D
2 × 2 × 2
(2,2,2)
32 × 16 × 80 × 80
16 × 32 × 160 × 160
Double conv2
(Conv3D + BN + Relu) × 2
3 × 3 × 3
(1,1,1)
16
32 × 32 × 160 × 160
16 × 32 × 160 × 160
Last conv
Conv3D
3 × 3 × 3
(1,1,1)
2
16 × 32 × 160 × 160
2 × 32 × 160 × 160
Network parameterThe design of the ResSE block is presented in Figure4. The following
expression can denote the details of residual connection:
and
correspond to the input of the
layer and
layer respectively.
is denoted as the residual function, which is composed of
several operations, including convolution, batch normalization (BN),
rectified linear unit(ReLU), and SE block.
is denoted as the activation function, and ReLU was used
in this work. The residual block integrates
with the
to improve the information flow. This behavior allows the
network to preserve feature maps in deeper neural networks, addressing
vanishing gradients and making networks easier to optimize.The SE module can selectively strengthen useful features and suppress useless
features by learning to use global information, thereby achieving feature
recalibration. As shown in Figure 4,
,
,
represent the channel number of the feature, the height,
and the width of the feature image, respectively. And
represents the reduction ratio, and the value in this work
is 2. This method implements attention weighting on channels in three steps.
Firstly, global spatial information is squeezed using global average
pooling, and the
channel feature map is generated in the end. The second is
the excitation operation, in which a bottleneck with two fully connected
(FC) layers around the ReLU unit is formed. In this process, first, compress
the channel feature number to C/r, then go through a ReLU function to
increase non-linearity, then restore the channel feature number to C, and
finally go through a sigmoid function to obtain the weight of each channel
feature. During this process, important channel features get larger weights,
and unimportant channel features get smaller weights. Finally, each channel
feature and the corresponding weight are multiplied as the output of the SE
block.
Figure 4.
The diagram of the 3D ResSE-Unet structure. A. The architecture of 3D
ResSE-Unet B. The design of ResSE block. C. The structure of SE block(
represents the height of the input feature,
represents the width of the input feature,
represents the channel number of the input
feature,
represents reduction ratio, and the value in this
work is 2).
The diagram of the 3D ResSE-Unet structure. A. The architecture of 3D
ResSE-Unet B. The design of ResSE block. C. The structure of SE block(
represents the height of the input feature,
represents the width of the input feature,
represents the channel number of the input
feature,
represents reduction ratio, and the value in this
work is 2).
Loss function
When training a CNN model, choosing an appropriate loss function can improve
network performance. Considering that there is a problem of
foreground-background class imbalance in this task. Thus, we designed a
mixed loss function, as defined in Eq. 1:
Where
and
represent the Dice loss and the Focal loss, respectively.
And they are explained as follows:
Where X and Y represent the
ground truth and the prediction result, the dice loss is suitable for severe
class imbalance tasks. However, in the routine task, dice loss will
influence the backpropagation and lead to training difficulty.
Where
is the variant to balance the importance of
positive/negative examples,
is a modulating factor. The focal loss can be seen as the
variation of Binary Cross-Entropy. Due to down-weight the contribution of
easy examples and focusing more on learning hard examples, it works well for
highly imbalanced class scenarios.
Evaluation
The testing set was used to evaluate the predictive performance of the 3D
ResSE-Unet. The ground-truth volumes were contoured manually by experienced
senior radiation oncologists. And the difference between auto-delineated GTV
and the ground-truth was calculated by dice similarity coefficient(DSC), the
95th-percentile Hausdorff distance(HD95), and mean surface
distance(MSD).
Model training
The proposed models were implemented by Pytorch framework on Linux operating
system using Python Application Programmable Interface and then accelerated
by the NVIDIA graphics card. To prevent overfitting, a batch normalization
operation was performed after each convolution operation. And the Kaiming
function was used to configure network initialization parameters. In the
training stage, the learning rate was set to 0.00015 in the Adam optimizer,
the batch size was set to 2, and the mixed loss function was the training
loss function. The max number of epochs was 90, and the loss value decreased
with the epoch number. After each training epoch, validation was performed
on the validation set, and only the best parameters would be saved. All
experiments in this article were performed on Intel Xeon E5-2650 V4 (2.2GHz)
processor and NVIDIA tesla T4 graphics card.
Result
After training the models, CT images of the testing set were imported into the
best-performing model to perform GTV delineation and delineation results were
evaluated qualitatively and quantitatively.
Comparison of different depths of 3D unet
To find a suitable depth of 3D Unet for GTV segmentation, different depths of
3D Unet were trained respectively. Different depths of 3D Unet include 3D
Unet_3B, 3D Unet_4B, 3D Unet_5B, which respectively include three
downsamplings, four downsamplings, and five downsamplings. The number of
convolution channels in each layer of 3D Unet_3B from shallow to deep is 16,
32, 64, 128. Similarly, the number of convolutional channels in each layer
of 3D Unet_4B is 16, 32, 64, 128, 256. And the number of convolution
channels in each layer of 3D Unet_5B is 16, 32, 64, 128, 256, 512.Quantitative evaluation results of three different depths of 3D Unet are
summarized in Table
3. As shown, the 3D Unet_4B has realized better segmentation
results. Its average values of DSC, Hausdorff distance, and average surface
distance can reach 0.7090, 33.89mm, and 7.030mm, respectively. And the three
quantitative evaluation results of 3D Unet_3B and 3D Unet_5B are not as good
as 3D Unet_4B.
Table 3.
Comparison of quantitative evaluation metrics of 3D Unet with
different depths (
).
Method
DSC
HD95(mm)
MSD(mm)
3D Unet_3B
0.6979
40.72
9.392
3D Unet_4B
0.7090
33.89
7.030
3D Unet_5B
0.6936
48.94
9.427
Comparison of quantitative evaluation metrics of 3D Unet with
different depths (
).The partial segmentation results of the testing set are shown in Figure 5.
Intuitively, 3D Unet segmentation results of different depths all have the
problem of false positives. However, compared with 3D Unet_3B and 3D
Unet_5B, there are fewer false positives and false negatives in 3D
Unet_4B.
Figure 5.
Comparison of segmentation results of different depths of 3D Unet.
A-C. 3D Unet_3B segmentation results. D-F. 3D Unet_4B segmentation
results. G-I. 3D Unet_5B segmentation results.
Comparison of segmentation results of different depths of 3D Unet.
A-C. 3D Unet_3B segmentation results. D-F. 3D Unet_4B segmentation
results. G-I. 3D Unet_5B segmentation results.
Comparison of 3D ResSE-Unet, 3D Res-Unet, 3D SE-Unet and 3D Unet
To investigate the effectiveness of the proposed segmentation model, 3D Unet,
3D Res-Unet, 3D SE-Unet, and 3D ResSE-Unet were trained respectively.
Compared with 3D Unet, 3D Res-Unet introduced residual connection, 3D
SE-Unet introduced channel attention mechanism, 3D ResSE-Unet introduced
residual connection and channel attention mechanism at the same time. For
useful comparison, when training 3D SE-Unet, 3D Res-Unet, and 3D ResSE-Unet,
the hyperparameters were the same as those used in 3D Unet.As shown in Table
4, the quantitative evaluation results of four networks on the
testing set are summarized. It can be found that compared with the 3D Unet,
the introduction of residual connection and the introduction of channel
attention mechanism can better improve the segmentation results of 3D Unet.
The introduction of residual connection increases the DSC score of 3D Unet
from 0.7090 to 0.7247 and reduces HD95 from 33.89mm to 21.64mm,
and MSD from 7.030mm to 5.121mm. The introduction of the channel attention
mechanism increases the DSC score of 3D Unet from 0.7090 to 0.7222 and
reduces HD95 from 33.89mm to 23.46mm, and MSD from 7.030mm to
5.487mm. In addition, the introduction of residual connection and channel
attention mechanisms at the same time can achieve better segmentation
results, and the average values of DSC, HD, and MSD can reach 0.7367,
21.39mm, 4.962mm, respectively.
Table 4.
Quantitative evaluation metrics comparison of different model(
).
Method
DSC
HD95(mm)
MSD(mm)
3D Unet
0.7090
33.89
7.030
3D SE-Unet
0.7222
23.46
5.487
3D Res-Unet
0.7247
21.64
5.121
3D ResSE-Unet
0.7367
21.39
4.962
Quantitative evaluation metrics comparison of different model(
).The representative comparison results of four models are shown in Figure 6. As shown,
there is the problem of false positives in the segmentation results of 3D
Unet. However, the introduction of residual connection and channel attention
mechanism can better solve this problem. Intuitively, compared with 3D Unet,
the problem of false positives has been improved in the results of 3D
Res-Unet, 3D SE-Unet, and 3D ResSE-Unet. And it can be found that 3D
ResSE-Unet realizes the best results.
Figure 6.
Comparison of segmentation results of 3D ResSE-Unet, 3D Res-Unet, 3D
SE-Unet, and 3D Unet. A1-C1. 3D ResSE-Unet segmentation results.
A2-C2. 3D Res-Unet segmentation results. A3-C3. 3D SE-Unet
segmentation results. A4-C4. 3D Unet segmentation results.
Comparison of segmentation results of 3D ResSE-Unet, 3D Res-Unet, 3D
SE-Unet, and 3D Unet. A1-C1. 3D ResSE-Unet segmentation results.
A2-C2. 3D Res-Unet segmentation results. A3-C3. 3D SE-Unet
segmentation results. A4-C4. 3D Unet segmentation results.In addition, we also compared the network parameters and average segmentation
time of the four models, as shown in Table 5. It can be seen that the
introduction of the channel attention mechanism hardly increases the number
of model parameters and does not reduce the segmentation efficiency. But the
introduction of residual connections will increase the number of model
parameters and slightly reduce the efficiency of segmentation. And it can be
found that compared with 3D Unet, 3D ResSE-Unet parameters increase from
21.54MB to 44.66MB, but the average segmentation time only increases by 1
second.
Table 5.
Comparison of network parameters and average segmentation time.
Parameters(MB)
Average segmentation time per
patient(s)
3D Unet
21.54
9.58
3D SE-Unet
21.75
9.60
3D Res-Unet
44.24
10.58
3D ResSE-Unet
44.66
10.61
Comparison of network parameters and average segmentation time.
Discussion
Radiotherapy is one of the main treatments for stage III NSCLC. Accurately
delineating GTV is essential to achieve precise radiotherapy. Radiologist manual
delineation is time-consuming and has inter-and intra-observer variability. However,
these problems can be solved by automatic segmentation methods based on CNNs. At
present, the research on the automatic delineation of GTV for NSCLC radiotherapy
mainly uses 2D CNNs and ignores spatial features of tumors from CT data. In this
work, we chose 3D Unet as the base model and used two different methods to improve
3D Unet. We designed a model named 3D ResSE-Unet and achieved the automatic
segmentation of GTV of stage III NSCLC radiotherapy. The segmentation results of
different depths of 3D Unet are shown in Table 3 and Figure 5. From the perspective of the 3D
Unet structure, the deeper the network, the more feature scales that can be
extracted, and the better the segmentation results will be obtained. This can
explain why the segmentation results of 3D Unet_4B are better than 3D Unet_3B.
However, the deeper the network, the more spatial information is lost through max
pooling operations, which is not suitable for segmenting small targets. In the
training set, some tumors were too small, and the minimum GTV was 3.884
cm3. And the overlap cropping technique in the preprocessing will
cause the 3D data block to be trained only to contain part of the GTV. Therefore,
the segmentation result of 3D Unet_5B is not as good as 3D Unet_4B.Compared with other depths, 3D Unet with four downsamplings is the more suitable
structure for our work, but the segmentation results still have the problem of false
positives. Two methods were adopted to solve this problem in this article. To solve
the problem of vanishing gradients and strengthen the transmission of features, the
residual connection mechanism is introduced into 3D Unet. And the channel attention
mechanism also has been introduced into 3D Unet to strengthen the useful channel
features and suppress the useless channel features. As shown in Table 4, 5, and Figure 6, compared with 3D
Unet, the introduction of residual connection and channel attention mechanism can
solve the problem of false positives and improve the segmentation performance.
Especially, 3D ResSE-Unet realizes the best results. Although the introduction of
channel attention mechanism and residual connection will slightly reduce the
segmentation efficiency, it only takes 10s to segment one case and still can meet
the needs of clinical applications.The comparison between the proposed approach in this article and three lung tumor
delineation methods developed in previous papers has been summarized in Table 6. Compared with 2D
CNNs,[21,23] the proposed model can obtain the same segmentation accuracy
while using fewer cases. This is due to the overlapping technique used in
preprocessing. In this way, each case is fully utilized as much as possible and 3D
data blocks to be trained in the training set have been expanded. And the
segmentation model can make full use of the z-axis information of the CT image. With
the same number of cases, our method is more likely to obtain better segmentation
performance. Compared with the research of Cui Y et al,
our segmentation results are average. The dense connections and V-net used in
their segmentation model provide new ideas for our follow-up research. However, the
influence of each convolution channel feature on the prediction result is ignored in
their study. Their segmentation performance may be further improved by introducing
the channel attention mechanism.
Table 6.
Comparison of 3D ResSE-Unet and other methods of tumor delineation for NSCLC
radiotherapy
Comparison of 3D ResSE-Unet and other methods of tumor delineation for NSCLC
radiotherapyAlthough we have achieved automatic segmentation of GTV for stage III NSCLC, our
experiment still has the following limitations. Firstly, only 214 cases of stage III
NSCLC have been collected for our experiment. This number is relatively small and
needs to be further increased. The tumor location, shape, and size of different
patients will be very different. Increasing the number of cases used for training
may further improve the generalization ability and prediction accuracy of the
segmentation model. Secondly, we have only realized the automatic segmentation of
stage III NSCLC. The segmentation effect of this model on GTV of stages I, II, and
IV NSCLC needs further study. Thirdly, compared with other cancers, lung tumors vary
greatly in size, shape, and location. The relationship between these features and
segmentation accuracy has not been further analyzed. Fourth, there is no further
comparison between deep learning-assisted delineation and manual delineation in
terms of efficiency and inter-and intra-observer variability. Fifth, we only
performed a joint assessment of the primary gross tumor volume and the lymph node
gross tumor volume and did not analyze their segmentation results separately. Sixth,
our department did not adopt respiratory motion management until 2018, and in order
to obtain enough cases, we collected patients from 2017 to 2020, so our experiments
were carried out with free breathing.In the future, we can make some new attempts to achieve better segmentation
performance. Firstly, compared to the residual connection, a more extreme connection
pattern has been developed, which is called the dense connection.
In this pattern, each layer receives the output features of all previous
layers as input and passes its feature maps to all subsequent layers. And the dense
connection also can alleviate the vanishing gradient problem and encourage feature
reuse. In future work, the residual connection may be replaced with dense
connections. Secondly, the channel attention mechanism only pays attention to the
difference of different channel information but ignores local information in each
channel. However, the local spatial attention mechanism[31-33] can solve this problem by
calculating the feature importance of each pixel in the space domain. Thus,
combining the advantages of the two attention mechanisms to improve the segmentation
effect is the next work that can be studied. Thirdly, our research is only based on
CT images, which can provide high-resolution anatomical details. Currently, PET/CT
and magnetic resonance images(MRI) have been widely used in the diagnosis and
treatment of cancer. PET images can provide quantitative metabolic information. MRI
can provide clear soft tissue contrast and help to distinguish the tumor from the
surrounding normal tissues. Integrating multi-modal images can obtain richer tumor
feature information and may improve the accuracy of tumor segmentation. And some
scholars have carried out researches based on multi-modal images.[34-38]
Conclusion
In this article, a 3D CNN named 3D ResSE-Unet is proposed for GTV segmentation of
stage III NSCLC. This model can fully excavate the three-dimensional spatial
information of tumors and realize accurate and rapid segmentation of GTV. 3D
ResSE-Unet is based on 3D Unet and combines the advantages of residual connection
and channel attention mechanism. Compared with 3D Unet, 3D ResSE-Unet segmentation
can achieve more accurate segmentation and can solve the problem of
over-segmentation. This model provides a new tool for realizing the automatic
delineation of GTV for lung cancer radiotherapy. But the current segmentation
results still need to be adjusted manually before clinical application. In the
future, the proposed method may be further improved to improve segmentation accuracy
and efficiency and assist to achieve accurate and effective radiotherapy.
Authors: Berkman Sahiner; Aria Pezeshk; Lubomir M Hadjiiski; Xiaosong Wang; Karen Drukker; Kenny H Cha; Ronald M Summers; Maryellen L Giger Journal: Med Phys Date: 2018-11-20 Impact factor: 4.071
Authors: Roel J H M Steenbakkers; Joop C Duppen; Isabelle Fitton; Kirsten E I Deurloo; Lambert Zijp; Apollonia L J Uitterhoeve; Patrick T R Rodrigus; Gijsbert W P Kramer; Johan Bussink; Katrien De Jaeger; José S A Belderbos; Augustinus A M Hart; Peter J C M Nowak; Marcel van Herk; Coen R N Rasch Journal: Radiother Oncol Date: 2005-10-26 Impact factor: 6.280
Authors: Philippe Giraud; Sabine Elles; Sylvie Helfre; Yann De Rycke; Vincent Servois; Marie France Carette; Claude Alzieu; Pierre Yves Bondiau; Bernard Dubray; Emmanuel Touboul; Martin Housset; Jean Claude Rosenwald; Jean Marc Cosset Journal: Radiother Oncol Date: 2002-01 Impact factor: 6.280
Authors: Jue Jiang; Yu-Chi Hu; Chia-Ju Liu; Darragh Halpenny; Matthew D Hellmann; Joseph O Deasy; Gig Mageras; Harini Veeraraghavan Journal: IEEE Trans Med Imaging Date: 2018-07-23 Impact factor: 10.048
Authors: Jue Jiang; Sadegh Riyahi Alam; Ishita Chen; Perry Zhang; Andreas Rimner; Joseph O Deasy; Harini Veeraraghavan Journal: Med Phys Date: 2021-05-25 Impact factor: 4.071