Literature DB >> 34660843

Weakly supervised segmentation of tumor lesions in PET-CT hybrid imaging.

Marcel Früh^1,2, Marc Fischer³, Andreas Schilling², Sergios Gatidis^1,4, Tobias Hepp^1,4.

Abstract

Purpose: We introduce and evaluate deep learning methods for weakly supervised segmentation of tumor lesions in whole-body fluorodeoxyglucose-positron emission tomography (FDG-PET) based solely on binary global labels ("tumor" versus "no tumor"). Approach: We propose a three-step approach based on (i) a deep learning framework for image classification, (ii) subsequent generation of class activation maps (CAMs) using different CAM methods (CAM, GradCAM, GradCAM++, ScoreCAM), and (iii) final tumor segmentation based on the aforementioned CAMs. A VGG-based classification neural network was trained to distinguish between PET image slices with and without FDG-avid tumor lesions. Subsequently, the CAMs of this network were used to identify the tumor regions within images. This proposed framework was applied to FDG-PET/CT data of 453 oncological patients with available manually generated ground-truth segmentations. Quantitative segmentation performance was assessed for the different CAM approaches and compared with the manual ground truth segmentation and with supervised segmentation methods. In addition, further biomarkers (MTV and TLG) were extracted from the segmentation masks.
Results: A weakly supervised segmentation of tumor lesions was feasible with satisfactory performance [best median Dice score 0.47, interquartile range (IQR) 0.35] compared with a fully supervised U-Net model (median Dice score 0.72, IQR 0.36) and a simple threshold based segmentation (Dice score 0.29, IQR 0.28). CAM, GradCAM++, and ScoreCAM yielded similar results. However, GradCAM led to inferior results (median Dice score: 0.12, IQR 0.21) and was likely to ignore multiple instances within a given slice. CAM, GradCAM++, and ScoreCAM yielded accurate estimates of metabolic tumor volume (MTV) and tumor lesion glycolysis. Again, worse results were observed for GradCAM. Conclusions: This work demonstrated the feasibility of weakly supervised segmentation of tumor lesions and accurate estimation of derived metrics such as MTV and tumor lesion glycolysis.

Entities: Chemical

Keywords: computed tomography; deep learning; label efficiency; oncological imaging; positron emission tomography; weakly supervised learning

Year: 2021 PMID： 34660843 PMCID： PMC8510879 DOI： 10.1117/1.JMI.8.5.054003

Source DB: PubMed Journal: J Med Imaging (Bellingham) ISSN： 2329-4302

Introduction

Contrast-enhanced computed tomography (CT) remains the backbone for oncological staging, whereas 18-fluordesoxyglucose ([18F]-FDG) positron emission tomography (PET)/CT hybrid imaging plays a central role in the detection of distant metastatic disease. In addition to the detection of tumor spots, FDG-PET provides essential functional information about the tumor metabolism. For instance, the maximum standardized uptake value (SUV) for FDG of primary tumors is a prognostic biomarker for survival in non-small cell lung cancer. In addition to the maximum SUV, state-of-the-art metrics for assessing tumor burden also include the metabolic tumor volume (MTV) and total lesion glycolysis (TLG). Although this information is, in principle, available in routine examinations, the evaluation can imply the manual analysis of a large number of single lesions and thus proves to be problematic in everyday clinical practice and in the exploration of large cohorts. Computer-aided automatic detection and segmentation of tumor lesions is therefore of great importance in PET/CT imaging. In recent years, significant progress has been made in the automatic analysis of medical images, mainly due to the emergence of deep learning methods., Deep learning models have already been successfully applied for the detection and segmentation of tumor lesions. Established approaches are mostly based on supervised learning schemes that use a large amount of manually voxel-wise annotated ground-truth data. However, acquiring ground-truth data, in particular for many small tumor lesions, is time consuming and requires an enormous manual labeling effort of an experienced radiologist. Advances in machine learning are pointing to methods that allow learning with a smaller amount of annotated training data. Whereas semi- and self-supervised learning try to boost performance by utilizing unlabeled data, weakly supervised learning reduces the complexity of the label and therefore simplifies the collection of ground-truth annotations. Following the second approach, the location of objects in natural images can be learned to a limited extent from a weaker annotation such as a classification of the imaged object of interest, instead of an actual voxel-wise mask (i.e., the full positional information). Previous studies demonstrate the potential of weakly supervised segmentation based on bounding boxes, scribbles, or image level class labels. In this work, we propose a framework for weakly supervised segmentation of tumor lesions in full-body PET/CT images of patients with cancer. Thus, only a binary slice-by-slice specification of whether malignant tissue is present or not is used as a weak supervision signal. A convolutional neural network (CNN) acts as a classifier. Subsequently, a threshold-based analysis of class activation maps (CAM) is utilized to generate the segmentation mask. We evaluate our proposed approach for different CAM methods and compare its performance in predicting TLG and MTV with supervised segmentation approaches for PET/CT images of oncological patients with lung cancer, lymphoma, and malignant melanoma.

Related Work

The use of CAM for weakly supervised object detection and segmentation has been reported, including in the medical imaging domain. Afshari et al. proposed a FCN architecture for PET lesion segmentation based on bounding boxes and the unsupervised Mumford-Shah segmentation model. Nguyen et al. used GradCAM paired with a ResNet50 to segment uveal melanoma lesions in MRI images. Subsequently, after applying a conditional random field, they trained a U-Net on predicted segmentation masks, which achieved Dice scores similar to the supervised counterpart. Recently, Eyuboglu et al. proposed a weakly supervised method that uses a BERT language model to extract regional abnormality labels from free-text radiology reports of PET/CT examinations. Subsequently, they trained a CNN-based classifier on these labels to automatically detect if there are abnormalities in a certain anatomical region.

Materials and Methods

Dataset

In this study, we included full body PET/CT scans of 453 oncological patients (195 females, 258 males) acquired between 2013 and 2016 from an ongoing PET/CT registry study in our hospital. The distribution of oncological diagnoses was as follows: 50% lung cancer, 18% lymphoma, and 32% malignant melanoma. The median age was 64 years (19–95 years). All examinations were performed using standardized protocols including state-of-the-art CT with an intravenous contrast agent (Biograph mCT, Siemens Healthineers, Germany). [18F]-FDG was applied as the PET tracer. The registry study was approved by the Ethics Committee of the University of Tübingen, reference number 064/2013B01.

Pre-processing

Voxel-wise SUVs were computed from attenuation corrected PET images. SUV images were pre-aligned and resampled to the resolution of the corresponding CT images by means of linear interpolation (spatial resolution of , in-slice shape ). To evaluate the performance of the model, a subject level train-validation-test split (60%–20%–20%) was used. All tumor lesions were manually annotated by an experienced radiologist in a slice-by-slice manner (Fig. 1). A slice-wise binary label, which indicates if malignant tissue is present or not, was derived from the segmentation masks as a weak supervision signal.

Fig. 1

Exemplary PET/CT slice with high SUV uptake next to the hilum of the right lung. The right image shows the manually annotated segmentation mask as red overlay to the PET image.

Data description

The median tumor volume was 46.5 ml [interquartile range (IQR) 158.4 ml]. Overall, only 13.5%–14% of the training/test set image slices contained malignant tissue. As shown in Fig. 2, the right skewed distribution of the tumor size within slices reflects a dominance of slices with small tumor proportions.

Fig. 2

Distribution of the tumor size for slices with malignant tissue. Slices with small sized tumors are dominating.

Methods—Weakly Supervised Tumor Segmentation

First, we describe the proposed method for weakly supervised segmentation. A detailed description of the network architecture as well as the derivation of the utilized CAM methods is given. Finally, we summarize the training routine, the baseline methods and the evaluation methodology.

Weakly supervised segmentation

The purpose of weakly supervised segmentation is to achieve a well-performing segmentation model without the need for manually annotated ground-truth segmentation masks. Weak labels (e.g., class labels or bounding boxes) are typically easy to gather and correlate directly with the segmentation mask. Our framework generates a segmentation mask prediction in three separate steps. First, a tumor classification network is trained with the provided slice-level binary labels (tumor/no tumor). Second, CAM methods are used to identify regions that are relevant to the networks decision. An adaptive unsupervised threshold-based image segmentation is applied to the region proposed by the CAM algorithm, yielding the tumor segmentation.

Architecture

For the slice-wise classification task, a CNN with VGG-16 base architecture was utilized. The weights of the network were pretrained on Imagenet. By removing the first max-pooling layer of the network, the size of the final feature map was increased to . Pre-processed PET and CT image slices form the two input channels of the network. The output of the network yields the probability of the slice containing one or more FDG-avid tumor lesions.

Class activation maps

Neural networks form a class of highly non-linear functions, and there is no general recipe for explaining the relevance of input features for the final prediction. One common approach is to visualize the saliency of regions of the input image with respect to the prediction of a CNN. These saliency maps are called CAM (Fig. 3). Four different established methods to derive CAMs were compared in this study.

Fig. 3

Proposed processing routine. First, a binary tumor classifier is trained in a supervised manner on PET/CT data. Then a class activation map is computed based on the classifier. Finally, threshold based segmentation is performed on the PET images within the region proposed by the CAM. The classic CAM algorithm requires a specific network architecture with a single fully connected layer following the final global average pooling layer of the convolutional part of the network. The activation map for class is computed as the dot product between feature map with filters of the last convolutional layer of the network and weights for class from the fully connected layer: Compared with CAM, GradCAM shows more flexibility regarding the network architecture. CAM is computed by scaling corresponding feature map of the last convolutional layer with the gradients of prediction for class with respect to the elements of via backpropagation followed by global average pooling: Subsequently, the linear combination between and feature map is calculated to compute : GradCAM lacks performance if multiple instances of the same class occur within one image as the focus on one object of class is enough to yield the corresponding prediction. Often only fragments of the object are considered as these are already sufficient for an accurate classification. This is particularly relevant in tumor segmentation, in which multiple tumor spots regularly appear on a single slice. GradCAM++ tackles this problem by weighting the non-negative gradient of the last convolutional layer with respect to a specific class: where is defined as with and indexing over the slice dimensions. ScoreCAM, just like CAM, does not rely on gradients to derive a CAM . The input image is perturbed with the predicted, up-sampled, and normalized feature maps . For each of these disturbed images, new feature maps are computed by forward passes through the network. All are subtracted from the original feature map of the input image . A subsequent softmax operation yields weights of the following linear combination:

Adaptive threshold

By applying a CAM-method-specific CAM-threshold to the CAMs, a binary regional candidate mask for the tumor area is derived. Thresholded CAMs are upscaled from to the original image size by means of nearest neighbour interpolation. The segmentation mask is subsequently derived by selecting all positions with values larger than a method-specific but fixed percentile of the SUV distribution inside the masked region. Data-specific hyperparameters in the form of CAM-thresholds and intensity percentiles were determined empirically on the training and validation sets. The percentile was empirically determined by performing grid search on the training data with 20 linearly spaced values between 20 and 50. The threshold value was determined in the same manner with ten linear spaced values from 0.1 to 0.9. The best values were determined by maximizing the Dice score on the validation data.

Segmentation routine

The complete segmentation routine is presented in the algorithm below. CAM Segmentation

Baselines

To evaluate the performance of our method, we compared our results with two baselines: a simple global threshold-based segmentation method and a fully supervised U-Net-CNN model.

Global threshold

A global threshold based on a fixed SUV percentile was applied to all images in which the classification network predicts a tumor. The percentile was again empirically determined by performing a grid search on the training data with 20 linearly spaced values between 20 and 50 and choosing the one that yielded the highest Dice score.

Supervised UNET

We compared our approach with a standard UNET segmentation model trained in a supervised manner on image slices. Our architecture consists of four double convolution layers in both, having the decoder and encoder with skip connections between all levels.

Training

As described above, a modified VGG16 backbone was used as the tumor classification network. Data augmentation, including slice-wise scaling, rotations, translations, and contrast changes, was applied. The model was implemented using the deep learning framework PyTorch (1.7.1). The network was trained for 50 epochs using a SGD optimizer with a momentum of 0.9, a learning rate of 0.001, and a batch size of 64. To consider class-imbalance, a weighted cross entropy loss () was used. The baseline U-Net model was trained on 2D image slices with a batch size of 64 for 200 epochs using the ADAM optimizer (, ) with an initial learning rate of . Again, a weighted () cross entropy loss was used. The same data augmentation for the classifier was used. A dedicated GPU (Tesla V100, NVIDIA, Santa Clara) was used for accelerated computing.

Statistical analysis

All results are reported with median and IQR. Additionally for all segmentation methods, intra-class correlations (two-way, agreement) between ground truth annotation and prediction were computed. A global significance level of 0.05 was used. PET with ground truth segmentation, corresponding activation map based on the four CAM methods, extracted segmentation and corresponding CT for a sample slice with a tumor.

Evaluation

Our proposed framework and the baselines were evaluated for 90 test subjects. The metrics 3D Dice score (compared with manual ground truth), MTV, and TLG deviation were computed for each patient. The Dice score is defined as where and are the sets of voxels inside the ground truth and predicted segmentation mask, respectively. The MTV quantifies the volume of tumor regions with high metabolism. TLG is defined as the product of the mean SUV and MTV.

Results

Weakly Supervised Tumor Segmentation

The following threshold values () were derived for CAM, GradCAM, GradCAM++, and ScoreCAM activation maps: 0.3, 0.2, 0.3, and 0.4, respectively. The following SUV percentile thresholds () were applied: 0.31 for CAM, 0.35 for GradCAM, 0.32 for GradCAM++, and 0.31 for ScoreCAM. Fig. 4 depicts the activation maps based on the four different methods and the corresponding segmentation for a sample slice with a tumor.

Fig. 4

PET with ground truth segmentation, corresponding activation map based on the four CAM methods, extracted segmentation and corresponding CT for a sample slice with a tumor.

Dice score

Overall, the supervised U-Net model showed the best performance with a median Dice score of 0.72 (IQR 0.36) (Fig. 5). ScoreCAM and CAM produced the best results of all weakly supervised methods with a median Dice score of 0.47 (IQR 0.35) and 0.46 (IQR 0.35), respectively. GradCAM++ performed slightly worse with a median Dice score of 0.42 (IQR 0.30). GradCAM, which achieved a median Dice score of 0.12 (IQR 0.21), showed significantly worse results. The global threshold method achieved a median Dice of 0.29 (IQR 0.28).

Fig. 5

Per subject Dice scores for the weakly supervised segmentation methods (blue) and the supervised baselines (red).

Evaluation of MTV

The supervised U-Net again showed the best results for the MTV estimation with a median difference of 17 ml (IQR 27 ml). Small tumors were slightly overestimated (Fig. 6). ScoreCAM (median difference 27 ml, IQR 48 ml), GradCAM++ (median difference 24 ml, IQR 48 ml), and CAM (median difference 26, IQR 68 ml) provided similar results. GradCAM again revealed inferior results with a median difference of 30 ml (IQR 76 ml). For all weakly supervised methods, an overestimation of small tumors and underestimation of large tumors was observed. This characteristic was most prominent in GradCAM and CAM. Using the global threshold baseline method also yielded a strong overestimation of smaller tumors and an underestimation of larger tumors (median difference 44 ml, IQR 92 ml). Those results are further validated by the ICC compared with the manual ground truth segmentation, which showed very similar scores and confidence intervals for CAM, GradCAM++, and ScoreCAM. GradCAM in contrast showed a significantly lower ICC (Table 1). Again, the supervised U-Net showed the highest scores and smallest confidence intervals, whereas the global threshold performed worse than CAM, GradCAM++, and ScoreCAM.

Fig. 6

Comparison between true and estimated MTV. All units in ml.

Table 1

Intra class correlation for estimated and real MTV/TLG.

	MTV (ml)			TLG (g)
	ICC	95%-CI	p-value	ICC	95%-CI	p-value
CAM	0.64	[0.50, 0.74]	<0.001	0.85	[0.77, 0.90]	<0.001
GradCAM	0.55	[0.39, 0.67]	<0.001	0.40	[0.19, 0.57]	<0.001
GradCAM++	0.64	[0.48, 0.73]	<0.001	0.79	[0.66, 0.86]	<0.001
ScoreCAM	0.64	[0.50, 0.75]	<0.001	0.82	[0.71, 0.88]	<0.001
Threshold	0.59	[0.45, 0.71]	<0.001	0.88	[0.83, 0.92]	<0.001
UNET	0.94	[0.91, 0.96]	<0.001	0.99	[0.98, 0.99]	<0.001

Comparison between true and estimated MTV. All units in ml. Intra class correlation for estimated and real MTV/TLG.

Evaluation of TLG

Tumor lesion glycolysis was predicted accurately by all methods except for GradCAM. Again, the supervised U-Net yielded the best results with a median TLG deviation of 50 g (IQR 110 g). No significant over- or underestimation was observed. (Fig. 7) ScoreCAM (median deviation of 99 g, IQR 285 g), GradCAM++ (median deviation 108 g, IQR 267 g), and CAM (median deviation 101 g, IQR 219 g) again achieved closely similar results. GradCAM (median deviation 112 g, IQR 482 g) showed the highest error with overall underestimation of TLG. In general underestimation of TLG of large tumors was observed; no overestimation of the TLG of small tumors occurred. The global threshold showed the largest variance for TLG estimation, again induced by marked overestimation of small lesions (median difference 167 g, IQR 524 g); however, there was less underestimation of larger lesion compared with the weakly supervised methods, which results in a higher ICC score due to less overall systematic error.

Fig. 7

Comparison between true and estimated TLG. All units in g.

Discussion

In this study we introduced, evaluated, and compared methods for weakly supervised segmentation of FDG-avid lesions in whole-body FDG-PET images. We established that, using CAMs with subsequent thresholding, weakly supervised segmentation is feasible with satisfactory accuracy. Compared with an upper baseline (a fully supervised UNET) and a lower baseline (a global threshold), we found that CAM, GradCAM++, and ScoreCAM yielded good overall segmentation accuracy whereas the use GradCAM led to inferior results. Overall, image-derived parameters MTV and TLG extracted from these segmentations correlated well with the ground truth values extracted from manual segmentation using CAM, GradCAM++, and ScoreCAM. Again, the use of GradCAM yielded higher deviations. The results of this study are relevant for a wide range of segmentation tasks in the medical imaging domain in which the generation of sufficient labeled training data is associated with high effort and cost. Using weak supervision—e.g., as in this study by only providing binary labels on an image level—this effort can be reduced significantly. Our results can thus contribute to more efficient training data generation and thus wider application of machine learning methods in the medical imaging domain. The contribution of our study beyond existing work is the application to whole body FDG PET data and the detailed comparison of different CAM techniques. We found that CAM, GradCAM++, and ScoreCAM are suitable CAM methods for weakly supervised segmentation as they capture the tumor lesions within PET images, and thus the inferior performance of weakly supervised segmentation using GradCAM can be explained by the known and previously described property of GradCAM to highlight only the few small regions that are relevant for the network output, leading to systematic underestimation of target regions within the image The main limitation of class activation mapping-based segmentation as implemented in this study is the necessity of two thresholds—one on the CAM to identify the target area and one on the PET image to define the segmentation. Our results show that this works well on FDG-PET data due to the generally higher signal intensity of tumor lesions compared with background tissue. However, generalization to other medical imaging modalities such as CT or MRI, in which lesion intensity is less discriminative, might be limited. Future work will expand the use of class activation mappings to further datasets, including CT or MRI images. To this end, research should focus on methods that avoid the use of thresholds. In this work, all analyses were performed on 2D slices. However, it would be beneficial to extend the principle of weakly supervised segmentation to 3D image data. This will allow for processing of entire imaging studies of single patients and further decrease the labeling effort. It can be expected, however, that the transition to 3D processing will be associated with a significant increase in computational demand. Although weak supervision saves significant time in creating labels, the precision of a supervised approach could not be reached in our study. If additional manual post-processing efforts are required to achieve sufficient precision for real-world applications, this must be taken into account. However, such corrections are mostly limited to the exclusion of entire false positive lesions and can therefore be efficiently performed. On the other hand, weak supervision allows a potentially much larger number of subjects to be available as training data. Further studies need to show to what extent this compensates for the poorer accuracy. In particular, this could potentially also provide higher robustness and generalizability than a supervised model with a smaller training sample size. Finally, the translation of the methodology presented in this paper to other PET tracers should be straightforward and may thus allow for implementing automated segmentation of non-FDG PET data with minimal manual annotation effort.

Conclusion

We were able to demonstrate that weakly supervised segmentation of FDG-avid lesions on whole-body FDG-PET is feasible, yielding satisfactory results. Further studies extending the proposed methodology to other PET tracers and medical imaging modalities will be necessary to investigate the transferabilty of the proposed methodology to related segmentation tasks.

Algorithm 1

CAM Segmentation

Input: PET/CT slice X, Percentile qm, Adaptive Threshold tm;

1: Predict class y^ of X (Does X contain a tumor or not?);

Ify^ is tumorous then

H = CAM(X)

Else

return Empty segmentation mask

End

2: H′← Mask all values ≥tm;

3: Upscale H′ from 32x32 pixels (size of the CAM) to the size of X;

4: H″←X⊙H′;

5: tq← Calculate the percentile qm of H″;

6: Segmentation mask = H≥tq;

Output: Segmentation mask;

11 in total

Review 1. A systematic review of the factors affecting accuracy of SUV measurements.

Authors: Michael C Adams; Timothy G Turkington; Joshua M Wilson; Terence Z Wong
Journal: AJR Am J Roentgenol Date: 2010-08 Impact factor: 3.959

2. Discriminative Localization in CNNs for Weakly-Supervised Segmentation of Pulmonary Nodules.

Authors: Xinyang Feng; Jie Yang; Andrew F Laine; Elsa D Angelini
Journal: Med Image Comput Comput Assist Interv Date: 2017-09-04

3. Prognostic value of metabolic tumour volume and total lesion glycolysis measured by 18F-fluorodeoxyglucose positron emission tomography/computed tomography in small cell lung cancer: A systematic review and meta-analysis.

Authors: Kai Nie; Yu-Xuan Zhang; Wei Nie; Lin Zhu; Yi-Nan Chen; Yong-Xin Xiao; Shi-Yuan Liu; Hong Yu
Journal: J Med Imaging Radiat Oncol Date: 2018-09-19 Impact factor: 1.735

4. Practice-based evidence for the clinical benefit of PET/CT-results of the first oncologic PET/CT registry in Germany.

Authors: Christina Pfannenberg; Brigitte Gueckel; Lisa Wang; Sergios Gatidis; Susann-Cathrin Olthof; Werner Vach; Matthias Reimold; Christian la Fougere; Konstantin Nikolaou; Peter Martus
Journal: Eur J Nucl Med Mol Imaging Date: 2018-09-29 Impact factor: 9.236

5. Tumor co-segmentation in PET/CT using multi-modality fully convolutional neural network.

Authors: Xiangming Zhao; Laquan Li; Wei Lu; Shan Tan
Journal: Phys Med Biol Date: 2018-12-21 Impact factor: 3.609

Review 6. Primary tumor standardized uptake value (SUVmax) measured on fluorodeoxyglucose positron emission tomography (FDG-PET) is of prognostic value for survival in non-small cell lung cancer (NSCLC): a systematic review and meta-analysis (MA) by the European Lung Cancer Working Party for the IASLC Lung Cancer Staging Project.

Authors: Thierry Berghmans; Michèle Dusart; Marianne Paesmans; Claude Hossein-Foucher; Irene Buvat; Catherine Castaigne; Arnaud Scherpereel; Céline Mascaux; Michel Moreau; Martine Roelandts; Stéphane Alard; Anne-Pascale Meert; Edward F Patz; Jean-Jacques Lafitte; Jean-Paul Sculier
Journal: J Thorac Oncol Date: 2008-01 Impact factor: 15.609

10. Tumor Segmentation and Feature Extraction from Whole-Body FDG-PET/CT Using Cascaded 2D and 3D Convolutional Neural Networks.

Authors: Skander Jemaa; Jill Fredrickson; Richard A D Carano; Tina Nielsen; Alex de Crespigny; Thomas Bengtsson
Journal: J Digit Imaging Date: 2020-08 Impact factor: 4.056