Literature DB >> 34170972

Harnessing clinical annotations to improve deep learning performance in prostate segmentation.

Karthik V Sarma¹, Alex G Raman^1,2, Nikhil J Dhinagar^1,3, Alan M Priester¹, Stephanie Harmon^4,5, Thomas Sanford^4,6, Sherif Mehralivand⁴, Baris Turkbey⁴, Leonard S Marks¹, Steven S Raman¹, William Speier¹, Corey W Arnold¹.

Abstract

PURPOSE: Developing large-scale datasets with research-quality annotations is challenging due to the high cost of refining clinically generated markup into high precision annotations. We evaluated the direct use of a large dataset with only clinically generated annotations in development of high-performance segmentation models for small research-quality challenge datasets.
MATERIALS AND METHODS: We used a large retrospective dataset from our institution comprised of 1,620 clinically generated segmentations, and two challenge datasets (PROMISE12: 50 patients, ProstateX-2: 99 patients). We trained a 3D U-Net convolutional neural network (CNN) segmentation model using our entire dataset, and used that model as a template to train models on the challenge datasets. We also trained versions of the template model using ablated proportions of our dataset, and evaluated the relative benefit of those templates for the final models. Finally, we trained a version of the template model using an out-of-domain brain cancer dataset, and evaluated the relevant benefit of that template for the final models. We used five-fold cross-validation (CV) for all training and evaluation across our entire dataset.
RESULTS: Our model achieves state-of-the-art performance on our large dataset (mean overall Dice 0.916, average Hausdorff distance 0.135 across CV folds). Using this model as a pre-trained template for refining on two external datasets significantly enhanced performance (30% and 49% enhancement in Dice scores respectively). Mean overall Dice and mean average Hausdorff distance were 0.912 and 0.15 for the ProstateX-2 dataset, and 0.852 and 0.581 for the PROMISE12 dataset. Using even small quantities of data to train the template enhanced performance, with significant improvements using 5% or more of the data.
CONCLUSION: We trained a state-of-the-art model using unrefined clinical prostate annotations and found that its use as a template model significantly improved performance in other prostate segmentation tasks, even when trained with only 5% of the original dataset.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34170972 PMCID： PMC8232529 DOI： 10.1371/journal.pone.0253829

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Prostate cancer is the second most frequent cancer diagnosis and the fifth leading cause of death for men worldwide [1]. Prostate segmentation is a component of the routine evaluation of prostate magnetic resonance imaging (MRI) necessary both for surveillance (through volume estimation) as well as targeted biopsy (to enable registration with real-time ultrasound). In the segmentation workflow, a clinician (generally a radiologist or urologist) will manually review the slices of a 3D T2-weighted MR image and produce a contour for each slice. In some workflows, the radiologist will use a computer-assistance tool, such as DynaCAD Prostate (Invivo-Philips, Gainesville, Florida) [2], to assist in segmentation, either by first producing an approximate annotation that is then edited by the radiologist, or by providing an assisted drawing tool that heuristically supports the designation of a contour. Regardless of workflow, segmentation requires a slice-by-slice analysis, which is time consuming, requires the skills of a specially trained radiologist, and is prone to intra- and inter-reader variability [3]. In addition to the utility of such segmentations for these clinical applications, obtaining a precise segmentation is critical for supporting image analysis research, as incorrectly assigning image regions may impair trained classifier accuracy, particularly in the case of lesion detecting classifiers that rely on input prostate segmentations as a component of the input pathway. Automated prostate segmentation is an active area of research, and substantial published work exists on the development of machine learning models for the purpose. However, these state of the art prostate segmentation algorithms [4-9] are often trained on small research-quality annotated datasets curated specifically for machine learning. Examples include the 100 patient Prostate MR Image Segmentation (PROMISE12) challenge dataset [10] and the 60 patient NCI-ISBI (National Cancer Institute–International Symposium on Biomedical Imaging) Automated Segmentation of Prostate Structures (ASPS13) challenge dataset [11]. Other algorithms have been trained on institutionally developed local datasets that include between 100 and 650 studies [12-15]. Unfortunately, the development of research-quality prostate boundary annotations is challenging. For example, for the PROMISE12 dataset, segmentations were created by an experienced radiologist, verified by a second experienced radiologist, and then re-annotated by a third nonclinical observer—a complex and expensive process. If, however, rough clinical annotations could be used to enable training a highly accurate segmentation model, these issues could be avoided, and substantially more data could be available. In this study, we train a prostate segmentation model using a large clinical prostate MRI dataset and rough clinical annotations created as part of the clinical workflow at our academic medical center. We then explore generalizing that model through refinement with small datasets, and the impact of original dataset size on generalizability. Finally, to confirm that it is the prostate specific features in our model that improve generalization rather than general MR features, we explore the relative utility of using our pretrained prostate model for as a basis for generalization versus a model pretrained on an MR dataset from brain cancer patients.

Materials and methods

Data

Four retrospective sources of data were used for this project. For training our segmentation model with our clinical data, we used MRI data collected from patients seen at our institution during routine clinical procedures. For examining generalization, we made use of two research-quality prostate MRI challenge datasets. Finally, for determining the relative utility of using our model trained with clinical data as a pre-trained starter, we made use of a brain MRI challenge dataset for comparison. All data was used for this work under the approval of the University of California, Los Angeles (UCLA) institutional review board (IRB# 16–001087). Informed consent was waived with the approval of the IRB for this retrospective study of medical records, based on institutional guidelines, the fact that the study involved no more than minimal risk, the fact that the waiver would not adversely affects the rights and welfare of study patients, and the impracticality of conducting the retrospective analysis in which results would not change care already delivered to study patients. Data used for this study was de-identified after collection and before analysis.

Primary dataset

Our internal clinical population for this study consists of 1,620 MRI studies collected from 1,111 patients who underwent transrectal ultrasound-MRI fusion biopsy (TRUS biopsy) using the Artemis guided biopsy system (Eigen Systems) between 2010 and 2018 at our institution using a standardized protocol and 3T magnet (Trio, Verio, or Skyra, Siemens Healthcare). As part of the protocol, prostate MRIs were contoured in a two-part process. First, the attending radiologist for the case (the attending radiologists for the patients included in this study each had between 10–27 years of experience) created a prostate contour using the DynaCAD Prostate image analysis platform as part of the routine clinical workflow. This contour was then used by a technician to re-contour the prostate on the Profuse (Eigen Systems) platform in order to enable use with the Artemis biopsy system, as DynaCAD segmentations cannot be directly imported for use on the Artemis. We retrospectively collected 3D axial turbo spin echo (TSE) T2-weighted images and prostate contour sets from these studies. T2 images were acquired clinically using the spatial and chemical-shift encoded excitation (SPACE, Siemens Healthcare) protocol, with field of view (FOV) 170 x 170 x 90 mm3 and resolution 0.66 x 0.66 x 1.5 mm3. Acquisition parameters are provided in . Studies were collected from our institution’s picture archiving and communication system (PACS). Corresponding T2 prostate contours were collected from the Profuse image analysis platform. Imaging data were collected from every available study for each patient seen at our institution during the study period. Studies were excluded from retrieval if the T2 image or contour was missing from PACS or corrupt, or if the image exhibited a protocol deviation, such as a variance in FOV or resolution. A total of 1,620 studies were included from 1,111 patients, and 84 studies were excluded. Of the 1,620 included studies, 29 used an endorectal coil.

External prostate challenge datasets

Two external challenge prostate datasets were used for this study: ProstateX-2 [16] and PROMISE12 [10]. The ProstateX-2 Challenge was a prostate cancer prediction challenge held in 2017. This dataset consists of 99 deidentified cases collected from patients seen at Radboud University Medical Center in the Netherlands. A consistent imaging protocol was used for all cases, which was significantly different from the protocol used for the primary dataset at our institution. A variety of images and clinical variables were provided with each case. For use in our experiments, we retrieved transverse T2-weighted MR images from each case in the dataset. These images were then annotated with a research-quality prostate contour by co-author B.T., an experienced abdominal radiologist. The PROMISE12 Grand Challenge was a prostate segmentation-specific challenge held in 2012. This dataset includes 50 deidentified cases collected from four different centers (Haukeland University Hospital in Norway, Beth Israel Deaconess Medical Center in the United States, University College London in the United Kingdom, and Radboud University Nijmegen Medical Center in the Netherlands). Each institution had unique acquisition protocols, with wide variability in the MR field strength, endorectal coil usage, and image resolution. Each case consists of a transverse T2 weighted MR image and a reference research-quality prostate contour produced by agreement of two expert radiologists (one radiologist at the institution where the image was acquired, and a second radiologist at Radboud University). Detailed acquisition parameters are not available for this dataset, but images were scanned at a variety of field strengths (1.5T or 3T), with or without endorectal coil, and with a variety of acquisition resolutions, pulse sequences, and device manufacturers [10].

Brain cancer dataset

In order to provide a non-prostate comparison, the Brain Tumor Segmentation (BraTS) 2019 [17, 18] challenge dataset was also used for this study. The dataset includes over 300 annotated cases collected from 19 different institutions using a wide variety of protocols. These cases include T2-weighted images of the brain with tumor segmentations. These segmentations were created manually using a multi-step protocol requiring agreement between multiple raters and final approval by an experienced neuroradiologist. Though tumor segmentation is a far more complex segmentation task than organ segmentation, this dataset provided an MRI comparison with a defined 3D segmentation task that could be leveraged as pretraining for prostate MRI segmentation. The BraTS 2019 data originates from large number of institutions and includes data collected with a variety of acquisition parameters; a specific breakdown of these parameters is not available [19].

Preprocessing

In order to facilitate transportability, we processed images from all three datasets using the same pipeline. Initial preprocessing was done in Python, primarily using the SimpleITK toolkit [20], and included bias field correction [21] and resampling to isotropic voxel size (1mm x 1mm x 1mm) for further processing; these steps were based on preprocessing done in previous work [13, 21–23]. After initial preprocessing, we applied interquartile range (IQR)-based intra-image normalization to address the relative nature of MR image intensity values (both within and between institutions). Each image was normalized to the image-level IQR (calculated from the central 128x128 column of the volume) and then values were clipped between two IQRs below the first quartile and five IQRs above the third quartile, in order to eliminate outlying values created by imaging artifacts. The preprocessing pipeline is depicted in .

3D U-Net model diagram and preprocessing steps.

A) Network diagram of the 3D U-Net used for this study. Numbers within the ovals represent number of feature maps at that layer. Connections represent network operations, such as 3x3x3 3D convolution (“Conv”), 2x2x2 max pooling (“Max Pool”), 3x3x3 3D transposed convolution (“Deconv”), skip feature map concatenation (“Concat”), batch normalization (“BN”), rectified linear unit activation (“ReLU”), and softmax output (“Softmax”). B) Process diagram of preprocessing steps. Once images were imported from the archive (either PACS or challenge download), N4ITK bias field correction was applied. Images were then resampled to 1mm isotropic resolution and IQR normalized. During training, real-time augmentation was applied to each input image to create the training sample for that epoch.

Augmentation

For all model training in this study, real-time augmentation using the Batchgenerators package was performed [24]. Three augmentation transformations were used: 1) random elastic deformation, 2) random rotation in the range [-π/8, π/8] in the axial plane, and [- π/4, π/4] along the axis, and 3) random mirroring across the y axis. After augmentation, the image was cropped to the central column of the transformed image (i.e. the central 128x128 voxels in the x,y plane).

Model, training and evaluation

The base model used for this study was the 3D U-Net [25]. For all experiments, the network was configured with four encoder levels, three decoder levels, a ReLU transfer function and group normalization (using eight groups) following every convolutional layer, and a softmax output layer. The model architecture is depicted in . All training and evaluation was done using the PyTorch framework on a DGX-1 (NVIDIA) deep learning appliance. Mixed-precision training using the NVIDIA Accelerated Mixed Precision (AMP) was used at optimization level O2, consisting of 16-bit model weights and inputs, 32-bit master weights and optimizer parameters, and dynamic loss scaling. Network inputs consisted of the full augmented image volume (with dimension 128x128x136). Training was performed using the Adam optimizer with learning rate 10−5 and the soft Dice loss function. Each epoch consisted of training on a full dataset comprised of one augmented sample generated for every original input sample. The primary evaluation metric used to compare segmented volumes was the soft Dice coefficient function as denoted in Eq 1, where S is the segmentation of a deep learning model and S is the manual segmentation. The value of the coefficient can range between 0 (no overlap) and 1 (perfect overlap). The average Hausdorff distance (AHD) was also used as a secondary metric, as denoted in Eq 2, where X is the set of all points within the manual segmentation, Y is the set of all points within the segmentation of the deep learning model, and d is the Euclidean distance. The AHD is a positive real number, and smaller numbers denote better matching segmentations. The evaluation metrics were calculated for whole prostate gland segmentation on the entire uncropped volume. In addition, each slice of the segmentation mask was split into three subvolumes: the apex subvolume (consisting of the apical 25% of prostate slices), the base subvolume (consisting of the basal 25% of prostate slices), and the midgland subvolume (consisting of the remaining middle 50% of slices); the Dice evaluation metric was calculated for each subvolume. Means and standard deviations across the entire dataset were reported for performance on the whole prostate as well as each of the three subvolumes. These were calculated using the following approach: for each of the five folds, metrics were calculated for each of the images within the fold using the model trained without that fold’s data. Once the metrics were calculated for every study, the mean and standard deviation of each metric across all images was computed (including whole-volume and subvolume metrics) and reported as the evaluation result.

Experiments

Baseline models

To establish baseline performance, models were first trained from scratch separately on the primary dataset, the ProstateX-2 (PX2) data, and the PROMISE12 (P12) data. Training was performed using five-fold CV over each entire dataset, with 324 images per fold. The evaluation metrics were then computed using the approach described above.

Generalizability to challenge datasets

To assess the utility of the baseline primary dataset model on the external challenge datasets, two sets of experiments were done for each dataset. First, the model was used to produce segmentation mask predictions for each example in the external datasets, and mean scores were reported for each dataset. Then, the model was refined for each external dataset using the baseline primary dataset model as the pretrained weight initializer. This refining was done using five-fold CV over 100 epochs, and validation soft Dice scores were calculated and reported as in the previous experiments. Results were compared for superiority against the baseline models using a one-tailed paired t-test, with α = 0.001.

Impact of dataset ablation

To assess the impact of the size of the primary dataset on both segmentation performance and generalizability, a series of ablation experiments were conducted. First, a series of models was trained using truncated versions of the primary dataset. In these experiments, the training set for each fold was truncated to a fixed proportion of its original size, from 5% to 80%. The validation set was not truncated to ensure a fair comparison. Five-fold CV was again used over 100 epochs. The resulting models were then evaluated using the soft Dice criterion to determine model performance on the primary dataset. In order to determine the impact of ablation on generalizability, the resulting models were then refined for 100 epochs using the PX2 or P12 datasets (without truncation), and then evaluated as in the previous experiments. These models were compared for superiority against the baseline models using one-tailed paired t-tests, with α = 0.001.

Comparison to BraTS model

In order to assess the relative importance of using the domain-specific primary baseline model as a pretrained weight initializer, a comparison model was trained using the BraTS dataset. The BraTS dataset was chosen for the comparison model because of the similar underlying data (T2-weighted imaging) and 3D nature of the desired output. The BraTS data was preprocessed using the same pipeline before training as the prostate data, and the same model architecture and training protocol was used. Five-fold CV was performed over 150 epochs. The BraTS model was then used as a pretrained weight initializer for refining PX2 and P12 segmentation models, using the same approach as in the previous experiments. These models were compared for superiority against the baseline models using one-tailed paired t-tests, with α = 0.001; additionally, the refined ablation models were compared against the refined BraTS models for superiority using one-tailed paired t-tests, with α = 0.001.

Results

Baseline models

Training results are shown in ; all results are reported as mean ± standard deviation in tables and text. The primary baseline model achieved a high overall performance, with a mean overall Dice coefficient of 0.909 ± 0.042 and mean AHD of 0.156 ± 0.231. This result is comparable to the top performing prostate segmentation models found in the literature. Example evaluation segmentations for the baseline model on the primary dataset are shown in and . The PX2 and P12 models performed less well, with mean overall Dice coefficients of 0.702 ± 0.083 and 0.568 ± 0.122, and mean AHDs of 0.480 ± 0.555 and 2.155 ± 2.466, respectively. Across all three models, midgland Dice performance was the highest (0.762–0.941) and performance on the base and apex regions was more limited (0.501–0.863). The P12 model was the worst performing across every measure. Performance measures on a per-sample basis for the PX2 and P12 baseline models are shown in .

Example UCLA baseline model segmentations.

The orange contour depicts ground truth segmentation and the shaded blue area depicts model segmentation. A) Example apex, midgland, and base slice from a sample in the primary dataset with a high metric on evaluation. The soft Dice coefficient for this sample was 0.928, and the average Hausdorff distance was 0.085. Images of all of the slices for this study are presented in ) Example apex, midgland, and base slice from a sample in the primary dataset with a low metric on evaluation. The soft Dice coefficient for this sample was 0.738, and the average Hausdorff distance was 0.935. Images of all of the slices for this study are presented in .

Evaluation metrics for PX2 and P12 datasets.

Soft Dice coefficients () and average Hausdorff distances () for every sample in the ProstateX-2 (PX2, n = 99) and PROMISE12 (P12, n = 50) datasets, after model evaluation for the baseline, BraTS, and refined primary baseline models. Each solid dot represents a single training example. The models trained by refining the BraTS pretrained model or the baseline pretrained model both exhibited improved performance and reduced variance on both evaluation metrics, and with the refined primary baseline model exhibiting the highest performance and lowest variance. Detailed statistics are available in Tables , , and . PX2 = ProstateX-2, P12 = PROMISE12, results reported as mean ± standard deviation across all images * denotes significantly higher than baseline model, p<0.001. PX2 = ProstateX-2, P12 = PROMISE12, results reported as mean ± standard deviation across all images

Generalizability to challenge datasets

Results are shown in . For the PX2 dataset, the primary baseline model exhibited a mean overall Dice coefficient of 0.465 ± 0.291 and AHD of 4.824 ± 5.920 before refining, and a coefficient of 0.912 ± 0.029 and AHD of 0.150 ± 0.192 after refining. For the P12 dataset, the primary baseline model exhibited an overall Dice coefficient of 0.708 ± 0.210 and AHD of 1.953 ± 3.747 before refining and a Dice of 0.852 ± 0.091 and AHD of 0.581 ± 1.314 after refining. Similar to the previous experiments, Dice performance in the midgland region was higher than that in the base and apex regions for all models. For both datasets, the refined primary baseline model performed significantly better (p < 0.001) than the baseline model trained with only the respective dataset across all measures. Though the unrefined UCLA model performed better on the P12 dataset, after refining, performance was best on the PX2 dataset. Performance measures on a per-sample basis for the PX2 and P12 refined baseline models are shown in . Example segmentations before and after refining are shown in

Impact of dataset ablation

Results for these experiments are shown in and . We found that model performance generally increased as the proportion of data used increased, with the primary model exhibiting an overall mean Dice coefficient of 0.638 at 5% and 0.909 at 100%. Both the PX2 and P12 models exhibited significantly increased performance (p < 0.001) over their baseline at all ablation levels. For all three sets of models, the models trained at the 60% ablation level achieved approximately the eightieth percentile performance.

Soft Dice coefficients for models trained with ablated dataset.

Soft Dice Coefficients for models trained using the ablated primary dataset (“Primary”) or trained using an ablated primary model as weight initializer (“FT”). PX2 = ProstateX-2, P12 = PROMISE12, FT = fine-tuned. Significant improvements can be seen in the performance of the fine-tuned models at 5% of the primary dataset used for training the ablated primary baseline model, with the performance benefits leveling out at 60% of the dataset. * denotes significantly higher than baseline model, p<0.001. PX2 = ProstateX-2, P12 = PROMISE12, FT = fine-tuned, results reported as mean across all images

Comparison to BraTS Model

The results of these experiments are shown in . The final overall soft Dice coefficient of the resulting model on the BraTS segmentation task was 0.591. When refined on the PX2 dataset, the mean overall soft Dice coefficient was 0.834, and the AHD was 0.299 ± 0.465. When refined on the P12 dataset, the mean overall Dice coefficient was 0.704, and the AHD was 1.428 ± 2.603. In both cases, the refined BraTS model significantly outperformed the baseline model across all measures (p<0.001), but was outperformed by the ablation models at 20% and higher (p<0.001). Performance measures on a per-sample basis for the PX2 and P12 refined BraTS models are shown in . Example segmentations are shown in * denotes significantly higher than baseline model, p<0.001. PX2 = ProstateX-2, P12 = PROMISE12, results reported as mean ± standard deviation across all images

Discussion

In this study, we developed a prostate segmentation CNN model using a large clinically generated dataset, and examined the relationship between dataset size and model performance. We further explored the generalizability of the model to external datasets, and the relative contribution of using the model as a pre-trained starter for improving performance when training on limited datasets. We found that the network trained on our institution’s dataset did not perform well initially when used on outside data. However, refining the network on the external data using the initial model as a pre-trained starter yielded significantly superior performance to training using randomly initialized models. On the PX2 dataset, using our institution’s model as a pre-trained starter yielded an increase in mean overall Dice coefficient of 30%, and on the P12 dataset, an increase of 49%. Using a model trained on data (BraTS 2019) completely unrelated to prostate segmentation as a pre-trained starter also yielded improvements over baseline, but was not as effective as using the primary dataset as a starter. As demonstrated in , model performance improved progressively from the baseline model, to the model trained using the non-relevant BraTS MR data, and finally to the model trained with the highly relevant UCLA prostate MR data. The final performance of the models we trained using our pre-trained prostate MR starter was comparable to other results from the literature on both datasets using more complex models [9, 26, 27], highlighting the value of creating a domain-specific starter for this task. For example, the leading model on the P12 leaderboard (submitted on 9/7/2020) has a Dice score of 0.895, which compares favorably with our final overall Dice coefficient of 0.852 [10]. The leading model trained on a large private dataset (trained on 648 studies at the NIH) has a Dice score of 0.915, which compares favorably with our 0.909 [12]. We also found that using truncated versions of our dataset still yielded significant improvements. Even using a model trained on only 15% of the primary dataset as a pre-trained starter yielded improvements over baseline of 18% and 28% on the PX2 and P12 datasets, and the gains from increasing dataset set saturated at approximately 60% of the primary dataset. These findings are notable in part because our primary dataset consists of rough clinical contours that have not been carefully re-annotated to produce a machine learning-quality dataset and images that were not filtered for inclusion of only optimal quality series. We included in our primary dataset images with quality limitations, images that used endorectal coils, and images from patients who had had prostate treatments that significantly distort the visual appearance of the prostate. Despite these complications, we still found that we were able to train a state-of-the-art model and then use that model to boost the performance of models trained on “gold-standard” data. The performance gained through the use of our model as a pre-trained starter was greater than that obtained using an unrelated pretrained model (as is typical for transfer learning; i.e. ImageNet [28]), suggesting that our model was able to learn features that were useful starters for the segmentation models trained for the external datasets. Our work does have some limitations. Because we did not use a machine learning-quality version of our dataset, it is difficult to compare the overall performance results on our data to state-of-the-art models. In addition, the imperfections in the clinically generated ground truth segmentations we used for our primary dataset likely include both areas incorrectly annotated in the foreground and the background. As a result, some differences between model predictions and the ground truth in the primary dataset are the result of inaccurate labels, rather than model error. Because we held the model design constant and simple in order to isolate the dependent variables in our experiments to the datasets used for training and pretraining (and as such used data from all folds in our evaluations, rather than a single held-out fraction), we may have been prevented from realizing performance gains that other works have found through complex model designs or post-processing steps. However, our intent with these choices was to demonstrate that even a simple model with rough clinical contours can provide substantial value when contemplating model development. This finding may have significant implications for future work, in which larger datasets with lower-quality annotations may be combined with smaller datasets with high-quality annotations to maximize the value of available data without requiring the significant expenditure of re-annotation effort. We plan to further explore this hypothesis in future work using more difficult problems, such as prostate cancer segmentation, in order to determine if this approach may unlock additional potential for medical image analysis. Additionally, because this is a retrospective analysis and does not include the real-time ultrasound used for image fusion, it is not possible for us to evaluate the impact of segmentation quality from different models on registration and biopsy targeting. Future, prospective work should include such an evaluation. We trained a state-of-the-art model using rough clinical annotations, producing a prostate segmentation model with a mean overall Dice coefficient of 0.909 and an AHD of 0.156. We additionally found that models trained using truncated fractions of our data were effective pre-trained starters for achieving higher performance models on external prostate segmentation challenge datasets. Our findings suggest a role for the combined use of datasets with low-quality and high-quality annotations in future medical image analysis model development in order to maximize performance while minimizing annotation effort.

Imaging acquisition parameters for study datasets.

Full acquisition data is not available for the PROMISE12 dataset, and the counts for images acquired at different field strengths and resolutions are not available. (DOCX) Click here for additional data file.

Full volume example of primary baseline dataset segmentation, high metric.

Orange contour depicts ground truth segmentation. Shaded blue area depicts model segmentation. Slices depicted from apex to base. The soft Dice coefficient for this sample was 0.928, and the average Hausdorff distance was 0.085. (PNG) Click here for additional data file.

Full volume example of primary baseline dataset segmentation, low metric.

Orange contour depicts ground truth segmentation. Shaded blue area depicts model segmentation. Slices depicted from apex to base. The soft Dice coefficient for this sample was 0.738, and the average Hausdorff distance was 0.935. (PNG) Click here for additional data file.

Example ProstateX-2 segmentations.

Orange contour depicts ground truth segmentation. Shaded blue area depicts model segmentation. The soft Dice coefficient and average Hausdorff distance metrics were 0.645 and 1.024 for the baseline model, 0.864 and 0.167 for the BraTS model, and 0.932 and 0.079 for the refined primary baseline model. (PNG) Click here for additional data file.

Example PROMISE12 segmentations.

Orange contour depicts ground truth segmentation. Shaded blue area depicts model segmentation. The soft Dice coefficient and average Hausdorff distance metrics were 0.536 and 2.974 for the baseline model, 0.678 and 0.291 for the BraTS model, and 0.910 and 0.102 for the refined primary baseline model. (PNG) Click here for additional data file. 7 Apr 2021 PONE-D-21-06200 Harnessing clinical annotations to improve deep learning performance in prostate segmentation PLOS ONE Dear Dr. Sarma, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. The reviewers have asked for revisions. Some of the reviewers have questioned the novelty of the paper. There is also concerns about the discussion and the comparisons, that authors need to address. Based on all this, I am recommending major revisions. Furthermore when submitting the revised paper, please also consider the following points: 1. English language needs proofreading. 2. References should be in proper format. 3. All acronyms must first be defined. Please submit your revised manuscript by May 22 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Usman Qamar Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf Thank you for stating in your Funding Statement: KVS acknowledges support from National Cancer Institute grant F30CA210329, National Institute of General Medical Studies grant GM08042, and the UCLA-Caltech Medical Scientist Training Program. CWA acknowledges funding from National Cancer Institute grants R21CA220352 P50CA092131, and an NVIDIA Corporation Academic Hardware Grant. LSM acknowledges funding from National Cancer Institute grants R01CA195505 and R01CA158627. SH acknowledges that this project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under Contract No. HHSN261200800001E. TS, SM, and BT acknowledge that this project was supported in part by the Intramural Research Program of the NIH. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Please provide an amended statement that declares *all* the funding or sources of support (whether external or internal to your organization) received during this study, as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now. Please also include the statement “There was no additional external funding received for this study.” in your updated Funding Statement. Please include your amended Funding Statement within your cover letter. We will change the online submission form on your behalf. Thank you for stating the following in the Competing Interests section: I have read the journal's policy and the authors of this manuscript have the following competing interests: LSM and AMP report a financial interest in Avenda Health outside the submitted work. BT reports IP-related royalties from Philips. The NIH has cooperative research and development agreements with NVIDIA, Philips, Siemens, Xact Robotics, Celsion Corp, Boston Scientific, and research partnerships with Angiodynamics, ArciTrax, and Exact Imaging. No other authors have competing interests to disclose. We note that you received funding from a commercial source: Avenda Health , NVIDIA, Philips, Siemens, Xact Robotics, Celsion Corp, Boston Scientific, Angiodynamics, ArciTrax, and Exact Imaging Please provide an amended Competing Interests Statement that explicitly states this commercial funder, along with any other relevant declarations relating to employment, consultancy, patents, products in development, marketed products, etc. Within this Competing Interests Statement, please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests). If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared. Please include your amended Competing Interests Statement within your cover letter. We will change the online submission form on your behalf. Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The paper addresses an important need in medical image segmentation analysis today, developing accurate segmentation models using unrefined segmentations as inputs. The use of non-perfect inputs can enable more data to be used for training and potentially improve model generalizability. The authors use a large clinical dataset of >1600 prostate MRIs to develop and validate a deep learning model for prostate gland segmentation. The model uses only clinically-produced prostate segmentations as model inputs. The study appears appropriately designed and utilizes the largest training cohort in prostate segmentation to date. I liked the idea to evaluate the model for a non-prostate segmentation task (i.e., brain cancer segmentation). I have several comments. Abstract 1. Materials and method: It could be relevant to mention that 5-fold cross-validation was performed, possibly also mentioning the initial train-validation split in the abstract. 2. Results: Please clarify whether the dice score of 0.91 was obtained on the training or independent testing data. It appears that all of the data was used for model training and there was no independent test set on the UCLA data. Introduction 3. Paragraph 1 sentence 1: I recommend either keeping and using the abbreviation for prostate cancer (PCa) consistently throughout the paper or removing the abbreviation. 4. Paragraph 1 sentence 3: At some institutions, urologists review the segmentations on T2-weighted MRI, whereas other institutions use radiologists or a mix. 5. Paragraph 1 sentence 4: Please change Phillips to Philips. This typo is repeated elsewhere as well. 6. Paragraph 1 sentence 4: If one exists, please provide a citation for DynaCAD Prostate (Philips) segmentation. 7. Paragraph 2 sentence 1: This makes a definitive assumption that inaccurate segmentations are sufficient for clinical tasks. Please provide a citation that supports this argument. Otherwise, I would not state this so definitively. For accurately targeting cancer in the prostate it is not the volume that is most important, but the relationship of the cancer to the edge of the prostate. 8. Paragraph 3: It would enhance the paper if the authors could provide a range in the number of cases that have previously been utilized for training different prostate segmentation models and cite the models; not all prostate segmentation models have been trained on small datasets, e.g.,: “Data Augmentation and Transfer Learning to Improve Generalizability of an Automated Prostate Segmentation Model. Thomas H. Sanford, Ling Zhang, Stephanie A. Harmon, Jonathan Sackett, Dong Yang, Holger Roth, Ziyue Xu, Deepak Kesani, Sherif Mehralivand, Ronaldo H. Baroni, Tristan Barrett, Rossano Girometti, Aytekin Oto, Andrei S. Purysko, Sheng Xu, Peter A. Pinto, Daguang Xu, Bradford J. Wood, Peter L. Choyke, and Baris Turkbey. American Journal of Roentgenology 2020 215:6, 1403-1410” Materials and methods 9. Primary dataset: If possible, please mention or describe whether scans were compliant with PIRADS specifications. 10. Primary dataset: To help readers assess generalizability, please describe the distribution between scans obtained on different scanners (e.g., Philips, GE, and/or Siemens). 11. Primary dataset: Please include whether scans were acquired using endorectal coil or not? Did this effect model performance? 12. Model, training and evaluation: Consider calculating the Hausdorff distance in all cases in the internal test set to demonstrate how gland segmentation accuracy may impact the location of the target. 13. Model, training and evaluation: The authors might want to mention that the results were reported as mean +/- standard deviation. 14. Baseline models: What was the initial training-validation split in the 5-fold cross-validation? 15. Please make it clear who performed the expert segmentations used as the gold standard for model training and for calculating dice scores. Results 16. It appears that the Dice of 0.91 was obtained from the 5 fold cross validation rather than from a held out test set in the UCLA data or from testing in the external dataset. Is this correct? If so, why was testing not performed in a held out test set from the UCLA data? 17. Please report the results as mean +/- standard deviation. 18. The authors mention that the model performs better after refinement. Why is the performance so poor prior to refinement (Dice 0.465)? 19. Is the improvement in dice clinically significant (i.e. does it make the cancer target location more accurate)? Discussion 20. There appears to be a discrepancy between the results in the abstract and what can be found in the discussion section. Did using the model trained on the large internal dataset as a pre-trained template increase performance (dice score) by 39% and 49% in the two respective external datasets – or was it 30% and 49%? 21. To help put this into context, I’d recommend citing other prostate segmentation papers that perform worse, equal to, and better than your model. 22. In calculating the dice in the clinical dataset, what was used as the gold standard segmentation? If it just the clinical segmentation, is it possible that the model segmentation is better than the gold standard? Tables and figures: 23. It should be clear to the reader that the results are shown as mean +/- standard deviation. 24. Space permitting, a table that shows relevant MRI characteristics and dataset composition would enhance the paper's value (e.g., it could include the number of cases, scanners, MRI sequence, magnetic field strength, slice thickness, in-plane resolution). 25. There should be figures that demonstrate the performance of the model visually. 26. It would be helpful if the paper provided an overview diagram of the 3D U-Net architecture and possibly also the preprocessing steps in the same diagram. 27. I recommend including 1-2 figures demonstrating representative test cases from the internal and external datasets, showing every slice from base-to-apex, including segmentation outputs before and after refinement. 28. It would be great to see the addition of box plots (or similar plots) that show dice score distributions (range, IQR, median) before and after refinement. 29. It would be interesting to see whether the choice of deep learning model affects the results significantly, more so than the size of the dataset. One idea could be to look at ImageNet or the holistically nested edge detector network (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5565676) Limitations section: 30. I recommend incorporating the following into the limitations. a. The 0.91 dice reported in the abstract is not from a held out test set or the external testing. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 16 May 2021 Response to Reviewers (point-by-point) Thank you for the very helpful review of our work. Based on your suggestions, we have made numerous improvements to our paper, and have detailed improvements made in response to your comments below: Abstract 1. Materials and method: It could be relevant to mention that 5-fold cross-validation was performed, possibly also mentioning the initial train-validation split in the abstract. We have added information about the cross-validation split and clarified that the evaluation scores are calculated across the full dataset to the section. 2. Results: Please clarify whether the dice score of 0.91 was obtained on the training or independent testing data. It appears that all of the data was used for model training and there was no independent test set on the UCLA data. We have added a parenthetical clarification to the section that the results are across CV folds. Introduction 3. Paragraph 1 sentence 1: I recommend either keeping and using the abbreviation for prostate cancer (PCa) consistently throughout the paper or removing the abbreviation. We have removed this abbreviation from the paper for clarity. 4. Paragraph 1 sentence 3: At some institutions, urologists review the segmentations on T2-weighted MRI, whereas other institutions use radiologists or a mix. We have updated this sentence to more accurately represent the process. 5. Paragraph 1 sentence 4: Please change Phillips to Philips. This typo is repeated elsewhere as well. We have fixed the spelling of “Philips” throughout the paper. 6. Paragraph 1 sentence 4: If one exists, please provide a citation for DynaCAD Prostate (Philips) segmentation. Philips has not published any work on the performance or operation of DynaCAD, however we have cited the product website. 7. Paragraph 2 sentence 1: This makes a definitive assumption that inaccurate segmentations are sufficient for clinical tasks. Please provide a citation that supports this argument. Otherwise, I would not state this so definitively. For accurately targeting cancer in the prostate it is not the volume that is most important, but the relationship of the cancer to the edge of the prostate. We have removed this assertion from the paragraph to put emphasis on the fact that accurate segmentations are critical for downstream image analysis, such as cancer detection. 8. Paragraph 3: It would enhance the paper if the authors could provide a range in the number of cases that have previously been utilized for training different prostate segmentation models and cite the models; not all prostate segmentation models have been trained on small datasets, e.g.,: “Data Augmentation and Transfer Learning to Improve Generalizability of an Automated Prostate Segmentation Model. Thomas H. Sanford, Ling Zhang, Stephanie A. Harmon, Jonathan Sackett, Dong Yang, Holger Roth, Ziyue Xu, Deepak Kesani, Sherif Mehralivand, Ronaldo H. Baroni, Tristan Barrett, Rossano Girometti, Aytekin Oto, Andrei S. Purysko, Sheng Xu, Peter A. Pinto, Daguang Xu, Bradford J. Wood, Peter L. Choyke, and Baris Turkbey. American Journal of Roentgenology 2020 215:6, 1403-1410” We have added citations for a sample of previously published prostate segmentation models that use challenge datasets and that use institutional datasets, as well as a range for the sizes of the datasets used. Materials and methods 9. Primary dataset: If possible, please mention or describe whether scans were compliant with PIRADS specifications. Our dataset begins in 2010 which predates PI-RADS. As such, we chose to use the T2 SPACE protocol images for our UCLA dataset, as the parameters remained the same through the collection period. We have added additional information about our acquisition protocol and parameters for the T2 images to the Materials and Methods section and table S1; we have also added information about the challenge datasets to table S1. 10. Primary dataset: To help readers assess generalizability, please describe the distribution between scans obtained on different scanners (e.g., Philips, GE, and/or Siemens). We have clarified in the Materials and Methods section and table S1 that all primary dataset scans were performed on Siemens scanners. 11. Primary dataset: Please include whether scans were acquired using endorectal coil or not? Did this effect model performance? An endorectal coil was used for a small fraction (<2%) of patients. We have noted this information in table S1. Because such a small number of patients had an endorectal coil, there is not enough data to conclusively determine the impact of the coil on model performance. For the patients in the primary dataset who were imaged with an endorectal coil, mean overall Dice coefficient was 0.884 (vs. 0.909 across the whole dataset) and mean AHD was 0.200 (vs. 0.156 across the whole dataset); both values are within one SD of the whole dataset means. 12. Model, training and evaluation: Consider calculating the Hausdorff distance in all cases in the internal test set to demonstrate how gland segmentation accuracy may impact the location of the target. We have added the average Hausdorff distance as an evaluation metric and reported our findings as they relate to model performance improvement throughout the paper, tables, and figures. 13. Model, training and evaluation: The authors might want to mention that the results were reported as mean +/- standard deviation. We have added additional clarity in this section regarding how the evaluation metrics were computed across all images in the dataset. 14. Baseline models: What was the initial training-validation split in the 5-fold cross-validation? We have added additional clarity regarding the training/validation set split to the Baseline models section. 15. Please make it clear who performed the expert segmentations used as the gold standard for model training and for calculating dice scores. We have clarified that the segmentations were performed clinically by the attending radiologist for the case at the time of the initial imaging study; most of the radiologists are not authors of this paper. Results 16. It appears that the Dice of 0.91 was obtained from the 5 fold cross validation rather than from a held out test set in the UCLA data or from testing in the external dataset. Is this correct? If so, why was testing not performed in a held out test set from the UCLA data? The mean overall Dice and AHD statistics reported on the primary dataset are the mean over the study-level scores for every image in the dataset included in the evaluation. These scores were calculated using the cross-validation model that was trained without that particular image in the training set (i.e. the image was in the held-out fold for that model). Typically studies require a held-out test set to address the potential bias introduced by model-level optimizations (i.e. adjusting the model design to optimize performance). However, because our study design involved “freezing” the model configuration in order to have the only dependent variables in our experiments be data-related, there was no need for a separate held-out test set because there is no risk of bias introduced from model optimization over the course of the experiments. As such, we chose to use all of the available data in the evaluation in order to provide a fairer assessment of model performance across a larger set of images. We believe this provides the best estimate of model performance and variance. 17. Please report the results as mean +/- standard deviation. We have updated the paper throughout to report results in this way. 18. The authors mention that the model performs better after refinement. Why is the performance so poor prior to refinement (Dice 0.465)? The PX2 dataset was acquired with homogeneous imaging parameters, and the P12 dataset was acquired with very heterogeneous imaging parameters. However, the parameter range for the P12 dataset does overlap more closely with the UCLA dataset. We believe this explains why the UCLA model before refining does better on P12 than PX2, and after refining, the PX2 model is superior because of the internal consistency. Unfortunately, because individual sample acquisition parameters were not provided with the P12 dataset, it is not possible to run an analysis to determine whether this hypothesis is correct. 19. Is the improvement in dice clinically significant (i.e. does it make the cancer target location more accurate)? Unfortunately, it is not possible for us to evaluate the impact of the performance improvements on biopsy targeting due to the retrospective design of our study, which does not include access to the real-time ultrasound used for image fusion. We have added a note to the limitations section of the paper about this and to highlight a prospective evaluation of targeting as future work. Discussion 20. There appears to be a discrepancy between the results in the abstract and what can be found in the discussion section. Did using the model trained on the large internal dataset as a pre-trained template increase performance (dice score) by 39% and 49% in the two respective external datasets – or was it 30% and 49%? We have corrected this error (the correct proportions are 30% and 49% improvements in Dice). 21. To help put this into context, I’d recommend citing other prostate segmentation papers that perform worse, equal to, and better than your model. We have added comparisons and citations to the leading prostate segmentation models on challenge and private datasets to facilitate easier comparison of our results in context to the discussion section of the paper. Since our results are comparable to the best published results available, we provided one comparison for each group in the discussion, and added a series of additional citations for models trained on challenge or private datasets to the introduction section of the paper. 22. In calculating the dice in the clinical dataset, what was used as the gold standard segmentation? If it just the clinical segmentation, is it possible that the model segmentation is better than the gold standard? For the primary dataset, we used the clinically generated segmentation. As such, it is possible that some differences between model predictions and the ground truth are the result of inaccurate labels, rather than model error. We have added a note to the limitations section about this possibility. Tables and figures: 23. It should be clear to the reader that the results are shown as mean +/- standard deviation. We have added this note on each table and to the beginning of the results section text for clarity. 24. Space permitting, a table that shows relevant MRI characteristics and dataset composition would enhance the paper's value (e.g., it could include the number of cases, scanners, MRI sequence, magnetic field strength, slice thickness, in-plane resolution). We have added this information as table S1. 25. There should be figures that demonstrate the performance of the model visually. We have added a number of figures (as described below) to help demonstrate our model’s performance and results. 26. It would be helpful if the paper provided an overview diagram of the 3D U-Net architecture and possibly also the preprocessing steps in the same diagram. We have added such a diagram as Figure 1. 27. I recommend including 1-2 figures demonstrating representative test cases from the internal and external datasets, showing every slice from base-to-apex, including segmentation outputs before and after refinement. We have added two “every slice” figures demonstrating segmentation outputs as figures S2 and S3. We have also added a figure with representative slices as figure 2. Finally, we have added figures S4 and S5 which depict representative segmentation slices of the same prostate for the baseline, BraTS, and refined primary baseline models. 28. It would be great to see the addition of box plots (or similar plots) that show dice score distributions (range, IQR, median) before and after refinement. Based on your suggestion, we looked at a number of potential plots (including box plots and violin plots) and found that strip plots best presented the information. We have added strip plots for Dice and average Hausdorff score for the challenge datasets across the baseline and refined models as Figure 3 29. It would be interesting to see whether the choice of deep learning model affects the results significantly, more so than the size of the dataset. One idea could be to look at ImageNet or the holistically nested edge detector network (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5565676) We agree that exploring the choice of deep learning model would be of great interest. We chose to fix our model parameters for the purposes of this study in order to focus our analysis on the impact of data size and relevance; however in future work we would like to further study the impact of model design on this outcome. This limitation is noted in the limitations section of the paper. Limitations section: 30. I recommend incorporating the following into the limitations. a. The 0.91 dice reported in the abstract is not from a held out test set or the external testing. We have updated the limitations section to clarify why we held the model design constant and included data from all folds in the valuation, rather than a held-out test set. We have also included performance metrics for the challenge datasets in the abstract as well. Submitted filename: Response to Reviewers.pdf Click here for additional data file. 14 Jun 2021 Harnessing clinical annotations to improve deep learning performance in prostate segmentation PONE-D-21-06200R1 Dear Dr. Sarma, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Usman Qamar Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors have done a nice job responding to my comments. I have no additional recommendations apart from also citing a similar, recent publication on prostate gland segmentation using clinically-generated labels. Deep Learning Improves Speed and Accuracy of Prostate Gland Segmentations on Magnetic Resonance Imaging for Targeted Biopsy. J Urol. April 2021. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 18 Jun 2021 PONE-D-21-06200R1 Harnessing clinical annotations to improve deep learning performance in prostate segmentation Dear Dr. Sarma: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Usman Qamar Academic Editor PLOS ONE

Table 1

Evaluation results for baseline models.

	Soft Dice Coefficients				Average Hausdorff Distance
Dataset	Overall	Base	Midgland	Apex	Average Hausdorff Distance
Primary	0.909 ± 0.042	0.863 ± 0.095	0.941 ± 0.030	0.832 ± 0.094	0.156 ± 0.231
PX2	0.702 ± 0.083	0.679 ± 0.117	0.849 ± 0.051	0.702 ± 0.093	0.480 ± 0.555
P12	0.568 ± 0.122	0.501 ± 0.168	0.762 ± 0.087	0.561 ± 0.168	2.155 ± 2.466

PX2 = ProstateX-2, P12 = PROMISE12, results reported as mean ± standard deviation across all images

Table 2

Evaluation results for retargeted models.

Refining?	Dataset	Soft Dice Coefficients				Average Hausdorff Distance
Refining?	Dataset	Overall	Base	Midgland	Apex	Average Hausdorff Distance
No	PX2	0.465 ± 0.291	0.314 ± 0.314	0.517 ± 0.316	0.401 ± 0.312	4.824 ± 5.920
Yes	PX2	0.912* ± 0.029	0.851* ± 0.102	0.949* ± 0.024	0.849* ± 0.070	0.150* ± 0.192
No	P12	0.708 ± 0.210	0.475 ± 0.317	0.779 ± 0.215	0.679 ± 0.221	1.953 ± 3.747
Yes	P12	0.852* ± 0.091	0.744* ± 0.207	0.918* ± 0.046	0.777* ± 0.134	0.581* ± 1.314

* denotes significantly higher than baseline model, p<0.001. PX2 = ProstateX-2, P12 = PROMISE12, results reported as mean ± standard deviation across all images

Table 3

Model performance using ablated primary dataset (overall soft Dice coefficients).

Model	5%	10%	15%	20%	40%	60%	80%	100%
Primary	0.638	0.754	0.775	0.825	0.883	0.901	0.906	0.909
PX2 FT	0.740*	0.814*	0.829*	0.861*	0.899*	0.907*	0.909*	0.912*
P12 FT	0.625*	0.721*	0.727*	0.781*	0.831*	0.848*	0.842*	0.852*

* denotes significantly higher than baseline model, p<0.001. PX2 = ProstateX-2, P12 = PROMISE12, FT = fine-tuned, results reported as mean across all images

Table 4

Evaluation results for refined BraTS models.

	Soft Dice Coefficients				Average Hausdorff Distance
Dataset	Overall	Base	Midgland	Apex	Average Hausdorff Distance
PX2	0.834* ± 0.072	0.783* ± 0.126	0.903* ± 0.065	0.775* ± 0.097	0.299* ± 0.465
P12	0.704* ± 0.137	0.614* ± 0.208	0.820* ± 0.120	0.644* ± 0.186	1.428* ± 2.603

* denotes significantly higher than baseline model, p<0.001. PX2 = ProstateX-2, P12 = PROMISE12, results reported as mean ± standard deviation across all images

19 in total

1. Generalizing Deep Learning for Medical Image Segmentation to Unseen Domains via Deep Stacked Transformation.

Authors: Ling Zhang; Xiaosong Wang; Dong Yang; Thomas Sanford; Stephanie Harmon; Baris Turkbey; Bradford J Wood; Holger Roth; Andriy Myronenko; Daguang Xu; Ziyue Xu
Journal: IEEE Trans Med Imaging Date: 2020-02-12 Impact factor: 10.048

2. 3D APA-Net: 3D Adversarial Pyramid Anisotropic Convolutional Network for Prostate Segmentation in MR Images.

Authors: Haozhe Jia; Yong Xia; Yang Song; Donghao Zhang; Heng Huang; Yanning Zhang; Weidong Cai
Journal: IEEE Trans Med Imaging Date: 2019-07-11 Impact factor: 10.048

3. PROSTATEx Challenges for computerized classification of prostate lesions from multiparametric magnetic resonance images.

Authors: Samuel G Armato; Henkjan Huisman; Karen Drukker; Lubomir Hadjiiski; Justin S Kirby; Nicholas Petrick; George Redmond; Maryellen L Giger; Kenny Cha; Artem Mamonov; Jayashree Kalpathy-Cramer; Keyvan Farahani
Journal: J Med Imaging (Bellingham) Date: 2018-11-10

4. Variability of manual segmentation of the prostate in axial T2-weighted MRI: A multi-reader study.

Authors: Anton S Becker; Krishna Chaitanya; Khoschy Schawkat; Urs J Muehlematter; Andreas M Hötker; Ender Konukoglu; Olivio F Donati
Journal: Eur J Radiol Date: 2019-10-25 Impact factor: 3.528

5. Fully automated prostate whole gland and central gland segmentation on MRI using holistically nested networks with short connections.

Authors: Ruida Cheng; Nathan Lay; Holger R Roth; Baris Turkbey; Dakai Jin; William Gandler; Evan S McCreedy; Tom Pohida; Peter Pinto; Peter Choyke; Matthew J McAuliffe; Ronald M Summers
Journal: J Med Imaging (Bellingham) Date: 2019-06-05

6. N4ITK: improved N3 bias correction.

Authors: Nicholas J Tustison; Brian B Avants; Philip A Cook; Yuanjie Zheng; Alexander Egan; Paul A Yushkevich; James C Gee
Journal: IEEE Trans Med Imaging Date: 2010-04-08 Impact factor: 10.048

7. Deeply supervised 3D fully convolutional networks with group dilated convolution for automatic MRI prostate segmentation.

Authors: Bo Wang; Yang Lei; Sibo Tian; Tonghe Wang; Yingzi Liu; Pretesh Patel; Ashesh B Jani; Hui Mao; Walter J Curran; Tian Liu; Xiaofeng Yang
Journal: Med Phys Date: 2019-02-19 Impact factor: 4.071

8. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries.

Authors: Freddie Bray; Jacques Ferlay; Isabelle Soerjomataram; Rebecca L Siegel; Lindsey A Torre; Ahmedin Jemal
Journal: CA Cancer J Clin Date: 2018-09-12 Impact factor: 508.702

9. Data Augmentation and Transfer Learning to Improve Generalizability of an Automated Prostate Segmentation Model.

Authors: Thomas H Sanford; Ling Zhang; Stephanie A Harmon; Jonathan Sackett; Dong Yang; Holger Roth; Ziyue Xu; Deepak Kesani; Sherif Mehralivand; Ronaldo H Baroni; Tristan Barrett; Rossano Girometti; Aytekin Oto; Andrei S Purysko; Sheng Xu; Peter A Pinto; Daguang Xu; Bradford J Wood; Peter L Choyke; Baris Turkbey
Journal: AJR Am J Roentgenol Date: 2020-10-14 Impact factor: 3.959

10. Graph-convolutional-network-based interactive prostate segmentation in MR images.

Authors: Zhiqiang Tian; Xiaojian Li; Yaoyue Zheng; Zhang Chen; Zhong Shi; Lizhi Liu; Baowei Fei
Journal: Med Phys Date: 2020-07-13 Impact factor: 4.071