Literature DB >> 36240135

Deploying deep learning models on unseen medical imaging using adversarial domain adaptation.

Aly A Valliani¹, Faris F Gulamali¹, Young Joon Kwon¹, Michael L Martini¹, Chiatse Wang², Douglas Kondziolka^3,4, Viola J Chen⁵, Weichung Wang^2,6, Anthony B Costa⁷, Eric K Oermann^3,8.

Abstract

The fundamental challenge in machine learning is ensuring that trained models generalize well to unseen data. We developed a general technique for ameliorating the effect of dataset shift using generative adversarial networks (GANs) on a dataset of 149,298 handwritten digits and dataset of 868,549 chest radiographs obtained from four academic medical centers. Efficacy was assessed by comparing area under the curve (AUC) pre- and post-adaptation. On the digit recognition task, the baseline CNN achieved an average internal test AUC of 99.87% (95% CI, 99.87-99.87%), which decreased to an average external test AUC of 91.85% (95% CI, 91.82-91.88%), with an average salvage of 35% from baseline upon adaptation. On the lung pathology classification task, the baseline CNN achieved an average internal test AUC of 78.07% (95% CI, 77.97-78.17%) and an average external test AUC of 71.43% (95% CI, 71.32-71.60%), with a salvage of 25% from baseline upon adaptation. Adversarial domain adaptation leads to improved model performance on radiographic data derived from multiple out-of-sample healthcare populations. This work can be applied to other medical imaging domains to help shape the deployment toolkit of machine learning in medicine.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36240135 PMCID： PMC9565422 DOI： 10.1371/journal.pone.0273262

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

1 Introduction

A major point of failure for machine learning models is lack of generalizability to unseen cases when deployed in production [1]. A major cause of this is dataset shift, when the underlying population (or domain) from which a model’s training set is sampled has a different distribution from the population encountered in production [1-3]. This problem for generalizing algorithms is thought to be a major challenge facing autonomous cars, financial systems, and many other deep learning systems. For medical models where it is common for datasets to reflect local patient populations and image acquisition methods, this problem is uniquely prevalent [1, 4–6]. The failure of computer-assisted diagnosis for mammography despite its approval by the FDA is one of the most well-known medical cases [6]. The most straightforward way to address this problem is by obtaining information about the external distribution via acquiring labeled data from it. However this is particularly challenging in many fields such as medicine, for example, where data is siloed in healthcare institutions to protect patient privacy, high-quality labeled data is time-consuming to acquire, and requires esoteric knowledge of the field [7, 8]. In medicine, high-quality labels are especially expensive to acquire as they require multiple graders to derive consensus in the context of poor intergrader reliability [9-11]. As such, purely technical approaches belonging to the realm of transfer learning and domain adaptation are promising alternatives. Broadly, algorithms for domain adaptation can be categorized into instance-based or feature-based approaches [12, 13]. Instance-based domain adaptation applies a re-weighting function to reduce the discrepancy between source and target samples whereas feature-based approaches aim to learn a mapping across domains when labeled data is unavailable in the target domain. The latter is the focus of this work. Prior work has utilized variations of adversarial domain adaptation on a spectrum of different tasks including medical image segmentation, lung nodule detection, prostate MRI segmentation, and federated learning. For example, previous methods have trained on augmented big data in the domains of prostate, left atrial, and left ventricular and shown that augmentation reduces the degradation in performance significantly [14]. In our study, we primarily focus on an in-hospital vs out-of-hospital cohort, rather than differing tasks altogether. A second method has utilized adaptive transition module (ATM) to learn a frequency attention map that can align different domain images in a common frequency domain. By backpropagating with differentiable fast fourier transform, lung nodule detection performance was significantly improved [15]. We do not use a frequency domain, but we anticipate that applying a frequency-based normalization may also improve performance. Shape-aware meta learning utilizes a network that can learn shape compactness and shape smoothness to provide domain-invariant embeddings [16]. Similar to ATMs, shape-aware meta-learning is primarily focused on different objectives rather than learning out-of-sample embeddings. Finally, some methods are able to combine Fourier transforms and shape-aware meta learning, demonstrating improved performance on out-of-sample objectives [17]. In context, our paper focuses on investigating the a priori assumption of dataset shift, and how it can be utilized to improve performance across centers rather than generating a novel machine learning methods to combat domain shift. We utilize an unsupervised domain adaptation algorithm that relies upon generative adversarial networks—neural networks that compete with one another—to obtain state-of-the-art results across all transformations for a canonical digit recognition task as well as one of the largest medical imaging datasets curated to date. Using this algorithm and datasets, we examine different scenarios for deploying a machine learning model in a medical use-case, analyze points of failure, and demonstrate the efficacy of our technique for maximizing data efficiency. An experimental innovation that we emphasize is refraining from presenting results from a joint test set. Rather we present results from distinct test sets split on our prior expectation of dataset shift (digit source or hospital site in our two cases respectively) and show that this simple change significantly improves our understanding of the problem, particularly in the medical use case where it can be common to test results on pooled multicenter data. Two categories of data were used for algorithm development and validation. Handwritten digit datasets were used for initial prototyping and proof-of-concept testing, and clinical chest x-ray (CXR) datasets were leveraged to simulate translation into the clinical setting. Our digit dataset consisted of 149,298 images of digits from three classic populations: MNIST, MNISTM, and USPS [18, 19]. We then applied our work to what, to the author’s best knowledge, is the largest dataset of medical imaging to date consisting of 868,549 chest radiographs drawn from 228,258 patients from three academic medical centers within the United States (Beth Israel, Stanford, and the National Institutes of Health Clinical Center) as well as an academic medical center in Spain (San Juan Hospital) [20-23]. There are four primary means of deploying a machine learning model in an environment with distinct populations (Fig 1A). We refer to “internal” (in-population or in-dataset) results as classification results tested on a held-out test set sampled from the same population as the training set. In contrast, we refer to external (out-of-population or out-of-dataset) results as classification results where the model is tested on a held-out test set sampled from a different dataset as the training set (see Materials and methods for further details). An idealized case is to train a model on a local dataset, and then have it perform well externally, out-of-dataset, on multiple different datasets. We developed a technique for generally improving algorithm performance using a purely computational approach involving cycle-consistent adversarial domain adaptation to make data from one dataset mimic that from another dataset. Our system is built upon the generative adversarial framework consisting of a single “generator” deep neural network (DNN) that competes against a “discriminator” DNN in a game to detect forged images (Fig 1B) [24]. The generator is tasked with taking input images from a source domain, and making them appear as if they were sampled from a target domain (Fig 1C). Once we have learned a technique for transferring data between domains, it is possible to train a classification model on any one population and then deploy it on an external one while minimizing the loss in performance. Baseline classification performance is the internal area under the curve (AUC), or the AUC on a held-out test set sampled from the same domain as the one the neural network classifier was originally trained on. Efficacy of domain adaptation is measured by comparing the post-adaptation AUC to the baseline AUC on the same held-out test set. Detailed descriptions of the datasets, model, and training routine can be found in the Materials and Methods.

Fig 1

Machine learning deployment strategies and schematic illustration of the proposed generative adversarial algorithm for domain adaptation.

(A) There are four primary methods by which machine learning models can be deployed in a context with distinct data domains: 1) train a model on one domain and deploy it across multiple distinct domains, 2) train multiple bespoke models that are optimized for deployment on individual domains, 3) train and deploy a single global model on all domains, and 4) train a model on one domain and adapt it through technical means to make it performant on a distinct domain. (B) Generative adversarial networks provide a technical framework for domain adaptation. A generator translates real data from one domain into fake data that resembles that of a different domain while the discriminator aims to distinguish between the two, which enables the generator to generate realistic-looking data in the target domain. (C) Schematic of the proposed algorithm. a) Real data from a source domain is translated by the generator to resemble data from a specified target domain while maintaining underlying semantic qualities of the input image. b) Translated data is reconstructed by the generator to resemble data from the source domain to maintain domain-agnostic image characteristics with a semantic consistency constraint ensuring that reconstructed images maintain the semantic characteristics of the source data. c) The discriminator aims to distinguish between real and synthetic images and identify the domain of input images to constrain the generator to produce realistic-looking synthetic images from a specified domain. d) A target discriminator is fine-tuned on synthetic images to better identify opacity in the target domain.

Machine learning deployment strategies and schematic illustration of the proposed generative adversarial algorithm for domain adaptation.

2 Results

2.1 Digit recognition

On a standard digit recognition task, we noted an average internal AUC of 99.87% (95% CI, 99.87–99.87%) which decreased to an average external AUC of 91.85% (95% CI, 91.82–91.88%) without adaptation and an average external AUC with adaptation of 94.66% (95% CI, 94.63–94.69%). Adaptation led to a generalized increase in performance with average salvage of approximately 35% post-adaptation as compared to baseline (Fig 2A). Notably, there was a global increase with adaptation across all datasets and, on visual inspection, adapted digits appear to be semantically consistent with their target datasets even across gray-scale and color transformations (Fig 2B and 2C). On average, external testing of locally trained models demonstrates a relative increase in performance of 3.04% (95% CI, 3.01–3.07%, absolute increase of 2.81%) and exhibits state-of-the-art results on this task for domain adaptation with a median salvage of 83.16% of the AUC lost from testing on an external dataset. Domain adaptation worked best on domains that were more similar under visual inspection. For example, adaptations between MNIST and MNISTM yielded significant improvements in classification performance due to the similar baseline character style across the two datasets. Adaptations between MNIST and USPS were similarly efficacious due to transition across grayscale domains whereas adaptations between MNISTM and USPS were less successful given the more difficult task of adaptation across character styles and color domains.

Fig 2

Results on the digits datasets.

(A) Performance of adapted and baseline algorithms as measured by area under the curve (AUC). Error bars denote standard deviations. Dotted lines represent the theoretical ceiling of AUC on the target test set as obtained by a baseline classifier trained on the target training set. Adaptation leads to a generalized increase in AUC across all source-target pairs with an average salvage of 35% of peak performance. (B) Expected relative change in AUC upon adaptation of a source dataset demonstrates a generalized increase in performance across populations. (C) In all cases, adaptation transforms input images (bounded by black boxes) to appear stylistically like those in the specified target domain (bounded by blue boxes) while preserving semantic information of images in the source domain.

Results on the digits datasets.

2.2 Chest x-ray classification

Using one of the largest medical imaging datasets to date we tested our technique on a medical imaging problem, identification of opacities on chest x-rays (CXRs), which is of particular relevance due to the present need for using algorithms to rapidly spot pulmonary aberrations from COVID-19 (Materials and Methods: Clinical Taxonomy and Pre-Processing) [25-29]. On our medical dataset we had an average internal test AUC of 78.07% (95% CI, 77.97–78.17%), and an average external AUC of 71.43% (95% CI, 71.32–71.60%) with an average relative performance loss of 8.51% (S3 Table in S1 File). After adaptation we noted an average relative improvement in performance of 2.42% (95% CI, 2.30–2.54%, absolute increase of 1.64%) implying an average salvage of approximately 25% of baseline performance (Fig 3A, S4 Table in S1 File). Specific populations tended to suffer more from dataset shift in the unadapted setting, and ultimately benefit more from adaptation—in this case San Juan with an average relative gain of 6.58% (absolute increase of 4.39%) AUC after adaptation (Fig 3B). For context, achieving this level of improvement without adaptation would require on average an additional 8,213 labeled chest radiographs derived from the target domain. Implementing domain adaptation more broadly would amount to having approximately 5,845 additional labeled images from the deployment dataset (S1 Fig in S1 File). In order to confirm that performance gains were due to adaptation of the underlying data and features rather than an incidental re-calibration, we visually inspected the adapted CXRs and plotted calibration curves confirming that adaptation is acting upon the underlying distribution of features rather than simply recalibrating the models (Fig 3C, S2 Fig in S1 File). For the CXR models there was a median salvage of 20.98% of AUC after adaptation.

Fig 3

Results on the chest x-ray datasets.

(A) Performance of adapted and baseline algorithms as measured by area under the curve (AUC). Error bars denote standard deviations. Dotted lines represent the theoretical ceiling of AUC on the target test set as obtained by a baseline classifier trained on the target training set and demonstrate an average salvage of 25% of the baseline performance after adaptation. (B) Expected relative change in AUC upon adaptation of a source dataset demonstrates a general improvement in performance across populations. The proposed adaptation technique leads to a generalized increase in AUC on average relative to baseline performance. (C) Input images without opacity are bounded by black boxes while those with opacity are bounded by red boxes. Adapted counterparts are bounded by blue boxes.

Results on the chest x-ray datasets.

(A) Performance of adapted and baseline algorithms as measured by area under the curve (AUC). Error bars denote standard deviations. Dotted lines represent the theoretical ceiling of AUC on the target test set as obtained by a baseline classifier trained on the target training set and demonstrate an average salvage of 25% of the baseline performance after adaptation. (B) Expected relative change in AUC upon adaptation of a source dataset demonstrates a general improvement in performance across populations. The proposed adaptation technique leads to a generalized increase in AUC on average relative to baseline performance. (C) Input images without opacity are bounded by black boxes while those with opacity are bounded by red boxes. Adapted counterparts are bounded by blue boxes.

2.3 Domain spread

The present work displays encouraging and practically useful results across both non-medical and medical datasets for mitigating dataset shift in the challenging case of not having access to labeled data in the target domain. When dealing with easily transported data, this problem can be somewhat obviated by simply localizing the data and training models on a union of the data or utilizing other techniques from transfer learning. Importantly, however, we observe that the performance of global models on pooled data from multiple data sources does not reflect efficacy on individual data domains (Fig 4, S4 and S5 Tables in S1 File). Instead, stratifying assessment by domain allows for examination of the domain spread, or inter-domain variance, as an a priori measure of expected model performance upon deployment. The dramatic reduction in domain spread with increasing amounts of handwritten digits data (0.1% domain spread = 112.06 and 100% domain spread = 0.01) relative to that of CXR data (0.1% domain spread = 26.12 and 100% domain spread = 23.83) suggests that added radiographs may not be sufficient to overcome data shift across hospital sites.

Fig 4

Results of baseline global models trained on incremental amounts of available data and evaluated on the global test set and dataset-specific test sets demonstrate a discrepancy between global results and population (domain) specific results.

Error bars denote standard deviations. (A) Training and testing on an aggregate dataset obscures the fact that the model trained on all of the data has a difference in performance on digit classification of over 20% arguing against the practical utility of testing on aggregated data. This discrepancy is ameliorated by increasing amounts of data and vanishes at 10% of the total available amount of data. (B) These results are initially mirrored in the chest x-ray cohort where performance of the global model trained on chest x-rays from all hospital sites and evaluated on the global and dataset-specific test sets demonstrates over 10% change in performance at 0.1% of the total available amount of data. Notably this discrepancy between site-specific performance is only mildly alleviated by increasing amounts of data and remains even when the joint model is trained on the entirety of the available dataset.

Results of baseline global models trained on incremental amounts of available data and evaluated on the global test set and dataset-specific test sets demonstrate a discrepancy between global results and population (domain) specific results.

3 Discussion

We built an unsupervised domain adaptation algorithm using GANs that ameliorates dataset shift on a canonical computer vision task and among the largest medical imaging datasets ever curated. Our approach extends upon previously described methods by incorporating real-time weak supervision of the generator to enable semantic consistency. Most importantly, unlike previous work, we conduct a comprehensive validation of our algorithm across all adaptation pairs for two very different computer vision tasks. As a proof-of-concept, we show broad adaptation efficacy across three different digits datasets amounting to an average salvage of 35% AUC post-adaptation. We apply the algorithm to chest x-rays derived from four healthcare institutions both in the United States and abroad, and show an improvement of 25% AUC in the identification of lung opacities relative to baseline. Finally, we show that the oft-used approach of creating a global model trained on data across domains is not sufficient to overcome data shift. Notably, we propose a unique metric, which we term domain spread, to characterize the heterogeneity of data across domains to obtain an a priori estimate of the external validity of a trained algorithm for any dataset. Taken together, these findings provide a comprehensive analysis of dataset shift, the potential of unsupervised domain adaptation, and a framework for future studies aiming to characterize and alleviate the burden of data shift in the widespread deployment of machine learning in the medical domain. In many medical, financial, and military applications data is not portable, and it is necessary to find ways of training a model on locally available data that will be performant elsewhere. While there are novel collaborative means around this such as Federated Learning, these technologies still necessitate the labeling of target data and the collaboration of target sites with optimizing a local model on site [30]. Future research should continue to investigate computational methods for unsupervised domain adaptation to improve on our results. Furthermore, research into semi-supervised domain adaptation where limited out-of-population data can inform the transfer learning task may be extremely beneficial for improving performance on this problem. We note that by testing on population specific test sets, model performance directly comments on the nature of the underlying domain and its apparent learnability. We consider an encouraging future direction of research to involve investigating means of utilizing limited knowledge of the underlying domain to inform the domain adaptation task. An important limitation of our study is label quality between the datasets, and we suspect that performance on the CXR task across all datasets is limited somewhat by the underlying quality of the labeling. We attempted to address this by utilizing the presence or absence of pulmonary “opacity” as our labels, and by noting that our key observation is the relative change in performance between internal, external, and adapted datasets. Furthermore, we recognize that training GANs is a computationally intensive task that requires hardware capabilities absent in many healthcare settings. We consider an important research direction to be one that reduces the computational burden of algorithm development in order to democratize the application of machine learning in the healthcare space.

4 Conclusion

We demonstrate that the use of unsupervised domain adaptation techniques can broadly increase model performance on external, shifted data. By measuring domain spread, we can determine a priori whether a global model provides a distinct advantage over domain-specific models. Improvements in domain adaptation such as shape-aware meta learning, and federated frequency attention maps may reduce the value of the domain spread, so domain spread can serve as an important marker for cross-site generalizability. Nevertheless, we anticipate that biases that exist within datasets such as those in label quality or underdiagnosis bias will need solutions that expand beyond purely computational approaches [31]. Future research should investigate these biases and further utilize unsupervised learning as sparsely labeled datasets and high-quality, resource-intensive labelling become increasingly important [32]. When used in a challenging medical use case of practical importance—identification of lung opacity on CXRs—we note that adverserial domain adaptation leads to a generalized increase in performance. Purely computational approaches to handling dataset shift are not only tractable, but beneficial, and we believe will become an increasingly important part of the deployment toolkit for machine learning as these tools become increasingly used in medicine.

5 Methods

5.1 Datasets

Hand drawn digits were obtained from their standard online repositories for computer science research (S1 Table in S1 File). Further details and descriptions of these datasets are easily accessible online and we will omit further discussion of them. The clinical CXR data was obtained from four publicly-available retrospective datasets of chest x-rays, each containing images from thousands of patients, and summarized below (S2 Table in S1 File).

5.1.1 ChestX-ray8 [22]

The ChestX-ray8 dataset derived from the National Institutes of Health contains 112,120 x-rays from 30,805 unique patients obtained over the period 1992–2015. Eight disease labels were extracted from radiology reports associated with each image: “atelectasis”, “cardiomegaly”, “effusion”, “infiltration”, “mass”, “nodule”, “pneumonia”, and “pneumothorax”. Scans without pathology were labeled “normal”.

5.1.2 CheXpert [21]

The CheXpert dataset derived from Stanford Hospital consists of 223,648 chest x-rays from 64,740 unique patients collected between October 2002 and July 2017. 14 labels were extracted from corresponding radiology reports: “atelectasis”, “cardiomegaly”, “consolidation”, “edema”, “enlarged cardiomegaly”, “fracture”, “lung lesion”, “lung opacity”, “no finding”, “pleural effusion”, “pleural other”, “pneumonia”, “pneumothorax”, and “support devices”.

5.1.3 MIMIC-CXR [20]

The MIMIC-CXR dataset derived from Beth Israel Deaconess Medical Center in Boston consists of 371,920 chest x-rays from 65,088 unique patients obtained over the period 2011–2016. Labels in this study were identical to those used by CheXpert.

5.1.4 PadChest [23]

The PadChest dataset derived from San Juan Hospital in Spain contains 160,861 chest x-rays from 67,625 patients between January 2009 and December 2017. 174 radiographic findings, 19 differential diagnoses, and 104 anatomic characteristics were extracted from radiology reports associated with each image.

5.2 Clinical taxonomy and pre-processing

5.2.1 Taxonomy

A significant challenge of working with multiple different clinical datasets for classification is heterogeneity in the labeling of the data. This is particularly notable for chest radiography where some datasets utilize language associated with radiographic findings (PadChest), while other datasets use mixed language that includes clinical diagnoses such as pneumonia. The challenge of label heterogeneity is compounded by error built into the labeling process itself, with studies utilizing a mix of manual annotation and semi-automated methods such as natural language processing to generate ground truth labels. In order to obtain more homogenous labels and minimize class imbalance, we grouped together labels that were associated with opacities on AP CXR as indicated below. This single label (“opacity”) was used for subsequent experiments. Use of this label is particularly relevant in the clinical setting because it allows one to capture the spectrum of radiographically visible pathology, which maximizes the intended utility of chest radiographs as front-line screening tools in the clinical setting. It was derived by grouping pathologies as follows: ChestX-ray8: “atelectasis”, “consolidation”, “edema”, “infiltration”, “mass”, “nodule”, “pneumonia” CheXpert/MIMIC-CXR: “atelectasis”, “consolidation”, “edema”, “lesion”, “lung opacity”, “pneumonia” PadChest: “alveolar pattern”, “atelectasis”, “atelectasis basal”, “atypical pneumonia”, “bronchiectasis”, “calcified densities”, “calcified granuloma”, “calcified pleural plaques”, “cavitation”, “consolidation”, “granuloma”, “ground glass pattern”, “increased density”, “infiltrates”, “interstitial pattern”, “laminar atelectasis”, “lobar atelectasis”, “lung metastasis”, “mass”, “multiple nodules”, “nodule”, “pleural plaques”, “pneumonia”, “pseudonodule”, “pulmonary edema”, “pulmonary mass”, “reticulonodular interstitial pattern”, “round atelectasis”, “segmental atelectasis”, “soft tissue mass”, “total atelectasis”, “tuberculosis”, “tuberculosis sequelae”

5.2.2 Preprocessing

Digits were nearest-neighbor interpolated to dimensions of 32x32x3. Chest x-rays were bilinear interpolated to dimensions of 224x224x3. Frontal chest x-rays were utilized for experimentation. Both digit and chest x-ray images were normalized with a mean and standard deviation of 0.5 for algorithm development.

5.3 Deep learning architectures

5.3.1 Baseline CNN classification

The LeNet architecture was used for baseline digit classification whereas the DenseNet architecture was chosen for chest x-ray opacity classification to maximize the ability of the network to learn fine-grained radiographic features. The ImageNet pretrained DenseNet feature extractor was concatenated to two fully connected layers composed of 1,000 and 100 hidden nodes interspersed with batch normalization, ReLU nonlinearity, and 50% dropout followed by a linear classifier. Hyper-parameters were selected by grid-search. CNN models were trained on a single NVIDIA Tesla V100 using PyTorch 1.1.0. Models were trained using a categorical cross entropy loss with Adam optimizer. LeNet was trained with a learning rate of 0.001 and batch size 128, while DenseNet was trained with a learning rate of 0.0002, weight decay 0.0005, and batch size 50. Hyper-parameters were selected by grid search. Batches were balanced by opacity label for all chest x-ray experiments. Real-time affine data augmentation, including random flips, rotations, and translations, was conducted during DenseNet training. All models were trained for 200 epochs or early-stopped once validation AUC no longer improved for ten consecutive epochs. A LeNet and DenseNet model was trained for each digit and chest x-ray dataset, respectively. Models were trained on an 80% split of a given dataset and validated on a 10% hold-out sample. The remaining 10% of data was used to construct a bootstrap sample of 1,000 replicates of size 1,000 to compute the ROC and other classification metrics.

5.3.2 Adversarial domain adaptation implementation

We adapt StarGAN for image-to-image translation [33]. The discriminator network uses a PatchGAN architecture, which classifies local MxM image patches as real or fake to promote fine-grained image synthesis. It is composed of six convolutional layers with kernel size four, stride two, and padding one interspersed with Leaky ReLU nonlinearity parametrized with a negative slope of 0.01 followed by two output convolutional layers. The generator network is composed of two downsampling convolutional layers with kernel size four, stride two, and padding one; residual blocks of size four and nine were used for digit and chest x-ray experiments, respectively; two transposed convolutional layers were used for upsampling. All convolutional layers were interspersed with Instance Normalization and ReLU nonlinearity. The source task network adopts the LeNet or DenseNet architecture for digit and chest x-ray experiments, respectively. Initial training on source images and labels is as described in ‘Baseline CNN classification’. The target task network was initialized with weights from the source network prior to subsequent training. All models were trained using the Adam optimizer with β1 = 0.5 and β2 = 0.999 for 200K iterations. The learning rate was initialized at 0.0001 and linearly decayed toward zero after 100K training steps. Batches of size 200 and 20 were used for digit and chest x-ray experiments, respectively, with an even split of input images from the source and target domains. Batches were balanced by opacity label for all chest x-ray images derived from the source domain. Per StarGAN, the discriminator was updated five times for every generator update. The task networks were updated at every generator update, except in the MNISTM → USPS transformation where task networks were updated once every ten generator updates for stability of training. Following training completion, train images from the source domain were transformed into synthetic images from the target domain. The transformed images were used to fine-tune the baseline CNN model (described in ‘Baseline CNN classification’). The baseline and fine-tuned models were evaluated on test images from the target domain to evaluate efficacy of domain adaptation. All algorithm development was conducted on a single NVIDIA Tesla V100 using PyTorch 1.1.0.

5.3.3 Adversarial domain adaptation design

Adversarial modeling is a computational framework in which two algorithms are simultaneously trained in a minimax adversarial process. A simple instantiation of adversarial learning is the generative adversarial network (GAN), which is composed of a generative model G that approximates a data distribution pitted against a discriminative model D that aims to determine whether sample data is derived from the generative distribution or true data distribution. By way of analogy, G can be conceptualized as a counterfeiter trying to create fake currency that resembles the original, while D represents law enforcement trying to discern between fake and real currency. The adversarial process allows both algorithms to iteratively improve, ultimately allowing G to produce samples that are indistinguishable from genuine counterparts. Numerous GANs have been proposed for a variety of computer vision tasks, including image synthesis and super-resolution imaging [33, 34]. The task of unsupervised domain adaptation, which aims to transfer insights gained from labeled data in a source domain in order to achieve comparable performance on unlabeled data in a target domain, has also seen applications of GANs. CyCADA, proposed by Hoffman et al., builds upon the GAN framework by introducing cycle and semantic consistency constraints to preserve pixel-level features and labels, respectively, when mapping from source to target domains [35]. The formulation requires two pairs of generators and discriminators to map across domains and achieves state-of-the-art performance on digit classification. StarGAN, proposed by Choi et al., incorporates the cycle consistency constraint and scales domain adaptation to multiple domains using a single generator and discriminator framework by conditioning the generator on a target domain label. It achieves among the best results on facial attribute transfer and facial expression synthesis [33]. Problem formulation. This work focuses on the task of unsupervised domain adaptation, where given source data X, source labels Y, and target data X but no target labels Y, the goal is to learn a classification model F that can correctly predict the label for X. In this case, we assign C to refer the the target domain. A naive approach would train a classification algorithm on source data and apply it to predict labels for data in the target domain. However, such an approach has shown to exhibit diminishing performance due to domain shift across data domains [36-40]. StarGAN is a novel GAN framework that extends upon previous methods to map data across multiple domains while preserving fine-grained image and semantic characteristics for application in the medical context [33]. Image-to-image translation. Given X and X, we adopt StarGAN to learn a mapping of domains S → C and C → S using a single generator G such that discriminator D is unable to distinguish real and synthetic images across domains. In other words, we want the generator, given a sample and a target domain to map to an image in the sample domain (x) to an equivalent image in the the target domain (x → c) G(x, c) → x. Our discriminator, on the other hand, produces a probability distribution over both the source sources and domain labels. D: x → {P(x), P(x)}, where P and P is the probability that the sample belongs in the source or target domain, respectively. Per StarGAN, the objective expressed as the adversarial loss enables G to generate an image G(x, c) that is indistinguishable from real images. D is the discriminator network, G is the generator network, X is the set of real samples, Y is the set of domains associated with X. (x, c) are a pair where x refers to a sample from the data source, and c refers to the target domain label. To stabilize adversarial training, we adopt the Wasserstein GAN objective from StarGAN defined as where is derived from uniformly sampling along straight lines between coupled points from the true data and generator distributions. E(x ∼ X)[D(x)] is the expected loss of the discriminator on the source data, E([D(G(x, c))] is the expected loss of the discriminator of the source data on generated data in the target domain, and is a regularization term to minimize the gradient as described in [41]. All experiments use λp = 10, which was optimized via grid-search. As in StarGAN, to constrain G to produce images in the target domain c, a domain classification loss is also imposed on both D and G. A simple classification loss over real images is used to optimize D Where, c is the domain of the sample x, and NLL(D(x), c), is a negative log-likelihood loss of the discriminator D(x), which predicts a given class, and the true class c. Conversely, the classification of D with respect to the fake images generated by the generator network G(x, c) is used to optimize G, as is standard for generative adversarial networks. By minimizing this objective, G is able to generate images that can be classified as target domain c. Although the adversarial and domain classification losses constrain G to generate images that appear realistic in the target domain, they do not preserve the content of the input image independent from domain characteristics. Therefore, a L1 penalty on the cycle consistency loss as defined in StarGAN. is imposed on the generator such that G is constrained to reconstruct input x in the original domain c’ from the translated image G(x, c). [||x − G(G(x, c), c′)||1 is a reconstruction loss with an absolute-value based normalization. Semantic modeling. Although the image translation framework enables G to synthesize realistic images in a given domain, it does not guarantee the preservation of semantic information across domains. For example, when translating chest x-rays from source to target domain, G may not maintain the underlying disease content. Therefore, we adopt techniques from CYCADA to train a source classifier F to weakly supervise G to generate images that are classified the same way before and after translation. Images generated in the target domain that correspond to the source domain should have similar labels. In other words, we use the labels that have already been prescribed in the source domain to guide the generator G. First, a classifier (F) is trained on the labeled source data with cross entropy loss. where N denotes the number of classes and σ is the softmax function. Second, a target classifier F is fine-tuned from F to weakly supervise G on fake images in the target domain. Ablation studies from the CYCADA paper have shown that this step leads to improvements in domain adaptation. Subsequently, the semantic loss—where the goal is the maintain a semantic relationship between the images in the target domain and the source domain can be defined with respect to F and F. This loss is simply the addition of for the classifier for images from the source domain L(F, G(G(X, C), C′), Y, the classifier for images from the target domain (L(F, G(X, C), Y) and the classifier for the images from the source domain with respect to the predictions generated by the classifier of images in the target domain L(F, G(X, C), F(X)). By generating a loss function that combines all three aspects, can generate a semantic relationship between the classifications generated by the source samples and the classifications of samples in target domain. Final Objective Taken together, the objective functions to optimize D and G, respectively, are: where λ and λ are scalars that control the importance of domain classification and cycle consistency losses, respectively. As per StarGAN, λ = 1 and λ = 10 for all experiments.

5.4 Statistical analysis

Dataset characteristics were compared using Analysis of Variance (ANOVA) where appropriate. Bootstrap confidence intervals were constructed to compare results at baseline and post-adaptation using bootstrap samples of 1,000 replicates. Statistical significance was evaluated at an alpha level of 0.05. All statistical analyses were performed in Python 3.5.2. (PDF) Click here for additional data file. 9 May 2022

PONE-D-22-08145

Deploying deep learning models on unseen medical imaging using adversarial domain adaptation

PLOS ONE Dear Dr. Oermann, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jun 23 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Mohamed Hammad, Ph.D. Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse. Code may be shared by providing a URL within the Methods section to a code repository or it may be uploaded as a supplemental file. 3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. 4. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: No Reviewer #2: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #2: No ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The paper has been carelessly prepared and needs to be largely reworked. These are some points that the authors must particularly pay attention and handle: • The major problem of this work is that its novelty and the theoretical contribution are so limited. So, the authors should modify it carefully and improve the novelty of this paper. Also, the authors should provide solid motivation for their work based on the existing literature. • The figures need to be amended. For example, the font is too small, and the resolution is not clear which makes it difficult to read. • All the equations were missing. • Please add Figure or Table about the optimal structure of the proposed method. In addition, provide the values of all parameters of proposed method in table. • Please specify how the parameters of proposed method were selected. • Please specify if the proposed methods parameters were optimized. If so, please write how proposed parameters were optimized? • • More details about the simulation software exploited should be added. • The results should be further analyzed, more details and further discussion of the simulation results is needed. • I recommend the authors to review below works and incorporate them while revising the paper: 1. Zhang, Ling, et al. "Generalizing deep learning for medical image segmentation to unseen domains via deep stacked transformation." IEEE transactions on medical imaging 39.7 (2020): 2531-2540. 2. Yin, Baocai, et al. "AFA: adversarial frequency alignment for domain generalized lung nodule detection." Neural Computing and Applications (2022): 1-12. 3. Liu, Quande, Qi Dou, and Pheng-Ann Heng. "Shape-aware meta-learning for generalizing prostate MRI segmentation to unseen domains." International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2020. 4. Liu, Quande, et al. "Feddg: Federated domain generalization on medical image segmentation via episodic learning in continuous frequency space." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. • The conclusions section should conclude that you have achieved from the study, contributions of the study to academics and practices. In addition, list the advantages and disadvantages of the proposed solution, as well as indicate the limitations of work. Further, mention the recommendations of future works. • The list of references should be reformatted and checked again to be matched with the journal requirement where a different styles and types are used. Reviewer #2: In this manuscript, the authors develop a general technique for ameliorating the effect of dataset shift using generative adversarial networks (GANs) on a dataset of 149,298 handwritten digits and a dataset of 868,549 chest radiographs obtained from four academic medical centers. They assess efficacy by comparing the area under the curve (AUC) pre-and post-adaptation. Adversarial domain adaptation leads to improved model performance on radiographic data derived from multiple out-of-sample healthcare populations. Their work can be applied to other medical imaging domains to help shape the deployment toolkit of machine learning in medicine. Before its acceptance for publication, the authors must arrange all the proposed models to be readable. Indeed, from line 357 to line 408, all equations are not readable. Also, the presentation of the manuscript must be improved. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 11 Jul 2022 1 June 2022 To Dr. Mohamed Hammad Academic Editor, PLOS ONE Thank you for your consideration of our manuscript. We have addressed all the comments provided, which we think have certainly improved the manuscript. Below, we provide the original comments and include our point-by-point responses. Comments: Reviewer #1: The major problem of this work is that its novelty and the theoretical contribution are so limited. So, the authors should modify it carefully and improve the novelty of this paper. Also, the authors should provide solid motivation for their work based on the existing literature. We thank the reviewer for the comment. The novelty of this paper arises from its technical extension of existing domain adaptation approaches but more importantly its extensive application of an algorithm toward general amelioration of dataset shift and characterization of domain spread as an a priori estimate of the prevalence of domain shift in a given dataset. Further, we contextualize our paper within these papers with the following paragraph included in the introduction, and cite the papers referenced as recommended by the reviewer. “Prior work has utilized variations of adversarial domain adaptation on a spectrum of different tasks including medical image segmentation, lung nodule detection, prostate MRI segmentation, and federated learning. For example, previous methods have trained on augmented big data in the domains of prostate, left atrial, and left ventricular and shown that augmentation reduces the degradation in performance significantly.(14) In our study, we primarily focus on an in-hospital vs out-of-hospital cohort, rather than differing tasks altogether. A second method has utilized adaptive transition module (ATM) to learn a frequency attention map that can align different domain images in a common frequency domain. By backpropagating with differentiable fast fourier transform, lung nodule detection performance was significantly improved.(15) We do not use a frequency domain, but we anticipate that applying a frequency-based normalization may also improve performance. Shape-aware meta learning utilizes a network that can learn shape compactness and shape smoothness to provide domain-invariant embeddings.(16) Similar to ATMs, shape-aware meta-learning is primarily focused on different objectives rather than learning out-of-sample embeddings. Finally, some methods are able to combine Fourier transforms and shape-aware meta learning, demonstrating improved performance on out-of-sample objectives.(17) In context, our paper focuses on investigating the a priori assumption of dataset shift, and how it can be utilized to improve performance across centers rather than generating a novel machine learning methods to combat domain shift.” 14. Zhang, Ling, et al. "Generalizing deep learning for medical image segmentation to unseen domains via deep stacked transformation." IEEE transactions on medical imaging 39.7 (2020): 2531-2540. 15. Yin, Baocai, et al. "AFA: adversarial frequency alignment for domain generalized lung nodule detection." Neural Computing and Applications (2022): 1-12. 16. Liu, Quande, Qi Dou, and Pheng-Ann Heng. "Shape-aware meta-learning for generalizing prostate MRI segmentation to unseen domains." International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2020. 17. Liu, Quande, et al. "Feddg: Federated domain generalization on medical image segmentation via episodic learning in continuous frequency space." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. The figures need to be amended. For example, the font is too small, and the resolution is not clear which makes it difficult to read. Thank you for the feedback. We have provided figures at the highest vectorized resolution. We are happy to reformat figures further if zooming does not enable adequate viewing. All the equations are missing. Thank you for the comment. It appears the formulas were accidentally redacted upon initial submission. They have now been incorporated with additional details. The summary equations of the generator and the discriminator are listed in lines 442 and 443. Equations specific to the StarGAN are listed in line 404, 408, 413, 416, and 423. The task and semantic loss are defined in line 433 and 439. Please add Figure or Table about the optimal structure of the proposed method. Thank you for the comment. Given the various training regimes utilized, we have provided Figure 1C as a generalizable illustration of the algorithm used across experiments. The authors believe creating an exhaustive table of all parameters would be unwieldy (especially given they were not custom tuned as mentioned in the next comment). However, in order to add further clarity, we have made the equations more explainable by adding thorough descriptions for each in the text. For example for equation 1, we have added. In other words, we want the generator, given a sample and a target domain to map to an image in the sample domain (xs) to an equivalent image in the the target domain (x¬c) G(x_s,c)→x_c. Our discriminator, on the other hand, produces a probability distribution over both the source sources and domain labels. D:x→{P_s (x),P_c (x)}, where P¬s and Pc is the probability that the sample belongs in the source or target domain, respectively For equation 2, we have added “E_(x～X) [D_src (x)] is the expected loss of the discriminator on the source data, E_((x,c)～(X,C)) [D_src (G(x,c))] is the expected loss of the discriminator of the source data on generated data in the target domain, and λ_gp 〖E_x ^ 〗_(～P_x ^ ) [(||∇_x ^ D(x ^)||_2-1)^2] is a regularization term to minimize the gradient as described in 41. All experiments use ƛgp = 10, which was optimized via grid-search.” For equation 3 and 4, we have simplified the equation to the negative log-likelihood loss. . A simple classification loss over real images (〖L^r〗_classification) is used to optimize D 〖L^r〗_classification (D,X,C)= E_((x,)～(X,) ) [NLL(D(x),c)] (3) Where, c is the domain of the sample x, and NLL(D(x),c), is a negative log-likelihood loss of the discriminator D(x), which predicts a given class, and the true class c. Conversely, the classification of D with respect to the fake images generated by the generator network G(x,c) is used to optimize G, as is standard for generative adversarial networks. 〖L^f〗_cls (D,G,X,C)=E_((x,c)～(X,C) ) [NLL(D(G(x,c)),c) ] (4) For equation 5, we have defined the L1 loss [||x - G(G(x,c),c')||_1 is a reconstruction loss with an absolute-value based normalization. For equation 6, we provide the definition of semantic similarity to clarify the meaning behind the equations. Images generated in the target domain that correspond to the source domain should have similar labels. In other words, we use the labels that have already been prescribed in the source domain to guide the generator G. First, a classifier (F¬s) is trained on the labeled source data with cross entropy loss. For equation 7, we explain the reasoning behind the additive losses. Ablation studies from the CYCADA paper have shown that this step leads to improvements in domain adaptation. Subsequently, the semantic loss – where the goal is the maintain a semantic relationship between the images in the target domain and the source domain can be defined with respect to FS and FT. This loss is simply the addition of for the classifier for images from the source domain (L_task (F_S,G(G(X_S,C),C'),Y_S), the classifier for images from the target domain (L_task (F_T,G(X_S,C),Y_S ) and the classifier for the images from the source domain with respect to the predictions generated by the classifier of images in the target domain L_task (F_S,G(X_T,C),F_T (X_T)). L_sem (G,F_S,F_T,X_S,Y_S,X_T,C)=L_task (F_T,G(X_S,C),Y_S)+L_task (F_S,G(G(X_S,C),C'),Y_S) + L_task (F_S,G(X_T,C),F_T (X_T)) (7) By generating a loss function that combines all three aspects, can generate a semantic relationship between the classifications generated by the source samples and the classifications of samples in target domain. In addition, provide the values of all parameters of proposed method in table. Please specify how the parameters of proposed method were selected. Please specify if the proposed methods parameters were optimized. If so, please write how proposed parameters were optimized? Thank you for this important point of clarification. Truthfully, we did not use a systematic method for hyperparameter tuning for three reasons: 1) GAN training is often extremely unstable and there are few (if any) a prior hyperparameter settings to narrow the search space, 2) each experiment required many hours to conduct given the extent of our dataset and the number of adaptation pairs simulated which made a broad hyperparameter search unfeasible, and 3) our focus was not to overoptimize our algorithm but instead to provide a generalizable approach for domain adaptation among diverse adaptation tasks (digits and CXRs) and adaptation pairs. As such, we leveraged insights raised by members of the team given our prior experience applying GANs to medical image data to conduct targeted adjustments of hyperparameter settings. Many of the parameters leveraged are consistent with those in the literature. For example, the beta_1 value of 0.5 was the optimized value used by authors of the StarGAN paper upon which our algorithm is built. This value has been shown in prior work (Radford et al. ICLR 2016: Unsupervised representation learning with deep convolutional generative adversarial networks) to stabilize generator training and has been adopted in other work as well (Gulrajani et al. NIPS 2017: Improved training of Wasserstein GANs). More details about the simulation software exploited should be added. Thank you for this point of clarification. No simulation software was utilized in this study. Figure 1 details the training regime whereby images from a single dataset (or hospital in the case of chest radiographs) were adapted to mirror that from another dataset or hospital, which resulted in improved classification efficacy. We are happy to provide additional details as requested. The results should be further analyzed, more details and further discussion of the simulation results is needed. Thank you for this request. We have included the following in the results for digit adaptation: “For example, adaptations between MNIST and MNISTM yielded significant improvements in classification performance due to the similar baseline character style across the two datasets. Adaptations between MNIST and USPS were similarly efficacious due to transition across grayscale domains whereas adaptations between MNISTM and USPS were less successful given the more difficult task of adaptation across character styles and color domains.” I recommend the authors to review below works and incorporate them while revising the paper: 15. Zhang, Ling, et al. "Generalizing deep learning for medical image segmentation to unseen domains via deep stacked transformation." IEEE transactions on medical imaging 39.7 (2020): 2531-2540. 16. Yin, Baocai, et al. "AFA: adversarial frequency alignment for domain generalized lung nodule detection." Neural Computing and Applications (2022): 1-12. 17. Liu, Quande, Qi Dou, and Pheng-Ann Heng. "Shape-aware meta-learning for generalizing prostate MRI segmentation to unseen domains." International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2020. 18. Liu, Quande, et al. "Feddg: Federated domain generalization on medical image segmentation via episodic learning in continuous frequency space." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. We enjoyed reviewing these manuscripts and have included them into the paragraph per our response to the first comment. The conclusions section should conclude that you have achieved from the study, contributions of the study to academics and practices. In addition, list the advantages and disadvantages of the proposed solution, as well as indicate the limitations of work. Further, mention the recommendations of future works. Thank you for the comment. For the advantages, we have included, “By measuring domain spread, we can determine a priori whether a global model provides a distinct advantage over domain-specific models. Improvements in domain adaptation such as shape-aware meta learning, and federated frequency attention maps may reduce the value of the domain spread, so domain spread can serve as an important marker for cross-site generalizability.” For the disadvantages, “Nevertheless, we anticipate that biases that exist within datasets such as those in label quality or underdiagnosis bias will need solutions that expand beyond purely computational approaches.” For recommendations regarding future work, we have included “Future research should investigate these biases and further utilize unsupervised learning as sparsely labeled datasets and high-quality, resource-intensive labelling become increasingly important.” The list of references should be reformatted and checked again to be matched with the journal requirement where a different styles and types are used. Thank you for the comment. We have verified that the citations are in an appropriate (Vancouver) citation style. Reviewer #2: In this manuscript, the authors develop a general technique for ameliorating the effect of dataset shift using generative adversarial networks (GANs) on a dataset of 149,298 handwritten digits and a dataset of 868,549 chest radiographs obtained from four academic medical centers. They assess efficacy by comparing the area under the curve (AUC) pre-and post-adaptation. Adversarial domain adaptation leads to improved model performance on radiographic data derived from multiple out-of-sample healthcare populations. Their work can be applied to other medical imaging domains to help shape the deployment toolkit of machine learning in medicine. Before its acceptance for publication, the authors must arrange all the proposed models to be readable. Indeed, from line 357 to line 408, all equations are not readable. Also, the presentation of the manuscript must be improved. Thank you for noting the absence of equations and lack of clarity. It appears the formulas were accidentally redacted upon initial submission. Given the various training regimes utilized, we have provided Figure 1C as a generalizable illustration of the algorithm used across experiments. We did not use a systematic method for hyperparameter tuning for three reasons: 1) GAN training is often extremely unstable and there are few (if any) a prior hyperparameter settings to narrow the search space, 2) each experiment required many hours to conduct given the extent of our dataset and the number of adaptation pairs simulated which made a broad hyperparameter search unfeasible, and 3) our focus was not to overoptimize our algorithm but instead to provide a generalizable approach for domain adaptation among diverse adaptation tasks (digits and CXRs) and adaptation pairs. As such, we leveraged insights raised by members of the team given our prior experience applying GANs to medical image data to conduct targeted adjustments of hyperparameter settings. Many of the parameters leveraged are consistent with those in the literature. For example, the beta_1 value of 0.5 was the optimized value used by authors of the StarGAN paper upon which our algorithm is built. This value has been shown in prior work (Radford et al. ICLR 2016: Unsupervised representation learning with deep convolutional generative adversarial networks) to stabilize generator training and has been adopted in other work as well (Gulrajani et al. NIPS 2017: Improved training of Wasserstein GANs). In order to add further clarity, we have made the equations more explainable by adding thorough descriptions for each in the text. For example for equation 1, we have added. In other words, we want the generator, given a sample and a target domain to map to an image in the sample domain (xs) to an equivalent image in the the target domain (x¬c) G(x_s,c)→x_c. Our discriminator, on the other hand, produces a probability distribution over both the source sources and domain labels. D:x→{P_s (x),P_c (x)}, where P¬s and Pc is the probability that the sample belongs in the source or target domain, respectively For equation 2, we have added “E_(x～X) [D_src (x)] is the expected loss of the discriminator on the source data, E_((x,c)～(X,C)) [D_src (G(x,c))] is the expected loss of the discriminator of the source data on generated data in the target domain, and λ_gp 〖E_x ^ 〗_(～P_x ^ ) [(||∇_x ^ D(x ^)||_2-1)^2] is a regularization term to minimize the gradient as described in 41. All experiments use ƛgp = 10, which was optimized via grid-search.” For equation 3 and 4, we have simplified the equation to the negative log-likelihood loss. . A simple classification loss over real images (〖L^r〗_classification) is used to optimize D 〖L^r〗_classification (D,X,C)= E_((x,)～(X,) ) [NLL(D(x),c)] (3) Where, c is the domain of the sample x, and NLL(D(x),c), is a negative log-likelihood loss of the discriminator D(x), which predicts a given class, and the true class c. Conversely, the classification of D with respect to the fake images generated by the generator network G(x,c) is used to optimize G, as is standard for generative adversarial networks. 〖L^f〗_cls (D,G,X,C)=E_((x,c)～(X,C) ) [NLL(D(G(x,c)),c) ] (4) For equation 5, we have defined the L1 loss [||x - G(G(x,c),c')||_1 is a reconstruction loss with an absolute-value based normalization. For equation 6, we provide the definition of semantic similarity to clarify the meaning behind the equations. Images generated in the target domain that correspond to the source domain should have similar labels. In other words, we use the labels that have already been prescribed in the source domain to guide the generator G. First, a classifier (F¬s) is trained on the labeled source data with cross entropy loss. For equation 7, we explain the reasoning behind the additive losses. Ablation studies from the CYCADA paper have shown that this step leads to improvements in domain adaptation. Subsequently, the semantic loss – where the goal is the maintain a semantic relationship between the images in the target domain and the source domain can be defined with respect to FS and FT. This loss is simply the addition of for the classifier for images from the source domain (L_task (F_S,G(G(X_S,C),C'),Y_S), the classifier for images from the target domain (L_task (F_T,G(X_S,C),Y_S ) and the classifier for the images from the source domain with respect to the predictions generated by the classifier of images in the target domain L_task (F_S,G(X_T,C),F_T (X_T)). L_sem (G,F_S,F_T,X_S,Y_S,X_T,C)=L_task (F_T,G(X_S,C),Y_S)+L_task (F_S,G(G(X_S,C),C'),Y_S) + L_task (F_S,G(X_T,C),F_T (X_T)) (7) By generating a loss function that combines all three aspects, can generate a semantic relationship between the classifications generated by the source samples and the classifications of samples in target domain. Submitted filename: Response to Reviewers.pdf Click here for additional data file. 18 Jul 2022

PONE-D-22-08145R1

Deploying deep learning models on unseen medical imaging using adversarial domain adaptation

PLOS ONE Dear Dr. Oermann, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Sep 01 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Mohamed Hammad, Ph.D. Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Since the previous version, authors have done huge work and the paper is much better. This version looks good Therefore, I suggest accepting this paper after minor: • The figures still need to be amended, where the font is too small which makes it difficult to read. Reviewer #2: The authors addressed all my comments in this version of the manuscript which was well improved. I recommend it for publication ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No ********** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

1 Aug 2022 30 July 2022 To Dr. Mohamed Hammad Academic Editor, PLOS ONE Thank you for your consideration of our manuscript. We have addressed all the comments provided. Below, we provide the original comments and include our point-by-point responses. Comments: Reviewer #1: Since the previous version, authors have done huge work and the paper is much better. This version looks good. Therefore, I suggest accepting this paper after minor: The figures still need to be amended, where the font is too small which makes it difficult to read. We thank the reviewer for providing comments that have improved the manuscript. To improve readability, we have provided elements of each panel as individual figures, which are now of sufficient font size to ensure all readers may view without difficulty. This will also allow the journal editing team to format the figures as they deem best in the published form of the manuscript. Reviewer #2: The authors addressed all my comments in this version of the manuscript which was well improved. I recommend it for publication. We thank the reviewer for taking the time to provide comments that have certainly improved the manuscript. Submitted filename: Response to Reviewers.docx Click here for additional data file. 5 Aug 2022 Deploying deep learning models on unseen medical imaging using adversarial domain adaptation PONE-D-22-08145R2 Dear Dr. Oermann, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Mohamed Hammad, Ph.D. Academic Editor PLOS ONE 29 Sep 2022 PONE-D-22-08145R2 Deploying deep learning models on unseen medical imaging using adversarial domain adaptation Dear Dr. Oermann: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Mohamed Hammad Academic Editor PLOS ONE

17 in total

1. Generalizing Deep Learning for Medical Image Segmentation to Unseen Domains via Deep Stacked Transformation.

Authors: Ling Zhang; Xiaosong Wang; Dong Yang; Thomas Sanford; Stephanie Harmon; Baris Turkbey; Bradford J Wood; Holger Roth; Andriy Myronenko; Daguang Xu; Ziyue Xu
Journal: IEEE Trans Med Imaging Date: 2020-02-12 Impact factor: 10.048

2. Diagnostic concordance among pathologists interpreting breast biopsy specimens.

Authors: Joann G Elmore; Gary M Longton; Patricia A Carney; Berta M Geller; Tracy Onega; Anna N A Tosteson; Heidi D Nelson; Margaret S Pepe; Kimberly H Allison; Stuart J Schnitt; Frances P O'Malley; Donald L Weaver
Journal: JAMA Date: 2015-03-17 Impact factor: 56.272

Review 3. Why CAD Failed in Mammography.

Authors: Ajay Kohli; Saurabh Jha
Journal: J Am Coll Radiol Date: 2018-02-03 Impact factor: 5.532

4. From development to deployment: dataset shift, causality, and shift-stable models in health AI.

Authors: Adarsh Subbaswamy; Suchi Saria
Journal: Biostatistics Date: 2020-04-01 Impact factor: 5.899

5. Selective Transfer Machine for Personalized Facial Expression Analysis.

Authors: Fernando De la Torre; Jeffrey F Cohn
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2016-03-28 Impact factor: 6.226

6. Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy.

Authors: Jonathan Krause; Varun Gulshan; Ehsan Rahimy; Peter Karth; Kasumi Widner; Greg S Corrado; Lily Peng; Dale R Webster
Journal: Ophthalmology Date: 2018-03-13 Impact factor: 12.079

7. PadChest: A large chest x-ray image dataset with multi-label annotated reports.

Authors: Aurelia Bustos; Antonio Pertusa; Jose-Maria Salinas; Maria de la Iglesia-Vayá
Journal: Med Image Anal Date: 2020-08-20 Impact factor: 8.545

8. New machine learning method for image-based diagnosis of COVID-19.

Authors: Mohamed Abd Elaziz; Khalid M Hosny; Ahmad Salah; Mohamed M Darwish; Songfeng Lu; Ahmed T Sahlol
Journal: PLoS One Date: 2020-06-26 Impact factor: 3.240

9. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study.

Authors: John R Zech; Marcus A Badgeley; Manway Liu; Anthony B Costa; Joseph J Titano; Eric Karl Oermann
Journal: PLoS Med Date: 2018-11-06 Impact factor: 11.069

10. Automated detection of COVID-19 cases using deep neural networks with X-ray images.

Authors: Tulin Ozturk; Muhammed Talo; Eylul Azra Yildirim; Ulas Baran Baloglu; Ozal Yildirim; U Rajendra Acharya
Journal: Comput Biol Med Date: 2020-04-28 Impact factor: 4.589