Literature DB >> 35146385

Spectral decoupling for training transferable neural networks in medical imaging.

Joona Pohjonen¹, Carolin Stürenberg¹, Antti Rannikko^1,2, Tuomas Mirtti^1,3, Esa Pitkänen^4,5.

Abstract

Many neural networks for medical imaging generalize poorly to data unseen during training. Such behavior can be caused by overfitting easy-to-learn features while disregarding other potentially informative features. A recent implicit bias mitigation technique called spectral decoupling provably encourages neural networks to learn more features by regularizing the networks' unnormalized prediction scores with an L2 penalty. We show that spectral decoupling increases the networks' robustness for data distribution shifts and prevents overfitting on easy-to-learn features in medical images. To validate our findings, we train networks with and without spectral decoupling to detect prostate cancer on tissue slides and COVID-19 in chest radiographs. Networks trained with spectral decoupling achieve up to 9.5 percent point higher performance on external datasets. Spectral decoupling alleviates generalization issues associated with neural networks and can be used to complement or replace computationally expensive explicit bias mitigation methods, such as stain normalization in histological images.

Entities: Chemical

Keywords: Algorithms; Artificial intelligence; Medical imaging; Medical tests

Year: 2022 PMID： 35146385 PMCID： PMC8816718 DOI： 10.1016/j.isci.2022.103767

Source DB: PubMed Journal: iScience ISSN： 2589-0042

Introduction

Neural networks have been adapted to many medical imaging tasks with impressive results, often surpassing human counterparts in consistency, speed, and accuracy (Liu et al., 2019). However, these networks are prone to overfit easy-to-learn or statistically dominant features, while disregarding other potentially informative features. This leads to poor generalization to data generated by different medical centers, reliance on the dominant features, and lack of robustness (Geirhos et al., 2020; Pezeshki et al., 2020). For example, a neural network classifier for skin cancer, approved to be used as a medical device in Europe, had overfit the correlation between surgical margins and malignant melanoma (Winkler et al., 2019). Owing to this, the false positive rate of the network was increased by 40 percentage points during external validation. Furthermore, three out of five neural networks for pneumonia detection showed significantly worse performance during external validation (Zech et al., 2018) and recent neural networks for COVID-19 detection rely on confounding factors rather than actual medical pathology (DeGrave et al., 2021). Even small differences in the sharpness of images from two different scanners can degrade the performance of neural networks significantly (see Robustness section). Although generalization issues need to be solved before any neural networks can be applied in clinical practice, the phenomenon is still poorly understood (van der Laak et al., 2021). This may be because the detection of generalization issues is hard and often requires state-of-the-art methods of explainable AI (DeGrave et al., 2021). An external dataset is one of the only methods of testing generalization performance, although it will uncover generalization issues only when the neural network fails to generalize to the dataset. If a neural network achieves high overall accuracy on the external dataset, it may still always fail for some subset of samples. Any particular external dataset may also contain the same sources of bias as the training data. Explicit methods have been proposed to address specific sources of bias, like using augmentation to address staining differences in tissue section slides (Tellez et al., 2019) or normalizing each image with a common standard (de Bel et al., 2019; Janowczyk et al., 2017). The obvious problem with explicit methods is that they only control for selected biases and more subtle sources of bias, like small differences between patient populations, may go unaddressed. Implicit methods of bias control are required before neural networks can be safely applied to clinical practice. Learning dominant features at the cost of other potentially informative features, also known as shortcut-learning, is a common problem in all neural networks and one of the main reasons behind the generalization issues (Geirhos et al., 2020). Shortcut-learning occurs mainly because of gradient starvation, where gradient descent updates the parameters of a neural network in directions capturing only dominant features, thus starving the gradient from other features (des Combes et al., 2018). The gradient descent algorithm finds a local optimum by taking small steps toward the opposite sign of the derivative, the direction of the steepest descent (Cauchy, 1847). The recently proposed method of spectral decoupling (Pezeshki et al., 2020) provably decouples the learning dynamics leading to gradient starvation when using cross-entropy loss, thus encouraging the network to learn more features. The effect is achieved by simply adding an L2 penalty on the unnormalized prediction scores (logits) of the network. We evaluate the utility of spectral decoupling as an implicit bias mitigation method in the context of medical imaging. We use simulation experiments to show that spectral decoupling increases networks′ robustness to data distribution shifts and can be used to train generalizable networks on datasets with a strong superficial correlation. The findings are then evaluated by training prostate cancer and COVID-19 classifiers, where the networks trained with spectral decoupling achieve significantly higher performance on all evaluation datasets.

Results

In this section, the utility of using spectral decoupling as an implicit bias mitigation method is explored with both simulation and real-world experiments.

Dominant features

To assess the utility of spectral decoupling in situations where the training dataset contains a strong dominant feature, the cutout dataset defined in Simulation datasets is used. Five networks are trained with either spectral decoupling or weight decay on the training set. In addition, five networks are trained on the control dataset with weight decay to provide a reference point of the performance under no spurious correlation caused by the dominant feature. The mean and SD of the accuracy and recall metrics on the test data are reported in Table 1. Accuracy is defined as the fraction of all instances that were correctly identified, and recall as the fraction of positive instances that were correctly identified.

Table 1

Results of the simulation study with the cutout dataset on dominant features

Name	Accuracy (SD)	Recall (SD)
Weight decay	0.752 (0.019)	0.523 (0.039)
Spectral decoupling	0.837 (0.020)	0.715 (0.046)
Control + weight decay	0.875 (0.009)	0.832 (0.036)

The mean and SD (SD) values are reported for each set of five trained networks.

Results of the simulation study with the cutout dataset on dominant features The mean and SD (SD) values are reported for each set of five trained networks. The use of spectral decoupling increases the accuracy by 8.5 percentage points over weight decay and almost reaches the performance of the network trained on the control dataset. The networks trained without spectral decoupling appear to make false predictions based on the dominant feature, although the class activation maps (Chattopadhay et al., 2018) of the trained neural networks, do not significantly differ between weight decay and spectral decoupling. As hyper-parameters were tuned on the test set, the results should be interpreted only as a demonstration that spectral decoupling can offer an important level of control over the features that are learned. The simpler variant of spectral decoupling in Equation 1 did not increase the networks′ performance in any way, and only after extensive hyper-parameter tuning, Equation 2 produced the reported results. The hyper-parameter tuning was sensitive to the selected parameters, and even small changes to the final values significantly reduced the accuracy of the neural network. Similar results were also reported with the real-world example in the original paper (Pezeshki et al., 2020). As extensive hyper-parameter tuning can deter researchers from applying the method, we limit hyper-parameter tuning to a simple grid search over limited search spaces for all other experiments, as described in Spectral decoupling.

Robustness

To assess whether spectral decoupling increases neural networks′ robustness to data distribution shifts, five networks are trained with either spectral decoupling or weight decay and evaluated on the robustness dataset described in Simulation datasets. In addition, five networks are trained with weight decay but without UniformAugment to assess how much the augmentation strategy improves robustness. The robustness to data distribution shifts caused by sharpening, blurring, and reducing the intensity of either hematoxylin or eosin stain are presented in Figure 1.

Figure 1

Robustness for data distribution shifts from the training data

The lines show the mean accuracy and the shaded regions represent one SD around the mean.

Robustness for data distribution shifts from the training data The lines show the mean accuracy and the shaded regions represent one SD around the mean. Performance of all networks trained with weight decay and without the augmentation strategy degrades to roughly 50% accuracy. Training the networks again with UniformAugment significantly increases robustness to all data distribution shifts except with hematoxylin stain intensity reduction (Figure 1C). When the data distribution shift is included as a possible augmentation (Figure 1A), the increase in accuracy is almost 40 percentage points with the most severe distribution shift. When the data distribution shift is not included as a possible transformation (Figures 1B–D), robustness is more similar with and without augmentation. This result demonstrates the importance of using augmentation as an explicit bias mitigation method. Although the use of augmentation already increased the accuracy by almost 40 percentage points, the use of spectral decoupling is able to improve the accuracy by a further 4.6 percentage points with the most severe data distribution shift (Figure 1A). The increase in accuracy is more pronounced with blurring, 12.4 percentage points with (Figure 1B), and eosin stain intensity reduction, where networks trained with spectral decoupling achieve 1.2 to 8.5 percentage points higher accuracy with a 0.9 to 0.0 multiplier (Figure 1D). These data distribution shifts are not included as possible transformations in UniformAugment, and thus not explicitly controlled. With hematoxylin stain intensity reduction, all networks degrade similarly in performance (Figure 1C). These results show that spectral decoupling is able to significantly complement and improve upon augmentation, as well as improve robustness to data distribution shifts that are not explicitly controlled by augmentation.

Prostate cancer detection

To assess whether the results of the simulation experiments translate into improvements in real-world datasets, we train networks with and without spectral decoupling to detect prostate cancer on H&E stained whole slide images of the prostate. These networks are then evaluated on four different datasets described in Prostate dataset. The results are presented in Figure 2. Networks trained with spectral decoupling show higher performance on all evaluation datasets. The difference between weight decay and spectral decoupling gets more pronounced as we move further away from the training dataset distribution. Finally, there is a 9.5 percentage point increase in accuracy over weight decay on the dataset from a different medical center. The reported performances are not comparable between evaluation datasets, as each dataset has been annotated with a different strategy and thus contain different amounts of label noise.

Figure 2

Neural network performance on evaluation datasets

(A–D) Each consecutive evaluation dataset moves further from the training data distribution. Networks trained with spectral decoupling improve accuracy by 0.35 (A), 1.0 (B), 3.6 (C) and 9.5 (D) percentage points over weight decay. All networks are trained with UniformAugment.

Neural network performance on evaluation datasets (A–D) Each consecutive evaluation dataset moves further from the training data distribution. Networks trained with spectral decoupling improve accuracy by 0.35 (A), 1.0 (B), 3.6 (C) and 9.5 (D) percentage points over weight decay. All networks are trained with UniformAugment. To further explore why networks trained without spectral decoupling fail to generalize to the dataset from Radboud University Medical Center (Figure 2D), the robustness to H&E stain intensities are explored in Figures 3A and 3B. Spectral decoupling is less sensitive to both H&E stain intensity reduction and interestingly, networks trained with weight decay actually increase in accuracy when reducing the eosin stain intensity. This indicates that the difference between spectral decoupling and weight decay performance in Figure 2D, may be partly because of differences in the stain intensities between the two medical centers. To explore this possibility, the stain intensities of the external dataset are normalized with the Macenko method (Macenko et al., 2009) to match the training data stain intensities and the resulting performance increases are reported in Figure 3C. Both networks trained with either spectral decoupling or weight decay benefit from stain normalization. Stain normalization is especially beneficial for networks trained with weight decay, where the mean network accuracy is increased by 7.5 percentage points. Networks trained with spectral decoupling still perform better than networks trained with weight decay coupled with stain normalization. These results demonstrate that spectral decoupling can complement or even replace normalization methods, with negligible computational requirements (Figure 3D).

Figure 3

Spectral decoupling can complement or even replace computationally heavy stain normalization methods

(A and B) Robustness to data distribution shifts, on the external dataset, caused by heematoxylin (A) or eosin (B) stain intensity reduction.

(D) Comparison of the computational requirements between spectral decoupling and the Macenko method. Images per seconds estimation for spectral decoupling is calculated with a Equation 1, where is a matrix and Macenko stain normalization is performed on resized images of size .

Spectral decoupling can complement or even replace computationally heavy stain normalization methods (A and B) Robustness to data distribution shifts, on the external dataset, caused by heematoxylin (A) or eosin (B) stain intensity reduction. (C) Network accuracy increases when normalizing H&E stain intensities with the Macenko method. (D) Comparison of the computational requirements between spectral decoupling and the Macenko method. Images per seconds estimation for spectral decoupling is calculated with a Equation 1, where is a matrix and Macenko stain normalization is performed on resized images of size .

COVID-19 detection

To assess whether spectral decoupling can help in real-world situations with strong dominant features and spurious correlations, we train five networks with and without spectral decoupling to detect COVID-19 positive patients in chest radiographs. Two different training datasets are used to train the networks and all networks are evaluated on the same external validation set, described in COVID-19 dataset. We first train neural networks with the BIMCV dataset, which represents an ideal situation where both the positive and negative samples originate from similar sources. Second, we train networks with the combined PadChest and BIMCV dataset. This dataset represents a situation where the network can easily achieve high performance by only learning to detect where a sample originates as most of the negative samples come from a single medical center. After training all networks, the predictions from each network are averaged to obtain ensemble predictions for both weight decay and spectral decoupling. ROC curves for ensemble predictions are presented in Figure 4, with bootstrapped () 95% CIs (CI) for each area under the ROC curve (AUROC) value. Networks trained with spectral decoupling achieve significantly higher AUROC values for both BIMCV (De-Long′s test: ) and the combined PadChest and BIMCV (De-Long′s test: ) training datasets. On the BIMCV dataset, weight decay and spectral decoupling achieve AUROCs of 0.812 (95% CI: 0.802–0.822) and 0.778 (95% CI: 0.767–0.788), respectively. With the combined PadChest and BIMCV weight decay and spectral decoupling achieve AUROCs of 0.747 (95% CI: 0.736–0.757) and 0.711 (95% CI: 0.700–0.723), respectively.

Figure 4

Receiver operating characteristic (ROC) curves for COVID-19 detection

Inset values indicate the areas under the ROC (AUROC) values and bootstrapped 95% CIs. Networks trained with spectral decoupling achieve significantly higher AUROC values compared to networks trained with weight decay.

Receiver operating characteristic (ROC) curves for COVID-19 detection Inset values indicate the areas under the ROC (AUROC) values and bootstrapped 95% CIs. Networks trained with spectral decoupling achieve significantly higher AUROC values compared to networks trained with weight decay. When training networks with the combined PadChest and BIMCV dataset, AUROC values of networks trained with either method decrease, although the number of training samples is increased over 10-fold. The decrease in AUROC is similar for weight decay and spectral decoupling, 0.065 and 0.067, respectively. This indicates that spectral decoupling is unable to mitigate bias in the combined dataset. As most of the negative samples originate from a single medical center, shortcut learning seems to happen even though spectral decoupling encourages the network to learn more features. Detecting where a sample originates is especially easy with radiographs because of systematic differences between data repositories and medical centers, which could be exploited by a neural network (DeGrave et al., 2021). Thus, the higher AUROC value of spectral decoupling is more likely because of increased robustness to data distribution shifts than avoidance of shortcut learning.

Discussion

Generalization performance is defined as the main challenge standing in the way of true clinical adoption of a neural network (van der Laak et al., 2021). Van Der Laak et al. (2021) argue that there is a need for public datasets which are truly representative of clinical practice. Although this is indeed important, we argue that training datasets, no matter how large, will never account for all possible variations caused by differences in imaging equipment, sample preparation, and patient populations. Thus, it is crucial to couple extensive multisource datasets with explicit and implicit bias mitigation methods to train neural networks which are robust to unseen variations. Two explicit methods of bias mitigation have been proposed for medical imaging. Augmentation of the training samples is crucial as it substantially increases robustness for distribution shifts from the training data caused by differences in imaging equipment or sample preparation (Figure 1, Tellez et al., 2019). Thus, it is strongly recommended to use extensive augmentation strategies for training neural networks intended for clinical practice. Normalization of all images to a common standard would substantially reduce the distribution shifts (de Bel et al., 2019; Janowczyk et al., 2017; Swiderska-Chadaj et al., 2020), but comes with a considerable computational cost (Figure 3D). Both methods address important problems and should be complementary to any implicit methods of bias control. Spectral decoupling is, to our knowledge, the first implicit bias mitigation method for addressing the generalization issues in neural networks. The method is complementary to augmentation, increasing the robustness for distribution shifts already addressed with augmentation (Figure 1A). Above all, spectral decoupling significantly increases the robustness for distribution shifts not addressed by augmentation (Figure 1B) and could be used to replace computationally expensive stain normalization methods (Figure 3C). By encouraging the neural network to learn more features, spectral decoupling can also help in situations where the training dataset contains strong dominant features or spurious correlations (Table 1). This is crucial as the dominant features can also be inherent to the data, such as different cancer types. For example, with prostate cancer, different Gleason grades (Epstein et al., 2016) are often unbalanced in the training set. Owing to gradient starvation (des Combes et al., 2018), the features of the underrepresented Gleason grades may not be learned by the neural network. Balancing the dataset, so that all Gleason grades are represented equally, is not easy or even desired as the grading is based on a continuous range of histological patterns. In COVID-19 detection, the networks′ performance decreased similarly for both weight decay and spectral decoupling (Figure 4), when training the networks on the combined BIMCV and the PadChest dataset. Radiographs contain systematic differences between data repositories and medical centers, such as laterality tokens and differences in the radiopacity of the image borders, which could arise from variations in patient position, radiographic projection or image processing (DeGrave et al., 2021). These differences can be easily leveraged by neural networks to detect where a single radiograph originates. We speculate that spectral decoupling was unable to prevent shortcut-learning because of the ease of shortcut learning in the combined PadChest and BIMCV dataset. In addition, our results showing the ability to prevent shortcut learning (Table 1) were obtained after considerable hyper-parameter optimization and no significant differences could be seen in the class activation maps between networks trained with either weight decay or spectral decoupling. Thus, removal of any obvious superficial correlations from the training dataset is crucial as there seems to be a limit of how much spectral decoupling can help with dominating features and spurious correlations. The advantages of spectral decoupling can be clearly seen when the network is evaluated with out-of-distribution samples (Figures 1, 2, and 4). Neural networks trained with spectral decoupling retain their performance with samples further from the training data distribution, which is exactly what is required from neural networks intended for clinical practice (van der Laak et al., 2021). Although using an external dataset may not reveal all generalization problems, it is clear that without spectral decoupling the neural networks fail to generalize to this particular external dataset from Radboud University Medical Center (Figures 2D and 3). Even in COVID-19 detection, where spectral decoupling seems to fail in preventing shortcut learning, the performance of the network is significantly increased over the state-of-the-art.

Conclusions

Spectral decoupling is the first implicit bias mitigation method for training neural networks to be used across multiple medical centers. The method adds no computational costs, is easy-to-implement and it complements and improves upon explicit bias mitigation methods. Our results recommend the use of spectral decoupling in all neural networks intended for clinical use.

Limitations of the study

Spectral decoupling is shown, by a simulation experiment, to offer an important level of control over the features that are learned in the ‘dominant features’ section. Despite this, spectral decoupling is unable to prevent shortcut learning as described in the COVID-19 detection section. We speculate this was because of the ease of shortcut learning in the training dataset, as mentioned in the discussion section. It is also possible spectral decoupling achieves significantly higher performance solely because of increased robustness to data distribution shifts and not also through the prevention of shortcut learning.

STAR★Methods

Key resources table

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Joona Pohjonen (joona.pohjonen@helsinki.fi).

Materials availability

This study did not generate new unique reagents.

Method details

Spectral decoupling

In spectral decoupling, the network is regularised by imposing an L2 penalty on the unnormalised outputs of the last layer of the network, or logits , which is then added to cross-entropy loss, . This penalty provably (Pezeshki et al., 2020) avoids the conditions leading to gradient starvation in networks trained with cross-entropy loss. Two variants of the penalty are defined as For Equation 1, there is a single tunable hyper-parameter . For Equation 2, hyper-parameters and are tuned separately for each class, a total of four hyper-parameters for the binary classification task in our study. Pseudo-code for implementing Equation 1 is presented in Figure. PyTorch style pseudocode for Equation 1. A simple grid search is used to optimize the hyper-parameters in Sections 2.2, 2.3, and 2.4. Bayesian optimisation is used in Section 2.1. Search spaces for the grid search are defined as , , where and . Hyper-parameter optimization is done on the validation split, except for Equation 2 in Section 2.1, where we perform optimization straight on the test split. For Equation 1, the tuned hyper-parameter is . For Equation 2, the tuned hyper-parameters are , , and for the experiment in Section 2.1, and , , and for the experiment in Section 2.4.

Prostate dataset

A total of 30 prostate cancer patient cases are annotated for classification into cancerous and benign tissue, where the cancerous areas were annotated in consensus by two observers (C.S., T.M.). All patients have undergone radical prostatectomy at the Helsinki University Hospital between 2014 and 2015. Each case contains 14 to 21 tissue section slides of the prostate. Tissue sections have a thickness of 4 μm and were stained with hematoxylin and eosin in a clinical-grade laboratory at the Helsinki University Hospital Diagnostic Center, Department of Pathology. Two different scanners are used to obtain images of the tissue section slides at 20x magnification. Larger macro slides (whole-mount, 2 × 3 inch slides) are scanned with an Axio Scan Z.1 scanner (Zeiss, Oberkochen, Germany), and the normal size slides with a Pannoramic Flash III 250 scanner (3DHistech, Budapest, Hungary). From the 30 patient cases, five are set aside for a test set and four are used as a validation set during training and hyper-parameter tuning. The test set is further divided based on the scanner used to obtain the images. Digital slide images are cut into tiles with pixels and 20% overlap, resulting in 4.7 million tiles with 10% containing cancerous tissue. To test the differences between cohorts from the same medical centre, another set of 60 prostate cancer patient cases are annotated into cancerous and benign tissue by one of six experienced pathologists. All patients have undergone radical prostatectomy at the Helsinki University Hospital between 2019 and 2020. Each case contains 10 to 21 normal and macro tissue section slides of the prostate. Tissue sections have a thickness of 4 μm and are also stained with hematoxylin and eosin in a clinical-grade laboratory at the Helsinki University Hospital Diagnostic Center, Department of Pathology. All slides are scanned with an Axio Scan Z.1 scanner (Zeiss, Oberkochen, Germany). Digital slide images are cut into tiles with pixels and 20% overlap, resulting in 13.1 million tiles with 16% containing cancerous tissue. For external validation, a freely available prostate cancer dataset is used, containing tissue section slides from patients who have undergone a radical prostatectomy at the Radboud University Medical Center between 2006 and 2011 (Bulten et al., 2018, 2019). The dataset contains images with pixels annotated by a uropathologist as either cancerous or benign. Images are scanned with a Pannoramic Flash II 250 scanner (3DHistech, Budapest, Hungary) at 20x magnification but later reduced to 10x magnification. These images are cut into tiles with pixels and 20% overlap, resulting in 5,655 tiles with 45% containing cancerous tissue. All digital slide images are cut and processed with HistoPrep (Pohjonen, 2021). A summary of the prostate datasets is presented in Table. Prostate datasets

COVID-19 dataset

For COVID-19 detection, we use large open-access repositories of chest radiographs. COVIDx8 dataset is compiled from five different open-source repositories and contains radiographs from over 15,000 patient cases from at least 51 countries, with over 1500 COVID-19 positive patient cases (Chowdhury et al., 2020; Cohen et al., 2020; Rahman et al., 2021; Tsai et al., 2021; Wang et al., 2020). BIMCV dataset (iteration 2) contains 3,033 positive and 2,743 negative COVID-19 patient cases, and 9,171 radiographs, after exclusions, collected from the multiple same medical centres during the same time period (De La Iglesia Vayá et al., 2020). Only PA and upright AP radiographs (Cohen et al., 2020) with windowing information were selected from the BIMCV dataset. PadChest dataset contains over 67,000 COVID-19 negative patient cases, and 114,227 radiographs from a single medical centre in Valencia, Spain (Bustos et al., 2020). 19 corrupted images were excluded from the PadChest dataset. COVIDx8 dataset is reserved as an external dataset, and two training datasets are compiled by using only the BIMCV dataset and by adding the PadChest and BIMCV datasets together. 5% of both training datasets are set aside for validation.

Simulation datasets

Two simulation experiments are used to more closely investigate the utility of spectral decoupling as an implicit bias mitigation method. For both experiments, the dataset from Helsinki University Hospital described in Section 9.2 is modified in specific ways.

Cutout simulation dataset

A dominant feature present in a real-world dataset could be, for example, a biological marker, a certain cancer type or a scanner artefact. To represent these kinds of features, 16 cutouts of pixels are added to the images (Figure). Example of the cutout operation Left: Benign sample. Right: 16 cutouts of pixels added to the benign sample. For the experiment, 200,000 images are selected for the training set with an equal amount of samples with cancerous and benign annotations. For the training set, cutouts are added to 25 and 2.5% of the benign and cancerous samples, respectively. This makes the presence of cutouts in the image spuriously correlated with a benign annotation. If the network overfits this correlation, cancerous samples with cutouts may be classified as benign. Thus for the test set, cutouts are added to all cancerous samples and none of the benign samples. For a control training set, cutouts are added to all images. Networks trained with this dataset provide a reference point of the performance with cutouts but without the spurious correlation.

Robustness simulation dataset

Shifts from the training data distribution are common when evaluating the neural network with datasets from different medical centres. Small changes in the images due to differences in, for example, sample preparation or imaging equipment can cause shifts from the training data distribution. We assess the networks′ robustness to these data distribution shifts, by applying transformations with increasing magnitudes to the images in the test set. Image sharpness and stain intensity were selected to represent possible dataset shifts caused by differences in the used scanner and sample preparation, respectively. The UniformAugment augmentation strategy consists of applying random transformations with a uniformly sampled magnitude to the images before feeding them to the network (LingChen et al., 2020). Sharpening the image is included in the set of possible transformations (Cubuk et al., 2019), meaning that the network sees sharpened images during training. Thus, the data distribution shift caused by sharpening images is being explicitly mitigated, which should help the network to predict correct labels for evaluation images with higher sharpness. Blurring the image is not included in the set of possible transformations (Cubuk et al., 2019), meaning that the network will not see randomly blurred images during training. Thus, the data distribution shift caused by blurring the images will not be explicitly mitigated and the use of UniformAugment should not directly help the network with blurry evaluation images. By evaluating the network with increasingly sharpened or blurred images, it is possible to assess whether spectral decoupling can improve upon situations where the data distribution shift is, and is not explicitly addressed. Additionally, there are large differences in the sharpness values of real-world datasets from different medical centres and scanners (Figure). Kernel density estimation of the variance of the images after a Laplace transformation. A higher variance indicates a sharper image. The image is generated from the preprocessing metrics calculated by HistoPrep (Pohjonen, 2021). Step-wise blurring is achieved by simple averaging with a kernel, where . Sharpened version of the image is created by applying kernel to the original image . Sharpness is then gradually increased by creating a new image with where defines the amount of sharpness increase. To assess the data distribution shifts caused by differences in sample preparation, the intensity of haematoxylin and eosin stains are computationally modified. Haematoxylin highlights cell nuclei, and eosin cytoplasm, connective tissue and muscle. The stain intensities depend on multiple steps in the staining process, and thus the final colour distribution of the slide images varies a lot (Tellez et al., 2019). The stain intensity modification is achieved by first separating the haematoxylin and eosin stains with the Macenko method (Macenko et al., 2009). The concentrations of each stain can then be reduced by multiplication with a value between 0 and 1 before the stains are combined back together. An example of the method is shown in Figure. Separation of the hematoxylin and eosin stains with the Macenko method (Macenko et al., 2009).

Quantification and statistical analysis

EfficientNet-b0 network (Tan and Le, 2019), with dropout (Srivastava et al., 2014) and stochastic depth (Huang et al., 2016) of 20% and an input size of , is used as a prostate cancer classifier for all experiments. For augmentation, the input images are randomly cropped and flipped, resized, and then transformed with UniformAugment (LingChen et al., 2020), using a maximum of two transformations. Each network is trained for 90 epochs, with a learning rate of and cosine scheduling. Weight decay of 0.0001 is used for networks trained without spectral decoupling. When training neural networks with spectral decoupling, weight decay is disabled. For COVID-19 detection, we replicate the training regimen from (DeGrave et al., 2021), where a DenseNet-121 network (Huang et al., 2018) is pre-trained with the ImageNet dataset and then fine-tuned for 30 epochs as a binary COVID-19 classifier. All hyper-parameters, other than spectral decoupling, are set to values reported in the paper. Training and validation curves for the trained networks are shown in Figure S1. For spectral decoupling, Equation 2 is used for the first simulation experiment on dominant features (Section 2.1) and COVID-19 detection (Section 2.4). Equation 1 is used for all other experiments (Sections 2.2 and 2.3). Each experiment is repeated five times and the summary metrics for these runs are reported. All reported performance metrics are balanced between the classes when necessary and a cut-off value of 0.5 is used to obtain a binary label from the normalised predictions of the network. To compare paired receiver under the operating characteristic (ROC) curves, we use one-tailed DeLong′s test and report the Z-values and p-values (DeLong et al., 1988). PyTorch (version 1.8) (Paszke et al., 2019) is used for training the neural networks, timm (version 0.1.8) (Wightman, 2019) for building the neural networks and albumentations (version 0.5.1) (Buslaev et al., 2020) for image augmentations.

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

PESO	Bulten et al. (2018, 2019)	N/A
COVIDx8	Cohen et al. (2020), Wang et al. (2020), Tsai et al. (2021), Rahman et al. (2021)Chowdhury et al. (2020)	N/A
PadChest	Bustos et al. (2020)	N/A
BIMCV+/−	De La Iglesia Vayá et al. (2020)	N/A
Helsinki University Hospital (2014–2015)	This paper (not shared)	N/A
Helsinki University Hospital (2019–2020)	This paper (not shared)	N/A

Software and Algorithms

PyTorch	https://pytorch.org	1.8
Albumentations	https://github.com/albumentations-team/albumentations	0.5.1
PyTorch image models	https://github.com/rwightman/pytorch-image-models	0.1.8
Python	https://www.python.org	3.8

Prostate datasets

Center	Scanner	Slides	Tiles	Train data	Test data
Helsinki University Hospital	Pannoramic Flash III 250	Normal	1.0 million	Dominant features	Prostate cancer detection
	Axio Scan Z.1	Macro	3.7 million	Robustness and Prostate cancer detection	Robustness and Prostate cancer detection
	Axio Scan Z.1	Macro	13.1 million	–	Section Prostate cancer detection
Radboud University Medical Center	Pannoramic Flash II 250	Both	5,655	–	Section Prostate cancer detection

14 in total

1. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology.

Authors: David Tellez; Geert Litjens; Péter Bándi; Wouter Bulten; John-Melle Bokhorst; Francesco Ciompi; Jeroen van der Laak
Journal: Med Image Anal Date: 2019-08-21 Impact factor: 8.545

Review 2. The 2014 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma: Definition of Grading Patterns and Proposal for a New Grading System.

Authors: Jonathan I Epstein; Lars Egevad; Mahul B Amin; Brett Delahunt; John R Srigley; Peter A Humphrey
Journal: Am J Surg Pathol Date: 2016-02 Impact factor: 6.394

3. PadChest: A large chest x-ray image dataset with multi-label annotated reports.

Authors: Aurelia Bustos; Antonio Pertusa; Jose-Maria Salinas; Maria de la Iglesia-Vayá
Journal: Med Image Anal Date: 2020-08-20 Impact factor: 8.545

4. Stain Normalization using Sparse AutoEncoders (StaNoSA): Application to digital pathology.

Authors: Andrew Janowczyk; Ajay Basavanhally; Anant Madabhushi
Journal: Comput Med Imaging Graph Date: 2016-05-16 Impact factor: 4.790

5. Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition.

Authors: Julia K Winkler; Christine Fink; Ferdinand Toberer; Alexander Enk; Teresa Deinlein; Rainer Hofmann-Wellenhof; Luc Thomas; Aimilios Lallas; Andreas Blum; Wilhelm Stolz; Holger A Haenssle
Journal: JAMA Dermatol Date: 2019-10-01 Impact factor: 10.282

6. The RSNA International COVID-19 Open Radiology Database (RICORD).

Authors: Emily B Tsai; Scott Simpson; Matthew P Lungren; Michelle Hershman; Leonid Roshkovan; Errol Colak; Bradley J Erickson; George Shih; Anouk Stein; Jayashree Kalpathy-Cramer; Jody Shen; Mona Hafez; Susan John; Prabhakar Rajiah; Brian P Pogatchnik; John Mongan; Emre Altinmakas; Erik R Ranschaert; Felipe C Kitamura; Laurens Topff; Linda Moy; Jeffrey P Kanne; Carol C Wu
Journal: Radiology Date: 2021-01-05 Impact factor: 11.105

7. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study.

Authors: John R Zech; Marcus A Badgeley; Manway Liu; Anthony B Costa; Joseph J Titano; Eric Karl Oermann
Journal: PLoS Med Date: 2018-11-06 Impact factor: 11.069

8. Epithelium segmentation using deep learning in H&E-stained prostate specimens with immunohistochemistry as reference standard.

Authors: Wouter Bulten; Péter Bándi; Jeffrey Hoven; Rob van de Loo; Johannes Lotz; Nick Weiss; Jeroen van der Laak; Bram van Ginneken; Christina Hulsbergen-van de Kaa; Geert Litjens
Journal: Sci Rep Date: 2019-01-29 Impact factor: 4.379

9. Impact of rescanning and normalization on convolutional neural network performance in multi-center, whole-slide classification of prostate cancer.

Authors: Zaneta Swiderska-Chadaj; Thomas de Bel; Lionel Blanchet; Alexi Baidoshvili; Dirk Vossen; Jeroen van der Laak; Geert Litjens
Journal: Sci Rep Date: 2020-09-01 Impact factor: 4.379

10. COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images.

Authors: Linda Wang; Zhong Qiu Lin; Alexander Wong
Journal: Sci Rep Date: 2020-11-11 Impact factor: 4.379