Literature DB >> 35573166

Dealing with distribution mismatch in semi-supervised deep learning for COVID-19 detection using chest X-ray images: A novel approach using feature densities.

Saul Calderon-Ramirez^1,2, Shengxiang Yang¹, David Elizondo¹, Armaghan Moemeni³.

Abstract

In the context of the global coronavirus pandemic, different deep learning solutions for infected subject detection using chest X-ray images have been proposed. However, deep learning models usually need large labelled datasets to be effective. Semi-supervised deep learning is an attractive alternative, where unlabelled data is leveraged to improve the overall model's accuracy. However, in real-world usage settings, an unlabelled dataset might present a different distribution than the labelled dataset (i.e. the labelled dataset was sampled from a target clinic and the unlabelled dataset from a source clinic). This results in a distribution mismatch between the unlabelled and labelled datasets. In this work, we assess the impact of the distribution mismatch between the labelled and the unlabelled datasets, for a semi-supervised model trained with chest X-ray images, for COVID-19 detection. Under strong distribution mismatch conditions, we found an accuracy hit of almost 30%, suggesting that the unlabelled dataset distribution has a strong influence in the behaviour of the model. Therefore, we propose a straightforward approach to diminish the impact of such distribution mismatch. Our proposed method uses a density approximation of the feature space. It is built upon the target dataset to filter out the observations in the source unlabelled dataset that might harm the accuracy of the semi-supervised model. It assumes that a small labelled source dataset is available together with a larger source unlabelled dataset. Our proposed method does not require any model training, it is simple and computationally cheap. We compare our proposed method against two popular state of the art out-of-distribution data detectors, which are also cheap and simple to implement. In our tests, our method yielded accuracy gains of up to 32%, when compared to the previous state of the art methods. The good results yielded by our method leads us to argue in favour for a more data-centric approach to improve model's accuracy. Furthermore, the developed method can be used to measure data effectiveness for semi-supervised deep learning model training.

Entities: Chemical

Keywords: Chest X-ray; Computer aided diagnosis; Covid-19; Distribution mismatch; MixMatch; Out of distribution detection; Semi-supervised deep learning

Year: 2022 PMID： 35573166 PMCID： PMC9085448 DOI： 10.1016/j.asoc.2022.108983

Source DB: PubMed Journal: Appl Soft Comput ISSN： 1568-4946 Impact factor: 8.263

Introduction

The COVID-19 disease is caused by the novel SARS-CoV2 coronavirus, discovered in 2019 [1]. The COVID-19 pandemic has caused thousands of human losses around the world, where even the most developed health systems have not been able to cope with the infection peaks [1]. Health practitioners are struggling with the detection and tracking of infected subjects, as the number of patients in need for medical assistance increases. Therefore, accurately detecting patients infected with the SARS-CoV2 virus is a critical task to control the pandemic. Nevertheless, SARS-CoV2 detection methods like the Real-time Reverse Transcription Polymerase Chain Reaction (RT-PCR) test can be expensive and time consuming. As an alternative and/or complementary method, the usage of medical imaging based approaches can be less expensive and also accurate [2], [3]. Moreover, X-ray based imaging diagnosis can be considered cheaper. The usage of X-ray machines is more widespread when compared to other imaging technologies like computer tomography. This is specially the case in less industrialized countries [4]. However, a limitation of X-ray based diagnosing of COVID-19 is the need of highly trained clinical practitioners like radiologists, which in less industrialized countries are scarce [4]. The implementation of Computer Aided Diagnosis (CAD) systems for COVID-19 diagnosis can be a solution to mitigate the specialized staff shortage. Deep learning based CAD systems have been extensively explored for different medical imaging applications [5], [6], [7]. More specifically, several deep learning architectures for COVID-19 detection have been proposed recently in the literature [8], [9], [10]. These systems have been developed using publicly available X-ray images datasets, with COVID-19 positive [11] and negative cases [12]. Nevertheless, a short-coming of implementing a deep learning architecture for real-world usage is the need of a large labelled dataset from the specific target clinic or hospital where the system is intended to be used. Labelling images in the medical domain is time-consuming and requires expensive human effort from highly trained clinical practitioners, which makes building an extensive labelled dataset costly. Previous work on COVID-19 detection with deep learning has relied on large and heterogeneous datasets, where around 100–400 COVID-19 positive cases sampled from the dataset [11], and larger datasets of COVID-19 negative cases sampled from different sources [13], [14], [15]. Such testing conditions can be considered far from a real-world scenario, where usually in the target clinic/hospital a limited set of labelled observations is available. Using external datasets for training might harm the overall performance of the model. This is mainly due to the differences between patient features and imaging protocols. This affects the final data distribution between the test and training data [16]. Another short-coming of the aforementioned previous work, is the bias of the population between the positive and negative COVID-19 samples. For example, as reported in [17], negative COVID-19 observations in [13] were sampled from pediatric Chinese patients, while positive COVID-19 cases in [11] correspond to adult patients from different countries. This dataset combination has been extensively used for training Convolutional Neural Network (CNN) based models to detect COVID-19, and leads to deceptive bias in both the test and training model data [17]. To deal with the limited labelled datasets, different approaches have been implemented in literature [18]. In the context of COVID-19 detection, namely data augmentation and transfer learning [19], [20] have been used. In transfer learning, a source labelled dataset is used to pre-train a model, and then fine-tune it in the target dataset . However, as discussed in [21], fine-tuning might not be enough to improve the model’s accuracy. The distribution mismatch between and due to different patient populations and imaging acquisition protocols, is frequently a reason for poor transfer learning performance. Another approach to deal with scarce labelled data is the usage of Semi-supervised Deep Learning (SSDL). SSDL leverages cheaper and more widely available unlabelled data. Semi-supervised learning for COVID-19 detection have been explored in [12], [22] with positive results, where very small labelled datasets have been used. Such previous work suggests that using unlabelled data can increase the model’s performance. The authors combined SSDL with common data augmentation and transfer learning approaches. However, to implement deep learning based solutions for extensive real-world usage, testing different model attributes like robustness and predictive uncertainty is crucial for its safe usage. A deep review on the importance of measuring different model attributes like robustness in medical applications of Artificial Intelligence (AI) can be found in [23]. In a real-world scenario, the use of unlabelled data sampled from different sources (hospitals or clinics) can be considered. However, the usage of unlabelled datasets with different distributions from the labelled test and training target data might harm the accuracy of the model. This leads to the need of analysing model robustness to different data distributions in the unlabelled dataset. Therefore, in this work, we study the impact of different unlabelled data sources in a SSDL model. Specifically, the MixMatch algorithm, which previously yielded interesting accuracy gains with very small labelled datasets for COVID-19 detection using X-ray images [12], [22] is used. Moreover, we propose a simple approach to select and build an unlabelled dataset. This aims to improve the overall SSDL model accuracy.

Problem definition

In this work, we evaluate a setting where the following datasets are available: A labelled dataset in the target clinic/hospital is available. The number of labelled observations is very small. The target dataset is sampled from the clinic/hospital where the model is intended to be deployed. A larger unlabelled dataset in a different source clinic/hospital is available, with . Different deep learning applications in medical imaging face distribution mismatch situations between the different datasets used. This might be the case for SSDL, when using different unlabelled data sources. We argue that quantifying distribution mismatch with respect to the model behaviour is important for medical imaging applications, as different unlabelled data sources might be considered. Moreover, simple dataset transformation procedures to improve model robustness to data distribution mismatch between the labelled and unlabelled datasets, is also important. This helps to narrow the gap between machine learning research and its real-world usage. The first contribution of this work aims to first explore the impact of distribution mismatch between the labelled and unlabelled dataset in SSDL in a real-world application: COVID-19 detection using chest X-ray images. We examine different distribution mismatch settings with data from the specific domain only (chest X-ray images), different than classic testing benchmarks where distribution mismatch is caused by adding images from different domains. We explore the influence of using unlabelled data from different data sources from the same domain, and measure its impact in SSDL. The second contribution consists in two novel methods based upon the feature space of a generic pre-trained CNN, to score unlabelled data according to its likelihood in the distribution of the labelled data. Such scores are used to filter possibly harming unlabelled data, and improve the performance of the SSDL model by using the filtered unlabelled data.

Manuscript organization

This manuscript is organized as follows: Section 2 studies recent literature around SSDL methods, and more specifically SSDL techniques designed to be robust to unlabelled data with a considerable distribution mismatch with respect to labelled data. In such section we also study Out of Distribution (OOD) detection techniques, as they are closely related to distribution mismatch robustness. Given the research gap described in Section 2, in Section 4 we propose our novel method to increase distribution mismatch robustness in a SSDL setting. We test our proposed method using the state of the art MixMatch algorithm [24]. The datasets used to create the different distribution mismatch tested throughout the experiments are described in Section 3. The detailed description of the experimental design is depicted in Section 5. An analysis of the yielded results and the initial observations is developed in Section 6, to later address the conclusions and future work in Section 7.

State of the art

Semi-supervised deep learning

SSDL aims to deal with small labelled datasets, by leveraging unlabelled data. Supervised deep learning networks often require large labelled datasets. This is partially addressed with the usage of data augmentation and transfer learning [25]. However, the usage of cheaper and more widely available unlabelled data, can further lower the need for labelled data. With a formal notation, in SSDL both labelled and unlabelled datasets are used. Each labelled observation is mapped to a label in the set . The unlabelled dataset corresponds to a set of observations , with . SSDL architectures can be classified as: Pre-training [26], pseudo-labelled [27] and regularization based. Within regularization based approaches, consistency loss term and graph based regularization and generative based [18] regularization techniques can be distinguished. A detailed survey regarding SSDL can be found in [28], [29]. Concerning regularization based SSDL, a regularization term leveraging unlabelled data is implemented in the loss function : with the model’s weights array, and the labelled and unlabelled loss terms respectively. The coefficient weighs the influence of unsupervised regularization. As previously mentioned, a number of regularization based variations can be found in the literature. The main ones include: consistency loss based [16], [30], graph based [31], [32] and generative augmentation based [33], [34]. Consistency based methods make the assumption of clustered-data/low-density separation. Such assumption refers to how the observations corresponding to a class, are clustered together. This makes the decision manifold lie in very sparse regions [28]. A violation to this assumption might degrade the performance of the semi-supervised method [28]. In pseudo-label training, pseudo-labels are estimated for unlabelled data. These are used for later model refinement. A straightforward pseudo-label based approach is based in co-training two models [35]. The model is pre-trained with the limited size labelled dataset. Later, the pseudo-labels are estimated for the unlabelled data using two models trained with different views (features) of the data. A voting scheme is implemented for estimating the pseudo-labels. MixMatch [24] combines both pseudo-label and consistency based SSDL, along with heavy data augmentation using the MixUp algorithm [36]. According to [24], MixMatch out-performs, accuracy wise, previous SSDL approaches. Given the recently state of the art performance demonstrated by MixMatch and also the good results yielded in [12], [22] for medical imaging applications, we chose it for the developed solution in this work. A detailed description of MixMatch can be found in Section 4. Table 1 quantitatively summarizes the reported accuracy performance of some of the most recent SSDL approaches. The results suggest that MixMatch and similar methods yield the lowest error rates. The reported results used the Street View House Numbers dataset (SVHN) dataset. Based upon the good results of MixMatch compared to other state of the art methods, we selected it to test our proposed data-centric method to improve SSDL robustness to OOD data.

Table 1

SSDL error rates (the lower the better) from literature of state of the art methods, using the SVHN dataset. As number of labels, , and were the most frequently used in the literature.

Model	Category	nl=500	nl=1000	nl=2000
Supervised only	Supervised	22.08±0.73[37]	14.46±0.71[37]	–
Pi Model (Pi-M)		6.83 ± 0.66 [30]	4.82 ± 0.17[30]	–
Temporal Ensemble Model (TEM)		5.12 ± 0.13[30]	4.42 ± 0.16[30], [38]	–
Virtual Adversarial Training with Entropy Minimization (VATM+EM)		–	3.86 ± 0.22[39]	–
Virtual Adversarial Training Model (VATM)		–	5.42 ± 0.22[39]	–
Mean Teacher Model (MTM)		4.18 ± 0.5 [30]	3.95 ± 0.19[30], [38]	–
Self Supervised network Model (SESEMI)		6.5 ± 0.28[40]	5.59 ± 0.12[40]	–
Mutual Exclusivity-Transformation Model (METM)		9.62 ± 1.37[41]	4.52 ± 0.4[41]	3.66 ± 0.14[41]
Walker Model (WaM)		6.25 ± 0.32[41]	5.14 ± 0.17[41]	4.6 ± 0.21[41]
Transductive Model (TransM)	Consistency based SSDL	4.32 ± 0.3[37]	3.8 ± 0.27[37]	3.35 ± 0.27 [37]
Transductive Model with Mean Teacher (TransM+MTM)		4.09 ± 0.42[37]	3.09 ± 0.27 [37]	3.35 ± 0.27 [37]
Memory based Model (MeM)		–	4.21 ± 0.12[42]	–

MixMatch		–	3.5 ± 0.28	–
ReMixMatch	Consistency and Pseudo-label based SSDL	–	2.65 ± 0.08	–
FixMatch using Random Augmentation		–	2.28 ± 0.11	–
FixMatch using CTA Augmentation		–	2.36 ± 0.19	–

Tri-Net		–	3.71 ± 0.14[27]	–
Speed as a supervisor for SSDL (SaaSM)	Pseudo-label based SSDL	–	3.82 ± 0.09[43]	–
Tri-Net with the Pi-M		–	3.45 ± 0.1[27]	–

SSDL error rates (the lower the better) from literature of state of the art methods, using the SVHN dataset. As number of labels, , and were the most frequently used in the literature.

SSDL robustness to distribution mismatch

The distribution mismatch between and is also referred to as the identically and Independent and Identically Distributed (IID) assumption violation. It might have different degrees and causes, which are enlisted as follows [44]: Prior probability shift: The distribution of the labels in can be different when compared to . In a CAD system this can be exemplified when the labels of the medical images have different distributions between the two datasets and . A specific case would be the label imbalance of the labelled dataset as discussed in [22]. Covariate shift: A different distribution of the features in the input observations might be sampled, leading to a distribution mismatch. In a medical imaging application, this can be related to the difference in the frequencies of the observed features between and . Concept drift: It refers to the different features observed in a sample, with the same label. In the application at hand in this work, this might happen when different patients with different variations of the COVID-19 disease are sampled to build with the same pathologies (classes) in . Concept shift: It is associated to a shift in the labels, with the same features. In the aforementioned example, it would refer to labelling a medical image with similar features with a different pathology (a bias caused by the image labellers). Unseen classes: The dataset contains observations of unseen or unrepresented classes in the dataset . One or more distractor classes are sampled in the unlabelled dataset. Therefore, a mismatch in the number of labels exist, along with a prior probability shift, and a feature distribution mismatch. For instance, the dataset might include only the classes viral pneumonia and normal, while the unlabelled dataset might include the classes bacterial pneumonia, viral pneumonia and normal. In our tested setting, different data sources were used only to gather unlabelled data . We recreate two of the aforementioned distribution mismatch causes: covariate and prior probability shift. The unlabelled datasets created and tested belong to normal (no pathology) chest X-ray images (COVID-19), from patients of different nationalities. As the labelled dataset includes both classes (COVID-19 and COVID-19), a label distribution mismatch also occurs. The tested setting in this work simulates the case where different unlabelled data sources might be available (for instance from different hospitals), at the beginning of a pandemic. Furthermore, a small labelled dataset might be available in the target hospital/clinic. The usage of different unlabelled datasets might potentially cause a violation of the aforementioned clustered-data/low-density separation assumption. Using unlabelled datasets with different distributions when compared to the labelled dataset, might create wrong sparse regions and/or less clustered groups of observations belonging to the same class. Therefore, in this work we explore data-oriented approaches to deal with potential violations of the clustered-data/low-density separation assumption. Unlabelled data can be considered significantly cheaper than labelled data. Thus, discarding potentially harmful observations with the aim to decrease the odds of violating the clustered-data/low-density separation assumption is viable and worthy to explore. In [45], an extensive evaluation of different distribution mismatch settings and its impact in SSDL is developed. Authors concluded that distribution mismatch in SSDL is an important challenge to be addressed. Recently, different approaches for improving SSDL robustness to the distribution mismatch between and have been proposed. In [46], an OOD masking method is proposed, referred to as RealMix. It consists on weighting the observations likely to be OOD during semi-supervised training. The output of a softmax activation function after the raw model output, was used as OOD masking coefficient. A hard thresholding was applied to the unlabelled data, in order to discard OOD data. This works as an observation-wise masking during semi-supervised model training. The authors compared their proposed method with state of the art general-purpose SSDL approaches like MixMatch [24]. The test bed consisted in different unlabelled datasets with a varying degree of distribution mismatch. The contamination source consists of images with different labels and features (completely OOD), corresponding to the unseen class IID violation cause. Their method proved to improve model robustness against OOD data contamination in , using general purpose datasets such as Canadian Institute For Advanced Research dataset with 10 classes (CIFAR-10) and SVHN. However, other types of distribution mismatch corruption such as concept drift or covariate shift were not tested. Another approach to deal with distribution mismatch under OOD contamination (different labels and features), can be found in [47]. The proposed method also implements a weighting coefficient, calculated as the softmax output of a models ensemble. It is referred to as Uncertainty Aware Self-Distillation (UASD) by the authors. Similar to RealMix, a hard thresholding of the OOD data was proposed. However, more diverse distribution mismatch scenarios were tested, using different degrees of contamination using unseen classes as pollution source. In a similar trend, the work in [48] propose a weighted approach to deal with OOD observations (with different label, different features). The proposed method was named Deep Safe Semi-Supervised Learning (DS3L) by the authors. However, instead of using the softmax output, the observation-wise weight is estimated through an optimization step. The score or weight obtained for each observation, is used to weight it in the unlabelled loss term, instead for discarding the data. We refer to this approach as soft thresholding. Similar to [46], only general purpose datasets (CIFAR-10 and Modified National Institute of Standards and Technology dataset (MNIST), using approximately half of the dataset as unseen classes in the unlabelled dataset) were used, with no other variations of distribution mismatch settings. Another resembling approach and testing bed to [48], can be found in [49], where an optimization based approach to weight each observation is implemented, with a test-bed focused in OOD contaminated unlabelled datasets. To diminish the computational cost of estimating the observation-wise weights for the unlabelled data, a clustering step was implemented. The cluster centroids were used to calculate the weights for all the observations within the cluster. The method is referred to as Robust Semi-Supervised Learning (R-SSL) by the authors. In this work, we analyse the effect of distribution mismatch in SSDL within a real-world application: COVID-19 detection using chest X-ray images. Unlike previous work on SSDL under distribution mismatch, we test a real-world setting in the medical domain, and explore its implications within such context. As previously mentioned, we analyse the impact of a distribution mismatch caused by covariate and prior probability shift. Different unlabelled dataset sources within the same domain and features are used. We aim to evaluate different approaches to weigh how harmful an unlabelled observation could be for SSDL training. We test different OOD detection approaches in this work. After calculating a harm coefficient for each unlabelled observation, different steps can be implemented to use such unlabelled dataset. For example, filtering the observations with high harm coefficients, select an unlabelled dataset upon its estimated benefit for SSDL, or weigh the unlabelled observation during SSDL training. Moreover, we focus on a data-oriented approach to identify and/or build a good unlabelled dataset for SSDL. We propose a simple and very inexpensive method to evaluate the distribution mismatch between an unlabelled and labelled datasets, and respectively. Such method can be thought as an OOD scoring approach (harm coefficient), which leads us to compare our method to recent OOD detectors used in the context of OOD data filtering to improve the accuracy of an SSDL model. Unlike most recent SSDL methods which use output or optimization based scoring for the unlabelled data, our approach uses the feature space, as seen in very recent OOD detection approaches. This research gap can be inferred by the state of the art summary table for SSDL robust methods, in Table 2.

Table 2

State of the art SSDL methods robust to distribution mismatch. The unseen classes setting is the most tested cause for distribution mismatch. Our proposed method tests covariate and prior probability shift causes for distribution mismatch, and implements a feature space based method for scoring unlabelled data.

Method name	IID violation cause	Thresholding	OOD data filtering approach
RealMix	Unseen classes	Hard	Output based
UASD	Unseen classes	Hard	Output based
DS3L	Unseen classes	Soft	Optimization based
R-SSL	Unseen classes	Soft	Optimization based

OOD data detection

OOD data detection refers to the general problem of detecting observations that are very unlikely given a specific data distribution (usually the training dataset distribution) [50]. The problem of OOD data detection can be thought as a generalization of the outlier detection problem, as it considers individual and collective outliers [51]. Specific scenarios of OOD data detection can be found in the literature. These include novel data and anomaly detection [52], with several applications like rare event detection [53], [54]. In classical pattern recognition literature different approaches to anomaly and OOD data detection are grounded in concepts such as density estimation [55], kernel representations [56], prototyping [55] and robust moment estimation [57]. Recent success of deep learning based approaches for image analysis [58] have motivated the development of OOD detection techniques for deep neural networks. OOD detection methods with deep learning architectures can be categorized in methods based upon the Deep Neural Networks (DNN)’s output, its input, or its learned feature space. DNN’s output based methods include the softmax based OOD detector proposed in [59]. In such work, OOD detection is framed as a confidence estimation using the model’s raw output layer values and passing it through a softmax function. Its maximum softmax value is used as confidence. Authors claim that the highest softmax value of OOD observations meaningfully differ from in distribution observations. However, as reported in [60], non calibrated models can be overconfident with OOD data. Therefore, in [60] a calibration methodology is introduced, implementing a temperature coefficient. OOD data detection in neural networks is implemented in [60] using input perturbations meant to maximize the softmax based separability. For this end, a gradient descent optimization is used, resulting in a preprocessed image. A temperature coefficient in the calculation of the softmax output is added and is estimated to make the true positive rate of 95% for in-distribution data detection, using the previously pre-processed images. Another approach for OOD detection based on the model’s output is the usage of Monte Carlo Dropout (Monte Carlo Dropout (MCD)) based uncertainty estimations. MCD is a popular method for implementing predictive uncertainty estimation [61], [62]. It consists in analysing the distribution of predictions using the same input and adding noise to the model (drop-out in the context of DNNs). This idea has been ported to the OOD detection problem, where observations with high uncertainty are scored with high OOD likelihood [63], [64]. Regarding feature space (a latent space approximation in DNNs) based methods for OOD detection different approaches can be found in the literature. For example, in [65], the authors implemented the Mahalanobis distance in latent space of the dataset to the input observation, assuming a Gaussian distribution of the data. Both the mean and covariance are estimated for the in distribution dataset. For a new observation , the OOD score is estimated as the Mahalanobis distance for such given distribution. The authors also implemented the calibration approach used in [60]. A superior performance of their proposed method in generic OOD detection benchmarks is reported, when compared to the methods in [59], [60]. However, no statistical significance tests of the results were performed. Another feature space based approach can be found in [66], known as deterministic uncertainty quantification. Such approach is also intended for uncertainty estimation, but also is tested as an OOD detection technique. It makes use of a centroid calculation of each category in the feature space, to later quantify the distance of a new observation to each centroid. Uncertainty quantification is estimated based in the kernel based distance to the category centroids. The approach is compared against an ensemble of deep neural networks (an output based approach for OOD detection). This is done in a simple OOD detection benchmark, where the CIFAR-10 is used as an in-distribution dataset and the SVHN as a OOD dataset. The authors reported the area under the Receiver Operator Characteristic (ROC) curve of their approach against other OOD methods. Their approach showed the highest area under the ROC curve index. However, no statistical analysis of the results were done. In [67] the authors developed an extensive testing of the influence of distribution mismatch between unlabelled and labelled datasets. Moreover, they also developed an approach to estimate the accuracy hit of such distribution mismatch for a state of the art SSDL method. The proposed method estimates the distribution mismatch in the feature space between and , using what the authors referred as a Deep Dataset Dissimilarity Measure (DeDiM). Euclidean and Manhattan based DeDiMs were tested and compared against density based DeDiMs. All of them were applied within the feature space, built with an image net pre-trained network. The authors found a significant advantage of the density based distances. In [68], the authors proposed an OOD detector using the feature space as well. The approach fits different parametric distributions in the feature space of the data. The decision to discriminate between OOD and In-Distribution (IOD) data is done based on the estimation of the approximated parametric model. Unfortunately, no comparison with other popular OOD methods was presented. Table 3 describes a summary of the state of the art methods and the benchmarks used to test them by the authors. This summary makes clear how most previous OOD detection methods have focused in the unseen class distribution mismatch cause. In this work we evaluate the covariate shift cause for a distribution mismatch between the labelled and unlabelled datasets in a real-world application, used by a SSDL method. Additionally we propose a simple feature based approach to improve SSDL performance under those circumstances, as few very recent OOD detection approaches have proposed.

Table 3

OOD test benchmarks for different techniques. Datasets with * were randomly cut by half for in-distribution training labelled data and the other half was used as OOD unlabelled data. The table reveals how arbitrary different testbeds have been used for benchmarking OOD detection algorithms, using the unseen classes cause for the IID assumption violation. IOD-OOD dataset pairs are indicated by number pairs in the table.

Method name	IOD data	OOD data	Category
Max. value of Softmax layer [59]	CIFAR-10 ¹	SUN1,2
	CIFAR-100 ²	Gaussian 1,2
	MNIST ³	Omniglot ³
		notMNIST³
		Uniform noise³
Inhibited Softmax [69]	CIFAR-10¹	SVHN¹
	MNIST²	LFW-A¹
		notMNIST²
		Omniglot²
ODIN [60]	CIFAR-10¹	TinyImageNet1,2	Output based
	CIFAR-100²	LSUN1,2
		iSUN1,2
		Uniform1,2
		Gaussian1,2
Epistemic Uncertainty Estimation [70]	CIFAR *¹	CIFAR*¹
	FashionMNIST*²	FashionMNIST*2
	SVHN*³	SVHN*³
	MNIST*⁴	MNIST*⁴

Mahalanobis Latent Distance [65]	CIFAR-10¹	SVHN1,2
	CIFAR-100²	CIFAR-10³
	SVHN³	TinyImageNet1,2,3
		LSUN1,2,3
Deterministic Uncertainty quantification	CIFAR-10	SVHN	Feature space based
Deep Residual Flow [68]	CIFAR-10¹	CIFAR-10³
	CIFAR-100²	TinyImageNet1,2,3
	SVHN³	LSUN1,2,3
		SVHN1,23

Unsupervised domain adaptation

When using an unlabelled dataset with a very different distribution to , a solution would be to correct or align the feature extractor trained with labelled or unlabelled data from the source of the unlabelled dataset , to the distribution of the labelled dataset (target dataset, usually smaller). This is known as Unsupervised Domain Adaptation (UDA). For instance in [21], the authors proposed an UDA method to align the feature extractor from a source dataset to a specific target dataset. This is done within the context of COVID-19 detection using chest X-ray images. The feature extractor was originally trained with source data. Later, the feature extractor is aligned by using both labelled and unlabelled data from the target dataset. The feature extractor alignment procedure basically consists in an adversarial training step using the aforementioned datasets. As a disadvantage of such method, the feature extractor needs to be trained with labelled source data (as usual in supervised learning). Hence a large number of labels is needed. Also, the feature extractor alignment process can be considered to be expensive, as an adversarial loss function needs to be optimized.

Datasets

In this work, we explore the sensitivity to distribution mismatch between and of a SSDL COVID-19 detection system using chest X-ray images. Therefore, we use different data sources for chest X-ray images for both COVID-19 (positive COVID-19) and COVID-19 (no pathology chest X-ray observations). For COVID-19 cases we use the open dataset made available by Dr. Cohen in [11]. This dataset is composed of 105 COVID-19 images at the time of writing this work. The observations were sampled from different journal websites like the Italian Society of Medical and Interventional Radiology and radiopaedia.org, and more recent publications in the field. In this work we used COVID-19 observations, discarding images related to Middle East Respiratory Syndrome (MERS), Acute Respiratory Distress Syndrome (ARDS) and Severe Acute Respiratory Syndrome (SARS). The images present varying resolutions from 400 × 400 up to 2500 × 2500 pixels. As for COVID-19 observations, we used four different data-sources. Table 4 summarizes the COVID-19 cases data sources. Fig. 1 shows observations for each one of the data sources used in this work. The datasets were randomly augmented with flips and rotations. No random crops were used to avoid discarding important regions in the images.

Table 4

COVID-19 observation sources description used in this work.

Dataset	CR	Chinese	ChestX-ray8	Indiana
No. of patients	105	5856	65240	4000
Patient’s age range (years)	7–86	children	0–94	adults
No. of obs.	105	5236	224316	8121
Hospital/clinic	Clinica Chavarria	No info.	Stanford Hospital	Indiana Network
				for Patient Care
Im. resolution	1907 × 1791	1300 × 600	1024 × 1024	1400 × 1400
Reference	[22]	[13]	[14]	[15]

Fig. 1

Row 1, column 1: a COVID-19 observation from [11], row 1, column 2: a COVID-19 observation from the Chinese dataset [13], row 2, column 1: ChestX-ray8 COVID-19 image [14], row 2, column 2: Indiana dataset COVID-19 sample image [15]. The bottom image corresponds to a sample image from the Costa Rica dataset [22]. As it can be seen, images from the Costa Rica dataset include a black frame.

In this first set of experiments, we evaluate the impact of OOD on data with different unlabelled data sources and different degrees of contamination. We simulate the following scenario: A small labelled target dataset (with and observations) is provided with a partition of the observations of the COVID-19 taken from Dr. Cohen’s dataset and the COVID-19 cases of the Indiana Chest X-ray dataset, described in Table 4. A larger number of 142 unlabelled observations is also available, to be used in the harm coefficient estimations methods. This can be thought as the target labelled dataset with limited labels which is accessible in a real-world application from the clinic/hospital where the model is intended to be deployed. Row 1, column 1: a COVID-19 observation from [11], row 1, column 2: a COVID-19 observation from the Chinese dataset [13], row 2, column 1: ChestX-ray8 COVID-19 image [14], row 2, column 2: Indiana dataset COVID-19 sample image [15]. The bottom image corresponds to a sample image from the Costa Rica dataset [22]. As it can be seen, images from the Costa Rica dataset include a black frame. For the unlabelled dataset, different partitions of COVID-19 cases the chest X-ray data sources described in Table 4. This simulates the usage of different sources of unlabelled datasets , taken from different hospitals/clinics. All the unlabelled observations are COVID-19, to enforce a prior probability shift (label imbalance). As in our preliminary tests, the worst performing unlabelled dataset dataset is the Costa Rican dataset described in Table 4, we used it to create different combinations with the rest of datasets. All of these are depicted in Table 7. A total of unlabelled observations were picked from such datasets with different combinations. Using different data sources for the unlabelled dataset, can help to assess the impact of a distribution mismatch between and .

Table 7

TB-1.1 results: Accuracy of a Alexnet model trained with MixMatch with different datasets. The unlabelled datasets Chest-Xray8, Costa Rican and Chinese datasets include only COVID-19 observations.

Dataset	nl=40	nl=20
Supervised	0.785±0.038	0.809±0.085
Indiana (with COVID-19+[11])	0.782±0.039	0.75±0.06
China	0.648±0.0247	0.659±0.033
Costa Rica	0.501±0.001	0.5±0.001
ChestX-ray8	0.72±0.076	0.71±0.074
ChestX-ray8 65% - Costa Rica 35%	0.711±0.083	0.66±0.11
ChestX-ray8 35% - Costa Rica 65%	0.516±0.022	0.511±0.016
China 65% - Costa Rica 35%	0.701±0.055	0.688±0.084
China 35% - Costa Rica 65%	0.53±0.023	0.528±0.019
Indiana 65% - Costa Rica 35%	0.532±0.024	0.559±0.059
Indiana 35% - Costa Rica 65%	0.501±0.001	0.503±0.009

As for the test dataset, it consists in another partition of the target dataset which includes the COVID-19 dataset, along with another partition of the Indiana Chest X-ray dataset (COVID-19). Both are the same size. This yields a completely balanced test setting. We used a total of observations, drawn from the same target dataset (31 observations per class). The test data comes from the distribution of the labelled data with no contamination. This simulates the case where the labelled data comes from the target dataset distribution. Both unlabelled and labelled datasets were standardized, given that the authors in [71] found that normalization is important in semi-supervised learning. COVID-19 observation sources description used in this work.

Proposed method

SSDL with MixMatch

In this work, we explore the usage of MixMatch as an SSDL method, therefore, we describe it as follows. We selected MixMatch as a baseline method given its good performance compared to other state of the art methods, as described in Table 1. For more details please refer to [24]. As previously mentioned, MixMatch combines both pseudo label and consistency regularization SSDL. In such context, a pseudo-label is estimated for each unlabelled observation in . It corresponds to the mean model output of a transformed input , using number of different transformations, such as flips and rotations [24]. Each pseudo-label is sharpened using a temperature parameter [24]. Also, a simple data augmentation approach is implemented, by linearly combining unlabelled and labelled observations, through the usage of the MixUp algorithm [36]. The pseudo-labels are used in the MixMatch loss function, which combines a supervised and unsupervised loss terms. In this work, the well-known cross-entropy function is used as a supervised loss term. As for the unsupervised loss term, we used the previously implemented Euclidian distance loss in [24]. The Euclidian distance measures the distance between the current model output and its pseudo-label, for the unlabelled observations. This loss term is weighed by the unsupervised learning coefficient . In this work, we used the MixMatch hyper-parameters recommended in [24], of , and . As for the unsupervised coefficient, a value of is used, given our empirical test results.

Harm coefficient estimation for unlabelled observations

Interesting results were yielded in [67], [72], where the authors found an strong correlation between the feature-density based distances and the MixMatch’s accuracy. Based upon it, we propose to estimate how harmful an individual unlabelled observation might be towards the MixMatch’s level of accuracy. We refer to this operator as the SSDL harm coefficient , where . We aim to implement a simple and computationally inexpensive method to filter OOD data in the unlabelled dataset, This is done in order to decrease the distribution mismatch between and . As mentioned in Section 2, using different unlabelled data sources might increase the chance of violating the clustered-data/low-density separation assumption. This is particularly the case given the potential distribution mismatch between the labelled and unlabelled datasets. Therefore, our proposed method aims to discard harmful observations that might create wrong low density regions to build the manifold and/or sparser sample clusters for each category. In a real-world scenario for OOD filtering, DNNs are fed with high resolution images, frequently with images from the same domain (chest X-ray images in our case). This contrasts with the usual settings of the methods discussed in Section 2. As previously discussed, benchmarking in the literature have been usually performed with small resolution images and with relatively not very difficult OOD detection challenges (i.e distinguishing between CIFAR-10 and MNIST images). We aim to further test real-world distribution mismatch conditions in a medical imaging analysis application such as the COVID-19 detection using chest X-ray images. In this work, we propose to use the feature density of a labelled dataset , to weigh how harmful could be to include an unlabelled observation in the unlabelled dataset . This is done within the context of training a model using the SSDL algorithm known as MixMatch. This harmful coefficient is represented as . We test two different variations to estimate . The first one consists in a non-parametric estimation of the feature density through an histogram calculation. The second variation assumes a Gaussian distribution of the feature space, by using a Mahalanobis distance. We use a generic feature-space built from a pre-trained image-net model, to keep the computational cost of the proposed method low. For all the tested configurations, we only use the features of the final convolutional layer. Computational resource restrictions for solving a real-world problem in medical imaging makes very expensive to use all the features extracted in the different layers as done in [65]. The procedure to calculate the harm coefficient using both methods, is depicted as follows: For all of the input observations , with , being the input space dimensionality, using the feature extractor , we calculate its feature vector as . The feature vector has dimension , with . For instance, a given feature extractor using the Imagenet pretrained Wide-ResNet architecture, yields features. For architectures such as densenet that might yield larger feature arrays in its final convolutional layer, we sub-sampled it to keep it in features, using an average pooling operation. This yields a feature set . For the Feature Histograms (FH) method, we perform the following steps: For each dimension in the feature space, we compute its normalized histogram to approximate the density functions , in the sample . This yields the set of approximated feature density functions: Using the approximated feature densities in , we estimate our SSDL harm coefficient , for an unlabelled observation in the following steps . Calculate the features for each unlabelled observation as , for each dimension in , The total likelihood calculation within the density function approximation set assumes that each dimension is statistically independent. Thus: To avoid under-flow, we calculate the negative logarithm of the likelihood, and use it as the harm coefficient: For the Mahalanobis based filtering, we perform the following steps: Calculate the covariance matrix from the features set , and the sample mean from the features set . Calculate the features for each unlabelled observation as . Compute the harm coefficient as: The harm coefficient can be used to discard the observations with high values, or to weigh them in case an online semi-supervised per-observation weighting is implemented. In this work, we test the impact of the distribution mismatch between the labelled target and unlabelled source datasets, and , respectively, in the accuracy of the SSDL MixMatch algorithm. Later, we test the impact of the proposed feature based harm coefficient to eliminate potentially harming observations from the unlabelled dataset. This was done to assess the accuracy of the model using the filtered unlabelled dataset . This way, we can assess in a controlled setting the impact of the distribution rectification procedure, implemented through a data filtering process. Fig. 2 summarizes both proposed methods.

Fig. 2

Summary of the proposed unlabelled data scoring methods for SSDL, and .

Experiments

Experiment design

Test-bed 1 (TB-1) is designed to assess the effect of on MixMatch’s accuracy of using different unlabelled datasets with a target labelled dataset . As error measure we use the accuracy in a balanced test dataset. This test-bed recreates different distribution mismatch conditions between and . The Costa Rican dataset acts as a source of OOD data, as it yielded the lowest accuracy when used as for MixMatch, among the empirically tested unlabelled data sources. We combine the aforementioned data sources with the Costa Rican dataset. This helps enforce different distribution mismatch settings. In the Test-bed 1.1 (TB-1.1), the first sub-experiment defined within the TB-1, we measure MixMatch’s accuracy using a densenet model, with feature extractor fine-tuning and without it. As error measure we also use the accuracy in a balanced test dataset. We aim to measure if there is a significant accuracy gain of fine-tuning the feature extractor during training. Table 5 shows the results of performing MixMatch’s training without feature extractor fine-tuning, while Table 6 shows the results with it.

Table 5

Dataset	nl=40	nl=20
Supervised	0.851±0.037	0.803±0.039
Indiana (with COVID-19+[11])	0.891±0.047	0.875±0.04
China	0.735±0.0621	0.722±0.054
Costa Rica	0.493±0.014	0.511±0.029
ChestX-ray8	0.825±0.061	0.795±0.052
ChestX-ray8 65% - Costa Rica 35%	0.579±0.115	0.582±0.067
ChestX-ray8 35% - Costa Rica 65%	0.5±0.001	0.503±0.009
China 65% - Costa Rica 35%	0.588±0.066	0.559±0.067
China 35% - Costa Rica 65%	0.498±0.004	0.508±0.024
Indiana 65% - Costa Rica 35%	0.504±0.014	0.553±0.062
Indiana 35% - Costa Rica 65%	0.501±0.004	0.5±0.001

Table 6

Dataset	nl=40	nl=20
Supervised	0.852±0.045	0.795±0.005
Indiana (with COVID-19+[11])	0.892±0.044	0.885±0.039
China	0.733±0.043	0.709±0.059
Costa Rica	0.498±0.004	0.501±0.016
ChestX-ray8	0.804±0.061	0.793±0.044
ChestX-ray8 65% - Costa Rica 35%	0.598±0.1	0.591±0.105
ChestX-ray8 35% - Costa Rica 65%	0.501±0.004	0.488±0.033
China 65% - Costa Rica 35%	0.593±0.057	0.614±0.0926
China 35% - Costa Rica 65%	0.514±0.055	0.496±0.022
Indiana 65% - Costa Rica 35%	0.516±0.048	0.535±0.047
Indiana 35% - Costa Rica 65%	0.508±0.016	0.501±0.011

Additionally, we devised a Test-bed 1.2 (TB-1.2), where the baseline results obtained in this MixMatch accuracy baseline test-bed in Table 5, Table 7 are correlated with the cosine DeDiMs between each and . This is measured as proposed in [71], and represented as . We measure the linear correlation between the model’s accuracy and its measured labelled–unlabelled dataset distance. For this experiment, we tested an alexnet’s model feature extractor, given its low computational cost. We implemented the cosine dataset DeDiM with a batch dataset size of , with 10 batches of random samples. The same batches were used to test the different configurations. Similar to the proposed harm coefficient estimation methods, we used a generic Imagenet pre-trained feature extractor to build the feature density estimations, as proposed in [71]. The DeDiM results are linearly correlated using a Pearson coefficient in Table 9. We performed a Wilcoxon test to verify whether there is a statistically significance difference when comparing: feature extractor fine-tuning vs. no feature extractor fine-tuning, the two proposed methods to each one of the previous methods (softmax and MCD based), and the proposed Mahalanobis method vs. the also proposed FH approach, with .

Table 9

TB-1.2 test results: Pearson coefficient between the accuracy and the calculated divergences.

SSDL model	nl	Pearson coefficient
Alexnet	20	−0.798
	40	−0.75

Densenet	20	−0.665
	40	−0.662

Finally, Test-bed 2 (TB-2) aims to assess MixMatch’s accuracy results when implementing the proposed methods in this work to filter the OOD observations, against two popular output based OOD filtering methods: the MCD and Softmax based OOD filters. In this test bed, we measure MixMatch’s accuracy through the four different filtered datasets, testing both alexnet and densenet as a model. We also tested the model with and labels. The results using the proposed feature histograms and Mahalanobis distance for each generated unlabelled data source are depicted in Table 11, Table 13, for the alexnet and the densenet models, respectively. To filter possible OOD observations, we eliminated the same percent of contaminated observations using the Costa Rican dataset (i.e, if the Chinese dataset was contaminated with 35% of observations with the Costa Rican dataset, we eliminated 35% of the observations with the highest harm coefficient, and so on). We leave the problem of defining the right harm coefficient threshold out of this study.

Table 11

Accuracy of a Alexnet model trained with MixMatch, with the filtered datasets using the harm coefficient with the two proposed feature density based methods: FH and the Mahalanobis based filter. The percentage of discarded observations is the same of the amount of Costa Rican observations.

	nl=40		nl=20
Dataset	Acc. FD	Acc. Maha.	Acc. FD	Acc. Maha.
ChestX-ray8 35% - Costa Rica 65%	0.709±0.084	0.727±0.078	0.682±0.09	0.685±0.089
ChestX-ray8 65% - Costa Rica 35%	0.732±0.064	0.7612±0.049	0.717±0.08	0.709±0.09
China 35% - Costa Rica 65%	0.683±0.065	0.708±0.07	0.667±0.078	0.667±0.09
China 65% - Costa Rica 35%	0.693±0.044	0.695±0.079	0.687±0.078	0.674±0.072
Indiana 35% - Costa Rica 65%	0.732±0.052	0.711±0.032	0.703±0.1	0.719±0.09
Indiana 65% - Costa Rica 35%	0.719±0.058	0.748±0.059	0.709±0.093	0.711±0.09

Table 13

Accuracy of a Densenet model trained with MixMatch, with the filtered datasets using the harm coefficient with the two proposed feature density based methods: FH and the Mahalanobis based filter. The percentage of discarded observations is the same of the amount of Costa Rican observations.

	nl=40		nl=20
Dataset	Acc. FD	Acc. Maha.	Acc. FD	Acc. Maha.
ChestX-ray8 35% - Costa Rica 65%	0.691±0.10	0.769±0.048	0.683±0.105	0.779±0.025
ChestX-ray8 65% - Costa Rica 35%	0.717±0.091	0.811±0.049	0.695±0.1	0.783±0.049
China 35% - Costa Rica 65%	0.794±0.036	0.795±0.053	0.787±0.048	0.769±0.076
China 65% - Costa Rica 35%	0.788±0.056	0.812±0.05	0.774±0.053	0.798±0.036
Indiana 35% - Costa Rica 65%	0.758±0.047	0.729±0.035	0.727±0.0512	0.714±0.046
Indiana 65% - Costa Rica 35%	0.737±0.049	0.762±0.055	0.703±0.055	0.722±0.032

TB-1.1 results: Accuracy of a Densenet model trained with MixMatch with different datasets. The unlabelled datasets Chest-Xray8, Costa Rican and Chinese datasets include only COVID-19 observations. No use of a fine-tuned feature extractor. TB-1.1 results: Accuracy of a Densenet model trained with MixMatch with different datasets. The unlabelled datasets Chest-Xray8, Costa Rican and Chinese datasets include only COVID-19 observations. Using the fine-tuned feature extractor. In all test beds, the MixMatch algorithm is tested with a densenet and alexnet models, using the recommended parameters in [24], along with an unsupervised regularization term coefficient of 200. As for model training, we use the one-cycle policy implemented in the FastAI library, with a weight decay of 0.001, This way we can measure MixMatch’s behaviour with models with different depth and architecture. For each configuration, we trained the model with 10 runs, using a different random data partition for training and test, for 50 epochs. Finally, Table 14 shows the average and standard deviation of the execution time in seconds for the tested harmful data filters. As for the data load of the aforementioned tests, and observations were used. For these performance tests, a densenet backbone was used. The Mahalanobis based method is the fastest with an execution time of around 65.1 secs. in average and a standard deviation of 2.3 secs. (for a typical data load of the test bench), when compared to the histogram based approach. The Mahalanobis method was the fastest with statistical significance according to our Wilcoxon test, when compared to the rest of the evaluated methods.

Table 14

Average and standard deviation of the execution time, in seconds, of the different unlabelled harmful data techniques tested in this work. The execution time of using 10 random data batches was measured.

Harmful data filter	Time (s)
Mahalanobis	65.1±2.3
Feature Histograms	269.7±2.7
Softmax	1246.7±22.2
Monte Carlo Dropout	1089.6±10.8

TB-1.1 results: Accuracy of a Alexnet model trained with MixMatch with different datasets. The unlabelled datasets Chest-Xray8, Costa Rican and Chinese datasets include only COVID-19 observations. TB-1.2 results: Cosine DeDiM distance, using 10 different batches of 80 observations, between the labelled and unlabelled datasets, and , respectively. Using Alexnet, to keep computing cost low. TB-1.2 test results: Pearson coefficient between the accuracy and the calculated divergences. Accuracy of a Alexnet model trained with MixMatch, with the filtered datasets using the harm coefficient with the two output-based methods: MCD and Softmax. The percentage of discarded observations is the same of the amount of Costa Rican observations. Accuracy of a Alexnet model trained with MixMatch, with the filtered datasets using the harm coefficient with the two proposed feature density based methods: FH and the Mahalanobis based filter. The percentage of discarded observations is the same of the amount of Costa Rican observations.

Experiment setup

Regarding hardware resources, most of the experiments were run at the DIGITS computer, De Montfort University, equipped with a 12 GB NVIDIA TITAN V GPU, 24 Intel(R) Xeon(R) E5-2620 0 @ 2.00 GHz CPU and 32 GB of RAM memory. Software wise, this system was used with Ubuntu 18.04 LTS, with Python version 3.7.0. The Pytorch library used to develop the algorithms in this thesis, with version 1.4.0 in both systems. We also used the FastAI library (version 1.0.61) to develop some sections of this work.1 The repository with the code used in this work can be found in https://gitlab.com/saul1917/mixmatch_with_ood.

Results analysis

In this section we develop the interpretation of the obtained results. As for the results in TB-1.1, depicted in Table 5, we can see a very strong influence of the unlabelled data source in the accuracy of the SSDL MixMatch algorithm. Training the model with the Indiana dataset including also COVID-19 observations, yields the highest accuracy, with around 0.89, higher than the supervised model. From there, using the ChestX-ray8 as , yields an accuracy of 0.825, followed by the usage of the Chinese dataset as , accuracy wise. Using the Costa Rican dataset as yields the lowest accuracy, with close to 0.493. Contaminating the Chest Xray8, Chinese and Indiana dataset with the Costa Rican dataset, yields a lower accuracy with an increasing degree of contamination. As for the impact of fine-tuning the feature extractor, there is no statistical significant difference of performing it, when comparing the results in Table 5, Table 6. This suggests that using an image-net pre-trained feature extractor for harm coefficient estimation is justifiable. Regarding TB-2 results, when comparing the accuracy yielded by MixMatch for each tested with the calculated inter-dataset cosine DeDiMs in Table 8, we can see an interesting relationship. The Costa Rican dataset and heavily contaminated data sources present the highest distances. For instance, the Chinese dataset contaminated with a degree of 65% with the Costa Rican dataset, presents a distance of 50.93 with the labelled dataset , similar to the inter-dataset distance to the Costa Rican dataset of 57.19 (the with the highest distance to ). We can see how using both of the aforementioned datasets, yield very low MixMatch accuracy. This behaviour is summarized in the obtained Pearson coefficients depicted in Table 9, with a very high lineal correlation, of around 78% for the tested variations. The correlation is still high for the semi-supervised densenet model behaviour with the dataset distances, using a generic Imagenet pre-trained alexnet model. This suggests that the usage of the feature density can bring useful information to preserve or discard an unlabelled observation in a .

Table 8

TB-1.2 results: Cosine DeDiM distance, using 10 different batches of 80 observations, between the labelled and unlabelled datasets, and , respectively. Using Alexnet, to keep computing cost low.

Dataset	d(Sl,Su)
China	2.06±0.11
Costa Rica	30.9±0.4
ChestX-ray8	1.04±0.27
ChestX-ray8 65% - Costa Rica 35%	3.95±0.94
ChestX-ray8 35% - Costa Rica 65%	11.84±0.94
China 65% - Costa Rica 35%	5.74±0.79
China 35% - Costa Rica 65%	14.85±0.0
Indiana 65% - Costa Rica 35%	6.33±0.3
Indiana 35% - Costa Rica 65%	16.61±0.3

Accuracy of a Densenet model trained with MixMatch, with the filtered datasets using the harm coefficient with the two output-based methods: MCD and Softmax. The percentage of discarded observations is the same of the amount of Costa Rican observations. Accuracy of a Densenet model trained with MixMatch, with the filtered datasets using the harm coefficient with the two proposed feature density based methods: FH and the Mahalanobis based filter. The percentage of discarded observations is the same of the amount of Costa Rican observations. Average and standard deviation of the execution time, in seconds, of the different unlabelled harmful data techniques tested in this work. The execution time of using 10 random data batches was measured. Regarding the results of TB-2, Table 11, Table 13 show the accuracy of MixMatch yielded when filtering the unlabelled datasets with the proposed FH and Mahalanobis methods, for both tested models (alexnet and densenet, respectively). For both proposed methods, we can see how filtering potentially harming observations from the unlabelled dataset increases MixMatch’s accuracy significantly, when compared to the baseline accuracies in Table 5, Table 7, for both tested models. For instance, when using the densenet model with , the ChestX-ray8 dataset contaminated with 35% and 65% with the Costa Rica dataset, increases its accuracy from 0.579 to 0.78 and 0.5 to 0.79, respectively, when filtering harmful observations with the Mahalanobis method (both with statistical significance, according to our Wilcoxon tests). This can be seen in both Table 5, Table 13. The usage of the FH method yields also an important accuracy gain. In this case however, it is lower than the gains obtained with the Mahalanobis method. The accuracy of the model trained with using the ChestX-ray8 dataset with no contamination is almost restored, as MixMatch originally yielded 0.825. We have to consider that the filtered dataset is always smaller than the original unlabelled dataset. Despite this, the accuracy ends very close. Similarly, for the alexnet model with , the accuracy of using an Indiana unlabelled dataset contaminated with 65% of the Costa Rica dataset is close to 50%, according to Table 7. However, after filtering out harmful unlabelled observations ends close to the 71%, using both the FH or the Mahalanobis method. When comparing the accuracy gain of using the feature histograms against the Mahalanobis distance based method, we can see a similar behaviour across almost all the tested unlabelled datasets . This since according to our statistical analysis test using the Wilcoxon method, there is no statistically significant difference between the FH and Mahalanobis method. However, this behaviour is broken for the ChestX-ray8 dataset, when using the densenet model, where the Mahalanobis based method yields statistically significant accuracy gains the FH approach, as seen in Table 13. This suggests that the feature distribution of the labelled dataset fits well with a Gaussian distribution, given the similar and sometimes slightly better results of the Mahalanobis method. The Mahalanobis based method is faster, as it only needs to compute a covariance matrix, when compared to the histogram based approach, which needs to build a feature histogram. This proved to be significantly slower in our tests as seen in Table 14. As for the tested MCD and Softmax baseline methods, popular in OOD detection and uncertainty estimation, the results depicted in Table 10, Table 12, for the alexnet and densenet models, show a very poor performance. The accuracy gains are negligible and sometimes the accuracy is diminished, when compared to the baseline results shown in Table 5, Table 7. Therefore, the usage of the feature density based methods for filtering potentially harmful unlabelled observations prove to be a significantly better approach. Accuracy gains of up to 25% with statistical significance in all the tested settings were obtained (using a Wilcoxon test with ), when using the feature density approaches over the tested output based ones. This can be seen when comparing the results for the proposed feature density techniques in Table 11, Table 13, with Table 10, Table 12, for the both tested architectures alexnet and densenet, respectively.

Table 10

Accuracy of a Alexnet model trained with MixMatch, with the filtered datasets using the harm coefficient with the two output-based methods: MCD and Softmax. The percentage of discarded observations is the same of the amount of Costa Rican observations.

	nl=40		nl=20
Dataset	Acc. Softmax	Acc. MCD	Acc. Softmax	Acc. MCD
ChestX-ray8 35% - Costa Rica 65%	0.532±0.059	0.506±0.012	0.52±0.038	0.5±0.002
ChestX-ray8 65% - Costa Rica 35%	0.582±0.096	0.567±0.067	0.579±0.096	0.558±0.067
China 35% - Costa Rica 65%	0.514±0.04	0.503±0.009	0.525±0.077	0.509±0.02
China 65% - Costa Rica 35%	0.591±0.096	0.579±0.076	0.585±0.096	0.567±0.051
Indiana 35% - Costa Rica 65%	0.503±0.009	0.503±0.006	0.506±0.019	0.509±0.014
Indiana 65% - Costa Rica 35%	0.574±0.078	0.544±0.032	0.551±0.054	0.543±0.042

Table 12

	nl=40		nl=20
Dataset	Acc. Softmax	Acc. MCD	Acc. Softmax	Acc. MCD
ChestX-ray8 35% - Costa Rica 65%	0.5±0.001	0.5±0.001	0.488±0.025	0.529±0.077
ChestX-ray8 65% - Costa Rica 35%	0.543±0.09	0.537±0.11	0.543±0.095	0.498±0.004
China 35% - Costa Rica 65%	0.498±0.004	0.5±0.001	0.49±0.04	0.496±0.009
China 65% - Costa Rica 35%	0.517±0.029	0.501±0.004	0.5±0.007	0.504±0.01
Indiana 35% - Costa Rica 65%	0.499±0.001	0.5±0.001	0.48±0.036	0.496±0.009
Indiana 65% - Costa Rica 35%	0.5±0.001	0.501±0.008	0.497±0.	0.503±0.0173

Conclusions

In this work, we have analysed the impact of the distribution mismatch between the labelled and the unlabelled dataset for training a SSDL model, using the MixMatch algorithm. The setting assessed used medical imaging data, for COVID-19 detection. Measuring the impact of distribution mismatch between the unlabelled and labelled dataset for medical imaging applications is still an under-reported problem in the literature. In the first test-bed, we have assessed the impact of using different unlabelled data sources , and quantitatively analysed the distribution mismatch between them using DeDiMs as a metric. The high linear correlation between the measured DeDiMs and the MixMatch accuracy, suggests a strong influence of the feature distribution mismatch between and . In contexts where a decision must be made about what unlabelled data source must be used, from a set of possible unlabelled datasets, the DeDiMs might be used as a quantitative prior method. Implementing the tested DeDiMs requires no model training, as a generic pre-trained ImageNet model seems to be good enough to estimate the benefit of using a specific unlabelled dataset , according to our results. Data quality metrics for deep learning models as argued in [73], [74] is an interesting path to develop further, as it might help to narrow the gap between research and real-world implementation of deep learning systems. For instance, building high quality datasets for training a semi-supervised model, or assess the safety of using a deep learning model before hand, can benefit from quantitative data quality measures. We argue for the community to include robust data quality metrics in the deployment of deep learning solutions. To increase the robustness of the SSDL model to the distribution mismatch, we tested different approaches to discard potentially harming unlabelled observations from the unlabelled dataset . The tested setting can be considered to be closer to real-world settings, as images within the same domain were used as OOD data contamination sources. This contrasts to the frequent OOD detection benchmarks where images from very different dataset were used as OOD data sources [68]. Our approach is data-oriented, as it modifies the original dataset in an explicit way by removing potentially harming unlabelled observations. We tested output based OOD filtering techniques against our proposed feature density based approaches. Our proposed methods based on the feature densities built upon a pre-trained model with Imagenet, showed a large and significantly advantage over previous output based OOD filtering methods. In the context of SSDL, some approaches have relied in weighing each unlabelled observation using the output of the model, as in [46]. According to our results, we argue that using the model’s output might yield over-confident results to filter or weigh unlabelled observations. This is widely known in OOD detection literature [75]. Even ensemble based approaches like the tested MCD method are not able to filter harming unlabelled observations, according to our test results. However, both feature density based approaches demonstrated a good performance on detecting harming unlabelled observations, almost recovering the original accuracy of the no contaminated datasets. The proposed methods can be deployed to correct and create more effective unlabelled datasets. Moreover both proposed methods do not require any deep learning model training, making it cheap and reducing the carbon footprint of its implementation [76]. Research of computationally efficient methods to identify potentially harmful data for deep learning systems remains as an interesting future research path. Recently, the renowned deep learning researcher, Andrew Ng, has urged the community to focus in data-centric based AI solutions, that are able to tackle the main challenges faced by AI systems during its everyday usage [77]. As argued in [78], most of development effort of AI solutions for real-world usage is invested in data manipulation tasks. Nevertheless, data-oriented operations are often overlooked in the deep learning research community. Also different dataset testing settings (scarcely labelled datasets, datasets with distribution mismatch settings), are frequently omitted. This often obscures the actual accuracy gain of using a specific methodology. Therefore, we agree with Andrew’s call on focusing in more data-centric methods and more sophisticated dataset settings evaluations to develop deep learning and AI technology, along with stronger data quality and evaluation standards for data-driven AI systems. In the context of the currently active COVID-19 pandemic, these short-comings for deep learning based solutions have hindered its path to solve urgent challenges to face the pandemic. It can be argued that the AI and deep learning community mostly focused on developing model-centric solutions that delivered questionable accuracy gains, often using datasets under unrealistic assumptions (same distribution of the test and training datasets) and hidden biases (age and other types of biases have been found in popular datasets used in recent publications) [77]. This has led to a poor and almost null impact of AI tools in the struggle against the COVID-19 pandemic [79], [80]. The lack of high quality data standards and regulations to obtain them (data bias acknowledgement, data standardization and sharing, data quality and robustness metrics, etc.) in the AI research community, is an obstacle to develop robust models for daily clinical usage.

CRediT authorship contribution statement

Saul Calderon-Ramirez: Software, Investigation, Data curation, Writing – original draft. Shengxiang Yang: Conceptualization, Methodology, Formal analysis, Supervision. David Elizondo: Conceptualization, Writing – review & editing. Armaghan Moemeni: Investigation, Writing – original draft, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

16 in total

1. SODA: Detecting COVID-19 in Chest X-rays with Semi-supervised Open Set Domain Adaptation.

Authors: Jieli Zhou; Baoyu Jing; Zeya Wang; Hongyi Xin; Hanghang Tong
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2021-03-17 Impact factor: 3.710

2. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning.

Authors: Takeru Miyato; Shin-Ichi Maeda; Masanori Koyama; Shin Ishii
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2018-07-23 Impact factor: 6.226

3. Sample-Size Determination Methodologies for Machine Learning in Medical Imaging Research: A Systematic Review.

Authors: Indranil Balki; Afsaneh Amirabadi; Jacob Levman; Anne L Martel; Ziga Emersic; Blaz Meden; Angel Garcia-Pedrero; Saul C Ramirez; Dehan Kong; Alan R Moody; Pascal N Tyrrell
Journal: Can Assoc Radiol J Date: 2019-09-12 Impact factor: 2.248

4. Preparing a collection of radiology examinations for distribution and retrieval.

Authors: Dina Demner-Fushman; Marc D Kohli; Marc B Rosenman; Sonya E Shooshan; Laritza Rodriguez; Sameer Antani; George R Thoma; Clement J McDonald
Journal: J Am Med Inform Assoc Date: 2015-07-01 Impact factor: 4.497

5. Deep learning based detection and analysis of COVID-19 on chest X-ray images.

Authors: Rachna Jain; Meenu Gupta; Soham Taneja; D Jude Hemanth
Journal: Appl Intell (Dordr) Date: 2020-10-09 Impact factor: 5.086

6. CT Imaging Features of 2019 Novel Coronavirus (2019-nCoV).

Authors: Michael Chung; Adam Bernheim; Xueyan Mei; Ning Zhang; Mingqian Huang; Xianjun Zeng; Jiufa Cui; Wenjian Xu; Yang Yang; Zahi A Fayad; Adam Jacobi; Kunwei Li; Shaolin Li; Hong Shan
Journal: Radiology Date: 2020-02-04 Impact factor: 11.105

7. Artificial intelligence vs COVID-19: limitations, constraints and pitfalls.

Authors: Wim Naudé
Journal: AI Soc Date: 2020-04-28

Review 8. How artificial intelligence may help the Covid-19 pandemic: Pitfalls and lessons for the future.

Authors: Yashpal Singh Malik; Shubhankar Sircar; Sudipta Bhat; Mohd Ikram Ansari; Tripti Pande; Prashant Kumar; Basavaraj Mathapati; Ganesh Balasubramanian; Rahul Kaushik; Senthilkumar Natesan; Sayeh Ezzikouri; Mohamed E El Zowalaty; Kuldeep Dhama
Journal: Rev Med Virol Date: 2020-12-19 Impact factor: 11.043

9. Correcting data imbalance for semi-supervised COVID-19 detection using X-ray chest images.

Authors: Saul Calderon-Ramirez; Shengxiang Yang; Armaghan Moemeni; David Elizondo; Simon Colreavy-Donnelly; Luis Fernando Chavarría-Estrada; Miguel A Molina-Cabello
Journal: Appl Soft Comput Date: 2021-07-13 Impact factor: 6.725