Literature DB >> 35239108

A real use case of semi-supervised learning for mammogram classification in a local clinic of Costa Rica.

Saul Calderon-Ramirez^1,2, Diego Murillo-Hernandez^3,4, Kevin Rojas-Salazar^3,4, David Elizondo⁵, Shengxiang Yang⁵, Armaghan Moemeni⁶, Miguel Molina-Cabello^7,8.

Abstract

The implementation of deep learning-based computer-aided diagnosis systems for the classification of mammogram images can help in improving the accuracy, reliability, and cost of diagnosing patients. However, training a deep learning model requires a considerable amount of labelled images, which can be expensive to obtain as time and effort from clinical practitioners are required. To address this, a number of publicly available datasets have been built with data from different hospitals and clinics, which can be used to pre-train the model. However, using models trained on these datasets for later transfer learning and model fine-tuning with images sampled from a different hospital or clinic might result in lower performance. This is due to the distribution mismatch of the datasets, which include different patient populations and image acquisition protocols. In this work, a real-world scenario is evaluated where a novel target dataset sampled from a private Costa Rican clinic is used, with few labels and heavily imbalanced data. The use of two popular and publicly available datasets (INbreast and CBIS-DDSM) as source data, to train and test the models on the novel target dataset, is evaluated. A common approach to further improve the model's performance under such small labelled target dataset setting is data augmentation. However, often cheaper unlabelled data is available from the target clinic. Therefore, semi-supervised deep learning, which leverages both labelled and unlabelled data, can be used in such conditions. In this work, we evaluate the semi-supervised deep learning approach known as MixMatch, to take advantage of unlabelled data from the target dataset, for whole mammogram image classification. We compare the usage of semi-supervised learning on its own, and combined with transfer learning (from a source mammogram dataset) with data augmentation, as also against regular supervised learning with transfer learning and data augmentation from source datasets. It is shown that the use of a semi-supervised deep learning combined with transfer learning and data augmentation can provide a meaningful advantage when using scarce labelled observations. Also, we found a strong influence of the source dataset, which suggests a more data-centric approach needed to tackle the challenge of scarcely labelled data. We used several different metrics to assess the performance gain of using semi-supervised learning, when dealing with very imbalanced test datasets (such as the G-mean and the F2-score), as mammogram datasets are often very imbalanced. Graphical Abstract Description of the test-bed implemented in this work. Two different source data distributions were used to fine-tune the different models tested in this work. The target dataset is the in-house CR-Chavarria-2020 dataset.

Entities: Chemical

Keywords: Breast cancer; Data imbalance; Mammogram; Semi-supervised deep learning; Transfer learning

Mesh：

Year: 2022 PMID： 35239108 PMCID： PMC8892413 DOI： 10.1007/s11517-021-02497-6

Source DB: PubMed Journal: Med Biol Eng Comput ISSN： 0140-0118 Impact factor: 3.079

Introduction

Breast cancer is one of the leading causes of death in women around the world [57]. Nonetheless, it is widely known that diagnosing a malign breast tumour in its early stages can increase treatment effectiveness [5]. In many situations, an early diagnostic can increase survival probability significantly. Deep learning has extensively been explored and implemented as an approach to develop computer-aided diagnosis (CAD) systems using medical imaging [3, 9, 12, 17, 18]. In 2012, a neural network architecture known as AlexNet won the ImageNet 2012 challenge. It featured a large neural network architecture, which implemented a set of novel techniques, which became a core part what was later referred to as deep learning. Later, it became a popular approach for image analysis tasks. Deep learning can be defined as the set of architectures, training algorithms aimed to build very large neural networks, with millions of parameters [28]. Deep learning-based systems have the potential of highly improving the diagnosis and further treatment of patients. For mammogram analysis, different deep learning architectures have been proposed, for either binary classification, BI-RADS-based multi-class classification, or segmentation of regions of interest [1, 29]. Frequently, previously proposed architectures for mammogram classification (binary or multi-class), use large open datasets that have been gathered in a specific group of hospitals in one or few countries. These results might not be representative for a system deployed in a small hospital/clinic from a specific country (target hospital or clinic). When implementing and deploying a deep learning solution in such target hospital/clinic, usually a very small labelled dataset is available. Using small labelled datasets frequently hampers the model’s generalization and performance. Nevertheless, cheaper unlabelled data might available in the target hospital/clinic. In this work, we explore the following setting: take a specific target clinic or hospital to deploy a deep learning model. Such data sampled from the target hospital/clinic must be used for evaluation purposes. A small number of labelled observations sampled from the target hospital/clinic might be available. Additionally, a larget unlabelled dataset is available in the target hospital/clinic. Furthermore, different datasets sampled from other hospitals or clinics might also be available. The notation of such experimental settings can be formalized as follows: Target labelled dataset : A small number of labelled observations might be available which can be used for training/fine-tuning the model. Source labelled dataset : Different data sources of data sampled in different hospitals/clinics might be used. Usually these datasets have a large number of labelled observations, thus . Target unlabelled dataset : A larger number of unlabelled observations might be available and can also be used for training/fine-tuning the model. As unlabelled data is cheaper to obtain, it can often be found that . Source unlabelled dataset : Similarly to the aforementioned case, more source unlabelled observations might be available when compared to the number of source labelled observations, thus . In this work, the usage of both transfer and semi-supervised learning using two different source datasets is explored: INbreast [43] and CBIS-DDSM [37]. The target dataset was obtained from the Costa Rican medical private clinic Imágenes Médicas Dr. Chavarría Estrada (hereafter referred as ). The aim of this research is to experiment the effectiveness of fine-tuning deep learning models in a semi-supervised fashion (using both and ), performing transfer learning from models trained with the source datasets and . For this study, the usage of unlabelled data from other source datasets was avoided, as it has been reported that it might decrease the performance of a Semi-supervised Deep Learning (SSDL) model [15, 16]. In this work, we use MixMatch as a semi-supervised learning approach [11], given previously positive results reported for this approach in medical imaging [13, 14]. This work proposes the usage of unlabelled data in fine-tuning with the MixMatch SSDL approach. The fine-tuning approach tested in this work refers to pre-training the model in a source dataset, to later re-train (fine-tune) the model using the target dataset. We compare semi-supervised fine-tuning to supervised fine-tuning (using the same target dataset for both cases). This is done as a mean of improving the performance of deep learning models on the task of binary classification of whole mammogram images under a real-life scenario using a novel target dataset. Evaluations and comparisons are drawn over the performance of deep learning models on the classification of mammogram images obtained in the context of the day-to-day basis of a local medical private clinic of Costa Rica. We test the combination of semi-supervised learning with other common approaches to deal with small labelled datasets, namely data augmentation and transfer learning. As for transfer learning, we test two different source datasets, in order to assess the impact of the source dataset in the performance of the model.

State of the art

Transfer learning and data augmentation for mammogram classification

CAD of breast cancer via mammogram image classification has been widely studied in the literature. Authors in [1] present a survey of the state of the art in the application of deep learning in the analysis of mammography images for the early detection of breast cancer. The authors summarize open challenges and best practices to follow when dealing with mammogram analysis using deep learning. One of the most frequent shortcomings of implementing deep learning for mammogram analysis in a target clinic/hospital is the lack of labelled training data [1]. This can lead to model overfitting to the dataset. Labelling medical images can be particularly expensive, as trained professionals are needed to carry out such specialized tasks [53]. To overcome this challenge, a number mammogram datasets are publicly available. However, different patient populations and image acquisition protocols can limit and hinder the performance of the final model using the target data [36]. Two of the most common approaches to tackle the problem of labelled data scarcity and subsequent model overfitting, are transfer learning and data augmentation [1, 29]. Using pre-trained model parameters from more general tasks often improve the model’s performance. Authors in [26] experimented with the multi-class classification of mammograms using transfer learning from ImageNet. Similarly, authors in [46] observed encouraging results in the classification of mammograms when using transfer learning from a chest X-ray dataset of patients with pneumonia. Applying transfer learning with models trained with observations from the same domain is intuitively an interesting approach. Authors in [4] carried out an exhaustive research for improving the performance of deep learning models in the binary classification of mammogram anomalies by using features previously learned from different mammogram datasets. Authors in [48] also experimented with transfer learning from mammogram datasets for the detection and classification of anomalies in mammogram images. For these cases the more specific term “domain adaptation” can be used, as although images from different datasets can be visually and semantically similar, their distributions might be significantly different, as explained in [20, 54]. As previously mentioned, data augmentation is also an effective approach to tackle data scarcity [1]. Simple augmentations by applying common image transformations like image rotations and flips can improve results [38]. In previous works, more sophisticated and domain-specific data augmentation techniques have been developed [25]. Authors in [19] obtained positive results by implementing elastic deformations for mammogram images, simulating possible different views of the same breast. The augmentation of training data has also been recently achieved by creating artificial observations with generative deep learning models [34, 58]. Alternative approaches to deal with small labelled datasets and meant to regularize deep learning models for mammogram classification, can be found in the literature [59]. For instance in [24] a Euclidian magnitude regularization approach is proposed in a deep learning pipeline for mammogram mass segmentation. More recently, adversarial augmentation combined with graph-based regularization [40] has been proposed improve the model’s generalization for mammogram diagnosis. Other methods to deal with small labelled target datasets such as semi-supervised learning (leveraging unlabelled data), have received comparably less attention in the literature. In this work our contribution can be summarized as the evaluation of common methods to deal with model overfitting in small labelled datasets (fine-tuning, data augmentation) combined with semi-supervised learning. We use a novel labelled dataset from a Costa Rican clinic, showing the practical challenges of using deep learning for mammogram analysis. Therefore, we include a data-centric approach in our proposed pipeline, as we evaluate the usage of different source datasets for transfer learning and further model fine-tuning using semi-supervised learning (along with data augmentation). The evaluation of the different configurations tested in this work, can shed light around the impact of using each one of the tested approaches individually and combined. This along the usage of different data sources and unlabelled data.

Semi-supervised learning for medical imaging and mammogram analysis

Another approach to deal with small labelled datasets is the usage of SSDL, which leverages unlabelled data to improve the model’s performance [20]. In recent years, the usage of the cheaper and larger unlabelled datasets for training deep learning models has proven to be a viable option for handling the lack of labelled data, as well as improving the performance of models [13, 17]. Authors in [20] present a survey of recent literature of semi-supervised learning approaches for medical imaging. The survey shows how unlabelled datasets have been used for improving model training in brain tumour segmentation, detection of vascular lesions, and prostate cancer detection. More recently, the usage of unlabelled data with semi-supervised deep learning has proven to give positive results in the detection of COVID-19 in chest x-ray images [13, 17]. However, research on SSDL approaches for mammogram analysis is still limited. In [52] the authors propose a new semi-supervised architecture for convolutional neural networks, designed to extract information from multiple views of masses from mammogram images for their binary classification. In [6] a semi-supervised setup is proposed for the joint use of weakly labelled data with fully labelled data of mammogram regions in the detection and classification of anomalies. Authors in [53] also proposed a semi-supervised approach based on graphs and convolutional neural networks for the classification of anomalies in mammograms. However, from our knowledge few authors in the literature deal with the classification of mammograms using less expensive whole-image labels only. In [14] the MixMatch approach was tested to improve the accuracy and predictive uncertainty of models applied to the binary classification of whole mammogram images. A target hospital or clinic might not have lower level labels available, to fine-tune and test a deep learning model. As previously mentioned, analysis of mammograms includes lower level tasks such as segmentation and detection of anomalies, the higher abstraction of level tasks, and the binary classification of images (malign findings with no/benign findings) [1, 29]. It may also include multi-class classification, for instance using the BI-RADS standard [25]. As such, different levels of annotations in the data might be needed for lower level tasks, like pixel-level annotations of the Region of Interest (ROI). When using transfer learning to leverage information from thoroughly annotated source datasets for lower level tasks, fine-tuning on the target data might still be needed [20]. These similar degrees of annotations would be preferable in the target dataset as well. Therefore, the need to use target data to train or fine-tune a model makes the use of unlabelled data an interesting alternative. Different image acquisition protocols and patient distribution sampled in a dataset source is a frequent real-life scenario that increases the need of model fine-tuning.

SSDL with MixMatch

In this work, the MixMatch method is used as the semi-supervised learning approach for training models with unlabelled data. This is novel SSDL method, presented by the authors in [11] has shown important accuracy gain against previous SSDL frameworks. Given the performance boost reported by the authors in [11] of MixMatch against other state-of-the-art semi-supervised methods, in this work we chose it to test the impact of semi-supervised learning for mammogram classification. It is mainly based on the use of pseudo-labels, unsupervised regularization and data augmentation. The following corresponds to a brief description of the method. SSDL makes use of labelled and unlabelled observations X, X respectively. MixMatch implements data augmentation with affine transformations on both datasets. Pseudo-labels are then generated for each unlabelled observation, sharpening the average of the predictions of a model on each of its augmented “versions”. This results in the set of pseudo-labels for observations of X. Similarly, the set Y can be used to represent the labels of observations in X. Further data augmentation is applied to the datasets S and , with S = (X,Y) and , by using linear interpolation of the data with the MixUp algorithm, as mentioned in [11]. This way, the sets of augmented data and are obtained and finally used to train a model by minimizing the compound loss function shown in Eq. 1. This loss function is formed by the respective supervised and unsupervised loss terms and . In this work, the supervised loss term is implemented as a cross-entropy loss, while the unsupervised term is implemented as a Euclidean distance, with the regularization coefficient γ and the rampup function r(τ) = τ/3000, as recommended in [13]. We refer the reader to the original publication in [11] for more details.

Class imbalance correction

A major factor that must be taken into account in the process of implementing a model for classification tasks, specially in the medical domain, is the distribution of classes in a dataset [1]. For medical conditions, it is common for observations depicting a disease or a “positive” case, to be fairly less frequent in comparison to normal or healthy observations [17]. Training a model with imbalanced data can lead to the final model being biased towards the majority classes, while ignoring the minorities. Multiple approaches to tackle the problem of imbalanced class distributions in datasets can be found in the literature [17]. Two of the most straightforward techniques used include under-sampling and over-sampling [56]. These techniques, although fairly simple and intuitive, might not prove to be the best choice, as they can lead respectively to information loss and over fitting [56]. Other common approaches used towards imbalanced class distributions in datasets involve the so-called cost-sensitive learning [56]. One implementation of this approach is to give weights to each class inside the cross-entropy loss function to correct for class imbalance. In the case of semi-supervised learning, authors in [17] proposed a similar technique called Pseudo-label-based Balance Correction (PBC). This technique applies class-balance correction both to the labelled and unlabelled data in the MixMatch SSDL approach. Given its reported positive results, we implement the class imbalance correction approach tested in [17] in our work.

Classification metrics for imbalanced data

Class-imbalanced datasets and its impact on the implementation of classification models has long been a subject of study in the literature [35]. Using metrics that account for class imbalance is an important aspect, specially for CAD systems used under real-life conditions. The most frequent and almost customary method for evaluating the classification performance of models consists in the traditional classification accuracy [51]. Despite its wide usage, traditional accuracy is not an adequate metric for imbalanced test data settings [2]. This metric does not take into account the possible differences between the distribution of both classes, and thus can mislead to optimistic results, as illustrated by authors in [21]. Basic and widely known classification metrics that also derive from the confusion matrix scheme are the recall, specificity, and precision [2]. These metrics offer more information about the model’s classification performance and have been used in the literature to provide more complete analysis in cases with imbalanced data settings [2, 33]. Precision, sensitivity and specificity measures provide values in the interval [0,1], where higher is better. While these metrics can be studied individually to analyse different dimensions of the performance of a model, other metrics can be used to summarize them into a single score or value. As discussed by the authors in [21], currently there is no consensus in the machine learning community on the ideal classification metric to use, specially in cases with imbalanced data. Two of the most widely used classification metrics, besides traditional accuracy, are the F-1 Score and Area Under the Receiver Operating Characteristic Curve (AUROC). These metrics are commonly used in contexts prone to data imbalance, such as information retrieval [47] and the medical domain [51], although they are not always adequate for such cases [41]. The F-1 score corresponds to the harmonic mean between recall and precision. This metric is most useful in contexts where the main focus of a problem is the positive class, and the detection of the negative class is less relevant [51]. It offers a balanced score of the rate of true positives (recall) and the rate of correctly predicted positives (precision). Nevertheless, multiple works and studies point out the deficiencies of this metric and discourage its use as a standalone measure for the classification performance of a model [21, 27, 41, 47], specially in cases of high class imbalance. Namely, one of the problems commonly pointed out is the fact that the F-1 Score weights the false positives (FP) the same as the false positives (FN). To address this shortcoming in imbalanced data scenarios is the F-2 score [23]. The AUROC is another single score metric that summarizes the trade-off between the rate of true positives and the rate of false positives given multiple decision thresholds for the classification performance of a model. It provides a deeper insight of the model’s behaviour, when compared to the accuracy. However, it still faces many problems that are pointed out by a number of authors in the literature [10, 21, 30], some related to the impact of highly imbalanced data. Other classification metrics that have been proposed and explored in the literature for data imbalance scenarios are the balanced accuracy and the G-Mean [2, 33, 35, 50]. Both of these metrics summarize the recall and specificity, offering a single score that balances the model’s capacity to correctly classify observations belonging to both the majority (negative) and the minority (positive) classes. Both metrics rely solely on the recall and the specificity of a model. The balanced accuracy consists of the arithmetic mean of both metrics, while the G-Mean is their geometric mean. They can be useful in cases of imbalanced data, as values closer to 1 imply that a model has a high predictive power for both classes. It can be noted that, while both metrics are similar, due to its mathematical properties, the G-Mean is less sensitive to outliers [2]. An example can be a model that achieves a perfect specificity of 1 by correctly classifying all negative samples, but with a low recall of 0.1. Here, the balanced accuracy would be 0.55, while the G-Mean would be 0.31. This shows how the balanced accuracy can be over-optimistic. In this work, the usage of the G-mean as a metric is implemented as it takes into account the rate of true positives and true negatives for malign cases, as its the most under-represented class. A wide variety of other classification metrics can be used for cases of imbalanced data, like the Matthews correlation coefficient [21]. This metric corresponds to a correlation coefficient between the observed and predicted classifications. Other metrics include the Youden index and the Discriminant Power [51]. These metrics, although useful, are not as popular or widely used as the other mentioned classification metrics and might not be as intuitive to understand.

Methods

Experimental setup

For this purpose several experimental configurations were analysed and carried out, as illustrated in Fig. 1. Multiple models were trained under different training configurations to evaluate the impact of SSDL on their classification performance on a target dataset. Transfer learning (a simple “Domain adaptation” method) and loss function-based class imbalance correction were also tested. This was done as means for dealing with common difficulties of the implementation of classification models for real-life use cases, such as limited amounts of data and extreme class imbalance (further detailed in Section 3.2.2).

Fig. 1

Diagram of experimental configurations presented in this work

Diagram of experimental configurations presented in this work Deep learning models were first trained in a supervised manner with complete mammography datasets and in order to obtain source-trained models, which were further fine-tuned on our target Costarrican dataset in a Supervised (Config. S+FT) or Semi-Supervised (Config. SSDL+FT) manner, with limited amounts of labelled observations . The performance of source-trained models, without fine-tuning on the target dataset, was also evaluated (Config. S+No-FT). The performance of models directly trained on the target dataset using SSDL, without domain adaptation from a source mammography dataset (Config. SSDL) was also tested. Class imbalance correction of the loss function with the PBC method developed in [17] was also used as part of the experiments of configurations SSDL+FT, S+FT and SSDL. The empirical results obtained in this study showed a considerable impact of its usage for correcting data imbalance. Therefore, we included it to train all of the tested SSDL models. Finally, all models were evaluated on test images from our novel target Costarrican dataset. Due to the extreme data imbalance present in the target dataset (95% of observations belong to the negative class and 5% to the positive class), specific classification metrics, aside from traditional accuracy, were evaluated as performance indicators. Following the research presented in Section 2.5, the G-Mean was chosen as main classification metric. This metric was used to provide insight related to the accuracy of the models on the positive class, without ignoring their predictive power at classifying the negative class. Other metrics including F-2 Score, accuracy, recall, specificity, and precision are also reported. Deep data set Dissimilarity Measures (DeDiMs) following the novel approach presented by authors in [15] were also evaluated, to provide a more thorough analysis of the impact of the choice of source datasets. This method consists in a simple and practical approach to compare different datasets by measuring their dissimilarity in the feature space of a generic deep learning classification model. We aim to quantitatively assess the similarity between the tested datasets and correlate it with the yielded results.

Mammography datasets

Three different mammography datasets were used to carry out the experiments depicted in this work, summarized in Table 1. Sample images are shown in Fig. 9. The selected datasets correspond to two popular and publicly available “source” datasets, used solely for model training: the INbreast () and CBIS-DDSM (). A third novel “target” dataset comprised of mammogram images gathered from a private medical clinic of Costa Rica was also used.

Table 1

Summary of datasets used in this work

	INbreast [43]	CBIS-DDSM [37]	Target CR dataset
Origin	Portugal	USA	Costa Rica
Year	2011	1997–2016	2020
Number of cases	115	1566	87
Number of images	410	3103	341
Views	CC	CC	CC
	MLO	MLO	MLO
Image mode	Full-field digital	Digitized screen-film	Full-field digital
Categories	BI-RADS	BI-RADS	BI-RADS
	ACR Density	ACR Density
		Verified Pathology
ROI annotations	Yes	Yes	No

Fig. 9

Examples of benign (top) and malign (bottom) mammogram images from each dataset

Summary of datasets used in this work

Third-party source datasets

Introduced in [43], the INbreast dataset is a mammographic database comprised of multiple full-field digital mammograms of patients with a wide variety of anomalies like masses and calcifications. Each image is labelled according to the BI-RADS scale from categories 1 to 6 and their density measure with the American College of Radiology (ACR) standard. The dataset is composed of 410 images in total, collected from 115 different cases. Since this work is focused on the binary classification of mammograms (i.e. according to the presence of breast anomalies), images from the INbreast dataset were divided into 2 groups. Similar to [48], mammograms labelled with BI-RADS categories 1 and 2 are defined as negative (benign) observations, and the ones labelled with categories 4, 5 and 6 are defined as positive (malign) observations. Mammograms labelled with categories 0 (non-conclusive) and 3 (probably benign) are ignored. For the INbreast dataset, this process results in 287 negative and 100 positive observations. The Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) dataset, presented in [36] was made publicly available by The Cancer Imaging Archive (TCIA) [22]. It corresponds to a curated and standardized version of the DDSM dataset [31]. The dataset comprises a total of 3103 digitized screen-film mammography images gathered from 1566 cases, labelled according to the type of anomalies present (masses or calcifications), their BI-RADS category, their ACR density measure and their verified pathology as benign (1728 images) or malign (1375 images). The dataset presents an overlap between cases that are classified as containing masses or calcifications, as some patients presented both. The total number of images detailed here represents the overall total of both mass and calcification cases, as obtained from [37] and subsequently used for model training.

Clínica Chavarría’s 2020 mammogram target dataset

The CR-Chavarria-2020 dataset consists of a novel collection of full-field digital mammograms obtained from the Costa Rican medical private clinic Imágenes Médicas Dr. Chavarría Estrada, over a period of one year (referred as CR-Chavarria-2020 in Fig. 1). The images are completely anonymized. Specifically, these images correspond to mammograms taken as a result of routinely medical appointments for patients of the clinic across the year 2020. The entire dataset is available for researchers, along with documentation of its distribution, annotations, and extra images that were discarded in the process of constructing the dataset. If the reader is interested in using our collected dataset, please make contact via email with the first author, as we plan to make the dataset publicly available in the future.1 We highlight the value of this dataset as target data for the evaluation of deep learning models in the medical domain, as it is highly representative of the operation conditions that production-implemented models would have to deal with, in a medium-sized clinic. The complete dataset, referred as , consists of a set of BI-RADS-labelled images. These are also annotated in a similarly manner as the source datasets, with their respective anonymous patient id, gender, age, type of view, and depicted breast. The complete dataset contains a total of 341 labelled images from 87 patients. Similarly to the INbreast dataset, images from were also subject to the same “binarization” process described above. This resulted in the binary-labelled target dataset , with a total of 282 images; 268 negative and 14 positive observations from 68 and 4 patients, respectively. Figures 2 and 3 illustrate the distribution of both BI-RADS and binary labels for and respectively. Here, the extreme class imbalance of observations can be better appreciated, being one of the most frequent and troublesome situations that arise in the implementation of machine learning models in the medical domain. In addition, Figs. 4, 5, 6, 7 and 8 show the distribution of other dimensions of both and , like the depicted view and breast in each mammogram, along with the age of patients. These aspects show more balanced distributions, as is the case with most mammogram datasets, and that the regular age span for patients varies from 40 to almost 90 years old (Fig. 9).

Fig. 2

BI-RADS categories distribution for

Fig. 3

Binary categories distribution for

Fig. 4

Craniocaudal (CC) and Mediolateral Oblique (MLO) views distribution for complete and binary-labelled target datasets

Fig. 5

Depicted breast distribution for complete and binary-labelled target datasets

Fig. 6

Age distribution for patients in

Fig. 7

Age distribution according to BI-RADS categories for patients in

Fig. 8

Age distribution according to binary categories for patients in

BI-RADS categories distribution for Binary categories distribution for Craniocaudal (CC) and Mediolateral Oblique (MLO) views distribution for complete and binary-labelled target datasets Depicted breast distribution for complete and binary-labelled target datasets Age distribution for patients in Age distribution according to BI-RADS categories for patients in Age distribution according to binary categories for patients in Examples of benign (top) and malign (bottom) mammogram images from each dataset Along with the complete dataset, a set of discarded images has also been made available. These images were retrieved from the clinic, but were discarded due to low image quality or artifacts (i.e. patients with breast implants). Figure 10 shows mammogram images of breasts with implants. Nevertheless, these could prove to be useful on further investigations, surrounding the robustness of models to domain-specific noise or corruptions in images [32].

Fig. 10

Examples of images from original CR data discarded due to image quality (top) or patients with breast implants (bottom)

Data preprocessing

Mammograms from all three described datasets originally possessed considerably high image resolutions. In order to avoid memory constraints, all image files were resized to 224 × 224 pixels, after being converted from the DICOM format to the BMP one. Standardization was applied to all images. The mean and standard deviation, according to the respective dataset employed for training, were calculated (complete INbreast, complete CBIS-DDSM or the corresponding training partition of each of the data subsets of the target dataset). Then, for each image, the channel-wise pixel values were subtracted by the mean and divided by the standard deviation. Standardization is done for each training batch. Additionally, through visual inspection of the images in CBIS-DDSM dataset, it can be noted that several mammograms contain multiple forms of noise, mainly due to the digitization process of the screen film. Physical labels, orientation tags and scanning artifacts are some of the types of noise inducing elements that can be found in mammogram images, as illustrated in [44]. To minimize the effects of these types of noise, a similar approach to the one described in [8] was implemented and applied to images from the CBIS-DDSM dataset. This is shown in Fig. 11. Authors in [8] describe the implemented preprocessing pipeline in this work, designed for background removal in mammograms. The process consists mainly on the application of a rolling ball algorithm with radius = 5. This is followed by the application of Huang’s fuzzy thresholding and morphological transformations of erosion and dilation. This process results in a binary map that can be used to remove background noise from an image. Such image preprocessing pipeline is implemented in this work, which makes use of the base code made available by the authors of [8] and algorithm implementations from the OpenCV library.

Fig. 11

Examples of images with background noise from CBIS-DDSM dataset, before and after being preprocessed

Experiments

All experiments described in this work were implemented in Python using the FastAI and PyTorch libraries, based on the MixMatch implementation described in [13].2 The PyTorch implementation of the VGG-19 layer with batch normalization was chosen as the main architecture for the models of all experiments. Additionally, experiments of configurations SSDL+FT and S+FT were also carried out using PyTorch implementations of ResNet-152 and EfficientNet-b0. The complete results of experiments with these architectures are presented in the Supplementary material. Transfer learning with pre-trained weights from ImageNet was used for the initial models of all experimental configurations. All depicted experiments were executed employing a total of 10 different randomly generated subsets of the binary-labelled target Costarrican dataset . Each with an average distribution of 70% of images for training and 30% for testing, with observations from different patients for training and for testing. Therefore, around 198 training images (including both labelled and unlabelled), and 82 test images were used. The models for the configurations SSDL+FT, S+FT and SSDL were trained on each data subset , with amounts of labelled observations, with 95% of observations corresponding to the negative class (benign) and 5% to the positive class (malign). Class imbalance correction of the loss function was implemented, respectively, as a weighted cross-entropy loss for the supervised models and as the PBC technique [17] for the SSDL models. Supervised models were trained only with the specified images from the corresponding training partition of the target data subset as . The SSDL models also used the remaining training images in as unlabelled data . Data augmentation was implemented for the training dataset as random flips and rotations through the FastAI library, for both supervised and SSDL models. All models were trained for 50 epochs each, with early stopping to avoid overfitting. We used the G-Mean as a criterion for keeping the model from the epoch with the best score after training. A learning rate of 0.00002, a weight decay of 0.001 and a batch size of 10 images were used. The hyper-parameters for MixMatch were set as follows: K = 2 transformations, a sharpening temperature of T = 0.25, an alpha mix value of α = 0.75 and unsupervised coefficient γ = 200, following the authors’ recommendations in [11]. The G-Mean, F2-Score, traditional accuracy, recall, specificity, and precision were evaluated for each model, using the test data from their respective . Results from these metrics were then reported as averages across the 10 target data subsets. The dissimilarities between the complete source datasets and , and the binary-labelled target dataset were evaluated following the approach presented in [15]. The cosine distance d was chosen as the dissimilarity measure, given its reported behaviour in [15]. This was evaluated in the feature space of a generic Wide-ResNet model pre-trained on ImageNet, with the cosine distance calculated between the distributions of two datasets on each feature of the feature space and then summed [15]. We used 10 randomly selected batches of 40 observations to calculate the feature distribution distances, as suggested in [15].

Results and discussion

The results of each of the described experimental configurations are presented in Tables 2, 3, 4, 5, and 6, as the mean and standard deviation of the corresponding classification metrics, evaluated across each of the 10 random data subsets of the target dataset. Results are also presented accordingly to the number of that were used for training (Configs. SSDL+FT, S+FT and SSDL).

Table 2

Classification performance for models of configuration S+No-FT, using the VGG-19 architecture

Metric	INbreast models	CBIS-DDSM models
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {x}$\end{document}x¯	s	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {x}$\end{document}x¯	s
G-Mean	0.3773	0.1043	0.3476	0.2534
F2-Score	0.1882	0.0625	0.1347	0.1148
Accuracy	0.2183	0.0602	0.7379	0.0678
Recall	0.7667	0.2509	0.2333	0.1876
Specificity	0.1901	0.0558	0.7639	0.0707
Precision	0.0470	0.0160	0.0517	0.0467

Table 3

Classification performance for models of configuration SSDL, using the VGG-19 architecture

Metric	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}${n^{l}_{t}} = 20$\end{document}ntl=20		\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}${n^{l}_{t}} = 40$\end{document}ntl=40		\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}${n^{l}_{t}} = 60$\end{document}ntl=60
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {x}$\end{document}x¯	s	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {x}$\end{document}x¯	s	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {x}$\end{document}x¯	s
G-Mean	0.4798	0.1936	0.5720	0.1257	0.6413	0.0929
F2-Score	0.2169	0.1194	0.2683	0.1168	0.3038	0.0889
Accuracy	0.5786	0.2212	0.6482	0.2172	0.6869	0.1412
Recall	0.5167	0.2687	0.5750	0.2648	0.6333	0.2297
Specificity	0.5815	0.2404	0.6518	0.2354	0.6904	0.1544
Precision	0.1189	0.1551	0.1079	0.0754	0.1096	0.0491

Table 4

Summary of G-Mean scores for models of Configs. Bold entries refer to the best result between supervised and SSDL models

Model	INbreast				CBIS-DDSM				Trainable
Architecture	SSDL		Supervised		SSDL		Supervised		parameters
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {x}$\end{document}x¯	s	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {x}$\end{document}x¯	s	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {x}$\end{document}x¯	s	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {x}$\end{document}x¯	s
VGG19_bn	0.6764	0.1084	0.6682	0.0770	0.7313	0.0742	0.5163	0.2826	139.5 million
ResNet-152	0.6774	0.1167	0.6767	0.1021	0.6575	0.1075	0.5857	0.0598	58.1 million
EfficientNet-b0	0.6512	0.1081	0.6393	0.0603	0.5982	0.0753	0.5824	0.0489	4 million

SSDL+FT and S+FT, using labelled observations. The corresponding number of trainable parameters for the PyTorch-implementation of each architecture is also shown

Table 5

Results of configurations SSDL+FT and S+FT, using INbreast as source dataset with the VGG-19 architecture. Bold entries refer to the best result between supervised and SSDL models

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}${n^{l}_{t}}$\end{document}ntl	Metric	SSDL		Supervised
		\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {x}$\end{document}x¯	s	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {x}$\end{document}x¯	s
20	G-Mean	0.6764	0.1084	0.6682	0.0770
	F2-Score	0.3506	0.0973	0.3133	0.0673
	Accuracy	0.7812	0.0727	0.7014	0.0793
	Recall	0.5917	0.1687	0.6500	0.1748
	Specificity^∗	0.7907	0.0755	0.7048	0.0876
	Precision	0.1436	0.0636	0.1074	0.0335
40	G-Mean	0.7017	0.0932	0.6656	0.0877
	F2-Score	0.3650	0.0899	0.3484	0.1112
	Accuracy	0.7742	0.0659	0.7224	0.1590
	Recall	0.6417	0.1715	0.6417	0.2081
	Specificity	0.7810	0.0693	0.7262	0.1721
	Precision	0.1380	0.0373	0.1837	0.1708
60	G-Mean	0.6689	0.0957	0.6604	0.0876
	F2-Score	0.3278	0.0958	0.3415	0.1116
	Accuracy	0.7211	0.1169	0.7432	0.1374
	Recall	0.6250	0.1318	0.6000	0.1748
	Specificity	0.7267	0.1230	0.7510	0.1466
	Precision	0.1226	0.0565	0.1822	0.1704

* Statistic significance (p −values < 0.05) for average differences between results of SSDL and supervised models

Table 6

Results of configurations SSDL+FT and S+FT, using CBIS-DDSM as source dataset with the VGG-19 architecture. Bold entries refer to the best result between supervised and SSDL models

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}${n^{l}_{t}}$\end{document}ntl	Metric	SSDL		Supervised
		\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {x}$\end{document}x¯	s	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {x}$\end{document}x¯	s
20	G-Mean^∗	0.7313	0.0742	0.5163	0.2826
	F2-Score	0.3910	0.0909	0.2892	0.1797
	Accuracy^∗	0.7455	0.1115	0.8333	0.0710
	Recall^∗	0.7333	0.1459	0.3917	0.2292
	Specificity^∗	0.7460	0.1201	0.8554	0.0709
	Precision	0.1480	0.0551	0.1602	0.1289
40	G-Mean^∗	0.7264	0.0909	0.5743	0.2308
	F2-Score^∗	0.3917	0.1124	0.3070	0.1597
	Accuracy^∗	0.7588	0.1041	0.8286	0.0476
	Recall^∗	0.7083	0.1632	0.4417	0.2189
	Specificity^∗	0.7612	0.1110	0.8482	0.0453
	Precision	0.1520	0.0630	0.1458	0.0899
60	G-Mean	0.7142	0.0717	0.6466	0.1462
	F2-Score	0.3779	0.1001	0.3436	0.1506
	Accuracy^∗	0.7197	0.1445	0.8132	0.0723
	Recall	0.7333	0.1459	0.5333	0.2297
	Specificity^∗	0.7190	0.1559	0.8271	0.0779
	Precision	0.1435	0.0623	0.1539	0.0834

* Statistic significance (p −values < 0.05) for average differences between results of SSDL and supervised models

Classification performance for models of configuration S+No-FT, using the VGG-19 architecture Classification performance for models of configuration SSDL, using the VGG-19 architecture Summary of G-Mean scores for models of Configs. Bold entries refer to the best result between supervised and SSDL models SSDL+FT and S+FT, using labelled observations. The corresponding number of trainable parameters for the PyTorch-implementation of each architecture is also shown Results of configurations SSDL+FT and S+FT, using INbreast as source dataset with the VGG-19 architecture. Bold entries refer to the best result between supervised and SSDL models * Statistic significance (p −values < 0.05) for average differences between results of SSDL and supervised models Results of configurations SSDL+FT and S+FT, using CBIS-DDSM as source dataset with the VGG-19 architecture. Bold entries refer to the best result between supervised and SSDL models * Statistic significance (p −values < 0.05) for average differences between results of SSDL and supervised models The classification performance on the target dataset of source-trained-only models appears to be rather poor, with no clear advantages between the source datasets, as seen in Table 2. The low average G-Mean values yielded by models trained on each of the source datasets show a deficient ability to correctly discriminate between both classes. This situation is confirmed by the yielded average recall and specificity values, which show a clear imbalance of the discrimination accuracy for each class. Low average F2-Score values also reinforce this conclusion, showing a relatively high number of FP in proportion to true positives (TP) predictions. The “accuracy paradox” can also be seen in the yielded average accuracy scores of Table 2. Models trained on scored notably lower accuracy values in comparison to models trained on . However, further analysis suggests that the higher accuracy scores of the latter models were due to their relatively high specificity scores. This shows a clear bias on the accuracy scores for the majority class (negative cases). Table 3 shows the classification performance results of models trained with SSDL on the target dataset, without domain adaptation from a source mammography dataset. Considerably high standard deviations are observed for the majority of the results. Despite this, the average values of both G-Mean and F2-Score show steady improvements as the number of increases. It is only logical that these models are able to make a better use of an increased number of labelled observations for training. This is mainly due to the fact that they do not possess previous domain-knowledge from a source dataset. Significant improvements can be perceived in the classification performance of the source-trained models after fine-tuning on the target dataset, as depicted by Tables 5 and 6. Wilcoxon signed-rank tests were applied to these results in order to identify statistically significant (p-values < 0.05) differences between the performance of the models fine-tuned either in a supervised manner or with the SSDL method. Therefore, the null hypothesis is defined as that there is no statistically significant difference of using semi-supervised learning against using conventional supervised learning. The alternative hypothesis refers to the statistically significant difference between using semi-supervised learning against using supervised learning. Table 5 shows the results of the models first trained on and then fine-tuned on the target dataset. The results with other architectures are depicted in the Supplementary material. Models fine-tuned with SSDL generally yielded moderately better average G-Mean and F2-Score results in comparison to models fine-tuned using a supervised manner. This happens specially when using a reduced number of labelled observations for training (), as the perceived gains decrease with a higher value of . With more labels, the results tend to reveal less statistical significance with p-values > 0.05. Therefore, we reject the previously stated null hypothesis, when few labels are used (). When comparing the models performance of the configurations SSDL, and SSDL+FT, described in Tables 3, 5 and 6, we can see two different scalability trends, with respect to . The SSDL configuration (with no fine-tuning), yields considerably lower performance scores, when compared to the SSDL+FT configuration. However, it scales better, when increases. This suggests that the SSDL+FT configuration, with initial knowledge on the target task (mammogram classification), is less benefited when the number of labels grows. The results shown in Table 6 correspond to the models that were first trained on and then fine-tuned on the target dataset. Considerably higher average G-Mean and F2-Score values were yielded by models fine-tuned with SSDL. They show statistical significance when employing lower amounts of labelled observations (), specially for the models that used the VGG19 architecture. For these models, the ones that were fine-tuned in a supervised fashion scored higher average specificity values. However, by observing their respective average recall values it is clear that their rate of correct predictions is unbalanced for both classes. These models appear to be biased to the majority class. However, the models with SSDL can be considered to be less biased, according to the yielded results. Their average recall and specificity show a more stable behaviour. Models with supervised fine-tuning also achieved generally higher average accuracy values, when compared to the no fine-tuned models. In summary, models that were subject to domain adaptation from a source mammography dataset showed improved classification performance results in comparison to the other experimental configurations tested in this work. However, the choice of source dataset and deep learning model architecture are shown to be important factors in the yielded results. Models that used the CBIS-DDSM as source dataset showed better overall results, with more evident trends and noticeable improvements by the use of SSDL. Models that used the INbreast as source dataset scored relatively worse results, with no significant differences between the performance of supervised and SSDL models. Additionally, the performance of supervised models does not change significantly across the different number of labelled observations tested. These models achieved seemingly converging G-Mean values with fairly balanced recall and specificity values from a lower number of . This was observed on all tested model architectures. Regarding the poor performance of configuration S+No-FT, we found that the measurement of the DeDiMs can be an useful warning of choosing one unlabelled data source over another. The dissimilarity between and was measured as ± 1.56, while for the dissimilarity between and was ± 2.31, both results with p-values < 0.05. These results indicate that the feature distributions (using a generic ImageNet pre-trained model) between both source datasets and the target dataset are significantly different. This can explain the poor results of configuration S+No-FT as a high dissimilarity is accurately suggesting that some sort of domain adaption is needed. At the same time, a lower dissimilarity between and might indicate that the former could be better suited to be used as a source dataset, as seen in the yielded performance behaviour for both datasets in Tables 5 and 6. The reasons behind a higher dissimilarity between two datasets need to be explored further. Table 4 summarizes the performance of the models with the lowest number of labels. The average G-Mean scores are shown for models fine-tuned with the lowest number of labelled observations. The results in Table 4 show how the model architecture constitutes an important factor in the yielded performance of the models. As seen previously, SSDL models show better performance in comparison to supervised ones. However, the improved gains are stronger for the more complex models (i.e. architectures with more trainable parameters). Overall, SSDL models without domain adaptation show significantly lower performance than models with domain adaptation either supervised or with SSDL (Configs. S+FT and SSDL+FT). Low average precision and F2-Score values are observed for models of all experimental configurations. As it was mentioned, for a binary classification task, this implies a considerably high number of false positives in relation to the number of true positives. Nonetheless, it must be taken into account that the target dataset suffers from extreme class imbalance. This causes the calculation of the precision to be highly sensitive to the number of false positives.

Conclusions

In this work we discussed the impact of using target datasets with scarce labelled data for the implementation of deep learning models for detection of malign cases using mammogram images. As presented in [7], the determination and study of an appropriate dataset size is an open challenge. It is clear that under real-life conditions medical imaging implementation of deep learning systems is still challenging, namely due to problems with labelled data scarcity and class imbalance. To tackle these challenges on the binary classification of mammograms, a combination of transfer learning from source datasets and semi-supervised learning to leverage unlabelled target data has been proposed and tested. In the experiments carried out in this work, it was found that this combination can achieve significant improvements on the classification performance of deep learning models. This surpasses the performance of models without transfer learning or without the use of unlabelled target data. The experiments depicted in this work also reveal the importance of using transfer learning from source datasets. Still, the highest yielded performance of the SSDL model with fine-tuning have a large room for improvement. Enforcing further supervision with small labelled datasets (pixel-wise labelling of the regions of interest), with other forms of weak or self-supervision [55] and/or domain adaptation [49], along with more complex data augmentation approaches as in [25], might improve the overall model performance. This must be done without raising too much the need of expensive labelling. The target dataset used in this work for the evaluation of the models in the classification of mammograms is made available for other interested researchers. The dataset built for this work shows real-life conditions for the deployment of a deep learning-based CAD system. Highly imbalanced data, along with the significant distribution mismatch with the source datasets are important and frequent aspects of real-world test data for medical imaging-based CAD. The dissimilarity between source and target datasets was found to be significant with the use of the DeDiMs measures. This was shown to be the case even though images from datasets can be considered as semantically and visually similar. Related to this, the choice of the source dataset was found to be an important factor in the yielded improvements in the performance of models, as well as model complexity. The measured DeDiMs can be considered a generic and simple data quality metric, similar to the data heterogeneity metric proposed in [42]. In general, specific data quality metrics for deep learning models to solve medical imaging challenges is still a very under-developed topic in the literature. We plan to contribute in such data-oriented metric development in the medical imaging analysis field in the future. In future work, we aim to explore computationally efficient and informative data quality metrics for deep learning architectures. Feature space-based quality metrics can be explored in more recent deep learning architectures such as transformers [39]. Additionally further evaluation of model-oriented properties of deep learning models such as robustness and predictive uncertainty, as recommended in [45], is also a future work-line to develop. Below is the link to the electronic supplementary material. (PDF 136 KB)

14 in total

1. INbreast: toward a full-field digital mammographic database.

Authors: Inês C Moreira; Igor Amaral; Inês Domingues; António Cardoso; Maria João Cardoso; Jaime S Cardoso
Journal: Acad Radiol Date: 2011-11-10 Impact factor: 3.173

Review 2. Review of recent advances in segmentation of the breast boundary and the pectoral muscle in mammograms.

Authors: Mario Mustra; Mislav Grgic; Rangaraj M Rangayyan
Journal: Med Biol Eng Comput Date: 2015-11-06 Impact factor: 2.602

3. Caveats and pitfalls of ROC analysis in clinical microarray research (and how to avoid them).

Authors: Daniel Berrar; Peter Flach
Journal: Brief Bioinform Date: 2011-03-21 Impact factor: 11.622

4. Looking for abnormalities in mammograms with self-and weakly supervised reconstruction.

Authors: Mickael Tardy; Diana Mateus
Journal: IEEE Trans Med Imaging Date: 2021-01-08 Impact factor: 10.048

Review 5. Deep learning in mammography and breast histology, an overview and future trends.

Authors: Azam Hamidinekoo; Erika Denton; Andrik Rampun; Kate Honnor; Reyer Zwiggelaar
Journal: Med Image Anal Date: 2018-03-26 Impact factor: 8.545

6. Sample-Size Determination Methodologies for Machine Learning in Medical Imaging Research: A Systematic Review.

Authors: Indranil Balki; Afsaneh Amirabadi; Jacob Levman; Anne L Martel; Ziga Emersic; Blaz Meden; Angel Garcia-Pedrero; Saul C Ramirez; Dehan Kong; Alan R Moody; Pascal N Tyrrell
Journal: Can Assoc Radiol J Date: 2019-09-12 Impact factor: 2.248

7. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository.

Authors: Kenneth Clark; Bruce Vendt; Kirk Smith; John Freymann; Justin Kirby; Paul Koppel; Stephen Moore; Stanley Phillips; David Maffitt; Michael Pringle; Lawrence Tarbox; Fred Prior
Journal: J Digit Imaging Date: 2013-12 Impact factor: 4.056

Review 8. Not-so-supervised: A survey of semi-supervised, multi-instance, and transfer learning in medical image analysis.

Authors: Veronika Cheplygina; Marleen de Bruijne; Josien P W Pluim
Journal: Med Image Anal Date: 2019-03-29 Impact factor: 8.545

9. Deep Learning to Improve Breast Cancer Detection on Screening Mammography.

Authors: Li Shen; Laurie R Margolies; Joseph H Rothstein; Eugene Fluder; Russell McBride; Weiva Sieh
Journal: Sci Rep Date: 2019-08-29 Impact factor: 4.996

10. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.

Authors: Davide Chicco; Giuseppe Jurman
Journal: BMC Genomics Date: 2020-01-02 Impact factor: 3.969