Literature DB >> 36138025

Inflation of test accuracy due to data leakage in deep learning-based classification of OCT images.

Iulian Emil Tampu^1,2, Anders Eklund^3,4,5, Neda Haj-Hosseini^3,4.

Abstract

In the application of deep learning on optical coherence tomography (OCT) data, it is common to train classification networks using 2D images originating from volumetric data. Given the micrometer resolution of OCT systems, consecutive images are often very similar in both visible structures and noise. Thus, an inappropriate data split can result in overlap between the training and testing sets, with a large portion of the literature overlooking this aspect. In this study, the effect of improper dataset splitting on model evaluation is demonstrated for three classification tasks using three OCT open-access datasets extensively used, Kermany's and Srinivasan's ophthalmology datasets, and AIIMS breast tissue dataset. Results show that the classification performance is inflated by 0.07 up to 0.43 in terms of Matthews Correlation Coefficient (accuracy: 5% to 30%) for models tested on datasets with improper splitting, highlighting the considerable effect of dataset handling on model evaluation. This study intends to raise awareness on the importance of dataset splitting given the increased research interest in implementing deep learning on OCT data.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36138025 PMCID： PMC9500039 DOI： 10.1038/s41597-022-01618-6

Source DB: PubMed Journal: Sci Data ISSN： 2052-4463 Impact factor: 8.501

Introduction

The evaluation of deep learning models, and machine learning methods in general, aims at providing an unbiased description of model performance. Given a pool of data suitable for studying a hypothesis (e.g., classification, regression or segmentation), different splits of the data are commonly created for model training, model hyper-parameter tuning (validation set) and model assessment (testing set). This translates to having a part of the data used to fit model parameters and tune model hyperparameters, and another set to assess model generalization on unseen data[1]. Disregarding the choice of having separate validation and testing splits[2], the strategy used to generate the testing set from the original pool of data can have a large impact on the assessment of the model’s final performance. Several studies have investigated the effect of the relative size between training, validation and/or testing sets[1,3] on model performance, as well as how training and validation sets can be used via resampling techniques, such as cross-validation, during model training[1,4]. More importantly, it is widely accepted that no overlap should exist between the samples used for model fitting and hyperparameter tuning, and those belonging to the testing set. If overlap is present, the model performance will be biased and uninformative with respect to the generalization capabilities of the model to new samples. However, although trivial, when implementing data splitting strategies, the overlap between training and testing sets can be easily overlooked. This is specifically more common in those applications where, due to hardware limitations or model design choices, the data from one subject (or acquisition) is used to obtain multiple dimensionally smaller samples used for model training and testing. An example of such a scenario is the slicing of a 3D volume into 2D images. In these cases, the overlap between training and testing sets results from having 2D images from the same subject (or acquisition) belonging to both sets. A proper splitting must therefore be performed on the volume or subject level and not on the slice level. Nowadays, machine learning methods, especially convolutional neural networks (CNNs) and deep learning algorithms, are widely used in research to analyze medical image data. A plethora of publications describe CNN implementations on a variety of both 2D and 3D medical data[5-7]. Reliable evaluation of such methods is paramount since this informs the research community on the models’ performance, allows meaningful comparison between methods, and to a greater extent indicates which research questions might be worth further investigation. In this regard, many medical image analysis challenges were established that aim at providing an unbiased platform for the evaluation and ranking of different methods on a common and standard testing dataset. Despite the many limitations of the medical image analysis challenges, including missing information regarding how the ground truth was obtained, their contribution towards a more transparent and reliable evaluation of deep learning methods for medical image applications is valuable[8]. However, not all implementations can be evaluated through dedicated challenges and unbiased datasets. Thus, when such third-party evaluation platforms are not available, it is the responsibility of the researchers performing the investigation to evaluate their methods thoroughly. As for the case of many of the reviewed medical image analysis challenges[8], one aspect that is sometimes missing or not well described is how the testing dataset is generated from the original pool of data. Moreover, there are examples where the preparation of the testing dataset was described, but its overlap with the training set was not considered[9-13], undermining the reliability of the reported results. Focusing on deep-learning applications for OCT, depending on the acquisition set-up, volumes are usually acquired with micrometer resolution in the x, y and z directions in a restricted field of view, with tissue structures that are alike and affected by similar noise. This results in consecutive slices having a high similarity, in both structure and noise. Figure 1 shows a schematic of an OCT volume along with examples of two consecutive slices from OCT volumes from three open-access datasets[14-16].

Fig. 1

Schematic of an OCT volume with examples of consecutive slices (b-scans) from the three open-access OCT datasets used in this study. In (a) consecutive 2D b-scans rendering a 3D OCT volume are pictured. Here an example from the AIIMS dataset[14] is used for illustrative purposes. In (b) the consecutive b-scans separated by ~18 micrometers are examples of healthy breast tissue (Patient 15, volume 0046, slices 0075 and 0076) from the AIIMS dataset[14]. In (c) images of healthy retina from Srinivasan’s dataset[16] (folder NORMAL5 slices 032 and 033) can be seen. In (d) consecutive images of retina affected by choroidal neovascularization (CNV 81630-33 and 81630-34) from the Kermany’s OCT2017 version 2 dataset[15] are shown. Note that the b-scans from both Kermany’s and Srinivasan’s datasets are given with data augmentation originally applied. Most of the reviewed literature implementing deep learning on OCT data used 2D images from scanned volumes, where two methods were commonly used to split the pool of image data into training and testing sets: per-image or per-volume/subject splitting (Table 1). In the per-volume/subject splitting approach, the random selection of data for the testing set is done on the volumes (or subjects) ensuring that images from one volume (or subject) belong to either one of the training or testing sets. It is important to notice that even a per-volume split might not be enough to avoid overlap between the training and testing set. In fact, if multiple volumes are acquired from the same tissue region, the tissue structures will be very similar among the volumes. In such scenarios, a per-subject split is more appropriate. Overall, assuming that volumes are acquired from different tissue regions, splitting the dataset per-volume or per-subject (here called per-volume/subject) ensures that overlap between training and testing is not present. In the per-image approach, 2D images belonging to the same volume are considered independent: thus, the testing set is created by random selection from the pool of images without considering that images from the same volume (or subject) might end up in both the training and testing sets. Even if this approach clearly results in an overlap between the testing and training sets, several of reviewed studies as well as an earlier version of one of the most downloaded open-access OCT datasets employed a per-image split.

Table 1

Summary of reviewed literature with a focus on dataset split and reported test classification performance.

Ref.	OCT dataset	Data split strategy	Model performance on testing set
[9]	Thyroid, parathyroid, fat and muscle samples	per-image	97.12% accuracy
[35]	Pituitary adenoma	per-image	0.96 AUC
[18]	Ophthalmology[15]* (version 2)	original split	95.55% accuracy
[19]	Ophthalmology[15]* (version 2)	original split	99.1% accuracy
[21]	Ophthalmology[15]* (version 3)	original split	98.7% accuracy
[22]	Ophthalmology[15]* (version 3)	original split	96.6% accuracy
[27]	Ophthalmology[15]* (train version 2, test version 3)	original split	99.6% accuracy
[23]	(1) Ophthalmology[15]* (train version 2, test version 3) (2) Ophthalmology[16]*	(1) original split (2) per-volume/subject	(1) 99.80% accuracy (2) 100% accuracy
[36]	Coronary artery	per-volume/subject	96.05% accuracy
[37]	Kidney^†	per-volume/subject	82.6% accuracy
[38]	High and low grade brain tumors	per-volume/subject	97% accuracy
[39]	Colon**^†	per-volume/subject	88.95% accuracy on 2D images
[40]	Breast tissue	per-volume/subject	91.7% specificity
[20]	Ophthalmology[15]* (version 2)	per-volume/subject	98.46% accuracy
[10]	(1) Ophthalmology[15]* (version 3) (2) Ophthalmology[16]* (3) Breast tissue[14]*	(1) original split (2) per-volume/subject (3) per-image	(1) ~96% accuracy (2) >98.8% accuracy (3) 98.8% accuracy
[24]	(1) Ophthalmology[16]* (2) Ophthalmology[41]* (3) Ophthalmology[42]* (4) Ophthalmology[15]* (unclear version)	(1) per-volume/subject (2) per-volume/subject (3) per-volume/subject (4) original split	(1) 96.66% accuracy (2) 98.97% accuracy (3) 99.74% accuracy (4) 99.78% accuracy
[43]	Dentistry	No description given	98% sensitivity 100% specificity
[44]	Ophthalmology	No description given	99.19% accuracy

Open-access datasets and the ones available upon request are marked by * and **, respectively. The dataset is not open-access if not specified. Datasets obtained from animal model samples are marked by †. The difference in performance between studies using the same datasets results from the different methods implemented.

Summary of reviewed literature with a focus on dataset split and reported test classification performance. (1) original split (2) per-volume/subject (1) 99.80% accuracy (2) 100% accuracy (1) Ophthalmology[15]* (version 3) (2) Ophthalmology[16]* (3) Breast tissue[14]* (1) ~96% accuracy (2) >98.8% accuracy (3) 98.8% accuracy (1) Ophthalmology[16]* (2) Ophthalmology[41]* (3) Ophthalmology[42]* (4) Ophthalmology[15]* (unclear version) (1) per-volume/subject (2) per-volume/subject (3) per-volume/subject (4) original split (1) 96.66% accuracy (2) 98.97% accuracy (3) 99.74% accuracy (4) 99.78% accuracy Open-access datasets and the ones available upon request are marked by * and **, respectively. The dataset is not open-access if not specified. Datasets obtained from animal model samples are marked by †. The difference in performance between studies using the same datasets results from the different methods implemented. The aim of this study is to demonstrate the effect of improper dataset splitting (per-image) on model evaluation using three open-access OCT datasets, AIIMS[14], Srinivasan’s[16] and Kermany’s OCT2017 (version 2 and 3)[15]. These were selected among the other open-access datasets for several reasons: (1) they are examples of OCT medical images belonging to different medical disciplines (ophthalmology and breast oncology) and showing different tissue structures and textures, (2) they are used in literature to evaluate deep learning-based classification of OCT images, with Kermany’s dataset[15] extensively used for developing deep learning methods in ophthalmology (over 14,500 downloads)[17], and (3) the datasets are provided in two different ways, subject-wise for the AIIMS and Srinivasan’s datasets, and already split for both versions of the Kermany’s datasets. The latter is an important aspect to consider since many of the studies using Kermany’s version 2 dataset overlooked the overlap between the training and the testing data.

Results

LightOCT model classification performance on the three datasets and for the different dataset split strategies is summarized in Table 2, with results presented as mean ± standard deviation (m ± std) over the ten-times repeated five-fold cross validation. In addition, Fig. 2 shows the Matthews Correlation Coefficient (MCC) distribution as box plots. For all datasets, the per-image split strategy results in a higher model performance compared to the per-volume/subject strategy. Looking at both the AIIMS and Srinivasan’s datasets, a model trained using a per-image strategy had a higher mean MCC by 0.08 and 0.43, respectively, compared to the one trained on a per-volume/subject split. A similar trend can be seen for both versions of Kermany’s dataset, with an increase in mean MCC by 0.12 and 0.07 for version 2 and version 3, respectively, when shifting from a per-volume/subject to a per-image split strategy. Moreover, results on the original splits for both versions of Kermany’s dataset are higher compared to the corresponding per-volume/subject splits, with a difference in MCC of 0.30 and 0.04 for the version 2 and version 3 datasets, respectively.

Table 2

LightOCT model performance on the AIIMS[14], Srinivasan’s[16] and Kermany’s[15] datasets with training, validation and testing sets split using different strategies.

Dataset	Split strategy	MCC [−1, 1] (m ± std)	AUC [0,1] (m ± std)	F1-score [0,1] (m ± std)	Accuracy [0,1] (m ± std)	Precision [0,1] (m ± std)	Recall [0,1] (m ± std)
AIIMS[14]	per-image	0.958 ± 0.038	1.000 ± 0.000	0.978 ± 0.021	0.978 ± 0.021	1.000 ± 0.000	0.978 ± 0.021
AIIMS[14]	per-volume/subject	0.881 ± 0.102	0.996 ± 0.009	0.934 ± 0.063	0.935 ± 0.060	0.993 ± 0.014	0.935 ± 0.060
Srinivasan[16]	per-image	0.853 ± 0.039	0.985 ± 0.005	0.898 ± 0.030	0.899 ± 0.029	0.973 ± 0.009	0.899 ± 0.029
Srinivasan[16]	per-volume/subject	0.426 ± 0.116	0.817 ± 0.055	0.593 ± 0.088	0.603 ± 0.078	0.702 ± 0.078	0.603 ± 0.078
Kermany[15] version 2	original_v2	0.886	0.993	0.909	0.911	0.983	0.911
	per-image	0.707 ± 0.021	0.953 ± 0.003	0.764 ± 0.022	0.770 ± 0.019	0.886 ± 0.007	0.770 ± 0.019
	per-volume/subject	0.588 ± 0.025	0.890 ± 0.006	0.644 ± 0.033	0.669 ± 0.023	0.769 ± 0.012	0.669 ± 0.023
Kermany[15] version 3	original_v3	0.644	0.964	0.678	0.704	0.916	0.704
	per-image	0.673 ± 0.021	0.950 ± 0.003	0.729 ± 0.022	0.738 ± 0.019	0.886 ± 0.007	0.738 ± 0.019
	per-volume/subject	0.600 ± 0.021	0.911 ± 0.006	0.651 ± 0.028	0.671 ± 0.021	0.795 ± 0.012	0.671 ± 0.021

Performance metrics are reported as mean ± standard deviation (m ± std) over the models trained through ten-times repeated five-fold cross validation and classes, for the per-image and per-volume/subject splits. For the original splits given by Kermany, results are reported for the single given split. AUC: area under the receiver operating characteristic curve, MCC: Matthews Correlation Coefficient.

Fig. 2

Comparison between Matthews Correlation Coefficient for LightOCT model trained on different dataset split strategies. Each box plot summarizes the test MCC for the 50 models trained through a ten-times repeated five-fold cross validation. Results are presented for all the four datasets with the per-image split strategy shown in striped-green and per-volume/subject split strategy in dotted-orange. For Kermany’s datasets, the result of the models trained on the original_v2 and original_v3 splits are shown as full-black circle and full-black cross, respectively. Outliers are shown as diamond (♦).

LightOCT model performance on the AIIMS[14], Srinivasan’s[16] and Kermany’s[15] datasets with training, validation and testing sets split using different strategies. Performance metrics are reported as mean ± standard deviation (m ± std) over the models trained through ten-times repeated five-fold cross validation and classes, for the per-image and per-volume/subject splits. For the original splits given by Kermany, results are reported for the single given split. AUC: area under the receiver operating characteristic curve, MCC: Matthews Correlation Coefficient. Comparison between Matthews Correlation Coefficient for LightOCT model trained on different dataset split strategies. Each box plot summarizes the test MCC for the 50 models trained through a ten-times repeated five-fold cross validation. Results are presented for all the four datasets with the per-image split strategy shown in striped-green and per-volume/subject split strategy in dotted-orange. For Kermany’s datasets, the result of the models trained on the original_v2 and original_v3 splits are shown as full-black circle and full-black cross, respectively. Outliers are shown as diamond (♦). Results on the random label experiments show that, for the original Kermany version 2 dataset, the obtained p-value was 0.071, with the high MCC for random labels indicating a probable data leakage. For the other datasets, the p-values were much larger. We conclude that using random labels during training can potentially be a way to automatically detect data leakage, but that it requires further research.

Discussion

Dataset split should be carefully designed to avoid overlap between training and testing sets. Table 1 summarizes several studies on deep learning applications for OCT data, specifying the described data split strategy. All the works using a per-image split strategy or the original split from both versions of Kermany’s dataset reported accuracies >95%. The results obtained on Kermany’s original_v2 split are in accordance with reported accuracy[18-20], where the differences in performance can be attributed to the deep learning model used and its optimization. However, results on the third version of the dataset were much lower compared to those in literature[10,21,22], even when using the same model architecture, loss and optimizer[10]. Interestingly, Table 1 also shows that studies using multiple datasets and reporting different split strategies for each dataset[10,23,24], report as high accuracy on the per-volume/subject split datasets as the one on the original_v2 or the per-image split datasets. Moreover, the drop in MCC seen when using Kermany’s original_v2 compared to the original_v3 split, highlights the inflation effect that leakage between training and testing data has on model performance, especially when the original_v2 shows to have 92% overlap between training and testing sets while original_v3 has none. In the light of our results, it is reasonable to question whether the high performances reported for the per-volume/subject split datasets reflect the true high performance of the implemented methods, or are examples of inflated accuracy values due to data leakage between training and testing sets. The difference in the effect of data leakage between datasets can be attributed to two reasons: (1) the way the dataset is provided for download and (2) the classification task that the model needs to learn. Firstly, Kermany’s dataset is provided already split in training, testing and validation sets (version 2 of the dataset), with overlap at both subject-ID level and at image level. Thus, using the original_v2 dataset as provided, highly inflates the model performance, with a drop in performance seen when using a per-image split (overlap only on a subject-ID level not on an image level) and an even larger decrease in the case of the appropriate split. AIIMS and Srinivasan’s datasets are provided with data from each subject saved independently, with split being performed in the reviewed studies using a per-image or per-subject/volume strategy (see Table 1). Thus, for these datasets the effect of inflation was measured with respect to the overlap in subject-IDs between the training and testing sets, and not with respect to having the exact same image in both sets. Disregarding the way the datasets are provided, the second reason for which the effect of data leakage was different between the datasets could be attributed to the difficulty of the classification task. Looking at the results for the proper split, the model performance on AIIMS dataset was higher compared to the one for Kermany’s dataset, suggesting that the binary classification between cancer and normal breast tissue was easier compared to the four-class classification for Kermany’s dataset. When improperly splitting the data, the advantage given to the model in classifying the test data by training on samples in the testing set had a smaller impact on AIIMS dataset than for Kermany’s since the classification task was easier. However, it is not trivial what task can be considered easy since this is influenced by a multitude of factors, including data, model architecture, and model optimization. This study is limited in investigating only data leakage originating from the overlap between the training and testing sets. However, there are other sources of data leakage stemming from hyper-parameter optimization, data normalization, and data augmentation. During hyper-parameter optimization, model parameters as well as training strategies should not be tuned based on test performance. Similarly, in the case of data normalization, statistics such as mean and variance used for the normalization should only be computed on the training set, and not the entire dataset. Failing to do so results in biasing method design choices with information obtained from the testing set, thus compromising the evaluation of model generalization to new data. Finally, in leakage due to data augmentation, an image could be augmented multiple times, with its different augmented versions ending up in the training and testing set. This results in a data overlap similar to the one of a per-image split strategy, where images with the same structures and noise properties are in both training and testing sets. Both original splits in the Kermany and Srinivasan’s datasets are provided with images already augmented; however, overlap between training and testing with respect to data augmentation was not investigated in this work. In conclusion, the dataset split strategy that is used can have a substantial impact on the evaluation of deep learning models. In this paper, it is demonstrated that, in OCT image classification applications specifically, a per-image split strategy of the volumetric data adopted by a considerable number of studies, returns over-optimistic results on model performance and an inflation of test performance values. This calls into question the reliability of the assessments and hindering an objective comparison between research outcomes. This problem has also been demonstrated in 3D magnetic resonance (MR) imaging studies[13] and in digital pathology[25], where data leakage between the training and testing sets resulted in over-optimistic classification accuracy (>29% slide level classification accuracy in MR studies and up to 41% higher accuracy in digital pathology). Moreover, greater attention should be paid to the structure of datasets made available to the research community to avoid biasing the evaluation of different methods and undermining the usefulness of open-access datasets. With the increased interest of the research community in the use of deep learning methods for OCT image analysis, this study intends to raise awareness on a trivial but overlooked problem that can spoil research efforts if not addressed correctly.

Methods

Dataset description

AIIMS dataset

The AIIMS dataset is a collection of 18480 2D OCT images of healthy (n = 9450) and cancerous (n = 9030) breast tissue[14]. The images are obtained from volumetric acquisitions and are provided as BMP files of size 245 × 442 pixels organized per-class and per-subject (22 cancer subjects and 23 healthy subjects).

Srinivasan’s dataset

The Srinivasan’s ophthalmology dataset[16] collects a total of 3,231 2D OCT images of age-related macular degeneration (AMD), diabetic macular edema (DME), and normal subjects. For each class, data from 15 subjects is provided in independent folders. OCT images are given as TIFF files with 512 × 496 pixels saved after data augmentation (rotation and horizontal flip).

Kermany’s dataset

The Kermany’s ophthalmology dataset[15,22] is one of the largest open-access ophthalmology datasets[26] and is used by an extensive number of studies (see Table 1). The dataset contains images from 5319 patients (train = 4686, test = 633) for retina affected by choroidal neovascularization (CNV), diabetic macular edema (DME), and drusen as well as from normal retina. The dataset is available in different versions with version 2 and version 3 (latest) both used in literature. The difference between the dataset versions is threefold: (1) the total number of available images and their organization, (2) the number of images in the given testing set and (3) the extent of data overlap between the given training and testing sets. For both dataset versions, images are given as JPEG files of sizes ranging [384 to 1536] × [496 to 512] pixels saved after data augmentation (rotation and horizontal flip). The version 2 of the dataset is provided with splits for training (n = 83484 images), validation (n = 32 images), and testing sets (n = 968 images), with validation and testing sets balanced with respect to the classes. The version 3 of the dataset is given with training (n = 108312) and testing (n = 1000, balanced between the four classes). In this study, the given splits are referred to as original_v2 and original_v3 splits, for version 2 and version 3 of the dataset, respectively. For both versions of the dataset, there is no specification on if the split between sets is performed before or after data augmentation as well as if the split in training and testing sets was performed per-image or per-volume/subject. By performing an automatic check on the original_v2 split (assuming that the naming convention is CLASS_subject-ID_bscan-ID), it was found that 92% of the test images belong to subject-IDs also found in the training set. Moreover, by visually inspecting the given splits it was possible to identify images in the testing set that were similar to the training set (an example of such a case is training image = DRUSEN-8086850-6, testing image = DRUSEN-8086850-1). When performing the same automatic check on the original_v3 split, no overlap between subject-IDs was found. In Table 1 it is specified which version of Kermany’s dataset was used by the different studies. Among these, two studies used a mixture of both datasets[23,27].

Dataset split

For all datasets, a custom split function was implemented to split the dataset per-image or per-volume. In either case, 1000 images from every class were assigned for testing in the case of Kermany’s and the AIIMS datasets. For the Srinivasan’s dataset, 250 images were selected for testing instead, given the smaller number of total images. Example images from AIIMS, Srinivasan’s and Kermany’s version 2 datasets are shown in Fig. 1b–d, respectively.

Model architecture and training strategy

The LightOCT model proposed by Butola et al.[10] was used in this study. LightOCT is a custom, shallow and multi-purpose network for OCT image classification composed of a two-layer CNN encoder, and one fully connected layer with softmax activation as output layer. The first and second convolutional layers have 8 and 32 convolutional filters, respectively. The kernel size of the filters in both layers is set to 5 × 5 and the output of each layer passes through a ReLU activation function[10]. A max-pooling operation is present between the first and the second convolutional layer that cuts the spatial dimension of the output of the first layer in half. The two-dimensional output of the CNN encoder is then flattened to a one dimensional vector, which is fed to the fully connected layer for classification. The number of nodes in the fully connected layer is changed based on the number of classes specified by the classification task[10]. For all of the classification tasks, the model was trained from scratch using stochastic gradient descent with momentum (m = 0.9) with a constant learning rate (lr = 0.0001). For all experiments, the batch size was set to 64 and the model was trained for 250 epochs without early stopping. Note that model architecture and training hyperparameters were not optimized for each dataset since it was out of the scope of this work. The model architecture as well as the training hyperparameters were chosen based on the results of Butola et al.[10] The model and the training routine were implemented in Tensorflow 2.6.2, and training was run on a computer with a 20-core CPU and 4 NVIDIA Tesla V100 GPUs.

Evaluation metrics

Models were trained on the original splits, if available, and on training and testing splits obtained using a per-image and per-volume/subject strategy. A ten-times repeated five-fold cross validation was run for both split strategies to ensure reliability of the presented results[28]. A multi-class confusion matrix was used to evaluate the classification performance of the model with Matthews Correlation Coefficient (MCC) obtained as a derived metric coherent with respect to class imbalance and stable to label randomization[29-31]. Accuracy, precision, recall and F1-score were also derived for each class using the definitions provided by Sokolova et al.[32] to allow comparison with previous studies. Additionally, receiver operator characteristic (ROC) curves were used along with the respective area under the curve (AUC). In an attempt to automatically detect bias due to data overlap, a random label experiment was carried out, where random labels were used for training a classifier and MCC was calculated on the test set (with original labels). To determine if the obtained MCC value was within the expected range, a null distribution was created for each dataset by creating 10000 test labels and prediction sets (also in this case random labels were used to simulate the random distribution) and calculating MCC for each of them. The MCC values obtained from the models trained using random labels on the different data splits (for which some have overlap) were compared to the respective null distribution (built without overlap) to calculate p-values using the one-sample Wilcoxon test (two-tailed).

24 in total

1. Deep-learning-aided forward optical coherence tomography endoscope for percutaneous nephrostomy guidance.

Authors: Chen Wang; Paul Calle; Nu Bao Tran Ton; Zuyuan Zhang; Feng Yan; Anthony M Donaldson; Nathan A Bradley; Zhongxin Yu; Kar-Ming Fung; Chongle Pan; Qinggong Tang
Journal: Biomed Opt Express Date: 2021-03-29 Impact factor: 3.732

Review 2. A survey on deep learning in medical image analysis.

Authors: Geert Litjens; Thijs Kooi; Babak Ehteshami Bejnordi; Arnaud Arindra Adiyoso Setio; Francesco Ciompi; Mohsen Ghafoorian; Jeroen A W M van der Laak; Bram van Ginneken; Clara I Sánchez
Journal: Med Image Anal Date: 2017-07-26 Impact factor: 8.545

3. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation.

Authors: Davide Chicco; Niklas Tötsch; Giuseppe Jurman
Journal: BioData Min Date: 2021-02-04 Impact factor: 2.522

4. Volumetric analysis of breast cancer tissues using machine learning and swept-source optical coherence tomography.

Authors: Ankit Butola; Azeem Ahmad; Vishesh Dubey; Vishal Srivastava; Darakhshan Qaiser; Anurag Srivastava; Paramsivam Senthilkumaran; Dalip Singh Mehta
Journal: Appl Opt Date: 2019-02-10 Impact factor: 1.980

5. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning.

Authors: Daniel S Kermany; Michael Goldbaum; Wenjia Cai; Carolina C S Valentim; Huiying Liang; Sally L Baxter; Alex McKeown; Ge Yang; Xiaokang Wu; Fangbing Yan; Justin Dong; Made K Prasadha; Jacqueline Pei; Magdalene Y L Ting; Jie Zhu; Christina Li; Sierra Hewett; Jason Dong; Ian Ziyar; Alexander Shi; Runze Zhang; Lianghong Zheng; Rui Hou; William Shi; Xin Fu; Yaou Duan; Viet A N Huu; Cindy Wen; Edward D Zhang; Charlotte L Zhang; Oulan Li; Xiaobo Wang; Michael A Singer; Xiaodong Sun; Jie Xu; Ali Tafreshi; M Anthony Lewis; Huimin Xia; Kang Zhang
Journal: Cell Date: 2018-02-22 Impact factor: 41.582

6. Why rankings of biomedical image analysis competitions should be interpreted with care.

Authors: Lena Maier-Hein; Matthias Eisenmann; Annika Reinke; Sinan Onogur; Marko Stankovic; Patrick Scholz; Tal Arbel; Hrvoje Bogunovic; Andrew P Bradley; Aaron Carass; Carolin Feldmann; Alejandro F Frangi; Peter M Full; Bram van Ginneken; Allan Hanbury; Katrin Honauer; Michal Kozubek; Bennett A Landman; Keno März; Oskar Maier; Klaus Maier-Hein; Bjoern H Menze; Henning Müller; Peter F Neher; Wiro Niessen; Nasir Rajpoot; Gregory C Sharp; Korsuk Sirinukunwattana; Stefanie Speidel; Christian Stock; Danail Stoyanov; Abdel Aziz Taha; Fons van der Sommen; Ching-Wei Wang; Marc-André Weber; Guoyan Zheng; Pierre Jannin; Annette Kopp-Schneider
Journal: Nat Commun Date: 2018-12-06 Impact factor: 14.919

7. On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning.

Authors: Yun Xu; Royston Goodacre
Journal: J Anal Test Date: 2018-10-29

8. Effect of data leakage in brain MRI classification using 2D convolutional neural networks.

Authors: Ekin Yagis; Selamawet Workalemahu Atnafu; Alba García Seco de Herrera; Chiara Marzi; Riccardo Scheda; Marco Giannelli; Carlo Tessa; Luca Citi; Stefano Diciotti
Journal: Sci Rep Date: 2021-11-19 Impact factor: 4.379

9. Diagnosis of Pituitary Adenoma Biopsies by Ultrahigh Resolution Optical Coherence Tomography Using Neuronal Networks.

Authors: Alexander Micko; Fabian Placzek; Roger Fonollà; Michael Winklehner; Ryan Sentosa; Arno Krause; Greisa Vila; Romana Höftberger; Marco Andreana; Wolfgang Drexler; Rainer A Leitgeb; Angelika Unterhuber; Stefan Wolfsberger
Journal: Front Endocrinol (Lausanne) Date: 2021-10-18 Impact factor: 5.555

10. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.

Authors: Davide Chicco; Giuseppe Jurman
Journal: BMC Genomics Date: 2020-01-02 Impact factor: 3.969