| Literature DB >> 36138025 |
Iulian Emil Tampu1,2, Anders Eklund3,4,5, Neda Haj-Hosseini3,4.
Abstract
In the application of deep learning on optical coherence tomography (OCT) data, it is common to train classification networks using 2D images originating from volumetric data. Given the micrometer resolution of OCT systems, consecutive images are often very similar in both visible structures and noise. Thus, an inappropriate data split can result in overlap between the training and testing sets, with a large portion of the literature overlooking this aspect. In this study, the effect of improper dataset splitting on model evaluation is demonstrated for three classification tasks using three OCT open-access datasets extensively used, Kermany's and Srinivasan's ophthalmology datasets, and AIIMS breast tissue dataset. Results show that the classification performance is inflated by 0.07 up to 0.43 in terms of Matthews Correlation Coefficient (accuracy: 5% to 30%) for models tested on datasets with improper splitting, highlighting the considerable effect of dataset handling on model evaluation. This study intends to raise awareness on the importance of dataset splitting given the increased research interest in implementing deep learning on OCT data.Entities:
Mesh:
Year: 2022 PMID: 36138025 PMCID: PMC9500039 DOI: 10.1038/s41597-022-01618-6
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 8.501
Fig. 1Schematic of an OCT volume with examples of consecutive slices (b-scans) from the three open-access OCT datasets used in this study. In (a) consecutive 2D b-scans rendering a 3D OCT volume are pictured. Here an example from the AIIMS dataset[14] is used for illustrative purposes. In (b) the consecutive b-scans separated by ~18 micrometers are examples of healthy breast tissue (Patient 15, volume 0046, slices 0075 and 0076) from the AIIMS dataset[14]. In (c) images of healthy retina from Srinivasan’s dataset[16] (folder NORMAL5 slices 032 and 033) can be seen. In (d) consecutive images of retina affected by choroidal neovascularization (CNV 81630-33 and 81630-34) from the Kermany’s OCT2017 version 2 dataset[15] are shown. Note that the b-scans from both Kermany’s and Srinivasan’s datasets are given with data augmentation originally applied.
Summary of reviewed literature with a focus on dataset split and reported test classification performance.
| Ref. | OCT dataset | Data split strategy | Model performance on testing set |
|---|---|---|---|
| [ | Thyroid, parathyroid, fat and muscle samples | 97.12% accuracy | |
| [ | Pituitary adenoma | 0.96 AUC | |
| [ | Ophthalmology[ | 95.55% accuracy | |
| [ | Ophthalmology[ | 99.1% accuracy | |
| [ | Ophthalmology[ | 98.7% accuracy | |
| [ | Ophthalmology[ | 96.6% accuracy | |
| [ | Ophthalmology[ | 99.6% accuracy | |
| [ | (1) Ophthalmology[ | (1) (2) | (1) 99.80% accuracy (2) 100% accuracy |
| [ | Coronary artery | 96.05% accuracy | |
| [ | Kidney† | 82.6% accuracy | |
| [ | High and low grade brain tumors | 97% accuracy | |
| [ | Colon**† | 88.95% accuracy on 2D images | |
| [ | Breast tissue | 91.7% specificity | |
| [ | Ophthalmology[ | 98.46% accuracy | |
| [ | (1) Ophthalmology[ (2) Ophthalmology[ (3) Breast tissue[ | (1) | (1) ~96% accuracy (2) >98.8% accuracy (3) 98.8% accuracy |
| [ | (1) Ophthalmology[ (2) Ophthalmology[ (3) Ophthalmology[ (4) Ophthalmology[ | (1) (2) (3) (4) | (1) 96.66% accuracy (2) 98.97% accuracy (3) 99.74% accuracy (4) 99.78% accuracy |
| [ | Dentistry | No description given | 98% sensitivity 100% specificity |
| [ | Ophthalmology | No description given | 99.19% accuracy |
Open-access datasets and the ones available upon request are marked by * and **, respectively. The dataset is not open-access if not specified. Datasets obtained from animal model samples are marked by †. The difference in performance between studies using the same datasets results from the different methods implemented.
LightOCT model performance on the AIIMS[14], Srinivasan’s[16] and Kermany’s[15] datasets with training, validation and testing sets split using different strategies.
| Dataset | Split strategy | MCC [−1, 1] (m ± std) | AUC [0,1] (m ± std) | F1-score [0,1] (m ± std) | Accuracy [0,1] (m ± std) | Precision [0,1] (m ± std) | Recall [0,1] (m ± std) |
|---|---|---|---|---|---|---|---|
| AIIMS[ | 0.958 ± 0.038 | 1.000 ± 0.000 | 0.978 ± 0.021 | 0.978 ± 0.021 | 1.000 ± 0.000 | 0.978 ± 0.021 | |
| 0.881 ± 0.102 | 0.996 ± 0.009 | 0.934 ± 0.063 | 0.935 ± 0.060 | 0.993 ± 0.014 | 0.935 ± 0.060 | ||
| Srinivasan[ | 0.853 ± 0.039 | 0.985 ± 0.005 | 0.898 ± 0.030 | 0.899 ± 0.029 | 0.973 ± 0.009 | 0.899 ± 0.029 | |
| 0.426 ± 0.116 | 0.817 ± 0.055 | 0.593 ± 0.088 | 0.603 ± 0.078 | 0.702 ± 0.078 | 0.603 ± 0.078 | ||
| Kermany[ | 0.886 | 0.993 | 0.909 | 0.911 | 0.983 | 0.911 | |
| 0.707 ± 0.021 | 0.953 ± 0.003 | 0.764 ± 0.022 | 0.770 ± 0.019 | 0.886 ± 0.007 | 0.770 ± 0.019 | ||
| 0.588 ± 0.025 | 0.890 ± 0.006 | 0.644 ± 0.033 | 0.669 ± 0.023 | 0.769 ± 0.012 | 0.669 ± 0.023 | ||
| Kermany[ | 0.644 | 0.964 | 0.678 | 0.704 | 0.916 | 0.704 | |
| 0.673 ± 0.021 | 0.950 ± 0.003 | 0.729 ± 0.022 | 0.738 ± 0.019 | 0.886 ± 0.007 | 0.738 ± 0.019 | ||
| 0.600 ± 0.021 | 0.911 ± 0.006 | 0.651 ± 0.028 | 0.671 ± 0.021 | 0.795 ± 0.012 | 0.671 ± 0.021 |
Performance metrics are reported as mean ± standard deviation (m ± std) over the models trained through ten-times repeated five-fold cross validation and classes, for the per-image and per-volume/subject splits. For the original splits given by Kermany, results are reported for the single given split. AUC: area under the receiver operating characteristic curve, MCC: Matthews Correlation Coefficient.
Fig. 2Comparison between Matthews Correlation Coefficient for LightOCT model trained on different dataset split strategies. Each box plot summarizes the test MCC for the 50 models trained through a ten-times repeated five-fold cross validation. Results are presented for all the four datasets with the per-image split strategy shown in striped-green and per-volume/subject split strategy in dotted-orange. For Kermany’s datasets, the result of the models trained on the original_v2 and original_v3 splits are shown as full-black circle and full-black cross, respectively. Outliers are shown as diamond (♦).