| Literature DB >> 36249702 |
Adam Hanif1, İlkay Yıldız2, Peng Tian2, Beyza Kalkanlı2, Deniz Erdoğmuş2, Stratis Ioannidis2, Jennifer Dy2, Jayashree Kalpathy-Cramer3, Susan Ostmo1, Karyn Jonas4, R V Paul Chan4, Michael F Chiang5, J Peter Campbell1.
Abstract
Purpose: To compare the efficacy and efficiency of training neural networks for medical image classification using comparison labels indicating relative disease severity versus diagnostic class labels from a retinopathy of prematurity (ROP) image dataset. Design: Evaluation of diagnostic test or technology. Participants: Deep learning neural networks trained on expert-labeled wide-angle retinal images obtained from patients undergoing diagnostic ROP examinations obtained as part of the Imaging and Informatics in ROP (i-ROP) cohort study.Entities:
Keywords: ANOVA, analysis of variance; AUC, area under the receiver operating characteristic curve; Artificial intelligence; Deep learning; ICROP, International Classification of Retinopathy of Prematurity; Labels; Neural networks; ROP, retinopathy of prematurity; Retinopathy of prematurity; i-ROP, Imaging and Informatics in ROP
Year: 2022 PMID: 36249702 PMCID: PMC9560533 DOI: 10.1016/j.xops.2022.100122
Source DB: PubMed Journal: Ophthalmol Sci ISSN: 2666-9145
Distribution of Plus Disease Severity Classes within Datasets
| Dataset | Normal | Preplus | Plus | Total |
|---|---|---|---|---|
| i-ROP | 54 | 31 | 15 | 100 |
| ICROP | 6 | 10 | 14 | 30 |
| Test dataset | 4577 | 812 | 172 | 5561 |
ICROP = International Classification of Retinopathy of Prematurity; i-ROP = Imaging and Informatics in ROP.
Figure 1Diagram showing the labeling process. Graders were asked to perform 2 tasks. A, They were given a single image at a time and asked to label the image as plus, preplus, or no plus. B, They were shown a pair of images and asked to choose the image that represented more severe disease.
Figure 2Schematic diagram illustrating the training, validation, and testing process involved in developing the neural networks applied to 1 of 2 binary classification tasks: normal versus abnormal and plus versus nonplus. RSD = reference standard diagnosis.
Figure 3Flow diagram showing a simplified depiction of neural network training between class and comparison labels in experiments A and B. Sixty percent of images from either the Imaging and Informatics in ROP (i-ROP) or International Classification of Retinopathy of Prematurity (ICROP) datasets were selected randomly. This selection then was balanced so as to achieve a near-even distribution of images represented by each of the 3 severity classes. In experiment A, the total number of class labels assigned to these images by expert graders then was used to train a neural network. Similarly, all comparison labels associated with the same images in this balanced training set were used to train a neural network for performance comparison. In experiment B, a set of class labels each corresponding to a single image in the balanced test set was used for training a neural network and was compared with a neural network trained on an equivalent number of comparison labels. E = total number of expert graders. ROP = retinopathy of prematurity.
Figure 4Line graphs showing experiment A neural network performance. A, B, Normal versus abnormal (A) and plus versus nonplus (B) classification tasks from models trained on class or comparison labels corresponding to images within the Imaging and Informatics in ROP (i-ROP) dataset. No statistically significant difference was calculated between models trained on either label type. C, D, Classification performances from models trained on class or comparison labels corresponding to images within the Classification of Retinopathy of Prematurity (ICROP) dataset. Training on comparison labels yielded significantly higher area under the receiver operating characteristic curves (AUCs) than training on class labels (2-way analysis of variance: normal vs. abnormal: F = 30.41; main effect, P = 0.0006; plus vs. nonplus: F = 5.83; main effect, P = 0.04). In the normal versus abnormal task (C), the average AUC from training with comparison labels associated with 3 images was significantly higher than from training with class labels associated with the same number of images (P = 0.008, Welch’s t test).
Figure 5Line graphs showing experiment B neural network performance. A, B, Normal versus abnormal (A) and plus versus nonplus (B) classification tasks from models trained on class or comparison labels within the Imaging and Informatics in ROP (i-ROP) dataset. A, B, Average area under the receiver operating characteristic curve (AUC) from training with 156 comparison labels was significantly higher than that measured from training with class labels (Welch’s t test: normal vs. abnormal, P = 0.002; plus vs. nonplus, P = 0.02). Training on comparison labels yielded significantly higher AUCs than training on class labels (2-way analysis of variance [ANOVA]: normal vs. abnormal: F = 12.16; main effect, P = 0.003; plus vs. nonplus: F = 8.77; main effect, P = 0.009). C, D, Classification performances from models trained on class or comparison labels corresponding to images within the International Classification of Retinopathy of Prematurity (ICROP) dataset. Training on comparison labels yielded significantly higher AUCs than training on class labels in both classification tasks (normal vs. abnormal: 2-way ANOVA: F = 13.93; main effect, P = 0.003; plus vs. nonplus: F = 7.14; main effect, P = 0.02). In the normal versus abnormal task (C), the average AUC from training with 204 comparison labels was significantly higher than that measured from training with class labels (P = 0.002, Welch’s t test).