| Literature DB >> 31420959 |
Jiayi Shen1,2, Casper J P Zhang3, Bangsheng Jiang4,5, Jiebin Chen6, Jian Song7, Zherui Liu6, Zonglin He4,5, Sum Yi Wong4,5, Po-Han Fang4,5, Wai-Kit Ming1,4,8,9.
Abstract
BACKGROUND: Artificial intelligence (AI) has been extensively used in a range of medical fields to promote therapeutic development. The development of diverse AI techniques has also contributed to early detections, disease diagnoses, and referral management. However, concerns about the value of advanced AI in disease diagnosis have been raised by health care professionals, medical service providers, and health policy decision makers.Entities:
Keywords: artificial intelligence; deep learning; diagnosis; diagnostic imaging; image interpretation, computer-assisted; patient-centered care
Year: 2019 PMID: 31420959 PMCID: PMC6716335 DOI: 10.2196/10010
Source DB: PubMed Journal: JMIR Med Inform
Figure 1Flow diagram of study inclusion and exclusion process.
Characteristics of included studies.
| Authors (year) | Artificial intelligence technology | Classification/labeling | Data source; sample size of total dataset, training sets, validation and/or tuning sets and test-set | Training process | Internal validation | Human clinicians (external validation) |
| Brinker (2019) [ | A convolutional neural network; CNN (trained with enhanced techniques on dermoscopic images) | All melanomas were verified by histopathological evaluation of biopsies; the nevi were declared as benign via expert consensus | International Skin Imaging Collaboration (ISIC) image archive; | A ResNet50 CNN model (residual learning) used for the classification of melanomas and atypical nevi. | Not reported | One hundred and forty-five dermatologists from 12 German university hospitals (using 100 images) |
| De Fauw (2018) [ | A segmentation CNN model using a 3-dimensional U-Net architecture | Referral suggestion: urgent/semi-urgent/routine/observation only (golden standard labels were retrospectively obtained by examining the patient clinical records to determine the final diagnosis and optimal referral pathway in the light of the subsequently obtained information) | Clinical OCT scans from Topcon 3D OCT, Topcon, Japan; | 1) Deep segmentation network, trained with manually segmented OCT scans; 2) Resulting tissue segmentation map; 3) Deep classification network, trained with tissue maps with confirmed diagnoses and optimal referral decisions; 4) Predicted diagnosis probabilities and referral suggestions. | Manually segmented and graded by 3 trained ophthalmologists, reviewed and edited by a senior ophthalmologist | |
| Esteva (2017) [ | Deep CNNs (a GoogleNet Inception v3 CNN architecture pretrained on the ImageNet dataset) | Biopsy-proven clinical images with 2 critical binary classification, labeled by dermatologists | Eighteen different clinician-curated, open-access online repositories and clinical data from Stanford University Medical Center; | 1) Classification of skin lesions using a single CNN; 2) Trained end-to-end from images directly, using only pixels and disease labels as inputs | Two dermatologists (at both 3-class and 9-class disease partitions) using 9-fold cross-validation | Twenty-one board-certified dermatologists on epidermal and melanocytic lesion classification |
| Han (2018) [ | A region-based convolutional deep neural network (R-CNN) | Four classes (onychomycosis, nail dystrophy, onycholysis, and melanonychia) and 6 classes (onychomycosis, nail dystrophy, onycholysis, melanonychia, normal, and others), manually categorized by dermatologists | Four hospitals (Asan Medical Center, Inje University, Hallym University, and Seoul National University); | 1) Extracted clinical photographs automatically cropped by the R-CNN; 2) One dermatologist cropped all of the images from the A2, R-CNN model trained using information about the crop location; 3) fine image selector trained to exclude unfocused photographs; (4)Three dermatologists tagged clinical diagnosis to the nail images generated by the R-CNN, with reference to the existing diagnosis tagged in the original image; (5) ensemble model as the output of both the ResNet-152 and VGG-19 systems computed with the feedforward neural networks | Two classes (onychomycosis or not) | 1) Forty-two dermatologists (16 professors, 13 clinicians with more than 10 years of experience in the department of Dermatology, and 8 residents) and 57 individuals from the general populations (11 general practitioners, 13 medical students, 15 nurses in the dermatology department, and 18 nonmedical persons) in the combined B1+C dataset; 2) The best 5 dermatologists among them in the combined B2+D dataset. |
| Kermany (2018) [ | Deep CNN (also used transfer learning) | Four categories (3 labels): choroidal neovascularization or diabetic macular edema (labeled as | Optical coherence tomography (OCT) images selected from retrospective cohorts of adult patients from the Shiley Eye Institute of the University of California San Diego, the California Retinal Research Foundation, Medical Center Ophthalmology Associates, the Shanghai First People’s Hospital, and Beijing Tongren Eye Center between July 1, 2013 and March 1, 2017. | After 100 epochs (iterations through the entire dataset), the training was stopped because of the absence of further improvement in both accuracy and cross-entropy loss | 1000 images randomly selected from the images used for training (limited model) | Six experts with significant clinical experience in an academic ophthalmology center |
| Long (2017) [ | Deep CNN | Binary classification by an expert panel in terms of opacity area (extensive vs limited), opacity density (dense vs nondense), and opacity location (central vs peripheral) | Childhood Cataract Program of the Chinese Ministry of Health (CCPMOH); | The championship model from the ImageNet Large Scale Visual Recognition Challenge 2014, containing 5 convolutional or down-sample layers in addition to 3 fully connected layers | K-fold cross-validation (K=5) | Three ophthalmologists with varying expertise (expert, competent, and novice) |
| Nam (2018) [ | Deep learning–based automatic detection algorithm (DLAD) | Binary classification: normal or nodule chest radiographs (image-level labeling); Nodule chest radiographs were obtained from patients with malignant pulmonary nodules proven at pathologic analysis and normal chest radiographs on the basis of their radiology reports. All chest radiographs were carefully reviewed by thoracic radiologists. | Normal and nodule chest radiographs from three Korean hospitals (Seoul National University Hospital; Boramae Hospital; and National Cancer Center) and 1 US hospital (University of California San Francisco Medical Center). | DLAD was trained in a semisupervised manner by using all of the image-level labels and partially annotated by 13 board-certified radiologists, with 25 layers and 8 residual connections | Radiograph classification and nodule detection performances of DLAD were validated by using 1 internal and 4 external datasets in terms of the area under ROC (AUROC) and figure of merit (FOM) form jackknife alternative free-response ROC (JAFROC) | 18 physicians (including 3 nonradiology physicians, 6 radiology residents, 5 board-certified radiologists, and 4 subspecialty trained thoracic radiologists) |
| Rajpurkar (2018) [ | Deep CNN with a 121-layer DenseNet architecture (CheXNeXt) | Binary values (absence/presence) in 14 pathologies: atelectasis, cardiomegaly, consolidation, edema, effusion, emphysema, fibrosis, hernia; Infiltration, mass; nodule, pleural thickening, pneumonia, and pneumothorax, obtained using automatic extraction methods on radiology reports | ChestX-ray14 dataset; | 1) Multiple networks were trained on the training set to predict the probability that each of the 14 pathologies is present in the image; 2) A subset of those networks, each chosen based on the average error on the tuning set, constituted an ensemble that produced predictions by computing the mean over the predictions of each individual network | Comprehensive comparison of the CheXNeXt algorithm to practicing radiologists across 7 performance metrics (ie, no external validation) | Nine radiologists (6 board-certified radiologists and 3 senior radiology residents from 3 institutions) |
| González-Castro (2017) [ | Support vector machine (SVM) classifier | Binary classifier of the burden of enlarged perivascular spaces (PVS) as low or high | Data from 264 patients in Royal Hallamshire Hospital; | Several combinations of the regularization parameter C and gamma, were used and assessed with all descriptors to find the optimal configuration using the implementation provided by the libSVM library | A stratified 5-fold cross-validation repeating ten times | Two observers (an experienced neuroradiologist and a trained image analyst) |
Comparison between artificial intelligence and human clinicians.
| Authors (year) | Performance index (AIa vs human clinicians) | ||||||
| Accuracy | AUCb | Sensitivity | Specificity | Error/weighted error | False positives | Other indices | |
| Brinker (2019) [ | N/Ac | Details provided in the article | Sensitivity (at specificity=73.3%):86.1% ; versus ;86.7% (among 3 resident dermatologists) | Specificity (at sensitivity=89.4%): mean=68.2% (range: 47.5%-86.25%) versus mean=64.4% (all 145 dermatologists, range: 22.5%-92.5%); Specificity (at sensitivity=92.8%): mean=61.1% versus mean=57.7 % (among 16 attending dermatologists) | N/A | N/A | N/A |
| De Fauw (2018) [ | N/A | No comparison | N/A | N/A | N/A | N/A | |
| Esteva (2017) [ | AUC of AI was reported but no comparison with human clinicians (Details provided in the article) | AI outperformed the average of dermatologists; (Details provided in the article) | AI outperformed the average of dermatologists (Details provided in the article) | N/A | N/A | N/A | |
| González-Castro (2017) [ | N/A | AUC (model 1): 0.9265 versus 0.9813 and 0.9074; AUC (model 2): 0.9041 versus 0.8395 and 0.8622; AUC (model 3): 0.9152 versus 0.9411 and 0.8934 | N/A | N/A | N/A | N/A | N/A |
| Han (2018) [ | N/A | N/A | Youden index (sensitivity + specificity - 1): B1+C dataset: >67.62% (trained with A1 dataset) and >63.03% (trained with A2 dataset) vs 48.39% (99% CI 29.16% (SD 67.62%); 95% CI 33.76% (SD 63.03%); B2+D dataset: Only one dermatologist performed better than the ensemble model trained with the A1 dataset, and only once in three experiments |
| N/A | N/A | N/A |
| Kermany (2018) [ | 96.6% versus 95.9% (mean; range: 92.1%-99.7%) | N/A | 97.8% versus 99.3% (mean; range: 98.2%-100%) | 97.4% versus 95.4% (mean; range: 82%-99.8%) | 6.6% versus 4.8% (mean; range: 0.4%-10.5%) | N/A | N/A |
| Long (2017) [ | Accuracy (distinguishing patients and healthy individuals): 100% versus 98% (expert), 98% (Competent), 96% (novice) [mean=97.33%]; | N/A | N/A | N/A | N/A | ||
| Nam (2018) [ | N/A | AUROC (in radiograph classification): 0.91 versus mean=0.885 (DLAD higher than 16 physicians and significantly higher than 11); JAFROC FOM (in nodule detection): 0.885 versus mean=0.794 (DLAD higher than all physicians and significantly higher in 15) | 80.7% versus mean=70.4% | No report of physicians’ performance | N/A | 0.3 versus mean=0.25 | N/A |
| Rajpurkar (2018) [ | Mean proportion correct value for all pathologies: 0.828 (SD=0.12) versus 0.675 (SD=0.15; board-certified radiologists) and 0.654 (SD=0.16; residents) | AUC (cardiomegaly): 0.831 versus 0.888 ( | CheXNEXt versus board-certfied radiologists | CheXNEXt versus board-certfied radiologists | N/A | N/A | |
aAI: artificial intelligence.
bAUC: area under the receiver operating characteristic curve.
cNot applicable.
Figure 2Distribution of bias in the included studies.
Figure 3Risk of bias in the included studies.