| Literature DB >> 30848813 |
Guy Nir1,2, Davood Karimi1, S Larry Goldenberg2, Ladan Fazli2, Brian F Skinnider3,4, Peyman Tavassoli2,5, Dmitry Turbin3, Carlos F Villamil4, Gang Wang4, Darby J S Thompson6,7, Peter C Black2, Septimiu E Salcudean1,2.
Abstract
Importance: Proper evaluation of the performance of artificial intelligence techniques in the analysis of digitized medical images is paramount for the adoption of such techniques by the medical community and regulatory agencies.Entities:
Mesh:
Year: 2019 PMID: 30848813 PMCID: PMC6484626 DOI: 10.1001/jamanetworkopen.2019.0442
Source DB: PubMed Journal: JAMA Netw Open ISSN: 2574-3805
Figure. Interobserver Variability in Grading of Prostate Cancer in Hematoxylin-Eosin–Stained Tissue Scanned at ×40 Magnification
The contours represent the detailed annotations of each pathologist on an example tissue core. The majority-vote label based on the annotations is overlaid on the image, as well as the automatic classification result of each patch (a small rectangular subimage) based on one of the cross-validation experiments. Gleason grade 3 indicates low-grade cancer and grades 4 and 5 indicate high-grade cancer.
Selected Previous Work
| Source | Classification Type | Data | Multiple Experts | Validation Type | Results |
|---|---|---|---|---|---|
| Monaco et al,[ | Cancer vs benign | 20 Patients and 40 slides | No | Slide-based cross-validation | Sensitivity, 0.87 and specificity, 0.90 |
| Doyle et al,[ | Cancer vs benign | 58 Patients and 100 slides | No | Slide-based and patient-based | Image-based accuracies, 0.69, 0.7, and 0.69; patient-based accuracies, 0.74, 0.66, and 0.57 |
| Gorelick et al,[ | Cancer vs benign; high-grade vs low-grade | 15 Patients, 50 slides, and 991 patches | No | Patch-based cross-validation | 90% Cancer vs benign, and 85% high-grade vs low-grade |
| Nguyen et al,[ | Benign, Gleason grade 3, and Gleason grade 4 | 29 Patients and 317 patches | No | Patch-based 10-fold cross-validation | 87.3% Accuracy for Gleason grade 3 vs 4 |
| Arvaniti et al,[ | Gleason grades 3, 4, and 5 | 641 + 245 Patients and unspecified number of patches | Yes (training on 1, evaluated on 2) | Patient-based (1 partitioning) | 58% Recall on test set |
| Nir et al,[ | Gleason grades 3, 4, and 5 | 231 Patients, 333 cores, and approximately 16 000 patches | Yes (6 for training and testing) | Patient-based cross-validation (repeated partitioning) | Benign vs cancer: accuracy, 90.2%, sensitivity, 91.3%, and specificity, 84.0%; Gleason grade 3 vs grade 4-5: accuracy, 76.6%, sensitivity, 75.9%, and specificity, 77.9% |
Results of the Cross-Validation Experiments
| Cross-Validation Method | Classification | Accuracy, Mean (SD), % | Sensitivity, Mean (SD), % | Specificity, Mean (SD), % |
|---|---|---|---|---|
| 20-Fold leave-patients-out | Benign vs cancer | 85.8 (4.3) | 86.3 (4.1) | 85.5 (7.2) |
| Low-grade vs high-grade | 81.2 (3.7) | 82.4 (5.0) | 82.0 (8.1) | |
| 20-Fold leave-cores-out | Benign vs cancer | 86.7 (3.7) | 87.2 (4.0) | 87.7 (5.5) |
| Low-grade vs high-grade | 83.4 (4.5) | 86.2 (6.4) | 84.2 (4.9) | |
| 20-Fold leave-patches-out | Benign vs cancer | 97.8 (1.2) | 98.5 (1.0) | 97.5 (1.2) |
| Low-grade vs high-grade | 92.2 (4.5) | 93.8 (5.8) | 90.8 (6.0) | |
| 2-Fold leave-patches-out | Benign vs cancer | 96.8 (0.0) | 96.8 (0.0) | 97.2 (0.0) |
| Low-grade vs high-grade | 87.0 (0.0) | 84.1 (0.0) | 94.1 (0.0) |
Results of the McNemar Test for Comparison of Different Cross-Validation Methods in Terms of Their Accuracy in Classifying Image Patches as Benign or Cancerous
| Cross-Validation Method | ||||
|---|---|---|---|---|
| 20-Fold Leave-Patients-Out | 20-Fold Leave-Cores-Out | 20-Fold Leave-Patches-Out | 2-Fold Leave-Patches-Out | |
| 20-Fold leave-patients-out | NA | NA | NA | NA |
| 20-Fold leave-cores-out | <.001 | NA | NA | NA |
| 20-Fold leave-patches-out | <.001 | <.001 | NA | NA |
| 2-Fold leave-patches-out | <.001 | <.001 | <.001 | NA |
Abbreviation: NA, not applicable.
Results of the Cross-Expert Experiment
| Source of Label Used as Ground Truth for Training | Agreement of the Automatic Classifier | ||||||
|---|---|---|---|---|---|---|---|
| With Pathologist 1 | With Pathologist 2 | With Pathologist 3 | With Pathologist 4 | With Pathologist 5 | Pathologist 6 | Overall | |
| Pathologist 1 | 0.48 | 0.39 | 0.40 | 0.36 | 0.37 | 0.31 | 0.38 |
| Pathologist 2 | 0.35 | 0.61 | 0.45 | 0.44 | 0.50 | 0.44 | 0.48 |
| Pathologist 3 | 0.40 | 0.58 | 0.64 | 0.61 | 0.58 | 0.52 | 0.58 |
| Pathologist 4 | 0.33 | 0.57 | 0.60 | 0.63 | 0.57 | 0.57 | 0.57 |
| Pathologist 5 | 0.40 | 0.57 | 0.59 | 0.55 | 0.61 | 0.50 | 0.58 |
| Pathologist 6 | 0.38 | 0.52 | 0.51 | 0.53 | 0.46 | 0.47 | 0.50 |
| Majority vote of all pathologists | 0.46 | 0.63 | 0.65 | 0.65 | 0.60 | 0.59 | 0.60 |
The values represent the (quadratic) weighted κ agreement between the corresponding automatic classifier and pathologists.