| Literature DB >> 32140566 |
Amirhossein Kiani1, Bora Uyumazturk1, Pranav Rajpurkar1, Alex Wang1, Rebecca Gao2, Erik Jones1, Yifan Yu1, Curtis P Langlotz3,4, Robyn L Ball3, Thomas J Montine3,5, Brock A Martin5, Gerald J Berry5, Michael G Ozawa5, Florette K Hazard5, Ryanne A Brown5, Simon B Chen5, Mona Wood5, Libby S Allard5, Lourdes Ylagan5, Andrew Y Ng1, Jeanne Shen3,5.
Abstract
Artificial intelligence (AI) algorithms continue to rival human performance on a variety of clinical tasks, while their actual impact on human diagnosticians, when incorporated into clinical workflows, remains relatively unexplored. In this study, we developed a deep learning-based assistant to help pathologists differentiate between two subtypes of primary liver cancer, hepatocellular carcinoma and cholangiocarcinoma, on hematoxylin and eosin-stained whole-slide images (WSI), and evaluated its effect on the diagnostic performance of 11 pathologists with varying levels of expertise. Our model achieved accuracies of 0.885 on a validation set of 26 WSI, and 0.842 on an independent test set of 80 WSI. Although use of the assistant did not change the mean accuracy of the 11 pathologists (p = 0.184, OR = 1.281), it significantly improved the accuracy (p = 0.045, OR = 1.499) of a subset of nine pathologists who fell within well-defined experience levels (GI subspecialists, non-GI subspecialists, and trainees). In the assisted state, model accuracy significantly impacted the diagnostic decisions of all 11 pathologists. As expected, when the model's prediction was correct, assistance significantly improved accuracy (p = 0.000, OR = 4.289), whereas when the model's prediction was incorrect, assistance significantly decreased accuracy (p = 0.000, OR = 0.253), with both effects holding across all pathologist experience levels and case difficulty levels. Our results highlight the challenges of translating AI models into the clinical setting, and emphasize the importance of taking into account potential unintended negative consequences of model assistance when designing and testing medical AI-assistance tools.Entities:
Keywords: Liver cancer; Machine learning; Pathology
Year: 2020 PMID: 32140566 PMCID: PMC7044422 DOI: 10.1038/s41746-020-0232-8
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Fig. 1Graphical user interfaces for the experiment.
The pathologists navigated the slide using the ObjectiveView image viewer. After identifying a tumor region of interest (ROI), they saved a 500 × 500 μm image patch at ×10 objective magnification (a) containing the ROI using the ‘crop’ tool. (The horizonal scale bar denotes 200 μm). After uploading the image patch to the diagnostic assistant’s user interface (b), they received a probability of each diagnosis (here, HCC), with an accompanying class activation map to assist with interpretation.
Dataset and patient characteristics.
| Source | Dataset | Class | No. of slides | Median patient agea | No. of female patientsb |
|---|---|---|---|---|---|
| TCGA | Total | HCC | 35 | 57.0 (17.0) | 11 (31.4) |
| CC | 35 | 64.0 (13.0) | 20 (57.1) | ||
| Training | HCC | 10 | 56.5 (13.5) | 1 (10.0) | |
| CC | 10 | 65.0 (10.75) | 8 (80.0) | ||
| Tuning | HCC | 12 | 63.0 (16.0) | 5 (41.6) | |
| CC | 12 | 71.0 (8.75) | 6 (50.0) | ||
| Validation | HCC | 13 | 55.0 (21.0) | 5 (38.5) | |
| CC | 13 | 59.0 (10.0) | 6 (46.1) | ||
| Stanford | Indep. test | HCC | 40 | 64.5 (8.25) | 10 (25.0) |
| CC | 40 | 63.0 (14.75) | 14 (35.0) |
aThe interquartile range (IQR) is provided in parentheses.
bThe percentage of female patients is provided in parentheses.
Fig. 2Experimental design.
Each of the 11 pathologists was randomly assigned to either test Order 1 or 2. Each test began with a brief practice block of 4 (2 HCC and 2 CC) practice whole-slide images (WSI), followed by 8 experiment blocks of 10 WSI each, with Order 1 beginning with assistance and Order 2 beginning without assistance. The same 80 experiment WSI were reviewed in the same sequence during Tests 1 and 2, across both test Orders.
Fig. 3Pathologist diagnostic workflow with assistance.
After reviewing the H&E whole-slide image (a), the pathologist extracts a tumor patch at ×10 objective magnification (b) and uploads it to the cloud-based model, which outputs predicted probabilities for cholangiocarcinoma (CC) and hepatocellular carcinoma (HCC) into the user interface (c), as well as corresponding class activation maps (not shown). These outputs are integrated with the pathologist’s diagnostic impression, to result in a final assisted diagnosis.
Pathologist unassisted and assisted accuracies, by experience levela.
| Assistance | GI specialists | Non-GI specialists | Trainees | Pathologists NOC |
|---|---|---|---|---|
| Assisted | 0.963 (0.930, 0.980) ( | 0.871 (0.822, 0.910) ( | 0.896 (0.851, 0.928) ( | 0.931 (0.881, 0.961) ( |
| Unassisted | 0.946 (0.909, 0.968) ( | 0.842 (0.790, 0.882) ( | 0.858 (0.809, 0.897) ( | 0.969 (0.929, 0.987) ( |
aThe average accuracy of each pathologist subgroup, along with the 95% confidence interval (in parentheses) and number of correct diagnoses made (n) is presented.
Fig. 4Impact of assistance on individual pathologist diagnostic performance.
The average diagnostic accuracy (across the set of 80 experiment WSI) for each pathologist is plotted as follows: gray circle (unassisted) = accuracy of the unassisted pathologist, star (model) = accuracy of the model alone (based on pathologist selected input patches), purple diamond (assisted) = accuracy of the pathologist with model assistance.
Impact of assistance on diagnostic accuracy under different conditionsa.
| Assistance | Assistance | Model correct | Model incorrect | |
|---|---|---|---|---|
| OR (95% CI) | 1.281 (0.882, 1.862) | 1.499 (1.007, 2.230) | 4.289 (2.360, 7.794) | 0.253 (0.126, 0.507) |
| 0.184 | 0.045 | 0.000 | 0.000 |
aThe results of mixed-effect logistic regression analyses evaluating the impact of assistance on diagnostic accuracy are presented as odds ratios (OR) for pathologist diagnostic accuracy, with 95% confidence intervals (95% CI) and p-values from likelihood ratio testing (a two-tailed p ≤ 0.05 was considered statistically significant).