| Literature DB >> 32701148 |
Kunal Nagpal1, Davis Foote1, Fraser Tan1, Yun Liu1, Po-Hsuan Cameron Chen1, David F Steiner1, Naren Manoj1,2, Niels Olson3, Jenny L Smith3, Arash Mohtashamian3, Brandon Peterson3, Mahul B Amin4, Andrew J Evans5, Joan W Sweet5, Carol Cheung5, Theodorus van der Kwast5, Ankur R Sangoi6, Ming Zhou7, Robert Allan8, Peter A Humphrey9, Jason D Hipp1,10, Krishna Gadepalli1, Greg S Corrado1, Lily H Peng1, Martin C Stumpe1,11, Craig H Mermel1.
Abstract
Importance: For prostate cancer, Gleason grading of the biopsy specimen plays a pivotal role in determining case management. However, Gleason grading is associated with substantial interobserver variability, resulting in a need for decision support tools to improve the reproducibility of Gleason grading in routine clinical practice. Objective: To evaluate the ability of a deep learning system (DLS) to grade diagnostic prostate biopsy specimens. Design, Setting, and Participants: The DLS was evaluated using 752 deidentified digitized images of formalin-fixed paraffin-embedded prostate needle core biopsy specimens obtained from 3 institutions in the United States, including 1 institution not used for DLS development. To obtain the Gleason grade group (GG), each specimen was first reviewed by 2 expert urologic subspecialists from a multi-institutional panel of 6 individuals (years of experience: mean, 25 years; range, 18-34 years). A third subspecialist reviewed discordant cases to arrive at a majority opinion. To reduce diagnostic uncertainty, all subspecialists had access to an immunohistochemical-stained section and 3 histologic sections for every biopsied specimen. Their review was conducted from December 2018 to June 2019. Main Outcomes and Measures: The frequency of the exact agreement of the DLS with the majority opinion of the subspecialists in categorizing each tumor-containing specimen as 1 of 5 categories: nontumor, GG1, GG2, GG3, or GG4-5. For comparison, the rate of agreement of 19 general pathologists' opinions with the subspecialists' majority opinions was also evaluated.Entities:
Year: 2020 PMID: 32701148 PMCID: PMC7378872 DOI: 10.1001/jamaoncol.2020.2485
Source DB: PubMed Journal: JAMA Oncol ISSN: 2374-2437 Impact factor: 31.777
Characteristics of the Validation Sets
| Source or diagnosis | Entire validation set, No. | Total | ||
|---|---|---|---|---|
| ML1 | Tertiary teaching hospital | External validation set (ML2) | ||
| Biopsy specimens from each source | 387 | 52 | 371 | 810 |
| Biopsy specimens excluded due to image quality, poor staining, or artifacts impeding diagnosis | 1 | 6 | 48 | 55 |
| Biopsy specimens excluded due to presence of ungradable variants | 2 | 0 | 1 | 3 |
| Cases included (1 biopsy specimen per case) | 384 | 46 | 322 | 752 |
| Nontumor | 94 | 13 | 147 | 254 |
| Tumor-containing | 290 | 33 | 175 | 498 |
| Grade group | ||||
| 1 | 147 | 24 | 76 | 247 |
| 2 | 72 | 6 | 44 | 122 |
| 3 | 46 | 2 | 22 | 70 |
| 4-5 | 25 | 1 | 33 | 59 |
The validation sets contain prostate core biopsy cases from 3 institutions: a large tertiary teaching hospital and 2 medical laboratories (ML1 and ML2) in the United States. A representative core specimen was selected from each case. Despite overlap in the data source for ML1 and the tertiary teaching hospital between the development and validation data sets, the cases and biopsy specimens did not overlap.
The deep learning system was developed using data from ML1 and the tertiary teaching hospital sources, but not from ML2. Thus ML2 represents an external validation data set.
Figure 1. Comparison of deep learning system (DLS) and Pathologist Agreement Rates With Subspecialists at Gleason Grading of Tumor-Containing Biopsy Specimens
A, Subspecialists review every biopsy to determine its grade group (GG) (see Box and Methods). Next, those GG determinations are compared with those of the DLS and the general pathologists. B, Agreement rates with subspecialists for the DLS and pathologists across all 498 tumor-containing biopsy specimens. C, Subanalysis considering 175 tumor-containing biopsy specimens from only the external validation set (medical laboratory 2). Because every pathologist reviewed only a subset of the cases, to represent every pathologist equally, the agreement rate shown for the general pathologists is the mean across all general pathologists. For the subanalysis presented in panel C, pathologists who conducted fewer than 20 reviews were excluded to avoid skewing the results (applied to 4 pathologists). Error bars represent 95% CIs.
Agreement Rates of the DLS and General Pathologists With the Subspecialists’ Majority Opinion at 3 Clinically Important Decision Cutoffs
| Clinical task, evaluation metric | % (95% CI) | |
|---|---|---|
| DLS | General pathologist | |
| Nontumor vs tumor determination (n = 752) | ||
| Agreement with subspecialist majority opinion | 94.3 (92.4-95.9) | 94.7 (92.8-96.3) |
| Sensitivity | 95.5 (93.7-96.8) | 92.8 (90.0-95.1) |
| Specificity | 91.7 (88.2-94.6) | 97.0 (95.1-98.6) |
| Grading of tumor-containing biopsy specimens | ||
| Agreement with subspecialist majority opinion for GG1 vs GG2-5 (n = 498) | 86.1 (83.1-89.2) | 80.6 (77.9-83.5) |
| Agreement with subspecialist majority opinion for GG1-2 vs GG3-5 (n = 498) | 92.8 (90.8-94.9) | 86.0 (83.2-88.5) |
Abbreviations: DLS, deep learning system; GG, grade group.
Similar to Figure 1, the agreement rate of the general pathologists represents the mean rate across all general pathologists.
The higher value in the row.
Agreement on 2 Gleason grading thresholds.
Mean Absolute Difference in Gleason Pattern Quantitation Relative to Subspecialists
| Gleason pattern | No. | Subspecialist discordance, % (95% CI) | |
|---|---|---|---|
| Deep learning system | Pathologist | ||
| 3 (Tumor-containing specimens) | 498 | 9.2 (8.0-10.5) | 14.0 (12.4-15.6) |
| 4 (Tumor-containing specimens) | 498 | 10.0 (8.6-11.2) | 16.3 (14.6-18.1) |
| 5 (Tumor-containing specimens) | 498 | 1.5 (0.9-2.1) | 3.2 (2.2-4.3) |
| 4 (Grade group 2 specimens only) | 122 | 12.0 (10.4-13.6) | 22.0 (19.6-24.6) |
Gleason pattern quantitation reflects the proportion of tumor in each biopsy specimen that is characterized as each Gleason pattern. The mean absolute differences in Gleason pattern quantitation are measured against the mean of subspecialist quantitation results for all tumor-containing biopsy specimens (rows 1-3) or grade group 2 biopsy specimens only (row 4).
Lower absolute differences (higher agreement rate in Gleason pattern quantitation).
Figure 2. Illustrative Concept of How Deep Learning System (DLS) Results May Be Presented to a Pathologist
These cases were graded by both the DLS and subspecialists as grade groups 1 (A), 2 (B), and 3 (C). The DLS provides both a glandular-level Gleason pattern categorization and a biopsy-level Gleason score and grade group. Left column represents low-power magnification view of the Gleason pattern categorization; middle column, 10 × magnification of the indicated area from the left column; right column, the DLS-generated Gleason score and Gleason pattern quantitation. In the left column, green represents DLS-categorized Gleason pattern 3; yellow, DLS-categorized Gleason pattern 4.