Literature DB >> 32701148

Development and Validation of a Deep Learning Algorithm for Gleason Grading of Prostate Cancer From Biopsy Specimens.

Kunal Nagpal¹, Davis Foote¹, Fraser Tan¹, Yun Liu¹, Po-Hsuan Cameron Chen¹, David F Steiner¹, Naren Manoj^1,2, Niels Olson³, Jenny L Smith³, Arash Mohtashamian³, Brandon Peterson³, Mahul B Amin⁴, Andrew J Evans⁵, Joan W Sweet⁵, Carol Cheung⁵, Theodorus van der Kwast⁵, Ankur R Sangoi⁶, Ming Zhou⁷, Robert Allan⁸, Peter A Humphrey⁹, Jason D Hipp^1,10, Krishna Gadepalli¹, Greg S Corrado¹, Lily H Peng¹, Martin C Stumpe^1,11, Craig H Mermel¹.

Abstract

Importance: For prostate cancer, Gleason grading of the biopsy specimen plays a pivotal role in determining case management. However, Gleason grading is associated with substantial interobserver variability, resulting in a need for decision support tools to improve the reproducibility of Gleason grading in routine clinical practice. Objective: To evaluate the ability of a deep learning system (DLS) to grade diagnostic prostate biopsy specimens. Design, Setting, and Participants: The DLS was evaluated using 752 deidentified digitized images of formalin-fixed paraffin-embedded prostate needle core biopsy specimens obtained from 3 institutions in the United States, including 1 institution not used for DLS development. To obtain the Gleason grade group (GG), each specimen was first reviewed by 2 expert urologic subspecialists from a multi-institutional panel of 6 individuals (years of experience: mean, 25 years; range, 18-34 years). A third subspecialist reviewed discordant cases to arrive at a majority opinion. To reduce diagnostic uncertainty, all subspecialists had access to an immunohistochemical-stained section and 3 histologic sections for every biopsied specimen. Their review was conducted from December 2018 to June 2019. Main Outcomes and Measures: The frequency of the exact agreement of the DLS with the majority opinion of the subspecialists in categorizing each tumor-containing specimen as 1 of 5 categories: nontumor, GG1, GG2, GG3, or GG4-5. For comparison, the rate of agreement of 19 general pathologists' opinions with the subspecialists' majority opinions was also evaluated.
Results: For grading tumor-containing biopsy specimens in the validation set (n = 498), the rate of agreement with subspecialists was significantly higher for the DLS (71.7%; 95% CI, 67.9%-75.3%) than for general pathologists (58.0%; 95% CI, 54.5%-61.4%) (P < .001). In subanalyses of biopsy specimens from an external validation set (n = 322), the Gleason grading performance of the DLS remained similar. For distinguishing nontumor from tumor-containing biopsy specimens (n = 752), the rate of agreement with subspecialists was 94.3% (95% CI, 92.4%-95.9%) for the DLS and similar at 94.7% (95% CI, 92.8%-96.3%) for general pathologists (P = .58). Conclusions and Relevance: In this study, the DLS showed higher proficiency than general pathologists at Gleason grading prostate needle core biopsy specimens and generalized to an independent institution. Future research is necessary to evaluate the potential utility of using the DLS as a decision support tool in clinical workflows and to improve the quality of prostate cancer grading for therapy decisions.

Entities: Chemical Disease Species

Year: 2020 PMID： 32701148 PMCID： PMC7378872 DOI： 10.1001/jamaoncol.2020.2485

Source DB: PubMed Journal: JAMA Oncol ISSN： 2374-2437 Impact factor: 31.777

Introduction

Prostate cancer is a leading cause of morbidity and mortality for men.[1] Its treatment is determined based largely on the pathologic evaluation of a prostate biopsy,[2] an imperfect diagnostic tool. The heterogeneous tumor growth patterns observed in a biopsy are characterized by the Gleason grading system in terms of their degree of differentiation (ranging from Gleason pattern 3, representing well-differentiated glands, to Gleason pattern 5, representing poorly differentiated cells). Ultimately, biopsy specimens are categorized into Gleason grade groups (GG) based on the proportions of the Gleason patterns present in a biopsy, with higher GG indicating greater clinical risk. These GGs are inherently subjective by virtue of relying on the visual assessment of cell differentiation and Gleason pattern predominance. Consequently, it is common for different pathologists to assign a different GG to the same biopsy (30%-50% discordances).[3,4,5,6,7,8] In general, pathologists with urologic subspeciality training show higher rates of interobserver agreement than general pathologists,[9] and reviews by experts lead to more accurate risk stratification than reviews by less experienced pathologists.[10,11] Because important treatment decisions rely on assessment of prostate biopsy specimens and there is limited availability of expert subspecialists, the development of an automated system for assessing prostate biopsy specimens with expert-level performance could help improve the clinical utility of the prostate biopsy. We developed a deep learning system (DLS) for reading digitized prostate biopsy specimen sections with the intent of achieving performance comparable to expert subspecialists. We evaluated the rate of model agreement with the majority opinion of several experienced subspecialists and compared this performance to a panel of general pathologists who independently reviewed the same biopsy specimens.

Methods

Data Sets

Deidentified digitized images of formalin-fixed paraffin-embedded prostate needle core biopsy specimens were obtained from 4 sources, each with independent tissue processing and staining: 2 independent medical laboratories (ML1 and ML2), a tertiary teaching hospital, and a university hospital. The ML1, tertiary teaching hospital, and university hospital biopsy specimens were used for DLS development, and the ML1, ML2, and tertiary teaching hospital biopsy specimens were used for validation. Biopsy specimens from ML2 served as an external validation data set; these specimens were used for independent validation only and not used for DLS development (Table 1). Additional details are presented in the Slide Preparation and Image Digitization section of the eMethods in the Supplement. Ethics approval for the use of these deidentified slides in this study was granted by the Naval Medical Center San Diego Institutional Review Board, which also waived the need for obtaining informed patient consent because the data were deidentified. No patients received compensation or were offered any incentive for participating in this study.

Table 1.

Characteristics of the Validation Sets

Source or diagnosis	Entire validation set, No.^a			Total
Source or diagnosis	ML1	Tertiary teaching hospital	External validation set (ML2)^b	Total
Biopsy specimens from each source	387	52	371	810
Biopsy specimens excluded due to image quality, poor staining, or artifacts impeding diagnosis	1	6	48	55
Biopsy specimens excluded due to presence of ungradable variants	2	0	1	3
Cases included (1 biopsy specimen per case)	384	46	322	752
Nontumor	94	13	147	254
Tumor-containing	290	33	175	498
Grade group
1	147	24	76	247
2	72	6	44	122
3	46	2	22	70
4-5	25	1	33	59

The deep learning system was developed using data from ML1 and the tertiary teaching hospital sources, but not from ML2. Thus ML2 represents an external validation data set.

The validation sets contain prostate core biopsy cases from 3 institutions: a large tertiary teaching hospital and 2 medical laboratories (ML1 and ML2) in the United States. A representative core specimen was selected from each case. Despite overlap in the data source for ML1 and the tertiary teaching hospital between the development and validation data sets, the cases and biopsy specimens did not overlap. The deep learning system was developed using data from ML1 and the tertiary teaching hospital sources, but not from ML2. Thus ML2 represents an external validation data set. Each specimen was randomly assigned to either the development or validation sets such that there was no overlap in slides between the development and validation sets. One specimen per case was selected for inclusion in the study, with selection of a tumor-containing specimen where available. Specimens with nongradable prostate cancer variants or with quality issues preventing diagnosis were excluded from the study. Additional details including the splitting of the development set for DLS training and tuning are presented in Table 1 and in eTable 1 in the Supplement.

Pathologic Examination of Prostate Biopsy Specimens

All pathologists participating in this study, including the general pathologists and urologic subspecialists (M.B.A., A.J.E., J.W.S., C.C., T.K., A.R.S., M.Z., R.A., and P.A.H.), were US board-certified or Canadian board-certified, and reviewed the pathology slides for each biopsy in a manner consistent with the International Society of Urological Pathology 2014 and College of American Pathologists guidelines with no time constraint.[12,13] If the specimen did not contain Gleason-gradable adenocarcinoma, it was classified as nontumor. Otherwise, to assign the final GG, the pathologists provided the relative amount of tumor corresponding to each Gleason pattern, specifically, the percentage of each that was considered Gleason pattern 3, 4, or 5. Gleason patterns 1 and 2 are not used in contemporary Gleason grading. The corresponding GG (GG1, GG2, GG3, or GG4-5) was then derived from the relative proportions of the Gleason patterns (Box).[13] Because of their low incidence and often similar treatment implications, GG4 and GG5 were collapsed into a single group. Identify whether a tumor is present. When a tumor is present, categorize regions of the tumor as 1 of 3 Gleason patterns: 3, 4, or 5. Quantify the relative amounts of each pattern. Sum the top 2 most prevalent patterns to determine the Gleason score. Under certain conditions, a third-most prevalent pattern is also used at this step. Map the Gleason score to a grade group. Both the Gleason score and grade group are part of standard reporting. The grade group system was designed to facilitate mapping of Gleason scores into discrete prognostic groups.[12]

Biopsy Specimen Reviews

Reviews were collected for 2 purposes: first for DLS development (training and tuning) and second for assessment of the DLS system performance using a separate validation data set. Biopsy specimen reviews for DLS development are detailed in the eMethods in the Supplement. For DLS validation, 6 urologic subspecialists reviewed the validation set (eFigure 1A in the Supplement). The subspecialists (M.B.A., A.J.E., T.K., M.Z., R.A., and P.A.H.) represented 5 institutions and had 18 to 34 years of clinical experience after residency (mean, 25 years). To reduce potential Gleason pattern ambiguity due to issues such as tangential cuts of the specimen, 2 adjacent sections (levels) of the specimens were provided to the subspecialists. These 3 levels were made available to the pathologists for establishing the reference standard, but not made available to the DLS, which interpreted only the middle section of each specimen. Furthermore, 1 additional section per specimen was stained with the PIN-4 immunohistochemistry cocktail (P504S plus p63 plus high molecular weight cytokeratin) to help the subspecialists identify cancer. For each of the 752 biopsy specimens in the validation set, reviews were performed by 2 of the 6 aforementioned expert subspecialists. A third subspecialist reviewed the specimens when there were discordances between the first 2 subspecialists (176 specimens [23%]). For cases without a majority opinion after 3 independent reviews (13 cases [1.7%]), the median classification was used. We then evaluated the accuracy of the DLS compared with this majority opinion of the subspecialists for each biopsy.

Biopsy Specimen Reviews by General Pathologists for Comparison

To measure the rate of agreement between the general pathologists and subspecialists, each biopsy specimen in the validation set was reviewed by several (median, 3, range, 1-6) US board-certified pathologists from the cohort of 19 participating in this study. The median number of biopsy specimens reviewed by each general pathologist was 84 (range, 41-312). To simulate routine clinical workflow, these pathologists had access to 3 sections per specimen, but not the immunohistochemistry-stained section.

Deep Learning System

The DLS operates in 2 stages, mimicking pathologists’ mental workflow by first characterizing individual regions into Gleason patterns, followed by assigning a GG to the entire biopsy specimen (eFigure 1B in the Supplement). To train the first stage of the DLS, we collected detailed region-level annotations from prostatectomy and biopsy specimens, which generated 114 million labeled image patches. The second stage of the DLS was trained using 580 biopsy specimen reviews (eTable 1 in the Supplement). Additional details, such as how the DLS neural network architecture was adapted for Gleason grading via Neural Architecture Search[14] and refined from the system used in prior work[15] as well as hyperparameter tuning[16] using the development set, are available in the Deep Learning System section of the eMethods in the Supplement.

Statistical Analysis

Prostate biopsy specimen interpretation involves first determining the presence or absence of prostate cancer. To evaluate the performance of the DLS for tumor detection, we calculated the DLS agreement rate with the subspecialists’ majority opinion for tumor vs nontumor classification. For comparison, we also computed the agreement rate of the general pathologists with the subspecialists’ majority opinion for tumor vs nontumor classification. To represent each general pathologist equally, we calculated each individual general pathologist’s agreement rate with subspecialists separately and calculated a mean rate across the 19 general pathologists. When a tumor is identified in the specimen, the next step of Gleason grading involves characterizing the Gleason pattern of each tumor region and estimating the proportion of each Gleason pattern present in the specimen (Box). To evaluate the ability of the DLS to quantitate Gleason patterns in the tumors, we computed the mean difference (mean absolute error) between the DLS-provided quantitation results and the mean of the subspecialist quantitation results for each Gleason pattern. For comparison, we also computed the mean absolute error between the general pathologists’ Gleason pattern quantitation results and the mean of the subspecialists’ quantitation results. The final step of Gleason grading involves determining the top 2 most prevalent Gleason patterns in each specimen, which determines the GG (Box). For evaluating the DLS in determining the GG for prostate biopsy specimens, we calculated the exact rate of agreement of the DLS categorization with the majority opinion of the subspecialists in categorizing specimens as nontumor, GG1, GG2, GG3, or GG4-5. For comparison, we also calculated the general pathologists’ rate of agreement with the majority opinion of the subspecialists. Similar to the tumor vs nontumor evaluation, we calculated each individual general pathologist’s agreement rate with subspecialists separately and calculated the mean rate across the 19 general pathologists. We additionally performed several subanalyses, which are detailed in the Statistical Analysis section of the eMethods in the Supplement. Finally, we conducted receiver operating characteristic curve analysis at 2 clinically meaningful decision thresholds: GG1 vs GG2-5 (representing the clinical threshold for potential eligibility for active surveillance vs prostatectomy or definitive treatment[17,18]) and GG1-2 vs GG3-5 (because some cases classified as GG2 with a low percentage of Gleason pattern 4 may still be managed with active surveillance[17,18]). Confidence intervals for all evaluation metrics were computed using a bootstrap approach by sampling specimens with replacement, with 1000 iterations. All statistical tests were 2-sided (see Statistical Analysis in the eMethods of the Supplement), and P < .05 was considered statistically significant. No adjustment was made for multiple comparisons. These analyses were performed using Python, version 2.7.6, and the scikit-learn library, version 0.20.0.[19]

Results

Evaluation was performed using an independent validation set from 3 institutions (752 biopsy specimens, 1 specimen per case) (Table 1), each reviewed by at least 2 expert subspecialists (3 subspecialists when there was discordance between the first 2). Using these data, we evaluated the performance of the DLS for tumor detection, Gleason pattern quantitation, and GG classification (Figure 1A).

Figure 1.

Comparison of deep learning system (DLS) and Pathologist Agreement Rates With Subspecialists at Gleason Grading of Tumor-Containing Biopsy Specimens

A, Subspecialists review every biopsy to determine its grade group (GG) (see Box and Methods). Next, those GG determinations are compared with those of the DLS and the general pathologists. B, Agreement rates with subspecialists for the DLS and pathologists across all 498 tumor-containing biopsy specimens. C, Subanalysis considering 175 tumor-containing biopsy specimens from only the external validation set (medical laboratory 2). Because every pathologist reviewed only a subset of the cases, to represent every pathologist equally, the agreement rate shown for the general pathologists is the mean across all general pathologists. For the subanalysis presented in panel C, pathologists who conducted fewer than 20 reviews were excluded to avoid skewing the results (applied to 4 pathologists). Error bars represent 95% CIs.

Comparison of deep learning system (DLS) and Pathologist Agreement Rates With Subspecialists at Gleason Grading of Tumor-Containing Biopsy Specimens

Tumor Detection

In distinguishing 752 biopsy specimens containing tumor from those without tumor, the rate of agreement with subspecialists was similar for the DLS and for general pathologists (DLS, 94.3%; 95% CI, 92.4%-95.9% vs pathologists, 94.7%; 95% CI, 92.8%-96.3%; P = .58). The DLS detected tumors more often than general pathologists, at the cost of more false-positives (Table 2). Of the false-positives committed by the DLS, one-third were noted by subspecialists as precancerous: high-grade prostatic intraepithelial neoplasia (HGPIN) or atypical small acinar proliferation (ASAP). The remaining false-positives tended to occur on small artifact-containing tissue regions (median tissue area called as tumor in these cases, 1%).

Table 2.

Agreement Rates of the DLS and General Pathologists With the Subspecialists’ Majority Opinion at 3 Clinically Important Decision Cutoffs

Clinical task, evaluation metric	% (95% CI)
Clinical task, evaluation metric	DLS	General pathologist
Nontumor vs tumor determination (n = 752)
Agreement with subspecialist majority opinion	94.3 (92.4-95.9)	94.7 (92.8-96.3)^b
Sensitivity	95.5 (93.7-96.8)^b	92.8 (90.0-95.1)
Specificity	91.7 (88.2-94.6)	97.0 (95.1-98.6)^b
Grading of tumor-containing biopsy specimens^c
Agreement with subspecialist majority opinion for GG1 vs GG2-5 (n = 498)	86.1 (83.1-89.2)^b	80.6 (77.9-83.5)
Agreement with subspecialist majority opinion for GG1-2 vs GG3-5 (n = 498)	92.8 (90.8-94.9)^b	86.0 (83.2-88.5)

Abbreviations: DLS, deep learning system; GG, grade group.

Similar to Figure 1, the agreement rate of the general pathologists represents the mean rate across all general pathologists.

The higher value in the row.

Agreement on 2 Gleason grading thresholds.

Abbreviations: DLS, deep learning system; GG, grade group. Similar to Figure 1, the agreement rate of the general pathologists represents the mean rate across all general pathologists. The higher value in the row. Agreement on 2 Gleason grading thresholds.

Gleason Pattern Quantitation

The DLS Gleason pattern quantitation error was lower than that of general pathologists across all patterns (Table 3). In particular, on GG2 slides (n = 122), where small differences in pattern 4 can substantially alter patient prognosis and treatment,[17] the DLS quantitation error rate was substantially lower than that of the general pathologists (DLS, 12.0%; 95% CI, 10.4%-13.6% vs pathologists, 22.0%; 95% CI, 19.6%-24.6%; P < .001).

Table 3.

Mean Absolute Difference in Gleason Pattern Quantitation Relative to Subspecialists

Gleason pattern	No.	Subspecialist discordance, % (95% CI)
Gleason pattern	No.	Deep learning system	Pathologist
3 (Tumor-containing specimens)	498	9.2 (8.0-10.5)^b	14.0 (12.4-15.6)
4 (Tumor-containing specimens)	498	10.0 (8.6-11.2)^b	16.3 (14.6-18.1)
5 (Tumor-containing specimens)	498	1.5 (0.9-2.1)^b	3.2 (2.2-4.3)
4 (Grade group 2 specimens only)	122	12.0 (10.4-13.6)^b	22.0 (19.6-24.6)

Gleason pattern quantitation reflects the proportion of tumor in each biopsy specimen that is characterized as each Gleason pattern. The mean absolute differences in Gleason pattern quantitation are measured against the mean of subspecialist quantitation results for all tumor-containing biopsy specimens (rows 1-3) or grade group 2 biopsy specimens only (row 4).

Lower absolute differences (higher agreement rate in Gleason pattern quantitation).

Grading Tumor-Containing Biopsy Specimens

For Gleason grading of tumor-containing biopsy specimens (n = 498), the rate of DLS agreement with the subspecialists (71.7%; 95% CI, 67.9%-75.3%) was significantly higher than the general pathologist agreement rate with subspecialists (58.0%; 95% CI, 54.5%-61.4%) (P < .001) (Figure 1B). The DLS outperformed 16 of the 19 general pathologists in this comparison (eTable 3 in the Supplement). In a subanalysis of biopsy specimens from the external validation set (ML2, n = 175), the rate of DLS agreement with subspecialists remained significantly higher than the rate of general pathologist agreement with subspecialists (71.4%; 95% CI, 65.7%-77.7% vs 61.2%; 95% CI, 55.7%-67.0%; P = .01) (Figure 1C; eTables 4 and 5 in the Supplement; additional subanalyses and sensitivity analyses are provided in eFigures 5 and 6 and in eTables 7, 9, and 10 in the Supplement). We further examined several clinically important thresholds on 498 tumor-containing cases (Table 2; eFigure 3 in the Supplement). The rate of agreement with subspecialists was higher for the DLS than for the general pathologists at distinguishing GG1 vs GG2-5 cases, a threshold with important implications for active surveillance vs definitive treatment (DLS: 86.1%; 95% CI, 83.1%-89.2% vs general pathologists: 80.6%, 95% CI, 77.9%-83.5%; P < .001). Results were similar for distinguishing GG1-2 vs GG3-5 cases (Table 2). The receiver operating characteristic curve analysis at these GG thresholds is shown in eFigure 3 in the Supplement. The contingency tables comparing GG classification by the DLS and by the general pathologists relative to the subspecialist majority opinion are provided in eTable 2 in the Supplement. Most of the improvement in GG accuracy by the DLS was due to reduced overgrading. On tumor-containing cases, pathologists had a 25.7% frequency of overgrading vs 8.9% overgrading by the DLS. By contrast, the DLS was slightly more likely to undergrade tumor-containing cases relative to specialists (frequency of undergrading: by pathologists, 14.7% vs by the DLS, 19.6%).

DLS Grading Examples

Figure 2 and eFigure 2 in the Supplement contain example visualizations of the DLS’s precise, interpretable glandular-level Gleason grading. They illustrate the potential of the DLS to be helpful in assisting pathologists in tumor detection, grading, and Gleason pattern quantitation.

Figure 2.

Illustrative Concept of How Deep Learning System (DLS) Results May Be Presented to a Pathologist

These cases were graded by both the DLS and subspecialists as grade groups 1 (A), 2 (B), and 3 (C). The DLS provides both a glandular-level Gleason pattern categorization and a biopsy-level Gleason score and grade group. Left column represents low-power magnification view of the Gleason pattern categorization; middle column, 10 × magnification of the indicated area from the left column; right column, the DLS-generated Gleason score and Gleason pattern quantitation. In the left column, green represents DLS-categorized Gleason pattern 3; yellow, DLS-categorized Gleason pattern 4.

Illustrative Concept of How Deep Learning System (DLS) Results May Be Presented to a Pathologist

Discussion

We have presented a system for Gleason grading prostate biopsy specimens with a rigorous evaluation involving numerous experienced urologic subspecialists from diverse backgrounds, having a mean of 25 years of experience, with access to several histologic sections and immunohistochemical-stained sections for every specimen. First, the DLS showed similar overall tumor detection rates compared with general pathologists, by catching more cases of tumor than general pathologists at the cost of some false-positives. This trade-off suggests that the DLS could help alert pathologists to tumors that may otherwise be missed[20,21] while relying on pathologist judgment to overrule false-positive categorizations on small tissue regions. Second, the DLS showed better agreement rates with subspecialists than pathologists did for Gleason pattern quantitation, which is an important prognostic signal and independent predictor of biochemical recurrence[22,23] and part of recommended reporting by the College of American Pathologists, International Society of Urological Pathology, World Health Organization, and European Association of Urology guidelines.[12,13,24,25] Third, in summarizing the overall GG for the biopsy specimens (which is derived from the proportions of Gleason patterns present in the specimen and ultimately used in risk stratification with the National Comprehensive Cancer Network guidelines), the DLS showed significantly greater agreement rates with subspecialists than general pathologists did. Finally, the rate of agreement of the DLS with subspecialists on an external validation set remained similar, suggesting DLS robustness to interlaboratory and patient cohort differences. Over the years, prostate cancer treatment has evolved such that the role of conservative management has been recognized in men with low-risk disease. In particular, several trials have shown the safety of active surveillance compared with radical prostatectomy or radiation therapy in carefully selected patients with localized prostate cancer.[26,27,28] In this decision-making process, guidelines endorsed by the American Society of Clinical Oncology recommend consideration of both the GG and relative amount of Gleason pattern 4.[17,18] Owing to the recognized interobserver variability in Gleason grading, intradepartmental consults have been recommended to improve consistency and quality of care.[29,30] In this regard, the DLS could function as a valuable decision support tool when deciding between GGs for patients with localized disease, with important downstream implications on treatment. A DLS such as this could therefore create efficiencies for health care systems by improving consistency of grading, reducing the consultation-associated costs and turnaround delays, and potentially decreasing treatment-related morbidity for men with low-risk disease. In particular, the DLS was substantially less likely to overgrade (especially at the clinically important GG1 vs GG2 distinction) while being slightly more likely to undergrade cases than general pathologists, especially at higher GGs (eTable 2 in the Supplement). These findings suggest that DLS assistance could be particularly helpful in accurately identifying low-risk cases that are eligible for more conservative management. The exact implementation and benefit of using such a tool remains to be determined but must be guided by prospective validation studies that examine the influence on diagnostic reporting and patient outcomes. The GG plays a pivotal role in patient treatment,[26,27,28] and grading among subspecialists is substantially more concordant than grading among general pathologists, both in our study (eFigure 4 and eTables 3 and 6 in the Supplement) and in the literature.[6,31] However, discordance remains even among subspecialists due to the inherent subjectivity and difficulty of Gleason grading. The subspecialists participating in the present study had at least a decade of urologic pathology experience and access to 3 levels and immunohistochemistry of each biopsy specimen in the validation set. These discordances highlight the need to further improve risk stratification for prostate cancer. One possibility is to develop systems to directly predict clinical risk with more precision than is possible by human graders. Such machine learning models could identify novel histoprognostic signals that are undiscovered or not evident to the human eye,[32,33] and may help stratify patient risk in a manner similar to existing molecular tests.[34,35] Other works have applied deep learning to Gleason grading.[36,37,38,39,40] Ström et al[39] trained and validated a DLS using biopsy specimens graded by the same urologic subspecialist (validation data set sizes: 1631 biopsy specimens from 246 men, and 330 biopsy specimens from 73 men) and additionally compared grading with 23 subspecialists on a smaller set of 87 biopsy specimens. Bulten et al[40] validated a DLS on 550 biopsy specimens from 210 randomly selected patients from the same institution used for development, using consensus grades from 3 experienced subspecialists at 2 institutions, and further compared with 15 pathologists or trainees on a smaller set of 100 biopsy specimens. Our study improved on these efforts via substantial subspecialist-reviewed glandular annotations to enable gland-level Gleason grading for assistive visualizations and explainability (Figure 2); via a rigorous review process involving several subspecialists from different institutions as well as 3 specimen levels and immunohistochemistry samples for every case; through the use of a sizable, independent clinical data set for validation; and finally by assessment of Gleason pattern quantitation in addition to Gleason grading of biopsy specimens.

Limitations

This study has limitations. First, we used 1 biopsy specimen per case although each clinical case typically involves 12 to 18 biopsy specimens. Second, this study did not evaluate the correlation of the DLS Gleason grading with clinical outcomes, which would be less subjective than to subspecialist review. However, unlike a previous analysis on radical prostatectomies,[15] such an analysis for biopsy specimens would be challenging due to confounding factors such as divergent treatment pathways based on the original diagnosis, tissue sampling variability inherent to small biopsy specimens, other clinical variables, and patient preferences. Third, the effect of rescanning the specimens on model performance will need to be evaluated in future work. Fourth, additional aspects such as nonadenocarcinoma prostate cancer variants or precancerous findings were not evaluated in this study.

Conclusions

To conclude, we have presented a DLS for Gleason grading of prostate biopsy specimens that is highly concordant with subspecialists and that maintained its performance on an external validation set. Future work will need to assess the diagnostic and clinical effect of the use of a DLS for increasing the accuracy and consistency of Gleason grading to improve patient care.

36 in total

1. Interobserver reproducibility of Gleason grading of prostatic carcinoma: general pathologist.

Authors: W C Allsbrook; K A Mangold; M H Johnson; R B Lane; C G Lane; J I Epstein
Journal: Hum Pathol Date: 2001-01 Impact factor: 3.466

2. Intraobserver and interobserver reproducibility of WHO and Gleason histologic grading systems in prostatic adenocarcinomas.

Authors: S O Ozdamar; S Sarikaya; L Yildiz; M K Atilla; B Kandemir; S Yildiz
Journal: Int Urol Nephrol Date: 1996 Impact factor: 2.370

3. The Prostate Cancer Intervention Versus Observation Trial: VA/NCI/AHRQ Cooperative Studies Program #407 (PIVOT): design and baseline results of a randomized controlled trial comparing radical prostatectomy with watchful waiting for men with clinically localized prostate cancer.

Authors: Timothy J Wilt
Journal: J Natl Cancer Inst Monogr Date: 2012-12

Review 4. The 2014 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma: Definition of Grading Patterns and Proposal for a New Grading System.

Authors: Jonathan I Epstein; Lars Egevad; Mahul B Amin; Brett Delahunt; John R Srigley; Peter A Humphrey
Journal: Am J Surg Pathol Date: 2016-02 Impact factor: 6.394

5. Phase 3 study of adjuvant radiotherapy versus wait and see in pT3 prostate cancer: impact of pathology review on analysis.

Authors: Dirk Bottke; Reinhard Golz; Stephan Störkel; Axel Hinke; Alessandra Siegmann; Lothar Hertle; Kurt Miller; Wolfgang Hinkelbein; Thomas Wiegel
Journal: Eur Urol Date: 2013-03-17 Impact factor: 20.096

6. Active surveillance for the management of localized prostate cancer: Guideline recommendations.

Authors: Chris Morash; Rovena Tey; Chika Agbassi; Laurence Klotz; Tom McGowan; John Srigley; Andrew Evans
Journal: Can Urol Assoc J Date: 2015 May-Jun Impact factor: 1.862

7. Artificial Intelligence-Based Breast Cancer Nodal Metastasis Detection: Insights Into the Black Box for Pathologists.

Authors: Yun Liu; Timo Kohlberger; Mohammad Norouzi; George E Dahl; Jenny L Smith; Arash Mohtashamian; Niels Olson; Lily H Peng; Jason D Hipp; Martin C Stumpe
Journal: Arch Pathol Lab Med Date: 2018-10-08 Impact factor: 5.534

Review 8. The Diagnosis and Treatment of Prostate Cancer: A Review.

Authors: Mark S Litwin; Hung-Jui Tan
Journal: JAMA Date: 2017-06-27 Impact factor: 56.272

9. Deep learning for automatic Gleason pattern classification for grade group determination of prostate biopsies.

Authors: Marit Lucas; Ilaria Jansen; C Dilara Savci-Heijink; Sybren L Meijer; Onno J de Boer; Ton G van Leeuwen; Daniel M de Bruin; Henk A Marquering
Journal: Virchows Arch Date: 2019-05-16 Impact factor: 4.064

10. A 22 Gene-expression Assay, Decipher® (GenomeDx Biosciences) to Predict Five-year Risk of Metastatic Prostate Cancer in Men Treated with Radical Prostatectomy.

Authors: Michael Marrone; Arnold L Potosky; David Penson; Andrew N Freedman
Journal: PLoS Curr Date: 2015-11-17

22 in total

Review 1. [Digital transformation in urology-opportunity, risk or necessity?]

Authors: T Loch; U Witzsch; G Reis
Journal: Urologe A Date: 2021-08-05 Impact factor: 0.639

2. A comprehensive prostate biopsy standardization system according to quantitative multiparametric MRI and PSA value: P.R.O.S.T score.

Authors: Chao Liang; Yuhao Wang; Lei Ding; Meiling Bao; Gong Cheng; Pengfei Shao; Lixin Hua; Bianjiang Liu; Jie Li
Journal: World J Urol Date: 2022-07-22 Impact factor: 3.661

3. AI Model for Prostate Biopsies Predicts Cancer Survival.

Authors: Kevin Sandeman; Sami Blom; Ville Koponen; Anniina Manninen; Juuso Juhila; Antti Rannikko; Tuomas Ropponen; Tuomas Mirtti
Journal: Diagnostics (Basel) Date: 2022-04-20

4. Deep Learning-Based Multi-Omics Integration Robustly Predicts Relapse in Prostate Cancer.

Authors: Ziwei Wei; Dunsheng Han; Cong Zhang; Shiyu Wang; Jinke Liu; Fan Chao; Zhenyu Song; Gang Chen
Journal: Front Oncol Date: 2022-06-23 Impact factor: 5.738

5. A multi-resolution model for histopathology image classification and localization with multiple instance learning.

Authors: Jiayun Li; Wenyuan Li; Anthony Sisk; Huihui Ye; W Dean Wallace; William Speier; Corey W Arnold
Journal: Comput Biol Med Date: 2021-02-10 Impact factor: 4.589

Review 6. Artificial intelligence for clinical oncology.

Authors: Benjamin H Kann; Ahmed Hosny; Hugo J W L Aerts
Journal: Cancer Cell Date: 2021-04-29 Impact factor: 38.585

7. Evaluation of the Use of Combined Artificial Intelligence and Pathologist Assessment to Review and Grade Prostate Biopsies.

Authors: David F Steiner; Kunal Nagpal; Rory Sayres; Davis J Foote; Benjamin D Wedin; Adam Pearce; Carrie J Cai; Samantha R Winter; Matthew Symonds; Liron Yatziv; Andrei Kapishnikov; Trissia Brown; Isabelle Flament-Auvigne; Fraser Tan; Martin C Stumpe; Pan-Pan Jiang; Yun Liu; Po-Hsuan Cameron Chen; Greg S Corrado; Michael Terry; Craig H Mermel
Journal: JAMA Netw Open Date: 2020-11-02

Review 8. Grading Evolution and Contemporary Prognostic Biomarkers of Clinically Significant Prostate Cancer.

Authors: Konrad Sopyllo; Andrew M Erickson; Tuomas Mirtti
Journal: Cancers (Basel) Date: 2021-02-05 Impact factor: 6.639

Review 9. Artificial Intelligence in Brain Tumour Surgery-An Emerging Paradigm.

Authors: Simon Williams; Hugo Layard Horsfall; Jonathan P Funnell; John G Hanrahan; Danyal Z Khan; William Muirhead; Danail Stoyanov; Hani J Marcus
Journal: Cancers (Basel) Date: 2021-10-07 Impact factor: 6.639

10. Lung Nodule Malignancy Prediction in Sequential CT Scans: Summary of ISBI 2018 Challenge.

Authors: Yoganand Balagurunathan; Andrew Beers; Michael Mcnitt-Gray; Lubomir Hadjiiski; Sandy Napel; Dmitry Goldgof; Gustavo Perez; Pablo Arbelaez; Alireza Mehrtash; Tina Kapur; Ehwa Yang; Jung Won Moon; Gabriel Bernardino Perez; Ricard Delgado-Gonzalo; M Mehdi Farhangi; Amir A Amini; Renkun Ni; Xue Feng; Aditya Bagari; Kiran Vaidhya; Benjamin Veasey; Wiem Safta; Hichem Frigui; Joseph Enguehard; Ali Gholipour; Laura Silvana Castillo; Laura Alexandra Daza; Paul Pinsky; Jayashree Kalpathy-Cramer; Keyvan Farahani
Journal: IEEE Trans Med Imaging Date: 2021-11-30 Impact factor: 11.037