Literature DB >> 35602201

Predicting prostate cancer specific-mortality with artificial intelligence-based Gleason grading.

Ellery Wulczyn¹, Kunal Nagpal¹, Matthew Symonds¹, Melissa Moran¹, Markus Plass², Robert Reihs², Farah Nader², Fraser Tan¹, Yuannan Cai¹, Trissia Brown³, Isabelle Flament-Auvigne³, Mahul B Amin⁴, Martin C Stumpe^1,5, Heimo Müller², Peter Regitnig², Andreas Holzinger², Greg S Corrado¹, Lily H Peng¹, Po-Hsuan Cameron Chen¹, David F Steiner¹, Kurt Zatloukal², Yun Liu¹, Craig H Mermel¹.

Abstract

Background: Gleason grading of prostate cancer is an important prognostic factor, but suffers from poor reproducibility, particularly among non-subspecialist pathologists. Although artificial intelligence (A.I.) tools have demonstrated Gleason grading on-par with expert pathologists, it remains an open question whether and to what extent A.I. grading translates to better prognostication.
Methods: In this study, we developed a system to predict prostate cancer-specific mortality via A.I.-based Gleason grading and subsequently evaluated its ability to risk-stratify patients on an independent retrospective cohort of 2807 prostatectomy cases from a single European center with 5-25 years of follow-up (median: 13, interquartile range 9-17).
Results: Here, we show that the A.I.'s risk scores produced a C-index of 0.84 (95% CI 0.80-0.87) for prostate cancer-specific mortality. Upon discretizing these risk scores into risk groups analogous to pathologist Grade Groups (GG), the A.I. has a C-index of 0.82 (95% CI 0.78-0.85). On the subset of cases with a GG provided in the original pathology report (n = 1517), the A.I.'s C-indices are 0.87 and 0.85 for continuous and discrete grading, respectively, compared to 0.79 (95% CI 0.71-0.86) for GG obtained from the reports. These represent improvements of 0.08 (95% CI 0.01-0.15) and 0.07 (95% CI 0.00-0.14), respectively. Conclusions: Our results suggest that A.I.-based Gleason grading can lead to effective risk stratification, and warrants further evaluation for improving disease management.

Entities: Chemical

Keywords: Prognostic markers; Prostate cancer

Year: 2021 PMID： 35602201 PMCID： PMC9053226 DOI： 10.1038/s43856-021-00005-3

Source DB: PubMed Journal: Commun Med (Lond) ISSN： 2730-664X

Introduction

Prostate cancer affects one in nine men in their lifetime[1], but disease aggressiveness and prognosis can vary substantially among individuals. The histological growth patterns of the tumor, as characterized by the Gleason grading system, are a major determinant of disease progression and criterion for selection of therapy. Based on the prevalence of these patterns, one of five Grade Groups (GG) is assigned[2]. The GG is among the most important prognostic factors for prostate cancer patients, and is used to help select the treatment plan most appropriate for a patient’s risk of disease progression[3]. The Gleason system is used at distinct points in the clinical management of prostate cancer. For patients undergoing diagnostic biopsies, if tumor is identified, the GG impacts the decision between active surveillance vs. definitive treatment options, such as surgical removal of the prostate or radiation therapy[3]. For patients who subsequently undergo a surgical resection of the prostate (radical prostatectomy), the GG is one key component of decisions regarding adjuvant treatment, such as radiotherapy or hormone therapy[4,5]. In large clinical trials, use of adjuvant therapy following prostatectomy has demonstrated benefits, such as improved progression-free survival for some patients, but can also result in substantial adverse side effects[6-8]. As such, several post-prostatectomy nomograms[9] have been developed, in order to better predict clinical outcomes following the definitive treatment, with the goal of identifying the patients most likely to benefit from adjuvant therapy. Gleason grading of prostatectomy specimens represents a key prognostic element in many of these nomograms, and is a central component of the risk categories defined by the National Comprehensive Cancer Network[5]. Due to the complexity and intrinsic subjectivity of the system, Gleason grading suffers from large discordance rates between pathologists (30–50%)[10-15]. However, grades from experts (such as those with several years of experience, primarily practicing urologic pathology, or those with urologic subspeciality training) are more consistent and result in more accurate risk stratification than grades from less experienced pathologists[16-19], suggesting an opportunity to improve the clinical utility of the system by improving grading consistency and accuracy. To this end, several artificial intelligence (A.I.) algorithms for Gleason grading have been developed and validated, using expert-provided Gleason scores[20-23]. However, an evaluation of the prognostic value of these algorithms and a direct comparison to the prognostic value of Gleason grading provided by pathologists has not been conducted. While the GG for biopsies, as well as prostatectomy specimens both provide important prognostic information[2], retrospective studies to evaluate long-term clinical outcomes is more straightforward from prostatectomy cases given widely divergent treatment pathways following biopsy alone. Building on prior work[22,24], we first trained an A.I. system to accurately classify and quantitate Gleason patterns on prostatectomy specimens, and further demonstrate that A.I.-based Gleason pattern (GP) quantitations can be used to provide better risk stratification than the Gleason GG from the original prostatectomy pathology reports.

Methods

Data

All available slides for archived prostate cancer resection cases between 1995 and 2014 in the Biobank Graz[25,26] at the Medical University of Graz were retrieved, de-identified, and scanned using a Leica Aperio AT2 scanner at 40× magnification (0.25 μm/pixel). The standard protocol for radical prostatectomy submission at the institution was to submit the entire prostate (right and left lobes, additionally divided into ventral and dorsal portions, and serially sectioned apex to base approximately every 3–5 mm). To our knowledge, there was no change in surgical procedure type over the time period studied. Robotic surgery was not used. Gleason patterns (Gleason scores) were extracted from the original pathology reports and translated to their corresponding GG[2]. Tertiary patterns, which were reported in only 22 of the 2807 cases (<1%), were not used in this study. Clinicopathologic variables, such as pathologic TNM staging, were also extracted from the pathology reports. Disease-specific survival (DSS) was inferred from International Classification of Diseases codes obtained from medical death certificates from the Statistik Austria database. Codes considered for prostate cancer-related death were C61 (malignant neoplasm of prostate) and C68 (malignant neoplasm of other and unspecified urinary organs). Institutional Review Board approval for this retrospective study, using anonymized slides and associated pathologic and clinical data, was obtained from the Medical University of Graz (Protocol no. 32-026 ex 19/20). Need for informed consent was waived because the project was performed with anonymized data. Validation set 1 included all available cases from 1995 to 2014 after application of the exclusion criteria (n = 2807; Table 1 and Supplementary Fig. S1). Because Gleason scoring at the Medical University of Graz was adopted in routine practice from 2000 onward, validation set 2 included all cases from 2000 onward for which a Gleason score was available (n = 1517; Table 1). Sensitivity analysis for inclusion of Gleason grades prior to the year 2000 (before Gleason scoring became routine at the institution) is presented in Supplementary Table S1. The specific purpose of validation set 2 is to allow for a direct comparison of the prognostic performance of the A.I. with that of the pathologist Gleason Grades.

Table 1

Cohort characteristics.

		Validation set 1	Validation set 2 (subset of set 1)
Number of cases		2807	1517
Number of slides	Total	83,645	47,626
Number of slides	Median per case (interquartile range)	29 (25, 34)	30 (26, 35)
Overall survival (OS)	Median years of follow-up (interquartile range)	13.1 (8.5, 17.2)	11.2 (7.4, 15.2)
	Censored (%)	2150 (77%)	1306 (86%)
	Observed (%)	657 (23%)	211 (14%)
Disease-specific survival (DSS) (%)	Censored	2673 (95%)	1464 (97%)
Disease-specific survival (DSS) (%)	Observed	134 (5%)	53 (3%)
Grade Group (%)	1	611 (22%)	608 (40%)
	2	476 (17%)	473 (31%)
	3	224 (8%)	224 (15%)
	4	128 (5%)	127 (8%)
	5	85 (3%)	85 (6%)
	Unknown	1283 (46%)	0 (0%)
Pathologic T-stage (%)	T2	1640 (58%)	1113 (73%)
	T3	791 (28%)	366 (24%)
	T4	25 (1%)	6 (<1%)
	Unknown	351 (13%)	32 (2%)
Age at diagnosis (%)	<60	952 (34%)	537 (35%)
	60–70	1546 (55%)	817 (54%)
	≥70	309 (11%)	163 (11%)
Margin status (%)	Negative	448 (16%)	153 (10%)
	Positive	242 (9%)	96 (6%)
	Unknown	2117 (75%)	1268 (84%)
Pathologic N-stage (%)	N0	1395 (50%)	879 (58%)
	N1	77 (3%)	62 (4%)
	N2	13 (<1%)	4 (<1%)
	N3	10 (<1%)	8 (1%)
	Unknown	1312 (47%)	564 (37%)
Received hormone or chemotherapy (%)	Yes	53 (2%)	33 (2%)
Received hormone or chemotherapy (%)	No/unknown	2754 (98%)	1484 (98%)
Received radiation therapy (%)	Yes	277 (10%)	176 (12%)
Received radiation therapy (%)	No/unknown	2530 (90%)	1341 (88%)
Biochemical recurrence (%)	Censored	338 (12%)	228 (15%)
	Observed	95 (3%)	55 (4%)
	No follow-up	2374 (85%)	1234 (81%)

Validation set 1 contains all prostatectomy cases from the Biobank Graz between 1995 and 2014. Validation set 2 was derived by first considering cases in the Gleason grading era at the institution (years 2000–2014; n = 2191), and then further filtering for cases where a Gleason score was recorded and available in the pathology report (n = 1517).

Cohort characteristics. Validation set 1 contains all prostatectomy cases from the Biobank Graz between 1995 and 2014. Validation set 2 was derived by first considering cases in the Gleason grading era at the institution (years 2000–2014; n = 2191), and then further filtering for cases where a Gleason score was recorded and available in the pathology report (n = 1517). All slides underwent manual review by pathologists (see “Pathologist cohort and QC details” in the Supplementary Methods) to confirm stain type and tissue type. Inclusion/exclusion criteria are described in Supplementary Fig. S1. Briefly, immunohistochemically stained slides were excluded from analysis and only slides containing primarily prostatic tissue were included. Slides containing exclusively prostatic tissue were included in their entirety. Slides with both prostatic tissue and seminal vesicle tissue were included, but processed using a prostatic tissue model meant to provide only prostatic tissue to the Gleason grading model (for more details on its development and performance, see “Prostatic tissue segmentation model” in Supplementary Methods and Supplementary Figs. S1 and S2).

Gleason grading model

We previously developed two A.I. systems: one for Gleason grading prostatectomy specimens[24] based on a classic “inception” neural network architecture, and a second for Gleason grading biopsy specimens based on a customized neural network architecture[22]. For this work, we used the prostatectomy dataset from the first study to train a new model using the customized neural network architecture introduced in the second study. The training dataset contained 112 million pathologist-annotated “image patches” from an independent set of prostatectomy cases from different institutions than the validation data used in this study. Briefly, the system takes as input 512 × 512 pixel image patches (at 10× magnification, 1 μm per pixel) and classifies each patch as one of four categories: nontumor, GP 3, 4, or 5. The hyperparameters used for training this network were determined using a random grid search that optimized for tuning set classification accuracy over 50 potential settings, and are described in Supplementary Table S2 and “Gleason grading model tuning” in the Supplementary Methods.

A.I. risk scores and risk groups

The Gleason grading model was run at stride 256 (at 10× magnification, 1 μm per pixel) on all prostate tissue patches. The classification of each patch as nontumor or GP 3, 4, or 5 was determined via argmax on re-weighted predicted class probabilities[24]. For each case, the percentage of prostate tumor patches that belong to Gleason patterns 3, 4, and 5 were subsequently computed by counting the numbers of patches categorized as each pattern across all slides for each case. A.I. risk scores were computed by fitting a Cox regression model using these case-level GP percentages as input, and the right-censored outcomes as the events (see workflow diagram in Supplementary Fig. S2). This approach was pursued first (rather than direct mapping of %GPs to GG as done by pathologists) due to the prognostic importance of precise GP quantitation[27], as well as the exhaustive nature of A.I. grading that rarely leads to classifications of GG1 (e.g., 100% GP3) and GG4 (e.g., 100% GP4). Sensitivity analyses evaluating additional ways of obtaining risk groups from %GPs, including direct mapping of %GPs to GG and a temporal-split methodology, demonstrated qualitatively similar results and are presented in Supplementary Table S3. GP 3 percentage was dropped as an input feature to avoid linear dependence between features. Leave-one-case-out cross-validation was used to adjust for optimism, similar to the tenfold cross-validation used in Epstein et al.[2]. A.I. risk groups were derived from the A.I. risk scores by discretizing the A.I. risk scores to match the number and frequency of pathologist GG in validation set 2. Discretization thresholds for both validation sets are provided in Supplementary Table S4.

Statistical analysis

Primary and secondary analyses were prespecified and documented prior to evaluation on the validation sets. The primary analysis consisted of the comparison of C-indices for DSS between pathologist GG and the A.I. risk scores (Table 2). The secondary analysis consisted of the comparison between C-indices for pathologist GG and the discretized A.I. risk groups. All other analyses were exploratory.

Table 2

C-index for pathologist and A.I. grading.

	C-index [95% CI]
	Validation set 1 (n = 2807 cases)	Validation set (n = 1517 cases)
(A) Pathologist Grade Groups	N/A^a	0.79 [0.71, 0.86]
(B) A.I. risk score (continuous)	0.84 [0.80–0.87]	0.87 [0.81, 0.91]
(C) A.I. risk groups (discretized)	0.82 [0.78–0.85]	0.85 [0.79, 0.90]
(D) Average of (A) and (C)	N/A^a	0.86 [0.80–0.91]

The A.I. risk score (B) is a continuous risk score from a Cox regression fit on Gleason pattern percentages from the A.I. The A.I. risk group (C) is a discretized version of the A.I. risk score. The discretization was done to match the number and frequency of pathologist Grade Groups in validation set 2. (D) Represents the average of the Pathologist Grade Group and A.I. risk groups. In validation set 2, the C-index for the A.I. risk score was statistically significantly higher than that for the pathologists’ Grade Group (p < 0.05, prespecified analysis). Bold indicates the highest value in each column (dataset).

aNot available because pathologist Grade Groups were not available for all cases in validation set 1 due to the earlier time period.

C-index for pathologist and A.I. grading. The A.I. risk score (B) is a continuous risk score from a Cox regression fit on Gleason pattern percentages from the A.I. The A.I. risk group (C) is a discretized version of the A.I. risk score. The discretization was done to match the number and frequency of pathologist Grade Groups in validation set 2. (D) Represents the average of the Pathologist Grade Group and A.I. risk groups. In validation set 2, the C-index for the A.I. risk score was statistically significantly higher than that for the pathologists’ Grade Group (p < 0.05, prespecified analysis). Bold indicates the highest value in each column (dataset). aNot available because pathologist Grade Groups were not available for all cases in validation set 1 due to the earlier time period. The prognostic performance of the pathologist GG, the A.I. risk scores, and the A.I. risk groups were measured using Harrel’s C-index[28], a generalization of area under the receiver operating characteristic curve for time-censored data. Confidence intervals for both the C-index of A.I. and pathologists, and the differences between them, were computed via bootstrap resampling[29] with 1000 samples. In Kaplan–Meier analysis of the pathologist GG and A.I. risk groups, the multivariate log-rank test was used to test for differences in survival curves across groups. All survival analysis were conducted using the Lifelines python package[30] (version 0.25.4).

37 in total

1. Validation of a Genomic Risk Classifier to Predict Prostate Cancer-specific Mortality in Men with Adverse Pathologic Features.

Authors: R Jeffrey Karnes; Voleak Choeurng; Ashley E Ross; Edward M Schaeffer; Eric A Klein; Stephen J Freedland; Nicholas Erho; Kasra Yousefi; Mandeep Takhar; Elai Davicioni; Matthew R Cooperberg; Bruce J Trock
Journal: Eur Urol Date: 2017-04-08 Impact factor: 20.096

2. Intraobserver and interobserver reproducibility of WHO and Gleason histologic grading systems in prostatic adenocarcinomas.

Authors: S O Ozdamar; S Sarikaya; L Yildiz; M K Atilla; B Kandemir; S Yildiz
Journal: Int Urol Nephrol Date: 1996 Impact factor: 2.370

3. A UK-based investigation of inter- and intra-observer reproducibility of Gleason grading of prostatic biopsies.

Authors: J Melia; R Moseley; R Y Ball; D F R Griffiths; K Grigor; P Harnden; M Jarmulowicz; L J McWilliam; R Montironi; M Waller; S Moss; M C Parkinson
Journal: Histopathology Date: 2006-05 Impact factor: 5.087

Review 4. Critical review of prostate cancer predictive tools.

Authors: Shahrokh F Shariat; Michael W Kattan; Andrew J Vickers; Pierre I Karakiewicz; Peter T Scardino
Journal: Future Oncol Date: 2009-12 Impact factor: 3.404

5. Phase 3 study of adjuvant radiotherapy versus wait and see in pT3 prostate cancer: impact of pathology review on analysis.

Authors: Dirk Bottke; Reinhard Golz; Stephan Störkel; Axel Hinke; Alessandra Siegmann; Lothar Hertle; Kurt Miller; Wolfgang Hinkelbein; Thomas Wiegel
Journal: Eur Urol Date: 2013-03-17 Impact factor: 20.096

6. A Contemporary Prostate Cancer Grading System: A Validated Alternative to the Gleason Score.

Authors: Jonathan I Epstein; Michael J Zelefsky; Daniel D Sjoberg; Joel B Nelson; Lars Egevad; Cristina Magi-Galluzzi; Andrew J Vickers; Anil V Parwani; Victor E Reuter; Samson W Fine; James A Eastham; Peter Wiklund; Misop Han; Chandana A Reddy; Jay P Ciezki; Tommy Nyberg; Eric A Klein
Journal: Eur Urol Date: 2015-07-10 Impact factor: 20.096

7. Development and Validation of a Deep Learning Algorithm for Gleason Grading of Prostate Cancer From Biopsy Specimens.

Authors: Kunal Nagpal; Davis Foote; Fraser Tan; Yun Liu; Po-Hsuan Cameron Chen; David F Steiner; Naren Manoj; Niels Olson; Jenny L Smith; Arash Mohtashamian; Brandon Peterson; Mahul B Amin; Andrew J Evans; Joan W Sweet; Carol Cheung; Theodorus van der Kwast; Ankur R Sangoi; Ming Zhou; Robert Allan; Peter A Humphrey; Jason D Hipp; Krishna Gadepalli; Greg S Corrado; Lily H Peng; Martin C Stumpe; Craig H Mermel
Journal: JAMA Oncol Date: 2020-09-01 Impact factor: 31.777

8. Automated acquisition of explainable knowledge from unannotated histopathology images.

Authors: Yoichiro Yamamoto; Toyonori Tsuzuki; Jun Akatsuka; Masao Ueki; Hiromu Morikawa; Yasushi Numata; Taishi Takahara; Takuji Tsuyuki; Kotaro Tsutsumi; Ryuto Nakazawa; Akira Shimizu; Ichiro Maeda; Shinichi Tsuchiya; Hiroyuki Kanno; Yukihiro Kondo; Manabu Fukumoto; Gen Tamiya; Naonori Ueda; Go Kimura
Journal: Nat Commun Date: 2019-12-18 Impact factor: 14.919

9. Evaluation of the Use of Combined Artificial Intelligence and Pathologist Assessment to Review and Grade Prostate Biopsies.

Authors: David F Steiner; Kunal Nagpal; Rory Sayres; Davis J Foote; Benjamin D Wedin; Adam Pearce; Carrie J Cai; Samantha R Winter; Matthew Symonds; Liron Yatziv; Andrei Kapishnikov; Trissia Brown; Isabelle Flament-Auvigne; Fraser Tan; Martin C Stumpe; Pan-Pan Jiang; Yun Liu; Po-Hsuan Cameron Chen; Greg S Corrado; Michael Terry; Craig H Mermel
Journal: JAMA Netw Open Date: 2020-11-02

10. Clinical Validation of the 2005 ISUP Gleason Grading System in a Cohort of Intermediate and High Risk Men Undergoing Radical Prostatectomy.

Authors: Sheila F Faraj; Stephania M Bezerra; Kasra Yousefi; Helen Fedor; Stephanie Glavaris; Misop Han; Alan W Partin; Elizabeth Humphreys; Jeffrey Tosoian; Michael H Johnson; Elai Davicioni; Bruce J Trock; Edward M Schaeffer; Ashley E Ross; George J Netto
Journal: PLoS One Date: 2016-01-05 Impact factor: 3.240

4 in total

1. Prostate cancer therapy personalization via multi-modal deep learning on randomized phase III clinical trials.

Authors: Felix Y Feng; Osama Mohamad; Andre Esteva; Jean Feng; Douwe van der Wal; Shih-Cheng Huang; Jeffry P Simko; Sandy DeVries; Emmalyn Chen; Edward M Schaeffer; Todd M Morgan; Yilun Sun; Amirata Ghorbani; Nikhil Naik; Dhruv Nathawani; Richard Socher; Jeff M Michalski; Mack Roach; Thomas M Pisansky; Jedidiah M Monson; Farah Naz; James Wallace; Michelle J Ferguson; Jean-Paul Bahary; James Zou; Matthew Lungren; Serena Yeung; Ashley E Ross; Howard M Sandler; Phuoc T Tran; Daniel E Spratt; Stephanie Pugh
Journal: NPJ Digit Med Date: 2022-06-08

2. Discriminatory Gleason grade group signatures of prostate cancer: An application of machine learning methods.

Authors: Mpho Mokoatle; Darlington Mapiye; Vukosi Marivate; Vanessa M Hayes; Riana Bornman
Journal: PLoS One Date: 2022-06-09 Impact factor: 3.752

Review 3. The Cellular and Molecular Immunotherapy in Prostate Cancer.

Authors: Anirban Goutam Mukherjee; Uddesh Ramesh Wanjari; D S Prabakaran; Raja Ganesan; Kaviyarasi Renu; Abhijit Dey; Balachandar Vellingiri; Sabariswaran Kandasamy; Thiyagarajan Ramesh; Abilash Valsala Gopalakrishnan
Journal: Vaccines (Basel) Date: 2022-08-22

4. Deep learning models for histologic grading of breast cancer and association with disease prognosis.

Authors: David F Steiner; Po-Hsuan Cameron Chen; Ronnachai Jaroensri; Ellery Wulczyn; Narayan Hegde; Trissia Brown; Isabelle Flament-Auvigne; Fraser Tan; Yuannan Cai; Kunal Nagpal; Emad A Rakha; David J Dabbs; Niels Olson; James H Wren; Elaine E Thompson; Erik Seetao; Carrie Robinson; Melissa Miao; Fabien Beckers; Greg S Corrado; Lily H Peng; Craig H Mermel; Yun Liu
Journal: NPJ Breast Cancer Date: 2022-10-04

4 in total