Literature DB >> 29158798

Improvement in prediction of prostate cancer prognosis with somatic mutational signatures.

Shengping Zhang¹, Yafei Xu², Xinjie Hui², Fei Yang³, Yueming Hu², Jianlin Shao⁴, Hui Liang¹, Yejun Wang².

Abstract

Prostate cancer is a leading male malignancy worldwide, while the prognosis prediction remains quite inaccurate. The study aimed to observe whether there was an association between the prognosis of prostate cancer and genetic mutation profile, and to build an accurate prognostic predictor based on the genetic signatures. The patients diagnosed of prostate cancer from The Cancer Genomic Atlas were used for prognostic stratification, while the somatic gene mutation profiles were compared between different prognostic groups. The genetic features were further used for training machine-learning models to predict prostate cancer prognosis. No significant gene with somatic mutation rate difference was found between prognostic groups of prostate cancer. Total 43 atypical genes were screened for building a support vector machine model to predict prostate cancer prognosis, with an average accuracy of 66% and 64% for 5-fold cross-validation or training-testing evaluation respectively. When combined with the National Institute for Health and Care Excellence (NICE) features, the model could be further improved, with the 5-fold cross-validation accuracy of ~71%, much better than NICE itself (62%). To our knowledge, for the first time, the research studied the relationship of genome-wide somatic mutations with prostate prognosis, and developed an effective prognostic prediction model with the atypical genetic signatures.

Entities: Chemical Disease Gene Species

Keywords: atypical features; prognosis prediction; prostate cancer; somatic mutation; support vector machine

Year: 2017 PMID： 29158798 PMCID： PMC5665042 DOI： 10.7150/jca.21261

Source DB: PubMed Journal: J Cancer ISSN： 1837-9664 Impact factor: 4.207

Introduction

Prostate cancer (PCa) is one of the most common malignancy in male worldwide, with ~ 1,000,000 cases diagnosed annually 1. In developed countries, PCa is the second leading cause of cancer-related deaths among men 2. Both genetics and demographic factors such as age, family history and race, are closely related with the incidence and progress of PCa 3-4. As our understanding has been broadened gradually on the underlying biology of PCa, various treatment strategies have also been developed, such as radical prostatectomy, hormone deprivation therapy, radiation therapy and chemotherapy. However, the prognosis of PCa is still far away from being satisfying, and most tumors relapse in 2 years to the castration-resistant state 5. Currently, over 80% of PCa are localised or locally advanced non-metastatic diseases and the patients face the selection of the best treatment regimen from a wide array 6. Risk stratification plays an important role in the clinical decision making and treatment options, which is mainly determined by a general impression currently, with the combination of a couple of clinical parameters, such as PSA concentration, clinical stage, biopsy Gleason score, patient age, number of positive prostate biopsies and so on 7-10. The most widely used stratification system for primary non-metastatic PCa is endorsed by the National Institute for Health and Care Excellence (NICE) guidelines, which use presenting PSA concentration, Gleason grade, and clinical T stage to classify PCa patients as low, intermediate, or high risk 11. Some new methods were proposed on the basis of the NICE stratification system, which displayed improved prognostic power 6. Despite the success of these risk stratification systems in prognosis prediction, the tumors within the same risk groups still showed remarkably different clinical courses 12-14. Therefore, new prognostic prediction tools are still urgently needed to further improve the accuracy and sensitivity of classification of PCa. Classically, the progression of various cancer types is ascribed to the sequential accumulation of genetic alterations. Somatic mutation signatures have been successfully applied in the development of prognostic prediction tools for various cancers, such as breast cancer, lung cancer, nasopharyngeal carcinoma, etc 15-17. Gene signatures have also been attempted in PCa risk stratification 18-21. For example, Irshad et al identified a three-gene panel, including FGFR1, PMP22 and CDKN1A, which could accurately predict the outcome PCa with low Gleason scores 20. Berg et al found that over-expression of ERG was associated with an increased risk of disease progression during active surveillance for PCa patients 22. In another study, a model with 100-gene signature, which classified PCa patients into five separate subgroups with distinct genomic alterations and expression profiles, showed better performance in prediction of diseases with poor prognosis than traditional predictors based on PSA and Gleason scores 23. The above studies demonstrate that genetic variation plays an important role in the classification of PCa and may display immense potential of clinical prediction. However, the current widely-used risk stratification systems in PCa were almost exclusively based on routine clinic-pathological parameters, without attention to the genetic variation. In this research, an extensive comparison was performed on the somatic mutation profiles in PCa with different prognosis, with the prostate adenocarcinoma (PRAD) data from The Cancer Genome Atlas (TCGA). No gene was found with significant somatic mutation rates between groups (False Positive Rate, FDR < 0.05). However, a combined filtering strategy generated 43 genes, which were further used as features for prognostic model development and reached good classification performance. With a 5-fold cross validation, the genetic model based on the 43 features showed an average AUC (Area Under Curve) of ROC (Receiver Operating Characteristic) curves and accuracy of ~0.70 and ~0.66 receptively, better than NICE (accuracy: ~0.62). A combined model with both the genetic signatures and NICE could reach better average performance (AUC: ~0.75; accuracy: ~0.71). Taking together, the study suggested that the somatic mutation signatures could largely facilitate the prognostic prediction of PCa, independently or combined with other clinical features.

Materials and Methods

Datasets, stratification and somatic mutation rate comparison

The clinical data for the patients with PCa were downloaded from TCGA (The Cancer Genome Atlas) website. The somatic mutation data between tumor-normal pairs of each PCa case were also downloaded. The mutations causing codon changes, frame-shifts, and premature translational terminations were retrieved for further analysis. Cases were stratified based on either 'tumor status' or 'biochemical recurrence'. For 'tumor status', 'with tumor' group included the patients detected with residual or recurrent tumors before death or at last follow-up; the rest were classified into 'tumor free' group. For 'biochemical recurrence' stratification, two groups were designated with 'recurrence' and 'non-recurrence' representing the cases with or without recurrence respectively. The clinical examination results were also used for NICE risk stratification ('low', 'medium' and 'high' risks) 11. To compare the somatic gene mutation frequency between prognostic groups, a matrix was prepared to record the mutations of all the genes for each case, followed by counting the number of cases with mutations for each gene in each group. Both Chi-square test with Benjamini & Hochberg correction and EBT were used for rate comparisons, and a False Discovery Rate (FDR) or p value < 0.05 was set as the significance level for Chi-square or EBT test respectively 24, 25.

Feature selection

A multi-factor filtering strategy was proposed to select the genetic features for prognosis-prediction model training. The genes were filtered when any of the following criteria was met: (1) the mutation rates in both groups were lower than 5%; (2) the absolute difference between the mutation rates of two groups was lower than 5%; (3) the significance of Chi-square test without FDR correction was higher than 0.1. Both TopN and mRMR strategy were also used for feature selection and model comparison 25, 26. For TopN strategy, the top N genes with smallest p values (EBT) for mutation rate comparison were selected as the features 25. The mRMR software package was downloaded, installed and used for mRMR feature selection 26.

Training of Support Vector Machine models

The n genes were selected as genetic features for model training. For each case P (j = 1, 2, …,m) belonging to a certain category C, where i equaled to 1 or 0, and m represented the total number of cases of the category C, the genetic features were represented as a binary vector F (g,g,…,g) in which g(k = 1, 2, …, n) represented the kth genetic feature, taking the value of 1 if the corresponding gene was mutated and 0 otherwise. There was an m*n matrix for category C. When NICE was used as an additional feature, the size of matrix was enlarged to m*(n+1), and the NICE feature was also represented in a binary form in the additional column, for which 1 and 0 represented 'high' and 'low'/'medium', respectively. An R package, 'e1071', was used for training Support Vector Machine (SVM) models using each training dataset (http://cran.r-project.org). For each training-testing experiment, the training dataset was used for both kernel selection and parameter optimization as described previously 25. Four kernels, including 'radial' (Radial Base Function, RBF), 'linear', 'polynomial' and 'sigmoid', were individually tested for the best-optimized parameters with a 10-fold cross-validation grid search strategy. The performance of different kernels with best-optimized parameters was then compared and the best kernel (with optimal parameters) was selected for further model training and prediction on the testing dataset.

Model performance assessment

A 5-fold cross validation and training-testing strategy were used for model performance evaluation. For 5-fold cross validation, the original feature-represented matrix for each category were randomly split into five parts with identical size. Every four parts of each category were combined and served as a training dataset while the rest one of each category was used for testing and performance evaluation. For the training-testing strategy, 2/3 of the original cases belonging to each category were randomly selected for mutation frequency comparison or feature selection and consequential representation, and served as the training datasets. Matrices were prepared for the rest 1/3 of the cases with the features newly identified with corresponding training datasets, and used for testing. The relatively balanced items, Receiver Operating Characteristic (ROC) curve, the area under ROC curve (AUC) and Accuracy, were utilized to assess the predictive performance. An ROC curve is a plot of Sensitivity versus (1 - Specificity) and is generated by shifting the decision threshold. AUC gives a measure of classifier performance. Accuracy was defined as (TP + TN)/(TP+FP+TN+FN), where TP, TN, FP and FN represented true positives, true negatives, false positives and false negatives respectively. The performance of genetic or combined models was recorded as the average 5-fold cross-validation or training-testing results, while that of pure NICE model was represented as the average 10-fold bootstrapping results. Students' t-tests were performed for the performance comparison with a preset significance level of 0.05.

Results

Prognostic stratification of PCa

The post-operation survival rate is high for PCa patients, so the biochemical recurrence or non-recurrence and other indicators have been used as more effective prognostic statistics 23. Two indicators, biochemical recurrence / non-recurrence, and tumor status of the last follow-up (with tumor / tumor free), were also adopted to stratify the TCGA PCa patients (Fig 1a). A significant dependence was observed between the two stratification criteria, with apparent enrichment of 'with tumor' patients in the 'recurrence' group (i.e., 'tumor free' patients in the 'non-recurrence' group) (Fig 1a; p = 5.8E-12, Chi-square test). NICE has been widely used for guiding the prognostic assessment for PCa in clinical practice. The TCGA cases were also evaluated with NICE, followed by comparison with the stratification results based on recurrence or tumor status (Fig 1b-c). With each type of stratification, NICE levels showed significant association with the prognostic groups (Fig 1b-c; p = 3.2E-4 and 2.3E-7 for NICE vs. recurrence status, and NICE vs. tumor status, respectively). Taking together, the results suggested that both recurrence status and tumor status could be considered as indicators used for prognostic stratification of PCa.

Figure 1

Prognostic stratification of TCGA PCa cases. (a) Stratification of PCa cases based on biochemical recurrence status and tumor status at last follow-up and the relationship. (b) Stratification of PCa cases based on biochemical recurrence status and NICE criteria and the relationship. (c) Stratification of PCa cases tumor status at last follow-up and NICE criteria and the relationship. The accumulative bar diagrams were shown with the sum percentage of 100%. The number of cases for each subgroup was indicated. Chi-square tests were performed, with the p values indicated at the right upper corner.

Classification of PCa prognosis with atypical somatic mutation signatures

A majority of the TCGA PCa cases were also profiled for the tumor somatic mutations. To observe whether there is an association between PCa prognosis and somatic mutation profiles, the cases with somatic mutation data were stratified according to the prognostic indicators (Table 1). Statistical comparisons were further performed between prognostic groups per gene for the mutation rates. However, with either strategy of stratification, no any gene with significant mutation rate difference was called between prognostic groups (Table 1; Supplemental file 1-4).

Table 1

Sample size summary and comparison of somatic mutation profiles between PCa prognostic groups

Recurrence Status		Tumor Status
Recurrence #	58	With Tumor #	80
Non_recurrence #	366	Tumor Free #	308
Sign. Genes #	0	Sign. Genes #	0

Note: Rate comparisons were performed with both Chi-square tests with FDR correction and EBT.

To further observe whether the atypical somatic mutation profiles were associated with PCa prognosis and therefore useful for classification with machine learning strategies, an integrated feature-filtering pipeline was adopted to screen the possibly more meaningful signatures. The prognosis data stratified by tumor status was used for further analysis since the 'poor prognosis' group ('With Tumor') contained more samples than the corresponding group stratified by recurrence status ('Recurrence') (Table 1). In total, 43 genes with subtle mutation difference between prognostic groups were identified with the filtering pipeline (Table 2).

Table 2

The list of 43 genes used for PCa prognosis classification

Signature genes
AHNAK2	FAM47C	MUC2	SACS
ANKRD30A	FAT2	MUC4	SALL1
ANKRD36C	FAT4	MYH11	SCN5A
APOB	FBN3	MYT1L	SPOP
ATP13A5	FLG2	NOD1	SRCAP
BAI3	FRG1B	PCDHA12	TP53
CACNA1A	HSPG2	PIK3CA	TRPM6
CACNA1E	KMT2D	PTEN	USH2A
CDH23	KRTAP4-9	PTH2	ZNF208
CNTNAP5	LPHN3	PTPRC	ZNF91
EPB41L3	MUC16	RYR1

SVM models were trained with the 43 features represented by their somatic mutation profile. The SVM models based on the 43 atypical features could well discriminate prognosis of PCa, achieving an average AUC of 0.696 and accuracy of 0.662 with a 5-fold cross-validation strategy (Fig 2a-b). When the feature size was reduced, the model performance also declined strikingly (Fig 2a-b). Interestingly, the models also performed much better than the ones based on the same size of genes with smallest p value for mutation rate comparison between prognostic groups (topN), or those based on 43 genes with smallest redundancy filtered with mRMR (Fig 2c).

Figure 2

Prediction of PCa prognosis with models based on genetic features. (a) ROC curves of 5-, 10, 20, 30 and 43-gene genetic models (f5, f10, f20, f30 and f43, respectively). The average results of 5-fold cross validations were shown. (b) AUC and general accuracy of prognosis prediction models with varied feature size. (c) Comparison of AUC and general accuracy of the f43 model and those based on topN and mRMR feature selection strategies. (d) Performance of models based on 5-fold cross validation (CV) and 5-fold training-testing (TT).

Training-testing evaluations were also performed to test the effectiveness of the models and the feature selection strategy. With the size of features varying from 29 to 47, the models averagely reached an AUC of 0.653, only slightly lower than the 5-fold cross-validation results, but better than the neutral, topN or mRMR models (Fig 2d).

Improvement of prognostic prediction of PCa with a combination of NICE and somatic mutation signatures

Currently, NICE was most often used for PCa prognosis prediction. NICE was also used to predict the prognosis of TCGA PCa cases, but only with accuracy of 55.3% and 61.5% for recurrence and tumor status stratification respectively (Fig 3a). The performance was even worse than the 5-fold cross-validation or training-testing models.

Figure 3

Prediction of PCa prognosis with models based on the combined NICE and genetic features. (a) The classification performance of NICE on PCa prognosis stratified by tumor status or recurrence. Bootstrapping analysis was performed and the results were represented as mean ± sd. (b) ROC curves of 43-gene genetic models (f43) and models based on the combined NICE and genetic features (f43+NICE). The average results of 5-fold cross validations were shown. (c) Comparison of the general accuracy of different prognosis prediction models. Students' t-tests were performed, and asterisk represented p < 0.05.

The NICE stratification results were also considered as an independent feature and combined with the 43 genetic signatures. A new SVM model was built, which showed apparent improvement for the performance compared with NICE or models based on genetic signatures solely (Fig 3b-c). The 5-fold cross-validation AUC and accuracy achieved 0.746 and 0.713 respectively (Fig 3b-c).

Discussion

PCa is an important cancer type, with a high worldwide morbidity. The 5-year survival has been improved significantly recently. However, there is still a big challenge to reduce the long-term mortality and recurrence, and to increase the percentages of tumor-free survival. NICE has been for a long time used as risk stratification of PCa patients on prognosis. However, the accuracy needs to be improved. With TCGA PCa data, we also found that NICE stratification could only correctly predict the prognosis for ~ 60% of the patients (Fig 3a). The somatic mutations have been well characterized for PCa patients 4,21. However, it remained largely unknown whether the prognosis of PCa was also related with genetic background. One major objective of the current research was to answer the question. Prognostic groups were stratified by either 'tumor status' or 'recurrence'; however, no gene was discovered that showed significant somatic mutation rate difference between groups with different prognosis. There could be no association between prognoses of PCa with somatic mutation profiles, but alternatively, other factors could also explain the observations. For example, the prognosis of PCa could be further improved with new stratification criteria. The small number of PCa cases and the general low somatic gene mutation rates in PCa could also have led to the dramatic low power 25. Therefore, no solid conclusions could be drawn before more objective-targeted studies are performed with enlarged size of cases and observation of elongated period of survival. In fact, with the atypical mutation rate difference between prognostic groups, the models trained in this research could still well distinguish the prognosis, with accuracy even higher than NICE (Fig 3c). The results indirectly suggested the dependency of prognosis with genetic mutation profile. In total, 43 atypical features were used for the model predicting PCa prognosis. Although many of the genes have been reported to function in different tumor types and progresses, a functional clustering analysis showed a significant enrichment of genes participating in calcium ion binding and transporting (GO:0015085, p = 2.59e-02; GO:0005509, p = 3.15e-02; PANTHER Overrepresentation Test, http://pantherdb.org/; data not shown). The combination of these genetic features with NICE factors appeared to improve the prognosis prediction significantly when compared with models based only on genetic features or NICE (Fig 3c). A tool was also developed to facilitate the testing of the new method in PCa prognosis prediction (http://www.szu-bioinf.org/PCpp). There are several drawbacks with the current model that need to be improved in the future. First of all, the current model was only evaluated with a single dataset from TCGA since it is difficult to find another dataset with both full genomic information and clinical data. 5-fold cross validation and training-testing were performed to correct the overfitting problem; however, new independent datasets are still in need to make more accurate evaluation. The size of genetic features was also a little large, and new experiments with enlarged size of cases could assist the finding of fewer more effective features.

Author Contributions

YW, SZ and HL conceived the project. SZ, YX, YH, FY and YW collected the data and performed the data analysis. XH and YW performed statistical analysis. HL provided the clinical support. SZ, YX, JS and YW developed the models. SZ, JS and YW developed the software tools. SZ, YX and YW wrote the manuscript and all the authors revised it. All the authors approved the final version of manuscript. Supplementary tables. Click here for additional data file.

24 in total

1. Prognostic value of a microRNA signature in nasopharyngeal carcinoma: a microRNA expression analysis.

Authors: Na Liu; Nian-Yong Chen; Rui-Xue Cui; Wen-Fei Li; Yan Li; Rong-Rong Wei; Mei-Yin Zhang; Ying Sun; Bi-Jun Huang; Mo Chen; Qing-Mei He; Ning Jiang; Lei Chen; William C S Cho; Jing-Ping Yun; Jing Zeng; Li-Zhi Liu; Li Li; Ying Guo; Hui-Yun Wang; Jun Ma
Journal: Lancet Oncol Date: 2012-05-03 Impact factor: 41.316

2. ERG protein expression in diagnostic specimens is associated with increased risk of progression during active surveillance for prostate cancer.

Authors: Kasper Drimer Berg; Ben Vainer; Frederik Birkebæk Thomsen; M Andreas Røder; Thomas Alexander Gerds; Birgitte Grønkær Toft; Klaus Brasso; Peter Iversen
Journal: Eur Urol Date: 2014-03-07 Impact factor: 20.096

3. A preoperative nomogram for disease recurrence following radical prostatectomy for prostate cancer.

Authors: M W Kattan; J A Eastham; A M Stapleton; T M Wheeler; P T Scardino
Journal: J Natl Cancer Inst Date: 1998-05-20 Impact factor: 13.506

4. Prediction of prognosis for prostatic adenocarcinoma by combined histological grading and clinical staging.

Authors: D F Gleason; G T Mellinger
Journal: J Urol Date: 1974-01 Impact factor: 7.450

5. Integrative genomic profiling of human prostate cancer.

Authors: Barry S Taylor; Nikolaus Schultz; Haley Hieronymus; Anuradha Gopalan; Yonghong Xiao; Brett S Carver; Vivek K Arora; Poorvi Kaushik; Ethan Cerami; Boris Reva; Yevgeniy Antipin; Nicholas Mitsiades; Thomas Landers; Igor Dolgalev; John E Major; Manda Wilson; Nicholas D Socci; Alex E Lash; Adriana Heguy; James A Eastham; Howard I Scher; Victor E Reuter; Peter T Scardino; Chris Sander; Charles L Sawyers; William L Gerald
Journal: Cancer Cell Date: 2010-06-24 Impact factor: 31.743

6. Classification of prostatic carcinomas.

Authors: D F Gleason
Journal: Cancer Chemother Rep Date: 1966-03

Review 7. Prostate cancer.

Authors: Gerhardt Attard; Chris Parker; Ros A Eeles; Fritz Schröder; Scott A Tomlins; Ian Tannock; Charles G Drake; Johann S de Bono
Journal: Lancet Date: 2015-06-11 Impact factor: 79.321

Review 8. Prostate cancer.

Authors: Jan-Erik Damber; Gunnar Aus
Journal: Lancet Date: 2008-05-17 Impact factor: 79.321

9. Comparison of digital rectal examination and serum prostate specific antigen in the early detection of prostate cancer: results of a multicenter clinical trial of 6,630 men.

Authors: William J Catalona; Jerome P Richie; Frederick R Ahmann; M'Liss A Hudson; Peter T Scardino; Robert C Flanigan; Jean B DeKernion; Timothy L Ratliff; Louis R Kavoussi; Bruce L Dalkin; W Bedford Waters; Michael T MacFarlane; Paula C Southwick
Journal: J Urol Date: 1994-05 Impact factor: 7.450

10. Cancer incidence in the United Kingdom: projections to the year 2030.

Authors: M Mistry; D M Parkin; A S Ahmad; P Sasieni
Journal: Br J Cancer Date: 2011-10-27 Impact factor: 7.640

7 in total

1. Association of specific gene mutations derived from machine learning with survival in lung adenocarcinoma.

Authors: Han-Jun Cho; Soonchul Lee; Young Geon Ji; Dong Hyeon Lee
Journal: PLoS One Date: 2018-11-12 Impact factor: 3.240

Review 2. Artificial intelligence and machine learning in precision and genomic medicine.

Authors: Sameer Quazi
Journal: Med Oncol Date: 2022-06-15 Impact factor: 3.738

3. A Multi-Gene Model Effectively Predicts the Overall Prognosis of Stomach Adenocarcinomas With Large Genetic Heterogeneity Using Somatic Mutation Features.

Authors: Xianming Liu; Xinjie Hui; Huayu Kang; Qiongfang Fang; Aiyue Chen; Yueming Hu; Desheng Lu; Xianxiong Chen; Yejun Wang
Journal: Front Genet Date: 2020-08-26 Impact factor: 4.599

4. Developing a new radiomics-based CT image marker to detect lymph node metastasis among cervical cancer patients.

Authors: Xuxin Chen; Wei Liu; Theresa C Thai; Tara Castellano; Camille C Gunderson; Kathleen Moore; Robert S Mannel; Hong Liu; Bin Zheng; Yuchen Qiu
Journal: Comput Methods Programs Biomed Date: 2020-09-16 Impact factor: 5.428

Review 5. Artificial Intelligence and Its Impact on Urological Diseases and Management: A Comprehensive Review of the Literature.

Authors: B M Zeeshan Hameed; Aiswarya V L S Dhavileswarapu; Syed Zahid Raza; Hadis Karimi; Harneet Singh Khanuja; Dasharathraj K Shetty; Sufyan Ibrahim; Milap J Shah; Nithesh Naik; Rahul Paul; Bhavan Prasad Rai; Bhaskar K Somani
Journal: J Clin Med Date: 2021-04-26 Impact factor: 4.241

6. Tumor cell intrinsic and extrinsic features predict prognosis in estrogen receptor positive breast cancer.

Authors: Kevin Yao; Evelien Schaafsma; Baoyi Zhang; Chao Cheng
Journal: PLoS Comput Biol Date: 2022-03-09 Impact factor: 4.475

7. Dr. Answer AI for Prostate Cancer: Predicting Biochemical Recurrence Following Radical Prostatectomy.

Authors: Jihwan Park; Mi Jung Rho; Hyong Woo Moon; Jaewon Kim; Chanjung Lee; Dongbum Kim; Choung-Soo Kim; Seong Soo Jeon; Minyong Kang; Ji Youl Lee
Journal: Technol Cancer Res Treat Date: 2021 Jan-Dec

7 in total