Jianfeng Chu1, Ning Li1, Wentao Gai2. 1. Department of Urology, Yantaishan Hospital, Yantai, Shandong 264000, P.R. China. 2. Department of Urology, Yantai Municipal Laiyang Central Hospital, Yantai, Shandong, 265200, P.R. China.
Abstract
Prostate cancer (PCa) is one of the most prevalent cancer types in men. Biochemical recurrence continues to occur in a large proportion of patients after radical prostatectomy. Thus, prognostic biomarkers are required to determine which treatment is suitable. In the present study, RNA-sequencing gene expression data from The Cancer Genome Atlas was used in order to develop a risk-score staging system based on the expression of eight genes. Cox multivariate regression was used to predict the outcome of patients with PCa. The biomedical recurrence-free survival of patients with low-risk scores was significantly longer compared with patients with high-risk scores (P=5×10-7). This result was further validated using another dataset, GSE70769, from the National Center for Biotechnology Information. The prognostic values of other clinical information and risk scores were evaluated for 5-year biochemical recurrence. The prognostic value of the risk score was determined using an area under curve value of 0.819, predicting the 5-year biochemical recurrence of patients with PCa. The risk score was identified to be significantly associated with primary tumor stage (P<0.01), Gleason score (P<0.01), and lymph node invasion (P<0.05), but was independent of age. Cox multivariate regression revealed that the risk score was an indicator for prediction of biochemical recurrence. Thus, the risk score is a valuable and robust indicator for predicting the biochemical recurrence of patients with PCa.
Prostate cancer (PCa) is one of the most prevalent cancer types in men. Biochemical recurrence continues to occur in a large proportion of patients after radical prostatectomy. Thus, prognostic biomarkers are required to determine which treatment is suitable. In the present study, RNA-sequencing gene expression data from The Cancer Genome Atlas was used in order to develop a risk-score staging system based on the expression of eight genes. Cox multivariate regression was used to predict the outcome of patients with PCa. The biomedical recurrence-free survival of patients with low-risk scores was significantly longer compared with patients with high-risk scores (P=5×10-7). This result was further validated using another dataset, GSE70769, from the National Center for Biotechnology Information. The prognostic values of other clinical information and risk scores were evaluated for 5-year biochemical recurrence. The prognostic value of the risk score was determined using an area under curve value of 0.819, predicting the 5-year biochemical recurrence of patients with PCa. The risk score was identified to be significantly associated with primary tumor stage (P<0.01), Gleason score (P<0.01), and lymph node invasion (P<0.05), but was independent of age. Cox multivariate regression revealed that the risk score was an indicator for prediction of biochemical recurrence. Thus, the risk score is a valuable and robust indicator for predicting the biochemical recurrence of patients with PCa.
Entities:
Keywords:
biomedical recurrence; gene expression; model; prognosis; prostate cancer
Prostate cancer (PCa) is one of the most prevalent cancer types in men; in 2015, there were 60,300 newly diagnosed cases of PCa in China, resulting in 26,000 mortalities (1). Disease recurrence has been reported in a large proportion of patients following radical prostatectomy (2), and castration-resistant disease typically develops as a result (3,4). Although prognostic and clinical indicators were implemented, the prognostic effect was not fully understood (5). Thus, clinical biomarkers for PCa biochemical recurrence are required. Huang et al (6) used long non-coding RNAs to develop a prediction model for biochemical recurrence; however, the analysis lacked validation datasets.Over the previous decade, single biomarkers have been identified for the prognosis of PCa (7–9); however, the utilization of these biomarkers requires further investigation owing to the heterogeneity of PCa (10). Multiple gene-based studies of prognostic biomarkers are currently prevalent owing to their robustness in multiple different cancer types (11–17).By associating gene expression and survival information from The Cancer Genome Atlas (TCGA), survival-associated genes were identified. Using a random forest-based variable hunting approach, eight genes were selected and a risk score staging system was developed. Patients with high-risk scores had significantly poorer survival rates compared with patients with low-risk scores. This result was further validated using an independent dataset from the National Center for Biotechnology Information (NCBI), GSE70769 (18). Analysis of clinicopathological factors revealed that the risk score was independent of age but was significantly associated with Tumor Node Metastasis (TNM) stage (19), lymph node invasion and Gleason score. Cox multivariate regression and the 5-year biochemical recurrence area under the receiver operating curve (ROC) reveal that the risk score was an important indicator for prediction of biochemical recurrence.
Materials and methods
Data pre-processing
Raw microarray data of the NCBI dataset GSE70769 was downloaded from Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo) (20). Subsequent to background correction and normalization using the Robust Multi-Array Averaging (RMA) approach (21), the data was used for further analysis. The probe names were annotated according to the manufacturer's annotation file. For genes matching multiple probes, the average values were calculated and used as the expression values for the corresponding genes. TCGA gene expression (https://cancergenome.nih.gov/) data was downloaded from University of California Santa Cruz Xena and converted to fragments per kilobase of transcript sequence per million base pairs sequenced (FPKM) values. The log 2-transformed RNA-Sequence by expectation-maximization values were retained for model development.
Prediction gene selection, Cox multivariate regression model and validation
Cox univariate regression was performed on TCGA dataset. Genes with relative expression levels associated with biochemical recurrence-free survival (BFS) were retained for a further forest-based variable hunting approach, performed as previously described (22,23). Following 100 repeats and 100 iterations, genes from the top of the list were selected for further analysis. Finally, eight genes were identified as the most frequently present in the repeats and iterations, thus these eight genes were selected for model development. Next, multivariate Cox regression was performed using the aforementioned genes to construct a linear risk-score model. In the validation datasets, coefficients were locked and the risk score for each sample was calculated. The risk score was calculated using the following formula; where βi indicates the coefficients evaluated with gene expression and xi refers to the relative gene expression level.For the training dataset, the samples were divided into low- and high-risk groups according to the median risk score using R software (v3.0.1; http://cran.r-project.org/doc/FAQ/R-FAQ.html) and packages (24,25).
Statistical analysis
Background correction and RMA normalization of raw Affymetrix CEL data were performed using the ‘RMA’ function in the ‘affy’ package (v1.56.0) (26). The survival difference between the high-risk and low-risk groups, univariate regression in the training dataset, multivariate Cox proportional hazard model development and multivariate regression with risk score and other clinical indicators were performed using the ‘survival’ function in the R package (v1.4–8). The ROCs were drawn and the area under curve (AUC) calculation was performed using the R package, ‘pROC’ (v1.11.0) (27). All statistical analysis was performed using R software and packages. P<0.05 was considered to indicate a statistically significant difference.
Results
Identification of survival-associated genes
Univariate Cox regression was performed on TCGA dataset, following filtering of the non-primary PCa tissues, by associating BFS and gene expression. Detailed information of the samples enrolled in TCGA dataset are presented in Table I. Genes significantly associated with BFS (P<0.01) were retained for further analysis. As the identified gene panel was relatively large, a random forest-based variable hunting approach was implemented to retrieve the best combination of biomarkers. Eight genes were selected for further model development (Fig. 1A; Table II). Finally, the coefficients are presented in Fig. 1B. The positive coefficients suggest that the genes are oncogenes, while the negative coefficients indicate tumor suppressor genes.
Table I.
The Cancer Genome Atlas sample information.
Variables
Samples, n
Age, years
<60
138
>60
170
Tumor stage
T2
131
T3-T4
177
Gleason score
1
21
2
115
3
72
4
38
5
62
T, tumor.
Figure 1.
Genes selected for model development. (A) Frequency of selected genes in random forest variable hunting. (B) Coefficients of genes in the risk score model. CHST1, carbohydrate sulfotransferase 1; ACOX1, acyl-CoA oxidase 1; CTBS, chitobiase; GNPNAT1, glucosamine-phosphate N-acetyltransferase 1; NAGLU, N-acetyl-α-glucosaminidase; LPIN3, lipin 3; ASRGL1, asparaginase like 1; HMGCS2, 3-hydroxy-3-methylglutaryl-CoA synthase 2.
Table II.
Univariate and multivariate Cox regression analysis of candidate genes.
To assess the prognostic value of the risk score model, the survival difference between high- and low-risk scores (using the median value as the cut-off) was compared to evaluate the performance of the risk score. According to the results, the BFS in the high-risk-score group was significantly shorter compared with the low-risk score group (P=5×10−7; Fig. 2A). As presented in Fig. 2A, samples with early biomedical recurrence were characterized with a high expression of asparaginase like 1 (ASRGL1), lipin 3 and carbohydrate sulfotransferase 1. However, patients without biochemical recurrence presented with a high expression of glucosamine-phosphate N-acetyltransferase 1 (GNPNAT1), chitobiase, acyl-CoA oxidase 1 (ACOX1), 3-hydroxy-3-methylglutaryl-CoA synthase 2 (HMGCS2) and N-acetyl-α-glucosaminidase (NAGLU), which was consistent with the coefficients (Figs. 1B and 2B). Disease-free survival time was additionally compared between the high- and low-risk groups and the result was similar to the BFS pattern as the survival of the high-risk group was notably lower compared with that of the low-risk group (Fig. 2C). The 5-year BFS ROC was identified to be an effective method to compare the prognostic value of the risk score and other clinicopathological observations (Fig. 2D). The AUCs of age, Gleason index, primary tumor stage, lymph node invasion and risk score were 0.597, 0.647, 0.628, 0.578 and 0.819, respectively. Specifically, it is indicated that the mortality risk of patients with the highest risk scores was very high. These results indicate that the risk score is better at predicting BFS than the other clinical observations.
Figure 2.
Risk score for prognosis in the training dataset. (A) Biochemical recurrence-free survival rate of high- and low-risk groups. (B) Heat maps of gene expression for each dataset. Blue/red dots in the first panel refer to the low and high-risk groups, respectively. (C) Disease-free survival rates of high- and low-risk groups. (D) The 5-year survival receiver operating curve of risk score and other clinical observations and their AUC. *P<0.001, risk score AUV vs. other clinical observations. AUC, area under the curve; T stage, tumor stage; CHST1, carbohydrate sulfotransferase 1; ACOX1, acyl-CoA oxidase 1; CTBS, chitobiase; GNPNAT1, glucosamine-phosphate N-acetyltransferase 1; NAGLU, N-acetyl-α-glucosaminidase; LPIN3, lipin 3; ASRGL1, asparaginase like 1; HMGCS2, 3-hydroxy-3-methylglutaryl-CoA synthase 2.
Validation of risk score performance
The high performance of the risk score may have resulted from the over-fit dataset. To test if over-fitness existed, the coefficients were locked in order to evaluate the robustness of this model, and the risk scores were calculated for an independent NCBI dataset (GSE70769). The samples from the independent dataset were additionally divvied into high- and low-risk groups, as with the training dataset. The results were similar to the BFS profile of the training dataset. The BFS of patients in the high-risk-score group were significantly shorter than the low-risk-score group (P=0.04; Fig. 3A) and associated with early biomedical recurrence (Fig. 3B). The expression profile was additionally similar to that of the training dataset (Fig. 3C). These results indicate that the risk score is a robust indicator for PCa prognosis.
Figure 3.
Prognostic value of the risk score on survival in a validation dataset. (A) Biochemical recurrence-free survival rate of the high- and low-risk groups in the GSE70769 dataset. (B) Detailed biochemical recurrence survival information. (C) Candidate gene expression. CHST1, carbohydrate sulfotransferase 1; ACOX1, acyl-CoA oxidase 1; CTBS, chitobiase; GNPNAT1, glucosamine-phosphate N-acetyltransferase 1; NAGLU, N-acetyl-α-glucosaminidase; LPIN3, lipin 3; ASRGL1, asparaginase like 1; HMGCS2, 3-hydroxy-3-methylglutaryl-CoA synthase 2.
Association between risk score and other clinical information
Analyses of risk score and clinical information were performed. The results indicated that the risk score was significantly associated with primary tumor stage (P<0.05), Gleason score (P<0.01) and lymph invasion (P<0.01), but not with age (Fig. 4A). Cox multivariate regression was performed using the risk score and the aforementioned clinical observations. The risk score was the only prognostic indicator identified to be significantly associated with biochemical recurrence (P=3×10−5; Fig. 4B). In summary, these results indicate that risk score is an important clinical indicator of PCa prognosis.
Figure 4.
Clinical information and risk score. (A) The association between risk score and clinicopathological information was evaluated and presented as a box plot. (B) Cox multivariate regression was performed using the risk score and other clinical information. The red dots indicate the hazard ratio, and the red line represents 95% confidence intervals. T, tumor.
Discussion
Despite the low rate of progression, biomedical recurrence and metastasis continue to be observed in a large proportion of patients with PCa (28). Thus, prognostic biomarkers are urgently required. Over the previous decade, single biomarkers have been reported to predict the survival of patients with PCa (3,9,29). However, the single-biomarker approach to cancer prognosis assessment is less robust compared with the more widely reported multiple-biomarker-based models (30–32). Using machine learning and gene expression, the present study developed a Cox multivariate regression-based risk score model. The model was then further evaluated for performance and robustness. The risk score staging system performed well in predicting survival in two datasets from different microarray platforms.Among the candidate genes selected, serum NAGLU has been reported to be associated with the clinical indicators and survival of gastrointestinal adenocarcinoma (33); and the expression of another gene, GNPNAT1, had been demonstrated to be associated with the progression of castration-resistant PCa (34) via the phosphatidylinositol3-kinase/protein kinase B signaling pathway. Proteomics analysis revealed that HMGCS2 expression is altered in PCa, and that the expression of this gene is associated with the survival of squamous cell carcinoma following surgery (35,36). It has additionally been revealed to affect the extracellular signal-regulated kinase/c-Jun signaling pathway in hepatocellular carcinoma (37). In addition, ACOX1 has been reported to be associated with migration and metastasis in the xenografts of colorectal carcinoma (38), and associated with the mitogen-activated protein kinase signaling pathway in hepatocellular carcinoma (39). A similar function was detected for ASRGL1 in endometrial carcinoma (39), although the underlying mechanism remains unclear. Collectively, these results indicate that the candidate genes used in the model are reliable, thus reinforcing the robustness of the model.In a previous study, Huang et al (6) used gene expression to predict biochemical recurrence using TCGA expression data, the study lacked a validation dataset. The present study was novel as it developed a robust prediction model for PCa that was validated using another platform. Indeed, the RNA-sequencing data was presented with log2-transformed FPKM values, whereas microarray data was presented as log2-transformed intensity values. The formula was calculated using the relative gene expression level, regardless of its unit. This may explain why this model is functional across different platforms.However, limitations of the present study exist. Firstly, the present study is a retrospective study. The clinical information and long-term follow-up are unavailable, and detailed clinical information are unavailable. Thus, bias may have resulted. Secondly, although the robustness of the risk score was validated using another dataset, the clinical utilization of the risk score requires further studies in order to fully confirm its efficiency. The present findings may provide novel insights for predicting the biochemical recurrence of patients with PCa.
Authors: Hemant Ishwaran; Thomas A Gerds; Udaya B Kogalur; Richard D Moore; Stephen J Gange; Bryan M Lau Journal: Biostatistics Date: 2014-04-11 Impact factor: 5.899
Authors: Per-Henrik D Edqvist; Jutta Huvila; Björn Forsström; Lauri Talve; Olli Carpén; Helga B Salvesen; Camilla Krakstad; Seija Grénman; Henrik Johannesson; Oscar Ljungqvist; Mathias Uhlén; Fredrik Pontén; Annika Auranen Journal: Gynecol Oncol Date: 2015-04-07 Impact factor: 5.482
Authors: Yan Ting Chiang; Kendric Wang; Ladan Fazli; Robert Z Qi; Martin E Gleave; Colin C Collins; Peter W Gout; Yuzhuo Wang Journal: Oncotarget Date: 2014-01-30
Authors: Khaldoun S Abdelwahed; Abu Bakar Siddique; Mohammed H Qusa; Judy Ann King; Soumaya Souid; Zakaria Y Abd Elmageed; Khalid A El Sayed Journal: ACS Pharmacol Transl Sci Date: 2021-10-05
Authors: James Meehan; Mark Gray; Carlos Martínez-Pérez; Charlene Kay; Jimi C Wills; Ian H Kunkler; J Michael Dixon; Arran K Turnbull Journal: J Pers Med Date: 2021-08-14
Authors: Elena A Pudova; Elena N Lukyanova; Kirill M Nyushko; Dmitry S Mikhaylenko; Andrew R Zaretsky; Anastasiya V Snezhkina; Maria V Savvateeva; Anastasiya A Kobelyatskaya; Nataliya V Melnikova; Nadezhda N Volchenko; Gennady D Efremov; Kseniya M Klimina; Anastasiya A Belova; Marina V Kiseleva; Andrey D Kaprin; Boris Y Alekseev; George S Krasnov; Anna V Kudryavtseva Journal: Front Genet Date: 2019-08-09 Impact factor: 4.599
Authors: Andrea Eigentler; Piotr Tymoszuk; Johanna Zwick; Arndt A Schmitz; Andreas Pircher; Florian Kocher; Andreas Schlicker; Ralf Lesche; Georg Schäfer; Igor Theurl; Helmut Klocker; Isabel Heidegger Journal: Cancers (Basel) Date: 2020-02-12 Impact factor: 6.639