Xi Sun1, Zhi-Rui Zhou2, Yan Fang1, Shuning Ding1, Shuangshuang Lu1, Zheng Wang1, Hui Wang1, Xiaosong Chen1, Kunwei Shen1. 1. Department of General Surgery, Comprehensive Breast Health Center, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China. 2. Radiation Oncology Center, Huashan Hospital, Shanghai Medical College, Fudan University, Shanghai, China.
Abstract
BACKGROUND: Breast cancer risk prediction is often based on clinicopathological characteristics despite the high heterogeneity derived from gene expression. Metabolic alteration is a hallmark of cancer, and thus, the integration of a metabolic signature with clinical parameters is necessary to predict disease outcomes in breast cancers. METHODS: Metabolic genes were downloaded from the Gene Set Enrichment Analysis (GSEA) dataset. Genes with statistical significance in the univariate analysis were applied in the least absolute shrinkage and selection operator (LASSO) analysis to build a gene signature in the GSE20685 dataset. Clinicopathological characteristics and risk scores with prognostic significance were incorporated into the nomogram to predict the overall survival (OS) of patients. The Cancer Genome Atlas (TCGA) and GSE866166 datasets were used as the validation datasets. Time-dependent receiver operating characteristic (tROC) curves and calibration plots were used to assess the accuracy and discrimination of the model. RESULTS: A 55-gene metabolic gene signature (MGS) was constructed, and was significantly related to OS both in the discovery (P<0.001) and validation (P<0.001) datasets. The MGS was an independent prognostic factor and could divide patients into high- and low-risk groups regardless of their different prediction analysis of microarray 50 (PAM50) subtypes. Time-dependent ROC curves indicated that the risk scores based on the MGS [area under the ROC curve (AUC): 0.931] were superior to the those based on the American Joint Committee on Cancer (AJCC) stage (AUC: 0.781) and PAM50 (AUC: 0.675). A nomogram based on the AJCC stage and risk score could predict OS, and the calibration curves showed good agreement to the actual outcome, indicating that the nomogram may have practical utility. Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) analysis indicated that this MGS was primarily enriched in amino acid pathways. CONCLUSIONS: Our results demonstrated that the MGS was superior to existing risk predictors such as PAM50 and AJCC stage. By combining clinical factors (AJCC stage) and the MGS, a nomogram was constructed and showed good predictive ability for OS in breast cancer. 2021 Annals of Translational Medicine. All rights reserved.
BACKGROUND: Breast cancer risk prediction is often based on clinicopathological characteristics despite the high heterogeneity derived from gene expression. Metabolic alteration is a hallmark of cancer, and thus, the integration of a metabolic signature with clinical parameters is necessary to predict disease outcomes in breast cancers. METHODS: Metabolic genes were downloaded from the Gene Set Enrichment Analysis (GSEA) dataset. Genes with statistical significance in the univariate analysis were applied in the least absolute shrinkage and selection operator (LASSO) analysis to build a gene signature in the GSE20685 dataset. Clinicopathological characteristics and risk scores with prognostic significance were incorporated into the nomogram to predict the overall survival (OS) of patients. The Cancer Genome Atlas (TCGA) and GSE866166 datasets were used as the validation datasets. Time-dependent receiver operating characteristic (tROC) curves and calibration plots were used to assess the accuracy and discrimination of the model. RESULTS: A 55-gene metabolic gene signature (MGS) was constructed, and was significantly related to OS both in the discovery (P<0.001) and validation (P<0.001) datasets. The MGS was an independent prognostic factor and could divide patients into high- and low-risk groups regardless of their different prediction analysis of microarray 50 (PAM50) subtypes. Time-dependent ROC curves indicated that the risk scores based on the MGS [area under the ROC curve (AUC): 0.931] were superior to the those based on the American Joint Committee on Cancer (AJCC) stage (AUC: 0.781) and PAM50 (AUC: 0.675). A nomogram based on the AJCC stage and risk score could predict OS, and the calibration curves showed good agreement to the actual outcome, indicating that the nomogram may have practical utility. Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) analysis indicated that this MGS was primarily enriched in amino acid pathways. CONCLUSIONS: Our results demonstrated that the MGS was superior to existing risk predictors such as PAM50 and AJCC stage. By combining clinical factors (AJCC stage) and the MGS, a nomogram was constructed and showed good predictive ability for OS in breast cancer. 2021 Annals of Translational Medicine. All rights reserved.
Entities:
Keywords:
Metabolism; breast cancer; gene signature; survival prediction
Breast cancer is the most common cancer in females, accounting for 30% of new cancer diagnoses (1). It is the second leading cause of cancer death in females worldwide, responsible for nearly 15% of cancer-related fatalities. Despite the dramatic improvement in breast cancer prognosis due to advances in early diagnosis and treatment over the past decades, the high incidence and mortality of breast cancer still poses a significant threat to human health (2).Growing evidence has shown that breast cancer is a heterogeneous disease. Patients with similar clinicopathological characteristics may differ in clinical outcomes because of different gene expression patterns (3). In recent years, the advent of high-throughput platforms has created great opportunities for researchers to explore the distinct molecular aberrations among tumors (4). On the basis of conventional clinicopathological features, individual gene signatures provide complementary information to predict disease prognosis (5,6). Among these gene signatures, a 50-gene classifier called prediction analysis of microarray 50 (PAM50) helps classify breast cancers into five intrinsic subtypes with distinct clinical outcomes: luminal A, luminal B, human epidermal growth factor receptor 2 (HER2)-overexpressed, basal-like, and normal-like tumors (7). However, the PAM50 signature alone cannot explain the discrepancies between the prognoses in patients with the same subtype, possibly because PAM50 primarily includes proliferation-associated genes (8). As a result, gene signatures involving diverse biological processes are needed to more precisely predict patient survival.The relationship between metabolism and cancer is multifaceted and bidirectional. Despite the biological significance of metabolism in cancer, metabolic gene signature (MGS) has not been implemented to predict survival in breast cancer patients. For one, aberrant metabolism is a hallmark of cancer, and is involved in diverse aspects of cancer including tumorigenesis, proliferation, and metastasis (9,10). Energy and biomass (nucleotides, amino acids, and lipids) produced via metabolic processes are necessary to satisfy the excessive proliferation of cancer cells and to adapt to environmental stress (11). Metabolites are essential in coordinating gene expression and nutrient utilization (12). Under these circumstances, metabolic pathways are rewritten to adapt to changes in cancer progression. For another, dysregulation of oncogenes and tumor suppressors could result in impaired regulation of metabolic pathways (13-15), and metabolic phenotype is now used as a treatment target and can provide clues regarding cancer treatment (10).Many metabolic pathways support the survival of cancer (16-19). Based on the complexity and diversity of tumor metabolism, genes from various metabolic pathways were selected to develop an MGS to predict breast cancer prognosis. In this study, metabolic genes with statistical significance in univariate Cox hazard analysis were incorporated into the least absolute shrinkage and selection operator (LASSO) model to calculate the risk score for each patient. Time-dependent receiver operating characteristic (tROC) curve analysis was applied to compare the prediction accuracy of the risk scores and clinicopathological parameters. A nomogram was established based on the risk score and the American Joint Committee on Cancer (AJCC) staging, and calibration plots were constructed in both the discovery and validation datasets.Our study aimed to elucidate the metabolic pathways contributing to cancer and explore the value of MGS in breast cancer risk stratification. We found that MGS could be used as an effective prognostic predictor and provides a rationale for the individual management of breast cancer patients. We present the following article in accordance with the TRIPOD reporting checklist (available at http://dx.doi.org/10.21037/atm-20-4813).
Methods
Datasets and preprocessing
Datasets from the Gene Expression Omnibus (GEO) database and The Cancer Genome Atlas (TCGA) dataset between 2011 and 2017 were reviewed. Datasets were included if they reported overall survival (OS) and clinicopathological parameters. GSE20685 (platform GPL570), comprising 327 patients, was selected as the discovery dataset, from which messenger RNA (mRNA) expression data and related clinical data for breast cancer were downloaded from the GEO database (http://www.ncbi.nlm.nih.gov/geo/). Probes were transformed to corresponding Entrez gene names referring to the annotation files.For the validation dataset, the GSE86166 dataset and related clinical data were downloaded from the GEO dataset. A total of 366 patients were included in this dataset. Another validation dataset was obtained from TCGA database comprising 1,039 patients. Normalized fragments per kilobase of exon model per million reads mapped (FPKM) data and related clinical data were downloaded from TCGA (https://portal.gdc.cancer.gov/). The metabolic signature was downloaded from the Gene Set Enrichment Analysis (GSEA) dataset. Pathways including glucose and lipid metabolism, amino acid metabolism, and nucleotide metabolism were examined in our analysis, and metabolic genes in both datasets were extracted for further analysis. Batches were removed with the R package “sva”. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).
Construction of the risk score
Univariate Cox regression was conducted to assess the association between gene expression levels and the OS of patients. Genes with a P value <0.05 were included in further analyses. Genes with a hazard ratio (HR) >1 indicated a poor prognosis, while genes with an HR <1 denoted a good prognosis. The LASSO model, which uses a L1-norm to penalize the weight of the model parameters, was then applied to remove genes of high correlation (20,21). A risk score formula was established by including individual normalized gene expression values weighted by their LASSO Cox coefficients as follows: ∑i1 Coefficient (mRNAi)*Expression (mRNAi). The R package “glmnet” in R 3.5.2 was used to conduct the LASSO analysis. The median value of the risk scores in the discovery dataset was used as the cutoff to divide patients into high- and low-risk groups in both the discovery and validation datasets. In the discovery dataset, the cutoff value from the tROC curve was also tested in different subtypes. Survival was compared between the high- and low-risk groups using the Kaplan-Meier (K-M) survival curve with log-rank tests.
Construction of the nomogram
To evaluate whether the risk score could serve as an independent prognostic indicator for breast cancer patients, univariate and multivariate Cox regression was used in both the discovery and validation datasets. There were 327, 716, and 341 patients with clinicopathological parameters in GSE20685, TCGA, and GSE86166 datasets, respectively. Parameters available in all datasets such as age, intrinsic subtype, and the AJCC tumor-node-metastasis (TNM) staging system (AJCC stage) were included in the analysis. The subtypes of patients were calculated using the “genefu” package in R. AJCC stage was statistically significant in univariate analysis and was assessed in further analysis. A nomogram was constructed to visualize the multivariate Cox regression. Variables with a P value <0.05 in the multivariate Cox regression were incorporated into the nomogram using the R package “rms” to predict 3-year OS rates.
Statistical analysis
The predictive accuracy of independent prognostic parameters including AJCC stage, subtypes, and risk score was calculated using the tROC curve. The discriminatory ability of different parameters was evaluated by the area under the ROC curve (AUC). The “survivalROC” package in R software was used to plot the tROC curve and calculate the AUC. The calibration plot was used to visualize the performance of the nomogram in all datasets. Functional annotation analysis, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) analysis, was performed on the Database for Annotation, Visualization and Integrated Discovery (DAVID) website (https://david.ncifcrf.gov/). Heatmaps of the LASSO analysis genes were plotted using the “heatmap” R package. A P value <0.05 was considered statistically significant.
Results
Screening of genes related to OS and construction of the MGS
The flowchart of the screening process for prognostic metabolic genes is presented in . First, metabolic genes were extracted from the GSEA database and a total of 863 genes were included. The genes were further validated as metabolic genes via KEGG () and GO () analysis. In the KEGG analysis, the top five pathways were purine metabolism, pyrimidine metabolism, carbon metabolism, inositol phosphate metabolism, and glycerophospholipid metabolism (). In the GO analysis, the top five pathways were oxidation-reduction process, metabolic process, xenobiotic metabolic process, phosphatidylinositol biosynthetic process, and inositol phosphate metabolic process (). The GSE20685 dataset was used as the discovery cohort, while the GSE86166 and TCGA datasets were used as the validation cohorts. In the GSE20685 dataset, a total of 161 metabolic genes were statistically significant (P<0.001) in the univariate Cox regression analysis and were selected as candidate genes for further analysis (Table S1). To reduce the high correlation between the genes, LASSO regression analysis was conducted to identify robust markers (), and 55 genes were finally included to construct the MGS (). The coefficients of the genes are listed in Table S2. As shown in , patients with a high-risk score exhibited significantly worse OS compared to those with a low-risk score in the discovery dataset (P<0.001, ), which was confirmed in TCGA (P<0.001, ) and the GSE86166 (P=0.032, ) validation cohorts.
Figure 1
Flow chart of the process used to select target genes included in the analysis. The GSE20625 dataset was used as a discovery dataset. LASSO regression analysis was used to identify the MGS. A nomogram was constructed using clinical parameters and the MGS. ROC curves and calibration plots were used to assess the accuracy and discrimination of the nomogram. LASSO, least absolute shrinkage and selection operator; ROC, receiver operating characteristic; MGS, metabolic gene signature.
Figure 2
LASSO Cox regression analysis. (A) KEGG analysis of metabolic genes extracted from the GSEA dataset. X-lab is the number of genes; (B) GO analysis of metabolic genes extracted from the GSEA dataset. X-lab is the number of genes; (C) parameter selection in the LASSO model; (D) LASSO coefficient of the genes of prognostic value. KEGG, Kyoto Encyclopedia of Genes and Genomes; GSEA, Gene Set Enrichment Analysis; GO, Gene Ontology; LASSO, least absolute shrinkage and selection operator.
Figure 3
Kaplan-Meier plot of the discovery and validation cohorts based on the metabolic gene signature (MGS). (A) Kaplan-Meier plot of the test cohort; (B) Kaplan-Meier plot of the TCGA validation cohort; (C) Kaplan-Meier plot of the GSE86166 validation dataset. TCGA, The Cancer Genome Atlas.
Flow chart of the process used to select target genes included in the analysis. The GSE20625 dataset was used as a discovery dataset. LASSO regression analysis was used to identify the MGS. A nomogram was constructed using clinical parameters and the MGS. ROC curves and calibration plots were used to assess the accuracy and discrimination of the nomogram. LASSO, least absolute shrinkage and selection operator; ROC, receiver operating characteristic; MGS, metabolic gene signature.LASSO Cox regression analysis. (A) KEGG analysis of metabolic genes extracted from the GSEA dataset. X-lab is the number of genes; (B) GO analysis of metabolic genes extracted from the GSEA dataset. X-lab is the number of genes; (C) parameter selection in the LASSO model; (D) LASSO coefficient of the genes of prognostic value. KEGG, Kyoto Encyclopedia of Genes and Genomes; GSEA, Gene Set Enrichment Analysis; GO, Gene Ontology; LASSO, least absolute shrinkage and selection operator.Kaplan-Meier plot of the discovery and validation cohorts based on the metabolic gene signature (MGS). (A) Kaplan-Meier plot of the test cohort; (B) Kaplan-Meier plot of the TCGA validation cohort; (C) Kaplan-Meier plot of the GSE86166 validation dataset. TCGA, The Cancer Genome Atlas.
Assessment of the MGS
The OS of the low-risk group was significantly higher than that of the high-risk group in the discovery cohort (). The median OS was 9.48 years in the high-risk group, while it was still unreached in the low-risk group. Furthermore, in TCGA dataset, the OS of the high-risk group was also worse than that of the low-risk group (). The median survival time for the high- and low-risk groups was 11.4 and 17.69 years, respectively. In the GSE86166 validation dataset cohort, the median survival time for high- and low-risk groups was 12.6 and 14.4 years, respectively ().
Figure 4
Risk plot of the discovery and validation cohorts. (A) Risk score distribution of patients in the prognostic model in the discovery cohort; (B) relationship between the survival time and risk score rank in the discovery cohort; (C) risk score distribution of patients in the prognostic model in the GSE20685 validation dataset; (D) relationship between the survival time and risk score rank in the GSE20685 validation cohort; (E) risk score distribution of patients in the prognostic model in the GSE86166 validation dataset; (F) relationship between the survival time and risk score rank in the GSE86166 validation cohort.
Risk plot of the discovery and validation cohorts. (A) Risk score distribution of patients in the prognostic model in the discovery cohort; (B) relationship between the survival time and risk score rank in the discovery cohort; (C) risk score distribution of patients in the prognostic model in the GSE20685 validation dataset; (D) relationship between the survival time and risk score rank in the GSE20685 validation cohort; (E) risk score distribution of patients in the prognostic model in the GSE86166 validation dataset; (F) relationship between the survival time and risk score rank in the GSE86166 validation cohort.Detailed information about clinical characteristics is described in . In the discovery cohort, prognosis was correlated with risk score, AJCC stage, and subtype in the univariate analysis (); these factors were included in the multivariate analysis. Risk score [P<0.001, HR =2.275, 95% confidence interval (CI): 1.411, 3.668] and AJCC stage (P<0.01, HR =5.050, 95% CI: 3.910, 6.524) were identified as independent prognostic predictors in the multivariate analysis (). In TCGA cohort, age (P<0.001, HR =1.036, 95% CI: 1.019, 1.054), AJCC stage (P=0.010, HR =1.825, 95% CI: 1.157, 2.877), and risk score (P<0.001, HR =1.247, 95% CI: 1.115, 1.393) were identified as unfavorable prognostic factors and independent prognostic predictors in the multivariate analysis (). In the GSE86188 dataset, risk score (P=0.025, HR =1.229, 95% CI: 1.026, 1.473) and AJCC stage (P=0.001, HR =2.159, 95% CI: 1.363, 3.418) were independent prognostic predictors as identified through multivariate analysis ().
Table 1
Clinicopathological characteristics of breast cancer patients from the training and validation sets
Variables
Training set (n=327), GSE20685 (%)
Validation set (n=716), TCGA (%)
Validation set (n=341), GSE86166 (%)
Age
Median [range]
46 [24–84]
58 [26–89]
–
≤55 years
252 (77.1)
316 (44.1)
–
>55 years
75 (22.9)
400 (55.9)
–
Stage
I + II
216 (66.1)
556 (77.7)
272 (79.8)
III + IV
111 (33.9)
160 (22.3)
69 (20.2)
Subtype
Lumina A
116 (35.5)
125 (17.5)
133 (39.0)
Lumina B
73 (22.3)
405 (56.6)
91 (26.7)
HER2
74 (22.6)
67 (9.4)
55 (16.1)
Basal
45 (13.8)
109 (15.2)
54 (15.8)
Normal
16 (4.9)
10 (1.4)
8 (2.3)
T category
T1
101 (30.9)
185 (25.8)
–
T2
188 (57.5)
428 (59.8)
–
T3
26 (8.0)
81 (11.3)
–
T4
12 (3.7)
22 (3.1)
–
Node status
N0
137 (41.9)
350 (48.9)
–
N1
87 (26.6)
253 (35.3)
–
N2
63 (19.3)
66 (9.2)
–
N3
40 (12.2)
47 (6.6)
–
Metastasis
M0
319 (97.6)
705 (98.5)
–
M1
8 (2.4)
11 (1.5)
–
Figure 5
Univariate and multivariate Cox regression analyses of the discovery and validation cohorts. (A) Univariate Cox regression analysis in the discovery cohort; (B) significant parameters in the univariate Cox regression analysis were included in the multivariate Cox regression analysis in the discovery cohort; (C) univariate Cox regression analysis in TCGA validation cohort; (D) significant parameters in the univariate Cox regression analysis were included in the multivariate Cox regression analysis in TCGA validation cohort; (E) univariate Cox regression analysis in the GSE86166 validation cohort; (F) significant parameters in the univariate Cox regression analysis were included in the multivariate Cox regression analysis in the GSE86166 validation cohort.
Univariate and multivariate Cox regression analyses of the discovery and validation cohorts. (A) Univariate Cox regression analysis in the discovery cohort; (B) significant parameters in the univariate Cox regression analysis were included in the multivariate Cox regression analysis in the discovery cohort; (C) univariate Cox regression analysis in TCGA validation cohort; (D) significant parameters in the univariate Cox regression analysis were included in the multivariate Cox regression analysis in TCGA validation cohort; (E) univariate Cox regression analysis in the GSE86166 validation cohort; (F) significant parameters in the univariate Cox regression analysis were included in the multivariate Cox regression analysis in the GSE86166 validation cohort.To evaluate whether MGS could predict survival regardless of intrinsic subtypes, K-M curves were plotted in different subtypes in the discovery dataset. The outcome of high-risk patients was significantly worse than that of low-risk patients (, P<0.001; , P<0.001; , P=0.007) regardless of their different cutoff values (Figure S1). The median survival time for the low-risk group was unreached in all three subtypes, which indicated that this MGS could stratify patients into different risk groups regardless of intrinsic subtypes.
Figure 6
Kaplan-Meier plot of the risk score in different subtypes. Kaplan-Meier plot of the risk score in the luminal (A), HER2 (B), and basal (C) subtypes in the discovery dataset.
Kaplan-Meier plot of the risk score in different subtypes. Kaplan-Meier plot of the risk score in the luminal (A), HER2 (B), and basal (C) subtypes in the discovery dataset.Indeed, the histological subtype of breast cancer has a notable effect on the prognosis. Thus, we performed subgroup analysis according to the histological subtype and found that there was no significant interaction between the MGS and the histological subtype (interaction P=0.36). In invasive ductal carcinoma (Figure S2A), a total of 462 patients (60.5%) were classified into the high-risk group, and 302 patients (39.5%) were classified into the low-risk group (P=0.002). In invasive lobular carcinoma (Figure S2B), a total of 86 patients (42.6%) were classified into the high-risk group and 116 patients (57.4%) were classified into the low-risk group (P=0.0004). In both invasive ductal carcinoma and invasive lobular carcinoma, patients with a high-risk score exhibited worse OS compared with patients with a low-risk score. This indicates that the MGS was a robust gene signature regardless of the pathological types in breast cancer. To assess whether the performance of the MGS was different in predicting the prognosis of males versus females, we applied our signature to lung cancer from TCGA database. A total of 938 patients with OS were included in this analysis. High-risk patients showed worse prognosis compared to low-risk patients in the entire dataset (P=0.026, Figure S3A). We then stratified patients according to sex and found that MGS could stratify female patients into high- and low-risk groups (P=0.045), while there was no difference between the high- and low-risk groups in males (P=0.23, Figure S3B,C). Thus, we cannot conclude that this method is more advantageous in predicting the prognosis of males compared to females; more studies are needed to evaluate whether there is a difference in the MGS for predicting the prognosis in different sexes.The tROC curves were plotted to compare the prediction accuracy of the risk score, AJCC stage, and subtype on 3-year OS. In the discovery dataset, the AUC of the risk score was equal to 0.931. The AUC values of AJCC stage and subtype were 0.781 and 0.675, respectively (). In TCGA dataset, the AUC of risk score was 0.704, which was higher than that of subtype (AUC: 0.629) and AJCC stage (AUC: 0.601; ). In the GSE86166 dataset, the AUC values of the risk score, AJCC stage, and subtype were 0.676, 0.667, and 0.615, respectively ().
Figure 7
Prediction accuracy based on different factors. (A) Time-dependent ROC curve for OS of the discovery dataset; (B) time-dependent ROC curve for OS of TCGA validation dataset; (C) time-dependent ROC curve for OS of the GSE86166 validation dataset. OS, overall survival; ROC, receiver operating characteristic; TCGA, The Cancer Genome Atlas.
Prediction accuracy based on different factors. (A) Time-dependent ROC curve for OS of the discovery dataset; (B) time-dependent ROC curve for OS of TCGA validation dataset; (C) time-dependent ROC curve for OS of the GSE86166 validation dataset. OS, overall survival; ROC, receiver operating characteristic; TCGA, The Cancer Genome Atlas.
Construction of MGS-based model
In our analysis, statistically significant clinicopathological and genetic factors in both datasets (AJCC stage and risk score) were included to construct a nomogram (). The concordance index (C-index) of the nomogram was 0.91 (95% CI: 0.85–0.98), which indicated that this model had high prediction accuracy. The presentation of the calibration plot for patient survival prediction in the discovery and validation datasets demonstrated that the nomogram-predicted outcome had good agreement with the actual outcome ().
Figure 8
Nomogram for the discovery cohort (verified in the validation cohort). (A) Nomogram to predict 3-year OS in the discovery set; (B) calibration plot for the discovery dataset; (C) calibration plot for TCGA validation dataset; (D) calibration plot for the GSE86166 validation dataset. OS, overall survival; TCGA, The Cancer Genome Atlas.
Nomogram for the discovery cohort (verified in the validation cohort). (A) Nomogram to predict 3-year OS in the discovery set; (B) calibration plot for the discovery dataset; (C) calibration plot for TCGA validation dataset; (D) calibration plot for the GSE86166 validation dataset. OS, overall survival; TCGA, The Cancer Genome Atlas.
KEGG and GO pathway analysis of the MGS
KEGG and GO analyses were performed to further evaluate the function of the genes in the MGS ().In the KEGG analysis, the top three pathways were histidine metabolism, arginine and proline metabolism, and glutathione metabolism (), indicating that our signature was associated with amino acid metabolism. In the GO analysis, the top three pathways were oxidation-reduction process pathway, xenobiotic metabolic process, and one-carbon metabolic process ().
Figure 9
Functional annotation analysis of the metabolic gene signature. (A) KEGG analysis of the metabolic gene signature. X-lab is the number of genes; (B) GO analysis of the metabolic gene signature. X-lab is the number of genes. KEGG, Kyoto Encyclopedia of Genes and Genomes; GO, Gene Ontology.
Functional annotation analysis of the metabolic gene signature. (A) KEGG analysis of the metabolic gene signature. X-lab is the number of genes; (B) GO analysis of the metabolic gene signature. X-lab is the number of genes. KEGG, Kyoto Encyclopedia of Genes and Genomes; GO, Gene Ontology.
Discussion
The high incidence and mortality of breast cancer poses a significant threat to human health. Traditionally, clinicopathological factors have been common and important considerations for survival prediction of breast cancer patients (22). However, patients with the same clinical factors may differ in clinical outcomes due to diverse gene expression patterns (23,24). The conceptual changes in tumor heterogeneity have provided rationale for risk stratification and treatment decision-making based on gene expression profiles (25). In our study, an MGS was constructed and validated in different datasets. It was shown to be superior to the commonly used clinicopathological factors and could better predict the OS of breast cancer patients.Cancer metabolism has experienced renewed interest in the past decade, with increasing evidence implying a vital role of metabolism in cancer tumorigenicity and malignancy (26). The application of a metabolic signature has been reported to predict prognosis in different cancers, such as thyroid and hepatocellular cancers; however, no metabolic signatures have been applied to predict outcomes in breast cancer. Metabolism pathways are complex, and one fundamental feature of cancer metabolism is the ability to obtain nutrients and biomass under nutritional stress (27).To fully understand the impact of metabolism on prognosis, genes covering a variety of metabolic pathways including glucose and lipid metabolism, amino acid metabolism, and nucleotide metabolism were all involved in our analysis. After selection by LASSO analysis, amino acid metabolism pathways including histidine metabolism, arginine and proline metabolism, and glutathione metabolism were found to be largely enriched in our MGS, which indicated that amino acid metabolism played an important role in prognosis. Amino acids are essential nutrients for breast cancer proliferation and are involved in various cancer pathways (28). Histidine was reported to be associated with chemotherapy sensitivity in breast cancer (29), arginine plays an essential role in maintaining mammalian target of rapamycin (mTOR) pathway activation (30,31), while proline, which is catalyzed by two different enzymes (pyrroline-5-carboxylate reductase and proline dehydrogenase), can sustain intracellular nucleotide levels. The inhibition of proline biosynthesis in cancer cells was also found to impair tumorigenic potential and metastasis inhibition (32,33). Glutathione is the most abundant antioxidant in cells; it maintains cellular redox homeostasis and has a bidirectional effect on cancer progression. It is vital to the detoxification of carcinogens; however, excess glutathione promotes tumor progression (34). Incorporating these different metabolic pathways rather than selecting a particular metabolic pathway makes our prediction model more robust.Our study highlighted the role of amino acid metabolism in breast cancer prognosis prediction. The application of an MGS in other types of tumors needs to be further explored. Moreover, it is also well established that there may be marked differences in metabolism between the sexes (35). Thus, the accuracy of an MGS in predicting the OS among patients of different sexes in other tumors needs to be further investigated.In our study, OS was selected as the end point to evaluate clinical outcomes. LASSO regression analysis, an innovative shrinkage and selection method for regression analysis, converted a high-dimensional predictor into a low-dimensional predictor (36,37). The combination of univariate analysis and LASSO Cox regression, which has been used widely in numerous studies, was conducted to screen genes indicating either poor or good prognosis and to construct a robust gene signature (38,39). In our analysis, a 55-gene metabolic signature was developed to predict the OS of breast cancer patients. According to the tROC curves, the MGS showed a better prediction accuracy than did AJCC stage or PAM50.Intrinsic subtypes are related to patient survival and can guide therapeutic strategies. Among patients with different PAM50 subtypes, our MGS could further classify these patients into different risk groups. Patients with low-risk scores had a better prognosis than patients with high-risk scores, indicating that the MGS can further guide individualized therapy regardless of molecular subtypes. Chemotherapy may be prescribed for patients with high-risk MGSs. Moreover, an appropriate target agent may be designed to improve the outcomes of these patients. Nomograms have been widely used to assess prognostic outcomes in cancer patients (40). A nomogram can incorporate statistical predictive models into a single numerical estimate of the probability of long-term OS for an individual patient. In our study, the metabolic signature and AJCC stage were identified as independent prognostic factors and were incorporated into the nomogram to predict the survival of patients. The high C-index of our nomogram indicated good accuracy in predicting the outcome of patients. Furthermore, calibration plots demonstrated that the nomogram had good discrimination in both the validation and discovery cohorts. However, more validation sets are needed to verify our signature.
Conclusions
Our results demonstrate that a risk score based on an MGS is superior to that based on AJCC stage or PAM50 subtypes in predicting the prognosis of breast cancer. This will provide evidence for further risk classification in guiding individualized management of breast cancer patients.The article’s supplementary files as