Jianping Wu1, Sulai Liu2, Xiaoming Chen1, Hongfei Xu1, Yaoping Tang1. 1. Hunan University of Science and Engineering, Yongzhou, China. 2. Department of Hepatobiliary Surgery, Hunan Provincial People's Hospital, The First Affiliated Hospital of Hunan Normal University, Changsha, China.
Abstract
OBJECTIVE: Colorectal cancer (CRC) is the most common cancer worldwide. Patient outcomes following recurrence of CRC are very poor. Therefore, identifying the risk of CRC recurrence at an early stage would improve patient care. Accumulating evidence shows that autophagy plays an active role in tumorigenesis, recurrence, and metastasis. METHODS: We used machine learning algorithms and two regression models, univariable Cox proportion and least absolute shrinkage and selection operator (LASSO), to identify 26 autophagy-related genes (ARGs) related to CRC recurrence. RESULTS: By functional annotation, these ARGs were shown to be enriched in necroptosis and apoptosis pathways. Protein-protein interactions identified SQSTM1, CASP8, HSP80AB1, FADD, and MAPK9 as core genes in CRC autophagy. Of 26 ARGs, BAX and PARP1 were regarded as having the most significant predictive ability of CRC recurrence, with prediction accuracy of 71.1%. CONCLUSION: These results shed light on prediction of CRC recurrence by ARGs. Stratification of patients into recurrence risk groups by testing ARGs would be a valuable tool for early detection of CRC recurrence.
OBJECTIVE:Colorectal cancer (CRC) is the most common cancer worldwide. Patient outcomes following recurrence of CRC are very poor. Therefore, identifying the risk of CRC recurrence at an early stage would improve patient care. Accumulating evidence shows that autophagy plays an active role in tumorigenesis, recurrence, and metastasis. METHODS: We used machine learning algorithms and two regression models, univariable Cox proportion and least absolute shrinkage and selection operator (LASSO), to identify 26 autophagy-related genes (ARGs) related to CRC recurrence. RESULTS: By functional annotation, these ARGs were shown to be enriched in necroptosis and apoptosis pathways. Protein-protein interactions identified SQSTM1, CASP8, HSP80AB1, FADD, and MAPK9 as core genes in CRC autophagy. Of 26 ARGs, BAX and PARP1 were regarded as having the most significant predictive ability of CRC recurrence, with prediction accuracy of 71.1%. CONCLUSION: These results shed light on prediction of CRC recurrence by ARGs. Stratification of patients into recurrence risk groups by testing ARGs would be a valuable tool for early detection of CRC recurrence.
Colorectal cancer (CRC) is one of the most common cancers and the third leading cause of cancer-related deaths worldwide.[1] The 5-year survival rate of CRC varies from 14% to 90%, based on the stage of CRC; patients diagnosed with localized cancers (stage I or II) usually have more favorable outcomes than patients with late-stage CRC.[2] Surgical resection of the tumor is the standard treatment for localized CRC but it cannot prevent recurrence after several years.[3] Because recurrence is preventable, research is needed to better predict the risk of CRC recurrence and stratify CRC patients into subgroups according to recurrence risk. A few studies in which CRC patients benefited from stratification into different adjuvant treatment groups according to molecular biomarkers have demonstrated the clinical utility of, and urgency to identify, molecular biomarkers in CRC.[4-6]High-throughput screening and advanced bioinformatics analysis along with machine learning characterize the clinical utility of biomarkers in CRC. Two commercial assays (Epi proColon, Epigenomics AG, Berlin, Germany; and Cologuard, Exact Sciences, Madison, WI, USA) that are based on DNA methylation levels have been approved by the US Food and Drug Administration for CRC diagnosis and screening. However, most reported biomarkers in CRC are used for early cancer screening and therapeutic stratification.[7-9] There is a need to develop a predictive biomarker for CRC recurrence at all stages.Autophagy is the process of transporting damaged, denatured, or aged proteins and organelles into lysosomes for digestion and degradation. On the one hand, autophagy prevents accumulation of toxic or carcinogenic proteins and organelles and inhibits cell carcinogenesis.[10-13] On the other hand, autophagy can be harmful because the autophagy cells provide more nutrients and promote tumor growth when tumor cells form.[14-17] Because autophagy can regulate tumor formation, spread, metastasis, and energy metabolism, antitumor drugs based on the regulation of autophagy activity have the potential for clinical treatment.[18-20]Depending on the way a lysosome accepts a substance to be degraded, autophagy can be classified into macroautophagy, microautophagy, and chaperone-mediated autophagy. Macroautophagy is the most common form of autophagy and is characterized by the formation of a cup-shaped bilayer membrane structure around the cytoplasmic component and the formation of an autophagosome.[21-24] The outer membrane and enzymatic fusion of autophagosomes form a monolayer membrane structure of autophagosomes, while the contents of the inner membrane and autophagosomes are digested.[21-24] This process is mediated by autophagy-related genes (ARGs).[25-31] These ARGs have been identified as direct or indirect participants in the autophagy process. Therefore, analysis of ARGs provides a comprehensive overview of the changes in autophagy in CRC. A number of studies have shown that these ARGs have important clinical implications for various types of cancer, including glioma, liver cancer, melanoma, and thyroid cancer.[29,30,32,33] In this study, we used two public datasets of CRC that included recurrent cases to explore the clinical utility of ARGs to predict CRC recurrence.Because disease biomarkers demonstrate utility in disease diagnosis, prediction of therapeutic response, and monitoring of residual disease, machine learning has become a useful tool for biomarker discovery.[34-36] More importantly, machine learning is a powerful way to identify phenotypic features compared with conventional analysis of differentially expressed genes in cancers. By using machine learning, we identified two ARGs with potential predictive ability for CRC recurrence; these genes were significantly associated with progression-free survival. Our results highlight the biological function of ARGs in CRC.
Methods
Ethical approval
This was a retrospective study using publicly available datasets; ethical approval was deemed unnecessary.
Data source and processing
Two microarray datasets (GSE64857 and GSE28722) were downloaded from the Gene Expression Omnibus (GEO) database (http://ncbi.blm.nih.gov/geo/). GSE64857 contained 81 samples of CRC patients with follow-up or recurrence information, which we used to identify ARG signatures. GSE28722 contained 129 samples of CRC patients with survival information, which we used to validate the prognostic capacity of ARG signatures in this cohort.[37,38] The annotation files of these microarrays were downloaded from the GEO website.We acquired a list of ARGs from the human autophagy database (HADb; http://autophagy.lu), the first human autophagy database. As a public repository, it provides updated annotations of human genes that are reported to be related to autophagy. Information on 232 ARGs was obtained from this website.After detecting the variance between samples by box-and-whisker plots, we used quantile normalization to remove unexpected variance within samples. After normalization, we conducted a principal component analysis (PCA) to explore genomic differences between samples with the top 1000 variant probes using the package “limma”[39] in R (https://www.r-project.org/).
Cox proportion and LASSO regressions
To generate CRC recurrence-related ARG signatures, we performed two regressions on our datasets: univariable Cox proportion and least absolute shrinkage and selection operator (LASSO). Univariable Cox proportion regression was first conducted to filter out CRC recurrence survival-related ARGs. Once we obtained the shortlist of ARGs from the univariable proportion regression, we then performed the LASSO regression on these genes to test their association with CRC recurrence using the “glmnet” package in R.[40] The LASSO regression analysis identified eligible ARG signatures for the CRC recurrence risk score. We generated the CRC recurrence risk score using the ARG signatures and their corresponding coefficients discovered by the LASSO model. The risk score was calculated as follows: Risk score = ΣnARGi × βi, where n was the numbers of ARG signatures, i was each ARG signature, and β was the corresponding coefficient. After the risk score was determined, we categorized CRC samples from GSE64857 into two groups (predicted recurrence or predicted non-recurrence) according to their risk score. The performance of this risk score was evaluated by receiver-operator characteristic (ROC) curve using “survival receiver‑operator characteristic (ROCR)” package.[41] The area under the curve (AUC) was calculated to assess the predictive accuracy.
Gene pathway analysis
To identify the biological pathways of ARG signatures, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases were used for functional enrichment analysis by the R package “clusterProfiler”.[42] The significance of a pathway was determined by a false discovery rate (FDR) < 0.05. The Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database (https://string‑db.org/; version 10.0) was used to explore the protein–protein interactions (PPI) of the recurrence-associated ARGs.
Survival analysis
For overall survival (OS) and progression-free survival (PFS) analysis, we used Kaplan–Meier plots to analyze the prognostic capacity of ARG signatures in CRC. We stratified all samples of GSE28722 into three groups (low, intermediate, high) based on their relative expression of ARG signatures. The log-rank test was performed to determine statistical significance, and a p-value < 0.05 was considered significant. All survival analyses were conducted using the R packages “survival” and “survminer”.
Gene set enrichment analysis
Gene set enrichment analysis (GSEA) was performed to investigate the pathways affected by the ARG signatures. We categorized samples of GSE64857 into two groups according to their risk score. GSEA was conducted using GSEA software provided by the Broad Institute (http://software.broadinstitute.org/gsea/index.jsp).[43] The hallmarks of genes were calculated if the signatures were enriched for CRC with a high or low risk score. Hallmark gene signatures with normalized enrichment score > 1 and FDR < 0.05 were considered significant.
Results
Download and cleaning of datasets for ARG signature identification
To generate the datasets for ARG signatures, we downloaded two datasets (GSE64857 and GSE28722) from the public domain (GEO). Dataset GSE64857 contained 81 samples with clinical information; 75 of these had clinical follow-up information, and 6 were excluded because they lacked recurrence information. Before conducting the bioinformatic analysis, we checked the variance within samples by performing quantile normalization for these two datasets. The boxplots of each sample from both datasets demonstrated that normalization removed the unwanted variance within samples (Figure 1). After normalization, we performed PCA to explore the variance of samples in terms of gene expression. Although the PCA plot of the top 1000 variable genes could not differentiate recurrent CRC from non-recurrent CRC, we observed separation within samples when we used the shortlist of ARGs (Figure 2). This result suggested that ARGs were potential biomarkers for CRC recurrence.
Figure 1.
Quantile normalization was used to remove the batch effect of data in two public datasets (GSE64857 and GSE28722) from the Gene Expression Omnibus.
Figure 2.
Overview of biomarker discovery using two public datasets (GSE64857 and GSE28722) from the Gene Expression Omnibus. The two PCA plots on the left show the differences in samples for recurrent and non-recurrent colorectal cancer for the top 1000 variable genes. The right-hand panel shows the PCA plot indicating differences in samples by autophagy-related genes. PCA, principal component analysis.
Quantile normalization was used to remove the batch effect of data in two public datasets (GSE64857 and GSE28722) from the Gene Expression Omnibus.Overview of biomarker discovery using two public datasets (GSE64857 and GSE28722) from the Gene Expression Omnibus. The two PCA plots on the left show the differences in samples for recurrent and non-recurrent colorectal cancer for the top 1000 variable genes. The right-hand panel shows the PCA plot indicating differences in samples by autophagy-related genes. PCA, principal component analysis.
Univariable Cox proportion regression identified ARGs for CRC recurrence
To identify ARGs directly associated with CRC recurrence-free survival, we used univariable Cox proportion regression to filter all 232 ARGs from HADb (http://autophagy.lu). As the dataset GSE28722 used the Rosetta custom human 23K array, we included 283 probes targeting all 232 ARGs in the downstream analysis. Univariable Cox proportion regression showed that 26 ARGs were significantly associated with CRC recurrence-free survival (Table
1). Therefore, we used these 26 significant ARGs for molecular function annotation and LASSO analysis.
To investigate the molecular function of these ARG signatures, we explored the associated enriched pathways in Gene Ontology (GO) terms (biological pathway, cellular component, and molecular function) and KEGG pathways. We used the ClusterProfiler package to identify the GO terms and KEGG pathways that ARG signatures enriched. For biological pathways of GO, the top three terms were “positive regulation of proteolysis,” “extrinsic apoptotic signaling pathway,” and “regulation of apoptotic signaling pathway.” The top three cellular component GO terms were “membrane raft,” “membrane microdomain,” and “membrane region.” The top three molecular function GO terms were “ubiquitin protein ligase binding,” “ubiquitin-like protein ligase binding,” and “heat shock protein binding.” The top three KEGG pathways enriched by the ARG signatures were “necroptosis,” “apoptosis,” and “hepatitis C” (Figure 3). The PPI network analysis demonstrated interactions in these ARG signatures. Of these, SQSTM1, CASP8, HSP80AB1, FADD, and MAPK9 were the interaction center nodes, suggesting that these genes had broader interactions than other ARG signature genes. To further display ARG signature interactions, we performed a PPI analysis of the 26 ARGs using the STRING database. SQSTM1, HSP90AB1, and CASP8 were identified as nodes in the PPI network of ARG signatures.
Figure 3.
The enriched pathways of ARGs in CRC recurrence. The significance of annotated pathways is denoted by color and the number of genes enriched is indicated by the size of dots. (A) Top 10 biological processes enriched by ARGs; (B) top 10 cellular components of ARGs; (C) top 10 molecular functions of ARGs; and (D) top 10 KEGG pathways of ARGs. ARG, autophagy-related gene; CRC, colorectal cancer; GO, Gene Ontology; BP, biological process; CC, cellular component; MF, molecular function; KEGG, Kyoto Encyclopedia of Genes and Genomes.
The enriched pathways of ARGs in CRC recurrence. The significance of annotated pathways is denoted by color and the number of genes enriched is indicated by the size of dots. (A) Top 10 biological processes enriched by ARGs; (B) top 10 cellular components of ARGs; (C) top 10 molecular functions of ARGs; and (D) top 10 KEGG pathways of ARGs. ARG, autophagy-related gene; CRC, colorectal cancer; GO, Gene Ontology; BP, biological process; CC, cellular component; MF, molecular function; KEGG, Kyoto Encyclopedia of Genes and Genomes.
Risk score stratification
To further test prediction of CRC recurrence by ARG signatures, we constructed a risk score prediction model by LASSO regression analysis. The 26 ARG signatures shortlisted from the univariable Cox proportion analysis were subjected to LASSO regression analysis (Table 1). No genes retained a significant statistical association using the optimal lambda score in LASSO regression; therefore, we chose the minimum lambda score for LASSO regression. We identified two genes (BAX and PARP1) that were significantly associated with CRC recurrence (Figure 4) using the minimum lambda score in LASSO regression. Interestingly, the regression coefficients of these two genes were negative, suggesting that they had a protective effect on CRC recurrence. A CRC recurrence risk score was calculated for each patient by multiplying the regression coefficient by the expression value of each gene. The CRC recurrence risk score = (208478_s_at, BAX) × −1.3329 + (211833_s_at, BAX) ×−0.2936 + (208644_at, PARP1) × −2.2957, where 208478_s_at, 211833_s_at, and 208644_at are the ARG gene probes for BAX and PARP1, respectively. To investigate the predictive accuracy of CRC recurrence risk score, we analyzed the ROC curve. The AUC of the CRC recurrence risk score ROC curve was 0.711 (Figure 5), suggesting that a CRC recurrence risk score based on ARG signatures was a potential predictive biomarker.
Figure 4.
The protein–protein interaction network showing the core genes in ARG signatures. Each node represents the gene signature, and the length of the lines indicates the degree of correlation between genes. ARG, autophagy-related gene.
Figure 5.
Diagnostic and prognostic capacity of ARG signatures in CRC. (A) Coefficient of variance plot of LASSO regression. (B) Coefficient plot of LASSO regression results. (C) ROC curves showing the diagnostic capacity of ARG signatures for CRC recurrence; the area under the curve is 0.711. (D) Kaplan–Meier curves of CRC stratified by ARG signature using the log-rank test.
LASSO, least absolute shrinkage and selection operator; ARG, autophagy-related gene; CRC, colorectal cancer; ROC, receiver-operator characteristic; AUC, area under the curve.
The protein–protein interaction network showing the core genes in ARG signatures. Each node represents the gene signature, and the length of the lines indicates the degree of correlation between genes. ARG, autophagy-related gene.Diagnostic and prognostic capacity of ARG signatures in CRC. (A) Coefficient of variance plot of LASSO regression. (B) Coefficient plot of LASSO regression results. (C) ROC curves showing the diagnostic capacity of ARG signatures for CRC recurrence; the area under the curve is 0.711. (D) Kaplan–Meier curves of CRC stratified by ARG signature using the log-rank test.LASSO, least absolute shrinkage and selection operator; ARG, autophagy-related gene; CRC, colorectal cancer; ROC, receiver-operator characteristic; AUC, area under the curve.To evaluate the prognostic ability of the CRC recurrence risk score, we performed a log-rank test on Kaplan–Meier curves of CRC survival. We categorized patients into three groups (high, low, and intermediate) based on BAX and PARP1 gene expression. Patients with high or low expression of both BAX and PARP1 were assigned to the high or low group, respectively, whereas the rest of the patients were stratified into the intermediate group. Although CRC patients did not benefit from high expression of BAX and PARP1 in overall survival, patients with high expression of BAX and PARP1 had a significantly favorable outcome in recurrence-free survival (Figure 6). These results suggested that BAX and PARP1 are potential prognostic biomarkers for CRC recurrence. Taken together, our results showed that BAX and PARP1 from ARG signatures could be utilized for CRC recurrence risk stratification.
Figure 6.
Validation of ARG signatures in an independent cohort. (A) Boxplots showing the differential expression of BAX and PARP1 between recurrent and non-recurrent CRC. (B) ROC curves showing the accuracy of the two ARG signatures.
ARG, autophagy-related gene; CRC, colorectal cancer; ROC, receiver-operator characteristic; AUC, area under the curve.
Validation of ARG signatures in an independent cohort. (A) Boxplots showing the differential expression of BAX and PARP1 between recurrent and non-recurrent CRC. (B) ROC curves showing the accuracy of the two ARG signatures.ARG, autophagy-related gene; CRC, colorectal cancer; ROC, receiver-operator characteristic; AUC, area under the curve.
Hallmarks of ARG signatures
To validate the enriched pathways associated with ARG signatures, we performed GSEA for the prognostic markers BAX and PARP1. GSEA demonstrated six hallmark pathways: “estrogen response early,” “p53 pathway,” “Kras signaling DN,” “apical junction,” “epithelial–mesenchymal transition,” and “UV response DN,” that were significantly associated with CRC patients having high expression of BAX and PARP1 (Figure 7).
Figure 7.
GSEA showing six hallmark pathways (apical junction, UV response DN, p53 pathway, Kras signaling DN, estrogen response early, and epithelial–mesenchymal transition) that were enriched in the high score ARG signature group.
GSEA, gene set enrichment analysis; ARG, autophagy-related gene.
GSEA showing six hallmark pathways (apical junction, UV response DN, p53 pathway, Kras signaling DN, estrogen response early, and epithelial–mesenchymal transition) that were enriched in the high score ARG signature group.GSEA, gene set enrichment analysis; ARG, autophagy-related gene.
Discussion
To accurately predict the prognosis of CRC, many cohorts have been studied to establish gene expression characteristics.[44,45] A meta-analysis has been performed to assess the clinical utility of several published prognostic gene expression profiles in CRC.[46] Although most published signatures show a statistically significant association with prognosis, their accuracy in classifying independent tumor samples into high- and low-risk groups remains limited. Therefore, more robust and accurate gene expression profiles are needed to predict prognosis. Here, we established CRC recurrence gene signatures by applying two supervised machine learning approaches. These recurrence signatures based on ARGs could predict CRC recurrence with moderate accuracy.The role of autophagy in tumors is well known, and the role of autophagy in the development and treatment of CRC has been reported previously.[47,48] However, the clinical significance of ARGs, especially their prognostic effects in CRC, has not been extensively investigated. Moreover, there is no comprehensive analysis of the prognostic significance of ARGs. Therefore, in this study, we first analyzed the expression levels of ARGs and subsequently examined their prognostic value. Finally, we established a model using prognostic ARGs. Given the clinical significance of these prognostic ARGs in CRC, they may provide new directions for clinical CRC treatment if a CRC recurrence score can be calculated for each patient.We evaluated the prognostic value of all 232 ARGs in CRC. By performing a univariate Cox analysis, we identified 26 recurrence-associated ARGs, suggesting that autophagy plays a crucial role in the development of CRC and may affect the prognosis of CRC patients. Because ARGs may have multiple functions in addition to autophagy, we also performed GO and KEGG pathway analysis and showed that most enrichment pathways were autophagy-related pathways. These results revealed that autophagy is a key factor in CRC recurrence.To more accurately assess prognosis of CRC patients, LASSO regression was used to analyze the prognostic significance of ARGs. As a result of LASSO regression, the number of ARGs was reduced to two genes, including three probes (208478_s_at, 211833_s_at, and 208644_at). These ARGs were combined into one prognostic assessment model. From the ROC curve, we found that the prognostic model had an AUC >0.7, indicating moderate efficacy in evaluating the prognosis and recurrence in patients with CRC. This moderate accuracy suggested there might be other confounding factors participating in CRC recurrence. The establishment of these prognostic biomarkers confirms the role of autophagy in the development of CRC and patient prognosis. Although BAX and PARP1 are widely recognized as key mediators in cell death and apoptosis, their roles in autophagy remain unclear for cancer development. Activation of PARP1 promotes expression of other autophagic pathway genes.[49] Increased ARG expression is caused by PARP1-mediated poly ADP-ribosylation that restrains FoxO3a in nuclei. In CRC, BAX is recognized as a negative regulator in autophagy; a few autophagy genes were inversely correlated with BAX.[50]This proof-of-concept study has some limitations. The ARG signatures and recurrence-predicting genes identified in this bioinformatics study need to be validated in external cohorts. Additionally, we did not compare the accuracy of our ARG signatures with other existing biomarkers (genes, DNA methylation, or metabolites). Combined with other protein biomarkers, our ARG signatures might better predict recurrence. A duplex PCR-based assay targeting these two ARGs signatures is under development and will be tested for clinical utility soon.In conclusion, we established an ARG signature that could predict CRC recurrence with moderate accuracy. The risk score based on ARG signature is a potential prognostic biomarker for CRC recurrence, but it requires clinical validation through further studies.
Authors: Richard G Gray; Philip Quirke; Kelly Handley; Margarita Lopatin; Laura Magill; Frederick L Baehner; Claire Beaumont; Kim M Clark-Langone; Carl N Yoshizawa; Mark Lee; Drew Watson; Steven Shak; David J Kerr Journal: J Clin Oncol Date: 2011-11-07 Impact factor: 44.544
Authors: María Marcuello; Veronika Vymetalkova; Rui P L Neves; Saray Duran-Sanchon; Hege Marie Vedeld; Emma Tham; Guus van Dalum; Georg Flügen; Vanesa Garcia-Barberan; Remond Ja Fijneman; Antoni Castells; Pavel Vodicka; Guro E Lind; Nikolas H Stoecklein; Ellen Heitzer; Meritxell Gironella Journal: Mol Aspects Med Date: 2019-06-14
Authors: Pierre Courtiol; Charles Maussion; Françoise Galateau-Sallé; Gilles Wainrib; Thomas Clozel; Matahi Moarii; Elodie Pronier; Samuel Pilcer; Meriem Sefta; Pierre Manceron; Sylvain Toldo; Mikhail Zaslavskiy; Nolwenn Le Stang; Nicolas Girard; Olivier Elemento; Andrew G Nicholson; Jean-Yves Blay Journal: Nat Med Date: 2019-10-07 Impact factor: 53.440
Authors: Andrew G Renehan; Lee Malcomson; Richard Emsley; Simon Gollins; Andrew Maw; Arthur Sun Myint; Paul S Rooney; Shabbir Susnerwala; Anthony Blower; Mark P Saunders; Malcolm S Wilson; Nigel Scott; Sarah T O'Dwyer Journal: Lancet Oncol Date: 2015-12-17 Impact factor: 41.316
Authors: Justyna Gil; David Ramsey; Elzbieta Szmida; Przemyslaw Leszczynski; Pawel Pawlowski; Marek Bebenek; Maria M Sasiadek Journal: Med Oncol Date: 2016-12-29 Impact factor: 3.064
Authors: Conan G Kinsey; Soledad A Camolotto; Amelie M Boespflug; Katrin P Guillen; Mona Foth; Amanda Truong; Sophia S Schuman; Jill E Shea; Michael T Seipp; Jeffrey T Yap; Lance D Burrell; David H Lum; Jonathan R Whisenant; G Weldon Gilcrease; Courtney C Cavalieri; Kaitrin M Rehbein; Stephanie L Cutler; Kajsa E Affolter; Alana L Welm; Bryan E Welm; Courtney L Scaife; Eric L Snyder; Martin McMahon Journal: Nat Med Date: 2019-03-04 Impact factor: 53.440