Literature DB >> 29344121

A five-gene based risk score with high prognostic value in colorectal cancer.

Yida Pan¹, Hongyang Zhang¹, Mingming Zhang², Jie Zhu¹, Jianghong Yu^1,3, Bangting Wang¹, Jigang Qiu⁴, Jun Zhang¹.

Abstract

Colorectal cancer (CRC) is one of the most frequently occurring malignancies worldwide. The outcomes of patients with similar clinical symptoms or at similar pathological stages remain unpredictable. This inherent clinical diversity is most likely due to the genetic heterogeneity. The present study aimed to create a predicting tool to evaluate patient survival based on genetic profile. Firstly, three Gene Expression Omnibus (GEO) datasets (GSE9348, GSE44076 and GSE44861) were utilized to identify and validate differentially expressed genes (DEGs) in CRC. The GSE14333 dataset containing survival information was then introduced in order to screen and verify prognosis-associated genes. Of the 66 DEGs, the present study screened out 46 biomarkers closely associated to patient overall survival. By Gene Ontology and Kyoto Encyclopedia of Genes and Genomes pathway analysis, it was demonstrated that these genes participated in multiple biological processes which were highly associated with cancer proliferation, drug-resistance and metastasis, thus further affecting patient survival. The five most important genes, MET proto-oncogene, receptor tyrosine kinase, carboxypeptidase M, serine hydroxymethyltransferase 2, guanylate cyclase activator 2B and sodium voltage-gated channel a subunit 9 were selected by a random survival forests algorithm, and were further made up to a linear risk score formula by multivariable cox regression. Finally, the present study tested and verified this risk score within three independent GEO datasets (GSE14333, GSE17536 and GSE29621), and observed that patients with a high risk score had a lower overall survival (P<0.05). Furthermore, this risk score was the most significant compared with other predicting factors including age and American Joint Committee on Cancer stage, in the model, and was able to predict patient survival independently and directly. The findings suggest that this survival associated DEGs-based risk score is a powerful and accurate prognostic tool and is promisingly implemented in a clinical setting.

Entities: Chemical Disease Gene Species

Keywords: DEG; colorectal cancer; microarray; overall survival; risk score

Year: 2017 PMID： 29344121 PMCID： PMC5754913 DOI： 10.3892/ol.2017.7097

Source DB: PubMed Journal: Oncol Lett ISSN： 1792-1074 Impact factor: 2.967

Introduction

Colorectal cancer (CRC) is currently one of the most commonly diagnosed cancers worldwide, with an estimated 1.4 million cases and 693,900 deaths occurring in 2012 (1). It is much more prevalent in Europe and Northern America than the developing countries, which however is also rising in the last decade (2). Though many advances have been achieved in the clinical management of CRC, the 5-year survival is usually only approximately 55% (3). Surgical resection remains the primary means of curative treatment. However, a proportion of patients will develop local recurrences and metastases thus having a poor prognosis after resection. Moreover, the outcomes of patients with similar clinical or pathologic stage remain unpredictable, especially when they are treated similarly (4). This inherent clinical diversity is most likely due to the genetic heterogeneity of each patient (5). Therefore, identifying the diversity in the genetic profile of colorectal carcinoma that governs the prognosis as well as accurate risk evaluation based on genetic screening would lead to new and more effective clinical strategies in decision making. Microarray technology allows comprehensive analysis of gene expression profiles in different diseases, which has been demonstrated in a variety of hematological tumors and solid tumors including lung (6), liver (7), pancreas (8), and breast (9). Biomarkers discovered by microarrays have a great potential in the prediction of clinical outcomes and survival as well as classification in different sub-types (10–12). However, several reported survival-related biomarkers in CRC are not well performed when their ability was assessed in independent datasets (13–15). Their clinical implement may also limited due to lack of reproducibility and/or standardization. This may be related to un-optimized parameters, different technique platforms, and small volume of samples. So an integrated strategy to combine several specific biomarkers together, which are verified by multiple data source, may be feasible in predicting CRC risk and prognosis. In the present study, we identified and verified 66 differentially expressed genes (DEGs) between CRC and normal tissue by bioinformatics analysis with multiple classifiers. Among them, we classified 46 biomarkers which were closely related to patient survival. We looked into the function of these genes via GO and KEGG pathway analysis. Finally, through random survival forests algorithm, we ranked these gene by importance and built a 5-genes-based linear risk score with multivariable cox regression model. Our findings suggest that this risk score is a powerful and arcuate prognostic tool and is promisingly implemented in the clinical setting.

Materials and methods

CRC datasets

The training and validation datasets were achieved from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/). GSE9348 (70 CRC and 12 normal, platform GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array) was used as training set for DEGs to distinguish cancerous and non-cancerous samples, GSE44076 (98 pairs of CRC and adjacent normal tissues, platform GPL13667 [HG-U219] Affymetrix Human Genome U219 Array) and GSE44861 (56 tumors and 55 adjacent normal tissues, by GPL3921 [HT_HG-U133A] Affymetrix HT Human Genome U133A Array) for validation. Three datasets with survival information generated by GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array were introduced for calculating risk score formula. GSE14333 (n=226) was set for training set, and GSE17536 (n=177) as well as GSE29621 (n=65) for validation.

Data preprocessing

All microarray data preprocessing were processed in R software version 3.1.0 using packages from Bioconductor. Raw microarray data (CEL files) of tumors and normal samples were pre-processed with the RMA algorithm using the affy package (16). Gene expression values were arranged after background adjustment, quantile normalization and summarizing probe values into one expression measure. If multiple probe sets mapping to a same gene, the averages of the probe values were taken as the expression values (17). Annotations for the probe arrays were downloaded from the GEO database.

Functional enrichment analysis

The GO and pathway functional enrichment analysis was operated by the online software GENECODIS3 to facilitate the interpretation of biological roles of survival related-DEGs (http://genecodis.cnb.csic.es) (18). The GO functions of the survival related-DEGs were categorized by biological process, molecular functions, and cellular components. Pathway enrichment analysis was based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. P-values have been obtained through Hypergeometric analysis corrected by FDR method. Terms with P<0.05 were considered as significantly enriched.

Statistical analysis

SPSS software (version 20.0; IBM SPSS, Armonk, NY, USA) were applied for statistical analysis. Survival analysis was performed by Kaplan-Meier method and Mantel-Cox log-rank test was used to evaluate the statistical significance of the differences. Pearson's Chi-Square test was used to investigate the difference in live and dead status of patients with different risk score. Differences were considered as statistically significant when P<0.05.

Results

Identification of DEGs between cancerous and non-cancerous tissues

GSE9348 was used as the training set to identify the DEGs between cancerous and non-cancerous tissues. This dataset included tumors from 70 patients and biopsies from 12 healthy controls. We employed different classifiers, namely Compound Covariate (CC), Diagonal Linear Discriminant Analysis (DLDA), Bayesian CCP (BCCP), Nearest Neighbor (NN), Nearest Centroid (NC) and Support Vector Machines (SVM), to identify specific gene markers. Leave-one-out cross validation was introduced to make the result stable and accurate. After processing, we got 66 DEGs, with high accuracy (classifier error rate <0.1) (data not shown). The distribution of the 66 genes in tumor and non-tumor tissue was clearly demarcated in the GSE9348 dataset (Fig. 1A). To further confirm the DEGs in cancerous and non-cancerous tissue, the human protein atlas immunohistochemistry database (www.proteinatlas.org) was utilized to visualize the expression. We found that downregulated DEGs like SCN9A, UGP2 and CWH43 were less stained even negative in CRC tissues (Fig. 1B), while upregulated DEGs as MET, MYC and SHMT2 were high stained in tumor parts (Fig. 1C).

Figure 1.

DEGs in colorectal cancer. (A) Heatmap of 66 DEGs expression in cancer and non-cancerous tissue of GSE9348. More detailed information could be achieved by contacting the corresponding author. (B) Immunohistochemistry (IHC) pictures of SCN9A, UGP2 and CWH43 as downregulated DEGs were archieved from the Human Protein Atlas database (HPA). (C) IHC results of MET, MYC and SHMT2 as upregulated DEGs from HPA. (D) ROC curves of three linear classifier CC, DLDA and SVM in training set GSE9348. FPR, false positive rate; TPR, true positive rate; AUC, area under curve; DEGs, Differentially expressed genes; T, tumor; N, normal.

As classifier CC, DLDA and SVM were linear classifiers, a linear discriminant with weight values could determine the cancerous status of samples. If one gene's weight value in a sample within a certain linear classifier was ω, and its expression value x, then Σω threshold was defined as cancerous. The threshold of classifier CC, DLDA and SVM were calculated as −43.835, −234.08 and 0.409, respectively. The ROC curves of the three linear classifiers confirmed its high effectiveness (AUC=1) (Fig. 1D). It should be noted that these ROC curves were derived from the training set GSE9348, in which the Σω discriminant of the three linear classifiers was set to compare with a calculated threshold adapting to GSE9348, so the sensibility and specificity was very high (Table I upper).

Table I.

Survival related DEGs by univariable cox proportional hazards regression analysis.

Gene	P-value	HR	Gene	P-value	HR
LOC339166	<1e-07	7.748	MYC	7E-07	0.581
SCN9A	<1e-07	0.154	SQRDL	7E-07	0.513
LGI1	<1e-07	0.115	SHMT2	0.000001	0.509
P2RY1	<1e-07	3.592	PDE6A	2.1E-06	2.229
PRPF4	<1e-07	0.245	UGDH	2.3E-06	1.792
GUCA2B	<1e-07	1.688	PTPRH	2.5E-06	1.733
ENOX2	<1e-07	0.193	PPP2R3A	8.4E-06	2.19
NPY	<1e-07	4.787	HSPH1	2.62E-05	1.61
SCGN	<1e-07	2.266	NR5A2	3.16E-05	0.585
TMEM9B	<1e-07	3.445	TRIP13	3.21E-05	0.631
RNASEH2A	<1e-07	0.438	CPM	6.06E-05	0.498
HSD11B2	<1e-07	0.647	DUSP14	0.000183	0.54
DENND2A	<1e-07	0.299	RCL1	0.000274	0.415
ASPA	<1e-07	3.507	ETV4	0.000396	0.672
CA7	<1e-07	2.626	SEMA6D	0.000472	1.9
LPHN3	<1e-07	0.247	HOMER1	0.000475	0.666
ABCG2	<1e-07	1.497	CCND1	0.000522	1.584
GALNT6	<1e-07	0.588	METTL7A	0.000543	2.012
PTGDR	<1e-07	0.336	MET	0.000577	1.528
TST	<1e-07	0.497	CWH43	0.0006	0.699
SMPDL3A	1E-07	0.428	DHRS11	0.000607	0.748
HSD17B11	1E-07	2.087	UGP2	0.000701	1.977
ETFDH	3E-07	0.549	SLC22A18AS	0.000812	0.558

HR, hazard ratio.

Validation of DEGs in independent CRC datasets

To avoid over-fitting and ensure marker stability, two independent CRC datasets, GSE44076 (98 pairs of CRC and adjacent tissues) and GSE44861 (56 tumors and 55 adjacent tissues) were introduced for verification. The classifiers utilized in GSE9348 worked well in these datasets (Table II), and the sensibility and specificity of Σω discriminant in classification of cancerous samples were also tested and confirmed (Table I middle and lower). Gene expressions of the 66 DEGs derived from GSE9348 performed a similar style in GSE44076 and GSE44861 (data not shown). The reliability of the three linear classifiers (CC, DLDA and SVM) was guaranteed when they applied to GSE44076 and GSE44861. The AUCs of classifier CC, DLDA and SVM in GSE44076 were 0.9994, 0.9996 and 0.9994 (Fig. 2A), while in GSE44861 the AUC values were 0.9253, 0.9292 and 0.9318, respectively (Fig. 2B).

Table II.

GO analysis and KEGG pathway analysis of 46 survival related-DEGs (partial data).

Genes	Hyp	Hyp[a]	Annotations
Biological process
5	4.7E-05	0.00408	GO:0042493: Response to drug (BP)
4	0.00079	0.02956	GO:0008152: Metabolic process (BP)
3	0.00804	0.03142	GO:0008283: Cell proliferation (BP)
3	0.00769	0.03249	GO:0007411: Axon guidance (BP)
3	0.01192	0.0359	GO:0008284: Positive regulation of cell proliferation (BP)
3	0.02362	0.04835	GO:0045893: Positive regulation of transcription, DNA-dependent (BP)
Molecular function
13	0.00391	0.02429	GO:0005515: Protein binding (MF)
7	1.6E-06	0.00019	GO:0016491: Oxidoreductase activity (MF)
7	0.01988	0.04888	GO:0000166: Nucleotide binding (MF)
6	0.01607	0.0431	GO:0004872: Receptor activity (MF)
5	0.00871	0.03213	GO:0016787: Hydrolase activity (MF)
4	0.00865	0.03294	GO:0016740: Transferase activity (MF)
4	0.01535	0.04312	GO:0004930: G-protein coupled receptor activity (MF)
Cellular component
15	0.00234	0.03334	GO:0005737: Cytoplasm (CC)
13	0.00169	0.03219	GO:0016020: Membrane (CC)
11	0.00562	0.03205	GO:0005886: Plasma membrane (CC)
9	0.00075	0.02125	GO:0005576: Extracellular region (CC)
7	0.0028	0.02656	GO:0005730: Nucleolus (CC)
6	0.00948	0.04156	GO:0005739: Mitochondrion (CC)
5	0.00357	0.02911	GO:0005615: Extracellular space (CC)
4	0.00072	0.04092	GO:0005743: Mitochondrial inner membrane (CC)
3	0.00236	0.02696	GO:0005759: Mitochondrial matrix (CC)
KEGG pathway
3	0.0089	0.0411	(KEGG) 05200: Pathways in cancer
2	0.00215	0.02152	(KEGG) 05213: Endometrial cancer
2	0.00258	0.02211	(KEGG) 05221: Acute myeloid leukemia
2	0.00304	0.02282	(KEGG) 05210: Colorectal cancer
2	0.00077	0.02304	(KEGG) 00040: Pentose and glucuronate interconversions
2	0.00207	0.02485	(KEGG) 00500: Starch and sucrose metabolism
2	0.00419	0.02514	(KEGG) 05220: Chronic myeloid leukemia
2	0.00386	0.02574	(KEGG) 05218: Melanoma
2	0.00184	0.02755	(KEGG) 00520: Amino sugar and nucleotide sugar metabolism
2	0.00141	0.02818	(KEGG) 05219: Bladder cancer
2	0.00551	0.03004	(KEGG) 05222: Small cell lung cancer
2	0.00063	0.03755	(KEGG) 05216: Thyroid cancer
2	0.01148	0.04919	(KEGG) 04110: Cell cycle
2	0.01238	0.04953	(KEGG) 04360: Axon guidance

Partial data, genes involved ≥3 (GO analysis) or gene involved ≥2 (KEGG pathway analysis). Genes involved in all KEGG pathway above were MYC and CCND1.

corrected Hyp. Hyp, hypergeometric P-value; BP, biological processes; MF, molecular function; CC, cellular component.

Figure 2.

ROC curves of linear classifier CC, DLDA and SVM of validation sets. ROC curves of linear classifier CC, DLDA and SVM in validation sets GSE44076 (A) and GSE44861 (B). FPR, false positive rate; TPR, true positive rate; AUC, area under curve.

Survival analysis of DEGs in CRC and their function annotation

The 66 biomarkers were significant differential genes in CRC, however, whether the expression of these genes were correlated with patient survival was unclear. We used GSE14333 which contained 226 samples with survival information among total 290 patients as the training set for survival analysis. By univariable cox proportional hazards regression analysis and random permutation test, we obtained 46 genes correlated with patient survival (P<0.001) (Table III).

Table III.

Multivariable and univariable model tests of risk score and other factors.

A, GSE14333

	Multivariable model				Univariable model

Variables	HR	95% CI of HR		P-value	HR	95% CI of HR		P-value
Risk score	2.346	1.298	4.241	0.005	2.718	1.523	4.851	0.001
Location	0.965	0.814	1.144	0.683	0.892	0.76	1.047	0.163
Dukes stage	1.18	0.926	1.503	0.18	1.044	0.86	1.266	0.666
Age of diagnosis	1.008	0.994	1.023	0.257	1.105	1.002	1.028	0.02
Sex	0.926	0.683	1.255	0.62	0.877	0.651	1.182	0.39
Adj XRT	0.463	0.218	0.984	0.045	0.433	0.212	0.884	0.021
Adj CTX	0.867	0.568	1.325	0.51	0.847	0.618	1.16	0.3

B, GSE17536

	Multivariable model				Univariable model

Variables	HR	95% CI of HR		P-value	HR	95% CI of HR		P-value

Risk score	2.745	1.204	6.262	0.016	3.283	1.489	7.236	0.003
Age	1.015	0.999	1.031	0.061	1.018	1.003	1.034	0.016
Sex	1.084	0.747	1.572	0.672	0.953	0.666	1.362	0.79
Ethnicity	0.967	0.728	1.284	0.817	0.915	0.685	1.221	0.545
AJCC stage	1.107	0.892	1.373	0.357	1.051	0.861	1.284	0.625
Grade	1.254	0.828	1.898	0.285	1.375	0.924	2.045	0.116

C, GSE29621

	Multivariable model				Univariable model

Variables	HR	95% CI of HR		P-value	HR	95% CI of HR		P-value

Risk score	9.03	1.425	57.223	0.019	2.526	0.481	13.269	2.73E-05
Sex	1.243	0.513	3.014	0.63	1.508	0.649	3.505	0.34
T stage	0.449	0.091	2.209	0.325	1.048	0.438	2.509	0.915
N stage	1.583	0.604	4.143	0.35	2.688	1.526	4.734	0.001
M stage	2.065	0.368	11.592	0.41	4.934	2.188	11.124	1.19E-04
Histology grade	0.849	0.325	2.219	0.738	0.665	0.284	1.558	0.348
AJCC stage	1.965	0.518	7.45	0.321	2.708	1.615	4.542	1.59E-04

HR, hazard ratio; Adj XRT, adjuvant radiation therapy; Adj CTX, adjuvant chemotherapy.

To elucidate the function of these survival related DEGs, we conducted GO and KEGG pathway analysis and revealed that many genes play an important role in ‘response to drug’, ‘metabolic process’, ‘cell proliferation’, and ‘oxidoreductase activity’, which were highly correlated to drug resistance, altered cancer metabolism, ROS level and proliferation, and many genes also participated in multiple cancer pathways, such as MYC and CCND1 (Table IV).

Table IV.

GO analysis and KEGG pathway analysis of 46 survival related-DEGs (partial data).

Genes	Hyp	Hyp[a]	Annotations
Biological process
5	4.7E-05	0.00408	GO:0042493: Response to drug (BP)
4	0.00079	0.02956	GO:0008152: Metabolic process (BP)
3	0.00804	0.03142	GO:0008283: Cell proliferation (BP)
3	0.00769	0.03249	GO:0007411: Axon guidance (BP)
3	0.01192	0.0359	GO:0008284: Positive regulation of cell proliferation (BP)
3	0.02362	0.04835	GO:0045893: Positive regulation of transcription, DNA-dependent (BP)
Molecular function
13	0.00391	0.02429	GO:0005515: Protein binding (MF)
7	1.6E-06	0.00019	GO:0016491: Oxidoreductase activity (MF)
7	0.01988	0.04888	GO:0000166: Nucleotide binding (MF)
6	0.01607	0.0431	GO:0004872: Receptor activity (MF)
5	0.00871	0.03213	GO:0016787: Hydrolase activity (MF)
4	0.00865	0.03294	GO:0016740: Transferase activity (MF)
4	0.01535	0.04312	GO:0004930: G-protein coupled receptor activity (MF)
Cellular component
15	0.00234	0.03334	GO:0005737: Cytoplasm (CC)
13	0.00169	0.03219	GO:0016020: Membrane (CC)
11	0.00562	0.03205	GO:0005886: Plasma membrane (CC)
9	0.00075	0.02125	GO:0005576: Extracellular region (CC)
7	0.0028	0.02656	GO:0005730: Nucleolus (CC)
6	0.00948	0.04156	GO:0005739: Mitochondrion (CC)
5	0.00357	0.02911	GO:0005615: Extracellular space (CC)
4	0.00072	0.04092	GO:0005743: Mitochondrial inner membrane (CC)
3	0.00236	0.02696	GO:0005759: Mitochondrial matrix (CC)
KEGG pathway
3	0.0089	0.0411	(KEGG) 05200: Pathways in cancer
2	0.00215	0.02152	(KEGG) 05213: Endometrial cancer
2	0.00258	0.02211	(KEGG) 05221: Acute myeloid leukemia
2	0.00304	0.02282	(KEGG) 05210: Colorectal cancer
2	0.00077	0.02304	(KEGG) 00040: Pentose and glucuronate interconversions
2	0.00207	0.02485	(KEGG) 00500: Starch and sucrose metabolism
2	0.00419	0.02514	(KEGG) 05220: Chronic myeloid leukemia
2	0.00386	0.02574	(KEGG) 05218: Melanoma
2	0.00184	0.02755	(KEGG) 00520: Amino sugar and nucleotide sugar metabolism
2	0.00141	0.02818	(KEGG) 05219: Bladder cancer
2	0.00551	0.03004	(KEGG) 05222: Small cell lung cancer
2	0.00063	0.03755	(KEGG) 05216: Thyroid cancer
2	0.01148	0.04919	(KEGG) 04110: Cell cycle
2	0.01238	0.04953	(KEGG) 04360: Axon guidance

Partial data, genes involved ≥3 (GO analysis) or gene involved ≥2 (KEGG pathway analysis). Genes involved in all KEGG pathway above were MYC and CCND1.

corrected hypergeometric P-value; Hyp, Hypergeometric P-value; KEGG, Kyoto Encyclopedia of Genes and Genomes; BP, biological processes; MF, molecular function; CC, cellular component.

Construction of risk score formula

In order to select the most weighted genes, we utilized random survival forests algorithm (Ntree =1,000, default parameters of Hemant Ishwaran algorithm) (Fig. 3A), and set the 46 survival related genes as variables in this model. We ranked these 46 genes by their importance after the processing of random survival forests algorithm via R software (Fig. 3B). Five genes, namely MET, CPM, SHMT2, GUCA2B and SCN9A were selected as the most important candidates (relative importance >0.5). Relative importance means the relative value of a certain gene normalized to the gene MET, which was the most important gene in our random survival forests model (Fig. 3B, and detailed normalized data not shown). To investigate whether the 5 candidates could provide an accurate prediction of survival in CRC patients, the expression data of these genes were fit into a multivariable cox regression model as covariates of the training dataset. We obtained each gene's regression coefficient and then built a risk score formula for each individual as follows:

Figure 3.

Survival related-DEGs ranked by variable importance. (A) Error rate of random survival forests algorithm (Ntree =1,000, default parameters of Hemant Ishwaran algorithm). (B) Variable importance of the 46 survival related-DEGs. DEGs, differentially expressed genes.

Risk score =−0.370* (expression value of CPM)-0.122* (expression value of GUCA2B) + 0.332* (expression value of MET) + 0.088* (expression value of SCN9A) + 0.827* (expression value of SHMT2). Cutting off by the median of the risk score, we defined risk score < median as low-risk group, and risk score > median as high-risk group. To assess the reliability of the risk-score formula in predicting patients survival, we ranked all the patients in the training set GSE14333, and divided them into either high-risk group (n=116) or low-risk group (n=113; Fig. 4). Patients in the low-risk group had a markedly longer overall survival than those in the high-risk group (P=0.001, by Mantel-Cox log rank) (Fig. 4A). The distribution of the follow-up months of a certain risk score and the live/dead status were shown in Fig. 4B. However, the P-value by Pearson Chi-Square test was 0.109, suggesting no significant difference between the live and dead status of patients with different risk score, indicating that our work was more valuable in predicting patient overall survival (Fig. 4A), not the final live/dead status. Moreover, the distribution of risk score in lower expression of SCN9A, CPM and GUCA2B as well as higher expression of MET and SHMT2 showed relative homogeneity and stability from patient to patient with high risk score (Fig. 4E upper).

Figure 4.

Test and validation of risk score in independent GEO datasets. (A) Kaplan-Meier survival curve of low and high risk patients in Training GSE14333 (P=0.001, by Mantel-Cox log rank). (B) Scatter diagram of live and dead outcome with different risk score value of GSE14333. Kaplan-Meier survival curve of low and high risk patients in validation set GSE17536 (P=0.001) (C) and GSE29621 (P=0.038) (D). (E) Gene expression distribution of the 5 most important biomarkers in low and high risk patients in GSE14333, GSE17536 and GSE29621. Genes in GSE14333 were SCN9A, CPM, GUCA2B, MET and SHMT2 from top to bottom. Genes in GSE17536 and GSE29621 were SHMT2, MET, CPM, GUCA2B and SCN9A from top to bottom. GEO, Gene Expression Omnibus.

In addition, we performed multivariable and univariate cox regression analysis to elucidate the relationship between risk score and other factors like sex, age of diagnosis and Dukes stage. It was shown that risk score was the most significant among other factors [P=0.005 (multivarible) and P=0.001 (univariable)], while age (P=0.016) and adjuvant radiation therapy (P=0.021) were univariable factors to prognosis as reported (19,20) (Table V upper). These data suggested that the risk score could predict patient survival directly and independently.

Table V.

Multivariable and univariable model tests of risk score and other factors.

A, GSE14333

	Multivariable model				Univariable model

Variables	HR	95% CI of HR		P-value	HR	95% CI of HR		P-value
Risk score	2.346	1.298	4.241	0.005	2.718	1.523	4.851	0.001
Location	0.965	0.814	1.144	0.683	0.892	0.76	1.047	0.163
Dukes stage	1.18	0.926	1.503	0.18	1.044	0.86	1.266	0.666
Age of diagnosis	1.008	0.994	1.023	0.257	1.105	1.002	1.028	0.02
Sex	0.926	0.683	1.255	0.62	0.877	0.651	1.182	0.39
Adj XRT	0.463	0.218	0.984	0.045	0.433	0.212	0.884	0.021
Adj CTX	0.867	0.568	1.325	0.51	0.847	0.618	1.16	0.3

B, GSE17536

	Multivariable model				Univariable model

Variables	HR	95% CI of HR		P-value	HR	95% CI of HR		P-value

Risk score	2.745	1.204	6.262	0.016	3.283	1.489	7.236	0.003
Age	1.015	0.999	1.031	0.061	1.018	1.003	1.034	0.016
Sex	1.084	0.747	1.572	0.672	0.953	0.666	1.362	0.79
Ethnicity	0.967	0.728	1.284	0.817	0.915	0.685	1.221	0.545
AJCC stage	1.107	0.892	1.373	0.357	1.051	0.861	1.284	0.625
grade	1.254	0.828	1.898	0.285	1.375	0.924	2.045	0.116

C, GSE29621

	Multivariable model				Univariable model

Variables	HR	95% CI of HR		P-value	HR	95% CI of HR		P-value

Risk score	9.03	1.425	57.223	0.019	2.526	0.481	13.269	2.73E-05
Sex	1.243	0.513	3.014	0.63	1.508	0.649	3.505	0.34
T stage	0.449	0.091	2.209	0.325	1.048	0.438	2.509	0.915
N stage	1.583	0.604	4.143	0.35	2.688	1.526	4.734	0.001
M stage	2.065	0.368	11.592	0.41	4.934	2.188	11.124	1.19E-04
Histology grade	0.849	0.325	2.219	0.738	0.665	0.284	1.558	0.348
AJCC stage	1.965	0.518	7.45	0.321	2.708	1.615	4.542	1.59E-04

HR, hazard ratio; Adj XRT, adjuvant radiation therapy; Adj CTX, adjuvant chemotherapy.

Validation of risk score in predicting survival within independent CRC datasets

To further evaluate the clinical value of this risk score, we used 2 independent CRC datasets GSE17536 (n=177) and GSE29621 (n=65) with survival information. We utilized the threshold in GSE14333 to classify high-risk and low-risk groups. Both datasets showed that high risk score patients had lower overall survival (P=0.001, GSE17536; P=0.038, GSE29621) (Fig. 4C and D). The 5 biomarkers of risk score (MET, CPM, SHMT2, GUCA2B and SCN9A) perform a similar stability in GSE17536 and GSE29621 as in GSE14333 (Fig. 4E middle and lower). In addition, by multivariable and univariate cox regression analysis, we confirmed that this risk score was the most significant in GSE17536 [P=0.016 (multivarible) and P=0.003 (univariable)], while P-value of other factors >0.05 except age, which was a univariable significant only (P=0.016) (Table V middle). In GSE29621, risk score was also the most significant (P=0.019 (multivarible) and P=2.73E-05 (univariable)), while N, M stage (TNM staging) and AJCC stage were only univariable significant (Table V lower), as it was easy to comprehend that metastasis and stage was related to patient outcome (21). These data indicated that risk score could directly predict patient survival.

Discussion

In the present study, we have identified and verified 46 survival related-biomarkers from 66 DEGs in CRC and then built a prognostic risk score which could be translated into the clinical setting. The 46 survival related-biomarkers mainly located in cytoplasm, membrane and nucleolus, only a small portion in mitochondria and other sub-cellular parts. Their GO enrichment showed that these genes involved in multiple biological processes such as response to drug, metabolic process, cell proliferation, and positive regulation of cell proliferation. Obviously, these biological processes played a pivotal role in cancer proliferation, drug-resistance, and metastasis, thus further affecting patient survival (22–24). Genes like MYC and CCND1 within CRC pathway in KEGG annotation also participated in other cancer pathway as endometrial cancer or chronic myeloid leukemia (25,26). After that, we ranked the 46 survival-related genes by random survival forests algorithm and got five most important biomarkers namely MET, CPM, SHMT2, GUCA2B and SCN9A. Recently, MET was reported gradually upregulated in the development and progression of CRC from normal epithelium to adenoma, colorectal carcinoma and metastases (27,28). Although others argued that the increase of MET in metastatic CRC was an acquired response to EGFR inhibition, not a de novo phenomenon (29), its prognostic value was confirmed by several independent researches (30,31). Moreover, suppressing MET by specific inhibitor or shRNA has a therapeutic role in CRC (32,33). CPM was less reported, and only one literature revealed that it was the target of miR-146a which promoted cell migration and invasion in CRC via CPM/src-FAK pathway (34). It was suggested that CPM has the potential to be a therapeutic target in cancer (35), but its function still need further discovery. SHMT2 participated in the cellular one-carbon metabolism, and has been implicated as a critical component for tumor survival. Its upregulation was correlated with tumor proliferation in several cancers (36,37). Kim et al found SHMT2 activity limits that of pyruvate kinase (PKM2) and reduces oxygen consumption, thus eliciting a metabolic switch that confers a profound survival advantage to cells in poorly vascularized regions (38). GUCA2B and SCN9A were rarely demonstrated in cancer and more light should shed on their role in CRC. The cause and progression of CRC are complicated and remains to be further elucidated, and we think the rest genes in Table III should have potential value in better interpreting the carcinogenesis and progression of CRC. Moreover, we established a linear risk score as a survival predicting model based on the above five genes by multivariable Cox regression using highly reliable CRC datasets. This risk score predicted patients at high risk of mortality independently and directly in all validation datasets. Although more prospective studies are necessary to further validate the reliability and robustness of this risk score, our work provide an new method toward clinical applications of gene expression profiling in CRC, especially in future personalized prediction and precision medicine.

38 in total

1. Identification of MAGEA12 as a prognostic outlier gene in gastric cancers.

Authors: J Wu; J Wang; W Shen
Journal: Neoplasma Date: 2017 Impact factor: 2.575

2. Gene expression profiling-derived immunohistochemistry signature with high prognostic value in colorectal carcinoma.

Authors: Wenjun Chang; Xianhua Gao; Yifang Han; Yan Du; Qizhi Liu; Lei Wang; Xiaojie Tan; Qi Zhang; Yan Liu; Yan Zhu; Yongwei Yu; Xinjuan Fan; Hongwei Zhang; Weiping Zhou; Jianping Wang; Chuangang Fu; Guangwen Cao
Journal: Gut Date: 2013-10-30 Impact factor: 23.059

3. Weighted gene co-expression network analysis in identification of endometrial cancer prognosis markers.

Authors: Xiao-Lu Zhu; Zhi-Hong Ai; Juan Wang; Yan-Li Xu; Yin-Cheng Teng
Journal: Asian Pac J Cancer Prev Date: 2012

4. Cancer Statistics, 2017.

Authors: Rebecca L Siegel; Kimberly D Miller; Ahmedin Jemal
Journal: CA Cancer J Clin Date: 2017-01-05 Impact factor: 508.702

5. Improved survival among colon cancer patients with increased differentially expressed pathways.

Authors: Martha L Slattery; Jennifer S Herrick; Lila E Mullany; Jason Gertz; Roger K Wolff
Journal: BMC Med Date: 2015-04-08 Impact factor: 8.775

6. Pathway activation strength is a novel independent prognostic biomarker for cetuximab sensitivity in colorectal cancer patients.

Authors: Qingsong Zhu; Evgeny Izumchenko; Alexander M Aliper; Evgeny Makarev; Keren Paz; Anton A Buzdin; Alex A Zhavoronkov; David Sidransky
Journal: Hum Genome Var Date: 2015-04-02

7. Common risk variants for colorectal cancer: an evaluation of associations with age at cancer onset.

Authors: Nan Song; Aesun Shin; Ji Won Park; Jeongseon Kim; Jae Hwan Oh
Journal: Sci Rep Date: 2017-01-13 Impact factor: 4.379

8. RECQ1 helicase is involved in replication stress survival and drug resistance in multiple myeloma.

Authors: E Viziteu; B Klein; J Basbous; Y-L Lin; C Hirtz; C Gourzones; L Tiers; A Bruyer; L Vincent; C Grandmougin; A Seckinger; H Goldschmidt; A Constantinou; P Pasero; D Hose; J Moreaux
Journal: Leukemia Date: 2017-02-10 Impact factor: 11.528

9. Prognostic role of ERBB2, MET and VEGFA expression in metastatic colorectal cancer patients treated with anti-EGFR antibodies.

Authors: Naoki Takahashi; Satoru Iwasa; Hirokazu Taniguchi; Yusuke Sasaki; Hirokazu Shoji; Yoshitaka Honma; Atsuo Takashima; Natsuko Okita; Ken Kato; Tetsuya Hamaguchi; Yasuhiro Shimada; Yasuhide Yamada
Journal: Br J Cancer Date: 2016-03-22 Impact factor: 7.640

10. MET amplification in metastatic colorectal cancer: an acquired response to EGFR inhibition, not a de novo phenomenon.

Authors: Kanwal Raghav; Van Morris; Chad Tang; Pia Morelli; Hesham M Amin; Ken Chen; Ganiraju C Manyam; Bradley Broom; Michael J Overman; Kenna Shaw; Funda Meric-Bernstam; Dipen Maru; David Menter; Lee M Ellis; Cathy Eng; David Hong; Scott Kopetz
Journal: Oncotarget Date: 2016-08-23

4 in total

1. Risk scoring for time to end-stage knee osteoarthritis: data from the Osteoarthritis Initiative.

Authors: R Dunn; J Greenhouse; D James; D Ohlssen; P Mesenbrink
Journal: Osteoarthritis Cartilage Date: 2020-05-13 Impact factor: 6.576

2. Clinical and Prognostic Significance of Cell Sensitivity to Chemotherapy Detected in vitro on Treatment Response and Survival of Leukemia Patients.

Authors: Maria Kolesnikova; Aleksandra Sen'kova; Sofia Tairova; Viktor Ovchinnikov; Tatiana Pospelova; Marina Zenkova
Journal: J Pers Med Date: 2019-05-07

3. A gene expression signature-based nomogram model in prediction of breast cancer bone metastases.

Authors: Chenglong Zhao; Yan Lou; Yao Wang; Dongsheng Wang; Liang Tang; Xin Gao; Kun Zhang; Wei Xu; Tielong Liu; Jianru Xiao
Journal: Cancer Med Date: 2018-12-21 Impact factor: 4.452

Review 4. Pharmacological and nutritional targeting of voltage-gated sodium channels in the treatment of cancers.

Authors: Osbaldo Lopez-Charcas; Piyasuda Pukkanasut; Sadanandan E Velu; William J Brackenbury; Tim G Hales; Pierre Besson; Juan Carlos Gomora; Sébastien Roger
Journal: iScience Date: 2021-03-06

4 in total