Literature DB >> 31939629

A 21‑gene Support Vector Machine classifier and a 10‑gene risk score system constructed for patients with gastric cancer.

Hui Jiang¹, Jiming Gu¹, Jun Du¹, Xiaowei Qi¹, Chengjia Qian¹, Bojian Fei¹.

Abstract

Gastric cancer (GC) ranks fifth in terms of incidence and third in terms of tumor mortality worldwide. The present study was designed to construct a Support Vector Machine (SVM) classifier and risk score system for GC. The GSE62254 (training set) and GSE26253 (validation set 2) datasets were downloaded from the Gene Expression Omnibus database. Furthermore, the gene expression profile of GC (validation set 1) was obtained from The Cancer Genome Atlas database. Differentially expressed genes (DEGs) between recurrent and non‑recurrent samples were determined using the limma package. The feature genes were selected using the Caret package, and an SVM classifier was built using the e1071 package. Using the penalized package, the optimal predictive genes for constructing a risk score system were screened. Finally, stratification analysis of clinical factors and pathway enrichment analysis were performed using Gene Set Enrichment Analysis. A total of 239 DEGs were identified in GSE62254, among which 114 DEGs were significantly associated with both recurrence‑free survival and overall survival. Subsequently, 21 feature genes were screened from the 114 DEGs, and an SVM classifier was built. A risk score system for survival prediction was constructed, following the selection of 10 optimal genes, including A‑kinase anchoring protein 12, angiopoietin‑like protein 1, cysteine‑rich sequence 1, myeloid/lymphoid or mixed‑lineage leukemia, translocated to chromosome 11, neuron navigator 3, neurobeachin, nephroblastoma overexpressed, pleiotrophin, tumor suppressor candidate 3 and zinc finger and SCAN domain containing 18. The stratification analysis revealed that pathological stage was an independent prognostic clinical factor in the high‑risk group. Additionally, eight significant pathways were associated with the 10‑gene signature. The SVM classifier and risk score system may be applied for classifying and predicting the prognosis of patients with GC, respectively.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2019 PMID： 31939629 PMCID： PMC6896370 DOI： 10.3892/mmr.2019.10841

Source DB: PubMed Journal: Mol Med Rep ISSN： 1791-2997 Impact factor: 2.952

Introduction

Gastric cancer (GC) occurs in the inner lining of stomach, and 60% of GC cases are caused by Helicobacter pylori infection (1). Patients with GC are usually characterized by epigastric pain, heartburn, inappetence, nausea, vomiting, weight loss and dysphagia (2). In patients with advanced GC, tumor cells may migrate from the stomach to other tissues and organs, such as liver, lymph nodes, lung and bone (3). As the disease is often diagnosed late, its prognosis is usually unfavorable with a 5-year survival rate <10% worldwide in 2016 (4). Globally, stomach cancer ranks fifth in terms of incidence and third in terms of tumor mortality, affecting 950,000 new patients and resulting in 723,000 cases of mortality in 2012 (5,6). In order to improve the therapies for GC, the molecular mechanisms of GC should be further elucidated. Astrocyte-elevated gene 1 is involved in the progression of GC and predicts the prognosis of patients with GC, and thus its targeted inhibition may be a promising strategy for treating the tumor (7). Decreased mRNA and protein expression levels of liver kinase B1 are detected in patients with GC with low survival rate, and are independent prognostic factors of GC (8,9). Nicotinamide adenine dinucleotide phosphate oxidases (NOX) family genes act as possible prognostic indicators in GC, indicating that NOX inhibitor may be useful for the treatment of patients with GC (10). Ataxia telangiectasia mutated (ATM) expression is decreased among patients with GC in Xinjiang, and thus ATM may be a potential marker of prognosis in patients with GC (11). Overexpression of fibulin-1 (FBLN1) inhibits GC cell growth and promotes apoptosis by elevating the expression of cleaved caspase-3; thus, FBLN1 is a tumor suppressor and prognostic factor in patients with GC (12). Despite these findings, the genes implicated in the pathogenesis of GC have not been thoroughly revealed. Early diagnosis, reasonable prognostic evaluation, and timely and appropriate intervention are important for improving the outcomes of patients with GC (13). The study of prognostic markers can guide the close monitoring and further treatment of patients at high risk of recurrence and improve their survival rate (14,15). Increasing studies have identified prognostic gene signatures and developed a prognostic score model for patients with GC (16–26). However, the recurrence-associated prognostic genes in GC have not been comprehensively examined. Since recurrence is experienced in 25–40% of all patients with GC treated with surgical resection (27,28), the identification of recurrence-associated genes is significant for survival prediction in these patients. Therefore, using microarray datasets of GC samples downloaded from The National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database, differentially expressed genes (DEGs) between recurrence and non-recurrence samples were identified. Subsequently, from the selected DEGs, the present study screened the feature genes associated with the recurrence of GC. This was followed by the construction of a classifier that could accurately identify the recurrence of GC. Combined with the clinical prognostic information, the risk score system was built based on the expression level of feature genes.

Materials and methods

Data source and preliminary screening of clinical factors

Using ‘gastric cancer’ and ‘Homo sapiens’ as key words, microarray data were searched for in the NCBI GEO database (http://www.ncbi.nlm.nih.gov/geo/). The selected datasets met the following criteria: i) Recurrence information was available; ii) recurrence-free survival (RFS) time information was available; and iii) sample size was ≥200. Finally, GSE62254 (platform, GP570, Affymetrix Human Genome U133 Plus 2.0 Array; Thermo Fisher Scientific, Inc.) (26,29) and GSE26253 (platform, GPL8432 Illumina HumanRef-8 WG-DASL v3.0; Illumina, Inc.) (30) were selected. GSE62254 contained 300 GC tissue samples, 282 of which had recurrence information, including 125 recurrent samples and 157 non-recurrent samples. The 282 samples were used as the training set of the present study. GSE26253 (n=432) included 177 recurrent and 255 non-recurrent samples, and was used as a validation set (validation set 2). Furthermore, in order to obtain another validation set, gene expression profiles of GC samples were downloaded from The Cancer Genome Atlas (TCGA; https://gdc-portal.nci.nih.gov/; TGCA STAD project) database based on the Illumina HiSeq 2000 RNA Sequencing platform. As a result, 421 GC tissue samples were acquired, 298 of which had corresponding recurrence information, comprising of 242 samples without recurrence and 56 samples with recurrence (validation set 1). Using the training dataset (GSE62254), univariate and multivariate Cox regression analyses were conducted to evaluate the association of clinical factors with prognosis, using the survival package (version 2.41-1; http://bioconductor.org/packages/survivalr/) (31) in R (version 3.4.1; http://www.r-project.org/). P<0.05 was set as the threshold for significant association. The pathological stage and recurrence were identified to be independent prognostic clinical factors (Table I; Fig. 1). Therefore, samples of the training set were divided into recurrence and non-recurrence groups for further analysis in the present study.

Table I.

Preliminary screening of independent prognostic clinical factors.

		Univariate cox			Multivariate cox

Clinical characteristics	GSE62254 (n=300)	HR	95% CI	P-value	HR	95% CI	P-value
Age (years, mean ± SD)	61.94±11.36	1.009	0.993–1.025	2.71×10⁻¹	–	–	–
Sex (male/female)	199/101	0.869	0.612–1.234	4.33×10⁻¹	–	–	–
MLH1 IHC (positive/negative/-)	234/64/2	2.206	1.326–3.670	1.78×10⁻³	1.533	0.859–2.733	1.48×10⁻¹
EBV ISH (positive/negative/-)	18/257/25	1.037	0.507–2.123	9.20×10⁻¹	–	–	–
Lymphovascular invasion (yes/no/-)	205/73/22	2.642	1.602–4.357	7.67×10⁻⁵	1.659	0.972–2.832	6.34×10⁻²
Pathologic M (M0/M1/-)	273/27	3.971	2.517–6.266	1.58×10⁻¹⁰	1.609	0.912–2.839	1.01×10⁻¹
Pathologic N (N0/N1/N2/N3)	38/131/80/51	2.052	1.698–2.480	2.03×10⁻¹⁴	1.206	0.851–1.708	2.92×10⁻¹
Pathologic T (T1/T2/T3/T4/-)	2/186/91/21	1.847	1.469–2.323	8.37×10⁻⁸	1.120	0.809–1.550	4.94×10⁻¹
Pathologic stage (I/II/III/IV/-)	30/96/95/77/2	2.378	1.933–2.925	2.22×10⁻¹⁶	1.660	1.056–2.611	2.81×10⁻²
Lauren classification (diffuse/intestinal/mixed)	135/146/17/2	0.828	0.704–0.974	2.19×10⁻²	0.988	0.829–1.177	8.92×10⁻¹
Recurrence (yes/no)	125/157/18	16.790	10.14–27.81	2.00×10⁻¹⁶	13.61	7.704–24.041	2.00×10⁻¹⁶
Mortality (dead/alive/-)	135/148//17	–	–	–	–	–	–
Overall survival time (months, mean ± SD)	50.59±31.42	–	–	–	–	–	–

Cox regression analysis was not performed for mortality and overall survival time, as they are dependent variables and not independent variables. HR, hazard ratio; MLH1 IHC, MutL homolog 1 immunohistochemistry; EBV ISH, Epstein-Barr virus in situ hybridization.

Figure 1.

KM survival curves based on pathological stage and recurrence. (A) KM curves according to pathological stage. (B) KM curves based on recurrence. KM, Kaplan-Meier; HR, hazard ratio.

Data normalization

The expression matrices of the three datasets were stacked, and each matrix was scaled based on expression levels. The unit specification was scaled and a sample vector was given as follows: In the formula, ||v||22 stands for the 2-norm of vector (norm). Combined with the sqrt [sum(data2)] function (32) in R, the square root of the eigenvalue of matrix B=A*AT was extracted to acquire the samples scaled to 1. Based on the median and median absolute deviation (MAD) of each gene, the gene expression level was centralized and normalized using median scaling. The details were shown as follows: Giving an eigenvector x=(x1, …, xn); and defining median scale normalization as:

Identification of DEGs between recurrence and non-recurrence samples

As aforementioned, the GSE62254 dataset was classified into recurrent and non-recurrent groups. The DEGs between the two groups were analyzed using the limma package (version 3.34.7; http://bioconductor.org/packages/release/bioc/html/limma.html) (33) in R. The strict cut-off was a false discovery rate (FDR) <0.05 and |log2 fold change (FC)|>0.263. Subsequently, bidirectional hierarchical clustering based on centered Pearson correlation algorithm was performed on the DEGs using the pheatmap package (version 1.0.8; http://cran.r-project.org/web/packages/pheatmap/index.html) (34) in R.

Construction of the Support Vector Machine (SVM) classifier

Using Cox regression analysis in the survival package (31), the DEGs that were significantly associated with RFS time and overall survival (OS) time were selected from the GSE62254 dataset. P<0.05 was set as the threshold. The DEGs significantly associated with both RFS time and OS time were used for subsequent analysis. The recursive feature elimination algorithm in the Caret package (version 6.0–76; http://cran.r-project.org/web/packages/caret) (35) in R was used to identify the optimal combination of feature genes. During the 100-fold cross validation, the gene combination corresponding to the highest accuracy and the smallest Root Mean Square Error (RMSE) was considered as the optimal combination of feature genes. Combined with the eigenvalues in each sample, the supervised classification algorithm SVM evaluates the probability of a sample belonging to one type (36). Using the SVM algorithm (Cross, 100-fold cross validation; Core, Sigmoid Kernel) in the e1071 package (version 1.6–8; http://cran.r-project.org/web/packages/e1071) (37) in R, an SVM classifier was built on account of the feature gene combination. In GSE62254, GSE26253 and the TCGA dataset, the classification efficiency of the SVM classifier was assessed based on Concordance index (C-index), Brier score, log-rank P-value of Cox-proportional hazard (Cox-PH) regression and area under the receiver operating characteristic (AUROC) curve. Using the survcomp package (version 1.30.0; http://www.bioconductor.org/packages/release/bioc/html/survcomp.html) (38) in R, the C-index (the score of all individual pairs that predicted the correct order of survival time) (39) and the Brier score (a scoring function for measuring the accuracy of probability prediction) (40) were calculated. Using the Kaplan-Meier (KM) curve analysis of the survival package (31), KM curves were drawn for the two groups predicted using the SVM classifier, and the log-rank P-value was calculated. Combined with the pROC package (version 1.12.1; http://cran.r-project.org/web/packages/pROC/index.html) (41) in R, the indexes including sensitivity, specificity, positive prediction value and negative prediction value were calculated for ROC curves.

Construction of risk score system

Using the Cox-PH model of the penalized package (version 0.9–50; http://bioconductor.org/packages/penalized/) (42) in R, the optimal combination of prognosis-associated genes was further screened from the selected combination of feature genes. The optimized parameter ‘lambda’ in the screening model was calculated through 1,000 cross-validation likelihood (cvl). Combined with prognostic coefficients of the prognosis-associated DEGs in the optimal combination, a risk score system was constructed based on gene expression level. Furthermore, the risk score was calculated for each sample using the following formula: CoefDEG and ExpDEG represent regression coefficient and the corresponding gene expression level, respectively. With the median of risk scores as the demarcation point, the samples in GSE62254 were classified into high- and low-risk groups. Using the KM curve analysis of the survival package (31), correlation analysis for the risk score system and prognosis was carried out. Additionally, the risk score system was further validated in the GSE26253 and TCGA datasets.

Stratification analysis of clinical factors

Combined with the univariate and multivariate Cox regression analysis of the survival package (31), the independent prognostic clinical factors in GSE62254 were selected. Combined with the high- and low-risk samples determined by the risk score system, stratification analysis was further carried out.

Pathway enrichment analysis

According to the risk scores of the samples in GSE62254, the samples were divided into high-risk and low-risk groups. Under FDR<0.05 and |log2 FC|>0.263, the DEGs between the two groups were identified using the limma package (33). Using Gene Set Enrichment Analysis (http://software.broadinstitute.org/gsea/index.jsp) (43), pathway enrichment analysis was conducted for the DEGs, with the screening criterion of nominal P<0.05.

Results

Identification of DEGs

Following data normalization, 239 DEGs between recurrent and non-recurrent samples in the GSE62254 dataset were identified (Fig. 2A). The Kernel density curve of the DEGs revealed that 79.08% (189/239) of the DEGs were upregulated and 20.92% (50/239) of the DEGs were downregulated in recurrent samples (Fig. 2B). A bidirectional hierarchical clustering heatmap, based on the expression levels of the identified DEGs, indicated that the samples clustered into two groups (Fig. 2C).

Figure 2.

Screening results of DEGs. (A) Scatter diagram of the DEGs (red dots represent DEGs; green horizontal dashed line represents FDR <0.05, and the two green vertical dashed lines represent log2 (FC)>0.263. (B) Kernel density curve of DEGs. (C) Bidirectional hierarchical clustering heatmap of the DEGs (pink and green sample bars represent recurrent samples and non-recurrent samples, respectively; red and blue represent upregulation and downregulation, respectively). DEGs, differentially expressed genes; FC, fold change; FDR, false discovery rate.

Construction of SVM classifier

A total of 124 recurrence-associated DEGs and 127 overall survival-associated DEGs were screened in GSE62254. Following comparison of the two sets of DEGs, 114 DEGs were found to be significantly associated with both RFS time and OS time. The 114 DEGs were further screened for feature genes. When min RMSE=0.148 and max Accuracy=0.842, the gene combination involving 21 genes was considered as the optimal one. Based on the 21 feature genes, an SVM classifier was built in GSE62254. For GSE62254, GSE26253 and the TCGA datasets, all C-index values were >0.80 and all Brier scores were <0.30 for RFS time and OS time. The classification results of the samples, based on the SVM classifier, are presented in scatter diagrams (Fig. 3). KM survival curves demonstrated that the log-rank P-values for RFS time and OS time in the training and validation sets were all <0.05 (Fig. 4), suggesting significantly different RFS time and OS time between predicted recurrence and non-recurrence samples in the GSE62254 and TCGA datasets, and significantly different RFS time in GSE26253 (the samples in GSE26253 had no OS information). The predicted results of the SVM classifier were consistent with the actual outcomes of patients with GC in these datasets. The AUROC curves revealed that all AUROC values for the training and validation sets were >0.8 (Table II; Fig. 4). These results suggested that the SVM classifier based on the 21 feature genes could accurately determine the recurrence type of GC samples.

Figure 3.

Scatter diagrams showing classification results of Support Vector Machine classifier. Scatter diagram of (A) GSE62254, (B) TCGA and (C) GSE26253 datasets. Red triangles and black dots represent recurrent and non-recurrent samples, respectively. TCGA, The Cancer Genome Atlas.

Figure 4.

KM survival curves and AUROC curves based on the Support Vector Machine classifier. (A-a and A-b) KM curves and (A-c) AUROC curve of the GSE62254 dataset. (B-a and B-b) KM curves and (B-c) AUROC curve of the TCGA dataset. (C-a) KM curve and (C-b) AUROC curve of the GSE26253 dataset. For KM curves, red and black curves represent recurrent samples and non-recurrent samples, respectively. KM, Kaplan-Meier; AUROC, area under the receiver operating characteristic; AUC, area under the curve; TCGA, The Cancer Genome Atlas.

Table II.

Assessment indexes for the SVM classifier in the GSE62254, GSE26253 and TCGA datasets.

	RFS/OS			ROC

Datasets	C-index	Brier score	Log rank P-value	AUROC	Sensitivity	Specificity	PPV	NPV
Training set (GSE62254; n=282)	0.966/0.871	0.0108/0.0255	2.00×10⁻¹⁶/2.00×10⁻¹⁶	0.924	0.896	0.929	0.911	0.918
Validation set 1 (TCGA; n=295)	0.929/0.807	0.0272/0.0283	7.33×10⁻¹⁵/3.87×10⁻¹³	0.898	0.844	0.929	0.779	0.871
Validation set 2 (GSE26253; n=432)	0.950	0.0115	2.00×10⁻¹⁶	0.881	0.853	0.914	0.873	0.899

SVM, Support Vector Machine; TCGA, The Cancer Genome Atlas; RFS, recurrence-free survival; OS, overall survival; C-index, Concordance index; ROC, receiver operating characteristic; AUROC, area under the receiver operating characteristic curve; PPV, positive prediction value; NPV, negative prediction value.

Using the Cox-PH model, the optimal combination of prognostic genes was further screened from the 21 feature genes. When the optimized parameter ‘lambda’ was 2.2604, the cvl value was largest (−757.1749; Fig. 5A). When ‘lambda’=2.2604, 10 optimal genes were obtained [A-kinase anchoring protein 12 (AKAP12), angiopoietin-like protein (ANGPTL) 1, cysteine-rich sequence 1 (CYS1), myeloid/lymphoid or mixed-lineage leukemia, translocated to chromosome 11 (MLLT11), neuron navigator 3 (NAV3), neurobeachin (NBEA), nephroblastoma overexpressed (NOV), pleiotrophin (PTN), tumor suppressor candidate 3 (TUSC3), zinc finger and SCAN domain containing 18 (ZSCAN18); Fig. 5B; Table III].

Figure 5.

Selection of the optimal gene combination. (A) Curve for selecting the optimized parameter ‘lambda’. The horizontal and vertical axes represent values of ‘lambda’ and cvl, respectively. The crossing of red dashed lines represents the value of ‘lambda’ parameter (2.2604), where cvl takes the maximum value (−757.1749). (B) Coefficient distribution diagram of the 10 optimal genes. AKAP12, A-kinase anchoring protein 12; ANGPTL1, angiopoietin-like protein 1; CYS1, cysteine-rich sequence 1; MLLT11, myeloid/lymphoid or mixed-lineage leukemia; translocated to chromosome 11; NAV3, neuron navigator 3; NBEA, neurobeachin; NOV, nephroblastoma overexpressed; PTN, pleiotrophin; TUSC3, tumor suppressor candidate 3; ZSCAN18, zinc finger and SCAN domain containing 18; cvl, cross-validation likelihood.

Table III.

Top 10 optimal genes selected for building the risk score system.

Gene	Coef	HR (95% CI)	P-value
AKAP12	0.3340	1.559 (1.278–3.112)	2.07×10⁻²
ANGPTL1	−0.5826	0.256 (0.121–0.541)	3.53×10⁻⁴
CYS1	0.1153	1.466 (1.149–3.311)	3.58×10⁻²
MLLT11	0.4899	1.623 (1.537–3.498)	2.16×10⁻²
NAV3	0.4681	2.243 (1.007–4.996)	4.79×10⁻²
NBEA	0.3292	1.706 (1.361–3.379)	1.26×10⁻²
NOV	0.2839	1.317 (1.187–2.525)	4.07×10⁻²
PTN	0.1638	1.563 (1.215–3.418)	2.63×10⁻²
TUSC3	0.0332	1.188 (1.053–1.711)	3.76×10⁻²
ZSCAN18	0.6275	2.308 (1.107–4.812)	2.56×10⁻²

HR, hazard ratio; AKAP12, A-kinase anchoring protein 12; ANGPTL1, angiopoietin-like protein 1; CYS1, cysteine-rich sequence 1; MLLT11, myeloid/lymphoid or mixed-lineage leukemia; translocated to chromosome 11; NAV3, neuron navigator 3; NBEA, neurobeachin; NOV, nephroblastoma overexpressed; PTN, pleiotrophin; TUSC3, tumor suppressor candidate 3; ZSCAN18, zinc finger and SCAN domain containing 18.

Based on prognostic coefficients of the 10 optimal genes, a risk score system was built and risk scores were calculated using the following formula: Risk score=(0.3340) × ExpAKAP12 + (−0.5826) × ExpANGPTL1 + (0.1153) × ExpCYS1 + (0.4899) × ExpMLLT11 + (0.4681) × ExpNAV3 + (0.3292) × ExpNBEA + (0.2839) × ExpNOV + (0.1638) × ExpPTN + (0.0332) × ExpTUSC3 + (0.6275) × ExpZSCAN18 The samples in GSE62254 were divided into high- and low-risk groups. KM survival curves revealed that the high- and low-risk groups determined by the risk score system had significantly different RFS time in all three datasets (GSE62254, P=1.85×10−10; AUC=0.945; TCGA set, P=4.27×10−3, AUC=0.893; GSE26253, P=3.99×10−4, AUC=0.866; Fig. 6). These results revealed robust prognostic power of the 10-gene risk score.

Figure 6.

KM and AUROC curves based on the risk score system. KM curve (left) and AUROC curve (right) of (A) GSE62254, (B) the TCGA dataset and (C) GSE26253. KM, Kaplan-Meier; AUROC, area under the receiver operating characteristic; AUC, area under the curve; TCGA, The Cancer Genome Atlas.

Stratification analysis

Cox regression analysis demonstrated that pathological stage and risk status were independent prognostic clinical factors in GSE62254 (Table IV). Consequently, all samples were stratified into high- and low-risk groups. Furthermore, stratification analysis revealed that pathological stage was an independent prognostic clinical factor in the high-risk group (Table V). In addition, patients at different pathological stages in the high-risk group had significantly different RFS time (P=4.40×10−9; hazard ratio, 2.455; 95% confidence interval, 1.807–3.335; Fig. 7).

Table IV.

Results of Cox regression analysis for the GSE62254 dataset.

	Univariate Cox			Multivariate Cox

Clinical characteristics	HR	95% CI	P-value	HR	95% CI	P-value
Age (years, mean ± SD)	1.003	0.987–1.02	6.76×10⁻¹	–	–	–
Sex (male/female)	0.967	0.669–1.401	8.61×10⁻¹	–	–	–
MLH1 IHC (positive/negative/-)	2.096	1.241–3.544	4.72×10⁻³	1.023	0.564–1.855	9.39×10⁻¹
EBV ISH (positive/negative/-)	1.044	0.509–2.141	9.07×10⁻¹	–	–	–
Lymphovascular invasion (yes/no/-)	2.409	1.456–3.987	4.15×10⁻⁴	1.552	0.899–2.680	1.15×10⁻¹
Pathologic M (M0/M1/-)	3.839	2.364–6.236	5.01×10⁻⁹	1.293	0.719–2.324	3.91×10⁻¹
Pathologic N (N0/N1/N2/N3)	2.024	1.661–2.465	5.82×10⁻¹³	1.049	0.733–1.503	7.93×10⁻¹
Pathologic T (T1/T2/T3/T4/-)	1.816	1.435–2.298	4.06×10⁻⁷	0.867	0.599–1.252	4.46×10⁻¹
Pathologic stage (I/II/III/IV/-)	2.414	1.939–3.005	2.22×10⁻¹⁶	2.082	1.270–3.415	3.65×10⁻³
Lauren classification (diffuse/intestinal/mixed)	0.874	0.739–1.033	1.14×10⁻¹	–	–	–
Risk status (high/low)	3.322	2.246–4.913	1.85×10⁻¹⁰	2.535	1.656–3.882	1.86×10⁻⁵

HR, hazard ratio; MLH1 IHC, MutL homolog 1 immunohistochemistry; EBV ISH, Epstein-Barr virus in situ hybridization.

Table V.

Results of stratification analysis of clinical factors.

A, Low risk

	Univariate cox			Multivariate cox

Clinical characteristics	HR	95% CI	P-value	HR	95% CI	P-value
Age (years, mean ± SD)	1.029	0.992–1.067	1.21×10⁻¹	–	–	–
Sex (male/female)	1.374	0.644–2.933	4.09×10⁻¹	–	–	–
MLH1 IHC (positive/negative/-)	2.59	1.075–6.241	2.77×10⁻²	2.297	0.779–6.775	1.32×10⁻¹
EBV ISH (positive/negative/-)	2.399	0.926–6.218	6.29×10⁻²	–	–	–
Lymphovascular invasion (yes/no/-)	3.796	1.333–10.81	7.19×10⁻³	2.782	0.965–8.022	5.82×10⁻²
Pathologic M (M0/M1/-)	5.649	2.167–14.73	6.48×10⁻⁵	2.256	0.787–6.471	1.30×10⁻¹
Pathologic N (N0/N1/N2/N3)	2.39	1.675–3.41	4.15×10⁻⁷	1.977	0.845–4.623	1.16×10⁻¹
Pathologic T (T1/T2/T3/T4/-)	1.34	0.816–2.2	2.45×10⁻¹	–	–	–
Pathologic stage (I/II/III/IV/-)	2.203	1.537–3.158	6.08×10⁻⁶	0.961	0.397–2.326	9.30×10⁻¹
Lauren classification (diffuse/intestinal/mixed)	0.869	0.628–1.205	4.01×10⁻¹	–	–	–

B, High risk

	Uni-variate cox			Multi-variate cox

Clinical characteristics	HR	95% CI	P-value	HR	95% CI	P-value

Age (years, mean ± SD)	1.009	0.991–1.028	3.31×10⁻¹	–	–	–
Sex (male/female)	0.868	0.566–1.33	5.15×10⁻¹	–	–	–
MLH1 IHC (positive/negative/-)	0.727	0.376–1.406	3.42×10⁻¹	–	–	–
EBV ISH (positive/negative/-)	0.539	0.170–1.711	2.87×10⁻¹	–	–	–
Lymphovascular invasion (yes/no/-)	1.787	1.003–3.183	4.58×10⁻²	1.297	0.676–2.487	4.34×10⁻¹
Pathologic M (M0/M1/-)	2.847	1.612–5.027	1.63×10⁻⁴	1.115	0.555–2.239	7.59×10⁻¹
Pathologic N (N0/N1/N2/N3)	1.706	1.332–2.186	1.85×10⁻⁵	0.987	0.666–1.463	9.48×10⁻¹
Pathologic T (T1/T2/T3/T4/-)	1.722	1.262–2.348	5.21×10⁻⁴	0.977	0.630–1.513	9.15×10⁻¹
Pathologic stage (I/II/III/IV/-)	2.455	1.807–3.335	4.40×10⁻⁹	2.245	1.241–4.062	7.48×10⁻³
Lauren classification (diffuse/intestinal/mixed)	1.018	0.841–1.232	8.59×10⁻¹	–	–	–

HR, hazard ratio; MLH1 IHC, MutL homolog 1 immunohistochemistry; EBV ISH, Epstein-Barr virus in situ hybridization.

Figure 7.

KM survival curves for pathological stage. (A) KM curve revealing the association between pathological stage and survival in the in low-risk group. (B) KM curve demonstrating the association between pathological stage and recurrence-free survival in the high-risk group. KM, Kaplan-Meier.

Based on the risk score system, the samples in GSE62254 were divided into high- and low-risk groups. A total of 671 DEGs were identified between the two groups, including 656 upregulated genes and 15 downregulated genes. Pathway enrichment analysis revealed that eight significant pathways were enriched for the DEGs (Table VI). According to the nominal P-value, the top three significant pathways were ‘vascular smooth muscle contraction’, ‘regulation of actin cytoskeleton’ and ‘tyrosine metabolism’.

Table VI.

Significant pathways enriched for the differentially expressed genes between high- and low-risk groups.

Pathway	Gene count, n	ES	NES	Nominal P-value
Vascular smooth muscle contraction	9	0.5535	1.7480	8.30×10⁻³
Regulation of actin cytoskeleton	8	0.4851	1.6408	1.45×10⁻²
Tyrosine metabolism	3	0.6451	1.6072	1.72×10⁻²
Metabolism of xenobiotics by cytochrome p450	2	0.6845	1.5654	1.80×10⁻²
Leukocyte transendothelial migration	5	0.5237	1.5113	3.04×10⁻²
Tight junction	9	0.4560	1.4869	3.37×10⁻²
Adherens junction	2	0.6969	1.4027	3.78×10⁻²
Cytokine-cytokine receptor interaction	10	0.3808	1.4334	4.29×10⁻²

ES, enrichment score; NES, normalized enrichment score.

Discussion

In the present study, 239 DEGs (189 upregulated and 50 downregulated) were identified between the recurrent and non-recurrent samples in the GSE62254 dataset. From the 114 DEGs that were significantly associated with both RFS and OS, 21 feature genes were further screened. Subsequently, an SVM classifier was built in GSE62254, which could accurately determine the recurrence type of GC samples. Additionally, the optimal set of 10 prognostic genes (AKAP12, ANGPTL1, CYS1, MLLT11, NAV3, NBEA, NOV, PTN, TUSC3 and ZSCAN18) was obtained, followed by the construction of a risk score system. The stratification analysis demonstrated that pathological stage was an independent prognostic clinical factor in the high-risk group. AKAP12A expression decreases colony formation and causes apoptotic cell death; thus, AKAP12A may be a critical mediator of survival in patients with GC (44). AKAP12 is usually inactivated in patients with GC and several other types of cancer, serves a role in regulating cytokinesis progression and functions as a tumor suppressor (45). The expression of ANGPTL2 is associated with GC progression, and the overexpression of ANGPTL2 at both the invasive margin and tumor center is an independent marker of prognosis in patients with GC (46,47). Elevated expression of cytoplasmic ANGPTL2 has been associated with invasion, metastasis and unfavorable survival in patients with GC, and thus ANGPTL2 may be used as a promising indicator for predicting postoperative recurrence of GC (48). Therefore, AKAP12 and ANGPTL1 may be associated with the outcomes of patients with GC. The oncogenic factor MLLT11 is associated with tumor progression and adverse survival, exhibiting pro-tumorigenic activity in patients with ovarian cancer (49). Signal transducer and activator of transcription 3 (STAT3) is involved in tumor formation, development, migration and motility, and MLLT11 overexpression promotes pYSTAT3 expression in invasive carcinoma cells through activating the Src kinase (50). Copy number changes of NAV3 are often detected in adenomas and colorectal cancer (CRC), and NAV3 acts in connecting colon inflammation with CRC development (51). NOV and cysteine-rich protein 61 (CYR61) are upregulated in GC, and elevated CYR61 levels are responsible for unfavorable outcome (52). Additionally, increased NOV contributes to cell proliferation and invasion in GC (52). These findings indicate that MLLT11, NAV3 and NOV may also act in the development and progression of GC. Increased PTN is significantly associated with poor OS time and RFS time of patients with GC, and may serve as an independent prognostic indicator (53). TUSC3 serves an oncogenic role in CRC, and may affect proliferation, aggression, invasion and metastasis of CRC via mediating PI3K/Akt, p38 mitogen-activated protein kinase and Wnt/β-catenin signaling pathways (54). Decreased levels of TUSC3 contribute to cell proliferation, invasion and metastasis in pancreatic cancer (PC), which predicts unfavorable outcomes in patients with PC (55,56). Gene expression and promoter methylation of ZSCAN18, cysteine dioxygenase 1 and zinc-finger protein 331 are negatively associated, and these genes have epigenetic similarity and may be potential biomarkers of gastrointestinal cancer (57). Therefore, PTN, TUSC3 and ZSCAN18 may be implicated in the pathogenesis of GC. In order to unveil possible biological functions of the 10 prognostic genes in GC, the present study screened the DEGs between the two risk groups, classified by the 10-gene risk score. Pathway enrichment analysis revealed that the resulting DEGs were significantly enriched with several pathways, including ‘vascular smooth muscle contraction’, ‘regulation of actin cytoskeleton’ and ‘tyrosine metabolism’. The ‘vascular smooth muscle contraction’ and ‘regulation of actin cytoskeleton’ pathways serve critical roles in cancer cell migration and invasion (58,59). Tyrosine phosphorylation enhances the Warburg effect and promotes tumor growth (60). Therefore, it can be inferred that the 10 prognostic genes may affect GC prognosis by modulating cancer migration and growth. The present study was a secondary analysis based on 282 samples with recurrence information in the GSE62254 dataset. A study by Cristescu et al (29) used GSE62254 to investigate the molecular alterations in four subtypes of GC by using targeted sequencing and genome-wide copy number microarrays. Wang et al (26) determined a six-gene signature (RNA binding protein, MRNA processing factor 2, Hes related family BHLH transcription factor with YRPW motif like, nestin, thiopurine S-Methyltransferase, SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily D, member 3 and family with sequence similarity 127, member A), based on GSE62254, as a prognostic biomarker in patients with GC. The six survival-associated genes were selected using a robust likelihood-based survival model from the prognosis-associated genes identified by univariate survival analysis (26). By contrast, the present study focused on recurrence-associated DEGs to identify prognostic genes and acquired a prognostic 10-gene signature. The application of different analysis methods, analysis processes and screening thresholds is another underlying factor of the different results obtained by the two studies. Although complex bioinformatics analyses were conducted for the gene expression profile of GC, the limitations of the present study should not be neglected. The primary limitation of the present study was the lack of experiments. In subsequent studies, experiments such as quantitative PCR and western blotting should be performed to validate the findings of the present study. In conclusion, 239 DEGs were identified between the recurrent and non-recurrent samples of GSE62254. Furthermore, the SVM classifier may be applied for distinguishing recurrent from non-recurrent patients with GC. Additionally, the risk score system involving 10 optimal genes may be used for predicting the prognosis of patients with GC.

52 in total

1. Microvessel count predicts metastasis and prognosis in patients with gastric cancer.

Authors: M Araya; M Terashima; A Takagane; K Abe; S Nishizuka; H Yonezawa; T Irinoda; T Nakaya; K Saito
Journal: J Surg Oncol Date: 1997-08 Impact factor: 3.454

2. Increased expression of pleiotrophin is a prognostic marker for patients with gastric cancer.

Authors: Hanqing Hu; Chaxiang Li; Shouwang Cai; Chengyu Zhu; Yufeng Tian; Jun Zheng; Jun Hu; Cui Chen; Wei Liu
Journal: Hepatogastroenterology Date: 2014 Jul-Aug

Review 3. TUSC3: functional duality of a cancer gene.

Authors: Kateřina Vašíčková; Peter Horak; Petr Vaňhara
Journal: Cell Mol Life Sci Date: 2017-09-19 Impact factor: 9.261

4. Tyrosine phosphorylation of mitochondrial pyruvate dehydrogenase kinase 1 is important for cancer metabolism.

Authors: Taro Hitosugi; Jun Fan; Tae-Wook Chung; Katherine Lythgoe; Xu Wang; Jianxin Xie; Qingyuan Ge; Ting-Lei Gu; Roberto D Polakiewicz; Johannes L Roesel; Georgia Z Chen; Titus J Boggon; Sagar Lonial; Haian Fu; Fadlo R Khuri; Sumin Kang; Jing Chen
Journal: Mol Cell Date: 2011-12-23 Impact factor: 17.970

5. Regulation of the actin cytoskeleton in cancer cell migration and invasion.

Authors: Hideki Yamaguchi; John Condeelis
Journal: Biochim Biophys Acta Date: 2006-07-14

6. Angiopoietin-like Protein 2 as a Predictor of Early Recurrence in Patients After Curative Surgery for Gastric Cancer.

Authors: Tadanobu Shimura; Yuji Toiyama; Koji Tanaka; Susumu Saigusa; Takahito Kitajima; Satoru Kondo; Masato Okigami; Hiromi Yasuda; Masaki Ohi; Toshimitsu Araki; Yasuhiro Inoue; Keiichi Uchida; Yasuhiko Mohri; Masato Kusunoki
Journal: Anticancer Res Date: 2015-09 Impact factor: 2.480

7. A-kinase anchoring protein 12 regulates the completion of cytokinesis.

Authors: Moon-Chang Choi; Yang-Ui Lee; Sung-Hak Kim; Jung-Hyun Park; Hyun-Ah Kim; Do-Youn Oh; Seock-Ah Im; Tae-You Kim; Hyun-Soon Jong; Yung-Jue Bang
Journal: Biochem Biophys Res Commun Date: 2008-06-11 Impact factor: 3.575

8. Identification and validation of a prognostic 9-genes expression signature for gastric cancer.

Authors: Zhiqiang Wang; Gongxing Chen; Qilong Wang; Wei Lu; Meidong Xu
Journal: Oncotarget Date: 2017-05-10

9. Clinicopathological characteristics and predictive markers of early gastric cancer with recurrence.

Authors: Jeong Won Kim; Ilseon Hwang; Mi-Jung Kim; Se Jin Jang
Journal: J Korean Med Sci Date: 2009-11-09 Impact factor: 2.153

10. A gene expression-based risk model reveals prognosis of gastric cancer.

Authors: Xiaorong Deng; Qun Xiao; Feng Liu; Cihua Zheng
Journal: PeerJ Date: 2018-01-02 Impact factor: 2.984

7 in total

1. Anti-inflammatory effects of extracellular vesicles from Morchella on LPS-stimulated RAW264.7 cells via the ROS-mediated p38 MAPK signaling pathway.

Authors: Qi Chen; Chengchuan Che; Shanshan Yang; Pingping Ding; Meiru Si; Ge Yang
Journal: Mol Cell Biochem Date: 2022-07-07 Impact factor: 3.396

2. Classifiers for Predicting Coronary Artery Disease Based on Gene Expression Profiles in Peripheral Blood Mononuclear Cells.

Authors: Jie Liu; Xiaodong Wang; Junhua Lin; Shaohua Li; Guoxiong Deng; Jinru Wei
Journal: Int J Gen Med Date: 2021-09-15

3. Exploring TCGA database for identification of potential prognostic genes in stomach adenocarcinoma.

Authors: Lin Zhou; Wei Huang; He-Fen Yu; Ya-Juan Feng; Xu Teng
Journal: Cancer Cell Int Date: 2020-06-23 Impact factor: 5.722

4. Identification of hub genes and their correlation with immune infiltration in coronary artery disease through bioinformatics and machine learning methods.

Authors: Ke-Ke Huang; Hui-Lei Zheng; Shuo Li; Zhi-Yu Zeng
Journal: J Thorac Dis Date: 2022-07 Impact factor: 3.005

5. Discovering Common miRNA Signatures Underlying Female-Specific Cancers via a Machine Learning Approach Driven by the Cancer Hallmark ERBB.

Authors: Katia Pane; Mario Zanfardino; Anna Maria Grimaldi; Gustavo Baldassarre; Marco Salvatore; Mariarosaria Incoronato; Monica Franzese
Journal: Biomedicines Date: 2022-06-02

6. A 9‑gene expression signature to predict stage development in resectable stomach adenocarcinoma.

Authors: Zining Liu; Hua Liu; Yinkui Wang; Ziyu Li
Journal: BMC Gastroenterol Date: 2022-10-14 Impact factor: 2.847

7. Signature Panel of 11 Methylated mRNAs and 3 Methylated lncRNAs for Prediction of Recurrence-Free Survival in Prostate Cancer Patients.

Authors: Jiarong Cai; Fei Yang; Xuelian Chen; He Huang; Bin Miao
Journal: Pharmgenomics Pers Med Date: 2021-07-12

7 in total