Hui Jiang1, Jun Du1, Jiming Gu1, Liugen Jin1, Yong Pu2, Bojian Fei1. 1. Department of Gastrointestinal Surgery, Affiliated Hospital of Jiangnan University, Wuxi, Jiangsu 214062, P.R. China. 2. Department of Pathology, Affiliated Hospital of Jiangnan University, Wuxi, Jiangsu 214062, P.R. China.
Abstract
The aim of the present study was to examine the molecular factors associated with the prognosis of colon cancer. Gene expression datasets were downloaded from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus databases to screen differentially expressed genes (DEGs) between colon cancer samples and normal samples. Survival‑related genes were selected from the DEGs using the Cox regression method. A co‑expression network of survival‑related genes was then constructed, and functional clusters were extracted from this network. The significantly enriched functions and pathways of the genes in the network were identified. Using Bayesian discriminant analysis, a prognostic prediction system was established to distinguish the positive from negative prognostic samples. The discrimination efficacy of the system was validated in the GSE17538 dataset using Kaplan‑Meier survival analysis. A total of 636 and 1,892 DEGs between the colon cancer samples and normal samples were screened from the TCGA and GSE44861 dataset, respectively. There were 155 survival‑related genes selected. The co‑expression network of survival‑related genes included 138 genes, 534 lines (connections) and five functional clusters, including the signaling pathway, cellular response to cAMP, and immune system process functional clusters. The molecular function, cellular components and biological processes were the significantly enriched functions. The peroxisome proliferator‑activated receptor signaling pathway, Wnt signaling pathway, B cell receptor signaling pathway, and cytokine‑cytokine receptor interactions were the significant pathways. A prognostic prediction system based on a 65‑gene signature was established using this co‑expression network. Its discriminatory effect was validated in the TCGA dataset (P=3.56e‑12) and the GSE17538 dataset (P=1.67e‑6). The 65‑gene signature included kallikrein‑related peptidase 6 (KLK6), collagen type XI α1 (COL11A1), cartilage oligomeric matrix protein, wingless‑type MMTV integration site family member 2 (WNT2) and keratin 6B. In conclusion, a 65‑gene signature was screened in the present study, which showed a prognostic prediction effect in colon adenocarcinoma. KLK6, COL11A1, and WNT2 may be suitable prognostic predictors for colon adenocarcinoma.
The aim of the present study was to examine the molecular factors associated with the prognosis of colon cancer. Gene expression datasets were downloaded from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus databases to screen differentially expressed genes (DEGs) between colon cancer samples and normal samples. Survival‑related genes were selected from the DEGs using the Cox regression method. A co‑expression network of survival‑related genes was then constructed, and functional clusters were extracted from this network. The significantly enriched functions and pathways of the genes in the network were identified. Using Bayesian discriminant analysis, a prognostic prediction system was established to distinguish the positive from negative prognostic samples. The discrimination efficacy of the system was validated in the GSE17538 dataset using Kaplan‑Meier survival analysis. A total of 636 and 1,892 DEGs between the colon cancer samples and normal samples were screened from the TCGA and GSE44861 dataset, respectively. There were 155 survival‑related genes selected. The co‑expression network of survival‑related genes included 138 genes, 534 lines (connections) and five functional clusters, including the signaling pathway, cellular response to cAMP, and immune system process functional clusters. The molecular function, cellular components and biological processes were the significantly enriched functions. The peroxisome proliferator‑activated receptor signaling pathway, Wnt signaling pathway, B cell receptor signaling pathway, and cytokine‑cytokine receptor interactions were the significant pathways. A prognostic prediction system based on a 65‑gene signature was established using this co‑expression network. Its discriminatory effect was validated in the TCGA dataset (P=3.56e‑12) and the GSE17538 dataset (P=1.67e‑6). The 65‑gene signature included kallikrein‑related peptidase 6 (KLK6), collagen type XI α1 (COL11A1), cartilage oligomeric matrix protein, wingless‑type MMTV integration site family member 2 (WNT2) and keratin 6B. In conclusion, a 65‑gene signature was screened in the present study, which showed a prognostic prediction effect in colon adenocarcinoma. KLK6, COL11A1, and WNT2 may be suitable prognostic predictors for colon adenocarcinoma.
Colon cancer is one of the most common types of cancer worldwide, with a high incidence in the Western world and developed Asian countries (1,2). The global burden of colon cancer is estimated to reach >2,200,000 new cases with 1,100,000 succumbing to mortality by 2030 (3). With a mortality rate of ~608,000 each year, colon cancer is recognized as the fourth most common cancer-associated cause of mortality in the world (4). Colon cancer originates from normal epithelial cells through aberrant crypts and progressive adenoma stages to cancer in situ and then metastasis. The incidence of colon cancer is reported to be closely connected with physical inactivity, smoking, obesity, heavy alcohol use, and high red meat consumption (5).Genetic changes have been reported to be important factors in the occurrence and development of colon cancer. For example, a lack of expression of CDX2 can identify which patients with high-risk stage II colon cancer may benefit from adjuvant chemotherapy (6). The expression of potassium voltage-gated channel subfamily Q member 1 is reported to be a good prognostic biomarker of the recurrence of stage II and III colon cancer (7). Cytochrome b5 reductase 1 predicts a poor prognosis in patients with colon cancer (8). The methylation of axis inhibitor 2 and dickkopf-1, which are Wnt target genes, are robust biomarkers of recurrence in stage II colon cancer (9). Serum microRNA-200c is also a prognostic and metastasis-predictive biomarker in patients with colon cancer (10).The pathological staging of colon cancer fails to accurately predict the prognostic status of patients. For this reason, Marisa et al established a transcriptome-based classification of colon cancer, which improved the current prognosis stratification (11). However, the data collected from clinical records was limited by the acquired samples. The present study attempted to identify the prognosis-associated genes using a comprehensive bioinformatics process. The gene expression datasets downloaded from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) database were combined with the corresponding survival status of patients who provided colon samples to construct a prognostic prediction system.
Materials and methods
Microarray data
mRNA expression data of colon adenocarcinoma samples were downloaded from the TCGA database (gdc-portal.nci.nih.gov/) on 12 Dec. 2016, which included 286 tumor samples and 41 normal samples. The dataset was based on the Illumina HiSeq 2000 RNA sequencing platform. The mRNA expression data in the TCGA dataset were annotated using the HUGO Gene Nomenclature Committee database (www.genenames.org/). Those mRNAs with expression levels <5 were removed and the remaining mRNAs were subjected to further analysis.Another microarray dataset, GSE44861, was downloaded from the GEO database (www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44861), containing 56 tumor samples and 55 matched normal samples from patients with colon tumors. The Data in the GSE44861 dataset was subject to background correction and normalization using the oligo package (12) in R language.
Differentially expressed genes (DEGs)
The Limma package in R language was used to screen DEGs between tumor samples and normal samples in TCGA and GSE44861 datasets. The false discovery rate (FDR) was calculated using a multi-test package. FDR<0.05 and |log2fold change|>0.585 were the cut-off criteria for DEGs. The common DEGs of the two datasets were subjected to further analysis.
Survival-associated genes
Based on the survival information of the samples in the TCGA dataset (272 tumor samples and 39 samples), survival-associated genes were selected from the above overlapping DEGs using univariate Cox regression analysis of survival (13) with the survival package in R3.1.0 language (bioconductor.org/packages/survivalr/). Log-rank P<0.05 was the threshold for this selection. Kaplan-Meier survival curves of the top six survival-associated genes with the highest P-values were obtained.
Co-expression network of survival-associated genes
The correlation coefficient (r) between the survival-associated genes was calculated using the cor function in R language. The gene pairs with |r|≥0.7 and P<0.05 were selected to construct a co-expression network, which was visualized using Cytoscape 2.8.0 (www.cytoscape.org/) (14).
Functional clusters
Gene ontology (GO) function and Kyoto Encyclopedia of Genes and Genome (KEGG) pathway enrichment were performed for the genes in the co-expression network using Fisher's exact test in the cluster Profiler package in R language (15). Subsequently, to determine how these survival-related genes in the network were regulated in colon cancer, the transcription factors (TFs) significantly correlated with the genes in the network were also searched using Database for Annotation, Visualization and Integrated Discovery (DAVID) software (david.ncifcrf.gov/) (16). The functional clusters in the network were extracted using the GraphWeb tool (biit.cs.ut.ee/graphweb/) (17).
Prognostic prediction system
To establish a prognostic prediction system, the expression data of the 272 samples with survival information in TCGA dataset served as the training datasets. The 272 samples were divided into 'good prognosis' (status alive and survival ≥15 months), and 'bad prognosis' (status deceased or survival <15 months).The gene signature-based prediction system was established and initially used to classify the samples in the TCGA dataset, followed by Kaplan-Meier survival analysis. Subsequently, GSE17538 was downloaded from the GEO database as a validation dataset. This dataset contained 177 colon cancer samples with survival information. The prediction system was tested on the validation set.
Results
DEGs between colon cancer samples and normal samples
A total of 12,370 mRNAs were retrieved from the TCGA dataset. A total of 636 DEGs were screened from the TCGA dataset, and 1,892 DEGs were identified from the GSE44861 dataset. The top 50 DEGs selected from each dataset were subjected to bidirectional hierarchical clustering analysis using the centered Pearson correlation algorithm (19) with the Pheatmap package (20) in R3.1.3 language. The heatmaps of the top 50 DEGs in these two datasets are illustrated in Fig. 1A and B. There were 368 overlapping DEGs from the two datasets.
Figure 1
Heatmaps of the top 50 DEGs. (A) Heatmap of the top 50 DEGs in The Cancer Genome Atlas dataset. (B) Heatmap of the top 50 DEGs in the GSE44861 dataset. Blue bars represent normal samples; yellow bars represent tumor samples. DEGs, differentially expressed genes.
Survival-related genes
A total of 155 genes from these overlapping DEGs were found to be survival-related. Kaplan-Meier survival curves of the top six survival-related genes, including triggering receptor expressed on myeloid cells 1, collagen type XI α1 (COL11A1), C-X-C motif chemokine ligand 3, EYA transcriptional coactivator and phosphatase 2, G protein subunit γ7, and insulin-like growth factor 2 MRNA binding protein 3, are demonstrated in Fig. 2A–F.
Figure 2
Kaplan-Meier survival curves of the top six survival-related genes with highest log-rank P-values. (A) TREM1; (B) COL11A1; (C) CXCL3; (D) EYA2; (E) GNG7; (F) IGF2BP3. P represents the significance in the log-rank test. Red and black curves represent the high risk samples and low risk samples, respectively. The median expression value was used as the cutoff. HR, hazard ratio; TREM1, triggering receptor expressed on myeloid cells 1; COL11A1, collagen type XI α1; CXCL3, C-X-C motif chemokine ligand 3; ETA2, EYA transcriptional coactivator and phosphatase 2; GNG7, G protein subunit 07; IGF2BP3, insulin-like growth factor 2 MRNA binding protein 3.
Co-expression network of survival-related genes
The co-expression network of the survival-related genes comprised 138 genes and 534 lines (connections), as illustrated in Fig. 3. A total of five functional clusters were extracted from the network. Among them, GO function enrichment analysis showed that three functional clusters were significantly closely related to signaling pathway, cellular response to cAMP, and immune system process, separately. The other two functional clusters were unknown.
Figure 3
Co-expression network of survival-related genes. Positive connections between genes are marked in solid lines and negative connections are marked in dashed lines. The functional clusters in the network are shown in different colors with functional annotations.
The genes in the co-network were enriched in 15 functions and five pathways. The significantly enriched functions included molecular function (calcium binding, carbohydrate binding, and cytokine activity), cellular components (plasma membrane, extracellular matrix, proteinaceous extracellular matrix, extracellular region part, and extracellular region) and biological processes (inflammatory response, response to wounding, collagen catabolic process, multicellular organism catabolic process, and multicellular organismal metabolic process). The significantly enriched pathways included the peroxisome proliferator-activated receptor (PPAR) signaling pathway, Wnt signaling pathway, B cell receptor signaling pathway, and cytokine-cytokine receptor interaction. MEIS1BHOXA9, S8, forkhead/winged-helix-box-class-O3 (FOXO3), LIM homeobox 3 (LHX3), CCAAT/enhancer binding protein α (CEBPA), sex determining region Y (SRY), signal transducers and activators of transcription (STAT), basic leucine zipper transcription factor 1 (BACH1) and STAT5B were found to show significant close correlation with the genes in the network. Additionally, COL11A1, wingless-type MMTV integration site family member 2 (WNT2) and kallikrein-related peptidase 6 (KLK6) were three genes involved in these significant terms or TFs.There were 99 'good' prognostic samples and 173 'bad' prognostic samples in the TCGA dataset. Based on these and on the co-expression network, a 65-gene signature-based prognostic prediction system was established. Genes in the signature included KLK6, COL11A1, cartilage oligomeric matrix protein, WNT2 and keratin 6B.The prognostic score distribution of each sample in the TCGA dataset is demonstrated ED in Fig. 4. Scores of the good prognostic samples [−2~0] differed from those of the bad prognostic samples [0~2], indicating the discriminative ability of the prediction system to differentiate good prognostic samples from bad prognostic samples.
Figure 4
Prognostic score distribution of good and poor prognostic samples.
The prediction system was used to divide the patients in the TCGA dataset into the two prognostic groups. The results showed that patients with a good prognosis had significantly longer overall survival rates, compared with those with a bad prognosis (P=3.56e–12; Fig. 5A). The discriminative effect of the 65-gene signature was assessed on the GSE17538 dataset (validation set). As presented in Fig. 5, the good prognostic samples and bad prognostic samples in the GSE17538 dataset had significantly different overall survival rates (P=1.67e–6; Fig. 5B). These results indicated that the prognostic prediction system established by Bayesian discriminant analysis was able to distinguish tumor samples with different survival status.
Figure 5
Kaplan-Meier survival curves of the samples in (A) The Cancer Genome Atlas dataset and (B) GSE17538 dataset. P-values indicate the significance in the log-rank test.
Discussion
A total of 636 and 1,892 DEGs between colon cancer samples and normal samples were screened from the TCGA and GSE44861 dataset, respectively. There were 155 survival-related genes screened. The co-expression network of survival-related genes consisted of 138 genes, 534 lines (connections) and five functional clusters. Three of these clusters were functionally related to signaling pathways, cellular responses to cAMP, and immune system processes. Enrichment analysis for the genes in the co-expression network showed molecular functions, cellular components and biological processes were the significantly enriched functions. The PPAR signaling pathway, Wnt signaling pathway, B cell receptor signaling pathway, and cytokine-cytokine receptor interaction were significant pathways. The TFs MEIS1BHOXA9, S8, FOXO3, LHX3, CEBPA, SRY, STAT, BACH1 and STAT5B showed significant close association with the genes of the co-expression network. COL11A1, WNT2 and KLK6 were three genes significantly closely involved in these significant terms or TFs. A 65-gene signature-based prognostic prediction system was established based on the genes in the co-expression network, which successfully classified patients in the TCGA dataset into those with a good prognosis and bad prognosis, with significantly different prognoses, as evidenced by the results of Kaplan-Meier survival analysis. Its discriminatory effect was confirmed in the GSE17538 dataset. COL11A1, WNT2 and KLK6 were also included in the 65-gene signature.FOXO3, a member of the forkhead family of TFs, is reported to suppress cancer cell proliferation and survival (21). The negative expression of FOXO3 is a risk factor for bad prognosis in bladder cancer (22). FOXO3 is a tumor suppressor gene in colon cancer (23). The downregulated expression of FOXO3 is associated with advanced stage in colon cancer, and the expression of FOXO3 is lower in recurrent stage I/II primary tumors (24).BACH1 is a CNC-bZIP TF involved in heme degradation, redox regulation, cell cycle, apoptotic pathways and subcellular transport processes (25). BACH1 is a regulator of metastasis-associated genesis and it has been reported to be involved in a network affecting the progression of breast cancer metastasis (26). Davudian et al found that the silencing of BACH1 inhibited the migration of HT-29 colon cancer cells through the reduction of metastasis-associated genes (27). The STAT family is involved in various cellular functions, including proliferation, survival and angiogenesis. STAT3 has been suggested to be a potential anticancer agent (28). STAT5 is intricately and completely involved in the oxidative metabolism of cancer cells (29). STAT5 is also involved in G1 cell cycle arrest and the inhibition of tumor cell invasion in humancolon cancer cells (30). The STAT3 to STAT5 expression ratio is reported to be an independent prognostic indicator in colon cancer (31).COL11A1 is a biomarker of metastatic progression and a regulator of the tumor microenvironment (32). COL11A1 promotes tumor progression and predicts poor clinical outcomes in ovarian cancer, with the 5-year overall survival and recurrence-free survival rates being markedly lower in patients with high tissue expression levels of COL11A1, compared with those with low expression levels (33). COL11A1 was found to be expressed in stromal cells of colon cancer samples (34).WNT2, a member of WNT gene family, functions as a regulator for cell specification during cell development. The uperegulation of WNT2 has been observed in colon cancer (35). WNT2 has a tumorigenic effect in gastrointestinal cancer (36,37). Jung et al found that WNT2 enhanced WNT/β-catenin signaling in an autocrine manner, contributing to the tumorigenesis of gastrointestinal cancer (38). It is accepted that ~90% of colon cancer cases originate from constitutive activation of the canonical Wnt signaling pathway (39). The development of colon cancer, and its poor response to radiotherapy and chemotherapy can be exacerbated by apoptosis modulated by the Wnt signaling pathway (40). Inhibition of the Wnt signaling pathway is a promising strategy for targeting chemotherapy-resistant colon cancer cells (41).KLK6 is involved in proterolytic cascades in normal physiology (42). The roles of KLKs in tumorigenic events have been noted previously (43). The increased expression of KLK6 has been shown in various cancer samples, including ovarian cancer, pancreatic cancer, uterine cancer, gastric cancer, colon cancer, skin cancer and urinary bladder cancer (44). Furthermore, the overexpression of KLK6 in colon cancer has been reported to be K-RAS-dependent (45). The expression of KLK6 has also been recognized as a prognostic indicator in several types of cancer, including ovarian cancer (46), pancreatic ductal adenocarcinoma (47) and colon cancer (48).The present study established a 65-gene signature for prognostic prediction in colon adenocarcinoma. KLK6, COL11A1 and WNT2 may be suitable prognostic predictors for colon adenocarcinoma. Verification of these findings in tissue specimens or clinical patients is necessary in further investigations.
Authors: Lina Seiz; Julia Dorn; Matthias Kotzsch; Axel Walch; Nicolai I Grebenchtchikov; Apostolos Gkazepis; Barbara Schmalfeldt; Marion Kiechle; Jane Bayani; Eleftherios P Diamandis; Rupert Langer; Fred C G J Sweep; Manfred Schmitt; Viktor Magdolen Journal: Biol Chem Date: 2012-04 Impact factor: 3.915
Authors: M D Bullock; A Bruce; R Sreekumar; N Curtis; T Cheung; I Reading; J N Primrose; C Ottensmeier; G K Packham; G Thomas; A H Mirnezami Journal: Br J Cancer Date: 2013-07-04 Impact factor: 7.640
Authors: R Kandimalla; J F Linnekamp; S van Hooff; A Castells; X Llor; M Andreu; R Jover; A Goel; J P Medema Journal: Oncogenesis Date: 2017-04-03 Impact factor: 7.485