You-Tao Zhou1, Zi-Kai Lin1, Man-Jie Shi1, Cui-Yan Yang2, Jia-Liang Wei3, Chuan-Feng Ke2. 1. The First Clinical Medical School, Guangzhou Medical University, Guangzhou, China. 2. Department of Gastrointestinal Surgery, The First Affiliated Hospital of Guangzhou Medical University, Guangzhou, China. 3. Department of General Surgery, the Fifth Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.
Abstract
BACKGROUND: It is now recognized that the symptoms of colon cancer differ according to whether the tumor is located on the left or right side of the patient. The results of the present study point to the differences in the tissue and embryonic origins of left- and right-sided colon cancer that cause the variations in molecular typing. The research purpose of this study is to establish a core differential gene scoring model and proved its effect. METHODS: We downloaded transcriptome data and clinical information from The Cancer Genome Atlas (TCGA). A total of 243 patients in stages II and III were grouped according to the colon cancer site. Then we screened for differential transcriptome products. The corresponding differential gene were performing a corresponding protein interaction analysis. We used 12 algorithms in Cytoscape to calculate the hub genes and a total of 37 hub genes were obtained finally. We extracted the first principal component value (PC1) of the hub genes to evaluate the effectiveness of screening. Cox regression analysis was performed for the differential genes. Finally, we performed a prognostic analysis on right-sided colon cancer patients using the BST2 gene, PC1 and relevant clinical information. RESULTS: After screening for differentially expressed genes, 37 hub genes were obtained with appropriate algorithms. PC1 showed differences in hub genes between left- and right-sided colon cancer patients. BST2 and 31 other genes were identified as significant by Cox regression analysis and were significantly mutated in patients with right-sided colon cancer. Finally, we selected the BST2 gene and relevant clinical information as the prognostic factors to build a scoring model. The prediction effect of the model was satisfied. CONCLUSIONS: We constructed a prognostic model based on BST2, PC1, and other relevant clinical information and proved its good effect. 2021 Annals of Translational Medicine. All rights reserved.
BACKGROUND: It is now recognized that the symptoms of colon cancer differ according to whether the tumor is located on the left or right side of the patient. The results of the present study point to the differences in the tissue and embryonic origins of left- and right-sided colon cancer that cause the variations in molecular typing. The research purpose of this study is to establish a core differential gene scoring model and proved its effect. METHODS: We downloaded transcriptome data and clinical information from The Cancer Genome Atlas (TCGA). A total of 243 patients in stages II and III were grouped according to the colon cancer site. Then we screened for differential transcriptome products. The corresponding differential gene were performing a corresponding protein interaction analysis. We used 12 algorithms in Cytoscape to calculate the hub genes and a total of 37 hub genes were obtained finally. We extracted the first principal component value (PC1) of the hub genes to evaluate the effectiveness of screening. Cox regression analysis was performed for the differential genes. Finally, we performed a prognostic analysis on right-sided colon cancer patients using the BST2 gene, PC1 and relevant clinical information. RESULTS: After screening for differentially expressed genes, 37 hub genes were obtained with appropriate algorithms. PC1 showed differences in hub genes between left- and right-sided colon cancer patients. BST2 and 31 other genes were identified as significant by Cox regression analysis and were significantly mutated in patients with right-sided colon cancer. Finally, we selected the BST2 gene and relevant clinical information as the prognostic factors to build a scoring model. The prediction effect of the model was satisfied. CONCLUSIONS: We constructed a prognostic model based on BST2, PC1, and other relevant clinical information and proved its good effect. 2021 Annals of Translational Medicine. All rights reserved.
Entities:
Keywords:
Hub; correlation; principal component analysis (PCA); regression; survival
Colon cancer is the most common intestinal cancer and is now the third most common cancer globally. There are histological, embryological, and clinical differences in left- and right-sided colon cancer. Despite the variety of treatments available for colon cancer, the prognosis is poor when colon cancer metastasizes to the lymph nodes and distant organs (1). Compared with patients with left-sided colon cancer, right-sided colon cancer patients have been shown to have increased lymphovascular invasion and a more advanced tumor stage (2). There is growing consensus that right- and left-sided colon cancer have different molecular features (3). A growing number of differential genes have been discovered with the development of technology and extensive previous research, such as the KRAS, BRAF, and BRAC1 mutations. The molecular difference between left-sided and right-sided colon cancer patients is mainly caused by different tissue and embryo development. However, finding disease-related genes using methods such as sequencing alone is too time-consuming and costly. With the development of computer algorithms, machine learning occupies an important position in cancer prediction and prognosis (4). Wang et al. constructed a hub gene prognosis model by multiplying regression coefficient by expression value based on the combined analysis of transcriptome and methylation data (5). If an algorithm can be constructed to effectively identify the hub genes that differentiate left-sided from right-sided colon cancer, it will be of great significance in predicting the prognosis of these patients. In addition, the identified hub genes are likely to be biomarkers of the disease, and this information will be useful for subsequent experimental verification.Analysis of colon cancer patients based on transcriptome data is currently a feasible approach. Liang et al. constructed a model which predicted the prognosis of right- and left-sided colon cancer patients using The Cancer Genome Atlas (TCGA) public database (6). The molecular differences between left- and right-sided colon cancer patients are essential differences in the expression of key genes. If genetic risk scoring can be used, it will significantly benefit clinical treatment and prognosis. For example, MMC, DMNC, and other algorithms can perform screening for transcriptome data. In conclusion, the use of relevant genetic and clinical information and the construction of a prognosis model can play a crucial role. High-throughput technology and bioinformatics make it possible to build this model (7).To study the hub genes of left- and right-sided colon cancer patients and construct a prognostic scoring model, we used colon cancer patients’ data from the TCGA. Data from 243 colon cancer patients in stages II and III were selected to perform the differential gene analysis. By using MMC, DMNC, and other algorithms, we obtained several hub genes. A principal component analysis (PCA) was used to verify the screening validity. After Cox regression, the BST2 gene and associated clinical information were used to construct a prognostic model. Finally, we evaluated the efficacy of our model in predicting survival. Our findings may provide a valuable contribution to the prognosis and treatment of left- and right-sided colon cancer patients.We present the following article in accordance with the TRIPOD reporting checklist (available at https://dx.doi.org/10.21037/atm-21-6163).
Methods
Data collection and preprocessing
The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). We downloaded the transcriptome data and clinical information from 523 patients from the TCGA public database on September 3rd. Of these, 453 people had both transcriptome information and clinical information. Next, 243 patients in stages II and III were divided into two groups by the International Oncology Code. The right-sided colon cancer (RCC) group comprised 154 patients with codes C18.0, C18.2, C18.3, and C18.4. The left-sided colon cancer (LCC) group included 89 patients with codes C18.5, C18.6, and C18.7. The remaining patients were excluded. According to an American epidemiological study, the ratio of left-sided and right-sided colon cancer patients is 1:2 (8). Our groupings were approximately equal to this ratio. Differential gene analysis was complete-case analysis. However, we used multiple imputation to supplement missing value in predictive variables with “mice” R software package.
Statistical analysis
All statistical analyses were performed using R software (version 4.0.3). We used the “limma” R software package to complete the differential analysis (9). The results of the differential screening were displayed in a volcano plot using the “ggplot2” package in R software (10). The preliminary results of the principal components were visualized by the “factoMineR” and “factoextra” R packages (11,12). The correlation analysis and visualization between the hub genes were completed by the “cor” R software function and “ggcorrplot” R software package (13). Univariate and multivariate Cox regressions were conducted by using the “survival” package in R. The overall survival in BST2 low group and high group were compared using the Kaplan-Meier method with a log rank-test. Receiver operating characteristic (ROC) curve analysis and the subsequent calculation of the area under the curve (AUC) were performed using the “pROC” and “survivalROC” packages in R. A P value of less than 0.05 was considered to be statistically significant.
Differential gene analysis and annotation
Differential transcription products with a fold change (FC) >1.5 and a P value <0.05 were considered to be differentially expressed products. We chose FC >1.5 to avoid missing key genes and made the screening more comprehensive. All the products corresponded to the genes to which they belonged. All differential genes were uploaded to the Database for Annotation, Visualization, and Integrated Discovery (DAVID) to perform the functional annotations (14). The results from the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment and the Gene Ontology (GO) analysis in DAVID were downloaded (15). After that, we used a bubble chart and bar chart with the “ggplot2” package and Excel software to visualize the KEGG and GO results. Subsequently, the hub genes were imported into the Funrich software program to perform the biological pathway and process analyses (16). The results of the analyses can be obtained directly from Funrich.
Differential protein analysis and network construction
After screening, the transcriptome product identification corresponded to the gene name. The identified genes were uploaded to the STRING database for the protein interaction analysis (17). Data from the protein node interactions were imported into Cytoscape software for calculation (https://cytoscape.org/what_is_cytoscape.html). According to 12 algorithms in cytoHubba which was a Cytoscape plugin, including MMC and DMNC, the first 30 genes of each algorithm were derived (18). Ultimately, 37 hub genes were obtained by de-duplicating. We constructed the protein-protein interaction (PPI) in the STRING database. In addition, we also constructed PPI in Cytoscape according to the nodes and the up-down relationships.
Relevant information analysis and model construction
We performed PCA for 37 hub genes with the “psych” R package (19). The proportion of variance of the first principal component reached 22%. Subsequently, we used the “coxph” function to perform the Cox regression analysis of differential genes, with genes having a P value <0.2 selected for the model building. Genes that reached significance in the Cox regression analysis were uploaded to the cBioPortal database for the mutation and survival analyses (20). We finally selected the BST2 gene as a predictive variable. The predictive variables included BST2, PC1, cancer stage, preoperative pretreatment carcinoembryonic antigen (CEA) levels, lymph node count, and KRAS gene mutation analysis results. We also explored the correlation between BST2 and the relevant clinical information of patients. Several boxplots were selected to show the correlations using the “ggboxlplot” R package. Finally, the probability of survival over several years was predicted, and the efficacy of the prediction was evaluated by a receiver operating characteristic (ROC) curve analysis (21).
Results
Screening of differential gene
The results of the differential gene screening are shown in the volcano map. Blue and red represent downregulation and upregulation of different genes, respectively (). We found 167 upregulated genes and 115 downregulated genes. However, only 251 genes met the criteria of P<0.05 and FC >1.5. Of these 251 genes, 90 genes were downregulated, and 161 genes were upregulated.
Figure 1
Differentially expressed genes in left- and right-sided colon cancer patients based on TCGA database. TCGA, The Cancer Genome Atlas.
Differentially expressed genes in left- and right-sided colon cancer patients based on TCGA database. TCGA, The Cancer Genome Atlas.
Hub gene PPI construction and differential gene related pathway analysis
Differential genes were screened by a variety of algorithms and regressions. The 37 key genes we selected represented most genes. After we uploaded 37 genes in the STRING database, we plotted the corresponding PPI (). The PPI showed us that MUC1 and its surrounding genes are more likely to be central. According to the text, the purple line shows the relationship that has been proved experimentally, and the green line shows the relationship that STRING mines. We also plotted PPI in Cytoscape according to node and regulation (). The larger the node, the greater the number of interactions. Green labels represent upregulation, and red labels represent downregulation (patients with left colon cancer were used as controls).
Figure 2
After the hub genes were screened by the STRING database and Cytoscape software, 37 hub genes were identified. The PPI is a vivid indicator of their relationship. (A) The PPI of 37 genes corresponding to proteins mapped in the STRING database; (B) the PPI plotted in Cytoscape based on interactions and the up-down relationships. PPI, protein-protein interaction.
After the hub genes were screened by the STRING database and Cytoscape software, 37 hub genes were identified. The PPI is a vivid indicator of their relationship. (A) The PPI of 37 genes corresponding to proteins mapped in the STRING database; (B) the PPI plotted in Cytoscape based on interactions and the up-down relationships. PPI, protein-protein interaction.Bar charts were used to show the biological processes in the GO analysis. The analysis revealed that differences in genes were primarily located in immune response, cell-cell signaling, and O-glycan processing (). After we translated the results into a table, we presented the partial results of the KEGG pathway analysis using bubble diagrams. We still found marked differences in the genes involved in the metabolic and chemokine signaling pathways (). This proves that the gene differences we were looking for in left- and right-sided colon cancer were confirmed.
Figure 3
All differential genes were uploaded to DAVID to perform the KEGG pathway enrichment and GO analysis. (A) Part of the biological process for the GO analysis of all differential genes; (B) results of the KEGG enrichment of all differential genes by DAVID. GO, gene ontology; DAVID, Database for Annotation, Visualization, and Integrated Discovery; KEGG, Kyoto Encyclopedia of Genes and Genomes.
All differential genes were uploaded to DAVID to perform the KEGG pathway enrichment and GO analysis. (A) Part of the biological process for the GO analysis of all differential genes; (B) results of the KEGG enrichment of all differential genes by DAVID. GO, gene ontology; DAVID, Database for Annotation, Visualization, and Integrated Discovery; KEGG, Kyoto Encyclopedia of Genes and Genomes.
PC1 extraction of hub gene
We used a principal component scatter plot to reflect the relationship between the observed values and the principal components (). Most points were distributed around the center, and most of the observed values had a strong relationship with the first principal component. The redder the color, the more valuable it was. Moreover, we also explored the 37 hub genes’ contribution to the first and second principal components (). It can be seen that the first ten genes up to NR1I2 made significant contributions to the first and second principal components. We investigated the relationship between the PC1 and the expression values of the first ten hub genes (). We also explored whether PC1 could predict left- and right-sided colon cancer, and the area under the ROC curve (AUC) showed a good result ().
Figure 4
We used the first principal component values of the PCA to verify the screening accuracy. (A) The principal component scatter plot of observed values; (B) the bar diagram of the 37 hub genes contributing to the first and second principal components; (C) the box plot of the first principal component values and expression values of the top ten hub genes; (D) the ROC curve of the predictive effect of the first principal component values on left- and right-sided colon cancer. PCA, principal component analysis; ROC, receiver operating characteristic.
We used the first principal component values of the PCA to verify the screening accuracy. (A) The principal component scatter plot of observed values; (B) the bar diagram of the 37 hub genes contributing to the first and second principal components; (C) the box plot of the first principal component values and expression values of the top ten hub genes; (D) the ROC curve of the predictive effect of the first principal component values on left- and right-sided colon cancer. PCA, principal component analysis; ROC, receiver operating characteristic.
Gene enrichment and correlation analysis of hub gene
To explore whether the 37 hub genes had representative significance, we entered them into the Funrich software for analysis. The biological pathway showed that all 37 hub genes were significantly correlated with CXCR3-mediated signaling events and the chemokine receptors bind chemokines (). This suggested that relevant significant biological pathway may be related to the recognition between cells. The biological process still showed an immune response (). However, in the protein domains, SCY and signal peptide differed between the hub gene set and the differential gene set (). Therefore, the 37 hub genes represented a large proportion of the differential genes. We also created heat maps of the relationships between the 37 genes (), showing positive and negative correlations in different colors. For instance, FCGR3A and CXCL9 showed a high positive correlation.
Figure 5
The biological pathway analysis of 37 hub genes using Funrich software, CXCR3-mediated signaling events and the chemokine receptors bind chemokines had significant statistical significance (A), the biological process of the GO analysis using Funrich software, immune response played an important role (B), the protein domain for the differential gene set and hub gene set, SCY and signal peptide were different between the hub gene set and the differential gene set (C), and the correlation heat maps of the first ten hub genes in the PCA, correlation coefficient close to 1 or −1 would have a stronger correlation (D).
The biological pathway analysis of 37 hub genes using Funrich software, CXCR3-mediated signaling events and the chemokine receptors bind chemokines had significant statistical significance (A), the biological process of the GO analysis using Funrich software, immune response played an important role (B), the protein domain for the differential gene set and hub gene set, SCY and signal peptide were different between the hub gene set and the differential gene set (C), and the correlation heat maps of the first ten hub genes in the PCA, correlation coefficient close to 1 or −1 would have a stronger correlation (D).
Relationship between BST2, PC1, and relevant clinical information
We explored the relationship between BST2, PC1, and relevant clinical information with the use of boxplots. The principal component value and BST2 expression value of the right-sided group were higher than those of the left-sided group (). Additionally, the high BST2 group had a higher PC1 value than the low group (). People with high BST2 expression levels were relatively older (). It is worth noting that the BST2 expression value in people without sigmoid colon cancer was significantly higher than those with sigmoid colon cancer ().
Figure 6
Exploration of the relationship between PC1, BST2, and relevant clinical information. We took the logarithm of the BST2 gene expression value and explored the results using a boxplot. (A,B) The relationship between the PC1, BST2, and the two groups; (C,D) the relationship between PC1, age, and the high and low BST2 groups; (E) the relationship between sigmoid colon cancer and BST2.
Exploration of the relationship between PC1, BST2, and relevant clinical information. We took the logarithm of the BST2 gene expression value and explored the results using a boxplot. (A,B) The relationship between the PC1, BST2, and the two groups; (C,D) the relationship between PC1, age, and the high and low BST2 groups; (E) the relationship between sigmoid colon cancer and BST2.
Construction of prognostic model
In the significant gene group identified by Cox regression, the mutation occurred mainly in patients with right-sided colon cancer (). Therefore, we decided that a prediction of prognosis was more appropriate in patients with right-sided colon cancer. We also plotted survival curves for the first three years between the two groups (). The survival of the Cox significant group was significantly worse than that of the non-significant group. To test whether our model effectively predicted right-sided patient survival, we used an ROC curve for verification. The ROC curve showed good predictions at 1 and 3 years. The AUC at 1 and 3 years was 0.76 and 0.765, respectively (). In general, the prediction results were satisfactory.
Figure 7
The cBioPortal database results of the mutation analysis of the significant genes identified by Cox regression. In the cox significant group, C18.0, C18.2, C18.3, C18.4 right-sided colon cancer patients were obviously more than C18.5, C18.6, C18.7 left-sided colon cancer patients (A), survival curves of patients with significant and non-significant genes according to the Cox regression analysis. The blue curve represented cox significant group and the red curve represented cox non-significant group (B), and ROC curves were also used to assess the predictive power of BST2, PC1, and other variables for right-sided patient survival years. AUC was 0.76 and 0.765, respectively (C,D). ROC, receiver operating characteristic; AUC, area under the curve.
The cBioPortal database results of the mutation analysis of the significant genes identified by Cox regression. In the cox significant group, C18.0, C18.2, C18.3, C18.4 right-sided colon cancer patients were obviously more than C18.5, C18.6, C18.7 left-sided colon cancer patients (A), survival curves of patients with significant and non-significant genes according to the Cox regression analysis. The blue curve represented cox significant group and the red curve represented cox non-significant group (B), and ROC curves were also used to assess the predictive power of BST2, PC1, and other variables for right-sided patient survival years. AUC was 0.76 and 0.765, respectively (C,D). ROC, receiver operating characteristic; AUC, area under the curve.
Discussion
There is now widespread consensus that molecular differences between left- and right-sided colon cancer are the more likely cause of differences in the disease. The mechanisms of difference in left- and right-sided colon cancer depends on their gene pathways. Genes associated with poor prognosis are significantly higher in right-sided colorectal cancer (22). Mukund et al. explored differences in molecular mechanisms between left- and right-sided colon cancer at the methylation and transcription levels (23). Their study provided a good explanation for the molecular differences that cause right- and left-sided colon cancer, but their results were limited and failed to translate to clinical application in diagnosis or prognosis. Chang et al. established a recurrence score (RS) to evaluate the relationship between differential genes and age. They concluded that differential gene expression did not differ by age or stage (II and III) (24). Their results provide strong evidence that molecular differences play an important role. However, our model did show better predictive performance with the addition of stage. After screening for differentially expressed genes in left- and right-sided colon cancer patients, we did not continue to use regression screening as other researchers have. Instead, the protein interaction data of the gene expressions were screened twice by 12 algorithms in Cytoscape to greatly improve our chances of identifying the hub genes. In constructing models, many researchers directly use Cox or LASSO regression analyses (25).We borrowed the methods of other researchers who used principal components to intuitively show the characteristics of the high- and low-risk groups (26). We also used Cox regression to screen the genes that affected the survival of patients. Finally, after 37 hub genes were obtained through 12 algorithms, the first principal component value was used as one of the variables of our model for left- and right-sided colon cancer patients. Each patient received an independent value, which avoided the problem of establishing training sets and validation sets to verify the validity of the value. Our concept was similar to Xu, Xu, and Yin’s idea of quantifying immune cell infiltration (ICI) patterns in individual tumors by using PCA to obtain ICI scores in the colon cancer tumor microenvironment (27). We finally selected the BST2 gene as another predictor because it was both a differential gene and a Cox regression significant gene. Left- and right-sided colon cancer continue to lack unique biomarkers at present, so discovering new biomarkers is undoubtedly helpful in targeting therapies (28). The screening approach we used has the potential to help other researchers find biomarkers that are unique to the treatment of left- and right-sided colon cancer. Pathway analysis revealed that our hub genes were significantly associated with immunity and relevant metabolic pathways. We also found that genes identified as significant in the Cox regression were mainly mutated in right-sided colon cancer. As such, we considered the model was better for predicting right-sided colon cancer, although we used it for both sides. Previous studies have shown that the BST2 gene shows hypomethylation in colon cancer tissues relative to adjacent normal tissues (29). Additionally, relevant clinical information was associated with BST2 and PC1 levels. We used BST2, PC1, pretreatment CEA levels, lymph node counts, and KRAS gene mutation results to build a predictive survival model in right-sided colon cancer patients.In our prediction, patients with left-sided colon cancer did better than those with right-sided colon cancer, consistent with previous studies (30). Our Cox regression analysis showed that most of the significant genes were enriched in right-sided colon cancer compared with left-sided colon cancer (31). This may be related to lower survival rates in patients with right- rather than left-sided colon cancer. Whether the disparate survival rates were related to these Cox significant genes needs further study. We were surprised to find a lower BST2 value in sigmoid colon cancer, which may be a specific question worth exploring in the future. As a differential gene and a Cox regression significant gene, BST2 is likely to be a biomarker of left- and right-sided colon cancer, including sigmoid colon cancer.
Conclusions
In conclusion, we screened for differentially expressed genes in stage II and III left- and right-sided colon cancer patients and constructed a model of differential genes using algorithms and other statistical methods. We also explored the relationship between relevant clinical information and predictive variables and demonstrated the predictive effect of our model. Our study also had limitations, which will be addressed in our future study. We encourage other researchers to further validate the correlation between the model and classical pathways or important molecules. Subsequent experiments should be carried out to verify that BST2 can be used as a biomarker. The reliability of our model will also require prospective cohort or case-control evaluation.The article’s supplementary files as
Authors: Glynn Dennis; Brad T Sherman; Douglas A Hosack; Jun Yang; Wei Gao; H Clifford Lane; Richard A Lempicki Journal: Genome Biol Date: 2003-04-03 Impact factor: 13.583
Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker Journal: Genome Res Date: 2003-11 Impact factor: 9.043
Authors: Rebecca L Siegel; Kimberly D Miller; Ann Goding Sauer; Stacey A Fedewa; Lynn F Butterly; Joseph C Anderson; Andrea Cercek; Robert A Smith; Ahmedin Jemal Journal: CA Cancer J Clin Date: 2020-03-05 Impact factor: 508.702