Literature DB >> 35071405

Identification of key genes associated with esophageal adenocarcinoma based on bioinformatics analysis.

Weifeng Qi¹, Rongyang Li¹, Lin Li¹, Shuhai Li¹, Huiying Zhang¹, Hui Tian¹.

Abstract

BACKGROUND: Esophageal adenocarcinoma (EAC) is an aggressive malignancy and accounts for the majority of cancer-related death worldwide. It is often diagnosed at an advanced stage and entails a poor prognosis for those afflicted. The mechanisms of its pathogenesis and progress remain unclear and require urgent elucidation. This study aimed to identify specific genes and potential pathways associated with the progression and prognosis of EAC using bioinformatics analyses.
METHODS: EAC microarray datasets from the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) databases were analyzed to identify differentially expressed genes (DEGs) using bioinformatics analysis. The DEGs in TCGA were then analyzed to construct a co-expression network by weighted correlation network analysis (WGCNA), and module-clinical trait relationships were analyzed to explore the genes that associated with clinicopathological parameters of EAC. Gene ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways analyses were performed for the cancer-related genes, and a DEG-based protein-protein interaction (PPI) network was used to extract hub genes through Cytoscape plugins. The consensus survival analysis for EAC (OSeac) was performed to identify the prognosis-related genes. The immune infiltration was evaluated by tumor immune estimation resource (TIMER) algorithms, and a risk score prognostic model was established using univariate, multivariate Cox proportional hazards regression, and lasso regression analysis.
RESULTS: Ultimately, 190 cancer-related DEGs were identified, 6 of which were found to play vital roles in the progression of EAC, including ACTA2, BGN, CALD1, COL1A1, COL4A1, and DCN. The risk score prognostic model consisted of 6 other genes that had an important impact on the prognosis of EAC, including CLDN3, EPB41L4A, ESM1, MT1X, PAQR5, and PLAU. The area under the curve of the prognostic model for predicting the survival of patients at 1, 2, and 3 years was 0.707, 0.702, and 0.726, respectively.
CONCLUSIONS: This study identified several genes with the potential to become useful targets for the diagnosis and treatment of EAC. The 6-gene-related risk score prognostic model and nomogram based on these genes may be a reliable tool for predicting the prognosis of patients with EAC. 2021 Annals of Translational Medicine. All rights reserved.

Entities: Chemical

Keywords: Esophageal adenocarcinoma (EAC); bioinformatics analysis; protein-protein interaction (PPI); risk score prognosis model; weighted gene co-expression network analysis (WGCNA)

Year: 2021 PMID： 35071405 PMCID： PMC8743722 DOI： 10.21037/atm-21-4015

Source DB: PubMed Journal: Ann Transl Med ISSN： 2305-5839

Introduction

Esophageal cancer (EC) is an aggressive malignancy and accounts for the majority of cancer-related deaths worldwide (1,2). The disease mainly includes 2 epidemiological and pathologically different subtypes: esophageal squamous cell carcinoma (ESCC) and esophageal adenocarcinoma (EAC) (3). ESCC is the most common and the dominant subtype in Asians, while EAC has a higher incidence in Western countries (4). Currently, the 5-year survival rate of EAC is less than 20% unless diagnosed and treated in the early stage (5). Additionally, surgical therapy, which can significantly improve prognosis, is not suitable for patients with advanced-stage cancer (6). Therefore, novel key genes and possible molecular mechanisms associated with the initiation, progression, and prognosis of EAC may provide a more effective approach to the early diagnosis and subsequent clinical decision making for personalized treatment and then improved survival. The occurrence and development of EAC are closely related to alcohol and tobacco addiction, obesity, and gastroesophageal reflux, but the specific carcinogenic mechanism remains unclear (3). Barrett’s esophagus (BE), which involves the specialized small intestinal metaplastic epithelium of the esophagus, is a precursor to EAC (7,8). Previous studies have identified a series of significantly mutated genes in the progress of BE and EAC, such as TP53, CDKN2A, SMAD4, ARID1A, and PIK3CA (9,10). DNA hypermethylation in the promoter regions of genes has also been observed in BE and EAC (8,11). A study conducted by Wu et al. in 2013 identified several progressively altered-expressed microRNAs (miRNAs) of malignant progression in BE and EAC (12), and a recent study using high-throughput sequencing analysis revealed that genes and pathways involved in EAC were associated with DNA replication, cell cycle, and fatty acid degradation signaling pathways (13). Although the research on the genes and molecular mechanisms of EAC has increased in recent years, a comprehensive picture of its genes and regulation is still lacking. With the development of genomic microarrays and high-throughput sequencing technologies, bioinformatics analysis is gradually becoming a prevailing tool for the exploration of cancer-related biomarkers and molecular mechanisms (13,14). Weighted gene co-expression network analysis (WGCNA) is a novel, systematic, bioinformatics method for selecting co-expression modules of related genes and the critical module associated with clinical traits, and may provide a novel means to exploring the potential biomarkers that could be used in the early diagnose and individual treatment of cancer (15). In this study, EAC microarray datasets from The Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO) databases were analyzed to identify differentially expressed genes (DEGs) of EAC. Protein-protein interaction (PPI) and WGCNA were then applied in the identification of the key genes closely associated with the development of EAC. Moreover, patients’ clinical information and RNA-sequencing of DEGs from TCGA database were used for univariate cox regression analysis, lasso regression analysis, and multivariate cox regression analysis to establish a risk score prognostic model. We present the following article in accordance with the reporting recommendations for tumor MARKer (REMARK) reporting checklist (available at https://dx.doi.org/10.21037/atm-21-4015).

Methods

Data acquisition and preprocessing

The gene expression profiles based on RNA-sequencing and relevant clinical data of EAC patients were downloaded from TCGA (https://www.cancer.gov/tcga) data repository (16). The level of gene expression was measured and standardized by R package “DEseq2” (The R Foundation for Statistical Computing) (17). To increase the robustness of our study, we searched for publicly available studies and samples in the GEO (https://www.ncbi.nlm.nih.gov/geo/) that met the following conditions for analysis: (I) the gene expression data series contained EAC and normal tissue samples; (II) the number of samples in every data series was more than 30; (III) the species was Homo sapiens. Finally, 2 gene expression profiles (GSE13898 and GSE26886) were identified for further analysis (18). The maximum value of expression of the genes was considered as the gene expression level if multiple probes corresponded to the same gene (19). The flow diagram of this study is shown in . The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Figure 1

Flow chart of this study. TCGA, The Cancer Genome Atlas; GEO, Gene Expression Omnibus; DEGs, differentially expressed genes; WGCNA, weighted gene co-expression network analysis; GO, gene ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; PPI, protein-protein interaction.

Identification of DEGs

R package “DEseq2” was used to identify DEGs in TCGA database (17), while R package “limma” was used for GEO datasets (20). Thresholds of |log2FC| >1.0 and an adjusted P value of <0.05 were selected.

Construction of a gene co-expression network (WGCNA)

The “WGCNA” package in R was used to construct a gene co-expression network of DEGs identified from TCGA data, and the analysis was performed according to the package instructions (15). The scale independence and average connectivity analysis of modules was performed by gradient test, and an appropriate power value was selected when the scale independence value was equal to 0.9. A WGCNA algorithm was then used to construct the co-expression network. A network with co-expression weight >2 was considered as candidate network, and genes with high absolute correlations were clustered into the same modules by cutting the clustering tree into branches. Only when the number of genes exceeded 30 was a module defined, and modules with higher correlation were merged (r<0.25). To visualize the results, different colors was assigned to each module.

Identification of cancer-related modules and genes

We used the module eigengene (ME) method to identify modules which were related with the clinical traits of EAC in TCGA data. The ME could represent the gene expression profiles of a module. The module-trait relationships (MTRs) were measured by linking the MEs to the clinical traits, and MTRs were then used to select significant clinical modules for in-depth analysis. Modules with an |MTRs| >0.5 were considered as cancer-related modules. Moreover, we took the intersection of the DEGs in the GSE13898 and GSE26886 datasets and those in the cancer-related modules and defined these as cancer-related genes.

Gene Ontology (GO) and pathway enrichment analysis

GO enrichment analysis identified which GO terms were over or underrepresented within a given set of genes, consisting of molecular function (MF), cellular components (CC), and biological processes (BP) (21); meanwhile, the KEGG database was used to identify functional and metabolic pathways (22). To explore the potential biological themes and pathways of cancer-related genes, we used the “clusterprofiler” package in R to annotate and visualize GO terms and KEGG pathways.

PPI network construction and module analysis

The Search Tool for The Retrieval of Interaction Genes (STRING, Zurich, Switzerland; https://string-db.org/) was used to construct the PPI network (23). Cancer-related genes were mapped to STRING to evaluate the PPI information with a confidence score >0.4 as the cutoff standard, which was then visualized using Cytoscape software (24). The key genes in the PPI were selected using five methods in CytoHubba plug-in, including edge percolated component (EPC), maximal clique centrality (MCC), maximal neighborhood component (MNC), degree (node connect degree) and closeness (node connect closeness). The intersection of top 15 genes identified by five methods was then taken to acquire the key genes in PPI analysis. In addition, Molecular Complex Detection (MCODE) was used to identify the finest clusters of PPI (25).

Survival analysis

The consensus survival analysis for EAC (OSeac) online survival analysis tool (http://bioinfo.henu.edu.cn/EAC/EACList.jsp) was applied to calculate Kaplan-Meier (K-M) survival curves with hazard ratio (HR) and log-rank tests of key genes in the PPI analysis (26).

Immune infiltration analysis

The Tumor Immune Estimation Resource (TIMER; http://timer.cistrome.org/) algorithm is a comprehensive online resource for the systematic analysis of immune infiltrates across various cancer types (27). In this study, we performed TIMER to determine the relationship between key gene expression in EAC and 6 immune infiltrates (B cells, CD4+ T cells, CD8+ T cells, neutrophils, macrophages, and dendritic cells).

Construction of a prognostic risk score model

A total of 154 patients with overall survival (OS) data were selected for further survival analysis. The clinical data of the GSE13898 dataset was provided by Professor Wang in Henan University. To give the established prognostic model better generalization ability, we identified the data from TCGA database as a training group (79 samples) and GSE13898 as a test group (75 samples). The training dataset was used to build the prognostic risk score model and validate it using the test dataset. To do this, we first took the intersection of the DEGs in the GSE13898 dataset and the genes in the cancer-related modules of TCGA analysis as candidate genes. Univariate Cox proportional hazards regression analysis was then used to identify key genes significantly associated with prognosis (P<0.05) (28). The collinearity between genes was eliminated through lasso regression analysis (29), and multivariate Cox proportional hazards regression analysis was performed to establish prognostic risk score model (30). The model used risk scores as predictors of prognostic status, with patients categorized into high- or low-risk groups according to the threshold of risk score. K-M survival curves were then plotted to evaluate the prediction effect of the model with log-rank test (P<0.05). The predictive performance of this model at different endpoints (1, 2, or 3 years) was assessed using a time-dependent receiver operating characteristic (ROC) curve (31), and the R packages “survival”, “glmnet”, “survminer”, and “survivalROC” were used in the construction of a prognostic risk score model.

Statistical analysis

R v4.1.1 was used to conduct data preprocessing, DEG screening, WGCNA analysis and functional annotation analysis. CytoHubba and MCODE in Cytoscape v3.7.0 was selected to mine key genes. The details of these bioinformatic analyses have been described in corresponding subsections. The potential diagnostic value of the prognostic risk score model was shown by ROC analysis using R v4.1.1. A P value <0.05 was considered statistically significant.

Results

Characteristics of selected datasets

The gene expression profiles based on RNA-sequencing were obtained from TCGA until April 2021. Studies from the GEO database up to April 2021 were also examined to increase the robustness of the study. Through layers of screening, we identified two datasets (GSE13898 and GSE 26886) in the GEO database that met the inclusion criteria, and the detailed characteristics of the selected datasets are summarized in . The data from TCGA were used to perform WGCNA analysis, and a risk score prognostic model was established using TCGA data and clinical information which was validated with the GSE13898 dataset.

Table 1

Basic information of three datasets

Datasets	Number (tumor)	Number (normal)	Total	Database
TCGA	79	9	88	TCGA
GSE13898*	75	28	103	GEO
GSE26886	21	19	40	GEO

*, clinical information of this dataset was provided by Qiang Wang, a professor from Henan university. TCGA, The Cancer Genome Atlas; GEO, Gene Expression Omnibus.

*, clinical information of this dataset was provided by Qiang Wang, a professor from Henan university. TCGA, The Cancer Genome Atlas; GEO, Gene Expression Omnibus. After the quality of samples in each group was measured, there were 75 EAC samples and 28 normal samples in GSE13898, and 21 EAC samples and 19 normal samples in GSE26886. We then identified 807 and 3,128 significantly upregulated DEGs and 811 and 2,595 downregulated DEGs in GSE13898 and GSE26886, respectively. In TCGA dataset, which contained 79 EAC samples and 9 normal samples, 2,038 upregulated genes and 2,855 downregulated genes were identified. The volcano plots of GEO and TCGA samples are presented in .

Figure 2

Volcano plots of differentially expressed genes from three datasets. The x-axis represents the fold change of gene expression, and the y-axis represents the adjusted P value. The red dots in the plot represent statistically significant up-regulated genes, while the blue dots represent significant down-regulated genes. TCGA, The Cancer Genome Atlas; GEO, Gene Expression Omnibus; FDR, false discovery rate; FC, fold change.

WGCNA of DEGs in TCGA

Clinical and RNA-sequencing data for 79 EAC and 9 normal samples were downloaded from TCGA database, and for module detection, a total of 4,893 DEGs were selected using R package “DEseq2” for further analysis. We first used the average linkage method and Pearson’s correlation coefficient to cluster a dendrogram of samples with clinical traits (), and co-expression analysis was then applied to construct the co-expression network. The connectivity between genes met a scale-free network distribution (scale-free R2=0.9) when the value of soft thresholding power β was set to 8 (). For cluster splitting, the minimum module size was set to 30, and the modules with a higher correlation were merged (r<0.25). Finally, 12 modules were identified through hierarchical clustering (), and a unique color was assigned to each module as an identifier (pink, blue, salmon, green-yellow, black, purple, green, brown, red, magenta, tan, and grey). The number of genes in modules ranged from 34 to 1,864.

Figure 3

Weighted gene co-expression network analysis of selected genes. (A) Clustering sample dendrogram and a trait heatmap. (B) Analysis of network topology for various soft-thresholding powers and the soft-threshold β was set to 8. (C) Hierarchical clustering dendrograms of identified co-expressed genes in modules in EAC. Each colored row represents a color-coded module which contains a group of highly connected genes. A total of 12 modules was identified by merging modules with a higher correlation. EAC, esophageal adenocarcinoma. To explain the gene expression variation, an ME was calculated that represented each module. We used the tissue type (EAC or normal) as the clinical phenotype to select the cancer-related modules for further analysis (). Based on the criteria of |MTR| >0.5, we selected 5 modules as cancer-related modules for in-depth analysis: the blue module (606 genes), purple module (84 genes), brown module (524 genes), red module (178 genes), and magenta module (1,864 genes). In addition, we plotted a scatterplot of gene significance vs. module membership in each of the 5 modules (Figure S1).

Figure 4

Heatmaps of the correlation between module eigengene and clinical traits of EAC. Each row corresponds to a module eigengene, and each column corresponds to a clinical characteristic. Each cell contains the corresponding correlation. EAC, esophageal adenocarcinoma. To identify the cancer-related genes, we took the intersection of the DEGs in the GSE13898 and GSE26886 datasets from the GEO database and the genes in 5 cancer-related modules from TCGA database. Finally, 190 cancer-related genes were identified ().

Figure 5

Venn diagrams of DEGs in the 2 GEO datasets and the genes in 5 cancer-related modules from TCGA database. DEGs, differentially expressed genes; TCGA, The Cancer Genome Atlas; GEO, Gene Expression Omnibus.

Enrichment analysis of cancer-related genes

GO and KEGG enrichment analysis were performed on the cancer-related genes identified in the above analysis. As illustrated in , genes in the GO analysis were most relevant with the BP of extracellular matrix organization, the CC of complex of collagen-containing extracellular matrix, and the MF of extracellular matrix structural constituent. As shown in KEGG analysis, 15 pathways were significantly associated with cancer-related genes, including regulation of actin cytoskeleton, tight junction, protein digestion, and absorption, nuclear factor-kappa B (NF-κB) signaling pathway, and human papillomavirus infection ().

Figure 6

Gene ontology (GO) and pathway enrichment analysis. (A) Biological process analysis. (B) Cellular component analysis. (C) Molecular function analysis. (D) KEGG pathway analysis. GO, gene ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; BP, biological processes; CC, cellular components; MF, molecular function.

PPI network analysis of cancer-related genes

The PPI network was established by Cytoscape based on the STRING database and consisted of 152 nodes and 366 edges (), with MCODE in Cytoscape being used to perform module analysis. One module was found to be significant (MCODE =7.375) and consisted of 17 nodes and 59 edges ().

Figure 7

The protein-protein interaction (PPI) network analysis and the most significant module. (A) The PPI network of the selected genes. The genes with yellow color belong to the fairly significant modules. (B) The most significant module of the PPI network.

Identification of hub genes associated with EAC

The genes with a score in the top 15 according to all five methods in CytoHubba were identified as hub genes of EAC. Six genes that may play an important role in EAC progression were identified: actin alpha 2 (ACTA2), biglycan (BGN), caldesmon 1 (CALD1), collagen type I alpha 1 chain (COL1A1), collagen type IV alpha 1 chain (COL4A1), and decorin (DCN) ().

Table 2

Hub genes for highly expressed genes ranked by different CytoHubba methods

Category	Rank methods in CytoHubba
Category	MNC	Closeness	Degree	MCC	EPC
1	COL1A1*	COL1A1*	COL1A1*	ACTA2*	COL1A1*
2	TIMP1	PTGS2	TIMP1	CALD1*	TIMP1
3	ACTA2*	TIMP1	PTGS2	MYL9	BGN*
4	BGN*	ACTA2*	ACTA2*	MYH11	DCN*
5	THBS1	THBS1	DCN*	TPM2	COL4A1*
6	MMP1	DCN*	BGN*	MYLK	ACTA2*
7	PTGS2	BGN*	THBS1	TPM1	THBS1
8	DCN*	MMP1	MMP1	COL1A1*	MMP1
9	COL4A1*	CALD1*	CALD1*	BGN*	COL5A2
10	CALD1*	COL4A1*	COL4A1*	DCN*	COL12A1
11	COL5A2	GJA1	COL5A2	COL4A1*	CALD1*
12	COL12A1	MET	MYLK	SERPINH1	COL6A3
13	MYL9	MYH11	COL12A1	COL12A1	FMOD
14	FMOD	FMOD	MET	COL6A3	SERPINH1
15	MYH11	MYL9	MYL9	TAGLN	PTGS2

*, the overlap genes in top 15 by five ranked methods. EPC, edge percolated component; MCC, maximal clique centrality; MNC, maximal neighborhood component; Degree, node connect degree; Closeness, node connect closeness.

Survival analysis and immune infiltration analysis of hub genes

K-M plots made using OSeac demonstrated the prognostic impact of the 6 hub genes identified from PPI analysis, and the results demonstrated that high expression of ACTA2, CALD1, COLA1A, and COL4A1 was associated with poor OS in patients with EAC (P<0.05), as shown in .

Figure 8

Kaplan-Meier curves of 6 hub genes in EAC patients. (A) ACTA2; (B) BGN; (C) CALD1; (D) COL1A1; (E) COL4A1; (F) DCN. HR, hazard ratio; CI, confidence interval; OSeac, the consensus survival analysis for EAC; EAC, esophageal adenocarcinoma; ACTA2, actin alpha 2; BGN, biglycan; CALD1, caldesmon 1; COL1A1, collagen type I alpha 1 chain; COL4A1, collagen type IV alpha 1 chain; DCN, decorin. The TIMER database was used to assess the correlation between the expression of hub genes and immune infiltration (), and a positive correlation between ACTA2 expression and the infiltration of B cells (Cor =0.179; P=1.63e-02), CD4+ T cells (Cor =0.293; P=6.74e-05), macrophage cells (Cor =0.582; P=1.05e-17), and dendritic cells (Cor =0.187; P=1.18e-02) was seen. The expression of BGN was positively related to the infiltration of CD4+ T cells (Cor =0.198; P=7.77e-03), macrophage cells (Cor =0.514; P=1.68e-13), and dendritic cells (Cor =0.238; P=1.27e-03), while CALD1 expression was positively associated with the infiltration of macrophage cells (Cor =0.605; P=2.44e-19) and dendritic cells (Cor =0.228; P=2.08e-03). COL1A1 expression was positively associated with the infiltration of macrophage cells (Cor =0.463, P=5.88e-11) and dendritic cells (Cor =0.152; P=4.23e-02), and COL4A1 expression was positively associated with the infiltration of macrophage cells (Cor =0.366; P=4.45e-07). The expression of DCN was positively associated with the infiltration of B cells (Cor =0.154; P=3.97e-02), CD4+ T cells (Cor =0.225; P=2.48e-03), macrophage cells (Cor =0.647; P=1.03e-22), and dendritic cells (Cor =0.159; P=3.31e-02).

Figure 9

Correlation between 6 hub genes and immune cell infiltration (TIMER). The correlation between the abundance of immune cell and the expression of ACTA2 (A), BGN (B), CALD1 (C), COL1A1 (D), COL4A1 (E), and DCN (F) in EAC. EAC, esophageal adenocarcinoma; ACTA2, actin alpha 2; BGN, biglycan; CALD1, caldesmon 1; COL1A1, collagen type I alpha 1 chain; COL4A1, collagen type IV alpha 1 chain; DCN, decorin. To establish an effective prognostic model for predicting the prognosis of EAC, univariate, multivariate Cox proportional hazards regression analysis and lasso regression analysis were employed to screen the genes. First, we identified 163 genes as candidate genes for this model by taking the intersection of the DEGs in the GSE13898 dataset and the genes in the cancer-related modules of the WGCNA (Figure S2). In the univariate Cox regression analysis, 14 genes significantly associated with prognosis were identified (P<0.05) (Table S1), while in lasso regression, when partial likelihood deviance was the smallest, 11 of the 14 genes had coefficients that were not 0 (Figure S3). Finally, a total of 6 genes were then obtained in multivariate Cox regression analysis to establish a prognostic risk score model: claudin-3 (CLDN3), erythrocyte membrane protein band 4.1 like 4A (EPB41L4A), endothelial cell specific molecule-1 (ESM1), metallothionein 1X (MT1X), progestin and adipoQ receptor family member 5 (PAQR5), and plasminogen activator urokinase (PLAU) (, Table S2). The risk score was calculated using the following formula: risk score = (0.5864 × CLDN3) + (0.5773 × ESM1) + (0.3891 × PLAU) + (−0.4981 × MT1X) + (−0.3769 × EPB41L4A) + (–0.2727 × PAQR5).

Figure 10

Forest plot and survival analysis for the prognostic risk score model based on 6 genes. (A) Forest plot for multivariate Cox regression. 95% confidence interval for the HR value over the box plot with associated P values were presented. (B,C) Survival curve for patients with different risk scores in the training data and test data, respectively. P<0.01. (D-F) ROC curves for the prognostic risk score model representing 1-, 2-, and 3-year predictions in the test data; the values of the areas under the curve are 0.707, 0.702, and 0.726, respectively. HR, hazard ratio; ROC, receiver operator characteristic; AUC, area under the curve. The K-M curves were grouped by defined risk scores (Figure S4), which indicated that the prognosis of the high-risk group was significantly poorer than that of the low-risk group in the training data (), as well as test data (). By predicting the survival of patients at 1, 2, and 3 years, the areas under the ROC curve (AUCs) obtained from the risk-based prediction model in the training data were 0.79, 0.888, and 0.889 (Figure S5), while in the test data, they were 0.707, 0.702, and 0.726 (). We also plotted scatter plots to illustrate the relationship between survival time and risk scores. As the risk scores increased, the duration of survival gradually decreased and the number of patient deaths gradually increased in the both training () and test group (), which demonstrated the definition of “risk score” was effective. In the training data, the expression of CLDN3, ESM1, and PLAU genes were significantly higher in the high-risk group (P<0.05), while the EPB41L4A, MT1X, and PAQR5 genes were significantly highly expressed in the low-risk group (P<0.05), which were consistent with their coefficients in the risk score formula (). The nomogram of this risk score prognostic model is presented in .

Figure 11

Distribution of duration of survival and the nomogram for the risk score model, and the expression of 6 genes in the model. (A,B) Distribution of duration of survival in the training data and test data. The x-axis is arranged in order of patient risk score, and the y-axis represents patient survival time. (C) The expression of 6 prognostic genes, where red represents the high-risk group, and blue represents the low-risk group. All P<0.01. (D) A nomogram for the prognostic risk score model. “Points” is a scoring scale for the 6 genes, respectively, and “total points” is a scale for total score. OS, overall survival; CLDN3, claudin-3; ESM1, endothelial cell specific molecule-1; PLAU, plasminogen activator urokinase; EPB41L4A, erythrocyte membrane protein band 4.1 like 4A; PAQR5, progestin and adipoQ receptor family member 5; MT1X, metallothionein 1X.

Discussion

EAC is a refractory type of cancer with high mortality due to its high metastasis rate, treatment resistance, and poor prognosis (3). Although many studies have been performed in recent years, the early diagnosis, effective treatment, and prognosis for EAC have not been well resolved, and it is essential to develop a better understanding of the molecular mechanisms involved in the occurrence and progression of the disease to explore potential targets for its diagnosis and treatment. In this study, we identified 6 genes as hub genes which may play an important role in the initiation and development of EAC by integrating TCGA and GEO data and combining the WGCNA and PPI network analysis. WGCNA provides module construction and correlation analysis within gene expression data to determine the associations between genes (15), and the PPI network was based on protein networks reported in the known literature and used to explore the keys genes of specific diseases (23). This study has provided strong evidence for a novel method combining WGCNA and PPI for the identification of key genes. Abnormal expression levels of key genes have been found in various human malignant tumors and might become potential targets for the diagnosis and treatment of malignancy (32-35). ACTA2 is a protein-coding gene generally expressed in smooth muscle cells and activated cancer-related fibroblasts; tumor cells may break away from a primary site and invade the surrounding tissue with the help of actin bundles (36). A study by Masszi et al. demonstrated that tumor growth factor beta (TGF-β)-elicited epithelial-mesenchymal transition (EMT) induced the expression of ACTA2, which then increased tumor invasion and worsened the prognosis of patients (37,38). It has recently been reported that the level of ACTA2 is considerably increased in ESCC tumors (39). BGN is a member of the family of small leucine-rich proteoglycans (SLRPs) which is strongly expressed in inflammatory and fibrotic tissue (40-42) and may act as an angiogenic factor by stimulating tumor endothelial cell migration in an autocrine manner through TLR2 and TLR4 (43). Caldesmon (CALD) is an actin- and myosin-binding protein family which is an essential component of the cytoskeleton in smooth muscle and non-muscle cells (44). As a member of this family, CALD1 is involved in the regulation of the endothelial cytoskeleton as well as migration, and p38 MAPK-mediated CALD phosphorylation may be involved in endothelial cytoskeletal remodeling (45). In humans, there are at least 28 different types of collagen proteins encoded by 44 collagen genes. The COL1A1 and COL4A1 are 2 of these genes and are essential in the extra cellular matrix (ECM), which is closely related to the biological behavior of tumor cells (46,47). A recent study has demonstrated that COL1A1 and COL4A1 are associated with several clinical parameters, including TNM staging, lymph node metastasis, and tumor invasion depth in gastric cancer patients (48). COL1A1 has been reported to be highly expressed in EC cells, and upregulation of COL1A1 has been associated with cell proliferation, invasion, and apoptosis (49). DCN, a small stromal proteoglycan, is a member of the SLRP gene family (50), and Bozoky et al. found that its expression is consistently decreased in the tumor microenvironment of various cancers (51). It has also been reported that the expression level of DCN is significantly decreased in EC tumor, suggesting it may serve as a key gene in the progression of EC (52). This information clarifies why the predicted genes, especially ACTA2, BGN, CALD1, and COL4A1 (not previously reported) are highly associated with the development of EAC, and these genes may act as potential biomarkers for its diagnosis and prognosis. It is widely recognized that cyclooxygenase-2 (COX-2) and SRY-box transcription factor 2 (SOX2) are reliable biomarkers for EAC. COX-2 plays important roles in the induction of inflammation and tumorigenesis (53), and neoplastic progression of BE towards EAC is highly related to increased expression of COX-2 (54). Selective COX-2 inhibition downregulates COX-2 and MET proto-oncogene (MET) expression, which are both important molecules involved in EAC progression and dissemination (55). SOX2 is a transcription factor associated with cancer stem cells (CSCs) and embryonic stem cells, and is involved in the formation and differentiation of esophageal epithelium (56). SOX2 expression is lost during transition from BE to EAC, which is related to an increased risk of neoplastic progression (57). In addition, the pattern of p53 and particularly SOX2 protein expression in EAC predicts the response to neoadjuvant chemoradiotherapy (nCRT) (58). Compared with the above 2 biomarkers, those identified through our bioinformatic analysis seem less credible due to the lack of functional experiments. Therefore, further experimental studies to elucidate the expression, molecular mechanism, and prognostic role of the potential biomarkers are required. A risk score prognostic model was established to predict the survival rate of patients with EAC and contained 6 key genes: CLDN3, EPB41L4A, ESM1, MT1X, PAQR5, and PLAU. While CLDN3, ESM1, and PLAU were found to be negative prognostic genes, EPB41L4A, MT1X, and PAQR5 was found to be positive. CLDN3, as a member of the claudin (CLDN) gene family, is generally expressed on the epithelia of multiple tissue and is involved in the formation of intercellular tight junctions (59). Tight junctions, the most apical intercellular junctions, play vital roles in intercellular cell adhesion and the maintenance of tissue osmotic homeostasis. CLDN3 has additionally been revealed to serve as a receptor of Clostridium perfringens enterotoxin (CPE) (60). Moreover, the binding of CPE to CLDN3 has been shown to cause the proximal portion part of the CPE to interact with the cell membrane and form small cell membrane pores, resulting in increased cell membrane permeability, loss of cell osmotic balance, and ultimately cell death (61,62). It has also been reported that the expression level of CLDN3 is significantly increased in EAC tumor tissue, which is consistent with our results, and might be associated with the progression and poor prognosis of EAC (63). In addition, the CLDN3 gene had the largest coefficient (0.5864) in the risk score formula, indicating that it may serve as a very important prognostic biomarker in EAC. ESM1, also called endocan, is an endothelial cell-related proteoglycan (64). Accumulated evidence has demonstrated that tESM1 plays an important role in the regulation of major process, such as cell adhesion, endothelial dysfunction, inflammatory disorders, and tumor progression (65). In addition, ESM1 is preferentially expressed in tumor endothelium, is dramatically overexpressed in many cancers, and has been shown to be directly involved in tumor progression (65,66). Cui et al. recently reported that ESM1 plays a tumor-driving role in EC and has the potential to become a biomarker for diagnosis and prognosis (67). However, the specific role and molecular mechanism of ESM1 in EAC requires further investigation. PLAU is a member of plasminogen activator system and participates in various physiological and pathophysiological processes, such as cell proliferation, adhesion, migration, and other functions through the proteolytic system, intracellular signal transduction, and chemokine activation (68). Remarkably, a recent study performed by Fang et al. has revealed that PLAU could promote progression of ESCC tumor cell and that tumor cell-secreted PLAU could expedite the conversion of fibroblasts to inflammatory cancer-associated fibroblasts, accelerating the proliferation and migration of ESCC cells (69). Our study demonstrated that PLAU may serve as a prognostic biomarker of EAC, which warrants further exploration in future studies. MT1X belongs to the metallothionein gene family (MTs), which encode a series of cysteine-rich proteins (70). MTs are involved in various BP, including metal homeostasis, DNA damage, oxidative stress, angiogenesis, apoptosis, cell differentiation, and carcinogenesis (71). It has been reported that abnormal overexpression of MT1X delayed the G1/S progression of cell cycle and promoted apoptosis by inactivating NF-κB signaling in hepatocellular carcinoma cells in vitro and suppressed tumor growth and lung metastasis in nude mice in vivo (72). We also found that MT1X had a relatively higher coefficient in the risk score formula. Taken together, this suggests MT1X may serve as a potential prognostic biomarker and may inhibit the progression and metastasis of EAC. EPB41L4A belongs to the FERM band (four-point-one, ezrin, radixin, moesin) superfamily, members of which mainly form a group of membrane-associated proteins whose major biological functions are the regulation of cytoskeletal rearrangements, intracellular trafficking, and Wnt/β-catenin signaling (73,74). It has been revealed that the Wnt/β-catenin signaling pathway is prominently involved in intercellular adhesion and carcinogenesis (75). Recent cancer research suggests that a high expression of EPB41L4A is associated with better prognosis in multiple myeloma (MM), which has been hypothesized to result from the expression changes of genes involved in DNA replication (76). It is worth noting that EPB41L4A has not been reported in EAC at present, and further investigation is required to explore its important roles in the progression of this cancer. PAQR5 is a member of the progestin and adipoQ receptor (PAQR) family, which encode functional receptors with a broad range of apparent ligand specificities (77). Until now, the role of in malignancy has not been extensively studied; one article in this area reported that high PAQR5 expression in endometrial cancer may be associated with good prognosis. The results of our study suggest that PAQR5 might have the potential to serve as a tumor suppressor gene of EAC (78). Interestingly, we found that “points” can often show significant improvement as risk scores increase, as shown in the nomogram (), indicating that risk scores may have a greater impact on prognosis compared with clinical information. To the best of our knowledge, the 6-gene signature-related prognostic model and nomogram in our study have not been reported previously, and we believe this model has the potential to be a practical clinical tool for predicting the prognosis of patients with EAC. However, this study has several limitations that should be noted. First, our study was mainly based on the data from the GEO and TCGA databases, in which most patients are White, African, or Latino, and caution must be taken when extending the findings to other ethnic groups. Second, due to the lack of basic experimental verification, the expression, molecular mechanism, and prognostic role of the genes at the protein level warrant further experimental studies. In addition, the mechanical analyses in our study were exclusively descriptive, and further functional experiments are needed to clarify the underlying mechanism of the genes. Finally, the amount of data included in our study is relatively small because of the low incidence of EAC, which may decrease the credibility and generalizability of the results.

Conclusions

This study identified 6 genes with the potential to become useful targets for the diagnosis and treatment of EAC, namely ACTA2, BGN, CALD1, COL1A1, COL4A1, and DCN. A risk score prognostic model based on the CLDN3, EPB41L4A, ESM1, MT1X, PAQR5, and PLAU genes was established to predict the survival rate of patients with EAC. The 6-gene-related risk score prognostic model and the nomogram based on it might be a reliable tool for predicting the prognosis of patients with EAC. The article’s supplementary files as

78 in total

1. Extracellular matrix in multiple sclerosis lesions: Fibrillar collagens, biglycan and decorin are upregulated and associated with infiltrating immune cells.

Authors: Hema Mohan; Markus Krumbholz; Rakhi Sharma; Sylvia Eisele; Andreas Junker; Michael Sixt; Jia Newcombe; Hartmut Wekerle; Reinhard Hohlfeld; Hans Lassmann; Edgar Meinl
Journal: Brain Pathol Date: 2010-03-25 Impact factor: 6.508

2. Overexpression of claudin proteins in esophageal adenocarcinoma and its precursor lesions.

Authors: Elizabeth Montgomery; Adam J Mamelak; Michael Gibson; Anirban Maitra; Salwa Sheikh; Samir S Amr; Stephen Yang; Malcolm Brock; Arlene Forastiere; Shengle Zhang; Kathleen M Murphy; Karin D Berg
Journal: Appl Immunohistochem Mol Morphol Date: 2006-03

3. Metallothionein 1 family profiling identifies MT1X as a tumor suppressor involved in the progression and metastastatic capacity of hepatocellular carcinoma.

Authors: Zhikun Liu; Qianwei Ye; Lingjiao Wu; Feng Gao; Haiyang Xie; Lin Zhou; Shusen Zheng; Xiao Xu
Journal: Mol Carcinog Date: 2018-08-26 Impact factor: 4.784

4. Expression analysis of epb41l4a during Xenopus laevis embryogenesis.

Authors: Yanchun Guo; Kathleen S Christine; Frank Conlon; Susanne Gessert; Michael Kühl
Journal: Dev Genes Evol Date: 2011-05-10 Impact factor: 0.900

5. SOX2 as a novel marker to predict neoplastic progression in Barrett's esophagus.

Authors: Sophie van Olphen; Katharina Biermann; Manon C W Spaander; Florine Kastelein; Ewout W Steyerberg; Hans A Stoop; Marco J Bruno; Leendert H J Looijenga
Journal: Am J Gastroenterol Date: 2015-09-01 Impact factor: 10.864

6. Decreased decorin expression in the tumor microenvironment.

Authors: Benedek Bozoky; Andrii Savchenko; Hayrettin Guven; Fredrik Ponten; George Klein; Laszlo Szekely
Journal: Cancer Med Date: 2014-03-17 Impact factor: 4.452

7. A prognostic nomogram based on LASSO Cox regression in patients with alpha-fetoprotein-negative hepatocellular carcinoma following non-surgical therapy.

Authors: Dongdong Zhou; Xiaoli Liu; Xinhui Wang; Fengna Yan; Peng Wang; Huiwen Yan; Yuyong Jiang; Zhiyun Yang
Journal: BMC Cancer Date: 2021-03-08 Impact factor: 4.430

8. Exome and whole-genome sequencing of esophageal adenocarcinoma identifies recurrent driver events and mutational complexity.

Authors: Austin M Dulak; Petar Stojanov; Shouyong Peng; Michael S Lawrence; Cameron Fox; Chip Stewart; Santhoshi Bandla; Yu Imamura; Steven E Schumacher; Erica Shefler; Aaron McKenna; Scott L Carter; Kristian Cibulskis; Andrey Sivachenko; Gordon Saksena; Douglas Voet; Alex H Ramos; Daniel Auclair; Kristin Thompson; Carrie Sougnez; Robert C Onofrio; Candace Guiducci; Rameen Beroukhim; Zhongren Zhou; Lin Lin; Jules Lin; Rishindra Reddy; Andrew Chang; Rodney Landrenau; Arjun Pennathur; Shuji Ogino; James D Luketich; Todd R Golub; Stacey B Gabriel; Eric S Lander; David G Beer; Tony E Godfrey; Gad Getz; Adam J Bass
Journal: Nat Genet Date: 2013-03-24 Impact factor: 38.330

Review 9. Multifaceted Role of the Urokinase-Type Plasminogen Activator (uPA) and Its Receptor (uPAR): Diagnostic, Prognostic, and Therapeutic Applications.

Authors: Niaz Mahmood; Catalin Mihalcioiu; Shafaat A Rabbani
Journal: Front Oncol Date: 2018-02-12 Impact factor: 6.244

10. Clinical prognostic implications of EPB41L4A expression in multiple myeloma.

Authors: Weilong Zhang; Rui Lai; Xue He; Xiaoni Liu; Ye Zhang; Zuozhen Yang; Ping Yang; Jing Wang; Kai Hu; Xiaoliang Yuan; Xiuru Zhang; Weiyou Liu; Hongmei Jing
Journal: J Cancer Date: 2020-01-01 Impact factor: 4.207