Literature DB >> 35958325

Identification of a three-gene expression signature and construction of a prognostic nomogram predicting overall survival in lung adenocarcinoma based on TCGA and GEO databases.

Yuwei Zhou1, Shenhu Gao1, Rong Yang2, Chengli Du1, Yanli Wang3, Yihe Wu1.   

Abstract

Background: Lung adenocarcinoma (LUAD) is the major cause of cancer mortality. Traditional prognostic factors have limited importance after including other parameters. Thus, developing a more credible prognostic model combined with genes and clinical parameters is necessary.
Methods: The messenger RNA (mRNA) expression and clinical information from The Cancer Genome Atlas (TCGA)-LUAD datasets and microarray data from three Gene Expression Omnibus (GEO) databases were obtained. We identified differentially-expressed genes (DEGs) between lung tumor and normal tissues through integrated analysis of the three GEO datasets. Univariate and multivariate Cox regression analyses were conducted to select survival-associated DEGs and to establish a prognostic gene signature which was associated with overall survival (OS). The expression of gene proteins was assessed in 180 LUAD tissue microarrays (TMAs) by immunohistochemistry (IHC). We verified its predictive performance with a Kaplan-Meier (KM) curve, receiver operating characteristic (ROC) curve, and Harrell's concordance index (C-index) and validated it in external GEO databases. Multivariate Cox regression analysis was performed to identify the significant prognostic indicators in LUAD. Furthermore, we established a prognostic nomogram based on TCGA-LUAD dataset.
Results: A three-gene signature was constructed to predict the OS of LUAD patients. The KM analysis, ROC curve, and C-index present a good predictive ability of the gene signature in TCGA dataset [P<0.0001; C-index 0.6375; 95% confidence interval (CI): 0.5632-0.7118; area under the ROC curve (AUC) 0.674] and the external GEO datasets (P=0.05, 0.004, and 0.04, respectively). Univariate and multivariate Cox regression analyses also verified that LUAD patients with low-risk scores had a decreased risk of death compared to those with a high-risk score in TCGA database [hazard ratio (HR) =0.3898; 95% CI: 0.1938-0.7842; P<0.05]. Finally, we constructed a nomogram integrating the gene signature and clinicopathological parameters (P<0.0001; C-index 0.762; 95% CI: 0.714-0.845; AUC 0.8136). Compared with conventional staging, a nomogram can effectively improve prognosis prediction. Conclusions: The nomogram is closely associated to the OS of LUAD patients. This consequence may be beneficial to individualized treatment and clinical decision-making. 2022 Translational Lung Cancer Research. All rights reserved.

Entities:  

Keywords:  Gene Expression Omnibus (GEO); The Cancer Genome Atlas (TCGA); lung adenocarcinoma (LUAD); overall survival (OS)

Year:  2022        PMID: 35958325      PMCID: PMC9359966          DOI: 10.21037/tlcr-22-444

Source DB:  PubMed          Journal:  Transl Lung Cancer Res        ISSN: 2218-6751


Introduction

Lung cancer is the most common cause of cancer-related mortality worldwide. In 2021, it was estimated that there are more than 230,000 new cases of lung cancer and 130,000 deaths each year among the newly diagnosed cancers in the United States (1). Non-small cell lung cancer (NSCLC) is the major form of lung cancer, and lung adenocarcinoma (LUAD) is the most common histologic subtype of NSCLC (2,3). The traditional prognosis and initial treatment of NSCLC depend on clinical parameters, tumor stage, histopathology and tumor markers (4). For patients with early-stage diseases, surgical resection is an effective solution. Patients with locally-advanced disease are candidate to radiotherapy associated with systemic therapy, while metastatic disease is treated with systemic treatments. Traditional prognostic factors have limited importance after including other parameters such as Karnofsky-performance status, dose of systemic therapy or radiotherapy, and weight-loss evolved in multivariate analysis (5). Fortunately, our understanding of the molecular pathogenesis of NSCLC has progressed rapidly. For example, targeted therapy is more effective than standard chemotherapy for mutations of epidermal growth factor receptor (EGFR), rearrangements in anaplastic lymphoma kinase (ALK), oncogene c-ROS1 (ROS1), and other molecular alterations. Additionally, gene chips and public genomic datasets allow researchers to detect and analyze differentially-expressed genes (DEGs) in various cancer types, which may help to identify tumor-associated genes as well as expression signatures with prognostic impact. Determining the gene characteristics of NSCLC tumor tissue may lead us to uncover crucial biological process during LUAD progression or recurrence, which assist in evaluating the prognosis and the possible therapeutic effects (6). An increasing number of literatures have shown that risk models based on multiple types of genes have great potential to predict LUAD prognosis (7-10). Therefore, we sought to establish a credible prognostic model combined with multiple genes via bioinformatics analysis and clinical parameters. In this study, we integrated three LUAD datasets from the GEO database (https://www.ncbi.nlm.nih.gov/geo/) to identify reliable DEGs in LUAD. Furthermore, we performed univariate Cox regression analysis and multivariate Cox proportional hazards regression analysis to identify survival-related DEGs. We constructed a risk score model using gene expression from TCGA-LUAD dataset. Subsequently, we used survival analyses, such as Kaplan-Meier (KM) analysis and a receiver operating characteristic (ROC) curve, to verify the model’s predictive performance. We also studied the molecular mechanisms underlying the gene prediction models. Finally, we established a nomogram combining the novel gene signature with clinicopathological parameters to predict the OS of LUAD patients. The detailed workflow of this study is provided in . We present the following article in accordance with the TRIPOD reporting checklist (available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-22-444/rc).
Figure 1

Flowchart displaying the procedure of establishing the gene prognostic model and nomogram of LUAD applied in our study. GEO, Gene Expression Omnibus; DEGs, differentially-expressed genes; FC, fold change; FDR, false discovery rate; TCGA, The Cancer Genome Atlas; LUAD, lung adenocarcinoma; KM, Kaplan-Meier; ROC, receiver operating characteristic; GEPIA, Gene Expression Profiling Interactive Analysis; IHC, immunohistochemistry.

Flowchart displaying the procedure of establishing the gene prognostic model and nomogram of LUAD applied in our study. GEO, Gene Expression Omnibus; DEGs, differentially-expressed genes; FC, fold change; FDR, false discovery rate; TCGA, The Cancer Genome Atlas; LUAD, lung adenocarcinoma; KM, Kaplan-Meier; ROC, receiver operating characteristic; GEPIA, Gene Expression Profiling Interactive Analysis; IHC, immunohistochemistry.

Methods

Gene expression and clinical data

The LUAD messenger RNA (mRNA) expression data and related clinical information were downloaded from the GEO database (https://www.ncbi.nlm.nih.gov/geo/). We used the keywords “Lung cancer”, “Lung adenocarcinoma”, and “LUAD” for retrieval. Studies based on “Homo sapiens” described as “Expression profiling by array” were included for the next step of screening. “Cell lines” and “xenografts” were excluded from the research. Three gene expression microarray datasets (GSE32863, GSE75037, and GSE32665) were selected and downloaded to screen DEGs. These datasets met the following requirements: (I) contained human lung tissue samples; (II) contained tumor and normal lung tissue samples; and (III) contained at least 100 samples. Moreover, we selected three datasets (GSE72094, GSE50081, and GSE31210) for subsequent validation of the prognostic gene signature, along with the follow-up information. The probes were matched to gene symbols using the manufacturer-provided annotation file. We used the median ranking value to calculate the expression value if multiple probes matched a single gene. The expression data were normalized and log2-transformed for further analysis. Harmonized RNA sequencing data (HTSeq-counts) and associated clinical information for LUAD were downloaded from TCGA (https://portal.gdc.cancer.gov/; up to November 20, 2020). A total of 551 samples were retrieved. After preliminary screening, 411 samples were selected, including 368 tumor samples and 43 normal tissue samples (available online: https://cdn.amegroups.cn/static/public/tlcr-22-444-1.xlsx). Forty-three cases of normal samples, 74 cases with a follow-up time of ≤30 days, and one sample with metastasis were removed. Thus, 293 cases with relevant tumor tissues and clinical information were ultimately included in the study. The gene expression data of TCGA-LUAD dataset were normalized by variance stabilizing transformation (VST) with R package “Deseq2” (11) for further analysis. The research was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Analysis of DEGs and integrated microarray dataset analysis

We conducted differential gene expression analysis between tumor and normal tissues based on the GEO dataset via the Limma package in R software (12). Significantly up- and downregulated genes were defined as those with |log2fold change (FC)| >1, P<0.05, and a false discovery rate (FDR) <0.05. We drew a volcano plot to display the FC and P values of DEGs between the two comparison groups. The DEGs identified from three GEO datasets were analyzed synthetically via the robust rank aggregation (RRA) method R package “RobustRankAggreg”. P<0.05 was considered statistically significant.

Survival analysis and establishment of the prognostic gene signature

We normalized the gene expression data of TCGA-LUAD dataset via VST with R package “Deseq2”. We then analyzed the association between expression levels of the DEGs identified from the GEO datasets and OS of LUAD patients in TCGA-LUAD dataset using a univariate Cox regression model. DEGs with P<0.01 were considered significant with OS and included for the multivariable Cox regression analysis. Next, we constructed a prognostic gene signature for LUAD patients based on a linear combination of the multivariable Cox regression coefficients (β) multiplied by the mRNA expression levels of candidate prognostic genes. The risk scoring formula was defined as follows (13): Patients in TCGA-LUAD were then divided into low- and high-risk subgroups according to the gene signature’s optimal cut-off value, as determined by X-tile software (14). KM survival analysis, the area under the ROC curve (AUC), and the concordance index (C-index) were used to evaluate the ability of the prognostic gene signature via ‘survival’, ‘timeROC’ package in R software. We also compared the prognostic value of gene signature with the previously defined risk models proposed by Zhang et al. (15-17).

Validation of prognostic genes in signature

We applied Gene Expression Profiling Interactive Analysis (GEPIA) (http://gepia.cancer-pku.cn/), an online web server based on TCGA and the Genotype-Tissue Expression (GTEX) databases (18), to display the mRNA expression levels of signature genes in LUAD and non-tumor lung tissues. The KM plotter (https://kmplot.com/analysis/index.php?p=service), an online tool commonly used to assess the effects of genes on survival based on The European Genome-phenome Archive (EGA), TCGA, and GEO (Affymetrix microarrays only) databases, was used to apply survival analysis for the signature genes.

Tissue array specimens and immunohistochemistry (IHC)

Lung cancer tissue arrays (HLugA180Su07) were purchased from Shanghai Outdo Biotech Co., Ltd. (Shanghai, China) for IHC analysis, including 98 LUAD tissues and 82 para-cancerous normal tissues. The primary antibodies used included antibodies against PLEK2 (ab121131, Abcam, UK), COL1A1 (ab34710, Abcam), and GPX3 (ab104448, Abcam). We deparaffinized the tissue arrays and retrieved antigens. Next, we incubated tissues with the primary antibody at 4 ℃ overnight, washed them with phosphate-buffered saline (PBS), and then incubated them with a biotin-conjugated secondary antibody. After washing, we incubated the sections with horseradish peroxidase (HRP) complex and visualized them using diaminobenzidine (DAB) (Maixin Biotech, Fuzhou, China). The IHC score was independently related to staining intensity and the percentage of positive cells. The staining intensity was divided into four levels: 0 (no staining), 1 (weak staining), 2 (moderate staining), and 3 (strong staining). The proportion of stained positive cells was defined as: 1 (0–25% positive cells), 2 (26–50% positive cells), 3 (51–75% positive cells), and 4 (76–100% positive cells). The staining intensity and the percentage scores were multiplied to obtain the total score. All the IHC results were reviewed by pathologist in First Affiliated Hospital, School of Medicine, Zhejiang University.

Evaluation of the prognostic value of the signature

To evaluate the significant prognostic values of the gene signature, we performed univariate and multivariate Cox regression analyses in TCGA-LUAD dataset on the gene signature and corresponding clinicopathological parameters, including age, gender, tumor status, histological subtype, residual tumor status, the American Joint Committee for Cancer tumor node metastasis (AJCC TNM) stage, T stage, N stage, and M stage. Parameters with P<0.05 in the univariate analysis were further included in the multivariate Cox regression analysis. Moreover, we used the GSE72094, GSE50081, and GSE31210 datasets with complete clinical information for external validation in the same way. The risk scores for each patient were calculated using the same formula, and the optimal cut-off value for each dataset was determined by X-tile software. P<0.05 was considered statistically significant.

Construction and validation of the gene prognostic nomogram

We constructed a composite nomogram based on the gene signatures and clinicopathological information identified from the univariate and multivariate Cox regression analyses discussed above to predict the 1-, 2-, and 3-year overall survival (OS) of LUAD patients in TCGA dataset using the “rms” package of R software. Based on the nomogram’s total points, patients were divided into two groups by the optimal cutoffs determined by X-tile. KM survival analysis, the AUC of the ROC curve, and a calibration curve comparing the predicted and observed OS chances were applied to evaluate the prognostic performance of the nomogram. We also compared the nomogram’s predictive ability with that of the AJCC staging using the C-index and AUC.

Statistical analysis

We performed statistical analysis in R v. 4.0.3 (ISBN 3-900051-07-0; https://www.r-project.org/) and GraphPad Prism v. 8.02 for Windows, GraphPad Software, San Diego, CA, USA (https://www.graphpad.com/). We calculated the regression coefficients and hazard ratio (HR) with 95% confidence interval (CI) using univariate and multivariate Cox regression analyses. All statistical tests were two-sided, P<0.05 was considered as statistically significant.

Results

Identification of DEGs obtained from three GEO datasets

According to the flow chart shown in , we identified a total of 2,628, 968, and 919 DEGs between tumor and normal tissues via Limma package in R software from the GSE75037, GSE32665, and GSE32863 gene expression profiles, respectively. Details of the GEO datasets in this study are displayed in . Of them, 1,157, 444, and 316 genes were upregulated, while 1,471, 524 and 603 genes were downregulated in the GSE75037, GSE32665, and GSE32863 datasets, respectively (). A total of 296 DEGs, composed of 120 upregulated and 176 downregulated genes, were identified after integrated analysis using the RRA method (available online: https://cdn.amegroups.cn/static/public/tlcr-22-444-2.xlsx). The top 20 up- and downregulated DEGs are shown in . Hierarchical clustering analysis indicated various DEG expression patterns between tumor and normal tissues (; Figure S1).
Table 1

Details of the GEO datasets included in this study

DatasetsPlatformSample size (tumor/normal)Application
GSE75037Illumina HumanWG-6 v3.0 expression beadchip166 (83/83)Identification of DEG
GSE32665Illumina human-6 v2.0 expression beadchip179 (87/92)Identification of DEG
GSE32863Illumina HumanWG-6 v3.0 expression beadchip116 (58/58)Identification of DEG
GSE72094Rosetta/Merck Human RSTA Custom Affymetrix 2.0 microarray [HuRSTA_2a520709.CDF]386 (386/0)External validation
GSE50081[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array130 (130/0)External validation
GSE31210[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array204 (204/0)External validation

GEO, Gene Expression Omnibus; DEG, differentially-expressed gene.

Figure 2

Identification of DEGs in LUAD between tumor and normal lung tissues. (A) Volcano plot for DEGs screened from the GEO profiles (GSE75037, GSE32665, and GSE32863). (B) The heatmap of the top 20 upregulated and downregulated DEGs screened by integrated analysis of the GEO datasets. The upregulated DEGs were shown in red while the downregulated DEGs were shown in blue. The value of each column represented the value of log2FC. (C) Representative heatmap of the DEGs after integrated analysis in GSE75037 indicated that the 296 DEGs can effectively distinguish tumors from non-tumor tissues. DEGs, differentially-expressed genes; LUAD, lung adenocarcinoma; GEO, Gene Expression Omnibus; FC, fold change.

GEO, Gene Expression Omnibus; DEG, differentially-expressed gene. Identification of DEGs in LUAD between tumor and normal lung tissues. (A) Volcano plot for DEGs screened from the GEO profiles (GSE75037, GSE32665, and GSE32863). (B) The heatmap of the top 20 upregulated and downregulated DEGs screened by integrated analysis of the GEO datasets. The upregulated DEGs were shown in red while the downregulated DEGs were shown in blue. The value of each column represented the value of log2FC. (C) Representative heatmap of the DEGs after integrated analysis in GSE75037 indicated that the 296 DEGs can effectively distinguish tumors from non-tumor tissues. DEGs, differentially-expressed genes; LUAD, lung adenocarcinoma; GEO, Gene Expression Omnibus; FC, fold change.

Identification of survival-related DEGs and development of the three-gene prognostic signature

We included 293 patients from TCGA-LUAD dataset with a follow-up period of >30 days in the following survival analyses. The patients’ clinical characteristics are listed in . Based on the univariate Cox regression model, 16 DEGs were identified as being significantly associated with OS (P<0.01; ). A prognostic gene signature, including PLEK2, COL1A1, and GPX3, was developed by multivariate Cox analysis (). The downregulated GPX3 with a HR <1 was considered a tumor suppressor, whereas the upregulated COL1A1 and PLEK2 with a HR >1 were regarded as oncogenes. Meanwhile, candidate genes were analyzed using X-tile software to identify the optimal OS cutoff values, and the patients were divided into low- and high-risk groups based on these data. KM survival analysis displayed that three genes were significantly related with patient OS (P<0.05) in TCGA-LUAD dataset (). The risk score was calculated as follows:
Table 2

Clinical characteristics of patients in TCGA-LUAD dataset and the three independent GEO datasets

Clinical featuresTCGA-LUADGSE72094GSE50081GSE31210
Samples sizes293386130204
Age (years), mean ± SD64.86±971.5±1.561.83±1.74548±10
Follow time (days), mean ± SD740.94±627.51,439±1901,510.55±8.5781,201.5±458.5
Survival status, n
   Death831095430
   Survival21027776174
Sex, n
   Male1311686795
   Female16221863109
Stage, n
   I16224693162
   II62653742
   III4756
   IV1514
   Unknown7
Smoking status, n
   Ever2919499
   Never3023105
   Unable to determine6513
Kras status, n
   Wt254
   Mut132
Tp53 status, n
   Wt290
   Mut96
Stk11 status, n
   Wt323
   Mut63

TCGA, The Cancer Genome Atlas; LUAD, lung adenocarcinoma; GEO, Gene Expression Omnibus; SD, standard deviation; Wt, wild-type; Mut, mutated.

Table 3

Identification of the gene expression signature by univariate and multivariate Cox regression analyses

No.GeneUnivariate analysis*Multivariate analysis**
HR95% CIPHR95% CIP
1 GPX3 0.73990.6117–0.8950.00190.62380.4789–0.81250.0005
2 PTPRM 1.37381.1002–1.71540.0051
3 ASPM 1.27551.0745–1.5140.0054
4 CENPF 1.25431.0607–1.48330.0081
5 TK1 1.30541.0733–1.58780.0076
6 PRR36 0.79270.6663–0.94310.0088
7 PLEK2 1.38331.1464–1.66910.00071.38761.0767–1.78830.0114
8 SLC2A1 1.25891.0766–1.47210.0039
9 FAM83A 1.21351.0796–1.36390.0012
10 MIF 1.40261.1054–1.77970.0054
11 PRC1 1.33441.0801–1.64860.0075
12 GJB2 1.18231.0597–1.3190.0027
13 HMMR 1.32091.105–1.5790.0022
14 ANLN 1.30821.1129–1.53770.0011
15 KCNK5 0.78360.6821–0.90030.0006
16 COL1A1 1.19871.0522–1.36550.00641.21731.02–1.45280.0293

*, the 16 DEGs were significantly associated with OS (P<0.01) according to univariate Cox regression analysis; **, we then performed multivariate Cox regression analysis on these 16 DEGs to identify the most informative gene set for survival prediction. Finally, the three genes marked in grey in the table were selected for multivariate Cox regression analysis and generation of a prognostic risk model according to their respective regression coefficients. HR, hazard ratio; CI, confidence interval; DEGs, differentially-expressed genes; OS, overall survival.

Figure 3

Development of the gene signatures and performance evaluation in TCGA dataset. (A) Survival analysis of candidate genes in TCGA-LUAD datasets. (B) Distribution of the risk score, survival status of patients, and the mRNA expression heatmap in TCGA dataset. (C) Survival analysis of the gene signature. Patients from TCGA dataset were divided into two groups according to the optimal cut-off values of the risk scores calculated by X-tile. (D) ROC curves of the 1-, 2-, and 3-year OS of the gene signature. TCGA, The Cancer Genome Atlas; LUAD, lung adenocarcinoma; ROC, receiver operating characteristic; AUC, area under the ROC curve; OS, overall survival; CI, confidence interval; mRNA, messenger RNA.

TCGA, The Cancer Genome Atlas; LUAD, lung adenocarcinoma; GEO, Gene Expression Omnibus; SD, standard deviation; Wt, wild-type; Mut, mutated. *, the 16 DEGs were significantly associated with OS (P<0.01) according to univariate Cox regression analysis; **, we then performed multivariate Cox regression analysis on these 16 DEGs to identify the most informative gene set for survival prediction. Finally, the three genes marked in grey in the table were selected for multivariate Cox regression analysis and generation of a prognostic risk model according to their respective regression coefficients. HR, hazard ratio; CI, confidence interval; DEGs, differentially-expressed genes; OS, overall survival. Development of the gene signatures and performance evaluation in TCGA dataset. (A) Survival analysis of candidate genes in TCGA-LUAD datasets. (B) Distribution of the risk score, survival status of patients, and the mRNA expression heatmap in TCGA dataset. (C) Survival analysis of the gene signature. Patients from TCGA dataset were divided into two groups according to the optimal cut-off values of the risk scores calculated by X-tile. (D) ROC curves of the 1-, 2-, and 3-year OS of the gene signature. TCGA, The Cancer Genome Atlas; LUAD, lung adenocarcinoma; ROC, receiver operating characteristic; AUC, area under the ROC curve; OS, overall survival; CI, confidence interval; mRNA, messenger RNA. Risk score = (0.1446053 × expression value of COL1A1) + (−0.2426827 × expression value of GPX3) + (0.2697514 × expression value of PLEK2). We calculated the optimal cutoff values for the risk scores using X-tile software. Patients were stratified into two groups (cutoff value =2.03) in TCGA-LUAD dataset. The distribution and survival status of LUAD patients were plotted based on the best cutoff value. The heatmap indicated that PLEK2 and COL1A1 tended to have higher expression in high-risk patients, while GPX3 was more highly expressed in low-risk patients (). The KM survival curves revealed significantly lower OS in the high-risk group compared to the low-risk group (P<0.0001) (). The time-dependent ROC curve and C-index were applied to assess the prognostic values of the three genes’ risk score (). The AUCs for the 1-, 2-, and 3-year OS predictions for the risk scores were 0.674, 0.658, and 0.691, respectively. The C-index of the risk score was 0.6375 (95% CI: 0.5632–0.7118). We also compared the ability of the risk score with three previously established gene signatures. The AUCs of the risk scores was close to those of the gene signatures (Figure S2A-S2C), and the risk score’s C-index was also close to those of the gene signatures (0.6375 vs. 0.6497, 0.6327, and 0.6164). Thus, the three-gene signature performed well at predicting the OS of LUAD patients.

Validation of prognostic genes with GEPIA and KM plotter

We applied GEPIA to validate the expression levels of the three genes. The mRNA expression levels of PLEK2 and COL1A1 were significantly increased in LUAD tumor tissue. In contrast, those of GPX3 were significantly decreased compared to normal tissues (). We performed a survival analysis using the KM plotter and demonstrated that the three gene’s expressions were related to LUAD prognosis. Patients with a high expression of PLEK2 or COL1A1, or a low GPX3 expression, had a worse prognosis (P<0.05) ().
Figure 4

Prognostic genes validation with GEPIA and the KM plotter. (A) COL1A1 and PLEK2 had higher expression levels in the LUAD specimen compared to the normal specimen, while GPX3 had the opposite expression in the GEPIA tumor database. T, tumor tissues (red); N, normal tissues (gray). *, P<0.05. (B) The prognostic information of the three genes in LUAD was demonstrated using the KM plotter. The red curve represents the high expression group, and the black curve represents the low expression group. The representative IHC images and prognostic scores of the genes in LUAD vs. adjacent tissue. (C) Expression of PLEK2, COL1A1, and GPX3 in LUAD tissue and adjacent tissue (DAB/hematoxylin staining; magnification 200×). (D) The expression levels of the three genes were analyzed in 98 LUAD tissues and 80 adjacent tissues. PLEK2 and COL1A1 were more highly expressed in LUAD tissue, while GPX3 exhibited an opposite expression trend (P<0.05). LUAD, lung adenocarcinoma; HR, hazard ratio; IHC, immunohistochemistry; GEPIA, Gene Expression Profiling Interactive Analysis; KM, Kaplan-Meier; DAB, diaminobenzidine.

Prognostic genes validation with GEPIA and the KM plotter. (A) COL1A1 and PLEK2 had higher expression levels in the LUAD specimen compared to the normal specimen, while GPX3 had the opposite expression in the GEPIA tumor database. T, tumor tissues (red); N, normal tissues (gray). *, P<0.05. (B) The prognostic information of the three genes in LUAD was demonstrated using the KM plotter. The red curve represents the high expression group, and the black curve represents the low expression group. The representative IHC images and prognostic scores of the genes in LUAD vs. adjacent tissue. (C) Expression of PLEK2, COL1A1, and GPX3 in LUAD tissue and adjacent tissue (DAB/hematoxylin staining; magnification 200×). (D) The expression levels of the three genes were analyzed in 98 LUAD tissues and 80 adjacent tissues. PLEK2 and COL1A1 were more highly expressed in LUAD tissue, while GPX3 exhibited an opposite expression trend (P<0.05). LUAD, lung adenocarcinoma; HR, hazard ratio; IHC, immunohistochemistry; GEPIA, Gene Expression Profiling Interactive Analysis; KM, Kaplan-Meier; DAB, diaminobenzidine.

Tissue array specimens and IHC

To further validate our findings, we examined the protein expression levels of the three genes by IHC in an independent cohort of LUAD patients (HLugA180Su07). Compared with adjacent tissue, COL1A1 and PLEK2 were mainly expressed in the cytoplasm and displayed higher levels in LUAD tissues than adjacent tissues (). The tumor tissues’ IHC scores were significantly higher than those of adjacent tissues (P<0.0001) (). In contrast, the GPX3 protein levels in tumors were lower than those in adjacent tissues’ (P<0.05) ().

Assessment of prognostic factors in TCGA-LUAD

A total of 161 patients from TCGA-LUAD dataset, whose complete clinical information was provided, including age, gender, tumor status, histological subtype, residual tumor status, AJCC TNM stage, T stage, N stage, and M stage, were included in the analysis (, available online: https://cdn.amegroups.cn/static/public/tlcr-22-444-3.xlsx). We identified the prognostic indicators of OS for lung cancer using univariate and multivariate Cox regression analyses. The univariate analysis indicated that risk score, tumor status, residual tumor, and pathological stages were significantly related with OS of LUAD patients (). Multivariate analysis shown that risk score was independent risk factor for OS (P<0.05; ). After adjusting for tumor status and clinical stage in the multivariate Cox analysis, LUAD patients with a low-risk score had a lower risk of death compared to those with a high-risk score (HR =0.3898; 95% CI: 0.1938–0.7842; P<0.05; ). Notably, after adjusting for the known risk factors for survival, the gene signature demonstrated a robust performance in predicting the OS of LUAD patients.
Table 4

Baseline characteristics of patients in TCGA-LUAD dataset with complete clinical information

Clinical featuresMean ± SD/n
Risk score2.119±0.05303
Age (years)64.56±10.29
Follow time (days)554.5±341.5
Sex
   Male71
   Female90
Tumor status
   Tumor free115
   With tumor46
Histologic diagnosis
   LUAD-NOS102
   LUAD-specified59
Residual tumor
   R0142
   R18
   Rx11
Stage
   I89
   II37
   III28
   IV7
T
   T157
   T284
   T312
   T46
   Tx2
N
   N0101
   N+60
M
   M0112
   M17
   Mx42
Subdivision
   R94
   L67

TCGA, The Cancer Genome Atlas; LUAD, lung adenocarcinoma; NOS, not-otherwise-specified; SD, standard deviation.

Table 5

Univariate and multivariate Cox regression analyses for risk scores on the OS of patients with LUAD

ParametersUnivariate analysisMultivariate analysis
HR95% CIPHR95% CIP
Risk group
   High risk111111
   Low risk0.37800.2015–0.70910.00244**0.38980.1938–0.78420.00825**
Age0.9949790.9685–1.0220.7156
Sex
   Male0.81820.448–1.4940.514
   Female111
Tumor status
   Tumor free111111
   With tumor5.04912.587–9.8562.09e-06***5.352.559-11.188.28e-06***
Histologic diagnosis
   LUAD-NOS111
   LUAD-specified0.66880.3488– 1.2820.226
Residual tumor
   R0111111
   R14.44351.6802–11.7520.00265**0.85140.2455–2.9540.7999
   Rx2.32760.7012–7.7260.167551.3290.3082–5.7270.7030
Stage
   I111111
   II2.62971.187–5.8270.01723*1.2590.1988–7.9770.8066
   III3.35671.598–7.0490.00138**1.60.1858–13.790.6686
   IV3.93491.101–14.0660.03506*0.57960.04838–6.9450.6669
T
   T1111111
   T21.4860.7244–3.0470.280091.150.5201–2.5420.7303
   T31.5230.3267–7.0980.592310.63390.1138–3.5320.6029
   T44.9791.5392–16.1050.00736**4.4590.9936–20.010.0509
   Tx6.8890–∞0.99750
N
   N00.34600.1845–0.64880.000938***0.6450.1033–4.0290.6390
   N+111111
M
   M0111
   M12.047080.6187–6.7730.241
   Mx0.917750.4546–1.8530.811
Subdivision
   L111
   R1.095350.6013–1.9950.766

*, P<0.05; **, P<0.01; ***, P<0.001. OS, overall survival; LUAD, lung adenocarcinoma; NOS, not-otherwise-specified; HR, hazard ratio; CI, confidence interval.

TCGA, The Cancer Genome Atlas; LUAD, lung adenocarcinoma; NOS, not-otherwise-specified; SD, standard deviation. *, P<0.05; **, P<0.01; ***, P<0.001. OS, overall survival; LUAD, lung adenocarcinoma; NOS, not-otherwise-specified; HR, hazard ratio; CI, confidence interval.

Validation of the prognostic value of the signature using external GEO datasets

We used three external datasets (GSE50081, GSE72094, and GSE31210) to validate the predictive ability of the prognostic signature (). Risk scores were calculated using the same formula for each patient, and patients were divided into high- and low-risk groups using the same method. We found that patients with lower risk scores had a better chance of survival than those with higher risk scores (P=0.0016, 0.0033, and 0.0001, respectively; and Figures S3,S4). The heatmap displayed that PLEK2 and COL1A1 tended to have higher expression in high-risk patients, while GPX3 had higher expression in low-risk patients ().
Figure 5

External validation of the prognostic gene in the GSE31210 dataset. (A) KM survival curves of the gene signature. Patients from the GSE31210 dataset were divided into two groups according to the optimal cut-off values for their risk scores (calculated by X-tile). (B) Distribution of the risk scores, patients’ survival status, and the mRNA expression heatmap in the GSE31210 dataset. (C) ROC curves of the 1-, 2-, and 3-year OS predictions of the gene signature. (D) The prognostic value of the gene signature was evaluated using the multivariate Cox model. *, P<0.05; **, P<0.01. ROC, receiver operating characteristic; AUC, area under the ROC curve; OS, overall survival; CI, confidence interval; KM, Kaplan-Meier; mRNA, messenger RNA.

External validation of the prognostic gene in the GSE31210 dataset. (A) KM survival curves of the gene signature. Patients from the GSE31210 dataset were divided into two groups according to the optimal cut-off values for their risk scores (calculated by X-tile). (B) Distribution of the risk scores, patients’ survival status, and the mRNA expression heatmap in the GSE31210 dataset. (C) ROC curves of the 1-, 2-, and 3-year OS predictions of the gene signature. (D) The prognostic value of the gene signature was evaluated using the multivariate Cox model. *, P<0.05; **, P<0.01. ROC, receiver operating characteristic; AUC, area under the ROC curve; OS, overall survival; CI, confidence interval; KM, Kaplan-Meier; mRNA, messenger RNA. We then assessed the prognostic power through ROC analysis and C-index. The ROC analysis for GSE31210 and GSE50081 is shown in and Figures S3,S4. The C-index of the risk score in the GSE31210, GSE50081, and GSE72094 datasets were 0.6322 (95% CI: 0.5416–0.7228), 0.6213 (95% CI: 0.5463–0.6962), and 0.5930 (95% CI: 0.535–0.651), respectively. Moreover, after adjusting for covariates, patients with a low-risk score still had a significantly lower risk of death (P=0.05, 0.004, and 0.04, respectively; , Figures S3,S4). Therefore, external validation showed that the prognostic signature performed well at predicting OS in LUAD patients.

Developing and validating a prognostic nomogram

We used these 161 patients from TCGA dataset to build a prognostic nomogram in order to predict the 1-, 2-, and 3-year OS of LUAD patients (). Risk score, tumor status, and pathological stage were selected to establish the nomogram model. The patients were divided into two risk groups according to the total point of the nomogram. The KM plot effectively discerned that those with higher scores had significantly poorer OS than the low-risk group (P<0.0001) (). The AUCs of the 1-, 2-, and 3-year OS of the nomogram were 0.8136, 0.7281, and 0.8324, respectively (). The AUCs of the 1-, 2-, and 3-year OS of the gene signature were 0.7025, 0.6476, and 0.6941, respectively. Additionally, the AUCs of the 1-, 2-, and 3-year OS of the AJCC stage were 0.7394, 0.6668, and 0.6539, respectively (Figure S5). The C-index of the nomogram was 0.762 (95% CI: 0.714–0.845), while that of the AJCC stage was 0.635 (95% CI: 0.547–0.678) and the signature was 0.646 (95% CI: 0.562–0.730), which suggested that the prognostic nomogram may performed best in predicting OS. These data demonstrated that the nomogram had better predictive ability than the AJCC-stage and gene signature for the 1-, 2-, and 3-year OS. The calibration curve for predicting the 1-, 2-, and 3-year OS demonstrated that the nomogram performed well at predicting OS of LUAD patients (). In the third year, when the predicted OS was >80%, the nomogram may underestimate mortality; however, when the predicted OS was <80%, the nomogram may overestimate mortality.
Figure 6

The performance of the nomogram in predicting the prognosis in TCGA-LUAD dataset. (A) A prognostic nomogram predicting the 1-, 2-, and 3-year OS of LUAD patients. (B) The nomogram’s KM survival curves. Patients from TCGA-LUAD dataset were stratified into two risk groups according to optimal cutoff values of the nomogram (calculated by X-tile software). (C) ROC curves of the 1-, 2-, and 3-year OS predictions of the nomogram. (D) Calibration plot of the 1-, 2-, and 3-year OS of the nomogram. TCGA, The Cancer Genome Atlas; LUAD, lung adenocarcinoma; ROC, receiver operating characteristic; AUC, area under the ROC curve; OS, overall survival; CI, confidence interval; KM, Kaplan-Meier.

The performance of the nomogram in predicting the prognosis in TCGA-LUAD dataset. (A) A prognostic nomogram predicting the 1-, 2-, and 3-year OS of LUAD patients. (B) The nomogram’s KM survival curves. Patients from TCGA-LUAD dataset were stratified into two risk groups according to optimal cutoff values of the nomogram (calculated by X-tile software). (C) ROC curves of the 1-, 2-, and 3-year OS predictions of the nomogram. (D) Calibration plot of the 1-, 2-, and 3-year OS of the nomogram. TCGA, The Cancer Genome Atlas; LUAD, lung adenocarcinoma; ROC, receiver operating characteristic; AUC, area under the ROC curve; OS, overall survival; CI, confidence interval; KM, Kaplan-Meier.

Discussion

The 5-year survival rate of lung cancer patients with localized-stage disease is 59%, while those with advanced stage is only 6% (1). LUAD is the most common subtype of lung cancer, and its mechanisms of pathogenesis and metastasis are diverse and heterogeneous. A single clinical parameter has a poor power of prognostic prediction. Outcomes vary between recovery and recurrence, even in patients with similar clinical and pathological features. Despite the recent advances in the biological characterization of LUAD, the molecular mechanisms behind its carcinogenesis remain elusive. Hence, it is necessary to identify a precise and effective prognostic signature to predict the survival of patients with LUAD. In the current study, we constructed a three-gene prognostic model to stratify LUAD patients into different risk groups for OS. These three genes were all significantly associated with the OS of LUAD patients. Among them, PLEK2 and COL1A1 are tumor-promoting genes, while GPX3 is a tumor suppressor gene. Previous studies have indicated that COL1A1 is an extracellular matrix protein, and its overexpression is vitally related to breast (19), stomach (20), liver (21), and colon (22) tumors. COL1A1 is highly expressed in human breast cancer, and its overexpression promotes breast cancer metastasis (23,24). Additionally, a study has demonstrated that COL1A1 might promote colorectal cancer metastasis by regulating the WNT/planar cell polarity (PCP) pathway (25). Another study has elucidated that COL1A1 was highly expressed in hepatocellular carcinoma (HCC), and its expression was significantly related to HCC disease progression through epithelial-mesenchymal transition (EMT) (21). Studies have also shown that compared to adjacent normal tissues, NSCLC tends to overexpress COL1A1 (26), which is correlated with the expression of hypoxia markers (27). Thus, COL1A1 is a reliable biomarker and recognized therapeutic target for different types of cancer. Previous study has found that increased PLEK2 expression might be a specific prognostic biomarker for poor progression-free survival (PFS) in LUAD patients. At the single-cell level, its expression is significantly positively correlated with LUAD cell invasion, cell cycle abnormalities, DNA damage, and DNA repair irregularities. Promoter hypomethylation may be a potential mechanism leading to its upregulation (28). Transforming growth factor (TGF)-β-mediated EMT plays a crucial role in tumor invasion and metastasis. A study has reported that PLEK2 was significantly up-regulated in NSCLC cells activating TGF-β1, and was negatively correlated with OS in NSCLC. In terms of mechanism, the PLEK2-SHIP2 axis promotes NSCLC EMT and migration via the TGF-β/PI3K/Akt signaling pathway (29). Additionally, researchers have shown that PLEK2 promotes tumorigenesis and metastasis in other types of cancer (30). Therefore, PLEK2 may be a new prognostic marker. The protein encoded by GPX3 belongs to the glutathione peroxidase family, which protects cells from oxidative damage by catalyzing glutathione’s organic hydroperoxides and hydrogen peroxide (H2O2) reduction. The role of GPX3 in cancer has not yet been elucidated. Research indicates that GPX3 has a dichotomous effect in different tumor types, both as a tumor suppressor protein and a pro-survival protein. Some studies have demonstrated that the lack of GPX3 expression in tumor tissues is related to poor prognosis and chemotherapy resistance in patients with LUADs and low-grade gliomas. In recent years, the application of GPX3 as a tumor suppressor in a variety of cancers has received extensive attention (31,32). Other studies have indicated that redox signals mediated by GPX3 can inhibit tumors in lung cancer cell lines by inhibiting the Erk-NF-κB-cyclin B1 signaling pathway, leading to cell cycle arrest in the G2/M phase (33-35). However, GPX3 expression is elevated in other tumor tissues, and its high expression is related to poor prognosis in patients with diseases like gastric cancer and lung squamous cell carcinoma (36). After identifying these three prognostic genes, we constructed and investigated the three-gene prognostic signature for its prognostic value in LUAD patients. Patients in high-risk groups showed significantly poorer prognoses than those in the low-risk group in TCGA-LUAD dataset. As demonstrated by the results, our prognostic signature was superior to or comparable with those reported in three previous studies. The AUC and C-index confirm its predictive value. Additionally, the prognostic model was an independent and significant factor for assessing the patients’ prognoses depending on the univariate and multivariate Cox regression analyses. We also verified the three-gene signature’s prognostic value in external GEO datasets. To improve the prognosis predictive ability of the three-gene prognostic signature, we integrated the prognostic model and conventional clinicopathological parameters, including AJCC-stage and tumor status, to construct a nomogram that can more accurately predict patient prognosis. As a supplement to AJCC staging, the nomogram presented superior predictive ability in terms of the 1-, 2-, and 3-year OS prediction, according to the survival analysis, C-index, and time- dependent ROC analysis, which is conducive to clinical decision-making and personalized treatment. In our nomogram’s calibration chart, a perfect agreement was observed between the predicted and the observed results. Therefore, our prognostic nomogram based on these three genes can help clinicians predict the survival outcome of LUAD patients and provide a reference for treatment guidance rather than a single routine clinical parameter. Based on TCGA-LUAD data combined with the three GEO datasets and external validation, this study provides solid evidence supporting the prognostic value of the gene signature in LUAD patients. Additionally, there are only three candidate genes in our signature, which will be more convenient and maneuverable in future clinical applications. However, this study has certain limitations that should be noted. Firstly, our study was retrospective, the sample size was small, and the patients’ information came from TCGA and GEO databases, which are restricted. Thus, it is necessary to verify the gene signature using a sufficient number of LUAD examples. Additionally, the clinical parameters were not adjusted in the three validated GEO datasets because the related information was unavailable from the GEO database. Finally, the specific mechanisms behind these prognostic genes in the pathogenesis and development of LUAD are not fully understood and need to be examined more thoroughly in the future.

Conclusions

In our study, we identified a three-gene model and a prognostic nomogram combined with gene signature and clinicopathological parameters to predict the OS of LUAD. Our prognostic model was closely associated with the prognosis of LUAD, which may facilitate discovering potential therapeutic targets and clinical decision-making. The article’s supplementary files as
  36 in total

1.  MRTF-A mediates the activation of COL1A1 expression stimulated by multiple signaling pathways in human breast cancer cells.

Authors:  Chao Meng; Yongping He; Zhaoqiang Wei; Yulin Lu; Fu Du; Guofang Ou; Nan Wang; Xue-Gang Luo; Wenjian Ma; Tong-Cun Zhang; Hongpeng He
Journal:  Biomed Pharmacother       Date:  2018-05-25       Impact factor: 6.529

2.  Cancer Statistics, 2021.

Authors:  Rebecca L Siegel; Kimberly D Miller; Hannah E Fuchs; Ahmedin Jemal
Journal:  CA Cancer J Clin       Date:  2021-01-12       Impact factor: 508.702

3.  limma powers differential expression analyses for RNA-sequencing and microarray studies.

Authors:  Matthew E Ritchie; Belinda Phipson; Di Wu; Yifang Hu; Charity W Law; Wei Shi; Gordon K Smyth
Journal:  Nucleic Acids Res       Date:  2015-01-20       Impact factor: 16.971

4.  Clinical significance and inflammatory landscapes of a novel recurrence-associated immune signature in early-stage lung adenocarcinoma.

Authors:  Chaoqi Zhang; Zhen Zhang; Guochao Zhang; Zhihui Zhang; Yuejun Luo; Feng Wang; Sihui Wang; Yun Che; Qingpeng Zeng; Nan Sun; Jie He
Journal:  Cancer Lett       Date:  2020-03-19       Impact factor: 8.679

5.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.

Authors:  Michael I Love; Wolfgang Huber; Simon Anders
Journal:  Genome Biol       Date:  2014       Impact factor: 13.583

6.  GEPIA: a web server for cancer and normal gene expression profiling and interactive analyses.

Authors:  Zefang Tang; Chenwei Li; Boxi Kang; Ge Gao; Cheng Li; Zemin Zhang
Journal:  Nucleic Acids Res       Date:  2017-07-03       Impact factor: 16.971

7.  COL1A1, PRPF40A, and UCP2 correlate with hypoxia markers in non-small cell lung cancer.

Authors:  Urszula Oleksiewicz; Triantafillos Liloglou; Kalliopi-Maria Tasopoulou; Nikoleta Daskoulidou; John R Gosney; John K Field; George Xinarianos
Journal:  J Cancer Res Clin Oncol       Date:  2017-03-03       Impact factor: 4.553

8.  Molecular Signature of Subtypes of Non-Small-Cell Lung Cancer by Large-Scale Transcriptional Profiling: Identification of Key Modules and Genes by Weighted Gene Co-Expression Network Analysis (WGCNA).

Authors:  Magdalena Niemira; Francois Collin; Anna Szalkowska; Agnieszka Bielska; Karolina Chwialkowska; Joanna Reszec; Jacek Niklinski; Miroslaw Kwasniewski; Adam Kretowski
Journal:  Cancers (Basel)       Date:  2019-12-21       Impact factor: 6.639

9.  m6A-related lncRNAs are potential biomarkers for predicting prognoses and immune responses in patients with LUAD.

Authors:  Feng Xu; Xiaoling Huang; Yangyi Li; Yongsong Chen; Ling Lin
Journal:  Mol Ther Nucleic Acids       Date:  2021-04-09       Impact factor: 8.886

10.  Differential Expression of COL1A1, COL1A2, COL6A3, and SULF1 as Prognostic Biomarkers in Gastric Cancer.

Authors:  Yan Hu; Jingjing Li; Haifeng Luo; Wenli Song; Jiyuan Yang
Journal:  Int J Gen Med       Date:  2021-09-17
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.