Literature DB >> 31440059

Identification of seven-gene signature for prediction of lung squamous cell carcinoma.

Zhe Wang¹, Zhongmiao Wang¹, Xing Niu², Jie Liu³, Zhuning Wang², Lijie Chen⁴, Baoli Qin¹.

Abstract

Background and aim: Lung squamous cell carcinoma (LUSC), is a pathological subtype of lung cancer, accounting for 30% of the lung cancers. A reliable model was constructed, based on the whole gene expression profiles, to predict the prognosis of patients with LUSC.
Methods: The RNA-Seq data of LUSC was downloaded from the TCGA database, and differentially expressed genes (p<0.05, |log2fold change| >1) were screened out. By univariate and multivariate Cox regression analysis, we identified seven prognosis-related genes. Then, we established a risk score staging system to predict the prognosis of patients with LUSC. Compared with other clinical parameters, the risk score was an independent prognostic factor and had a better performance in predicting prognosis. Finally, GSEA analysis was carried out to determine the enrichment pathway significantly. The risk score models were established by Cox proportional hazard regression analysis; the ROC curve was applied to test the performance of risk score model. All the statistical analysis was accomplished by R packages.
Results: In this study, a model was constructed to predict prognosis, which contains seven genes: CSRNP1, CLEC18B, MIR27A, AC130456.4, DEFA6, ARL14EPL, and ZFP42. Based on the model, the risk score of each patient was calculated with LUSC (hazard ratio [HR]=2.673, 95% CI=1.871-3.525). It was found that the risk score can distinguish high-risk and low-risk groups in prognosis of LUSC patients, independently. Furthermore, the model was validated by ROC curves in the testing dataset and the whole dataset. Lastly, by gene set enrichment analysis (GSEA), we showed the main enrichment pathways were DNA damage stimulus, DNA repair, and DNA replication. It was suggested that the risk score may provide a new and reliable method for prognosis prediction.
Conclusion: The results of this study suggested that the risk score based on seven-genes could indicate a promising and independent prognostic biomarker for LUSC patients.

Entities: Chemical Disease Gene Mutation Species

Keywords: Cox regression model; gene set enrichment analysis; lung squamous cell carcinoma; prognosis; risk score

Year: 2019 PMID： 31440059 PMCID： PMC6664418 DOI： 10.2147/OTT.S198998

Source DB: PubMed Journal: Onco Targets Ther ISSN： 1178-6930 Impact factor: 4.147

Introduction

Lung cancer is the most commonly diagnosed cancer (11.6% of the total cases) and the leading cause of cancer death, which made it a big concern for human public health worldwide.1 About 80% of primary lung cancers are non-small cell lung cancer (NSCLC), which can be further classified into adenocarcinoma and squamous cell carcinoma.2 Lung squamous cell carcinoma (LUSC) comprises ∼30% of all lung cancers.3,4 Primary treatments of advanced lung squamous cell carcinoma are surgery, radiotherapy, and chemotherapy. Although molecularly targeted agents and immunotherapy have played a major role in the treatment of lung adenocarcinoma, there have been few approved targeted therapies and effective chemotherapeutic options beyond the first line of therapy for LUSC. However, its 5-year survival rate is still less than 15%.5 Thus, it is important to find effective and reliable predictors for early detection. Previous studies have identified the correlation between aberrant expressed genes and human cancers, and revealed promise of these genes as biomarkers in predicting patients’ survival.6–8 Similarly, the abnormality of RNA has also been reported as having impact on the survival. Overexpressed HUWE1 was found to be associated with poor patient outcome.9 VPS9D1-AS1 serves as a promising biomarker to predict the prognosis of NSCLC.10 These studies demonstrate that changes in single gene expression and in RNA expression are meaningful for predicting the prognosis of lung cancer. However, there is no independent element which can predict the prognosis of lung cancer based on a multi-gene model. In this research, we presented the risk score based on seven genes as a dependent prognostic biomarker for LUSC; the risk score contained all genes information which was included in the model. Additionaly, we found that these genes are mainly enriched in the DNA damage stimulus pathway, DNA repair pathway, and DNA replication pathway. In summary, we identified a new element (risk score) to predict the prognosis of LUSC patients. Compared to the existing studies of predictor, our work offered a new risk model as an independent prognostic biomarker in risk stratification for LUSC patients.

Materials and methods

Data source and preprocessing

Gene expression profiles and the clinical dataset of patients with LUSC were downloaded from the TCGA data portal.11 The gene expression profiles were chosen based on the criteria as follows: median expression in LUSC >0 and in adjacent-normal tissue >0. Then, all the genes were normalized by DESeq2 package. Clinical information, including the total number of patients (n=551), sex, age, TNM stage, T stage, N stage, M stage, survival time, and survival state were included in this research. A total of 551 LUSC patients were randomly divided into a training data set (n=363) and a testing data set (n=188).

Identifcation of prognosis-associated genes

Differential expression genes (DEGs) were chosen according to the criteria of P<0.05, |log2fold change| >1 in the training dataset using the R package edge, and analysed by univariate analysis. Then, 24 genes (p<0.001) were selected to build a Cox proportional hazards regression model.12 Finally, the model was carried out by stepwise regression method. Therefore, seven independent genes were chosen via the Cox regression model (Table 1). Based on this independent prognositc model, the survival risk of each LUSC patient was evaluated as follows:

Table 1

The seven prognosis-associated genes to establish the risk score system

Genes	Coefficient	HR	p-value
CSPNP1	−0.1311	0.8772	0.1348
MIR27A	0.1390	1.1492	0.0794
CLEC18B	−0.1951	0.8228	0.0014
AC130456.4	0.1708	1.1862	0.0343
DEFA6	0.1702	1.1855	0.0033
ARL14EPL	0.1821	1.1997	0.0245
ZFP42	0.1135	1.1202	0.0013

Abbreviation: HR, hazard ratio.

The seven prognosis-associated genes to establish the risk score system Abbreviation: HR, hazard ratio.

K-M survival analysis

The risk score of each LUSC sample was calculated according to the risk score formula, and all the LUSC patients were divided into low-risk and high-risk groups using the median value of the risk score as the cut-off value. Then the K-M survival curves of high-risk and low-risk groups were mapped using survival package. Lastly, the follow-up time of patients was shown in a scatter plot.

Correlation analysis between risk score system and clinical factors

All patients were grouped by age, gender, race, smoking status, and stage. The risk score between the two groups was calculated with the purpose of observing the difference between each feature.

Gene set enrichment analysis

Gene Set Enrichment Analysis (GSEA) was carried out to analyze the differences between two groups: high-risk group and low-risk group. The gene sets were adopted from The Molecular Signatures Database. The phenotype label was high-risk score vs. low-risk score. The number of permutations was set to 1,000, and a false discovery rate (FDR) <0.25 was recognized as statistically significant.

Statistical analysis

We invested DEGs using edgeR package, the principle of which is fitting generalized linear models and using likelihood ratio tests to identify DEGs. Next, we used the Cox regression model to do the multivariable survival analysis. Based on the median value of risk score, we divide the population into high-risk and low-risk groups. To further confirm the model, we used the Kaplan-Meier method to analyze the correlation between risk score and overall survival, and the log-rank test to compare survival curves. We used R software version 3.5.1 and the “survival” package to do the receiver operating characteristic (ROC) curve analysis. Moreover, we conducted 3-fold cross-validation to further confirm the efficacy of the model. The area under the curve (AUC) was also calculated. We compared two groups in different clinical parameters (age, gender, race, smoking status, and stage) using the t-test for continuous variables and χ2 test for categorical variables. With regard to both log-rank test and Cox proportional hazards regression model, the signifcance was set at p <0.05.

Results

Identification of prognostic gene signature

To identify new genetic prognostic markers for LUSC, we screened 24 prognostic related differential expression genes using univariate analysis. These prognostic significant genes were included in the Cox multivariate regression model, and the seven-gene prognosis prediction model (Table 1) was finally determined by stepwise regression. Based on the model, the risk score of each sample was obtained, and, based on the median value of the risk score, LUSC patients were divided into high-risk group and low-risk group. In the training set, we found that the survival curves of the high-low risk group were significantly different, and the low-risk group had a better survival period (Figure 1A). The training population were ranked according to the risk score from low to high, and we showed the population follow-up time and genes heat map by the ranking (Figures 1C–E). Notably, the gene expression trend of CLEC18B is opposite to the others (Figure 1E). To verify the performance of the model, the ROC curve was described and the area under the ROC was 0.765 (Figure 1B). Furthermore, 3-fold cross-validation was carried out to validate the model, and the optimal AUC area was 0.66 (Figure 2).

Figure 1

Figure 2

Outcomes of seven-gene model in the validation cohort. (A) Kaplan-Meier survival curve illustrates the differences between the high-risk group and low-risk group concerning the survival rate in the validation samples. (B) The distribution of follow-up time in the validation samples. (C) The distributions of the risk scores in the validation samples. (D) The distributions of the seven-gene expression profiles of patients in the validation samples.

Outcomes of the seven-gene model in the training cohort. (A) Kaplan-Meier survival curve demonstrates the differences between the high-risk group and low-risk group about survival rate in the training sample. (B) ROC curve shows the performance of the model. (C) The distribution of follow-up time in the training sample. (D) The distributions of the risk scores in the training sample. (E) The distributions of the seven-gene expression profiles of patients in the training sample. Outcomes of seven-gene model in the validation cohort. (A) Kaplan-Meier survival curve illustrates the differences between the high-risk group and low-risk group concerning the survival rate in the validation samples. (B) The distribution of follow-up time in the validation samples. (C) The distributions of the risk scores in the validation samples. (D) The distributions of the seven-gene expression profiles of patients in the validation samples.

Validation of the seven-gene signature in predicting survival using Kaplan–Meier curves

In order to further prove the accuracy of the model and the classification effect of the risk score, we further applied the K-M analysis to display the survival curve of the high-risk group and the low-risk group in the validation dataset and the whole dataset. We found the survival of the two groups is significantly different in each dataset ( p<0.05) (Figures 3A and 4A). This result further demonstrated the effective performance of the model for classifying patients into two groups with different prognoses. Meanwhile, we showed the follow-up time of the validation dataset and the whole dataset by ascending ranking with risk score (Figures 3B and 4B). Consistently, the seven-gene expression heatmap sorted by risk score was also shown in the figures (Figures 3C and D and 4C and D).

Figure 3

Figure 4

The effects of clinical factors and risk score in all samples. The clinical importance of clinical factor and risk score.

Abbreviations: CI, confidence interval; HR, hazard ratio.

Outcomes of seven-gene model in the all samples. (A) Kaplan-Meier survival curve illustrates the differences between the high-risk group and low-risk group about survival rate in all samples. (B) The distribution of follow-up time in the all samples. (C) The distributions of the risk scores in the all samples. (D) The distributions of the seven-gene expression profiles of patients in the all samples. The effects of clinical factors and risk score in all samples. The clinical importance of clinical factor and risk score. Abbreviations: CI, confidence interval; HR, hazard ratio.

The seven-gene signature is an independent prognostic factor of survival

To compare the risk score with conventional clinical factors, we performed univariate and multivariate Cox hazard regression analysis to evaluate the importance of these indictors in the patients’ prognosis, which included risk score, age, sex, and TNM stage. We found that the risk score was an independent prognostic factor (Figure 5) (hazard ratio [HR]=2.673; 95% confidence interval [CI]=1.871~3.525; p<0.05). This result suggested the risk score was a robust indicator in predicting the prognosis of the LUSC cohort.

Figure 5

The association between risk score and clinical factors (age (A), gender, (B) race (C), smoking-pack years (D), stage (E)) in all samples.

The association between risk score and clinical factors (age (A), gender, (B) race (C), smoking-pack years (D), stage (E)) in all samples. Clinical staging is reported relating to prognosis in NSCLC.13 However, when it was analyzed together with the risk score in our research, the result in the model did not show significance. This was because each factor in the model was assigned a weight, and this made the function of the staging non-significant. These results suggest that the risk score is a more robust element in predicting the prognosis in the LUSC cohort. Furthermore, it is noteworthy that the seven-gene signature is competitive for survival prediction compared with clinical parameters. By Cox regression analysis, we showed that the risk score is an independent prognostic factor. However, whether there is a difference in risk score for gender, age, race, stage, and smoking-pack years is unknown. In order to explore this issue, we tested the risk scores in different groups dividing by the parameters (gender, age, race, stage, and smoking-pack years), but there were no difference found (Figure 6).

Figure 6

GSEA revealed that genes with higher risk subgroup were enriched in pathways of (A) DNA damage stimulus, (B) DNA repair, and (C) DNA replication.

Identification of the eight‐gene signature-related biological processes and pathways

Based on the risk scores of the seven-gene expression, the population was divided into a high-risk group and a low-risk group, and there were significant differences in the prognosis between the two groups. To explore the mechanism of the seven-gene model, we examined the whole gene expression profiles by GSEA analysis. The GSEA analysis revealed that the mainly enrichment pathways were DNA damage stimulus pathway, DNA repair pathway, and DNA replication pathway, and the results suggested that the poor prognosis is related to the DNA damage repair pathway (Figure 7).

Figure 7

The dataset was divided into three parts by caret package, and two parts of them were taken as the train dataset and one part as the testing dataset in turn. By 3-fold cross-validation, the area under the curve was 0.6604.

Discussion

The prediction of prognosis is of great significance for treatment choice. A large number of studies have explored prognosis-related biomakers, and gene expression profiles have been found to play an important role in prognosis.14–16 In this research, we obtained a seven-gene prognostic model by Cox regression analysis and calculated the risk score of the LUSC patients. Our results suggested that the risk score is an independent prognostic factor. Furthermore, we explored the mechanism of the seven-gene predicting model by GSEA analysis. These findings suggested that the patient’s prognosis and molecular genetic background are inter-related, and prognostic assessment, including genetic background, will be more accurate. To the best of our knowledge, this is the first time that the seven-gene model was identified, and was found to be a candidate prognostic biomarker in LUSC. Many studies have found that changes in gene expression affect prognosis, and there are reports that combine this analysis to predict prognosis of lung adenocarcinoma, kidney cancer, and pancreatic cancer.17–19 Multi-gene prediction of LUSC has not been reported. In our study; we obtained the prognosis-related seven-gene model by the Cox hazard regression analysis. Some of the seven genes (CSRNP1, CLEC18B, MIR27A, AC130456.4, DEFA6, ARL14EPL, and ZFP42) were reported in relation to tumorigenesis. CSRNP1 has been reported as a hub gene correlating to the prognosis of liver cancer.20 Similarly, it has been reported to be involved in the promotion of carcinogenesis through the eQTL mechanism in renal cancer.21 TGF-β1 could induce CSRNP1 expression, while SMAD3 was activated.22 A meta-analysis reported that the rs13275170 locus of DEFA6 increased 1.3-fold in gastric cancer, suggesting that DEFA6 is associated with tumorgenesis.22 ZFP42 is an oncogene, and a study reported a combined methylation model can accurately predict the occurrence of lung cancer, including SOX17, TAC1, HOXA7, CDO1, HOXA9, and ZFP42, which highly suggests the role of ZFP42 in the development of lung cancer.23 In this seven-gene model, MIR27A, AC130456.4, and ARL14EPL have not been reported in the relationship with tumors. It was interesting to find the increasing expression of CLEC18B has a negative correlation with the risk score in the heatmap. CLEC18B is a member of the CLEC superfamily and expressed in a variety of cancer cells. It was reported as strongly negatively correlated with total survival in GBM, and it was found to promote proliferation, invasion, and migration in GBM tumor cells. The high expression of CLEC18B can activate the activity of Wnt/β-catenin signaling pathway.24 Our study has proven that the genes in the model are related to the DNA damage pathway. In addition, Wnt signaling and DNA damage do have crosstalk.25 Referring to ZFP42, there has been little known about the ZFP42 gene in lung cancer. In summary, the seven genes have not been reported to predict the prognosis of lung cancer individually. We obtained the risk score of the sample by the combined prediction model of the seven genes, and verified the risk score was an independent prognostic factor. In order to understand the mechanism of prognosis prediction of this gene set. We proformed GSEA analysis; GSEA is a computational method to evaluate whether a defined set of genes shows statistically significant, concordant differences between two biological states.26 We found the associated biological signaling pathways were the DNA damage stimulus pathway, DNA repair pathway, and DNA replication pathway. Consistent with previous reports, DNA damage repair genes affect prognosis of tumor patients.27 These results also suggested that poor prognosis may be associated with abnormal DNA damage pathways. However, the risk score as a comprehensive evaluation indicator reducing the genetic dimension can independently predict the LUSC patient’s prognosis. The model built by this study was validated by an internal dataset. However, it also required more clinical data validation to reinforce the conclusion. Taken together, this study suggests that the integrated genetic information included with the clinical evaluation parameters to analyze prognosis will lead to more accurate predictions and provide more accurate treatment options.

Conclusion

Our study proposed a novel seven-gene model to predict the prognosis of patients with LUSC. We also further validated the feasibility of this model. In addition, genes in this model were mainly associated with the DNA damage stimulus, DNA repair, and DNA replication. Therefore, our study provided a reliable way to analyze the prognosis of patients with lung squamous cancer.

25 in total

Review 1. Pathology of lung cancer.

Authors: William D Travis
Journal: Clin Chest Med Date: 2011-12 Impact factor: 2.878

2. DNA Damage Response and Repair Gene Alterations Are Associated with Improved Survival in Patients with Platinum-Treated Advanced Urothelial Carcinoma.

Authors: Min Yuen Teo; Richard M Bambury; Emily C Zabor; Emmet Jordan; Hikmat Al-Ahmadie; Mariel E Boyd; Nancy Bouvier; Stephanie A Mullane; Eugene K Cha; Nitin Roper; Irina Ostrovnaya; David M Hyman; Bernard H Bochner; Maria E Arcila; David B Solit; Michael F Berger; Dean F Bajorin; Joaquim Bellmunt; Gopakumar Iyer; Jonathan E Rosenberg
Journal: Clin Cancer Res Date: 2017-01-30 Impact factor: 12.531

Review 3. The crosstalk between Wnt/β-catenin signaling pathway with DNA damage response and oxidative stress: Implications in cancer therapy.

Authors: Ansar Karimaian; Maryam Majidinia; Hossein Bannazadeh Baghi; Bahman Yousefi
Journal: DNA Repair (Amst) Date: 2017-01-11

4. Early Detection of Lung Cancer Using DNA Promoter Hypermethylation in Plasma and Sputum.

Authors: Alicia Hulbert; Ignacio Jusue-Torres; Alejandro Stark; Chen Chen; Kristen Rodgers; Beverly Lee; Candace Griffin; Andrew Yang; Peng Huang; John Wrangle; Steven A Belinsky; Tza-Huei Wang; Stephen C Yang; Stephen B Baylin; Malcolm V Brock; James G Herman
Journal: Clin Cancer Res Date: 2016-10-11 Impact factor: 12.531

5. FGFR1 amplification in squamous cell carcinoma of the lung.

Authors: Rebecca S Heist; Mari Mino-Kenudson; Lecia V Sequist; Swathi Tammireddy; Laura Morrissey; David C Christiani; Jeffrey A Engelman; A John Iafrate
Journal: J Thorac Oncol Date: 2012-12 Impact factor: 15.609

6. Cancer statistics, 2013.

Authors: Rebecca Siegel; Deepa Naishadham; Ahmedin Jemal
Journal: CA Cancer J Clin Date: 2013-01-17 Impact factor: 508.702

7. GSEA-P: a desktop application for Gene Set Enrichment Analysis.

Authors: Aravind Subramanian; Heidi Kuehn; Joshua Gould; Pablo Tamayo; Jill P Mesirov
Journal: Bioinformatics Date: 2007-07-20 Impact factor: 6.937

8. Innate immunity and non-Hodgkin's lymphoma (NHL) related genes in a nested case-control study for gastric cancer risk.

Authors: Sue K Park; Jae Jeong Yang; Sohee Oh; Lisa Y Cho; Seung Hyun Ma; Aesun Shin; Kwang-Pil Ko; Taesung Park; Keun-Young Yoo; Daehee Kang
Journal: PLoS One Date: 2012-09-21 Impact factor: 3.240

9. Application of the new 8th TNM staging system for non-small cell lung cancer: treated with curative concurrent chemoradiotherapy.

Authors: Hoon Sik Choi; Bae Kwon Jeong; Hojin Jeong; Yun Hee Lee; In Bong Ha; Jin Ho Song; Ki Mun Kang
Journal: Radiat Oncol Date: 2017-07-21 Impact factor: 3.481

10. Comprehensive genomic characterization of squamous cell lung cancers.

Authors:
Journal: Nature Date: 2012-09-09 Impact factor: 49.962

13 in total

1. A m⁶A methyltransferase-mediated immune signature determines prognosis, immune landscape and immunotherapy efficacy in patients with lung adenocarcinoma.

Authors: Mengyuan Lei; Chenghan Luo; Jiayang Zhang; Wenjun Cao; Jian Ge; Min Zhao
Journal: Cell Oncol (Dordr) Date: 2022-08-15 Impact factor: 7.051

2. MiR-4652-5p Targets RND1 to Regulate Cell Adhesion and Promote Lung Squamous Cell Carcinoma Progression.

Authors: Yafu Zhou; Jianhua Yan; Huiguo Chen; Wenwu Zhou; Guohua Xiao; Haoyu Zou; Jinsong Yang
Journal: Appl Biochem Biotechnol Date: 2022-03-25 Impact factor: 2.926

3. Identification and in vitro validation of diagnostic and prognostic biomarkers for lung squamous cell carcinoma.

Authors: Xiaopeng Zhao; Chong Yuan; Xu He; Miao Wang; Haoran Zhang; Jingge Cheng; Hongyan Wang
Journal: J Thorac Dis Date: 2022-04 Impact factor: 2.895

4. Detecting Prognosis Risk Biomarkers for Colon Cancer Through Multi-Omics-Based Prognostic Analysis and Target Regulation Simulation Modeling.

Authors: Zuojing Yin; Xinmiao Yan; Qiming Wang; Zeliang Deng; Kailin Tang; Zhiwei Cao; Tianyi Qiu
Journal: Front Genet Date: 2020-05-26 Impact factor: 4.599

5. Identification and validation of a three-gene signature as a candidate prognostic biomarker for lower grade glioma.

Authors: Kai Xiao; Qing Liu; Gang Peng; Jun Su; Chao-Ying Qin; Xiang-Yu Wang
Journal: PeerJ Date: 2020-01-03 Impact factor: 2.984

6. A prognostic model for overall survival of patients with early-stage non-small cell lung cancer: a multicentre, retrospective study.

Authors: Cheng Lu; Kaustav Bera; Xiangxue Wang; Prateek Prasanna; Jun Xu; Andrew Janowczyk; Niha Beig; Michael Yang; Pingfu Fu; James Lewis; Humberto Choi; Ralph A Schmid; Sabina Berezowska; Kurt Schalper; David Rimm; Vamsidhar Velcheti; Anant Madabhushi
Journal: Lancet Digit Health Date: 2020-10-19

7. Identification of a four-gene panel predicting overall survival for lung adenocarcinoma.

Authors: Chunyu Li; Qizhong Long; Danni Zhang; Jun Li; Xianming Zhang
Journal: BMC Cancer Date: 2020-12-07 Impact factor: 4.430

8. Identification of an Immune Gene Expression Signature for Predicting Lung Squamous Cell Carcinoma Prognosis.

Authors: Yubo Yan; Minghui Zhang; Shanqi Xu; Shidong Xu
Journal: Biomed Res Int Date: 2020-06-27 Impact factor: 3.411

9. Soyasaponin II protects against acute liver failure through diminishing YB-1 phosphorylation and Nlrp3-inflammasome priming in mice.

Authors: Fangzhao Wang; Shenhai Gong; Teng Wang; Lei Li; Haihua Luo; Junhao Wang; Chenyang Huang; Hongwei Zhou; Guiming Chen; Zhanguo Liu; Qifan Zhang; Yong Jiang; Peng Chen
Journal: Theranostics Date: 2020-02-03 Impact factor: 11.556

10. Prognostic Value of a Novel Signature With Nine Hepatitis C Virus-Induced Genes in Hepatic Cancer by Mining GEO and TCGA Databases.

Authors: Jianming Wei; Bo Wang; Xibo Gao; Daqing Sun
Journal: Front Cell Dev Biol Date: 2021-07-16