Xuan Luo1, Jian Guo Xu2, ZhiYuan Wang1, XiaoFang Wang3, QianYing Zhu1, Juan Zhao1, Li Bian1. 1. 36657The First Affiliated Hospital of Kunming Medical University, Kunming, China. 2. Department of Dental Research, The Affiliated Stomatological Hospital of Kunming Medical University, Kunming, China. 3. Department of Pathology, The Second Affiliated Hospital of Kunming Medical University, Kunming, China.
Abstract
OBJECTIVE: Lung adenocarcinoma (LUAD) is a common malignant tumor with a poor prognosis. The present study aimed to screen the key genes involved in LUAD development and prognosis. METHODS: The transcriptome data for 515 LUAD and 347 normal samples were downloaded from The Cancer Genome Atlas and Genotype Tissue Expression databases. The weighted gene co-expression network and differentially expressed genes were used to identify the central regulatory genes for the development of LUAD. Univariate Cox, LASSO, and multivariate Cox regression analyses were utilized to identify prognosis-related genes. RESULTS: The top 10 central regulatory genes of LUAD included IL6, PECAM1, CDH5, VWF, THBS1, CAV1, TEK, HGF, SPP1, and ENG. Genes that have an impact on survival included PECAM1, HGF, SPP1, and ENG. The favorable prognosis genes included KDF1, ZNF691, DNASE2B, and ELAPOR1, while unfavorable prognosis genes included RPL22, ENO1, PCSK9, SNX7, and LCE5A. The areas under the receiver operating characteristic curves of the risk score model in the training and testing datasets were .78 and .758, respectively. CONCLUSION: Bioinformatics methods were used to identify genes involved in the development and prognosis of LUAD, which will provide a basis for further research on the treatment and prognosis of LUAD.
OBJECTIVE: Lung adenocarcinoma (LUAD) is a common malignant tumor with a poor prognosis. The present study aimed to screen the key genes involved in LUAD development and prognosis. METHODS: The transcriptome data for 515 LUAD and 347 normal samples were downloaded from The Cancer Genome Atlas and Genotype Tissue Expression databases. The weighted gene co-expression network and differentially expressed genes were used to identify the central regulatory genes for the development of LUAD. Univariate Cox, LASSO, and multivariate Cox regression analyses were utilized to identify prognosis-related genes. RESULTS: The top 10 central regulatory genes of LUAD included IL6, PECAM1, CDH5, VWF, THBS1, CAV1, TEK, HGF, SPP1, and ENG. Genes that have an impact on survival included PECAM1, HGF, SPP1, and ENG. The favorable prognosis genes included KDF1, ZNF691, DNASE2B, and ELAPOR1, while unfavorable prognosis genes included RPL22, ENO1, PCSK9, SNX7, and LCE5A. The areas under the receiver operating characteristic curves of the risk score model in the training and testing datasets were .78 and .758, respectively. CONCLUSION: Bioinformatics methods were used to identify genes involved in the development and prognosis of LUAD, which will provide a basis for further research on the treatment and prognosis of LUAD.
Genes affecting the development of LUAD included PECAM1, HGF, SPP1, ENG, and The favorable prognosis genes included KDF1, ZNF691, DNASE2B, and ELAPOR1, while the unfavorable prognosis genes included RPL22, ENO1, PCSK9, SNX7, and LCE5A.
How Does Your Research Contribute to the Field?
Bioinformatics methods were used to identify genes involved in the development and prognosis of LUAD, which will provide a basis for further research on its treatment and prognosis.
What Are Your Research’s Implications Towards Theory, Practice, or Policy?
This study provides basic evidence for the diagnosis and treatment of LUAD.
Background
Lung adenocarcinoma (LUAD) is a common malignant tumor with high morbidity and mortality.
Its 5-year survival rate is 4-17%.
LUAD treatment is difficult due to tumor heterogeneity, and the risk of recurrence after treatment is also higher.
With the development of genomics and use of bioinformatics analysis for lung cancer, a large number of molecular markers related to lung cancer development, drug resistance, and prognosis have been discovered, such as epidermal growth factor receptor tyrosine kinase inhibitors (EGFR TKIs) and anaplastic lymphoma kinase (ALK) inhibitors.[4,5] Therefore, it is particularly important to study the biological markers of LUAD development and prognosis. The weighted gene co-expression network analysis (WGCNA) is a better method to screen central regulatory genes for tumorigenesis and development.
WGCNA explores the relationship between genes and phenotypes by weighting the correlation network.
It can analyze signaling networks by converting gene expression data into co-expression modules.
It is also used to screen genes associated with cancer-related modules and signatures. WGCNA finds relevant genes and predicts gene functions by analyzing key genes and identifying potential therapeutic targets and predictive biomarkers.
Wei et al
have adopted WGCNA to analyze LUAD and have identified modules that were highly correlated with LUAD. Yi et al
have identified the occurrence- and prognosis-related genes of LUAD using WGCNA. In recent years, WGCNA has been predominantly used to screen important hub genes against LUAD gene expression data. The present study was based on the transcriptome data and clinical information on LUAD from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) databases. The development regulatory genes in LUAD were screened using WGCNA as well as differentially expressed genes (DEGs). Univariate Cox, LASSO, and multivariate Cox regression analyses were utilized to screen the key genes for the prognosis of LUAD and to construct a prognostic model in order to ensure further understanding of the developmental and prognostic LUAD markers. Experimental design (Figure 1).
Figure 1.
Experimental design.
Experimental design.
Materials and Methods
Data Acquisition
Transcriptome data (mRNA) for LUAD were downloaded from the UCSC Xena database (https://xenabrowser.net/datapages/) TCGA in Pan-Cancer (PANCAN). Normal sample transcriptome data for LUAD were downloaded from the UCSC Xena database (https://xenabrowser.net/datapages/) GTEx. The clinical data were downloaded from the TCGA. Final cohort included a total of 515 LUAD and 347 normal samples. Ethical approval was not required for the present study because the data were obtained from public databases.
Intersection of Differentially Expressed Genes and Weighted Gene Co-expression Network Analysis
The R package limma was used to screen DEGs. Conditions for screening included log fold change (logFC) of ≥|2| and adjusted P ≤ .05. R package WGCNA was used to analyze genes with TPM≥1. β values and scale-free R2 were adjusted as a soft-threshold index to construct a scale-free co-expression network. WGCNA was used to screen out the relevant LUAD modules. The most relevant LUAD modules and genes intersecting with DEGs were used for the next step.
Kyoto Encyclopedia of Genes and Genomes, Gene Ontology, Protein-Protein Interaction Network, Survival Analyses, and Immunohistochemistry
The R package WGCNA was used to screen-out the relevant LUAD modules. The most relevant LUAD modules, genes intersecting with DEGs, and R package clusterprofiler were used for Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) analyses of this part of the gene set. A PPI network for the intersection genes was constructed using the string database, and then the top 10 hub genes were screened using cytohubba in Cytoscape (version 3.7.2). The network PPI pairs with a combined confidence score of ≥.4 were visualized. The GEPIA database (http://gepia.cancer-pku.cn/) survival analysis was performed for hub genes, which were screened for survival impact. The HPA database (https://www.proteinatlas.org/) was used to analyze the level of protein expression for hub genes.
Prognostic Model Construction
The mRNA results were obtained from the transcriptome data, the survival data were analyzed, the missing survival data were deleted, and 487 samples were obtained. The samples were randomly divided into 2 groups, including 243 training and 244 testing samples. R survival was used to perform a univariate Cox regression analysis on all genes in the training set to screen for genes significantly associated with prognosis.
To eliminate the problem of collinearity between genes, LASSO regression in R glmnet and R survival were used. After performing 1000 10-fold cross-validations, the λ value with minimized error was selected as the optimum λ parameter value.
Risk score models were constructed using multivariate Cox regression in R survminer and R survival. The risk score model’s predicted risk score served as a predictor of prognostic status.The risk score for each sample was calculated using a risk score model. The total sample was divided into high- and low-risk groups. Using R survminer and R survival were compared between high- and low-risk groups. To calculate the potency of the risk score model, ROC curves for the training and testing datasets were plotted using R survivalroc ROC. The risk curves were plotted according to survival time, survival status, gene expression quantity, and risk score for each sample. A nomogram was created using R RMS based on the clinical information provided by the TCGA database, including age, gender, stage, and risk score, and excluding missing data.
Results
Differentially Expressed Genes and Weighted Gene Co-expression Network Analysis
Based on the filter conditions of logfc≥|2| and adjusted P ≤ .05, 2518 DEGs out of 19 405 genes were screened. The β-value was set at 12 (scale free R2 ≥ .90) and adjusted as a soft-threshold index to construct a scale-free co-expression network according to the WGCNA. The WGCNA results had a total of 10 modules, including MEred, MEblue, MEgreen, MEmagenta, MEpink, MEyellow, MEblack, MEbrown, MEturquoise, and MEgrey. MEturquoise was the module with the highest LUAD correlation with 3449 genes. It had a total of 526 intersection genes with DEGs (Figure 2).
Figure 2.
Intersection of 2518 differentially expressed genes and 3447 genes inside the MEturquoise module.
Intersection of 2518 differentially expressed genes and 3447 genes inside the MEturquoise module.
Enrichment Analysis
KEGG/GO functional enrichment analysis was performed on 526 genes. Biological process (BP) was mainly enriched in extracellular matrix organization, external structure organization, and external encapsulating structure organization (Figure 3A). Cellular component (CC) was mainly enriched in collagen-containing extracellular matrix, focal adhesion, and cell-substrate junction (Figure 3B). Molecular function (MF) was mainly enriched in glycosaminoglycan binding, extracellular matrix structural constituent, and sulfur compound binding (Figure 3C). KEGG was mainly enriched in the PI3K-Akt signaling pathway (Figure 3D).
Figure 3.
A. Top 10 GO terms of genes related to biological process, B. cellular component, C. molecular function, D. kyoto encyclopedia of genes and genomes.
A. Top 10 GO terms of genes related to biological process, B. cellular component, C. molecular function, D. kyoto encyclopedia of genes and genomes.
PPI, Hub Gene Survival Analysis, and Immunohistochemistry
A PPI network for 526 genes was constructed using a string database. The top 10 hub genes were screened using the cytohubba module in Cytoscape. These included IL6, PECAM1, CDH5, VWF, THBS1, CAV1, TEK, HGF, SPP1, and ENG (Figure 4A). Survival analysis for hub genes identified 4 genes that had an impact on survival. They were PECAM1, HGF, SPP1, and ENG (Figure 4B). Boxplots based on the expression profiles of these 4 genes were constructed (Figure 4C). PECAM1, HGF, and ENG were downregulated and SPP1 was upregulated at the mRNA level. Immunohistochemistry results for PECAM1, HGF, and ENG showed that they were downregulated and SPP1 was upregulated at the protein level (Figure 5).
Figure 4.
A. Top 10 hub genes were screened from 526 genes. B. Top 10 genes were subjected to survival analysis, and 4 genes (PECAM1, HGF, SPP1, and ENG) had impacts on survival with a log-rank of P < .05. C. Expression profiles of PECAM1, HGF, SPP1, and ENG in LUAD; SPP1 was upregulated, while PECAM1, HGF, and ENG were downregulated. *Indicates that the difference between LUAD and normal groups was statistically significant.
Figure 5.
PECAM1, HGF, and ENG were downregulated and SPP1 was upregulated at the protein level.
A. Top 10 hub genes were screened from 526 genes. B. Top 10 genes were subjected to survival analysis, and 4 genes (PECAM1, HGF, SPP1, and ENG) had impacts on survival with a log-rank of P < .05. C. Expression profiles of PECAM1, HGF, SPP1, and ENG in LUAD; SPP1 was upregulated, while PECAM1, HGF, and ENG were downregulated. *Indicates that the difference between LUAD and normal groups was statistically significant.PECAM1, HGF, and ENG were downregulated and SPP1 was upregulated at the protein level.An accurate prognostic model was established and univariate Cox regression analysis was performed on 19 405 genes in 243 training dataset samples, which identified 75 prognostically influential genes. LASSO regression analysis was performed on 75 genes, which identified 17 genes. Multivariate Cox regression analysis of 17 genes filtered out 9 genes (RPL22, ENO1, KDF1, ZNF691, PCSK9, DNASE2B, SNX7, ELAPOR1, and LCE5A) that were used to construct a forest plot (Figure 6A). The favorable prognosis genes included KDF1, ZNF691, DNASE2B, and ELAPOR1, while unfavorable prognosis genes included RPL22, ENO1, PCSK9, SNX7, and LCE5A. The risk score model followed the relationship, where riskScore = .421*RPL22+.417*ENO1-.522*KDF1-.502*ZNF691+.169*PCSK9-.13*DNASE2B+.346*SNX7-.077*ELAPOR1+.179*LCE5A. Risk score analysis for the training and testing datasets divided the cohort into high- and low-risk groups according to the median risk score value. Survival analysis showed that the high risk group had a worse overall survival than the low-risk group (P < .0001; Figure 6B-6C).
Figure 6.
A. KDF1, ZNF691, DNASE2B, and ELAPOR1 had coefficient values of <1 and were favorable prognostic genes. RPL22, ENO1, PCSK9, SNX7, and LCE5A had coefficient values of >1 and were unfavorable prognostic genes. B. The high-risk group had lower overall survival than the low-risk group in the training dataset (P < .0001). C. The high-risk group had a lower overall survival rate than the low-risk group in the testing dataset (P <.0001). Risk model potency calculated by plotting ROC curves with area under the curve values of .78 and .758 for training D and testing E groups, respectively.
A. KDF1, ZNF691, DNASE2B, and ELAPOR1 had coefficient values of <1 and were favorable prognostic genes. RPL22, ENO1, PCSK9, SNX7, and LCE5A had coefficient values of >1 and were unfavorable prognostic genes. B. The high-risk group had lower overall survival than the low-risk group in the training dataset (P < .0001). C. The high-risk group had a lower overall survival rate than the low-risk group in the testing dataset (P <.0001). Risk model potency calculated by plotting ROC curves with area under the curve values of .78 and .758 for training D and testing E groups, respectively.The ROC curves for the training and testing datasets were plotted to evaluate the efficacy of the risk score model. The ROC curves had area under the curve values of .78 and .758 for the training and testing groups, respectively (Figure 6D-6E). Risk curves were plotted according to survival time and risk score (Figure 7A). The number of patients who died continuously increased and the number of those who survived continuously decreased with increasing risk scores (Figure 7B). A heat map of the key genes for prognosis is shown in Figure 7C, which demonstrated the reliability of the risk score model. The nomogram with clinical data from the TCGA database included age, gender, stage, and risk score and was used to predict the 1-, 3-, and 5-year survival values (Figure 8).
Figure 7.
A. High- and low-risk groups were ranked by risk score of each sample. B. As risk score increased, the number of deaths continued to increase in each sample. C. Expression of each gene inside the risk score model in high- and low-risk groups.
Figure 8.
“Points” served as a scoring scale for each factor, while “total points” served as a scale for the total score. “riskScore” represents the risk score. “Stage” represents the 4 stages of LUAD. Based on the total score of each patient, 1-, 3-, and 5-year survival rates were inferred.
A. High- and low-risk groups were ranked by risk score of each sample. B. As risk score increased, the number of deaths continued to increase in each sample. C. Expression of each gene inside the risk score model in high- and low-risk groups.“Points” served as a scoring scale for each factor, while “total points” served as a scale for the total score. “riskScore” represents the risk score. “Stage” represents the 4 stages of LUAD. Based on the total score of each patient, 1-, 3-, and 5-year survival rates were inferred.
Discussion
The present study used DEGs with WGCNA to identify the intersection genes. KEGG/GO enrichment analysis was performed on the intersection genes, which were then used to construct the PPI network. Ten hub genes were screened out and evaluated using the cytohubba module in Cytoscape. The final results screened-out 4 genes that have effects on the development of LUAD. KEGG analysis showed that PECAM-1 participates in cell adhesion molecules, leukocyte transendothelial migration, and malaria. Increased expression of PECAM-1 promotes migration of endothelial lymphocytes.
The present study demonstrated that high expression of PECAM-1 was associated with improved survival, which was consistent with the study by Cao et al
PECAM-1 signaling may be involved in the regulation of VEGF expression
HGF and SPP1 were involved in the PI3K-Akt signaling pathway. The present study demonstrated that high expression of HGF was associated with improved survival, which was consistent with the study by Pan et al
It has also been noted that HGF regulates the expression of SAE2 and circRNA CCDC66, thereby increasing EMT and drug resistance in LADC cells.
Additional studies have demonstrated that HGF activates FAK and downregulates the expression of AIF.
The cDNA sequence for SPP1 contains a 67-bp 5' noncoding region and a 415-bp 3′ noncoding region, as well as a 942-bp coding region encoding a 314-amino-acid protein.
High SPP1 expression predicts poor survival in LUAD,
and SPP1 serves as a biological marker for LUAD prognosis.[22-25] It is also involved in the development of pulmonary fibrosis, as well as affects macrophage secretion by upregulating its expression.
The membrane antigen endoglin, which was first identified in 1985 on the surface of a cell line with acute lymphoblastic leukaemia using hybrid technology to obtain the murine monoclonal antibody mab44g4, was mainly localized in vascular endothelial cells.
In 1988, Gougos et al
have validated the presence of endoglin in a variety of vascular endothelial cells using the monoclonal antibody mab44g4 and have found it to be a homodimer composed of subunits with a molecular weight of 95 kDa. Mutations in ENG may cause hereditary hemorrhagic telangiectasia,[29,30] greater susceptibility to pulmonary artery malformations,
and pulmonary hypertension.Univariate Cox, LASSO, and multivariate Cox regression analyses were used in the present study to construct a risk score model. The model contained 9 genes, including RPL22, ENO1, KDF1, ZNF691, PCSK9, DNASE2B, SNX7, ELAPOR1, and LCE5A. Ribosomal protein L22 (rpl22) is expressed in various cells.
Studies have shown that rpl22 functions as a tumor suppressor by selectively upregulating the expression of the tumor suppressor p53 and inhibiting colony formation of cancer cells.
The rpl22 is downregulated in lung cancer and can interact with casein kinase 2α.[35,36] EN01 is the most differentially expressed protein in humans, and this differential expression depends on the stress or metabolic status of the pathological cells in the tissue.
Additionally, EN01 is involved in many important physiological processes, such as hypoxia resistance,
inflammatory responses, and autoimmune activities.
ENO1 upregulation promotes glycolysis and tumor progression in LUAD
and serves as a potential biological marker in the early occurrence and development of lung cancer.
Chemoresistance influences small cell lung cancer by regulating ENO1 expression.
The KDF1 mutation may cause abnormal ectodermal development
and regulates epidermal differentiation via KDF1-mediated IKKα deubiquitination.
The low expression of ZNF691 in patients with ovarian cancer is associated with poor prognosis.
Proprotein convertase subtilosin 9 (PCSK9) is the ninth member of the proteinase K subfamily of the proprotein convertase family. It is currently confirmed that PCSK9 can affect blood lipids by regulating lipid metabolism. It interacts primarily at the cell surface with epidermal growth factor precursor homology domain A of the LDLR, limiting the LDLR-mediated lipoprotein uptake.
PCSK9 regulates apoptosis of LUAD cells (A549) via the ER stress and mitochondrial signaling pathways.
It may also be a prognostic marker for NSCLC patients.
DNASE2B is an enzyme responsible for nuclear degradation in the mouse lens. However, DNASE2B expression in zebrafish has a distribution pattern that is different from that in mice.
DNASE2B is highly expressed in smokers and may have the potential to improve the prediction of chemosensitivity in gastric cancer patients.[50,51] Sorting nexins (SNXs) are a family of peripheral membrane proteins that direct protein trafficking decisions within the endocytic network.
SNX7 is a member of the SNX family and contains a PX and BAR domains. It regulates amyloid β peptide by inducing the degradation of amyloid precursor protein.
ELAPOR1 is a secretory granule maturation-promoting factor that is lost during paligenosis.The study results showed that as the risk score increased, the patient survival rate gradually decreased, while the prognostic impact of the risk score increased. The area under the ROC curve for the training and testing datasets indicated that the risk score model was valid.
Limitations
The present study did not validate the amount of mRNA expression of PECAM1, HGF, SPP1, and ENG in lung cancer and only adopted the mRNA data from the TCGA database to construct a prognostic model with a single source of data.
Conclusions
Genes affecting the development of LUAD included PECAM1, HGF, SPP1, and ENG. The favorable prognosis genes included KDF1, ZNF691, DNASE2B, and ELAPOR1, while the unfavorable prognosis genes included RPL22, ENO1, PCSK9, SNX7, and LCE5A.
Authors: C Manaspon; S Thaweesapphithak; T Osathanon; K Suphapeetiporn; T Porntaveetus; V Shotelersuk Journal: Br J Dermatol Date: 2019-05-14 Impact factor: 9.302
Authors: Ye Jin Ha; Sang Nam Yoon; Yeo Jin Jeon; Dong Hyung Cho; Seon Ae Roh; Byung Sik Kim; Hee Jin Kim; Seon Young Kim; Yong Sung Kim; Jin Cheon Kim Journal: Anticancer Res Date: 2011-12 Impact factor: 2.480