Literature DB >> 35116665

Systematically integrative analysis identifies diagnostic and prognostic candidates and small-molecule drugs for lung adenocarcinoma.

Qiang Chen1,2, Xiaoyi Wang1, Jing Hu3,4.   

Abstract

BACKGROUND: Lung adenocarcinoma (LUAD) is the most common histological subtype of lung cancer (LC). However, the early-stage diagnostic rate is still low, and the 5-year overall survival (OS) rate remains poor. The present study aimed to identify critical genes as diagnostic and prognostic markers and small-molecule drugs for combating LUAD using a systematic bioinformatics analysis.
METHODS: Five gene expression profiling datasets were systematically integrated and analyzed. First, gene coexpression modules were identified, and differentially expressed genes (DEGs) were screened. Second, the functional changes of these DEGs were systematically investigated. Third, the protein-protein interaction network, high correlation module and key genes were identified. Fourth, prognosis and diagnostic analyses were performed. Fifth, small-molecule drugs were predicted for guiding LUAD therapy.
RESULTS: Finally, 12-gene and 2-gene signatures were identified as diagnostic and prognostic markers. The areas under the curves (AUCs) of two signatures associated with 3-year survival were 0.686 and 0.603, respectively. The AUCs of two signatures were over 95% and 94% in diagnostic model, separately. Eleven small-molecule drugs were found and irinotecan was simultaneously predicted in three drug databases.
CONCLUSIONS: The present study identified some key dysregulated genes involved in LUAD and potential drugs by a comprehensive analysis, which provides novel insights into the pathological mechanism involved in LUAD and may shed light on the diagnosis, prognosis and treatment of LUAD patients. 2021 Translational Cancer Research. All rights reserved.

Entities:  

Keywords:  Lung adenocarcinoma (LUAD); biomarker; diagnosis; prognosis; small-molecule drug

Year:  2021        PMID: 35116665      PMCID: PMC8797894          DOI: 10.21037/tcr-21-526

Source DB:  PubMed          Journal:  Transl Cancer Res        ISSN: 2218-676X            Impact factor:   1.241


Introduction

Lung cancer (LC), one of the most common malignancies, is the leading cause of cancer-related deaths worldwide (1).Especially in recent years, the morbidity and mortality of LC have been increasing year by year, and LC has ranked first among all malignancies for many years in some countries such as China (2019 National Cancer Report from China National Cancer Center). Non-small cell lung cancer (NSCLC) is the most predominant pathological type, constituting more than 80% of LCs (2), of which lung adenocarcinoma (LUAD) is the major histological subtype and accounts for more than 40% of LCs (3,4). Annually, LUAD results in more than 600,000 deaths all over the world (5). Despite recent advances in molecular diagnosis and multimodality therapies, the 5-year overall survival (OS) rate of LUAD patients in all stages is only approximately 15% (3,6). LUAD patients diagnosed at an early stage have a higher 5-year OS rate with 70–90% (7). However, no more than 20% of LUAD patients are diagnosed in a timely manner at an early stage (8), and 35–50% of patients diagnosed and treated at an early stage will relapse after surgical resection (9), which indicates a very poor prognosis. To improve the survival rate of LUAD patients, it is vital to uncover the underlying molecular mechanisms of LUAD and identify potential molecular diagnostic and prognostic biomarkers and/or therapeutic targets to combat LUAD. Currently, the diagnosis and prognosis of LUAD patients are mainly evaluated on the basis of many clinical and pathological features. Due to the high heterogeneity of LUAD, many clinical variables correlating with prognosis bring difficulty in predicting clinical outcomes upon detecting LUAD at an early stage (10). Recently, several molecular factors such as gene mutation and overexpression have been used to guide the clinical care of LUAD patients (3,11). For example, EGFR mutations and ALK fusions have been used as targets of molecular targeted therapy (3,12,13). However, EGFR and ALK alterations were found in only a small fraction of LUAD patients, and the majority of LUAD patients frequently harbored activating mutations such as KRAS, BRAF and ERBB2 (14) and loss-of-function mutations and deletions such as TP53, RB1 and CDKN2A (3,15,16). So far, few targeted molecular therapies have been clinically used for such alterations, and few prognostic biomarkers have been identified to predict clinical outcomes. Therefore, more knowledge of additional genes altered in LUAD is required to further guide the diagnosis, treatment and prognosis of LUAD. Gene expression analysis based on gene expression profiles is an important traditional method of investigating the differences in gene expression under different cell statuses, and a large number of differentially expressed genes (DEGs) associated with LUAD have been identified (17-20), such as AKT1, DDR2 and FGFR1. Some genetic factors including K-ras, FGF22 and LAPTM4B, as biomarkers, have also been investigated to predict the prognosis in LUAD patients (21-23). However, most DEGs reported in different studies vary greatly, and only a few consistent DEGs have been identified. In addition, few identified diagnostic and prognostic markers have been widely accepted for routine clinical use, and widely acceptable consistent key genes as biomarkers urgently require identification. Increasing LUAD-related gene expression data allows us to do the important work of identifying consistent key genes involved in LUAD by systematically integrative analysis. In this study, five LUAD-related gene expression profile datasets from the NCBI Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) databases were systematically integrated by bioinformatics methods including weighted gene coexpression network analysis (WGCNA), differentially expressed gene analysis (DEGA), functional enrichment analysis, and protein and protein interaction (PPI) network construction. Subsequently, diagnostic and prognostic analyses of identified critical genes were performed to identify diagnostic and prognostic candidates associated with LUAD patients. Last, potential small-molecule drugs related to key genes were identified to guide the treatment of LUAD. We present the following article in accordance with the TRIPOD reporting checklist and the MDAR checklist (available at https://dx.doi.org/10.21037/tcr-21-526).

Methods

The flow chart of systematic bioinformatics analysis in the current study is displayed in .
Figure 1

Flow chart of the bioinformatics analysis in the present study. LUAD, lung adenocarcinoma; TCGA, the cancer genome atlas; GEO, gene expression omnibus; DEGA, differentially expressed gene analysis; DEGs, differentially expressed genes; WGCNA, weighted gene coexpression network analysis; PPI, protein-protein interaction; MCODE, molecular complex detection; GO, gene ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; RP, reaction pathway; DO, disease ontology; LASSO, least absolute shrinkage and selection.

Flow chart of the bioinformatics analysis in the present study. LUAD, lung adenocarcinoma; TCGA, the cancer genome atlas; GEO, gene expression omnibus; DEGA, differentially expressed gene analysis; DEGs, differentially expressed genes; WGCNA, weighted gene coexpression network analysis; PPI, protein-protein interaction; MCODE, molecular complex detection; GO, gene ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; RP, reaction pathway; DO, disease ontology; LASSO, least absolute shrinkage and selection.

Data collection

Five LUAD-related gene expression datasets were reanalyzed by systematic bioinformatics methods in this study. Among these datasets, four microarray datasets were retrieved from the NCBI GEO database (https://www.ncbi.nlm.nih.gov/geoprofiles/), including GSE10072 (19), GSE7670 (24), GSE19804 (20) and GSE102511 (25). The GSE10072, GSE7670 and GSE19804 datasets were generated using the Affymetrix microarray platform. The GSE102511 dataset was produced using the Ion Torrent Proton platform. The GSE10072 and GSE102511 datasets were from American patients with LUAD. The GSE10072 dataset included 58 eligible tumor tissues samples and 49 eligible normal lung tissues samples, and the GSE102511 dataset included 16 eligible tumor tissues samples and 15 eligible normal lung tissues samples. The GSE7670 and GSE19804 datasets were from Taiwanese patients with LUAD and included 27 and 56 eligible paired tissue samples, separately. Given the same microarray chip platform (GPL96), the GSE10072 and GSE7670 datasets were merged into one dataset by reconstructing the gene expression profiles, and the new dataset was named “Dataset A”. The GSE19804 and GSE102511 datasets were named “Dataset B” and “Dataset C”, respectively. A LUAD-related RNA-seq dataset was retrieved from TCGA database portal (2019, https://portal.gdc.cancer.gov/). The inclusion criteria for the RNA-seq data were as follows: (I) histological diagnosis for LUAD; (II) except LUAD, not suffering from other malignancies; and (III) data with complete clinical information. Finally, totals of 535 LUAD tissues samples and 59 non-LUAD normal lung tissues samples were included. The LUAD-related RNA-seq dataset was named “Dataset D”. These studies have been approved by the Institutional Review Board of the relevant participating institutions including the National Taiwan University and Taichung Veterans General Hospital (GSE19804), the Taipei Veterans General Hospital and Taichung Veterans General Hospital (GSE7670), 13 participating hospitals and National Cancer Institutes (GSE10072), the Aichi Cancer Center and Nagasaki University (GSE102511), and the National Cancer Institute of NIH (TCGA RNA-seq data). No additional approval from the ethics committee was required. The present study complies with the requirements of data usage and publishing from NCBI GEO and TCGA databases.

Data preprocessing and DEGA

All microarray data were standardized by a normalized microarray preprocessing procedure using the affy package (version 1.60.0) in Bioconductor project (version 3.9.0, http://www.bioconductor.org/) (26), and RNA-seq data were subjected to normalization using the trimmed mean of M-values (TMM) method based on the edgeR package (version 3.26.3) in Bioconductor project (27). DEG screening of microarray data was performed using the limma package (version 3.32.7) in Bioconductor project (28). The limma package employs the voom method, linear modeling and empirical Bayes moderation to assess DEGs, which yields more robust results, even with fewer microarrays. The edgeR package (version 3.26.3) in Bioconductor project was used to screen the DEGs between LUAD tissue and non-LUAD normal lung tissue samples of TCGA RNA-seq data (27).

WGCNA

Gene coexpression was analyzed using a WGCNA method, and WGCNA was performed using the WGCNA package (version 1.13) (29). First, an adjacency matrix was converted according to the gene expression matrix. Based on the adjacency matrix, the unsupervised coexpression relationship of each gene pair was computed by Pearson correlation coefficients. The soft threshold β was used to strengthen the correlation adjacency matrix, and the parameter β of each dataset was selected according to its scale-free topology criterion. Second, a topological matrix was converted according to the strengthened adjacency matrix, and the correlation of each gene pair was measured using the topological overlap measure (TOM). On the basis of TOM-based dissimilarity (1-TOM), the genes with coherent expression profiles were classified into a gene module using the average linkage hierarchical clustering method. Gene coexpression module was identified from the system cluster tree by a dynamic cutting algorithm. The modules with 75% similarity were merged into one module, and the representative gene in a module was identified as the module eigengene (ME). The correlation between the ME and gene module was defined as the module membership (MM). The gene differential expression was measured using the P value from t-test method between LUAD and normal lung tissues, and gene significance (GS) was computed by log10 transformation of the p value. The average GS was defined as the module significance (MS) of the module, and the MS indicated the correlation between the module and LUAD. A detailed description of the WGCNA method can be obtained from the original article (29).

PPI network construction and highly correlated module identification

The interactive relationships among DEGs encoding proteins were elucidated by constructing a PPI network, and the interactive relationships between gene pairs were retrieved from the online STRING database (version 11.0, https://string-db.org/) (30). Gene pairs with a combined score ≥0.7 were filtered to construct the PPI network, and the PPI network was established and visualized using Cytoscape software (version 3.7.0, http://www.cytoscape.org/) (31). Based on the topological properties of the whole PPI network, the highly correlated modules (subnetwork) were extracted from the whole PPI network using a Molecular COmplex DEtection (MCODE) algorithm. The MCODE analysis was performed using the plugin MCODE (version 1.5.1) in Cytoscape software (32). The threshold parameters were set as degree cut-off =4, node score cut-off =0.6, K-core =4 and max. depth =100.

Essential genes identification

Key genes were identified using seven centrality analyses in the PPI subnetwork (33). Seven centrality methods were Degree Centrality, Subgraph Centrality, Network Centrality, Eigenvector Centrality, Closeness Centrality, Betweenness Centrality and Information Centrality. The plugin CytoNCA (version 2.1.6) was used to perform the Centrality analyses in the Cytoscape software (34). The centrality score of each gene was computed by the centrality analyses, and the genes with higher centrality scores were identified as key genes. The intersecting genes of key genes were identified as the essential genes.

Identification of LUAD-specific prognostic gene signature

The LUAD patients from the TCGA database were used to perform the survival analysis. The Kaplan-Meier (KM) estimate and log-rank (LR) test were used to evaluate the associations between the essential genes and the OS of LUAD patients in the survival package (version 2.43-3, https://CRAN.R-project.org/package=survival). The group cut off was set to 50%, and the LR P value, hazard ratio (HR) and 95% confidence interval (CI) were computed. A P<0.05 indicated the statistical significant of the association between an essential gene and the OS of LUAD patients. A univariate Cox proportional hazards regression model was applied to evaluate the associations between the essential genes and the OS of LUAD patients. The same characteristic parameters obtained via the LR method were computed, and the same significant P value criterion was set. A multivariate Cox hazards regression model with the stepwise method was applied to assess the prognostic value for LUAD patients using the survival package in R project. The essential gene combination in the optimal Cox hazard regression model was used for further analyses. The hazards model was established as follows: where “ExpDEGn” represented the expression level of DEGn and “CoeDEGn” denoted the regression coefficient from the multivariate Cox regression model (35). On the basis of the median of the above risk scores, LUAD patients were divided into the low-risk and high-risk groups. The survival ROC package (version 1.0.3) was used to construct the receiver operating characteristic (ROC) curve, and the ROC was used to measure the risk prediction rate of DEGs between the low- and high-risk groups. LASSO Cox regression model analysis was performed using the glmnet package (version 4.0-2) based on R. The formula used to calculate the risk score is the same as that in the multivariate Cox hazards regression model, and the statistical methods are the same as those in the multivariate Cox hazards regression model.

Diagnostic analysis of prognostic genes

Diagnostic analysis of prognostic genes was performed using the support vector machine (SVM) method. The SVM classification model was constructed using the e1071 package (version 1.7-3) based on R. The radial basis function was applied in the SVM kernels, and the mRNA profiles were classified using 100 independent repetitions of 10-fold cross-validation. The specificity, sensitivity and accuracy were calculated.

Identification of candidate small-molecule drugs

Potential small-molecule drugs of the candidate prognostic genes were searched using three databases including CMap (https://portals.broadinstitute.org/cmap/) (36), L1000FWD (http://amp.pharm.mssm.edu/L1000FWD/) (37) and DGIdb (http://www.dgidb.org/) (38). The intersection of identified small molecules was indicative of potential candidate adjuvant drugs for use in LUAD patients.

Expression validation of essential genes by transcriptome sequencing data

To verify the expression of essential genes between LUAD tissue and normal lung tissue by bioinformatics methods, the transcriptomes of 10 nonsmoking LUAD patients of 35–50 years old (5 male and 5 female LUAD patients in early stages) from Xuanwei City (one of the areas with the highest morbidity and mortality of LC in China) in China were sequenced. All tissues were obtained from The First People’s Hospital of Yunnan Province. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). This study was approved by the Institutional Review Board of The First People’s Hospital of Yunnan Province (No. 2017YY227). Informed consent was taken from all the patients. Transcriptome sequencing data were generated using the Illumina HiSeq2500 platform with the paired-end sequencing method by Beijing Novogene Technology Co., Ltd. DEGs between LUAD tissues and normal lung tissues samples were screened using the DESeq2 (version 1.24.0) package in Bioconductor project (39), and |logFC|>1 and FDR<0.01 were set as the cut-off criteria.

Statistical analysis

We performed data analysis according to the characteristics of each dataset on the basis of R software (version 3.6.3). For microarray datasets, DEGs were screened using the unpaired t-test in the limma package and a |Log2(fold change) (logFC)|>1 and a false discovery rate (FDR) <0.05 (P<0.05) were set as the cut-off criteria. For RNA-seq dataset, the quantile-adjusted conditional maximum likelihood (qCML) method in the edgeR package was used to identify DEGs, and the statistical significant of the difference was set for |logFC|>1.5 and FDR<0.01 on the basis of large-scale samples. The construction of prognostic signature was performed by univariate and multivariate COX regression analyses. Survival analysis was conducted by KM method and LR test, and P<0.05 indicated that the difference was statistically significant. Diagnostic analysis was performed using the SVM method. Gene coexpression analysis was performed using WGCNA, and gene significance was indexed by log10 transformation of the P value of the t-test measuring differential expression between LUAD and normal lung tissue samples. The expressions of key DEGs in prognostic and diagnostic signatures were analyzed using the paired-sample datasets including GSE7670 and GSE19804, and the paired t-test method was used to compare the expression difference between tumor samples and normal samples. A P<0.05 indicated that the difference was statistically significant in two groups.

Results

DEGs identification and functional analysis

To identify DEGs involved in LUAD, DEGA was performed. According to |logFC|>1 and FDR <0.05, totals of 623 (Dataset A, 189 upregulated and 434 downregulated), 1,387 (Dataset B, 424 upregulated and 963 downregulated), 1,343 (Dataset C, 492 upregulated and 851 downregulated) DEGs were identified. According to |logFC|>1.5 and FDR <0.01, 11,450 (8,940 upregulated and 2,510 downregulated) DEGs were identified. Overlapping analysis showed that 235 common DEGs (74 upregulated and 161 downregulated) were identified (available online: https://cdn.amegroups.cn/static/public/tcr-21-526-1.pdf). To better understand the roles of 235 common DEGs in LUAD, functional enrichment analyses including Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, Reactome pathways (RP) and Disease Ontology (DO) analyses were performed using the online STRING database (version 11.0) and clusterProfiler package (version 3.12.0) in Bioconductor project (30,40). GO analysis showed that 235 dysregulated genes were separately significantly enriched in 163 upregulated and 537 downregulated biological processes (BPs) (P<0.05), 18 upregulated and 29 downregulated molecular functions (MFs) (P<0.05), and 22 upregulated and 36 downregulated cellular components (CCs) (P<0.05). The top 5 GO terms with the most significant P values are shown in . Among the enriched BPs, upregulated BPs were mainly related to the cell cycle process, and downregulated BPs were mainly associated with the biological regulation (). KEGG pathway analysis showed that 2 upregulated pathways including the cell cycle (P=0.00062) and ECM-receptor interaction (P=0.0101) were significantly enriched (), and no significantly downregulated pathways were enriched. RP analysis showed that 235 dysregulated genes were separately enriched in 1 upregulated and 1 downregulated pathways (), and the 2 RPs were upregulated collagen degradation (P=0.0068) and downregulated hemostasis (P=0.0045), respectively. DO analysis showed that up- and down-regulated DEGs were separately associated with 32 and 9 diseases, and the top 5 DO terms are listed in . All upregulated DO terms were significantly associated with many types of cancers, and the most significantly upregulated DO term was non-small cell lung carcinoma (P=0.000764424) ().
Table 1

Functional terms enriched by common DEGs

Term IDTerm descriptionObservedFDR
Biological process
   Up-regulated
    GO:1903047Mitotic cell cycle process215.01E-12
    GO:0022402Cell cycle process231.11E-10
    GO:0051301Cell division141.91E-06
    GO:0000280Nuclear division112.81E-06
    GO:0007059Chromosome segregation101.51E-05
   Down-regulated
    GO:0051239Regulation of multicellular organismal process671.36E-13
    GO:0009653Anatomical structure morphogenesis511.80E-10
    GO:0050793Regulation of developmental process562.61E-10
    GO:0072359Circulatory system development322.61E-10
    GO:0001944Vasculature development254.38E-10
Molecular function
   Up-regulated
    GO:0005515Protein binding450.00063
    GO:0042802Identical protein binding200.00099
    GO:0042803Protein homodimerization activity130.0016
    GO:0046983Protein dimerization activity160.0021
    GO:0005488Binding610.0034
   Down-regulated
    GO:0005539Glycosaminoglycan binding120.00024
    GO:0005102Signaling receptor binding310.00048
    GO:0005488Binding1260.00048
    GO:0005515Protein binding830.00048
    GO:0017046Peptide hormone binding60.00048
Cellular component
   Up-regulated
    GO:0005576Extracellular region281.14E-05
    GO:0005819Spindle105.09E-05
    GO:0000776Kinetochore60.00092
    GO:0000940Condensed chromosome outer30.001
    GO:0005615Extracellular space150.001
   Down-regulated
    GO:0005576Extracellular region551.28E-09
    GO:0005615Extracellular space317.89E-07
    GO:0005886Plasma membrane749.46E-06
    GO:0009986Cell surface229.46E-06
    GO:0031226Intrinsic component of plasma membrane343.13E-05
KEGG pathway
   Up-regulated
    hsa04110Cell cycle60.00062
    hsa04512ECM-receptor interaction40.0101
Reactome pathway
   Up-regulated
    HSA-1442490Collagen degradation30.0068
   Down-regulated
    HSA-109582Hemostasis150.0045
Disease ontology
   Up-regulated
    DO:3908Non-small cell lung carcinoma150.0007644
    DO:3459Breast carcinoma130.0027271
    DO:0050904Salivary gland carcinoma50.0088499
    DO:8850Salivary gland cancer50.0088499
    DO:0060084Cell type benign neoplasm130.0088499
   Down-regulated
    DO:6000Congestive heart failure140.0134019
    DO:6432Pulmonary hypertension80.0134019
    DO:5844Myocardial infarction150.0134019
    DO:3393Coronary artery disease170.0134019
    DO:1936Atherosclerosis160.0261586

DEG, differentially expressed gene; FDR, false discovery rate; KEGG, Kyoto Encyclopedia of Genes and Genomes; ECM, extracellular matrix.

DEG, differentially expressed gene; FDR, false discovery rate; KEGG, Kyoto Encyclopedia of Genes and Genomes; ECM, extracellular matrix.

Interactive relationships among 235 DEGs and essential gene identification

To elucidate the interactive relationships among 235 DEGs, a PPI network was constructed. At a minimum required interaction score = high confidence 0.7, a total of 108 DEGs was filtered into the PPI network, and a PPI network with 108 nodes and 320 edges was established (). Three highly correlated modules were identified in the whole PPI network, and the module with the highest score (score =19.789) included 20 nodes and 188 edges (). The 20 genes had high centrality scores () and were identified as essential genes including UBE2C, ZWINT, MELK, CCNB2, RRM2, CEP55, BIRC5, TTK, KIF20A, CDKN3, TOP2A, NUSAP1, ASPM, CDC20, BUB1B, CCNB1, TYMS, MCM4, CENPF and KIF4A. These genes were mainly associated with many types of cancers including NSCLC and LUAD () and were mainly enriched within pathways related to the cell cycle (). All the genes were highly expressed in LUAD tissue (all P<0.001) (), and every pair of genes showed a strong positive correlation in expression (all P<0.001) ().
Figure 2

Biological analysis based on 235 DEGs. (A) A PPI network with 108 nodes and 320 edges was established. Each red cycle node represents one upregulated gene and each green cycle represents one downregulated gene. Each edge represents an interactive relationship between two genes. Larger nodes represent more links. Thicker edges represent higher coexpression scores among genes, and deeper-colored edges represent higher combined scores (yellow to blue to pink). (B) Highly correlated module with the highest score was identified by the MCODE algorithm in the whole PPI network (score =19.789). The subnetwork included 20 nodes and 188 edges. All genes in the subnetwork were upregulated in LUAD tissues. Thicker edges represent higher coexpression scores among genes, and deeper-colored edges represent higher combined scores (yellow to blue to pink). (C) The DO-gene network showed that 20 genes were mainly associated with various cancers. (D) The pathway-gene network showed that 20 genes were mainly related to cell cycle pathways. Each red cycle node represents one gene. Each blue V node represents one GO term (biological process). Each yellow triangle node represents one KEGG pathway, and each green diamond node represents one reaction pathway. (E) Twenty genes were upregulated in LUAD tissues (all P<0.001). (F) Twenty genes had stronger positive correlations in expression. (G) Nineteen genes in the subnetwork were validated to have significantly upregulated expression in LUAD tissues by analyzing transcriptome sequencing data. (H) Nineteen genes had stronger positive correlations in expression in the transcriptome sequencing data. DEGs, differentially expressed genes; PPI, protein-protein interaction; MCODE, molecular complex detection; LUAD, lung adenocarcinoma; GO, gene ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes.

Table 2

Centrality scores of twenty genes in the highly correlated module by eight centrality methods

RankGeneSubgraphEigenvectorInformationLACBetweennessClosenessNetworkDegree
1 UBE2C 7504268.500.22610.4717.790.2221.019.0019
2 ZWINT 7504268.500.22610.4717.790.2221.019.0019
3 MELK 7504268.500.22610.4717.790.2221.019.0019
4 CCNB2 7504268.500.22610.4717.790.2221.019.0019
5 RRM2 7504268.500.22610.4717.790.2221.019.0019
6 CEP55 7504268.500.22610.4717.790.2221.019.0019
7 BIRC5 7504268.500.22610.4717.790.2221.019.0019
8 TTK 7504268.500.22610.4717.790.2221.019.0019
9 KIF20A 7504268.500.22610.4717.790.2221.019.0019
10 CDKN3 7504268.500.22610.4717.790.2221.019.0019
11 TOP2A 7504268.500.22610.4717.790.2221.019.0019
12 NUSAP1 7504268.500.22610.4717.790.2221.019.0019
13 ASPM 7504268.500.22610.4717.790.2221.019.0019
14 CDC20 7504268.500.22610.4717.790.2221.019.0019
15 BUB1B 7504268.500.22610.4717.790.2221.019.0019
16 CCNB1 7504266.500.22610.4717.790.2221.019.0019
17 TYMS 6799982.000.21510.2216.890.1110.9517.8818
18 MCM4 6799982.000.21510.2216.890.1110.9517.8818
19 CENPF 6799982.000.21510.2216.890.1110.9517.8818
20 KIF4A 6799982.000.21510.2216.890.1110.9517.8818

LAC, local average connectivity.

Biological analysis based on 235 DEGs. (A) A PPI network with 108 nodes and 320 edges was established. Each red cycle node represents one upregulated gene and each green cycle represents one downregulated gene. Each edge represents an interactive relationship between two genes. Larger nodes represent more links. Thicker edges represent higher coexpression scores among genes, and deeper-colored edges represent higher combined scores (yellow to blue to pink). (B) Highly correlated module with the highest score was identified by the MCODE algorithm in the whole PPI network (score =19.789). The subnetwork included 20 nodes and 188 edges. All genes in the subnetwork were upregulated in LUAD tissues. Thicker edges represent higher coexpression scores among genes, and deeper-colored edges represent higher combined scores (yellow to blue to pink). (C) The DO-gene network showed that 20 genes were mainly associated with various cancers. (D) The pathway-gene network showed that 20 genes were mainly related to cell cycle pathways. Each red cycle node represents one gene. Each blue V node represents one GO term (biological process). Each yellow triangle node represents one KEGG pathway, and each green diamond node represents one reaction pathway. (E) Twenty genes were upregulated in LUAD tissues (all P<0.001). (F) Twenty genes had stronger positive correlations in expression. (G) Nineteen genes in the subnetwork were validated to have significantly upregulated expression in LUAD tissues by analyzing transcriptome sequencing data. (H) Nineteen genes had stronger positive correlations in expression in the transcriptome sequencing data. DEGs, differentially expressed genes; PPI, protein-protein interaction; MCODE, molecular complex detection; LUAD, lung adenocarcinoma; GO, gene ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes. LAC, local average connectivity.

Expression validation of 20 essential genes

To validate the differential expressions of 20 essential genes between LUAD and normal lung tissues, 20 transcriptomes from 10 nonsmoking LUAD patients in an early stage were analyzed. According to |logFC|>1 and FDR<0.01, 2,360 DEGs (1,302 upregulated and 1,058 downregulated) were identified (available online: https://cdn.amegroups.cn/static/public/tcr-21-526-2.pdf). Except TYMS, 19 DEGs were consistent with the results from integrative data, and the expressions of 19 DEGs were visualized using the heatmap shown in . Similarly, there were stronger positive correlations in expression among these genes (all R2>0.55, most P<0.001) (). The RNA-seq data have been deposited in the NCBI Short Read Archive (Accession number: PRJNA561283). Given the high consistency of the identified DEGs, the 20 DEGs were selected for further analyses.

Consensus clustering of 20 essential genes and relationships with distinct clinical outcomes and clinicopathological features

To investigate the relationships between 20 essential genes and distinct clinical outcomes and clinicopathological features, consensus clustering was performed. On the basis of the expression similarity of 20 essential genes, k=2 was selected according to clustering stabilities increasing from k=2 to 9 in the TCGA dataset () and clustered into two subgroups (). The two subgroups were significantly related to the prognosis of LUAD patients, and the cluster2 subgroup had a higher survival rate (P=0.002, ). Clinicopathological features were compared between the two subgroups. The results showed that all these genes were lowly expressed in the cluster2 subgroup, and the cluster2 subgroup was significantly correlated with an earlier N stage, T stage and pathological grade (P<0.05, 0.001 and 0.01, separately), as well as with fewer female and dead patients (P<0.001 and 0.05, separately) (). The results explained why patients had a higher OS in the cluster2 subgroup.
Figure 3

Consensus clustering of 20 essential genes and relationships with distinct clinical outcomes and clinicopathological features. (A,B) Clustering stability increasing curve from k=2 to 9. (C) K=2 was selected, and LUAD patients were divided into two clusters. (D) The survival rates of LUAD patients showed significant differences between the two clusters (P=0.002), and the cluster2 patients had a higher survival rate. (E) Relationships between 20 genes and clinicopathological features are shown using a heatmap. The cluster1 subgroup was significantly correlated with a later N stage, T stage and pathological grade (P<0.05, 0.001 and 0.01, separately) and with more female and dead patients (P<0.001 and 0.05, separately). *, P<0.05; **, P<0.01; ***, P<0.001. LUAD, lung adenocarcinoma.

Consensus clustering of 20 essential genes and relationships with distinct clinical outcomes and clinicopathological features. (A,B) Clustering stability increasing curve from k=2 to 9. (C) K=2 was selected, and LUAD patients were divided into two clusters. (D) The survival rates of LUAD patients showed significant differences between the two clusters (P=0.002), and the cluster2 patients had a higher survival rate. (E) Relationships between 20 genes and clinicopathological features are shown using a heatmap. The cluster1 subgroup was significantly correlated with a later N stage, T stage and pathological grade (P<0.05, 0.001 and 0.01, separately) and with more female and dead patients (P<0.001 and 0.05, separately). *, P<0.05; **, P<0.01; ***, P<0.001. LUAD, lung adenocarcinoma.

Survival analysis and prognostic model of 20 essential genes

To elucidate the relationships between 20 essential genes and the OS of LUAD patients, survival analysis was performed. Univariate Cox regression showed that all 20 essential genes were significantly correlated with the OS of LUAD patients (all P<0.01) and were risky genes with HR >1 (). To better predict the clinical outcomes of LUAD with 20 essential genes, the LASSO Cox regression algorithm was applied and 12 essential genes (ZWINT, MELK, CCNB2, RRM2, TTK, KIF20A, TOP2A, ASPM, CDC20, CCNB1, TYMS and KIF4A) were selected to build the risk signature based on the minimum criteria (). The risk score of each patient was calculated, and LUAD patients were divided into high-risk and low-risk subgroups on the basis of the median risk score. A significant difference in OS was observed between the two subgroups (P=1e-05), and the low-risk group showed a higher survival rate (). The KM estimate and LR test showed that 12 genes were significantly associated with the OS of LUAD patients (all P<0.05, Figure S1).
Figure 4

Prognostic and diagnostic analyses. (A) Univariate Cox regression analysis showed that 20 genes had significant prognostic value for predicting the OS in LUAD patients. (B,C) LASSO Cox analysis revealed that 12 genes had the most powerful features for predicting the OS. (D) A significant difference in OS was observed between the high- and low-risk subgroups (P=1e-05), and the low-risk group showed a higher survival rate. (E) Twelve genes were highly expressed in the high-risk subgroup. The “***” indicates that the statistical P value is less than 0.001 between two groups. (F) Twelve genes were lowly expressed in the alive patient group The “**” and “***” indicate that the statistical P value is less than 0.01 and 0.001, respectively. (G) Twelve genes were associated with the N stage, T stage, pathological stage and survival status between the high- and low-risk groups (P<0.01, 0.01, 0.001 and 0.001, separately). (H) The ROC curve showed that the risk score, N stage, T stage and pathological stage could better predict the three-year OS for LUAD patients. (I) Univariate Cox regression analysis showed that the risk score, N stage, T stage and pathological stage were significantly correlated with the OS of LUAD patients (all P<0.001). (J) Multivariate Cox regression analysis showed that the risk score and pathological stage were significantly correlated with the OS of LUAD patients (P<0.001 and =0.003). (K) Diagnostic analysis showed that 12 genes had a high specificity and sensitivity for predicting the diagnosis in LUAD patients. **, P<0.01; ***, P<0.001. LASSO, least absolute shrinkage and selection. LUAD, lung adenocarcinoma; OS, overall survival; ROC, receiver operating characteristic.

Prognostic and diagnostic analyses. (A) Univariate Cox regression analysis showed that 20 genes had significant prognostic value for predicting the OS in LUAD patients. (B,C) LASSO Cox analysis revealed that 12 genes had the most powerful features for predicting the OS. (D) A significant difference in OS was observed between the high- and low-risk subgroups (P=1e-05), and the low-risk group showed a higher survival rate. (E) Twelve genes were highly expressed in the high-risk subgroup. The “***” indicates that the statistical P value is less than 0.001 between two groups. (F) Twelve genes were lowly expressed in the alive patient group The “**” and “***” indicate that the statistical P value is less than 0.01 and 0.001, respectively. (G) Twelve genes were associated with the N stage, T stage, pathological stage and survival status between the high- and low-risk groups (P<0.01, 0.01, 0.001 and 0.001, separately). (H) The ROC curve showed that the risk score, N stage, T stage and pathological stage could better predict the three-year OS for LUAD patients. (I) Univariate Cox regression analysis showed that the risk score, N stage, T stage and pathological stage were significantly correlated with the OS of LUAD patients (all P<0.001). (J) Multivariate Cox regression analysis showed that the risk score and pathological stage were significantly correlated with the OS of LUAD patients (P<0.001 and =0.003). (K) Diagnostic analysis showed that 12 genes had a high specificity and sensitivity for predicting the diagnosis in LUAD patients. **, P<0.01; ***, P<0.001. LASSO, least absolute shrinkage and selection. LUAD, lung adenocarcinoma; OS, overall survival; ROC, receiver operating characteristic. The expression levels of 12 genes were analyzed between the high-risk and low-risk groups. The results showed that all the genes were significantly highly expressed in high risk group (all P<0.001, ). The expression levels of 12 genes were also compared between alive and dead patient groups. The result showed that 12 genes were significantly lowly expressed in the alive patient group (all P<0.01, separately, ), which indicates that low expressions of these genes contribute to a low risk of LUAD patients and lengthen survival of LUAD patients. To elucidate the associations between risk scores and clinicopathological features, clinicopathological features were analyzed between the high-risk and low-risk subgroups. We observed significant differences between the two risk subgroups with respect to the N stage (P<0.01), T stage (P<0.01), pathological stage (P<0.001) and survival status (P<0.001) (). The ROC curve showed that the risk score and pathological stage could better predict the three-year OS for LUAD patients (AUC =0.686 and 0.675, separately) (). To determine whether the risk signature is an independent prognostic indicator, univariate and multivariate Cox regression analyses were performed. By univariate Cox regression analysis, the risk score, N stage, T stage and pathological stage were significantly correlated with the OS of LUAD patients (all P<0.001) (). Multivariate Cox regression analysis showed that risk score and pathological stage were significantly correlated with the OS of LUAD patients (P<0.001 and =0.003, separately) (). These results indicate that the risk score can independently predict the OS in LUAD patients.

Diagnostic model based on 12 essential genes related to prognosis

To evaluate diagnostic power of 12 essential genes related to the prognosis of LUAD patients, a diagnostic model of 12 genes was constructed using the SVM classification model. The results showed that both the specificity and sensitivity of classification exceed 93% in the three datasets (). The AUCs were over 95% in the three datasets (), which indicates that 12 genes are very effective as diagnostic biomarkers in predicting the diagnosis of LUAD patients.

Coexpression network construction and module analysis

To better understand the gene coexpression relationships in different tissue types, WGCNA was performed. By normalization, 12,410 (Dataset A), 20,217 (Dataset B), 17,702 (Dataset C) and 31,343 (Dataset D) genes were selected to construct the gene coexpression network. On the basis of a scale-free topology criterion R2>0.8, power β=7 (Dataset A, R2=0.84), 18 (Dataset B, R2=0.85), 4 (Dataset C, R2=0.85) and 2 (Dataset D, R2=0.86) were selected as soft thresholds to convert the Pearson correlation matrix into a strengthened adjacency matrix, separately (, Figure S2A). The TOM of each gene pair was calculated, and 20, 17, 35 and 36 coexpression modules were separately identified by average linkage hierarchical clustering according to a TOM-based dissimilarity measure (1-TOM) in four datasets (, Figure S2B). The correlation analysis between ME and LUAD showed that the brown (including 1,875 genes, cor=−0.87 and P=6.20E-50), dark turquoise (including 1,895 genes, cor=−0.83 and P=1.45E-31), yellow (including 910 genes, cor=−0.94 and P=5.37E-15) and blue (including 3,025 genes, cor=−0.88 and P=1.9E-190) modules were separately the most significant modules in the four datasets (, Figure S2C). The correlation analysis of the MMs in these modules showed that these MMs had the most significant correlation with LUAD (Dataset A, cor=0.87 and P<1e-200; Dataset B, cor=0.77 and P<1e-200; Dataset C, cor=0.95 and P<1e-200; Dataset D, cor=0.98 and P<1e-200; and Figure S2D). GSs across modules showed that these modules had the highest GS values (, Figure S2E). The four modules were selected as key coexpression gene modules for further analyses.
Figure 5

Coexpression network analysis by WGCNA. WGCNA of Dataset A was used to visualize the WGCNA results, and other WGCNA results are shown in Figure S2. (A) Network topology for various soft-threshold powers and the testing of the properties of the scale-free network were analyzed. (B) LUAD-specific coexpression modules were analyzed, and 20 modules were identified. Each short vertical line corresponds to one gene. Each branch represents one expression module of highly interconnected groups of genes. Below the dendrogram, each group of genes has been given one color, which indicates its module assignment. Gray suggests that the genes were outside all modules. (C) The associations between modules and LUAD were analyzed, which showed that the brown module was identified as having the most significant association with LUAD (P=6e-50). (D) The associations between brown module membership and LUAD were analyzed, which showed that the genes in the module and LUAD had a stronger association (P<1e-200). (E) The mean significance across modules was analyzed, which showed that the brown module with the highest mean significance and a lower variation was the module with the most significant association with LUAD. WGCNA, weighted gene coexpression network analysis; LUAD, lung adenocarcinoma.

Coexpression network analysis by WGCNA. WGCNA of Dataset A was used to visualize the WGCNA results, and other WGCNA results are shown in Figure S2. (A) Network topology for various soft-threshold powers and the testing of the properties of the scale-free network were analyzed. (B) LUAD-specific coexpression modules were analyzed, and 20 modules were identified. Each short vertical line corresponds to one gene. Each branch represents one expression module of highly interconnected groups of genes. Below the dendrogram, each group of genes has been given one color, which indicates its module assignment. Gray suggests that the genes were outside all modules. (C) The associations between modules and LUAD were analyzed, which showed that the brown module was identified as having the most significant association with LUAD (P=6e-50). (D) The associations between brown module membership and LUAD were analyzed, which showed that the genes in the module and LUAD had a stronger association (P<1e-200). (E) The mean significance across modules was analyzed, which showed that the brown module with the highest mean significance and a lower variation was the module with the most significant association with LUAD. WGCNA, weighted gene coexpression network analysis; LUAD, lung adenocarcinoma.

Key gene identification in the four most significant modules

To identify the key genes among the four most significant modules, DEGA and overlapping analyses were performed in the four modules. The results showed that 392 (Dataset A), 869 (Dataset B), 483 (Dataset C) and 2,270 (Dataset D) DEGs were separately identified (), and 52 common DEGs (5 upregulated and 47 downregulated) were identified (, ). These DEGs were mainly involved in blood vessel development (GO:0001568, P=1.80e-6), vasculature development and regulation (GO:1901342, P=2.17e-6) and tube development (GO:0035295, P=6.03e-6). The expressions of 52 common DEGs are shown using a heatmap in .
Figure 6

DEGs identified in the most significant modules and PPI analysis. (A) In total, 392, 869, 483 and 2,270 DEGs were separately identified in the four most significant modules. (B) 52 consistent DEGs were identified in the four most significant modules. (C) The mRNA expressions of 52 consistent DEGs between LUAD and normal lung tissues were visualized using a heatmap. (D) The interactive relationships between 52 consistent DEGs were analyzed using a PPI network, and with a minimum required interaction score = the high confidence 0.7, a total of 12 (1 up- and 11 down-regulated genes) was filtered into the PPI network. Each node represents one gene, and bigger nodes represents genes with more links. Each red cycle node represents one upregulated gene and each green cycle node represents one downregulated gene. Each edge represents the interactive relationship between two genes. (E) Highly correlated modules were analyzed using the MCODE algorithm in the whole PPI network, and one highly correlated module with 5 nodes and 10 edges was identified. All 5 genes including ADRB2, RAMP2, CALCRL, RAMP3 and VIPR1 in the module were significantly downregulated genes, and each pair of genes had an interactive relationship. (F) All the genes were lowly expressed in LUAD tissue (P<0.001). (G) Correlation analysis showed that there were stronger positive correlations in expression among 5 genes (all R2>0.70). (H) Five genes in the subnetwork were validated to have significantly downregulated expression in LUAD tissue by analyzing transcriptome sequencing data. (I) Five genes had stronger positive correlations in expression in the transcriptome sequencing data (all R2>0.80). DEGs, differentially expressed genes; LUAD, lung adenocarcinoma; PPI, protein-protein interaction; MCODE, molecular complex detection.

Table 3

Common DEGs in the four most significant co-expression modules

Gene symbolDataset ADataset BDataset CDataset D
logFCFDRlogFCFDRlogFCFDRlogFCFDR
Up-regulated
   PYCR11.08342.60E-341.29662.01E-262.78214.05E-073.73879.68E-93
   CEACAM11.12907.28E-181.27447.37E-111.20711.90E-022.09552.63E-30
   ABCC31.36763.89E-221.91651.23E-193.79397.01E-092.56861.13E-37
   GOLM11.88247.93E-362.02768.61E-291.88704.45E-072.67544.03E-65
   HMGB32.14411.80E-361.78983.38E-131.98505.23E-043.52506.97E-48
Down-regulated
   SFTPC3.78468.10E-263.19336.72E-112.79645.01E-044.70779.03E-69
   FCN33.61581.00E-463.44804.02E-203.89869.14E-064.49313.91E-172
   TMEM1003.52931.35E-413.60157.46E-183.90736.85E-094.33146.7E-141
   ABCA82.93767.81E-502.82462.07E-161.37042.67E-052.83181.23E-71
   CDH52.58281.01E-512.23667.80E-242.20762.85E-082.52344.06E-148
   GPM6A2.54673.4E-473.92362.82E-282.55053.51E-084.43389.26E-122
   EDNRB2.54499.55E-502.99114.45E-222.61276.85E-093.47581.75E-177
   GRK52.49352.23E-571.88522.82E-132.08772.73E-082.46362.62E-128
   TEK2.47871.72E-522.62683.02E-242.67992.85E-083.22796.82E-191
   CLDN182.44809.25E-323.28869.02E-142.21001.33E-033.51312.38E-38
   CA42.31059.1E-463.21812.02E-313.52263.35E-084.33142.42E-104
   TGFBR32.27104.24E-422.16431.94E-162.07911.06E-082.55366.25E-89
   LDB22.23292.04E-452.18223.70E-211.90131.88E-092.58738.98E-170
   CAV22.17061.36E-212.10452.90E-121.76101.53E-072.31531.51E-70
   EMP22.06213.99E-471.83999.91E-181.65467.72E-072.70984.00E-179
   VIPR12.03894.11E-371.94172.10E-152.63027.84E-103.20943.43E-127
   RAMP32.01181.71E-382.05341.66E-212.82867.72E-083.19955.78E-185
   GIMAP61.91548.16E-341.98641.40E-171.77761.18E-062.06852.62E-91
   DOCK41.83721.05E-301.49403.12E-201.26945.15E-071.53971.59E-59
   RAMP21.82501.01E-451.76282.60E-182.61982.92E-072.85861.83E-166
   RASIP11.80372.46E-452.00141.30E-282.15612.23E-082.28192.09E-101
   EMCN1.79351.17E-422.50493.43E-171.77847.21E-052.72281.43E-114
   SLC6A41.79091.87E-262.90931.14E-174.59297.01E-096.13709.64E-174
   IL331.75378.30E-292.07291.04E-121.54443.11E-052.07784.94E-41
   ERG1.74765.85E-401.49349.67E-142.09681.26E-052.12483.21E-115
   ADRB21.61502.42E-381.97916.43E-201.46621.3E-043.05411.69E-128
   ANXA31.57541.37E-171.79501.01E-101.49922.27E-041.83782.18E-41
   ITM2A1.54843.02E-321.68883.30E-151.38922.10E-072.01611.33E-67
   TBX31.49514.79E-311.38481.96E-131.97553.53E-052.11081.26E-71
   JAM21.48447.69E-501.95118.55E-201.82707.72E-082.49281.57E-150
   CD931.45641.46E-281.69247.67E-171.84296.54E-072.18425.72E-103
   GHR1.45035.31E-331.88302.24E-151.57695.72E-042.01602.21E-45
   STXBP61.43501.56E-462.63711.02E-192.91722.99E-113.36233.84E-103
   TMEM471.43172.40E-181.43651.25E-121.37151.49E-041.72872.81E-55
   FERMT21.36054.41E-311.29327.35E-081.04525.26E-051.39229.59E-63
   SEMA6A1.32789.00E-211.60104.60E-122.36031.33E-102.54391.49E-95
   SEMA5A1.32001.03E-301.59498.43E-191.93592.09E-052.51486.65E-77
   PTPRB1.31433.05E-362.16487.02E-211.93851.05E-052.58551.33E-119
   CALCRL1.30012.6E-222.03592.50E-092.29401.28E-042.74438.75E-150
   SASH11.27492.18E-411.42064.38E-181.61552.34E-081.83522.13E-102
   S1PR11.26911.09E-402.02061.38E-222.05518.06E-062.82524.29E-189
   LMCD11.25795.45E-291.10331.87E-131.41613.59E-041.72078.72E-84
   SPTBN11.25116.33E-241.65362.06E-091.55446.94E-061.71501.41E-95
   EML11.24831.09E-351.42292.21E-121.39914.63E-061.67252.98E-51
   PODXL1.18691.80E-261.10763.11E-151.25307.96E-051.06403.71E-28
   KL1.11225.76E-302.29872.82E-242.01173.83E-052.30512.19E-41
   PDK41.11173.07E-181.85441.058E-061.86224.63E-042.38712.32E-53

DEG, differentially expressed gene; FC, fold change; FDR, false discovery rate.

DEGs identified in the most significant modules and PPI analysis. (A) In total, 392, 869, 483 and 2,270 DEGs were separately identified in the four most significant modules. (B) 52 consistent DEGs were identified in the four most significant modules. (C) The mRNA expressions of 52 consistent DEGs between LUAD and normal lung tissues were visualized using a heatmap. (D) The interactive relationships between 52 consistent DEGs were analyzed using a PPI network, and with a minimum required interaction score = the high confidence 0.7, a total of 12 (1 up- and 11 down-regulated genes) was filtered into the PPI network. Each node represents one gene, and bigger nodes represents genes with more links. Each red cycle node represents one upregulated gene and each green cycle node represents one downregulated gene. Each edge represents the interactive relationship between two genes. (E) Highly correlated modules were analyzed using the MCODE algorithm in the whole PPI network, and one highly correlated module with 5 nodes and 10 edges was identified. All 5 genes including ADRB2, RAMP2, CALCRL, RAMP3 and VIPR1 in the module were significantly downregulated genes, and each pair of genes had an interactive relationship. (F) All the genes were lowly expressed in LUAD tissue (P<0.001). (G) Correlation analysis showed that there were stronger positive correlations in expression among 5 genes (all R2>0.70). (H) Five genes in the subnetwork were validated to have significantly downregulated expression in LUAD tissue by analyzing transcriptome sequencing data. (I) Five genes had stronger positive correlations in expression in the transcriptome sequencing data (all R2>0.80). DEGs, differentially expressed genes; LUAD, lung adenocarcinoma; PPI, protein-protein interaction; MCODE, molecular complex detection. DEG, differentially expressed gene; FC, fold change; FDR, false discovery rate.

PPI network construction and essential gene identification based on 52 common DEGs

To elucidate the interactive relationships among 52 DEGs, a PPI network was constructed. At a minimum required interaction score = high confidence 0.7, a total of 12 (1 upregulated and 11 downregulated) among 52 DEGs was filtered into the PPI network. A PPI network with 12 nodes and 18 edges was established (). One highly correlated module was identified in the whole PPI network, and the module included 5 nodes and 10 edges (score =5.00) (). Centrality analysis showed that the 5 downregulated genes including ADRB2, RAMP2, CALCRL, VIPR1 and RAMP3 had the same centrality score () and were identified as essential genes involved in LUAD. All the essential genes were lowly expressed in LUAD tissue (all P<0.001) (), and every pair of genes showed a strong positive correlation in expression (all R2>0.75 and P<0.001) (). On the basis of the data of 20 transcriptomes, the expressions of 5 genes were significantly downregulated in LUAD tissue (all P<0.01) () and had strong correlations in expression (all R2>0.80 and P<0.001) ().
Table 4

Centrality scores of five essential genes according to seven centrality methods

RankGeneSubgraphDegreeEigenvectorInformationBetweennessClosenessNetwork
1 ADRB2 11.213940.44723.12501.04.0
2 VIPR1 11.213940.44723.12501.04.0
3 RAMP3 11.213940.44723.12501.04.0
4 RAMP2 11.213940.44723.12501.04.0
5 CALCRL 11.213940.44723.12501.04.0

Prognostic analysis based on 5 essential genes

To elucidate the relationships between 5 essential genes and the OS of LUAD patients, survival analysis was performed using the KM estimate and LR test. According to the P<0.05 cut-off criterion, ADRB2 (P=0.01009) and VIPR1 (P=0.01613) were identified as associated with the OS of LUAD patients (), and ADRB2 (HR =0.68215, 95% CI: 0.5081–0.9143) and VIPR1 (HR =0.70189, 95% CI: 0.5199–0.9369) were protective genes with HR <1 (). Higher expressions of the two genes resulted in a higher OS rate of LUAD patients (). Similar results were found in the univariate Cox regression analysis, and ADRB2 and VIPR1 had significant prognostic value in LUAD patients (). Higher mRNA expressions of ADRB2 (P=0.002, HR =0.854, 95% CI: 0.775–0.943) and VIPR1 (P<0.001, HR =0.831, 95% CI: 0.755–0.914) had lower HRs and resulted in a higher OS rate (). The multivariate Cox regression model with the stepwise method based on the 5 essential genes showed that ADRB2 and VIPR1 had significant prognostic value for the OS of LUAD patients (P=0.000124). A two-gene prognostic model was established, and the risk score of each patient was calculated. On the basis of the median of the risk scores, the LUAD patients were divided into high-risk and low-risk subgroups. The mortality rate of the high-risk subgroup was significantly higher than that of the low-risk group [43.24% (109 in 252 patients) vs. 29.37% (74 in 252 patients), P=0.001637], and the high-risk subgroup had a worse prognosis compared to the low-risk group (P=0.00225, HR =1.57692, 95% CI: 1.1749–2.212, ).
Figure 7

Prognostic and diagnostic analyses. (A) Kaplan-Meier survival curves showed that the expressions of the ADRB2 and VIPR1 genes were associated with the OS of LUAD patients (P=0.01009 and 0.01613, separately), and high expressions of these two genes resulted in a higher OS rate of LUAD patients. (B) Univariate Cox analysis showed that ADRB2 and VIPR1 were significantly associated with the OS of LUAD patients (P=0.002 and <0.001), and were protective genes, with HR <1. (C) Survival analysis showed that the high-risk group had a worse prognosis among LUAD patients (P=0.00225). (D) The expression analysis showed that the ADRB2 and VIPR1 genes were significantly highly expressed in the low-risk group (both P<2.2e-16). (E) The expression analysis showed that the ADRB2 and VIPR1 genes were significantly highly expressed in the alive group (P=0.001724 and 0.000709, separately). (F) The two genes were associated with the T stage, gender and survival status between high- and low-risk groups (all P<0.05). (G) The ROC curve showed that the AUC of the two-gene prognostic model was 0.603. (H) Univariate Cox regression analysis showed that the risk score, N stage, T stage and pathological stage were significantly correlated with the OS of LUAD patients (all P<0.001). (I) Multivariate Cox regression analysis showed that the risk score and pathological stage were significantly correlated with the OS of LUAD patients (both P<0.001). (J) Diagnostic analysis showed that the 2-gene signature had high specificity and sensitivity for predicting the diagnosis of LUAD patients. *, P<0.05; **, P<0.01. LUAD, lung adenocarcinoma; OS, overall survival; ROC, receiver operating characteristic; AUC, area under curve.

Prognostic and diagnostic analyses. (A) Kaplan-Meier survival curves showed that the expressions of the ADRB2 and VIPR1 genes were associated with the OS of LUAD patients (P=0.01009 and 0.01613, separately), and high expressions of these two genes resulted in a higher OS rate of LUAD patients. (B) Univariate Cox analysis showed that ADRB2 and VIPR1 were significantly associated with the OS of LUAD patients (P=0.002 and <0.001), and were protective genes, with HR <1. (C) Survival analysis showed that the high-risk group had a worse prognosis among LUAD patients (P=0.00225). (D) The expression analysis showed that the ADRB2 and VIPR1 genes were significantly highly expressed in the low-risk group (both P<2.2e-16). (E) The expression analysis showed that the ADRB2 and VIPR1 genes were significantly highly expressed in the alive group (P=0.001724 and 0.000709, separately). (F) The two genes were associated with the T stage, gender and survival status between high- and low-risk groups (all P<0.05). (G) The ROC curve showed that the AUC of the two-gene prognostic model was 0.603. (H) Univariate Cox regression analysis showed that the risk score, N stage, T stage and pathological stage were significantly correlated with the OS of LUAD patients (all P<0.001). (I) Multivariate Cox regression analysis showed that the risk score and pathological stage were significantly correlated with the OS of LUAD patients (both P<0.001). (J) Diagnostic analysis showed that the 2-gene signature had high specificity and sensitivity for predicting the diagnosis of LUAD patients. *, P<0.05; **, P<0.01. LUAD, lung adenocarcinoma; OS, overall survival; ROC, receiver operating characteristic; AUC, area under curve. The expression levels of ADRB2 and VIPR1 were analyzed between the high-risk and low-risk groups. The results showed that both ADRB2 and VIPR1 were significantly highly expressed in the low-risk group (both P<2.2e-16, ). Further, the expression levels of ADRB2 and VIPR1 were observed between the alive and dead patient groups. The result showed that both ADRB2 and VIPR1 were significantly highly expressed in the alive patient group (P=0.001724 and 0.000709, separately, ), which indicates that high expressions of ADRB2 and VIPR1 contribute to a low risk for LUAD patients and lengthen survival of LUAD patients. To elucidate the associations between the risk scores and clinicopathological features, clinicopathological features were analyzed between the high- and low-risk subgroups. We observed significant differences between the two risk subgroups with respect to the T stage (P<0.01), gender (P<0.05) and survival status (P<0.05) (). The ROC indicates that the risk score can predict the prognosis of LUAD patients (). To determine whether the risk signature is an independent prognostic indicator, univariate and multivariate Cox regression analyses were performed. According to univariate Cox regression analysis, the risk score, N stage, T stage and pathological stage were significantly correlated with the OS of LUAD patients (all P<0.001) (). Multivariate Cox regression analysis showed that risk score and pathological stage were significantly correlated with the OS of LUAD patients (both P<0.001) (). These results indicate that the risk score can independently predict the OS in LUAD patients.

Diagnostic analysis of the 2-gene signature

Through the diagnostic analysis of four datasets, the results showed that both the specificity and sensitivity of classification exceed 93% in the three datasets (), and the AUCs were over 94% (). Among these values, the AUC was 100% in the GSE102511 dataset, which indicates that the 2-gene signature has very high effectiveness as a diagnostic biomarker in predicting the diagnosis of LUAD patients.

Expression analysis of 12-gene and 2-gene signatures in paired sample datasets

In this study, some paired samples were included in GSE7670 and GSE19804. To validate the expression of DEGs in 12-gene and 2-gene signatures, paired samples were selected and analyzed the expression difference between tumor and normal lung tissues using the paired t-test method. The results showed that 14 genes had statistical significant in expression between two groups in two datasets (all P<0.05, and ), which was consistent with the results obtained from the unpaired t-test method.
Figure 8

The gene expression analysis of 12-gene and 2-gene signatures in GSE7670 dataset. All genes had statistical significant in expression between tumor and normal lung tissues (P<0.05).

Figure 9

The gene expression analysis of 12-gene and 2-gene signatures in GSE19804 dataset. All genes had statistical significant in expression between tumor and normal lung tissues (P<0.05).

The gene expression analysis of 12-gene and 2-gene signatures in GSE7670 dataset. All genes had statistical significant in expression between tumor and normal lung tissues (P<0.05). The gene expression analysis of 12-gene and 2-gene signatures in GSE19804 dataset. All genes had statistical significant in expression between tumor and normal lung tissues (P<0.05).

Identification of small-molecule drugs

To identify potential adjuvant drugs based on 14 prognostic genes (12-gene and 2-gene signatures) to guide the therapy of LUAD patients, small-molecule drugs were screened using three databases including CMap, L1000FWD and DGIdb. Totals of 87, 84 and 315 small molecules were separately identified in three small-molecule databases, and 1 small molecule with the drug name irinotecan was simultaneously found (). By pairwise comparison, 3 (podophyllotoxin, tanespimycin, and irinotecan), 4 (methotrexate, irinotecan, timolol, and hydrocortisone) and 6 (vincristine, teniposide, idarubicin, amsacrine, irinotecan, and etoposide) small molecules were found ().
Figure 10

Small molecule prediction. Eleven small-molecule drugs were predicted, and among these small molecules, irinotecan was found in three small molecule databases including CMap, DGIdb and L1000FWD.

Small molecule prediction. Eleven small-molecule drugs were predicted, and among these small molecules, irinotecan was found in three small molecule databases including CMap, DGIdb and L1000FWD.

Discussion

LUAD, as the most commonly pathological subtype of LC and that accounts for more than 40% of LCs (3,4), is one of the leading causes of LC-related deaths (1). Despite recent advances in detection technologies and treatment methods, the 5-year OS rate of LUAD patients in all stages remains very poor, at less than 20% (3,6). The identification of diagnostic and prognostic biomarkers may contribute to improving the survival rate of LUAD patients. However, due to the high heterogeneity of LUAD, the results reported from different studies vary enormously, and some identified biomarkers are not widely accepted for predicting clinical outcomes. To identify consistently acceptable diagnostic and prognostic biomarkers to improve clinical outcomes, systematic analysis is essential to explore the molecular mechanisms involved in LUAD and identify some key diagnostic and prognostic signatures by integrating LUAD-related gene expression data. In this study, we systematically integrated five LUAD-related gene expression datasets using bioinformatics methods including WGCNA, DEGA, PPI network, and prognostic and diagnostic analyses to identify transcriptome characterization to mine the key genes involved in LUAD. Finally, 12-gene (ZWINT, MELK, CCNB2, RRM2, TTK, KIF20A, TOP2A, ASPM, CDC20, CCNB1, TYMS and KIF4A) and 2-gene (ADRB2 and VIPR1) signatures were identified in LUAD and may serve as diagnostic and prognostic markers of LUAD patients. Furthermore, eleven small-molecule drugs related to the 12-gene and 2-gene signatures were identified, providing novel insights for LUAD therapeutic studies. As can be seen from the “Methods” and “Results” sections, 12-gene and 2-gene signatures were identified by a systematical analysis based on DEGs obtained from an unpaired t-test method. Before discussing the potential applicability of two signatures as diagnostic and prognostic markers, we must eliminate several doubts that readers may have on the processing method: (I) GSE7670 and GSE19804 datasets included many paired samples, why used an unpaired t-test for DEGA? (II) Why were GSE7670 and GSE10072 datasets merged into one dataset? (III) Why the microarray datasets and RNA-seq dataset do not use the same cutoff standard when performing DEGA? For the first question, theoretically it is more suitable to use paired t-test to process paired samples. We tried to use the paired t-test method to screen DEGs on paired sample datasets. However, few DEGs with |logFC|>1 and adjusted P<0.05 were identified using the method. Especially, the adjusted P values were significantly improved in the paired t-test model. We consider that the adjusted P values may be over-corrected in the test model when the sample number is smaller. The main reason may be that the degree of freedom is reduced and the gene expression matrix is re-standardized after samples were paired. It is not rigorous to only rely on logFC value to identify differentially expressed genes. In order to compensate the deficiency, we used the paired t-test method to analyze the expression differences of DGEs in the two signatures based on single gene. The results showed that all DEGs had statistical significance in the expression in GSE7670 and GSE19804 datasets (P<0.05, and ). For the second question, the sample number is more and the statistical reliability of the data is stronger in theory. Due to the same microarray platform, two datasets were merged into one dataset by reconstructing gene expression profile after background was corrected using a normalized microarray preprocessing procedure in the affy package. To a greater extent, the method eliminates some technical errors than the method of simply merging gene expression profiles. This method is expected to improve statistical power to get more reliable results. Judging from the results of this study, the method is feasible. For the third question, RNA-seq dataset included more samples and genes than microarray dataset, which indicates that RNA-seq dataset can have a better “resolution” in identifying DEGs. So, a more stringent cutoff standard was used for RNA-seq dataset. In the 12-gene signature, all the genes were enriched in biological pathways related to the cell cycle (). As we all know, the cell cycle is one of the most critical pathways closely associated with cancers, and an abnormal cell cycle will initiate malignant tumors (41,42). So far, some genes related to the cell cycle have also been confirmed to play important roles in LC and serve as potential prognostic candidates to predict the survival of NSCLC patients, such as ZWINT and CDC20, as well as BUB1B (43,44). The current results were consistent with a previous study (44). ZWINT is an important regulatory protein and plays key roles in chromosome movement and mitotic checkpoints (45,46). Some studies showed that the dysfunction of ZWINT resulted in many types of cancers such as breast and ovarian cancers (45), and the overexpression of ZWINT predicted a poor prognosis (47,48). However, a few studies also showed that ZWINT was a protective gene in hepatocellular carcinoma, and its increased expression contributed to a good prognosis (49,50). In this study, ZWINT was identified as a risky factor in LUAD, and its upregulated expression resulted in a poor prognosis. This result was consistent with previous studies indicating that a high expression of ZWINT was closely related to a poor prognosis of LUAD patients (45). CCNB2, which is a member of the cyclin family, is one of the essential components of the cell cycle regulatory machinery and plays a key role in transforming growth factor beta-mediated cell cycle control. CCNB2 has been reported to associate with many cancers (51), and its overexpression predicted a poor prognosis in NSCLC patients (52). CCNB1 is another member of the cyclin family, and its gene product plays critical role in controlling the G2/M transition phase of the cell cycle by forming the maturation-promoting factor (MPF) with CDC2 (53). The upregulation of CCNB1 in tumor tissue predicted a worse prognosis in hepatocellular carcinoma patients (54), and it may serve as potential therapeutic target (55). Presently, a few studies have shown that CCNB1 plays a role in NSCLC (56), and its polymorphisms were associated with clinical outcomes (57). However, the function of CCNB1 remains little known in LUAD. Our studies demonstrated that CCNB1 plays important role in LUAD, which provides novel insight into pathological mechanism involved in LUAD. CDC20, which is an important regulatory protein in the cell cycle, has been identified to be negatively regulated by p53, and used as a potential cancer therapeutic target (58). Several studies showed that the expression of CDC20 was related to the survival in some cancers such as colorectal cancer and hepatocellular carcinoma, and its overexpression predicted a poor prognosis (54,59). TTK is a protein kinase, and its expression is closely associated with cell proliferation (60). Silencing TTK expression inhibits proliferation and progression (61), and the overexpression of TTK promoted breast cancer cell proliferation and conferred a poorer prognosis (62), which indicates that TTK is a risky gene in cancer. RRM2 encodes one of two nonidentical subunits for ribonucleotide reductase that catalyzes the formation of deoxyribonucleotides from ribonucleotides. Several studies have reported that the gene expression was associated with cancers, and it has been identified as potential prognostic marker in patients with NSCLC or breast cancer (63-65). ASPM is the human ortholog of the Drosophila melanogaster abnormal spindle gene (asp), which plays essential role in normal mitotic spindle function in embryonic neuroblasts (66). Recently, some studies showed that ASPM expression was associated with the survival of cancer patients, and its upregulation resulted in a poor prognosis in breast cancer and pancreatic ductal adenocarcinoma (62,67). KIF4A and KIF20A are two members of the kinesin family that are mainly responsible for movement along the microtubules in the cell, which has been demonstrated to be associated with many diseases including cancers (68-71). The other three genes including MELK, TOP2A and TYMS are enzyme-encoding genes, which separately play roles in BPs related to cell cycle regulation, chromosome status, and DNA replication and repair. So far, some published studies have reported that these genes played roles in the initiation and progression of cancers and have been identified as predictors of the survival in cancer patients. In the 2-gene signature, ADRB2 is a protein coding gene encoding the beta2 adrenergic receptor that belongs to a member of the G protein-coupled receptor (GPCR) superfamily. Many studies have confirmed that several diseases such as asthma (72), obesity (73) and type 2 diabetes (74) were associated with ADRB2. Recently, researchers have found that ADRB2 was significantly correlated with various aspects related to cancer such as prostate cancer (75) and breast cancer (74), which indicates that ADRB2 is related to cell proliferation and apoptosis, tumor growth and metastasis, and angiogenesis. A few studies have so far discovered that ADRB2 activation promoted the proliferation of LC cells (76), and it was identified as a potential independent factor for early-stage NSCLC patients (77). Our results confirmed that the dysregulation of ADRB2 played a key role in LUAD, and high mRNA expression of ADRB2 resulted in a higher OS rate in LUAD patients. Similar to ADRB2, VIPR1 encodes a small neuropeptide that belongs to GPCR. VIPR1 is widely expressed among various tissues and plays key roles in many physiological functions including immune regulation, glycogen metabolism, etc. (78). Previous studies have found that VIPR1 was differentially expressed between cancer and normal tissues in many malignancies (78). For example, VIPR1 was highly expressed in prostate cancer (79), breast cancer (80) and colon cancer (81), which indicates that it functions in malignant tumors. Some studies have also reported that VIPR1 was significantly downregulated in LC (82) and may serve as a molecular target for the diagnosis, prevention and treatment of LC (80). This study further demonstrated that VIPR1 played roles in LUAD, and high mRNA expression of VIPR1 resulted in a higher OS rate in LUAD patients. Interestingly, both ADRB2 and VIPR1 identified in this study are GPCRs that belong to the largest family of cell surface receptors mediating many physiological processes (83). Large numbers of studies have proven that GPCRs play roles in many diseases processes such as human genetic and endocrine diseases (84,85), and GPCRs are often designated as drug targets (86). Despite their broad physiological and pathological functions, as well as acting as favorable sites for drug development, the roles of GPCRs have been underappreciated for a long time in tumor biology (83). With the discovery of more GPCRs associated with cancers, researchers have recognized the importance of the functions of GPCRs in tumor biology and paid more attention to GPCRs in many aspects such as molecular machinery (87-89) and tumor-related drug development (86,90). In this study, we found that two key GPCR genes played roles in LUAD by a comprehensive analysis, which indicates that GPCRs provide important functions in tumor biology. Two identified genes may serve as potential diagnostic and prognostic candidates and pharmacological drug targets in LUAD patients. In addition, several small molecules were identified as potential therapeutic drugs to combat LUAD in this study. Especially, irinotecan (Drugbank accession number: DB00762) was found in three small-molecule prediction databases. Irinotecan is a derivative of camptothecin that inhibits the action of topoisomerase I and can prevent the religation of the DNA strand by binding to the topoisomerase I-DNA complex and causing double-strand DNA breakage and cell death. Irinotecan, as an antineoplastic enzyme inhibitor, is primarily used in the treatment of colorectal cancer. Recently, irinotecan was also approved for treating advanced pancreatic cancer. In LC, the therapeutic efficacy of irinotecan was evaluated by a large number of trials (91-93). In particular, irinotecan has been proven to be effective as a chemotherapeutic drug against small cell lung cancer (SCLC), and SCLC patients receiving maintenance chemotherapy with irinotecan had a longer survival (94). Currently, a few studies have also explored the treatment effect of irinotecan against NSCLC (95).Our results showed that irinotecan might serve as potential therapeutic drug against LUAD. In addition, other small molecules including vincristine, teniposide, etoposide, methotrexate, podophyllotoxin and others were also predicted in this study. Some of these small molecules such as vincristine and teniposide have so far been used for anti-tumor therapy (96,97), and these small molecules may play certain roles in combating LUAD. Despite the findings in terms of clinical implications, some limitations should be noted. First, this study is a retrospective study based on a bioinformatics strategy, so the robustness of the prediction value of the gene signature should be further validated by prospective clinical trials. Second, the functional roles of the gene signature should be further elucidated in LUAD.

Conclusions

Taken together, the present study systematically analyzed gene expression data related to LUAD using comprehensive bioinformatics methods and identified some key genes associated with the diagnosis and prognosis of LUAD patients. In addition, some small molecules were predicted to combat LUAD. These findings provide novel insights into the pathological mechanism involved in LUAD and may serve as potential diagnostic and prognostic markers and therapy targets against LUAD.
  97 in total

1.  Genomic Landscape of Atypical Adenomatous Hyperplasia Reveals Divergent Modes to Lung Adenocarcinoma.

Authors:  Smruthy Sivakumar; F Anthony San Lucas; Tina L McDowell; Wenhua Lang; Li Xu; Junya Fujimoto; Jianjun Zhang; P Andrew Futreal; Junya Fukuoka; Yasushi Yatabe; Steven M Dubinett; Avrum E Spira; Jerry Fowler; Ernest T Hawk; Ignacio I Wistuba; Paul Scheet; Humam Kadara
Journal:  Cancer Res       Date:  2017-09-26       Impact factor: 12.701

2.  CytoNCA: a cytoscape plugin for centrality analysis and evaluation of protein interaction networks.

Authors:  Yu Tang; Min Li; Jianxin Wang; Yi Pan; Fang-Xiang Wu
Journal:  Biosystems       Date:  2014-11-15       Impact factor: 1.973

Review 3.  Cell cycle control and cancer.

Authors:  L H Hartwell; M B Kastan
Journal:  Science       Date:  1994-12-16       Impact factor: 47.728

4.  Prognostic significance of β2-adrenergic receptor expression in non-small cell lung cancer.

Authors:  Tomohiro Yazawa; Kyoichi Kaira; Kimihiro Shimizu; Akira Shimizu; Keita Mori; Toshiteru Nagashima; Yoichi Ohtaki; Tetsunari Oyama; Akira Mogi; Hiroyuki Kuwano
Journal:  Am J Transl Res       Date:  2016-11-15       Impact factor: 4.060

5.  Phase II Trial of Carfilzomib Plus Irinotecan in Patients With Small-cell Lung Cancer Who Have Progressed on Prior Platinum-based Chemotherapy.

Authors:  Susanne M Arnold; Kari Chansky; Maria Q Baggstrom; Michael A Thompson; Rachel E Sanborn; John L Villano; Saiama N Waqar; John Hamm; Markos Leggas; Maurice Willis; Joseph Rosales; John J Crowley
Journal:  Clin Lung Cancer       Date:  2020-01-27       Impact factor: 4.785

6.  ASPM is a major determinant of cerebral cortical size.

Authors:  Jacquelyn Bond; Emma Roberts; Ganesh H Mochida; Daniel J Hampshire; Sheila Scott; Jonathan M Askham; Kelly Springell; Meera Mahadevan; Yanick J Crow; Alexander F Markham; Christopher A Walsh; C Geoffrey Woods
Journal:  Nat Genet       Date:  2002-09-23       Impact factor: 38.330

Review 7.  G protein-coupled receptors disrupted in human genetic disease.

Authors:  Miles D Thompson; Maire E Percy; W McIntyre Burnham; David E C Cole
Journal:  Methods Mol Biol       Date:  2008

8.  Roles of CCNB2 and NKX3-1 in Nasopharyngeal Carcinoma.

Authors:  Di Qian; Weichang Zheng; Cuixia Chen; Guanghuai Jing; Junxuan Huang
Journal:  Cancer Biother Radiopharm       Date:  2020-03-23       Impact factor: 3.099

Review 9.  G protein-coupled receptors as promising cancer targets.

Authors:  Ying Liu; Su An; Richard Ward; Yang Yang; Xiao-Xi Guo; Wei Li; Tian-Rui Xu
Journal:  Cancer Lett       Date:  2016-03-18       Impact factor: 8.679

10.  ZW10 Binding Factor (ZWINT), a Direct Target of Mir-204, Predicts Poor Survival and Promotes Proliferation in Breast Cancer.

Authors:  Guangrong Zhou; Mingyang Shen; Zhengyuan Zhang
Journal:  Med Sci Monit       Date:  2020-04-05
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.