Literature DB >> 35116665

Systematically integrative analysis identifies diagnostic and prognostic candidates and small-molecule drugs for lung adenocarcinoma.

Qiang Chen^1,2, Xiaoyi Wang¹, Jing Hu^3,4.

Abstract

BACKGROUND: Lung adenocarcinoma (LUAD) is the most common histological subtype of lung cancer (LC). However, the early-stage diagnostic rate is still low, and the 5-year overall survival (OS) rate remains poor. The present study aimed to identify critical genes as diagnostic and prognostic markers and small-molecule drugs for combating LUAD using a systematic bioinformatics analysis.
METHODS: Five gene expression profiling datasets were systematically integrated and analyzed. First, gene coexpression modules were identified, and differentially expressed genes (DEGs) were screened. Second, the functional changes of these DEGs were systematically investigated. Third, the protein-protein interaction network, high correlation module and key genes were identified. Fourth, prognosis and diagnostic analyses were performed. Fifth, small-molecule drugs were predicted for guiding LUAD therapy.
RESULTS: Finally, 12-gene and 2-gene signatures were identified as diagnostic and prognostic markers. The areas under the curves (AUCs) of two signatures associated with 3-year survival were 0.686 and 0.603, respectively. The AUCs of two signatures were over 95% and 94% in diagnostic model, separately. Eleven small-molecule drugs were found and irinotecan was simultaneously predicted in three drug databases.
CONCLUSIONS: The present study identified some key dysregulated genes involved in LUAD and potential drugs by a comprehensive analysis, which provides novel insights into the pathological mechanism involved in LUAD and may shed light on the diagnosis, prognosis and treatment of LUAD patients. 2021 Translational Cancer Research. All rights reserved.

Entities: Chemical

Keywords: Lung adenocarcinoma (LUAD); biomarker; diagnosis; prognosis; small-molecule drug

Year: 2021 PMID： 35116665 PMCID： PMC8797894 DOI： 10.21037/tcr-21-526

Source DB: PubMed Journal: Transl Cancer Res ISSN： 2218-676X Impact factor: 1.241

Introduction

Lung cancer (LC), one of the most common malignancies, is the leading cause of cancer-related deaths worldwide (1).Especially in recent years, the morbidity and mortality of LC have been increasing year by year, and LC has ranked first among all malignancies for many years in some countries such as China (2019 National Cancer Report from China National Cancer Center). Non-small cell lung cancer (NSCLC) is the most predominant pathological type, constituting more than 80% of LCs (2), of which lung adenocarcinoma (LUAD) is the major histological subtype and accounts for more than 40% of LCs (3,4). Annually, LUAD results in more than 600,000 deaths all over the world (5). Despite recent advances in molecular diagnosis and multimodality therapies, the 5-year overall survival (OS) rate of LUAD patients in all stages is only approximately 15% (3,6). LUAD patients diagnosed at an early stage have a higher 5-year OS rate with 70–90% (7). However, no more than 20% of LUAD patients are diagnosed in a timely manner at an early stage (8), and 35–50% of patients diagnosed and treated at an early stage will relapse after surgical resection (9), which indicates a very poor prognosis. To improve the survival rate of LUAD patients, it is vital to uncover the underlying molecular mechanisms of LUAD and identify potential molecular diagnostic and prognostic biomarkers and/or therapeutic targets to combat LUAD. Currently, the diagnosis and prognosis of LUAD patients are mainly evaluated on the basis of many clinical and pathological features. Due to the high heterogeneity of LUAD, many clinical variables correlating with prognosis bring difficulty in predicting clinical outcomes upon detecting LUAD at an early stage (10). Recently, several molecular factors such as gene mutation and overexpression have been used to guide the clinical care of LUAD patients (3,11). For example, EGFR mutations and ALK fusions have been used as targets of molecular targeted therapy (3,12,13). However, EGFR and ALK alterations were found in only a small fraction of LUAD patients, and the majority of LUAD patients frequently harbored activating mutations such as KRAS, BRAF and ERBB2 (14) and loss-of-function mutations and deletions such as TP53, RB1 and CDKN2A (3,15,16). So far, few targeted molecular therapies have been clinically used for such alterations, and few prognostic biomarkers have been identified to predict clinical outcomes. Therefore, more knowledge of additional genes altered in LUAD is required to further guide the diagnosis, treatment and prognosis of LUAD. Gene expression analysis based on gene expression profiles is an important traditional method of investigating the differences in gene expression under different cell statuses, and a large number of differentially expressed genes (DEGs) associated with LUAD have been identified (17-20), such as AKT1, DDR2 and FGFR1. Some genetic factors including K-ras, FGF22 and LAPTM4B, as biomarkers, have also been investigated to predict the prognosis in LUAD patients (21-23). However, most DEGs reported in different studies vary greatly, and only a few consistent DEGs have been identified. In addition, few identified diagnostic and prognostic markers have been widely accepted for routine clinical use, and widely acceptable consistent key genes as biomarkers urgently require identification. Increasing LUAD-related gene expression data allows us to do the important work of identifying consistent key genes involved in LUAD by systematically integrative analysis. In this study, five LUAD-related gene expression profile datasets from the NCBI Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) databases were systematically integrated by bioinformatics methods including weighted gene coexpression network analysis (WGCNA), differentially expressed gene analysis (DEGA), functional enrichment analysis, and protein and protein interaction (PPI) network construction. Subsequently, diagnostic and prognostic analyses of identified critical genes were performed to identify diagnostic and prognostic candidates associated with LUAD patients. Last, potential small-molecule drugs related to key genes were identified to guide the treatment of LUAD. We present the following article in accordance with the TRIPOD reporting checklist and the MDAR checklist (available at https://dx.doi.org/10.21037/tcr-21-526).

Methods

The flow chart of systematic bioinformatics analysis in the current study is displayed in .

Figure 1

Flow chart of the bioinformatics analysis in the present study. LUAD, lung adenocarcinoma; TCGA, the cancer genome atlas; GEO, gene expression omnibus; DEGA, differentially expressed gene analysis; DEGs, differentially expressed genes; WGCNA, weighted gene coexpression network analysis; PPI, protein-protein interaction; MCODE, molecular complex detection; GO, gene ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; RP, reaction pathway; DO, disease ontology; LASSO, least absolute shrinkage and selection.

Data collection

Five LUAD-related gene expression datasets were reanalyzed by systematic bioinformatics methods in this study. Among these datasets, four microarray datasets were retrieved from the NCBI GEO database (https://www.ncbi.nlm.nih.gov/geoprofiles/), including GSE10072 (19), GSE7670 (24), GSE19804 (20) and GSE102511 (25). The GSE10072, GSE7670 and GSE19804 datasets were generated using the Affymetrix microarray platform. The GSE102511 dataset was produced using the Ion Torrent Proton platform. The GSE10072 and GSE102511 datasets were from American patients with LUAD. The GSE10072 dataset included 58 eligible tumor tissues samples and 49 eligible normal lung tissues samples, and the GSE102511 dataset included 16 eligible tumor tissues samples and 15 eligible normal lung tissues samples. The GSE7670 and GSE19804 datasets were from Taiwanese patients with LUAD and included 27 and 56 eligible paired tissue samples, separately. Given the same microarray chip platform (GPL96), the GSE10072 and GSE7670 datasets were merged into one dataset by reconstructing the gene expression profiles, and the new dataset was named “Dataset A”. The GSE19804 and GSE102511 datasets were named “Dataset B” and “Dataset C”, respectively. A LUAD-related RNA-seq dataset was retrieved from TCGA database portal (2019, https://portal.gdc.cancer.gov/). The inclusion criteria for the RNA-seq data were as follows: (I) histological diagnosis for LUAD; (II) except LUAD, not suffering from other malignancies; and (III) data with complete clinical information. Finally, totals of 535 LUAD tissues samples and 59 non-LUAD normal lung tissues samples were included. The LUAD-related RNA-seq dataset was named “Dataset D”. These studies have been approved by the Institutional Review Board of the relevant participating institutions including the National Taiwan University and Taichung Veterans General Hospital (GSE19804), the Taipei Veterans General Hospital and Taichung Veterans General Hospital (GSE7670), 13 participating hospitals and National Cancer Institutes (GSE10072), the Aichi Cancer Center and Nagasaki University (GSE102511), and the National Cancer Institute of NIH (TCGA RNA-seq data). No additional approval from the ethics committee was required. The present study complies with the requirements of data usage and publishing from NCBI GEO and TCGA databases.

Data preprocessing and DEGA

All microarray data were standardized by a normalized microarray preprocessing procedure using the affy package (version 1.60.0) in Bioconductor project (version 3.9.0, http://www.bioconductor.org/) (26), and RNA-seq data were subjected to normalization using the trimmed mean of M-values (TMM) method based on the edgeR package (version 3.26.3) in Bioconductor project (27). DEG screening of microarray data was performed using the limma package (version 3.32.7) in Bioconductor project (28). The limma package employs the voom method, linear modeling and empirical Bayes moderation to assess DEGs, which yields more robust results, even with fewer microarrays. The edgeR package (version 3.26.3) in Bioconductor project was used to screen the DEGs between LUAD tissue and non-LUAD normal lung tissue samples of TCGA RNA-seq data (27).

WGCNA

Gene coexpression was analyzed using a WGCNA method, and WGCNA was performed using the WGCNA package (version 1.13) (29). First, an adjacency matrix was converted according to the gene expression matrix. Based on the adjacency matrix, the unsupervised coexpression relationship of each gene pair was computed by Pearson correlation coefficients. The soft threshold β was used to strengthen the correlation adjacency matrix, and the parameter β of each dataset was selected according to its scale-free topology criterion. Second, a topological matrix was converted according to the strengthened adjacency matrix, and the correlation of each gene pair was measured using the topological overlap measure (TOM). On the basis of TOM-based dissimilarity (1-TOM), the genes with coherent expression profiles were classified into a gene module using the average linkage hierarchical clustering method. Gene coexpression module was identified from the system cluster tree by a dynamic cutting algorithm. The modules with 75% similarity were merged into one module, and the representative gene in a module was identified as the module eigengene (ME). The correlation between the ME and gene module was defined as the module membership (MM). The gene differential expression was measured using the P value from t-test method between LUAD and normal lung tissues, and gene significance (GS) was computed by log10 transformation of the p value. The average GS was defined as the module significance (MS) of the module, and the MS indicated the correlation between the module and LUAD. A detailed description of the WGCNA method can be obtained from the original article (29).

PPI network construction and highly correlated module identification

The interactive relationships among DEGs encoding proteins were elucidated by constructing a PPI network, and the interactive relationships between gene pairs were retrieved from the online STRING database (version 11.0, https://string-db.org/) (30). Gene pairs with a combined score ≥0.7 were filtered to construct the PPI network, and the PPI network was established and visualized using Cytoscape software (version 3.7.0, http://www.cytoscape.org/) (31). Based on the topological properties of the whole PPI network, the highly correlated modules (subnetwork) were extracted from the whole PPI network using a Molecular COmplex DEtection (MCODE) algorithm. The MCODE analysis was performed using the plugin MCODE (version 1.5.1) in Cytoscape software (32). The threshold parameters were set as degree cut-off =4, node score cut-off =0.6, K-core =4 and max. depth =100.

Essential genes identification

Key genes were identified using seven centrality analyses in the PPI subnetwork (33). Seven centrality methods were Degree Centrality, Subgraph Centrality, Network Centrality, Eigenvector Centrality, Closeness Centrality, Betweenness Centrality and Information Centrality. The plugin CytoNCA (version 2.1.6) was used to perform the Centrality analyses in the Cytoscape software (34). The centrality score of each gene was computed by the centrality analyses, and the genes with higher centrality scores were identified as key genes. The intersecting genes of key genes were identified as the essential genes.

Identification of LUAD-specific prognostic gene signature

The LUAD patients from the TCGA database were used to perform the survival analysis. The Kaplan-Meier (KM) estimate and log-rank (LR) test were used to evaluate the associations between the essential genes and the OS of LUAD patients in the survival package (version 2.43-3, https://CRAN.R-project.org/package=survival). The group cut off was set to 50%, and the LR P value, hazard ratio (HR) and 95% confidence interval (CI) were computed. A P<0.05 indicated the statistical significant of the association between an essential gene and the OS of LUAD patients. A univariate Cox proportional hazards regression model was applied to evaluate the associations between the essential genes and the OS of LUAD patients. The same characteristic parameters obtained via the LR method were computed, and the same significant P value criterion was set. A multivariate Cox hazards regression model with the stepwise method was applied to assess the prognostic value for LUAD patients using the survival package in R project. The essential gene combination in the optimal Cox hazard regression model was used for further analyses. The hazards model was established as follows: where “ExpDEGn” represented the expression level of DEGn and “CoeDEGn” denoted the regression coefficient from the multivariate Cox regression model (35). On the basis of the median of the above risk scores, LUAD patients were divided into the low-risk and high-risk groups. The survival ROC package (version 1.0.3) was used to construct the receiver operating characteristic (ROC) curve, and the ROC was used to measure the risk prediction rate of DEGs between the low- and high-risk groups. LASSO Cox regression model analysis was performed using the glmnet package (version 4.0-2) based on R. The formula used to calculate the risk score is the same as that in the multivariate Cox hazards regression model, and the statistical methods are the same as those in the multivariate Cox hazards regression model.

Diagnostic analysis of prognostic genes

Diagnostic analysis of prognostic genes was performed using the support vector machine (SVM) method. The SVM classification model was constructed using the e1071 package (version 1.7-3) based on R. The radial basis function was applied in the SVM kernels, and the mRNA profiles were classified using 100 independent repetitions of 10-fold cross-validation. The specificity, sensitivity and accuracy were calculated.

Identification of candidate small-molecule drugs

Potential small-molecule drugs of the candidate prognostic genes were searched using three databases including CMap (https://portals.broadinstitute.org/cmap/) (36), L1000FWD (http://amp.pharm.mssm.edu/L1000FWD/) (37) and DGIdb (http://www.dgidb.org/) (38). The intersection of identified small molecules was indicative of potential candidate adjuvant drugs for use in LUAD patients.

Expression validation of essential genes by transcriptome sequencing data

To verify the expression of essential genes between LUAD tissue and normal lung tissue by bioinformatics methods, the transcriptomes of 10 nonsmoking LUAD patients of 35–50 years old (5 male and 5 female LUAD patients in early stages) from Xuanwei City (one of the areas with the highest morbidity and mortality of LC in China) in China were sequenced. All tissues were obtained from The First People’s Hospital of Yunnan Province. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). This study was approved by the Institutional Review Board of The First People’s Hospital of Yunnan Province (No. 2017YY227). Informed consent was taken from all the patients. Transcriptome sequencing data were generated using the Illumina HiSeq2500 platform with the paired-end sequencing method by Beijing Novogene Technology Co., Ltd. DEGs between LUAD tissues and normal lung tissues samples were screened using the DESeq2 (version 1.24.0) package in Bioconductor project (39), and |logFC|>1 and FDR<0.01 were set as the cut-off criteria.

Statistical analysis

We performed data analysis according to the characteristics of each dataset on the basis of R software (version 3.6.3). For microarray datasets, DEGs were screened using the unpaired t-test in the limma package and a |Log2(fold change) (logFC)|>1 and a false discovery rate (FDR) <0.05 (P<0.05) were set as the cut-off criteria. For RNA-seq dataset, the quantile-adjusted conditional maximum likelihood (qCML) method in the edgeR package was used to identify DEGs, and the statistical significant of the difference was set for |logFC|>1.5 and FDR<0.01 on the basis of large-scale samples. The construction of prognostic signature was performed by univariate and multivariate COX regression analyses. Survival analysis was conducted by KM method and LR test, and P<0.05 indicated that the difference was statistically significant. Diagnostic analysis was performed using the SVM method. Gene coexpression analysis was performed using WGCNA, and gene significance was indexed by log10 transformation of the P value of the t-test measuring differential expression between LUAD and normal lung tissue samples. The expressions of key DEGs in prognostic and diagnostic signatures were analyzed using the paired-sample datasets including GSE7670 and GSE19804, and the paired t-test method was used to compare the expression difference between tumor samples and normal samples. A P<0.05 indicated that the difference was statistically significant in two groups.

Results

DEGs identification and functional analysis

To identify DEGs involved in LUAD, DEGA was performed. According to |logFC|>1 and FDR <0.05, totals of 623 (Dataset A, 189 upregulated and 434 downregulated), 1,387 (Dataset B, 424 upregulated and 963 downregulated), 1,343 (Dataset C, 492 upregulated and 851 downregulated) DEGs were identified. According to |logFC|>1.5 and FDR <0.01, 11,450 (8,940 upregulated and 2,510 downregulated) DEGs were identified. Overlapping analysis showed that 235 common DEGs (74 upregulated and 161 downregulated) were identified (available online: https://cdn.amegroups.cn/static/public/tcr-21-526-1.pdf). To better understand the roles of 235 common DEGs in LUAD, functional enrichment analyses including Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, Reactome pathways (RP) and Disease Ontology (DO) analyses were performed using the online STRING database (version 11.0) and clusterProfiler package (version 3.12.0) in Bioconductor project (30,40). GO analysis showed that 235 dysregulated genes were separately significantly enriched in 163 upregulated and 537 downregulated biological processes (BPs) (P<0.05), 18 upregulated and 29 downregulated molecular functions (MFs) (P<0.05), and 22 upregulated and 36 downregulated cellular components (CCs) (P<0.05). The top 5 GO terms with the most significant P values are shown in . Among the enriched BPs, upregulated BPs were mainly related to the cell cycle process, and downregulated BPs were mainly associated with the biological regulation (). KEGG pathway analysis showed that 2 upregulated pathways including the cell cycle (P=0.00062) and ECM-receptor interaction (P=0.0101) were significantly enriched (), and no significantly downregulated pathways were enriched. RP analysis showed that 235 dysregulated genes were separately enriched in 1 upregulated and 1 downregulated pathways (), and the 2 RPs were upregulated collagen degradation (P=0.0068) and downregulated hemostasis (P=0.0045), respectively. DO analysis showed that up- and down-regulated DEGs were separately associated with 32 and 9 diseases, and the top 5 DO terms are listed in . All upregulated DO terms were significantly associated with many types of cancers, and the most significantly upregulated DO term was non-small cell lung carcinoma (P=0.000764424) ().

Table 1

Functional terms enriched by common DEGs

Term ID	Term description	Observed	FDR
Biological process
Up-regulated
GO:1903047	Mitotic cell cycle process	21	5.01E-12
GO:0022402	Cell cycle process	23	1.11E-10
GO:0051301	Cell division	14	1.91E-06
GO:0000280	Nuclear division	11	2.81E-06
GO:0007059	Chromosome segregation	10	1.51E-05
Down-regulated
GO:0051239	Regulation of multicellular organismal process	67	1.36E-13
GO:0009653	Anatomical structure morphogenesis	51	1.80E-10
GO:0050793	Regulation of developmental process	56	2.61E-10
GO:0072359	Circulatory system development	32	2.61E-10
GO:0001944	Vasculature development	25	4.38E-10
Molecular function
Up-regulated
GO:0005515	Protein binding	45	0.00063
GO:0042802	Identical protein binding	20	0.00099
GO:0042803	Protein homodimerization activity	13	0.0016
GO:0046983	Protein dimerization activity	16	0.0021
GO:0005488	Binding	61	0.0034
Down-regulated
GO:0005539	Glycosaminoglycan binding	12	0.00024
GO:0005102	Signaling receptor binding	31	0.00048
GO:0005488	Binding	126	0.00048
GO:0005515	Protein binding	83	0.00048
GO:0017046	Peptide hormone binding	6	0.00048
Cellular component
Up-regulated
GO:0005576	Extracellular region	28	1.14E-05
GO:0005819	Spindle	10	5.09E-05
GO:0000776	Kinetochore	6	0.00092
GO:0000940	Condensed chromosome outer	3	0.001
GO:0005615	Extracellular space	15	0.001
Down-regulated
GO:0005576	Extracellular region	55	1.28E-09
GO:0005615	Extracellular space	31	7.89E-07
GO:0005886	Plasma membrane	74	9.46E-06
GO:0009986	Cell surface	22	9.46E-06
GO:0031226	Intrinsic component of plasma membrane	34	3.13E-05
KEGG pathway
Up-regulated
hsa04110	Cell cycle	6	0.00062
hsa04512	ECM-receptor interaction	4	0.0101
Reactome pathway
Up-regulated
HSA-1442490	Collagen degradation	3	0.0068
Down-regulated
HSA-109582	Hemostasis	15	0.0045
Disease ontology
Up-regulated
DO:3908	Non-small cell lung carcinoma	15	0.0007644
DO:3459	Breast carcinoma	13	0.0027271
DO:0050904	Salivary gland carcinoma	5	0.0088499
DO:8850	Salivary gland cancer	5	0.0088499
DO:0060084	Cell type benign neoplasm	13	0.0088499
Down-regulated
DO:6000	Congestive heart failure	14	0.0134019
DO:6432	Pulmonary hypertension	8	0.0134019
DO:5844	Myocardial infarction	15	0.0134019
DO:3393	Coronary artery disease	17	0.0134019
DO:1936	Atherosclerosis	16	0.0261586

DEG, differentially expressed gene; FDR, false discovery rate; KEGG, Kyoto Encyclopedia of Genes and Genomes; ECM, extracellular matrix.

Interactive relationships among 235 DEGs and essential gene identification

To elucidate the interactive relationships among 235 DEGs, a PPI network was constructed. At a minimum required interaction score = high confidence 0.7, a total of 108 DEGs was filtered into the PPI network, and a PPI network with 108 nodes and 320 edges was established (). Three highly correlated modules were identified in the whole PPI network, and the module with the highest score (score =19.789) included 20 nodes and 188 edges (). The 20 genes had high centrality scores () and were identified as essential genes including UBE2C, ZWINT, MELK, CCNB2, RRM2, CEP55, BIRC5, TTK, KIF20A, CDKN3, TOP2A, NUSAP1, ASPM, CDC20, BUB1B, CCNB1, TYMS, MCM4, CENPF and KIF4A. These genes were mainly associated with many types of cancers including NSCLC and LUAD () and were mainly enriched within pathways related to the cell cycle (). All the genes were highly expressed in LUAD tissue (all P<0.001) (), and every pair of genes showed a strong positive correlation in expression (all P<0.001) ().

Figure 2

Biological analysis based on 235 DEGs. (A) A PPI network with 108 nodes and 320 edges was established. Each red cycle node represents one upregulated gene and each green cycle represents one downregulated gene. Each edge represents an interactive relationship between two genes. Larger nodes represent more links. Thicker edges represent higher coexpression scores among genes, and deeper-colored edges represent higher combined scores (yellow to blue to pink). (B) Highly correlated module with the highest score was identified by the MCODE algorithm in the whole PPI network (score =19.789). The subnetwork included 20 nodes and 188 edges. All genes in the subnetwork were upregulated in LUAD tissues. Thicker edges represent higher coexpression scores among genes, and deeper-colored edges represent higher combined scores (yellow to blue to pink). (C) The DO-gene network showed that 20 genes were mainly associated with various cancers. (D) The pathway-gene network showed that 20 genes were mainly related to cell cycle pathways. Each red cycle node represents one gene. Each blue V node represents one GO term (biological process). Each yellow triangle node represents one KEGG pathway, and each green diamond node represents one reaction pathway. (E) Twenty genes were upregulated in LUAD tissues (all P<0.001). (F) Twenty genes had stronger positive correlations in expression. (G) Nineteen genes in the subnetwork were validated to have significantly upregulated expression in LUAD tissues by analyzing transcriptome sequencing data. (H) Nineteen genes had stronger positive correlations in expression in the transcriptome sequencing data. DEGs, differentially expressed genes; PPI, protein-protein interaction; MCODE, molecular complex detection; LUAD, lung adenocarcinoma; GO, gene ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes.

Table 2

Centrality scores of twenty genes in the highly correlated module by eight centrality methods

Rank	Gene	Subgraph	Eigenvector	Information	LAC	Betweenness	Closeness	Network	Degree
1	UBE2C	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
2	ZWINT	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
3	MELK	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
4	CCNB2	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
5	RRM2	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
6	CEP55	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
7	BIRC5	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
8	TTK	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
9	KIF20A	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
10	CDKN3	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
11	TOP2A	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
12	NUSAP1	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
13	ASPM	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
14	CDC20	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
15	BUB1B	7504268.50	0.226	10.47	17.79	0.222	1.0	19.00	19
16	CCNB1	7504266.50	0.226	10.47	17.79	0.222	1.0	19.00	19
17	TYMS	6799982.00	0.215	10.22	16.89	0.111	0.95	17.88	18
18	MCM4	6799982.00	0.215	10.22	16.89	0.111	0.95	17.88	18
19	CENPF	6799982.00	0.215	10.22	16.89	0.111	0.95	17.88	18
20	KIF4A	6799982.00	0.215	10.22	16.89	0.111	0.95	17.88	18

LAC, local average connectivity.

Expression validation of 20 essential genes

To validate the differential expressions of 20 essential genes between LUAD and normal lung tissues, 20 transcriptomes from 10 nonsmoking LUAD patients in an early stage were analyzed. According to |logFC|>1 and FDR<0.01, 2,360 DEGs (1,302 upregulated and 1,058 downregulated) were identified (available online: https://cdn.amegroups.cn/static/public/tcr-21-526-2.pdf). Except TYMS, 19 DEGs were consistent with the results from integrative data, and the expressions of 19 DEGs were visualized using the heatmap shown in . Similarly, there were stronger positive correlations in expression among these genes (all R2>0.55, most P<0.001) (). The RNA-seq data have been deposited in the NCBI Short Read Archive (Accession number: PRJNA561283). Given the high consistency of the identified DEGs, the 20 DEGs were selected for further analyses.

Consensus clustering of 20 essential genes and relationships with distinct clinical outcomes and clinicopathological features

To investigate the relationships between 20 essential genes and distinct clinical outcomes and clinicopathological features, consensus clustering was performed. On the basis of the expression similarity of 20 essential genes, k=2 was selected according to clustering stabilities increasing from k=2 to 9 in the TCGA dataset () and clustered into two subgroups (). The two subgroups were significantly related to the prognosis of LUAD patients, and the cluster2 subgroup had a higher survival rate (P=0.002, ). Clinicopathological features were compared between the two subgroups. The results showed that all these genes were lowly expressed in the cluster2 subgroup, and the cluster2 subgroup was significantly correlated with an earlier N stage, T stage and pathological grade (P<0.05, 0.001 and 0.01, separately), as well as with fewer female and dead patients (P<0.001 and 0.05, separately) (). The results explained why patients had a higher OS in the cluster2 subgroup.

Figure 3

Consensus clustering of 20 essential genes and relationships with distinct clinical outcomes and clinicopathological features. (A,B) Clustering stability increasing curve from k=2 to 9. (C) K=2 was selected, and LUAD patients were divided into two clusters. (D) The survival rates of LUAD patients showed significant differences between the two clusters (P=0.002), and the cluster2 patients had a higher survival rate. (E) Relationships between 20 genes and clinicopathological features are shown using a heatmap. The cluster1 subgroup was significantly correlated with a later N stage, T stage and pathological grade (P<0.05, 0.001 and 0.01, separately) and with more female and dead patients (P<0.001 and 0.05, separately). *, P<0.05; **, P<0.01; ***, P<0.001. LUAD, lung adenocarcinoma.

Survival analysis and prognostic model of 20 essential genes

To elucidate the relationships between 20 essential genes and the OS of LUAD patients, survival analysis was performed. Univariate Cox regression showed that all 20 essential genes were significantly correlated with the OS of LUAD patients (all P<0.01) and were risky genes with HR >1 (). To better predict the clinical outcomes of LUAD with 20 essential genes, the LASSO Cox regression algorithm was applied and 12 essential genes (ZWINT, MELK, CCNB2, RRM2, TTK, KIF20A, TOP2A, ASPM, CDC20, CCNB1, TYMS and KIF4A) were selected to build the risk signature based on the minimum criteria (). The risk score of each patient was calculated, and LUAD patients were divided into high-risk and low-risk subgroups on the basis of the median risk score. A significant difference in OS was observed between the two subgroups (P=1e-05), and the low-risk group showed a higher survival rate (). The KM estimate and LR test showed that 12 genes were significantly associated with the OS of LUAD patients (all P<0.05, Figure S1).

Figure 4

Prognostic and diagnostic analyses. (A) Univariate Cox regression analysis showed that 20 genes had significant prognostic value for predicting the OS in LUAD patients. (B,C) LASSO Cox analysis revealed that 12 genes had the most powerful features for predicting the OS. (D) A significant difference in OS was observed between the high- and low-risk subgroups (P=1e-05), and the low-risk group showed a higher survival rate. (E) Twelve genes were highly expressed in the high-risk subgroup. The “***” indicates that the statistical P value is less than 0.001 between two groups. (F) Twelve genes were lowly expressed in the alive patient group The “**” and “***” indicate that the statistical P value is less than 0.01 and 0.001, respectively. (G) Twelve genes were associated with the N stage, T stage, pathological stage and survival status between the high- and low-risk groups (P<0.01, 0.01, 0.001 and 0.001, separately). (H) The ROC curve showed that the risk score, N stage, T stage and pathological stage could better predict the three-year OS for LUAD patients. (I) Univariate Cox regression analysis showed that the risk score, N stage, T stage and pathological stage were significantly correlated with the OS of LUAD patients (all P<0.001). (J) Multivariate Cox regression analysis showed that the risk score and pathological stage were significantly correlated with the OS of LUAD patients (P<0.001 and =0.003). (K) Diagnostic analysis showed that 12 genes had a high specificity and sensitivity for predicting the diagnosis in LUAD patients. **, P<0.01; ***, P<0.001. LASSO, least absolute shrinkage and selection. LUAD, lung adenocarcinoma; OS, overall survival; ROC, receiver operating characteristic. The expression levels of 12 genes were analyzed between the high-risk and low-risk groups. The results showed that all the genes were significantly highly expressed in high risk group (all P<0.001, ). The expression levels of 12 genes were also compared between alive and dead patient groups. The result showed that 12 genes were significantly lowly expressed in the alive patient group (all P<0.01, separately, ), which indicates that low expressions of these genes contribute to a low risk of LUAD patients and lengthen survival of LUAD patients. To elucidate the associations between risk scores and clinicopathological features, clinicopathological features were analyzed between the high-risk and low-risk subgroups. We observed significant differences between the two risk subgroups with respect to the N stage (P<0.01), T stage (P<0.01), pathological stage (P<0.001) and survival status (P<0.001) (). The ROC curve showed that the risk score and pathological stage could better predict the three-year OS for LUAD patients (AUC =0.686 and 0.675, separately) (). To determine whether the risk signature is an independent prognostic indicator, univariate and multivariate Cox regression analyses were performed. By univariate Cox regression analysis, the risk score, N stage, T stage and pathological stage were significantly correlated with the OS of LUAD patients (all P<0.001) (). Multivariate Cox regression analysis showed that risk score and pathological stage were significantly correlated with the OS of LUAD patients (P<0.001 and =0.003, separately) (). These results indicate that the risk score can independently predict the OS in LUAD patients.

Diagnostic model based on 12 essential genes related to prognosis

To evaluate diagnostic power of 12 essential genes related to the prognosis of LUAD patients, a diagnostic model of 12 genes was constructed using the SVM classification model. The results showed that both the specificity and sensitivity of classification exceed 93% in the three datasets (). The AUCs were over 95% in the three datasets (), which indicates that 12 genes are very effective as diagnostic biomarkers in predicting the diagnosis of LUAD patients.

Coexpression network construction and module analysis

To better understand the gene coexpression relationships in different tissue types, WGCNA was performed. By normalization, 12,410 (Dataset A), 20,217 (Dataset B), 17,702 (Dataset C) and 31,343 (Dataset D) genes were selected to construct the gene coexpression network. On the basis of a scale-free topology criterion R2>0.8, power β=7 (Dataset A, R2=0.84), 18 (Dataset B, R2=0.85), 4 (Dataset C, R2=0.85) and 2 (Dataset D, R2=0.86) were selected as soft thresholds to convert the Pearson correlation matrix into a strengthened adjacency matrix, separately (, Figure S2A). The TOM of each gene pair was calculated, and 20, 17, 35 and 36 coexpression modules were separately identified by average linkage hierarchical clustering according to a TOM-based dissimilarity measure (1-TOM) in four datasets (, Figure S2B). The correlation analysis between ME and LUAD showed that the brown (including 1,875 genes, cor=−0.87 and P=6.20E-50), dark turquoise (including 1,895 genes, cor=−0.83 and P=1.45E-31), yellow (including 910 genes, cor=−0.94 and P=5.37E-15) and blue (including 3,025 genes, cor=−0.88 and P=1.9E-190) modules were separately the most significant modules in the four datasets (, Figure S2C). The correlation analysis of the MMs in these modules showed that these MMs had the most significant correlation with LUAD (Dataset A, cor=0.87 and P<1e-200; Dataset B, cor=0.77 and P<1e-200; Dataset C, cor=0.95 and P<1e-200; Dataset D, cor=0.98 and P<1e-200; and Figure S2D). GSs across modules showed that these modules had the highest GS values (, Figure S2E). The four modules were selected as key coexpression gene modules for further analyses.

Figure 5

Coexpression network analysis by WGCNA. WGCNA of Dataset A was used to visualize the WGCNA results, and other WGCNA results are shown in Figure S2. (A) Network topology for various soft-threshold powers and the testing of the properties of the scale-free network were analyzed. (B) LUAD-specific coexpression modules were analyzed, and 20 modules were identified. Each short vertical line corresponds to one gene. Each branch represents one expression module of highly interconnected groups of genes. Below the dendrogram, each group of genes has been given one color, which indicates its module assignment. Gray suggests that the genes were outside all modules. (C) The associations between modules and LUAD were analyzed, which showed that the brown module was identified as having the most significant association with LUAD (P=6e-50). (D) The associations between brown module membership and LUAD were analyzed, which showed that the genes in the module and LUAD had a stronger association (P<1e-200). (E) The mean significance across modules was analyzed, which showed that the brown module with the highest mean significance and a lower variation was the module with the most significant association with LUAD. WGCNA, weighted gene coexpression network analysis; LUAD, lung adenocarcinoma.

Key gene identification in the four most significant modules

To identify the key genes among the four most significant modules, DEGA and overlapping analyses were performed in the four modules. The results showed that 392 (Dataset A), 869 (Dataset B), 483 (Dataset C) and 2,270 (Dataset D) DEGs were separately identified (), and 52 common DEGs (5 upregulated and 47 downregulated) were identified (, ). These DEGs were mainly involved in blood vessel development (GO:0001568, P=1.80e-6), vasculature development and regulation (GO:1901342, P=2.17e-6) and tube development (GO:0035295, P=6.03e-6). The expressions of 52 common DEGs are shown using a heatmap in .

Figure 6

DEGs identified in the most significant modules and PPI analysis. (A) In total, 392, 869, 483 and 2,270 DEGs were separately identified in the four most significant modules. (B) 52 consistent DEGs were identified in the four most significant modules. (C) The mRNA expressions of 52 consistent DEGs between LUAD and normal lung tissues were visualized using a heatmap. (D) The interactive relationships between 52 consistent DEGs were analyzed using a PPI network, and with a minimum required interaction score = the high confidence 0.7, a total of 12 (1 up- and 11 down-regulated genes) was filtered into the PPI network. Each node represents one gene, and bigger nodes represents genes with more links. Each red cycle node represents one upregulated gene and each green cycle node represents one downregulated gene. Each edge represents the interactive relationship between two genes. (E) Highly correlated modules were analyzed using the MCODE algorithm in the whole PPI network, and one highly correlated module with 5 nodes and 10 edges was identified. All 5 genes including ADRB2, RAMP2, CALCRL, RAMP3 and VIPR1 in the module were significantly downregulated genes, and each pair of genes had an interactive relationship. (F) All the genes were lowly expressed in LUAD tissue (P<0.001). (G) Correlation analysis showed that there were stronger positive correlations in expression among 5 genes (all R2>0.70). (H) Five genes in the subnetwork were validated to have significantly downregulated expression in LUAD tissue by analyzing transcriptome sequencing data. (I) Five genes had stronger positive correlations in expression in the transcriptome sequencing data (all R2>0.80). DEGs, differentially expressed genes; LUAD, lung adenocarcinoma; PPI, protein-protein interaction; MCODE, molecular complex detection.

Table 3

Common DEGs in the four most significant co-expression modules

Gene symbol	Dataset A		Dataset B		Dataset C		Dataset D
Gene symbol	logFC	FDR	logFC	FDR	logFC	FDR	logFC	FDR
Up-regulated
PYCR1	1.0834	2.60E-34	1.2966	2.01E-26	2.7821	4.05E-07	3.7387	9.68E-93
CEACAM1	1.1290	7.28E-18	1.2744	7.37E-11	1.2071	1.90E-02	2.0955	2.63E-30
ABCC3	1.3676	3.89E-22	1.9165	1.23E-19	3.7939	7.01E-09	2.5686	1.13E-37
GOLM1	1.8824	7.93E-36	2.0276	8.61E-29	1.8870	4.45E-07	2.6754	4.03E-65
HMGB3	2.1441	1.80E-36	1.7898	3.38E-13	1.9850	5.23E-04	3.5250	6.97E-48
Down-regulated
SFTPC	3.7846	8.10E-26	3.1933	6.72E-11	2.7964	5.01E-04	4.7077	9.03E-69
FCN3	3.6158	1.00E-46	3.4480	4.02E-20	3.8986	9.14E-06	4.4931	3.91E-172
TMEM100	3.5293	1.35E-41	3.6015	7.46E-18	3.9073	6.85E-09	4.3314	6.7E-141
ABCA8	2.9376	7.81E-50	2.8246	2.07E-16	1.3704	2.67E-05	2.8318	1.23E-71
CDH5	2.5828	1.01E-51	2.2366	7.80E-24	2.2076	2.85E-08	2.5234	4.06E-148
GPM6A	2.5467	3.4E-47	3.9236	2.82E-28	2.5505	3.51E-08	4.4338	9.26E-122
EDNRB	2.5449	9.55E-50	2.9911	4.45E-22	2.6127	6.85E-09	3.4758	1.75E-177
GRK5	2.4935	2.23E-57	1.8852	2.82E-13	2.0877	2.73E-08	2.4636	2.62E-128
TEK	2.4787	1.72E-52	2.6268	3.02E-24	2.6799	2.85E-08	3.2279	6.82E-191
CLDN18	2.4480	9.25E-32	3.2886	9.02E-14	2.2100	1.33E-03	3.5131	2.38E-38
CA4	2.3105	9.1E-46	3.2181	2.02E-31	3.5226	3.35E-08	4.3314	2.42E-104
TGFBR3	2.2710	4.24E-42	2.1643	1.94E-16	2.0791	1.06E-08	2.5536	6.25E-89
LDB2	2.2329	2.04E-45	2.1822	3.70E-21	1.9013	1.88E-09	2.5873	8.98E-170
CAV2	2.1706	1.36E-21	2.1045	2.90E-12	1.7610	1.53E-07	2.3153	1.51E-70
EMP2	2.0621	3.99E-47	1.8399	9.91E-18	1.6546	7.72E-07	2.7098	4.00E-179
VIPR1	2.0389	4.11E-37	1.9417	2.10E-15	2.6302	7.84E-10	3.2094	3.43E-127
RAMP3	2.0118	1.71E-38	2.0534	1.66E-21	2.8286	7.72E-08	3.1995	5.78E-185
GIMAP6	1.9154	8.16E-34	1.9864	1.40E-17	1.7776	1.18E-06	2.0685	2.62E-91
DOCK4	1.8372	1.05E-30	1.4940	3.12E-20	1.2694	5.15E-07	1.5397	1.59E-59
RAMP2	1.8250	1.01E-45	1.7628	2.60E-18	2.6198	2.92E-07	2.8586	1.83E-166
RASIP1	1.8037	2.46E-45	2.0014	1.30E-28	2.1561	2.23E-08	2.2819	2.09E-101
EMCN	1.7935	1.17E-42	2.5049	3.43E-17	1.7784	7.21E-05	2.7228	1.43E-114
SLC6A4	1.7909	1.87E-26	2.9093	1.14E-17	4.5929	7.01E-09	6.1370	9.64E-174
IL33	1.7537	8.30E-29	2.0729	1.04E-12	1.5444	3.11E-05	2.0778	4.94E-41
ERG	1.7476	5.85E-40	1.4934	9.67E-14	2.0968	1.26E-05	2.1248	3.21E-115
ADRB2	1.6150	2.42E-38	1.9791	6.43E-20	1.4662	1.3E-04	3.0541	1.69E-128
ANXA3	1.5754	1.37E-17	1.7950	1.01E-10	1.4992	2.27E-04	1.8378	2.18E-41
ITM2A	1.5484	3.02E-32	1.6888	3.30E-15	1.3892	2.10E-07	2.0161	1.33E-67
TBX3	1.4951	4.79E-31	1.3848	1.96E-13	1.9755	3.53E-05	2.1108	1.26E-71
JAM2	1.4844	7.69E-50	1.9511	8.55E-20	1.8270	7.72E-08	2.4928	1.57E-150
CD93	1.4564	1.46E-28	1.6924	7.67E-17	1.8429	6.54E-07	2.1842	5.72E-103
GHR	1.4503	5.31E-33	1.8830	2.24E-15	1.5769	5.72E-04	2.0160	2.21E-45
STXBP6	1.4350	1.56E-46	2.6371	1.02E-19	2.9172	2.99E-11	3.3623	3.84E-103
TMEM47	1.4317	2.40E-18	1.4365	1.25E-12	1.3715	1.49E-04	1.7287	2.81E-55
FERMT2	1.3605	4.41E-31	1.2932	7.35E-08	1.0452	5.26E-05	1.3922	9.59E-63
SEMA6A	1.3278	9.00E-21	1.6010	4.60E-12	2.3603	1.33E-10	2.5439	1.49E-95
SEMA5A	1.3200	1.03E-30	1.5949	8.43E-19	1.9359	2.09E-05	2.5148	6.65E-77
PTPRB	1.3143	3.05E-36	2.1648	7.02E-21	1.9385	1.05E-05	2.5855	1.33E-119
CALCRL	1.3001	2.6E-22	2.0359	2.50E-09	2.2940	1.28E-04	2.7443	8.75E-150
SASH1	1.2749	2.18E-41	1.4206	4.38E-18	1.6155	2.34E-08	1.8352	2.13E-102
S1PR1	1.2691	1.09E-40	2.0206	1.38E-22	2.0551	8.06E-06	2.8252	4.29E-189
LMCD1	1.2579	5.45E-29	1.1033	1.87E-13	1.4161	3.59E-04	1.7207	8.72E-84
SPTBN1	1.2511	6.33E-24	1.6536	2.06E-09	1.5544	6.94E-06	1.7150	1.41E-95
EML1	1.2483	1.09E-35	1.4229	2.21E-12	1.3991	4.63E-06	1.6725	2.98E-51
PODXL	1.1869	1.80E-26	1.1076	3.11E-15	1.2530	7.96E-05	1.0640	3.71E-28
KL	1.1122	5.76E-30	2.2987	2.82E-24	2.0117	3.83E-05	2.3051	2.19E-41
PDK4	1.1117	3.07E-18	1.8544	1.058E-06	1.8622	4.63E-04	2.3871	2.32E-53

DEG, differentially expressed gene; FC, fold change; FDR, false discovery rate.

PPI network construction and essential gene identification based on 52 common DEGs

To elucidate the interactive relationships among 52 DEGs, a PPI network was constructed. At a minimum required interaction score = high confidence 0.7, a total of 12 (1 upregulated and 11 downregulated) among 52 DEGs was filtered into the PPI network. A PPI network with 12 nodes and 18 edges was established (). One highly correlated module was identified in the whole PPI network, and the module included 5 nodes and 10 edges (score =5.00) (). Centrality analysis showed that the 5 downregulated genes including ADRB2, RAMP2, CALCRL, VIPR1 and RAMP3 had the same centrality score () and were identified as essential genes involved in LUAD. All the essential genes were lowly expressed in LUAD tissue (all P<0.001) (), and every pair of genes showed a strong positive correlation in expression (all R2>0.75 and P<0.001) (). On the basis of the data of 20 transcriptomes, the expressions of 5 genes were significantly downregulated in LUAD tissue (all P<0.01) () and had strong correlations in expression (all R2>0.80 and P<0.001) ().

Table 4

Centrality scores of five essential genes according to seven centrality methods

Rank	Gene	Subgraph	Degree	Eigenvector	Information	Closeness	Network
1	ADRB2	11.2139	4	0.4472	3.125	1.0	4.0
2	VIPR1	11.2139	4	0.4472	3.125	1.0	4.0
3	RAMP3	11.2139	4	0.4472	3.125	1.0	4.0
4	RAMP2	11.2139	4	0.4472	3.125	1.0	4.0
5	CALCRL	11.2139	4	0.4472	3.125	1.0	4.0

Prognostic analysis based on 5 essential genes

To elucidate the relationships between 5 essential genes and the OS of LUAD patients, survival analysis was performed using the KM estimate and LR test. According to the P<0.05 cut-off criterion, ADRB2 (P=0.01009) and VIPR1 (P=0.01613) were identified as associated with the OS of LUAD patients (), and ADRB2 (HR =0.68215, 95% CI: 0.5081–0.9143) and VIPR1 (HR =0.70189, 95% CI: 0.5199–0.9369) were protective genes with HR <1 (). Higher expressions of the two genes resulted in a higher OS rate of LUAD patients (). Similar results were found in the univariate Cox regression analysis, and ADRB2 and VIPR1 had significant prognostic value in LUAD patients (). Higher mRNA expressions of ADRB2 (P=0.002, HR =0.854, 95% CI: 0.775–0.943) and VIPR1 (P<0.001, HR =0.831, 95% CI: 0.755–0.914) had lower HRs and resulted in a higher OS rate (). The multivariate Cox regression model with the stepwise method based on the 5 essential genes showed that ADRB2 and VIPR1 had significant prognostic value for the OS of LUAD patients (P=0.000124). A two-gene prognostic model was established, and the risk score of each patient was calculated. On the basis of the median of the risk scores, the LUAD patients were divided into high-risk and low-risk subgroups. The mortality rate of the high-risk subgroup was significantly higher than that of the low-risk group [43.24% (109 in 252 patients) vs. 29.37% (74 in 252 patients), P=0.001637], and the high-risk subgroup had a worse prognosis compared to the low-risk group (P=0.00225, HR =1.57692, 95% CI: 1.1749–2.212, ).

Figure 7

Prognostic and diagnostic analyses. (A) Kaplan-Meier survival curves showed that the expressions of the ADRB2 and VIPR1 genes were associated with the OS of LUAD patients (P=0.01009 and 0.01613, separately), and high expressions of these two genes resulted in a higher OS rate of LUAD patients. (B) Univariate Cox analysis showed that ADRB2 and VIPR1 were significantly associated with the OS of LUAD patients (P=0.002 and <0.001), and were protective genes, with HR <1. (C) Survival analysis showed that the high-risk group had a worse prognosis among LUAD patients (P=0.00225). (D) The expression analysis showed that the ADRB2 and VIPR1 genes were significantly highly expressed in the low-risk group (both P<2.2e-16). (E) The expression analysis showed that the ADRB2 and VIPR1 genes were significantly highly expressed in the alive group (P=0.001724 and 0.000709, separately). (F) The two genes were associated with the T stage, gender and survival status between high- and low-risk groups (all P<0.05). (G) The ROC curve showed that the AUC of the two-gene prognostic model was 0.603. (H) Univariate Cox regression analysis showed that the risk score, N stage, T stage and pathological stage were significantly correlated with the OS of LUAD patients (all P<0.001). (I) Multivariate Cox regression analysis showed that the risk score and pathological stage were significantly correlated with the OS of LUAD patients (both P<0.001). (J) Diagnostic analysis showed that the 2-gene signature had high specificity and sensitivity for predicting the diagnosis of LUAD patients. *, P<0.05; **, P<0.01. LUAD, lung adenocarcinoma; OS, overall survival; ROC, receiver operating characteristic; AUC, area under curve. The expression levels of ADRB2 and VIPR1 were analyzed between the high-risk and low-risk groups. The results showed that both ADRB2 and VIPR1 were significantly highly expressed in the low-risk group (both P<2.2e-16, ). Further, the expression levels of ADRB2 and VIPR1 were observed between the alive and dead patient groups. The result showed that both ADRB2 and VIPR1 were significantly highly expressed in the alive patient group (P=0.001724 and 0.000709, separately, ), which indicates that high expressions of ADRB2 and VIPR1 contribute to a low risk for LUAD patients and lengthen survival of LUAD patients. To elucidate the associations between the risk scores and clinicopathological features, clinicopathological features were analyzed between the high- and low-risk subgroups. We observed significant differences between the two risk subgroups with respect to the T stage (P<0.01), gender (P<0.05) and survival status (P<0.05) (). The ROC indicates that the risk score can predict the prognosis of LUAD patients (). To determine whether the risk signature is an independent prognostic indicator, univariate and multivariate Cox regression analyses were performed. According to univariate Cox regression analysis, the risk score, N stage, T stage and pathological stage were significantly correlated with the OS of LUAD patients (all P<0.001) (). Multivariate Cox regression analysis showed that risk score and pathological stage were significantly correlated with the OS of LUAD patients (both P<0.001) (). These results indicate that the risk score can independently predict the OS in LUAD patients.

Diagnostic analysis of the 2-gene signature

Through the diagnostic analysis of four datasets, the results showed that both the specificity and sensitivity of classification exceed 93% in the three datasets (), and the AUCs were over 94% (). Among these values, the AUC was 100% in the GSE102511 dataset, which indicates that the 2-gene signature has very high effectiveness as a diagnostic biomarker in predicting the diagnosis of LUAD patients.

Expression analysis of 12-gene and 2-gene signatures in paired sample datasets

In this study, some paired samples were included in GSE7670 and GSE19804. To validate the expression of DEGs in 12-gene and 2-gene signatures, paired samples were selected and analyzed the expression difference between tumor and normal lung tissues using the paired t-test method. The results showed that 14 genes had statistical significant in expression between two groups in two datasets (all P<0.05, and ), which was consistent with the results obtained from the unpaired t-test method.

Figure 8

The gene expression analysis of 12-gene and 2-gene signatures in GSE7670 dataset. All genes had statistical significant in expression between tumor and normal lung tissues (P<0.05).

Figure 9

The gene expression analysis of 12-gene and 2-gene signatures in GSE19804 dataset. All genes had statistical significant in expression between tumor and normal lung tissues (P<0.05).

The gene expression analysis of 12-gene and 2-gene signatures in GSE7670 dataset. All genes had statistical significant in expression between tumor and normal lung tissues (P<0.05). The gene expression analysis of 12-gene and 2-gene signatures in GSE19804 dataset. All genes had statistical significant in expression between tumor and normal lung tissues (P<0.05).

Identification of small-molecule drugs

To identify potential adjuvant drugs based on 14 prognostic genes (12-gene and 2-gene signatures) to guide the therapy of LUAD patients, small-molecule drugs were screened using three databases including CMap, L1000FWD and DGIdb. Totals of 87, 84 and 315 small molecules were separately identified in three small-molecule databases, and 1 small molecule with the drug name irinotecan was simultaneously found (). By pairwise comparison, 3 (podophyllotoxin, tanespimycin, and irinotecan), 4 (methotrexate, irinotecan, timolol, and hydrocortisone) and 6 (vincristine, teniposide, idarubicin, amsacrine, irinotecan, and etoposide) small molecules were found ().

Figure 10

Small molecule prediction. Eleven small-molecule drugs were predicted, and among these small molecules, irinotecan was found in three small molecule databases including CMap, DGIdb and L1000FWD.

Discussion

LUAD, as the most commonly pathological subtype of LC and that accounts for more than 40% of LCs (3,4), is one of the leading causes of LC-related deaths (1). Despite recent advances in detection technologies and treatment methods, the 5-year OS rate of LUAD patients in all stages remains very poor, at less than 20% (3,6). The identification of diagnostic and prognostic biomarkers may contribute to improving the survival rate of LUAD patients. However, due to the high heterogeneity of LUAD, the results reported from different studies vary enormously, and some identified biomarkers are not widely accepted for predicting clinical outcomes. To identify consistently acceptable diagnostic and prognostic biomarkers to improve clinical outcomes, systematic analysis is essential to explore the molecular mechanisms involved in LUAD and identify some key diagnostic and prognostic signatures by integrating LUAD-related gene expression data. In this study, we systematically integrated five LUAD-related gene expression datasets using bioinformatics methods including WGCNA, DEGA, PPI network, and prognostic and diagnostic analyses to identify transcriptome characterization to mine the key genes involved in LUAD. Finally, 12-gene (ZWINT, MELK, CCNB2, RRM2, TTK, KIF20A, TOP2A, ASPM, CDC20, CCNB1, TYMS and KIF4A) and 2-gene (ADRB2 and VIPR1) signatures were identified in LUAD and may serve as diagnostic and prognostic markers of LUAD patients. Furthermore, eleven small-molecule drugs related to the 12-gene and 2-gene signatures were identified, providing novel insights for LUAD therapeutic studies. As can be seen from the “Methods” and “Results” sections, 12-gene and 2-gene signatures were identified by a systematical analysis based on DEGs obtained from an unpaired t-test method. Before discussing the potential applicability of two signatures as diagnostic and prognostic markers, we must eliminate several doubts that readers may have on the processing method: (I) GSE7670 and GSE19804 datasets included many paired samples, why used an unpaired t-test for DEGA? (II) Why were GSE7670 and GSE10072 datasets merged into one dataset? (III) Why the microarray datasets and RNA-seq dataset do not use the same cutoff standard when performing DEGA? For the first question, theoretically it is more suitable to use paired t-test to process paired samples. We tried to use the paired t-test method to screen DEGs on paired sample datasets. However, few DEGs with |logFC|>1 and adjusted P<0.05 were identified using the method. Especially, the adjusted P values were significantly improved in the paired t-test model. We consider that the adjusted P values may be over-corrected in the test model when the sample number is smaller. The main reason may be that the degree of freedom is reduced and the gene expression matrix is re-standardized after samples were paired. It is not rigorous to only rely on logFC value to identify differentially expressed genes. In order to compensate the deficiency, we used the paired t-test method to analyze the expression differences of DGEs in the two signatures based on single gene. The results showed that all DEGs had statistical significance in the expression in GSE7670 and GSE19804 datasets (P<0.05, and ). For the second question, the sample number is more and the statistical reliability of the data is stronger in theory. Due to the same microarray platform, two datasets were merged into one dataset by reconstructing gene expression profile after background was corrected using a normalized microarray preprocessing procedure in the affy package. To a greater extent, the method eliminates some technical errors than the method of simply merging gene expression profiles. This method is expected to improve statistical power to get more reliable results. Judging from the results of this study, the method is feasible. For the third question, RNA-seq dataset included more samples and genes than microarray dataset, which indicates that RNA-seq dataset can have a better “resolution” in identifying DEGs. So, a more stringent cutoff standard was used for RNA-seq dataset. In the 12-gene signature, all the genes were enriched in biological pathways related to the cell cycle (). As we all know, the cell cycle is one of the most critical pathways closely associated with cancers, and an abnormal cell cycle will initiate malignant tumors (41,42). So far, some genes related to the cell cycle have also been confirmed to play important roles in LC and serve as potential prognostic candidates to predict the survival of NSCLC patients, such as ZWINT and CDC20, as well as BUB1B (43,44). The current results were consistent with a previous study (44). ZWINT is an important regulatory protein and plays key roles in chromosome movement and mitotic checkpoints (45,46). Some studies showed that the dysfunction of ZWINT resulted in many types of cancers such as breast and ovarian cancers (45), and the overexpression of ZWINT predicted a poor prognosis (47,48). However, a few studies also showed that ZWINT was a protective gene in hepatocellular carcinoma, and its increased expression contributed to a good prognosis (49,50). In this study, ZWINT was identified as a risky factor in LUAD, and its upregulated expression resulted in a poor prognosis. This result was consistent with previous studies indicating that a high expression of ZWINT was closely related to a poor prognosis of LUAD patients (45). CCNB2, which is a member of the cyclin family, is one of the essential components of the cell cycle regulatory machinery and plays a key role in transforming growth factor beta-mediated cell cycle control. CCNB2 has been reported to associate with many cancers (51), and its overexpression predicted a poor prognosis in NSCLC patients (52). CCNB1 is another member of the cyclin family, and its gene product plays critical role in controlling the G2/M transition phase of the cell cycle by forming the maturation-promoting factor (MPF) with CDC2 (53). The upregulation of CCNB1 in tumor tissue predicted a worse prognosis in hepatocellular carcinoma patients (54), and it may serve as potential therapeutic target (55). Presently, a few studies have shown that CCNB1 plays a role in NSCLC (56), and its polymorphisms were associated with clinical outcomes (57). However, the function of CCNB1 remains little known in LUAD. Our studies demonstrated that CCNB1 plays important role in LUAD, which provides novel insight into pathological mechanism involved in LUAD. CDC20, which is an important regulatory protein in the cell cycle, has been identified to be negatively regulated by p53, and used as a potential cancer therapeutic target (58). Several studies showed that the expression of CDC20 was related to the survival in some cancers such as colorectal cancer and hepatocellular carcinoma, and its overexpression predicted a poor prognosis (54,59). TTK is a protein kinase, and its expression is closely associated with cell proliferation (60). Silencing TTK expression inhibits proliferation and progression (61), and the overexpression of TTK promoted breast cancer cell proliferation and conferred a poorer prognosis (62), which indicates that TTK is a risky gene in cancer. RRM2 encodes one of two nonidentical subunits for ribonucleotide reductase that catalyzes the formation of deoxyribonucleotides from ribonucleotides. Several studies have reported that the gene expression was associated with cancers, and it has been identified as potential prognostic marker in patients with NSCLC or breast cancer (63-65). ASPM is the human ortholog of the Drosophila melanogaster abnormal spindle gene (asp), which plays essential role in normal mitotic spindle function in embryonic neuroblasts (66). Recently, some studies showed that ASPM expression was associated with the survival of cancer patients, and its upregulation resulted in a poor prognosis in breast cancer and pancreatic ductal adenocarcinoma (62,67). KIF4A and KIF20A are two members of the kinesin family that are mainly responsible for movement along the microtubules in the cell, which has been demonstrated to be associated with many diseases including cancers (68-71). The other three genes including MELK, TOP2A and TYMS are enzyme-encoding genes, which separately play roles in BPs related to cell cycle regulation, chromosome status, and DNA replication and repair. So far, some published studies have reported that these genes played roles in the initiation and progression of cancers and have been identified as predictors of the survival in cancer patients. In the 2-gene signature, ADRB2 is a protein coding gene encoding the beta2 adrenergic receptor that belongs to a member of the G protein-coupled receptor (GPCR) superfamily. Many studies have confirmed that several diseases such as asthma (72), obesity (73) and type 2 diabetes (74) were associated with ADRB2. Recently, researchers have found that ADRB2 was significantly correlated with various aspects related to cancer such as prostate cancer (75) and breast cancer (74), which indicates that ADRB2 is related to cell proliferation and apoptosis, tumor growth and metastasis, and angiogenesis. A few studies have so far discovered that ADRB2 activation promoted the proliferation of LC cells (76), and it was identified as a potential independent factor for early-stage NSCLC patients (77). Our results confirmed that the dysregulation of ADRB2 played a key role in LUAD, and high mRNA expression of ADRB2 resulted in a higher OS rate in LUAD patients. Similar to ADRB2, VIPR1 encodes a small neuropeptide that belongs to GPCR. VIPR1 is widely expressed among various tissues and plays key roles in many physiological functions including immune regulation, glycogen metabolism, etc. (78). Previous studies have found that VIPR1 was differentially expressed between cancer and normal tissues in many malignancies (78). For example, VIPR1 was highly expressed in prostate cancer (79), breast cancer (80) and colon cancer (81), which indicates that it functions in malignant tumors. Some studies have also reported that VIPR1 was significantly downregulated in LC (82) and may serve as a molecular target for the diagnosis, prevention and treatment of LC (80). This study further demonstrated that VIPR1 played roles in LUAD, and high mRNA expression of VIPR1 resulted in a higher OS rate in LUAD patients. Interestingly, both ADRB2 and VIPR1 identified in this study are GPCRs that belong to the largest family of cell surface receptors mediating many physiological processes (83). Large numbers of studies have proven that GPCRs play roles in many diseases processes such as human genetic and endocrine diseases (84,85), and GPCRs are often designated as drug targets (86). Despite their broad physiological and pathological functions, as well as acting as favorable sites for drug development, the roles of GPCRs have been underappreciated for a long time in tumor biology (83). With the discovery of more GPCRs associated with cancers, researchers have recognized the importance of the functions of GPCRs in tumor biology and paid more attention to GPCRs in many aspects such as molecular machinery (87-89) and tumor-related drug development (86,90). In this study, we found that two key GPCR genes played roles in LUAD by a comprehensive analysis, which indicates that GPCRs provide important functions in tumor biology. Two identified genes may serve as potential diagnostic and prognostic candidates and pharmacological drug targets in LUAD patients. In addition, several small molecules were identified as potential therapeutic drugs to combat LUAD in this study. Especially, irinotecan (Drugbank accession number: DB00762) was found in three small-molecule prediction databases. Irinotecan is a derivative of camptothecin that inhibits the action of topoisomerase I and can prevent the religation of the DNA strand by binding to the topoisomerase I-DNA complex and causing double-strand DNA breakage and cell death. Irinotecan, as an antineoplastic enzyme inhibitor, is primarily used in the treatment of colorectal cancer. Recently, irinotecan was also approved for treating advanced pancreatic cancer. In LC, the therapeutic efficacy of irinotecan was evaluated by a large number of trials (91-93). In particular, irinotecan has been proven to be effective as a chemotherapeutic drug against small cell lung cancer (SCLC), and SCLC patients receiving maintenance chemotherapy with irinotecan had a longer survival (94). Currently, a few studies have also explored the treatment effect of irinotecan against NSCLC (95).Our results showed that irinotecan might serve as potential therapeutic drug against LUAD. In addition, other small molecules including vincristine, teniposide, etoposide, methotrexate, podophyllotoxin and others were also predicted in this study. Some of these small molecules such as vincristine and teniposide have so far been used for anti-tumor therapy (96,97), and these small molecules may play certain roles in combating LUAD. Despite the findings in terms of clinical implications, some limitations should be noted. First, this study is a retrospective study based on a bioinformatics strategy, so the robustness of the prediction value of the gene signature should be further validated by prospective clinical trials. Second, the functional roles of the gene signature should be further elucidated in LUAD.

Conclusions

Taken together, the present study systematically analyzed gene expression data related to LUAD using comprehensive bioinformatics methods and identified some key genes associated with the diagnosis and prognosis of LUAD patients. In addition, some small molecules were predicted to combat LUAD. These findings provide novel insights into the pathological mechanism involved in LUAD and may serve as potential diagnostic and prognostic markers and therapy targets against LUAD.

97 in total

1. Genomic Landscape of Atypical Adenomatous Hyperplasia Reveals Divergent Modes to Lung Adenocarcinoma.

Authors: Smruthy Sivakumar; F Anthony San Lucas; Tina L McDowell; Wenhua Lang; Li Xu; Junya Fujimoto; Jianjun Zhang; P Andrew Futreal; Junya Fukuoka; Yasushi Yatabe; Steven M Dubinett; Avrum E Spira; Jerry Fowler; Ernest T Hawk; Ignacio I Wistuba; Paul Scheet; Humam Kadara
Journal: Cancer Res Date: 2017-09-26 Impact factor: 12.701

2. CytoNCA: a cytoscape plugin for centrality analysis and evaluation of protein interaction networks.

Authors: Yu Tang; Min Li; Jianxin Wang; Yi Pan; Fang-Xiang Wu
Journal: Biosystems Date: 2014-11-15 Impact factor: 1.973

Review 3. Cell cycle control and cancer.

Authors: L H Hartwell; M B Kastan
Journal: Science Date: 1994-12-16 Impact factor: 47.728

4. Prognostic significance of β2-adrenergic receptor expression in non-small cell lung cancer.

Authors: Tomohiro Yazawa; Kyoichi Kaira; Kimihiro Shimizu; Akira Shimizu; Keita Mori; Toshiteru Nagashima; Yoichi Ohtaki; Tetsunari Oyama; Akira Mogi; Hiroyuki Kuwano
Journal: Am J Transl Res Date: 2016-11-15 Impact factor: 4.060

5. Phase II Trial of Carfilzomib Plus Irinotecan in Patients With Small-cell Lung Cancer Who Have Progressed on Prior Platinum-based Chemotherapy.

Authors: Susanne M Arnold; Kari Chansky; Maria Q Baggstrom; Michael A Thompson; Rachel E Sanborn; John L Villano; Saiama N Waqar; John Hamm; Markos Leggas; Maurice Willis; Joseph Rosales; John J Crowley
Journal: Clin Lung Cancer Date: 2020-01-27 Impact factor: 4.785

6. ASPM is a major determinant of cerebral cortical size.

Authors: Jacquelyn Bond; Emma Roberts; Ganesh H Mochida; Daniel J Hampshire; Sheila Scott; Jonathan M Askham; Kelly Springell; Meera Mahadevan; Yanick J Crow; Alexander F Markham; Christopher A Walsh; C Geoffrey Woods
Journal: Nat Genet Date: 2002-09-23 Impact factor: 38.330

10. ZW10 Binding Factor (ZWINT), a Direct Target of Mir-204, Predicts Poor Survival and Promotes Proliferation in Breast Cancer.

Authors: Guangrong Zhou; Mingyang Shen; Zhengyuan Zhang
Journal: Med Sci Monit Date: 2020-04-05