| Literature DB >> 31886214 |
Hui Xie1,2, Conghua Xie1,2.
Abstract
Background and Goals. To identify a multigene signature model for prognosis of non-small-cell lung cancer (NSCLC) patients, we first found 2146 consensus differentially expressed genes (DEGs) in NSCLC overlapped in Gene Expression Omnibus (GEO) and TCGA lung adenocarcinoma (LUAD) datasets using integrated analysis. We constructed a weighted gene coexpression network (WGCN) using the consensus DEGs and identified the module significantly associated with pathological M stage and consisted of 61 genes. After univariate Cox regression analysis and subsequent stepwise model selection by the Akaike information criterion (AIC) and multivariate Cox hazard model analysis, an mRNA signature model which calculated prognostic score was generated: prognostic score = (-0.2491 × EXPRRAGB) + (-0.0679 × EXPRSPH9) + (-0.2317 × EXPRPS6KL1) + (-0.1035 × EXPRXFP1) + 0.1571 × EXPRRM2 + 0.1104 × EXPRTL1, where EXP is the fragments per kilobase million (FPKM) value of the mRNA included in the model. The prognostic model separated NSCLC patients in the TCGA-LUAD dataset into the low- and high-risk score groups with a median prognostic score of 0.972. Higher scores predicted higher risk. The area under ROC curve (AUC) was 0.994 or 0.776 in predicting the 1- to 10-year survival of NSCLC patients. The prognostic performance of this prognostic model was validated by an independent GSE11969 dataset of NSCLC adenocarcinoma with AUC values between 0.822 and 0.755 in predicting 1- to 10-year survival of NSCLC. These results suggested that the six-gene signature functioned as an independent biomarker to predict the overall survival of NSCLC adenocarcinoma.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31886214 PMCID: PMC6925693 DOI: 10.1155/2019/4250613
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Characteristics of the public microarray datasets used in this study.
| Study | Species/array platform | Samples | Number of samples | Set |
|---|---|---|---|---|
| GSE19188 | [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array(GPL570) | NSCLC vs. control tissue | NSCLC = 91 (LUAD 45, LUSC 46), control tissue = 65, total = 156 | Training set |
| GSE30219 | [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array(GPL570) | NSCLC vs. control tissue | NSCLC = 293, control tissue = 14, total = 307 | Training set |
| GSE10072 | [HG-U133A] Affymetrix Human Genome U133A Array (GPL96) | NSCLC vs. control tissue | NSCLC = 58 (LUAD 58), control tissue = 49, total = 107 | Training set |
| GSE7670 | [HG-U133A] Affymetrix Human Genome U133A Array (GPL96) | NSCLC vs. control tissue | NSCLC = 28 (LUAD 28), control tissue = 30, total = 58 | Training set |
| GSE2514 | [MG_U74Av2] Affymetrix Murine Genome U74A Version 2 Array (GPL81) | NSCLC vs. control tissue | NSCLC = 20, control tissue = 19, total = 39 | Training set |
| GSE32863 | Illumina HumanWG-6 v3.0 expression beadchip (GPL6884) | NSCLC vs. control tissue | NSCLC = 58 (LUAD 58), control tissue = 58, total = 116 | Training set |
| GSE21933 | Phalanx Human OneArray (GPL6254) | NSCLC vs. control tissue | NSCLC = 21, control tissue = 21, total = 42 | Training set |
| GSE40275 | Human Exon 1.0 ST Array (GPL15974) | NSCLC vs. control tissue | NSCLC = 41, control tissue = 43, total = 84 | Training set |
| GSE12472 | Agilent-012391 Whole Human Genome Oligo Microarray G4112A (Feature Number version) (GPL1708) | NSCLC vs. control tissue | NSCLC = 35 (LUSC 35), control tissue = 28, total = 63 | Training set |
| GSE80796 | [HuGene-1_0-st] Affymetrix Human Gene 1.0 ST Array [transcript (gene) version] (GPL6244) | NSCLC vs. control tissue | NSCLC = 309, control tissue = 196, total = 505 | Training set |
| GSE8500 | Human 3.0 A1 (GPL3991) | NSCLC vs. control tissue | NSCLC = 40, control tissue = 8, total = 48 | Training set |
| GSE85841 | Agilent-067406 Human CBC lncRNA + mRNA microarray V4.0 (GPL20115) | NSCLC vs. control tissue | NSCLC = 8, control tissue = 8, total = 16 | Training set |
| GSE19027 | [HG-U133A] Affymetrix Human Genome U133A Array (GPL96) | NSCLC vs. control tissue | NSCLC = 21, control tissue = 39, total = 60 | Training set |
| TCGA-LUAD | Human Illumina HiSeq 2000 | LUAD vs. control tissue | LUAD = 517, control tissue = 59, total = 576 | Test set |
| GSE11969 | Agilent Homo sapiens 21.6 K custom array (GPL7015) | Overall survival | LUAD = 94, Overall survival information = 94 | Validation set |
GSE, Gene Expression Omnibus accession number; TCGA, the Cancer Genome Atlas; NSCLC, non-small-cell lung cancer; LUAD, lung adenocarcinomal; LUSC, lung squamous cell carcinoma.
Figure 1Workflow of the integrated analysis and WGCNA analysis of the NSCLC datasets. Figure 1 was reproduced from Sun et al. [16].
Quality control of the training datasets.
| No. | Datasets | IQC | EQC | CQCg | CQCp | AQCg | AQCp | Rank |
|---|---|---|---|---|---|---|---|---|
| 1 | GSE19188 | 8.56 | 2 | 307.65 | 160.1 | 0.79 | 54.72 | 2.33 |
| 2 | GSE30219 | 4.43 | 2 | 307.65 | 182.57 | 0.76 | 71.49 | 2.58 |
| 3 | GSE10072 | 9.76 | 2 | 307.65 | 178.14 | 0.14 | 49.61 | 3.67 |
| 4 | GSE7670 | 6.48 | 1.7 | 307.65 | 133.26 | 0.2 | 43.06 | 4.75 |
| 5 | GSE2514 | 3.66 | 1.53 | 307.65 | 111.14 | 0.4 | 22.46 | 5.67 |
| 6 | GSE32863 | 5.73 | 1.7 | 307.65 | 63.92 | 0.08 | 32.71 | 5.75 |
| 7 | GSE21933 | 4.43 | 1 | 307.65 | 52.67 | 0.37 | 23.99 | 6.08 |
| 8 | GSE40275 | 3.96 | 1.53 | 8.68 | 34.36 | 0.29 | 21.19 | 7.50 |
| 9 | GSE12472 | 0.24 | 1.53 | 7.06 | 44.83 | 0.32 | 12.49 | 8.33 |
| 10 | GSE80796 | 2.08 | 1.05 | 0.27 | 2.32 | 0.28 | 1.38 | 10.00 |
| 11 | GSE8500 | 3.51 | 0.4 | 0.38 | 4.02 | 0.26 | 0.06 | 10.67 |
| 12 | GSE85841 | 0.04 | 0.62 | 0.71 | 0.31 | 0.04 | 0.26 | 11.67 |
| 13 | GSE19027 | 1.35 | 0.39 | 0 | 7.83 | 0.03 | 0.24 | 12.00 |
NSCLC, non-smal-cell lung carcinoma; GSE, GEO dataset; IQC, internal quality control indexes; EQC, external quality control indexes; CQCg and CQCp, consistency of differential expression quality control indexes for genes and pathways; AQCg and AQCp, accuracy quality control indexes for genes and pathways. P value not significant after Bonferroni correction.
Figure 2Meta-analysis of differentially expressed genes involved in NSCLC by combining P values. (a) Principal component analysis (PCA) biplot of quality control measures in thirteen NSCLC studies. (b) The number of differentially expressed genes plotted as a function of false discovery rate (FDR) in the analysis of four different datasets and four different meta-analysis algorithms (maxP, minP, roP, and adaptively weighted statistic). Figure 2 was reproduced from Sun et al. [16].
Figure 3Identification of consensus DEGs in the training and the test datasets of NSCLC patients. (a) Heat map and two-way hierarchical clustering based on 7076 DEGs that were differentially expressed between NSCLC and ANT samples of the training set. ANT (green label) and NSCLC (red label) samples fell into separate clusters. (b) The 3592 DEGs NSCLC (red label) vs. ANT (blue label) of the TCGA-LUAD test set. Each column represents a sample, and each row represents the mRNA level. The color scale represents the raw Z-score ranging from blue (low expression) to red (high expression). Dendrograms beside each heat map correspond to the hierarchical clustering of the 3592 DEGs by the expression level. (c) A Venn diagram showing the overlap of DEGs detected in the training and test sets. Figure 3 was reproduced from Sun et al. [16].
Figure 4Determination of parameter β of the adjacency function in the weighted gene correlation network analysis (WGCNA) algorithm. (a) Analysis of the scale-free fit index for various soft thresholding powers β. (b) Analysis of the mean connectivity for various soft thresholding powers. (c) Histogram of connectivity distribution when β = 19. (d) Checking the scale-free topology when β = 19. Figure 4 was reproduced from Sun et al. [16].
Figure 5Network construction of the weighted coexpressed genes and their associations with clinical traits. (a) Hierarchical clustering tree of the TCGA-LUAD samples based on the DEGs. Dendrogram tips are labeled with the TCGA-LUAD unique name and experiment identifier. In the hierarchical dendrogram, lower branches correspond to higher coexpression (height = Euclidean distance). Identical colors in the ten bands below the dendrogram depict the TCGA-LUAD clinical traits. (b) Heat map view of topological overlap of coexpressed genes in different modules. The heat map was generated from the topological overlap values between genes. The genes were grouped into modules labeled by a color code, which are given under the gene dendrogram on both sides. The topological overlap was high among genes of same module. (c) Module-trait relationships for gender, histological type, lymphatic invasion, tumor status, treatment condition, and pathologic stage. Numbers shown represent Pearson correlations between the modules and traits. P values are in parentheses. Numbers on the color bar refer to the strength of the correlation in the table (red = 1, blue = −1). T, extent of the tumor; N, extent of spread to the lymph nodes; M, presence of metastasis. Figure 5 was reproduced from Sun et al. [16].
Univariate Cox regression analysis of lightcyan module genes and overall survival.
| Genes | Overall survival | ||
|---|---|---|---|
| HR | CI (95% CI) |
| |
| RRM2 | 1.291 | 1.152–1.448 | <0.001 |
| RPS6KL1 | 0.798 | 0.709–0.897 | <0.001 |
| RTL1 | 1.132 | 1.054–1.215 | 0.001 |
| RXFP1 | 0.864 | 0.787–0.949 | 0.002 |
| RRM1 | 1.438 | 1.121–1.845 | 0.004 |
| RTCD1 | 1.489 | 1.100–2.014 | 0.010 |
| RRAGB | 0.706 | 0.537–0.929 | 0.013 |
| RSPH10B2 | 0.909 | 0.842–0.981 | 0.014 |
| RRM2B | 0.784 | 0.641–0.958 | 0.017 |
| RSPH9 | 0.901 | 0.826–0.983 | 0.019 |
| RXFP2 | 0.734 | 0.562–0.960 | 0.024 |
| RUNX1 | 0.798 | 0.645–0.988 | 0.038 |
HR, hazard ratio; CI, confidence interval. Association of the 61 genes of the lightcyan module with survival was analyzed using univariate Cox regression analysis. Presented in the table were those which showed significant association (P < 0.05).
Multivariate Cox regression analysis of lightcyan module genes and overall survival.
| Genes |
| HR | selogHR |
| CI (95% Cl) |
|
|---|---|---|---|---|---|---|
| RRAGB | −0.2491 | 0.7795 | 0.1360 | −1.83 | 0.5971–1.0176 | 0.06700 |
| RSPH9 | −0.0679 | 0.9344 | 0.0462 | −1.47 | 0.8535–1.0230 | 0.14201 |
| RPS6KL1 | −0.2317 | 0.7932 | 0.0608 | −3.81 | 0.7042–0.8935 | 0.00014# |
| RTL1 | 0.1104 | 1.1167 | 0.0380 | 2.91 | 1.0367–1.2030 | 0.00364# |
| RXFP1 | −0.1035 | 0.9016 | 0.0497 | −2.08 | 0.8180–0.9939 | 0.03720# |
| RRM2 | 0.1571 | 1.1701 | 0.0626 | 2.51 | 1.0350–1.3229 | 0.01209# |
The listed genes were selected by the Akaike information criterion (AIC) model from the significant genes after the univariate Cox regression analysis (Table 3). Multivariate Cox regression analysis of association of the listed genes with survival was performed to reveal the independent predictor for survival and generate a prognostic risk score model. β, regression coefficient; HR, hazard ratio; CI, confidence interval; #P < 0.05.
Figure 6The prognostic performance of the six-gene signature in the TCGA-LUAD test cohort. (a) Risk score analysis of the six-gene signature of NSCLC. Risk score of gene signature (top); duration of cases (middle); low and high score groups for the six genes (bottom). (b) Survival analysis of the high-risk group and the low-risk group using Kaplan–Meier curves. (c) The prognostic efficiency of the six-gene signature for survival time. ROC curves of the six-gene signature for predicting 1- to 10-year survival were analyzed. (d–i) The independent prognostic efficiency of individual mRNA in the six-gene signature of the test set. (d) RRM2; (e) RPS6KL1; (f) RTL1; (g) RXFP1; (h) RRAGB; (i) RSPH9. Horizontal axis, overall survival time. Vertical axis, overall survival.
Figure 7The prognostic performance of the six-gene signature in the GSE11969 validation cohort. (a) Risk score analysis of the six-gene signature of NSCLC. Risk score of gene signature (top); duration of cases (middle); low and high score groups for the six genes (bottom). (b) Survival analysis of the high-risk group and the low-risk group using Kaplan–Meier curves. (c) The prognostic efficiency of the six-gene signature for survival time. ROC curves of the six-gene signature for predicting 1- to 10-year survival were analyzed. (d–i) The independent prognostic efficiency of individual mRNA in the six-gene signature in the validation set. (d) RRM2; (e) RPS6KL1; (f) RTL1; (g) RXFP1; (h) RRAGB; (i) RSPH9. Horizontal axis, overall survival time. Vertical axis, overall survival.
Figure 8The relative levels of RRAGB, RSPH9, RPS6KL1, RXFP1, RRM2, and RTL1 in NSCLC adenocarcinoma. (a) Compared with the normal control, RRAGB, RSPH9, and RXFP1 mRNA levels were significantly decreased and RPS6KL1, RTL1, and RRM2 mRNA levels were significantly increased in NSCLC of TCGA-LUAD. (b) The expression levels of RRAGB, RSPH9, RPS6KL1, and RXFP1 were significantly lower and RTL1 and RRM2 mRNA levels were significantly higher in the high-risk group than those in the low-risk group of TCGA-LUAD. (c) The expression levels of RRAGB, RSPH9, RPS6KL1, RXFP1, and RTL1 were significantly lower and RRM2 mRNA levels was significantly higher in the high-risk group than those in the low-risk group in NSCLC adenocarcinoma of GSE11969.