Literature DB >> 32113923

Independent Validation of Early-Stage Non-Small Cell Lung Cancer Prognostic Scores Incorporating Epigenetic and Transcriptional Biomarkers With Gene-Gene Interactions and Main Effects.

Ruyang Zhang¹, Chao Chen², Xuesi Dong³, Sipeng Shen⁴, Linjing Lai², Jieyu He², Dongfang You⁵, Lijuan Lin⁵, Ying Zhu², Hui Huang², Jiajin Chen², Liangmin Wei², Xin Chen², Yi Li⁶, Yichen Guo⁷, Weiwei Duan⁸, Liya Liu⁹, Li Su¹⁰, Andrea Shafer¹¹, Thomas Fleischer¹², Maria Moksnes Bjaanæs¹², Anna Karlsson¹³, Maria Planck¹³, Rui Wang¹⁴, Johan Staaf¹³, Åslaug Helland¹⁵, Manel Esteller¹⁶, Yongyue Wei⁴, Feng Chen¹⁷, David C Christiani¹⁸.

Abstract

BACKGROUND: DNA methylation and gene expression are promising biomarkers of various cancers, including non-small cell lung cancer (NSCLC). Besides the main effects of biomarkers, the progression of complex diseases is also influenced by gene-gene (G×G) interactions. RESEARCH QUESTION: Would screening the functional capacity of biomarkers on the basis of main effects or interactions, using multiomics data, improve the accuracy of cancer prognosis? STUDY DESIGN AND METHODS: Biomarker screening and model validation were used to construct and validate a prognostic prediction model. NSCLC prognosis-associated biomarkers were identified on the basis of either their main effects or interactions with two types of omics data. A prognostic score incorporating epigenetic and transcriptional biomarkers, as well as clinical information, was independently validated.
RESULTS: Twenty-six pairs of biomarkers with G×G interactions and two biomarkers with main effects were significantly associated with NSCLC survival. Compared with a model using clinical information only, the accuracy of the epigenetic and transcriptional biomarker-based prognostic model, measured by area under the receiver operating characteristic curve (AUC), increased by 35.38% (95% CI, 27.09%-42.17%; P = 5.10 × 10-17) and 34.85% (95% CI, 26.33%-41.87%; P = 2.52 × 10-18) for 3- and 5-year survival, respectively, which exhibited a superior predictive ability for NSCLC survival (AUC3 year, 0.88 [95% CI, 0.83-0.93]; and AUC5 year, 0.89 [95% CI, 0.83-0.93]) in an independent Cancer Genome Atlas population. G×G interactions contributed a 65.2% and 91.3% increase in prediction accuracy for 3- and 5-year survival, respectively.
INTERPRETATION: The integration of epigenetic and transcriptional biomarkers with main effects and G×G interactions significantly improves the accuracy of prognostic prediction of early-stage NSCLC survival.

Entities: Chemical Disease Gene Species

Keywords: early stage; interaction; multiomics; non-small cell lung cancer; prognostic score

Mesh：

Substances：
Biomarkers, Tumor

Year: 2020 PMID： 32113923 PMCID： PMC7417380 DOI： 10.1016/j.chest.2020.01.048

Source DB: PubMed Journal: Chest ISSN： 0012-3692 Impact factor: 9.410

Lung cancer is a leading cause of cancer-related death worldwide and was estimated to cause 1.76 million deaths in 2018. The 5-year survival rate among patients with lung cancer remains relatively low, ranging from 4% to 17% depending on clinical characteristics. Compared with patients diagnosed with late-stage disease, early-stage patients often have a considerably more favorable prognosis. However, significant heterogeneity in clinical prognosis is observed for patients with early-stage non-small cell lung cancer (NSCLC) with similar clinical characteristics, which indicates the importance of understanding molecular mechanisms. Identifying molecular changes in oncogene and/or tumor suppressor genes that are associated with NSCLC survival is helpful for developing targeted therapies to prolong patients’ survival time. DNA methylation is a heritable, reversible, epigenetic modification that affects the spatial conformation of DNA and regulates gene expression., DNA methylation is a molecular biomarker and may be a therapeutic target for the treatment of cancer., In addition, gene-gene (G×G) interactions have long been recognized to regulate the progression of complex diseases, including NSCLC. The development of cancer may be related to interactions between several key genes. Lung cancer prognosis-associated biomarkers have been proposed on the basis of omics data, including DNA methylation, gene expression, microRNA, and long noncoding RNA. However, most studies are limited to a single type of omics data, which results in less accurate prognostic models. For example, our previous integrative omics study of the BTG2 gene showed that this gene could slightly improve the prediction accuracy of early-stage NSCLC survival. However, a large-scale integrative analysis of multiomics data has identified genes with either important main effects or gene-gene (G×G) interactions, based on which a more accurate prognostic prediction model of NSCLC can be constructed. Specifically, we used a two-stage study design and performed an integrative analysis of pan-cancer-related genes to identify prognostic biomarkers with either a main effect or G×G interactions using epigenome and transcriptome data from multiple study centers. We then built a prognostic prediction model for early-stage NSCLC by incorporating both selected epigenetic and transcriptional biomarkers.

Methods

Only patients with early-stage (stage I or II) lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) were included in our study. DNA methylation data were harmonized from five international study centers, including Harvard, Spain, Norway, Sweden, and the Cancer Genome Atlas (TCGA). Gene expression data were composed of four datasets from the Gene Expression Omnibus (GEO) and TCGA.

Harvard

The Harvard cohort consisted of patients seen at Massachusetts General Hospital (MGH), and histologically confirmed as having primary NSCLC, recruited since 1992. We profiled 151 early-stage patients from this cohort. A lung pathologist at MGH evaluated each specimen for the amount (tumor cellularity, > 70%) and quality of tumor cells. The specimens were classified histologically according to World Health Organization criteria. The institutional review boards at the Harvard T. H. Chan School of Public Health and MGH approved the study. All patients provided written informed consent.

Spain

The Spanish cohort included 226 patients with early-stage NSCLC recruited from eight subcenters between 1991 and 2009. Patients provided written consent and tumors were surgically collected. This study was approved by the Bellvitge Biomedical Research Institute institutional review boards.

Norway

The Norwegian cohort consisted of 133 patients with early-stage NSCLC from Oslo University Hospital, recruited between 2006 and 2011. The project was developed with the approval of the Oslo University Institutional Review Board and regional ethics committee (S-05307). All patients provided informed consent. Tumor tissues were snap frozen in liquid nitrogen and stored at –80°C until DNA isolation.

Sweden

Tumor DNA was collected from 103 patients with early-stage NSCLC, including 80 patients with LUAD and 23 patients with LUSC, at the Skåne University Hospital in Lund, Sweden. The study was developed under the approval of the Regional Ethical Review Board in Lund, Sweden (Registration nos. 2004/762 and 2008/702).

TCGA

A total of 332 LUAD and 285 LUSC with full DNA methylation, survival time, and covariates data were included. Level 1 HumanMethylation450 DNA methylation data from patients with early-stage NSCLC were downloaded on October 1, 2015.

GEO

Transcriptome information from 425 patients with early-stage NSCLC was profiled using the Affymetrix Human Genome U133A Plus 2.0 Array (e-Table 1). Only data from patients with available survival time, clinical stage, and tumor tissue expression values were analyzed.

Quality Control for DNA Methylation Data

DNA methylation was assessed with Illumina Infinium HumanMethylation450 BeadChips (Illumina Inc.). Raw image data were imported into GenomeStudio Methylation Module V1.8 (Illumina Inc.) to calculate methylation signals and to perform normalization, background subtraction, and quality control (QC). Unqualified probes were excluded if they fitted any one of the following quality control criteria: (1) failed detection (P > .05) in ≥ 5% samples; (2) coefficient of variance < 5%; (3) all samples were methylated or all were unmethylated; (4) common single-nucleotide polymorphisms located in probe sequence or in 10-bp flanking regions; (5) cross-reactive probes; or (6) data did not pass QC in all centers. Samples with > 5% undetectable probes were excluded. Methylation signals were further processed for quantile normalization (betaqn function in R package minfi) as well as type I and II probe correction (BMIQ function in R package lumi). They were adjusted for batch effects (ComBat function in R package sva) according to the best pipeline by a comparative study. Details of the QC process are described in e-Figure 1.

Quality Control for Gene Expression Data

The TCGA workgroup completed the mRNA sequencing data processing and QC. Raw counts were normalized using RNA-sequencing by expectation maximization. Level 3 gene quantification data were downloaded from the TCGA data portal and were further checked for quality. Gene probes were excluded if the missing rate > 80%, and the batch effect was corrected with ComBat. The expression value of each gene was transformed on a log2 scale and standardized before association analysis. DNA methylation and gene expression of 719 pan-cancer-related genes were then used for subsequent association analysis. Gene symbols for the 719 pan-cancer-related genes were obtained from the Catalogue of Somatic Mutations in Cancer (COSMIC). After QC, there were 12,806 CpG probes identified for association analysis. CpG probes from five genes (BTG2, KDM, EGLN2, LRRC3B, and SIPA1L3) reported in our previous study were also included.

Statistical Analysis

The flow of analysis is depicted in Figure 1. Epigenetic and transcriptional analyses were performed simultaneously, and a discovery phase and validation phase were used to identify NSCLC prognostic biomarkers. In each procedure, we conducted analysis of both the main effects and gene-gene interactions among biomarkers. Patients having DNA methylation data from Harvard, Spain, Norway, and Sweden, as well as patients having gene expression data from GEO, were assigned to the discovery phase for epigenetic analysis and transcriptional analysis, respectively. Patients having two types of omics data from TCGA were assigned to the validation phase.

Figure 1

Flow chart of study design and statistical analyses. In the epigenetic analysis, patients with lung adenocarcinoma and lung squamous cell carcinoma from the Harvard, Spain, Norway, and Sweden cohorts were used in the discovery phase for screening, whereas data from the Cancer Genome Atlas (TCGA) was used for validation. In transcriptional analysis, gene expression data from Gene Expression Omnibus and TCGA were used in the discovery phase and the validation phase, respectively. Both main effect and G×G interaction analyses were performed. G×G = gene by gene; NSCLC = non-small cell lung cancer. For the main effect analysis, we used sure independence screening (SIS) and LASSO Cox penalized regression to screen biomarkers with main effects that were relevant to survival, using the R package SIS. SIS LASSO is a two-stage procedure. At the first stage, SIS selects the biomarkers with the strongest marginal associations with survival. At the second stage, LASSO was used to perform variable selection and parameter estimation simultaneously among the biomarkers selected at the first stage. During the LASSO procedure, tuning parameter selection was based on Bayesian information criteria. To capture biomarkers that might be missed at the first stage, we repeatedly applied the SIS LASSO algorithm to the remaining unselected biomarkers until no new biomarkers can be recruited. This iterative procedure is termed iterative SIS (ISIS) LASSO. To account for the biologic heterogeneity between LUAD and LUSC, we used a histology-stratified multivariate Cox proportional hazards model to test these biomarkers, using the R package survival. The stratified model adjusted for the differences between LUAD and LUSC in baseline hazards. The other covariates adjusted in the model were age, sex, study center, clinical stage, and smoking status. For the G×G interaction analysis, a histology-stratified multivariate Cox proportional hazards model adjusted for the aforementioned covariates was applied to identify biomarkers with G×G interactions. The P value thresholds for multiple testing were established by the Bonferroni method, which set the significance level to .05 divided by the number of tests. This way, the overall type I error would be controlled at the .05 level. In our study, the significance level of G×G interaction analysis of epigenetic and transcriptional biomarkers was defined as 6.10 × 10–10 = 0.05/(12,806 × 12,805/2) and 1.94 × 10–7 = 0.05/(719 × 718/2), respectively. Significant biomarkers observed in the discovery phase were further confirmed in the validation phase and were retained if the P value was ≤ .05 and there was consistent direction of the effect across two phases. We also performed a test of proportional hazards assumption for each significant biomarker. The hazard ratio (HR) and 95% CI were described as per 1% level of DNA methylation or gene expression increment. Sensitivity analysis was performed to confirm these robustly significant biomarkers. Patients were excluded if their DNA methylation (logit2 transformed) or expression (log2 transformed) values were out of range, based on mean ± 3 × SD. For those identified biomarkers, we applied a forward stepwise regression strategy to build up a multibiomarker Cox proportional hazards model in the discovery phase, which was then validated in TCGA samples. In the forward stepwise regression, a likelihood ratio test was applied to test the main effect or G×G interaction of biomarkers if Pentry ≤ .05 and Pelimination > .05. Sensitivity analysis was also performed using two different thresholds: .10 and .15. Epigenetic and transcriptional scores were calculated on the basis of a weighted linear combination of individual values of the DNA methylation and gene expression, with weights derived from the Cox model. Integrative scores were synthesized by epigenetic and transcriptional scores. Finally, the prognostic score was defined as the linear combination of clinical information and integrative score (see e-Appendix 1). Kaplan-Meier survival curves adjusted for the covariates were drawn to represent the survival difference among patients with different scores. We predicted 3- and 5-year overall survival of patients, using the nearest neighbor method for time-to-event data. The accuracy of the prediction is presented using a receiver operating characteristic (ROC) curve and was measured by area under the ROC curve (AUC), computed by the R package survivalROC. The prediction accuracy was confirmed with an independent TCGA population in the validation phase. The 95% CI and P value of the AUC improvement were calculated on the basis of 1,000-time bootstrap resampling. Stratification analysis of prognostic scores was carried out within subgroups stratified by age, sex, smoking status, clinical stage, and histology. The concordance index (Cindex), an average accuracy of predictive survival across follow-up years, as well as the 95% CI, which ranges from 0.5 to 1.0, were calculated to estimate the predictive performance. A nomogram was generated with R package rms to facilitate application of our model. We assessed the potential functions of the identified genes at the protein level by taking advantage of limited public resources. First, we evaluated the association between protein expression and gene expression, using the reverse-phase protein array from the TCGA database. Second, we performed differential expression analysis between tumor and normal tissues, and further investigated the main effects of genes and G×G interactions between genes on LUAD survival, using the Clinical Proteomic Tumor Analysis Consortium (CPTAC) database. Differential protein expression analysis was performed with the R package limma, which generated a linear model to estimate fold changes and SEs prior to empirical Bayes smoothing. Finally, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis was carried out with Metascape. Gene network analysis was conducted with GeneMANIA, a plugin of the Cytoscape application. The critical hubs, highly connected to nodes in a module, were defined as the highest connectivity degrees. P values were two-sided. All statistical analyses were performed with R version 3.5.1 (R Foundation), unless otherwise specified.

Results

After QC, 1,230 (Ndiscovery = 613 and Nvalidation = 617) patients with 12,806 CpG probes and 719 gene probes were included in the association analysis. The demographic and clinical information are described in e-Tables 2, 3. For the main effect analysis of DNA methylation and gene expression, 23 CpG probes (e-Tables 4-6) and 13 gene probes (e-Tables 7, 8) were selected by ISIS LASSO, respectively. However, only cg19286631 was significantly associated with survival in both phases (HRdiscovery = 1.03 [95% CI, 1.01-1.05], P = 1.43 × 10–2; HRvalidation = 1.03 [95% CI, 1.01-1.06], P = 1.13 × 10–3) and remained significant in sensitivity analysis. Also, only one gene probe located in the NDRG1 gene remained significant in the validation phase (HRdiscovery = 1.41 [95% CI, 1.05-1.89], P = 2.16 × 10–2; HRvalidation = 1.12 [95% CI, 1.01-1.42], P = 4.33 × 10–2) and sensitivity analysis. For the G×G interaction analysis, we observed 2,495 and 40 G×G interactions from epigenetic and transcriptional analysis, respectively, in the discovery phase. Finally, 149 and 2 G×G interactions were retained in the validation phase that were also significant in the sensitivity analysis (e-Tables 9-13). By forward stepwise regression analysis in the discovery phase, we observed one CpG probe with a main effect and 25 pairs of CpG probes with G×G interactions in the multibiomarker model (e-Table 14), which was used to calculate the epigenetic score (e-Table 15) (HRdiscovery = 2.71 [95% CI, 2.41-3.05]; P = 1.15 × 10–61). One gene probe with a main effect and one pair of gene probes with a G×G interaction were retained in the multibiomarker model and used to calculate the transcriptional score (HRdiscovery = 2.44 [95% CI, 1.78-3.35]; P = 2.79 × 10–8). The associations between survival and each of these scores were independently confirmed in the validation phase when adjusted for covariates (epigenetic score: HRvalidation = 2.72 [95% CI, 2.31-3.20], P = 6.06 × 10–33; transcriptional score: HRvalidation = 2.64 [95% CI, 1.73-4.04], P = 7.51 × 10–6; integrative score: HRvalidation = 2.72 [95% CI, 2.32-3.18], P = 5.68 × 10–35; prognostic score: HRvalidation = 2.72 [95% CI, 2.34-3.17], P = 5.04 × 10–38). To evaluate the discriminative ability of these scores, samples in the validation phase were categorized into low-, medium-, and high-score groups based on the tertiles of epigenetic, transcriptional, integrative, and prognostic scores, respectively. Compared with the epigenetic low-score group, the medium- and high-score groups had 4.39- and 21.24-fold mortality risk, respectively (HRMedium vs Low = 4.39 [95% CI, 2.42-7.99], P = 1.22 × 10–6; HRHigh vs Low = 21.24 [95% CI, 11.23-40.17], P = 5.67 × 10–21) (Fig 2A). Patients with a high transcriptional score had significantly worse survival (HRMedium vs Low = 1.46 [95% CI, 0.92-2.33], P = 1.04 × 10–1; HRHigh vs Low = 2.26 [95% CI, 1.41-3.60], P = 6.52 × 10–4) (Fig 2B). The significant survival difference was enhanced among patients with different integrative scores (HRMedium vs Low = 4.32 [95% CI, 2.39-7.83], P =1.33 × 10–6; HRHigh vs Low = 24.32 [95% CI, 12.71-46.56], P = 5.76 × 10–22) (Fig 2C). Moreover, when combined with clinical information, including age, sex, study center, clinical stage, and smoking status, the prognostic score significantly discriminated NSCLC survival (HRMedium vs Low = 7.32 [95% CI, 3.50-15.33], P = 1.29 × 10–7; HRHigh vs Low = 28.85 [95% CI, 13.13-63.43], P = 5.83 × 10–17) (Fig 2D). The discriminative ability of the prognostic score is further illustrated by categorizing patients on the basis of the quintile level of the score. Figure 2E manifests an ordering relation: patients in higher-quintile groups had lower 3- and 5-year survival rates, as well as shorter median survival time. This indicates that patients with higher mortality risks can be detected by using our score system (HRLevel 5 vs 1 = 66.09 [95% CI, 25.13-173.80], P = 1.98 × 10–17; HRLevel 4 vs 1 = 21.02 [95% CI, 8.13-54.31], P = 3.24 × 10–10; HRLevel 3 vs 1 = 9.13 [95% CI, 3.51-23.78], P = 5.93 × 10–6; HRLevel 2 vs 1 = 4.40 [95% CI, 1.68-11.53], P = 2.53 × 10–3) (Fig 2F). The performance of the prognostic score was further confirmed in the analysis stratified by covariates (Fig 3).

Figure 2

Figure 3

Forest plots of results from stratification analysis of prognostic score. HR with 95% CI of the prognostic score on non-small cell lung cancer survival in various subgroups is stratified by clinical characteristics. LUAD = lung adenocarcinoma; LUSC = lung squamous cell carcinoma. See Figure 2 legend for expansion of other abbreviation.

Estimated survival curves for patients grouped by various biomarker-based scores. A, Epigenetic score of DNA methylation. B, Transcriptional score of gene expression. C, Integrative score of DNA methylation and gene expression. D, Prognostic score of DNA methylation, gene expression, and clinical information. Patients were categorized into low-, medium-, and high-score groups by using the tertiles of each score as the cutoffs. E, Discriminative ability of the prognostic score. Results of 3- and 5-year survival rate, median survival time, and hazard ratio (HR) were compared across five groups, defined by using the quintiles of the prognostic score as the cutoffs. F, HR and P values were derived from the Cox proportional hazards model for patients with different quintile levels of the prognostic score. HRH vs L = HRHigh vs Low; HRM vs L = HRMedium vs Low. Forest plots of results from stratification analysis of prognostic score. HR with 95% CI of the prognostic score on non-small cell lung cancer survival in various subgroups is stratified by clinical characteristics. LUAD = lung adenocarcinoma; LUSC = lung squamous cell carcinoma. See Figure 2 legend for expansion of other abbreviation. We then independently validated the predictive ability of these biomarkers. The model with only clinical information, as aforementioned, had very limited prediction ability (AUC3 year = 0.65, AUC5 year = 0.66). However, by adding biomarkers with either main effects or G×G interactions, the AUCs significantly increased by 35.38% (95% CI, 27.09%-44.17%; P = 5.10 × 10–17) and 34.85% (95% CI, 26.33%-41.87%; P = 2.52 × 10–18) for 3- and 5-year survival, respectively, and exhibited a superior predictive ability for NSCLC survival (AUC3 year = 0.88 [95% CI, 0.83-0.93]; AUC5 year = 0.89 [95% CI, 0.83-0.93]) (Fig 4). G×G interactions contributed an additional 65.2% for 3-year and 91.3% for the 5-year prediction accuracy increase.

Figure 4

Receiver operating characteristic curves for various predictive models using the clinical information (C), the main and interaction effects of DNA methylation (M), and gene expression (E). A, Three-year survival prediction. B, Five-year survival prediction. The AUC increase (%) was evaluated by comparing the model with that with only the clinical information. P values and 95% CIs were calculated by using 1,000 bootstrap samples. AUC = area under the receiver operating characteristic curve; ROC = receiver operating characteristic. See Figure 1 legend for expansion of other abbreviations. In the sensitivity analysis, we reanalyzed the stepwise regression using two different thresholds (P = .10 and .15) and found that the majority of the selected biomarkers were the same as those in the original regression model (e-Table 16). We then recalculated these scores, retested their associations with NSCLC survival, and obtained similar results (e-Table 17). Meanwhile, the AUCs of our prognostic model using different thresholds were comparable: 0.88 = .05 vs 0.85 = .10 vs 0.86 = .15 for 3-year survival; 0.89 = .05 vs 0.83 = .10 vs 0.86 = .15 for 5-year survival (e-Figs 2 and 3). Moreover, we found that the effects of these four scores did not differ significantly between patients with LUAD and patients with LUSC (PEpigenetic score = .6572; PTranscriptional score = .1823; PIntegrative score = .5532; PPrognostic score = .9653) (e-Table 18). Our prognostic model retained similar prediction ability in both the LUAD (AUC3 year = 0.91, AUC5 year = 0.89, and Cindex = 0.82; 95% CI, 0.76-0.87) and LUSC (AUC3 year = 0.85, AUC5 year = 0.87, and Cindex = 0.82; 95% CI, 0.76-0.88) populations, indicating the usefulness of the selected biomarkers and their interactions in predicting the outcomes for patients with LUAD and patients with LUSC (e-Fig 4). To facilitate application of our prognostic prediction model, we combined clinical information and scores of biomarkers and developed a nomogram, which estimated well a patient’s 3- or 5-year survival (e-Fig 5). The Cindex of the prognostic score indicated acceptable prediction accuracy (Cindex = 0.82; 95% CI, 0.78-0.86) in an independent TCGA population. The calibration plots also showed good accordance between observed and predicted survival time (e-Fig 6). In protein analysis, three of the four genes mapped in TCGA had significant correlation between gene expression and protein expression (e-Table 19). Most (77%) of the 47 genes mapped in CPTAC were differentially expressed between tumor and normal tissue, with statistical significance (e-Fig 7). In addition, one gene with main effect and four pairs of genes with G×G interaction had a significant effect on LUAD survival (e-Table 20). Among 49 genes identified in epigenetic analysis, five genes (FOXP1, AFF3, BCL6, MAPK1, and STAT3) were identified as hub genes with the highest connectivity degrees, greater than 25 (Fig 5). These 49 genes were enriched in cancer-related pathways including the non-small cell lung cancer pathway (e-Table 21).

Figure 5

Gene network and gene enrichment analysis of 49 genes to which 25 pairs of CpG probes with interaction and one CpG probe with main effect are mapped. A, The gene network plot constructed by GeneMANIA. Central nodes with boldface outline represent hub genes, and the size represents the connectivity degree of each node. B, Barplot of gene pathways enriched with significant genes, and colored by P values. C, The pathway network plot of these pathways enriched with significant genes. Significant pathways with a similarity > 0.3 are connected by edges. Each node represents an enriched term and is colored by its cluster identification. The size of the node represents the number of genes in the pathway. The edge represents potential biologic relationships between two pathways. GO = Gene Ontology.

Discussion

An accurate prognostic predictive model may aid physicians in making clinical decisions or guiding adjuvant therapy, especially for the vulnerable patients with high mortality risk. Although subject and tumor characteristics have been commonly used as valid predictors, increased evidence has indicated that molecular biomarkers may provide early warning signals. This is because tumor cells may metastasize even when the tumor size is undetectable (< 0.01 cm3) and aberrations of biomarkers occur. Thus, there is added value when a prognostic predictive model incorporates both genetic and nongenetic factors, whose effects can be captured using approaches that are both biologically stable and technically reproducible. We conducted a two-stage integrative study of DNA methylation and gene expression data from multiple centers to propose a prognostic scoring method incorporating transomics biomarkers with main effects and G×G interactions. The prognostic score, which was validated in an independent population, effectively discriminates survival outcomes for patients with early-stage NSCLC and significantly improves prediction accuracy for their prognosis. G×G interactions are of interest because they provide important clues regarding the biologic mechanisms of complex diseases. It was suggested in previous studies that identification of G×G interactions would improve the predictive accuracy of statistical models., However, interactions might not dramatically improve prediction if their effects are weak or there are few significant interactions, but might optimize statistical modeling. Besides prediction, G×G interactions could increase the power to detect associations and then be leveraged for the identification of new biomarkers. Our results showed that biomarkers with G×G interactions significantly and predominantly improved the prognostic prediction accuracy of early-stage NSCLC, which might be due to increased power. To evaluate the prediction accuracy of our model, we conducted a literature search to compare our studies with others. The details of the literature screening process are summarized in e-Figure 8. The prediction accuracy of our model is superior (e-Table 22), as the one study with the best AUC (0.80) had a very small sample size. Another study with the largest sample size, without independent validation, had unsatisfactory prediction capacity (Cindex = 0.64). Our study has a relatively large sample size and provides a satisfactory prediction model that performed well in an independent population regardless of AUC (AUC3 year = 0.88 and AUC5 year = 0.89) and Cindex (0.82). Among the genes identified in transcriptional analysis, NDRG1 and RHOA have been reported to be associated with lung cancer. In this study, among 49 genes identified in epigenetic analysis, five (AFF3, MAPK1, STAT3, FOXP1, and BCL6) were identified as hub genes in the gene network. AFF3 is associated with NSCLC prognosis. MAPK1 promotes NSCLC cell survival and is a therapeutic target for NSCLC chemotherapeutic resistance. STAT3, one of the three major downstream pathways activated by EGFR phosphorylation, is persistently activated in 22% to 65% of NSCLC.36, 37, 38 It is a strong predictor of poor NSCLC prognosis and related to cisplatin resistance in NSCLC cells.39, 40, 41 FOXP1 is an independent factor for predicting poor NSCLC prognosis. BCL6 could inhibit cell apoptosis in lung cancer and plays a role in sustaining NSCLC genomic instability. In enrichment pathway analysis, 49 genes were significantly enriched in pathways or processes that are cancer related. Notably, the identified genes were also enriched in the KEGG non-small cell lung cancer pathway (hsa05223). The hub genes MAPK1 and STAT3 in the network were also involved in this pathway. The results indicated that, after functional confirmation, the identified CpG probes are potential epigenetic targets for NSCLC chemotherapy. Our study has some strengths, as follows: (1) Most studies focus only on main effects of biomarkers, ignoring their G×G interactions that account for missing heritability of complex diseases like NSCLC. Also, most studies focus on single omics data testing prognostic biomarkers.10, 11, 12, 13 Taking advantage of epigenomic and transcriptomic data and considering both G×G interactions and main effects, we built up transomics prognostic scores, which could improve prognostic value; (2) to identify reliable prognostic biomarkers for the prediction of early-stage NSCLC overall survival, we used stringent statistical criteria. In the main effect analysis, candidate biomarkers, with effect sizes larger than a data-driven threshold in ISIS LASSO, must reach statistical significance to stay in the Cox regression model. For the G×G interaction analysis, we applied the most conservative Bonferroni correction to control for false positives. In addition, significant biomarkers observed in the discovery phase must be further validated in an independent population. However, one consequence was that only a few biomarkers were identified because of the limited sample size of gene expression data, which therefore contributed a small proportion of improved accuracy of our model; (3) we used ISIS LASSO and stepwise regression to screen biomarkers with main effect and interactions, respectively, and built multibiomarker models. These coefficients, used as weights to define scores, were derived from multibiomarker models rather than single-biomarker models. Single-biomarker models might result in biased estimates of effect sizes, whereas multibiomarker models are more beneficial to clarify the complex association and could improve prediction accuracy,; (4) the prediction accuracy of our prognostic model was robust toward different selection thresholds in stepwise regression as well as stratification by histology types; and (5) the genes we identified as enriched in the non-small cell lung cancer pathway and most of the hub genes have been reported to be associated with NSCLC, indicating the reliability of our prognostic biomarkers. We also acknowledge some limitations of our study, as follows: (1) We focused only on pan-cancer genes, whereas most dysregulated genes represent the consequences rather than the causes of neoplastic process. Moreover, few powerful statistical methods or excellent computer hardware can finish G×G interactions for time-to-even data on genome-wide scales within weeks. We exhaustively tested all pairs of pan-cancer-related genes; (2) limited clinical information was available for several cohorts that were initiated decades ago. However, in our study, a few easily accessible clinical predictors and dozens of biomarkers exhibited considerable accuracy, which indicated potentiality for real-world application; (3) the event rate of survival time for TCGA population is relatively low (23%), which considerably reduced the statistical power. However, through a conservative two-stage strategy this study showed the robustness of our findings; (4) our prognostic prediction model predicts survival outcome and distinguishes subgroups of patients with high mortality risk accurately, which provides a potential opportunity for the delivery of personalized medicine and interventions tailored to each individual’s level of risk. However, it requires information on 54 biomarkers, which might restrict its clinical translatability without testing of specimens. Nevertheless, the history of cancer omics testing has taught us that, as technology improves and costs fall, the trend is toward more convenient and comprehensive approaches to quickly capture biomarker information. In the coming years, advances in technology will facilitate our model’s usefulness through a customized biochip to enable widespread clinical application and maximize the benefit to patients; and (5) further studies with a large-scale population and extension of other ethics are warranted to confirm the results of our association study and verify the underline biologic mechanisms of the genes and their interactions. Results of protein analysis in public resources and in gene network and enrichment analyses might provide insight into the functional mechanisms.

Conclusion

The prognostic score incorporating transomics biomarkers with both main effects and G×G interactions significantly improves prognostic prediction accuracy for early-stage NSCLC survival.

47 in total

1. Polygenes, risk prediction, and targeted prevention of breast cancer.

Authors: Paul D P Pharoah; Antonis C Antoniou; Douglas F Easton; Bruce A J Ponder
Journal: N Engl J Med Date: 2008-06-26 Impact factor: 91.245

Review 2. The relative utilities of genome-wide, gene panel, and individual gene sequencing in clinical practice.

Authors: Frank C Kuo; Brenton G Mar; R Coleman Lindsley; Neal I Lindeman
Journal: Blood Date: 2017-06-09 Impact factor: 22.113

3. limma powers differential expression analyses for RNA-sequencing and microarray studies.

Authors: Matthew E Ritchie; Belinda Phipson; Di Wu; Yifang Hu; Charity W Law; Wei Shi; Gordon K Smyth
Journal: Nucleic Acids Res Date: 2015-01-20 Impact factor: 16.971

4. Overexpression of constitutive signal transducer and activator of transcription 3 mRNA in cisplatin-resistant human non-small cell lung cancer cells.

Authors: Kenji Ikuta; Kazu Takemura; Masaru Kihara; Masuhiro Nishimura; Nobuhiko Ueda; Shinsaku Naito; Eibai Lee; Eiji Shimizu; Aiko Yamauchi
Journal: Oncol Rep Date: 2005-02 Impact factor: 3.906

Review 5. Lung cancer: current therapies and new targeted treatments.

Authors: Fred R Hirsch; Giorgio V Scagliotti; James L Mulshine; Regina Kwon; Walter J Curran; Yi-Long Wu; Luis Paz-Ares
Journal: Lancet Date: 2016-08-27 Impact factor: 79.321

6. Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study.

Authors: Kerby Shedden; Jeremy M G Taylor; Steven A Enkemann; Ming-Sound Tsao; Timothy J Yeatman; William L Gerald; Steven Eschrich; Igor Jurisica; Thomas J Giordano; David E Misek; Andrew C Chang; Chang Qi Zhu; Daniel Strumpf; Samir Hanash; Frances A Shepherd; Keyue Ding; Lesley Seymour; Katsuhiko Naoki; Nathan Pennell; Barbara Weir; Roel Verhaak; Christine Ladd-Acosta; Todd Golub; Michael Gruidl; Anupama Sharma; Janos Szoke; Maureen Zakowski; Valerie Rusch; Mark Kris; Agnes Viale; Noriko Motoi; William Travis; Barbara Conley; Venkatraman E Seshan; Matthew Meyerson; Rork Kuick; Kevin K Dobbin; Tracy Lively; James W Jacobson; David G Beer
Journal: Nat Med Date: 2008-07-20 Impact factor: 53.440

7. Quantitative evidence for early metastatic seeding in colorectal cancer.

Authors: Zheng Hu; Jie Ding; Zhicheng Ma; Ruping Sun; Jose A Seoane; J Scott Shaffer; Carlos J Suarez; Anna S Berghoff; Chiara Cremolini; Alfredo Falcone; Fotios Loupakis; Peter Birner; Matthias Preusser; Heinz-Josef Lenz; Christina Curtis
Journal: Nat Genet Date: 2019-06-17 Impact factor: 38.330

8. An evaluation of analysis pipelines for DNA methylation profiling using the Illumina HumanMethylation450 BeadChip platform.

Authors: Francesco Marabita; Malin Almgren; Maléne E Lindholm; Sabrina Ruhrmann; Fredrik Fagerström-Billai; Maja Jagodic; Carl J Sundberg; Tomas J Ekström; Andrew E Teschendorff; Jesper Tegnér; David Gomez-Cabrero
Journal: Epigenetics Date: 2013-02-19 Impact factor: 4.528

9. ERK1/2 is activated in non-small-cell lung cancer and associated with advanced tumours.

Authors: S Vicent; J M López-Picazo; G Toledo; M D Lozano; W Torre; C Garcia-Corchón; C Quero; J-C Soria; S Martín-Algarra; R G Manzano; L M Montuenga
Journal: Br J Cancer Date: 2004-03-08 Impact factor: 7.640

10. Genome-wide DNA methylation analyses in lung adenocarcinomas: Association with EGFR, KRAS and TP53 mutation status, gene expression and prognosis.

Authors: Maria Moksnes Bjaanæs; Thomas Fleischer; Ann Rita Halvorsen; Antoine Daunay; Florence Busato; Steinar Solberg; Lars Jørgensen; Elin Kure; Hege Edvardsen; Anne-Lise Børresen-Dale; Odd Terje Brustugun; Jörg Tost; Vessela Kristensen; Åslaug Helland
Journal: Mol Oncol Date: 2015-11-06 Impact factor: 6.603

11 in total

1. A Novel Cuproptosis-Related Prognostic Gene Signature and Validation of Differential Expression in Clear Cell Renal Cell Carcinoma.

Authors: Zilong Bian; Rong Fan; Lingmin Xie
Journal: Genes (Basel) Date: 2022-05-10 Impact factor: 4.141

2. MiR-30a-3p Suppresses the Growth and Development of Lung Adenocarcinoma Cells Through Modulating GOLM1/JAK-STAT Signaling.

Authors: Dongxiao Ding; Yunqiang Zhang; Xuede Zhang; Ke Shi; Wenjun Shang; Junjie Ying; Li Wang; Zhongjie Chen; Haihua Hong
Journal: Mol Biotechnol Date: 2022-04-19 Impact factor: 2.860

3. Pathological and clinical features of multiple cancers and lung adenocarcinoma: a multicentre study.

Authors: Pietro Bertoglio; Luigi Ventura; Vittorio Aprile; Maria Angela Cattoni; Dania Nachira; Filippo Lococo; Maria Rodriguez Perez; Francesco Guerrera; Fabrizio Minervini; Letizia Gnetti; Alessandra Lenzini; Francesca Franzi; Giulia Querzoli; Guido Rindi; Salvatore Bellafiore; Federico Femia; Giuseppe Salvatore Bogina; Diana Bacchin; Peter Kestenholz; Enrico Ruffini; Massimiliano Paci; Stefano Margaritora; Andrea Selenito Imperatori; Marco Lucchi; Luca Ampollini; Alberto Claudio Terzi
Journal: Interact Cardiovasc Thorac Surg Date: 2022-06-15

4. Epigenetic-smoking interaction reveals histologically heterogeneous effects of TRIM27 DNA methylation on overall survival among early-stage NSCLC patients.

Authors: Xinyu Ji; Lijuan Lin; Sipeng Shen; Xuesi Dong; Chao Chen; Yi Li; Ying Zhu; Hui Huang; Jiajin Chen; Xin Chen; Liangmin Wei; Jieyu He; Weiwei Duan; Li Su; Yue Jiang; Juanjuan Fan; Jinxing Guan; Dongfang You; Andrea Shafer; Maria Moksnes Bjaanaes; Anna Karlsson; Maria Planck; Johan Staaf; Åslaug Helland; Manel Esteller; Yongyue Wei; Ruyang Zhang; Feng Chen; David C Christiani
Journal: Mol Oncol Date: 2020-09-03 Impact factor: 7.449

Review 5. A narrative review of prognosis prediction models for non-small cell lung cancer: what kind of predictors should be selected and how to improve models?

Authors: Yuhang Wang; Xuefeng Lin; Daqiang Sun
Journal: Ann Transl Med Date: 2021-10

6. Epigenome-wide three-way interaction study identifies a complex pattern between TRIM27, KIAA0226, and smoking associated with overall survival of early-stage NSCLC.

Authors: Xinyu Ji; Lijuan Lin; Juanjuan Fan; Yi Li; Yongyue Wei; Sipeng Shen; Li Su; Andrea Shafer; Maria Moksnes Bjaanaes; Anna Karlsson; Maria Planck; Johan Staaf; Åslaug Helland; Manel Esteller; Ruyang Zhang; Feng Chen; David C Christiani
Journal: Mol Oncol Date: 2022-01-07 Impact factor: 7.449

7. Combined Consideration of Tumor-Associated Immune Cell Density and Immune Checkpoint Expression in the Peritumoral Microenvironment for Prognostic Stratification of Non-Small-Cell Lung Cancer Patients.

Authors: Yong Yang; Xiaobao Yang; Yichao Wang; Jingsong Xu; Hanyu Shen; Hongquan Gou; Xiong Qin; Gening Jiang
Journal: Front Immunol Date: 2022-02-10 Impact factor: 7.561

8. APOLLO: An accurate and independently validated prediction model of lower-grade gliomas overall survival and a comparative study of model performance.

Authors: Jiajin Chen; Sipeng Shen; Yi Li; Juanjuan Fan; Shiyu Xiong; Jingtong Xu; Chenxu Zhu; Lijuan Lin; Xuesi Dong; Weiwei Duan; Yang Zhao; Xu Qian; Zhonghua Liu; Yongyue Wei; David C Christiani; Ruyang Zhang; Feng Chen
Journal: EBioMedicine Date: 2022-04-15 Impact factor: 11.205

9. A two-phase comprehensive NSCLC prognostic study identifies lncRNAs with significant main effect and interaction.

Authors: Jing Zhu; Jinxing Guan; Xinyu Ji; Yunjie Song; Xiaoshuang Xu; Qianqian Wang; Quanan Zhang; Renhua Guo; Rui Wang; Ruyang Zhang
Journal: Mol Genet Genomics Date: 2022-02-26 Impact factor: 3.291

10. Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer.

Authors: Qianyu Yuan; Tianrun Cai; Chuan Hong; Mulong Du; Bruce E Johnson; Michael Lanuti; Tianxi Cai; David C Christiani
Journal: JAMA Netw Open Date: 2021-07-01