| Literature DB >> 30906311 |
Zhi Huang1,2,3, Xiaohui Zhan2,4, Shunian Xiang4,5, Travis S Johnson2,6, Bryan Helm2, Christina Y Yu2,6, Jie Zhang5, Paul Salama3, Maher Rizkalla3, Zhi Han2,7, Kun Huang2,3,7.
Abstract
Improved cancer prognosis is a central goal for precision health medicine. Though many models can predict differential survival from data, there is a strong need for sophisticated algorithms that can aggregate and filter relevant predictors from increasingly complex data inputs. In turn, these models should provide deeper insight into which types of data are most relevant to improve prognosis. Deep Learning-based neural networks offer a potential solution for both problems because they are highly flexible and account for data complexity in a non-linear fashion. In this study, we implement Deep Learning-based networks to determine how gene expression data predicts Cox regression survival in breast cancer. We accomplish this through an algorithm called SALMON (Survival Analysis Learning with Multi-Omics Neural Networks), which aggregates and simplifies gene expression data and cancer biomarkers to enable prognosis prediction. The results revealed improved performance when more omics data were used in model construction. Rather than use raw gene expression values as model inputs, we innovatively use eigengene modules from the result of gene co-expression network analysis. The corresponding high impact co-expression modules and other omics data are identified by feature selection technique, then examined by conducting enrichment analysis and exploiting biological functions, escalated the interpretation of input feature from gene level to co-expression modules level. Our study shows the feasibility of discovering breast cancer related co-expression modules, sketch a blueprint of future endeavors on Deep Learning-based survival analysis. SALMON source code is available at https://github.com/huangzhii/SALMON/.Entities:
Keywords: breast cancer; co-expression analysis; cox regression; deep Learning; multi-omics; neural networks; survival prognosis
Year: 2019 PMID: 30906311 PMCID: PMC6419526 DOI: 10.3389/fgene.2019.00166
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1SALMON (Survival Analysis Learning with Multi-Omics Neural Networks) architecture with the implementation of Cox proportional hazards regression networks. Co-expression modules (eigengene matrices) are the inputs to the SALMON. Number of the hidden layers and dimensions of hidden layers can also be fine-tuned (not included in this paper). The output is the hazard ratios which can be interpreted as the relative risks of patients.
Demographical and clinical characteristics of 583 female breast invasive carcinoma (BRCA) patients.
| 13,132 | 57 | 530 | 12 | 31.70 | 0.00–216.59 | 57 | 26–90 | 76.16% | 67.41% |
mRNA and miRNA stand for mRNA-seq data and miRNA-seq data. OS stands for overall survival. The status of ER and PR were derived from IHC (immunohistochemistry). All clinical information was collected from cBioPortal.
Figure 2(A) Performances of SALMON with multi-omics data integrated in terms of concordance index. (B) Performance comparison between SALMON and the modified Cox-nnet, DeepSurv, GLMNET, and RSF in terms of concordance index with all omics data used for learning. (C–E) Kaplan-Meier plot of survival prognosis. Hazard ratios were derived from all five testing sets. Log-rank test was used to find the corresponding p-value with low risk and high risk groups dichotomized by the median hazard ratio. Omics data used for training and testing: (C) mRNA-seq data (mRNA); (D) miRNA-seq data (miRNA); (E) integration of mRNA, miRNA, CNB, TMB, and demographical & clinical (diagnosis age, ER status, PR status) data. All other combinations of multi-omics results are in Figure S1.
Performances comparison with different combinations of multi-omics data by pairwise paired t-test, according to concordance index among 5-folds cross-validation results.
| −0.784 | 0.477 | −0.676 | 0.536 | −0.832 | 0.452 | −2.928 | 0.043* | −3.315 | 0.030* | ||
| - | - | 0.406 | 0.705 | −0.487 | 0.652 | −0.092 | 0.931 | −0.652 | 0.550 | ||
| – | – | – | – | 0.247 | 0.817 | −5.804 | 0.004* | −2.710 | 0.054 | ||
| – | – | – | – | – | – | −4.168 | 0.014* | −3.603 | 0.023* | ||
| – | – | – | – | – | – | – | – | −1.529 | 0.201 | ||
Note that a negative t-statistic indicated set 1 worse than set 2 in terms of performances. Multi-omics dataset applied as inputs: (i) mRNA-seq data (mRNA) (57 features); (ii) miRNA-seq data (miRNA) (12 features); (iii) integration of mRNA and miRNA (69 features); (iv) integration of mRNA, miRNA, copy number burden (CNB), and tumor mutation burden (TMB) (71 features); (v) integration of mRNA, miRNA, and demographical and clinical (diagnosis age, ER status, PR status) data (72 features); (vi) integration of mRNA, miRNA, CNB, TMB, and demographical and clinical (diagnosis age, ER status, PR status) data (74 features).
t-denotes the pairwise paired Student's t-test statistic, P denotes the p-value obtained. P-value < 0.05 are considered to be significant and indicated with * symbol.
Top features that reduced the concordance index, including two demographical and clinical features, and five mRNA-seq co-expression modules (eigengene matrices as inputs to the SALMON).
| 1 | Diagnosis age | −0.1257 | Age |
| 2 | PR status | −0.0343 | Progesterone receptors status |
| 3 | Module 13 | −0.0150 | Genes MST1, CPT1B. CD8+, CD4+, Breast bulk tissue. |
| 4 | Module 47 | −0.0071 | Genes MAP3K7, CCNC. Cytoband chr6q14-q16 and chr6q21. |
| 5 | Module 5 | −0.0059 | Genes DDR2, FLNA, TCF4. Associated with extracellular matrix (ECM), cell adhesion, and cell migration. |
| 6 | Module 36 | −0.0053 | Gene SNW1. Cytoband chr14q23-q24 and chr14q31-q32. |
| 7 | Module 51 | −0.0047 | Genes TCP1, HDAC2. Cytoband chr6q14-q15and chr6q21-q26. |
Figure 3Features importance evaluated by the decrease of concordance index, sorted based on median values. Boxplots in Green: 57 mRNA co-expression module features (ID from 1 to 57); boxplots in red: 12 miRNA co-expression module features (ID from 58 to 69); boxplots in turquoise: copy number burden (CNB) and tumor mutation burden (TMB) features (ID from 70 to 71); boxplots in pink: demographical and clinical features (ID from 72 to 74).
Figure 4Performances of SALMON algorithm stratified by three age groups: 26–50 group; 51–70 group; 71–90 group with integrating all omics data (integration of mRNA, miRNA, CNB, TMB, diagnosis age, ER status, PR status).
Top features that reduced the concordance indices.
| 1 | – | – | – | |||
| 2 | Module 1 | 0 | Module 13 | −0.0221 | – | |
| 3 | Module 2 | 0 | Module 4 | −0.0185 | – | |
| 4 | Module 3 | 0 | Module 5 | −0.0150 | – | |
| 5 | Module 4 | 0 | Diagnosis age | −0.0150 | – | |
Experiments performed separately with three age groups: 26–50 group; 51–70 group; 71–90 group, with integrating all omics data (integration of mRNA, miRNA, CNB, TMB, diagnosis age, ER status, PR status). Detailed feature rankings are in .