| Literature DB >> 28105908 |
André Veríssimo1,2, Arlindo Limede Oliveira2,3, Marie-France Sagot4,5, Susana Vinga6.
Abstract
BACKGROUND: Modeling survival oncological data has become a major challenge as the increase in the amount of molecular information nowadays available means that the number of features greatly exceeds the number of observations. One possible solution to cope with this dimensionality problem is the use of additional constraints in the cost function optimization. LASSO and other sparsity methods have thus already been successfully applied with such idea. Although this leads to more interpretable models, these methods still do not fully profit from the relations between the features, specially when these can be represented through graphs. We propose DEGREECOX, a method that applies network-based regularizers to infer Cox proportional hazard models, when the features are genes and the outcome is patient survival. In particular, we propose to use network centrality measures to constrain the model in terms of significant genes.Entities:
Keywords: Cox proportional models; Network metrics; Regularization
Mesh:
Year: 2016 PMID: 28105908 PMCID: PMC5249012 DOI: 10.1186/s12859-016-1310-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Centrality measures
Fig. 2DEGREECOX network regularizer
Fig. 3Comparison of the fraction of top-ranked genes calculated for starting networks for the centrality metrics analized: a weighted degree, b betweenness and c closeness. Sub-network properties obtained by removing edges from the starting network
Deviance and C-index results for models chosen by 5-fold cross-validation and tested on all 3 datasets (including 2 that were hidden from the training phase). The LASSO and RIDGE methods do not use network information so the values for GCN and GFM are the same, they are only shown in both networks when they are better than DEGREECOX and NET-COX
|
| Bonome | TCGA | Tothill | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| Bonome | TCGA | Tothill | Bonome | TCGA | Tothill | Bonome | TCGA | Tothill | ||||||||||
|
| GCN | GFM | GCN | GFM | GCN | GFM | GCN | GFM | GCN | GFM | GCN | GFM | GCN | GFM | GCN | GFM | GCN | GFM | |
| RMSE |
|
|
| 1.3189 | 1.1538 | 1.2139 | 1.1027 |
| 1.3619 | 0.9201 | 0.8043 |
| 1.1083 | 1.6326 | 1.2975 | 1.3749 | 1.1679 |
| 0.7013 |
|
| 0.8131 | 0.8353 | 1.1438 |
| 1.0992 |
| 1.3514 |
| 0.8361 | 0.8508 | 1.1003 |
|
|
|
|
| 0.7363 | 0.7606 | |
|
| 0.7807 |
|
| 1.3755 |
| 1.1769 | 1.5649 | 1.3252 |
| ||||||||||
|
| 0.7887 | 1.4619 | 1.2586 | 1.7419 | 0.8105 | 1.3019 | 1.9595 | 1.4208 | 0.5444 | ||||||||||
| C-Index |
|
| 0.9401 | 0.6020 | 0.6037 | 0.6455 | 0.6494 | 0.6444 | 0.6427 | 0.8476 | 0.9089 |
| 0.6695 | 0.6011 | 0.6088 | 0.6100 | 0.6215 |
| 0.9519 |
|
| 0.9260 | 0.9202 | 0.6079 | 0.6054 | 0.6483 | 0.6506 | 0.6416 | 0.6439 | 0.8918 | 0.8892 | 0.6633 |
|
|
|
|
| 0.9389 | 0.9352 | |
|
|
|
|
|
|
| 0.6579 | 0.6000 | 0.5926 |
| ||||||||||
|
| 0.9309 | 0.5615 | 0.6124 | 0.6405 | 0.9043 | 0.6399 | 0.5075 | 0.5728 | 0.9784 | ||||||||||
Values in bold represent the best performing method for the dataset/network combination (per RMSE and C-Index)
P-values for log-rank test results for models chosen by 5-fold cross-validation and tested on all 3 datasets (including 2 that were hidden from the training phase). The log-rank tests the separation in two categories of patients, high and low risk based on the expression dataset, using the top and lower 40 % PI groups and the top and lower 50 % PI groups. The LASSO and RIDGE methods do not use network information so the values for GCN and GFM are the same, they are only shown in both networks when they are better than DEGREECOX and NET-COX. The p-values when the model is tested on the same dataset used in training are always 0 and are ommited from the table
|
| Bonome | TCGA | Tothill | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| TCGA | Tothill | Bonome | Tothill | Bonome | TCGA | |||||||
|
| GCN | GFM | GCN | GFM | GCN | GFM | GCN | GFM | GCN | GFM | GCN | GFM | |
| 40 |
| 2.084 | 2.124 | 0.0013 | 4.390 | 2.046 | 3.990 |
| 5.822 |
|
| 8.347 | 3.125 |
|
| 1.082 | 2.791 | 7.726 |
| 2.815 | 1.185 | 4.241 |
| 1.696 | 2.545 |
|
| |
|
|
|
| 4.233 | 1.765 | 0.0016 | 1.864 | |||||||
|
| 0.0364 | 0.0048 |
| 0.0036 | 0.5630 | 0.0033 | |||||||
| 50 |
| 3.332 | 5.284 | 0.0076 | 0.0084 | 4.394 | 0.0090 |
|
| 0.0045 |
| 5.264 | 7.183 |
|
| 2.169 | 5.086 | 0.0170 | 0.0179 | 0.0036 | 0.0015 | 1.247 | 3.126 |
| 8.138 |
|
| |
|
|
|
|
| 0.0029 | 0.0050 | 3.499 | |||||||
|
| 0.0720 | 0.0048 | 0.0022 | 0.0193 | 0.6464 | 0.0050 | |||||||
Values in bold represent the best method for the dataset/network combination (per 40 % and 50 % separation)
Fig. 4Residuals when models are trained with the correlation network and TCGA dataset and tested with the Bonome dataset
Fig. 5Kaplan-Meier curves for high vs. low risk groups with the model learnt from the TCGA dataset and tested on Bonome (a and b) and Tothill (c and d). When a death event occurs for an individual, the cumulative survival decreases