| Literature DB >> 30703089 |
Borong Shao1,2, Maria Moksnes Bjaanæs3,4,5, Åslaug Helland3,4,5, Christof Schütte1,2, Tim Conrad1,2.
Abstract
Various feature selection algorithms have been proposed to identify cancer prognostic biomarkers. In recent years, however, their reproducibility is criticized. The performance of feature selection algorithms is shown to be affected by the datasets, underlying networks and evaluation metrics. One of the causes is the curse of dimensionality, which makes it hard to select the features that generalize well on independent data. Even the integration of biological networks does not mitigate this issue because the networks are large and many of their components are not relevant for the phenotype of interest. With the availability of multi-omics data, integrative approaches are being developed to build more robust predictive models. In this scenario, the higher data dimensions create greater challenges. We proposed a phenotype relevant network-based feature selection (PRNFS) framework and demonstrated its advantages in lung cancer prognosis prediction. We constructed cancer prognosis relevant networks based on epithelial mesenchymal transition (EMT) and integrated them with different types of omics data for feature selection. With less than 2.5% of the total dimensionality, we obtained EMT prognostic signatures that achieved remarkable prediction performance (average AUC values >0.8), very significant sample stratifications, and meaningful biological interpretations. In addition to finding EMT signatures from different omics data levels, we combined these single-omics signatures into multi-omics signatures, which improved sample stratifications significantly. Both single- and multi-omics EMT signatures were tested on independent multi-omics lung cancer datasets and significant sample stratifications were obtained.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30703089 PMCID: PMC6354965 DOI: 10.1371/journal.pone.0204186
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Frequently used molecular/gene interaction networks in network-based feature selection studies.
We listed below the basic information of the networks as well as exemplary studies that employed the networks. With STRING database we only considered the edges with confidence scores ≥ 0.9. When a database has information of many species, only Homo sapiens was considered. In the 4th and 5th columns, Data means that the network size is dependent on the dimensions of data. App means that the network size is dependent on the application. GO means that the network size is dependent on gene ontology terms.
| Molecular interactions | Database | Version | Number of edges | Number of nodes | Studies |
|---|---|---|---|---|---|
| protein-protein | STRING | V10.5 | 547621 | 19578 | [ |
| protein-protein | HPRD | Release 9 | 41327 | 30047 | [ |
| biological pathways | KEGG | Release 84.0 | App | App | [ |
| biological pathways | Pathway Commons | V7 | 1912848 | 14863 | [ |
| miRNA-gene | miRTarBase | V7.0 | 502651 | 16822 | [ |
| transcription factor—target | TRANSFAC | V7.0 public | 1504 | 1648 | [ |
| gene co-expression | None | None | Data | Data | [ |
| gene functional linkage | Multiple | None | App | App | [ |
| gene ontology | Gene ontology | None | GO | GO | [ |
Overview of the 10 feature selection algorithms.
We listed below their main methodologies, whether network information was integrated, the algorithmic output and reference.
| Algorithm | Methodology | Network | Output | Reference |
|---|---|---|---|---|
| t-test | t-statistic | No | feature ranking | [ |
| Lasso | regularized regression | No | coefficients | [ |
| NetLasso | network-based regularization | Yes | coefficients | [ |
| AddDA2 | subnetwork scoring and searching | Yes | subnetworks | [ |
| NetRank | feature importance on network | Yes | feature ranking | [ |
| stSVM | random walk on network | Yes | feature ranking | [ |
| Cox | Cox PH model | No | feature ranking | [ |
| RegCox | regularized Cox PH model | No | coefficients | [ |
| MSS | random sampling | No | feature ranking | [ |
| Survnet | subnetwork scoring and searching | Yes | subnetworks | [ |
The description of datasets.
This table shows the sample sizes for labeled data (thresholds <700 and >1400 days) and censored data.
| labeled data | censored data | |||
|---|---|---|---|---|
| good prognosis | poor prognosis | total | all stages | |
| Level GE | 84 | 99 | 183 | 497 |
| Level DM | 74 | 93 | 167 | 447 |
| Level CNA | 73 | 76 | 149 | 503 |
Fig 1The AUC, AUPR, and accuracies of EMT features versus random features using DM data with the core EMT network.
Gaussian kernel is used to estimate the density functions based on results from 30 times 10-fold cross-validation. For each cross-validation fold, EMT features and random features are tested on the same training and cross-validation samples. Each row in the figure corresponds to one feature selection algorithm. The last row corresponds to using all EMT features. The p-values of paired t-tests are provided in each sub-figure.
The prediction performance of EMT signatures on three data levels.
The table gives the average AUC values of EMT signatures on three data levels with each EMT network. The results from comparative groups 2, 3, 4, and 5 are given in the third row and the last three rows.
| Data Level | Gene expression | DNA Methylation | CNA | ||||||
|---|---|---|---|---|---|---|---|---|---|
| | | 74 | 123 | 455 | 74 | 123 | 455 | 70 | 117 | 445 |
| EMT | 0.662 | 0.728 | 0.691 | 0.698 | 0.679 | 0.671 | 0.616 | 0.645 | 0.608 |
| t-test | 0.658 | 0.709 | 0.677 | 0.688 | 0.675 | 0.669 | 0.616 | 0.626 | 0.621 |
| Lasso | 0.616 | 0.703 | 0.620 | 0.697 | 0.666 | 0.667 | 0.615 | 0.619 | 0.617 |
| NetLasso | 0.659 | 0.718 | 0.686 | 0.700 | 0.678 | 0.677 | 0.619 | 0.635 | 0.621 |
| addDA2 | 0.650 | 0.675 | 0.651 | 0.699 | 0.661 | 0.702 | 0.597 | 0.626 | 0.616 |
| NetRank | 0.656 | 0.691 | 0.668 | 0.695 | 0.685 | 0.693 | 0.615 | 0.619 | 0.610 |
| stSVM | 0.651 | 0.693 | 0.639 | 0.669 | 0.668 | 0.687 | 0.608 | 0.617 | 0.616 |
| Cox | 0.673 | 0.705 | 0.712 | 0.703 | 0.707 | 0.696 | 0.620 | 0.664 | 0.675 |
| RegCox | 0.648 | 0.698 | 0.729 | 0.696 | 0.717 | 0.666 | 0.645 | 0.669 | 0.653 |
| MSS | 0.662 | 0.694 | 0.659 | 0.674 | 0.654 | 0.640 | 0.608 | 0.627 | 0.625 |
| Survnet | 0.646 | 0.661 | 0.679 | 0.702 | 0.688 | 0.680 | 0.626 | 0.693 | 0.682 |
| All | 0.648 | 0.652 | 0.612 | ||||||
| All + Lasso | 0.643 | 0.691 | 0.607 | ||||||
| EMT hallmark | 0.675 | 0.627 | 0.617 | ||||||
The average odds ratios of EMT signatures on three data levels.
The table gives the average odds ratios of EMT signatures on three data levels with each EMT network. The results from comparative groups 2, 3, 4, and 5 are given in the third row and the last three rows.
| Data Level | Gene expression | DNA Methylation | CNA | ||||||
|---|---|---|---|---|---|---|---|---|---|
| | | 74 | 123 | 455 | 74 | 123 | 455 | 70 | 117 | 445 |
| EMT | 3.162 | 7.021 | 5.563 | 4.27 | 4.263 | 4.223 | 1.779 | 2.766 | 1.257 |
| t-test | 3.974 | 6.112 | 8.872 | 4.767 | 6.044 | 8.31 | 3.086 | 3.663 | 4.745 |
| Lasso | 5.094 | 5.6 | 10.793 | 4.482 | 8.213 | 10.427 | 4.367 | 4.494 | 5.852 |
| NetLasso | 3.742 | 6.418 | 9.042 | 4.607 | 4.772 | 9.198 | 2.291 | 3.211 | 2.975 |
| addDA2 | 5.304 | 4.714 | 10.716 | 5.432 | 6.736 | 15.647 | 2.639 | 5.201 | 7.297 |
| NetRank | 3.881 | 4.75 | 6.776 | 5.302 | 5.621 | 7.662 | 2.731 | 3.715 | 2.79 |
| stSVM | 4.118 | 6.238 | 2.448 | 3.682 | 4.22 | 5.253 | 2.173 | 1.63 | 1.721 |
| Cox | 3.685 | 5.684 | 7.666 | 5.083 | 5.499 | 4.821 | 1.398 | 3.562 | 4.897 |
| RegCox | 3.944 | 5.924 | 8.558 | 5.225 | 6.576 | 4.625 | 3.525 | 4.165 | 3.712 |
| MSS | 3.098 | 6.005 | 4.284 | 4.516 | 3.187 | 2.76 | 1.421 | 2.515 | 1.698 |
| Survnet | 2.798 | 4.159 | 4.798 | 5.455 | 3.694 | 4.544 | 3.25 | 5.394 | 5.457 |
| All | 4.316 | 2.221 | 1.233 | ||||||
| All + Lasso | 3.537 | 5.385 | 2.451 | ||||||
| EMT hallmark | 5.103 | 1.322 | 1.227 | ||||||
Fig 2The comparison of prediction performance between FSFs and individually selected features for different feature selection algorithms.
The boxplot is based on the results from 30 times stratified 10-fold cross-validation.
The results of log-rank tests on stratified sample clusters using single- and multi-omics EMT signatures on all-stage samples.
K-means algorithm was employed for clustering the samples into 3 groups. We highlighted all p-values that are lower than 10e-3.
| GE | DM | CNA | GE+DM | GE+CNA | DM+CNA | GE+DM+CNA | |
|---|---|---|---|---|---|---|---|
| t-test | 1.12e-01 | 3.62e-03 | 1.90e-03 | ||||
| Lasso | 5.56e-02 | 5.54e-01 | 8.71e-01 | ||||
| NetLasso | 2.71e-01 | 6.45e-01 | 7.58e-02 | 2.96e-02 | 1.53e-02 | 4.83e-01 | 4.34e-01 |
| addDA2 | |||||||
| NetRank | 1.79e-01 | 2.19e-02 | 7.31e-02 | ||||
| stSVM | 4.14e-02 | 3.39e-01 | 8.91e-01 | 8.86e-01 | 6.37e-02 | 5.85e-01 | 6.83e-01 |
| Cox | 2.91e-03 | ||||||
| RegCox | 8.52e-03 | 2.67e-01 | |||||
| MSS | 1.48e-03 | 5.95e-01 | 2.78e-01 | 2.59e-01 | 1.63e-03 | ||
| Survnet | 6.25e-03 | 5.19e-03 | 1.54e-03 | ||||
| Ensemble | 4.72e-02 | 6.05e-03 | 1.01e-01 | ||||
| allemt | 1.39e-02 | 4.30e-01 | 1.07e-01 | 7.64e-01 | 1.45e-02 | 2.07e-01 | 5.34e-01 |
| EMT hallmark | 1.62e-01 | 9.47e-01 | 3.09e-01 | 7.59e-01 | 5.82e-02 | 8.96e-01 | 7.58e-01 |
The results of log-rank tests on stratified sample clusters using single- and multi-omics EMT signatures on early-stage samples.
K-means algorithm was employed for clustering the samples into 3 groups. We highlighted all p-values that are lower than 10e-2.
| GE | DM | CNA | GE+DM | GE+CNA | DM+CNA | GE+DM+CNA | |
|---|---|---|---|---|---|---|---|
| t-test | 9.38e-02 | 1.93e-01 | 1.55e-01 | 2.47e-01 | |||
| Lasso | 1.01e-01 | 2.35e-01 | |||||
| NetLasso | 8.56e-01 | 2.46e-01 | 2.29e-01 | 1.01e-01 | 9.69e-01 | 9.31e-01 | |
| addDA2 | 3.52e-02 | ||||||
| NetRank | 5.39e-01 | ||||||
| stSVM | 3.17e-02 | 2.99e-01 | 2.86e-01 | 4.00e-01 | 2.16e-02 | 1.32e-01 | 8.57e-01 |
| Cox | 2.15e-01 | 2.42e-02 | 1.85e-02 | 3.30e-02 | 1.10e-02 | ||
| RegCox | 3.36e-01 | 2.33e-02 | 3.90e-02 | ||||
| MSS | 6.51e-02 | 6.51e-01 | 2.43e-01 | 2.34e-02 | 3.10e-02 | 9.91e-01 | 5.91e-02 |
| Survnet | 3.05e-01 | 6.24e-02 | 6.78e-02 | 8.66e-02 | 2.27e-02 | ||
| Ensemble | 1.54e-01 | 4.54e-02 | |||||
| allemt | 2.59e-01 | 9.45e-01 | 6.31e-01 | 1.38e-02 | 4.16e-01 | 9.73e-01 |
Fig 3EMT single-omics signatures can stratify test samples into significantly different prognostic groups.
The signature is selected by addDA2 algorithm using DM data.
Fig 4EMT multi-omics signatures can stratify test samples into significantly different prognostic groups, when the corresponding single-omics signatures cannot.
The signature consists of both GE and DM single-omics signatures selected by t-test.