| Literature DB >> 34270690 |
Mehrad Mahmoudian1,2, Mikko S Venäläinen1, Riku Klén1, Laura L Elo1,3.
Abstract
MOTIVATION: The emergence of datasets with tens of thousands of features, such as high-throughput omics biomedical data, highlights the importance of reducing the feature space into a distilled subset that can truly capture the signal for research and industry by aiding in finding more effective biomarkers for the question in hand. A good feature set also facilitates building robust predictive models with improved interpretability and convergence of the applied method due to the smaller feature space.Entities:
Year: 2021 PMID: 34270690 PMCID: PMC8665768 DOI: 10.1093/bioinformatics/btab501
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Internal steps of SIVS method. (A) The general schema of the SIVS method. (B) Frequency of each feature having nonzero coefficient in the ‘iterative model building’ step. (C) Distribution of nonzero coefficients each feature has got in the ‘iterative model building’ step. Features are illustrated in a sorted order based on the median of their nonzero coefficients from high to low. (D) The main plot of the SIVS method, presenting an overview of the ‘RFE’ step. This plot is composed of three main elements: the bar chart that shows the VIMP, the box plots to show the distribution of AUROC after removal of each feature, and ultimately the two vertical dashed lines marking the two suggested strictness
Data that has been used in this study
| Disease | Response value | Data type (platform) | Accession ID |
|---|---|---|---|
| Breast cancer | Relapse-free survival | Microarray (GPL96) | GSE2034, GSE7390 |
| Lung cancer | Subtype classification | RNA-seq | TCGA_LUAD, TCGA_LUSC |
| Cardiovascular | Occurrence of cardiovascular outcome | Clinical | SPRINT, ACCORD-BP |
| Arcene | Cancer versus healthy | Mass-spectrometry | ARCENE |
Note: To compare the method introduced in this article, four types of data have been used. This table presents the various data types that have been used in this article, in addition to the information on what has been used as response values.
Run-type comparison and their models’ consistency
| Run-type | glmnet | SIVS + glmnet | Boruta + glmnet | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Metric | Detail | Breast cancer | Lung cancer | Cardiovascular | Arcene | Breast cancer | Lung cancer | Cardiovascular | Arcene | Breast cancer | Lung cancer | Cardiovascular | Arcene |
| Number of selected features | Maximum | 107 | 158 | 15 | 59 | 41 | 45 | 13 | 10 | 7 | 35 | 11 | 7 |
| Median | 76 | 114 | 15 | 43 | 41 | 43 | 13 | 10 | 7 | 34 | 11 | 6 | |
| Mean | 79.12 | 114.08 | 14.74 | 41.88 | 41 | 42.42 | 13 | 10 | 7 | 33.62 | 11 | 6.44 | |
| Minimum | 59 | 76 | 14 | 28 | 41 | 41 | 13 | 10 | 7 | 32 | 11 | 6 | |
| Standard deviation | 9.6685 | 15.9524 | 0.4408 | 7.9089 | 0 | 1.165 | 0 | 0 | 0 | 0.8261 | 0 | 0.4989 | |
| Intersect | 54 | 58 | 14 | 26 | 41 | 41 | 13 | 10 | 7 | 30 | 11 | 6 | |
| Union | 112 | 177 | 16 | 60 | 41 | 45 | 13 | 10 | 7 | 35 | 11 | 7 | |
| AUROC [validation] | Maximum | 64.05% | 99.36% | 69.58% | 75.53% | 61.16% | 99.37% | 69.43% | 72.73% | 56.96% | 99.28% | 69.38% | 69.48% |
| Median | 62.51% | 99.19% | 69.51% | 74.84% | 60.88% | 99.30% | 69.37% | 71.39% | 56.93% | 99.16% | 69.31% | 69.32% | |
| Mean | 62.01% | 99.17% | 69.52% | 74.74% | 60.93% | 99.30% | 69.37% | 71.58% | 56.93% | 99.16% | 69.32% | 69.24% | |
| Minimum | 59.42% | 98.83% | 69.45% | 74.19% | 60.74% | 99.25% | 69.37% | 70.74% | 56.89% | 98.99% | 69.31% | 68.59% | |
| Standard deviation | 0.0108 | 0.0009 | 0.0003 | 0.0041 | 0.001 | 0.0002 | 0.0001 | 0.0053 | 0.0002 | 0.0005 | 0.0002 | 0.0023 | |
Note: For each data type that is used in this article and for each method, 100 modelings and testings have been done using 100 different cross-validation seeds. This table presents the consistency of each method in terms of the number of selected features and AUROC.
Fig. 2.Side-by-side comparison of glmnet and SIVS. (A) The number of features that were used in each of the 100 glmnet models built using SIVS features (SIVS + glmnet), Boruta features (Boruta + glmnet) and plain glmnet. For each dataset, all three types of runs were performed 100 times with 100 different cross-validation seeds to assess the stability of the outcomes. (B and C) Performance of these models on the test sets. The plots on the second row (panel B) illustrate that there is no significant difference in the performance between the models that were built using features selected by SIVS and models that were built without despite the fact that the models built using SIVS use far fewer features as illustrated in panel A. Additionally, the plots in panel C illustrate the same data points as panel B, but are zoomed-in to show the performance robustness of models that are built using SIVS selected features compared to glmnet and Boruta + glmnet. (D) Venn diagrams depicting the overlap of the selected features via their intersection (∩) and union (∪), showing that the feature space suggested by SIVS is always a subset of standard glmnet feature space, and typically the feature space of SIVS is so robust that the intersect and union are the same set
Fig. 3.Significance of SIVS feature reduction on the final model. The AUROC of the glmnet models built using the full feature space and built using only SIVS suggested features were tested in a pair-wise fashion where models that were built using the same cross-validation seeds were compared together using the Delong method with two-sided alternative hypothesis (DeLong )