| Literature DB >> 32164534 |
Xin Guan1,2, George Runger1, Li Liu3,4,5.
Abstract
BACKGROUND: In biomarker discovery, applying domain knowledge is an effective approach to eliminating false positive features, prioritizing functionally impactful markers and facilitating the interpretation of predictive signatures. Several computational methods have been developed that formulate the knowledge-based biomarker discovery as a feature selection problem guided by prior information. These methods often require that prior information is encoded as a single score and the algorithms are optimized for biological knowledge of a specific type. However, in practice, domain knowledge from diverse resources can provide complementary information. But no current methods can integrate heterogeneous prior information for biomarker discovery. To address this problem, we developed the Know-GRRF (know-guided regularized random forest) method that enables dynamic incorporation of domain knowledge from multiple disciplines to guide feature selection.Entities:
Keywords: Biomarker discovery; Domain knowledge; Feature selection; Regularized random forest
Mesh:
Substances:
Year: 2020 PMID: 32164534 PMCID: PMC7068914 DOI: 10.1186/s12859-020-3344-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Schematic representation of the Know-GRRF method. (a) The data structure. The feature matrix X contains the observed values of P predictors of N samples. The prior matrix A contains functional measures of each predictor from M domains. These functional measures are combined in a linear model to derive a score representing the biological relevance of predictors. The vector Y contains the observed response values of the samples. (b) The feature selection component. Non-leaf nodes are marked with the splitting features and colored by the corresponding biological relevance. Know-GRRF starts with an empty feature set F. In tree 1, three features (X3, X5 and X9) are sequentially added to F based on information gains weighted by biological relevance. In tree 2, because X5 and X9 are already members of F, they are selected based on information gains only. Because X7 is not a member of F, it is selected based on information gain weighted by biological relevance. (cmsubsup) The stability selection component. Know-GRRF first optimizes the tuning parameters on the complete dataset. It then uses bootstrapped samples to select features. After T iterations, features selected in more than a user-define frequency cutoff c are aggregated and constitute the final feature set. Alternatively, Know-GRRF can use stepwise selection to derive the final feature set
True Relationship in Simulated Scenarios
| Scenario | Relationships |
|---|---|
| 1. Linear | |
| 2. Higher order | |
| 3. Interaction |
Superscripts indicate the indices of feature X
Methods Comparison in Two-Class Classification Tasks
| Method | Lasso | RRF | Know-GRRF | |||||
|---|---|---|---|---|---|---|---|---|
| prior 1 | prior 2 | prior both | ||||||
| Freq>50% | Stepwise | Freq>50% | Stepwise | Freq>50% | Stepwise | |||
| Scenario 1 | ||||||||
| JI | 0.26 | 0.18 | 0.40 | 0.40 | 0.50 | 0.33 | 0.80a | 0.80a |
| TPR | 0.90 | 0.30 | 0.40 | 0.40 | 0.50 | 0.40 | 0.80 | 0.80 |
| FPR | 0.27 | 0.08 | 0 | 0 | 0 | 0.03 | 0 | 0 |
| FN | 10 | 1, 2, 3, 4, 7, 9, 10 | 4, 6, 7, 8, 9, 10 | 4, 6, 7, 8, 9, 10 | 1, 2, 3, 4, 5 | 1, 2, 3, 4, 5, 10 | 2, 4 | 4, 10 |
| Scenario 2 | ||||||||
| JI | 0.38 | 0.19 | 0.47 | 0.55 | 0.50 | 0.30 | 0.60 | 0.80a |
| TPR | 0.80 | 0.40 | 0.70 | 0.60 | 0.50 | 0.30 | 0.60 | 0.80 |
| FPR | 0.12 | 0.12 | 0.06 | 0.01 | 0 | 0 | 0 | 0 |
| FN | 3, 9 | 1, 2, 4, 7, 8, 9 | 7, 8, 9 | 4, 7, 8, 9 | 1, 2, 3, 4, 5 | 1, 2, 3, 4, 7, 8, 9 | 1, 3, 5, 9 | 1, 9 |
| Scenario 3 | ||||||||
| JI | 0.31 | 0.27 | 0.40 | 0.27 | 0.50 | 0.30 | 0.50 | 0.90a |
| TPR | 1.00 | 0.40 | 0.40 | 0.30 | 0.50 | 0.30 | 0.50 | 0.90 |
| FPR | 0.24 | 0.06 | 0 | 0.01 | 0 | 0 | 0 | 0 |
| FN | 1, 3, 5, 6, 7, 8 | 1, 6, 7, 8, 9, 10 | 1, 3, 6, 7, 8, 9, 10 | 1, 2, 3, 4, 5 | 1, 2, 3, 4, 7, 8, 9 | 1, 3, 4, 5, 7 | 3 | |
aindicates the best JI value in each scenario
Methods Comparison in Regression Tasks
| Method | Lasso | RRF | Know-GRRF | |||||
|---|---|---|---|---|---|---|---|---|
| Prior 1 | Prior 2 | Prior 3 | ||||||
| Freq>50% | Stepwise | Freq>50% | Stepwise | Freq>50% | Stepwise | |||
| Scenario 1 | ||||||||
| JI | 0.91a | 0.07 | 0.28 | 0.54 | 0.33 | 0.54 | 0.56 | 0.48 |
| TPR | 1.00 | 0.20 | 0.70 | 0.70 | 0.70 | 0.70 | 1.00 | 1.00 |
| FPR | 0.01 | 0.22 | 0.17 | 0.03 | 0.12 | 0.03 | 0.09 | 0.12 |
| FN | – | 2, 3, 4, 5, 6, 9, 10 | 6, 7, 10 | 6, 7, 10 | 2, 3, 4 | 2, 3, 4 | – | – |
| Scenario 2 | ||||||||
| JI | 0.09 | 0.11 | 0.25 | 0.28 | 0.26 | 0.47 | 0.63a | 0.53 |
| TPR | 0.10 | 0.30 | 0.70 | 0.50 | 0.60 | 0.80 | 1.00 | 1.00 |
| FPR | 0.01 | 0.19 | 0.20 | 0.09 | 0.14 | 0.08 | 0.07 | 0.10 |
| FN | 2, 3, 4, 5, 6, 7, 8, 9, 10 | 1, 2, 3, 4, 7, 8, 9 | 6, 8, 9 | 4, 6, 7, 8, 9 | 1, 2, 3, 4 | 3, 4 | – | – |
| Scenario 3 | ||||||||
| JI | 0.67 | 0.19 | 0.29 | 0.57 | 0.26 | 0.47 | 0.56 | 0.67a |
| TPR | 1.00 | 0.50 | 0.70 | 0.80 | 0.60 | 0.80 | 1.00 | 1.00 |
| FPR | 0.06 | 0.18 | 0.16 | 0.04 | 0.14 | 0.08 | 0.09 | 0.06 |
| FN | – | 1, 3, 5, 7, 8 | 7, 8, 10 | 7, 8 | 1, 2, 3, 5 | 1, 3 | – | – |
aindicates the best JI value in each scenario
Genes selected by different approaches
| Method | Parameters | Number of Selected Genes | Selected Genes | |
|---|---|---|---|---|
| Lasso | 15 | |||
| RRF | 13 | |||
| Know-GRRF ( | Cancer gene prior | 169 | Omitted due to space limits. | |
| Consv. prior | 9 | |||
| VI prior | 6 | |||
| All priors | 7 | |||
Fig. 2ROC curves of random forests models using genes selected by different approaches. AUROC values are displayed. Data of 41 testing samples were used to construct the curves