| Literature DB >> 32859146 |
Samir Rachid Zaim1,2,3, Colleen Kenost1,3, Joanne Berghout1,3, Wesley Chiu1,3, Liam Wilson1,3, Hao Helen Zhang4,5,6, Yves A Lussier7,8,9,10,11,12.
Abstract
BACKGROUND: In this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcripts) than samples (i.e., mice or human samples) in a study, it poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest (RF) classifiers are widely used due to their flexibility, powerful performance, their ability to rank features, and their robustness to the "P > > N" high-dimensional limitation that many matrix regression algorithms face. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32859146 PMCID: PMC7456085 DOI: 10.1186/s12859-020-03718-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Random forest feature selection methods and their permutation requirements
| Permute | Method | Brief description | |
|---|---|---|---|
| No | |||
| EFS [ | No | Calculates a global score for each feature using 8 different metrics to measure importance and selects features whose score exceeds the median global score | |
| AUC-RF [ | No | Iteratively trains a random forest algorithm and removes predictors in a stepwise fashion to maximize an AUC increase | |
| RFE, dRFE [ | No | Iteratively trains a random forest (RF) model and drops uninformative features based on a user-defined criterion | |
| RF-ACE [ | No | Creates phony variables called “Artificial Contrasts with Ensembles”, and compares how often these sham variables are used over the real ones | |
| R2VIM [ | No | Calculates variable importance (VI) and divides by minimum VI to create relative VI, and choose important features based on a pre-selected cutoff | |
| VarSelRF, geneSrF [ | No | Iteratively removes worst .20 (or x-percentage) of all features; retrains RF; selects smallest feature set within one set of best models | |
| Yes | Vita [ | Yes | P-values are calculated based on empirical null distribution of non-positive importance scores that accelerate null distribution estimates |
| Perm [ | Yes | Permutes outcomes (Y) and determines importance based on which features retained a larger importance in | |
| PIMP [ | Yes | Permutes outcome and determines features’ priority based on increases in mutual information or Gini errors. A feature’s | |
| VSURF [ | No | Two-step FS algorithm: 1) uses predictor permutations to identify features robust to noise, and 2) refines model by conducting step-forward inclusion of features until error convergence | |
| Boruta [ | No | Creates phony predictors by permuting the values of the shadow vars. Runs RF to identify features’ Z-scores. Eliminates features whose Z-score are less than a threshold. Repeats until convergence |
Absence of permutations generally decreases substantially computing time. P-values provide explicit ranking of features, which enables objective feature thresholding
BinomialRF improves the memory requirements
| Features dimension | Interaction order | Memory requirements for interactions | Memory efficiency | |
|---|---|---|---|---|
| 10 | 2 | N × 10 | N × 55 | |
| 3 | N × 175 | |||
| 100 | 2 | N × 100 | N × 5050 | |
| 3 | N × 166,750 | |||
| 1000 | 2 | N × 1000 | N × 500,500 | |
| 3 | N × 166,667,500 | |||
The improvement is on the orders of magnitude in 2-way and 3-way interactions when compared to other methods of Table 1. One advantage of the binomialRF algorithm is that it can screen for sets of gene interactions in a memory efficient manner by only requiring a constant-sized matrix whereas the current state of the art requires the predictor matrix to increase in size in a combinatoric fashion to screen for interactions. Memory efficiency is defined by , and interaction memory requirements are defined by the number of columns required to map all k-way interactions
Fig. 1BinomialRF showing substantially improved computational time. The simulation runtimes are measured in seconds and are plotted in powers of ten to show the difference in magnitudes of computation time. The simulation scenarios are detailed in Section 2.1, where the length of the coefficient vector, β varies from 10 to 100 and 1000 features. All simulations were conducted on a 2017 MacBook Pro with 3.1 GHz Intel Core i5 and 16 GB of RAM. All simulations resulted in the binomialRF being the fastest
Simulation results of biomarkers
| Model | Precision | Recall | Test error | Model size |
|---|---|---|---|---|
| 3A. Results: 100–2000 features | ||||
| AUCRF | 0.54 (0.25) | 0.74 (0.26) | 0.27 (0.1) | 8.74 (0.13) |
| binomialRF | 0.37 (0.36) | 0.33 (0.13) | 81.72 (0.08) | |
| Boruta | 0.89 (0.15) | 0.41 (0.37) | 0.32 (0.13) | 63.38 (0.1) |
| EFS | 0.83 (0.16) | 0.69 (0.27) | 8.66 (0.13) | |
| Perm | 0.33 (0.33) | 0.30 (0.09) | 59.42 (0.1) | |
| PIMPa | 0.18 (0.36) | 0.00 (0.01) | 0.35 (0.1) | |
| RFE | 0.49 (0.35) | 0.61 (0.23) | 0.3 (0.08) | 250.29 (0.09) |
| VarSelRF | 0.67 (0.24) | 0.65 (0.29) | 0.27 (0.1) | 12.31 (0.12) |
| Vita | 0.46 (0.28) | 0.66 (0.29) | 0.28 (0.1) | 35.44 (0.1) |
| VSURF | 0.86 (0.15) | 0.44 (0.36) | 0.31 (0.12) | 40.95 (0.1) |
| 3B. Results: 10,000 features | ||||
| AUCRF | 0.17 (0.05) | 0.33 (0.05) | 215.68 (0.01) | |
| binomialRF | 0.51 (0.12) | 0.14 (0.12) | 28.6 (0.03) | |
| Boruta | 0.03 (0.18) | 0.47 (0.01) | ||
| Perm | 0.02 (0) | 0.46 (0.03) | 4958.26 (0.03) | |
| RFE | 0.03 (0) | 0.66 (0) | 0.44 (0.04) | 1950.11 (0.02) |
| Vita | 0.03 (0) | 0.52 (0) | 0.45 (0.05) | 1954.32 (0.02) |
The binomialRF and the algorithms in Table 1 were tested across a range of simulation scenarios (Table 6). Mean (standard deviation) results are shown and ranked according to decreasing F1-score. In 3A, the results for all techniques are shown up to 2000 features. In 3B, the results are shown for a limited simulation scenario with 10,000 features and 100 seeded genes. Only a subset of methods are presented in 3B as the remaining were either unable to process 10,000 features (i.e., induced memory errors) or introduced rate-limiting computational challenges (see Fig. 1). Across both tables, Boruta and binomialRF attain the highest precisions, while PERM the highest recall. More studies are required in high dimensional scenarios to better understand each technique’s behavior. Top accuracies are bolded
aAcross many runs – the PIMP algorithm resulted in no gene predictions, despite running them using their default parameters, resulting in these low precision and recall values. We varied the parameters with no additional success – so we report these results with an asterisk to note they warrant further investigation
Parameters settings for the simulation study
| Parameter | Values |
|---|---|
| Genome size ( | 100, 500, 1000, 2000, 10,000 |
| Genes seeded ( | 5, 25, 50, 100 |
| Number of trees ( | 500, 1000 |
UCI ML madelon dataset validation
| Model | Model size | Run time | Precision | Recall |
|---|---|---|---|---|
| VarSelRF | 23 (13) | 129 (21) | ||
| VSURF | 3.5 (1.4) | 321 (267) | ||
| binomialRF | 17.1 (3.9) | 0.55 (0.02) | 0.55 (0.01) | |
| Vita | 13 (5.68) | 1007 (1220) | 0.55 (0.02) | 0.55 (0.02) |
| Boruta | 2 (2) | 139 (45) | 0.54 (0.03) | |
| Perm | 240 (13) | 269. (329) | 0.56 (0.08) | 0.54 (0.01) |
| AUCRF | 31 (30) | 33 (7.5) | 0.55 (0.04) | 0.54 (0.02) |
| RFE | 81 (4.2) | 20 (1.4) | 0.54 (0.06) | 0.54 (0.01) |
| EFS | 20 (8.3) | 2617 (2126) | 0.53 (0.02) | 0.54 (0.02) |
| PIMP | 1.7 (1.3) | 482 (128) | 0.50 (0.04) | 0.50 (0.01) |
The algorithms in Table 1 were tested and compared using the Madelon benchmark dataset from UCI (described in Methods). Mean (standard deviation) results are shown and ranked according to decreasing harmonic mean of precision and recall of variables. Top accuracies are bolded
TCGA dataset validation
| Model | Time | Test error | Model size |
|---|---|---|---|
| 5A. Breast cancer | |||
| binomialRF | 83 (11) | 0 (0) | 27 (4) |
| RFE | 100 (13) | 0 (0) | 692 (23) |
| Perm | 112 (16) | 0 (0) | 1092 (39) |
| Vita | 493 (88) | 0 (0) | 19,933 (10) |
| Boruta | 1667 (617) | 0 (0) | 92 (3) |
| 5B. Kidney cancer | |||
| binomialRF | 51 (10) | 0 (0) | 48 (3) |
| RFE | 67 (10) | 0 (0) | 592 (55) |
| Perm | 73 (12) | 0 (0) | 867 (55) |
| Vita | 315 (72) | 0 (0) | 19,760 (41) |
| Boruta | 987 (363) | 0 (0) | 24 (2) |
The algorithms in Table 1 were tested and compared using the TCGA breast cancer and kidney datasets, reporting the mean (and standard deviation in parentheses). Half of the methods were not included as they encountered computation or memory limitations in running the TCGA datasets
Fig. 2Biomarker accuracies of the TCGA validation study. The TCGA validation study was conducted using breast and kidney cancer datasets, accessed via the R package TCGA2STAT. The matched-sample datasets were utilized to determine whether binomialRF could produce an accurate classifier via main effects and interactions. Left, the two binomialRF classifiers (51 identified gene main effects; 39 identified gene-gene interactions) and obtained a classifier as accurate as the original black-box RF model with all ~ 20,000 genes. Right, the two binomialRF classifiers (16 identified gene main effects; 11 identified gene-gene interactions) obtained a classifier as accurate as the original black-box RF model with all ~ 20,000 genes
Fig. 3Statistical interactions prioritized by binomialRF in TCGA cancers recapitulate known cancer driver genes. The statistical interaction gene networks (Top) indicate the pairwise biomarker interactions identified by the binomialRF algorithm for the breast (Left) and kidney (Right) cancer datasets. Key features are involved in multiple interactors (super-interactors; e.g., SPRY2; COL10A1). Features names (gene products) found in the literature as associated to cancer pathophysiology are shown in black; those also documented as driving cancer genes in COSMIC are shown in green (Methods); the remainder are grey. Two exemplar statistical interactions (one per dataset) are circled and the log expression of their gene products and of their ratios are shown in the bottom panels. The distribution separation across tumor (green) and normal (orange) cases indicates a potential interaction between these two genes across the cohorts
Fig. 4The binomialRF feature selection algorithm. The binomialRF algorithm is a feature selection technique in random forests (RF) that treats each tree as a stochastic binomial process and determines whether a feature is selected more often than by random chance as the optimal splitting variable, using a top-bottom sampling without replacement scheme. The main effects algorithm identifies whether the optimal splitting variables at the root of each tree are selected at random or whether certain features are selected with significantly higher frequencies. The interaction-screening extension is detailed in Section 3. Legend: Tz = z tree in random forest; X = feature j; F = the observed frequency of selecting X; Pr = probability; P = number of (#) of features; V = # of trees in a RF; m = user parameter to limit P; g = index of the product
Fig. 5Decision tree and node variables. In the binary split decision tree, X1 is the optimal splitting feature at the root of the tree, and is the optimal splitting sequence that indicates a potential X1 ⊗ X2 ⊗ X3 3-way interaction, where the symbol “ ⊗ ” denotes interactions
Fig. 6Calculating RF features’ interactions. a 2-way Interactions. To extend the binomialRF algorithm for 2-way interaction selection, we define the test statistic which reflects the frequency, F of the pair X ⊗ X occurring in the random forest. In particular, the probability of an interaction term occurring by random chance is recalculated and normalized by a factor of a half. b -way interactions, = 4. Here, we illustrate the tree traversal process to identify all 4-way interactions, , with each color denoting a possible interaction path. The legend on the right shows how each interaction path results in a set of 4-way feature interactions. In general, for any user-desired K, the k.binomialRF algorithm traverses the tree via dynamic tree programming to identify all possible paths from the K-terminal nodes to the root, where -terminal nodes are all nodes K-steps away from the root node
TCGA validation study datasets
| Description | Breast cancer | Kidney cancer |
|---|---|---|
97 tumor, 97 normal samples | 65 tumor, 65 normal samples | |