| Literature DB >> 31656581 |
Michail Tsagris1,2,3, Ioannis Tsamardinos2,4,5.
Abstract
Feature (or variable) selection is the process of identifying the minimal set of features with the highest predictive performance on the target variable of interest. Numerous feature selection algorithms have been developed over the years, but only few have been implemented in R and made publicly available R as packages while offering few options. The R package MXM offers a variety of feature selection algorithms, and has unique features that make it advantageous over its competitors: a) it contains feature selection algorithms that can treat numerous types of target variables, including continuous, percentages, time to event (survival), binary, nominal, ordinal, clustered, counts, left censored, etc; b) it contains a variety of regression models that can be plugged into the feature selection algorithms (for example with time to event data the user can choose among Cox, Weibull, log logistic or exponential regression); c) it includes an algorithm for detecting multiple solutions (many sets of statistically equivalent features, plain speaking, two features can carry statistically equivalent information when substituting one with the other does not effect the inference or the conclusions); and d) it includes memory efficient algorithms for high volume data, data that cannot be loaded into R (In a 16GB RAM terminal for example, R cannot directly load data of 16GB size. By utilizing the proper package, we load the data and then perform feature selection.). In this paper, we qualitatively compare MXM with other relevant feature selection packages and discuss its advantages and disadvantages. Further, we provide a demonstration of MXM's algorithms using real high-dimensional data from various applications. Copyright:Entities:
Keywords: Feature selection; R package; algorithms; computational efficiency
Mesh:
Year: 2018 PMID: 31656581 PMCID: PMC6792475 DOI: 10.12688/f1000research.16216.2
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Frequency of CRAN and Bioconductor FS related packages in terms of the target variable they accept.
The percentage-wise number (out of 184) appears inside the parentheses.
| Target type | Binary | Nominal | Continuous | Counts |
|---|---|---|---|---|
| Frequency (%) | 107 (58.15%) | 31 (16.85%) | 120 (65.22%) | 29 (15.76%) |
| Target type | Survival | Case-control | Ordinal | Multivariate |
| Frequency (%) | 27 (14.67%) | 3 (1.63%) | 3 (1.63%) | 11 (5.97%) |
Figure 1. Frequency of FS related R packages handling different types of target variables.
The horizontal axis shows the number of types (any combinations) of target variables from Table 1. For example, there 95 R packages that can handle only 1 type (any type) of target variable, 41 packages that can handle any 2 types of target variables, while MXM is the only one that handles all of them.
Cross-tabulation of the FS packages in R based on the target variable.
There are 108 packages which handle binary target variables, 59 packages offering algorithms for binary and continuous target variables and only one package handling ordinal and nominal target variables, etc.
| Binary | Nominal | Continuous | Counts | Survival | Case-control | Ordinal | Multivariate | |
|---|---|---|---|---|---|---|---|---|
| Binary |
| |||||||
| Nominal |
| 32 | ||||||
| Continuous |
| 13 | 120 | |||||
| Counts | 28 | 3 | 28 | 29 | ||||
| Survival | 18 | 5 | 17 | 7 | 27 | |||
| Case-control | 1 | 1 | 1 | 1 | 1 | 3 | ||
| Ordinal | 4 | 1 | 2 | 2 | 1 | 1 | 4 | |
| Multivariate
| 4 | 3 | 8 | 4 | 3 | 1 | 1 | 11 |
Frequency of other types of regression models for FS treated by R packages on CRAN and Bioconductor.
The percentage-wise number appears inside the parentheses.
| Regression models | Robust | GLMM | GEE | Functional |
|---|---|---|---|---|
| Frequency (%) | 4 (2.19%) | 8 (4.37%) | 2 (1.09%) | 2 (1.09%) |
A brief overview of the types of target variables and regression models in MXM.
| Target variable type | Regression model or test |
|---|---|
| Continuous and percentages
| Linear, MM and quantile regression, Pearson and Spearman
|
| Multivariate continuous | Multivariate linear regression |
| Compositional data | Multivariate linear regression |
| (Strictly) positive valued | Gaussian and Gamma regression with a log-link |
| Percentages with or without zeros | Beta regression and quasi binomial regression |
| Counts | Poisson, quasi Poisson, negative binomial and zero inflated Poisson
|
| Binary | Logistic regression, quasi binomial regression and
|
| Nominal | Multinomial regression and
|
| Ordinal | Ordinal regression |
| Number of successes out of trials | Binomial regression |
| Time-to-event | Cox, Weibull and exponential regression |
| Matched case-control | Conditional logistic regression |
| Left censored | Tobit regression |
| Repeated/clustered, longitudinal | Generalised linear mixed models (GLMM) and Generalised
|
Algorithm suggestion according to combinations of sample size (n) and number of features (p), multiple solutions and high volume data.
| Algorithm | Cases | ||||||
|---|---|---|---|---|---|---|---|
| n small &
| n small &
| n big &
| n big &
| High volume
| Multiple
| Authors
| |
| BSR | ✓ | ||||||
| FBED | ✓ | ✓ | ✓ | ✓ | |||
| FSR | ✓ | ||||||
| gOMP | ✓ | ✓ | ✓ | ✓ | |||
| IAMB | ✓ | ✓ | |||||
| MMMB | ✓ | ✓ | ✓ | ✓ | |||
| MMPC | ✓ | ✓ | ✓ | ✓ | |||
| SES | ✓ | ✓ | ✓ | ✓ | ✓ | ||
An overview of the main FS algorithms in MXM.
| R Function | Algorithm |
|---|---|
| MMPC | Max-Min Parents and Children (MMPC) |
| SES | Statistically Equivalent Signatures (SES) |
| mmmb | Max-Min Markov Blanket (MMMB) |
| fs.reg | Forward selection (FSR) |
| bs.reg | Backward selection (BSR) |
| iamb | Incremental Association Markov Blanket (IAMB) |
| fbed.reg | Forward-Backward with Early Dropping (FBED) |
| gomp | Generalized Orthogonal Matching Pursuit (gOMP) |