| Literature DB >> 30897097 |
Olga Krakovska1,2,3, Gregory Christie1,2, Andrew Sixsmith1,2,3, Martin Ester4, Sylvain Moreno1,2,5.
Abstract
Large survey databases for aging-related analysis are often examined to discover key factors that affect a dependent variable of interest. Typically, this analysis is performed with methods assuming linear dependencies between variables. Such assumptions however do not hold in many cases, wherein data are linked by way of non-linear dependencies. This in turn requires applications of analytic methods, which are more accurate in identifying potentially non-linear dependencies. Here, we objectively compared the feature selection performance of several frequently-used linear selection methods and three non-linear selection methods in the context of large survey data. These methods were assessed using both synthetic and real-world datasets, wherein relationships between the features and dependent variables were known in advance. In contrast to linear methods, we found that the non-linear methods offered better overall feature selection performance than linear methods in all usage conditions. Moreover, the performance of the non-linear methods was more stable, being unaffected by the inclusion or exclusion of variables from the datasets. These properties make non-linear feature selection methods a potentially preferable tool for both hypothesis-driven and exploratory analyses for aging-related datasets.Entities:
Mesh:
Year: 2019 PMID: 30897097 PMCID: PMC6428288 DOI: 10.1371/journal.pone.0213584
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
F1 scores for synthetic data feature selection when all variables are discrete.
| F1 score, noise = .5 | F1 score, noise = 1 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | N | N | ||||||||||
| 200 | 300 | 500 | 1000 | 2500 | 5000 | 200 | 300 | 500 | 1000 | 2500 | 5000 | |
| HS | .78 | .79 | .81 | .82 | .82 | .81 | .76 | .8 | .82 | .81 | .81 | .81 |
| HF | .83 | .89 | .91 | .92 | .91 | .91 | .73 | .84 | .91 | .91 | .91 | .91 |
| DC | .80 | .81 | .82 | .83 | .82 | .82 | .76 | .8 | .83 | .82 | .82 | .82 |
| OLS | .09 | .12 | .18 | .3 | .52 | .58 | .08 | .11 | .16 | .26 | .46 | .56 |
| BLS | .15 | .17 | .21 | .31 | .51 | .58 | .13 | .15 | .18 | .28 | .45 | .57 |
| BLS | .14 | .16 | .2 | .31 | .51 | .58 | .13 | .15 | .17 | .28 | .46 | .56 |
| BLS | .08 | .1 | .12 | .21 | .45 | .63 | .06 | .08 | .1 | .17 | .36 | .56 |
| FLS | .14 | .17 | .21 | .31 | .51 | .58 | .12 | .15 | .17 | .27 | .45 | .57 |
| FLS | .14 | .16 | .2 | .31 | .51 | .58 | .13 | .14 | .17 | .28 | .46 | .56 |
| FLS | .09 | .11 | .12 | .21 | .45 | .63 | .07 | .08 | .1 | .17 | .36 | .56 |
| LASSO | .18 | .18 | .19 | .21 | .22 | .22 | .17 | .18 | .19 | .20 | .21 | .21 |
F1 scores for synthetic data feature selection when part of the variables is continuous.
| F1 score, noise = .5 | F1 score, noise = 1 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | N | N | ||||||||||
| 200 | 300 | 500 | 1000 | 2500 | 5000 | 200 | 300 | 500 | 1000 | 2500 | 5000 | |
| HS | .80 | .81 | .81 | .81 | .82 | .81 | .8 | .81 | .82 | .81 | .82 | .81 |
| HF | .85 | .89 | .89 | .90 | .90 | .90 | .78 | .85 | .90 | .89 | .90 | .90 |
| DC | .81 | .82 | .81 | .82 | .82 | .82 | .79 | .81 | .82 | .81 | .82 | .81 |
| OLS | .08 | .13 | .17 | .30 | .46 | .55 | .09 | .11 | .17 | .26 | .43 | .51 |
| BLS | .13 | .16 | .20 | .31 | .46 | .55 | .13 | .15 | .18 | .28 | .42 | .51 |
| BLS | .14 | .16 | .20 | .31 | .46 | .54 | .14 | .15 | .18 | .28 | .42 | .51 |
| BLS | .07 | .09 | .11 | .23 | .43 | .58 | .06 | .08 | .11 | .18 | .37 | .49 |
| FLS | .13 | .16 | .2 | .31 | .46 | .55 | .13 | .15 | .18 | .28 | .42 | .51 |
| FLS | .13 | .15 | .2 | .31 | .46 | .54 | .14 | .15 | .18 | .28 | .42 | .51 |
| FLS | .08 | .09 | .11 | .23 | .43 | .58 | .07 | .09 | .11 | .18 | .37 | .49 |
| LASSO | .17 | .19 | .19 | .21 | .21 | .22 | .17 | .17 | .19 | .20 | .21 | .21 |
Fig 1False discovery rate and sensitivity of linear and non-linear methods, with all discrete variables.
(A) Added noise is equal to .5. (B) Added noise is equal to 1.
Fig 2False discovery rate and sensitivity of linear and non-linear methods, with continuous variables.
(A) Added noise is equal to .5. (B) Added noise is equal to 1.
Comparative performance of non-linear (HS, HF, DC) and linear (BLS C, FLS C, BLS R2, FLS R2, BLS BIC, FLS BIC) methods for variable selection, for full set (Panel A), and two subsets (Panels B and C).
Shown are p-values for associations between health and variables from the Wisconsin Longitudinal Study (WLS)(2).
| A. Full set ( | ||||||
| Method | Alcohol | education | tobacco | tobacco1 | tobacco2 | health |
| HS | < .01 | < .01 | < .01 | < .01 | < .01 | < .01 |
| HF | .13 | .04 | .06 | < .01 | < .01 | < .01 |
| DC | < .01 | < .01 | < .01 | < .01 | < .01 | < .01 |
| OLS | .02 | .39 | .35 | .44 | .83 | < .01 |
| BLS | .02 | 1 | 1 | .01 | 1 | < .01 |
| BLS | .03 | 1 | 1 | < .01 | 1 | < .01 |
| BLS | 1 | 1 | 1 | < .01 | 1 | < .01 |
| FLS | .03 | 1 | 1 | 1 | < .01 | < .01 |
| FLS | .03 | 1 | 1 | 1 | < .01 | < .01 |
| FLS | 1 | 1 | 1 | 1 | < .01 | < .01 |
| LASSO | .34 | .54 | .48 | .22 | .8 | .16 |
| B. First subset ( | ||||||
| Method | Alcohol | education | tobacco | tobacco1 | tobacco2 | health |
| HS | < .01 | < .01 | < .01 | < .01 | < .01 | < .01 |
| HF | .13 | .04 | .06 | < .01 | < .01 | < .01 |
| DC | < .01 | < .01 | < .01 | < .01 | < .01 | < .01 |
| OLS | .02 | . < .01 | .56 | .92 | .36 | < .01 |
| BLS | .02 | < .01 | 1 | 1 | < .01 | < .01 |
| BLS | .02 | < .01 | 1 | 1 | < .01 | < .01 |
| BLS | 1 | < .01 | 1 | 1 | < .01 | < .01 |
| FLS | .02 | < .011 | 1 | 1 | < .01 | < .01 |
| FLS | .02 | < .011 | 1 | 1 | < .01 | < .01 |
| FLS | 1 | < .01 | 1 | 1 | < .01 | < .01 |
| LASSO | .65 | .75 | .94 | .91 | .09 | < .01 |
| C. Second subset | ||||||
| Method | Alcohol | education | tobacco | tobacco1 | tobacco2 | |
| HS | < .01 | < .01 | < .01 | < .01 | < .01 | |
| HF | .13 | .04 | .06 | < .01 | < .01 | |
| DC | < .01 | < .01 | < .01 | < .01 | < .01 | |
| OLS | .057 | .06 | .10 | .58 | .74 | |
| BLS | .053 | .06 | .08 | .02 | 1 | |
| BLS | .053 | .06 | .08 | .02 | 1 | |
| BLS | .04 | 1 | 1 | 1 | 1 | |
| FLS | .06 | .07 | .12 | 1 | .03 | |
| FLS | .06 | .07 | .12 | 1 | .03 | |
| FLS | 1 | 1 | 1 | 1 | .03 | |
| LASSO | .62 | .58 | .53 | .39 | .64 | |
P-values for variables having significant impact on health in the health and retirement study.
| Method | alcohol | education | Tobacco | physical activity |
|---|---|---|---|---|
| HS | .04 | .09 | .14 | < .01 |
| HF | .42 | .04 | 1 | .36 |
| DC | .33 | .02 | .03 | .07 |
| OLS | .83 | .02 | .48 | .67 |
| BLS | 1 | < .01 | 1 | 1 |
| BLS | 1 | < .01 | 1 | 1 |
| BLS | 1 | 1 | 1 | 1 |
| FLS | 1 | < .01 | 1 | 1 |
| FLS | .26 | < .01 | 1 | 1 |
| FLS | 1 | 1 | 1 | 1 |
| LASSO | .53 | . < .01 | .2 | .23 |