| Literature DB >> 26273608 |
Susana Perez-Alvarez1, Guadalupe Gómez2, Christian Brander3.
Abstract
Large datasets including an extensive number of covariates are generated these days in many different situations, for instance, in detailed genetic studies of outbreed human populations or in complex analyses of immune responses to different infections. Aiming at informing clinical interventions or vaccine design, methods for variable selection identifying those variables with the optimal prediction performance for a specific outcome are crucial. However, testing for all potential subsets of variables is not feasible and alternatives to existing methods are needed. Here, we describe a new method to handle such complex datasets, referred to as FARMS, that combines forward and all subsets regression for model selection. We apply FARMS to a host genetic and immunological dataset of over 800 individuals from Lima (Peru) and Durban (South Africa) who were HIV infected and tested for antiviral immune responses. This dataset includes more than 500 explanatory variables: around 400 variables with information on HIV immune reactivity and around 100 individual genetic characteristics. We have implemented FARMS in R statistical language and we showed that FARMS is fast and outcompetes other comparable commonly used approaches, thus providing a new tool for the thorough analysis of complex datasets without the need for massive computational infrastructure.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26273608 PMCID: PMC4529908 DOI: 10.1155/2015/319797
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Data description. Summary of cohort characteristics either joined or split by country (Lima and Durban) showing viral load and its transformation log10 distribution. Below, number of HLA alleles and OLP present in the database and frequency of the top 5 parameters for each category.
| Viral load | Log10 (Viral load) | |||||
|---|---|---|---|---|---|---|
| Global | Lima | Durban | Global | Lima | Durban | |
| Median | 37800 | 37240 | 37900 | 4.577 | 4.571 | 4.579 |
| IQR | (8715; 131500) | (13310; 109000) | (7075; 138500) | (3.94; 5.119) | (4.124; 5.038) | (3.85; 5.141) |
|
| ||||||
| HLA | OLP | |||||
| Global | Lima | Durban | Global | Lima | Durban | |
|
| ||||||
| Total number | 73 | 62 | 66 | 406 | 391 | 371 |
| 1st | C∗07 | A∗02 | B∗15 | 76 | 76 | 78 |
| ( | ( | ( | ( | ( | ( | |
| 2nd | A∗02 | B∗35 | C∗07 | 78 | 84 | 76 |
| ( | ( | ( | ( | ( | ( | |
| 3rd | B∗15 | C∗04 | A∗30 | 84 | 81 | 84 |
| ( | ( | ( | ( | ( | ( | |
| 4th | A∗30 | C∗07 | C∗06 | 25 | 85 | 25 |
| ( | ( | ( | ( | ( | ( | |
| 5th | C∗04 | B∗39 | A∗68 | 41 | 78 | 41 |
| ( | ( | ( | ( | ( | ( | |
Figure 1Illustration of FARMS algorithm. On a dataset of 10 covariates, variable “10” is forced to be always included. The starting model includes 3 variables (in addition to the forced-in ones) and adds 2 more variables in each iteration.
Figure 2Illustration of FARMS input and output. (a) FARMS function requires data as R data frame containing all the variables (response and explanatory variables). (b) Calling the FARMS function within R requires the indication of at least the response variable and the explanatory variables. In this illustration, which follows the explanation of Figure 1, we also indicate the number of variables to add in each iteration, the number of variables to compose the starting model, the columns containing the forced-in variables, and the name of the output file. (c) By default, FARMS function returns an R object containing information for each iteration (two iterations in this illustration) and the dataset as processed by the algorithm. (d) Optionally, FARMS function can produce a text file containing the same information as the R object output, adding a text line for each iteration, helping also to monitor the algorithm execution and to track the evolution of models until the final model is obtained.
Figure 3Evolution of the computing time (in seconds) needed to reach the final model according to FARMS parameters: starting number of covariates (P_Start) and number of covariates added in each step (P_Add). First two rows of figures refer to the HLA-only model; rows 3 and 4 correspond to OLP-only covariates model.
Comparison of results when basing variable selection on the FARMS algorithm or common strategies. The number of covariates included in the final model excludes the “forced in” covariates. FARMS parameters used in this case are number of adding covariates = 8, number of starting covariates = 10, and the selecting criteria for both best subset and best model was the BIC. (All runs were executed on an Intel Xeon x5680 machine with 6 CPU cores and 95 GB RAM memory under a Linux Suse 11.0 OS).
| Time (seconds)*
| Number of vars.** | BIC | AIC |
| Adj. | ||
|---|---|---|---|---|---|---|---|
| HLA | FARMS |
|
|
|
|
|
|
| All subsets | >1 month | — | — | — | — | — | |
| Forward selection1 | 3.84 (3.13; 3.52) | 17 | 2259.4 | 2168.86 | 14.69% | 12.98% | |
| Forward stepwise1 | 4.32 (3.51; 3.90) | 17 | 2259.4 | 2168.86 | 14.69% | 12.98% | |
| Forward selection2 | 2.01 (1.62; 1.88) | 10 | 2235.45 | 2178.27 | 12.36% | 11.33% | |
| Forward stepwise2 | 2.35 (1.89; 2.18) | 9 | 2235.44 | 2183.03 | 11.67% | 10.74% | |
| Forward selection3 | 1.99 (1.61; 1.88) | 10 | 2235.45 | 2178.27 | 12.36% | 11.33% | |
| Forward stepwise3 | 2.38 (1.92; 2.22) | 9 | 2235.44 | 2183.03 | 11.67% | 10.74% | |
| Backward stepwise3 | 13 s | 10 | 2236.52 | 2174.58 | 12.93% | 11.81% | |
|
| |||||||
| OLP | FARMS |
|
|
|
|
|
|
| All subsets | >1 month | — | — | — | — | — | |
| Forward selection1 | 393.2 (324.40; 336.60) | 79 | 2396.53 | 2010.56 | 38.40% | 32.22% | |
| Forward stepwise1 | 545.9 (451.70; 469.70) | 83 | 2415.46 | 2010.53 | 38.97% | 32.51% | |
| Forward selection2 | 401.7 (329.50; 343.40) | 80 | 2403.3 | 2012.56 | 38.40% | 32.13% | |
| Forward stepwise2 | 462.4 (382.50; 401.50) | 76 | 2385.18 | 2013.51 | 37.76% | 31.77% | |
| Forward selection3 | 38.09 (31.34; 33.11) | 12 | 2224.82 | 2158.11 | 14.77% | 13.57% | |
| Forward stepwise3 | 38.31 (31.34; 33.43) | 12 | 2224.82 | 2158.11 | 14.77% | 13.57% | |
| Backward stepwise3 | >12 hours | 23 | 2232.63 | 2108.63 | 21.68% | 19.45% | |
1Selection by AIC and base model with intercept-only.
2Selection by AIC and base model with forced-in variables.
3Selection by BIC and base model with forced-in variables.
*Time obtained after 100 executions for each scenario.
**Including forced-in variables.
Figure 4Comparison of computational time between FARMS, forward selection, and forward stepwise regression algorithms for the two possible scenarios, HLA-only and OLP-only. (1: selection by AIC and base model with intercept-only; 2: selection by AIC and base model with forced-in variables; 3: selection by BIC and base model with forced-in variables).