| Literature DB >> 35508974 |
Jung-Yi Joyce Lin1, Liangyuan Hu2, Chuyue Huang3, Ji Jiayi4, Steven Lawrence1, Usha Govindarajulu1.
Abstract
BACKGROUND: Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can provide good performances achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets.Entities:
Keywords: Missing at random; Multiply imputed datasets; Tree-based methods; Variable importance
Mesh:
Year: 2022 PMID: 35508974 PMCID: PMC9066834 DOI: 10.1186/s12874-022-01608-7
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.612
Simulation results for each variable selection approach performed on the fully observed data and among incomplete data. For bootstrap imputation based methods on incomplete data, we show results corresponding to different threshold values of π. The optimal value of π leading to the highest F1 score is π=0.1 for BI-BART and π=0.3 for BI-XGB. The sample size is n=1000. The number of useful predictors is 10 and the number of noise predictors is 40. Two missingness proportions were considered: 40% missingness in Y and 60% overall missingness; 20% missingness in Y and 40% overall missingness. The performance measures were computed across 250 data replications
| AUC | Precision | Recall | Type I error | ||
|---|---|---|---|---|---|
| BART | 0.92 (0.88, 0.96) | 1.00 | 0.87 | 0.93 | 0.00 |
| XGBoost | 0.88 (0.84, 0.92) | 0.93 | 0.81 | 0.86 | 0.02 |
| RR-BART | 0.82 (0.78, 0.86) | 0.87 | 0.80 | 0.83 | 0.01 |
| BI-BART | 0.83 (0.78, 0.88) | 0.87 | 0.82 | 0.85 | 0.03 |
| BI-BART | 0.81 (0.76, 0.86) | 0.97 | 0.71 | 0.82 | 0.01 |
| BI-BART | 0.77 (0.72, 0.82) | 0.99 | 0.63 | 0.77 | 0.00 |
| BI-BART | 0.73 (0.68, 0.78) | 1.00 | 0.55 | 0.71 | 0.00 |
| BI-BART | 0.67 (0.62, 0.72) | 1.00 | 0.48 | 0.65 | 0.00 |
| BI-BART | 0.59 (0.54, 0.64) | 1.00 | 0.40 | 0.57 | 0.00 |
| BI-BART | 0.49 (0.44, 0.54) | 1.00 | 0.31 | 0.47 | 0.00 |
| BI-BART | 0.41 (0.36, 0.46) | 1.00 | 0.23 | 0.38 | 0.00 |
| BI-BART | 0.33 (0.28, 0.38) | 1.00 | 0.14 | 0.25 | 0.00 |
| BI-BART | 0.25 (0.20, 0.30) | 1.00 | 0.07 | 0.13 | 0.00 |
| BI-XGB | 0.58 (0.53, 0.63) | 0.40 | 0.88 | 0.55 | 0.36 |
| BI-XGB | 0.74 (0.69, 0.79) | 0.66 | 0.85 | 0.75 | 0.15 |
| BI-XGB | 0.82 (0.77, 0.87) | 0.83 | 0.83 | 0.83 | 0.03 |
| BI-XGB | 0.76 (0.71, 0.81) | 0.96 | 0.72 | 0.82 | 0.01 |
| BI-XGB | 0.70 (0.65, 0.75) | 0.99 | 0.63 | 0.77 | 0.00 |
| BI-XGB | 0.64 (0.59, 0.69) | 1.00 | 0.54 | 0.70 | 0.00 |
| BI-XGB | 0.59 (0.54, 0.64) | 1.00 | 0.41 | 0.58 | 0.00 |
| BI-XGB | 0.47 (0.42, 0.52) | 1.00 | 0.29 | 0.44 | 0.00 |
| BI-XGB | 0.35 (0.30, 0.40) | 1.00 | 0.17 | 0.29 | 0.00 |
| BI-XGB | 0.28 (0.23, 0.33) | 1.00 | 0.04 | 0.09 | 0.00 |
| MIA-BART (Impute missing Y) | 0.75 (0.71, 0.79) | 0.80 | 0.75 | 0.77 | 0.04 |
| MIA-BART (Exclude missing Y) | 0.72 (0.66, 0.78) | 0.78 | 0.70 | 0.74 | 0.05 |
| MIA-XGB (Impute missing Y) | 0.74 (0.70, 0.78) | 0.81 | 0.73 | 0.77 | 0.04 |
| MIA-XGB (Exclude missing Y) | 0.71 (0.65, 0.77) | 0.75 | 0.71 | 0.73 | 0.08 |
| BART Complete cases | 0.70 (0.63, 0.77) | 0.90 | 0.60 | 0.72 | 0.03 |
| XGBoost Complete cases | 0.73 (0.66, 0.80) | 0.90 | 0.68 | 0.77 | 0.04 |
| RR-BART | 0.86 (0.82, 0.90) | 0.91 | 0.84 | 0.87 | 0.02 |
| BI-BART | 0.87 (0.83, 0.91) | 0.91 | 0.87 | 0.89 | 0.01 |
| BI-BART | 0.85 (0.80, 0.90) | 0.99 | 0.76 | 0.85 | 0.02 |
| BI-BART | 0.82 (0.77, 0.87) | 1.00 | 0.68 | 0.82 | 0.01 |
| BI-BART | 0.77 (0.72, 0.82) | 1.00 | 0.59 | 0.76 | 0.01 |
| BI-BART | 0.72 (0.67, 0.77) | 1.00 | 0.54 | 0.70 | 0.00 |
| BI-BART | 0.63 (0.58, 0.68) | 1.00 | 0.46 | 0.61 | 0.01 |
| BI-BART | 0.53 (0.48, 0.58) | 1.00 | 0.36 | 0.51 | 0.00 |
| BI-BART | 0.44 (0.39, 0.49) | 1.00 | 0.28 | 0.42 | 0.00 |
| BI-BART | 0.38 (0.33, 0.43) | 1.00 | 0.18 | 0.30 | 0.00 |
| BI-BART | 0.29 (0.24, 0.34) | 1.00 | 0.12 | 0.18 | 0.00 |
| BI-XGB | 0.60 (0.55, 0.65) | 0.44 | 0.90 | 0.58 | 0.30 |
| BI-XGB | 0.76 (0.71, 0.81) | 0.69 | 0.87 | 0.77 | 0.11 |
| BI-XGB | 0.84 (0.79, 0.89) | 0.86 | 0.85 | 0.85 | 0.02 |
| BI-XGB | 0.78 (0.73, 0.83) | 0.99 | 0.75 | 0.84 | 0.01 |
| BI-XGB | 0.73 (0.68, 0.78) | 1.00 | 0.65 | 0.79 | 0.00 |
| BI-XGB | 0.67 (0.62, 0.72) | 1.00 | 0.57 | 0.73 | 0.00 |
| BI-XGB | 0.62 (0.57, 0.67) | 1.00 | 0.44 | 0.61 | 0.00 |
| BI-XGB | 0.49 (0.44, 0.54) | 1.00 | 0.32 | 0.47 | 0.00 |
| BI-XGB | 0.38 (0.32, 0.42) | 1.00 | 0.20 | 0.33 | 0.00 |
| BI-XGB | 0.31 (0.25, 0.35) | 1.00 | 0.08 | 0.14 | 0.00 |
| MIA-BART (Impute missing Y) | 0.78 (0.74, 0.82) | 0.83 | 0.77 | 0.79 | 0.03 |
| MIA-BART (Exclude missing Y) | 0.76 (0.70, 0.82) | 0.81 | 0.74 | 0.78 | 0.04 |
| MIA-XGB (Impute missing Y) | 0.77 (0.73, 0.82) | 0.83 | 0.76 | 0.79 | 0.03 |
| MIA-XGB (Exclude missing Y) | 0.73 (0.67, 0.79) | 0.78 | 0.73 | 0.75 | 0.06 |
| BART Complete cases | 0.73 (0.67, 0.79) | 0.92 | 0.64 | 0.75 | 0.03 |
| XGBoost Complete cases | 0.76 (0.70, 0.82) | 0.93 | 0.71 | 0.80 | 0.03 |
Fig. 1The mean cross-validated AUC, averaged across 250 data replications, for each of three methods: RR-BART, BI-BART and BI-XGB. The mean AUC for bootstrap imputation based methods BI-BART and BI-XGB varies by the threshold value of π. missForest was used for imputation. The sample size n=1000. The proportion of missingness is 40% in the outcome Y and is 60% overall
Fig. 2Power of each of three methods, RR-BART, BI-BART and BI-XGB, for selecting each of 10 useful predictors across 250 data replications. missForest was used for imputation. The sample size n=1000. The proportion of missingness is 40% in the outcome Y and is 60% overall
Simulation results for the setting in which there are 10 useful predictors and no noise variables. For bootstrap imputation methods on incomplete data, we show results corresponding to the best threshold values of π based on F1. The sample size n=1000. Two missingness proportions were considered: 40% missingness in Y and 60% overall missingness; 20% missingness in Y and 40% overall missingness. The performance measures were computed across 250 data replications
| AUC | Precision | Recall | Type I error | ||
|---|---|---|---|---|---|
| BART | 0.74 (0.68, 0.80) | 1.00 | 0.62 | 0.70 | NA |
| XGBoost | 0.75 (0.69, 0.81) | 1.00 | 0.61 | 0.69 | NA |
| RR-BART | 0.73 (0.67, 0.79) | 1.00 | 0.36 | 0.48 | NA |
| RR-BART (all selected) | 0.97 (0.95, 0.99) | 1.00 | 1.00 | 1.00 | NA |
| BI-BART | 0.73 (0.67, 0.79) | 1.00 | 0.38 | 0.50 | NA |
| BI-XGB | 0.79 (0.73, 0.85) | 1.00 | 0.54 | 0.64 | NA |
| BART Complete cases | 0.50 (0.43, 0.57) | 1.00 | 0.15 | 0.35 | NA |
| XGBoost Complete cases | 0.55 (0.48, 0.62) | 1.00 | 0.18 | 0.38 | NA |
| MIA-BART (Impute missing Y) | 0.66 (0.60, 0.72) | 1.00 | 0.31 | 0.42 | NA |
| MIA-BART (Exclude missing Y) | 0.63 (0.56, 0.70) | 1.00 | 0.25 | 0.40 | NA |
| MIA-XGB (Impute missing Y) | 0.73 (0.67, 0.69) | 1.00 | 0.50 | 0.59 | NA |
| MIA-XGB (Exclude missing Y) | 0.70 (0.62, 0.77) | 1.00 | 0.46 | 0.55 | NA |
| RR-BART | 0.77 (0.72, 0.82) | 1.00 | 0.51 | 0.67 | NA |
| RR-BART (all selected) | 0.98 (0.96, 0.99) | 1.00 | 1.00 | 1.00 | NA |
| BI-BART | 0.75 (0.70, 0.80) | 1.00 | 0.50 | 0.69 | NA |
| BI-XGB | 0.80 (0.75, 0.85) | 1.00 | 0.52 | 0.70 | NA |
| MIA-BART (Impute missing Y) | 0.70 (0.65, 0.75) | 1.00 | 0.46 | 0.60 | NA |
| MIA-BART (Exclude missing Y) | 0.67 (0.61, 0.73) | 1.00 | 0.43 | 0.57 | NA |
| MIA-XGB (Impute missing Y) | 0.70 (0.65, 0.75) | 1.00 | 0.46 | 0.64 | NA |
| MIA-XGB (Exclude missing Y) | 0.67 (0.61, 0.73) | 1.00 | 0.42 | 0.61 | NA |
| BART Complete cases | 0.54 (0.49, 0.59) | 1.00 | 0.16 | 0.39 | NA |
| XGBoost Complete cases | 0.56 (0.51, 0.61) | 1.00 | 0.20 | 0.40 | NA |
Variable selection results by each of 3 methods, with the best imputation method and threshold value of π for BI-BART and BI-XGB suggested in simulations. BART was used with missForest and π=0.1, XGB with missForest and π=0.3. Definitions of the variable names appear in Web Table 1. RR-BART and BI-BART both selected 17 variables, and BI-XGB selected 16 variables
| Variables | RR-BART | BI-BART | BI-XGB |
|---|---|---|---|
| TRIGRES | Yes | Yes | Yes |
| SYSBP | Yes | Yes | Yes |
| LPA | Yes | Yes | Yes |
| DIABP | Yes | Yes | Yes |
| WAIST | Yes | Yes | Yes |
| INSULIN | Yes | No | Yes |
| LUCRES | Yes | No | No |
| APOARES | Yes | Yes | Yes |
| EDUCATION | Yes | No | Yes |
| BP | Yes | Yes | No |
| WHRATIO | Yes | Yes | Yes |
| BMI | Yes | Yes | Yes |
| RACE | Yes | Yes | No |
| TPA | Yes | Yes | Yes |
| DTTLIN | Yes | Yes | No |
| SHBG | Yes | Yes | Yes |
| RESTLES | Yes | Yes | No |
| GLUCOSE | No | Yes | Yes |
| T | No | Yes | No |
| E2AVE | No | No | Yes |
| PAI1 | No | No | Yes |
Fig. 3Validation plot of predicted probabilities of 3-year incidence of metabolic syndrome among middle-aged women using the SWAN data. The risk prediction models were the BART and XGBoost models with predictors selected via RR-BART, BI-BART and BI-XGB. missForest was used for imputation