| Literature DB >> 35927610 |
A M Panken1,2, M W Heymans3.
Abstract
BACKGROUND: For the development of prognostic models, after multiple imputation, variable selection is advised to be applied from the pooled model. The aim of this study is to evaluate by using a simulation study and practical data example the performance of four different pooling methods for variable selection in multiple imputed datasets. These methods are the D1, D2, D3 and recently extended Median-P-Rule (MPR) for categorical, dichotomous, and continuous variables in logistic regression models.Entities:
Keywords: Logistic regression; Median-p-rule; Multiple imputation; Pooling selection methods; Variable selection
Mesh:
Year: 2022 PMID: 35927610 PMCID: PMC9351113 DOI: 10.1186/s12874-022-01693-8
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.612
Means and variances used for the simulated dataset
| Varname | Coefficients | Mean | Variance | Standard Deviation (SD) | Distribution | |
|---|---|---|---|---|---|---|
| Cat1 | Xcat1 | 0.5 / 1.5 / 1.5a | 2 | 0.9 | 0.95 | Normal |
| Cat2 | Xcat2 | 1.5 / 1.5 / 1.5a | 3 | 1 | 1 | Normal |
| Dich1 | X1 | -0.5 | 0.8 | 0.2 | 0.45 | Normal |
| Dich2 | X2 | -1 | 0.7 | 0.3 | 0.55 | Normal |
| Cont1 | X3 | 0.5 | 7 | 2 | 1.41 | Normal |
| Noise | X4 | 0 | 7 | 2 | 1.41 | Normal |
| Cont2 | X5 | -0.1 | 40 | 90 | 9.5 | Normal |
| Cont3 | X6 | -0.1 | 26 | 15 | 3.9 | Normal |
| Cont4 | X7 | -0.1 | 34 | 23 | 4.8 | Normal |
Cat1 Categorical variable 1, Cat2 Categorical variable 2, Dich1 Dichotomous variable 1, Dich2 Dichotomous variable 2
Cont1 Continuous variable 1, Cont2 Continuous variable 2, Cont3 Continuous variable 3, Cont4 Continuous variable 4, Noise Noise variable
aCoefficients belonging to the dummies of the categorical variable
Fig. 1The selection frequency of the variables using different P-out-values compared to the complete dataset
Percentages selection frequency of variables after backward selection in Multiply Imputed datasets using four different pooling methods and in the complete dataset
| Dataset | Variable | D1 | D2 | MR | MPR | Comp* |
|---|---|---|---|---|---|---|
| Noise | 12.2 | 12.2 | 12.4 | 11.4# | 11.4 | |
| Cont4 | 63.8 | 64.6 | 64.6 | 67# | 79.6 | |
| Cat1 | 63.8 | 66.6 | 67.2 | 84.2# | 87 | |
| Cat2 | 76.6 | 83 | 84.8 | 92.8# | 95.4 | |
| Dich1 | 35.8 | 35.8 | 36.2 | 37.6# | 38.2 | |
| Noise | 11.2# | 11.4 | 11.2# | 11.8 | 10.2 | |
| Cont4 | 49.8 | 51.6 | 51.6 | 52.8# | 65.2 | |
| Cat1 | 51.8 | 51.8 | 52.4 | 73.6# | 74.4 | |
| Cat2 | 76.8 | 79.6 | 82 | 86.2# | 86.8 | |
| Dich1 | 32.4 | 33 | 32.8 | 34.6# | 34.6 | |
| Noise | 6# | 6.6 | 6.2 | 6.2 | 5.2 | |
| Cont4 | 5 | 50.6 | 50.6 | 52.8# | 67.2 | |
| Cat1 | 53.2 | 54.6 | 54.2 | 74.6# | 75.2 | |
| Cat2 | 65 | 68.8 | 73 | 88# | 86.2 | |
| Dich1 | 27.2 | 27.2 | 27.4# | 29 | 27.8 | |
| Noise | 6.4 | 6.8 | 6.2# | 6.2# | 4.8 | |
| Cont4 | 39.4 | 39.2 | 40.2 | 41.4# | 49.2 | |
| Cat1 | 38.2 | 38.8 | 38.2 | 61# | 53.2 | |
| Cat2 | 65 | 65.4 | 70 | 78.6# | 74.6 | |
| Dich1 | 22.8# | 23.6 | 23.2 | 24.6 | 22.2 | |
| Noise | 12 | 11.6# | 11.6# | 11.6# | 10 | |
| Cont4 | 94.6 | 94.8# | 94.8# | 94.8# | 99 | |
| Cat1 | 96.2 | 98.8 | 98.6 | 99.4# | 100 | |
| Cat2 | 99.2 | 100# | 100# | 100# | 100 | |
| Dich1 | 64.8 | 65# | 65# | 65# | 69.2 | |
| Noise | 10.4 | 10# | 10# | 10# | 10 | |
| Cont4 | 82# | 82# | 82.2# | 82# | 91.2 | |
| Cat1 | 89.6 | 93.6 | 93.2 | 97.8# | 98.8 | |
| Cat2 | 98.4 | 99.8 | 99.8 | 100# | 100 | |
| Dich1 | 57.4 | 58 | 58 | 58.4# | 61.2 | |
| Noise | 6.2 | 6 | 5.8# | 6.2 | 05.2 | |
| Cont4 | 92 | 91.8 | 91.8 | 92.2# | 97.8 | |
| Cat1 | 92.6 | 96.2 | 95.4 | 99# | 99.8 | |
| Cat2 | 97.2 | 99.8 | 99.8 | 100# | 100 | |
| Dich1 | 52 | 53 | 53 | 53.2# | 58.8 | |
| Noise | 6.4 | 6.8 | 6.2# | 6.2# | 4.8 | |
| Cont4 | 39.4 | 39.2 | 40.2 | 41.4# | 49.2 | |
| Cat1 | 38.2 | 38.8 | 38.2 | 61# | 53.2 | |
| Cat2 | 65 | 65.4 | 70# | 78.6# | 74.6 | |
| Dich1 | 22.8# | 23.6 | 23.2 | 24.6 | 22.2 |
N Number of observations, corr Correlation, P-out P-value for excluding a variable out of the prognostic model, Noise Noise variable, Cont4 Continuous variable 4, Cat1 Categorical variable, Cat2 Categorical variable 2, Dich1 Dichotomous variable 1, D1 D1 method, D2 D2 method, D3 D3 method, MPR Median-P-rule, comp analyses in complete dataset (reference values for the pooling methods)
The selection frequency of variables in the complete dataset act as the reference standard: * = reference values for comparison the pooling methods with the complete data; # = value that is closest to the reference value
P-values of the pooled variables after log-transformation and calculation of the median
| Dataset | Variables | Pooling Method D1 | Pooling Method D2 | Pooling Method D3 | Pooling Method MPR | Complete Dataset* |
|---|---|---|---|---|---|---|
| Noise | -1.309554 | -1.3096763# | -1.229329 | -1.9106885 | -1.32294506 | |
| Cont4 | -2.012191# | -1.8803549 | -1.643659 | -2.8330669 | -2.27920822 | |
| Cat1 | -1.930734 | -1.8166761 | -1.840571 | -2.6829073# | -2.36633468 | |
| Cat2 | -2.150953 | -1.9567753 | -2.043423 | -3.0906292# | -2.95750666 | |
| Dich1 | -1.755433# | -1.746318 | -1.696342 | -1.8969296 | -1.75720332 | |
| Noise | -1.439446 | -1.3603044# | -1.285094 | -1.7425542 | -1.338895 | |
| Cont4 | -1.83904# | -1.7223828 | -1.630539 | -2.5040535 | -1.95430019 | |
| Cat1 | -1.773127# | -1.6700756 | -1.705565 | -2.1858457 | -1.92590193 | |
| Cat2 | -2.182335 | -2.0396509 | -2.282796 | -2.6709277# | -2.54638103 | |
| Dich1 | -1.610444 | -1.5757474 | -1.597061 | -1.8110715# | -1.72596463 | |
| Noise | -1.873396# | -1.7555967 | -1.59476 | -2.5543024 | -1.90936438 | |
| Cont4 | -2.262126# | -2.0474303 | -1.870487 | -3.1615994 | -2.53263423 | |
| Cat1 | -2.143072 | -1.9870785 | -2.058752 | -2.88706# | -2.54509389 | |
| Cat2 | -2.434034 | -2.1368796 | -2.237235 | -3.1390634# | -3.0642774 | |
| Dich1 | -1.984843# | -1.9703728 | -1.923421 | -2.166719 | -2.002823 | |
| Noise | -1.683931# | -1.6379195 | -1.607315 | -2.3293738 | -1.86372761 | |
| Cont4 | -2.161964# | -2.0390992 | -1.900202 | -2.8912924 | -2.374376 | |
| Cat1 | -2.078314 | -1.920269 | -2.029932 | -2.4134016# | -2.29490206 | |
| Cat2 | -2.465616 | -2.2333715 | -2.489852 | -2.8188999# | -2.71046546 | |
| Dich1 | -1.95608 | -1.8382816 | -1.854454 | -2.0588553# | -2.09515079 | |
| Noise | -1.260527 | -1.3075# | -1.240598 | -1.639111 | -1.38193069 | |
| Cont4 | -2.936592 | -2.7997339 | -2.435983 | -4.1197571# | -4.13946275 | |
| Cat1 | -3.194703 | -3.6122543 | -3.598083 | -4.8961963# | -5.85087529 | |
| Cat2 | -3.713544 | -4.3224707 | -4.399027 | -5.60206# | -7.19565311 | |
| Dich1 | -1.951632# | -1.9122966 | -1.833822 | -2.0774095 | -1.93383386 | |
| Noise | -1.348488# | -1.4031749 | -1.401532 | -1.9028052 | -1.2418635 | |
| Cont4 | -2.418733 | -2.3622861 | -2.145894 | -3.3509581# | -2.95032364 | |
| Cat1 | -2.729985 | -2.8352247 | -2.758217 | -4.0065819# | -4.15967193 | |
| Cat2 | -4.064997 | -4.3178549 | -4.49222 | -5.1426675# | -5.75557227 | |
| Dich1 | -1.816346 | -1.7937226# | -1.764417 | -1.9024322 | -1.7891513 | |
| Noise | -1.443512 | -1.4313625 | -1.388748 | -1.8756295# | -1.66712132 | |
| Cont4 | -3.027566 | -2.859925 | -2.478627 | -4.2321024# | -4.1460952 | |
| Cat1 | -3.320076 | -3.6443612 | -3.662341 | -4.9232755# | -5.84393201 | |
| Cat2 | -3.729321 | -4.3001623 | -4.416825 | -5.60206# | -7.11509176 | |
| Dich1 | -2.16806 | -2.1131203# | -2.052601 | -2.3234185 | -2.13648909 | |
| Noise | -1.683931# | -1.6379195 | -1.607315 | -2.3293738 | -1.86372761 | |
| Cont4 | -2.161964# | -2.0390992 | -1.900202 | -2.8912924 | -2.374376 | |
| Cat1 | -2.078314 | -1.920269 | -2.029932 | -2.4134016# | -2.29490206 | |
| Cat2 | -2.465616 | -2.2333715 | -2.489852 | -2.8188999# | -2.71046546 | |
| Dich1 | -1.95608 | -1.8382816 | -1.854454 | -2.0588553# | -2.09515079 |
N Number of observations, Corr Correlation, P-out P-value for excluding a variable from the prognostic model, Noise Noise variable, Cont4 Continuous variable 4, Cat1 Categorical variable, Cat2 Categorical variable 2, Dich1 Dichotomous variable 1, D1 D1 method, D2 D2 method, D3 D3 method, MPR Median-P-rule pooling method, complete dataset analyses in complete dataset (reference values for the pooling methods); * = reference values for comparison the pooling methods with the complete data; # = value that is closest to the reference value
Fig. 2Percentages agreement between the P-values of the selected variables by the different pooling methods and the complete dataset.
Comparing selected prognostic models to the developed models in the complete dataset
| First 10 unique models | D1(n) | D1(%) | D2(n) | D2(%) | D3(n) | D3(%) | MPR(n) | MPR(%) | Comp (n) |
|---|---|---|---|---|---|---|---|---|---|
| M1a | |||||||||
| M1a | 425 | 88.0 | 402 | 83.2 | 410 | 84.9 | 441 | 91.3# | 483 |
| M1b | 329 | 73.4 | 276 | 61.6 | 310 | 69.2 | 359 | 80.1# | 448 |
| M1c | 178 | 49.7 | 146 | 40.8 | 194 | 54.2 | 270 | 75.4# | 358 |
| M1d | 110 | 38.1 | 107 | 37.0 | 111 | 38.4 | 205 | 70.9# | 289 |
| M2b | |||||||||
| M2a | 361 | 80.8 | 336 | 75.2 | 358 | 80.1 | 391 | 87.5# | 447 |
| M2b | 253 | 67.3 | 231 | 61.4 | 250 | 66.5 | 280 | 74.5# | 376 |
| M2c | 105 | 43.9 | 107 | 44.8 | 108 | 45.2 | 169 | 70.7# | 239 |
| M2d | 93 | 52.8 | 94 | 53.4 | 93 | 52.8 | 109 | 61.9# | 176 |
| M3c | |||||||||
| M3a | 491 | 98.2 | 489 | 97.8 | 491 | 98.2 | 492 | 98.4# | 500 |
| M3b | 472 | 94.4 | 472 | 94.4 | 474 | 94.8 | 475 | 95.0# | 500 |
| M3c | 452 | 90.4 | 445 | 89.0 | 455 | 91.0 | 456 | 91.2# | 500 |
| M3d | 434 | 86.8 | 412 | 83.4 | 432 | 87.4 | 441 | 89.3# | 494 |
| M4d | |||||||||
| M4a | 481 | 96.2 | 481 | 96.2 | 481 | 96.2 | 483 | 96.6# | 500 |
| M4b | 439 | 88.5 | 465 | 93.8 | 469 | 94.6 | 470 | 94.8# | 496 |
| M4c | 401 | 83.5 | 380 | 79.2 | 401 | 83.5 | 416 | 86.7# | 480 |
| M4d | 93 | 52.8 | 94 | 53.4 | 77 | 43.8 | 112 | 63.6# | 176 |
aM1 = model with n = 200. correlation degree 0.2; a = p-out ≤ 0.5; b = p-out ≤ 0.3; c = p-out ≤ 0.1; d = p-out t ≤ h0.05
bM2 = model with n = 200. correlation degree 0.6; a = p-out ≤ 0.5; b = p-out ≤ 0.3; c = p-out ≤ 0.1; d = p-out ≤ 0.05
cM3 = model with n = 500. correlation degree 0.2; a = p-out ≤ 0.5; b = p-out ≤ 0.3; c = p-out ≤ 0.1; d = p-out ≤ 0.05
dM4 = model with n = 500. correlation degree 0.6; a = p-out ≤ 0.5; b = p-out ≤ 0.3; c = p-out ≤ 0.1; d = p-out ≤ 0.05
n Number of observations, P-out P-value for excluding variable out of the model, D1 (n) Number of developed similar prognostic models as in the complete dataset with the D1-method, D1(%) Percentage of similar models as in the complete dataset with the D1-method, D2 (n) Number of developed similar prognostic models as in the complete dataset with the D2-method, D2(%) Percentage of similar prognostic models as in the complete dataset with the D2-method, D3 (n) Number of developed similar prognostic models as in the complete dataset with the D3-method, D3(%) Percentage of similar prognostic models as in the complete dataset with the D3-method, MPR (n) Number of developed similar prognostic models as in the complete dataset with the MPR-method, MPR (%) Percentage of similar prognostic models as in the complete dataset with the MPR-method, comp (n) Number of the first ten unique models selected in the BWS-procedure; # = highest amount of similar unique prognostic models compared to the models from the complete dataset
Fig. 3Model selection frequencies of the first ten unique prognostic models from the four pooling methods, quantifying how likely these models were selected compared to the models from the complete dataset
Selected variables selected in the NHANES-dataset by the four pooling methods
| M = 5, P-out < 0.05 | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Variables | D1 | D2 | D3 | MPR | D1 | D2 | D3 | MPR | D1 | D2 | D3 | MPR | D1 | D2 | D3 | MPR | D1 | D2 | D3 | MPR |
| Age | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | |||
| BMI | X | X | X | X | X | X | X | X | X | X | ||||||||||
| Pulse | ||||||||||||||||||||
| BPSysAve | ||||||||||||||||||||
| BPDiaAve | ||||||||||||||||||||
| TotChol | ||||||||||||||||||||
| Gender (Dich) | ||||||||||||||||||||
| Diabetes (Dich) | ||||||||||||||||||||
| Race (Cat) | ||||||||||||||||||||
| Education (Cat) | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |
| Depressed (Cat) | ||||||||||||||||||||
| LittleInterest (Cat) | X | X | X | X | ||||||||||||||||
M Number of imputations, P-out P-value for excluding a variable from the prognostic model, N Number of observations, D1 D1-method, D2 D2-method, D3 D3-method, MPR Median P-Rule