| Literature DB >> 26319135 |
Olga Morozova1, Olga Levina2, Anneli Uusküla3, Robert Heimer4.
Abstract
BACKGROUND: Automatic stepwise subset selection methods in linear regression often perform poorly, both in terms of variable selection and estimation of coefficients and standard errors, especially when number of independent variables is large and multicollinearity is present. Yet, stepwise algorithms remain the dominant method in medical and epidemiological research.Entities:
Mesh:
Year: 2015 PMID: 26319135 PMCID: PMC4553217 DOI: 10.1186/s12874-015-0066-2
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Characteristics of study participants; bivariate and full multivariate linear regressionsa, (N = 811)
| Independent variables | n (%)b | Bivariate OLS regression | Full multivariate OLS regression ( | ||
|---|---|---|---|---|---|
| Beta | 95 % CI | Beta | 95 % CI | ||
| I. SOCIO-DEMOGRAPHIC CHARACTERISTICS | |||||
| Sex: | |||||
| Male | 631 (77.8) | Ref | - | Ref | - |
| Female | 180 (22.2) | 1.72 | (−0.62 ; 4.06) | −3.25 | (−5.61 ; −0.88) |
| Age (median = 32 y.o): | |||||
| Less than 32 y.o. | 397 (49.0) | Ref | - | Ref | - |
| 32 y.o. or older | 414 (51.0) | −5.21 | (−7.27 ; −3.14) | −4.81 | (−6.69 ; −2.93) |
| Education: | |||||
| Primary or basic | 62 (7.6) | Ref | - | Ref | - |
| Secondary, vocational or at least some higher | 749 (92.4) | 7.37 | (2.89 ; 11.84) | 3.54 | (−0.61 ; 7.68) |
| Main source of income: | |||||
| Legal source | 677 (83.5) | Ref | - | Ref | - |
| Illegal source | 134 (16.5) | −3.27 | (−5.78 ; −0.76) | −0.35 | (−2.75 ; 2.05) |
| Level of income: | |||||
| Coping well | 245 (30.2) | Ref | - | Ref | - |
| Coping is difficult (or very difficult) | 566 (69.8) | −8.99 | (−11.18 ; −6.80) | −3.86 | (−6.00 ; −1.73) |
| Living arrangements: | |||||
| Someone else’s house | 274 (33.8) | Ref | - | Ref | - |
| Owned or rented place | 512 (63.1) | −3.24 | (−5.35 ; −1.13) | 1.73 | (−0.33 ; 3.79) |
| Shelter/no fixed place | 25 (3.1) | −5.00 | (−10.80 ; 0.81) | 0.97 | (−4.74 ; 6.69) |
| Marital status:d | |||||
| Not married | 554 (68.4) | Ref | - | Ref | - |
| Married | 256 (31.6) | −0.21 | (−2.55 ; 2.13) | −0.96 | (−3.40 ; 1.49) |
| II. ALCOHOL AND DRUG USE | |||||
| Alcohol abuse using CAGE scale: | |||||
| CAGE = 0–1 | 274 (33.9) | Ref | - | Ref | - |
| CAGE = 2-4 | 534 (66.1) | −10.10 | (−12.20 ; −8.00) | −1.76 | (−4.00 ; 0.47) |
| Age of first drug use (cannabis excluded; median = 16 y.o.): | |||||
| 17 y.o. or older | 280 (34.5) | Ref | - | Ref | - |
| 16 y.o. or younger | 531 (65.5) | −6.64 | (−8.70 ; −4.58) | −3.38 | (−5.47 ; −1.30) |
| Main drug of use: | |||||
| (Meth)-amphetamines | 27 (3.3) | Ref | - | Ref | - |
| Methadone/Fentanyl | 221 (27.3) | −6.92 | (−11.80 ; −2.04) | 1.43 | (−3.51 ; 6.37) |
| Heroin | 563 (69.4) | −14.74 | (−19.39 ; −10.08) | 0.87 | (−4.01 ; 5.74) |
| Poly-drug use in the last 4 weeks:e | |||||
| Injected 1 class of drugs | 697 (85.9) | Ref | - | Ref | - |
| Injected 2 or more classes of drugs | 114 (14.1) | −6.15 | (−9.08 ; −3.22) | −3.62 | (−6.38 ; −0.87) |
| Frequency of injecting drugs (days during the last 4 weeks; median = 20): | |||||
| 19 days or less | 337 (41.6) | Ref | - | Ref | - |
| 20 days or more | 474 (58.4) | −8.37 | (−10.49 ; −6.25) | −1.39 | (−3.88 ; 1.09) |
| Frequency of injecting drugs (times per day; median = 1): | |||||
| One | 437 (54.0) | Ref | - | Ref | - |
| Two or more | 372 (46.0) | −7.69 | (−9.79 ; −5.60) | −0.40 | (−2.61 ; 1.80) |
| Used non-sterile injecting equipment at least once in the last 4 weeks: | |||||
| No (or don’t know) | 344 (42.4) | Ref | - | Ref | - |
| Yes | 467 (57.6) | −10.54 | (−12.59 ; −8.50) | −2.05 | (−4.41 ; 0.30) |
| Ever used non-sterile injecting equipment: | |||||
| No | 79 (9.7) | Ref | - | Ref | - |
| Yes | 732 (90.3) | −10.94 | (−14.08 ; −7.81) | −0.55 | (−4.13 ; 3.04) |
| Getting sterile injecting equipment (any unused syringes in last 4 weeks): | |||||
| No | 46 (5.7) | Ref | - | Ref | - |
| Yes | 765 (94.3) | 9.12 | (4.77 ; 13.48) | −0.97 | (−5.14 ; 3.20) |
| Ever overdosed: | |||||
| No | 284 (35.0) | Ref | - | Ref | - |
| Yes | 527 (65.0) | −6.58 | (−8.75 ; −4.42) | 0.10 | (−2.12 ; 2.32) |
| III. MENTAL HEALTH | |||||
| Mental health problems score: | |||||
| Lower score on mental health problems | 427 (52.7) | Ref | - | Ref | - |
| Higher score on mental health problems | 384 (47.3) | −9.27 | (−11.28 ; −7.26) | −2.61 | (−4.85 ; −0.38) |
| IV. SEXUAL RISK | |||||
| Sexually active in the last 6 months: | |||||
| No | 188 (23.2) | Ref | - | Ref | - |
| Yes | 623 (76.8) | 1.75 | (−0.73 ; 4.23) | 0.91 | (−1.69 ; 3.52) |
| Involved in sexual work in the last 6 months: | |||||
| No | 757 (93.3) | Ref | - | Ref | - |
| Yes | 54 (6.7) | −0.16 | (−3.35 ; 3.03) | 2.80 | (−1.05 ; 6.66) |
| Paid for sex in the last 6 months: | |||||
| No | 748 (92.2) | Ref | - | Ref | - |
| Yes | 63 (7.8) | 7.80 | (4.65 ; 10.95) | 2.25 | (−1.11 ; 5.60) |
| Condom use during the last sexual intercourse: | |||||
| Yes | 378 (46.6) | Ref | - | N/A | - |
| No | 238 (29.3) | −0.79 | (−3.29 ; 1.72) | ||
| Don’t know | 195 (24.0) | −2.44 | (−5.11 ; 0.24) | ||
| HIV and Hepatitis C status of primary sexual partner: | |||||
| HIV and HCV negative or unknown | 155 (19.1) | Ref | - | Ref | - |
| Known to be HIV or HCV positive | 218 (26.9) | −11.09 | (−14.14 ; −8.04) | −1.98 | (−5.05 ; 1.09) |
| No primary partner in the last 6 months | 438 (54.0) | −8.23 | (−11.00 ; −5.47) | −2.02 | (−5.04 ; 1.00) |
| V. INFECTIOUS DISEASES HISTORY AND STATUS | |||||
| Ever been tested for HIV: | |||||
| No (or don’t know) | 52 (6.4) | Ref | - | Ref | - |
| Yes | 759 (93.6) | −1.95 | (−6.04 ; 2.13) | 3.63 | (−0.58 ; 7.84) |
| HIV status awareness: | |||||
| Result of the most recent HIV test is negative, unknown or never tested | 428 (52.8) | Ref | - | N/A | - |
| Result of the most recent HIV test is positive | 383 (47.2) | −9.71 | (−11.74 ; −7.68) | ||
| HIV status (based on study testing): | |||||
| Negative | 359 (44.3) | Ref | - | Ref | - |
| Positive | 452 (55.7) | −9.20 | (−11.26 ; −7.15) | −2.71 | (−6.55 ; 1.13) |
| Receiving regular HIV care: | |||||
| HIV-negative or unaware | 428 (52.8) | Ref | - | Ref | - |
| HIV+; receives regular HIV care | 125 (15.4) | −0.87 | (−3.41 ; 1.67) | 1.46 | (−3.05 ; 5.98) |
| HIV+; does not receive regular HIV care | 258 (31.8) | −13.77 | (−15.95 ; −11.59) | −4.32 | (−8.36 ; −0.28) |
| Tuberculosis history awareness: | |||||
| No (or don’t know) | 757 (93.3) | Ref | - | Ref | - |
| Yes | 54 (6.7) | −7.46 | (−11.15 ; −3.76) | −3.32 | (−6.43 ; −0.22) |
| Hepatitis C history awareness: | |||||
| No | 126 (15.5) | Ref | - | N/A | - |
| Yes | 685 (84.5) | −10.13 | (−13.02 ; −7.23) | ||
| Treatment of Hepatitis C: | |||||
| Never diagnosed with HCV | 126 (15.5) | Ref | - | Ref | - |
| HCV+, never been offered treatment | 591 (72.9) | −11.09 | (−14.00 ; −8.17) | −4.56 | (−7.62 ; −1.50) |
| HCV+, was offered treatment, but did not receive it | 50 (6.2) | −7.45 | (−12.58 ; −2.32) | −7.37 | (−12.75 ; −1.99) |
| HCV+, was offered treatment and received it | 44 (5.4) | −0.27 | (−4.53 ; 3.98) | −2.02 | (−6.80 ; 2.76) |
| Hepatitis B history awareness: | |||||
| No | 401 (49.4) | Ref | - | Ref | - |
| Yes | 410 (50.6) | −9.31 | (−11.29 ; −7.34) | −0.05 | (−2.36 ; 2.25) |
| Ever vaccinated against Hepatitis B: | |||||
| No (or don’t know) | 525 (64.7) | Ref | - | Ref | - |
| Yes (at least one dose) | 286 (35.3) | 7.29 | (5.07 ; 9.50) | 1.63 | (−0.74 ;4.00) |
| VI. CONTACT WITH TREATMENT SERVICES AND PRISON | |||||
| History of incarceration: | |||||
| No | 537 (66.2) | Ref | - | Ref | - |
| Yes | 274 (33.8) | −2.41 | (−4.64 ; −0.18) | −1.49 | (−3.40 ; 0.42) |
| Having basic medical insurance: | |||||
| No | 156 (19.3) | Ref | - | Ref | - |
| Yes | 654 (80.7) | 2.39 | (−0.40 ; 5.17) | 1.46 | (−1.06 ; 3.98) |
| Receiving any healthcare services in the last 12 months: | |||||
| Received | 546 (67.3) | Ref | - | Ref | - |
| Not received | 265 (32.7) | −3.00 | (−5.32 ; −0.68) | −1.46 | (−3.56 ; 0.64) |
| Ever received drug abuse treatment: | |||||
| No | 229 (28.2) | Ref | - | N/A | - |
| Yes | 582 (71.8) | −5.79 | (−8.15 ; −3.43) | ||
| Receiving detoxification services in the last 6 months: | |||||
| Did not need detox services | 646 (79.7) | Ref | - | Ref | - |
| Needed, but did not receive detox | 99 (12.2) | −10.74 | (−14.12 ; −7.35) | −5.39 | (−9.00 ; −1.78) |
| Needed and received detox | 66 (8.1) | −4.00 | (−8.04 ; 0.04) | −4.82 | (−7.96 ; −1.67) |
| Ever had difficulties obtaining drug abuse treatment: | |||||
| Never received treatment (or don’t know) | 233 (28.7) | Ref | - | Ref | - |
| Had no difficulties | 482 (59.4) | −4.44 | (−6.89 ; −1.99) | 1.26 | (−0.93 ; 3.45) |
| Had difficulties | 96 (11.8) | −10.56 | (−13.92 ; −7.20) | −0.74 | (−4.14 ; 2.66) |
| Had difficulties obtaining medical care because of drug use: | |||||
| No (or don’t know) | 764 (94.2) | Ref | - | Ref | - |
| Yes | 47 (5.8) | −5.10 | (−9.48 ; −0.73) | −4.22 | (−8.21 ; −0.22) |
| VII. STIGMA, DISCLOSURE AND POLICE HARASSMENT | |||||
| Ever experienced police confiscate syringes: | |||||
| No | 599 (73.9) | Ref | - | Ref | - |
| Yes | 212 (26.1) | −6.00 | (−8.41 ; −3.59) | −0.48 | (−3.02 ; 2.06) |
| PWID status disclosure to family/friends:f | |||||
| Rather disclosed | 420 (51.8) | Ref | - | Ref | - |
| Rather did not disclose | 391 (48.2) | −7.06 | (−9.10 ; −5.01) | −0.65 | (−2.78 ; 1.48) |
| PWID status disclosure to a healthcare provider:g | |||||
| Rather disclosed | 278 (34.3) | Ref | - | Ref | - |
| Rather did not disclose | 533 (65.7) | 3.83 | (1.67 ; 5.99) | −0.16 | (−2.51 ; 2.19) |
| Internalized PWID stigma:h | |||||
| Low | 417 (51.4) | Ref | - | Ref | - |
| High | 394 (48.6) | −9.14 | (−11.15 ; −7.13) | −3.68 | (−5.92 ; −1.44) |
| PWID stigma consciousness:h | |||||
| Low | 343(42.3) | Ref | - | Ref | - |
| High | 468 (57.7) | −1.68 | (−3.81 ; 0.45) | 1.52 | (−0.52 ; 3.56) |
95 % CI 95 % confidence interval, HRQoL Health-related quality of life, OLS Ordinary Least Squares, PWID people who inject drugs, Ref Reference Category, VAS Visual Analogue Scale
aDependent Variable is EuroQoL 5D VAS measure of the HRQoL
bNumbers may not sum up to total due to missing values, and % may not sum up to 100 due to rounding
cThe adjusted R2 = 0.37. Four variables (Condom use during the last sexual intercourse, HIV status awareness, Hepatitis C awareness, Ever received drug abuse treatment) were not included into the multivariate regression, because of complete collinearity with other variables in the model
dMarried = legally married or living as married; Not married = widowed, divorced or never married
eThe following classes of drugs are included: opiates, amphetamines, and cocaine
fBased on five questions, each measured on 5-point Likert scale. Individual items scores were summed and dichotomized by median
gBased on one question measured on 5-point Likert scale, and dichotomized by median
hBoth internalized stigma scale and stigma consciousness scale are six items questionnaires measured on the 5-point Likert scale. Individual items scores were summed and dichotomized by median
Fig. 1Bootstrap frequency of covariates selection in the final model using stepwise algorithms. Dependent variable is EuroQoL 5D visual analogue scale measure of the health-related quality of life. a shows results of backward elimination regression using AIC, b—using BIC, and c—using Likelihood Ratio Test (p = 0.05). d, e and f show results of forward selection regression with AIC, BIC and LRT (p = 0.05) correspondingly. Black bars represent variables selected in the final model, and light grey bars—variables excluded from the final model. Solid line and the number next to it correspond to the minimum frequency among variables included in the final model; dashed line and the number next to it correspond to the maximum frequency among variables excluded from final subset. Dotted line corresponds to the frequency = 0.9, and number next to it shows the percentage of variables in the final model with inclusion frequency over 0.9 (out of the number of variables selected in the final model). Description of variable names is provided in the Additional file 2
Fig. 2Bootstrap frequency of covariates selection in the final model using penalized regression. Dependent variable is EuroQoL 5D visual analogue scale measure of the health-related quality of life. a shows results of lasso corresponding to λmin, b—lasso corresponding to λ1se ; c and d—adaptive lasso with λmin (c) and λ1se (d); and e and f—adaptive elastic net with λmin (e) and λ1se (f). Black bars represent variables selected in the final model, and light grey bars—variables excluded from the final model. Solid line and the number next to it correspond to the minimum frequency among variables included in the final model; dashed line and the number next to it correspond to the maximum frequency among variables excluded from final subset. Dotted line corresponds to the frequency = 0.9, and number next to it shows the percentage of variables in the final model with inclusion frequency over 0.9 (out of the number of variables selected in the final model). Description of variable names is provided in the Additional file 2
Fig. 4Summary of the resulting linear regression models obtained with different subset selection methods. Dependent variable is EuroQoL 5D visual analogue scale measure of the health-related quality of life. 95 % CI, 95 % Confidence/Credible interval; Full MV, full multivariate regression; HRQoL, health-related quality of life. Description of variable names is provided in the Additional file 2
Fig. 3Bayesian model averaging: posterior inclusion probabilities of independent variables in linear regression. Dependent variable is EuroQoL 5D visual analogue scale measure of the health-related quality of life. a shows covariates posterior inclusion probabilities (PIP) based on aggregate information from sampling chain with posterior model distribution based on MCMC frequencies. b shows covariates PIP based on 100 best models from sampling chain with posterior model distributions based on exact marginal likelihoods. Dashed line corresponds to the subset selection PIP threshold, which equals 0.5 (median inclusion probability model). Description of variable names is provided in the Additional file 2
Summary of methods performance
| Method | Stability of model selection | Incorporating model uncertainty | Computational efficiency (running time)a |
|---|---|---|---|
| I. STEPWISE REGRESSION METHODS | |||
| Backward elimination (AIC) | Moderate | Do not incorporate model uncertainty in the estimation of regression coefficients and standard errors. | Model selection: 5.4 s |
| Estimation of SE with bootstrapb: 30.9 s | |||
| Backward elimination (BIC) | Very poor | Model selection: 5.6 s | |
| Estimation of SE with bootstrapb: 15.0 s | |||
| Backward elimination (LRT) | Moderate | Model selection: 5.1 s | |
| Estimation of SE with bootstrapb: 19.2 s | |||
| Forward selection (AIC) | Moderate | Model selection: 2.8 s | |
| Estimation of SE with bootstrapb: 28.5 s | |||
| Forward selection (BIC) | Very poor | Model selection: 1.9 s | |
| Estimation of SE with bootstrapb: 13.8 s | |||
| Forward selection (LRT) | Moderate | Model selection: 3.1 s | |
| Estimation of SE with bootstrapb: 19.8 s | |||
| II. PENALIZED REGRESSION METHODS | |||
| Lasso | Poor (λmin) | Model uncertainty is partially incorporated into the estimation and inference procedure via λ tuning step, and estimation of standard errors using bootstrap. | Lasso algorithm: 0.02 s |
| Good (λ1se) | 10-fold CV: 0.5 s | ||
| Estimation of SE with bootstrapb: 394.0 s | |||
| Adaptive lasso | Good (λmin) | Estimation of weights (ridge regression): 1.6 s | |
| Good (λ1se) | Adaptive lasso algorithm: 0.02 s | ||
| 10-fold CV: 0.5 s | |||
| Estimation of SE with bootstrapb: 411.2 s | |||
| Adaptive elastic net | Good (λmin) | Estimation of weights (ridge regression): 1.6 s | |
| Good (λ1se) | Estimation of λ for L2 penalty (elastic net): 1.2 s | ||
| Adaptive elastic net algorithm: 0.2 s | |||
| 10-fold CV: 1.4 s | |||
| Estimation of SE with bootstrapb: 3,265.3 s | |||
| III. BAYESIAN MODEL AVERAGING | |||
| Bayesian model averaging (using MCMC to search model space) | PIPs of regression covariates inform model selection. Bootstrap gave selection frequencies that were almost identical to PIPs (data not shown). | Model uncertainty is properly incorporated into the estimation of regression coefficients and their standard deviations (provided that MCMC chain converged and the algorithms managed to search the entire model space). | 250.8 s |
| (1,000,000 iterations, chain converged) | |||
AIC Akaike Information Criterion, BIC Bayesian Information Criterion, CV cross-validation, LRT Likelihood Ratio Test, MCMC Markov Chain Monte Carlo, PIP posterior inclusion probability, SE standard error
aThe analysis is run on a 1.7 GHz Intel(R) Core(TM) i5 processor with 4.00 GB of DDR3 memory
bIn all cases of estimation of standard errors using bootstrap number of iterations = 2,000