Literature DB >> 28638214

Two Paradoxes in Linear Regression Analysis.

Ge Feng¹, Jing Peng², Dongke Tu³, Julia Z Zheng⁴, Changyong Feng^2,5.

Abstract

Regression is one of the favorite tools in applied statistics. However, misuse and misinterpretation of results from regression analysis are common in biomedical research. In this paper we use statistical theory and simulation studies to clarify some paradoxes around this popular statistical method. In particular, we show that a widely used model selection procedure employed in many publications in top medical journals is wrong. Formal procedures based on solid statistical theory should be used in model selection.

Entities: Disease Gene Species

Keywords: Forward selection; backward elimination; multiple regression; univariate regression

Year: 2016 PMID： 28638214 PMCID： PMC5434296 DOI： 10.11919/j.issn.1002-0829.216084

Source DB: PubMed Journal: Shanghai Arch Psychiatry ISSN： 1002-0829

1. Introduction

Linear regression is the most widely used statistical model in data analysis.[ Wide availability and ease of use of statistical software packages, such as SAS, SPSS and R make the linear regression accessible to people without any formal statistical training. Although wise use of statistical methods such as linear regression helps us, even novices, develop a better understand of data and guide our decisions, it also causes confusion in interpretation of results and paradoxical findings. For example, we are often asked by our biomedical collaborators questions like “When I run the univariate regression of Y on the predictor, the p-value is very small. However, if I add some other predictors in the model, is not significant anymore. Why?” The same problem also occurs in logistic regression for binary outcome [, log-linear regression for counting data [, and Cox proportional hazards regression for survival data.[ A simple answer to this question is the different assumptions between the univariate and multiple regression models. However, this is not so meaningful for non-statisticians. This is discussed in Section 2. In many medical studies, regression analysis involves a large of number of independent variables, or predictors. Model selection is required to find the predictors that are significantly associated with an outcome, or dependent variable, of interest. Here is how the model selection was done in a recent paper published in JAMA Surgery[: “The administrative database was then evaluated by means of univariate and multivariate logistic regression. First we identified variables that were associated (P < .20) with readmission, the dependent variable. These potential confounders were then entered in multivariate stepwise (backward elimination) logistic regression, with readmission as the dependent variable. A logistic regression model was constructed to identify patient factors associated with readmission.” This forward selection procedure as the first step to weed out “non-significant” predictors has been become almost the gold standard for variable selection and has been used in many papers published in top medical journals.[ The key idea of this method is first to run a univariate regression on each predictor. If the p-value is less than some pre-specified level, for example 0.1, then the predictor is used in the multiple regression. Otherwise, the predictor is assumed to have no significant effect on the outcome. This method seems quite logical and intuitively meaningful. Indeed, it has been used and is still being used by the biomedical and other research communities. Is this a valid procedure? In this paper we use linear regression analysis to show two paradoxes in regression analysis. In Section 2 we use some very basic theory to show how the univariate regression and multiple regression make different assumptions on the models. We use examples and simulation studies to show two paradoxes in regression analysis in Section 3. Section 4 briefly discusses the transitivity of correlation. Our results clearly invalidate the model selection procedure widely used in biomedical research.

2. Basic theory

Let (Y, X1, ..., Xp) be a random vector, where X1, ..., Xp are called the covariates (independent variables), and Y is called the outcome (dependent variables). The regression of Y on (X1, ..., Xp) is the conditional expectation of Y given (X1, ..., Xp), denoted by E[Y|X1, ..., Xp] which is a measurable function of (X1, ..., Xp). Denote the function by g(X, ..., Xp). Without knowing the joint distribution of (X1, ..., Xp, Y ), in general, the form of g(X1, ..., Xp) is unknown. In statistical analysis, we usually assume some mathematically tractable forms of g(X1, ..., Xp). For example, the linear regression analysis [ assumes that In the logistic regression analysis with 0-1 outcome [, we assume that In this paper we assume the outcome Y is continuous. Let It is obvious that E[Y|X1, ..., Xp] = 0. We consider a stronger form of the liner regression model and assume that given X1, ..., Xp, the variance of £ which does not depend on (X1, ..., Xp). This assumption is also used in most statistical literature on linear model.[ We further assume that Xk, k = 1, . . ., p, have finite second moments. From (1) we have Let Zk = E[Xk|X1], k = 1, . . ., p. (It is clear that Zk = Xk). Then the regression of Yi on X1i is which still has a linear form. Let Then Although (3) has the same form as (1), they are fundamentally different in the error terms. Note that E[n|X1] = 0, Cov(Zk, n) = 0, k = 1, . . ., p. However, the conditional variance of n given X1 is Therefore, the conditional variance of n given X1 is no longer a constant. This violates the fundamental assumption used in linear regression model.[ The univariate linear regression of on assumes the following form of the model From (3) we know that generall Suppose (Y, Xi1, ..., Xip), i = 1, ..., n, is a random sample from (1). Let Let ɣ̂1 be the least square estimate of the univariate regression of Yi on X1i in (4). Then and as n → ∞ Let ɣ̂1 be the least square estimator of β1 in (1). It is well known that E[β1 = βj and β → β1. Hence the estimates from the univariate regression and multiple regression usually converge to different limits. In a special case that and other covariates are uncorrelated, the limits are the same.

3. Two paradoxes in linear regression analysis

In this section we show why the estimates of the coefficient of some covariates in the univariate regression and in the multiple regression do not match. More specifically, we show that in some cases, the estimate from the univariate regression is significant, but the result from the multiple regression is not. On the other hand, in some cases, the result is significant for the multiple regression but not for the univariate regression. Suppose (1) is the true multiple regression model. The univariate regression model uses model (4) by assuming that E[ζ|X1] = 0. This assumption is generally wrong unless E[XK|X1] is a constant (k = 2,..., p). Hence, with a correct multiple regression model, the estimate of the univariate analysis is based on a wrong model. This is the reason why the results from univariate regression and multiple regression do not match. Furthermore, result (5) shows that there is no clear interpretation of the estimate in the univariate analysis. We discuss two paradoxes related to univariate and multiple regressions through both theoretical derivations and simulation studies.

3.1 Significant covariate effect in multiple regression but not in univariate regression

Let X2, X3, X4 and ε be independent random variables with standard normal distributions. Consider the following model where α ≠ 0, k = 0,1,2,3, and where β1β2 ≠ 0. Then which is 0 if and only if From (5) we know that if (7) is true, the least square estimator ɣ̂1 of the coefficient of the univariate regression of Y on X will not be significant, even though X1 is necessary in specifying model (6). Example 1. Let α1 = -3/5, α2 = 3, α3 = 4, β1 = 1, β2 = 2 in (6). The true model is Table 1 shows the simulation result of the estimates and standard deviations of the coefficient of X1 in both univariate and multiple regressions after 10,000 replications. For a wide range of sample sizes, the least square estimator of the coefficient of X1 in the multiple regression is very close to the true value, and the standard deviation decreases significantly with the sample size. However, the estimate of coefficient in the univariate analysis is very close to 0 in all cases.

Table 1.

Estimate of the regression coefficient of X1

n	Multiple regression		Univariate regression
n	Estimate	SD	Estimate	SD
30	-0.6010	0.0988	-0.0005	0.4225
50	-0.6003	0.0748	-0.0016	0.3194
100	-0.6003	0.0514	-0.0009	0.2226
200	-0.6002	0.0357	0.0002	0.1585
500	-0.6005	0.0226	-0.0005	0.0965
1,000	-0.6000	0.0160	-0.0002	0.0691

According to the practice in medical publications [, X1 will not enter the multiple regression. Table 2 shows the result of the least square estimates of the coefficients of X2 and X3 after X1 is removed in (8). It is easy to see that the estimate of the coefficient of X2 is dramatically biased in the multiple regression after X1 is removed due to the univariate analysis.

Table 2.

Estimates of the regression coefficients of X2 and X 3 with X1 being removed

n	Coefficient of X₂ (α=3)		Coefficient of X₃ (α=4)
n	Estimate	SD	Estimate	SD
30	2.4074	0.3030	4.0028	0.3047
50	2.3990	0.2281	4.0014	0.2302
100	2.4020	0.1611	3.9992	0.1581
200	2.3999	0.1111	4.0019	0.1126
500	2.4002	0.0703	4.0005	0.0705
1,000	2.4002	0.0498	3.9993	0.0492

3.2 Significant covariate effect in univariate regression but not in multiple regression

Suppose X1, X2, X3 and ε are independent standard normal random variables, and X4 = β1X1+β2X2, where β1β2 ≠ 0 Consider the following true model is If (9) is expanded to include X and the expanded model still satisfies the conditions of the linear regression, then the regression equation becomes From (9) and (10) we have or Since β2 ≠ 0, we should have δ3 = 0, which means that X4 has no role in the multiple regression. Let ɣ̂ be the least square estimate of the coefficient of univariate linear regression of Y on X4. Then Hence if, when sample size δ3= 0 is large enough, the result from the univariate is significant but the multiple regression is not. Example 2. Let α = 0, α = 1, α = 2 in (9) and β = β2= 1, Table 3 shows the least square estimates of the coefficient of X in both univariate and multiple linear regressions after 10,000 replications. For all sample sizes, the univariate regression shows that X has very significant effect on Y. However, in the multiple regression, the effect is not significant.

Table 3

Estimate of the regression coefficient of X4

n	Univariate regression		Multiple regression
n	Estimate	SD	Estimate	SD
30	1.0024	0.4723	0.0038	0.2014
50	0.9975	0.3564	-0.0008	0.1496
100	0.9995	0.2469	-0.0015	0.1032
200	0.9982	0.1733	0.0005	0.0723
500	0.9999	0.1101	0.0005	0.0452
1,000	0.9995	0.0776	0.0004	0.0318

4. Transitivity of correlation

Another issue around the regression analysis is the transitivity of the correlation in the interpretation. For example, some people may say like that: “Since factor A is highly correlated with outcome Y, and factor A and factor B are highly correlated, then B should be correlated with Y.” It seems very intuitive and reasonable that correlation is transitive. Unfortunately, this is not true. Here is a theoretical example. Suppose X and Z are independent standard normal random variables and Y=X+Z. It’s clear that the correlation between X and Y, and between Y and Z are both 0.707. However, the correlation between X and Z is 0. In our Example 2, the correlations between X and X1 and Y are 0.707 and 0.408, respectively. However, we proved in Section 3.2 shows that X4 has no role in the multiple regression if X2 and X2 are in the model although X4 is not a linear combination of X2 and X2.

5. Discussion

Regression analysis in medical research usually involves many predictors (independent variables). The model selection is needed to pick covariates having significant effect on the outcome. A widely used method in medical publications[ is first to screen those covariates through univariate analysis. If a covariate is not significant in the univariate regression analysis, it will not enter the multiple regression analysis. The underlying assumption of this method is that is a covariate is significant in the multiple regression only if it is significant in the univariate regression analysis. Our results indicate that this assumption is wrong. A covariate may be very significant in the univariate regression but has no role in the multiple regression (see Example 2 in Section 3). On the other hand, a covariate is a necessary part of a multiple regression but may be not correlated with the outcome (see Example 1 in Section 3). The initial univariate screening method totally ignores the correlation among covariates. There is no theoretical work to support this method. Our simulation results clearly show that the multiple regression results after the univariate screening may be dramatically biased and misleading. The biomedical community should stop using this procedure in their research and publications.

21 in total

1. Effector memory T cells, early metastasis, and survival in colorectal cancer.

Authors: Franck Pagès; Anne Berger; Matthieu Camus; Fatima Sanchez-Cabo; Anne Costes; Robert Molidor; Bernhard Mlecnik; Amos Kirilovsky; Malin Nilsson; Diane Damotte; Tchao Meatchi; Patrick Bruneval; Paul-Henri Cugnenc; Zlatko Trajanoski; Wolf-Herman Fridman; Jérôme Galon
Journal: N Engl J Med Date: 2005-12-22 Impact factor: 91.245

2. A multivariate analysis of dermatology missed appointment predictors.

Authors: Patrick R Cronin; Leah DeCoste; Alexa Boer Kimball
Journal: JAMA Dermatol Date: 2013-12 Impact factor: 10.282

3. A Clinical Prediction Model to Assess Risk for Chemotherapy-Related Hospitalization in Patients Initiating Palliative Chemotherapy.

Authors: Gabriel A Brooks; Ankit J Kansagra; Sowmya R Rao; James I Weitzman; Erica A Linden; Joseph O Jacobson
Journal: JAMA Oncol Date: 2015-07 Impact factor: 31.777

4. Antecedents of cerebral palsy. Multivariate analysis of risk.

Authors: K B Nelson; J H Ellenberg
Journal: N Engl J Med Date: 1986-07-10 Impact factor: 91.245

5. Evaluation of the Association Between Preoperative Clinical Factors and Long-term Weight Loss After Roux-en-Y Gastric Bypass.

Authors: G Craig Wood; Peter N Benotti; Clare J Lee; Tooraj Mirshahi; Christopher D Still; Glenn S Gerhard; Michelle R Lent
Journal: JAMA Surg Date: 2016-11-01 Impact factor: 14.766

6. Association of Admission Laboratory Values and the Timing of Endoscopic Retrograde Cholangiopancreatography With Clinical Outcomes in Acute Cholangitis.

Authors: Alexander C Schwed; Monica M Boggs; Xuan-Binh D Pham; Drew M Watanabe; Michael C Bermudez; Amy H Kaji; Dennis Y Kim; David S Plurad; Darin J Saltzman; Christian de Virgilio
Journal: JAMA Surg Date: 2016-11-01 Impact factor: 14.766

7. An international prognostic index for patients with chronic lymphocytic leukaemia (CLL-IPI): a meta-analysis of individual patient data.

Authors:
Journal: Lancet Oncol Date: 2016-05-13 Impact factor: 41.316

8. Association between sustained virological response and all-cause mortality among patients with chronic hepatitis C and advanced hepatic fibrosis.

Authors: Adriaan J van der Meer; Bart J Veldt; Jordan J Feld; Heiner Wedemeyer; Jean-François Dufour; Frank Lammert; Andres Duarte-Rojo; E Jenny Heathcote; Michael P Manns; Lorenz Kuske; Stefan Zeuzem; W Peter Hofmann; Robert J de Knegt; Bettina E Hansen; Harry L A Janssen
Journal: JAMA Date: 2012-12-26 Impact factor: 56.272

9. Socioeconomic inequalities in depression: a meta-analysis.

Authors: V Lorant; D Deliège; W Eaton; A Robert; P Philippot; M Ansseau
Journal: Am J Epidemiol Date: 2003-01-15 Impact factor: 4.897

10. Transcatheter or Surgical Aortic-Valve Replacement in Intermediate-Risk Patients.

Authors: Martin B Leon; Craig R Smith; Michael J Mack; Raj R Makkar; Lars G Svensson; Susheel K Kodali; Vinod H Thourani; E Murat Tuzcu; D Craig Miller; Howard C Herrmann; Darshan Doshi; David J Cohen; Augusto D Pichard; Samir Kapadia; Todd Dewey; Vasilis Babaliaros; Wilson Y Szeto; Mathew R Williams; Dean Kereiakes; Alan Zajarias; Kevin L Greason; Brian K Whisenant; Robert W Hodson; Jeffrey W Moses; Alfredo Trento; David L Brown; William F Fearon; Philippe Pibarot; Rebecca T Hahn; Wael A Jaber; William N Anderson; Maria C Alu; John G Webb
Journal: N Engl J Med Date: 2016-04-02 Impact factor: 91.245