Literature DB >> 25903082

Rank regression: an alternative regression approach for data with outliers.

Tian Chen¹, Wan Tang¹, Ying Lu², Xin Tu¹.

Abstract

Linear regression models are widely used in mental health and related health services research. However, the classic linear regression analysis assumes that the data are normally distributed, an assumption that is not met by the data obtained in many studies. One method of dealing with this problem is to use semi-parametric models, which do not require that the data be normally distributed. But semi-parametric models are quite sensitive to outlying observations, so the generated estimates are unreliable when study data includes outliers. In this situation, some researchers trim the extreme values prior to conducting the analysis, but the ad-hoc rules used for data trimming are based on subjective criteria so different methods of adjustment can yield different results. Rank regression provides a more objective approach to dealing with non-normal data that includes outliers. This paper uses simulated and real data to illustrate this useful regression approach for dealing with outliers and compares it to the results generated using classical regression models and semi-parametric regression models.

Entities: Disease Gene Species

Keywords: linear regression; non-normal distribution; normal distribution; rank regression; semi-parametric regression models; sexual health

Year: 2014 PMID： 25903082 PMCID： PMC4248265 DOI： 10.11919/j.issn.1002-0829.214148

Source DB: PubMed Journal: Shanghai Arch Psychiatry ISSN： 1002-0829

Introduction

Regression is widely used in mental health research and related services research to model relationships involving health and service utilization outcomes and clinical and socio-demographic factors. Regression models measure changes in the dependent variable in response to changes in a set of independent variables of interest. Linearregression focuses on continuous dependent variables, while other regression models such as logistic and log-linear regression consider noncontinuous dependent variables such as binary and count outcomes. The dependent variable is often called the response, while the independent variables are frequently referred to as the explanatory variables, predictors, or covariates. Linear regression is arguably the most popular regression model in practice, because of the ubiquity of continuous outcomes and because it is relatively easy to understand the modeled relationship and interpret the model estimates. Fitting such models is convenient because all major software packages (R, SAS, SPSS and STATA) provide both the model estimates and the diagnostics of the model fit. However, the wide popularity and routine use of the linear regression also creates some problems. Many researchers apply the model without first checking assumptions about the normal distribution of the data underlying the validity of model estimates. The classic normal-based linear regression imposes strong constraints on data, and its estimates are also quite sensitive to departures from assumed mathematical models. Without careful checking of the model assumptions, estimates generated by linear regression models may be difficult to interpret and conclusions drawn from such estimates may be misleading.

Different approaches to deal with non-normal study data in regression analyses

Classic linear regression assumes a normally distributed response, yi, and models the mean of this response variable as a function of a set of independent variables, xi = (x, x...., x)T as follows: y=xβ+ ε~N(0, σ 2), 1 ≤ i ≤ n (1) where β = (β1, β2, ..., βp)T is the vector of parameters, n is the sample size, ε denotes the error term, N(µ, σ2) denotes a normal distribution with mean µ and variance σ2, and εi ~ N(0, σ2 ) means that ε follows a normal distribution with mean 0 and variance σ2 . The well-shaped bell curve of the normal distribution is often at odds with the distribution of data arising in real studies, because of its symmetric shape and extremely thin tails (exponential decay). Over the years, various methods have been developed to improve the limitations of the classic linear model. All the different methods can be grouped into 3 major categories. One approach is to use mathematical distributions that more closely resemble the data distribution in the study.[1] For example, by positing a t-distribution for the error ε , the resulting linear model can accommodate data distributions with thicker tails. This is possible because the t-distribution has an additional degree of freedom parameter to control the thickness of the tail. However, like the normal distribution, the t-distribution is also symmetric. To model skewed data distributions, a popular approach is to use the chi-square distribution. Although this parametric alternative broadens the scope of data distributions that can be accommodated, it is still quite limited because mathematical distributions always have more regular shapes than those arising in practice. A second popular alternative is to use semi-parametric or distribution-free models.[2] Under this approach, no mathematical model is assumed for the data distribution (the non-parametric part) and the relationship between y and x is represented by the mean of y after adjustment for x (parametric component). The latter parametric component is implied by the specification of the classic linear regression in (1) and is given by: E (y|x) =xβ, 1 ≤ i ≤ n(2) where E (y|x) denotes mathematical expectation. For those unfamiliar with mathematical expectation, the above expression simply means that the population-level average of the response y is a linear function of x . This linear relationship is also implicit in the normal-based linear regression in (1). Thus, the semi-parametric linear model in (2) only requires a linear relationship between the response and the set of explanatory variables, thereby offering valid inference for a wide class of data distributions. Although significantly improving the utility of linear regression, the semi-parametric model still has limited applications. A major problem is that like the classic model it continues to model the mean of the response. Like the sample mean of a variable, model estimates from this approach can be quite biased when there are extremely large or small values, or outliers, in the response. Various approaches have been developed to address this important issue of outliers. A common approach in psychosocial research is to trim outliers using ad-hoc rules. For example, limiting the values of all observations to 3 times the interquartile range when estimating the mean of an outcome (i.e., a 'trimmed' mean).[3] However, these ad-hoc methods induce artifacts because of their dependence on the specific rules used, and the use of different rules can result in different outcomes. Another approach to limiting the influence of outliers is to employ rank tests. The Mann-Whitney-Wilcoxon rank sum test is widely used to compare two groups in such situations. Within the setting of regression analysis, rank regression is a popular approach for dealing with outliers.[4],[5] Like the Mann-Whitney-Wilcoxon rank sum test, rank regression does not use the observed responses y directly, but, rather, uses information about the ranking of these observations, thereby yelding estimates that are much less sensitive to outliers.

Simulation studies to compare different approaches

The data were simulated from a study with one binary variable and one continuous covariate. To show differences across the different methods, we selected a large sample size (n=500) to reduce the effect of sampling variability on model estimates. We performed simulation of data and fitted the different models to the data generated using the R software. All simulations were performed with a Monte Carlo sample size M=1000 and a type I error α=0.05. We simulated y from the following linear model: y = β0 + xβ1 + xβ2 + ε ,ε ~ N (0, σ2=½), x ~ N (0, 0.2), x 2 ~ Bernoulli (0.5), 1≤i≤ n. with β0 = β1 = β2 = 1. To create non-normally distributed error ε , we replaced the normal distribution with a t-distribution, t (0, ½, 3), with mean 0, variance σ2=½, and degrees of freedom 3. To create outlying observations, we first ordered the values simulated (either from the normal distribution or from the t distribution) from the smallest to the largest denoted by: y(1) < y(2) < ... u(1) < u(2) < ... < u(50), and added the values u(1) from the uniform to the 50 largest values of y , i.e., y(451) < y(452) < ... < y(500) , to form a set of outlying observations, i.e., y(451) = y(451) + y(1), z(452) = y(452) + u(2) , ... , z(500) = y(500) + u(50) . To assess the robustness of the different methods, we replaced y(451) < y(452) < ...< y(500) in the original sample with the values z(451) < z(452) < ...< y(500) , and fit the models to the resulting observations: y(1) < y(2) < ...< y(450) < z (451) < z(452) < ...< z(500) . Table 1 shows the estimates of β1 and β2, the corresponding standard errors, and type I error rates from fitting the three methods to data simulated from the normal-distributed error N(0, 1/2) based on 1000 Monte Carlo simulations both with and without included outliers. (The intercept β0 is estimated by the rank regression and so this estimate is missing in the table.) In the table, values in the column titled 'mean' are the averaged estimates of each parameter over 1000 Monte Caro replications; the 'asymptotic standard error' is the model-based standard error; the 'empirical standard error' is the standard errors of the 1000 estimates of each parameter; and the 'type I error' is the percent of times the null hypothesis - that the estimated parameter is equal to the true parameter - is rejected. For example, the empirical type I error rates for β1 in the data set without outliers is the percent of times of rejecting the null H0: β1 =1. Estimates (mean), asymptotic and empirical standard errors, and empirical type I error rates from fitting the classic linear, semi-parametric, and rank regression models to data simulated from normal-distributed errors If a model performs well, (a) the averaged value of estimates of each parameter (in the 'mean' column) should be close to the true value of the respective parameter; (b) the magnitude of the asymptotic standard error should be close to that of the empirical standard error; and (c) the empirical type I error rate should be close to the nominal value 0.05. As shown in Table 1, in the absence of outliers, all three methods performed well, with the averaged estimates all nearly identical to the true value 1, the asymptotic standard errors all close to their empirical counterparts, and the type I error rate all close to the nominal level α=0.05. Further, all three methods yielded near identical standard errors, indicating that there is practically no loss of power by using the two robust alternatives instead of the classic linear model for the simulated normal data. However, results are very different in the presence of outliers. As shown in the Table 1, both the classic and semi-parametric models yielded extremely large estimates that are un-interpretable, impossibly large standard errors, and type I errors close to 1. In contrast, the rank regression model for both β1 and β2 generated estimates close to the true value 1, reasonable asymptotic and empirical standard errors that were equal to each other, and type I errors that, though elevated, were close to the nominal 0.05 level. Table 2 shows the results of a similar simulation when the data were simulated from t-distributed error, , instead of from normal-distributed error. In the absence of outliers the mean estimate and type 1 error of the two parameters were acceptable for all three models; however, the empirical standard error was much larger than the asymptotic standard error for the classical and semi-parametric models while these two types of standard error were similar in magnitude in the rank regression model. In the presence of outliers, as was the case in the normal-error simulation, the estimates generated by the classic and semi-parametric models were un-interpretable while those generated by the rank regression model were acceptable. Thus, for data with t-distribution error the rank regression model preforms better than the classic linear and the semi-parametric models both in the absence and in the presence of outliers. Estimates (mean), asymptotic and empirical standard errors, and empirical type I error rates from fitting the classic linear, semi-parametric, and rank regression models to data simulated from t-distributed errors

A real-life example

To illustrate the three approaches to dealing with outliers, we use results from a recent randomized controlled study[6] to evaluate the efficacy of a sexual risk-reduction intervention program targeting teenage girls in low-income urban settings who are at elevated risk for HIV, sexually transmitted infections, and unintended pregnancies. The study recruited sexually active urban adolescent girls aged 15 to 19 and randomized them to a sexual risk reduction intervention or to a structurally-equivalent health promotion control group. Assessments and behavioral data were collected at baseline, 3, 6 and 12 months post-baseline. The primary interest of the study was to compare the frequency of unprotected vaginal sex between the two treatment conditions. A difficult problem with the study data was the extremely large values reported by some subjects for their sexual activities. For example, five subjects reported over 100 episodes of unprotected vaginal sex over the past 3 months at the 6 month follow-up. If linear regression is applied directly to this outcome, estimates will be severely biased and become un-interpretable. Alternative models need to be considered when analyzing the data. The linear regression for the different methods is specified as follows: y = β0 + x β1 + ε , 1 ≤ i ≤ n, (3) where y is the number of episodes of unprotected vaginal sex, x is the binary indicator for the treatment condition (1 for the intervention and 0 for the control group), and ε is the model error. The model error ε follows the normal distribution for the classic linear regression, while the distribution is unspecified for the semi-parametric and rank regression methods. To highlight the differences in the models we removed zero observations (i.e., individuals who reported no episodes of unprotected sex in the prior three months) and fit all three models (classic linear, semi-parametric, and rank regression) to the remaining data. In addition, we also recomputed the estimates for the classic linear model and the semi-parametric model after trimming the observed responses to decrease the influence of outliers. We trimmed the observed responses of number of episodes of unprotected vaginal sex in the prior three months at 3 times the interquartile range; the 25%, 50% and 75% quartiles were 2, 4, and 10 episodes, respectively, so the interquartile range was 8 (10 - 2) and any observations below -20 (4 - 3*8) or above +28 (4 + 3*8) were considered outliers. There were no observations below -20 so no lower-level trimming was necessary, but all observations above 28 were trimmed to 28. Table 3 shows the resulting estimates of β1 for the treatment condition in the linear model (3) and the corresponding asymptotic standard errors and p-values using the different models. As was the case in the simulation study with outliers, the huge values for the estimates and standard errors using the classic linear and semi-parametric models clearly show that the estimates are profoundly affected by the outliers and, thus, are un-interpretable. In comparison, the classic and semi-parametric methods yielded more reasonable estimates when applied to the trimmed observations. However, results using the trimmed data were still quite different from those generated from the rank regression model; the estimates from the two models that used trimmed data were more than 50% higher than that using the rank regression method and the standard errors were more than double that from the rank regression analysis. Results from the simulation study suggest that rank regression is quite robust against outliers and, unlike models that use trimmed data, are not vulnerable to change when different trimming criteria are employed. Estimates, standard errors, and p-values from fitting the classic linear, semi-parametric, rank regression, classic linear with trimmed outliers, and semi-parametric with trimmed outliers models to the risk-reduction intervention study

Software for alternative linear regression models

Most major software such as R and SAS has the capability of fitting the semi-parametric linear regression model. In R, there are several packages available for fitting the generalized estimating equations (GEE). Although GEE is an extension of the semi-parametric method for longitudinal data, we may still use these packages for fitting the semi-parametric model to cross-sectional data by introducing an 'ID' variable that has unique values for each of the observations. For example, if the GEE package is installed, then one may apply the following codes to fit the semi-parametric linear regression model: library (gee); id = 1: length (y); gee (y ~ x, id = id) where y is the outcome and x is the covariate matrix. Similarly, SAS also offers 'Procedures' for fitting the GEE which can be utilized to provide estimates for semi-parametric linear regression models. For example, by adding an ID variable to the SAS data set, we may apply the Procedure GENMOD to fit the semi-parametric model: ROC GENMOD DATA = data; Model y = x1 x2; Repeated subject = id; Run; At the time of writing, SAS does not have the capability to fit the rank regression. For our simulated and real study examples, packages in R were used to fit this robust alternative model. To perform this regression model, first download the R functions from the website:http://www.stat.wmich.edu/mckean/HMC/Rcode/ AppendixB/ww.r. Then, we use the following command in R to obtain estimates from fitting the rank regression: wwest (x, y, bij = "WIL") where y is the outcome and x is the covariate matrix. Note that while SAS is a commercial software package, R is free to download, install, and run. In addition, software for newer statistical methods are generally first available in R. However, unlike SAS, R has no designated technical support so users generally rely on peer-support, web postings, and books for resolving issues concerning applications of specific packages and general data management problems.

Discussion

Classic linear regression has a number of weaknesses, limiting its applications to real study data. We discussed two robust alternatives, the semi-parametric model and the rank regression model. Although the former yields more valid estimates than the classic linear model, it breaks down when there are extremely large (or small) observations in the response (i.e., the dependent variable). In the presence of such outliers, the rank regression model provides much more robust estimates. Unlike ad-hoc methods such as trimming outliers based on 3 x interquartile range, rank regression generates the same estimates regardless of the actual values of the response as long as the rankings of the observations remain the same. This formal approach not only removes any subjective element in the estimates, but it also makes it easier to compare results of different analyses based on the same study data and to compare results between different studies. Further, the rank regression model is also capable of addressing outliers in the independent variables, although this tutorial only discussed outliers in the response variable. Currently, rank regression is only available in some selected software packages such as R - we included sample R codes for fitting this robust regression model in this report to facilitate its use by readers. As this approach becomes more popular, it is likely that other major software giants such as SAS will have similar offerings. Unlike the classic and semi-parametric linear regression models, rank regression is only available for fitting cross-sectional data. This is, in part, due to the complexity of computing estimates and asymptotic standard errors. However, as longitudinal studies become the norm rather than the exception in modern clinical research, it will become increasingly important to develop software that can extend this robust model to longitudinal research data and, thus, help investigators more effectively deal with imperfections in real study data.

Table 1.

Estimates (mean), asymptotic and empirical standard errors, and empirical type I error rates from fitting the classic linear, semi-parametric, and rank regression models to data simulated from normal-distributed errors

Models	β₁				β₂
	mean	standard error		type Ierror	mean	standard error		type Ierror
	mean	asymptotic	empirical	type Ierror	mean	asymptotic	empirical	type Ierror
Absence of outliers
classic linear	1.00	0.16	0.16	0.06	1.00	0.06	0.06	0.04
semi-parametric	1.00	0.16	0.17	0.05	1.00	0.07	0.06	0.04
rank regression	1.00	0.16	0.16	0.07	1.00	0.06	0.06	0.04
Presence of outliers
classic linear	>10⁵	>10⁴	>10⁴	0.09	>10⁵	>10⁴	>10⁴	1
semi-parametric	>10⁵	>10⁴	>10⁴	0.09	>10⁵	>10⁴	>10⁴	1
rank regression	1.11	0.18	0.18	0.09	1.06	0.07	0.07	0.11

Table 2.

Models	β₁				β₂
	mean	standard error		type Ierror	mean	standard error		type Ierror
	mean	asymptotic	empirical	type Ierror	mean	asymptotic	empirical	type Ierror
Absence of outliers
classic linear	0.98	0.16	0.35	0.05	1.00	0.07	0.11	0.05
semi-parametric	0.98	0.16	0.35	0.05	1.00	0.06	0.11	0.05
rank regression	1.00	0.12	0.11	0.05	1.00	0.05	0.05	0.06
Presence of outliers
classic linear	>10⁴	>10⁴	>10⁴	0.25	>10⁴	>10⁴	>10⁴	0.80
semi-parametric	>10⁴	>10⁴	>10⁴	0.25	>10⁴	>10⁴	>10⁴	0.80
rank regression	1.05	0.30	0.29	0.06	0.99	0.31	0.30	0.07

Table 3.

Estimates, standard errors, and p-values from fitting the classic linear, semi-parametric, rank regression, classic linear with trimmed outliers, and semi-parametric with trimmed outliers models to the risk-reduction intervention study

Models	β₁
Models	estimate	standard	error p-value
classic linear	-6707.0	6667.7	0.315
semi-parametric	-6707.0	6667.7	0.315
rank regression	-0.4286	0.4630	0.355
classic linear with trimmed outliers	-0.6738	0.9818	0.493
semi-parametric with trimmed outliers	-0.6738	0.9775	0.491

2 in total

1. Reducing sexual risk behavior in adolescent girls: results from a randomized controlled trial.

Authors: Dianne Morrison-Beedy; Sheryl H Jones; Yinglin Xia; Xin Tu; Hugh F Crean; Michael P Carey
Journal: J Adolesc Health Date: 2012-08-28 Impact factor: 5.012

2. Hypertension, blood pressure, and heart rate variability: the Atherosclerosis Risk in Communities (ARIC) study.

Authors: Emily B Schroeder; Duanping Liao; Lloyd E Chambless; Ronald J Prineas; Gregory W Evans; Gerardo Heiss
Journal: Hypertension Date: 2003-10-27 Impact factor: 10.190

2 in total

6 in total

1. Prospective evaluation of dietary and lifestyle pattern indices with risk of colorectal cancer in a cohort of younger women.

Authors: Y Yue; J Hur; Y Cao; F K Tabung; M Wang; K Wu; M Song; X Zhang; Y Liu; J A Meyerhardt; K Ng; S A Smith-Warner; W C Willett; E Giovannucci
Journal: Ann Oncol Date: 2021-04-01 Impact factor: 51.769

2. Glyphosate exposure in pregnancy and shortened gestational length: a prospective Indiana birth cohort study.

Authors: S Parvez; R R Gerona; C Proctor; M Friesen; J L Ashby; J L Reiter; Z Lui; P D Winchester
Journal: Environ Health Date: 2018-03-09 Impact factor: 5.984

3. Maternal verbal aggression in early infancy and child's internalizing symptoms: interaction by common oxytocin polymorphisms.

Authors: Laetitia J C A Smarius; Thea G A Strieder; Theo A H Doreleijers; Tanja G M Vrijkotte; M Hadi Zafarmand; Susanne R de Rooij
Journal: Eur Arch Psychiatry Clin Neurosci Date: 2019-05-07 Impact factor: 5.270

4. Exposure Characterization of Haloacetic Acids in Humans for Exposure and Risk Assessment Applications: An Exploratory Study.

Authors: Shahid Parvez; Jeffrey L Ashby; Susana Y Kimura; Susan D Richardson
Journal: Int J Environ Res Public Health Date: 2019-02-06 Impact factor: 3.390

5. Is radiographic progression in modern rheumatoid arthritis trials still a robust outcome? Experience from tofacitinib clinical trials.

Authors: Robert B M Landewé; Carol A Connell; John D Bradley; Bethanie Wilkinson; David Gruben; Sander Strengholt; Désirée van der Heijde
Journal: Arthritis Res Ther Date: 2016-09-23 Impact factor: 5.156

6. Pregnancy Serum DLK1 Concentrations Are Associated With Indices of Insulin Resistance and Secretion.

Authors: Clive J Petry; Keith A Burling; Peter Barker; Ieuan A Hughes; Ken K Ong; David B Dunger
Journal: J Clin Endocrinol Metab Date: 2021-05-13 Impact factor: 5.958

6 in total