Literature DB >> 31218050

On the linear in probability model for binary data.

Abstract

The analysis of binary response data commonly uses models linear in the logistic transform of probabilities. This paper considers some of the advantages and disadvantages of simple least-squares estimates based on a linear representation of the probabilities themselves, this in particular sometimes allowing a more direct empirical interpretation of underlying parameters. A sociological study is used in illustration.

Entities: Chemical Disease Species

Keywords: interpretation of parameters; logistic model; missing values; model sensitivity

Year: 2019 PMID： 31218050 PMCID： PMC6549984 DOI： 10.1098/rsos.190067

Source DB: PubMed Journal: R Soc Open Sci ISSN： 2054-5703 Impact factor: 2.963

Introduction

The interpretation of data in the form of binary outcomes arises in many areas of science from the primary physical and biological sciences and their application through to more directly applied areas and the social sciences. Two distinct themes in the analysis of binary data go back at least to the beginning of the twentieth century with the contrast between Karl Pearson who, in his biserial correlation coefficient, treated a pair of possibly related binary variables as derived from an unobserved bivariate normally distributed variable, and Yule who worked directly with observed proportions of outcomes. When the hypothesized latent variables have a tangible interpretation, as in quantal bioassays, the former approach is preferable, but in the present paper we consider only situations in which observed proportions of outcomes are represented directly and relations concerning them interpreted. Suppose that for n independent individuals, we observe a realization of a binary outcome variable Y (1 ≤ i ≤ n) taking values 1 or −1, and that for individual i there is a p × 1 vector x of explanatory variables. A widely used representation is the linear logistic form in which log{pr(Y = 1)/pr(Y = −1)} is assumed to depend linearly on x. This leads to a simple interpretation of regression coefficients as ratios of effects when the binary responses are concentrated at one of the two levels but otherwise the interpretation is less direct. For a discussion from a sociological perspective of the difficulties of interpreting logistic coefficients, see [1] and, for a wide-ranging review, see [2]. The linear in probability model to be considered in the present paper specifies the probabilities as linear functions of the explanatory variables, that is for y = −1, 1 and with x typically including a constant termso that E(Y) = βTx. There are implicit restrictions on the parameter space, namely that for all data x, |βTx| ≤ 1. If both the linear in probability and linear logistic models give adequate fit, the former has the advantage that the linear regression coefficients have a clearer operational interpretation in terms of numbers of individuals potentially influenced by a unit change of an explanatory variable. Emphasis sometimes lies on testing the significance of individual effects and comparison of their relative magnitudes. For this, the exponential family form of the linear logistic model [3,4] brings substantial simplification and other advantages. Furthermore, the logistic dependence has the potential to apply over a wide range of future conditions excluded by the positivity constraints on the linear form. The discussion highlights a context in which maximum-likelihood estimation is very sensitive to aberrant observations, whereas ordinary least squares is insensitive yet typically achieves high efficiency. A limiting case which sharply illustrates these distinctions concerns the comparison of data (Y1, Y2) formed from counts of events from two Poisson processes of rates, say, ρ1 and ρ1ψ or ρ1 and ρ1 + θ for the multiplicative and additive representations, respectively. That is, Y2 represents either a multiplication of the baseline rate by a constant or the addition of a separate signal. The former model falls within the exponential family of distributions and leads to an analysis based on a 2 × 2 contingency table. The second calls for a different analysis based on large-sample maximum-likelihood theory. For a further discussion concerning a similar model for Poisson variables, see [5].

Inferential aspects

Second-moment theory

We now consider properties of the linear in probability model based only on first and second moments. First, we define the least-squares estimate of β by projecting the vector Y = (Y1, …, Y)T orthogonally onto the space spanned by the columns of x, thus givingIn the present context, x is a matrix whose ith row is xT. The estimate is unbiased but does not have second-moment optimality unless β = 0 because the components of Y in general do not have equal variance. Nor is the covariance matrix of the estimates given by the standard formulae unless β is small. In factwhere . One simple and often satisfactory estimate of the covariance matrix of is to replace Δ by in which β is replaced by . A more elaborate second moment approach is to replace by a weighted least-squares estimate in which var(Y) is estimated as . Since is not bounded away from zero, weighted least squares is inappropriate as a general method. The calculation of approximate confidence intervals and significance tests may be based on the asymptotic normality of .

Maximum-likelihood estimation

The log likelihood corresponding to (1.1) isprovided that for all i, . We return to the relevance of this condition later. A stationary value of the log likelihood occurs whereIf 1/(1 + a) is expanded as 1 − a and higher terms neglected, that is the regression assumed small, the least-squares estimate is recovered. There is a strong argument for using ordinary least squares rather than maximum likelihood in this context despite sufficiency of under model (1.1). In the present context, the two estimators are virtually equivalent in terms of their efficiency, while maximum likelihood suffers extreme fragility, as explained below. There is the following expansion of the second derivative of ℓ(β), valid for small ,Here denotes the matrix of second partial derivatives with respect to β. On taking expectations, an approximation to the asymptotic variance of the maximum-likelihood estimator is obtained as {xT(I + Δ)x}−1. For comparison to (2.1), it is more convenient to work with {xT(I − Δ)−1x}−1, which is a lower bound for {xT(I + Δ)x}−1. Using the geometric series expansion , say, and the formulawe write, with A = xTx, B = I and in (2.3) and ,Because , where the notation means that A − B is a negative definite matrix, the inflation in variance from using rather than isWrite δ = βTx. From the geometric series, we deduce thatThus showing that the loss in efficiency is typically very small. On the other hand, from the perspective of formal likelihood theory even one individual out of range, in the sense that |βTx| > 1, would refute the parameter value in question. That is, maximum likelihood is extremely sensitive in the present context to observations measured with error or drawn from a model even slightly different from that postulated. Ordinary least squares is by contrast relatively unaffected by such anomalies.

Interpretation of analysis

The interpretation of the regression coefficients in the linear in probability model is similar to that in a normal theory linear regression model. Let x* and x** be two different vectors of covariate information, differing by 1 unit in variable j and otherwise the same. The number of positive outcomes is where Z = (Y + 1)/2. Therefore, the hypothetical change in E(S) for a hypothetical replacement of m individuals who differ by one unit in the jth component but are otherwise the same isIf there are binary covariates, it is natural to code them as {−1, 1}, in which case division of two is not needed because a unit change in the level corresponds to a numerical difference of two units. If, upon fitting the linear in probability model, it is found that the number of least-squares fitted values outside [−1, 1] is appreciably larger than could be attributed to chance under the linear in probability model, some doubt would be cast upon the plausibility of the model. The expected number out of range, assuming that the linear in probability model is valid for all observations, is where, by the asymptotic normality of ,Thus, a predicted number of out of range values is an estimate of λ, obtained by replacing β and Σ by estimates in the expression for each p. A crude lower bound on the variance of the sum, R, of out of range values is λ, obtained by incorrectly assuming that R is approximately Poisson distributed for large n. The variance of R is larger than λ due to dependence between the summands, induced by . In particular,Writeso that Z and Z are bivariate normally distributed of zero means, unit variances and correlation coefficientThen is the sum of the quadrant probabilities,While there is no closed-form expression for these, close approximations are obtained by replacing the conditional expectations of the functions of interest by the corresponding functions of the conditional expectations, with approximation error established by Taylor series expansion. Depending on the signs of z, z and ρ, the approximation so obtained might be improved by interchanging the roles of z and z on the right-hand side of the above display. For a further discussion, see [6].

Socio-economic inequalities in educational attainment

We use US data from the National Longitudinal Study of Youth (1979), a nationally representative longitudinal study of people aged 14–22. Our binary outcome, coded as {−1, 1}, specifies whether the individual enrolled in a 4-year-degree-granting institution for at least 1 year. There are five potential explanatory variables. Ability is measured as the respondent’s score on the Armed Forces Qualifying Test, administered to all respondents in the 1981 wave of the survey. Family income in childhood is measured as the log of total net family income in 1979. All respondents identified themselves as male or female but race was measured via interviewer observation, and we here limit our sample to those respondents who were classified as black or non-black and non-Hispanic. Finally, we include an indicator of whether respondents were living with at least one parent at the time of the first survey. As is common with extensive observational data, some observations on explanatory variables are missing, as shown in table 1. Because we are concerned with the dependence of outcome on explanatory variables, individuals with missing outcome are treated as uninformative about that dependence. A sensitivity analysis examined how the regression coefficients of interest changed when rather extreme assignments were made to the three explanatory variables with missing values, treating binary variables as all at one or other extreme and continuous variables as at their upper and lower quartile. The levels used were 68.33 and 17.28 for the Armed Forces Qualifying Test score and 10.00 and 8.79 for the logarithm of family income when the individual was in childhood. Estimates from the eight patterns of missingness are in table 2. While there is some dependence on the missing values, that dependence is very minor and without qualitative impact on the conclusions of the analysis. If a larger number of explanatory variables have missing values the sensitivity analysis should be based on a suitable fraction of the two-level factorial system of potential missing values, allowing estimation of main effects from missingness [7, §12.2].

Table 1.

Summary of data.

covariate	description	sample range	per cent missing
x₁	gender	{1 = male, −1 = female}	0
x₂	AFQT score	percentage (0–100)	4.3
x₃	log income	continuous (3.00–11.23)	51.2
x₄	race	{1 = black, −1 = non-black/non-Hispanic}	0
x₅	lives with parent	{1 = yes, −1 = no}	5.1

Table 2.

Sensitivity analysis of least squares estimates and their estimated standard errors from replacing all missing values of x by high and low levels. The estimated standard errors are obtained by replacing Δ by in equation (2.1). The sample size is 9043.

			least squares estimates of regression coefficients (estimated standard errors)
x₂	x₃	x₅	β^0	β^1	β^2	β^3	β^4	β^5	number out of range	predicted number out of range
L	H	H	−1.51 (0.13)	−0.061 (0.0092)	0.0201 (0.00031)	0.064 (0.011)	0.224 (0.011)	−0.034 (0.011)	394	396
L	H	L	−1.51 (0.13)	−0.062 (0.0092)	0.0202 (0.00031)	0.063 (0.014)	0.223 (0.011)	−0.021 (0.010)	383	388
L	L	H	−1.32 (0.12)	−0.060 (0.0092)	0.0202 (0.00031)	0.048 (0.014)	0.222 (0.011)	−0.038 (0.011)	384	391
L	L	L	−1.31 (0.12)	−0.061 (0.0092)	0.0203 (0.00031)	0.046 (0.014)	0.221 (0.011)	−0.025 (0.011)	377	384
H	H	H	−1.57 (0.13)	−0.065 (0.0093)	0.0198 (0.00033)	0.068 (0.014)	0.225 (0.011)	−0.028 (0.011)	444	441
H	H	L	−1.57 (0.13)	−0.067 (0.0094)	0.0198 (0.00032)	0.066 (0.014)	0.223 (0.011)	−0.011 (0.010)	434	436
H	L	H	−1.45 (0.13)	−0.065 (0.0094)	0.0198 (0.00033)	0.059 (0.014)	0.224 (0.011)	−0.034 (0.012)	451	450
H	L	L	−1.44 (0.13)	−0.066 (0.0094)	0.0199 (0.00033)	0.056 (0.014)	0.222 (0.011)	−0.017 (0.011)	453	443
max absolute difference			0.23	0.0061	0.00050	0.022	0.0040	0.026

Summary of data. Sensitivity analysis of least squares estimates and their estimated standard errors from replacing all missing values of x by high and low levels. The estimated standard errors are obtained by replacing Δ by in equation (2.1). The sample size is 9043. The sensitivity analysis used here may be contrasted with procedures of multiple imputation based on the untestable assumption that observations are missing at random. An informal preliminary analysis involved tests for interactions and inspection of interaction plots. None was strongly suggested. Table 2 reports least squares estimates of regression coefficients and their estimated standard errors from a model with main effects for the five explanatory variables. The suggestion is that hypothetically increasing the number of males and correspondingly reducing the number of females in the population by m units, say, would correspond to a 6–7% of m decrease in the expected number of individuals receiving higher education, all other things equal. The coefficient of the race variable is similarly interpreted, the suggestion being that in a hypothetical population, demographically equivalent to the one under study except for having m more black children than white children, the expected number of individuals experiencing the positive outcome would be 22–23% higher. It is suggested, all other things being equal, that a 1% increase in family income, i.e. an increase of 0.01 in log family income, would correspond to a 0.02–0.03% increase in the expected number of positive outcomes and that a 1% increase in ability, to the extent that it can be measured by the Armed Forces Qualifying Test score, would correspond to a 1% increase. An absolute change at the bottom of the income scale has a relatively greater effect than the same absolute change at the top. Finally, accounting for other factors, individuals living with someone other than one of their parents are perhaps slightly more likely to experience the positive outcome, although the evidence for this is rather weak. In the above interpretation of the estimated coefficients on the continuous variables, division by 2 is needed, as described in §2.3. Division by 2 is not needed for the three binary explanatory variables because they are coded as {−1, 1}. The last two columns of table 2 show the actual and predicted number of least squares fitted values that are outside [−1, 1]. The individuals whose fitted values are out of range are almost all at the two edges of the sample space for the Armed Forces Qualifying Test score. While the numerical values of the coefficient estimates from a linear logistic model are not comparable to those from a linear in probability model, the ratios of these coefficients are remarkably similar. The code for verifying this statement and the analysis of §3 is available as outlined in the data accessibility statement.

Discussion

As with other statistical methods care is needed especially when relatively complex data are involved. In the present context, a reasonable approach for general use is to base the analysis on with the improved estimate of its covariance matrix, given by (2.1). Examination of model adequacy should include a check of the number of fitted values outside [−1, 1]. Do such values form a rationally identifiable subgroup to be analysed separately? Does their omission or exclusion materially affect the conclusions? Does the number of anomalous observations suggest major change to the whole analysis? A large number of anomalous observations may suggest that a model linear on the logit scale would be more appropriate. From the perspective of formal likelihood theory, even one individual out of range would refute the parameter value in question in the linear in probability model. Thus, the paper illustrates an empirical context in which the formal optimality of maximum-likelihood estimates is achieved only at the cost of extreme fragility. A formally slightly less efficient method is much to be preferred.

5 in total

5. Role ambiguity as an amplifier of the association between job stressors and workers' psychological ill-being: Evidence from an occupational survey in Japan.

Authors: Takashi Oshio; Akiomi Inoue; Akizumi Tsutsumi
Journal: J Occup Health Date: 2021-01 Impact factor: 2.708