Literature DB >> 33852635

Random intercept and linear mixed models including heteroscedasticity in a logarithmic scale: Correction terms and prediction in the original scale.

Ricardo Ramírez-Aldana¹, Lizbeth Naranjo².

Abstract

Random intercept models are linear mixed models (LMM) including error and intercept random effects. Sometimes heteroscedasticity is included and the response variable is transformed into a logarithmic scale, while inference is required in the original scale; thus, the response variable has a log-normal distribution. Hence, correction terms should be included to predict the response in the original scale. These terms multiply the exponentiated predicted response variable, which subestimates the real values. We derive the correction terms, simulations and real data about the income of elderly are presented to show the importance of using them to obtain more accurate predictions. Generalizations for any LMM are also presented.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 33852635 PMCID： PMC8046211 DOI： 10.1371/journal.pone.0249910

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

In economics and other scientific areas such as medicine, geology, and genetics; it is common to study linear models with a dependent variable defined in a logarithmic scale; for instance in studies related to income [1], health insurance [2], medical expenditures [3], health care utilization and earnings [4], or sediment discharge [5]. In the logarithmic scale, the variable can have an associated normal distribution, whereas in the original scale this is not true. In other words, dependent variables correspond to random variables with a log-normal distribution, a skewed distribution associated with variables taking only positive values, which has been extensively used in analyses for real data corresponding to stock prices, income (without higher-income individuals), time from infection to first symptoms, distribution of particles, number of words per sentence, age of marriage, size of living tissue, etc. There are instances in which presence of heteroscedasticidity can be solved considering such logarithmic scale; however, sometimes this issue is not solved even after the transformation, for instance when the variability is not proportional to the squared conditional mean response given values of the explanatory variables. Additionally, there are data in which nesting between observations is present, for instance, when observations belong to the same spatial cluster. In this case, independence between observations is not satisfied, since the values in a same cluster are correlated, and a random intercept model is preferred. A random intercept model (RIM) in a logarithmic scale is a special type of linear mixed model (LMM) [6, 7], in which: where i = 1, …, m, j = 1, …, n, m is the number of clusters, n is the number of observations in the ith cluster, log(Y) is the response associated with the jth observation in the ith cluster, x = (x, …, x)′ is a vector of dimension p associated with the jth observation in the ith cluster corresponding to the p fixed effects given in the regression parameters vector = (β1, …, β)′. Variable γ represents an intercept random effect associated with cluster i, which allows to model the relationship among observations for each cluster, it has a normal distribution additionally, γ, for i = 1, …, m, are independent and identically distributed (i.i.d.). The random error is ϵ, and since heteroscedasticidity is assumed in (1), i.i.d., where is a known number that allows different variability between observations and clusters. The terms are assumed as known; and for instance, they could be obtained using the unit size or the ELL method [8, 9]. Under this method, two linear models are fitted, a first model (beta) is the corresponding marginal model and a second one (alpha) is a model associated with transformed residuals obtained from the beta model (residuals obtained after deleting the effects associated with the random effects); then, an approximation of the terms can be used in the random intercept model. Finally, the random effects γ and the errors ϵ are independent. In many cases, it is necessary to return to the original scale of Y. Traditionally, this is done by simply applying an exponential function to the predicted values obtained from the model. However, this approach does not consider that the random terms involved in the model are transformed as well, and predictions are subestimated. In some cases, a generalized linear mixed model (GLMM) [10] with an associated distribution according to the data type could be used (e.g., gamma, Poisson, etc.). However, sometimes it is preferred to use the normal distribution in the logarithmic scale, when we know the dependent variable has a log-normal distribution (as far as we know, it is not one distribution included in programs that fit GLMM; and, transforming the dependent variable in a normal GLMM would be similar as what we are doing), or when other processes depend on such normal RIM. For instance, in small area estimation there are methods based on the RIM, e.g. the empirical best predictor (EBP) method [11], in which parameters are estimated from a RIM using the sample information; after that, the conditional distribution of the out-of-sample data given the sample data can be derived from the normal distribution assumption; the predicted values, simulations, and Monte Carlo approximations are used to estimate poverty measures at a small area level (for elements in or outside of the sample), and finally, a parametric bootstrap mean squared error (MSE) estimator is obtained based on the same RIM. Additionally, not all possible distributions are implemented in GLMM and a logarithmic transformation must be used in LMM. In this sense; recently, [12] proposed a model for data showing skewness at the log scale based on an extension of a distribution called generalized beta of the second kind, which can be seen as a random effects model designed for skewed response variables extending the usual log-normal-nested error model, they also found empirical best predictors for poverty measures in small areas. Statistical models to correct the logarithmic transformation in linear regression models have been proposed by different authors, e.g. [5, 13, 14]; other authors have used Bayesian methods to deal with it, e.g. [15]. Some extensions to address heteroscedasticity in linear regression models with a logarithmic scale have also been proposed, e.g. [2, 16]. Other authors have compared the logarithmic transformation in linear models with other type of models in different applications, e.g. [4, 17–21]. Moreover, others have studied the Box-Cox transformation in linear models, e.g. [3, 22–24], being the logarithmic transformation a particular case. Finally, a Box-Cox transformation in LMM has been studied by [23]. In this paper are derived the correction terms that should be used to obtain more accurate predictions in a RIM with heteroscedasticity, when predictions are desired in the original scale. In economics, for instance, this allows a more accurate prediction of income or to improve predictions of measures depending on it, for instance poverty measures. These correction terms multiply the exponentiated predicted values obtained from the RIM, calculating the latter values (without correction) being the usual procedure. These terms are important since they allow to obtain more precise predictions, with a smaller MSE. Since the RIM contains two random terms, the random effect and the error term, two correction terms are obtained. When these correction terms are not included, subestimation occurs. Similar terms have been obtained for linear regression models, but, as far as the authors know, they have not been derived for RIM. These results are relevant, not only when a RIM is fitted in some data, but also when methodology is based on such models, for instance in small area estimation. The motivation example considers a sample of aging individuals over 60 years old in Mexico, in which some household and socio-demographic measures and income are known. The information is available by state i = 1, …, m; for m = 32, the number of individuals by state is n, Y is the income by individual j = 1, …, n and state i, and there are a total of observations. Assume we want to estimate the expected income for these individuals, E[Y], according to the available explanatory variables. This process could be useful to estimate income for another set of similar individuals, in which income is not available but the other variables are, for imputation, or in a simulation process, as in small-area estimation which depends on simulating income in out-of-sample individuals. In the framework of linear models, there are several options for this estimation; in some of them, the distributional assumptions are better satisfied in a logarithmic scale, using log(Y) as response, but the estimations are required on the original scale. A first option is to estimate the expected income, without a common random effect for state, simply using a linear regression in a logarithmic scale and using a correction term to estimate in the original scale. A second option is fitting a linear model on the log-transformed scale with random effects for state obtaining the exponentiated predicted values to estimate the income. A third option is to associate a gamma distribution to the income, a commonly used distribution for positive skewed data such as the income or costs [25], and fit a generalized linear mixed model including random effects for state. The fourth option we propose is to apply correction terms on the second option. We show here that this improves the precision of the estimated values. We apply the corrections terms to both the real data set and in simulated data to evaluate which of the different described options including random effects has a better performance, for the simulations we variate the number of clusters, observations by cluster, parameters associated with the variance of the random effects and error terms, and consider models under different distributions. This paper is organized as follows. In the second section, we briefly present the correction term used in a linear regression model in a logarithmic scale, including the so-called smearing estimate. In the third section, we derive correction terms for a RIM with heteroscedasticity, and the corresponding correction terms for a RIM with homoscedasticity are obtained as a particular case. In the fourth section, we obtain simulations using a log-normal distribution to show that the MSE is minimized when the correction terms are used and in certain cases when gamma distribution simulations are used, comparing the estimations using our method with those derived from a generalized linear mixed model. Additionally, the real data corresponding to income in elderly people is analyzed to show the use of the correction terms and to compare predictions using different options to calculate them. In a fifth section, we propose a generalization for LMM and for transformations different from the logarithm. Finally, the conclusion is presented in the last section, and some of the linear algebra used for the calculations is presented as Supplementary Material.

Correction term associated with a linear regression model in a logarithmic scale

In our motivation example, assume that we estimate income in a logarithmic scale without considering a random effect for state. In this case, we are in the framework of a linear regression and the second index j is unnecessary, and thus, for simplicity we eliminate it in this section. A linear regression model in a logarithmic scale, also called log-normal linear model, is defined as: where Y is the response variable for the ith observation, x = (x, …, x)′ is a vector of the p explanatory variables for the ith observation, = (β1, …, β)′ is a vector of dimension p of regression parameters, and u is an error term, where u ∼ N(0, σ2) i.i.d., for i = 1, …, n. In matrix notation, log(Y) = X + u, where Y = (Y1, …, Y)′, X = (x1, …, x)′, and u = (u1, …, u)′. From (2), the expected value of the response is However, since in the original scale, we have that the expected value is Omitting subindex i, we have that: Noticing that the integral in the last equality is equal to one, hence, and so Therefore, the estimator of the predicted response is given by Using the same reasoning and considering heteroscedasticity in (2), allowing to have different variability among subjects, i.e. , . The last term can be estimated by replacing β with the least squares estimator , and σ2 with the biased (maximum likelihood, ML) or unbiased estimator, or respectively, where is the residual sum of squares. Observe that is the estimator of the predicted response associated with a log-normal distribution. A modified non-parametric estimator can also be associated with model (2) by using the smearing estimate [3]. We assume u ∼ F i.i.d., i = 1, …, n, where E[u] = 0 and Var(u) = σ2. Since F is not completely known, the empirical distribution, is used, where from (2) is the estimated value of u, and the indicator function is equal to 1 if and 0 otherwise. Assuming Y0 corresponds to an observed response with associated explanatory variables values x0, the predicted response is: Furthermore, substituting the regression parameter by the estimates , the estimated predicted response is given by:

Correction terms in a RIM with heteroscedasticity and a logarithmic scale

In this section we derive the correction terms for a RIM with heteroscedasticity in a logarithmic scale. In terms of our motivation example, the process corresponds to estimate the expected income by fitting a model in a logarithmic scale adding a common random effect for state and using correction terms that allow more precise estimations in the original scale. First, a preliminar estimator is introduced. Second, an estimator based on the random effect best linear predictor is presented. Third, an estimator based on a conditional expectation is proposed. Finally, a correction term based on the smearing estimate is given.

A preliminar estimator

From the RIM model given in (1), equivalent to then, by using independence between γ and ϵ, the expectation of the response in the original scale is: As in (3), the expectations of the exponentials of γ and ϵ are and , and, as a consequence, Therefore, using the corresponding estimators, where and are variance estimators corresponding to the error and random effects terms, respectively, and is the fixed effects estimator, estimated by using ML or restricted ML estimator (REML) methods.

An estimator based on the random effect best linear predictor

The predicted values in (5) do not consider that the random effect γ can be estimated through the best linear predictor, thus having a predictor for each ith observation, for i = 1, …, m. The vector of estimated random effects corresponds to . Similarly, to obtain a better predictor associated with Y, it is more adequate to use E[exp(γ)|log(Y)], instead of E[exp(γ)], in (4). Hence, the predictor is: A first approach to estimate E[exp(γ)|log(Y)] could be simply by using so the estimator (7) would be Note that, is the predicted value corresponding to log(Y) exponentiated to return to the original scale (naive estimator). This term is multiplied by a term associated with the error. Assuming heteroscedasticity, and the estimator (7) would be: Note that, according to the Jensen inequality, thus, subestimates E[exp(γ)|log(Y)]. Hence, a better prediction can be derived by directly obtaining E[exp(γ)|log(Y)].

An estimator based on E[exp(γ)|log(Y)]

In this subsection, we obtain a better predictor by computing directly the conditional expectation E[exp(γ)|log(Y)]. For this purpose, first we derived the conditional distribution of the random effect γ conditional to the transformed response for the sample, log(Y) = (log(Y1), …, log(Y))′, which is a vector of dimension n, where is the sample size, with for i = 1, …, m. The random effect has an univariate distribution γ ∼ N(0, σ2), whereas log(Y) has a multivariate distribution log(Y) ∼ N(X, V), where X is the design matrix of dimension n × p of fixed effects associated with the response, is a vector of dimension p of regression parameters, and V is the variance and covariance matrix Var[log(Y)] with dimension n × n. The expected value of this conditional distribution corresponds to the predictor given in (6), whereas using properties concerning the distribution of conditioned multivariate normal random variables, it can be shown (see Proposition 1 in Supplementary Material) that the variance associated with the conditional distribution corresponds to Thus, Using the result given in (3), corresponding to the expected value associated with a lognormal random variable, As a consequence, the predictor of the response in the original scale, is estimated considering heteroscedasticity and a predictor E[exp(γ)|log(Y)], for each i = 1, …, m, and corresponds to From (11) and assuming (or w = 1), which is a model with homoscedasticity in the error term, and the predictor corresponds to Observe how the predicted values given in (11) or (12) include the term , which is the naive estimator associated with Y. This value is corrected according to two factors, one corresponding to the error and another to the random effect. In contrast, the predictor in (9) only considered the term associated with the error term. Under heteroscedasticity, the predictor given in (9) subestimates the real value since the term E[exp(γ)|log(Y)], i = 1, …, m, is not used. However, it can be easier to calculate since the sum is not included. Once, E[exp(γ)|log(Y)] is calculated, the predictor is given in (11). As far as we know, this expected value had not been obtained before. Observe that all estimators given in (9), (11), and (12) consider that a normal distribution is associated with the transformed data.

A correction term based on the smearing estimate

We saw in the second section that a smearing estimator [3] is a nonparametric statistic used to estimate the expected response on the untransformed scale after fitting a linear model on the transformed scale, thus being useful when the normality assumption is not satisfied. In this subsection, we used this type of estimator to obtain correction terms for the RIM. One variant of the estimators in a model considering homoscedasticity, Eq (12), is obtained by using a smearing estimate for the error term: One variant, considering different variance in each ith cluster, and the corresponding smearing estimate, is:

Experimental results

In this section, the proposed correction terms for RIM with heteroscedasticity in a logarithmic scale are applied to simulation-based scenarios and to an income for elderly people real dataset.

Simulation-based experiment

A simulation-based experiment is conducted to analyse the correction terms proposed in this paper. The goal of this simulation experiment is to demonstrate that the proposed approach implementation properly works, and, therefore, the real values are adequately recovered by the estimated ones. We generated one hundred datasets for different scenarios, the generated covariates and general structure are as follows. A set of m clusters having n observations each one, for i = 1, …, m, are simulated. For balanced designs m = {50, 100} and n = {10, 20} ∀i. For unbalanced designs there are two scenarios, one with n = {11, 12, …, 50} and m = 40, and another with n = {11, 12, …, 90} and m = 80. The variables x are randomly generated from a uniform distribution U(0, 1), for l = 1, …, p and i = 1, …, n, where p = 3 and . In order to include an intercept term, x = 1. These values are the entries in the design matrix X of dimension n × p. The regression parameters vector is β = (0.8, 1.3, −0.7)′. The intercept random effects γ, for i = 1, …, m, are generated from a normal distribution with mean 0 and variance , . In order to include heteroscedasticity, fixed values were proposed for the weights w. They have been deterministically assigned as w = (i + 1)/10 + j/1000, for i = 1, …, m and j = 1, …, n. The error terms vector ϵ is generated from a multivariate normal distribution N(0, R), where R = diag(Σ1, …, Σ), , and σ2 = {0.2, 0.4}. Note that, by using properties of the multivariate normal distribution, it is also possible to generate by the following way: first simulate * from a multivariate standard normal distribution N(0, I), or equivalently generate from a univariate standard normal distribution N(0, 1), then do = R1/2 * where R1/2 is such that R = R1/2 R1/2, or equivalently . Finally, the response variable in the logarithmic scale is obtained from (1), this is, by substituting the simulated values in , thus, the response variable Y is obtained as Y = exp(log(Y)). Fig 1 shows one simulated dataset. These graphics show that the response variable log(Y) has a linear relation with X (graphic in the left). In contrast, as sometimes occurs in practice, a logarithm transformation is needed on Y to get a linear relationship with the explanatory variables (graphic in the right).

Fig 1

Simulated data.

Fig 2 shows the simulated response variable log(Y) and Y, compared with their estimated responses. The squared red dots represent the naive estimates without correction terms in (8), and the blue triangles represent the estimated values obtained by using the correction terms in (11). Note that in general the estimates by using the naive estimator are lower than the estimates obtained by using the proposed correction terms in (11), showing that the naive estimator subestimates the real values.

Fig 2

Simulated response vs. estimated response.

Multiple data sets were generated according to the specifications provided in the above paragraphs, and the model’s performance was analyzed by using the mean squared error (MSE), given by Table 1 shows the means and standard deviations (sd) associated with the MSE for the one hundred datasets simulated for each scenario defined according to different values of m, n, σ2, and . The MSE are computed by using different estimates in specific, first by using the naive estimator of (8) (column MSE), and then by using the correction terms of (5), (9), (11), and (13) (columns MSE(, MSE(, MSE(, and MSE( respectively). Finally, a GLMM with a gamma distribution and a logarithmic link is fitted (column MSE), being this an alternative to model positive skewed variables avoiding fitting a transformed response in a LMM.

Table 1

Summary of the MSE for different values of m, n, σ2, and for data simulated from a RIM in a logarithmic scale.

m	n_i	σ²	σγ2	MSE_naive mean (sd)	MSE₍₅₎ mean (sd)	MSE₍₉₎ mean (sd)	MSE₍₁₁₎ mean (sd)	MSE₍₁₃₎ mean (sd)	MSE_Gamma mean (sd)
Balanced design
50	10	0.2	0.2	4.4 (2.4)	8.2 (3.3)	4.0 (2.2)	4.0 (2.2)	4.2 (2.3)	4.6 (2.4)
50	10	0.2	0.4	6.4 (4.2)	16.1 (6.6)	5.8 (3.6)	5.8 (3.6)	6.1 (4.0)	6.6 (4.2)
50	10	0.4	0.2	26.2 (81.7)	29.5 (80.0)	24.0 (79.0)	23.9 (78.8)	25.3 (81.1)	25.0 (79.2)
50	10	0.4	0.4	171.4 (1233.8)	181.8 (1221.8)	166.2 (1221.5)	165.9 (1220.1)	169.3 (1228.7)	160.7 (1163.1)
50	20	0.2	0.2	5.4 (5.8)	9.3 (6.1)	5.0 (5.4)	5.0 (5.4)	5.2 (5.7)	5.3 (5.6)
50	20	0.2	0.4	8.9 (21.2)	21.2 (36.1)	8.1 (17.9)	8.1 (17.9)	8.6 (20.3)	8.8 (22.1)
50	20	0.4	0.2	19.7 (18.4)	23.3 (18.0)	17.5 (16.2)	17.5 (16.1)	18.9 (17.9)	18.1 (16.9)
50	20	0.4	0.4	34.1 (61.3)	45.0 (61.9)	31.1 (57.4)	31.1 (57.2)	33.0 (60.6)	31.5 (57.2)
100	10	0.2	0.2	2.9 (2.3)	6.7 (2.7)	2.7 (2.1)	2.7 (2.1)	2.8 (2.2)	3.0 (2.3)
100	10	0.2	0.4	4.0 (5.5)	14.0 (7.3)	3.7 (4.8)	3.7 (4.7)	3.9 (5.4)	4.1 (5.4)
100	10	0.4	0.2	19.0 (67.6)	22.6 (68.1)	17.7 (64.3)	17.6 (64.0)	18.6 (67.1)	18.2 (64.5)
100	10	0.4	0.4	11.9 (10.3)	22.0 (11.9)	10.6 (9.1)	10.6 (9.0)	11.4 (10.0)	11.1 (9.4)
100	20	0.2	0.2	2.9 (1.6)	6.5 (2.0)	2.7 (1.4)	2.7 (1.4)	2.9 (1.5)	2.9 (1.5)
100	20	0.2	0.4	3.7 (2.6)	13.4 (4.6)	3.4 (2.3)	3.4 (2.3)	3.6 (2.6)	3.6 (2.5)
100	20	0.4	0.2	10.5 (13.2)	14.0 (13.0)	9.6 (12.8)	9.6 (12.8)	10.3 (13.1)	9.8 (12.7)
100	20	0.4	0.4	27.2 (92.0)	37.0 (91.6)	25.3 (90.1)	25.3 (90.0)	26.7 (91.7)	25.4 (87.3)
Unbalanced design
{11,…,50}		0.2	0.2	3.1 (1.6)	6.7 (2.3)	2.9 (1.4)	2.9 (1.4)	3.0 (1.6)	3.1 (1.6)
{11,…,50}		0.2	0.4	4.9 (3.1)	15.2 (7.3)	4.6 (2.8)	4.6 (2.8)	4.8 (3.1)	4.9 (3.0)
{11,…,50}		0.4	0.2	14.7 (20.2)	18.2 (19.3)	13.5 (18.6)	13.4 (18.5)	14.3 (20.0)	13.9 (19.1)
{11,…,50}		0.4	0.4	15.8 (15.3)	25.7 (15.6)	14.3 (14.2)	14.3 (14.1)	15.2 (15.1)	14.7 (14.4)
{11,…,90}		0.2	0.2	1.4 (0.7)	5.0 (1.4)	1.4 (0.6)	1.4 (0.6)	1.4 (0.7)	1.4 (0.7)
{11,…,90}		0.2	0.4	2.2 (1.2)	12.7 (6.2)	2.1 (1.0)	2.1 (1.0)	2.2 (1.2)	2.2 (1.2)
{11,…,90}		0.4	0.2	7.9 (37.5)	11.6 (37.5)	7.6 (36.8)	7.6 (36.8)	7.8 (37.4)	7.6 (36.2)
{11,…,90}		0.4	0.4	8.9 (15.1)	18.9 (16.1)	8.2 (14.3)	8.2 (14.3)	8.7 (15.0)	8.4 (14.2)

The means and standard deviations (sd) of the MSE are computed by using the estimates, , given by the naive estimator, correction terms of (5), (9), (11), and (13) and a GLMM with gamma distribution and logarithmic link, respectively. From these simulation scenarios it is shown that, assuming heteroscedasticity, the best estimations, with the lowest MSE mean and sd, are in general those obtained by using the correction terms given in (11). Standard deviations are always larger in column MSE(. Moreover, just in one case the means and sd’s in column MSE are lower than others. Another type of datasets were generated from a GLMM with a gamma distribution associated with the response Y and logarithmic link. The values of the parameters are similar to the ones used in the previous simulation experiment concerning the RIM in a logarithmic scale, having analogous balanced and unbalanced designs with the same values of m, n, x, p, , and ; and including the heteroscedasticity terms w. The response variable of the GLMM with gamma distribution and logarithmic link is thus generated from where the probability density function of Y ∼ Gamma(shape = a, scale = s) is given by , a > 0, s > 0, and where . The shape parameter a depends on α, which was chosen as α = {1, 1.5, 5}. The purpose of simulating data based on a GLMM with gamma distribution and logarithmic link was to see how our approach worked even when the true distribution associated with the data was not Gaussian. However, our simulations are based on a model extensively used in positive skewed distributions, being this model an alternative to fitting a LMM on the transformed response. In fact, for some particular values assigned to the shape and scale parameters, the distribution associated with the data was similar as that observed for the LMM in the logarithmic scale. Table 2 shows the means and standard deviations (sd) associated with the MSE for the one hundred datasets simulated for each scenario, each one defined according to different values of m, n, α, and ; and assuming heteroscedasticity. From these scenarios, it is shown that when the parameter associated with shape α is much bigger than 1, the best estimations, those having the lowest MSE mean and sd, are in general those obtained by using the correction terms given in (11). Hence, in this case, the estimations by using the RIM in a logarithmic scale and the corrections terms are good, even better than those obtained using a GLMM with a gamma distribution and logarithmic link. However, when α is close to 1 the estimations obtained by using the RIM in a logarithmic scale are worst, which makes sense, since a gamma distribution with parameter α = 1 is an exponential distribution, which completely differs from a log-normal distribution.

Table 2

Summary of the MSE for different values of m, n, α, and for data simulated from a GLMM with gamma distribution and logarithmic link.

m	n_i	α	σγ2	MSE_naive mean (sd)	MSE₍₅₎ mean (sd)	MSE₍₉₎ mean (sd)	MSE₍₁₁₎ mean (sd)	MSE₍₁₃₎ mean (sd)	MSE_Gamma mean (sd)
Balanced design
50	10	1	0.2	13.9 (11.0)	558.2 (869.0)	74.6 (74.4)	90.1 (93.2)	12.9 (10.7)	12.0 (10.0)
50	10	1	0.4	23.9 (29.2)	577.4 (895.6)	51.4 (57.3)	64.0 (75.6)	23.7 (28.7)	19.8 (23.8)
50	10	1.5	0.2	9.6 (6.0)	29.8 (11.3)	10.7 (5.1)	11.3 (5.3)	8.9 (5.7)	8.5 (5.2)
50	10	1.5	0.4	14.0 (9.6)	40.2 (19.1)	13.1 (7.9)	13.7 (8.2)	13.0 (8.8)	12.0 (7.9)
50	10	5	0.2	2.8 (1.3)	6.2 (1.8)	2.6 (1.1)	2.6 (1.1)	2.7 (1.2)	3.0 (1.4)
50	10	5	0.4	4.5 (3.4)	14.3 (7.9)	4.1 (2.9)	4.1 (2.9)	4.3 (3.2)	4.6 (3.4)
50	20	1	0.2	15.2 (8.5)	311.7 (251.9)	21.9 (12.7)	24.0 (14.6)	16.2 (8.5)	12.9 (6.7)
50	20	1	0.4	26.1 (45.9)	366.9 (187.1)	29.5 (35.3)	31.8 (36.1)	29.5 (45.5)	21.8 (36.5)
50	20	1.5	0.2	9.0 (2.4)	24.0 (6.6)	8.6 (2.0)	8.7 (2.0)	8.6 (2.3)	7.9 (2.0)
50	20	1.5	0.4	13.9 (7.3)	38.4 (14.3)	12.5 (6.0)	12.6 (6.0)	13.7 (6.6)	12.0 (6.0)
50	20	5	0.2	2.7 (1.0)	6.1 (1.6)	2.5 (0.9)	2.5 (0.9)	2.6 (1.0)	2.7 (1.0)
50	20	5	0.4	4.3 (2.6)	14.7 (8.1)	3.9 (2.2)	3.9 (2.1)	4.1 (2.4)	4.2 (2.6)
100	10	1	0.2	8.5 (7.8)	89.5 (54.6)	16.1 (8.8)	18.1 (10.1)	8.1 (7.8)	7.5 (7.1)
100	10	1	0.4	12.2 (7.0)	110.9 (52.5)	15.4 (9.1)	17.3 (11.5)	11.9 (6.8)	10.5 (6.0)
100	10	1.5	0.2	5.6 (2.7)	14.9 (4.1)	5.7 (2.2)	5.9 (2.2)	5.3 (2.6)	5.1 (2.4)
100	10	1.5	0.4	7.8 (3.9)	24.4 (6.9)	7.3 (3.3)	7.5 (3.3)	7.5 (3.7)	7.0 (3.4)
100	10	5	0.2	1.6 (0.5)	5.2 (1.0)	1.5 (0.5)	1.5 (0.4)	1.6 (0.5)	1.7 (0.6)
100	10	5	0.4	2.4 (1.3)	12.6 (4.7)	2.2 (1.1)	2.2 (1.0)	2.3 (1.2)	2.5 (1.4)
100	20	1	0.2	8.6 (3.2)	69.7 (22.0)	10.0 (4.0)	10.5 (4.4)	8.5 (3.3)	7.5 (2.7)
100	20	1	0.4	15.3 (13.5)	86.0 (28.6)	14.1 (11.5)	14.3 (11.5)	15.8 (13.9)	13.1 (11.4)
100	20	1.5	0.2	5.7 (1.8)	13.5 (2.2)	5.2 (1.4)	5.3 (1.4)	5.5 (1.7)	5.0 (1.5)
100	20	1.5	0.4	10.3 (10.8)	25.1 (9.2)	9.0 (8.9)	9.1 (8.9)	10.1 (10.6)	9.0 (9.1)
100	20	5	0.2	1.6 (0.6)	5.0 (1.0)	1.5 (0.5)	1.5 (0.5)	1.6 (0.6)	1.6 (0.6)
100	20	5	0.4	2.5 (1.3)	11.8 (4.5)	2.3 (1.1)	2.3 (1.1)	2.4 (1.2)	2.4 (1.2)
Unbalanced design
{11,…,50}		1	0.2	12.6 (10.7)	75.8 (33.8)	18.1 (10.9)	19.8 (11.8)	11.9 (10.7)	11.1 (9.7)
{11,…,50}		1	0.4	17.9 (7.5)	134.6 (288.9)	19.4 (9.6)	21.1 (12.0)	35.0 (177.4)	15.2 (6.0)
{11,…,50}		1.5	0.2	7.7 (2.0)	15.0 (3.8)	7.4 (1.6)	7.6 (1.7)	7.2 (1.9)	6.9 (1.7)
{11,…,50}		1.5	0.4	11.5 (6.6)	26.6 (14.1)	10.5 (5.6)	10.6 (5.6)	10.8 (6.0)	10.2 (5.8)
{11,…,50}		5	0.2	2.2 (0.6)	5.7 (2.4)	2.1 (0.5)	2.1 (0.5)	2.1 (0.6)	2.2 (0.6)
{11,…,50}		5	0.4	3.2 (1.1)	12.0 (5.6)	3.0 (1.1)	3.0 (1.1)	3.1 (1.1)	3.2 (1.1)
{11,…,90}		1	0.2	5.9 (1.4)	18.5 (4.4)	6.1 (1.4)	6.3 (1.5)	5.7 (1.4)	5.4 (1.2)
{11,…,90}		1	0.4	8.0 (2.1)	28.3 (8.8)	7.8 (2.1)	7.9 (2.2)	7.8 (2.0)	7.3 (1.9)
{11,…,90}		1.5	0.2	3.7 (0.7)	8.0 (1.6)	3.5 (0.6)	3.6 (0.6)	3.6 (0.7)	3.4 (0.7)
{11,…,90}		1.5	0.4	6.0 (2.1)	17.1 (6.3)	5.5 (1.7)	5.5 (1.7)	5.7 (2.0)	5.5 (1.8)
{11,…,90}		5	0.2	1.1 (0.2)	4.5 (1.3)	1.0 (0.2)	1.0 (0.2)	1.1 (0.2)	1.1 (0.2)
{11,…,90}		5	0.4	1.6 (0.5)	11.3 (5.3)	1.5 (0.4)	1.5 (0.4)	1.6 (0.4)	1.6 (0.5)

The means and standard deviations (sd) of the MSE are computed by using the estimates, , given by the naive estimator, correction terms of (5), (9), (11), and (13), and a GLMM with gamma distribution and logarithmic link, respectively.

Income for elderly people data application

Returning to our motivation example, we performed analyses based on the National Household Income and Expenditure Survey (Encuesta Nacional de Ingresos y Gastos de los Hogares, ENIGH) 2016 [26], a biennial study to examine income and its distribution in Mexico. Elderly people were considered (60 or more years old). Quaterly total income, that is the income considering all possible sources of income, was obtained for each person as a response variable. Household and sociodemographic information was considered as well. To avoid presence of outliers, only people with an income between 2,000 and 40,000 Mexican pesos were considered. Hence, a total of n = 18, 512 participants were included in the analyses. As already mentioned in the Introduction section, a logarithmic scale was used for the response variable. To help deciding which variables to use as explanatory, we first fitted linear regression models. According to the obtained results, some variables were modified (categories collapsed) or generated using information from other questions. The final linear model in which we are based upon has a coefficient of determination of 0.35. The sociodemographic explanatory variables included in the RIM are: sex, indigeneous (1 = Yes, 2 = No), knowing how to read and write a note (1 = Yes, 2 = No), level of education (0 = None to 9 = Ph.D.), marital status (0 = Without a partner, 1 = With a partner), having a health service provider (1 = Yes, 2 = No), work (1 = Looking for a job, 2 = Retired, 3 = Domestic chores, 4 = Other situation, 5 = Can not work, 6 = Working), disability (0 = Without, 1 = With), and contribution to social security in all their lives (1 = Yes, 2 = No). At a household level, explanatory variables are: number of rooms, presence of wc (1 = Yes, 2 = No), number of light bulbs, household ownership (1 = Rented, 2 = Borrowed, 3 = Owner but paying it, 4 = Owner, 5 = Intestated, 6 = Another situation), number of residents, type of the location where the household is in (0 = Rural, 1 = Urban, a location is considered as urban when its size is of 2,500 or more residents), socioeconomic stratum (1 = Low, 2 = Low medium, 3 = High medium, 4 = High), and flooring material (1 = Ground, 2 = Cement, 3 = Wood, mosaic, or another floor recovering). Since individuals are nested in each of the 32 states, an intercept random effect for state was included, each state having between 400 and 1000 observations. The parameter (fixed effects) estimations associated with the RIM model with homoscedasticity in the error term are shown in Table 3. The estimated standard deviation associated with the random effect, is approximately 0.08, and the corresponding value associated with the error term, is approximately 0.6. A likelihood ratio test comparing the RIM model with a model without the random effect, i.e. was obtained, with an associated p-value of less than 0.05 (this number when divided by two is even smaller, a calculation that must be made since the hypothesis involves a value in the frontier of the parametral space). Hence, a random effect is necessary and a linear regression model (without random effects) should not be fitted, which we defined as a first option to possibly use for this data in the Introduction section.

Table 3

Parameter estimations for the RIM associated with income in a logarithmic scale for elderly people data in 2016.

Variable	Value	Std. Error	DF	t-value	p-value
Intercept	9.128	0.043	18261	211.533	<0.001
Sociodemographic variables
Woman	-0.176	0.011	18261	-15.476	<0.001
No indigeneous	0.042	0.010	18261	4.002	<0.001
Not knowing how to write/read	-0.064	0.017	18261	-3.881	<0.001
Level of education: Prescholar	-0.069	0.106	18261	-0.654	0.513
Level of education: Elementary	0.047	0.016	18261	2.924	0.004
Level of education: Junior high	0.171	0.021	18261	8.002	<0.001
Level of education: High school	0.347	0.031	18261	11.088	<0.001
Level of education: Teacher’s school	0.760	0.047	18261	16.087	<0.001
Level of education: Technician	0.222	0.028	18261	7.837	<0.001
Level of education: Bachelor’s degree	0.443	0.028	18261	15.580	<0.001
Level of education: Master’s degree	0.625	0.079	18261	7.949	<0.001
Level of education: Ph.D.	0.685	0.181	18261	3.793	<0.001
With a partner	-0.057	0.010	18261	-5.682	<0.001
No health service provider	-0.160	0.011	18261	-14.329	<0.001
Work: Looking for a job	-0.465	0.052	18261	-8.989	<0.001
Work: Retired	-0.120	0.013	18261	-9.005	<0.001
Work: Domestic chores	-0.415	0.014	18261	-30.620	<0.001
Work: Other situation	-0.452	0.023	18261	-19.800	<0.001
Work: Can not work	-0.407	0.024	18261	-17.014	<0.001
With disability	-0.075	0.010	18261	-7.430	<0.001
No contribution social security	-0.201	0.012	18261	-17.043	<0.001
Household level variables
Number of rooms	0.023	0.004	18261	6.166	<0.001
No wc	-0.106	0.029	18261	-3.578	<0.001
Total number of light bulbs	0.014	0.001	18261	10.972	<0.001
Ownership: Borrowed	-0.130	0.026	18261	-4.922	<0.001
Ownership: Owner but paying	-0.089	0.034	18261	-2.611	0.009
Ownership: Owner	-0.092	0.023	18261	-4.023	<0.001
Ownership: Intestated	-0.136	0.039	18261	-3.525	<0.001
Ownership: Another situation	-0.115	0.059	18261	-1.961	0.050
Number of residents	-0.014	0.002	18261	-6.271	<0.001
Urban	-0.005	0.012	18261	-0.402	0.688
Stratum: Low-medium	0.096	0.014	18261	6.810	<0.001
Stratum: High-medium	0.074	0.020	18261	3.693	<0.001
Stratum: High	0.144	0.029	18261	5.034	<0.001
Floor: Cement	0.085	0.026	18261	3.222	0.001
Floor: Wood, mosaic or other	0.183	0.028	18261	6.497	<0.001

Fig 3 shows the histogram and qq-plot associated with the residuals. They are indicative that the normality assumption is satisfied, the same being true when the random effects qq-plot is examined.

Fig 3

Residuals.

Left: Histogram of the residuals. Middle: qq-plot of the residuals. Right: qq-plot of the residuals of the random effects.

Residuals.

Left: Histogram of the residuals. Middle: qq-plot of the residuals. Right: qq-plot of the residuals of the random effects. Fig 4 shows the fitted values for the RIM associated with income in a logarithmic scale for the elderly people data in 2016. The squared red dots represent the naive estimates without correction terms, and blue triangles represent the estimated values by using the correction terms in (12), a particular case of (11), and which, according to the simulation results, are the best estimations (with lowest MSE). Note that the estimates derived through the naive estimator are in general lower than those derived through the proposed correction terms in (12), showing that the naive estimator subestimates the data. In terms of the options discussed in the Introduction section, the naive estimates and those including the correction terms corresponded to the second and fourth, respectively. When the naive estimator is obtained and compared with the true values, the squared root of the mean squared error is 7027.784, whereas using the correction factor given in (12), the squared root of the mean squared error is 6829.003, which is an improvement. In terms of the third option discussed in the Introduction, we fitted a GLMM using a gamma distribution for the response variable, a logarithmic link function, and both a penalised quasi-likelihood (PQL) and Laplace approximation methods, we checked that the normality assumption in the estimated random effects is satisfied. We obtained values for the squared root of the mean squared error of 6973.41 and 6979.769 under the PQL and Laplace methods, respectively. Hence, in this example the estimates under the correction term are even more precise that those obtained using a GLMM. We fitted models considering some heteroscedasticity schemes, for instance using the ELL method, the cluster size, or the squared residuals, but only with the former method we obtained an inferior mean squared error than under the homoscedasticity scheme; however, the normality assumption was not satisfied as well.

Fig 4

Fitted values for the RIM associated with income in a logarithmic scale for elderly people data in 2016.

Squared red: naive estimates. Blue triangles: estimates by using the correction terms in (12).

Fitted values for the RIM associated with income in a logarithmic scale for elderly people data in 2016.

Squared red: naive estimates. Blue triangles: estimates by using the correction terms in (12).

Mimic simulation example

Validating our proposed correction terms for RIM including heteroscedasticity in a logarithm scale, we did a simulation experiment based on 100 data sets of size 2000. The simulated data approximately mimic the motivating data of the income for elderly people, assuming two types of weights associated with heteroscedasticity: the cluster size and one of the explanatory variables, as sometimes is found in real data. Details are given in the S2 Text. Our simulation strategy generated the means and standard deviations of the MSE for each one of the corrections terms considering the two types of weights and varying values associated with the variance of the random effects and error terms, see Table 4. The estimations with the lowest MSE corresponded to those obtained using the correction terms associated with Eqs (9) and (11). See details and a table including more values in the Supplementary Material.

Table 4

Summary of the MSE for the mimic simulation example.

Weights	σ	σ_γ	MSE_naive mean (sd)	MSE₍₅₎ mean (sd)	MSE₍₉₎ mean (sd)	MSE₍₁₁₎ mean (sd)	MSE₍₁₃₎ mean (sd)
(1)	0.15	0.079	183.0 (5.9)	797.6 (104.2)	183.0 (5.9)	183.0 (5.9)	183.0 (5.9)
(1)	0.15	0.32	201.7 (16.2)	3383.0 (607.6)	201.7 (16.2)	201.7 (16.2)	201.7 (16.2)
(1)	0.15	1.28	753.6 (350.0)	35308.6 (19790.8)	753.8 (350.0)	753.8 (350.0)	753.7 (350.0)
(1)	0.594	0.079	725.3 (22.4)	1053.2 (84.1)	724.6 (22.3)	724.6 (22.3)	724.7 (22.3)
(1)	0.594	0.32	796.7 (67.6)	3432.5 (696.9)	796.0 (67.5)	796.0 (67.5)	796.0 (67.5)
(1)	0.594	1.28	2994.2 (2264.1)	34012.2 (27261.6)	2991.4 (2262.3)	2991.3 (2262.3)	2991.9 (2262.5)
(1)	1.2	0.079	1495.4 (49.6)	1680.7 (68.5)	1490.5 (49.1)	1490.4 (49.1)	1490.7 (49.1)
(1)	1.2	0.32	1634.4 (107.4)	3713.1 (592.2)	1629.5 (106.4)	1629.5 (106.4)	1629.9 (106.5)
(1)	1.2	1.28	6852.0 (5789.4)	38593.9 (36263.5)	6840.1 (5851.4)	6840.2 (5852.4)	6844.1 (5869.6)
(2)	0.15	0.079	540.1 (12.1)	928.9 (88.2)	539.9 (12.1)	539.9 (12.1)	539.9 (12.1)
(2)	0.15	0.32	592.8 (38.2)	3306.8 (562.7)	592.5 (38.2)	592.5 (38.2)	592.6 (38.2)
(2)	0.15	1.28	2290.2 (1369.8)	35322.8 (24328.5)	2288.1 (1367.9)	2288.1 (1367.9)	2289.4 (1369.4)
(2)	0.594	0.079	2264.4 (81.6)	2387.4 (86.0)	2245.5 (78.5)	2245.5 (78.5)	2251.4 (80.5)
(2)	0.594	0.32	2506.9 (206.4)	4311.0 (683.5)	2484.7 (200.4)	2484.7 (200.3)	2491.3 (203.9)
(2)	0.594	1.28	8994.3 (4847.5)	34063.3 (21229.0)	8909.1 (4786.8)	8908.9 (4786.4)	8934.1 (4792.5)
(2)	1.2	0.079	5546.3 (647.1)	5472.5 (622.3)	5412.8 (629.8)	5412.8 (629.7)	5453.8 (648.7)
(2)	1.2	0.32	6289.4 (1296.7)	7195.0 (1295.5)	6122.4 (1257.0)	6122.4 (1256.9)	6182.5 (1301.5)
(2)	1.2	1.28	23639.7 (17407.2)	47081.2 (37670.3)	23141.8 (17392.3)	23143.1 (17395.0)	23264.0 (17244.1)

(1) Size of each cluster. (2) Total number of light bulbs.

Generalization to linear mixed models and with functions different from the logarithm

In this section, we generalize the correction terms for any LMM and for transformations different from the logarithm. We have seen that the estimators based on the conditional expectancy associated with the random effects given the transformed response have a better performance; thus, we present only this type of estimator for LMM, obtaining a closed formula. A LMM includes q random effects; for instance, we can have random effects associated with some or all the fixed effects. In its matrix form, a LMM corresponds to where X is the design matrix associated with the fixed effects of dimension n × p and is the corresponding vector of parameters of dimension p. On the other hand, = (1, …, ) is a vector of dimension mq of random effects, where a vector of dimension q corresponding to all random effects associated with a cluster i, with distribution ∼ N(0, G), with G a diagonal matrix of dimension mq × mq, G = diag(D, D, …, D), where D is the variance and covariance matrix of dimension q × q associated with the random effects, which is assumed to be the same for all clusters. This term is multiplied by the matrix U, a block diagonal matrix of dimension n × mq given by U = diag(U1, U2, …, U), with U of dimension n × q. The vector of errors has distribution ∼ N(0, R), where R is a block diagonal matrix of dimension n × n given by R = diag(Σ1, Σ2, …, Σ), with Σ a diagonal matrix of dimension n × n given by . The error terms and random effects are assumed independent. Considering an individual j in a cluster i; i = 1, …, m and j = 1, …, n, the expression analogous to (1) associated with a LMM is: where u is the jth row corresponding to matrix U. From the joint distribution of the random effects and transformed response log(Y), we obtain (see Proposition 2 in S1 File) that the variance and covariance matrix associated with cluster i, Var(γ|log(Y)), for i = 1, …, m, is and where is the best linear predictor of γ, . Consequently, and using the expected value corresponding to a log-normal distribution: Thus, to estimate E[Y] in a cluster i; i = 1, …, m, for an individual j; j = 1, …, n, where Y is modeled as in (14), we use the estimator for the random error ϵ, multiplied by the expected value associated with the random effects conditional to the response calculated in (16), and the constant part . The estimator corresponds to: In (17), all terms are known once substituting the estimated variance and covariance terms for the random effects in D and in Σ, both D and Σ part of Var(γ|log(Y)). These terms and obtained after fitting the model. For instance, consider a model including random effects associated with the intercept and a variable u. For each cluster i = 1, …, m, γ = (γ, γ)′, with γ and γ scalars corresponding to the random effects for the intercept and variable u, respectively. The values associated with variable u in cluster i can be accommodated in a vectorial form as , thus U is a matrix of dimension n × 2 such that , where corresponds to the intercept. Finally, where and correspond to the variances associated with the random effects for the intercept and variable u, respectively, and is the corresponding covariance. It is easy to derive that in this case (15) corresponds to with = and D given in (18). This equation can be substituted in expression (17) using estimations of , , and , values obtained after fitting the LMM in any statistical software. We could consider a transformation more general than a logarithm, for instance a Box-Cox transformation g, whose inverse follows a power-normal distribution. Each observation Y; for j = 1, …, n and i = 1, …, m, associated with a MLM under a Box-Cox transformation with parameter λ, g(Y), satisfies that with and . The expected value E[X] of a power-normal distribution, in this case , is calculated in [27] (Lemma 1). After considering the estimated parameters, this expression corresponds to one class of corrected predictions in the original scale, that without conditioning the random effects to the sample. For instance, for λ = 0, the expected value given in [27] is , corresponding to Eq (5) when only one random effect is used. For an invertible function g(⋅), and considering estimators based on conditioning on the sample, as in (11) for a RIM or (17) for any LMM, a simulation can be used. Assuming that in the transformed scale all normality assumptions are satisfied, we can apply similar results as when a MLM and logarithm transformation were considered, and where Var[γ|g(Y)] corresponds to (15). The expected value of the response in the original scale in a cluster i for an individual j can be approximated with simulations by generating a set of random numbers z, for l = 1, …, L, according to the distribution given in (19), and obtaining: using or E[ϵ] instead of ϵ, the expected value E[ϵ] could be obtained by simulating the distribution of ϵ.

Conclusion

The correction terms we proposed for a RIM with or without heteroscedasticity with response in a logarithmic scale enable more precise predictions. This is useful since responses in a logarithmic scale are commonly used, specially in financial and poverty analyses, and with our procedure, we can obtain more precise predictions of an economic measure in a population or better simulations of the distribution of the response, or an associated measure, for a new population (by simulating the error term and random effects and using the values of the explanatory variables). As the simulations assuming log-normal distributions and real data show, the best predictions, with lowest MSE, correspond to those including two correction terms, one for the errors and another for the random effects. These correction terms are easy to calculate and implement without the need of special software. Even though in a GLMM, a distribution different from the normal can be used, it is sometimes desired simply to work in a logarithmic scale when the normal behaviour under this transformation is properly satisfied; or in other words, when a lognormal distribution adequately fits some data. Besides, through simulations with gamma distributions, a commonly used distribution used to model income or similar variables, we showed that the predictions using the two correction terms are more precise than those obtained through a GLMM with a gamma distribution, as long as the parameter α associated with shape, in the gamma distribution is not close to one. And, even when the parameter is one, corresponding to an exponential distribution, as the number of clusters and observations in each cluster increase, the estimations obtained using the correction terms are close to those obtained with the GLMM and a gamma distribution (being in general better the ones using the smearing estimate, specially for lower values associated with the variance of the random effect, and viceversa), and better that those obtained through another correction method or without correction terms. On the other hand, in other type of analyses, as in some small area estimation techniques, it is desirable to preserve a normal distribution since the fit of a RIM is just one first step in a set of processes, all assuming normality; hence, assuming another distribution would change the complete technique; and, without the correction, the estimated poverty measures or any measure associated with a small area might be incorrect. The weights we considered for heteroscedasticity were of the form ; however, a more general form can be used by substituting for in all formulas. If the variance structure is estimated using a function, for instance an exponential variance structure, we estimate the LMM including this structure. Thus, the parameters for the structure are estimated with the fixed and random effects parameters. Any inference should be performed being careful that the degrees of freedom are corrected or appropriate corrections applied, particularly for small sample sizes [28] and non-linear covariance structures [29]. For the predictions in the original scale, the terms can be calculated using the estimated parameters corresponding to the variance structure and then using our formulas. Any further inference should be taken with care considering the variance structure was estimated. In fact, assuming any correlation structure associated with the error for each cluster, i.e. assuming that the matrix Σ is not necessarily diagonal (however, the correlation structure between clusters is still assumed diagonal), for instance when time is involved, Eqs (15) and (16) still hold true, and formula (17) might be used modifying the third term accordingly, though care should be taken if any inference is required. We also generalized the procedure considering any LMM, being RIM a particular case, and outlined the process that could be followed when a function different from the logarithm is used, though it seems that approximations should be used in this general case. Future work could be to continue working with transformations different from the logarithm to see if better predictions with closed formulas can be obtained. An exact variance estimator of the predicted values is also something desirable, though it seems, from some preliminary calculations, that a closed formula cannot be obtained; however, a better approximation than one using only simulations might be possible. We are working in the implementation of the correction terms in two-part models and their variants, for instance for health expenditure data in which there is concentration in the zero value since some people do not spend money, to see whether our correction terms allow to obtain better predictions as some preliminary analyses have shown.

Proposition 1 and Proposition 2 concerning the calculations to obtain Var(γ|log(Y)) and Var(γ|log(Y)) for a RIM and LMM, respectively, and details about the mimic simulation example.

(PDF) Click here for additional data file.

R code associated with the non-simulated data analyzed in the manuscript.

(R) Click here for additional data file.

Source R code that allow to replicate the analysis for the mimic simulation example.

(R) Click here for additional data file.

Data that allow to replicate the analysis for the mimic simulation example.

(CSV) Click here for additional data file.

Instructions that allow to replicate the analysis for the mimic simulation example.

(TXT) Click here for additional data file. (TXT) Click here for additional data file. 10 Nov 2020 PONE-D-20-31434 Random intercept model including heteroscedasticity in a logarithmic scale: correction terms and prediction in the original scale PLOS ONE Dear Dr. Ramírez-Aldana, Thank you for submitting your manuscript to PLOS ONE. One of the referees have recommended that you make amendments to your article. We therefore invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Dec 25 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Ivan Kryven Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We suggest you thoroughly copyedit your manuscript for language usage, spelling, and grammar. If you do not know anyone who can help you do this, you may wish to consider employing a professional scientific editing service. Whilst you may use any professional scientific editing service of your choice, PLOS has partnered with both American Journal Experts (AJE) and Editage to provide discounted services to PLOS authors. Both organizations have experience helping authors meet PLOS guidelines and can provide language editing, translation, manuscript formatting, and figure formatting to ensure your manuscript meets our submission guidelines. To take advantage of our partnership with AJE, visit the AJE website (http://learn.aje.com/plos/) for a 15% discount off AJE services. To take advantage of our partnership with Editage, visit the Editage website (www.editage.com) and enter referral code PLOSEDIT for a 15% discount off Editage services. If the PLOS editorial team finds any language issues in text that either AJE or Editage has edited, the service provider will re-edit the text for free. Upon resubmission, please provide the following: The name of the colleague or the details of the professional service that edited your manuscript A copy of your manuscript showing your changes by either highlighting them or using track changes (uploaded as a *supporting information* file) A clean copy of the edited manuscript (uploaded as the new *manuscript* file) 3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This paper proposes and evaluates correction methods for analysis using log-transformed data to fit a linear mixed model and using the fit to obtain predictions on the original scale. (1) The correction follows from the property of the log-normal distribution. If mu and var are the mean and variance on the log-scale than exp(mu + var/2) is the mean on the original scale. This is well-known, so the claim made in the paper that the correction is proposed "for the first time" is probably overstating the case. (2) That being said, the authors have a particular application. My feeling is that the whole story line should be more focused on the application case. Most importantly, I am missing a clear statement what exactly is to be estimated in the context of the example. Without that context, it is cumbersome to follow all the equations. Not because of the equations, but because it is unclear what we are estimating! (3) While I think it is important to focus more clearly on the example, I also think the scope needs to be broadened a bit for the benefit of readers using mixed models but not necessarily the one considered by the authors. The corrections are pretty straight-forward and applicable with other mixed models. All that's required is to work out the relevant marginal variance to use in the correction. (4) Model (2) only has a single subscript "i", whereas equation (4) has two subscripts "ij". It is not clear what these subscripts refer to and why they are changing in this way. Again, providing the specific context of your example may help here. (5) In line 169, I am not sure why there is a need to switch to matrix notation. What does this really help? I am also not sure expressions like in eq. (12) are correct, where you are mixing matrix and scalar expressions. (6) Line 219: What is the rationale underlying the smearing estimator, and what is it's advantage over the plug-in estimator based on what I said in (1)? Some explanation in the paper would be useful. (7) Line 267: It is important to tell readers how you simulated the data and why. (8) Line 274: Why did you even consider fitting this GLMM? It's not the model used to simulate the data, is it? Conversely, if it is assumed the GLMM generated the data, why use any other approach for analysis? Clearly, a rationale for using these alternate models is missing. (9) Line 281: On a similar note, it is not clear to me why data were generated from a GLMM in a second simulation. I think you need a clear justification and story line for this choice. (10) The manuscript file did not contain and of the diagnostic residual plots, so I was not able to convince myself that the log-normality assumption was satisfied. How do you actually check this when the data is heteroscedastic, as is the case under your model? (11) And now the obvious question: what to do if you are not so lucky that the log-transformation works well? You mention Box-Cox transformation in the discussion. What if I need to use x^0.3, say? How does your approach generalize? If you can cover that more general case, and also tell people what to do in case their LMM looks different than yours, you can vastly broaden the scope and impact of your paper. As it stands, it's a rather narrow case study. Reviewer #2: Authors have proposed correction terms to predict the response in the original scale, which are easy to calculate and implement, for a random intercept model (RIM) with or without heteroscedasticity with response in a logarithmic scale. Different estimators of the predicted response are given (some of them already present in the literature) In addition, simulations and a real dataset are presented in the paper to show the importance of using the correction terms to obtain more accurate predictions. I have no comment whatsoever. Overall the manuscript is well written and organized. I enjoyed reading the paper and we can say that this paper contributes to the growing literature on statistical modeling. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 16 Dec 2020 Dear Dr. Ramírez-Aldana, Thank you for submitting your manuscript to PLOS ONE. One of the referees have recommended that you make amendments to your article. We therefore invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Dec 25 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Ivan Kryven Academic Editor PLOS ONE Response: Thank you very much for the revision. We have modi\ffied the paper attending to the comments and suggestions raised by the two reviewers. A point-by-point response is provided in an additional file included as part of our submitted revision. The modi\fcations have been marked in red color in the new version of the paper. Reviewer #1: This paper proposes and evaluates correction methods for analysis using log-transformed data to fit a linear mixed model and using the fit to obtain predictions on the original scale. (1) The correction follows from the property of the log-normal distribution. If mu and var are the mean and variance on the log-scale than exp(mu + var/2) is the mean on the original scale. This is well-known, so the claim made in the paper that the correction is proposed "for the first time" is probably overstating the case. (2) That being said, the authors have a particular application. My feeling is that the whole story line should be more focused on the application case. Most importantly, I am missing a clear statement what exactly is to be estimated in the context of the example. Without that context, it is cumbersome to follow all the equations. Not because of the equations, but because it is unclear what we are estimating! (3) While I think it is important to focus more clearly on the example, I also think the scope needs to be broadened a bit for the benefit of readers using mixed models but not necessarily the one considered by the authors. The corrections are pretty straight-forward and applicable with other mixed models. All that's required is to work out the relevant marginal variance to use in the correction. (4) Model (2) only has a single subscript "i", whereas equation (4) has two subscripts "ij". It is not clear what these subscripts refer to and why they are changing in this way. Again, providing the specific context of your example may help here. (5) In line 169, I am not sure why there is a need to switch to matrix notation. What does this really help? I am also not sure expressions like in eq. (12) are correct, where you are mixing matrix and scalar expressions. (6) Line 219: What is the rationale underlying the smearing estimator, and what is it's advantage over the plug-in estimator based on what I said in (1)? Some explanation in the paper would be useful. (7) Line 267: It is important to tell readers how you simulated the data and why. (8) Line 274: Why did you even consider fitting this GLMM? It's not the model used to simulate the data, is it? Conversely, if it is assumed the GLMM generated the data, why use any other approach for analysis? Clearly, a rationale for using these alternate models is missing. (9) Line 281: On a similar note, it is not clear to me why data were generated from a GLMM in a second simulation. I think you need a clear justification and story line for this choice. (10) The manuscript file did not contain and of the diagnostic residual plots, so I was not able to convince myself that the log-normality assumption was satisfied. How do you actually check this when the data is heteroscedastic, as is the case under your model? (11) And now the obvious question: what to do if you are not so lucky that the log-transformation works well? You mention Box-Cox transformation in the discussion. What if I need to use x^0.3, say? How does your approach generalize? If you can cover that more general case, and also tell people what to do in case their LMM looks different than yours, you can vastly broaden the scope and impact of your paper. As it stands, it's a rather narrow case study. Response: Thank you very much for the revision. We really appreciate your time in reviewing the paper. We have modi\ffied the paper attending to the comments and suggestions raised by you. A point-by-point response is provided in an additional file included as part of our submitted revision. The modi\ffications have been marked in red color in the new version of the paper. Reviewer #2: Authors have proposed correction terms to predict the response in the original scale, which are easy to calculate and implement, for a random intercept model (RIM) with or without heteroscedasticity with response in a logarithmic scale. Different estimators of the predicted response are given (some of them already present in the literature) In addition, simulations and a real dataset are presented in the paper to show the importance of using the correction terms to obtain more accurate predictions. I have no comment whatsoever. Overall the manuscript is well written and organized. I enjoyed reading the paper and we can say that this paper contributes to the growing literature on statistical modeling. Response:Thank you very much for the revision. We really appreaciate your time in reviewing the paper. Submitted filename: Response to Reviewers_08122020.pdf Click here for additional data file. 23 Dec 2020 PONE-D-20-31434R1 Random intercept and linear mixed models including heteroscedasticity in a logarithmic scale: correction terms and prediction in the original scale PLOS ONE Dear Dr. Ramírez-Aldana, Thank you for revising your manuscript. As you can see Reviewer 1 still has several comments and we therefore invite you to prepare a new version of the manuscript addressing Reviewer's feedback. Following on Reviewer's advise, please condense the description of the key steps of the approach and include more elaborated discussion of examples and extensions. Please submit your revised manuscript by Feb 06 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Ivan Kryven Academic Editor PLOS ONE [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: (No Response) ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: (No Response) ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: (No Response) ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: (No Response) ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The revision is a substantial improvement because the example now plays a more central role throughout, making it easier to assess the relevance of the proposed method. (1) The proposed method itself is rather straightforward. If we log-transform data and fit a linear mixed model assuming normality, then the analysis implies that the untransformed data have a log-normal distribution. The naive back-transformation of means from the log-scale to the original scale does now provide unbiased estimators of the expected value on the original scale. This is all well known, and it is also known that the mean on the original scale is simple exp(mu)*(exp(sigma^2/2). All the authors are, therefore, proposing is that we should multiply the naive estimator exp(mu_hat) with the improved estimator exp(mu_hat)*(exp(sigma^2_hat/2). This works straightforwardly for a linear mixed model, we just need to work out what the marginal variance is in each case. It is not clear why the authors are using so much matrix algebra to make this very simple point. The matrix algebra is clearly unnecessary, even if generalizing this to other models. Thus, I see a lot of scope form simplification. Certainly, the matrix algebra does not make things easier in any way as asserted in L202. How could things be possibly any easier than in the scalar multiplication shown in eq. (9)? And that's all it takes! (2) In the same vein, the authors' proposal to use simulation in case other transformations than the logarithm are used is fine, though not new either. Moreover, this can be said in a single sentence. Then three pages the authors are devoting to this can be shortened substantially, if not removed altogether. Instead, they could focus on analytical results that are available for the power-normal distribution: Freeman, J., and R. Modarres. 2006. Inverse Box–Cox: The powernormal distribution. Stat. Probab. Lett 76:764–772. doi:10.1016/j.spl.2005.10.036 (3) The authors make a bit point about heteroscedasticity, but the form of heteroscedasticity is very restrictive (L145): they assume that the variance is sigma12/w, where w are known weights. This is, of course, the most favourable case, but most of the time such weights are unknown. This begs two questions: (i) what to do if heteroscadasticity is of a different form and (ii) what are the weights in the central real example? They mention in L32 what the weights could be, but what about their example? (4) As regards the simulation, it would be very useful to explain how the parameter setting is modelled on the real example. As it stands, it seems like the settings studied fall from the skies, which is not very convincing. In particular, coming back to the weights, what is the rationale for the simulation of the weights in L283-284? What do the weights represent in the real-life application the simulation is hopefully representing here? Further remarks: L141: Something wrong here with the representation of the normal as N(sigma^2, sigma^2) L154: y => and ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 4 Feb 2021 PONE-D-20-31434R1 Random intercept and linear mixed models including heteroscedasticity in a logarithmic scale: correction terms and prediction in the original scale PLOS ONE Dear Dr. Ramírez-Aldana, Thank you for revising your manuscript. As you can see Reviewer 1 still has several comments and we therefore invite you to prepare a new version of the manuscript addressing Reviewer's feedback. Following on Reviewer's advise, please condense the description of the key steps of the approach and include more elaborated discussion of examples and extensions. Please submit your revised manuscript by Feb 06 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Ivan Kryven Academic Editor PLOS ONE Response: Thank you very much for the revision. We have modified the paper and supplementary material attending to the comments and suggestions raised by Reviewer 1. A point-by-point response was uploaded. The modifications have been marked in red color in the new version of the paper. We have condensed the description of the steps of our approach by providing details in the Supplementary Material. We have also included more discussion concerning the example and extensions, condensing all these extensions. Additionally, a mimic example based on the real data set with its R code, data, and instructions to replicate it is now included as part of the manuscript. It is important to note that since the last revision, we have added funding for this research, and some lines have to be included in the manuscript concerning this matter. We have added them next to the Acknowledgment section. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The revision is a substantial improvement because the example now plays a more central role throughout, making it easier to assess the relevance of the proposed method. Thank you very much for the revision. We really appreciate your time in reviewing the paper. We have modified the paper attending to the comments and suggestions raised by you. A point-by-point response was uploaded. The modifications have been marked in red color in the new version of the paper. Submitted filename: Response to Reviewers_27012021.pdf Click here for additional data file. 9 Feb 2021 PONE-D-20-31434R2 Random intercept and linear mixed models including heteroscedasticity in a logarithmic scale: correction terms and prediction in the original scale PLOS ONE Dear Dr. Ramírez-Aldana, Thank you for resubmitting your manuscript to PLOS ONE. Reviewer 1 has noticed that one of their comments was not understood as intended, please see below. We therefore invite you to respond to this comment by incorporating the necessary changes to the manuscript. Please submit your revised manuscript by Mar 26 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Ivan Kryven Academic Editor PLOS ONE [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: (No Response) ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: (No Response) ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: (No Response) ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: (No Response) ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors acknowledge that in case of heteroscedasticity the weights are typically unknown and hence need to be estimated. They further mention two options in R to estimate such functions and note that these involve parameters. It may be added that common forms of heteroscedasticity may also involve correlation, as is the case, e.g., with repeated measures data. The authors further response is based on this assertion: "As far as we know, once these parameters are estimated, the weights could be assumed as known, by simply calculating these values for each individual using the estimated parameters." This is true as far as the point estimates for fixed and random effects are concerned, and only when residual errors are independently distributed, but it is incorrect as far as the inference is concerned. Adjustments are needed both regarding the standard errors and the degrees of freedom. See the papers by Kenward and Roger (1997 and 2009 in Biometrics and CSDA). I would like the authors to re-consider this point, and I am happy to take another look once that's done. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 25 Mar 2021 PONE-D-20-31434R2 Random intercept and linear mixed models including heteroscedasticity in a logarithmic scale: correction terms and prediction in the original scale PLOS ONE Dear Dr. Ramírez-Aldana, Thank you for resubmitting your manuscript to PLOS ONE. Reviewer 1 has noticed that one of their comments was not understood as intended, please see below. We therefore invite you to respond to this comment by incorporating the necessary changes to the manuscript. Please submit your revised manuscript by Mar 26 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Ivan Kryven Academic Editor PLOS ONE Answer: Thank you very much for the revision. We have modified the paper attending to the comment raised by Reviewer 1. A point-by-point response is provided as an additional file. The modifications have been marked in red color in this new version of the paper. 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors acknowledge that in case of heteroscedasticity the weights are typically unknown and hence need to be estimated. They further mention two options in R to estimate such functions and note that these involve parameters. It may be added that common forms of heteroscedasticity may also involve correlation, as is the case, e.g., with repeated measures data. The authors further response is based on this assertion: "As far as we know, once these parameters are estimated, the weights could be assumed as known, by simply calculating these values for each individual using the estimated parameters." This is true as far as the point estimates for fixed and random effects are concerned, and only when residual errors are independently distributed, but it is incorrect as far as the inference is concerned. Adjustments are needed both regarding the standard errors and the degrees of freedom. See the papers by Kenward and Roger (1997 and 2009 in Biometrics and CSDA). I would like the authors to re-consider this point, and I am happy to take another look once that's done. Answer: Thank you very much for the revision. We really appreciate your time in reviewing the paper. We have modified the paper attending to the comment raised by you. A full response is provided as an additional file. The modifications have been marked in red color in the new version of the paper. Submitted filename: Response to Reviewers_02032021.pdf Click here for additional data file. 29 Mar 2021 Random intercept and linear mixed models including heteroscedasticity in a logarithmic scale: correction terms and prediction in the original scale PONE-D-20-31434R3 Dear Dr. Ramírez-Aldana, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Ivan Kryven Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 31 Mar 2021 PONE-D-20-31434R3 Random intercept and linear mixed models including heteroscedasticity in a logarithmic scale: correction terms and prediction in the original scale Dear Dr. Ramírez-Aldana: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Ivan Kryven Academic Editor PLOS ONE

6 in total

1 in total

1. Analysis of hospital costs by morbidity group for patients with severe mental illness.

Authors: Vicent Caballer-Tarazona; Antonio Zúñiga-Lagares; Francisco Reyes-Santias
Journal: Ann Med Date: 2022-12 Impact factor: 4.709