Literature DB >> 35676510

GIVE statistic for goodness of fit in instrumental variables models with application to COVID data.

Abstract

Since COVID-19 outbreak, scientists have been interested to know whether there is any impact of the Bacillus Calmette-Guerin (BCG) vaccine against COVID-19 mortality or not. It becomes more relevant as a large population in the world may have latent tuberculosis infection (LTBI), for which a person may not have active tuberculosis but persistent immune responses stimulated by Mycobacterium tuberculosis antigens, and that means, both LTBI and BCG generate immunity against COVID-19. In order to understand the relationship between LTBI and COVID-19 mortality, this article proposes a measure of goodness of fit, viz., Goodness of Instrumental Variable Estimates (GIVE) statistic, of a model obtained by Instrumental Variables estimation. The GIVE statistic helps in finding the appropriate choice of instruments, which provides a better fitted model. In the course of study, the large sample properties of the GIVE statistic are investigated. As indicated before, the COVID-19 data is analysed using the GIVE statistic, and moreover, simulation studies are also conducted to show the usefulness of the GIVE statistic along with analysis of well-known Card data.

Entities: Chemical

Mesh：

Substances：

Year: 2022 PMID： 35676510 PMCID： PMC9176169 DOI： 10.1038/s41598-022-13240-y

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.996

Introduction

There has been a significant interest since COVID-19 outbreak on whether well known tuberculosis (TB) vaccine Bacillus Calmette-Guerin (BCG) and COVID-19 mortality are related or not. Strictly speaking, the scientists are willing to know the effect of BCG vaccination on the reduced mortality from COVID-19. Though as of now, there is no as such any scientific evidence in favouring the positive impact of BCG vaccination on the mortality due to COVID-19 (see, e.g.,[1]). However, most of the studies did not consider the fact that many people in this world may have latent TB infection (LTBI), that means, those people may not have evidence of clinically active TB but have Mycobacterium tuberculosis antigens. Moreover, the fact is that LTBI creates lifelong immunity, and hence, may provide an immunological protection against COVID-19. This article addresses the association between LTBI and the reduction of mortality due to COVID-19, and in the course of this study, the Instrumental Variables (IV) estimation has played an important role in the modelling of COVID-19 data to draw such conclusions, see[2]. However, the conclusions drawn in that article are restricted only to the estimation of parameters, and there is no mention/evidence of goodness of fit, i.e., how good the model is fitted to the COVID-19 data. In this context, we would like to mention that this article attempts to propose and study the goodness of fit of the models obtained by IV estimation, which has not yet been paid much attention in the literature. We would now like to discuss in detail about how IV can be a powerful toolkit to work on the aforesaid research problem related to COVID-19 outbreak. Note that one may formulate a regression equation with the mortality due to COVID-19 as the response variable and a few related covariates, which will be thoroughly discussed in “Analysis: COVID-19 Data Set” Section. However, in such regression models, it is likely to happen that the covariates become correlated with the regression errors, and it makes the regression model a dubious regression model. In order to overcome this problem, one may adopt the IV method since the IVs are highly correlated with the covariates but uncorrelated with the error random variables, at least in the limiting sense. An excellent exposition about the IV estimation is presented in[3,4], Chap. 6. Besides, the issue of correlated errors with the explanatory variables is observed in measurement error models and[5] has discussed the role of measurement errors in seemingly unrelated regression equation (SURE) models in a Bayesian framework using Markov Chain Monte Carlo and mean field variational Bayes. Apart from the COVID-19 outbreak, in general also, the IV estimation in multiple linear regression models has played a vital role in biostatistics. Bio-statisticians are pleased to use it as IV can tame the confounding bias, without directly observing all explanatory variables[6]. Has presented a tutorial on IV estimation in biostatistics. The IV estimation has not only been used but extended in many directions. For example[7], used IV estimation in competing risk data[8], extended it to censored survival outcome[9], applied IV techniques to the proportional hazards model[10], presented a test for checking concordance between instrumental variable effects[8], presented a method of instrumental variable estimation when the outcome is rare, and many others (see also[11-13]). As we indicated earlier in the context of the modelling mortality rate due to COVID-19 outbreak, the IV estimation in multiple linear regression model requires a set of IVs, which are highly correlated with the explanatory variables, at least in limit and uncorrelated with the disturbance term, at least in limit. Various sets of different IVs can be obtained for the same set of data, which gives rise to different estimators of the regression coefficient vector. This, in turn, produces different fitted models for the same set of given data. A more complicated situation arises in multiple linear regression where there are more than one explanatory variables. In such a case, the experimenter has various options to choose the IVs for the explanatory variables, e.g., the same IVs for all the explanatory variables or different IVs for different explanatory variables. Now, how to check which of the model is better fitted to the given data or equivalently which choice of instrumental variable gives a better fit in statistical sense is an issue. Many occasions, researchers and applied workers provide various arguments to justify their choice of IVs but such justifications are based on their experience and not based on any quantitative and scientific outcome of statistical measures. It is important to note that the goodness of fit in the classical multiple linear regression model, based on OLSE, is defined for judging the goodness of fit of estimation, denoted as whereas for the purpose of prediction, it is denoted as . We now would like to emphasize that the goodness of fit for judging the estimation and prediction of the model are different. Strictly speaking, this article considers the aspect of goodness of fit of estimation only while[14] mentioned that it is problematic to develop the goodness of fit based on residuals, and for that reason, they considered the prediction errors to develop the goodness of fit. We here successfully develop the goodness of fit based on residuals, see also[15,16] in this regard. Overall, this article investigates in the following direction. The coefficient of determination, popularly known as R-square , is used to measure the goodness of fit in classical multiple linear regression model and cannot be used or directly extended to the case of IV models because the explanatory variable becomes stochastic and correlated with the disturbance term. To overcome this problem, we here propose a new goodness of fit statistic, viz., Goodness of Instrumental Variable Estimates (GIVE) statistic, which can be used to measure the goodness of fit in the IV models and to determine the optimal choice of IVs for a given data. We here consider the development of GIVE statistic based on IV estimation for a general situation when the error random variables are non-spherical, and their covariance matrix is unknown. The statistical properties such as the asymptotic distributions of the GIVE statistic is also derived. Moreover, the performance of the GIVE statistic in a finite sample is investigated with the Monte-Carlo simulation experiments in the context of multiple linear regression model and measurement error model. Furthermore, the usefulness of the GIVE statistic is established on the COVID-19 data addressing the association between LTBI and the reduction of mortality due to COVID-19 as we discussed in the first two paragraphs in this section. Finally, “Analysis : Card Data for Schooling” Section briefly studies the use of GIVE statistic on the real data set from[17], who presented an analysis of the causal link between schooling and earnings in education outcomes using the instrumental variable estimation.

Instrumental variable (IV) estimation

We first describe the model and IV estimation in the setup of a classical multiple linear regression model, see Rao et al. (2008, Chapter 4)[18]. Consider the multiple linear regression model with the study variable y linearly related to the p explanatory variables with an intercept term in X ashere y is the vector of observations on study variable, X is the matrix of n observations on each of the p explanatory variables and an intercept term, is the vector of regression coefficients associated with the explanatory variables, and u is the vector of nonspherical disturbances with and unknown covariance matrix, i.e., , where is an unknown positive definite matrix, and is a constant. Note that in the usual linear regression model, it is assumed that X and u are uncorrelated but we here consider a situation where X and u are uncorrelated in limit, in the sense that as , where denotes the convergence in probability. Moreover, we assume the presence of intercept term in the model, which is needed for the validity of coefficient of determination in classical multiple linear regression model and may be needed for the validity of the proposed goodness of fit statistic studied in “Asymptotic Distribution of GIVE statistic” Section. Next, suppose that a set of p instrumental variables (denoted by ) is available, and the observations are arranged in a matrix with an intercept term in such that they are correlated with X, in limit and uncorrelated with u, in limit. Assume thatwhere , and are non-singular positive definite (denoted by ) matrices of constants, and I is an () identity matrix. Consider now the set up of linear regression model in which X is regressed on the set of instruments Z for a given sample size n in the first stage asand we assume and for the random error term , where and are the same as the earlier associated with . In this context, it should be mentioned that and can be different in principle but here they are considered to be the same only because of the notational and algebraic simplicities. The feasible generalized least squares estimate of is obtained by using generalized regression of X on Z and replacing by asfrom (3), where is a consistent estimator of unknown . In the next stage, using the predicted value of X aswhere in , we haveApplication of generalized least squares on (4) yields the two stage feasible generalized least squares (2SFGLS) estimator of asNow, Theorem 1 asserts the consistency of to estimate the unknown parameter . Suppose that be a consistent estimator , and the following assumptions are required for the consistency of the estimator :

Theorem 1

Let be a consistent estimator of under Euclidean norm. Then, under (B1) and (B2), as .

Proof

See the supplementary file.

GIVE statistic for goodness of fit in IV model

Note that the coefficient of determination, popularly known as , is the ratio of the sum of squares due to regression (or the fitted model) and the total sum of squares obtained in the context of analysis of variance in multiple linear regression model, see[18], Chap. 3. It is measuring the proportion of variability explained by the fitted model with respect to the total variability in the data based on the ordinary least squares estimator (OLSE). The coefficient of determination fails to judge the goodness of fit in the IV models because the total sum of squares can no longer be partitioned into two orthogonal components, viz., the sum of squares due to regression and the sum of squares due to error. Moreover, the IV estimators do not possess the properties like the best linear unbiased estimator as possessed by the OLSE, and hence, replacing OLSE by IV estimator in the definition of goodness of fit statistics will not necessarily yield a good and reliable outcome. We here attempt to provide a new statistic for quantitatively measuring the goodness of fit in the IV models, termed as GIVE statistic. The square of the population multiple correlation coefficient between y and fixed explanatory variable, say is given bywhere , where is a positive definite finite matrix, , and is a vector of all elements unity. Note that when the model is best fitted, then , and we have . On the other hand, if the model is worst fitted, then all the ’s will be zero indicating that all the explanatory variables are not effective, and consequently, . Any other value of lying between zero and one will accordingly measure the goodness of the fitted model in terms of multiple correlation coefficient. Now, recall the model (4), viz., , where is estimated by (5), and we develop the goodness of fit statistic. The total sum of squares for model (4) isIt may be observed that in the classical multiple linear regression model, the total sum of squares is partitioned into two orthogonal components- sum of squares due to regression and sum of squares due to error. Observing the expression in (3.2), the first two terms, viz., and can be considered as jointly constituting the sum of squares due to regression whereas can be considered as the sum of squares due to errors. If we replace the unknown and u in the first two terms of (9), viz., and by and , respectively, then a measure of goodness of fit can be constructed by measuring the ratio of the sum of squares due to regression and the total sum of squares under IV models asThe statistic (10) can be used to measure the goodness of fit in the IV model having spherical disturbances with unknown covariance matrix and is termed as Goodness of Instrumental Variable Estimates (GIVE) statistic in IV models. Theorem 2 states the consistency of to its population counterpart , whereThe interpretation of is the same as in (8).

Theorem 2

Let be a consistent estimator of under Euclidean norm. Then under (B1) and (B2), as . See the supplementary material.

Remark

In case is known, then can be replaced by known . For example, substituting will give the classical case. In all-inclusive, is a consistent estimator of , where unknown is estimated by its consistent estimator of . The interpretations of are similar to the interpretations of coefficient of determination. When , then indicates that the model is the best fitted. On the other hand, if all estimated regression coefficients are close to zero or say, exactly zero which indicates that all the regression coefficients are not significant, i.e., the model is worst fitted, then . Beside these two values of , any other value of lying between 0 and 1 will indicate the degree of goodness of fit provided by the fitted model for given explanatory variables and sample size. For example, if , it then would indicate that of the variation in the response values is being explained by the fitted IV model based on the choice of IVs and the fitted IV model is nearly good.

Asymptotic distribution of GIVE statistic

We here derive the asymptotic distribution for the case when is unknown, and the asymptotic distribution for the special cases, viz. known and are stated in Corollary 1 and Lemma 1 (see the supplementary file), respectively. We now assume the following conditions for the sake of technicalities: (A1) The parameter space of is compact. (A2) X is a bounded random variable. (A3) Z is a bounded random variable. (A4) Z and u are independent random variables. (A5) Z and are independent random variables. (A6) The correlation between and u is non-zero. (A7) Let and , where . Moreover, suppose that for all i and j. Before stating the theorem, we also need to introduce a few notations, which are the following. Let us denote , which is an arbitrary d-dimensional vector, and

Theorem 3

Under conditions (A1–A7), converges weakly to a normal distribution with mean and variance .

Proof

See the supplementary material.

Corollary 1

Under conditions (A1–A6), converges weakly to a normal distribution with mean and variance .

Proof of Corollary 1

The proof follows using the same arguments as the proof of Lemma 1 in the supplementary file.

Monte-Carlo simulation study

We conducted a Monte-Carlo simulation experiment for studying the performance of the GIVE statistic for finite sample cases. To justify and understand the behaviour of GIVE statistic based on the choice of instrumental variables, we consider Wald’s and Durbin’s choices of instrumental variables, (see[18], Chap. 4, pp. 208–209) for more details). The Wald instrument technique divides the observations on explanatory variable into two groups based on their median value and choose the IV as + 1 and −1 for the two groups, and Durbin instrument technique uses the ranks of observations on explanatory variable as instruments. Here our modest aim is to understand the performance of the proposed goodness of fit statistics. Hence, to simplify and make the understanding better, we consider a model with homoskedastic error structure with identity covariance matrix. The same procedure can be extended to the case of a non-identity type covariance matrix- known or unknown both. In case of unknown non-identity covariance matrices, we also need to choose a suitable consistent estimator to estimate the covariance matrix based on a sample of data. We conducted the simulation experiments using R software under two cases. First case is the general set up of multiple linear regression models. In this case, our objective is to demonstrate the application of GIVE statistic in a prominent model where IV is extensively used. Keeping this in mind, we consider the measurement error models in the second case and simulated the values of proposed GIVE statistic.

Application: Multiple linear regression model

Here our simulation set up is as follows. For a given sample size and 200, the random errors are generated following with and 5 for the model (3) and (4) with and 9 with high values of population multiple correlation between the study variable and all the independent variables, and is known. The observations on X are generated from a normal distribution and corresponding IV’s are found to construct Z using the Wald instrument technique, denoted as , and Durbin instrument technique, denoted as . The GIVE statistic based on and are computed and denoted as and , respectively and their empirical relative bias and empirical relative mean squared error are computed. A comparison of the GIVE statistic with traditional is made to know what happens to GIVE statistic when some important and unimportant explanatory variables are added or when the intercept term is absent etc. It may be recalled that the classical coefficient of determination has a property that it increases as the number of explanatory variables in the model increases. The traditional is defined only when there is an intercept term in the model. The detailed results along with obtained values in different tables are presented in the supplementary material for the sake of space. The conclusions drawn from those results are mentioned here. We observe that as the value of variance increases, the values of and decrease. All the results indicating that the choice of yields a better fitted model than . Such an outcome is intuitively correct also because is using more information in terms of ranks of the observations whereas is using the information on data as indicator variables only. Hence, it can be concluded that if both and are used in X, then the proposed GIVE statistic are capable of judging the goodness of fit which is affected primarily due to the choice of IVs and thus deciding over the appropriateness of the choice of IVs. We have used the same choice of IVs for all the variables but extending it to a case where different explanatory variables are replaced by different IVs is not difficult. The resulting GIVE statistic will reflect the goodness of fit appropriately. Moreover, the empirical relative bias (RB) and empirical relative mean squared error (RM) of is smaller than that of in all the settings of simulation set up. The RB and RM of and increase as increases for the given sample size. As the sample size increases, the RB and RM of and decrease for a given . It is clear that the behavior resembles the behaviour of traditional . The and are found to have a tendency to increase as the the number of explanatory variables increase. This empirically confirms that the values of GIVE statistic increase when explanatory variables are added in the model. Next, when relevant explanatory variables are added, both and increase. The magnitude of increment is less than the magnitude of increment of and . This clearly indicates that the capability of GIVE statistic in diagnosing whether the relevant or unimportant explanatory variables are added in the model. Furthermore, we find that the values of the GIVE statistic increase when the intercept term is absent in the model in comparison to the values when the intercept term is present in the model. There seems no issue that the GIVE statistic do not work in a model without intercept. It is contrary to the traditional which is defined only in a model with intercept term. The RBs and RMs of and are lowered when the intercept term is removed from the model. It will give slightly higher values in comparison when the intercept term is present in the model. Besides, the GIVE statistic works well only when is small. It is intuitively expected that as increases, the model fitting should be worsened. The GIVE statistic are capturing it better than . Difference in the values of and for the same sets of explanatory variables can be interpreted as the difference arising due to the choice of IVs.

Application: Measurement error model

We first briefly describe the set up of measurement error models. More details can be found in[18-20]. Note that the symbols and notations used in this subsection are limited to this “Application: Measurement Error Model” section only. The reason to choose the measurement error model for application of GIVE statistic is that the explanatory variable and random errors become correlated when the data is contaminated with the measurement errors. Beside, among other estimation methods used in measurement error models, the IV method is a popular method to obtain the consistent estimators of the regression coefficient. Note that a basic common assumption of any statistical analysis is that all the observations are correctly observed. However, in many practical situations, they cannot be correctly observed due to various reasons; in fact, they are observed with some measurement error into them. The difference between the observed and true values of the variable is termed as measurement error. We here consider the structural form of the multiple measurement error model where the true explanatory variables are stochastic with the same mean. Let denote the vector of observations on the true values of study variable and , ; be the matrix of the n observations on each of the p explanatory variables linked withwhere is a vector of regression coefficients associated with p explanatory variables, is the intercept term, and is a vector of elements unity. The true values and T are not observable due to presence of measurement errors but they are observed as y and X which are vector and matrix, respectively given aswhere is a vector of measurement errors in , , and ; is matrix of measurement errors involved in T. We assume that , are independent and identically distributed random variables following , and are also independent and identically distributed following . Further, and are also assumed to be statistically independent of each other. The data is generated from a population with high multiple correlation, and and 200 are chosen. The measurement errors and are generated following and , respectively with , and 5 for the model (11)–(13) when and 9. The observations on T are generated from a normal distribution in every replication, and the corresponding IV’s are found to construct Z using the two approaches—Wald Instrument Technique, denoted as , and Durbin Instrument Technique, denoted as . The is estimated using and with measurement error ridden data on X and y to further compute the GIVE statistic. In this section, the GIVE statistic based on and are denoted as and , respectively. The results are summarized in the supplementary material. It is clear that the measurement errors in the data affect the values of and . As the value of variance increases, the values of and decrease. The rate of such increment depends upon the sample size also. This again confirms that the proposed GIVE statistic can satisfactorily measure the goodness of fit in the measurement error models. The performance of GIVE statistic depends on a combination of n and . Overall, it can be concluded that if there are several available choices of IVs, e.g., and in the present case, then the proposed GIVE statistic helps in judging the appropriate choice of IVs and gives an idea about the goodness of the fitted model. Next, in terms of bias, it empirically indicates that and are negatively biased. The magnitude of relative bias of is smaller than the magnitude of relative bias of and in all the settings of simulation set up. If we take the sample size to be substantially large, then as the sample size increases, the values of GIVE statistic will converge better towards . Moreover, we observe that the values of and have a tendency to increase as the number of explanatory variables increases just like the traditional in the classical multiple linear regression model without measurement errors. When unimportant explanatory variables are added in the model, then and , both increase but the amount of increment is less than the increment that happened in and . This confirm the capability of and in diagnosing whether the relevant or unimportant explanatory variables are added in the model. Hence, the proposed GIVE statistic works well in the measurement error model. We expect that the GIVE statistic will also work well in other models also.

Analysis: COVID-19 data set

In “Introduction” Section, it has already been mentioned that there is likely to be an association between well known tuberculosis (TB) vaccine Bacillus Calmette-Guerin (BCG) and COVID-19 mortality. However, the scientists have not yet found any data analytic evidence of such a finding (see, e.g.,[1]). One possible reason for not finding any evidence of it is that a large proportion of people in this universe may have latent TB infection (LTBI), i.e., they may not have clinical evidence of active TB but have Mycobacterium tuberculosis antigens. Moreover, as said before, the LTBI may provide immunological protection against COVID-19. We here try to address the association between LTBI and the reduction of mortality due to COVID-19 using the IV technique based on GIVE statistic. The data set consists of mortality rate due to COVID-19 of 104 countries from January 01, 2020 to May 30, 2020 (see https://www.worldometers.info/coronavirus) along with region (see https://www.who.int/countries/), bcgindex (see http://www.bcgatlas.org/), pop65 (see https://data.worldbank.org/indicator/SP.POP.65) and lntb10 (see http://www.bcgatlas.org/). The whole data set is also submitted to the journal’s repository. In this place, we would like to mention that the most of the countries with low income levels (annual per capita income less than $825 USD) reported zero deaths attributed to COVID-19 till May 30, 2020, and for this reason, specific 104 countries are selected for data analysis, and the data with incomplete observations is not considered As indicated in “Introduction” Section, we here formulate a regression equation with the mortality due to COVID-19 as the response variable and motivated by the study of[2], we consider the IVs for the following three covariates : bcgindex The number of years a country has included BCG vaccine in its national immunization program. 1: All individuals received mandatory vaccinations; 0 : BCG neither previously nor currently mandatory; Values between 0 and 1: BCG previously mandatory but now discontinued. See Data Appendix for details of the construction of the BCG index. The definition is taken from[2]. region The World Health Organization (WHO) regional classification was used here. 1: African Region; 2: South-East Asia Region; 3: East-Mediterranean Sea Region; 4: Western-Pacific Asia Region; 5: Region of America; 6: European Region. The definition is taken from[2]. pop65 The ratio of the population over 65 years of age. The definition is taken from[2]. Another covariate is lntb10, which is defined as the number of TB infections per 100,000 people in logarithmic scale, i.e., this variable is approximately the same as the LTBI. As[2] did, we here test the hypothesis if LTBI is associated with reduced COVID-19 mortality and present the data analysis using IV estimation for the modelling of COVID-19 data based on the aforesaid four covariates, where bcgindex, region and pop65 are considered as IVs, and lntb10 is considered as original explanatory variable. We extend it further and use two types of IVs to demonstrate how the GIVE statistic can help in deciding the fitting of model and choice of IVs for a better fit. Here the IVs following Wald’s and Durbin’s methods are generated for all the explanatory variables in both the models. The following values of GIVE statistic are obtained: For the multiple linear regression model with “bcgindex”, “region” as IVs, “lntb10” as the original variable, and the mortality due to COVID-19 as the response variable, we obtain the values of the GIVE statistic as and . Moreover, using the type IV, the estimate of the coefficient associated with “lntb10” is , and using the type IV, the estimate of the coefficient associated with “lntb10” is . The negative sign of the coefficient of “lntb10” and the relatively large values of and indicate that the LTBI may really protect against mortality due to COVID-19. Next, for the multiple linear regression model with “bcgindex”, “region” and “pop65” as IVs, “lntb10” as the original variable, and the mortality due to COVID-19 as the response variable, we obtain the values of the GIVE statistic as and . Moreover, using the type IV, the estimate of the coefficient associated with “lntb10” is , and using the type IV, the estimate of the coefficient associated with “lntb10” is . Again, the negative sign of the coefficient of “lntb10”, and the larger values of and indicate that the LTBI may strongly protect against COVID-19 mortality for the population with age more than 65. Overall, it clearly indicates that the multiple linear regression fits the COVID-19 data well, and the LTBI has really an impact on the reduction of mortality due to COVID-19; in particular, for the elder people, who are 65 years old or more. Moreover, since the variable bcgindex has been used as an instrumental variable, and hence, it is highly correlated with lntb10. Thus whatever conclusions we can draw for lntb10 will also hold for bcgindex with a very high probability. Therefore, the way LTB1 is affecting the COVID-19 mortality, the BCG vaccine will also be in the same way. However at the end, to be attentive, we would like to mention that this conclusion is only based Statistical evidence obtained from a certain data.

Analysis: card data for schooling

We present another example to demonstrate the application of the proposed GIVE statistics on a real data set from[17] who presented an analysis of the causal link between schooling and earnings in education outcomes using the instrumental variable estimation. The analysis affirms that marginal returns to education among children of less-educated parents are higher than the rates of return estimated by conventional methods. The data set consists of 35 variables with 3010 observations, and it is available in the “ivpack” package in R software. The following two variables on variation in college proximity as instrumental variable can be used as IV: nearc2 = nearc4 = We consider the model with six explanatory variables and fit a multiple instrumental variable model using both the instruments. We are using this data set to provide several illustrations about the choice of instruments and concerned goodness of fit. There are two variables “nearc2” and “nearc4” which can be used as IV in place of “educ”. Additionally, Wald’s instrument and Bartlett’s instruments can also be used to generate IVs for the variable “educ”. Then we address the issue that which choice of the IV out of the four available choices of IVs will act as better IV for fitting the IV regression model. We measure the goodness of fit by computing the GIVE statistic which are obtained as follows: Here, we observe that < < < . This inequality indicates that the model fitted by Wald’s and Bartlett’s instruments have a higher value of GIVE statistic than that of “nearc2” or “nearc4”. Hence, one can conclude that among the available four choices of IVs, the use of Bartlett’s IV provides the best fitting followed by Wald’s IV in comparison to “nearc2” or “nearc4”. All these studies illustrate the the application of proposed GIVE statistics in case of presence of IVs.

Recommendations: GIVE statistic and concluding remarks

Based on the results in “Application: Multiple Linear Regression Model, Application: Measurement Error Model and Analysis: COVID-19 Data Set” sections, the following recommendations are being made. The proposed GIVE statistic are capable of measuring the goodness of fitted models in case the regression coefficients are estimated by IV estimation method. The proposed GIVE statistic works well in the models with as well as without an intercept term. The performance of GIVE statistic depends upon the choice of IV, sample size and variance of random errors. The proposed GIVE statistic are capable of judging the the appropriate choice of IVs in order to have a better fitted model. The proposed GIVE statistic are capable of judging the variable selection, i.e., if the explanatory variables added in the model are relevant or not. The values of GIVE statistics can be used in any biological, biostatistical or econometric models, where IV is used. The use of GIVE statistic is not recommended when either the variance of random errors (or variance of the measurement error) is large or sample size is too low. The GIVE statistic performs better when the variance of random errors (or the variance of the measurement error) is not too large and/or sample size is reasonably large, but it also depends upon the choice of IVs. The GIVE statistic can be used in any real data application. We now want to end this section with a discussion on the robust analogue of the proposed measure . In this context, we would like to point out that depends on the choice of instruments in the instrumental variable estimator of the unknown parameter, i.e., (see (3.3)), and consequently, is unlikely to be robust against the outliers/large variance of the errors as the instrumental variable estimator is generally not robust against the outliers. To make it robust against the outliers, one of the possibilities is to use any robust estimator of the unknown parameter such as the least absolute deviation estimators (see, e.g.,[21]) or the least median squares regression estimator (see, e.g.,[22]) and modify the definition of based on the robust analogue of . It may be noted that such estimation methods based on the use of instrumental variables have not yet been developed in the literature to the best of our knowledge. Moreover, it is an appropriate place to mention that there is a trade off between the efficiency and the robustness, and consequently, the robust estimators are not generally efficient estimators. In order to have better efficiency, we propose our measure based on the instrumental variable estimator suppressing the issue of robustness. The study on modified based on the robust analogue of will be of interest for future research. Supplementary Information.

IV	nearc2	nearc4	Wald	Bartlett
GIVE statistic	0.1868739	0.1868201	0.2307384	0.2816641

9 in total

Review 1. Mendelian randomization as an instrumental variable approach to causal inference.

Authors: Vanessa Didelez; Nuala Sheehan
Journal: Stat Methods Med Res Date: 2007-08 Impact factor: 3.021

2. A note on the control function approach with an instrumental variable and a binary outcome.

Authors: Eric J Tchetgen Tchetgen
Journal: Epidemiol Methods Date: 2014-12

3. Instrumental variables estimation with competing risk data.

Authors: Torben Martinussen; Stijn Vansteelandt
Journal: Biostatistics Date: 2018-08-14 Impact factor: 5.899

4. Instrumental variable additive hazards models.

Authors: Jialiang Li; Jason Fine; Alan Brookhart
Journal: Biometrics Date: 2014-10-08 Impact factor: 2.571

5. Testing concordance of instrumental variable effects in generalized linear models with application to Mendelian randomization.

Authors: James Y Dai; Kwun Chuen Gary Chan; Li Hsu
Journal: Stat Med Date: 2014-05-26 Impact factor: 2.373

6. Instrumental variable methods for causal inference.

Authors: Michael Baiocchi; Jing Cheng; Dylan S Small
Journal: Stat Med Date: 2014-03-06 Impact factor: 2.373

7. Seemingly unrelated regression with measurement error: estimation via Markov Chain Monte Carlo and mean field variational Bayes approximation.

Authors: Georges Bresson; Anoop Chaturvedi; Mohammad Arshad Rahman
Journal: Int J Biostat Date: 2020-09-21 Impact factor: 0.968

8. Role of latent tuberculosis infections in reduced COVID-19 mortality: Evidence from an instrumental variable method analysis.

Authors: Harutaka Takahashi
Journal: Med Hypotheses Date: 2020-08-26 Impact factor: 1.538

Review 9. A review of instrumental variable estimators for Mendelian randomization.

Authors: Stephen Burgess; Dylan S Small; Simon G Thompson
Journal: Stat Methods Med Res Date: 2015-08-17 Impact factor: 3.021

9 in total