Literature DB >> 33265554

Non-Quadratic Distances in Model Assessment.

Abstract

One natural way to measure model adequacy is by using statistical distances as loss functions. A related fundamental question is how to construct loss functions that are scientifically and statistically meaningful. In this paper, we investigate non-quadratic distances and their role in assessing the adequacy of a model and/or ability to perform model selection. We first present the definition of a statistical distance and its associated properties. Three popular distances, total variation, the mixture index of fit and the Kullback-Leibler distance, are studied in detail, with the aim of understanding their properties and potential interpretations that can offer insight into their performance as measures of model misspecification. A small simulation study exemplifies the performance of these measures and their application to different scientific fields is briefly discussed.

Entities: Chemical Disease Gene

Keywords: Kullback-Leibler distance; divergence measure; mixture index of fit; model assessment; non-quadratic distance; statistical distance; total variation

Year: 2018 PMID： 33265554 PMCID： PMC7512982 DOI： 10.3390/e20060464

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Model assessment, that is assessing the adequacy of a model and/or ability to perform model selection, is one of the fundamental components of statistical analyses. For example, in the model adequacy problem one usually begins with a fixed model and interest centers on measuring the model misspecification cost. A natural way to create a framework within which we can assess model misspecification is by using statistical distances as loss functions. These constructs measure the distance between the unknown distribution that generated the data and an estimate from the data model. By identifying statistical distances as loss functions, we can begin to understand the role distances play in model fitting and selection, as they become measures of the overall cost of model misspecification. This strategy will allow us to investigate the construction of a loss function as the maximum error in a list of model fit questions. Therefore, our fundamental question is the following. How can one design a loss function that is scientifically and statistically meaningful? We would like to be able to attach a specific scientific meaning to the numerical values of the loss, so that a value of the distance equal to 4, for example, has an explicit interpretation in terms of our statistical goals. When we select between models, we would like to measure the quality of the approximation via the model’s ability to provide answers to important scientific questions. This presupposes that the meaning of “best fitting model” should depend on the “statistical questions” being asked of the model. Lindsay [1] discusses a distance-based framework for assessing model adequacy. A fundamental tenet of the framework for model adequacy put forward by Lindsay [1] is that it is possible and reasonable to carry out a model-based scientific inquiry without believing that the model is true, and without assuming that the truth is included in the model. All this of course, assuming that we have a way to measure the quality of the approximation to the “truth”, is offered by the model. This point of view never assumes the correctness of the model. Of course, it is rather presumptuous to label any distribution as the truth as any basic modeling assumption generated by the sampling scheme that provided the data is never exactly true. An example of a basic modeling assumption might be “ are independent, identically distributed from an unknown distribution ”. This, as any other statistical assumption, is subject to question even in the most idealized of data collection frameworks. However, we believe that well designed experiments can generate data that is similar to data from idealized models, therefore we operate as if the basic assumption is true. This means that we assume that there is a true distribution that generates the data, which is “knowable” if we can collect an infinite amount of data. Furthermore, we note that the basic modeling assumption will be the global framework for assessment of all more restrictive assumptions about the data generation mechanism. In a sense, it is the “nonparametric” extension of the more restrictive models that might be considered. We let be the class of all distributions consistent with the basic assumptions. Hence , and sets are called models. We assume that ; hence, there is a permanent model misspecification error. Statistical distances will then provide a measure for the model misspecification error. One natural way to measure model adequacy is to define a loss function that describes the loss incurred when the model element M is used instead of the true distribution . Such a loss function should, in principle, indicate, in an inferential sense, how far apart the two distributions , M are. In the next section, we offer a formal definition of the concept of a statistical distance. If the statistical questions of interest can be expressed as a list of functionals of the model M that we wish to be uniformly close to the same functionals of the true distribution, then we can turn the set of model fit questions into a distance via where the supremum is taken over the class of functionals of interest. Using the supremum of the individual errors is one way of assessing overall error, but using this measure has the nice feature that its value gives a bound on all individual errors. The statistical questions of interest may be global, such as: is the normal model correct in every aspect? Or we may be interested to have answers on a few key characteristics, such as the mean. Lindsay et al. [2] introduced a class of statistical distances, called quadratic distances, and studied their use in the context of goodness-of-fit testing. Furthermore, Markatou et al. [3] discuss extensively the chi-squared distance, a special case of quadratic distance, and its role in robustness. In this paper, we study non-quadratic distances and their role in model assessment. The paper is organized as follows. Section 2 presents the definition of a statistical distance and its associated properties. Section 3, Section 4 and Section 5 discuss in detail three popular distances, total variation, the mixture index of fit and the Kullback-Leibler distance, with the aim of understanding their role in model assessment problems. The likelihood distance is also briefly discussed in Section 5. Section 6 illustrates computation and applications of total variation, mixture index of fit and Kullback-Leibler distances. Finally, Section 7 presents discussion and conclusions pertaining to the use of total variation and mixture index of fit distances.

2. Statistical Distances and Their Properties

If we adopt the usual convention that loss functions are nonnegative in their arguments, and zero if the correct model is used, and have larger value if the two distributions are not very similar, then the loss can also be viewed as a distance between , M. In fact, we will always assume that for any two distributions F, G If this holds, we will say that is a statistical distance. Unlike the requirements for a metric, we do not require symmetry. In fact, there is no reason that the loss should be symmetric, as the roles of , M are different. We also do not require to be nonzero when the arguments differ. This zero property will allow us to specify that two distributions are equivalent as far as our statistical purposes are concerned by giving them zero distance. Furthermore, it is important to note that if is in and , and , say , then the distance induces a loss function on the parameter space via Therefore, if is in the model, the losses defined by are parametric losses. We begin within the discrete distribution framework. Let , where T is possibly infinite, be a discrete sample space. On this sample space we define a true probability density , as well as a family of densities , where is the parameter space. Assume we have independent and identically distributed random variables producing the realizations from . We record the data as , where is the number of observations in the sample with value equal to t. We note here that we use the word “density” in a generic fashion that incorporates both, probability mass functions as well as probability density functions. A rather formal definition of the concept of statistical distance is as follows. (Markatou et al. [ We would require to indicate the worst mistake that we can make if we use m instead of . The precise meaning of this statement is obvious in the case of total variation that we discuss in detail in Section 3 of the paper. We would also like our statistical distances to be convex in their arguments. Let τ, m be a pair of probability density functions, with m being represented as where Let τ, m be a pair of probability density functions, and assume where Lindsay et al. [2] define and study quadratic distances as measures of goodness of fit, a form of model assessment. In the next sections, we study non-quadratic distances and their role in the problem of model assessment. We begin with the total variation distance.

3. Total Variation

In this section, we study the properties of the total variation distance. We offer a loss function interpretation of this distance and discuss sensitivity issues associated with its use. We will begin with the case of discrete probability measures and then move to the case of continuous probability measures. The results presented here are novel and are useful in selecting the distances to be used in any given problem. The total variation distance is defined as follows. Let τ, m be two probability distributions. We define the total variation distance between the probability mass functions τ, m to be This measure is also known as the -distance (without the factor ) or index of dissimilarity. The total variation distance takes values in the interval By definition with equality if and only if , . Moreover, . But , m are probability mass functions (or densities), therefore and hence or, equivalently Therefore . ☐ The total variation distance is a metric. By definition, the total variation distance is non-negative. Moreover, it is symmetric because and it satisfies the triangle inequality since Thus, it is a metric. ☐ The following proposition states that the total variation distance is convex in both, left and right arguments. Let τ, m be a pair of densities with τ represented as Moreover, if m is represented as It is a straightforward application of the definition of the total variation distance. ☐ The total variation measure has major implications for prediction probabilities. A statistically useful interpretation of the total variation distance is that it can be thought of as the worst error we can commit in probability when we use the model m instead of the truth . The maximum value of this error equals 1 and it occurs when , m are mutually singular. Denote by the probability of a set under the measure and by the probability of a set under the measure m. Let τ, m be two probability mass functions. Then where A is a subset of the Borel set Define the sets , , . Notice that Because on the set the two probability mass functions are equal , and hence Note that, because of the nature of the sets and , both terms in the last expression are positive. Therefore Furthermore But and Therefore and hence ☐ The model misspecification measure This indicates the sense in which the measure assesses the overall risk of using m instead of τ, then chooses m that minimizes the aforementioned risk. We now offer a testing interpretation of the total variation distance. We establish that the total variation distance can be obtained as a solution to a suitably defined optimization problem. It is obtained as that test function which maximizes the difference between the power and level of a suitably defined test problem. A randomized test function for testing a statistical hypothesis Let We have Then So ☐ An advantage of the total variation distance is that it is not sensitive to small changes in the density. That is, if is replaced by where and is small then Therefore, when the changes in the density are small . When describing a population, it is natural to describe it via the proportion of individuals in various subgroups. Having small would ensure uniform accuracy for all such descriptions. On the other hand, populations are also described in terms of a variety of other variables, such as means. Having the total variation measure small does not imply that means are close on the scale of standard deviation. The total variation distance is not differentiable in the arguments. Using We now study the total variation distance in continuous probability models. The total variation distance between two probability density functions τ, m is defined as The total variation distance has the same interpretation as in the discrete probability model case. That is One of the important issues in the construction of distances in continuous spaces is the issue of invariance, because the behavior of distance measures under transformations of the data is of interest. Suppose we take a monotone transformation of the observed variable X and use the corresponding model distribution; how does this transformation affect the distance between X and the model? Invariance seems to be desirable from an inferential point of view, but difficult to achieve without forcing one of the distributions to be continuous and appealing to the probability integral transform for a common scale. In multivariate continuous spaces, the problem of transformation invariance is even more difficult, as there is no longer a natural probability integral transformation to bring data and model on a common scale. Let Write where is the inverse transformation. Next, we do a change of variable in the integral. Set from where we obtain and ; the prime denotes derivative with respect to the corresponding argument. Then But hence Now since is a one-to-one transformation, is either increasing or decreasing on different segments of . Thus where . ☐ A fundamental problem with the total variation distance is that it cannot be used to compute the distance between a discrete distribution and a continuous distribution because the total variation distance between a continuous measure and a discrete measure is always the maximum possible, that is 1. This inability of the total variation distance to discriminate between discrete and continuous measures can be interpreted as asking “too many questions”at once, without any prioritization. This limits its use despite its invariant characteristics. We now discuss the relationship between the total variation distance and Fisher information. Denote by the joint density of n independent and identically distributed random variables. Then we have the following proposition. The total variation distance is locally equivalent to the Fisher information number, that is where By definition Now, expand using Taylor series in the neighborhood of to obtain where the prime denotes derivative with respect to the parameter . Further, write to obtain where Therefore, assuming that converges to a normal random variable in absolute mean, then because , and when . ☐ The total variation is a non-quadratic distance. It is however related to a quadratic distance, the Hellinger distance, defined as by the following inequality. Let τ, m be two probability mass functions. Then Straightforward using the definitions of the distances involved and Cauchy-Swartz inequality. Holder’s inequality provides . ☐ Note that ; the square root of this quantity, that is , is known as Matusita’s distance [6,7]. Further, define the affinity between two probability densities by Then, it is easy to prove that The above inequality indicates the relationship between total variation and Matusita’s distance.

4. Mixture Index of Fit

Rudas, Clogg, and Lindsay [8] proposed a new index of fit approach to evaluate the goodness of fit analysis of contingency tables based on the mixture model framework. The approach focuses attention on the discrepancy between the model and the data, and allows comparisons across studies. Suppose is the baseline model. The family of models which are proposed for evaluating goodness of fit is a two-point mixture model given by Here denotes the mixing proportion, which is interpreted as the proportion of the population outside the model . In the robustness literature the mixing proportion corresponds to the contamination proportion, as explained below. In the contingency table framework , describe the tables of probabilities for each latent class. The family of models defines a class of nested models as varies from zero to one. Thus, if the model does not fit well the data, then by increasing , the model will be an adequate fit for sufficiently large. We can motivate the index of fit by thinking of the population as being composed of two classes with proportions and respectively. The first class is perfectly described by , whereas the second class contains the “outliers”. The index of fit can then be interpreted as the fraction of the population intrinsically outside , that is, the proportion of outliers in the sample. We note here that these ideas can be extended beyond the contingency table framework. In our setting, the probability distribution describing the true data generating mechanism may be written as , where and is arbitrary. This representation of is arbitrary such that we can construct another representation . However, there always exists the smallest unique such that there exists a representation of that puts the maximum proportion in one of the population classes. Next, we define formally the mixture index of fit. (Rudas, Clogg, and Lindsay [ Notice that is a distance. This is because if we set for a fixed , we have and if . Define the statistical distance Note that, to be able to present Proposition 8 below, we have turned arbitrary discrete distributions into vectors. As an example, if the sample space The set of vectors Given with , there exists a representation of Write any arbitrary discrete distribution as follows: where and takes the value 1 at the th position and the value 0 everywhere else. Then which belongs to a simplex. ☐ We have Define Then with equality at some t. Let now the error term be Then and cannot be made smaller without making negative at a point . This concludes the proof. ☐ We have if there exists By Proposition 9 , but it equals 1 at . ☐ One of the advantages of the mixture index of fit is that it has an intuitive interpretation that does not depend upon the specific nature of the model being assessed. Liu and Lindsay [9] extended the results of Rudas et al. [8] to the Kullback-Leibler distance. Computational aspects of the mixture index of fit are discussed in Xi and Lindsay [4] as well as in Dayton [10] and Ispány and Verdes [11]. Finally, a new interpretation to the mixture index of fit was presented by Ispány and Verdes [11]. Let be the set of probability measures and . If d is a distance measure on and , then is the least non-negative solution of the equation in . Next, we offer some interpretations associated with the mixture index of fit. The statistical interpretations made with this measure are attractive, as any statement based on the model applies to at least of the population involved. However, while the “outlier”model seems interpretable and attractive, the distance itself is not very robust. In other words, small changes in the probability mass function do not necessarily mean small changes in distance. This is because if , then a change of in from to 0 causes to go to 1. Moreover, assume that our framework is that of continuous probability measures, and that our model is a normal density. If is a lighter tailed distribution than our normal model , then and therefore That is, light tailed densities are interpreted as outliers. Therefore, the mixture index of fit measures error from the model in a “one-sided” way. This is in contrast to total variation, which measures the size of “holes” as well as the “outliers” by allowing the distributional errors to be neutral. In what follows, we show that if we can find a mixture representation for the true distribution then this implies a small total variation distance between the true probability mass function and the assumed model m. Specifically, we have the following. Let π* be the mixture index of fit. If Write with . This is because there always exists the smallest unique such that can be represented as a mixture model. Thus, the above relationship can be written as ☐ There is a mixture representation that connects total variation with the mixture index of fit. This is presented below. Denote by Then Fix ; for any given m let be a solution to the equation Let and and note that since then Rewrite now Equation (1) as follows: where and . Thus, ignoring the constraints, every pair satisfying the equation above also satisfies for some number . Moreover, such pair must have in order the constraints , to be satisfied. Hence, varying over gives a class of solutions. To determine , and adding these we obtain and the maximum value is obtained when . Therefore and so ☐ Therefore, for small the mixture index of fit and the total variation distance are nearly equal.

5. Kullback-Leibler Distance

The Kullback-Leibler distance [12] is extensively used in statistics and in particular in model selection. The celebrated AIC model selection criterion [13] is based on this distance. In this section, we present the Kullback-Leibler distance and some of its properties with particular emphasis on interpretations. The Kullback-Leibler distance between two densities τ, m is defined as or The Kullback-Leibler distance is nonnegative, that is with equality if and only if Write Set , then is a convex, non-negative function that equals 0 at . Therefore . ☐ We define the likelihood distance between two densities τ, m as The intuition behind the above expression of the likelihood distance comes from the fact that the log-likelihood in the case of discrete random variables taking discrete values, , m is the number of groups, can be written, after appropriate algebraic manipulations, in the above form. Alternatively, we can write the likelihood distance as and use this relationship to obtain insight into connections of the likelihood distance with the chi-squared measures studied by Markatou et al. [3]. Specifically, if we write the Pearson’s chi-squared statistic as then from the functional relationship we obtain that . However, it is also clear from the right tails of the functions that there is no way to bound below by a multiple of . Hence, these measures are not equivalent in the same way that Hellinger distance and symmetric chi-squared are (see Lemma 4, Markatou et al. [3]). In particular, knowing that is small is no guarantee that all Pearson z-statistics are uniformly small. On the other hand, one can show by the same mechanism that , where and is the symmetric chi-squared distance given as It is therefore true that small likelihood distance implies small z-statistics with blended variance estimators. However, the reverse is not true because the right tail in r for is of magnitude r, as opposed to for the likelihood distance. These comparisons provide some feeling for the statistical interpretation of the likelihood distance. Its meaning as a measure of model misspecification is unclear. Furthermore, our impression is that likelihood, like Pearson’s chi-squared is too sensitive to outliers and gross errors in the data. Despite Kullback-Leibler’s theoretical and computational advantages, a point of inconvenience in the context of model selection is the lack of symmetry. One can show that reversing the roles of the arguments in the Kullback-Leibler divergence can yield substantially different results. The sum of the Kullback-Leibler distance and the likelihood distance produces the symmetric Kullback-Leibler distance or J divergence. This measure is symmetric in the arguments, and when used as a model selection measure it is expected to be more sensitive than each of the individual components.

6. Computation and Applications of Total Variation, Mixture Index of Fit and Kullback- Leibler Distances

The distances discussed in this paper are used in a number of important applications. Euán et al. [14] use the total variation to detect changes in wave spectra, while Alvarez- Esteban et al. [15] cluster time series data on the basis of the total variation distance. The mixture index of fit has found a number of applications in the area of social sciences. Rudas et al. [8] provided examples of the application of * to two-way contingency tables. Applications involving differential item functioning and latent class analysis were presented in Rudas and Zwick [16] and Dayton [17] respectively. Formann [18] applied it in regression models involving continuous variables. Finally, Revuelta [19] applied the * goodness-of-fit statistic to finite mixture item response models that were developed mainly in connection with Rasch models [20,21]. The Kullback-Leibler (KL) distance [12] is fundamental in information theory and its applications. In statistics, the celebrated Akaike information Criterion (AIC) [13,22], widely used in model selection, is based on the Kullback-Leibler distance. There are numerous additional applications of the KL distance in fields such as fluid mechanics, neuroscience, machine learning. In economics, Smith, Naik, and Tsai [23] use KL distance to simultaneously select the number of states and variables associated with Markov-switching regression models that are used in marketing and other business applications. KL distance is also used in diagnostic testing for ruling in or ruling out disease [24,25], as well as in a variety of other fields [26]. Table 1 presents the software, written in R, that can be used to compute the aforementioned distances. Additionally, Zhang and Dayton [27] present a SAS program to compute the two-point mixture index of fit for the two-class latent class analysis models with dichotomous variables. There are a number of different algorithms that can be used to compute the mixture index of fit for contingency tables. Rudas et al. [8] propose to use a standard EM algorithm, Xi and Lindsay [4] use sequential quadratic programming and discuss technical details and numerical issues related to applying nonlinear programming techniques to estimate *. Dayton [10] discusses explicitly the practical advantages associated with the use of nonlinear programming as well as the limitations, while Pan and Dayton [28] study a variety of additional issues associated with computing *. Additional algorithms associated with the computation of * can be found in Verdes [29] and Ispány and Verdes [11].

Table 1

Computer packages for calculating total variation, mixture index of fit, and Kullback- Leibler distances.

Information	Total Variation	Kullback-Leibler	Mixture Index of Fit
R package	distrEx	bioDist	pistar
R function	TotalVarDist	KLD.matrix	pistar.uv
Dimension	Univariate	Univariate	Univariate
Website	https://cran.r-project.org/web/packages/distrEx/	http://bioconductor.org/packages/release/bioc/html/bioDist.html	https://rdrr.io/github/jmedzihorsky/pistar/man/

We now describe a simulation study that aims to illustrate the performance of the total variation, Kullback-Leibler, and mixture index of fit as model selection measures. Data are generated from either an asymmetric contamination model, or from a symmetric contamination model, where is the percentage of contamination. Specifically, we generate 500 Monte Carlo samples of sample sizes 200, 1000, and 5000 as follows. If the sample has size n and the percentage of contamination is , then of the sample size is generated from model or and the remaining from a model. We use and in the model and in the model. The total variation distance was computed between the simulated data and the model. The Kullback-Leibler distance was calculated between the data generated from the aforementioned contamination models and a random sample of the same size n from . When computing the mixture index of fit, we specified the component distribution as a normal distribution with initial mean 0 and variance 1. All simulations were carried out on a laptop computer with an Intel Core i7 processor and 64 bit Windows 7 operation system. The R packages used are presented in Table 1. Table 2 and Table 3 present means and standard deviations of the total variation and Kullback-Leibler distances as a function of the contamination model and the sample size. To compute the total variation distance we use the R function “TotalVarDist” of the R package “distrEx”. It smooths the empirical distribution of the provided data using a normal kernel and computes the distance between the smoothed empirical distribution and the provided continuous distribution (in our case this distribution is ). We note here that the package “distrEx” provides an alternative option to compute the total variation which relies on discretizing the continuous distribution and then computes the distance between the discretized continuous distribution and the data. We think that smoothing the data to obtain an empirical estimator of the density and then calculating its distance from the continuous density is a more natural way to handle the difference in scale between the discrete data and the continuous model. Lindsay [1] and Markatou et al. [3] discuss this phenomenon and call it discretization robustness. The Kullback-Leibler distance was computed using the function “KLD.matrix” of the R package “bioDist”.

Table 2

Means and standard deviations (SD) of the total variation (TV) and Kullback-Leibler (KLD) distances. Data are generated from the model with . The sample size n is 200, 1000, 5000. The number of Monte Carlo replications is 500.

Contaminating Model	Percentage of Contamination (ε)	Summary	n=200		n=1000		n=5000
Contaminating Model	Percentage of Contamination (ε)	Summary	TV	KLD	TV	KLD	TV	KLD
N(1, 1)	0.01	Mean	0.144	0.224	0.065	0.048	0.029	0.008
	0.01	SD	0.017	0.244	0.007	0.051	0.004	0.009
	0.05	Mean	0.146	0.255	0.069	0.065	0.034	0.017
	0.05	SD	0.017	0.267	0.009	0.059	0.004	0.015
	0.1	Mean	0.149	0.323	0.076	0.088	0.047	0.026
	0.1	SD	0.017	0.343	0.009	0.073	0.005	0.018
	0.2	Mean	0.162	0.482	0.097	0.147	0.081	0.059
	0.2	SD	0.020	0.462	0.011	0.123	0.006	0.030
	0.3	Mean	0.181	0.616	0.128	0.215	0.117	0.102
	0.3	SD	0.022	0.528	0.013	0.150	0.007	0.044
	0.4	Mean	0.201	0.733	0.162	0.293	0.155	0.153
	0.4	SD	0.024	0.616	0.014	0.176	0.007	0.058
	0.5	Mean	0.232	0.937	0.198	0.392	0.192	0.207
	0.5	SD	0.026	0.735	0.014	0.203	0.007	0.067
N(5, 1)	0.01	Mean	0.149	0.577	0.070	0.338	0.034	0.231
	0.01	SD	0.017	0.373	0.008	0.131	0.004	0.063
	0.05	Mean	0.167	1.416	0.092	1.041	0.060	0.838
	0.05	SD	0.020	0.499	0.009	0.248	0.004	0.138
	0.1	Mean	0.196	2.392	0.126	2.002	0.103	1.731
	0.1	SD	0.020	0.609	0.010	0.335	0.004	0.219
	0.2	Mean	0.259	4.841	0.210	4.404	0.199	3.947
	0.2	SD	0.023	0.941	0.012	0.512	0.006	0.383
	0.3	Mean	0.336	7.924	0.302	7.305	0.297	6.652
	0.3	SD	0.028	1.182	0.014	0.730	0.007	0.569
	0.4	Mean	0.419	11.317	0.398	10.655	0.396	9.843
	0.4	SD	0.031	1.388	0.016	0.863	0.006	0.792
	0.5	Mean	0.506	15.045	0.495	14.443	0.494	13.573
	0.5	SD	0.035	1.768	0.016	1.027	0.007	0.999
N(10, 1)	0.01	Mean	0.149	0.352	0.070	0.129	0.034	0.082
	0.01	SD	0.017	0.275	0.008	0.071	0.004	0.024
	0.05	Mean	0.169	0.862	0.094	0.713	0.061	0.705
	0.05	SD	0.018	0.408	0.009	0.178	0.004	0.093
	0.1	Mean	0.197	1.898	0.128	1.850	0.105	1.854
	0.1	SD	0.020	0.593	0.010	0.261	0.004	0.132
	0.2	Mean	0.259	4.685	0.211	4.640	0.202	4.638
	0.2	SD	0.026	0.968	0.013	0.423	0.006	0.253
	0.3	Mean	0.340	8.393	0.305	8.055	0.300	7.909
	0.3	SD	0.029	1.391	0.014	0.631	0.007	0.388
	0.4	Mean	0.420	12.209	0.402	11.846	0.401	11.653
	0.4	SD	0.031	1.433	0.014	0.657	0.007	0.448
	0.5	Mean	0.515	16.544	0.503	16.041	0.501	15.841
	0.5	SD	0.032	1.499	0.016	0.730	0.007	0.432

Table 3

Contaminating Model	Percentage of Contamination (ε)	Summary	n=200		n=1000		n=5000
Contaminating Model	Percentage of Contamination (ε)	Summary	TV	KLD	TV	KLD	TV	KLD
N(0, 4)	0.01	Mean	0.145	0.263	0.066	0.068	0.030	0.021
	0.01	SD	0.017	0.250	0.008	0.058	0.003	0.014
	0.05	Mean	0.147	0.497	0.069	0.204	0.034	0.079
	0.05	SD	0.017	0.391	0.008	0.130	0.004	0.036
	0.1	Mean	0.154	0.778	0.076	0.368	0.044	0.181
	0.1	SD	0.018	0.527	0.008	0.168	0.004	0.062
	0.2	Mean	0.166	1.275	0.094	0.712	0.071	0.426
	0.2	SD	0.020	0.639	0.010	0.255	0.005	0.108
	0.3	Mean	0.182	1.797	0.118	1.067	0.101	0.671
	0.3	SD	0.021	0.738	0.012	0.324	0.006	0.158
	0.4	Mean	0.201	2.320	0.144	1.407	0.133	0.924
	0.4	SD	0.021	0.875	0.012	0.403	0.006	0.198
	0.5	Mean	0.220	2.766	0.173	1.755	0.164	1.164
	0.5	SD	0.025	0.932	0.013	0.450	0.006	0.219
N(0, 9)	0.01	Mean	0.146	0.369	0.067	0.122	0.031	0.046
	0.01	SD	0.018	0.348	0.007	0.089	0.003	0.022
	0.05	Mean	0.154	0.839	0.074	0.490	0.040	0.321
	0.05	SD	0.017	0.477	0.008	0.187	0.004	0.081
	0.1	Mean	0.164	1.414	0.087	0.945	0.058	0.661
	0.1	SD	0.018	0.602	0.009	0.256	0.005	0.120
	0.2	Mean	0.189	2.529	0.120	1.748	0.101	1.300
	0.2	SD	0.021	0.801	0.011	0.366	0.005	0.188
	0.3	Mean	0.216	3.529	0.161	2.526	0.149	1.954
	0.3	SD	0.023	0.957	0.012	0.466	0.006	0.276
	0.4	Mean	0.252	4.608	0.205	3.444	0.196	2.660
	0.4	SD	0.026	1.071	0.014	0.549	0.006	0.339
	0.5	Mean	0.286	5.630	0.250	4.289	0.244	3.423
	0.5	SD	0.026	1.123	0.014	0.657	0.007	0.406
N(0, 16)	0.01	Mean	0.146	0.429	0.067	0.166	0.031	0.078
	0.01	SD	0.016	0.374	0.007	0.100	0.003	0.032
	0.05	Mean	0.156	1.073	0.078	0.716	0.044	0.511
	0.05	SD	0.017	0.514	0.008	0.203	0.004	0.088
	0.1	Mean	0.169	1.774	0.094	1.281	0.066	0.981
	0.1	SD	0.019	0.606	0.008	0.277	0.005	0.142
	0.2	Mean	0.200	3.160	0.137	2.383	0.120	1.927
	0.2	SD	0.021	0.800	0.011	0.408	0.005	0.218
	0.3	Mean	0.239	4.471	0.187	3.532	0.177	2.937
	0.3	SD	0.025	1.045	0.013	0.485	0.006	0.278
	0.4	Mean	0.280	5.812	0.242	4.822	0.235	4.044
	0.4	SD	0.026	1.125	0.014	0.589	0.007	0.355
	0.5	Mean	0.331	7.537	0.298	6.145	0.293	5.218
	0.5	SD	0.029	1.274	0.015	0.693	0.007	0.433

We observe from the results of Table 2 and Table 3 that the total variation distance for small percentages of contamination is small and generally smaller than the Kullback-Leibler distance for both asymmetric and symmetric contamination models with a considerably smaller standard deviation. The above behavior of the total variation distance in comparison to the Kullback-Leibler manifests itself across all sample sizes used. Table 4 presents the mixture index of fit computed using the R function “pistar.uv” from the R package “pistar” (https://rdrr.io/github/jmedzihorsky/pistar/man/; accessed on 5 June 2018). Since the fundamental assumption in the definition of the mixture index of fit is that the population on which the index is applied is heterogeneous and expressed via the two-point model, we only used the asymmetric contamination model for various values of the contamination distribution.

Table 4

Means and standard deviations (SD) for the mixture index of fit. Data are generated from an asymmetric contamination model of the form , with sample sizes, n, of 1000, 5000. The number of Monte Carlo replications is 500.

Percentage of Contamination ε	Summary	N(1,1)		N(5,1)		N(10,1)
Percentage of Contamination ε	Summary	n=1000	n=5000	n=1000	n=5000	n=1000	n=5000
0.1	Mean	0.180	0.160	0.223	0.213	0.837	0.934
0.1	SD	0.045	0.044	0.041	0.040	0.279	0.198
0.2	Mean	0.184	0.172	0.288	0.287	0.433	0.521
0.2	SD	0.044	0.042	0.036	0.036	0.144	0.240
0.3	Mean	0.189	0.179	0.344	0.346	0.314	0.317
0.3	SD	0.047	0.039	0.028	0.024	0.016	0.012
0.4	Mean	0.194	0.186	0.436	0.436	0.410	0.413
0.4	SD	0.044	0.034	0.026	0.021	0.017	0.011
0.5	Mean	0.194	0.185	0.529	0.533	0.511	0.512
0.5	SD	0.047	0.035	0.024	0.020	0.017	0.010

We observe that the mixture index of fit generally estimates well the mixing proportion . We observe (see Table 4) that when the second population is the bias associated with estimating the mixing (or contamination) population can be as high as . This is expected because the population is very close to creating essentially a unimodal sample. As the means of the two normal components get more separated, the mixture index of fit provides better estimates of the mixing quantity and the percentage of observations that need to be removed so that provides a good fit to the remaining data points.

7. Discussion and Conclusions

Divergence measures are widely used in scientific work, and popular examples of these measures include the Kullback-Leibler divergence, Bregman Divergence [30], the power divergence family of Cressie and Read [31], the density power divergence family [32] and many others. Two relatively recent books that discuss various families of divergences are Pardo [33] and Basu et al. [34]. In this paper we discuss specific divergences that do not belong to the family of quadratic divergences, and examine their role in assessing model adequacy. The total variation distance might be preferable as it seems closest to a robust measure, in that if the two probability measures differ only on a set of small probability, such as a few outliers, then the distance must be small. This was clearly exemplified in Table 2 and Table 3 of Section 6. Outliers influence chi-squared measures more. For example, the Pearson’s chi-squared distance can be made dramatically larger by increasing the amount of data in a cell with small model probability . In fact, if there is data in a cell with model probability zero, the distance is infinite. Note that if data occur in a cell with probability, under the model, equal to zero, then it is possible that the model is not true. Still, even in this case, we might wish to use it on the premise that provides a good approximation. There is a pressing need for the further development of well-tested software for computing the mixture index of fit. This measure is intuitive and has found many applications in the social sciences. Reiczigel et al. [35] discuss bias-corrected point estimates of *, as well as a bootstrap test and new confidence limits, in the context of contingency tables. Well-developed and tested software will further popularize the dissemination and use of this method. The mixture index of fit ideas were extended in the context of testing general model adequacy problems by Liu and Lindsay [9]. Recent work by Ghosh and Basu [36] presents a systematic procedure of generating new divergences. Ghosh and Basu [36], building upon the work of Liu and Lindsay [9], generate new divergences through suitable model adequacy tests using existing divergences. Additionally, Dimova et al. [37] use the quadratic divergences introduced in Lindsay et al. [2] and construct a model selection criterion from which we can obtain AIC and BIC as special cases. In this paper, we discuss non-quadratic distances that are used in many scientific fields where the problem of assessing the fitted models is of importance. In particular, our interest centered around the properties and potential interpretations of these distances, as we think this offers insight into their performance as measures of model misspecification. One important aspect for the dissemination and use of these distances is the existence of well-tested software that facilitates computation. This is an area where further development is required.

6 in total

Non-Quadratic Distances in Model Assessment.

1. Introduction

2. Statistical Distances and Their Properties

3. Total Variation

4. Mixture Index of Fit

5. Kullback-Leibler Distance

6. Computation and Applications of Total Variation, Mixture Index of Fit and Kullback- Leibler Distances

7. Discussion and Conclusions

1. Selecting diagnostic tests for ruling out or ruling in disease: the use of the Kullback-Leibler distance.

2. Applications and computational strategies for the two-point mixture index of fit.

3. Testing the Rasch model by means of the mixture fit index.

4. Estimating the pi* goodness of fit index for finite mixtures of item response models.

5. Bias-corrected estimation of the Rudas-Clogg-Lindsay mixture index of fit.

6. Refining clinical diagnosis with likelihood ratios.