Literature DB >> 33266744

On the Impossibility of Learning the Missing Mass.

Elchanan Mossel¹, Mesrob I Ohannessian².

Abstract

This paper shows that one cannot learn the probability of rare events without imposing further structural assumptions. The event of interest is that of obtaining an outcome outside the coverage of an i.i.d. sample from a discrete distribution. The probability of this event is referred to as the "missing mass". The impossibility result can then be stated as: the missing mass is not distribution-free learnable in relative error. The proof is semi-constructive and relies on a coupling argument using a dithered geometric distribution. Via a reduction, this impossibility also extends to both discrete and continuous tail estimation. These results formalize the folklore that in order to predict rare events without restrictive modeling, one necessarily needs distributions with "heavy tails".

Entities: Chemical Disease Gene

Keywords: Good–Turing; heavy tails; light tails; missing mass; no free lunch; rare events

Year: 2019 PMID： 33266744 PMCID： PMC7514132 DOI： 10.3390/e21010028

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Given data consisting of n i.i.d. observations from an unknown distribution p over the positive integers , we traditionally compute the empirical distribution: To estimate the probability of an event , we could use . This works well for abundantly represented events, but not as well for rare events. An unequivocally rare event is the set of symbols that are missing in the data, The probability of this (random) event is denoted by the missing mass: The question we strive to answer in this paper is: “Can we learn the missing mass when p is an arbitrary distribution on ?” Definition 1 phrases this precisely in the learning framework. An estimator is a sequence of functions learns the missing mass in relative error with respect to a family The learning is said to be distribution-free, if all distributions on In this framework, our question becomes whether we can distribution-free learn the missing mass in relative error. It is obvious that the empirical estimator gives us the trivial answer of 0, and cannot learn the missing mass. A popular alternative is the Good–Turing estimator of the missing mass, which is the fraction of singletons in the data: The Good–Turing estimator has many interpretations. Its original derivation by Good [1] uses an empirical-Bayes perspective. It can also be thought of as a leave-one-out cross-validation estimator, which contributes to the missing set if and only if the holdout appears exactly once in the data. Fundamentally, derives its form and its various properties from the simple fact that: A study of in the learning framework was first undertaken by McAllester and Schapire [2] and continued later by McAllester and Ortiz [3]. Some further refinement and insight was also given later by Berend and Kontorovich [4]. These works focused on additive error. Ohannessian and Dahleh [5] shifted the attention to relative error, establishing the learning property of the Good–Turing estimator with respect to the family of heavy-tailed (roughly power-law) distributions, e.g., with . This work also showed that Good–Turing fails to learn the missing mass for geometric distributions, and therefore does not achieve distribution-free learning. More recently, Ben-Hamou et al. [6] provide a comprehensive and tight set of concentration inequalities, which can be interpreted in the current framework, and which further demonstrate that Good–Turing can learn with respect to heavier-than-geometric light tails, e.g., the family that includes with (see the definition in Section 4.3 and Remark 4.3 in that paper), in addition to power-laws. These results leave open the important question: does there exist some other estimator that can learn the missing mass in relative error for any distribution p? Our contributions are: We prove that there are no such estimators, thus providing the first such “no free lunch” theorem for learning about rare events. The first insight to glean from this impossibility result is that one is justified to use further structural assumptions. Furthermore, the proof relies on an implicit construction that uses a dithered geometric distribution. In doing so, it shows that the failure of the Good–Turing estimator for light-tailed distributions is not a weakness of the procedure, but is rather due to a fundamental barrier. Conversely, the success of Good–Turing for heavier-than-geometric and power laws shows its universality, in some restricted sense. In particular, in concrete support to folklore (e.g., [7]), we can state that for estimating probabilities of rare events, heavy tails are both necessary and sufficient. We extend this result to continuous tail estimation. We show, on a positive note, that upon restricting to parametric light-tailed families learning may be possible. In particular, we show that for the geometric family the natural plug-in estimator learns the missing mass in relative error. As an ancillary result, we prove an instance-by-instance convergence rate, which can be interpreted as a weak sample complexity. For this, we establish some sharp concentration results for the gaps in geometric distributions, which may be of independent interest. The paper is organized as follows. In Section 2, we present our main result, with a detailed exposition of the proof. In Section 3 we discuss questions of weak versus strong learnability, we give an immediate extension to continuous tail estimation, show that parametric light-tailed learning is possible, comment further on the Good–Turing estimator, and concisely place this result in the context of a chief motivating application, that of computational linguistics. Lastly, we conclude in Section 4 with a summary and open questions.

2. Main Result

Our main result is stated as follows. The rest of this section is dedicated to its detailed proof. There exists a positive In particular, it follows that it is impossible to perform distribution-free learning of the missing mass in relative error. Our proof below implies the statement of the theorem with

2.1. Proof Outline

Consider the family of -dithered geometric distributions, where the mass of each outcome beyond a symbol m of a random variable is divided between two symbols, with a fraction in one and in the other. The individual distributions in this family differ only by which of each pair of such symbols gets which fraction. More precisely: The The intuition of the proof of Theorem 1 is that within such light-tailed families, two distributions may have very similar samples and thus estimated values, yet have significantly different true values of the missing mass. This follows the general methodology of many statistical lower bounds. We now state the outline of the proof. We choose a subsequence of the form . We set , , and . The value of is made explicit in the proof, and depends only on these choices. We proceed by induction. We show that there exists such that for all with we have for : Then, at every step : We start with such that for all with , Inequality (3) holds for . We then show that it must be that for at least one of or , for all with , Inequality (3) holds additionally for . We select to be the corresponding . This induction produces an infinite sequence , and the desired distribution in Theorem 1 can be chosen as , since it is readily seen to satisfy the claim for each , by construction.

2.2. Proof Details

We skip the proof of the base case, since it is mostly identical to that of the induction step. Therefore, in what follows we assume that satisfies the inductive hypothesis (H), and we would like to prove that the selection in (*) can always be done. Let us denote the two choices of parameters by and and let us refer to and by the trailing parameters. What we show in the remainder of the proof is that with two arbitrary sets of trailing parameters, we cannot have two simultaneous violations of Inequality (3) (for both and ). That is, we cannot have both: This is stated in Lemma 3, in the last portion of this section. To see why this is sufficient to show that the selection in (*) can be done, consider first the case that Inequality (3) with is upheld for both and with any two sets of trailing parameters. In this case we can arbitrarily choose to be either or , since the induction step is satisfied. We can therefore focus on the case in which this fails. That is, for either or a choice of trailing parameters can be made such that Inequality (3) with is not satisfied, and therefore one of the two cases in (4) holds [say, for example, for ]. Fix the corresponding trailing parameters [in this example, ]. Then, for any choice of the other set of trailing parameters [in this example, ], Lemma 3 precludes a violation of Inequality (3) for by the other choice [in this example, ]. Therefore this choice can be selected for [in this example, .] By using the coupling device and restricting ourselves to a pivotal event, we formalize the aforementioned intuition that the estimator may not distinguish between two separated missing mass values, and deduce that both statements in (4) cannot hold simultaneously. A coupling between two distributions p and

2.2.1. Coupling

Couplings are useful because probabilities of events on each side may be evaluated on the joint probability space, while forcing events of interest to occur in an orchestrated fashion. Going back to our induction step and the specific choices and with arbitrary trailing parameters, we perform the following coupling. For , let It is easy to verify that q in Equation (5) is a coupling between and as in Definition 3. Now let us observe the consequences of this choice. If are generated according to q, then if either is in then both values are identical. If either is in then so is the other, but otherwise the two values are conditionally independent. If either is in , so is the other, and the conditional probability is given by: Now consider coupled data generated as i.i.d. samples from q. It follows that, marginally, the X-sequence is i.i.d. from , and so is the -sequence from . Any event B that is exclusively X-measurable or that is exclusively -measurable has the same probability under the coupled measure. That is, and In what follows we work only with coupled data, and use simply the shorthand to mean .

2.2.2. Pivotal Event

The event we would like to work under is that of the coupled samples being identical, while exactly covering the range 1, ⋯, . Let’s call this the pivotal event and denote it by: The reason interests us is that it encapsulates the aforementioned intuition. Under event while any estimator cannot distinguish the coupled samples, The confusion of any estimator is simply due to the fact that under , the coupling forces all samples to be identical , for all . Thus , since estimators only depend on the samples and not the probabilities. The missing masses, on the other hand, do depend on both the samples and the probabilities and thus they differ. But the event makes the set of missing symbols simply the tail , so we can compute the missing masses exactly: where follows from the usual geometric sum. The claim follows. □ We now show that has always a positive probability, bounded away from zero. For Please note that in Equation (6) overspecifies the event. In fact, only forcing the exact coverage of is sufficient, since this implies in turn that the coupled samples are identical. Recalling the coupling of Equation (5), this is immediate for symbols in , and follows for since is not allowed in this event. We can then write , dividing the exact coverage to the localization in the range and the representation of each symbol by at least one sample: Let be the probability of for a given i being in . From the coupling in Equation (5) and the structure of the dithered family in Equation (2), we see that for up to this probability sums up to the first terms of a geometric, and for the coupling assigns it , thus: We can then explicitly compute: Meanwhile, the complement of is the event that at least one of does not appear, that is . Conditionally on , the occurrence probabilities of these symbols are simply normalized by , that is . Thus, by a union bound, we have: where the last inequality follows from the fact that . Therefore, We now use our choices of , , , and , to bound this worst-case . In particular, we can verify that , and it follows as claimed that the pivotal event has always a probability bounded away from zero. □

2.2.3. Induction Step

We now combine all the elements presented thus far to complete the proof of Theorem 1 by establishing the following claim, which we have shown in the beginning of the detailed proof section to be sufficient for the validity of the induction step. In particular, we restate Equation (4) under the coupling of Equation (5). Letwith arbitrary trailing parameters Please note that this choice of means that , where is as in Lemma 2. Recall the pivotal event , and assume, for the sake of contradiction, that both probability bounds and hold. Please note that if holds, it means that and similarly if holds, it means that By making our hypothesis, we are asserting that these events have high probabilities, , under both and distributions, and that thus the estimator is effectively -close to the true value of the missing mass. Yet, we know that this would be violated under the pivotal event, which occurs with positive probability. We now formalize this contradiction. By Lemma 2, we have that: where the last inequality is strict, by the choice of . On the other hand, recall that by Lemma 1 under we have: By combining this with Equations (7) and (8), we can now see that if , which is satisfied by any choice of , in particular ours, then if occurs, then occurs, and conversely if occurs then occurs. For example, say occurs, then : implying that Equation (8) is not satisfied, thus occurs. The end result is that under event , and cannot occur at the same time, and thus: This contradicts the bound in (9), and establishes the lemma. □

3. Discussions

3.1. Weak Versus Strong Distribution-Free Learning

Arguably, a more common notion of learning is a strong version of Definition 1, where the sample complexity is a function of the distribution class rather than the instance. Formally: We say that an estimator learns the missing mass in relative error strongly with respect to a family The learning is said to be strongly distribution-free, if all distributions on The distinction here is similar to that of uniform versus pointwise convergence. Clearly, the existence of a strong learner implies the existence of a weak learner. Conversely, as we have shown that there is no weakly distribution-free learner, there is also no strongly distribution-free learner. However, the ability to choose a different distribution at every sample size n makes it very easy to show this corollary directly. For example, we can consider two distributions p and q with and , both of which would result with overwhelming probability in a length-n sequence consisting entirely of this first symbol. Thus any estimator would need to predict the same missing mass with high probability. However, the rest of the symbols would have probabilities differing by a factor of 100 between the two models, and thus any estimator would be misguided for at least one of the two cases. The relevance of the current contribution is rooted in the plausible yet misguided optimism that although we may not do well in such a worst-case paradigm, there is more hope if we first fix the instance and then study asymptotics. Our “no free lunch” theorem indeed shows the more subtle fact that there are always bad instances for every estimator, and thus even such weak learning is fundamentally impossible. Such a contrast between weak/strong learning has also been appreciated in the classical learning literature, notably in the work of Antos and Lugosi [8]. The notions there are framed in the negative, which is why the weak/strong terminology is reversed. A traditional minimax lower bound in that context states that for any sequence of concept learners at each n we can find a distribution for which the expected cumulative classification error is lower bounded by the complexity of the concept class. Analogously, not being able to strong learn as in Definition 4 means that for any estimator at each n we can find a distribution for which the relative error stays away from zero. By demanding a strong performance from a learner/estimator, we are able to give only a weak guarantee. In particular, it could be too loose for a fixed distribution that doesn’t vary with n. [8] contributes by giving lower bounds that hold infinitely often for any sequence of concept learners but for a distribution choice that is adversarial in advance, fixed for all n. Analogously, not being able to (weak) learn as in Definition 1 means that for any estimator there exists a distribution, fixed for all n, for which the relative error stays away from zero for infinitely many n. The lower bounds of Antos and Lugosi [8] can now be tighter, which is why they call their results strong minimax lower bounds. In the context of the present paper, of course, the lower bounds correspond to the impossibility result, which is thus stronger since it doesn’t even hold for a fixed distribution.

3.2. Generalization to Continuous Tails

A closely related problem to learning the missing mass is that of estimating the tail of a probability distribution. In the simplest setting, the data consists of that are i.i.d. samples from a continuous distribution on . Let F be the cumulative distribution function. The task in question is that of estimating the tail probability that is the probability that a new sample exceeds the maximum of all samples seen in the data. One can immediately see the similarity with the missing mass problem, as both problems concern estimating probabilities of underrepresented events. We can use essentially the same learning framework given by Definition 1, and prove a completely parallel impossibility result. There exists In particular, it follows that it is impossible to perform distribution-free learning of the tail probability in relative error. (Sketch) The discrete version of this theorem is a trivial extension of Theorem 1, since in the proof of the latter the pivotal event forced the missing mass to be a tail probability. The potential strengthening of Theorem 2 comes from insisting on a continuous . The same techniques may be adapted in this case, such as by dithering an exponential distribution, where a base exponential density is divided into intervals, and mass is moved between pairs of adjacent intervals by scaling the density the same way as dithers the geometric. The adversarial distribution for a given estimator can then be chosen from this family. In order not to repeat the same aguments, however, we instead prove this result via a reduction. The details can be found in the Appendix A. Namely, we show that discrete tail estimation can be reduced to continuous tail estimation. Since the former is impossible, so is the latter. □ Theorem 2 gives a concrete justification of why it is important to make regularity assumptions when extrapolating distribution tails. This is of course the common practice of extreme value theory, see, for example [9]. Some impossibility results concerning the even more challenging problem of estimating the density of the maximum were already known [10], but to the best of our knowledge this is the first result asserting it for tail probability estimation as well.

3.3. Learning in Various Families

Ben-Hamou et al. [6] (Corollary 5.3) gives a very clean characterization of a sufficient learnable family, which encompasses the one covered by Ohannessian and Dahleh [5]. ([ Denote the expected number of single-occurrence and double-occurrence symbols by Let The Good–Turing estimator learns the missing mass in relative error with respect to The proof relies on power moment concentration inequalities (such as Chebyshev’s). The property embodies the heavy-tailed nature, since it says that rare events occur often. The condition that (i.e., its lim sup) remains bounded is a smoothness condition, since roughly captures the variance of the number of singletons (see [6], Proposition 3.3). For us, this is instructive because one could readily verify that the condition of Theorem 3 fails for geometric (and dithered geometric) distributions. We can thus see that in some sense Good–Turing captures a maximal family of learnable distributions. In particular, we now know that the complement of is not learnable. Considering how sparse the dithered geometric family is, the failure of any estimator to learn the missing mass with respect to it may seem discouraging. (Please note that Theorem 1 holds even if the estimator is aware that this is the class it is paired with.) However, if we restrict ourselves to smooth parametric families within light tails then the outlook can be brighter. We illustrate this with the case of the geometric family. Let Let Then (Sketch) The proof consists of pushing forward the convergence of the parameter to that of the entire distribution using continuity arguments, and then specializing to the missing mass. The details can be found in the Appendix B. □

3.4. Learning the Missing Mass in Additive Error and Learning Other Related Quantities

As mentioned in the introduction, a good part of the work on learning the missing mass focused on additive error [2,3,4]. Recently, minimax lower bounds were given for the additive error in [11] and [12]. Note however that relative error bounds cannot be deduced from these (nor any other way, given the impossibility established here.) A related problem to learning the missing mass in relative error is that of learning a distribution in KL-divergence loss. This averages all log-relative errors (missing or otherwise). This averaging scales the log-relative errors by the rare probability and attenuates the kind of gaps discussed in our present context. One thus hopes to have more optimistic results. Indeed, the Good–Turing estimator was recently shown to be adaptive/competitive for distribution learning in KL-divergence [13]. A similar result in the context of distribution learning in total variation was given in [14]. In the language of Section 3.2, being competitive can be understood as an intermediate characterization between weak and strong learning. Lastly, one could be interested in learning other properties of distributions that are intimately related to the rare component. Entropy and support size are two of these. In [15] a traditional minimax bound was established for these quantities, which was then further distilled in [16]. Another related problem is predicting the growth of the support as more observations are made. This was characterized very precisely in [17], where one can find further pointers on this very old problem. Some of these results may give the impression that nothing further can be gained from structural assumptions. However, rates can generally be refined whenever such structure exists. See for example [18] for refined competitive rates in distribution estimation and [19] for similar results in the predictive/compression setting. These results use tail characterizations, similar to those in extreme value theory [10].

3.5. N-Gram Models and Bayesian Perspectives

One of the prominent applications of estimating the missing mass has been to computational linguistics. In that context, it is known as smoothing and is used to estimate N-gram transition probabilities. The importance of accurately estimating the missing mass, and in particular in a relative-error sense, comes from the fact that N-grams are used to score test sentences using log-likeliehoods. Test sentences often have transitions that are never seen in the training corpus, and thus in order for the inferred log-likelihoods to accurately track the true log-likelihood, these rare transitions need to be assigned meaningful values, ideally as close to the truth as possible. As such, various forms of smoothing, including Good–Turing esimation, have become an essential ingredient of many practical algorithms, such as the popular method proposed by Kneser and Ney [20]. In the context of N-gram learning, a separate Bayesian perspective was also proposed. One of the earliest to introduce this were [21] using a Dirichlet prior. This was shown to not be very effective, and we now understand that it is due to the fact that (1) the Dirichlet process produces light tails while language is often heavy-tailed and, even if it were; (2) rare probabilities are hard to learn for large light-tailed families. The natural progression of these Bayesian models led to the use of the two-parameter Poisson-Dirichlet prior [22], which was suggested initially by [23]. Despite employing sophisticated inference techniques, the missing mass estimator that resulted from these models closely followed the Good–Turing estimator (for a sharp analysis of this correspondence, see Falahatgar et al. [18].) In light of the present work, this is not surprising since the two-parameter Poisson-Dirichlet process almost surely produces heavy-tailed distributions, and any two algorithms that learn the missing mass are bound to have the same qualitative behavior.

4. Summary

In this paper, we have considered the problem of learning the missing mass, which is the probability of all unseen symbols in an i.i.d. draw from an unknown discrete distribution. We have phrased this in the probabilistic framework of learning. Our main contribution was to show that it is not possible to learn the missing mass in a completely distribution-free fashion. In other words, no single estimator can do well for all distributions. We have given a detailed account of the proof, emphasizing the intuition of how failure can occur in large light-tailed families. We have also placed this work in a greater context, through some discussions and extensions of the impossibility result to continuous tail probability estimation, and by showing that smaller, parametric, light-tailed families may be learnable. An initial impetus for this paper and its core message is that assuming further structure can be necessary in order to learn rare events. Further structure, of course, is nothing more than a form of regularization. This is a familiar notion to the computational learning community, but for a long time the Good–Turing estimator enjoyed favorable analysis that focused on additive error, and evaded this kind of treatment. The essential ill-posedness of the problem was uncovered by studying relative error. But lower bounds cannot be deduced from the failure of particular algorithms. Our result thus completes the story, and we can now shift our attention to studying the landscape that is revealed. The most basic set of open problems concerns establishing families that allow learning of the missing mass. We have seen in this paper some such families, including the heavy-tailed family learnable by the Good–Turing estimator, and simple smooth parametric families, learnable using plug-in estimators. How do we characterize such families more generally? The next layer of questions concerns establishing convergence rates, i.e., strong learnability, via both lower and upper bounds. The fact that a family of distributions allows learning does not mean that such rates can be established. This is because any estimator may be faced with arbitrarily slow convergence, by varying the distribution in the family. In other words we may be faced with a lack of uniformity. How do we control the convergence rate? Lastly, when learning is not possible, we may want to establish how gracefully an estimator can be made to fail. Understanding these limitations and accounting for them can be critical to the proper handling of data-scarce learning problems.

1 in total

1. Optimal prediction of the number of unseen species.

Authors: Alon Orlitsky; Ananda Theertha Suresh; Yihong Wu
Journal: Proc Natl Acad Sci U S A Date: 2016-11-08 Impact factor: 11.205

1 in total