Literature DB >> 26435560

The Parzen Window method: In terms of two vectors and one matrix.

Hamse Y Mussa1, John B O Mitchell2, Avid M Afzal3.   

Abstract

Pattern classification methods assign an object to one of several predefined classes/categories based on features extracted from observed attributes of the object (pattern). When L discriminatory features for the pattern can be accurately determined, the pattern classification problem presents no difficulty. However, precise identification of the relevant features for a classification algorithm (classifier) to be able to categorize real world patterns without errors is generally infeasible. In this case, the pattern classification problem is often cast as devising a classifier that minimizes the misclassification rate. One way of doing this is to consider both the pattern attributes and its class label as random variables, estimate the posterior class probabilities for a given pattern and then assign the pattern to the class/category for which the posterior class probability value estimated is maximum. More often than not, the form of the posterior class probabilities is unknown. The so-called Parzen Window approach is widely employed to estimate class-conditional probability (class-specific probability) densities for a given pattern. These probability densities can then be utilized to estimate the appropriate posterior class probabilities for that pattern. However, the Parzen Window scheme can become computationally impractical when the size of the training dataset is in the tens of thousands and L is also large (a few hundred or more). Over the years, various schemes have been suggested to ameliorate the computational drawback of the Parzen Window approach, but the problem still remains outstanding and unresolved. In this paper, we revisit the Parzen Window technique and introduce a novel approach that may circumvent the aforementioned computational bottleneck. The current paper presents the mathematical aspect of our idea. Practical realizations of the proposed scheme will be given elsewhere.

Entities:  

Keywords:  Kernel functions; Parzen Window; Probability density function

Year:  2015        PMID: 26435560      PMCID: PMC4534349          DOI: 10.1016/j.patrec.2015.06.002

Source DB:  PubMed          Journal:  Pattern Recognit Lett        ISSN: 0167-8655            Impact factor:   3.756


Introduction

In mathematical pattern recognition, the problem of pattern classification entails assigning an object – based on a number of specific features of the object – to one of a finite set of predefined classes/categories ω, where j =1, 2,…, J, with J being the number of classes/categories of interest. Typically the object (or simply the pattern) is represented by an L-dimensional vector x whose elements, are values assumed to contain the appropriate information about the specific pattern features utilized to accurately classify the pattern represented by x. When L discriminatory features for a pattern can be determined accurately, the pattern classification problem presents no difficulty: it reduces to a simple look-up table scheme. However, identifying the relevant features to classify realistic patterns without classification errors is generally impossible. Thus, the pattern classification problem is often cast as the task of finding a classifier that minimizes the misclassification rate [1]. One popular way of achieving this objective is to treat both the pattern vector x and the class label ω as random variables. In this case, the posterior class probabilities p(ω|x) for a given pattern x is computed; then pattern x is assigned to the class, for which the p(ω|x) value is maximum [1-6]. (In the last step it is being assumed that all misclassification errors are equally bad [1,3,4].) However, in practice, the form of the function p(ω|x) is unknown; instead, N patterns x and their corresponding correct class labels – i.e., assumed to constitute a representative dataset of the joint probability density function p(ω, x) for ω and x – are usually available. It is from these prototype patterns x and their corresponding class labels y that one tries to estimate p(ω|x). According to basic probability rules [1-7],These rules allow one to modularize the estimation problem and estimate p(ω|x) (and, of course, p(x)) in terms of p(x|ω) and p(ω): whereby we may have a better chance of being able to estimate p(x|ω) and p(ω) from than estimating p(ω|x) directly from . In the denominator, . In the Bayesian statistics framework, p(ω) is referred to as the class prior probability, which is the probability that a member of class ω will occur. The function p(x|ω) is called the class-conditional probability density function, i.e. the probability density of observing pattern x given that x is a member of class ω. The denominator term on the right hand side of Eq. 2 is often called the “evidence” or “marginal likelihood”. For the purpose of this paper we can afford to simply view this term as a normalization factor. If there is evidence that the number of prototype patterns per class is an indication of the importance of that class, then a sensible approximation of p(ω) can be where if the ith prototype x belongs to class ω, otherwise ; and N is as described before. Nonetheless, p(ω) is typically assumed to be uniform, i.e., where J is as defined before. Estimating p(x|ω) from is not straightforward [1-5]. In the last half-century, a plethora of methods have been proposed for estimating p(x|ω) based on the so-called training set. There are ample excellent reviews and text books on this topic; for example, the two books – one by Hand [4] and the other by Murphy [8] – give adequate and accessible descriptions of the bulk of these approaches devised in recent (and not so recent) years. In this paper we are concerned with one particular approach that is widely thought to be apropos to the task of estimating p(x|ω) from a representative training dataset: the so-called Parzen Window method [1,2,4,9,10], also known as Parzen estimator, Potential function technique [10], to name but a few. In the preceding discussion and in the rest of the paper, the terms “class”, “label”, “class label” and “category” are employed interchangeably. For notational simplicity we use x, x and ω both as random variables and their realizations. Furthermore we follow (in line with the current trend in machine learning and statistics) the convenient – but not necessarily correct – practice of using the term “density” for both a discrete random variable’s probability function and for the probability function of a continuous random variable. An implicit assumption being made throughout the paper is that all spaces, matrices, vectors, functions and variables (discrete or not), etc., are real.

Literature review

The Parzen Window approach can approximate p(x|ω) by a simple formula [1,2,10]: where denotes prototype patterns and is as defined before. K(x, x; λ) – commonly known as the kernel function – is a two-variable function with specific properties, which are abundantly covered in the statistical pattern recognition literature [1,2,9,10]. At any rate, it might be helpful to think of K(x, x; λ) as a measure of similarity returning how similar patterns x and x are, λ being a tunable (smoothing) parameter. In other words, the kernel function peaks at x = x and decays away elsewhere; the λ parameter, inter alia, has an important role in determining the rate of the decay. Eq. 4 indicates that is formed from the superposition of kernel function K(x, x; λ) values at the given prototype patterns x for class ω. The Parzen Window method is powerful in the sense that, with enough representative data points (prototypes/references), its estimate of the class conditional probability density converges to p(x|ω) (see Ref. [1], Chapter 4). Although Eq. 4 is conceptually simple and capable of providing a good estimate of p(x|ω), it can computationally suffer from the requirements that all the prototypes/references x for class ω must be retained in main-memory to compute the estimate of p(x|ω). Furthermore, considerable CPU-time may be required each time this method is used to estimate p(x|ω) to classify a novel pattern. The fact that the size of the reference pattern vectors x can be easily in the hundreds (or more) may exacerbate the main-memory and CPU-time requirements. Over the years, various schemes have been developed to address the computational drawback of this otherwise elegant and powerful method. For example, one of these schemes entails – see Ref. [1] (Chapter 4), Ref. [4] (Chapter 2) and Ref. [10] (Chapters 6) for detailed technical and practical discussions – expressing the kernel function as a finite series expansion which renders the right hand side of Eq. 4 with being appropriate basis functions (not necessarily polynomials) defined in the feature space in which the pattern vectors x and x reside. From Eqs. 4 and 6, we have where with if pattern x belongs to class ω, otherwise . This scheme certainly removes the reference patterns’ storage problem. However, it can create a computational problem of its own, in particular when both M and L are large, which is often the case in real world classification problems. Computing M basis functions of L variables to classify a new pattern x is not a trivial computational task [1-4,10-12]. Another approach – albeit a particular case of the scheme above – is that proposed by Specht [13]. It was based on a Taylor series expansion of ρ(x, x; λ) (see Eq. 8), such that an rth order polynomial in L variables was required to estimate and store terms to approximate [10,13,14]. For short but “insightful” descriptions of the relationship between an appropriate value of r and the smoothing parameter λ, see Ref. [1] (Chapter 4) and Ref. [14] (Chapter 4). In principle, Specht’s scheme has a strong appeal of simplicity providing the number of terms required in the Taylor series can be held to a practical limit. Unfortunately, both r and L can be large in current realistic classification problems [1,4,10]. Despite these (and many other) efforts, to the best of our knowledge, the computational bottleneck that the Parzen Window method encounters, when N and L are large, remains an unresolved issue. Thus, the motivation for this paper is to introduce yet another scheme that might be able to circumvent the aforementioned computational bottleneck problem, while retaining the estimation power and conceptual simplicity of the Parzen Window method. This work is confined to Parzen Window based approaches, in which the kernel function is – or can be expressed – in the form where A > 0; f(x; λ) and f(x; λ) are any real functions defined in the feature space; ρ(x, x; λ) is a polynomial in x and x; and λ is as defined before. The kernel functions that are or can be written in the form above are ubiquitous nowadays in data analysis [4,6,8]. They have most often been successfully applied to discrete data; for this reason we decided to confine attention to the discrete case. For illustrative purposes, we focus on binary data, i.e., x = 0 or 1 denoting absence or presence of feature x in the pattern vector, respectively. That is to say, both x (test pattern vector) and x (reference/prototype pattern vector) reside in a binary feature space . The extension of the proposed scheme to continuous data – i.e., – is straightforward. One final, but important remark is that Specht’s approach and our proposed scheme are arguably similar in spirit. However there are crucial differences: unlike Specht’s formulation, our scheme does not estimate terms, it does not retain terms in main-memory, nor does the variable r feature in the final form of our algorithm – instead, in our case, only two L-dimensional vectors and one L-by-L matrix are required to retain in main-memory. The two vectors and the matrix can notably be highly sparse. We will briefly expound on this assertion shortly.

Proposed method and discrete Parzen Window approach

As a concrete example, we use the widely utilized kernel function (albeit in cheminformatics [15,16], and references therein) introduced by Aitchison and Aitken (AA–kernel) [17,18], where 0.5 < λ < 1 and d(x, x) denotes the number of components in which x and x disagree. This dissimilarity measure d(x, x) can be conveniently expressed as [4] In passing, the AA–kernel is basically a discrete analogue of an isotropic Gaussian kernel [17,18]. From Eqs. 9 and 10, and the fact that we have where . The term can be written as where γ = . Inserting Eq. 12 into Eq. 11 yields (cf. Eq. 8). From Eqs. 13 and 4, we have Now we come to the main contribution of this paper: removing the requirement for retaining all the reference/prototype patterns for class ω in main-memory to compute in order to estimate to classify a new pattern x. However, first we simplify the notation by defining these variables a, z and z′, which will be consistently used throughout, as follows: and where refers to the number of patterns in the training dataset that belong to class ω and are as described before. In our new notation, Eq. 14 becomes where . The main contribution of the paper is formulating Eq. 15 in terms of γ , a, z, z′ and an L-by-L matrix, Q which will be defined shortly. The task of this formulation basically amounts to expressing in terms of z, z′ and Q. In doing this, we hope to ameliorate the computational drawback of the Parzen Window method based on kernel functions in the form given in Eq. 8.

Results

When the task is trivial: reduces to where, by definition (see Section 3), . However, when r > 1, the task can be taxing. To this end, we make use of a simple – but useful – proposition (Proposition 1, whose proof is provided in Appendices A and B) to demonstrate that for r > 1 can be written as (see Eq. 19 in Appendix C) where Q is just an L-by-L matrix, (see Eq. 19). Inserting Eqs. 16 and 17 into Eq. 15 results in where ;  with γ = and 0.5 < λ < 1 Evidently, Eq. 18 illustrates that it is not necessary to retain all reference/prototype patterns for a given class in main-memory to compute the value of for a test pattern x; instead, all that is required is an L-by-L matrix Q and two L size vectors (z and z′), which are computed once and then retained in main-memory. This was the objective we set out to achieve in this paper. One final, but important, remark is that z, z′ and Q can be highly sparse in real world applications when x ∈ {0, 1}, in particular if the value of L is large. The fundamental reason for this sparsity is that in a high-dimensional reference pattern vector x there is the potential of many of its components being zero – i.e., many of the features are very likely to be absent in x. In passing, if we are dealing with continuous data, whereby the vectors z, z′ and matrix Q could be dense. Nonetheless, storing Q can still be computationally cheaper than retaining reference patterns x per class in main-memory – providing that . The current paper presents the mathematical aspect of our idea. Practical realizations of the proposed scheme will be given elsewhere.

Conclusion

The Parzen Window method is a powerful tool for estimating class conditional probability density functions. However, it can suffer from a severe computational bottleneck when the training dataset is large. Over the years, attempts have been made to rectify this computational drawback of the method. To the best of our knowledge the issue has remained unsolved. In this paper we have proposed a novel scheme, which we hope contributes to our endeavor of alleviating the computational bottleneck from which the Parzen Window algorithm suffers when the training dataset is large.
  4 in total

1.  Probability density function learning by unsupervised neurons.

Authors:  S Fiori
Journal:  Int J Neural Syst       Date:  2001-10       Impact factor: 5.866

2.  Prediction of biological activity for high-throughput screening using binary kernel discrimination.

Authors:  G Harper; J Bradshaw; J C Gittins; D V Green; A R Leach
Journal:  J Chem Inf Comput Sci       Date:  2001 Sep-Oct

3.  Nonsymmetric PDF estimation by artificial neurons: application to statistical characterization of reinforced composites.

Authors:  S Fiori
Journal:  IEEE Trans Neural Netw       Date:  2003

4.  Classifying molecules using a sparse probabilistic kernel binary classifier.

Authors:  Robert Lowe; Hamse Y Mussa; John B O Mitchell; Robert C Glen
Journal:  J Chem Inf Model       Date:  2011-07-08       Impact factor: 4.956

  4 in total
  1 in total

1.  A note on utilising binary features as ligand descriptors.

Authors:  Hamse Y Mussa; John B O Mitchell; Robert C Glen
Journal:  J Cheminform       Date:  2015-12-01       Impact factor: 5.514

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.