Literature DB >> 32837252

Gaussian bandwidth selection for manifold learning and classification.

Ofir Lindenbaum1, Moshe Salhov2, Arie Yeredor1, Amir Averbuch2.   

Abstract

Kernel methods play a critical role in many machine learning algorithms. They are useful in manifold learning, classification, clustering and other data analysis tasks. Setting the kernel's scale parameter, also referred to as the kernel's bandwidth, highly affects the performance of the task in hand. We propose to set a scale parameter that is tailored to one of two types of tasks: classification and manifold learning. For manifold learning, we seek a scale which is best at capturing the manifold's intrinsic dimension. For classification, we propose three methods for estimating the scale, which optimize the classification results in different senses. The proposed frameworks are simulated on artificial and on real datasets. The results show a high correlation between optimal classification rates and the estimated scales. Finally, we demonstrate the approach on a seismic event classification task.
© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2020.

Entities:  

Keywords:  Classification; Diffusion maps; Dimensionality reduction; Kernel methods

Year:  2020        PMID: 32837252      PMCID: PMC7330274          DOI: 10.1007/s10618-020-00692-x

Source DB:  PubMed          Journal:  Data Min Knowl Discov        ISSN: 1384-5810            Impact factor:   3.670


Introduction

Dimensionality reduction is an essential step in numerous machine learning tasks. Methods such as Principal Component Analysis (PCA) (Jolliffe 2002), Multidimensional Scaling (MDS) (Kruskal 1977), Isomap (Tenenbaum et al. 2000) and Local Linear Embedding (Roweis and Saul 2000) aim to extract essential information from high-dimensional data points based on their pairwise connectivities. Graph-based kernel methods such as Laplacian Eigenmaps (Belkin and Niyogi 2001) and Diffusion Maps (DM) (Coifman and Lafon 2006), construct a positive semi-definite kernel based on the multidimensional data points to recover the underlying structure of the data. Such methods have been proven effective for tasks such as clustering (Luo 2011), classification (Lindenbaum et al. 2015), manifold learning (Lin et al. 2006) and many more. Kernel methods rely on computing a distance function (usually Euclidean) between all pairs of data points and application of a data dependent kernel function. This kernel should encode the inherited relations between high-dimensional data points. An example for a kernel that encapsulates the Euclidean distance takes the form(where denotes the Euclidean norm). As shown, for example, in Roweis and Saul (2000), Coifman and Lafon (2006), spectral analysis of such a kernel provides an efficient representation of the lower (d-) dimensional data (where ) embedded in the ambient space. Devising the kernel to work successfully in such contexts requires expert knowledge for setting two parameters, namely the scaling (Eq. 1.1) and the inferred dimension d of the low-dimensional space. We focus in this paper on setting the scale parameter , sometimes also called the kernel bandwidth. The scale parameter is related to the statistics and to the geometry of the data points. The Euclidean distance, which is often used for learning the geometry of the data, is meaningful only locally when applied to high-dimensional data points. Therefore, a proper choice of should preserve local connectivities and neglect large distances. If is too large, there is almost no preference for local connections and the kernel method is reduced essentially to PCA (Lindenbaum et al. 2015). On the other hand, if is too small, the matrix (Eq. 1.1) has many small off-diagonal elements, which is an indication of a poor connectivity within the data. Several studies have proposed different approaches for setting . A study by Lafon et al. (2006) suggests a method which enforces connectivity among most data points—a rather simple method, which is nonetheless sensitive to noise and to outliers. The sum of the kernel is used by Singer et al. (2009) to find a range of valid scales, this method provides a good starting value but is not fully automated. An approach by Zelnik-Manor and Perona (2004) sets an adaptive scale for each point, which is applicable for spectral clustering but might deform the geometry. As a result, there is no guarantee that the rescaled kernel has real eigenvectors and eigenvalues. Others simply use the squared standard deviation (mean squared Euclidean distances from the mean) of the data as , this again is very sensitive to noise and to outliers. Kernel methods are also used for Support Vector Machines (SVMs, Scholkopf and Smola 2001), where the goal is to find a feature space that best separates given classes. Methods such as (Gaspar et al. 2012; Staelin 2003) use cross-validation to find the scale parameter which achieves peak classification results on a given training set. The study by Campbell et al. (1999) suggests an iterative approach that updates the scale until reaching maximal separation between classes. Chapelle et al. (2002) relate the scale parameter to the feature selection problem by using a different scale for each feature. This framework applies gradient descent to a designated error function to find the optimal scales. These methods are good for classification, but require to actually re-classify the points for testing each scale. In this paper we propose methods to estimate the scale which do not require the repeated application of a classifier to the data. Since the value of defines the connectivity of the resulted kernel matrix (Eq. 1.1), its value is clearly crucial for the performance of kernel based methods. Nonetheless, the performance of such methods depends on the training data and on the optimization problem in hand. Thus, in principle we cannot define an ’optimal’ scaling parameter value independently of the data. We therefore focus on developing tools to estimate a scale parameter based on a given training set. We found that there are almost no simple methods focusing on finding element-wise (rather than one global) scaling parameters dedicated for manifold learning. Neither are there methods that try to maximize classification performance without directly applying a classifier. For these reasons we propose new methodologies for setting , dedicated either to manifold learning or to classification. For the manifold learning task, we start by estimating the manifold’s intrinsic dimension. Then, we introduce a vector of scaling parameters , such that each value , rescales each feature. We propose a special greedy algorithm to find the scaling parameters which best capture the estimated intrinsic dimension. This approach is analyzed and simulated to demonstrate its advantage. For the classification task, we propose three methods for finding a scale parameter. In the first, by extending (Lindenbaum et al. 2016) we seek a scale which provides the maximal separation between the classes in the extracted low-dimensional space. The second is based on the eigengap of the kernel. It is justified based on the analysis of a perturbed kernel. The third method sets the scale which maximizes the within-class transition probability. This approach does not require to compute an eigendecomposition. Additionally, we provide new theoretical justifications for the eigengap-based method, as well as new simulations to support all methods. Interestingly, we also show empirically that all the three methods converge to a similar scale parameter . The structure of the paper is as follows: Preliminaries are given in Sect. 2. Section 3 presents and analyzes two frameworks for setting the scale parameter: the first is dedicated to a manifold learning task while the second fits a classification task. Section 4 presents experimental results. Finally, in Sect. 5 we demonstrate the applicability of the proposed methods for the task of learning seismic parameters from raw seismic signals.

Preliminaries

We begin by providing a brief description of two methods used in this study: A kernel-based method for dimensionality reduction called Diffusion Maps (Coifman and Lafon 2006); and Dimensionality from Angle and Norm Concentration (DANCo, Ceruti et al. 2014), which estimates the intrinsic dimension of a manifold based on the ambient high-dimensional data. In the following, vectors and matrices are denoted by bold letters, and their components are denoted by the respective plain letters, indexed using subscripts or parentheses.

Diffusion maps (DM)

DM (Coifman and Lafon 2006) is a nonlinear dimensionality reduction framework that extracts the intrinsic geometry from a high-dimensional dataset. This framework is based on the construction of a stochastic matrix from the graph of the data. The eigendecomposition of the stochastic matrix provides an efficient representation of the data. Given a high-dimensional dataset , the DM framework construction consists of the following steps: We chose to use DM in our analysis as it provides an intuitive interpretation based on its Markovian construction. Nonetheless, the methods in this manuscript could also be adapted to Laplacian Eigenmaps (Belkin and Niyogi 2001) and to other kernel methods. A kernel function is chosen, so as to compute a matrix with elements , satisfying the following properties: (i) Symmetry: ; (ii) Positive semi-definiteness: , namely for all ; and (iii) Non-negativity: , namely . These properties guarantee that has real-valued eigenvectors and non-negative real-valued eigenvalues. In this study, we focus on the common choice of a Gaussian kernel (see Eq. 1.1) as the affinity measure between two multidimensional data vectors and ; Obviously, choosing the kernel function entails the selection of an appropriate scale , which determines the degrees of connectivities expressed by the kernel. By normalizing the rows of , the row-stochastic matrix is computed, where is a diagonal matrix with . can be interpreted as the transition probabilities of a (fictitious) Markov chain on , such that (where t is an integer power) describes the implied probability of transition from point to point in t steps. Spectral decomposition is applied to , yielding a set of N eigenvalues (in descending order) and associated normalized eigenvectors satisfying ; A new representation for the dataset is defined by where is the scale parameter of the Gaussian kernel (Eq. 2.1) and denotes the ith element of . Note that and were excluded as the constant eigenvector doesn’t carry information about the data. The main idea behind this representation is that the Euclidean distance between two multidimensional data points in the new representation is equal to the weighted distance between the conditional probabilities and , , where i and j are the i-th and j-th rows of . The diffusion distance is defined by where W is a diagonal matrix with . This equality is proved in Coifman and Lafon (2006). A low-dimensional mapping is defined by such that , where .

Intrinsic dimension estimation

Given a high-dimensional dataset , which describes an ambient space with a manifold containing the data points , the intrinsic dimension is the minimum number of parameters needed to represent the manifold.

Definition 2.2.1

Let be a manifold. The intrinsic dimension of the manifold is a positive integer determined by how many independent “coordinates” are needed to describe . By using a parametrization to describe the manifold, the intrinsic dimension is the smallest integer such that there exists a smooth map for all data points on the manifold , . Methods proposed by Fukunaga and Olsen (1971) or by Verveer and Duin (1995) use local or global PCA to estimate the intrinsic dimension . The dimension is set as the number of eigenvalues greater than some threshold. Others, such as Trunk (1976) or Pettis et al. (1979), use k-neaserst-neighbors (KNN) distances to find a subspace around each point and based on some statistical assumption estimate . A survey of different approaches is provided in Camastra (2003). In this study we use Dimensionality from Angle and Norm Concentration (DANCo, by Ceruti et al. (2014)) based algorithm (which we observed to be the most robust approach in our experiments) to estimate . DANCo jointly uses the normalized distances and mutual angles to extract a robust estimate of . This is done by finding the dimension that minimizes the KullbackLeibler divergence between the estimated probability distribution functions (pdf-s) of artificially-generated data and the observed data. A full description of DANCo is presented in the “Appendix” of this manuscript. In Sect. 3.2, we propose a framework which exploits the resulting estimate of for choosing the scale parameter .

Setting the scale parameter

DM as described in Sect. 2 is an efficient method for dimensionality reduction. The method is almost completely automated, and does not require tuning many hyper-parameters. Nonetheless, its performance is highly dependent on proper choice of (Eq. 1.1), which, along with the decaying property of Gaussian affinity kernel , defines the affinity between all points in . If the ambient dimension D is high, the Euclidean distance becomes meaningless once it takes large values. Thus, a proper choice of should preserve local connectivities and neglect large distances. We argue that there is no one ’optimal’ way for setting the scale parameter; rather, one should define the scale based on the data and on the task in hand. In the following subsection we describe several existing method for setting . In Sect. 3.2, we propose a novel algorithm for setting in the context of manifold learning. Finally in Sect. 3.3, we present three methods for setting in the context of classification tasks, so as to optimize the classification performance (in certain senses) in the low-dimensional space. Our goal is to optimize the scale prior to the application of the classifier.

Existing methods

Several studies propose methods for setting the scale parameter . Some choose as the empirical squared standard deviation (mean squared Euclidean deviations from the empirical mean) of the data. This approach is reasonable when the data is sampled from a uniform distribution. A max-min measure is suggested in Lafon et al. (2006) where the scale is set toand . This approach attempts to set a small scale to maintain local connectivities. Another scheme (Singer et al. 2009) aims to find a range of values for . The idea is to compute the kernel from Eq. (2.1) at various values of . Then, search for the range of values which give rise to a well-pronounced Gaussian bell shape. The scheme in Singer et al. (2009) is implemented using Algorithm 3.1. Note that consists of two asymptotes, and , since when , (Eq. 2.1) approaches the Identity matrix, whereas for , approaches an all-ones matrix. We denote by the minimal value within the range (defined in Algorithm 3.1). This value is used in the simulations presented in Sect. 4. A dynamic scale is proposed in Zelnik-Manor and Perona (2004), suggesting to calculate a local-scale for each data point . The scale is chosen using the distance from the r-th nearest neighbor of the point . Explicitly, the calculation for each point iswhere is the r-th nearest (Euclidean) neighbor of the point . The value of the kernel for points isThis dynamic scale guarantees that at least half of the points are connected to r neighbors. All the methods mentioned above treat as a scalar. Thus, when data is sampled from various types of sensors these methods may be dominated by the features (vector elements) with highest energy (or variance). In such cases, each feature in a data vector may require a different scale. In order to re-scale the vector, a diagonal positive-definite (PD) scaling matrix is introduced. The rescaling of the feature vector is set as . The kernel matrix is rewritten asA standard way to set the scaling elements is to use the empirical standard deviation of the respective elements and then set . More specifically,

Setting for manifold learning

In this subsection we propose a framework for setting the scale parameter when the dataset has some low-dimensional manifold structure . We start by revisiting an analysis from (Coifman et al. 2008; Hein and Audibert 2005), which relate the scale parameter to the intrinsic dimension (Definition 2.2.1) of the manifold. In Coifman et al. (2008) a range of valid values is suggested for , here we expand the results from (Coifman et al. 2008; Hein and Audibert 2005) by introducing a diagonal PD scaling matrix (as used in Eq. 3.4). This diagonal matrix enables a feature selection procedure which emphasizes the latent structure of the manifold. Let , be the kernel matrix and diagonal PD matrix (resp.) from Eq. (3.4). By taking the double sum of all elements in Eq. (3.4), we getBy assuming that the data points in are independently uniformly distributed over the manifold , this sum can be approximated using the mean value theorem aswhere is the (weighted) volume of the -dimensional manifold , with , being infinitesimal parallelograms on the manifold, carrying the dependence on (see Moser 1965 for a more detailed discussion). When is sufficiently small, the integrand in the internal integral in Eq. (3.7) takes non-negligible values only when is close to the hyperplane tangent to the manifold at . Thus, the integration over within a small patch around each can be approximated by integration in , so thatwhere is the intrinsic dimension of and is a -dimensional vector of coordinates on the tangent plane. The integral in Eq. (3.8) has a closed-form solution (the internal integral yields , so that the outer integral yields ), and we therefore obtain the relationwhere we have used for shorthand. The key observation behind our suggested selection of and , is that due to the approximation in Eq. (3.9), an “implied” intrinsic dimension can be obtained for different selections of and as follows. Taking the of Eq. (3.9) we haveDifferentiating w.r.t. we obtainleading to the “implied” dimensionWe propose to choose the scaling so as to minimize the difference between the estimated dimension (see Sect. 2.2) and the implied dimension . We therefore set (and ) based on solving the following optimization problemWe note that this minimization problem has one degree of freedom, which can be resolved, e.g., by arbitrarily setting (see Algorithm 3.2 below). When working in a sufficently small ambient dimension D, the minimizaion can be solved using an exhaustive search (e.g., on some pre-defined grid of scaling values). However, for large D an exhaustive search may become unfeasible in practice, so we propose a greedy algorithm, outlined below as Algorithm 3.2, for computing both the scaling matrix and the scale parameter . The proposed algorithm (Algorithm 3.2) operates by iteratively constructing the normalized dataset row by row. To this end, the algorithm is first initialized by normalizing the first rows (coordinates) using their empirical standard deviations (note that the estimated (or known) intrinsic dimension is either provided as an input or estimated using DANCo Ceruti et al. (2014)). Then, in the -th iteration () only the scaling factor for the next (-th) row of and a new overall scaling are found, using a (two-dimensional) exhaustive search. The resulting is then applied to the -th row, and the entire data block is normalized by before the next iteration. The computational complexity of this algorithm with k hypotheses of and for each iteration is , since operations are required in the computation of a single scaling hypothesis, and this is required for each coordinate . Due to the greedy nature of the algorithm, its performance depends on the order of the D features. We further propose to reorder the features using a soft feature-selection procedure. The studies in Cohen et al. (2002), Lu et al. (2007), Song et al. (2010) suggest an unsupervised feature selection procedure based on PCA. The idea is to use the features which are most correlated with the top principle components. We propose an algorithm for reordering the features based on their correlation with the leading coordinates of the DM embedding. The algorithm (Algorithm 3.3 below) is called Correlation Based Feature Permutation (CBFP), and uses the correlation between the D features and embedding coordinates. This correlation value provides a natural measure for the influence of each feature on the extracted embedding. Some remarks on the uniform distribution and finite sample assumptions: The derivation in Eq. (3.7) is based on the assumption that the distribution of is uniform. If the density is not uniform, consider a measure with probability density . The integration in Eqs. (3.7) and (3.8) should be with respect to . This will change the result in Eq. (3.9) and in the algorithm that follows. In Hein and Audibert (2005), the authors show that under certain assumptions on the integral in Eq. (3.9) can still be used to approximate the intrinsic dimension . The approximations in Eqs. (3.7) and (3.9) are unbiased for proper choices of N and . As shown in Hein and Audibert (2005), the bias error in Eq. (3.9) is proportional to , where is the curvature of the manifold. This means that should be sufficiently small for this term to vanish; However, due to the finite sample approximation of the integral, should not be too small, since the variance term is proportional to . This quantity could be used to validate the scaling parameter estimated by Algorithm 3.2. In practice, a proper range of scales , can be validated by evaluating the slope of (see Eq. 3.9). If the slope of as a function of appears linear, this indicates that the approximation holds, and the bias and the variance errors are negligible. In Sect. 4.1 below, we evaluate the performance of Algorithms 3.2 and 3.3 when applied to synthetic data embedded in artificial manifolds.

Setting for classification

Classification algorithms use a metric space and an induced distance to compute the category of unlabeled data points. Dimensionality reduction is effective for capturing the essential intrinsic geometry of the data and neglecting the undesired information (such as noise). Therefore, dimensionality reduction can drastically improve classification results (Lindenbaum et al. 2015). As previously mentioned, playes a crucial role in the performance of kernel methods for dimensionality reduction. Various methods have been proposed for finding a scale that would potentially optimize classification performance. Studies such as by Gaspar et al. (2012) and by Staelin (2003) use a cross-validation procedure and select the scale that maximizes the performance on the validation data. In Chapelle et al. (2002) apply gradient descent to a classification-error function to find an ’optimal’ scale . Although these methods share our goal, they require performing classification on a validation set for selecting the scale parameter. To the best of our knowledge, the only method that estimates the scale without using a validation set was proposed by Campbell et al. (1999), where for binary classificationit was suggested to use the scale that maximizes the margin between the support vectors. The authors show empirically that their suggested value correlates with peak classification performance on a validation set. In this subsection we focus on DM for dimensionality reduction and demonstrate the influence of the scale parameter on classification performance in the low-dimensional space. The contribution in this section is threefold:Next, we develop tools to estimate a scale parameter based on a given training set. The training set denoted as consists of classes. The classes are denoted by . In this study we focus on the balanced setting, where the number of samples in each class is , thus the total number of data points is . We use a scalar scaling factor . However, the analysis provided in this subsection could be expanded to a vector scaling (namely, to the use of a diagonal PD scaling matrix) in a straightforward way (Fig. 1).
Fig. 1

A schematic of the three proposed methods for estimated dedicated for classification. Given an input we use range of hypothesis . a The first two dimensions of high-dimensional Gaussian classes. b The probabilistic approach- the scale is estimated based on modified transition matrices. c The geometric approach- the scale is estimated based on the geometry of the embedding. d The spectral approach- the scale is estimated based on a generalized eigengap computed for each scale within the hypothesis range

We use the Davis-Kahan theorem to analyze a perturbed version of ideally separated classes. This allows us to optimize the choice of merely based on the eigengap of the perturbed kernel. Based on our study in Lindenbaum et al. (2015), we present an intuitive geometric metric to evaluate the separation in a multi-class setting. We show empirically that the scale which maximizes the ratio between class separation and the average class spread also optimizes classification performance. Finally, to reduce the computational complexity involved in the spectral decomposition of the affinity kernel, we present a heuristic that allows to estimate the scale parameter based on the stochastic version of the affinity kernel. A schematic of the three proposed methods for estimated dedicated for classification. Given an input we use range of hypothesis . a The first two dimensions of high-dimensional Gaussian classes. b The probabilistic approach- the scale is estimated based on modified transition matrices. c The geometric approach- the scale is estimated based on the geometry of the embedding. d The spectral approach- the scale is estimated based on a generalized eigengap computed for each scale within the hypothesis range

The geometric approach

The following approach for setting is based on the geometry of the extracted embedding. The idea is to extract low-dimensional representations for a range of candidate ’s. Then choose which maximizes the among-classes to within-class variances ratio. In other words, choose a scale such that the classes are dense and far apart from each other in the resulting low-dimensional embedding space. This is done by maximizing the ratio between the inter-class variance and the sum of intera-class variances. We have explored a similar approach for an audio based classification task in Lindenbaum et al. (2015). This geometric approach is implemented using the following steps: The idea is that (Eq. 3.18) inherits the inner structure of the classes and neglects the mutual structure. In Sect. 4.2, we describe experiments that empirically evaluate the influence of on the performance of classification algorithms. We note, however, that this approach requires an eigendecomposition computation for each , thus, its computational complexity is of order of (d being the number of required eigenvectors). Compute DM-based embeddings , (Eq. 2.5) for various candidate-values of . Denote by the center of mass for class , and by the center of mass for all the data points. All and are computed in the low-dimensional DM-based embedding - see step 1. For each class , the average square distance (in the embedding space) is computed for the data points from the center of mass such that The same measure is computed for all data points such that Define Choose which maximizes

The spectral approach

In this subsection, we analyze the relation between the spectral properties of the kernel and its corresponding low-dimensional representation. We start the analysis by constructing an ideal training set, with well separated classes. Then, we add a small perturbation to the training set and compute the perturbed affinity matrix . Based on the spectral properties of the perturbed kernel we suggest a scaling to capture the essential information for class separation. The Ideal Case: We begin the discussion by considering an ideal classification setting, in which the classes are assumed to be well-separated in the ambient space (a similar setting for spectral clustering was described in Ng et al. (2002)). The separation is formulated using the following definitions: We assume that such that the classes are well separated. Using this assumption and the decaying property of the Gaussian kernel, the matrix (Eq. 2.1) converges to the following block formwhere . For the ideal case, we further assume that the elements of , are non-zeros because and the classes are connected. The Euclidean gap is defined as This is the Euclidean distance between the two closest data points belonging to two different classes. The maximal class width is defined as This is the maximal Euclidean distance between two data points belonging to the same class.

Proposition 3.3.1

Assume that , then, the matrix (Eq. 3.21) has an eigenvalue with multiplicity . Furthermore, the first coordinates of the DM mapping (Eq. 2.3) are piecewise constant. The explicit form of the first nontrivial eigenvector is given byThe eigenvectors have the same structure but cyclically shifted to the right by bins.

Proof

Recall that (row-stochastic). Due to the special block structure of (Eq. 3.21), each block , is row stochastic. Thus,Each eigenvector consists of a block of 1-s at the row indices that correspond to (Eq. 3.21), padded with zeros. is the right eigenvector (), with the eigenvalue . We now have an eigenvalue with multiplicity and piecewise constant eigenvectors denoted as . Each data point , corresponds to a row within the respective sub-matrix . Therefore, using as the low-dimensional representation of , all the data points from within a class are mapped to a point in the embedding space.

Corollary 3.3.1

Using the first eigenvectors of (Eq. 3.21) as a representation for such that yields that the distances (defined in Eqs. (3.19) and (3.20), respectively). Based on the representation in Proposition 3.3.1 along with Eq. (3.19), we getIn a similar manner, we get by Eq. (3.20) Corollary 3.3.1 implies that we can compute an efficient representation for the classes. We denote this representation by . The Perturbed Case In real datasets, we cannot expect that the off block-diagonal elements of the affinity matrix would be zero. The data points from different classes in real datasets are not completely disconnected, and we can assume they are weakly connected. This low connectivity implies that off-block-diagonal values of are non-zeros. We analyze this more realistic scenario by assuming that is a perturbed version of the “Ideal” block form of . Perturbation theory addresses the question of how a small change in a matrix relates to a change in its eigenvalues and eigenvectors. In the perturbed case, the off-block-diagonal terms are non-zeros and the obtained (perturbed) matrix takes the form where is assumed to be a symmetrical small perturbation of the form The analysis of the “Ideal case” has provided an efficient representation for classification tasks as described in Proposition 3.3.1. We propose to choose the scale parameter such that the extracted representation based on (Eq. 3.21) is similar to the extracted representation using (Eq. 3.24). For this purpose we use the following theorem.

Theorem 3.1

(Davis-Kahan) Stewart (1990) Let and be Hermitian matrices of the same dimensions, and let be a perturbed version of . Set an interval S, denote the eigenvalues within S as and with a corresponding set of eigenvectors and for and , respectively. Define asThen the distancewhere is a diagonal matrix with the principal angles on the diagonal, and denotes the Frobenius norm. In other words, the theorem states that the eigenspace spanned by the perturbed kernel is similar, to some extent, to the eigenspace spanned by the ideal kernel . The distance between these eigenspaces is bounded by . Theorem 3.2 provides a measure which helps to minimize the distance between the ideal representation (proposition 3.3.1) and the realistic (perturbed) representation .

Theorem 3.2

The distance between and in the DM representations based on the matrices and , respectively, is bounded such thatwhere is the perturbation matrix defined in Eq. (3.24) and is a diagonal matrix whose elements are the sums of rows . Define . Based on Eq. (3.24), we haveWe are now ready to use Theorem 3.1. Assume that the eigenvalues of and of are ordered in descending order, and set , denoting the first eigenvectors of and of as and , respectively. Obviously, by construction we have , . Based on the analysis of the “ideal” matrix , we know that its first eigenvalues are equal to 1. Noting that is algebraically similar to , they have the same eigenvalues, implying that , , as well. Using the definition of from Eq. (3.26), we conclude that . Setting and , the Davis-Kahan Theorem 3.1 asserts that the distance between the eigenspaces and is bounded such thatThe eigen-decomposition of is written as . Note that which means that the eigen-decomposition of could be written as and the right eigenvectors of are . Using the same argument for and choosing the eigenspaces using the first eigenvectors, we get that decreasing the term also decreases .

Assumption 1

The perturbation matrix (Eq. 3.24) changes only slightly over the range of values of . Explanation For two data points from different classes , and , the values of . The decaying property of the Gaussian kernel provides a range of values for such that the perturbation matrix is indeed small. In Sect. 4.2 below we evaluate this assumption using a mixture of Gaussians.

Corollary 3.3.2

Given classes under the perturbation assumption and assumption 1, the generalized eigengap is defined as . The scale parameter , which maximizes Geprovides the best class separation using an coordinates embedding (). This approach also requires computing an eigendecomposition for each value, thus, its computational complexity is of the order of .

The probabilistic approach

We introduce here notations from graph theory to compute a measure of the class separation based on the stochastic matrix (Eq. 2.2). Based on the values of the matrix , a Cut (Shi and Malik 2000) is defined for any two subsets Given classes , we define the Classification Cut by the following measureIn clustering problems, a partition is searched such that the normalized version of the cut is minimized (Dhillon et al. 2004; Ding et al. 2005). We use this intuition for a more relaxed classification problem. We first define a Generalized cut using the following matrixBased on (which carries the dependence on ), the Generalized cut is then defined asThe idea is to remove the probability of “staying” at a specific node from the within-class transition probability. Now letWe search for which maximizes , namelyBy the stochastic model, the implied probability of transition between point and point is equal to , therefore by maximizing , the sum of within-class transition probabilities is maximized. Based on the definition of the diffusion distance (Eq. 2.4), this implies that the within-class diffusion distance would be small, followed by a small Euclidean distance in the DM space. The heuristic approach entailed in Eq. (3.37) provides yet another criterion for setting a scale parameter which captures the geometry of the given classes. This approach does not require computing an eigendecomposition for each candidate value, thus, its computational complexity is of order of .

Experimental results

In this section we provide some experimental results, showing and comparing the different approaches (outlined in the previous section) in their respective contexts. We begin by demonstrating our proposed manifold-based scaling in Sect. 4.1, and then demonstrate the classification-based scaling approaches in Sect. 4.2.

Manifold learning

In this subsection we evaluate the performance of the proposed manifold-based approach by embedding a low-dimensional manifold which lies in a high-dimensional space. We consider two datasets: A synthetic set and a set based on an image taken from the MNIST database.

A synthetic dataset

The first experiment is constructed by projecting a 3-dimensional synthetic manifold into a high-dimensional space, then concatenating it with Gaussian noise. Data generation is done according to the following steps:We define the datasets and First, a 3-dimensional Swiss Roll is constructed based on the following function where (), are drawn from Uniform distributions within the intervals , respectively. In our experiment we chose . We project the Swiss roll into a high-dimensional space by multiplying the data by a random matrix . The elements of are drawn from a Gaussian distribution with zero mean and variance of . Finally, we augment the projected Swiss Roll with a vector of Gaussian noise, obtaining where each component of is an independent Gaussian variable with zero mean and variance of . Left: “clean” Swiss Roll ( in Eq. (4.1). Right: 3 coordinates of the projected Swiss Roll ( in Eq. (4.2)). Both figures are colored by the value of the underlying parameter (Eq. 4.1) To evaluate the proposed framework, we apply Algorithm 3.1 followed by Algorithm 3.2, and extract a low-dimensional embedding (Fig. 2).
Fig. 2

Left: “clean” Swiss Roll ( in Eq. (4.1). Right: 3 coordinates of the projected Swiss Roll ( in Eq. (4.2)). Both figures are colored by the value of the underlying parameter (Eq. 4.1)

Extracted DM-based embedding of the “noisy” Swiss roll using different methods for choosing the scale parameter . Top left: standard deviation scalings, the matrix are computed by Eq. (3.5). Top right: the scaling, the calculation of is described in Alg. (3.1) and Singer et al. (2009). Mid left: the MaxMin scaling, the value is defined by Eq. (3.1). Mid right: KNN based scaling (Zelnik-Manor and Perona 2004). Bottom left: the proposed scaling which is described in Alg. (3.2). Bottom right: scaling based on as described in Algorithm (3.1) and Singer et al. (2009) applied to the clean Swiss roll Y that is defined by Eq. (4.1) Different high-dimensional datasets were generated using various values of , , and . DM is applied to each using:The extracted embedding is compared to the embedding extracted from the clean Swiss roll defined in Eq. (4.1). The standard deviation normalization as defined in Eq. (3.5). The scale, which is described in 3.1 and in Singer et al. (2009). The MaxMin scale, as defined in Eq. (3.1) and in Lafon et al. (2006). The KNN based scaling (Zelnik-Manor and Perona 2004). The proposed scale parameters , obtained using Algorithm 3.2. The mean square error of the extracted embedding. A comparison between the proposed normalization and alternative methods which are detailed in Sect. 3 Each embedding is computed using an eigendecomposition, therefore, the embedding’s coordinates could be the same up to scaling and rotation. To overcome this ambiguity, we search for an optimal translation and rotation matrix of the following formwhere is the rotation matrix and is the translation matrix, which minimizes the mis-match errordefined as the sum of square distances between values of the clean mapping and the “aligned” mapping . We repeat the experiment 40 times and compute the empirical Mean Square error in the embedding space defined asAn example of the extracted embedding based on all the different methods is presented in Fig. 3, followed by the MSE in Fig. 4. It is evident that Algorithm 3.2 is able to extract a more precise embedding than the alternative scaling schemes. The strength of Algorithm 3.2 is that it emphasizes the coordinates which are essential for the embedding and neglects the coordinates which were contaminated by noise.
Fig. 3

Extracted DM-based embedding of the “noisy” Swiss roll using different methods for choosing the scale parameter . Top left: standard deviation scalings, the matrix are computed by Eq. (3.5). Top right: the scaling, the calculation of is described in Alg. (3.1) and Singer et al. (2009). Mid left: the MaxMin scaling, the value is defined by Eq. (3.1). Mid right: KNN based scaling (Zelnik-Manor and Perona 2004). Bottom left: the proposed scaling which is described in Alg. (3.2). Bottom right: scaling based on as described in Algorithm (3.1) and Singer et al. (2009) applied to the clean Swiss roll Y that is defined by Eq. (4.1)

Fig. 4

The mean square error of the extracted embedding. A comparison between the proposed normalization and alternative methods which are detailed in Sect. 3

MNIST manifold

In the following experiment, we create an artificial low-dimensional manifold by rotating a handwritten image of a digit. First, we rotate the handwritten digit ‘6’ from MNIST dataset by angles that are uniformly sampled over . Next, we add random zero-mean Gaussian noise with a variance of independently to each pixel. An example of the original and noisy version of the handwritten ‘6’ are shown in Fig. 5. Note that the values of the original image are in the range [0, 1]. In order to capture the circular structure of the manifold we apply DM to the rotated images. An example of the expected circular structure extracted by DM is depicted in Fig. 5.
Fig. 5

Top left: an example of a clean handwritten digit of ‘6’. Top right: a noisy example of the digit ‘6’. Each pixel is added by a Gaussian noise. The noise is i.i.d. drawn from N(0, 0.5). Bottom left: extracted DM-based embedding from 320 rotated images of the “clean” handwritten digit. Bottom right: extracted DM-based embedding from 320 rotated images of the “noisy” handwritten digit

We apply the different scaling schemes to the noisy images and extract a 2-dimensional DM-based embedding. For the vectorized scaling schemes (proposed and standard deviation approach) we apply the scalings to the top 50 principle components. This allows us to reduce the computational complexity, as the dimension of the feature space is reduced from 784 to 50. In this experiment we further compare to two additional local scaling schemes. The first (Vasiloglou et al. 2006), which we refer to as Harmonic, and the second, presented in Taseska et al. (2019), which we refer to as LocalDM. To evaluate the performance of the different scaling schemes, we propose the following metric to compare the extracted embedding to a perfect circle. Given a 2-dimensional representation , we use a polar transformation to evaluate the implied radius at each point. The squared radius is defined by . Next, we normalize the radius values by their empirical mean, such that , and is the empirical mean of the radii . Finally, we compute the empirical variance of the normalized radius . Explicitly, this value is computed by . A scatter plot of vs. the variance of the additive noise is presented in Fig. 6. As evident in this figure, up to a certain variance of the noise, the proposed scaling scheme suppresses the noise and captures the correct circular structure of the data. At some level of noise our method breaks. It seems that the standard deviation and Singer’s approach also break at a similar noise level. An explanation for this phenomenon could be that at the lower SNRs all these methods start to “amplify” the noise, rather than the signal.
Fig. 6

The normalized radius variance (NRV) of the extracted embedding from the noisy rotated digit manifold. A comparison between the proposed normalization and alternative methods which are detailed in Sect. 3

Top left: an example of a clean handwritten digit of ‘6’. Top right: a noisy example of the digit ‘6’. Each pixel is added by a Gaussian noise. The noise is i.i.d. drawn from N(0, 0.5). Bottom left: extracted DM-based embedding from 320 rotated images of the “clean” handwritten digit. Bottom right: extracted DM-based embedding from 320 rotated images of the “noisy” handwritten digit The normalized radius variance (NRV) of the extracted embedding from the noisy rotated digit manifold. A comparison between the proposed normalization and alternative methods which are detailed in Sect. 3

Classification

In this subsection we provide empirical support for the theoretical analysis from Sect. (3.3). We evaluate the influence of on the classification results using four datasets: a mixture of Gaussians, artificial classes lying on a manifold, handwritten digits and seismic recordings. We focus on evaluating how the proposed measures (Eqs. (3.37), (3.18), (3.31), resp.) are correlated with the quality of the classification.

Classification of a Gaussian mixture

In the following experiment we focus on a simple classification test using a mixture of Gaussians. We generate two classes using two Gaussians, based on the following steps: Two vectors were drawn from a Gaussian distribution . These vectors are the centers of masses for the generated classes and . (resp.). data points were drawn for each class with a Gaussian distribution , respectively. Denote these 2N data points by . Left: an example of the Gaussian distributed data points. Right: a 2-dimensional mapping of the data points The first experiment evaluates the Spectral Approach (Sect. 3.3.2). Therefore, we set such that the class variance is smaller than the variance of the center of mass. Then, we apply DM using a scale parameter such that . In Fig. 8 (left), we present the first extracted diffusion coordinate using various values of . It is evident that the separation between classes is highly influenced by . A comparison between , and Ge is presented in Fig. 8 (right). This comparison provides evidence of the high correlation between (Eq. 3.37), (Eq. 3.18) and the generalized eigengap (Eq. 3.31) (Fig. 7).
Fig. 8

Left: the first eigenvector computed for various values of . Right: a comparison between and Ge

Fig. 7

Left: an example of the Gaussian distributed data points. Right: a 2-dimensional mapping of the data points

Left: the first eigenvector computed for various values of . Right: a comparison between and Ge To evaluate the validity of Assumption 1, we calculate the Frobenius norm of the perturbation matrix for various values of . The results with the approximated are presented in Fig. 9. Indeed, as evident from Fig. 9, the value of is nearly constant for a small range of values around .
Fig. 9

The Frobenius norm of the perturbation matrix . The annotated point is the approximated scale

The Frobenius norm of the perturbation matrix . The annotated point is the approximated scale

Classes based on an artificial physical process

For the non-ideal case, we generate classes using a non-linear function. This non-linear function is designed to model an unknown underlying nonlinear physical process governed by a small number of parameters. Consequently, the classification task is essentially expected to provide an estimate of these hidden parameters. An example for such a problem is studied, e.g., in Lindenbaum et al. (2015), where a musical key is estimated by applying a classifier to a low-dimensional representation extracted from the raw audio signals. In the following steps we describe how we generate classes from a Spiral structure: where are drawn independently from a zero-mean Gaussian distribution with covariance . Two examples of the spiral-based classes are shown in Fig. 10. For both examples, we use with different values for the gap parameter G.
Fig. 10

Two examples of the generated three-dimensional spiral that are based on Eq. (4.6) using classes with data points within each class. The gaps are set to be left and right, respectively

Set the number of classes and a gap parameter G. Each class , consists of data points drawn from a uniformly dense distribution within the line , . is the class-length, set as . Let denote the total number of points. Denote as the set of all points from all classes. Project each into the ambient space using the following spiral-like function Two examples of the generated three-dimensional spiral that are based on Eq. (4.6) using classes with data points within each class. The gaps are set to be left and right, respectively A 2-dimensional mapping extracted from both spirals presented in Fig. 10 To evaluate the advantage of the proposed scale parameters and (Eqs. (3.18) and (3.37), resp.) for classification tasks, we calculate the ratios for various values of , and then we evaluate the resulting classification (which is based on the low-dimensional embedding). Examples of embeddings of the two spirals from Fig. 10 are shown in Fig. 11. This merely demonstrates the effect of on the quality of separation.
Fig. 11

A 2-dimensional mapping extracted from both spirals presented in Fig. 10

We apply classification in the low-dimensional space using KNN (). The KNN classifier is evaluated based on Leave-One-Out cross validation. The results are shown in Fig. 12, where it is evident that the classification results in the ambient space are highly influenced by the scale parameter . Furthermore, peak classification results occur at a value of corresponding to the maximal values of and . The value of did not indicate the peak classification scale, however, its computation complexity is lighter compared to and .
Fig. 12

Accuracy of classification in the spiral artificial dataset for different values of the gap parameter G. The data is generated based on Eq. (4.6). The proposed scales () and existing methods () are annotated on the plots

Accuracy of classification in the spiral artificial dataset for different values of the gap parameter G. The data is generated based on Eq. (4.6). The proposed scales () and existing methods () are annotated on the plots

Classification of handwritten digits

In the following experiment, we use the dataset from the UCI machine learning repository (Lichman 2013). The dataset consists of 2000 data points describing 200 instances of each digit from 0 to 9, extracted from a collection of Dutch utility maps. The dataset consists of multiple features of different dimensions. We use a concatenation of the Zerkine moment (ZER), morphological (MOR), profile correlations (FAC) and the Karhunen-loéve coefficients (KAR) as our features space. We compute the proposed ratios and for various values of , and estimate the optimal scale based on Eqs. (3.18), (3.37). We evaluate the extracted embedding using 20-fold cross validation ( left out as a test set). The classification is done by applying KNN (with ) in the d-dimensional embedding. In Fig. 13, we present the classification results and the proposed optimal scales for classification. Our proposed scale concurs with the scale that provides maximal classification rate.
Fig. 13

Accuracy of classification in the multiple features dataset. KNN () is applied in a dimensional diffusion based representation. The proposed scales () and existing methods () are annotated on the plots

Accuracy of classification in the multiple features dataset. KNN () is applied in a dimensional diffusion based representation. The proposed scales () and existing methods () are annotated on the plots

Classification of COVID-19 using chest X-ray images

In the next evaluation, we focus on classifying individuals that were infected by COVID-19. In certain individuals, the COVID-19 disease may cause symptoms of pneumonia; these individuals could be further diagnosed using chest X-ray images (Abbas et al. 2020). Convolutional neural networks (CNNs) have demonstrated promising results in classification of COVID-19 based on X-ray (Sethy and Behera 2020; Wang and Wong 2020) or CT (Shuai et al. 2020; Song et al. 2020) of chest images. Here, we focus on the task of classifying COVID-19 patients based on the front view chest X-ray images. The data is collected from the Kaggle database,1 from which we have used 112 images that are split equally between two classes: healthy and COVID-19 infected. The feature space is defined by resizing all the chest X-rays to pixel images. This feature space is still sensitive to translations and scales, which are inherent to the X-ray modality. This sensitivity could clearly be partly mitigated by defining a translation-invariant feature space, e.g. by using CNNs, however, this is beyond the scope of this study. To evaluate the proposed schemes, we compute the ratios and for various values of , and estimate optimal scales based on Eqs. (3.18), (3.37). For each scale value, we extract a 2-dimensional embedding and perform classification using KNN (with ) and SVM classifiers. Accuracy is averaged using a 10-folds cross-validation. Fig. 14 demonstrates that the proposed scales are good candidates for selecting a scale that yields high classification performance.
Fig. 14

Classification accuracy vs. value of on the COVID-19 chest X-ray dataset. The proposed scales () and existing methods () are annotated on the plots. Left panel: Classification using KNN (K=2). Right panel: Classification using Support Vector Machine (SVM)

Classification accuracy vs. value of on the COVID-19 chest X-ray dataset. The proposed scales () and existing methods () are annotated on the plots. Left panel: Classification using KNN (K=2). Right panel: Classification using Support Vector Machine (SVM)

Practical guidelines

Our numerical simulations demonstrate that none of the proposed scaling schemes consistently outperforms the others. Nonetheless, across all of the evaluated datasets, the best performance was obtained within the range where the values were bounded by and . The probabilistic approach that estimates is based on the transition matrix and requires computations. This complexity could be further reduced by computing a k-sparse approximation for the matrix . Specifically, methods such as k-sparse graph (Wang et al. 2013) can be computed with complexity . The probabilistic approach is based on a heuristic, and we consider it to be the least accurate of all of the proposed schemes. Both the Spectral approach and the Geometric approach require a spectral decomposition. Assuming that both methods use the same number of coordinates (i.e., the embedding dimension is the same as the number of classes, ), they both require computations. The spectral approach is derived by analyzing a perturbed version of a kernel , constructed based on well-separated classes. Empirically, this analysis seems to hold for certain cases; however, if there is no spectral gap (i.e., ) we do not recommend to use this method. Finally, the Geometric approach is purely based on the extracted representation. Therefore, we consider it as the most reliable method for estimating a scale parameter that is most appropriate for classification.

Application: learning seismic parameters

In this section we demonstrate the capabilities of the proposed approach for extracting meaningful parameters from raw seismic recordings. Extracting reliable seismic parameters is a challenging task. Such parameters could help discriminate earthquakes from explosions, moreover, they can enable automatic monitoring of nuclear experiments. Traditional methods such as Rodgers et al. (1997); Blandford (1982) use signal processing to try to analyze seismic recordings. More recent methods, such as Kortström et al. (2016); Ruano et al. (2014); Ohrnberger (2001); Beyreuther et al. (2012); Hammer et al. (2013); Del Pezzo et al. (2003); Tiira (1996) use machine learning to construct a classifier for a variety of seismic events. Here, we extend our result from Rabin et al. (2016a); Lindenbaum et al. (2018), in which we have demonstrated the strength of DM for extracting seismic parameters. Our proposed method performs a vector scaling for manifold learning. Thus, essentially if the data lies on a manifold, our scaling combined with DM will extract the manifold from high-dimensional seismic recordings. Moreover, it will provide a natural feature selection procedure, thus if some features are corrupt, the proposed scaling may be able to reduce their influence. As a test case, we use a dataset from (Rabin et al. 2016a, b; Lindenbaum et al. 2018), which was recorded in Israel and Jordan between 2005–2015. All recordings were collected in HRFI (Harif) station located in the south of Israel. The station collects three signals from north (N), east (E) and vertical (Z). Each signal is sampled using a broadband seismometer at 40[Hz] and consists of samples.

Feature extraction

Seismic events usually generate two waves, primary-waves (P) and secondary waves (S). The primary wave arrives directly from the source of the event to the recorder, while the secondary wave is a shear wave and thus arrives at some time delay. Both waves pass through different material thus have different spectral properties. This motivates the use of a time-frequency representation as the feature vector for each seismic event. The time-frequency representation used in this study is a Sonogram (Joswig 1990), which offers computation simplicity while retaining the sufficient spectral resolution for the task in hand. The Sonogram is basically a spectrogram, renormalized and rearranged in a logarithmic manner. Given a seismic signal , the Sonogram is extracted using the following steps: These steps are applied to each of the channels separately. This results in three sets for the east channel, for the north channel and for the horizontal. Examples for seismic recording of an explosion and of an earthquake are presented on Fig. 15a, b. Examples of corresponding Sonograms are presented in Fig. 15a, b .
Fig. 15

Top: Example of a raw signal recorded from a an explosion and b earthquake. Bottom: The Sonogram matrix extracted from c an explosion and d earthquake

Compute a discrete-time Short Term Fourier Transform: where is a window function of length , with a shift value of time steps. We use overlap of and compute for frequency bins, such as the values of f are spread uniformly on a logaritmic scale. Normalize the energy by the number of frequency bins Reshape the time frequency representation into a vector, by concatenating columns. The resulted vector representation for a signal is denoted by . Top: Example of a raw signal recorded from a an explosion and b earthquake. Bottom: The Sonogram matrix extracted from c an explosion and d earthquake

Seismic manifold learning

To evaluate the proposed scaling for manifold learning, we use a subset of the seismic recording with 352 quarry blasts. The explosions have occurred at 4 known quarries surrounding the recording station HRFI. Our study in Rabin et al. (2016a), Lindenbaum et al. (2018), has demonstrated that most of the variability of quarry blasts stems from the source location of each quarry, therefor, we assume that the 352 blasts lie on some low-dimensional manifold. Where the parameters of the manifold should correlate with location parameters. Our approach for setting the scale parameter provides a natural feature selection procedure. To evaluate the capabilities of this procedure we “destroy” the information in some of the features. We do this by applying a deformation function to one channel out of the three seismometer recordings. We define the input for Algorithm 3.2 aswhere is an element-wise deformation function. In the first test case the deformation function is defined by . Then, we apply DM with various scaling schemes and examine the extracted representation. In Fig.16 the two leading DM coordinates of different scaling methods are presented.
Fig. 16

The two leading DM coordinates of the 352 quarry blasts, colored by source quarry cluster. Scaling method based on: a The standard deviation of the data. b Singer’s (2009) approach (detailed in Algorithm 3.1. c The max-min methods (Eq. 3.1). d Proposed scaling for manifold learning (detailed in Algorithm 3.2)

The two leading DM coordinates of the 352 quarry blasts, colored by source quarry cluster. Scaling method based on: a The standard deviation of the data. b Singer’s (2009) approach (detailed in Algorithm 3.1. c The max-min methods (Eq. 3.1). d Proposed scaling for manifold learning (detailed in Algorithm 3.2) The quarry cluster separation is clearly evident in Fig. 16d. To further evaluate how well the low-dimensional representation correlates with the source location we use a list of source locations. A list with the explosions locations is provided to us based on manual calculations, performed by an analyst by considering the phase difference between the signals’ arrival times to different stations. We note that this estimation is accurate up to a few kilometers. A map of the location estimates colored by source quarry is presented in Fig. 17a. Then, we apply Canonical Correlation Analysis (CCA) to find the most correlated representation. The transformed representations and are presented in Fig. 17b, c respectively. The two correlation coefficients between coordinates of and are 0.88 and 0.72.
Fig. 17

a A map with source locations of 352 explosions. Points are colored by quarry cluster. b A CCA based representation of the latitude and longitude of the explosions. c A CCA based representation of the two leading DM coordinates extracted based on the proposed scaling (appear in Fig. 16d)

a A map with source locations of 352 explosions. Points are colored by quarry cluster. b A CCA based representation of the latitude and longitude of the explosions. c A CCA based representation of the two leading DM coordinates extracted based on the proposed scaling (appear in Fig. 16d) In the second test case, we use additive Gaussian noise to “degrade” the signal. We use as the defoemation function, where is drawn from a zero-mean Gaussian distribution with variance of . We estimate the scaling based on the proposed and alternative methods. Then, we apply CCA to the two leading DM coordinated and the estimated source locations. The top correlation coefficients for various values of are presented in Fig. 18. Both the max-min method and Singer’s (Singer et al. 2009) scheme seem to break at the same noise level. The standard deviation approach is robust to the noise level, this is because it essentially performs whitening of the data. However, this also obscures some of the information content when the noise is of low power. The proposed approach seems to outperform all alternative schemes for this test case.
Fig. 18

Highest correlation coefficient between the DM representation extracted using various scaling schemes. The x-axis corresponds to the variance of the additive Gaussian noise

Highest correlation coefficient between the DM representation extracted using various scaling schemes. The x-axis corresponds to the variance of the additive Gaussian noise

Classification of seismic events

Automatic classification of seismic events is useful as it may reduce false alarm warnings on one hand, and enable monitoring nuclear events on the other hand. To evaluate the proposed scaling for classification of seismic events, we use a set with 46 earthquakes and 62 explosions all of which were recorded in Israel. A low-dimensional mapping is extracted by using DM with various values of , and binary classification was applied using KNN () and Support Vector Machine (SVM) in a leave-one-out fashion. The accuracy of the classification for each value of is presented in Fig. 19. The estimated values of , and were annotated. It is evident that for classification the estimated values are indeed close to the optimal values, although they do not fully coincide. Nevertheless, they all achieve high classification accuracy.
Fig. 19

Classification accuracy vs. value of . The proposed scales () and existing methods () are annotated on the plots. Left panel: Classification using KNN (K = 2). Right panel: Classification using Support Vector Machine (SVM)

Classification accuracy vs. value of . The proposed scales () and existing methods () are annotated on the plots. Left panel: Classification using KNN (K = 2). Right panel: Classification using Support Vector Machine (SVM)

Conclusions

The scaling parameter of the widely used Gaussian kernel is often crucial for machine learning algorithms. As happens in many tasks in the field, there does not seem to be one global scheme that is optimal for all applications. For this reason, we propose two new frameworks for setting a kernel’s scale parameter tailored for two specific tasks. The first approach is useful when the high-dimensional data points lie on some lower dimensional manifold. By exploiting the properties of the Gaussian kernel, we extract a vectorized scaling factor that provides a natural feature selection procedure. Theoretical justification and simulations on artificial data demonstrate the strength of the scheme over alternatives. The second approach could improve the performance of a wide range of kernel based classifiers. The capabilities of the proposed methods are demonstrated using artificial and real datasets. Finally, we present an application for the proposed approach that helps learn meaningful seismic parameters in an automated manner. In the future, we intend to generalize the approach for the multi-view setting recently studied in Lindenbaum et al. (2020), Salhov et al. (2019), Lederman and Talmon (2014).
  10 in total

1.  Nonlinear dimensionality reduction by locally linear embedding.

Authors:  S T Roweis; L K Saul
Journal:  Science       Date:  2000-12-22       Impact factor: 47.728

2.  A global geometric framework for nonlinear dimensionality reduction.

Authors:  J B Tenenbaum; V de Silva; J C Langford
Journal:  Science       Date:  2000-12-22       Impact factor: 47.728

3.  Data fusion and multicue data matching by diffusion maps.

Authors:  Stéphane Lafon; Yosi Keller; Ronald R Coifman
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2006-11       Impact factor: 6.226

4.  Graph Laplacian tomography from unknown random projections.

Authors:  Ronald R Coifman; Yoel Shkolnisky; Fred J Sigworth; Amit Singer
Journal:  IEEE Trans Image Process       Date:  2008-10       Impact factor: 10.856

5.  Detecting intrinsic slow variables in stochastic dynamical systems by anisotropic diffusion maps.

Authors:  Amit Singer; Radek Erban; Ioannis G Kevrekidis; Ronald R Coifman
Journal:  Proc Natl Acad Sci U S A       Date:  2009-08-18       Impact factor: 11.205

6.  An intrinsic dimensionality estimator from near-neighbor information.

Authors:  K W Pettis; T A Bailey; A K Jain; R C Dubes
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  1979-01       Impact factor: 6.226

7.  On the parameter optimization of Support Vector Machines for binary classification.

Authors:  Paulo Gaspar; Jaime Carbonell; José Luís Oliveira
Journal:  J Integr Bioinform       Date:  2012-07-24

8.  Deep Learning Enables Accurate Diagnosis of Novel Coronavirus (COVID-19) With CT Images.

Authors:  Ying Song; Shuangjia Zheng; Liang Li; Xiang Zhang; Xiaodong Zhang; Ziwang Huang; Jianwen Chen; Ruixuan Wang; Huiying Zhao; Yutian Chong; Jun Shen; Yunfei Zha; Yuedong Yang
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2021-12-08       Impact factor: 3.710

9.  Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network.

Authors:  Asmaa Abbas; Mohammed M Abdelsamea; Mohamed Medhat Gaber
Journal:  Appl Intell (Dordr)       Date:  2020-09-05       Impact factor: 5.019

10.  COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images.

Authors:  Linda Wang; Zhong Qiu Lin; Alexander Wong
Journal:  Sci Rep       Date:  2020-11-11       Impact factor: 4.379

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.