Literature DB >> 35197293

Cheap robust learning of data anomalies with analytically solvable entropic outlier sparsification.

Illia Horenko1.   

Abstract

Entropic outlier sparsification (EOS) is proposed as a cheap and robust computational strategy for learning in the presence of data anomalies and outliers. EOS dwells on the derived analytic solution of the (weighted) expected loss minimization problem subject to Shannon entropy regularization. An identified closed-form solution is proven to impose additional costs that depend linearly on statistics size and are independent of data dimension. Obtained analytic results also explain why the mixtures of spherically symmetric Gaussians-used heuristically in many popular data analysis algorithms-represent an optimal and least-biased choice for the nonparametric probability distributions when working with squared Euclidean distances. The performance of EOS is compared to a range of commonly used tools on synthetic problems and on partially mislabeled supervised classification problems from biomedicine. Applying EOS for coinference of data anomalies during learning is shown to allow reaching an accuracy of [Formula: see text] when predicting patient mortality after heart failure, statistically significantly outperforming predictive performance of common learning tools for the same data.
Copyright © 2022 the Author(s). Published by PNAS.

Entities:  

Keywords:  entropy; mislabeling; outlier detection; regularization; sparsification

Year:  2022        PMID: 35197293      PMCID: PMC8917346          DOI: 10.1073/pnas.2119659119

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


Detection of data anomalies, outliers, and mislabeling is a long-standing problem in statistics, machine learning (ML), and artificial intelligence (1–4). Let be a fixed dataset (where data instances x are possibly augmented with labels), let θ be a set of ML model parameters, and let be a scalar-valued loss function measuring a misfit of the data instance x. Then, a wide class of learning methods and anomaly detection algorithms can be formulated as numerical procedures for a minimization of the following functional: where is the outlyingness, taking the values close to zero if the data point x is an anomaly (1, 5–7). If w and θ are both unknown, then the above problem [1] for simultaneous estimation of model parameters and loss weights becomes ill posed. Common approaches deal with this ill-posedness problem imposing additional parametric assumptions on w, for example, based on parametric thresholding of one-dimensional linear projections in Stahel–Donoho estimators or deploying other parametric tools [like -distribution quantiles to determine outliers of a D-dimensional normal distribution] (1, 5, 7, 8). An appealing idea would be to make this ill-posed problem well posed in a nonparametric way, by regularizing it with one of the common regularization approaches. For example, applying l1 regularization could result in a sparsification of w and zeroing out the outlying data points from the estimation (9). However, applying l1 and other sparsification methods results in a polynomial cost scaling required for a numerical solution of the resulting optimization problems—and would limit the solution of [1] to relatively small problems (10). The key message of this brief report is in showing that the simultaneous well-posed detection of anomalies and learning of parameters θ in [1] can be achieved computationally very efficiently by means of the minimization of expected loss from the right-hand side of [1]—performed simultaneously to the regularized Shannon entropy maximization of the loss weight distribution w, The following summarizes the properties of this problem’s solutions. For any fixed and θ such that and , constrained minimization problem [] admits a unique closed-form solution , The proof of the is provided in . It is straightforward to validate that the numerical cost of computing [3] scales linearly in statistics size T and is independent of the data dimension D—in contrast to common regularization techniques that require polynomial cost scaling in the data dimension D (10). If the loss function is a squared Euclidean distance (as in the least-squares methods), then, according to the above , the unique probability distributions w minimizing [2] are from the α-parametric family of spherically symmetric Gaussians, with the dimension-wise variance being . This result provides an interesting insight into the density-based methods, for example, in t-stochastic neighbor embedding (t-SNE) (11)—one of the most popular nonlinear dimensions reduction approaches in the area of biomedicine (with over 20,000 citations according to Google Scholar). This method searches for the optimal low-dimensional approximations of the high-dimensional densities defined in a heuristic way as mixtures of spherically symmetric Gaussians,with a multiindex . According to the above , this heuristics—building a computational foundation of tSNE—is actually equivalent to the optimal nonparametric density estimate [3], in the sense that it is simultaneously minimizing the expectation of the pairwise squared Euclidean distances between the data points (when considering loss function with and in [2]) and simultaneously maximizing the entropy of w (i.e., providing the least-biased estimation) and is obtained with an explicitly computable closed-form expression. Furthermore, solution [3] also provides a recipe for computing such t-SNE density estimates in the cases with non-Euclidean loss functions g. Entropic outlier sparsification algorithm for the solution of optimization problem [2] For a given , and , randomly choose initial while do solution of [2] for fixed evaluating [3] for fixed It is straightforward to verify that the simultaneous learning of the parameters θ and probability densities w can be performed with the monotonically convergent entropic outlier sparsification (EOS) algorithm (see ). Eq. establishes a relation between the Gaussian variance parameter and the entropic sparsification parameter α in [3], indicating a possibility of inferring the optimal sparsification parameter for the given data. For example, optimal in [4]—and hence the optimal sparsification parameter value —can be obtained by maximizing the log-likelihood of the distribution with respect to the parameter α; that is, . In the practical examples of EOS below, we will follow a simpler grid search approach for selecting the optimal sparsification parameter —deploying the same multiple cross-validation procedure that is commonly used for determining metaparameter values in AI and ML. On a predefined grid of α values, we will select those values that show the best overall model performance on the validation data that were not used in the model training. Fig. 1 summarizes numerical experiments comparing EOS to common data anomaly detection and learning tools on randomly generated synthetic datasets (representing multivariate normal distributions with asymmetrically positioned uniformly distributed outliers; Fig. 1 ) and three biomedical datasets with various proportions of randomly mislabeled data instances in the training sets (Fig. 1 ). All of the compared algorithms are provided with the same information and run with the same hardware and software; 50 cross-validations were performed in every experiment to visualize the obtained 95% CIs. In numerical experiments with synthetic data (Fig. 1 ), the EOS algorithm is deployed, with g being the negative point-wise multivariate Gaussian loglikelihood, that is, with , where μ and Σ are Gaussian mean and covariance, respectively. Iterative estimation of weighted mean and covariance in the θ-step of the EOS algorithm is performed using analytical estimates of the weighted Gaussian covariance and mean, and convergence tolerance tol is set to . Total computational costs and statistical precisions—the latter are measured as the numbers of correctly identified points not belonging to the Gaussian distribution divided by the total number of identified outliers—are performed for various problem dimensions, statistic sizes, and outlier proportions. EOS was compared to all of the outlier detection methods available in the “Statistics” and “Machine Learning” toolboxes of MathWorks. Precision is chosen as the measure of performance here since it is more robust than the other common measures when the datasets are not balanced, for example, when the number of instances in one class (outliers) is much less than in the other class (nonoutliers). These results show that EOS allows a marked and robust improvement of outlier detection precision for all of the considered comparison cases. Data and MATLAB code are provided at https://github.com/horenkoi/EOS.
Fig. 1.

Comparison of EOS algorithm for the solution of optimization problem [2] to common methods of data anomaly detection (A–F) and supervised classifier learning (G–I) on synthetic and real data examples from refs. 12–14.

Comparison of EOS algorithm for the solution of optimization problem [2] to common methods of data anomaly detection (A–F) and supervised classifier learning (G–I) on synthetic and real data examples from refs. 12–14. Next, real labeled datasets from biomedicine are considered, including two popular datasets—the University of Wisconsin Database for Breast Cancer diagnostics data (12) (Fig. 1 ) and the clinical dataset for predicting mortality after heart failure (13) (Fig. 1 )—as well as a single-cell messenger RNA gene expression dataset from longevity research (14) (Fig. 1 ). The main focus here is on comparing the robustness of learning methods to outliers and mislabeled data instances in the training set, for common binary classifiers and for EOS that is equipped with loss function g from the scalable probabilistic approximation (SPA) classifier algorithm (15, 16). SPA is selected since it shows the highest robustness to mislabeling for all of the considered datasets (Fig. 1 ). As can be seen from Fig. 1 and H, EOS with from SPA (EOS+SPA, red dashed lines), allows a statistically significant improvement of prediction performance (measured with the common performance measure area under curve [AUC]) for all of the tested mislabeling proportions p for all of the considered biomedical examples. As was shown recently, coinference of data mislabelings can significantly improve predictive performance of supervised classifiers (17). Application of the EOS algorithm with model loss function from SPA (EOS+SPA) allows achieving AUC of 0.96 and accuracy of () when predicting patients’ mortality after heart failure from clinical patients’ data, statistically significantly outperforming common learning tools that do not deploy outlier coinference (Fig. 1). EOS and entropic sparsification Eq. can be also applied to other types of leaning problems, for example, to feature selection and novelty detection problems.
  5 in total

1.  On a Scalable Entropic Breaching of the Overfitting Barrier for Small Data Problems in Machine Learning.

Authors:  Illia Horenko
Journal:  Neural Comput       Date:  2020-06-10       Impact factor: 2.026

2.  Translational Regulation of Non-autonomous Mitochondrial Stress Response Promotes Longevity.

Authors:  Jianfeng Lan; Jarod A Rollins; Xiao Zang; Di Wu; Lina Zou; Zi Wang; Chang Ye; Zixing Wu; Pankaj Kapahi; Aric N Rogers; Di Chen
Journal:  Cell Rep       Date:  2019-07-23       Impact factor: 9.423

3.  Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone.

Authors:  Davide Chicco; Giuseppe Jurman
Journal:  BMC Med Inform Decis Mak       Date:  2020-02-03       Impact factor: 2.796

4.  Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics.

Authors:  Susanne Gerber; Lukas Pospisil; Stanislav Sys; Charlotte Hewel; Ali Torkamani; Illia Horenko
Journal:  Front Artif Intell       Date:  2022-01-05

5.  Low-cost scalable discretization, prediction, and feature selection for complex systems.

Authors:  S Gerber; L Pospisil; M Navandar; I Horenko
Journal:  Sci Adv       Date:  2020-01-29       Impact factor: 14.136

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.