Sung-Cheol Kim1, Adith S Arun1, Mehmet Eren Ahsen2, Robert Vogel1, Gustavo Stolovitzky3. 1. IBM Research, IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598. 2. Gies College of Business, University of Illinois at Urbana-Champaign, Urbana, IL 61820. 3. IBM Research, IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598; gustavo.stolo@gmail.com.
Abstract
Binary classification is one of the central problems in machine-learning research and, as such, investigations of its general statistical properties are of interest. We studied the ranking statistics of items in binary classification problems and observed that there is a formal and surprising relationship between the probability of a sample belonging to one of the two classes and the Fermi-Dirac distribution determining the probability that a fermion occupies a given single-particle quantum state in a physical system of noninteracting fermions. Using this equivalence, it is possible to compute a calibrated probabilistic output for binary classifiers. We show that the area under the receiver operating characteristics curve (AUC) in a classification problem is related to the temperature of an equivalent physical system. In a similar manner, the optimal decision threshold between the two classes is associated with the chemical potential of an equivalent physical system. Using our framework, we also derive a closed-form expression to calculate the variance for the AUC of a classifier. Finally, we introduce FiDEL (Fermi-Dirac-based ensemble learning), an ensemble learning algorithm that uses the calibrated nature of the classifier's output probability to combine possibly very different classifiers.
Binary classification is one of the central problems in machine-learning research and, as such, investigations of its general statistical properties are of interest. We studied the ranking statistics of items in binary classification problems and observed that there is a formal and surprising relationship between the probability of a sample belonging to one of the two classes and the Fermi-Dirac distribution determining the probability that a fermion occupies a given single-particle quantum state in a physical system of noninteracting fermions. Using this equivalence, it is possible to compute a calibrated probabilistic output for binary classifiers. We show that the area under the receiver operating characteristics curve (AUC) in a classification problem is related to the temperature of an equivalent physical system. In a similar manner, the optimal decision threshold between the two classes is associated with the chemical potential of an equivalent physical system. Using our framework, we also derive a closed-form expression to calculate the variance for the AUC of a classifier. Finally, we introduce FiDEL (Fermi-Dirac-based ensemble learning), an ensemble learning algorithm that uses the calibrated nature of the classifier's output probability to combine possibly very different classifiers.
Binary classification is the task of predicting the binary categorical label of each item in a set of items that belong to one of two categories (1). Typically, this prediction is made using a function, known as a classifier, which learns from examples taken from a training dataset containing items of both classes of interest. This classifier is subsequently used to predict the labels of previously unseen items contained in a new dataset.Binary classification has a remarkably broad range of applications in fields such as biomedicine (2), economics (3), finance (4), astronomy (5), advertisement (6), and manufacturing (7). Problems addressed in these areas with greater or lesser success include predicting antibacterial activity of molecules (8), diagnosing breast cancer from mammography studies (9), detecting skin cancer from dermoscopy images (10), predicting Alzheimer’s disease onset from linguistic markers (11), classifying hand gestures from wearable device signals (12), identifying lunar craters from images (13), deciding whether a judge should have defendants wait for trial under bail at home or in jail (14), choosing whether to approve a loan to a client (15), predicting corporate financial distress (16), and determining whether a given semiconductor manufacturing process will lead to a faulty product (17).This great diversity of applications has spurred a considerable amount of work devoted to the development of classification methods. Despite substantial theoretical progress that led to increased predictive power, there is concern that methods optimized under narrow theoretical contexts may not lead to performance generalization (18) and that the emphasis of research on prediction models should perhaps shift to other issues such as model interpretation and independent validation (19). Accordingly, in this paper we address four generic problems arising in any classification task: 1) We develop a calibrated probabilistic interpretation of the output of a classification pipeline, independent of the classification method used; 2) we show how to use this probabilistic interpretation to optimally choose a threshold that separates predicted classes; 3) we introduce an analytical way to compute the confidence interval of the most popular classification performance metric (the area under the receiver operating characteristics curve [AUC]), which uses only the available information rather than ad hoc hypotheses about the classifier; and 4) we address the issue of performance generalization by developing an ensemble approach that, rather than relying on the generalization ability of any individual method, leverages the ability of many methods to compensate each other’s deficiencies and get a performance that is often better than the best in the ensemble. To achieve these objectives we advance an unexpected equivalence between the probabilistic description of fermions in quantum statistical mechanics and the probability of correct classification of items in typical classification problems. The validity of the probabilistic aspects of fermionic systems under general conditions renders the equivalent results in the world of classification to be quite robust and independent of any individual classification method.In a binary classification problem, the two classes to be predicted are usually denoted as the negative and the positive class. While this distinction is arbitrary, the positive class is usually chosen as the class that is more costly to misclassify. For example, in cancer screening, failing to detect patients with cancer is more costly than failing to detect disease-free subjects, and therefore cancer patients are usually assigned to be in the positive class. We use this convention in this paper. A binary classifier typically assigns a score to a given item that can be a proxy to the confidence assigned by the classifier that the item belongs to the positive class or to the cost of misclassification of that sample. In general, the probability density of these scores will depend on the class and is known as the class-conditioned score density , which is not known a priori and is different for each classifier. Some authors have proposed to fit Gaussian distributions to the scores resulting from specific classifiers (20), but better results have resulted from allowing more flexibility in fitting class-conditioned densities (21). The score outputted by different classifiers can be binarized using a decision threshold in such a way that we can assign samples with scores above/below that threshold to the positive/negative class. The optimal decision threshold will depend on the class-conditioned score density and is usually chosen empirically or learned during training.The class-conditioned density can be used to compute the posterior probability of the class given the score assigned to an item. Having a well-calibrated probability that a given item belongs to the positive class can be very useful, for example, when the result of a classifier must be merged with other classifiers in an ensemble (22, 23) or when the probability of the class assignment of a classifier needs to be combined with other probabilistic elements into a complex decision. However, most classifiers produce a score with unknown probability density from which a posterior class probability cannot be recovered. Some authors have proposed to train the posterior probability simultaneously with the classifier (24), and others have proposed to train the parameters of a logistic function that depends on the score of a previously trained classifier (25). In these methods, as the probability is trained in the training set, there is some risk of overfitting to the probabilistic output. The alternative of keeping a holdout set and using cross-validation is relatively successful (25). These strategies are feasible only if there is a sufficient amount of labeled data for the problem at hand. In cases where the amount of labeled data is not enough to train the classifier and the posterior probability, other methods are desirable.One of the most popular metrics to measure the performance of a binary classifier is the AUC (26). To calculate the AUC of a classifier, it is necessary to rank the items using the scores assigned to each item by the classifier. Thus, the AUC of a classifier is invariant with respect to any monotonic transformation of scores. It follows that what is important in calculating the AUC is the relative ranking of an item to other items rather than the actual score assigned to an item. When the number of items in the test set increases, the AUC of a classifier asymptotically approaches the probability that the classifier assigns a randomly chosen positive sample a higher score than a randomly chosen negative sample (27). The AUC is a threshold-independent way of calculating the performance of a classifier. To predict whether items are positive or negative we need to choose a decision threshold and assign items with scores above this threshold to the positive class and items below it to the negative class. In these cases the balanced accuracy, defined as the average of the sensitivity and specificity, is a popular metric for model evaluation.Assuming that a classifier is better than random, ranking N classified items in decreasing order from higher to lower scores will lead to positive samples having predominantly low ranks (1, 2, …) and negative samples having a tendency to have high ranks (…, N − 2, N − 1, N). Therefore, we can ask, What is the probability that the item ranked at rank is in the positive class ()? In this paper, we show that this probability can be mapped to the probability that a fermion (a quantum particle of half-integer spin such as an electron) occupies a given single-particle quantum state in a physical system of independent fermions (28, 29). This probability is known as the Fermi–Dirac (FD) distribution in quantum statistical physics and is used in fields such as atomic physics (30), solid-state physics (31) (e.g., the transport properties of electrons in metals), and astrophysics (32) (e.g., the physics of white dwarf stars).We explore the application of the FD statistics in machine learning in the context of binary classification problems. Using the FD statistics, we show that the optimal rank threshold below which items are more likely to be positive and above which items are more likely to be negative is the same threshold at which the balanced accuracy of the classifier is maximal and is related to the chemical potential in the FD distribution. We also use the FD distribution to derive a closed-form expression for the variance of the AUC of a classifier, which is independent of the distribution of scores assigned by the classifier. This variance is necessary to assign confidence intervals to the AUC and to estimate sample size in power analysis. Finally, we introduce FiDEL (Fermi–Dirac-based ensemble learning), an ensemble learning algorithm based on the FD distribution that uses the calibrated probability assigned to different base classifiers to combine them into a new ensemble classifier. FiDEL only uses the AUC of the base classifiers and the fraction of positive examples in the problem, both of which can be estimated from the training set.
The Fermi–Dirac Distribution in Binary Classification
The FD distribution describes the probability that a fermion occupies a single-particle quantum state in a fermionic system, e.g., the probability that an electron occupies a certain atomic level in an atom. Fermions obey the Pauli exclusion principle. This means that if represents the number of fermions in quantum state , then can be only 1 or 0. The probability that quantum state is occupied is then equal to the average occupation number . If the fermionic system is in thermodynamic equilibrium with a thermal bath, then the probability that the quantum state , assumed to have an energy , is occupied follows the FD distributionwhere is the Boltzmann constant, is the absolute temperature of the thermal bath, and is a temperature-dependent chemical potential. The FD distribution can be derived by maximizing the entropy of the system in the microcanonical ensemble of statistical physics under the constraints that the number of fermions and the total energy of the system are known (33):where is the number of quantum states available to the fermions. We assume that is finite, which is a good approximation in many physical systems when the energy gap, , . For the purpose of this paper quantum states refer to single-fermion quantum states and the fermions in our system are noninteracting.We next present a conceptual parallel between certain statistical properties in binary classification problems and the FD statistics. Let us define the ensemble of test sets to be the set (ensemble) of datasets with items of which exactly are in the positive class (for a more formal definition see ). Fig. 1 depicts test sets in the ensemble as Venn diagrams with items each characterized by a feature vector and its class . Let us consider a classifier that assigns a score to each item in one of the test sets as shown in Fig. 1. This score is typically a measure of the confidence assigned by classifier that the item belongs to the positive class and can then be used to rank samples from the most likely to belong to the positive class to the least likely (). The class of the item at rank can be either or . The binary nature of the classification of each item suggests that the class of an item at a given rank can be mapped to the binary occupation number of a quantum state in the fermionic system. In this mapping the ranks in the classification problem are the equivalent to the quantum states in the FD problem; class 1 items act as fermions and obey an exclusion principle in that only one item of class 1 can be ranked at any given rank. Classifier can place either a positive or a negative class item at rank (Fig. 1). However, for each realization of the test set in the ensemble, the constraint that must hold. Let us call the average class of items that the classifier ranked at rank over all possible test sets in the ensemble. Given that the previous constraint holds true for each realization, it will also be true on average in the ensemble; that is,We next discuss the mapping to the classification problem of the energy level of the ith quantum state. To do this we note that in any given realization of a test set in the ensemble, the average rank of positive class samples can be expressed as . Calling the average rank of class 1 items over all possible test sets in the ensemble, and using that (), where is the average AUC of classifier over the ensemble, we find thatComparing Eq. with Eq. and Eq. with Eq. we can postulate a formal mapping of the quantities from the fermionic system to the classification problem: , , , , and . Given that takes only the values 0 and 1, its ensemble average is equal to the probability that an item is in the positive class given that it was ranked at rank . Under these conditions and from the fact that the FD distribution follows from the second principle of thermodynamics along with Jaynes’ insight (34) that the maximum-entropy principle in statistical mechanics is nothing but the maximization of the uncertainty about our unknowns, we conclude the maximum-entropy rank-conditioned class probability in the classification problem is given by the FD distribution with the appropriately mapped quantities:where and are chosen to fit Eqs. and from the known and of the classifier. Fig. 1 shows the result of plotting (dots) for an ensemble with and and an of 0.9 and the fitted FD distribution (red dashed line), which follows the empirically simulated distribution remarkably well (P value < ).
Fig. 1.
(A) Test sets are sampled from an ensemble. Each test set consists of items of which exactly are in the positive class. Applying the classifier to set endows each item with score . The items are then ranked in decreasing order of scores. If item has class and was ranked at rank , then we assign label to rank and we keep a tally of the number of times rank was assigned label in the test sets. The rank-conditioned positive class probability is the frequency with which items in the positive class were ranked at rank in the test sets. (B) Comparison between the Fermi–Dirac distribution and the rank-conditioned positive class probability for simulated data. An ensemble with test sets was simulated. The class-conditioned score density of the classifier was simulated with a Gaussian density function with mean and for the negative class and and for the positive class. This corresponds to a classifier with an AUC of 0.9. Each test set had 50 items from the positive class and 50 items from the negative class (). For each of the test sets, the items were processed according to A. The resulting frequency of positive labels for each rank is plotted and compared with the FD distribution from Eq. , with fitted parameters ( = 0.0759, = 50). The Pearson correlation between the FD distribution and the rank-conditioned positive class probability is 0.99 (P value < ).
(A) Test sets are sampled from an ensemble. Each test set consists of items of which exactly are in the positive class. Applying the classifier to set endows each item with score . The items are then ranked in decreasing order of scores. If item has class and was ranked at rank , then we assign label to rank and we keep a tally of the number of times rank was assigned label in the test sets. The rank-conditioned positive class probability is the frequency with which items in the positive class were ranked at rank in the test sets. (B) Comparison between the Fermi–Dirac distribution and the rank-conditioned positive class probability for simulated data. An ensemble with test sets was simulated. The class-conditioned score density of the classifier was simulated with a Gaussian density function with mean and for the negative class and and for the positive class. This corresponds to a classifier with an AUC of 0.9. Each test set had 50 items from the positive class and 50 items from the negative class (). For each of the test sets, the items were processed according to A. The resulting frequency of positive labels for each rank is plotted and compared with the FD distribution from Eq. , with fitted parameters ( = 0.0759, = 50). The Pearson correlation between the FD distribution and the rank-conditioned positive class probability is 0.99 (P value < ).To recap, the FD distribution for a physical system follows from the second principle of thermodynamics (maximum entropy) under the constraints that the energy and the number of fermions of the system are known. Because of the mapping between the binary classification problem and the fermionic system, we can think of the FD distribution as the maximum-entropy estimate of the rank-conditioned class probability with the appropriately mapped constraints. The rank-conditioned class probability can also be derived directly from these constraints and the maximum-entropy principle without invoking a mapping between the classification problem and the fermionic system (). However, we believe that this mapping can provide a fruitful analogy to interpret the parameters and , as we will see in the next section.It should be clear from this discussion that we are not claiming that the rank-conditioned class probability is the FD distribution. Rather, we claim that the FD distribution is the distribution that makes the least number of assumptions by maximizing our uncertainty about the information we do not have but taking into account the information encoded in the aforementioned constraints. If more information were available, for example, if we knew the class-conditioned score density of the classifier, then a more precise distribution could be derived. However, the FD distribution provides an excellent approximation for the posterior probability of binary classifiers as is shown in the following sections where we introduce multiple applications of this approach.
The Temperature and Chemical Potential in Binary Classification
Next, we discuss the interpretation of the temperature and the chemical potential in the context of binary classification. In a fermionic system, as the temperature approaches 0, all fermions will occupy the quantum states with the lowest possible energies allowed by the exclusion principle up to the chemical potential at , a quantity known as the Fermi energy . On the other temperature extreme, when , all quantum states are equally probable and the average occupation number is . In the classification problem, the parameter is mapped to the inverse temperature in the physical system. As the FD distribution is a step function and is equal to 1 for ranks less than or equal to and 0 otherwise. This corresponds to a perfect classifier with an of 1. Note that, in this case, the chemical potential is equal to . When decreases (i.e., the temperature increases), the probability that an item is of class 1 at rank becomes a smooth logistic function, which reflects an imperfect classification with an between 0.5 and 1. For (i.e., ), independently of , which corresponds to a random classifier. The above discussion suggests that the temperature in a fermionic system maps to classification errors. At finite temperature there is no clear-cut energy threshold below which energy states are occupied by fermions and above which the states are unoccupied. In the classification problem, that means that we do not have a clear-cut threshold rank below which we will find only class 1 items and above which we will find only class 0 items. We show later that at finite temperature the optimal threshold in the classification problem is related to the chemical potential in the physical system.The parameters and should be computed from the constraints Eqs. and and in general will depend on , , and . However, for sufficiently large , these parameters can be rescaled such that and can be computed numerically from the knowledge of and only (), where is the fraction of class 1 items, often called prevalence. Fig. 2 shows the dependence of (Fig. 2) and (Fig. 2) as a function of the and . While a general analytical expression to express and in terms of and does not exist, it is possible to express as a function of , , and :It is also possible to find explicit expressions for special cases. For weak classifiers, i.e., when , we show in thatAnother approximate expression can be found in the limit of perfect classifiers (),Finally, if , then for all (). Beyond the special cases discussed above, there are symmetries in the dependence of and as a function of and that must hold for all and , as discussed in .
Fig. 2.
The rescaled coefficients of the Fermi–Dirac distribution are determined from the values of and . (A) Dependence of on and the prevalence, . (B) Dependence of on and the prevalence . Here, and were calculated as discussed in with .
The rescaled coefficients of the Fermi–Dirac distribution are determined from the values of and . (A) Dependence of on and the prevalence, . (B) Dependence of on and the prevalence . Here, and were calculated as discussed in with .
Choosing Thresholds in Binary Classification
To assign class labels to each sample in a test set we must choose a decision threshold. In practical applications, this threshold is typically learned from a training set. But, if we know the rank-conditioned class probability, it is possible to relate the rank-threshold below/above which the classes are assigned to be positive/negative to the parameters and of the corresponding FD distribution.This threshold can be chosen to be the rank at which the class-conditioned rank probability that an item is at rank is the same for the positive and negative classes. We define the log-likelihood ratio aswhere in the second equality we applied Bayes’ theorem to express the class-conditioned rank probability in terms of the posterior rank-conditioned class probability and used that . Using Eq. as the rank-conditioned class probability we can find thatHence, the optimal rank threshold can be computed as the rank that makes :where we used Eq. to go from Eq. to Eq. . Eq. shows the dependence of the optimal threshold on . From the previous section, this means that the only information needed to determine the optimal threshold is the and the prevalence , which can be learned from the training set.It is also possible to find a threshold that strikes a compromise between the sensitivity and the specificity of a classifier. For a ranked list, a popular way to do this is to find the rank that maximizes the balanced accuracy , defined as the average of the true positive rate and the specificity or 1 − (where FPR denotes the false positive rate) of a binary classifier. For a given instance of the test set, these metrics can be expressed asTaking the average of the previous equations in the ensemble we can express the average balanced accuracy in terms of the posterior class distribution asThe next step in finding the optimal threshold is to choose the argument that maximizes . We approximate this step by assuming to be a continuous variable and finding the value of that makes the derivative of zero; that is,Assuming so that that the discrete sums in Eqs. and can be approximated by integrals, we find that Eq. yieldsTo ascertain that the resulting from Eq. is a maximum, we need to verify that the second derivative of the is negative at . Using the FD expression for the distribution of we find that , which is always negative when the classifier is better than random; i.e., , as is positive in those cases (Fig. 2). Interestingly, Eq. yields the same result as the earlier calculation requiring that the log-odds ratio . In other words, the threshold that makes the log ratio of class-conditioned rank distributions zero is also the one that maximizes the balanced accuracy.To exemplify and verify the calculations described in this section we use simulation experiments based on classifiers with Gaussian class-conditioned score densities. The simulations consist of 45 realizations of test sets with , , and . Fig. 3, in which each point represents a different combination of and , shows that the threshold calculated using Eq. based on the requirement that is an excellent approximation of the threshold [computed by finding the that maximizes the from scanning through all possible thresholds for each realization of the test set in the simulations]. The actual that maximizes the balanced accuracy and the estimate using the FD expression have a correlation coefficient . The probability that such a high correlation (or larger) between the two quantities is due to chance is negligible (P value < ).
Fig. 3.
(A) Correlation between two different methods for determining the optimal thresholds for segregating positive and negative classes. is the traditional method of scanning over all possible rank thresholds to empirically determine the rank that maximizes the balance accuracy and is the proposed closed-form method, Eq. , based on the FD distribution. Here, nine different values ranging from 0.1 to 0.9 and five different values ranging from 0.55 to 0.95 were tested with total sample size . The correlation coefficient is (P value ). (B) Correlation between two different methods for determining the SD of the AUC. (DeLong) represents the DeLong method, and (FD) represents the FD-based method. The same conditions as in A were used. The correlation coefficient is (P value ).
(A) Correlation between two different methods for determining the optimal thresholds for segregating positive and negative classes. is the traditional method of scanning over all possible rank thresholds to empirically determine the rank that maximizes the balance accuracy and is the proposed closed-form method, Eq. , based on the FD distribution. Here, nine different values ranging from 0.1 to 0.9 and five different values ranging from 0.55 to 0.95 were tested with total sample size . The correlation coefficient is (P value ). (B) Correlation between two different methods for determining the SD of the AUC. (DeLong) represents the DeLong method, and (FD) represents the FD-based method. The same conditions as in A were used. The correlation coefficient is (P value ).
The Variance of AUC Estimates
The AUC is perhaps the most popular metric to evaluate the performance of binary classifiers. While it would be desirable to know the distribution of the AUC of a given classifier over all possible test sets with similar characteristics [such as the ensemble of test sets we discussed earlier], what we usually compute in most applications is an estimator of its mean value in a specific dataset. This estimate of the AUC carries an error that results from inevitable sample-to-sample variation and finite sample sizes. Therefore, any complete reporting of the AUC should also provide a confidence interval that contains the true but unknown with some probability, typically 95%. To compute this confidence interval it is necessary to estimate the variance of the AUC distribution. As discussed in refs. 27 and 35, the mean and variance of the AUC distribution for a classifier are given bywhere , is the probability that the classifier assigns higher scores to two randomly and independently sampled positive items than to a randomly sampled negative item, and is the probability that the classifier assigns lower scores to two randomly and independently sampled negative items than to a randomly sampled positive item. We also denote by the probability that the classifier assigns a higher score to a randomly sampled positive item than to a randomly sampled negative item.We derive Eqs. and in , but for completeness we provide some intuition for these formulas here. The AUC of a classifier measures the area under the receiver operating characteristic (ROC) curve traced by the points (FPR, TPR) for a given test set, where the parameter is the classification threshold discussed in the previous section and ranges from the maximum to the minimum possible scores outputted by the classifier. When is the classification threshold, all the items with scores larger than are considered positives, so the TPR is the fraction of the positive items with scores larger than and the FPR is the fraction of the negative items with scores larger than . When we compute the area under the ROC curve using a rectangular integration rule, each time the parameter crosses the score of a positive example, the TPR gains units whereas the FPR does not change. This corresponds to a vertical change in the ROC curve and therefore there is no gain in AUC. When the score crosses the value of a negative item (), the ROC curve goes from point (FPR, TPR) to point (FPR, TPR), with FPR = FPR. The AUC results from adding the areas of rectangles (one per negative item with score ) whose height is equal to the fraction of positive examples TPR with score larger than and whose base is equal to , which is the -axis change in the ROC curve that takes place when the parameter goes from one negative item to the next. Interestingly, the calculation sketched above is the exact same calculation that we would perform to estimate the frequency with which we find positive items with score larger than that of a negative example in the same test set: Given a negative item for which the classifier assigned a score , the frequency of positive examples with scores greater than coincides with the TPR ; the probability of a positive to have a score greater than a negative is the sum of these frequencies weighted by the probability of choosing that negative sample, which in a given test set is . We have just justified that in a given test set and for a given classifier, the AUC can be computed aswhere and are the scores assigned by the classifier to the th positive examples and the th negative examples, respectively, and is the Heaviside function that takes the value of 1 for positive arguments and 0 for negative arguments. Eq. expresses a known relation between the AUC of a classifier in a given test set and the Mann–Whitney statistics (36). Taking the expected value in both sides of the equality in Eq. we get . Note that the expected value of for randomly and independently sampled positive and negative examples with scores and , respectively, is equal to the probability that a positive example has a score larger than a negative example; that is, . Therefore, , which proves Eq. . (For an alternative derivation see .)Next, we sketch the derivation of Eq. , which will allow us to elucidate the origin of the parameters and . (See for the full derivation.) To compute the variance of the AUC we use the fact that , which requires squaring Eq. and taking its expected value in the ensemble. This operation leads to four nested sums (two over the positive examples and two over the negative examples) of the average of . To deal with repeated indexes in these nested sums we consider the following four cases: 1) Case and leads to terms of the form , all of which are equal to , given that and are independent (because , , , and are), and . 2) Case and leads to terms of the form , which is equal to the probability earlier denoted by that the score of a randomly sampled positive item () is larger than the scores of two independently and randomly sampled negative items ( and ). 3) Case and leads to terms of the form , which is equal to the probability earlier denoted by that both the scores of two independently and randomly sampled positive items ( and ) are larger than the scores of randomly sampled negative items (). 4) Case and leads to terms of the form , which, given that , is equal to the probability that the score of a randomly sampled positive item is larger than the score of a randomly sampled negative item, which was shown before to be equal to . Assembling all these cases to compute , we recover Eq. .Using Eq. requires knowledge of the quantities and that depend on the generally unknown class-conditioned score densities . One could proceed by assuming some functional form for these densities. For example, if is assumed to be exponential, it can be shown (27) that and . However, assuming a distribution just because it yields analytical expressions may lead to inaccurate results, e.g., producing too loose confidence intervals. In practical applications the variance of the AUC is often computed using a method first proposed by DeLong et al. (37), which consists of rearranging the terms in Eq. into an estimator of the AUC variance,Eq. has proved to be a reliable option for the computation of (38, 39).Note that , , and depend only on the relative order of positive and negative samples. As such, they could be written in terms of the class-conditioned rank probabilities that in turn can be expressed using the FD distribution for the rank-conditioned class probabilities. Indeed, in cases where we do not know the true class-conditioned score or rank distribution, Eq. requires that we use only , , and and assume the most parsimonious (maximum-entropy) distribution for the rank-conditioned class probability. As was shown earlier, this leads to the FD distribution Eq. .Let us first express (i.e., the right-hand side of Eq. ) in terms of ranks. The probability that the score of a negative item is smaller than the score of a positive item translates into the probability that the negative item has a higher rank than that of a positive item. If a positive item is at rank , the probability that a negative item has a rank higher than is . As the positive item can be at any rank , to compute we need to add the previous sum over all the possible ranks where the positive item is, weighted by the probability that there is a positive item at rank . Using that and , Eq. can be expressed aswhere we expressed in terms of the FD distribution. Recall that and were selected using constraints based on the number of positive samples (Eq. ) and the (Eq. ). These constraints are different from Eq. , and therefore Eq. may appear to overdetermine the parameters. Interestingly this is not the case. We show in that Eq. holds for any rank-conditioned class probability that verifies those two constraints, and therefore they are valid for the FD distribution whose parameters were fitted using those very same conditions.Following similar arguments to the ones used to deduce Eq. , we can find expressions for and :where and .We compare the SD estimated using Eq. [DeLong et al.’s (37) method] and using Eq. (with and computed using the FD-based method of Eqs. and ) in Fig. 3 for the same simulations as those used in Fig. 3. The two ways of computing the SD yield almost identical values, with a correlation coefficient (P value < ). The minor deviations between the two ways of computing observed in Fig. 3 correspond to cases where the prevalence takes values close to 0 or 1. In these situations, the FD distribution fit to the constraints is not as good compared to cases where the prevalence is of intermediate value.
Using the FD Distribution for Ensemble Classification
Ensemble learning for classification is the endeavor of combining multiple base classifiers in an effort to construct an ensemble classifier that generalizes better than any of its constituents. In this section, we present FiDEL, an ensemble learning method based on using the FD distribution to model the rank-conditioned class probabilities for different base classifiers.We assume that we have classifiers in the ensemble, denoted by . Let denote the rank assigned to item by classifier . Let denote the joint probability of rank assignment by classifiers given the class of item . Following refs. 22 and 23, we assume that the base classifiers’ rank assignments for a given item are conditionally independent given the class. This strong assumption means that different classifiers rank the same item independently of each other whether the item is in the positive or the negative class. Under this assumption, the joint class-conditional distribution of rank predictions can be written asWe use the log-likelihood ratioto estimate the degree to which the evidence given by the ranks assigned by classifiers to item supports the conclusion that item is in the positive versus the negative class. Using the assumption of conditional independence, the log-likelihood ratio can be rewritten aswhere is the probability of rank given class for classifier .Replacing by the FD distribution, the sum in the right-hand side of Eq. can be expressed aswhere can be used as the score provided by the FiDEL ensemble classifier to rank items to compute the AUC. Items that get larger and positive scores will be more likely to belong to the positive class. Conversely, more negative scores will be more likely to belong to the negative class. The log-likelihood ratio suggests that 0 is the natural threshold that separates items in the positive and negative classes, and therefore the predicted label for the FiDEL ensemble iswhere is the Heaviside step function.Note that the contribution to of classifier is the difference between the optimal threshold and the rank of the item being classified, weighted by the parameter . As previously discussed, can be interpreted as the inverse of the temperature of an equivalent physical system; a higher temperature corresponds to more classification errors and thus lower accuracy. (See Fig. 2 where it can be seen that for any , increases monotonically with the of a classifier.) Therefore, weighting each classifier’s contribution to by can be easily interpreted: Methods with higher error map to higher temperatures that lead to lower s, which results in a lower weight in the final score. The predicted score of the ensemble classifier, , is also dependent on the rank assigned by the classifier relative to the threshold . Items ranked lower and farther from the threshold given by classifier will contribute to a larger .The derivation of the FiDEL ensemble is based on the strong assumption that base classifier predictions are class-conditionally independent. To determine the extent to which class-conditional dependence influences the performance of the FiDEL ensemble we developed a model () that simulates the situation in which all pairs of classifiers in the ensemble have a conditional rank correlation given both the positive and the negative class equal to a parameter , which we varied between 0 (uncorrelated case) and 0.6. In practice, however, there are different degrees of correlation between different pairs of classifiers, as we will see below. For different values of the class-conditioned rank correlation we compared the performance of FiDEL with that of the best classifier in the ensemble and with a baseline ensemble model that we call the wisdom of crowds (WoC) ensemble (40). The WoC ensemble is a classifier whose score for a given item can be computed as the average of the ranks assigned by the base classifiers to that item. The results of these simulations, summarized in , show that the performance of FiDEL is robust to mild violations in the assumption of class-conditional independence and that FiDEL’s performance is greater than that of the best individual classifiers up to a class-conditioned correlation of 0.4. Furthermore, FiDEL is better than the WoC ensemble for all values of tested. To exemplify its use and assess its performance in practical classification tasks, we applied FiDEL to two problems proposed in the Kaggle crowd-sourcing platform: the West Nile Virus (WNV) Prediction challenge (41) and the Springleaf Marketing Response (SMR) challenge (42) ( and ). We chose these challenges because they are binary classification problems with vastly different positive-class prevalence ( and 0.24 for the WNV and SMR challenges, respectively) and large datasets ( and points for the WNV and SMR challenges, respectively) and the data are easily accessible though the Kaggle website. We used 23 general purpose and widely used methods as base classifiers, of which 21 were used in the WNV data and 20 were used in the SMR challenge (we intended to also use 21 classifiers in the SMR data, but one of the chosen classifiers failed to run; ). In both problems, the data were randomly partitioned into 22 equal-size subgroups, each of which maintained the class proportions of the overall dataset. Of these 22 groups, 21 were used for training and validation and the remaining one was used as the test set.As discussed above, a high degree of class-conditioned correlation can considerably degrade FiDEL’s performance. We studied the class-conditioned correlation under two different training strategies. In the “disjoint partition” strategy, we trained each of the classifiers in its own partition, in such a way that no classifier was trained using the same data. In the “same partition” strategy, all classifiers were trained using the same partition. The class-conditioned correlation averaged over the two classes for each pair of classifiers for the prediction in the test set in both the WNV and the SMR datasets is shown in . The average class-conditioned correlations over all the pairs of classifiers in the WNV data for the same partition and the disjoint partition strategies were 0.66 and 0.54. For the SMR data the correlations for the same partition and the disjoint partition were 0.44 and 0.32. These results suggest that the best strategy to use FiDEL from the perspective of minimizing the class-conditioned correlation is the disjoint partition strategy, which we use next.After training in their respective training set, the of each classifier and the prevalence were computed in the remaining training partitions (that is, excluding partition and the test set). and were then used to fit the parameters and of the FD distribution for each classifier. shows that the resulting FD distribution with the fitted and is an excellent approximation to the empirically computed rank-conditioned class probability. We then applied the learned FiDEL ensemble method to the test data (Fig. 4). We compared the performance of FiDEL to that of the WoC ensemble method and the best individual classifier. We randomly chose classifiers among the 21 (WNV) or 20 (SMR) classifiers used in the respective datasets to compute FiDEL scores . Fig. 4 shows the results of running FiDEL for the WNV data whereas Fig. 4 shows the results for the SMR data. Fig. 4 shows results for 200 randomly chosen sets of , 5, and 7 classifiers. For each point in one such combination of classifiers, the x coordinate is the AUC of the best classifier and the y coordinate is the FiDEL AUC for that combination. In the vast majority of cases the points are above the identity line (dashed line), indicating that the FiDEL method outperforms the best among the base classifiers in the vast majority of classifier choices for the ensemble, even when and more so for . Fig. 4 shows the average, over 200 combinations of randomly chosen classifiers, of the AUCs of the best individual classifier in the ensemble (gray dashed line), the WoC ensemble (black dashed line), and the FiDEL method (solid blue line). Error bars represent the SEM over the 200 combinations. FiDEL clearly and robustly outperforms both the WoC ensemble and the best individual classifier of the ensemble.
Fig. 4.
Performance of FiDEL on two Kaggle binary classification challenges: the WNV Prediction challenge (A and B) and the SMR challenge (C and D). The dataset in the WNV Prediction challenge has a prevalence and data points. The dataset in the SMR challenge has a prevalence and data points. (A and C) Comparison between the AUCs using FiDEL by combining algorithms randomly chosen among 21 (WNV) or 20 (SMR) possible algorithms ( axis) and the AUC of the best among the algorithms used in the combination ( axis) with , 5, and 7. Each point corresponds to one of 200 combinations of randomly chosen classifiers. (B and D) The average and SEM of the AUCs over 200 combinations of algorithms () chosen randomly among 21 (WNV) or 20 (SMR) methods for FiDEL (blue solid line), WoC (black dashed line), and the best individual algorithm among the combined (gray dashed line) for the WNV Prediction (B) and SMR (D) challenges. FiDEL was trained using the AUC and values of the base classifiers using a validation set carved from the training set. The AUCs reported here correspond to evaluations of base classifiers, FiDEL, and WoC in the test set, which is a partition independent of the training set. Different partition choices and base classifier combinations may produce slightly different results.
Performance of FiDEL on two Kaggle binary classification challenges: the WNV Prediction challenge (A and B) and the SMR challenge (C and D). The dataset in the WNV Prediction challenge has a prevalence and data points. The dataset in the SMR challenge has a prevalence and data points. (A and C) Comparison between the AUCs using FiDEL by combining algorithms randomly chosen among 21 (WNV) or 20 (SMR) possible algorithms ( axis) and the AUC of the best among the algorithms used in the combination ( axis) with , 5, and 7. Each point corresponds to one of 200 combinations of randomly chosen classifiers. (B and D) The average and SEM of the AUCs over 200 combinations of algorithms () chosen randomly among 21 (WNV) or 20 (SMR) methods for FiDEL (blue solid line), WoC (black dashed line), and the best individual algorithm among the combined (gray dashed line) for the WNV Prediction (B) and SMR (D) challenges. FiDEL was trained using the AUC and values of the base classifiers using a validation set carved from the training set. The AUCs reported here correspond to evaluations of base classifiers, FiDEL, and WoC in the test set, which is a partition independent of the training set. Different partition choices and base classifier combinations may produce slightly different results.
Other Uses of the Class-Conditioned Rank Probability
In the previous sections, we argued that the FD distribution provides an explicit expression for the rank-conditioned class probability and its counterpart, the class-conditioned rank probability. In this section, we provide a few simple results that follow from expressing some performance metrics directly in terms of the class-conditioned rank probability. For example, in a previous section we demonstrated that the threshold for segregating the positive and the negative classes that zeroes the log-likelihood ratio is also the threshold that maximizes the balance accuracy, and we did so by expressing the balance accuracy in terms of the class-conditioned rank probabilities of the classifier. Using the same notation that we used earlier, let us denote by the true class of an item that a classifier placed at rank . Given that can take only the values 1 and 0, its mean value in the ensemble is equal to . The false positive rate (FPR), the true positive rate (TPR) (also known as recall), the precision (Prec), and the balanced accuracy (bac), at a given rank used as the threshold between positive and negative predicted classes, can all be computed as follows: , , , and . Therefore, their averages in the ensemble areUsing these expressions, there are a number of interesting relations that can be derived (). To start with, we can derive an expression for the AUC:This relation is not new, as it can be obtained from the AUC relation to the Wilcoxon–Mann–Whitney U statistics, but in we derive it using the rank-conditioned class probability. Eq. is interesting as it clearly shows that the AUC depends only on class-conditioned average ranks and not on other subtleties of the distribution of ranks.A second interesting expression iswhere the overbar is the average over all thresholds: . Eq. relates the AUC and the average balance accuracy over all thresholds. As the maximum , Eq. implies that the maximum .A final interesting relation pertains the area under the precision recall curve (AUPRC):where the approximation in Eq. holds for . It is interesting that the AUPRC is related to the average square of the precision over all thresholds.
Conclusion
The problem of binary classification is a fundamental task in machine learning. It has spurred the development of a wealth of ingenious algorithms including k-nearest neighbors, support vector machines, random forests, and deep learning to name but a few. Each of these algorithms outputs scores whose value depends on the intricacies of the algorithm and can be properly interpreted only in the narrow context in which the algorithm was used. However, when we try to combine algorithms with other elements of evidence to decide the class of an item, it would be desirable that the output of the algorithm be the probability that the item belongs to each class. Most algorithms do not have a way to compute well-calibrated class-conditioned score densities. Some methods, however, explicitly model the posterior probability of their classification, for example using logistic regression or Platt scaling methods (43), performing a logistic transformation of chosen features in the former or of a classifier score in the latter, into an output probability. While such transformations make intuitive sense and work well for some applications, they are heuristic methodologies. Our approach is different from the abovementioned methods on two counts: On the one hand, our logistic transformation transforms the ranks (not features or scores) assigned by a classifier to items in a test set into a probability; on the other hand, the logistic transformation is not postulated as an ad hoc transformation but results from the maximum-entropy principle and as such is the least-biased distribution given the information at hand. In other words, ours is the most parsimonious calibrated class distribution, and, in the absence of additional information, should be preferred to other methods.In this paper, we address the problem of endowing any binary classifier with a probabilistic output using statistical physics considerations. We map the problem of estimating the probability that a classifier places a positive-class item at a given rank to the problem of computing the occupation number of a fermion in a given quantum state in a fermionic physical system with a finite number of single-fermion quantum states. This mapping leads to the identification of the rank-conditioned class probability of a classifier as the FD distribution describing an ensemble of fermionic systems. The FD distribution depends on two parameters of the physical system: the temperature and the chemical potential. We showed that the interpretation of these parameters in a fermionic system can be useful in understanding the role of these same parameters in the classification problem: In physics the temperature is a manifestation of how disordered a system can be whereas in a classification problem the temperature measures how far a classifier is from the perfect classifier. A temperature of 0 implies a perfect classifier and a temperature of infinity results in a random classifier. The chemical potential measures the energy at which the occupation number of fermions is 50/50, and therefore in the classification system it is related to the rank threshold that separates predicted positive and negative classes. Having a precise functional form for the rank-conditioned class probability allowed us to calculate the optimal threshold to separate predicted positive and negative classes. It also permitted the calculation of the SD of the AUC necessary to estimate confidence intervals and perform power analyses. By way of estimating the class probabilities in rank space, our formalism provides a calibrated class probability that can be used to combine classifiers. This allowed us to propose the ensemble learning algorithm that we call FiDEL. We also showed that expressing performance metrics in terms of rank-conditioned class probabilities is a useful tool for formal derivations: for example, the derivation that the threshold that best separates predicted classes using the likelihood-ratio method is also the threshold that maximizes the balanced accuracy of a classification.Many of the ideas presented in this paper are of a theoretical nature. However, we can envision practical applications of our theory that can be implemented relatively easily. As an example, suppose that we have dataset such as the one used in ref. 9, consisting of a collection of screening mammograms from women whose breast cancer status after the screening examination is known to be positive or negative. Assume that we divide this set into two partitions: a training set with, e.g., 50% of the data and a validation set with the remaining 50%. After training our classifier in the training set, we compute the AUC and the prevalence in the validation set, from which we derive the FD parameters and . When a woman goes to the radiologist for her next breast cancer screening examination, our classifier processes the mammogram yielding a score, from which we find its rank in the context of the other scores in the validation set. In this way we find the rank order of the new mammogram in the validation set. We then use the FD distribution with the parameters obtained from the validation set to compute the probability that this woman has cancer according to the classifier. This calibrated probability that a woman has cancer given her mammogram and the score outputted by the given classifier in the context of a validation set can be used by radiologists as a decision aid to decide whether a woman must be recalled or not for further studies after screening. Given that the FD distribution is the maximum-entropy distribution, this probability is the most unbiased estimate given the data at hand. Similar strategies can be envisioned in other application domains where a validation set and a preferred classifier are available.It is important to highlight limitations of our approach. To start with, we need to be clear that the FD distribution is not, in general, the exact rank-conditioned class probability of a classifier in a test set. It is, however, the probability distribution that is maximally noncommittal about the aspects of the problem we have no information on, but does include the information that we have about our problem, namely the classifier AUC, the fraction of positive examples, and the total number of elements in the test set. If we had more information about the distribution of scores, or if we had the area under the precision-recall curve, for example, then we could improve the rank-conditioned class probability beyond the FD distribution. A second consideration is that the FD distribution represents the probability that items at a given rank are in the positive class in an ensemble of specific characteristics (the ensemble of test sets). However, in typical applications we have just one test set and we find the FD parameters from one instance of the ensemble. This means that we have a single AUC estimate from one test set and use that point estimate as an estimator for the average AUC. Furthermore, in typical applications, we do not know the labels in the test set, and therefore we cannot compute the AUC in the test set. In these cases, we need to use the AUC as well as the fraction of positive examples from a validation set as we did in the WNV and SMR classification problems presented in this paper. Finally, the ensemble learning algorithm we proposed was derived under the assumption that the base classifiers are class-conditionally independent, which is a strong assumption that holds only approximately. However, we showed that the FiDEL ensemble overperforms the best of the base classifiers even if there is moderate class-conditioned correlation among the base classifiers up to an average correlation of 0.4 to 0.5. We also showed that training base classifiers in disjoint partitions of a dataset, or in completely different datasets, such as in federated learning, would reduce the dependence between classifiers. We are exploring possible modifications to FiDEL that take into account the correlation between base classifiers in an ensemble.Despite some of these limitations, we believe that the FD distribution is a useful tool to model the rank-conditioned class probability of a classifier. By transforming scores into ranks, the FD distribution provides a calibrated probabilistic output for binary classifiers.
Materials and Methods
Datasets.
We used datasets for binary classification problems from two Kaggle competitions: The WNV Prediction challenge and the SMR challenge. The WNV competition, which took place in 2015, challenged participants to predict the presence or absence of West Nile virus across the city of Chicago based on tests performed on mosquitoes caught in traps. The data provided to make those predictions included record identification (id), date, address, mosquito species, trap id, and number of mosquitoes. Participants were also given weather data concurrent with the mosquito testing period (2007 to 2014) and the date and location of chemical spraying conducted by the city during 2011 and 2013. The SMR competition ran in 2015 and challenged participants to predict whether or not customers will respond to a marketing mail offer sent to them. Each row corresponds to one customer with 1,934 anonymized features composed of a mix of continuous and categorical variables. More detail can be found in .
Classifiers.
The classifiers used in each of the competitions are described in . A total of 23 classifiers were used of which 21 classifiers were used in the WNV dataset and 20 classifiers in the SMR dataset.
Statistical Analysis and Visualization.
Statistical analysis and visualization were performed using R (http://www.R-project.org). Source code can be found at https://github.com/sungcheolkim78/FiDEL.
Authors: Andre Esteva; Brett Kuprel; Roberto A Novoa; Justin Ko; Susan M Swetter; Helen M Blau; Sebastian Thrun Journal: Nature Date: 2017-01-25 Impact factor: 49.962
Authors: Thomas Schaffter; Diana S M Buist; Christoph I Lee; Yaroslav Nikulin; Dezso Ribli; Yuanfang Guan; William Lotter; Zequn Jie; Hao Du; Sijia Wang; Jiashi Feng; Mengling Feng; Hyo-Eun Kim; Francisco Albiol; Alberto Albiol; Stephen Morrell; Zbigniew Wojna; Mehmet Eren Ahsen; Umar Asif; Antonio Jimeno Yepes; Shivanthan Yohanandan; Simona Rabinovici-Cohen; Darvin Yi; Bruce Hoff; Thomas Yu; Elias Chaibub Neto; Daniel L Rubin; Peter Lindholm; Laurie R Margolies; Russell Bailey McBride; Joseph H Rothstein; Weiva Sieh; Rami Ben-Ari; Stefan Harrer; Andrew Trister; Stephen Friend; Thea Norman; Berkman Sahiner; Fredrik Strand; Justin Guinney; Gustavo Stolovitzky; Lester Mackey; Joyce Cahoon; Li Shen; Jae Ho Sohn; Hari Trivedi; Yiqiu Shen; Ljubomir Buturovic; Jose Costa Pereira; Jaime S Cardoso; Eduardo Castro; Karl Trygve Kalleberg; Obioma Pelka; Imane Nedjar; Krzysztof J Geras; Felix Nensa; Ethan Goan; Sven Koitka; Luis Caballero; David D Cox; Pavitra Krishnaswamy; Gaurav Pandey; Christoph M Friedrich; Dimitri Perrin; Clinton Fookes; Bibo Shi; Gerard Cardoso Negrie; Michael Kawczynski; Kyunghyun Cho; Can Son Khoo; Joseph Y Lo; A Gregory Sorensen; Hwejin Jung Journal: JAMA Netw Open Date: 2020-03-02