Literature DB >> 22905111

A comparison of MCC and CEN error measures in multi-class prediction.

Giuseppe Jurman¹, Samantha Riccadonna, Cesare Furlanello.

Abstract

We show that the Confusion Entropy, a measure of performance in multiclass problems has a strong (monotone) relation with the multiclass generalization of a classical metric, the Matthews Correlation Coefficient. Analytical results are provided for the limit cases of general no-information (n-face dice rolling) of the binary classification. Computational evidence supports the claim in the general case.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 22905111 PMCID： PMC3414515 DOI： 10.1371/journal.pone.0041882

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Comparing classifiers' performance is one of the most critical tasks in machine learning. Comparison can be carried out either by means of statistical tests [1], [2] or by adopting a performance measure as an indicator to derive similarities and differences, in particular as a function of the number of classes, class imbalance, and behaviour on randomized labels [3]. The definition of performance measures in the context of multiclass classification is still an open research topic as recently reviewed [4], [5]. One challenging aspect is the extension of such measures from binary to multiclass tasks [6]. Graphical comparison approaches have been introduced [7], but a generic analytic treatment of the problem is still unavailable. One relevant case study regards the attempt of extending the Area Under the Curve (AUC) measure, which is one of the most widely used measures for binary classifiers but it has no automatic extension to the multiclass case. The AUC is associated to the Receiver Operating Characteristic (ROC) curve [8], [9] and thus proposed formulations were based on a multiclass ROC approximation [10]–[13]. A second class of extensions is defined by the Volume Under the Surface (VUS) approach, which is obtained by considering the generalized ROC as a surface whose volume has to be computed by exact integration or polynomial approximation [14]–[16]. As a baseline, the average of the AUCs on the pairwise binary problems derived from the multi-class problems has also been proposed [17]. Other measures are more naturally extended, such as the accuracy (ACC, i.e. the fraction of correctly predicted samples), the Global Performance Index [18], [19], and the Matthews Correlation Coefficient (MCC). We will focus our attention to the last function [20], which in the binary case is also known as the -coefficient, i.e., the square root of the average statistic on observed samples for the contingency table of the classification problem. For binary tasks, MCC has attracted the attention of the machine learning community as a method that summarizes into a single value the confusion matrix [21]. Its use as a reference performance measure on unbalanced data sets is now common in other fields such as bioinformatics. Remarkably, MCC was chosen as accuracy index in the US FDA-led initiative MAQC-II for comparing about 13 000 different models, with the aim of reaching a consensus on the best practices for development and validation of predictive models based on microarray gene expression and genotyping data [22]. A generalization of MCC to the multiclass case was defined in [23], also used for comparing network topologies [24], [25]. A second family of measures that have a natural definition for multiclass confusion matrices are the functions derived from the concept of (information) Entropy, first introduced in [26]. In the classification framework, measures in the entropy family range from the simpler confusion matrix entropy [27] to more complex functions as the Transmitter Information [28] and the Relative Classifier Information (RCI) [29]. Wei and colleagues recently introduced a novel multiclass measure under the name of Confusion Entropy (CEN) [30], [31]. They compared CEN to both RCI and accuracy, obtaining better discriminative power and precision in terms of two statistical indicators called degree of consistency and degree of discriminancy [32]. In our study, we investigate the intriguing similarity existing between CEN and MCC. In particular, we experimentally show that the two measures are strongly correlated, and that their relation is globally monotone and locally almost linear. Moreover, we provide a brief outline of the mathematical links between CEN and MCC with detailed examples in limit cases. Discriminancy and consistency ratios are discussed as comparative factors, together with functions of the number of classes, class imbalance, and behaviour on randomized labels.

Methods

Given a classification problem on samples and classes , define the two functions indicating for each sample its true class and its predicted class , respectively. The corresponding confusion matrix is the square matrix whose -th entry is the number of elements of true class that have been assigned to class by the classifier:The most natural performance measure is the accuracy, defined as the ratio of the correctly classified samples over all the samples:

Confusion Entropy (CEN)

In information theory, the entropy associated to a random variable is the expected value of the self-information :where is the probability mass function of , with for , motivated by the limit . The Confusion Entropy measure CEN for a confusion matrix is defined in [30] as:where , , are defined as follows: is the confusion probability of class : is the probability of classifying the samples of class to class for subject to class : is the probability of classifying the samples of class to class subject to class : For , this measure ranges between (perfect classification) and for the complete misclassification casewhile in the binary case CEN can be greater than 1, as shown below.

Matthews Correlation Coefficient (MCC)

The definition of the MCC in the multiclass case was originally reported in [23]. We recall here the main concepts. Let be two matrices where if the sample is predicted to be of class () and otherwise, and if sample belongs to class () and otherwise. Using Kronecker's delta function, the definition becomes:Note that , where , and, for , . The covariance function between X and Y can be written as follows:where and and are the means of the columns defined respectively as and . Finally the Matthews Correlation Coefficient MCC can be written as:MCC lives in the range , where is perfect classification. The value is asymptotically reached in the extreme misclassification case of a confusion matrix with all zeros but in two symmetric entries , . MCC is equal to when is all zeros but for one column (all samples have been classified to be of a class ), or when all entries are equal .

Relationships between CEN and MCC

As discussed before, CEN and MCC live in different ranges, whose extreme values are differently reached. In Box 1 of Fig. 1, numerical examples are shown for in different situations: (a) complete classification, (b) complete misclassification, (c) all samples classified as belonging to one class, (d) misclassification case in a very unbalanced situation.

Figure 1

Examples of CEN and MCC for different confusion matrices.

It is worth noting that CEN is more discriminant than MCC in specific situations, although the property is not always welcomed. For instance, in Fig. 1, Box 1(c), while . Furthermore, as shown in Box 2, for constant matrix for each , regardless of the number of classes , while it is easy to show that , i.e., CEN is a function of . Note that both measures are invariant for scalar multiplication of the whole confusion matrix, so we always set in Box 2. For small sample sizes, we can show that CEN has higher discriminant power than MCC, i.e., different confusion matrices can have same MCC and different CEN. This can be quantitatively assessed by using the degree of discriminancy criterion [32]: for two measures and on a domain , let and ; then the degree of discriminancy for over is . For instance, as in [30], we consider a 3-class case with samples respectively: we evaluate all the possible confusion matrices ranging from the perfect classification caseto the complete misclassification case. In this case the degree of discriminancy of CEN over MCC is about 6. Similar results hold for all the 12 small sample size cases on three classes listed in Tab. 6 of [30], ranging from 9 to 19 samples. We proceed now to show an intriguing relationship between MCC and CEN. First consider the confusion matrix of dimension where , i.e., all entries have value but in the diagonal whose values are all , for , two integers. In this case,and thusThis identity can be relaxed to the following generalization, which slightly underestimates CEN:where both sides are zero when , and . For simplicity's sake, we call “transformed MMC” (tMCC) the right member of Eq. 3. A numerical simulation shows that the tMCC approximation in Eq. 3 holds in a more general and practical setting (Fig. 2). In the simulation, 200 000 confusion matrices (dimension range: 3 to 30) were generated. For each class , the number of correctly classified elements (i.e., the -th diagonal element) was uniformly randomly chosen between 1 and 1000. Then the off-diagonal entries were generated as random integers between 1 and , where the parameter was extracted from the uniform distribution in the range , corresponding to small-moderate misclassification. For such data, the Pearson correlation between tMCC and CEN is about 0.994.

Figure 2

Dotplots of CEN versus MCC.

(a) and CEN versus tMCC (b) for 200 000 random confusion matrices of different dimensions.

Dotplots of CEN versus MCC.

(a) and CEN versus tMCC (b) for 200 000 random confusion matrices of different dimensions. In order to compare measures, we consider also the degree of consistency indicator [32]: for two measures and on a domain , let and ; then the degree of consistency of and is = . On the given data, , while the degree of discriminancy is undefined since no ties occur. In summary, the relation between tMMC and CEN is close to linear on this data, with an average ratio of 1.000508 (CI: , 95% bootstrap Student).

Comparison on the family

The behaviour of the Confusion Entropy is instead rather diverse from MCC and ACC for the family of matrices, where all entries are equal but for a non-diagonal one. Because of the multiplicative invariance, all entries can be set to one but for the leftmost lower corner: for a positive integer. As shown in Fig. 1, Box 3, when grows bigger, more and more samples are misclassified, i.e., the accuracy decreases to zero for increasing . The MCC measure of this confusion matrix iswhich is a function monotonically decreasing for increasing values of , with limit for . On the other hand, the Confusion Entropy for the same family of matrices iswhich is still a decreasing function of increasing , but asymptotically moving towards zero, i.e., to the minimal entropy case. In Box 3 of Fig. 1 we present three numerical examples for .

The dice rolling case

Another pathologic case is found in the case of dice rolling classification on unbalanced classes: because of the multiplicative invariance of the measures, we can assume that the confusion matrix for this case has all entries equal to one but for the last row, whose entries are all , for . In this case, the Confusion Entropy isa decreasing function for growing whose limit for is . As a function of , this limit is an increasing function asymptotically growing towards . It is easy to see that for in this case. More in general, while in all those cases where random classification (i.e., no learning) happens, this is lost in the case of CEN, due to its greater discriminant power: there is no unique value associated to the spectrum of random classification problems.

The binary case

In the two-class case (P: positives, N: negatives), the confusion matrix is , where T and F stand for true and false respectively. The Matthews Correlation Coefficient has the familiar definition [20], [21]: The Confusion Entropy can be written for the binary case as:Note that in the case and , we haveand thus when the ratio is smaller than 1. In other words, the confusion matrices with have ; the bound is attained for , the case of total misclassification. This suggests that CEN should not be used as a classifier performance measure in the binary case. A numerical example is provided in Fig. 1, Box 4, while a plot of CEN and MCC curves for different ratios of is shown in Fig. 3.

Figure 3

Lines describing CEN and MCC of a confusion matrix for increasing ratio .

Gray vertical lines correspond to the examples provided in Fig. 1, Box 4.

Lines describing CEN and MCC of a confusion matrix for increasing ratio .

Gray vertical lines correspond to the examples provided in Fig. 1, Box 4. Indeed, differently from the multi-class case, CEN and MCC are poorly correlated for two classes. We computed MCC and CEN for all the 4 598 125 possible confusion matrices for a binary classification task on samples (). Results are displayed in Fig. 4, for and the cumulative plot with all . In this last case, the (absolute) Pearson correlation between the two metrics is only .

Figure 4

Scatter plots of CEN versus MCC for all the confusion matrices of binary classification tasks with s = 5,10,50,75 samples and for the cumulative set of all 4 598 125 matrices with .

Results and Discussion

We compared the Matthews Correlation Coefficient (MCC) and Confusion Entropy (CEN) as performance measures of a classifier in multiclass problems. We have shown, both analytically and empirically, that they have a consistent behaviour in practical cases. However each of them is better tailored to deal with different situations, and some care should be taken in presence of limit cases. Both MCC and CEN improve over Accuracy (ACC), by far the simplest and widespread measure in the scientific literature. The point with ACC is that it poorly copes with unbalanced classes and it cannot distinguish among different misclassification distributions. CEN has been recently proposed to provide an high level of discrimination even between very similar confusion matrices. However, we show that this feature is not always welcomed, as in the case of random dice rolling, for which , but a range of different values is found for CEN. This case is of practical interest because class labels are often randomized as a sanity check in complex classification studies, e.g., in medical diagnosis tasks such as cancer subtyping [33] or image classification problems (e.g., handwritten ZIP code identification or image scene classification examples) [34]. Our analysis also shows that CEN should not be reliably used in the binary case, as its definition attributes high entropy even in regimes of high accuracy and it even gets values larger than one. In the most general case, MCC is a good compromise among discriminancy, consistency and coherent behaviors with varying number of classes, unbalanced datasets, and randomization. Given the strong linear relation between CEN and a logarithmic function of MCC, they are exchangeable in a majority of practical cases. Furthermore, the behaviour of MCC remains consistent between binary and multiclass settings. Our analysis does not regard threshold classifiers; whenever a ROC curve can be drawn, generalized versions of the Area Under the Curve algorithm or other similar measures represent a more immediate choice [35]. This given, for confusion matrix analysis, our results indicate that the MCC remains an optimal off-the-shelf tool in practical tasks, while refined measures such as CEN should be reserved for specific topic where high discrimination is crucial.

9 in total

Review 1. Assessing the accuracy of prediction algorithms for classification: an overview.

Authors: P Baldi; S Brunak; Y Chauvin; C A Andersen; H Nielsen
Journal: Bioinformatics Date: 2000-05 Impact factor: 6.937

2. Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

Authors: B W Matthews
Journal: Biochim Biophys Acta Date: 1975-10-20

3. Comparing two K-category assignments by a K-category correlation coefficient.

Authors: J Gorodkin
Journal: Comput Biol Chem Date: 2004-12 Impact factor: 2.877

4. Evaluating diagnostic tests: The area under the ROC curve and the balance of errors.

Authors: David J Hand
Journal: Stat Med Date: 2010-06-30 Impact factor: 2.373

5. Efficient multiclass ROC approximation by decomposition via confusion matrix perturbation analysis.

Authors: Thomas C W Landgrebe; Robert P W Duin
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2008-05 Impact factor: 6.226

6. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models.

Authors: Leming Shi; Gregory Campbell; Wendell D Jones; Fabien Campagne; Zhining Wen; Stephen J Walker; Zhenqiang Su; Tzu-Ming Chu; Federico M Goodsaid; Lajos Pusztai; John D Shaughnessy; André Oberthuer; Russell S Thomas; Richard S Paules; Mark Fielden; Bart Barlogie; Weijie Chen; Pan Du; Matthias Fischer; Cesare Furlanello; Brandon D Gallas; Xijin Ge; Dalila B Megherbi; W Fraser Symmans; May D Wang; John Zhang; Hans Bitter; Benedikt Brors; Pierre R Bushel; Max Bylesjo; Minjun Chen; Jie Cheng; Jing Cheng; Jeff Chou; Timothy S Davison; Mauro Delorenzi; Youping Deng; Viswanath Devanarayan; David J Dix; Joaquin Dopazo; Kevin C Dorff; Fathi Elloumi; Jianqing Fan; Shicai Fan; Xiaohui Fan; Hong Fang; Nina Gonzaludo; Kenneth R Hess; Huixiao Hong; Jun Huan; Rafael A Irizarry; Richard Judson; Dilafruz Juraeva; Samir Lababidi; Christophe G Lambert; Li Li; Yanen Li; Zhen Li; Simon M Lin; Guozhen Liu; Edward K Lobenhofer; Jun Luo; Wen Luo; Matthew N McCall; Yuri Nikolsky; Gene A Pennello; Roger G Perkins; Reena Philip; Vlad Popovici; Nathan D Price; Feng Qian; Andreas Scherer; Tieliu Shi; Weiwei Shi; Jaeyun Sung; Danielle Thierry-Mieg; Jean Thierry-Mieg; Venkata Thodima; Johan Trygg; Lakshmi Vishnuvajjala; Sue Jane Wang; Jianping Wu; Yichao Wu; Qian Xie; Waleed A Yousef; Liang Zhang; Xuegong Zhang; Sheng Zhong; Yiming Zhou; Sheng Zhu; Dhivya Arasappan; Wenjun Bao; Anne Bergstrom Lucas; Frank Berthold; Richard J Brennan; Andreas Buness; Jennifer G Catalano; Chang Chang; Rong Chen; Yiyu Cheng; Jian Cui; Wendy Czika; Francesca Demichelis; Xutao Deng; Damir Dosymbekov; Roland Eils; Yang Feng; Jennifer Fostel; Stephanie Fulmer-Smentek; James C Fuscoe; Laurent Gatto; Weigong Ge; Darlene R Goldstein; Li Guo; Donald N Halbert; Jing Han; Stephen C Harris; Christos Hatzis; Damir Herman; Jianping Huang; Roderick V Jensen; Rui Jiang; Charles D Johnson; Giuseppe Jurman; Yvonne Kahlert; Sadik A Khuder; Matthias Kohl; Jianying Li; Li Li; Menglong Li; Quan-Zhen Li; Shao Li; Zhiguang Li; Jie Liu; Ying Liu; Zhichao Liu; Lu Meng; Manuel Madera; Francisco Martinez-Murillo; Ignacio Medina; Joseph Meehan; Kelci Miclaus; Richard A Moffitt; David Montaner; Piali Mukherjee; George J Mulligan; Padraic Neville; Tatiana Nikolskaya; Baitang Ning; Grier P Page; Joel Parker; R Mitchell Parry; Xuejun Peng; Ron L Peterson; John H Phan; Brian Quanz; Yi Ren; Samantha Riccadonna; Alan H Roter; Frank W Samuelson; Martin M Schumacher; Joseph D Shambaugh; Qiang Shi; Richard Shippy; Shengzhu Si; Aaron Smalter; Christos Sotiriou; Mat Soukup; Frank Staedtler; Guido Steiner; Todd H Stokes; Qinglan Sun; Pei-Yi Tan; Rong Tang; Zivana Tezak; Brett Thorn; Marina Tsyganova; Yaron Turpaz; Silvia C Vega; Roberto Visintainer; Juergen von Frese; Charles Wang; Eric Wang; Junwei Wang; Wei Wang; Frank Westermann; James C Willey; Matthew Woods; Shujian Wu; Nianqing Xiao; Joshua Xu; Lei Xu; Lun Yang; Xiao Zeng; Jialu Zhang; Li Zhang; Min Zhang; Chen Zhao; Raj K Puri; Uwe Scherf; Weida Tong; Russell D Wolfinger
Journal: Nat Biotechnol Date: 2010-07-30 Impact factor: 54.908

7. The meaning and use of the area under a receiver operating characteristic (ROC) curve.

Authors: J A Hanley; B J McNeil
Journal: Radiology Date: 1982-04 Impact factor: 11.105

8. A fast and efficient gene-network reconstruction method from multiple over-expression experiments.

Authors: Dejan Stokić; Rudolf Hanel; Stefan Thurner
Journal: BMC Bioinformatics Date: 2009-08-17 Impact factor: 3.169

9. Repeated observation of breast tumor subtypes in independent gene expression data sets.

Authors: Therese Sorlie; Robert Tibshirani; Joel Parker; Trevor Hastie; J S Marron; Andrew Nobel; Shibing Deng; Hilde Johnsen; Robert Pesich; Stephanie Geisler; Janos Demeter; Charles M Perou; Per E Lønning; Patrick O Brown; Anne-Lise Børresen-Dale; David Botstein
Journal: Proc Natl Acad Sci U S A Date: 2003-06-26 Impact factor: 12.779

9 in total

54 in total

1. Sensitivity and specificity of substrate mapping: an in silico framework for the evaluation of electroanatomical substrate mapping strategies.

Authors: Joshua J E Blauer; Darrell Swenson; Koji Higuchi; Gernot Plank; Ravi Ranjan; Nassir Marrouche; Rob S Macleod
Journal: J Cardiovasc Electrophysiol Date: 2014-05-30

2. The influence of alignment-free sequence representations on the semi-supervised classification of class C G protein-coupled receptors: semi-supervised classification of class C GPCRs.

Authors: Raúl Cruz-Barbosa; Alfredo Vellido; Jesús Giraldo
Journal: Med Biol Eng Comput Date: 2014-11-04 Impact factor: 2.602

3. BRCA1- and BRCA2-specific in silico tools for variant interpretation in the CAGI 5 ENIGMA challenge.

Authors: Natàlia Padilla; Alejandro Moles-Fernández; Casandra Riera; Gemma Montalban; Selen Özkan; Lars Ootes; Sandra Bonache; Orland Díez; Sara Gutiérrez-Enríquez; Xavier de la Cruz
Journal: Hum Mutat Date: 2019-07-03 Impact factor: 4.878

4. HiTSelect: a comprehensive tool for high-complexity-pooled screen analysis.

Authors: Aaron A Diaz; Han Qin; Miguel Ramalho-Santos; Jun S Song
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 16.971

5. Detecting Drinking Episodes in Young Adults Using Smartphone-based Sensors.

Authors: Sangwon Bae; Denzil Ferreira; Brian Suffoletto; Juan C Puyana; Ryan Kurtz; Tammy Chung; Anind K Dey
Journal: Proc ACM Interact Mob Wearable Ubiquitous Technol Date: 2017-06-30

6. Preference-Driven Classification Measure.

Authors: Jan Kozak; Barbara Probierz; Krzysztof Kania; Przemysław Juszczuk
Journal: Entropy (Basel) Date: 2022-04-10 Impact factor: 2.738

7. A Comparison Between Single- and Multi-Scale Approaches for Classification of Histopathology Images.

Authors: Marina D'Amato; Przemysław Szostak; Benjamin Torben-Nielsen
Journal: Front Public Health Date: 2022-07-04

8. A mountable toilet system for personalized health monitoring via the analysis of excreta.

Authors: Seung-Min Park; Daeyoun D Won; Brian J Lee; Diego Escobedo; Andre Esteva; Amin Aalipour; T Jessie Ge; Jung Ha Kim; Susie Suh; Elliot H Choi; Alexander X Lozano; Chengyang Yao; Sunil Bodapati; Friso B Achterberg; Jeesu Kim; Hwan Park; Youngjae Choi; Woo Jin Kim; Jung Ho Yu; Alexander M Bhatt; Jong Kyun Lee; Ryan Spitler; Shan X Wang; Sanjiv S Gambhir
Journal: Nat Biomed Eng Date: 2020-04-06 Impact factor: 25.671

9. Large expert-curated database for benchmarking document similarity detection in biomedical literature search.

Authors: Peter Brown; Yaoqi Zhou
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

10. Identification of Cell Markers and Their Expression Patterns in Skin Based on Single-Cell RNA-Sequencing Profiles.

Authors: Xianchao Zhou; Shijian Ding; Deling Wang; Lei Chen; Kaiyan Feng; Tao Huang; Zhandong Li; Yudong Cai
Journal: Life (Basel) Date: 2022-04-07