| Literature DB >> 31898477 |
Davide Chicco1,2, Giuseppe Jurman3.
Abstract
BACKGROUND: To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets.Entities:
Keywords: Accuracy; Binary classification; Biostatistics; Confusion matrices; Dataset imbalance; F1 score; Genomics; Machine learning; Matthews correlation coefficient
Mesh:
Year: 2020 PMID: 31898477 PMCID: PMC6941312 DOI: 10.1186/s12864-019-6413-7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
The standard confusion matrix M
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | True positives TP | False negatives FN |
| Actual negative | False positives FP | True negatives TN |
True positives (TP) and true negatives (TN) are the correct predictions, while false negatives (FN) and false positives (FP) are the incorrect predictions
Classwise performance measures
| Sensitivity, recall, true positive rate | Specificity, true negative rate | ||
| Positive predictive value, precision | Negative predictive value | ||
| False positive rate, fallout | False discovery rate |
TP: true positives. TN: true negatives. FP: false positives. FN: false negatives
Correlation between MCC, accuracy, and F1 score values
| N | PCC (MCC, | PCC (MCC, accuracy) | PCC (accuracy, |
|---|---|---|---|
| 10 | 0.742162 | 0.869778 | 0.744323 |
| 25 | 0.757044 | 0.893572 | 0.760708 |
| 50 | 0.766501 | 0.907654 | 0.769752 |
| 75 | 0.769883 | 0.912530 | 0.772917 |
| 100 | 0.771571 | 0.914926 | 0.774495 |
| 200 | 0.774060 | 0.918401 | 0.776830 |
| 300 | 0.774870 | 0.919515 | 0.777595 |
| 400 | 0.775270 | 0.920063 | 0.777976 |
| 500 | 0.775509 | 0.920388 | 0.778201 |
| 1 000 | 0.775982 | 0.921030 | 0.778652 |
Pearson correlation coefficient (PCC) between accuracy, MCC and F1 score computed on all confusion matrices with given number of samples N
Fig. 1Relationship between MCC and F1 score. Scatterplot of all the 21 084 251 possible confusion matrices for a dataset with 500 samples on the MCC/ F1 plane. In red, the (−0.04, 0.95) point corresponding to use case A1
Fig. 2Use case A1 — Positively imbalanced dataset. a Barplot representing accuracy, F1, and normalized Matthews correlation coefficient (normMCC = (MCC + 1) / 2), all in the [0, 1] interval, where 0 is the worst possible score and 1 is the best possible score, applied to the Use case A1 positively imbalanced dataset. b Pie chart representing the amounts of true positives (TP), false negatives (FN), true negatives (TN), and false positives (FP). c Pie chart representing the dataset balance, as the amounts of positive data instances and negative data instances
Recap of the six use cases results
| Balance | Confusion matrix | Accuracy [0, 1] | F1 score [0, 1] | MCC [–1, +1] | Figure | Informative | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Pos | Neg | TP | FN | TN | FP | Response | |||||
| Use case A1 Positively imbalanced dataset | 91 | 9 | 90 | 1 | 0 | 9 | 0.90 | 0.95 | Figure | ||
| Use case A2 Positively imbalanced dataset | 75 | 25 | 5 | 70 | 19 | 6 | Suppl. Additional file | ||||
| Use case B1 Balanced dataset | 50 | 50 | 47 | 3 | 5 | 45 | 0.66 | Suppl. Additional file | |||
| Use case B2 Balanced dataset | 50 | 50 | 10 | 40 | 46 | 4 | Suppl. Additional file | ||||
| Use case C1 Negatively imbalanced dataset | 10 | 90 | 9 | 1 | 1 | 89 | Suppl. Additional file | ||||
| Use case C2 Negatively imbalanced dataset | 11 | 89 | 2 | 9 | 88 | 1 | 0.90 | Suppl. Additional file | |||
For the Use case A1, MCC is the only statistical rate able to truthfully inform the readership about the poor performance of the classifier. For the Use case B1, MCC and accuracy are able to inform about the poor performance of the classifier in the prediction of negative data instances, while for the Use case A2, B2, C1, all the three rates (accuracy, F1, and MCC) are able to show this information. For the Use case C2, the MCC and F1 are able to recognize the weak performance of the algorithm in predicting one of the two original dataset classes. pos: number of positives. neg: number of negatives. TP: true positives. FN: false negatives. TN: true negatives. FP: false positives. Informative response: list of confusion matrix rates able to reflect the poor performance of the classifier in the prediction task. We highlighted in bold the informative response of each use case
Colon cancer prediction rankings
| Classifier | MCC | F1 score | Accuracy | TP rate | TN rate |
|---|---|---|---|---|---|
| MCC ranking: | |||||
| Gradient boosting | 0.81 | 0.78 | 0.85 | 0.69 | |
| Decision tree | 0.82 | 0.77 | 0.88 | 0.58 | |
| 0.87 | 0.80 | 0.92 | 0.52 | ||
| Linear SVM | 0.82 | 0.76 | 0.86 | 0.53 | |
| Radial SVM | 0.75 | 0.67 | 0.86 | 0.40 | |
| F1 score ranking: | |||||
| +0.48 | 0.80 | 0.92 | 0.52 | ||
| Linear SVM | +0.41 | 0.76 | 0.86 | 0.53 | |
| Decision tree | +0.53 | 0.77 | 0.88 | 0.58 | |
| Gradient boosting | +0.55 | 0.78 | 0.85 | 0.69 | |
| Radial SVM | +0.29 | 0.67 | 0.86 | 0.40 | |
| Accuracy ranking: | |||||
| +0.48 | 0.87 | 0.92 | 0.52 | ||
| Gradient boosting | +0.55 | 0.81 | 0.85 | 0.69 | |
| Decision tree | +0.53 | 0.82 | 0.88 | 0.58 | |
| Linear SVM | +0.41 | 0.82 | 0.86 | 0.53 | |
| Radial SVM | +0.29 | 0.75 | 0.86 | 0.40 | |
Prediction results on colon cancer gene expression dataset, based on MCC, F1 score, and accuracy. linear SVM: support vector machines with linear kernel. MCC: worst value –1 and best value +1. F1 score, accuracy, TP rate, and TN rate: worst value 0 and best value 1. To avoid additional complexity and keep this table simple to read, we prefered to exclude the standard deviation of each result metric. We highlighted in bold the ranking of each rate