| Literature DB >> 33145661 |
Alberto Casagrande1, Francesco Fabris2, Rossano Girometti3.
Abstract
Agreement measures are useful tools to both compare different evaluations of the same diagnostic outcomes and validate new rating systems or devices. Cohen's kappa (κ) certainly is the most popular agreement method between two raters, and proved its effectiveness in the last sixty years. In spite of that, this method suffers from some alleged issues, which have been highlighted since the 1970s; moreover, its value is strongly dependent on the prevalence of the disease in the considered sample. This work introduces a new agreement index, the informational agreement (IA), which seems to avoid some of Cohen's kappa's flaws, and separates the contribution of the prevalence from the nucleus of agreement. These goals are achieved by modelling the agreement-in both dichotomous and multivalue ordered-categorical cases-as the information shared between two raters through the virtual diagnostic channel connecting them: the more information exchanged between the raters, the higher their agreement. In order to test its fair behaviour and the effectiveness of the method, IA has been tested on some cases known to be problematic for κ, in the machine learning context and in a clinical scenario to compare ultrasound (US) and automated breast volume scanner (ABVS) in the setting of breast cancer imaging. Graphical Abstract To evaluate the agreement between the two raters [Formula: see text] and [Formula: see text] we create an agreement channel, based on Shannon Information Theory, that directly connects the random variables X and Y, that express the raters outcomes. They are the terminals of the chain X⇔ diagnostic test performed by [Formula: see text] ⇔ patient condition[Formula: see text] ⇔ diagnostic test performed by [Formula: see text] ⇔ Y.Entities:
Keywords: Cohen’s kappa statistic; Diagnostic agreement; Information measures; Inter-reader agreement; Multivalue ordered-categorical ratings
Mesh:
Year: 2020 PMID: 33145661 PMCID: PMC7679268 DOI: 10.1007/s11517-020-02261-2
Source DB: PubMed Journal: Med Biol Eng Comput ISSN: 0140-0118 Impact factor: 2.602
Fig. 1The agreement channel directly connects the random variables X and Y that are the terminals of the chain X⇔ diagnostic test performed by ⇔ patient condition ⇔ diagnostic test performed by ⇔ Y
The scenarios examined to compare IA and κ
| X | X | |||||
|---|---|---|---|---|---|---|
| (a) Scenario 1 | (b) Scenario 2 | |||||
| Y | 3600 | 2595 | 9901 | 64 | ||
| 65 | 3740 | 2 | 33 | |||
| (c) Scenario 3 | (d) Scenario 4 | |||||
| Y | 9900 | 86 | 21 | 5 | ||
| 1 | 13 | 3 | 21 | |||
| (e) Scenario 5 | (f) Scenario 6 | |||||
| Y | 40 | 5 | 40 | 2 | ||
| 3 | 2 | 3 | 5 | |||
Raw agreement data between ultrasound (US) and automated breast volume scanner (ABVS) in assessing breast cancer findings according to all BI-RADS classes ( in the table for brevity)
| US | |||||||
|---|---|---|---|---|---|---|---|
| Total | |||||||
| ABVS | 51 | 4 | 0 | 1 | 1 | ||
| 3 | 78 | 1 | 0 | 0 | |||
| 0 | 0 | 13 | 4 | 0 | |||
| 0 | 1 | 1 | 16 | 7 | |||
| 0 | 0 | 0 | 0 | 5 | |||
| Total | |||||||
Cohen’s kappa and IA between US and ABVS in dichotomised BI-RADS classes ( in the table for brevity)
| US | ||||
|---|---|---|---|---|
| Total | ||||
| ABVS | 136 | 3 | ||
| 1 | 46 | |||
| Total | ||||
Two agreement matrices relating the classifications performed by the pairs kNN and naïve Bayes models (KB) (Table 5a) and random forest and SGD models (FS) (Table 5b) on the Tic-Tac-Toe data set (DS4). The first rows/columns of these matrices count the correctly classified entries (C), while those misclassified (W) are packed in the second rows/columns
| Naïve Bayes | SGD | ||||||
|---|---|---|---|---|---|---|---|
| C | W | C | W | ||||
| kNN | C | 547 | 134 | Random | C | 903 | 6 |
| W | 120 | 157 | Forest | W | 39 | 10 | |
| (a) The agreement matrix relating the classifications performed by kNN and naïve Bayes models (KB). Its | (b) The agreement matrix relating the classifications performed by random forest and SGD models (FS). Its | ||||||
Fig. 2Choosing the best threshold in dichotomising a multivalue ordered-categorical ratings. The maximum agreement is obtained in correspondence with the standard dichotomisation 1-2/3-4-5 for κ and IA
A comparison between IA and κ on a Machine Learning domain
| FK | FS | FB | KS | KB | SB | ||||
|---|---|---|---|---|---|---|---|---|---|
| DS0 | 0.28 (5) | 0.42 (2) | 0.30 (3) | 0.28 (4) | 0.56 (1) | 0.19 (6) | 0.98 | 0.77 | |
| 0.46 (3) | 0.63 (2) | 0.43 (5) | 0.44 (4) | 0.72 (1) | 0.31 (6) | ||||
| DS1 | 0.54 (2) | 0.33 (6) | 0.71 (1) | 0.36 (5) | 0.53 (3) | 0.39 (4) | 0.92 | 0.83 | |
| 0.73 (2) | 0.56 (5) | 0.77 (1) | 0.58 (4) | 0.63 (3) | 0.54 (6) | ||||
| DS2 | 0.63 (1) | 0.52 (2) | 0.22 (4) | 0.47 (3) | 0.17 (5) | 0.14 (6) | 0.98 | 0.94 | |
| 0.79 (1) | 0.64 (2) | 0.37 (4) | 0.57 (3) | 0.30 (6) | 0.35 (5) | ||||
| DS3 | 0.14 (4) | 0.33 (1) | 0.28 (2) | 0.08 (6) | 0.10 (5) | 0.19 (3) | 0.93 | 0.94 | |
| 0.21 (5) | 0.51 (1) | 0.41 (2) | 0.20 (6) | 0.28 (4) | 0.41 (3) | ||||
| DS4 | 0.11 (3) | 0.25 (1) | 0.18 (2) | 0.04 (6) | 0.11 (4) | 0.04 (5) | 0.61 | 0.60 | |
| 0.15 (4) | 0.29 (2) | 0.17 (3) | 0.08 (5) | 0.36 (1) | 0.03 (6) | ||||
| DS5 | 0.05 (6) | 0.28 (3) | 0.43 (1) | 0.06 (5) | 0.06 (4) | 0.34 (2) | 0.99 | 0.77 | |
| 0.23 (4) | 0.55 (3) | 0.67 (1) | 0.21 (5) | 0.21 (6) | 0.62 (2) |
Six data sets from the UCI Machine Learning Repository [19] were considered: the Congressional Voting Records Data Set (DS0) [39], the Breast Cancer Wisconsin (Diagnostic) Data Set (DS1) [52], the Iris Data Set (DS2) [21], the Spambase Data Set (DS3) [27], the Tic-Tac-Toe Endgame Data Set (DS4) [3], and the Heart Disease Data Set (DS5) [28]. Each of the data sets were used to train random forest, k-nearest neighbours, stochastic gradient (SGD) and naïve Bayes models. Then the pairs of models random forest-kNN (FK), random forest-SGD (FS), random forest-naïve Bayes (FB), kNN-SGD (KS), kNN-naïve Bayes (KB), and SGD-naïve Bayes (SB) were compared according to their correct classifications of the data set entries and their IA and κ were evaluated. Finally, the Spearman’s rank correlation coefficient (r) [48] between the sequences of IA s and κ s was computed. All the reported values were rounded up to the second decimal digit. The numbers inside round parentheses in the table represent the rank of the associated value among those on the same row
Fig. 3A scatter plot of the IA-κ values for the pairs of models random forest-kNN (FK), random forest-SGD (FS), random forest-naïve Bayes (FB), kNN-SGD (KS), kNN-naïve Bayes (KB), and SGD-naïve Bayes (SB) trained on the Tic-Tac-Toe data set. It is easy to see that the black points are strictly correlated, while the red one, corresponding to KB, falls apart from any reasonable model for the former