| Literature DB >> 36038846 |
Xinshu Zhao1, Guangchao Charles Feng2, Song Harris Ao2, Piper Liping Liu2.
Abstract
BACKGROUND: Interrater reliability, aka intercoder reliability, is defined as true agreement between raters, aka coders, without chance agreement. It is used across many disciplines including medical and health research to measure the quality of ratings, coding, diagnoses, or other observations and judgements. While numerous indices of interrater reliability are available, experts disagree on which ones are legitimate or more appropriate. Almost all agree that percent agreement (ao), the oldest and the simplest index, is also the most flawed because it fails to estimate and remove chance agreement, which is produced by raters' random rating. The experts, however, disagree on which chance estimators are legitimate or better. The experts also disagree on which of the three factors, rating category, distribution skew, or task difficulty, an index should rely on to estimate chance agreement, or which factors the known indices in fact rely on. The most popular chance-adjusted indices, according to a functionalist view of mathematical statistics, assume that all raters conduct intentional and maximum random rating while typical raters conduct involuntary and reluctant random rating. The mismatches between the assumed and the actual rater behaviors cause the indices to rely on mistaken factors to estimate chance agreement, leading to the numerous paradoxes, abnormalities, and other misbehaviors of the indices identified by prior studies.Entities:
Keywords: Cohen’s kappa; Intercoder reliability; Interrater reliability; Krippendorff’s alpha; Reconstructed experiment
Mesh:
Year: 2022 PMID: 36038846 PMCID: PMC9426226 DOI: 10.1186/s12874-022-01707-5
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.612
A category (C) by difficulty (df) by skew (sk) - reconstructed experimenta
| Across: Distribution & Skew (sk) | 50&50 | 25&75, 75&25 | 1&99, 99&1 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Across: Category (C) | 2 | 4 | 6 | 8 | 2 | 4 | 6 | 8 | 2 | 4 | 6 | 8 | |
| difference in pixels (px) | Difficulty | ||||||||||||
| 1 | =1.000 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| 2 | ≈0.8571 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| 3 | ≈0.7143 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| 4 | ≈0.5714 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| 5 | ≈0.4286 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| 6 | ≈0.2857 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| 7 | ≈01429 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| 8 | =0.0000 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
aMain cell entries are number of reconstructed rating sessions (subjects) in each experimental condition (cell)
Concepts and variables
| Down: Author or Origin | Reliability (True Agreement) | Chance Agreement | |||||
|---|---|---|---|---|---|---|---|
| generic for any index | ri | ac | |||||
| Dependent Variables | Index Estimation | %-Agreement (unknown author) | ao | aoac | |||
| Bennett et al. (1954) [ | S | Sac | |||||
| Perreault & Leigh (1989) [ | Ir | Irac | |||||
| Gwet (2002, 2008, 2010, 2012) [ | AC1 | ACac | |||||
| Scott (1955) [ | π | πac | |||||
| Cohen (1960) [ | κ | κac | |||||
| Krippendorff (1970, 1980) [ | α | αac | |||||
| Empirical Observation | Primary Indicator | ori observed interrater reliability | oac observed chance agreement | ||||
| Secondary Indicator (used in calculation) | oar observed right agreement | oae observed erroneous agreement | |||||
ao observed agreement | do observed disagreement | ||||||
| Independent Variables | Denotation | C | sk | df or es | |||
| Concept | Category | Distribution Skew | Difficulty or Easiness | ||||
| Other Concepts | Denotation | em | me | sdm | dr2 | Nc | Nd |
| Concept | error of means (mean estimation minus mean target) | mean of errors (mean of differences between estimation and target) | standard deviation of an observed target of estimation (oae ori) | directional | No. of rating sessions | No. of rating decisions within a session | |
Effects of estimation targets, category, skew & difficulty on observed or estimated chance agreement and reliability (dr2)
| A. | B. | C. | D. | E. | F. | G. | H. | |||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Right: Source or Author | Observation | %-agreement | Bennett et al. | Perreault & Leigh | Gwet | Scott | Cohen | Krippendorff | |
| Effects on Intcdr Reliability Obsv & Ests | 2 | Right: Obsd / Estd Interrater Reliability as Dependent Variables Down: Independent Variables | ori | ao | S | Ir | AC1 | π | κ | α |
| 3 | Observed Reliability (ori) | 1.00*** | .841*** | .691*** | .599*** | .721*** | .312*** | .312*** | .312*** | |
| 4 | Category (C) | .003 | −.002 | .175*** | .185*** | .123*** | .001 | .001 | .001 | |
| 5 | Distribution Skew (sk) | .000 | .000 | .000 | −.000 | .003 | −.293*** | −.292*** | −.293*** | |
| 6 | Difficulty (df) | −.774*** | −.778*** | −.566*** | −.434*** | −.554*** | −.389*** | −.389*** | −.389*** | |
| Effects on Chance Agrt Obsv & Ests | 7 | Right: Obsd / Estd. Chance Agreement as Dependent Variables Down: Independent Variables | oac | aoac = 0a | Sac | Irac | ACac | πac | κac | αac |
| 8 | Observed Chance Agreement (oac) | 1.00*** | – | .021** | .021** | .075*** | −.151*** | −.152*** | −.151*** | |
| 9 | Category (C) | −.019** | – | −.863*** | −.863*** | −.661*** | −.013* | −.014* | −.013* | |
| 10 | Distribution Skew (sk) | −.001 | – | .000 | .000 | −.039*** | .437*** | .434*** | .437*** | |
| 11 | Difficulty (df) | .585*** | – | .000 | .000 | .009 | −.123*** | −.125*** | −.123*** | |
| N | 12 | Nc (number of rating sessions) | 384 | 384 | 384 | 384 | 384 | 384 | 384 | 384 |
| 13 | Nd (number items within each session) | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
Main cell entries are directional r squared (dr2), which are r squared with the directional sign of r, dr2 = r•|r|
*: p<.05; **: p<.01; ***: p<.001
a As aoac, the chance estimate of ao, is a constant, its correlations (dr2) with other variables cannot be calculated
Mean of errors (me) / distance between index estimations and targets of estimation
| A. | B. | C. | D. | E. | F. | G. | |||
|---|---|---|---|---|---|---|---|---|---|
| 1 | Author or Source | %-agreement | Bennett et al. | Perreault & Leigh | Gwet | Scott | Cohen | Krippendorff | |
| Interrater Reliability | 2 | Interrater Reliability Estimator | ao | S | Ir | AC1 | π | κ | α |
| 3 | me (ri) = mean (|ri-ori|) (0 ≤ me ≤ 1) | .130*** | .096*** | .180*** | .093*** | .327*** | .324*** | .323*** | |
| 4 | Standard Deviation of me (ri) | .145 | .099 | .148 | .104 | .221 | .220 | .220 | |
| 5 | 95% confidence interval of me (ri) | .115 ~ .144 | .086 ~ .106 | .164 ~ .194 | .082 ~ .103 | .304 ~ .349 | .302 ~ .346 | .301 ~ .345 | |
| Chance Agreement | 6 | Chance Agreement Estimator | aoac | Sac | Irac | ACac | πac | κac | αac |
| 7 | me (ac):=mean (|ac-oac|) (0 ≤ me ≤ 1) | .130*** | .182*** | .182*** | .130*** | .450*** | .448*** | .448*** | |
| 8 | Standard Deviation of me (ac) | .145 | .141 | .141 | .127 | .201 | .201 | .202 | |
| 9 | 95% confidence interval of me (ac) | .115 ~ .144 | .168 ~ .196 | .168 ~ .196 | .117 ~ .143 | .429 ~ .470 | .428 ~ .469 | .427 ~ .468 | |
| 10 | Nc (number of rating sessions) | 384 | 384 | 384 | 384 | 384 | 384 | 384 | |
| 11 | Nd (number items within each session) | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
*: p<.05, **: p<.01, ***: p<.001
Means and error of means (em): index estimations against observations
| A. | B. | C. | D. | E. | F. | G. | H. | |||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Right: Author or Source | Observed Agreement | %-agreement | Bennett et al. | Perreault & Leigh | Gwet | Scott | Cohen | Krippendorff | |
| Interrater Reliability | 2 | Observed or Estimated Reliability (denotation) | ori | ao | S | Ir | AC1 | π | κ | α |
| 3 | Observed / Estimated Interrater Reliability | .555 | .685 | .556 | .726 | .600 | .237 | .240 | .241 | |
| 4 | Standard Deviation | .248 | .122 | .203 | .173 | .192 | .249 | .247 | .248 | |
| 5 | Range (minimum~maximum) | −.20 ~ .90 | .42 ~ .92 | −.10 ~ .856 | .0 ~ .925 | −.045 ~ .912 | −.177 ~ .778 | −.173 ~ .778 | −.17 ~ .779 | |
| 6 | em(ri) = mean(ri)-mean(ori) (−1 ≤ em ≤ 1) | .000 | .130*** | .001 | .171*** | .044*** | −.318*** | −.315*** | −.314*** | |
| 7 | 95% confidence interval | .00 ~ .00 | .115 ~ .144 | −.013 ~ .015 | .155 ~ .186 | .031 ~ .058 | −.341 ~ −.295 | −.338 ~ −.292 | −.338 ~ −.291 | |
| Chance Agreement | 8 | Chance Agreement (denotation) | oac | aoac | Sac | Irac | ACac | πac | κac | αac |
| 9 | Observed or Estimated Chance Agreement | .130 | .000 | .260 | .260 | .173 | .575 | .573 | .572 | |
| 10 | Standard Deviation | .145 | .000 | .146 | .146 | .148 | .109 | .109 | .110 | |
| 11 | Range (minimum~maximum) | .0 ~ .72 | .0 ~ .0 | .125 ~ .50 | .125 ~ .50 | .022 ~ .50 | .448 ~ .905 | .447 ~ .905 | .445 ~ .905 | |
| 12 | em(ac) = mean(ac)-mean(oac) (−1 ≤ em ≤ 1) | .000 | −.130*** | .131*** | .131*** | .044*** | .445*** | .443*** | .443*** | |
| 13 | 95% confidence interval | .00 ~ .00 | −.144 ~ −.115 | .111 ~ .15 | .111 ~ .15 | .026 ~ .061 | .423 ~ .466 | .422 ~ .465 | .421 ~ .464 | |
| N | 14 | Nc (number of rating sessions) | 38 | 384 | 384 | 384 | 384 | 384 | 384 | 384 |
| 15 | Nd (number items within each session) | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
*: p<.05, **: p<.01, ***: p<.001
Effects of category, skew, and difficulty on observed chance agreement, reliability, and index estimations (average scores)
| A. | B. | C. | D. | E. | F. | G. | H. | I. | J. | K. | L. | M. | N | O | P | Q | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Reliability Observation or Estimation | Chance Agreement Observation or Estimation | ||||||||||||||||||
| 1 | Author/ Source | Observed | %-Agreement | Bennett et al. | Perreault & Leigh | Gwet | Scott | Cohen | Krippen-dorff | Observed | %-Agreement | Bennett et al. | Perreault & Leigh | Gwet | Scott | Cohen | Krippen-dorff | ||
| 2 | Estimator: | ori | ao | S | Ir | AC1 | π | κ | α | oac | aoac | Sac | Irac | ACac | πac | κac | αac | Nc | |
| 3 | Ground 0 | .685 | .370 | .608 | .371 | .369 | 370 | .373 | 0 | .500 | .500 | .499 | .501 | .500 | .498 | 32 | |||
| 4 | Category (C) | 2 | .701 | .402 | .584 | .470 | .230 | .232 | .234 | 0 | .500 | .500 | .401 | .598 | .597 | .596 | 96 | ||
| 5 | 4 | .678 | .571 | .747 | .621 | .226 | .230 | .230 | 0 | .250 | .250 | .142 | .573 | .571 | .571 | 96 | |||
| 6 | 6 | .676 | .612 | .777 | .644 | .239 | .241 | .242 | 0 | .167 | .167 | .087 | .562 | .561 | .561 | 96 | |||
| 7 | 8 | .686 | .641 | .796 | .664 | .254 | .257 | .257 | 0 | .125 | .125 | .062 | .564 | .563 | .562 | 96 | |||
| 8 | Skew (sk) | .50 | .688 | .560 | .732 | .592 | .370 | .372 | .374 | 0 | .260 | .260 | .203 | .501 | .500 | .498 | 128 | ||
| 9 | .75 | .678 | .547 | .722 | .588 | .302 | .304 | .305 | 0 | .260 | .260 | .186 | .545 | .543 | .543 | 128 | |||
| 10 | .99 | .690 | .561 | .723 | .619 | .040 | .044 | .045 | 0 | .260 | .260 | .132 | .678 | .676 | .676 | 128 | |||
| 11 | Difficulty (df) | .000 | .844 | .782 | .884 | .810 | .482 | .484 | .485 | 0 | .260 | .260 | .152 | .630 | .629 | .628 | 48 | ||
| 12 | .143 | .805 | .728 | .852 | .761 | .404 | .406 | .407 | 0 | .260 | .260 | .158 | .616 | .615 | .615 | 48 | |||
| 13 | .286 | .757 | .659 | .808 | .697 | .341 | .343 | .344 | 0 | .260 | .260 | .164 | .599 | .598 | .600 | 48 | |||
| 14 | .429 | .721 | .600 | .765 | .643 | .273 | .275 | .277 | 0 | .260 | .260 | .169 | .591 | .589 | .588 | 48 | |||
| 15 | .571 | .659 | .518 | .706 | .563 | .196 | .199 | .200 | 0 | .260 | .260 | .180 | .565 | .563 | .563 | 48 | |||
| 16 | .714 | .606 | .444 | .647 | .495 | .117 | .121 | .121 | 0 | .260 | .260 | .182 | .548 | .546 | .546 | 48 | |||
| 17 | .857 | .567 | .387 | .591 | .440 | .068 | .071 | .072 | 0 | .260 | .260 | .189 | .534 | .533 | .532 | 48 | |||
| 18 | 1.00 | .523 | .332 | .552 | .389 | .018 | .022 | .022 | 0 | .260 | .260 | .194 | .514 | .512 | .511 | 48 | |||
| 19 | Mean | .685 | .556 | .726 | .600 | .237 | .240 | .241 | 0 | .260 | .260 | .173 | .575 | .573 | .572 | 384 | |||
| 20 | Nd | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
Fig. 1A sample screen seen by some raters (for category = 6, difficulty = 1)
Fig. 2Accuracies of Interrater Reliability Indices. Notes: 1. Solid red bars are dr2 between estimated chance agreement & observed chance agreement. 2. Dotted blue bars are dr2 between estimated interrater reliability & observed interrater reliability. 3. Primary benchmark: dr2 > 0.8. 4. Data source: Lines 3 & 8, Table 3