Literature DB >> 36038846

Interrater reliability estimators tested against true interrater reliabilities.

Xinshu Zhao¹, Guangchao Charles Feng², Song Harris Ao², Piper Liping Liu².

Abstract

BACKGROUND: Interrater reliability, aka intercoder reliability, is defined as true agreement between raters, aka coders, without chance agreement. It is used across many disciplines including medical and health research to measure the quality of ratings, coding, diagnoses, or other observations and judgements. While numerous indices of interrater reliability are available, experts disagree on which ones are legitimate or more appropriate. Almost all agree that percent agreement (ao), the oldest and the simplest index, is also the most flawed because it fails to estimate and remove chance agreement, which is produced by raters' random rating. The experts, however, disagree on which chance estimators are legitimate or better. The experts also disagree on which of the three factors, rating category, distribution skew, or task difficulty, an index should rely on to estimate chance agreement, or which factors the known indices in fact rely on. The most popular chance-adjusted indices, according to a functionalist view of mathematical statistics, assume that all raters conduct intentional and maximum random rating while typical raters conduct involuntary and reluctant random rating. The mismatches between the assumed and the actual rater behaviors cause the indices to rely on mistaken factors to estimate chance agreement, leading to the numerous paradoxes, abnormalities, and other misbehaviors of the indices identified by prior studies.
METHODS: We conducted a 4 × 8 × 3 between-subject controlled experiment with 4 subjects per cell. Each subject was a rating session with 100 pairs of rating by two raters, totaling 384 rating sessions as the experimental subjects. The experiment tested seven best-known indices of interrater reliability against the observed reliabilities and chance agreements. Impacts of the three factors, i.e., rating category, distribution skew, and task difficulty, on the indices were tested.
RESULTS: The most criticized index, percent agreement (ao), showed as the most accurate predictor of reliability, reporting directional r2 = .84. It was also the third best approximator, overestimating observed reliability by 13 percentage points on average. The three most acclaimed and most popular indices, Scott's π, Cohen's κ and Krippendorff's α, underperformed all other indices, reporting directional r2 = .312 and underestimated reliability by 31.4 ~ 31.8 points. The newest index, Gwet's AC1, emerged as the second-best predictor and the most accurate approximator. Bennett et al's S ranked behind AC1, and Perreault and Leigh's Ir ranked the fourth both for prediction and approximation. The reliance on category and skew and failure to rely on difficulty explain why the six chance-adjusted indices often underperformed ao, which they were created to outperform. The evidence corroborated the notion that the chance-adjusted indices assume intentional and maximum random rating while the raters instead exhibited involuntary and reluctant random rating.
CONCLUSION: The authors call for more empirical studies and especially more controlled experiments to falsify or qualify this study. If the main findings are replicated and the underlying theories supported, new thinking and new indices may be needed. Index designers may need to refrain from assuming intentional and maximum random rating, and instead assume involuntary and reluctant random rating. Accordingly, the new indices may need to rely on task difficulty, rather than distribution skew or rating category, to estimate chance agreement.

Entities: Chemical

Keywords: Cohen’s kappa; Intercoder reliability; Interrater reliability; Krippendorff’s alpha; Reconstructed experiment

Mesh：

Year: 2022 PMID： 36038846 PMCID： PMC9426226 DOI： 10.1186/s12874-022-01707-5

Source DB: PubMed Journal: BMC Med Res Methodol ISSN： 1471-2288 Impact factor: 4.612

Background

Intercoder or interrater reliability is used to measure measurement quality in many disciplines, including health and medical research [1-10]. A search of databases including Google Scholar, Scopus, and Web of Science found dozens of terms in academic literature, such as diagnostician for inter-diagnostician reliability and patient for inter-patient reliability, showing the concept’s broad reach --Likely the earliest index is percent agreement, denoted ao [9, 11]. Almost all reliability experts agree that ao inflates reliability because it fails to remove chance agreement (ac) [2–5, 12–14]. Scores of indices have been proposed to estimate and remove ac. Bennett and colleagues’ S and Perreault and Leigh’s Ir estimate ac as functions of category (C) [7, 15]. Scott’s π, Cohen’s κ and Krippendorff’s α estimate ac as functions of distribution skew (sk) [2, 16–19]. Gwet’s AC1 makes ac a function of both category and skew. Although many other indices are available and new indices continue to emerge, only these seven are in regular use and continue to be recommended or advocated, according to comprehensive reviews [14, 20–26]. annotator, arbitrator, assessor, auditor, diagnostician, doctor, editor, evaluator, examiner, grader, interpreter, interviewer, judge, monitor, observer, operator, patient, pharmacist, physician, reader, referee, reporter, researcher, respondent, scorer, screener, student, supervisor, surgeon, teacher, tester, therapist, transcriber, translator, user, voter. Using derivation or simulation, statisticians discuss and debate three questions: 1) Which indices are valid or more accurate when estimating reliability or chance agreement? 2) What factors affect the indices? 3) What factors should affect the indices? Answers to Questions 2 and 3 explain the answers to Question 1 [14, 27]. Underlying the debates are five viewpoints, the first of which is widely shared by almost all experts, while the others are contested, often heatedly. The five viewpoints lead to five groups of conjectures, which we list below and leave the details to Additional file 1, Section I.2. Percent agreement (ao) ignores chance agreement (ac), therefore is inflated. Rating category (C) inflates S, Ir, and AC1 by deflating the indices’ ac estimates. Distribution skew (sk) deflates π, κ & α by inflating the indices’ ac estimates. Major indices overlook task difficulty, a major factor affecting ac; consequently, they misestimate reliability. Chance-adjusted indices, S, π, κ, α, Ir, and AC1 included, assume intentional and maximum chance rating by all raters; it is under this assumption that the chance-adjusted indices share the same chance correcting formula, Eq. 1, where ao is observed %-agreement, ac is estimated chance agreement, and ri is estimated true agreement, i.e., reliability index. The intentional-random assumption, aka maximum-random assumption, is said to be a root cause of many known paradoxes, abnormalities, and other misbehaviors of the indices, because raters are believed to be have honestly and truthfully. Random ratings, if any, should be involuntary rather than intentional, task-dependent rather than invariably maximized [14, 21–24, 26, 28–30]. Chance agreement is a product of rater behavior, and the debates are ultimately about rater behavior [14, 31]: What behaviors are assumed by the indices’ estimations? What behaviors in fact take place? Do the assumptions match the behaviors? The debaters rely on theoretical arguments, mathematical derivation, fictitious examples, naturalistic comparisons, and Monte Carlo simulation. A systematic observation of rater behavior is needed to inform the debates over rater behavior. This paper reports a controlled experiment that manipulated category, skew, and difficulty, and observed raters’ behavioral responses. The seven indices were tested against the observed behavior. The findings also apply to the two equivalents of ao, six equivalents of S, two equivalents of π, and one equivalent of κ, covering 18 indices in total, all of which had been analyzed mathematically by Zhao, Liu and Deng [14].

Methods

Reconstructed experiment with golden standard

Reconstructed experiment on real data (REORD)

We conducted a 4 × 8 × 3 between-subject controlled experiment with 4 subjects per cell. Here the term “subject” refers to the unit of analysis of a study, such as a participating patient in an experiment on the effectiveness of a new drug. A “subject” in this study, however, was a rating session with 100 pairs of rating by two raters. As 4 × 8 × 3 × 4 = 384, this study was based on 384 rating sessions, i.e., subjects. The three manipulated factors included four levels of category (C = 2,4,6,8), eight levels of difficulty (df ranges 0 ~ 1, 0 for the least and 1 for the most difficult), and three levels of skew (sk = 0.5 for 50-50 distribution, 0.75 for 75-25 or 25-75 distribution, and 0.99 for 99-1 or 1-99 distribution), as summarized in Table 1.

Table 1

A category (C) by difficulty (df) by skew (sk) - reconstructed experimenta

Across: Distribution & Skew (s_k)		50&50s_k = 0.5				25&75, 75&25s_k = 0.75				1&99, 99&1s_k = 0.99
Across: Category (C)		2	4	6	8	2	4	6	8	2	4	6	8
difference in pixels (p_x)	Difficultyd_f = (8-p_x)/7	2	4	6	8	2	4	6	8	2	4	6	8
1	=1.000	4	4	4	4	4	4	4	4	4	4	4	4
2	≈0.8571	4	4	4	4	4	4	4	4	4	4	4	4
3	≈0.7143	4	4	4	4	4	4	4	4	4	4	4	4
4	≈0.5714	4	4	4	4	4	4	4	4	4	4	4	4
5	≈0.4286	4	4	4	4	4	4	4	4	4	4	4	4
6	≈0.2857	4	4	4	4	4	4	4	4	4	4	4	4
7	≈01429	4	4	4	4	4	4	4	4	4	4	4	4
8	=0.0000	4	4	4	4	4	4	4	4	4	4	4	4

aMain cell entries are number of reconstructed rating sessions (subjects) in each experimental condition (cell)

A category (C) by difficulty (df) by skew (sk) - reconstructed experimenta aMain cell entries are number of reconstructed rating sessions (subjects) in each experimental condition (cell) Over 300 raters, registering 383 web names, from 53 Asian, European, and North American cities judged online the lengths of bars, which served as the experimental stimulus. A total of 22,290 items were rated, of which 19,900 were successfully paired, producing 9950 pairs of rating. Borrowing techniques from bootstrap [32, 33], jackknife [34], and Monte Carlo simulation [35], we sampled and resampled from the 9950 pairs to reconstruct the 384 rating sessions [36]. Thus, raters and rating were real, while rating sessions were reconstructed, making it a reconstructed experiment on real data (REORD). The Additional file 1 at the end of this manuscript (Section II) provides further details and rationales.

Observed true reliability (ori) and true chance agreement (oac) as golden standards

The raters were instructed to judge the length of bars. The researchers determined the bar lengths through programming, therefore know with certainty which rating decision was right or wrong. As the lengths of the bars were set such that random guesses would occur only between the longest and the second longest bars, the true chance agreement (oac) was twice the wrong agreement (Eq. 3, Additional file 1), and true reliability (ori) was observed agreement ao minus oac (Eq. 5 of Additional file 1). Thus, ori served as the golden standard, namely the observed estimand, against which the seven indices were evaluated, and oac served as the golden standard for the seven chance estimators [37]. Additional file 1 (II.3) explains our use of the term "golden standard" as opposed to "gold standard."

Five independent variables and 16 dependent variables

Thus, this REORD experiment features three manipulated independent variables, category I, skew (sk) and difficulty (df) and 16 main dependent variables, which are the seven indices’ reliability and chance estimations plus the observed true reliability (ori) and true chance agreement (oca). As the two main estimands, ori and oca sometimes also serve as independent variables when assessing their impacts on the indices’ estimations. Tables 1 and 2 and the Additional file 1 provide more details and rationales of variable calculations.

Table 2

Concepts and variables

		Down: Author or Origin		Reliability (True Agreement)		Chance Agreement
		generic for any index		r_i		a_c
Dependent Variables	Index Estimation	%-Agreement (unknown author)		a_o		ao_ac
		Bennett et al. (1954) [15]		S		S_ac
		Perreault & Leigh (1989) [7]		I_r		Ir_ac
		Gwet (2002, 2008, 2010, 2012) [38–41]		AC₁		AC_ac
		Scott (1955) [16]		π		π_ac
		Cohen (1960) [2]		κ		κ_ac
		Krippendorff (1970, 1980) [19, 42, 43]		α		α_ac
	Empirical Observation	Primary Indicator		o_ri observed interrater reliability		o_ac observed chance agreement
		Secondary Indicator (used in calculation)		o_ar observed right agreement		o_ae observed erroneous agreement
		Secondary Indicator (used in calculation)		a_o observed agreement		d_o observed disagreement
Independent Variables	Denotation	C		s_k		d_f or e_s
Independent Variables	Concept	Category		Distribution Skew		Difficulty or Easiness
Other Concepts	Denotation	e_m	m_e	s_dm	dr²	N_c	N_d
Other Concepts	Concept	error of means (mean estimation minus mean target)	mean of errors (mean of differences between estimation and target)	standard deviation of an observed target of estimation (o_ae o_ri)	directional r² (dr² = r*\|r\|)	No. of rating sessions	No. of rating decisions within a session

Concepts and variables ori observed interrater reliability oac observed chance agreement oar observed right agreement oae observed erroneous agreement ao observed agreement do observed disagreement

Statistical indicators – directional R squared (dr2) and mean of errors (me)

Reliability indices serve two functions. One is to evaluate measurement instruments against each other, for which an index needs to accurately predict, meaning positively and highly correlating with, true reliability. We use directional r squared (dr2 = r•|r|) to gauge the predictive accuracy of the seven indices and their chance estimators (Table 2 and Eq. 10 of the Additional file 1). We preferred r2 over r because r2 has a clearer and more practical interpretation, percent of the DV variance explained by the IV; r2 is also more conservative as r2 ≤ |r|. We preferred dr2 over r2 because dr2 indicates the direction of the relationship while r2 does not. The second function of the indices is to evaluate measurement instruments against fixed benchmarks, such as 0.67 and 0.80, that some reliability authorities recommend [19, 30, 44, 45]. For this function, an index needs to approximate true reliability. We use mean of errors, me, which is the indices’ deviations from the observed true reliability averaged across the 384 rating sessions, to gauge the approximating accuracy of the seven indices, denoted me(ri) in Table 2 and Eq. 8 of the Additional file 1. With the same reasoning, we also use me to assess and compare the chance estimators of the indices, denoted me(ac) in Table 2 and Eq. 9 of the Additional file 1. We adopted dr2 > .8 as the primary benchmark and me < .02 as the secondary benchmark when evaluating the seven indices. Section V of the Additional file 1 details the calculations of and the rationales behind the benchmarks.

Functions of P values and statistical pretests

This study observes the tradition of reporting p < α, where α = .05, .01, or .001. We however also take a functionalist view of p values, striving to follow the best statistical practice [46-50]: avoiding the terms containing “significance," e.g., “statistical significance,” for p < α; considering p < α as a prescreen threshold, passing which allows us to assess, interpret, and compare effect size indicators on percentage scales, such as r2, dr2 and me, with some confidence; using terms such as “statistical pretest” and “statistically acknowledged” where we would have traditionally used “significance test” and “statistically significant;” reserving the terms containing “significant” and “significance” for effect sizes of substantive importance. More of our views and practices regarding the functions of p values may be found in our prior work [51-53].

Results

Reliability estimations tested against observed reliabilities

Findings are summarized in Tables 3, 4, 5 and 6 and Fig. 1 and discussed in three sections. This section reports the performance of the seven indices when predicting and approximating the observed reliability. The next section analyzes the impact of the four factors on the indices’ performance. The following section discusses offset mechanism for a better understanding of the indices’ complex behavior.

Table 3

Effects of estimation targets, category, skew & difficulty on observed or estimated chance agreement and reliability (dr2)

			A.	B.	C.	D.	E.	F.	G.	H.
	1	Right: Source or Author	Observation	%-agreement	Bennett et al.	Perreault & Leigh	Gwet	Scott	Cohen	Krippendorff
Effects on Intcdr Reliability Obsv & Ests	2	Right: Obsd / Estd Interrater Reliability as Dependent Variables Down: Independent Variables	o_ri	a_o	S	I_r	AC₁	π	κ	α
	3	Observed Reliability (o_ri)	1.00^***	.841^***	.691^***	.599^***	.721^***	.312^***	.312^***	.312^***
	4	Category (C)	.003	−.002	.175^***	.185^***	.123^***	.001	.001	.001
	5	Distribution Skew (s_k)	.000	.000	.000	−.000	.003	−.293^***	−.292^***	−.293^***
	6	Difficulty (d_f)	−.774^***	−.778^***	−.566^***	−.434^***	−.554^***	−.389^***	−.389^***	−.389^***
Effects on Chance Agrt Obsv & Ests	7	Right: Obsd / Estd. Chance Agreement as Dependent Variables Down: Independent Variables	o_ac	ao_ac = 0^a	S_ac	Ir_ac	AC_ac	π_ac	κ_ac	α_ac
	8	Observed Chance Agreement (o_ac)	1.00^***	–	.021^**	.021^**	.075^***	−.151^***	−.152^***	−.151^***
	9	Category (C)	−.019^**	–	−.863^***	−.863^***	−.661^***	−.013^*	−.014^*	−.013^*
	10	Distribution Skew (s_k)	−.001	–	.000	.000	−.039^***	.437^***	.434^***	.437^***
	11	Difficulty (d_f)	.585^***	–	.000	.000	.009	−.123^***	−.125^***	−.123^***
N	12	N_c (number of rating sessions)	384	384	384	384	384	384	384	384
N	13	N_d (number items within each session)	100	100	100	100	100	100	100	100

Main cell entries are directional r squared (dr2), which are r squared with the directional sign of r, dr2 = r•|r|

*: p<.05; **: p<.01; ***: p<.001

a As aoac, the chance estimate of ao, is a constant, its correlations (dr2) with other variables cannot be calculated

Table 4

Mean of errors (me) / distance between index estimations and targets of estimation

			A.	B.	C.	D.	E.	F.	G.
	1	Author or Source	%-agreement	Bennett et al.	Perreault & Leigh	Gwet	Scott	Cohen	Krippendorff
Interrater Reliability	2	Interrater Reliability Estimator	a_o	S	I_r	AC₁	π	κ	α
	3	m_e (r_i) = mean (\|r_i-o_ri\|) (0 ≤ m_e ≤ 1)	.130^***	.096^***	.180^***	.093^***	.327^***	.324^***	.323^***
	4	Standard Deviation of m_e (r_i)	.145	.099	.148	.104	.221	.220	.220
	5	95% confidence interval of m_e (r_i)	.115 ~ .144	.086 ~ .106	.164 ~ .194	.082 ~ .103	.304 ~ .349	.302 ~ .346	.301 ~ .345
Chance Agreement	6	Chance Agreement Estimator	ao_ac	S_ac	Ir_ac	AC_ac	π_ac	κ_ac	α_ac
	7	m_e (a_c):=mean (\|a_c-o_ac\|) (0 ≤ m_e ≤ 1)	.130^***	.182^***	.182^***	.130^***	.450^***	.448^***	.448^***
	8	Standard Deviation of m_e (a_c)	.145	.141	.141	.127	.201	.201	.202
	9	95% confidence interval of m_e (a_c)	.115 ~ .144	.168 ~ .196	.168 ~ .196	.117 ~ .143	.429 ~ .470	.428 ~ .469	.427 ~ .468
N	10	N_c (number of rating sessions)	384	384	384	384	384	384	384
N	11	N_d (number items within each session)	100	100	100	100	100	100	100

*: p<.05, **: p<.01, ***: p<.001

Table 5

Means and error of means (em): index estimations against observations

			A.	B.	C.	D.	E.	F.	G.	H.
	1	Right: Author or Source	Observed Agreement	%-agreement	Bennett et al.	Perreault & Leigh	Gwet	Scott	Cohen	Krippendorff
Interrater Reliability	2	Observed or Estimated Reliability (denotation)	o_ri	a_o	S	I_r	AC₁	π	κ	α
	3	Observed / Estimated Interrater Reliability	.555	.685	.556	.726	.600	.237	.240	.241
	4	Standard Deviation	.248	.122	.203	.173	.192	.249	.247	.248
	5	Range (minimum~maximum)	−.20 ~ .90	.42 ~ .92	−.10 ~ .856	.0 ~ .925	−.045 ~ .912	−.177 ~ .778	−.173 ~ .778	−.17 ~ .779
	6	e_m(r_i) = mean(r_i)-mean(o_ri) (−1 ≤ e_m ≤ 1)	.000	.130^***	.001	.171^***	.044^***	−.318^***	−.315^***	−.314^***
	7	95% confidence interval	.00 ~ .00	.115 ~ .144	−.013 ~ .015	.155 ~ .186	.031 ~ .058	−.341 ~ −.295	−.338 ~ −.292	−.338 ~ −.291
Chance Agreement	8	Chance Agreement (denotation)	o_ac	ao_ac	S_ac	Ir_ac	AC_ac	π_ac	κ_ac	α_ac
	9	Observed or Estimated Chance Agreement	.130	.000	.260	.260	.173	.575	.573	.572
	10	Standard Deviation	.145	.000	.146	.146	.148	.109	.109	.110
	11	Range (minimum~maximum)	.0 ~ .72	.0 ~ .0	.125 ~ .50	.125 ~ .50	.022 ~ .50	.448 ~ .905	.447 ~ .905	.445 ~ .905
	12	e_m(a_c) = mean(a_c)-mean(o_ac) (−1 ≤ e_m ≤ 1)	.000	−.130^***	.131^***	.131^***	.044^***	.445^***	.443^***	.443^***
	13	95% confidence interval	.00 ~ .00	−.144 ~ −.115	.111 ~ .15	.111 ~ .15	.026 ~ .061	.423 ~ .466	.422 ~ .465	.421 ~ .464
N	14	N_c (number of rating sessions)	38	384	384	384	384	384	384	384
N	15	N_d (number items within each session)	100	100	100	100	100	100	100	100

*: p<.05, **: p<.01, ***: p<.001

Table 6

Effects of category, skew, and difficulty on observed chance agreement, reliability, and index estimations (average scores)

			A.	B.	C.	D.	E.	F.	G.	H.	I.	J.	K.	L.	M.	N	O	P	Q
			Reliability Observation or Estimation								Chance Agreement Observation or Estimation
1	Author/ Source		Observed	%-Agreement	Bennett et al.	Perreault & Leigh	Gwet	Scott	Cohen	Krippen-dorff	Observed	%-Agreement	Bennett et al.	Perreault & Leigh	Gwet	Scott	Cohen	Krippen-dorff
2	Estimator:		o_ri	a_o	S	I_r	AC₁	π	κ	α	o_ac	ao_ac	S_ac	Ir_ac	AC_ac	π_ac	κ_ac	α_ac	N_c
3	Ground 0		.555	.685	.370	.608	.371	.369	370	.373	.130	0	.500	.500	.499	.501	.500	.498	32
4	Category (C)	2	.537	.701	.402	.584	.470	.230	.232	.234	.164	0	.500	.500	.401	.598	.597	.596	96
5		4	.550	.678	.571	.747	.621	.226	.230	.230	.128	0	.250	.250	.142	.573	.571	.571	96
6		6	.557	.676	.612	.777	.644	.239	.241	.242	.119	0	.167	.167	.087	.562	.561	.561	96
7		8	.578	.686	.641	.796	.664	.254	.257	.257	.108	0	.125	.125	.062	.564	.563	.562	96
8	Skew (s_k)	.50	.550	.688	.560	.732	.592	.370	.372	.374	.138	0	.260	.260	.203	.501	.500	.498	128
9		.75	.556	.678	.547	.722	.588	.302	.304	.305	.122	0	.260	.260	.186	.545	.543	.543	128
10		.99	.560	.690	.561	.723	.619	.040	.044	.045	.130	0	.260	.260	.132	.678	.676	.676	128
11	Difficulty (d_f)	.000	.824	.844	.782	.884	.810	.482	.484	.485	.020	0	.260	.260	.152	.630	.629	.628	48
12		.143	.783	.805	.728	.852	.761	.404	.406	.407	.021	0	.260	.260	.158	.616	.615	.615	48
13		.286	.721	.757	.659	.808	.697	.341	.343	.344	.036	0	.260	.260	.164	.599	.598	.600	48
14		.429	.659	.721	.600	.765	.643	.273	.275	.277	.062	0	.260	.260	.169	.591	.589	.588	48
15		.571	.543	.659	.518	.706	.563	.196	.199	.200	.116	0	.260	.260	.180	.565	.563	.563	48
16		.714	.439	.606	.444	.647	.495	.117	.121	.121	.168	0	.260	.260	.182	.548	.546	.546	48
17		.857	.331	.567	.387	.591	.440	.068	.071	.072	.236	0	.260	.260	.189	.534	.533	.532	48
18		1.00	.142	.523	.332	.552	.389	.018	.022	.022	.380	0	.260	.260	.194	.514	.512	.511	48
19		Mean	.555	.685	.556	.726	.600	.237	.240	.241	.130	0	.260	.260	.173	.575	.573	.572	384
20		N_d	100	100	100	100	100	100	100	100	100	100	100	100	100	100	100	100	100

Fig. 1

A sample screen seen by some raters (for category = 6, difficulty = 1)

Effects of estimation targets, category, skew & difficulty on observed or estimated chance agreement and reliability (dr2) Right: Obsd / Estd Interrater Reliability as Dependent Variables Down: Independent Variables Right: Obsd / Estd. Chance Agreement as Dependent Variables Down: Independent Variables Main cell entries are directional r squared (dr2), which are r squared with the directional sign of r, dr2 = r•|r| *: p<.05; **: p<.01; ***: p<.001 a As aoac, the chance estimate of ao, is a constant, its correlations (dr2) with other variables cannot be calculated Mean of errors (me) / distance between index estimations and targets of estimation *: p<.05, **: p<.01, ***: p<.001 Means and error of means (em): index estimations against observations *: p<.05, **: p<.01, ***: p<.001 Effects of category, skew, and difficulty on observed chance agreement, reliability, and index estimations (average scores) A sample screen seen by some raters (for category = 6, difficulty = 1) Overall, 2.86% of the raters’ decisions fell on the short bars (1.11, 1.93 and 5.53% respectively for four, six, and eight categories). As expected, there were fewer agreements on short bars, averaging 0.45% (0.04, 0.12, and 1.18%). These agreements showed no detectable effects on the main relations under investigation. The correlations between the manipulated variables were practically zero, confirming orthogonality, which indicates minimal confounding or multicollinearity.

Predicting reliability

Percent agreement, ao, the oldest and the most criticized index of interrater reliability, did well predicting true reliability, showing dr2 = .841 (Line 3, Table 3). Of the seven indices tested, ao was the only one meeting the primary benchmark dr2 > .8 (Ineq. 11), outperforming the second best, AC1 (dr2 = .721), and the third best, S (dr2 = .691) by more than 10 points, although the latter two met the tentative benchmark dr2 > .67. The most respected three, π, κ and α, tied as the least accurate predictor, reporting dr2 = .312, failing the tentative benchmark by margins. They also underperformed the next worst, Ir, by 28.7 points (dr2 = .599). The underperformances of the chance-adjusted indices, especially the popular π, κ and α, were disappointing, considering that the whole mission of the indices was to outperform ao. The low r2 means large predictive errors, suggesting that the three indices too often assign lower scores to more reliable instruments, and attach higher scores to less reliable ratings. They failed to differentiate reliable instruments from unreliable ones accurately and consistently. Figure 2 visualizes the performances and ranks the indices by their dr2 scores. It is noticed, again, that κ and α ranked among the lowest while percent agreement (ao) ranked the highest. Figure 2 also shows a strong and positive correlation between accuracy of predicting chance agreement and accuracy of predicting interrater reliability (dr2 = .9768, p < .001), supporting a design feature of this study, which is to analyze the indices’ chance estimates for the purpose of understanding the indices.

Fig. 2

Accuracies of Interrater Reliability Indices. Notes: 1. Solid red bars are dr2 between estimated chance agreement & observed chance agreement. 2. Dotted blue bars are dr2 between estimated interrater reliability & observed interrater reliability. 3. Primary benchmark: dr2 > 0.8. 4. Data source: Lines 3 & 8, Table 3

Approximating reliability

A .555 average reliability (ori) was observed (A3, Table 5). The seven indices’ estimation of reliability, however, ranged from .237 (π) to .726 (Ir), indicating large approximation errors. As the experts would have predicted, percent agreement (ao) overestimated reliability, reporting em = .13 (B6, Table 5) and me = .13 (A3, Table 4). The error, however, was below what’s allowed by the secondary benchmark, me < .2 (Ineq. 13 of the Additional file 1). So ao was the only index meeting both primary and secondary benchmarks. Three other indices also met the me < .2 benchmark, of which two, AC1 (me = .093) and S (me = .096). also outperformed ao (Line 3 Table 4). The trio, π, κ and α, again underperformed all others, reporting me .323 ~ .327 (Line 8, Table 5). The errors equaled one third of the 0 ~ 1 scale, and more than doubled the errors of ao (me = .130). Ir overestimated reliability across the board like ao did (D6, Table 5), while κ, π and α underestimated across the board -- 23.7% ~ 24.1% estimated versus 55.5% observed (Line 3, Table 5). AC1 and S underestimated some sessions while overestimated other sessions (Line 6, Table 5). Of AC1 and S, the under and over estimations offset each other to make the sizes (absolute values) of em much smaller than that of me. Of the other five indices, em and me are about equal in size (Line 6, Table 5 vs Line 3, Table 4). In part because of the offsets, AC1 and S produced near-zero or very small em errors (.001 and .044, respectively), much smaller than any of the other five indices did. By contrast, κ, π and α again produced the largest errors, reporting em ranging from −.318 ~ −.314, much worse than the next worst, Ir (em = .171, Line 6, Table 5).

Pi-kappa-alpha synchrony

As shown above, π, κ and α behaved like one index, despite the spirited debates on which of them is the best [10, 12, 54–57]. This pattern of π-κ-α synchrony persisted throughout the data.

Impacts of four factors

The five viewpoints reviewed earlier discussed four factors behind reliability and/or reliability estimations. Now that we have observed rater behavior, we examine the true impacts of the four factors.

Conjecture group 1: chance agreement inflates ao

As said, a 13% chance agreement (oac) and a 55.5% reliability (ori) were observed, while percent agreement (ao) assumed 0% chance agreement and reported a 68.5% reliability, which means a 13-point overestimation (Tables 4 and 5). Conjecture 1 and the century-old beliefs were supported. Chance agreement exists. By completely overlooking chance agreement, ao inflates the estimated reliability. The data from this experiment, however, adds a third point: The chance agreement may not be as large as previously thought. In this experiment, the chance agreement of ao stayed below the .2 threshold, which was a main factor that allowed the predictive accuracy (r2) of ao to stay above the .8 threshold. As ao outperformed all six indices on the primary benchmark (r2) and outperformed four out of the six on the secondary benchmark (me), an argument could be made that overestimating and misestimating chance agreement can be as counterproductive as overlooking chance agreement.

Conjecture group 2, category inflates S, Ir & AC1

As critics of S, Ir and AC1 would have predicted, category (C) had large and negative effects on chance estimations Sac, Irac and ACac, with dr2 ranging −.863 ~ −.661, (p < .001, Line 9, Table 3). Table 6 (K4 ~ K7) shows more details, e.g., Sac was 50% when C = 2 but plunged to 12.5% when C = 8. The decreases appeared large compared to the 13-point average oac. Negative effects on chance estimations contribute to positive effects on reliability estimations, as shown in the dr2 ranging .599 ~ .721 (p < .001, Line 3, Table 3). S jumped from 40.2% when C = 2 to 64.1% when C = 8 (C4 ~ C7, Table 6). The effect (difference) of 23.9 points is large compared with the 55.5-point average ori. In contrast, category effects on the targets of estimations, ori and oac, were tiny. Coefficients dr2 were respectively .003 (p ≥ .05) and − .019 (p < .01) (A4 and A9, Table 3, See Table 6, Lines 4 ~ 7, for more details). These results support the classic theory that S and equivalents underestimate chance agreement when categories exceed two, even when additional categories are largely empty. The tables also show that Ir and AC1 relied on category in the same fashion that S did and shared the same deficiency. The differences between the category effect on S, Ir or AC1 estimation and the category effect on observed reliability all passed the p < .001 pretest. At the meantime, category showed minimal effects (dr2 ≈ .001, p ≥ .05) on π, κ and α, as their authors intended (Line 4, Table 3).

Conjecture group 3: skew depresses κ, π & α

As critics of κ, π & α would have predicted, skew had substantial and positive effects on chance estimators κac, πac & αac, with dr2 ranging .434 ~ .437 (p < .001, Line 10, Table 3). Table 6 (Lines 8 ~ 10) shows more details, e.g., κac was 50% when distribution was 50&50, but rose to 67.6% when distribution changed to 1&99. The positive effects on chance estimates led to negative effects on reliability estimates. Skew effects on the three indices were all negative, with dr2 ranging −.293 ~ −.292 (p < .001, Line 5, Table 3). When distribution changed from completely even to extremely skewed, the trio’s chance agreement estimates increased from about .5 to about .68, and in parallel their reliability estimates decreased from about .37 to about .04, a drop of over 89% (Lines 8 ~ 10, Table 6). While mathematical analyses of prior studies had predicted a drop [14, 26, 58], the empirical evidence of this study showed the drastic magnitude of the drop. In contrast to the large effects on the index estimators, skew showed minimal effect on the observed estimands, ori and oac (p ≥ .05 for both dr2, A5 & A10, Table 3), supporting the argument that chance estimates and reliability indices should not rely on skew. Each difference between the skew effect on π, κ or α estimation and the category effect on the observed estimand passes the p < .001 pretest. In another contrast, skew showed practically zero effects on S, Ir or their chance estimates, and a small negative effect on ACac (dr2 = −.039, p < .001, Lines 5 & 10, Table 3). So Ir avoided the skew effect as its authors intended, while AC1 reversed the effect as its author intended, although the reversed effect was small. A long-suspected pattern was confirmed empiri–lly -- κ, π & α were dependent on skew while S, Ir & AC1 were dependent on category.

Conjecture group 4: indices overlook task difficulty

Difficulty showed a substantial and positive effect on oac (dr2 = .585, p < .001, A11, Table 3), and a large and negative effect on ori (dr2 = −.774, p < .001, A6). A change from extremely easy to extremely difficult decreased ori by over 68 percentage points and increased oac by nearly 36 points (Columns A and I, Table 6). These effects appear large compared with 13-point average oac and 55.5-point average ori, suggesting that chance estimates and reliability indices should rely on difficulty. In contrast, difficulty had minimal effects on Sac, Irac and ACac (dr2 = .000 ~ .009, p ≥ .05, Table 3) and negative effects on κac, πac & αac (dr2 = −.123 or − .125, p < .001, Table 3; c.f. Columns I & N ~ P, Lines 11 ~ 18, Table 6), implying that the indices either failed to rely on difficulty or relied on its opposite, easiness, to estimate chance agreement. Each difference between the difficulty effect on chance estimation and the difficulty effect on observed chance agreement was statistically acknowledged at p < .001. Difficulty showed weaker effects on the six chance-adjusted indices (dr2 = −.566 ~ −.389, Line 6, Table 3) than on the estimation target ori (dr2 = −.774). Each difference between the difficulty effect on reliability estimation and the difficulty effect on observed reliability was statistically acknowledged at p < .001. By contrast, ao, showed a strong and negative correlation (dr2 = −.778, B6, Table 3) with difficulty. The correlation was as strong as the correlation between ori and difficulty (dr2 = −.774, A6), suggesting the negative correlations between the chance-adjusted indices and difficulty (dr2 = −.566 ~ −.389) are likely due to ao embedded in the indices. Based on derivation and simulation, Gwet concluded that the indices prior to AC1 had not handled difficulty properly, and AC1 handled it better, at least than κ [38, 59, 60]. The above findings support both claims. The near zero correlation between ACac and difficulty (dr2 = .009, p ≥ .05, E11, Table 3), however, suggests that AC1 still does not handle difficulty properly.

Conjecture group: indices assume intentional and maximum random rating

The most direct evidence for the behavioral assumptions behind the statistical indices should come from mathematical analysis. A 2013 study provides detailed scenarios of rater behavior assumed by each of the 22 indices analyzed [14]. Readers were invited to derive mathematical formulas from the behavioral scenarios. If a reader-derived formula matches the formula for the corresponding index, then the reader should conclude that the corresponding index indeed assumes the behavioral pattern depicted in the scenario. If, for example, a formula derived from the Kappa Scenario matches the formula for Cohen’s κ [2], it would confirm that κ indeed assumes the rater behavior depicted in the Kappa Scenario. Such class exercises, for example, have shown our students that the main chance-adjusted indices all assume that raters regularly conduct intentional and maximum random rating. This study provided corroborating empirical evidence. The indices’ chance estimates were poorly correlated with their estimands, the observed chance agreements (Table 3, Line 8). The observed chance agreement (oac) explained less than 8% of the variance in each of the category-based indices’ chance estimates, Sac (2.1%), Irac (2.1%), and ACac (7.5%). Although the correlations were stronger for the skew-based indices’ chance estimates, πac (− 15.1%), κac (− 15.2%), and αac (− 15.1%), the dr2 coefficients were all negative, suggesting that the three indices tended to give higher estimates when the true chance agreements were lower, and give lower estimates when the true chance agreements were higher. Clearly, the index-estimated random rating and the observed raters’ random rating were completely different entities. This finding supports the argument that the chance-adjusted indices assume intentional and maximum random rating while typical raters conduct involuntary and task-dependent random rating. The mismatches between the assumptions and the observations explain the negligible or negative correlations between the estimates and the estimands. More corroborating evidence for the maximum-random assumption came from the large overestimation of chance agreement by the six chance-adjusted indices, as shown in Line 12 of Table 5 and the right half of Table 6, which are summarized in Line 19. The more detailed and situational evidence of the behavioral assumptions come from the influences of the four factors and the indices' offset and aggravation behaviors, which are discussed below.

Summarizing impacts of four factors

Each index of interrater reliability implied one or more misassumptions about chance agreement. ao Overlooked chance agreement. S, Ir and AC1 inappropriately relied on category. π, κ And α inappropriately relied on skew. While difficulty had a strong and positive effect on chance agreement, all chance adjusted indices failed to rely on difficulty. π, κ and α even relied on its opposite, easiness. The misassumptions, including missed, mistaken, and contra assumptions, impeded estimation. π, κ And α fared worse in part because they entailed more and more devastating misassumptions, some of which had been mistaken as evidence of sophistications. Recall that the main mission of the chance adjusted indices is to remove chance agreement in order to improve on percent agreement. When they mishandled the factors affecting chance agreement, they misestimated chance agreement, thereby misestimated reliability. Misassumptions about the four factors are keys to understanding the indices’ underperformance. For more detailed understandings, we discuss below the offsetting mechanism, which interacts with the assumptions and misassumptions of the indices to define the indices’ behavior.

Offsets in reliability estimation

Puzzles may arise if one peruses Tables 3, 4, 5 and 6, five of which discussed below.

Puzzle 1

Each chance-adjusted index relied on a wrong factor, skew or category, to estimate chance agreement; none of them relied on the right factor, difficulty. How come some approximated chance agreement far better than the others (Line 12 of Table 5 and Line 7 of Table 4)?

Puzzle 2

Chance estimators barely measured the observed chance agreement oac; somer even measured anti oac (C8 ~ H8 of Table 3). Given the miserable performances of the chance estimations, how come the reliability estimations were all positively and sometimes substantially correlated with the observed reliability (C3 ~ H3)?

Puzzle 3

Assuming a negative relation between chance agreement and reliability, one might expect that an over estimation of chance agreement leads to an under estimation of reliability. How come S overestimated chance agreement by 100% (oac = .130 compared to Sac = .260, Line 9, Table 5) while also approximated reliability almost perfectly (S = .556, compared to ori = .555, Line 3, Table 5)?

Puzzle 4

Continued from Puzzle 3, how come AC1 overestimated chance agreement (em = .044, Line 12, Table 5) while also overestimated reliability (em = .044, Line 6, Table 5)? More generally, how come across-the-board overestimations of chance agreement did not translate into across-the-board underestimations of reliability (Line 12 vs Line 6, Table 5)?

Puzzle 5

Continued from Puzzles 3 & 4, how come Ir overestimated chance agreement more than AC1 did (Irac = .131 vs ACac = .044, Line 12, Table 5), while also overestimated reliability more than AC1 did (Ir = .171 vs AC1 = .044, Line 6, Table 5)? The puzzles can be explained in part by offsets, including partial offset, over offset, and counter offset, i.e., aggravation, imbedded in the reliability formulas, some of which discussed below.

Category offset, skew aggravation, and skew offset

To understand Puzzle 1, first recall that, under intentional-and-maximum-random assumption, chance-adjusted indices tend to overestimate chance agreement [9, 14, 29, 44, 45, 61–63]. In this experiment, the overestimations ranged from 4.4 percentage points by AC1 to 44.5 points by Scott’s π, all statistically acknowledged (p < .001, Line 12, Table 5). To explain Puzzle 1, we note that the category-based indices assumed that larger number of categories decreased chance agreement (C9 ~ E9, Table 3), which offset the general overestimation. The skew-based indices assumed that higher skew increased chance agreement (F10 ~ H10), which aggravated the general overestimation. AC1 assumed both, that is, category and skew both decreased chance agreement (E10), thereby it offset the overestimation even more than the other two category-based indices. To illustrate the point, we follow the textbook tradition of starting from ground zero, which features two raters, two categories, and 50&50% distribution. Here, and only here, all major indices gave about the same estimates, ac ≈ 0.5 (K2 ~ P2, Table 6). Under intentional-and-maximum-random assumption, two raters draw from marbles, half with one color and half another color; they rate randomly if the colors match, and honestly if mismatch [9, 14, 29, 44, 45]. Task difficulty is not a factor in this view of rater behavior. In actual rating, however, ac = 0.5 could occur only if the task is extremely difficult. In our experiment, even the most difficult (df = 1 for 1-pixel difference) condition did not reach that theoretical maximum, reporting an oac = .38 (I18, Table 6). The less difficult sessions reported significantly smaller oac, averaging 0.13 across all levels of difficulty. This means a 37-point initial overestimation at the ground zero by each chance-adjusted index (em = .5-.13 = .37). When category increased from ground zero, Sac, Irac and ACac decreased quickly under the category assumption (Columns K ~ M, Row 4 ~ 7, Table 6). While the assumption was unjustified given the small change in oac (I4 ~ I7), the decrease partially offset the 37-point overestimation, making Sac, Irac and ACac less inaccurate. By contrast, κac, πac & αac rejected the category assumption to remain unchanged (Columns N ~ P), hence did not benefit from the partial offset. Thus, Sac, Irac & ACac became less inaccurate than κac, πac & αac. Now return to ground zero, then increase skew. Under the skew assumption, κac, πac & αac increased with skew (Columns N ~ P, Row 8 ~ 10, Table 6). While the assumption was unjustified given the small change in oac (I8 ~ I10), the increase further aggravated the 37-point overestimation, making κac, πac & αac even more inaccurate. By contrast, Sac and Irac rejected the skew assumption to remain unchanged (K ~ L, 8 ~ 10), hence did not suffer from the aggravation. Thus, κac, πac & αac became even more inaccurate than Sac & Irac. Rather than accepting or rejecting the skew assumption, ACac reversed it, by assuming that skew reduced ac (M8 ~ M10). While the assumption also mismatched the observed skew effects (I8 ~ I10), the decrease further reduced the once 37-point overestimation. Here two unjustified assumptions, category and reversed skew, joined hands to partially offset another unjustified assumption, intentional and maximum random. Thus, ACac became even less inaccurate than Sac & Irac, hence the least inaccurate of the six. As the effect of intentional-and-maximum-random assumption was stronger than the other two effects combined, a net effect was that even ACac still overestimated chance agreement. There were other under-offsets, over-offsets, and counter-offsets, i.e., aggravations, some of which discussed below. Behind multifarious offsets were multifarious assumptions about rater behaviors, which fought or allied with each other or stayed neutral to produce the multifarious outcomes. Two wrongs sometimes made one right, sometimes half right, and often three, four, or more wrongs.

Chance-removal offset

To understand Puzzle 2, recall that, assuming intentional and maximum random rating, index designers wanted to remove the maximum amount of chance agreement from all considerations, which requires to remove ac not only from percent agreement (ao), but also from the realm of consideration [9, 14, 23, 24, 29, 44, 45]. Accordingly, ac is subtracted twice in Eq. 1, first from ao in the numerator, and second from 1 in the denominator, which represents 100% of the realm of consideration. Two offsets occurred as a result. First, ac offsets ao in the numerator. Second, ac in the denominator offsets its own impact in the numerator. As the self-offsets weaken ac’s effects, ao dominates Eq. 1, the indices’ estimation of reliability. That explains Puzzle 2: the weak or negative ac–oac correlations exerted weaker effects than the strong and positive ao-ori correlation. The weaker effects still hinder. The chance estimators not only failed to fulfill their prescribed mission of improving on percent agreement, but the estimators worked against the mission. Consequently, all six indices underperformed percent agreement when predicting observed true chance agreement. Ironically, it was the supposedly “most primitive” and “flawed” percent agreement (ao) that worked inside the indices to keep them from performing and looking even worse ([2] p38, [12] p80). The offsets also help to explain Puzzle 3. While S overestimated chance agreement by 13.1 points (Line 12, Table 5) on average, the chance-removal offset helped to bring down the scalar error of reliability estimation to 9.6 points (Line 3, Table 4). This across-session error contains over- and under-estimations of individual sessions, which offset each other during averaging to reduce the vector error to near zero (em = .001, Line 6, Table 5. See also the discussion of aggregation bias earlier). By setting estimated reliability (ri in Eq. 1) equal to observed reliability (ori in Eq. 5 of Additional file 1), ri = ori, we derive a threshold (th) for ac, which is Eq. 2: For any rating session, an index accurately estimated reliability when ac = th, underestimated when ac > th, and overestimated when ac < th. Therefore, when oac < ac < th, the index overestimated both the chance agreement and the reliability, explaining Puzzle 4. Across the 384 sessions, average th would be .292 if we plug oac (.13) and ori (.555) into Eq. 2. As Table 5 shows, of the six chance-adjusted indices, the three (κ, π, α) reporting ac > .292 (Line 9) also underestimated reliability (Line 6), and the three (S, Ir, AC1) reporting ac < .292 also overestimated reliability. At the same time, all six overestimated chance agreement (Line 12). Due to the chance-removal offset, it is possible and possibly common for some category-based indices to overestimate both chance agreement and reliability. A previously undocumented paradox emerges from this analysis (Eqs. 1 and 2). An index estimates reliability accurately (ri = ori) only when it overestimates chance agreement (ac > oac), an index that estimates chance agreement accurately (ac = oac) inevitably underestimates reliability (ri < ori), except in the extreme and impractical situation when ri = ori = 0. The paradox, applicable for all known chance-adjusted indices, is rooted in the chance-removal offset imposed by Eq. 1, which traces back to the intentional and maximum random assumption [14, 23, 24, 26].

Square-root over offset

To understand Puzzle 5, recall that Perreault and Leigh’s Ir adopts the chance estimator of S, Irac = Sac, and takes the square root of S as the reliability estimation [7]. S ≤ Ir, as Ir = S½ for 1 ≥ S ≥ 0 and Ir = 0 for − 1 ≥ S < 0. When chance agreement is overestimated, the square root operation constitutes an additional offset [14]. Due to the category-based over-offset of S, Ir overestimates chance agreement more than AC1; at the meantime, due to the square root over-offset of Ir, Ir overestimates reliability more than AC1. The two offsets explain Puzzle 5. A rating session in this experiment simulates a study. In practice, errors do not offset across studies, e.g., one study’s overestimation of Disease A does not offset another study’s underestimation of Disease B. We should not overemphasize the near-zero aggregated error by S shown in em or overlook the sizable individual errors by S shown in me.

Discussion

Main findings

Of the seven indices, percent agreement (ao) stood out as the most accurate predictor of reliability (dr2 = .841, Table 3) and the third most accurate approximator (me = .130, Table 4). AC1, the newest and the least known, was the second-best predictor (dr2 = .721) and the best approximator (me = .093). S ranked behind AC1 for both functions (dr2 = .691, me = .096). The most respected, the most often required, and the most often applied indices, π, κ and α, ranked the last for both functions (dr2 = .312, me = .323 ~ .327). The indices’ underperformances appeared attributable to mismatches between the assumed and observed rater behaviors, and multifarious offsets and aggravations between the misassumptions. Percent agreement assumed zero random rating, leading to the 13-point overestimation of reliability. The other six indices assumed intentional and maximum random rating, leading to a 37-point initial overestimation of chance agreement at “ground zero” for interrater reliability (Line 3, Table 6). Away from ground zero, S, Ir and AC1 assumed larger number of categories produced less chance agreement, which offset the initial overestimation, while π, κ and α assumed skewer distributions produced more chance agreement, which aggravated the overestimation. The opportune offsets and the austere aggravations explain the smaller approximation errors by the category-based indices than by the skew-based indices. Contrary to the assumptions, neither rating category nor distribution skew showed meaningful effects on the observed true chance agreement. Difficulty exhibited a substantial and positive effects on chance agreement (dr2 = .585, p < .001, Table 3), while S, Ir, and AC1 did not rely on difficulty to estimate chance agreement (dr2 = .000 ~ .009, p ≥ .05). Failing to rely on difficulty further explains the three indices’ underperformances in prediction. Moreover, π, κ & α relied on the opposite difficulty, easiness, to estimate chance agreement (dr2 = −.125 ~ −.123, p < .001), which further explains π, κ & α’s worse performances than S, Ir, and AC1.

What did the indices indicate?

An index indicates a certain concept. What did the seven indices indicate? Did they indicate what they purport to indicate? Percent agreement ao was the only index meeting the primary benchmark (dr2 > .8), thereby also meeting the competitive benchmark. By overlooking chance agreements, ao overestimated reliability by 13 percentage points (em = me = .130, Tables 4 and 5). The error was within the range allowed by the secondary benchmark (me < .2). The overestimation appeared across the board, as shown in Columns A and B (Lines 4 through 18) of Table 6, which implies that researchers and reviewers may manage ao’s deficiency by discounting a certain amount, such as 15 points, treating ao-0.15 as a crude estimation of reliability. Overall, in this experiment percent agreement behaved as a good predictor and a 13-point over-approximator of interrater reliability. The other six indices set out to outperform ao by removing estimated chance agreement ac. Unfortunately, their ac estimations failed to accurately estimate true chance agreement oac. Sac, Irac, and ACac were slightly influenced by oac (dr2 = .021 ~ .075, p < .01 or p < .001, Table 3). They were instead strongly and negatively influenced by category (dr2 = −.863 ~ −.661, p < .001), suggesting they indicated fewness of category more than they indicated chance agreement. The other three chance estimators, πac, κac & αac, predicted far less accurately. They indicated mostly skew (dr2 = .434 ~ .437),the opposite of observed chance agreement oac, and easiness (Lines 8-10, Columns F-H, Table 3). When Eq. 1 was used to remove ac, ao offset some impact of ac, which also self-offset some. The offsets reduced the category and skew effects and kept the index-ori correlations positive (Line 3-5, Table 3). But still, ac, the unique core of each index, all impeded the reliability estimation. Sac, Irac and ACac impeded less than πac, κac, & αac did, allowing S, Ir and AC1 to predict reliability better than π, κ, & α did (Line 3, Table 3). But the reduced impediments were still impediments. Consequently, none of the chance-adjusted indices had a good chance of outperforming ao when predicting reliability. Two indices, AC1 (me = .093) and S (me = .096), did outperform ao (me = .13) for approximation, which was due more to opportune offsets between misassumptions, and less to removing chance agreements (Line 3, Table 4). Consequently, no chance-adjusted index passed the primary benchmark dr2 > 0.8. Two, AC1 (.721) and S (.691), passed the threshold dr2 > 0.67 for tentative acceptance (Table 3). Being the best approximator, AC1 (me = .093) was the one meeting the competitive benchmark. AC1 and S were also two of the four indices meeting the secondary benchmark, me < .2 (Line 3, Table 3). Category exerted some effects on AC1 (dr2 = .123) and S (dr2 = .175). Fortunately for the two indices, the category effects were much smaller than the estimand effects of ori (dr2 = .721 & .691). The two indices underestimated reliability when C = 2, and overestimated when C ≥ 4 (Columns A, C and E, Lines 4 ~ 7, Table 6). Overall, AC1 and S were acceptable predictors of interrater reliability, and under- or over-approximators when category was respectively under or over 3. Ir (dr2 = .599, me = .18) failed the tentative benchmark for prediction but satisfied the secondary benchmark for proximity. It overestimated reliability across the board. Overall, Ir was a poor predictor and an 18-point over-approximator of interrater reliability. Ir’s overestimation was worse when the number of categories was increased. The performances of π, κ and α belong to another class. The trio’s estimation-estimand correlations (dr2 = .312) were far below the primary benchmark of dr2 > .8 or the tentative benchmark of dr2 > .67; and their approximation errors (me = .323 ~ .327) were far above the secondary benchmark me < .2. Furthermore, evenness (1-skew) exerted nearly as large effects on the trio (dr2 = .292 ~ .293, Line 5) as their estimand ori did (dr2 = .312), suggesting that the trio indicated distribution evenness nearly as much as they indicated interrater reliability. More even distributions raised π, κ and α nearly as effectively as higher reliability did, even though skew or evenness showed no effect on observed reliability or chance agreement. Overall, π, κ & α were crude predictors of reliability and evenness, and 31-point under-approximators of reliability. They were crude because they showed large errors when predicting reliability (dr2 = .312) or evenness (dr2 = .292 ~ .293). While dr2 (.292 ~ .293) were too low to make π, κ & α precise indicators of evenness or skew, they were too high to make the trio pure indicators of reliability. The correlation can be even more disconcerting if one considers its impact on the creation and diffusion of knowledge. Reviewers and researchers use the trio to screen measurements and manuscripts, while the trio systematically favor more even distributions, making the world appear flatter. It would be a collective version of the conservative bias, or evenness bias, except this one permeates scientific knowledge [64, 65]. By contrast, ao showed none of this disparaging deficiency (dr2 = .000).

Conclusion

Like most controlled experiments, this study had limited external validity. The raters made visual judgments, which did not represent all tasks. The categories stopped at eight. The short-bar categories were largely empty by design. Each session had only two raters. The list could go on. To avoid unwarranted generalization, we used past tense to describe the indices’ behaviors and their impact. Our findings, however, have been speculated or predicted by the theoretical analyses, mathematical derivations and Monte Carlo simulations [14, 29, 59–63, 66–70]. These studies used no actual measures, specific tasks, human raters, or other specifics that may limit external validity. What some other studies lack in internal validity, this study provides. The validity of our collective knowledge is significantly strengthened by adding empirical studies based on observing rater behavior. The indices were advertised to be “standard” and “global” for “general purpose” [12, 14, 42, 71]. Now that some reigning indices did not perform as advertised against one set of observed behavior, it is sufficient evidence that the indices are not general or global or standard. The burden is not on doubters to prove that the purported general indices always fail, but on defenders to produce good evidence that the indices generally perform. Despite the lack of empirical evidence in support of the reigning indices, the spiral of inertia in their defense likely will continue for a while [26, 58]. In that event, the interpretation of π, κ and α may warrant more caution, and the application of ao and AC1 may deserve more credence.

Future research

Replication studies

More controlled experiments are called for to falsify or qualify the findings of and the theories behind this experiment, and to test the other reliability indices against their estimands [71-73].

New indices

New indices may be needed. Index designers may be more cautious about the assumptions that raters conduct intentional and maximum chance rating, or their chance rating is determined by skew or category. More thoughts may be given to the possibility that raters conduct instead involuntary and task-dependent random rating, and more weights given to task difficulty. The index designers are encouraged to assess and adjust their ideas and indices against behavioral data, including the data from this experiment, which will be made public upon publication of this manuscript.

REORD and behavior-based statistical methods

Mathematical statistics use a system of axioms and theorems to build tools for analyzing behavioral data. The REORD (reconstructed experiment on real data) methodology reverses the logic, using observed behavior to inform statistical methods. The application might not be limited to interrater reliability. REORD, for example, may open a new front for the studies of sensitivity and specificity measures, two practical tools often used in medical and health research. REORD may also help to investigate the empirical relationship between reliability and validity, two of the most fundamental concepts of scientific enquiry.

Rater expectations of prevalence or skew

The researchers in this REORD experiment told the raters nothing about the prevalence or the skew of the long and short bars. As prevalence and skew were programmed to vary randomly between trials and between rating sessions, the researchers themselves did not know about the prevalence or skew until data analysis, and the raters could not have guessed accurately. This design feature was chosen because it resembled one type of research condition, under which raters don’t know what to expect, therefore they don’t expect. For some tasks, however, raters do expect about prevalence and skew, due to their prior experience with the same tasks or their prior exposure to second-hand information. A follow-up study may investigate the impact of such expectations on raters’ rating or the indices of reliability, sensitivity, and specificity.

Human vs machine raters

Expectations about distribution, prevalence, and skew can be programmed into artificial intelligence (AI) to aid automated diagnoses, judgements, scorings, evaluations, ratings, and other decisions by machines. Unlike human decisions and human expectations that are often vague and varying, machine decisions and machine expectations can be programmed to be super clear and super consistent [74, 75]. Topics of human-machine reliability and inter-machine reliability versus inter-human reliability could be fruitful and fascinating for research using REORD, and so could topics of sensitivity, specificity, and validity with human and/or machine raters. Additional file 1.

16 in total

1. Psychological probability as a function of experienced frequency.

Authors: F ATTNEAVE
Journal: J Exp Psychol Date: 1953-08

Review 2. Clinical versus actuarial judgment.

Authors: R M Dawes; D Faust; P E Meehl
Journal: Science Date: 1989-03-31 Impact factor: 47.728