Literature DB >> 32435572

Kappa and Beyond: Is There Agreement?

Abstract

Entities: Chemical

Year: 2020 PMID： 32435572 PMCID： PMC7222679 DOI： 10.1177/2192568220911648

Source DB: PubMed Journal: Global Spine J ISSN： 2192-5682

× No keyword cloud information.

Introduction

It is often of interest to assess agreement between 2 or more observers when the observation of interest is categorical. Observations of clinical interest can include diagnoses, assessment of risk factors (exposures), and outcomes. The statistical tool most widely used to determine agreement is the kappa statistic. The kappa statistic is a chance-corrected measure; that is, it seeks to assess agreement beyond that which occurs by chance. However, it is well known that kappa has some limitations that can lead to paradoxical results (low kappa even in the presence of strong observer agreement) in certain circumstances.[1,2] In this article, we will review kappa and its limitation, and introduce an alternative measure of agreement, the Agreement Coefficient 1 (AC1) given by Gwet.[3]

The Kappa Statistic

Cohen in 1960 proposed the kappa statistic in the context of 2 observers.[4] It was later extended by Fleiss to include multiple observers.[5] For illustration purposes, we will look at the simpler case of 2 observers, acknowledging that the principles are the same for multiple observers. Let us imagine that an investigator wants to develop a spine classification system based in part on the presence or absence of a finding on magnetic resonance imaging (MRI). The investigator gives 200 MRIs to 2 spine surgeons to assess independently whether the finding is present or absent in each image. The results are provided in Table 1.

Table 1.

A 2 × 2 Table of Results in a Hypothetical Example Comparing 2 Different Assessors.

		Assessor 1
		Present	Absent	Total
Assessor 2	Present	(a) 130	(b) 56	(g1) 186
	Absent	(c) 9	(d) 5	(g2) 14
	Total	(f1) 139	(f2) 61	(N) 200

A 2 × 2 Table of Results in a Hypothetical Example Comparing 2 Different Assessors. The kappa statistic is given by the formula where Po = observed agreement, (a + d)/N, and Pe = agreement expected by chance, . In our example, Po = (130 + 5)/200 = 0.675 Pe = ((186 * 139) + (14 * 61))/2002 = 0.668 κ = (0.675 − 0.668)/(1 − 0.668) = 0.022 Kappa values range from −1 to 1, though it usually falls between 0 and 1. One represents perfect agreement, indicating that the raters agree in their classification of every case. Zero indicates agreement no better than that expected by chance. A negative kappa would indicate agreement worse than that expected by chance. A common scale of interpretation for the kappa statistic is given by Altman[6] (Table 2).

Table 2.

A Commonly Used Scale of Interpretation for Kappa Statistic.

Kappa	Agreement
≤0.20	Poor
0.21-0.40	Fair
0.41-0.60	Moderate
0.61-0.80	Good
0.81-1.00	Very good

A Commonly Used Scale of Interpretation for Kappa Statistic. Using the scale shown in Table 2, the ability of our 2 investigators to agree on the presence or absence of a specific MRI finding is “poor.”

The Limitations of Kappa

While the kappa statistic above is quite low, one may notice that the absolute percentage of observer agreement (Po) is quite high (68%). How can the observers agree nearly 70% of the time yet have such a low kappa? The answer to this question lies in the distribution of the marginal table totals on which the magnitude of chance agreement (Pe), and subsequently, kappa, is dependent. The factors that characterize the distribution are referred to as prevalence and bias (see Box 1). Prevalence is the probability with which an observer will classify an object as present or absent. It is related to the balance in Table 1. Bias is the frequency at which raters choose a particular category, present or absent. This is related to the symmetry in Table 1. In our example, Table 1 is symmetrically imbalanced with a high prevalence for each assessor categorizing the MRI finding as present. This has the effect of lowering the kappa creating what appears to be a “paradox,” high percent agreement and low kappa. Definitions Used to Describe the Limitations of the Kappa Statistic Symmetrical: The distribution across g1 and g2 is the same as 𝑓1 and 𝑓2 Asymmetrical: The distribution across g1 and 𝑔2 is in the opposite direction to 𝑓1 and 𝑓2 Balanced: The proportion of the total number of objects in 𝑔1 and 𝑓1 is equal to 0.5 Imbalance: The proportion of the total number of objects in 𝑔1 and 𝑓1 is not equal to 0.5 Prevalence: Probability with which a rater will classify an object into a category; this is related to the balance in Table 2 Bias: Frequency at which raters choose a particular category; this is related to the symmetry in Table 2

An Alternative Statistic of Agreement

Given that kappa is affected by the skewed distributions of categories (the prevalence problem) and by the degree to which observers disagree (the bias problem), Gwet[3] in 2008 proposed an alternative agreement statistic by adjusting for chance agreement in a different way. He defined and new agreement coefficient (AC1) between 2 (or multiple) observers as the conditional probability that 2 randomly selected observers will agree, given that no agreement will occur by chance. In this way, the AC1 resists the so-called “paradox” of kappa. The AC1 statistic has the same formula as the kappa except it calculates the agreement expected by chance as follows: Pe = agreement expected by chance, 2q * (1 − q), where q = (g1 + f1)/2N In our example, Po remains the same, but Pe takes on a different value with the following results: Po = (130 + 5)/200 = 0.675 q = (186 + 139)/(2 * 200) = 0.813 Pe = 2(0.813) * (1 − 0.813) = 0.305 AC1 = (0.675 − 0.305)/(1 − 0.305) = 0.532 How shall we interpret this coefficient? Remember from Table 2 that many use a common benchmark scale to determine if the agreement is poor, fair, moderate, good, or very good. The advantage is that this scale is straightforward. However, this simple method can lead to misleading conclusions for the following reasons: The calculated kappa is specific to the pool of subjects used in the study and will change if different subjects are used. The magnitude of the kappa coefficient is dependent on several factors such as sample size, the number of categories, and the distribution of subjects among the categories. For example, a kappa value of 0.54 based on 200 subjects suggests a much stronger message about the extent of agreement among raters, than a kappa value of 0.6 based on 10 subjects only. A more standardized method that overcomes many of the problems cited above was proposed by Gwet.[7] Using the agreement coefficient and its standard error, one can calculate the probability that the agreement coefficient would fall into each category in Table 2. Starting with the highest agreement coefficient range, 0.81 to 1.0 (very good), and moving to the poorest range (≤0.2), one calculates the cumulative probability that the coefficient falls into that category. When the cumulative probability crosses a certain threshold (say 95%) that is the most likely range to which the estimate belongs. The results from our example show that kappa’s cumulative probability does not reach 95% until the bottom range (poor), where the AC1 crosses in the 0.41 to 0.60 range (moderate; (Table 3). Fortunately, Gwet’s agreement coefficient and his method for benchmarking are included in several statistical packages.

Table 3.

Benchmark Range When the Level of the Cumulative Probability of the Agreement Coefficient Reaches 0.95 (95%).

Benchmark Range	Description	Cumulative Probability
		Kappa	AC₁
0.81 to 1.00	Very good	0.00	0.00
0.61 to 0.80	Good	0.06	0.14
0.41 to 0.60	Moderate	0.09	0.98
0.21 to 0.40	Fair	0.11	1.00
≤0.20	Poor	1.00	1.00

Benchmark Range When the Level of the Cumulative Probability of the Agreement Coefficient Reaches 0.95 (95%). Kappa remains the most frequently used statistic assessing agreement between 2 or more observers when the observation of interest is categorical. The kappa statistic is a chance-corrected measure. However, there are some limitations to the kappa that relate to the distribution of the marginal table totals on which the chance correction depends. An alternative agreement statistic is Gwet’s AC1 that seeks to minimize the kappa limitations. A standardized method of benchmarking includes calculating the cumulative probability of an agreement coefficient falling within a benchmark range, providing a more standardized way of interpreting the agreement statistic.

4 in total

1. Computing inter-rater reliability and its variance in the presence of high agreement.

Authors: Kilem Li Gwet
Journal: Br J Math Stat Psychol Date: 2008-05 Impact factor: 3.380

2. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit.

Authors: J Cohen
Journal: Psychol Bull Date: 1968-10 Impact factor: 17.737

3. High agreement but low kappa: II. Resolving the paradoxes.

Authors: D V Cicchetti; A R Feinstein
Journal: J Clin Epidemiol Date: 1990 Impact factor: 6.437

4. High agreement but low kappa: I. The problems of two paradoxes.

Authors: A R Feinstein; D V Cicchetti
Journal: J Clin Epidemiol Date: 1990 Impact factor: 6.437

4 in total

6 in total

1. Tumor resectability and response on CT following neoadjuvant therapy for pancreatic cancer: inter-observer agreement study.

Authors: Hae Young Kim; Yoon Jin Lee; Won Chang; Jungheum Cho; Ji Hoon Park; Jong-Chan Lee; Jaihwan Kim; Jin-Hyeok Hwang; Young Hoon Kim
Journal: Eur Radiol Date: 2022-01-15 Impact factor: 5.315

2. Validity and Reliability of the Thai Version of the 19-Item Compliance-Questionnaire-Rheumatology.

Authors: Saranya Panichaporn; Wanwisa Chanapai; Ananya Srisomnuek; Phakhamon Thaweeratthakul; Wanruchada Katchamart
Journal: Patient Prefer Adherence Date: 2022-08-17 Impact factor: 2.314

3. ICD indication in hypertrophic cardiomyopathy: which algorithm to use?

Authors: Marcelo Antônio Oliveira Santos-Veloso; Ândrea Virgínia Ferreira Chaves; Eveline Barros Calado; Manuel Markman; Lucas Soares Bezerra; Sandro Gonçalves de Lima; Brivaldo Markman Filho; Dinaldo Cavalcanti de Oliveira
Journal: Rev Assoc Med Bras (1992) Date: 2022-08 Impact factor: 1.712

4. Children Use Non-referential Gestures in Narrative Speech to Mark Discourse Elements Which Update Common Ground.

Authors: Patrick Louis Rohrer; Júlia Florit-Pons; Ingrid Vilà-Giménez; Pilar Prieto
Journal: Front Psychol Date: 2022-01-11

5. Monitoring adherence to antiretroviral therapy among adolescents in Southern Uganda: comparing Wisepill to Self-report in predicting viral suppression in a cluster-randomized trial.

Authors: Samuel Kizito; Flavia Namuwonge; Rachel Brathwaite; Torsten B Neilands; Proscovia Nabunya; Ozge Sensoy Bahar; Christopher Damulira; Abel Mwebembezi; Claude Mellins; Mary M McKay; Fred M Ssewamala
Journal: J Int AIDS Soc Date: 2022-09 Impact factor: 6.707

6. Developing a Natural Language Processing tool to identify perinatal self-harm in electronic healthcare records.

Authors: Karyn Ayre; André Bittar; Joyce Kam; Somain Verma; Louise M Howard; Rina Dutta
Journal: PLoS One Date: 2021-08-04 Impact factor: 3.240

6 in total