The degree of inter-rater agreement is usually assessed through κ -type coefficients and the extent of agreement is then characterized by comparing the value of the adopted coefficient against a benchmark scale. Through two motivating examples, it is displayed the different behavior of some κ -type coefficients due to asymmetric distribution of marginal frequencies over categories. In order to investigate the robustness of four κ -type coefficients for nominal and ordinal classifications and of an inferential benchmarking procedure that, differently from straightforward benchmarking, does not neglect the influence of the experimental conditions, an extensive Monte Carlo simulation study has been conducted. The robustness has been investigated for several scenarios, differing for sample size, rating scale dimension, number of raters, frequency distribution of rater classifications, pattern of agreement across raters. Simulation results reveal an higher paradoxical behavior of Fleiss kappa and Conger kappa with ordinal rather than nominal classifications; the coefficients robustness improves with increasing sample size and number of raters for both nominal and ordinal classifications whereas robustness improves with rating scale dimension only for nominal classifications. By identifying the scenarios (ie, minimum sample size, number of raters, rating scale dimension) with acceptable robustness, this study provides guidelines about the design of robust agreement studies.
The degree of inter-rater agreement is usually assessed through κ -type coefficients and the extent of agreement is then characterized by comparing the value of the adopted coefficient against a benchmark scale. Through two motivating examples, it is displayed the different behavior of some κ -type coefficients due to asymmetric distribution of marginal frequencies over categories. In order to investigate the robustness of four κ -type coefficients for nominal and ordinal classifications and of an inferential benchmarking procedure that, differently from straightforward benchmarking, does not neglect the influence of the experimental conditions, an extensive Monte Carlo simulation study has been conducted. The robustness has been investigated for several scenarios, differing for sample size, rating scale dimension, number of raters, frequency distribution of rater classifications, pattern of agreement across raters. Simulation results reveal an higher paradoxical behavior of Fleiss kappa and Conger kappa with ordinal rather than nominal classifications; the coefficients robustness improves with increasing sample size and number of raters for both nominal and ordinal classifications whereas robustness improves with rating scale dimension only for nominal classifications. By identifying the scenarios (ie, minimum sample size, number of raters, rating scale dimension) with acceptable robustness, this study provides guidelines about the design of robust agreement studies.
Agreement studies are of critical importance in medicine, clinical epidemiology, diagnostic imaging as well as in similar contexts of research since they provide information about the repeatability and reproducibility of human measurement systems to which physicians, clinicians, and radiologists can be assimilated when evaluating patients' diseases using dichotomous, nominal, or ordinal rating scales.The traditional measurement system analysis (MSA) procedures estimate the performance of a measurement system as its ability to provide true (ie, accurate) and consistent (ie, precise) results.
Generally speaking, the accuracy is the closeness between repeated measurements and the true value although ISO 5725
refers the term accuracy to both systematic bias (ie, trueness) and random error (ie, precision).Actually, by definition, subjective evaluations lack a reference value for assessing their trueness and thus the classical definition of accuracy cannot be directly operationalized for human measurement system: subjective evaluations can be related only to consistency and assessed as the degree of agreement between repeated evaluations. From a conceptual standpoint, agreement measures the “closeness” between ratings and can be intended as a broader term that contains both accuracy and consistency: if all the ratings can be assumed to come from the same underlying distribution, then agreement is assessing precision around the mean of the ratings.The agreement observed within rater and/or among more independent raters are, respectively, measures of rater repeatability and rater reproducibility; the more raters agree on the evaluations they provide, the more comfortable we can be that they are precise and that their evaluations are reproducible and exchangeable
and thus trustworthy.A number of theoretical and methodological approaches have been proposed over the years in different disciplines for the assessment of rater repeatability and/or reproducibility; these approaches can be grouped in two main families: index‐based approach and model‐based approach. The former quantifies the rater agreement level in a single number and does not provide insight into the structure and nature of agreement differences;
,
,
,
the latter overcomes this criticism and models the ratings provided by each rater to each subject focusing on the association structure between repeated evaluations.
,Even though the model‐based approach gives more information than the single estimate provided by the index‐based approach, the latter is the easiest to implement and thus the most widely applied, especially by practitioners. This article focuses on index‐based approach relating the precision of categorical subjective evaluations to the concept of agreement.The easiest way of measuring agreement between ratings is to calculate the overall percentage of agreement; nevertheless, this measure does not take into account the agreement that would be expected by chance alone.
A reasonable alternative is to adopt an agreement coefficient belonging to the wide family of the ‐type coefficients, that corrects the probability of observed agreement with the probability of agreement expected by chance, resulting in a relative agreement measure. Specifically, ‐type coefficients compare a real measurement system (ie, rater) against a hypothetical chance measurement system, that is thus used as a reference for correcting the proportion of observed agreement. The chance agreement term estimates the agreement that would be obtained if the subjects had been evaluated completely at random.The pioneer ‐type coefficients are Scott's
and Cohen's kappa
proposed in 1955 and 1960 for the simplest case of two raters and then extended, respectively, by Fleiss
and Conger
to the case of multiple raters. The coefficients differ for how the chance measurement system is conceived of. Specifically, Cohen and Conger coefficients assume that the probabilities of the chance measurement system of classifying an item into each agreement category are equal to the probabilities characterizing the raters; while, according to Scott and Fleiss, they are given by the overall classification probabilities so that no assumption about the equality of marginal frequencies across the replicated evaluations is required.Despite their popularity, all the above coefficients are known for being strongly dependent on trait prevalence and bias in the subject population which affect the observed marginal frequency distribution of the ratings over classification categories and thus the calculation of the chance agreement term. Specifically, it has been shown that, fixing the observed agreement component, “symmetrical unbalanced marginal frequencies produce lower values of than asymmetrical unbalanced marginal frequencies”;
it means that the coefficients are not robust to changes in the frequency distribution of subjects across rating categories and it is unclear what they are truly measuring.
,
These criticisms, firstly observed by Kraemer in 1979,
are widely known as prevalence and bias paradoxes as referred to by Feinstein and Cicchetti.
,The debate about the uses and misuses of Scott's
and Cohen's kappa
has been extensive and persistent in the specialized literature (eg, Brennan and Prediger,
Feinstein and Cicchetti,
Cicchetti and Feinstein,
Byrt et al,
Gwet,
Warrens,
Erdmann et al,
just to name a few), especially for the simplest case of two raters and dichotomous (or at least nominal) data.For example within the purview of contingency tables, Cicchetti and Feinstein
,
identified the conditions that lead to paradoxical behavior showing via practical examples the dependence of Cohen's kappa on trait prevalence; Guggenmoos‐Holzmann
explored the dependence of Cohen's kappa on trait prevalence with respect to seven validity parameters (ie, sensitivity and specificity of each classification procedure, associations between the procedures both in presence and absence of the target trait and trait prevalence) and discussed its interpretation as measure of consistency in extreme populations with maximum true prevalence (ie, trait prevalence = 1) through two examples. The problematic dependency of Cohen's kappa and Scott's on the differences in raters marginal frequencies has been shown also by Gwet
who conducted a sensitivity analysis to investigate how the agreement assessed via some ‐type coefficients changes with respect to the variation of trait prevalence and raters classification probabilities. Similarly, Erdamnn et al
conducted a simulation study to determine the standard error of some ‐type coefficients for dichotomous tests depending on trait prevalence, specificity and sensitivity with different sample sizes.For the case of contingency tables, the origin of the paradoxical behavior of Cohen's kappa and Scott's has been explored by Gwet
and a formal proof of the paradoxical behavior associated with Cohen's kappa has been provided by Warrens.In order to overcome the criticisms related to ‐type coefficients, researchers have suggested to formulate the agreement expected by chance as uniform across categories (ie, commonly known as uniform kappa although proposed by several authors like Bennett et al,
Janson and Vegelius,
Brennan and Prediger,
and Byrt et al
) or to approximate the propensity of random ratings by the proportion of observed to maximum evaluation variance so as to consistently yield reliable results (ie, agreement coefficient by Gwet
). Another alternative approach for handling the paradoxical behavior in the case of two raters and nominal categories has been suggested by Nelson and Pepe,
who presented a graphical method for assessing inter‐rater agreement.On the contrary, a scarce effort has been devoted to ‐type coefficients for inter‐rater agreement with more than two raters. Gwet
introduced the variant of the coefficient for the case of more than two raters and suggested new variance estimators for the multiple‐rater generalized statistics whose validity, demonstrated via a Monte Carlo simulation study, does not depend upon the hypothesis of independence between raters. Falotico and Quatto
discussed the paradoxical behavior of Fleiss' kappa and its asymptotic confidence interval and suggested the adoption of permutation and bootstrap techniques to avoid the former and the latter, respectively. Marasini et al,
instead, extended the uniform chance agreement to the case of multiple raters for ordinal rating categories.To the best of our knowledge, the effect of changes in marginal frequencies over categories to inter‐rater agreement indexes has been best investigated by Quarfoot and Levine
who defined the coefficient robustness as coefficient ability of giving roughly the same result for a fixed level of agreement across raters irrespective of the frequency distribution of ratings across categories. Specifically, the Monte Carlo simulation study conducted by Quarfoot and Levine
aimed at exploring the robustness of 5 inter‐rater agreement indexes with respect to six different frequency distributions of ratings and as many patterns of rater agreement considering a large sample of 100 subjects classified by 8 raters for a total of 1440 investigated scenarios. A main limit of Quarfoot and Levine study is that it examined neither the more critical scenarios of small sample sizes and small groups of raters nor the influence of the type of rating scale nor the robustness of lower confidence bound commonly adopted for a proper characterization of the extent of inter‐rater agreement.Since situations in which the number of subjects belonging to one of the rating categories far exceed the quantity of the others are very common in clinical contexts, an inter‐rater agreement coefficient robust to paradoxical behavior due to prevalence or bias becomes of utmost importance. In such framework, this article aims to identify the scenarios under which ‐type coefficients are not sensible to paradoxical behavior by investigating the effects of sample size, number of involved raters and type of rating scale on the robustness of ‐type coefficients and discussing their practical implication for the final characterization of the extent of rater agreement. The investigation concerns four ‐type coefficients for inter‐rater agreement with nominal data together with their weighted version for ordinal data. The investigated coefficients are the two well‐cited Fleiss' kappa coefficient
and Conger's kappa
as well as the uniform kappa
and Gwet's agreement coefficients
( for nominal and for ordinal data) proposed as paradox‐resistant ‐type coefficients. It is important to note that each of these indexes uses a different approach to correct for chance agreement so that they could respond differently to paradoxes.According to the study aims, the algorithm proposed by Quarfoot and Levine has been extended in order to investigate the robustness of the ‐type coefficients and asymptotic lower confidence bound under several scenarios differing for sample size, rating scale dimension, number of raters, frequency distribution of ratings, and pattern of agreement across raters.The remainder of this article is organized as follows: In Section 2, the ‐type coefficients and the inferential benchmarking procedure are introduced. In Section 3, the implication of the paradoxical behavior of the ‐type coefficients and the usefulness of the inferential characterization procedure are illustrated and discussed through two motivating examples. In Section 4, the simulation algorithm for the analysis of the paradoxical behavior is described and the main results are discussed. Finally, conclusions are summarized in Section 5.
MEASURING AGREEMENT FOR NOMINAL AND ORDINAL CLASSIFICATIONS
Let a set of subjects randomly selected from the population of subjects be classified by raters on a categorical scale with dimension ; let be the random variable denoting the category to which the subject is assigned by rater and denote its realization (). The random variables are stochastically independent, their distribution depends on the true classification and they are completely determined by the model parameters and given by:
with and ;
so that
where is the marginal distribution for the generic rater .The agreement among raters can be defined by an arbitrary choice along a continuum ranging from agreement among all possible pairs of raters (ie, pairwise agreement, the less restrictive definition of agreement) to agreement among all the raters (ie, ‐wise agreement, the most restrictive definition of agreement). Because of its practical interpretation, the attention is hereafter restricted to pairwise agreement according which the probability that a pair of randomly selected raters, referred to as and , agree on the classification of an arbitrary subject into category , namely , is given by:For a generic subject, instead, the probability of agreement is given by:At sample level, the probability of agreement is replaced by an unbiased estimator (see De Mast and Van Wieringen
for demonstration) given by the average proportion of observed agreement among all pairs of raters and formulated by Fleiss
as follows:
where being the indicator function. Since some inter‐rater agreement is expected by chance alone, a positive value of observed agreement does not automatically provide information about rater consistency; for this reason, several authors
,
,
,
,
proposed the ‐type agreement coefficients that are relative measures of agreement obtained by rescaling the observed agreement with the agreement expected by chance alone.
‐Type coefficients for nominal classifications
The ‐type agreement coefficients are formulated as follows:
where is the probability of agreement expected by chance. At sample level, the estimator of is given by:
In order to formulate the proportion of agreement expected by chance , it is necessary to define how a chance measurement system is conceived of. Different notions of a chance measurement system are advocated in the literature, leading to as many ‐type coefficients; several well‐known alternative coefficients are in the following recalled.According to Bennet et al,
a chance measurement system classifies subjects following the uniform model. Thus, the probability that two raters agree by chance can be estimated as follows:
and the obtained coefficient for inter‐rater agreement is just a linear transformation of the observed proportion of agreement .Fleiss,
instead, defined the proportion of agreement expected by chance under the assumption of homogeneous and thus exchangeable raters. Indeed, is based on one‐way ANOVA setting where each subject is classified by a different set of raters randomly selected in a population so that the variation due to the raters cannot be separated from the error variation.Assuming that the probability of classifying a subject into category is given by the marginal distribution of the classifications provided by raters (ie, ), the probability that two raters agree by chance is:
where can be estimated by the marginal frequencies given by:
Although Fleiss proposed the statistic as a generalization of Cohen's kappa to the case of multiple raters, it reduces to Scott's
and it coincides with Cohen's kappa if and only if the column marginals () are all equal.The generalization of to the case of different raters, commonly referred to as Conger's kappa (ie, ), has been proposed firstly by Conger
and later by Davies and Fleiss,
Schouten,
and O'Connell and Dobson.
is based on two‐way ANOVA setting where all subjects are classified by the same set of raters who are included as systematic source of disagreement.
According to Conger, a rater providing random classifications is conceived of as one that classifies subjects randomly but with a distribution equal to the marginal distribution of her/his classifications () so that the probability that two raters agree by chance can be estimated as:
At sample level, the pairwise agreement coefficient estimates the expected agreement as the mean proportion of chance agreement between all pairs of raters, that is by averaging all Cohen's kappa pairwise chance agreement estimates. Since averaging all pairwise chance agreement components becomes time‐consuming when , an alternative method more efficient and with direct calculation is recommended. Let be the proportion of subjects classified into category by rater , can be expressed as follows:The agreement coefficient proposed by Gwet
formulates the agreement expected by chance as the probability of the simultaneous occurrence that one rater provides random ratings and raters agree. The probability of random rating is approximated with a normalized measure of randomness defined by the ratio of the observed variance to the variance expected under the assumption of totally random ratings. The observed variance is , being still formulated as in Equation (11), whereas the variance expected under the assumption of totally random ratings is . Under this assumption, the probability that two raters agree by chance can be estimated as follows:The complete sample formulations of the ‐type coefficients for nominal classifications under comparison are reported in Table A1 in Appendix A.
TABLE A1
Formulation of the introduced inter‐rater ‐type coefficients for nominal classifications
When raters classify subjects on a K‐ordinal scale, some disagreements are more serious than others, that is disagreement on two distant categories are more relevant than disagreement on neighboring categories; it is therefore necessary to a priori assign different weights, denoted as , to each pair of ratings (with ). The weighted ‐type coefficient for ordinal classifications is thus formulated as:
where the weighted version of the proportion of observed agreement
is:The agreeing weighting scheme is a non‐increasing function of : for and for . The weights can be arbitrarily defined, anyway the linear,
, and quadratic,
, weights are the most commonly used weighting schemes for ‐type coefficients. Although Fleiss and Cohen
and Schuster
showed that the ‐type coefficients with quadratic weights are equivalent to the intraclass correlation coefficient, Brenner and Kliebsch
showed that the use of linear weights instead of quadratic weights leads to a statistic less sensitive to the number of rating categories. Thus, the linear weights are here suggested and adopted. It is worth to pinpoint that the unweighted coefficients are special cases of the corresponding weighted versions, obtained with weights equal to either 0 or 1: = 1 if and = 0 elsewhere.Assuming a uniform model for chance measurement system, Marasini et al
formulated the statistic by defining the weighted proportion of agreement expected by chance as:
being the sum of all weights : .The weighted version of the proportion of chance agreement defined by Fleiss for ordinal classifications is:
being and the estimates of the probability of classifying a subject into category and , respectively.In Conger's kappa, the weighted proportion of agreement expected by chance for ordinal classifications is formulated as follows:
whereGwet
proposed , a weighted version of , obtained by formulating the agreement expected by chance as follows:The complete sample formulations of the ‐type coefficients for ordinal classifications under comparison are reported in Table A2 in Appendix A.
TABLE A2
Formulation of the introduced inter‐rater ‐type coefficients for ordinal classifications
Characterization of the extent of rater agreement via lower confidence bound
All ‐type coefficients range from to : when the observed proportion of agreement equals chance agreement, the coefficient is null; when the observed agreement is greater than chance agreement the coefficient is positive; vice‐versa, it can be interpreted as disagreement. Several benchmark scales have been proposed mainly in social and medical sciences for interpreting the extent of agreement.
,
,
,
,
,
,
The best known scales are those proposed by Landis and Koch,
Altman,
and Shrout.
The former consists of six ranges of values corresponding to as many categories of agreement: Poor, Slight, Fair, Moderate, Substantial, and Almost Perfect agreement for coefficient values ranging between 1 and 0, 0 and 0.2, 0.21 and 0.4, 0.41 and 0.6, 0.61 and 0.8, and 0.81 and 1.0, respectively. This scale was then simplified by Altman who collapsed the first two ranges of values into one agreement category and later by Shrout who deleted the category of negative values and moved the threshold value of Slight agreement from 0.2 to 0.1.Useful information about the true extent of agreement are provided by the lower confidence bound for .
The asymptotic normal approximation for ‐type coefficients
,
has been used to construct symmetric confidence intervals of the form:A consistent large‐sample variance estimator for has been provided by Gwet
as a variant of the widely used variance estimate proposed by Fleiss et al.
Indeed, this latter has been derived under the assumption of no agreement among raters, making it suitable only for testing the hypothesis of no agreement; if this assumption is not satisfied, the variance estimate becomes irrelevant and it should be avoided for quantifying the precision of as well as for building confidence intervals. The variance estimator proposed by Gwet
,
and hereafter adopted is given by:
where is the sampling fraction of subjects from a target population of size N and is the agreement estimated at subject‐level. The formulations of for all ‐type coefficients under study can be found in Appendix B.The asymptotic confidence interval, whose accuracy depends on the asymptotic normality of coefficient and variance, is by definition generally applicable only for large sample sizes. Under non‐asymptotic conditions, alternative approaches such as the bootstrap confidence intervals may be used (eg, References 8 and 56, 57, 58) .Among the available methods to build bootstrap confidence intervals, the percentile bootstrap is the simplest and the most popular one. The lower and upper bounds of the two‐sided ()% percentile bootstrap confidence interval are, respectively, the and percentiles of the cumulative distribution function of the bootstrap replications of the ‐type coefficient:On the other hand, the bias‐corrected and accelerated bootstrap (BCa) confidence interval is recommended for severely non normal data,
,
since it adjusts for any bias and lack of symmetry of the bootstrap distribution through the acceleration parameter and the bias correction parameter . The lower and upper bounds of the two‐sided ()% BCa confidence‐interval are defined as:Despite BCa confidence interval needs an higher computational complexity, its coverage error is generally smaller than for the other bootstrap intervals although it can be erratic for small , typically .
TWO MOTIVATING EXAMPLES
Two real agreement studies are hereafter presented in order to show the implication of the paradoxical behavior of ‐type coefficients. To assess the degree of inter‐rater agreement, the evaluations simultaneously provided by raters have been classified in a table (, where the generic cell contains the number of raters who classify subject into category . The observed agreement is computed, according to either Equation (6) for nominal data or Equation (16) for ordinal data, and corrected by the agreement expected by chance adopting the analyzed ‐type coefficients, whose formulations are summarized in Tables A1 and A2 in Appendix A; in the case of ordinal data, the linear weighting scheme has been adopted. Moreover, for a proper characterization of the extent of rater agreement, the lower bound of the asymptotic confidence interval of each coefficient has been built according to Equation (22).
Data sets
The first data set is based on the data originally provided by Sandifer et al
and also discussed by Fleiss.
In the study of Sandifer et al, between 6 and 10 psychiatrists from a pool of 43 psychiatrists were selected to diagnose a patient. As done by Fleiss,
we dropped the diagnoses in order to have a constant number of 6 assignments per patient. Specifically, the analyzed data set contains the diagnoses of psychiatrists who were requested to classify patients into one of the following nominal diagnostic categories: (1) depression, (2) personality disorder, (3) schizophrenia, (4) neurosis, (5) other.The second data set was originally published by Holmquist et al
and then analyzed also in the studies of Landis and Koch,
,
Agresti,
Becker and Agresti,
and Saraçbaşi;
it is one of the most common data set in agreement studies where the paradoxical behavior of the ‐type coefficients is observed. Specifically, the study involved independent pathologists who classified images/slides with the aim of investigating the variability in the classification of carcinoma in situ of the uterine cervix. Based on the dimension and type of lesions, physicians had to classify the presence of a carcinoma in situ adopting an ordinal scale with grades: (1) negative, (2) atypical squamous hyperplasia, (3) carcinoma in situ, (4) squamous carcinoma with early stromal invasion, (5) invasive carcinoma.According to the study design, can be correctly adopted only to the first data set containing the diagnoses of 30 patients made by different groups of 6 psychiatrists each, whereas is suitable for the second data set where all slides/images have been independently classified by the same set of pathologists.
Study results
The studies' results are reported in Table 1 and represented in Figure 1, against the agreement categories of the Landis and Koch's benchmark scale. The observed agreement is equal to 0.555 and 0.855 in psychiatric diagnosis and cervix carcinoma study, respectively.
TABLE 1
Point estimate, two‐sided 95% asymptotic confidence interval, and expected agreement term of each ‐type coefficient for agreement in psychiatric diagnoses and carcinoma classifications
Psychiatric diagnosis study (pa=0.555)
Cervix carcinoma study (paw=0.855)
κ^ coefficient
Point estimate
95% CI
pa|c term
κ^ coefficient
Point estimate
95% CI
pa|c term
KUnif
0.444
[0.336, 0.552]
0.200
s∗
0.639
[0.598, 0.680]
0.600
KF
0.430
[0.324, 0.536]
0.220
KCw
0.497
[0.432, 0.562]
0.713
AC1
0.448
[0.339, 0.557]
0.195
AC2
0.687
[0.647, 0.727]
0.687
FIGURE 1
Point estimate and two‐sided 95% asymptotic confidence interval of each ‐type coefficient, plotted against Landis and Koch's benchmark scale
Point estimate, two‐sided 95% asymptotic confidence interval, and expected agreement term of each ‐type coefficient for agreement in psychiatric diagnoses and carcinoma classificationsPoint estimate and two‐sided 95% asymptotic confidence interval of each ‐type coefficient, plotted against Landis and Koch's benchmark scaleIn psychiatric diagnosis study the degree of inter‐rater agreement appears Moderate with every ‐type coefficient; , , and and their terms are quite similar with one another and agree to within a few hundredths. Moreover, with a significance level , there is evidence for rejecting the null hypothesis of Slight inter‐rater agreement and accepting the tested hypothesis of at least Fair agreement since the lower bound of their confidence interval belongs to the region ranging from 0.2 to 0.4.In cervix carcinoma study, achieving more than 85% observed agreement might be at first sight impressive, but this finding must be tempered by the fact that the expected agreement due to uniform distribution could be as high as 60% with a 5‐point ordinal scale. Because of the differences among the terms (see Table 1), the estimated degree of inter‐rater agreement differs across coefficients: it is classified as Moderate using and as Substantial using and . Moreover, with a significance level , there is evidence for rejecting the null hypothesis of no more than Fair inter‐rater agreement and accepting the tested hypothesis of at least Moderate inter‐rater agreement for both and ; whereas for there is evidence for rejecting the null hypothesis of no more than Moderate inter‐rater agreement and accepting the tested hypothesis of Substantial inter‐rater agreement.It is interesting to highlight how the differences in skeweness of marginal distributions (see Figure 2) and sample size make the two agreement studies differ, respectively, in terms of similarity across the estimated agreement coefficients and width of the parametric confidence intervals. These differences highlight the importance of investigating the robustness of agreement coefficient against changes in marginal distributions.
FIGURE 2
Marginal distributions of ratings over categories for psychiatric diagnoses and carcinoma classifications
Marginal distributions of ratings over categories for psychiatric diagnoses and carcinoma classifications
ROBUSTNESS STUDY VIA MONTE CARLO SIMULATION
An extensive Monte Carlo simulation study has been conducted in order to investigate the robustness of ‐type coefficients to changes in frequency distribution of ratings over categories for a fixed level of agreement.The ratings provided by raters on the same set of subjects adopting a categorical scale are simulated considering that rater rates the subjects according to their distribution over classification categories (ie, frequency distribution, FD) and each of the other raters agrees with according to a given pattern of agreement (ie, agreement distribution, AD). Specifically, the FD mimics the type of subjects the raters are exposed to as filtered through the rating instrument in question whereas AD mimics the agreement between each rater and by modeling the rating probabilities of the raters conditioned to FD.The simulation study has been designed as multi‐factor experimental design with five multi‐level factors: rating scale dimension , sample size , number of raters , FD and AD. The factor has 9 levels: classification categories; the factor has 3 levels: subjects; the factor has 3 levels: raters; whereas the factors FD and AD have 6 and 2 levels, respectively. FDs are all special cases of the beta‐binomial distribution with different values of the shape parameters ; the main characteristics and the patterns of all FDs are summarized in Table 2. AD 1 is a binomial distribution scaled on and centered on ratings provided by ; AD 2 is a uniform distribution which represents the case that all classification categories have an equal chance of occurring.
TABLE 2
Parameters and pattern of each FD
Name
Parameters (a,b)
Pattern
FD 1
(0.25, 0.25)
Extremes
FD 2
(1, 1)
Uniform
FD 3
(2, 2)
Central
FD 4
(50, 50)
Binomial
FD 5
(25, 50)
Skewed
FD 6
(5, 50)
Very skewed
Parameters and pattern of each FDFor each combination of , , , FD, and AD, Monte Carlo data sets have been generated for both nominal and ordinal classifications and the degree of inter‐rater agreement has been assessed with all ‐type coefficients under study, for a total of scenarios. The robustness of the asymptotic lower confidence bound has been investigated with sample sizes satisfying the asymptotic condition of normality (ie, subjects) for a total of simulated scenarios. For each scenario, the lower bound of the two‐sided 95% asymptotic confidence interval has been built according to Equation (22).Quarfoot and Levine
assessed the robustness of agreement coefficients looking at the range over the coefficient mean values obtained from different FDs with a fixed AD; this approach is not recommended under non‐asymptotic conditions because of the lower representativeness of the mean for the distribution of the ‐type coefficients. The approach here suggested is to assess robustness by looking at the mean range over the coefficient values obtained from different FDs with a fixed AD.The adopted simulation procedure works as follows:sample the under each FD, with ;sample the with and under each pair FD‐AD;compute the ‐type coefficients for nominal (ie, , , , ; see Table A1) and ordinal (ie, , , and ; see Table A2) classifications obtained under each pair FD‐AD;set and build the lower confidence bound of each ‐type coefficient for each pair FD‐AD through Equation (22);repeat S times steps 1 to 4 (ie, , );for each pair FD‐AD and combination of , , and , compute the mean agreement value over the Monte Carlo data sets of ‐type coefficient;for each AD and Monte Carlo replication, compute the range of agreement over the coefficients estimated on data simulated from the 6 different FDs with a fixed AD:for each AD, compute the mean range of agreement :Similarly, for each AD, compute the mean range of agreement over the LBs obtained from the 6 different FDs with a fixed AD.The simulation algorithm has been implemented using Mathematica (Version 11.0, Wolfram Research, Inc., Champaign, IL, USA).
Simulation results
Simulation results revealed a negligible effect of number of raters and sample size on the mean agreement value. For illustrative purpose, mean and standard deviation of all ‐type coefficients for raters and subjects are reported in Tables 1 through 4 in the Supplementary Materials. The obtained results show that the mean agreement value changes with the number of categories; vice‐versa, the number of raters and sample size slightly affect the mean agreement value, that increases of about 3% with increasing number of raters and decreasing sample size.Specifically, under AD 1 simulation results reveal a slightly higher dependency of and on trait prevalence since their mean values decrease to the lower adjacent category with FD 4, FD 5, and FD 6 (ie, unbalanced marginal distribution), whereas and generally assume values belonging to the same agreement category for all FDs (see Table 1 in the Supplementary Materials). The sensitivity of and to changes in FDs is much more evident and the mean agreement obtained with the very skewed frequency distribution FD 6 is 2‐steps apart categories lower than that with balanced FDs; vice‐versa, for and the mean agreement is quite the same whatever the FD (see Table 2 in the Supplementary Materials). The above results obtained for ordinal data are in line with those discussed in Quarfoot and Levine
in terms of average agreement. Under AD 2, instead, all the mean agreement values are approximately 0 (see Tables 3 and 4 in the Supplementary Materials).For the extreme scenarios — those with raters evaluating subjects and raters evaluating subjects — Figures 3 and 4 report the distributions of the range over the coefficients estimated from different FDs with a fixed AD and Tables 3 and 4 compare the mean range over the coefficient values () against the range over the coefficient mean values (hereafter, ). The range distributions for and are clustered around a lower mean and thus these coefficients can be assumed more robust to changes in FDs than and . The comparative analysis between and reveals they are comparable under asymptotic conditions (see Table 4) but for small sample sizes always overestimates the coefficient robustness (see Table 3).
FIGURE 3
Distribution of range over the coefficient values obtained from different FDs with AD 1 when raters evaluate subjects on a ‐ordinal scale
FIGURE 4
Distribution of range over the coefficient values obtained from different FDs with AD 1 when raters evaluate subjects on a ‐ordinal scale
TABLE 3
Range over the mean ‐type coefficient values and mean range obtained from different FDs with AD 1, when raters evaluate subjects on a ‐ordinal scale
K=3
K=4
K=5
K=6
K=7
K=8
K=9
K=10
K=11
s∗
Δκ‾
0.066
0.098
0.111
0.121
0.125
0.124
0.133
0.127
0.125
AR‾
0.462
0.387
0.337
0.309
0.289
0.275
0.264
0.250
0.239
KFw
Δκ‾
0.406
0.434
0.452
0.447
0.449
0.468
0.473
0.463
0.468
AR‾
0.714
0.660
0.636
0.624
0.616
0.628
0.630
0.623
0.622
KCw
Δκ‾
0.389
0.417
0.435
0.431
0.435
0.453
0.456
0.447
0.451
AR‾
0.691
0.638
0.614
0.602
0.597
0.607
0.609
0.602
0.600
AC2
Δκ‾
0.139
0.157
0.152
0.154
0.150
0.144
0.142
0.134
0.130
AR‾
0.452
0.376
0.329
0.301
0.282
0.265
0.251
0.236
0.227
.
TABLE 4
Range over the mean ‐type coefficient values and mean range obtained from different FDs with AD 1, when raters evaluate subjects on a ‐ordinal scale
K=3
K=4
K=5
K=6
K=7
K=8
K=9
K=10
K=11
s∗
Δκ‾
0.094
0.137
0.158
0.167
0.178
0.175
0.174
0.173
0.172
AR‾
0.119
0.153
0.171
0.179
0.182
0.183
0.182
0.180
0.178
KFw
Δκ‾
0.304
0.374
0.413
0.439
0.442
0.469
0.476
0.482
0.489
AR‾
0.305
0.374
0.414
0.441
0.445
0.472
0.482
0.491
0.499
KCw
Δκ‾
0.304
0.374
0.413
0.439
0.442
0.469
0.476
0.482
0.489
AR‾
0.305
0.374
0.414
0.440
0.445
0.472
0.482
0.491
0.499
AC2
Δκ‾
0.214
0.241
0.244
0.240
0.237
0.223
0.215
0.206
0.200
AR‾
0.223
0.245
0.247
0.242
0.240
0.225
0.216
0.208
0.202
Distribution of range over the coefficient values obtained from different FDs with AD 1 when raters evaluate subjects on a ‐ordinal scaleDistribution of range over the coefficient values obtained from different FDs with AD 1 when raters evaluate subjects on a ‐ordinal scaleRange over the mean ‐type coefficient values and mean range obtained from different FDs with AD 1, when raters evaluate subjects on a ‐ordinal scale.Range over the mean ‐type coefficient values and mean range obtained from different FDs with AD 1, when raters evaluate subjects on a ‐ordinal scaleSimulation results obtained for the lower confidence bounds are almost the same as the coefficient estimates with subjects revealing that the trait prevalence affects the sample variance in a similar manner; moreover, no much difference is observed between for and so that the mean range of agreement is represented in Figures 5 and 6, respectively, for unweighted (ie, nominal classifications) and weighted (ie, ordinal classifications) coefficients for and raters.
FIGURE 5
Mean range of agreement obtained for the four unweighted agreement coefficient estimates (red: , green: , blue: , orange: ) under AD 1 and AD 2
FIGURE 6
Mean range of agreement obtained for the four weighted agreement coefficient estimates (red: , green: , blue: , orange: ) under AD 1 and AD 2
Mean range of agreement obtained for the four unweighted agreement coefficient estimates (red: , green: , blue: , orange: ) under AD 1 and AD 2Mean range of agreement obtained for the four weighted agreement coefficient estimates (red: , green: , blue: , orange: ) under AD 1 and AD 2The patterns in Figure 5 show that under AD 2 and with raters or with but subjects, is no more than 0.2 for all the four unweighted coefficients so they can be recognized as robust to changes in FDs. Indeed, a reasonable value ranges between 0 and 0.2 since it allows to characterize the extent of agreement into the same category or at most into two adjacent categories whatever the frequency distribution of ratings over classification categories. Under AD 1, instead, the robustness worsens remaining quite comparable across all agreement coefficients that are not robust only with samples of subjects; with increasing sample size and number of involved raters, the coefficients exhibit a less paradoxical behavior.Vice‐versa, with ordinal classifications (see Figure 6), the patterns of differ across coefficients and the similar behavior of and on one hand and of and on the other let to distinguish them into two groups, with a much higher robustness to changes in FDs for the latter. Specifically, under AD 2, is less than 0.2 with raters so that the coefficients can be assumed robust; under AD 1, instead, and are robust for and whereas for and exceeds the value 0.4 so that their adoption is not recommended because of their strong sensitivity to changes in frequency distribution.It is worth to pinpoint that although ‐type coefficients are affected by the number of categories (as revealed by simulation results reported in Tables 1 through 4 in the Supplementary Materials), it is not reasonable to assume that the robustness changes over rating scale dimension is exclusively due to the changes of the underlying ‐type coefficients. Indeed, if the ‐type coefficients changed in the same way with increasing number of classification categories for all the 6 FDs under study, would be the same whatever the scale dimension. Actually, the mean range of agreement accounts only for the coefficient variation across FDs and it does not depend on the estimated agreement values, that is the same range could be obtained both with low and high degree of agreement. Moreover, increasing the number of classification categories makes and decrease with balanced FDs and increase with unbalanced FDs so that the value increases which means a worsening of coefficient robustness.The simulation results can be more interestingly read in light of their practical implication for the characterization of the extent of agreement via a benchmark scale. Indeed, for a given AD, differences across FDs can make the extent of rater agreement span over a number of interpretation categories depending on the adopted benchmark scale. For example, adopting the Landis and Koch benchmark scale, implies that the extent of agreement spans up to adjacent categories (eg, it may belong to Fair or range from Fair to Moderate) whereas the extent of agreement spans up to 4 steps‐apart categories when (eg, from Slight to Almost Perfect).
CONCLUSIONS
This article investigates — via an extensive Monte Carlo simulation study — the robustness of ‐type coefficients for assessing inter‐rater agreement with both nominal and ordinal classifications.Simulation results show that the robustness of ‐type coefficients increases as sample size and number of involved raters do and reveal the paradoxical behavior of all ‐type coefficients with raters and small sample sizes for both nominal and ordinal classifications. The investigation of robustness under several experimental conditions, never explored before to the best of our knowledge, sheds light on the different behavior between inter‐rater agreement coefficients with nominal and ordinal classifications, showing the higher paradoxical behavior of the weighted variant of Fleiss' and Conger's kappa. Indeed, with nominal classifications, the robustness is comparable across ‐type coefficients with the only exception of categories, although the values of and slightly decrease with symmetrical unbalanced marginal distribution; vice‐versa, with ordinal classifications and are more robust than and that are strongly influenced by the frequency distributions over categories. The obtained simulation results let to identify the scenarios where the degree of agreement is about the same whatever the FDs (ie, ): raters classifying a moderate set of subjects or less than 5 raters classifying a larger set of subjects. For such scenarios, being the effect of FDs negligible, there is no doubt about the adoption of ‐type coefficients as robust measures of rater agreement.The variation of and values, for nominal and ordinal data, respectively, reflect the changes in the observed proportion of agreement since the coefficients are just linear transformation of the observed agreement. It should not come as a surprise that both agreement terms change with FDs because different combinations of AD and FD affect the distribution of ratings over classification categories (ie, the category assigned to subjects by raters) to which both the observed and the expected agreement are closely related. However, such variations are negligible producing values less than 0.2, with the exception of raters.Fleiss' and Conger's kappa coefficients are more affected by paradoxical behavior due to the chance measurement system model they rely on. Firstly, they strongly depend on the true subject classification and confound measurement precision with accuracy and/or other properties of the subjects population. Secondly, there is a strongly nonlinear relationship between the observed agreement and the coefficient value, so that small variations in the observed agreement could result in dramatic changes in the final degree of agreement. The strong sensitivity of linearly weighted Fleiss' and Conger's kappa coefficients for changes in the distribution of classifications over categories could make their standard error be so large to make them practically useless.It is also worthy to note that the choice between and should be based on the way the raters are selected for the agreement study since their adoption ignoring the study design can lead to incorrect conclusions:
in the two‐way ANOVA setting is likely to result in an underestimation of the agreement level giving on average smaller values than , whereas the misuse of in one‐way ANOVA settings is likely to overestimate the agreement level.
,The inter‐rater agreement coefficients use as reference for the observed agreement different chance measurement systems, each of which can be more or less suitable for a given context. and represent two alternatives to when the chance measurement system cannot follow the uniform model. But, as revealed by simulation results, and are not recommended in the case of unbalanced marginal distributions with ordinal classifications because of their sensitivity to trait prevalence; in such circumstances, the adoption of and is strongly recommended.Data S1 Supplementary materialClick here for additional data file.