Literature DB >> 30132375

Asymptotic variability of (multilevel) multirater kappa coefficients.

Abstract

Agreement studies are of paramount importance in various scientific domains. When several observers classify objects on categorical scales, agreement can be quantified through multirater kappa coefficients. In most statistical packages, the standard error of these coefficients is only available under the null hypothesis that the coefficient is equal to zero, preventing the construction of confidence intervals in the general case. The aim of this paper is triple. First, simple analytic formulae for the standard error of multirater kappa coefficients will be given in the general case. Second, these formulae will be extended to the case of multilevel data structures. The formulae are based on simple matrix algebra and are implemented in the R package "multiagree". Third, guidelines on the choice between the different mulitrater kappa coefficients will be provided.

Entities: Chemical Disease Gene Species

Keywords: Conger kappa; Fleiss’s kappa; hierarchical; nested; pairwise agreement; rater

Mesh：

Year: 2018 PMID： 30132375 PMCID： PMC6745615 DOI： 10.1177/0962280218794733

Source DB: PubMed Journal: Stat Methods Med Res ISSN： 0962-2802 Impact factor: 3.021

1 Introduction

Reliability and agreement studies are of paramount importance in medical and behavioral sciences. They provide information about the amount of error inherent to any diagnosis, score or measurement. Using unreliable measurement instruments and procedures can lead to incorrect conclusions from scientific studies and unreproducible research while disagreement between physicians can lead, in clinical decision making, to different treatments for the patient. Reliability is classically defined as the ratio between the true score variance and the total variance and is quantified through different versions of the intraclass correlation coefficient (ICC), depending on the study design.[1] When several observers rate subjects, ICCs for consistency are obtained if the systematic shifts between the observers are ignored while ICCs for agreement are obtained if they are taken into account. In parallel to the ICCs, scaled agreement coefficients[2-4] were developed outside the classical test theory and were found to be closely related to ICCs for agreement. While it is easy to define the agreement between two observers on a categorical scale for a given object (they agree or they don’t agree), this is not the case when agreement is searched between several observers (R > 2). In this latter case, the agreement can be defined by an arbitrary choice along a continuum ranging from agreement between a pair of observers to agreement among all the R observers, i.e. a concordant classification between g observers (). Conger[5] formalised this framework by defining the g-wise agreement coefficients, including the less restrictive (pairwise) and the most restrictive (R-wise) definition of agreement. In practice, g is often equal to 2 or to the majority of the observers (). Mielke and Berry[6] prefer the R-wise definition to take all interactions between the R observers into account. Despite this appealing property, attention is restricted to pairwise agreement coefficients (g = 2) in this paper because of their practical interpretation. The two pairwise agreement coefficients considered in this paper pertain to the kappa coefficient family and were shown to be asymptotically equivalent to ICCs for agreement when the scale is binary. The first agreement coefficient is commonly named Fleiss kappa. It was developed by Fleiss[7] and was shown to be asymptotically equivalent to the ICC for agreement based on a one-way ANOVA design.[8] In a one-way setting, each object is rated by a different set of observers, randomly selected in a population. Therefore, the variation due to the observers cannot be separated from the error variation and only ICC for agreement can be determined.[1] The second coefficient is the pairwise kappa coefficient developed by Conger[5] and equivalently by Davies and Fleiss,[9] Schouten[10] and O’Connell and Dobson.[11] This second coefficient will be referred to as ‘Conger kappa’ to differentiate it from ‘Fleiss kappa’. When all objects are classified on a binary scale by the same set of observers randomly selected in a population, Conger kappa is asymptotically equivalent to the ICC for agreement under a two-way ANOVA setting including the observers as systematic source of disagreement.[9,12] Fleiss kappa coefficient is popular, as assessed by more than 4000 citations of his original paper in Google scholar as compared to the 350 citations of Conger’s paper and 300 citations of Davies and Fleiss’ paper. The three following issues were identified with the use of multirater kappa coefficients in the literature. First, Fleiss kappa is used independently of the design of the study. The misuse of Fleiss kappa in the two-way ANOVA setting is likely to result in an underestimation of the agreement level,[13] as Fleiss kappa coefficient gives on average smaller values than Conger kappa. In the same way, the misuse of Conger kappa in one-way ANOVA settings is likely to overestimate the agreement level. It is therefore important to use the appropriate multirater kappa coefficient, based on the study design and the corresponding ANOVA model. Second, main statistical packages (e.g. R package ‘irr’, STATA, SAS macro MAGREE, SPSS extension STATS_FLEISS_KAPPA) only provide the standard error of Fleiss kappa under the hypothesis that it equals zero, despite the existence of a formula for the general case derived by Schouten.[10] Worse, with the exception of the R package ‘magree’, Conger kappa coefficient, when available (e.g. R package ‘irr’, STATA), is reported without standard error, although an asymptotic formula based on the delta method was also provided by Schouten[14] and O’Connell and Dobson.[11] Finally, there is a need to define multirater kappa coefficients and provide statistical inference in the presence of multilevel data. Multilevel data are commonly encountered in medical and behavioural sciences, where measures are often obtained on persons nested in organisations (e.g. patients in health care centers), on different parts of the body or by repeated measurements over time. For example, in the study motivating this paper, seven groups of four medical observers with different experience levels were asked to assess the presence of crackles and wheezes (yes/no) on the lung sounds of 20 subjects. The lung sounds were recorded with a stethoscope at three locations on each side of the thorax, leading to six observations per subject. The aim of the study was to evaluate the level of agreement within each group of observers. Specific statistical techniques need to be used to account for the dependency between the objects of the same cluster. It was shown in various contexts that ignoring the hierarchical structure of the data can lead to incorrect conclusions (e.g. Hox[15]). Therefore, Barlow et al.[16] and Oden,[17] among others, proposed stratified agreement coefficients. They use a weighted average of the agreement coefficients obtained on each cluster. These coefficients, however, are not asymptotically equivalent to ICCs and possess a less straightforward interpretation than the coefficients considered here. The aim of this paper is therefore threefold. First, the formula of the standard error derived with the delta method by Schouten[10,14] and O’Connell and Dobson[11] for Fleiss and Conger kappa will be presented in a unified framework using simple notations. Second, these formulas will be extended to the case of multilevel data structures, based on recent work.[18-20] Third, the paper will emphasise the appropriate use and interpretation of Fleiss and Conger kappa depending on the study design. The standard error formulae derived by the delta method are based on simple algebra, easy to program and implemented in the R statistical package ‘multiagree’ available on Github. As an alternative, the clustered bootstrap method will also be considered and the statistical performances of the two methods will be compared using simulations. In Section 2, the two multirater kappa coefficients, Fleiss and Conger kappa coefficients are reviewed and the general formula derived by the delta method for their standard error is given. These definitions are generalised to multilevel data in Section 3. The standard error of the multilevel multirater kappa coefficients are derived using the delta method and the clustered bootstrap method in Section 4. Then, the statistical properties of the delta and the bootstrap methods are studied using simulation in Section 5. The methods are illustrated on psychological and medical data in Section 6. Finally, the results are discussed in Section 7.

2 Definition of the classical pairwise agreement coefficients

Suppose that a sample of N objects is classified by several observers on a K-categorical scale. Two situations can be distinguished and will lead to different agreement coefficients: (1) each object i () is rated by a different random sample of observers of size R and (2) the same R observers rate all objects. Fleiss kappa coefficient is an appropriate agreement measure in the first case and Conger kappa coefficient in the second case. Let the random variable be equal to 1 when observer r classifies object i in category j () and denote the realisation of the random variable (). Finally, let be the number of observers classifying object i in category j. When each object i () is rated by a different random sample of observers, only the n are available. The two pairwise kappa coefficients, Fleiss and Conger kappas, denoted, respectively, by κ1 and κ2, are estimated by The proportion P is the observed agreement. It is defined as the mean proportion of agreement between all possible pairs of observers. In the case of Fleiss kappa coefficient, we have For Conger kappa, the same expression is obtained, namely The proportion P (l = 1, 2) is the agreement expected under the assumption of statistical independence between any two observers. Its expression differs for the two multirater kappa coefficients and , as explained in the following section.

2.1 Fleiss kappa coefficient

The expected agreement was defined by Fleiss[7] under a one-way ANOVA setting, i.e. when the R observers are not the same for all objects, as where is the overall proportion of objects classified in category j (). When the scale is binary, is asymptotically () equivalent to the ICC for agreement corresponding to a one-way random effect ANOVA model including the observers as source of variation in the denominator.[8] The difference with the ICC lies in the definition of the between objects mean sum of squares (i.e. BMS) which is divided by the number of objects N instead of N − 1. The agreement coefficient can be expressed in terms of variance components[21] and reduces to the intraclass kappa coefficient[22] when . The asymptotic sampling variance of was derived by Schouten[10] and can be written as where is the observed agreement corresponding to object i defined in equation (2). The quantity is the expected agreement for object i defined in equation (4). Under the null hypothesis that and an equal number of observers per object (R = R), the formula reduces to the formula derived by Fleiss[23] and available in statistical software,

2.2 Conger kappa coefficient

The expected agreement is defined as the mean proportion of expected agreement between all pairs of observers[9] and can be expressed as where is the proportion of objects classified in category j by observer r (). For binary scales, Davies and Fleiss[9] have shown that is asymptotically (N > 15) equivalent to the ICC for agreement corresponding to a two-way random effect ANOVA model[8] including the observers as source of variation. Conger kappa can also be expressed in terms of variance components, the difference with the ICC lies in the denominator. The term in the ICC is replaced by , where JMS denotes the between observers mean sum of squares and EMS the mean residual sum of squares. The agreement coefficient reduces to Cohen’s kappa coefficient when R = 2. Davies and Fleiss[9] gave the formula of the large sampling variance in the binary case under the null hypothesis that the agreement coefficient is equal to zero and proposed a FORTRAN program for scales with more than two categories. However, Schouten[14] and O’Connell and Dobson[11] derived a formula in the general case for nominal scales using the delta method, available in the R package ‘magree’ where

3 Definition of multilevel multirater pairwise kappa coefficients

Multilevel multirater pairwise kappa coefficients will be defined similarly to the case of two observers.[18-20] Suppose that the population of objects possesses a 2-level hierarchical structure in the sense that there are C clusters with n objects (). If there are R observers rating object i, pairs of observers can be formed. These pairs will be denoted by the superscript where r1 and r2 correspond to the two observers of pair p. Let equal 1 if object i from cluster c is classified in category j by observer r and be its realisation. Note that under a one-way design, only is in general available. In order to be able to define an overall kappa coefficient, two assumptions are made. First, it is assumed that the objects are homogeneous in each cluster, in the sense that the probability of being classified in category j by observer r1 and k by observer r2 of pair p is the same for all objects in cluster c. This implies that the probability to be classified in category j by observer r is the same for all objects in cluster c, namely . Second, it is assumed that there is no sub-population of objects, i.e. and therefore also . Let denote the relative sample size of the cth cluster and the realisation of . The multilevel observed agreement is defined as the average observed proportion of agreement over all possible pairs of observers. In case of a one-way analysis of variance, this means The expected agreement for multilevel data is then defined by with . In the case of a two-way ANOVA setting, if , the observed agreement is In the same way, if the proportion of objects classified in category j is denoted by for observer r1 of pair p and for observer r2, the expected agreement for multilevel data is defined by The multilevel counterpart of Fleiss kappa coefficient () and Conger kappa coefficient () are obtained by using the multilevel expression of P and P in equation (1). They reduce to Fleiss and Conger kappa coefficients when the hierarchical level of the data is ignored.

4 Sampling variability

4.1 Delta method

We will consider the vector for the one-way ANOVA setting, where . For the two-way ANOVA setting, let be the vector with the marginal classification proportions relative to cluster c and observer r, that is The observed agreement between observers r1 and r2 of pair p for cluster c is given by . We will consider the vector Similarly to Yang and Zhou[18,19] and to Vanbelle,[20] it can be shown that asymptotically, under mild regularity conditions, and are asymptotically normally distributed with variance–covariance matrix , j = 1, 2. The elements of are estimated in Appendix 1, following the technique of Obuchowski.[24] The delta method will be applied on successive functions of the vector to lead to the standard error of the multilevel Fleiss and Conger kappa coefficients. The aim is to derive the asymptotic variance–covariance matrix of the vector . Then, a last application of the delta method will lead to the asymptotic variance–covariance of and .

4.1.1 Multilevel Fleiss kappa

When the objects are not all classified by the same set of observers, the vector is a function of the vector (i.e. ) fulfilling the conditions of the multivariate delta method. The asymptotic variance–covariance matrix of is, by application of the delta method, given by where is the Jacobian matrix corresponding to with respect to , that is, F is a matrix with null elements except elements (1, 1) equal to 1 and element equal to .

4.1.2 Multilevel Conger kappa

When the objects are all classified by the same set of observers, the expected agreement is the average of the expected agreement over all pairs of observers. In matrix notation, the agreement expected under the independence assumption of the two observers of pair p is given by . The vector is a function of the vector (i.e., ) fulfilling the conditions of the multivariate delta method. The asymptotic variance–covariance matrix of is, by application of the delta method, given by where is the Jacobian matrix corresponding to with respect to , that is where is a P × RK matrix with null elements except elements equal to and elements equal to , (. In the same way, the overall observed agreement and expected agreement P and are the average of the observed and expected agreement for all pairs of observers given in , fulfilling the conditions of the multivariate delta method, . The asymptotic variance–covariance matrix of is, by application of the delta method, given by where the matrix Q is the Jacobian corresponding to the function , i.e. a matrix with null elements except elements and equal to .

4.1.3 Multilevel Fleiss and Conger kappa

Finally, the multilevel multirater kappa coefficient and are function of the vectors and , respectively, fulfilling the conditions of the multivariate delta method. , l = 1, 2. The variance–covariance matrix is, by application of the delta method, given by with When there is only one unit per cluster (), the variance given by equation (9) for the multilevel multirater Fleiss and Conger kappa coefficients multiplied by a correction factor, namely , reduces to equation (5) for multilevel Fleiss kappa coefficient and equation (8) for multilevel Conger kappa coefficient, respectively. When there are only two observers, the formula reduces to the formula derived by Yang and Zhou.[18]

4.2 The clustered bootstrap method

The clustered bootstrap method was applied by Kang et al.[25] to derive the standard error of the Cohen’s kappa coefficient in the presence of multilevel data and by Vanbelle[20] to derive the variance–covariance matrix when comparing several kappa coefficients. The clustered bootstrap consists of three steps: Draw a random sample with replacement of size C from the cluster indexes. For each cluster, take all observations belonging to the cluster. If the cluster sizes are different, the sample size of the bootstrap sample could be different from the original sample size N. Repeat steps 1 and 2 to generate a total of B independent bootstrap samples. Depending on the study design, the multilevel Fleiss or Conger kappa coefficient is determined for each bootstrap sample (l = 1, 2). The bootstrap estimate of the agreement coefficient κ is then defined by[25] with variance Alternatively, percentiles can be considered to construct confidence intervals.

5 Simulations

To study the behavior of the type I error rate (α), multilevel-dependent binary variables with fixed marginal distribution and dependency between pairs of variables were simulated following the algorithm of Emrich and Piedmonte.[26] Data were simulated under a two-way ANOVA setting, leading Conger kappa coefficient as the appropriate agreement measure. That is, we supposed that R observers each classified C clusters with each n = n subjects. For each cluster, a vector of binary correlated random variables was generated using the R package ‘mvtbinaryEP’ version 1.0.1. Note that the behavior of Fleiss and Conger kappa coefficients is very similar since they only differ in the definition of the expected agreement. The two measures coincide if the marginal probability distributions of the observers are exactly the same. The assessment on a binary scale of C = 25, 50 and 100 clusters with each or 10 objects by R = 2, 5 or 10 observers was simulated. For each cluster, the association structure between the assessments made by the observers can be characterised by two n × n matrices. The first matrix represents the intra-cluster association structure. The diagonal elements are equal to 1 (same observer, same object) and the off-diagonal elements (same observer, different objects), representing the association strength between members of a same cluster, were fixed to , 0.1, 0.3, 0.5 and 0.7. The second matrix gives the inter-observer agreement structure. The diagonal elements, representing the inter-observer agreement levels, were fixed to , 0.2, 0.4, 0.6 and 0.8. The off-diagonal elements, representing the association between the classification of two different objects by two different observers, were randomly chosen in the possible values allowed by the algorithm, given the Fréchet bounds. This represents a total of 180 schemes for each number of clusters C. To allow a wide range of possible agreement values, all observers were assumed to have a uniform marginal probability distribution. This implies that κ1 and κ2 reduce to the correlation coefficient for the binary case, namely the φ coefficient.[2] For each simulation scheme, the mean squared error, the mean standard error, and the coverage probability, defined as the number of times the 95% confidence interval covers the theoretical agreement value, were recorded. For the clustered bootstrap method, the coverage was determined for the 95% confidence interval based on mean and standard error and based on percentiles. The clustered bootstrap method was based on B = 5000 bootstrap samples. A total of 1000 simulations were performed for each parameter configuration. Therefore, the 95% confidence interval for the nominal coverage level is [0.936; 0.963].

5.1 Simulation results for n = 1 (no multilevel data)

The coverage levels obtained for Conger kappa coefficients when there is no multilevel structure are presented in Figure 1 for observers with uniform marginal probability distribution using the delta and the percentile-based clustered bootstrap method. The complete results are given as supplemental material.

Figure 1.

Simulations. Coverage for Conger’s kappa coefficient against the number of clusters obtained with the delta method (black) and the percentile-based bootstrap method (gray) in the presence of 2 (dotted), 5 (dashed) and 10 (plain) observers with uniform marginal probability distribution. The number of objects per cluster is equal to 1. The coverage levels obtained with the delta and the percentile-bootstrap methods are very similar, except for high agreement values where the percentile-bootstrap method performs better. The percentile-bootstrap confidence intervals (CIs) are left-skewed in that case and provide better coverage levels. An important finding is that the coverage is too low when the sample size is small (C = 25) and the kappa coefficient is small (). This situation worsens when the number of observers increases.

5.2 Simulations results for n = 2, 5, 10 (multilevel data)

The results obtained with the delta and the clustered bootstrap methods were very similar and stable across the different number of objects per cluster. Therefore, only the results obtained with the delta method for five objects per cluster (n = 5) are presented in Figure 2. The complete results can be found in the supplemental material.

Figure 2.

Simulations. Coverage for Conger’s kappa coefficient according to the delta method in the presence of 2 (dotted), 5 (dashed) and 10 (plain) observers with uniform marginal probability distribution, 25 (up), 50 (middle) and 100 (bottom) clusters and five objects per cluster. As seen in Figure 2, the coverage level becomes closer to the nominal level when the value of Conger coefficient increases, when the intra-cluster association level decreases and when the number of observers decreases. The coverage level was generally within the 95% confidence interval for kappa values above 0.4 and a number of clusters larger than 50. Here too, the percentile cluster bootstrap method provides better coverage levels for high agreement values when the number of clusters is small (see Supplemental material, C = 25).

6 Examples

6.1 Psychiatric diagnosis

This section focuses on the data analysed in the original paper of Fleiss.[7] These data does not present a multilevel structure. A total of six psychiatrists were unsystematically selected from a pool of 43 psychiatrists to give a psychiatric diagnosis to a subject. The set of observers can therefore differ from subject to subject, leading to Fleiss kappa coefficient as agreement measure. A total of 30 subjects were classified as suffering mainly of (1) depression, (2) personality disorder, (3) schizophrenia, (4) neurosis or (5) other psychiatric disorder. The probability to be classified in these categories was respectively and (see Table 1).

Table 1.

Fleiss example.

				Delta method			Bootstrap method
Category	p_j	P_o	P_e	κ₁ (SE)	95% CI		κ₁ (SE)	95% CI
1	0.144	0.813	0.753	0.245 (0.109)	0.031	0.459	0.232 (0.108)	0.020	0.443
2	0.144	0.813	0.753	0.245 (0.115)	0.020	0.470	0.231 (0.098)	0.040	0.422
3	0.167	0.867	0.722	0.520 (0.100)	0.324	0.716	0.511 (0.078)	0.358	0.664
4	0.306	0.776	0.576	0.471 (0.084)	0.307	0.635	0.459 (0.076)	0.310	0.608
5	0.239	0.842	0.636	0.566 (0.115)	0.341	0.791	0.550 (0.128)	0.298	0.801
Overall		0.556	0.220	0.430 (0.054)	0.324	0.536	0.418 (0.055)	0.309	0.526

Note: Summary of the statistics to compute Fleiss kappa for each category separately and overall.

Fleiss example. Note: Summary of the statistics to compute Fleiss kappa for each category separately and overall. Fleiss’ conclusion was that agreement was better than chance for all categories. While 0 is indeed not included in the confidence interval for each of the five categories, the lower confidence bound is close to 0 for categories 1 and 2 (see Table 1). The observed agreement P varies between 0.78 and 0.87, meaning that when isolating one category, pairs of observers agree, on average, on 78–87% of the patients. However, when considering the five diagnostic categories together, this percentage drops to 56%. This suggests that agreement, when isolating one category, does not mainly occur on the isolated category but rather in the category mixing the other four diagnostic categories. If we focus on the interpretation of the confidence interval for Fleiss kappa coefficient (0.33–0.53), it can be concluded that we are 95% confident that the actual proportion of disagreement is between (1–0.53)100 = 47% and (1–0.33)100 = 67% lower than the proportion of disagreement expected under the independence assumption of the observers. Both observed agreement and Fleiss kappa coefficients therefore indicate a non-negligible variability in the psychiatric diagnostic within groups of observers.

6.2 Tromsø study (multilevel)

Lung auscultation is routinely used in daily clinical practice by health professionals. While new methodology of chest imaging such as MRI, CT scans and portable ultrasound are now available, the stethoscope remains advantageous when it comes to costs, availability, patient care and training of health professionals to use it. Lung auscultation has proven to be helpful in the diagnosis of several lung and heart related conditions as a part of routine physical examination. However, there is a lack of information about how the presence of wheezes or crackles relates to common heart and lung diseases and the prognostic value these findings might have. The Tromsø study is a population-based study designed to evaluate abnormal auscultation findings against a wide range of clinical and epidemiological endpoints. Due to the subjective nature of evaluating sounds, the inter-observer agreement among medical professionals in classifying lung sounds was studied before the implementation of the Tromsø study.[27] Seven groups of four observers were asked to assess the presence of crackles and wheezes on the lung sounds of 20 subjects: general practitioners (GPs) from The Netherlands (NLD), Wales (WAL), Russia (RUS), and Norway (NOR), pulmonologists working at the University Hospital of North Norway (PLN), sixth year medical students (STU) at the Faculty of Health Sciences in Tromsø and an international group of experts (researchers) in the field of lung sounds (EXP). Lung sounds were recorded at six different locations, three locations on each side of the thorax (Anterior thorax (A), upper posterior thorax (U) and lower posterior thorax (L)), leading to a multilevel data structure. A more detailed description of the study can be found in Aviles et al.[27] In this section, we will focus on the detection of crackles. Since the same observers classified all the sounds obtained at the six body places, the multilevel Conger kappa coefficient is adopted. There are two prerequisites to the definition of agreement at the patient level: (1) the absence of patient sub-population in terms of crackles detection and (2) the homogeneity of crackles detection within patients, that is the probability of detecting crackles should be the same for the three thorax locations. Among the 20 subjects, 13 were recruited in a rehabilitation center and 7 in the office environment of the researchers. Although differences in the probability to detect crackles are expected between these two groups of subjects, Conger’s kappa coefficient will be computed overall due to the limited sample size. The probability to detect crackles is given for the seven groups of observers and the three thorax locations in Table 2.

Table 2.

Tromsø example.

Body
location	EXP	NOR	RUS	WAL	NLD	PLN	STU
U	0.13	0.11	0.23	0.083	0.12	0.22	0.22
L	0.29	0.38	0.37	0.19	0.16	0.34	0.36
A	0.016	0.048	0.3	0.029	0.046	0.12	0.22
P-value	<0.0001	<0.0001	0.031	<0.0001	0.0015	<0.0001	0.0093

Note: Probability to detect crackles according to the location (anterior thorax (A), upper posterior thorax (U) and lower posterior thorax (L). The probabilities are compared among locations using a multilevel probit regression.

Tromsø example. Note: Probability to detect crackles according to the location (anterior thorax (A), upper posterior thorax (U) and lower posterior thorax (L). The probabilities are compared among locations using a multilevel probit regression. Since the probability to detect crackles differs between the three locations, the average proportion of agreement between pairs of observers is reported for each group of observers and each thorax location separately, on top of the overall agreement (see Table 3).

Table 3.

Tromsø example.

	U		L		A		All
Group	P_o	κ₂ (SE)	P_o	κ₂ (SE)	P_o	κ₂ (SE)	P_o	κ₂ (SE)
EXP	0.88	0.65 (0.13)	0.78	0.52 (0.08)	0.91	0.04 (0.06)	0.86	0.56 (0.08)
NOR	0.92	0.75 (0.12)	0.78	0.55 (0.10)	0.85	0.10 (0.06)	0.85	0.58 (0.08)
RUS	0.72	0.25 (0.08)	0.64	0.26 (0.07)	0.59	0.06 (0.07)	0.65	0.20 (0.05)
WAL	0.86	0.48 (0.17)	0.88	0.71 (0.10)	0.86	0.01 (0.05)	0.87	0.53 (0.09)
NLD	0.85	0.54 (0.13)	0.86	0.61 (0.12)	0.85	0.07 (0.06)	0.86	0.49 (0.10)
PLN	0.80	0.50 (0.14)	0.76	0.49 (0.12)	0.73	0.05 (0.07)	0.76	0.40 (0.09)
STU	0.78	0.43 (0.15)	0.79	0.56 (0.11)	0.63	0.02 (0.05)	0.74	0.37 (0.08)

Note: Proportion of agreement (P) and Conger’s kappa coefficient (standard error) for each group of observers reported overall (All) and at each thorax location (anterior thorax (A), upper posterior thorax (U) and lower posterior thorax (L)).

Tromsø example. Note: Proportion of agreement (P) and Conger’s kappa coefficient (standard error) for each group of observers reported overall (All) and at each thorax location (anterior thorax (A), upper posterior thorax (U) and lower posterior thorax (L)). When looking at the individual thorax locations, it can be seen in Table 3 that on average, pairs of general practitioners (NOR, WAL, NLD) agree on the classification of more than 78% of the sounds, independently of the thorax location except for Russian GPs (RUS) where pairs agree, on 59% to 72% of the sounds. This lower agreement level might partially be explained by a confusion with the English nomenclature around the term crackles (see Aviles et al.[27] for more details). The experts agree on average on 78–91% of the sounds, the pulmonologists on 73–80% and the students on 63–79%, depending on the location of the auscultation. These agreement proportions translate to relatively low Conger kappa coefficients, especially in the anterior thorax location. This can be explained by the low probabilities of detecting crackles (see Table 2) combined with the small sample size. The misclassification of one sound in fact represents a disagreement on 5% of the sounds. The overall and per location multirater agreement levels within groups of GPs were considered satisfactory by the researchers for using lung auscultation in the Tromsø study.

7 Discussion

In this paper, the asymptotic formula of the standard error of Fleiss and Conger kappa coefficients using the delta method was presented in a unified framework. The formula was extended to account for multilevel data structures. The formula only involves simple matrix calculations and can be easily implemented in practice. A R package ‘multiagree’ was developed by the author and is available on Github. Code to reproduce the results is available as Supporting Information on the journal’s web page. The scope of this paper was limited to Fleiss and Conger kappa coefficients for two reasons. First, they cover two study designs frequently encountered in practice. Second, both are asymptotically equivalent to ICCs for agreement. Fleiss kappa coefficient was developed as an agreement measure under a one-way ANOVA model, i.e. when the objects are rated by different sets of observers. On the other hand, Conger kappa was developed as an agreement measure under a two-way ANOVA model, i.e. when all objects are rated by the same set of observers. The choice between these two agreement coefficients should therefore be primarily based on the study design. Two assumptions were made to ensure the existence of an overall multirater multilevel kappa coefficient, i.e. the homogeneity of the members of a cluster and the existence of a common kappa coefficient across the clusters. When there is evidence that the assumptions do not hold, as discussed by Yang and Zhou,[18] a separate multirater multilevel kappa coefficient should be computed for each sub-population identified. In the same way, if sub-groups of observers are identified, it is better to compute agreement separately within the different groups.[14] The multilevel delta method, although asymptotic, showed similar coverage levels than the clustered bootstrap method. In the presence of more than two observers, good statistical performances of the delta method were observed for moderate number of clusters (e.g. C = 50) and multilevel kappa coefficients higher than 0.4, disregarding the cluster size. For two observers, good statistical properties were observed already for small sample sizes (C = 25). When the sample size is small, confidence intervals based on the percentile clustered bootstrap method provide better coverage levels for high kappa coefficients (). One extension of the methods presented in this paper is also implemented in the R package ‘multiagree’. The results in this paper were combined with the results in Vanbelle[20] to allow the comparison of several (multilevel) multirater agreement coefficients. A further extension could be the inclusion of agreement weights when computing multilevel multirater agreement coefficients. To summarise, this paper provides two simple methods to compute the standard error of the multirater kappa coefficients that perform well when the number of clusters is moderate (C = 50). Only the percentile clustered bootstrap method provided satisfactory coverage levels when the number of clusters was small (C = 25) and the agreement was high (). The delta and the clustered bootstrap methods should therefore be used with caution when the number of clusters is small. Click here for additional data file. Supplemental material, Supplemental Figures for Asymptotic variability of (multilevel) multirater kappa coefficients by Sophie Vanbelle in Statistical Methods in Medical Research Click here for additional data file. Supplemental material, Supplemental material1 for Asymptotic variability of (multilevel) multirater kappa coefficients by Sophie Vanbelle in Statistical Methods in Medical Research Click here for additional data file. Supplemental material, Supplemental material2 for Asymptotic variability of (multilevel) multirater kappa coefficients by Sophie Vanbelle in Statistical Methods in Medical Research

13 in total

1. A simple method for the analysis of clustered binary data.

Authors: J N Rao; A J Scott
Journal: Biometrics Date: 1992-06 Impact factor: 2.571

2. A comparison of methods for calculating a stratified kappa.

Authors: W Barlow; M Y Lai; S P Azen
Journal: Stat Med Date: 1991-09 Impact factor: 2.373

3. Estimating kappa from binocular data.

Authors: N L Oden
Journal: Stat Med Date: 1991-08 Impact factor: 2.373

Review 4. Intraclass correlations: uses in assessing rater reliability.

Authors: P E Shrout; J L Fleiss
Journal: Psychol Bull Date: 1979-03 Impact factor: 17.737

5. Resampling probability values for weighted kappa with multiple raters.

Authors: Paul W Mielke; Kenneth J Berry; Janis E Johnston
Journal: Psychol Rep Date: 2008-04

6. On the comparison of correlated proportions for clustered data.

Authors: N A Obuchowski
Journal: Stat Med Date: 1998-07-15 Impact factor: 2.373

7. Comparing correlated kappas by resampling: is one level of agreement significantly different from another?

Authors: D P McKenzie; A J Mackinnon; N Péladeau; P Onghena; P C Bruce; D M Clarke; S Harrigan; P D McGorry
Journal: J Psychiatr Res Date: 1996 Nov-Dec Impact factor: 4.791

8. Kappa statistic for clustered matched-pair data.

Authors: Zhao Yang; Ming Zhou
Journal: Stat Med Date: 2014-02-16 Impact factor: 2.373

9. Kappa statistic for clustered dichotomous responses from physicians and patients.

Authors: Chaeryon Kang; Bahjat Qaqish; Jane Monaco; Stacey L Sheridan; Jianwen Cai
Journal: Stat Med Date: 2013-03-27 Impact factor: 2.373

10. International perception of lung sounds: a comparison of classification across some European borders.

Authors: Juan Carlos Aviles-Solis; Sophie Vanbelle; Peder A Halvorsen; Nick Francis; Jochen W L Cals; Elena A Andreeva; Alda Marques; Päivi Piirilä; Hans Pasterkamp; Hasse Melbye
Journal: BMJ Open Respir Res Date: 2017-12-18

4 in total

1. Comparison of Hydrocolloid Dressings and Silver Nanoparticles in Treatment of Pressure Ulcers in Patients with Spinal Cord Injuries: A Randomized Clinical Trial.

Authors: Parvaneh Asgari; Mitra Zolfaghari; Yee Bit-Lian; Amir Hossien Abdi; Younes Mohammadi; Fatemeh Bahramnezhad
Journal: J Caring Sci Date: 2022-01-15

2. Accuracy of a Deep Learning System for Classification of Papilledema Severity on Ocular Fundus Photographs.

Authors: Caroline Vasseneix; Raymond P Najjar; Xinxing Xu; Zhiqun Tang; Jing Liang Loo; Shweta Singhal; Sharon Tow; Leonard Milea; Daniel Shu Wei Ting; Yong Liu; Tien Y Wong; Nancy J Newman; Valerie Biousse; Dan Milea
Journal: Neurology Date: 2021-05-19 Impact factor: 9.910

3. Prevalence and clinical associations of wheezes and crackles in the general population: the Tromsø study.

Authors: J C Aviles-Solis; C Jácome; A Davidsen; R Einarsen; S Vanbelle; H Pasterkamp; H Melbye
Journal: BMC Pulm Med Date: 2019-09-11 Impact factor: 3.317

4. Robustness of κ -type coefficients for clinical agreement.

Authors: Amalia Vanacore; Maria Sole Pellegrino
Journal: Stat Med Date: 2022-02-06 Impact factor: 2.497

4 in total