Literature DB >> 28464322

Comparing dependent kappa coefficients obtained on multilevel data.

Abstract

Reliability and agreement are two notions of paramount importance in medical and behavioral sciences. They provide information about the quality of the measurements. When the scale is categorical, reliability and agreement can be quantified through different kappa coefficients. The present paper provides two simple alternatives to more advanced modeling techniques, which are not always adequate in case of a very limited number of subjects, when comparing several dependent kappa coefficients obtained on multilevel data. This situation frequently arises in medical sciences, where multilevel data are common. Dependent kappa coefficients can result from the assessment of the same individuals at various occasions or when each member of a group is compared to an expert, for example. The method is based on simple matrix calculations and is available in the R package "multiagree". Moreover, the statistical properties of the proposed method are studied using simulations. Although this paper focuses on kappa coefficients, the method easily extends to other statistical measures.

Entities: Chemical Disease Species

Keywords: Clustered bootstrap; Delta method; Hierarchical; Intraclass; Rater

Mesh：

Year: 2017 PMID： 28464322 PMCID： PMC5600130 DOI： 10.1002/bimj.201600093

Source DB: PubMed Journal: Biom J ISSN： 0323-3847 Impact factor: 2.207

Introduction

Reliability and agreement studies are of paramount importance in behavioral, social, biological, and health sciences. They both provide information about the quality of the measurements (Kottner et al., 2011). When observers classify items on a categorical scale, reliability refers to the ability of the scale to differentiate between the items, despite the presence of measurement error while agreement refers to the degree of closeness between two assessments made on the same items. Good reliability is an essential property of a measurement scale, especially when assessing the correlation with other variables because of the well‐known attenuation effect (that is, the presence of measurement error tends to weaken correlations between variables). In addition to good reliability, good agreement is sometimes also imperative, as in clinical decision making where observers should provide exactly the same scores, in order to make the same decision for the patient. Agreement is also involved in the assessment of criterion validity, where the degree of agreement between a measurement instrument under scrutiny and a reference method, which is often also subject to measurement error, is studied. This paper is motivated by the third part of an exploratory study aimed at investigating the influence of different factors on inter‐ and intraobserver agreement levels on the evaluation of oropharyngeal dysphagia severity (Pilz et al., 2016). Oropharyngeal dysphagia is characterized by difficulties in swallowing. In addition to quality of life deterioration, it can have severe consequences such as malnutrition, dehydration, aspiration pneumonia, and sudden death. Fiberoptic endoscopic evaluation of swallowing (FEES) is nowadays the first choice method to evaluate the severity of oropharyngeal dysphagia. It permits the anatomical assessment of the pharyngeal and laryngeal structures and provides a comprehensive evaluation of the pharyngeal stage of swallowing. FEES consists of five criteria in the visual evaluation and interpretation of swallowing images: (1) valleculae pooling (No, , ), (2) pyriform pooling (No, , ), (3) number of piecemeal deglutitions (1, 2, 3, 4, or 5 or more often), (4) posterior spill (No, Yes) and (5) penetration/aspiration (No, , ). Despite the increasing popularity of the FEES assessment, there is no standardization of the measurement criteria. Crucially, the interpretation of swallowing images is based on visual judgment and is thus subjective. It might be influenced by factors like observer's experience or bolus consistency. The FEES study therefore aimed to investigate the influence of different factors on inter‐ and intraobserver agreement levels. In the third FEES study part, two observers (medical students who received a special training) independently assessed 40 swallowing images obtained on 20 patients, who consecutively swallowed 5cc of a thin liquid and 5cc of a thick liquid. The swallowing images were assessed in a random order by the observers, blinded to any medical information on the patient. The exercise was repeated after two weeks under the same conditions to determine the intraobserver agreement level of each observer. Then, the two observers reviewed the medical images during two consensus meetings, planned two weeks apart. During the consensus meetings, the two students reviewed the images together and determined a score in consensus. The aim was to compare the individual intraobserver agreement level of the two students to the intraobserver agreement level obtained by the students in consensus. The FEES study possesses two particularities. First, the structure of the data is multilevel, that is items are nested within clusters. Here, two swallows (one with thin and one with thick liquid) are nested within patients. Multilevel data are common in medical and behavioral sciences, where measures are often obtained on persons nested in organizations (e.g. patients in health care centers), on different body parts or by repeated measurements over time. Ignoring the multilevel structure of the data can lead to incorrect conclusions (see e.g. Hox, 2002). Secondly, the same patients were evaluated by the same observers under two experimental conditions (individually and in consensus). This then introduces a dependency between the agreement coefficients to be compared, a dependency that also needs to be taken into account. When items (subjects/objects) are evaluated by observers on a categorical scale, then reliability, as classically defined, can be measured through the intraclass kappa coefficient for binary scales (Kraemer, 1979). For ordinal scales, Cohen (1968), Fleiss and Cohen (1973), and Schuster (2004) showed that the quadratic weighted kappa coefficient is asymptotically equivalent to an intraclass correlation coefficient. However, for nominal scales, reliability has to be assessed separately for each category with the intraclass kappa coefficient (Kraemer, 1979). On another hand, for nominal scales, agreement can be measured through Cohen's kappa coefficient (Cohen, 1960) and for ordinal scales, through the linear weighted kappa coefficient (Cohen, 1968; Cicchetti and Allison, 1971; Vanbelle, 2016). Kappa coefficients are relative agreement coefficients. They have the particularity to involve the marginal probability distribution of the observers, that is the probability for an observer to classify items in the different categories of the scale (Warrens, 2010, 2014). Through this relationship, kappa coefficients depend on the prevalence of the trait under study, which limits the possibility to compare them among studies with different prevalence. Several authors (Thompson and Walter, 1988; Feinstein and Cicchetti, 1990; Cicchetti and Feinstein, 1990; Byrt et al., 1993; de Vet et al., 2006) proposed the use of absolute agreement measures (e.g. the proportion of items classified in the same category by the two observers) to avoid that dependency. These absolute coefficients are however not sensitive to the scales' inability in distinguishing between items in a population with low prevalence and kappa coefficients are therefore to be preferred (Rogot and Goldberg, 1966; Vach, 2005; Kraemer et al., 2004; Vanbelle, 2016). While the statistical analysis of multilevel data became very popular in the last decades, only little attention was paid to the evaluation of agreement in the presence of multilevel data. This could be explained by the fact that it is common practice to summarize the information at the highest level of the hierarchy (e.g. the patient in the FEES study) following rules established by the researchers and then compute agreement based on the summary measures. For example, the FEES score could be defined at patient level as the average or the maximum score obtained for the thin and the thick swallow. By doing so, information is lost on possible disagreements at the lowest level of the hierarchy and this can result in biased estimates of agreement levels. Moreover, it is not possible to predict the relationship between the agreement values obtained at different hierarchical levels (Vanbelle et al., 2012). Kappa coefficients were nevertheless extended over the years to account for particular study designs. In particular, population‐averaged (Thomson, 2001; Williamson and Manatunga, 1997; Williamson et al., 2000; Gonin et al., 2000) and unit‐specific models (Gajewski et al., 2007; Vanbelle et al., 2012; Vanbelle and Lesaffre, 2016) were developed to account for a multilevel data structure and for the presence of categorical and continuous predictors. While these modeling techniques represent a considerable progress, they require adequate model specifications, expert programming skills, and a reasonable sample size (Carey et al., 1993). The latter is not achieved with the 20 patients of the FEES study. Recently, Yang and Zhou (2014, 2015) developed a marginal approach, based on the delta method, involving only simple matrix calculations to adjust the standard error of kappa coefficients in the presence of multilevel data. Their derivations are however limited to the estimation of a single kappa coefficient and make the comparison of several dependent kappa coefficients impossible. Dependent kappa coefficients can occur in many ways. For example, two observers may assess the same individuals at various occasions or in different experimental conditions like in the FEES study. Alternatively, each member of a group of observers may be compared to an expert in assessing the same items on a categorical scale. In this latter case, the agreement coefficient is used as criterion validity measure. In the present paper, we therefore develop a method to compare several dependent kappa coefficients obtained on multilevel data. This provides a new practical and simple alternative to the more advanced statistical techniques. The alternative method is based on the use of Hotelling's T 2 statistic, previously used to compared dependent kappa coefficients (Vanbelle and Albert, 2008). This paper improves the earlier method extending to multilevel data structures and using two different ways to estimate the variance‐covariance matrix between the kappa coefficients. The variance‐covariance matrix is derived using the delta method and the clustered bootstrap method (Field and Welsh, 2007). The kappa coefficients are introduced in Section 2. In Section 3, the kappa coefficients are generalized to multilevel structures (Yang and Zhou, 2014, 2015). Further, the method to compare several kappa coefficients is provided in Section 4 using the delta method and the clustered bootstrap method. The statistical properties of the new method are studied in Section 5 for a binary and a 3‐ordinal scale. In Section 6, the otorhinolaryngological data are analyzed. Finally, the method is discussed in Section 7.

Definition of the kappa coefficients

Kappa coefficients were initially defined in terms of computation procedure rather than population parameters (see e.g. Kraemer, 1979). Vanbelle (2016) provided recently a definition in terms of population parameters making the interpretation of the most common kappa forms straightforward. This definition will be adopted here. Consider a population of items (subjects or objects) and two fixed observers. In the FEES study, the items are swallowing images and the two observers are medical students. Let the random variable represent the classification of item k by observer r, that is if observer r () classifies a randomly selected item k of population in category i (). Further consider the random variable representing the disagreement between the two observers on the classification of item k. When the scale is binary or nominal, the function is usually used, where I(., .) is the identity function. The random variable then equals 1 if a disagreement occurs and equals 0 otherwise. When the scale is ordinal, functions of the form () are usually used. In practice, or is most common. The random variable then gives the distance (number of categories) separating the classifications made by the two observers when . This number is squared when . Kappa coefficients are defined by the formula where is the expectation of over the population of items and is the expectation assuming statistical independence of the ratings made by the two observers, that is . When the function is used, Cohen's kappa coefficient is obtained. Cohen's kappa coefficient compares the expected probability of disagreement to the same probability under the statistical independence of the ratings. Using leads to the linear weighted kappa coefficient when and to the quadratic kappa coefficient when . The linear (quadratic) weighted kappa coefficient compares the expected (squared) number of categories separating the classifications made by the two observers to the same number under the statistical independence of the ratings. Kappa coefficients are therefore relative agreement measures, depending on the marginal probability distribution of the observers and () through the denominator of Eq. (1). Kappa coefficients vary between −1 and 1. The value 1 is reached when there is perfect agreement between the two observers while a value of 0 means that the agreement is equal to what is expected under the statistical independence of the ratings. The quantity can be expressed according to the joint classification probabilities of the two observers using agreement weights or disagreement weights . Suppose that the joint probabilities are the same for all items in the population, that is . This implies that the marginal probability distribution for observer 1 is given by and for observer 2 by . Then, the kappa coefficient can be expressed as with and . The quantities and are obtained by replacing by in and , respectively. The agreement weights were introduced by Cohen (1960). Cicchetti and Allison (1971) introduced the linear agreement weights and Cohen (1968) the quadratic agreement weights . For binary scales, under the assumption of equal marginal probability distributions ( ), Cohen's kappa coefficient is called the intraclass kappa coefficient and is a reliability measure (Kraemer, 1979). That is, the intraclass kappa coefficient is the ratio of the variance of the “true” scores to that of the observed scores where the “true” score is the mean over independent replications of the measure (Kraemer et al., 2004). The quadratic kappa coefficient was also shown to be asymptotically equivalent to an intraclass correlation coefficient (Cohen, 1968; Fleiss and Cohen, 1973; Schuster, 2004).

Definition of multilevel kappa coefficients

Suppose now that the population possesses a 2‐level hierarchical structure in the sense that observations are made on items (level 1 of the hierarchy) nested in K clusters (level 2 of the hierarchy) (). In the FEES example, the clusters are the patients and there are two swallows with a different liquid consistency nested in each patient. In order to define an overall kappa coefficient over the population of items, Yang and Zhou (2014) make two assumptions. First, they assume that the members of a cluster are homogeneous, in the sense that each member of cluster k has the same probability of being classified in category i by observer 1 and j by observer 2. The identity shows that the members have the same probability to be classified in category i by rater 1 () and in category j by rater 2 (), with and . In the FEES study, this means that the oropharyngeal dysphagia severity scores should not depend on the liquid consistency. Secondly, Yang and Zhou (2014) assume the homogeneity of the pairwise classification among the K clusters, that is and therefore of the marginal classification probabilities ( and ). In the FEES study, this means that all patients should possess the same probability to be classified in the different severity categories, that is that there is no patient sub‐population in terms of dysphagia severity. Let denote the relative sample size of the k‐th cluster. In the FEES study, since there are two swallows per patient. The marginal probability distribution of an observer is the weighted average of the marginal probability distribution at the cluster level, that is and . In the same way, the pairwise classification probabilities over the population of clusters are given by . The weighted kappa coefficient for multilevel data is then defined as (Yang and Zhou, 2014) where is the agreement and the agreement expected under the statistical independence assumption of the ratings, that is . Note that can be rewritten as This implies that the agreement is a weighted average of the agreement obtained at the cluster level. The weights presented in the above equation are the same as those defined in Section 2, to lead to the multilevel counterparts of Cohen's, the linear and the quadratic kappa coefficients. Yang and Zhou (2014) showed that the weighted kappa coefficient obtained with the quadratic weights can be interpreted as an intraclass correlation coefficient and is a reliability measure. Using the linear weights, the weighted kappa coefficient is a relative agreement measure comparing the mean distance between the classification of the two observers to the mean distance under the independence assumption of the ratings (Vanbelle, 2016). The family of kappa coefficients as defined by Yang and Zhou (2014) corresponds to the classical kappa coefficients when the hierarchical level of the data is ignored. A sample estimate of the kappa coefficients is obtained by replacing the probabilities , and by their corresponding sample proportions , and .

Comparison of several dependent kappa coefficients

Hotelling's T 2 test

Hotelling's T 2 test will be used (Vanbelle and Albert, 2008) to compare several dependent multilevel coefficients defined in Section 3. Suppose that dependent kappa coefficients () obtained on multilevel data have to be compared. That is, we wish to test the null hypothesis versus , where and C is a patterned matrix obtained by merging the identity matrix and a vector of −1. For example, in the FEES study, we are interested in comparing three kappa coefficients. One kappa coefficient is obtained between the measurements made two weeks apart for each medical student individually (namely, κ1 and κ2) and one kappa coefficient is obtained between the two consensus meetings of the two students (namely, κ3). This yields to The test statistic where and S are respectively a vector of estimates of and their estimated variance‐covariance matrix, is distributed as Hotelling's T 2 under two assumptions. The first is the existence of a common kappa coefficient across the clusters. This assumption is already made by Yang and Zhou (2014). The second assumption is multivariate normality of the vector of kappa coefficients . The null hypothesis is rejected at the α‐level if where is the upper α‐percentile of the F distribution on and degrees of freedom. Note that, since “” is large in general, the left‐hand side of Eq. 4 can be approximated by , the ()‐th percentile of the chi‐square distribution on degrees of freedom. If denotes the l‐th row of matrix C, multiple comparisons can be made by using simultaneous confidence intervals for contrasts (), namely Note that other forms of the matrix C can be envisaged, depending on the individual contrasts of interest.

The delta method

Yang and Zhou (2014, 2015) determined the asymptotic variance of a single multilevel kappa coefficient with the delta method. In this section, we will apply the delta method twice to derive the asymptotic variance‐covariance matrix S, involved in Eq. 3. For notation convenience, the asymptotic variance‐covariance matrix will be derived for the comparison of kappa coefficients. The extension to more than two kappa coefficients is straightforward since the covariance is defined on pairs of variables. Asymptotic variance‐covariance of the observed and expected agreements Let be the proportion of items from cluster k classified in category r by observer 1, s by observer 2, t by observer 3 and u by observer 4 and suppose that we are interested in the comparison of the agreement coefficient obtained between observers 1 and 2 to the agreement coefficient obtained between observers 3 and 4. Let , , and be the vectors with the marginal classification proportions relative to cluster k for the observers 1, 2, 3, and 4, respectively. For example, . Let and denote the proportions in the joint classification table relative to observers 1 and 2 and to observers 3 and 4, respectively. The observed agreement between observers 1 and 2 and observers 3 and 4 are respectively estimated by where are agreement weights (). Define the vector as Similarly to Yang and Zhou (2014), it can be shown that asymptotically, under mild regular conditions, is asymptotically normally distributed with variance‐covariance matrix var. The elements of var are estimated in Appendix 1, following the technique of Obuchowski (1998). To determine the two kappa coefficients to be compared, the expected agreement is also required for the two pairs of observers (namely, and ). In matrix notation, the expected agreement between observers 1 and 2 and between observers 3 and 4 is given by where Λ is the matrix with the agreement weights as elements. The vector is a function of the vector (i.e. ) fulfilling the conditions of the multivariate delta method. The asymptotic variance‐covariance matrix of is, by application of the multivariate delta method, given by where J is the Jacobian matrix corresponding to f(.) with respect to , that is, and the vector is the vector of zeros. Asymptotic variance‐covariance of the kappa coefficients. In the same way, the vector of kappa coefficients is a function of the vector fulfilling the conditions of the multivariate delta method, . The variance‐covariance matrix of is, by application of the multivariate delta method, given by with The elements of are also given in Appendix 1. When there is only one unit per cluster ( ), the variance‐covariance matrix given by Eqn. 5 reduces to the classical variance‐covariance matrix multiplied by a correction factor, namely .

The clustered bootstrap method

Kang et al. (2013) determined the asymptotic variance of a single Cohen's kappa coefficient using clustered bootstrap. We will use this technique to determine the asymptotic variance‐covariance matrix S of L multilevel kappa coefficients, defined in Eq. 3. The clustered bootstrap consists of three steps: Draw a random sample with replacement of size K from the cluster indexes. For each cluster, select all observations belonging to the cluster. If the cluster sizes are different, the sample size of the bootstrap sample could be different from the original sample size N. Repeat steps 1 and 2 to generate a total of B independent bootstrap samples. For each bootstrap sample , the L multilevel kappa coefficients to be compared are determined, . The bootstrap estimate of the vector of kappa coefficients is then defined by Kang et al. (2013) as The elements of S can then be determined as follows: The vector and the matrix S are then used in Eq. 3.

Simulations

To study the behavior of the type I error rate (α), we simulated multilevel dependent categorical variables with fixed marginal probability distribution and kappa coefficient between pairs of variables. This was done according to the convex combination algorithm introduced by Lee (1997) and implemented in R by Ibrahim and Suliadi (2011). The algorithm originally considered the coefficient of uncertainty U, Goodman and Kruskal's τ, and Goodman and Kruskal's γ‐coefficient as association measures between pairs of categorical variables. In the present case, these association measures are replaced by Cohen's kappa coefficient when the scale is binary and the linear weighted kappa coefficient when the scale is 3‐ordinal. Data were simulated for three observers assessing , 30, and 100 clusters with each , or 4 items. The kappa coefficient obtained between observers 1 and 2 (namely, κ1) was then compared to the kappa coefficient obtained between observers 1 and 3, (namely, κ2). As agreement values, similarly to correlation values, are restricted by the marginal probability distribution of the observers, the three observers were assumed to have the same marginal probability distribution to allow the simulation of interobserver agreement levels between 0 and 1. However, even so, it was not possible to generate data for all the planned simulation patterns. Uniform (0.5,0.5) and nonuniform (0.7,0.3) marginal probability distributions were considered for binary scales while only the uniform (1/3,1/3,1/3) marginal probability distribution was considered in the 3‐ordinal case. The association structure between the ratings can be divided in three parts: the intracluster association (different items classified by the same observers), the interobserver agreement levels (the same items classified by different observers) and the interobserver association (different items classified by different observers). The association structure was expressed in terms of kappa coefficients in the convex combination algorithm introduced by Lee (1997) instead of the coefficient of uncertainty U, originally used. The same homogeneous intracluster association structure was considered for each observer. The association strength between members of a cluster, given in terms of kappa coefficients, was fixed to represent no association to strong association within clusters, i.e. , and 0.7. The interobserver agreement levels for the three pairs of observers were fixed to κ=0, 0.2, 0.4, 0.6, and 0.8 and the interobserver association levels were fixed to values allowed by the algorithm, and for the highest interobserver agreement level . For each simulation scheme, the mean squared error, the mean standard error, the mean correlation between the two agreement coefficients of interest and the type I error, defined as the number of times the Hotelling's T 2 test rejects the null hypothesis of equal kappa coefficients, were recorded. This was done using the multilevel delta and the clustered bootstrap method for the new multilevel method and when ignoring the multilevel data structure of the data. The clustered bootstrap method was based on bootstrap samples. Note that the sample estimate of kappa coefficients is the same either when taking the multilevel data structure into account or not. A total of 500 simulations were performed for each parameter configuration. Therefore, the 95% confidence interval for the type I error is [0.031;0.069]. The results of the simulations are reported in Fig. 1 under the scenario that the three observers classify items on a binary scale with a uniform marginal probability distribution. Results are only displayed for one of the three kappa coefficients and the delta method because the other results were very similar. The complete results are given in the supporting web material.

Figure 1

(A) Mean squared error and (B) mean standard error of κ1, (C) mean correlation between κ1 and κ2 and (D) type I error for the comparison of two dependent multilevel kappa coefficients (κ1 and κ2) obtained on a binary scale when the observers marginal probability distribution is uniform and the cluster size is equal to (left) and (right). The results obtained by the delta method ignoring the hierarchical structure (dashed lines) and by the multilevel delta method (plain line) are reported for (black), (middle gray), and (light gray) clusters. Results are depicted for different interobserver agreement values (). As seen in Fig. 1A, the mean squared error of the kappa estimates is relatively small (less than 0.040 for 20 clusters, 0.035 for 30 clusters and 0.010 for 100 clusters) and increases in general with the value of the intra‐cluster kappa coefficient. When the hierarchical data structure is ignored (dashed lines), the standard error of the kappa coefficients (Fig. 1B) and the correlation between pairs of kappa coefficients (Fig. 1C) does not vary according to the intracluster kappa coefficient. This was to be expected since all items are considered to be independent of each other in that case. When the multilevel structure is accounted for (plain lines), the standard error increases according to the intracluster kappa coefficient. The increase in standard error is roughly equal to the design effect (see e.g. Hox, 2002), that is where denotes the intracluster kappa coefficient. This reflects the fact that, when the intracluster kappa coefficient increases, the items of a same cluster become more alike. This decreases the amount of information contained in the data and therefore increases the uncertainty, which is quantified by the standard error. According to the formula given in Appendix 1, the correlation between the kappa coefficients also varies with value of the intracluster kappa coefficient. The difference in the behavior of the standard error and the correlation between the two types of analysis resulted in different behaviors of the type I error rates (Fig. 1D). The type I error rate increases dramatically outside the 95% confidence interval for intracluster kappa coefficients larger than 0.3. The type I error rate obtained with the multilevel method is closer to the nominal level with a large number of clusters () than with a small number (), although type I error rates are already within the 95% confidence interval for in most of the cases. In general, the type I error rate is the furthest from the nominal level with the multilevel approach for large interobserver agreement values () and moderate cluster size (). The test also shows somewhat conservative type I error rates for a small number of clusters () and small cluster size (). One assumption underlying the Hotelling's T 2 test is the multivariate normality of the kappa coefficients vector. This assumption could be problematic for high agreement values and small sample sizes. Indeed, since kappa coefficients are bounded in the interval [−1,1], the sampling distribution of the kappa coefficients becomes left skewed when approaching the boundaries. To illustrate the effect of this skewness on the sampling distribution of the T 2 statistic, the density of the 500 T 2 statistics obtained for with the intracluster kappa coefficient equal to 0.5 is depicted in Fig. 2 for a binary scale under the uniform marginal distribution of the observers. Some deviations from the theoretical distribution are noted, explaining the behavior of the type I error rate.

Figure 2

Theoretical (plain line) and observed (dashed line) sampling distribution of the T 2 statistic when comparing two kappa coefficients equal to 0.8 obtained on a binary scale with uniform observers' marginal distribution when the intracluster kappa coefficient equals 0.5. In the left panel, there are observations per cluster and in the right panel there are observations per cluster. The number of clusters is 20 (upper panel), 30 (middle panel) and 100 (lower panel).

Application

The aim of the third FEES study part (Pilz et al., 2016) is to compare the individual intraobserver agreement level of two students to the intraobserver agreement level obtained by the students in consensus. Since the FEES criteria are ordinal, the multilevel linear weighted kappa coefficient is used as agreement measure. Three dependent linear weighted kappa coefficients (observer 1, observer 2, consensus) obtained on multilevel data (two swallows per patients) have therefore to be compared. The criterion “posterior spill” is not analyzed because all observations except two were classified in the category “No”. The two prerequisites to the definition of a kappa coefficient at the patient level are (1) the absence of patient subpopulations in terms of dysphagia severity and (2) the homogeneity of the dysphagia severity within patient, that is the probability of being classified in the different FEES severity categories should not depend on liquid consistency. There was no evidence against the first assumption in the two first study parts (see Pilz et al., 2016). To test the adequacy of the second assumption, the proportion of patients classified in the different FEES severity categories are given in Table 1 according to the liquid consistency. The effect of the consistency on the marginal probability distributions was tested through an ordinal multilevel probit regression. As it can be seen in Table 1, a separate kappa coefficient should be computed per liquid consistency for the valleculae pooling and the penetration/aspiration criteria because there was evidence against the homogeneity assumption.

Table 1

		Category
Parametera)	Liquid consistency	1	2	3	4	5	p‐valueb)
VP	Thin	0.41	0.55	0.04			<0.0001
	Thick	0.20	0.43	0.37
PP	Thin	0.57	0.41	0.02			0.41
	Thick	0.65	0.29	0.06
PD	Thin	0.30	0.22	0.18	0.15	0.15	0.36
	Thick	0.18	0.44	0.18	0.12	0.09
PA	Thin	0.35	0.60	0.05			<0.0001
	Thick	0.62	0.34	0.04

a)VP, valleculae pooling; PP, pyriform pooling; PD, piecemeal deglutition; PA, penetration/aspiration.

b) p‐value obtained by ordinal multilevel probit regression.

FEES study (third part). Proportion of patients classified in the different FEES severity categories according to the liquid consistency (N=20). Test of the homogeneity of the dysphagia severity within patient a)VP, valleculae pooling; PP, pyriform pooling; PD, piecemeal deglutition; PA, penetration/aspiration. b) p‐value obtained by ordinal multilevel probit regression. The intraobserver agreement level obtained by the students individually and during the consensus meeting are given in Table 2. Because of the small number of observations, an overall linear weighted kappa coefficient was computed for the valleculae pooling and the penetration/aspiration criteria, despite the heterogeneity of the dysphagia severity scoring for the thin and thick liquid consistencies (see Table 2). The agreement coefficients were compared using the delta method and the clustered bootstrap method with iterations. The multilevel structure of the data was taken into account when the agreement was computed at patient level. Note that the program took less than 1 s for the multilevel delta method and about 16 s for the clustered bootstrap method on a regular PC (Intel Core II, 2GB).

Table 2

			Delta method
Parametera)	Kb)	Nc)	Liquid consistencyd)	Observer 1	Observer 2	Consensus	p‐value
VP	14	20	All	0.79 (0.11)	0.94 (0.061)	0.75 (0.12)	0.25
	14	14	Thin	0.69 (0.21)	1.00 (NA)	0.66 (0.18)	NA
	6	6	Thick	0.57 (0.39)	0.57 (0.39)	0.79 (0.21)	NA
PP	19	29	All	0.56 (0.19)	0.76 (0.11)	1.00 (NA)	NA
PD	20	35	All	0.93 (0.037)	0.78 (0.081)	0.94 (0.034)	0.11
PA	18	26	All	0.62 (0.14)	0.79 (0.098)	0.88 (0.071)	0.25
	13	13	Thin	0.84 (0.15)	0.48 (0.23)	0.80 (0.12)	0.28
	13	13	Thick	0.35 (0.28)	1.00 (NA)	1.00 (NA)	NA
			Clustered bootstrap method
Parameter	K	N	Liquid consistency	Observer 1	Observer 2	Consensus	p‐value
VP	20	40	All	0.74 (0.12)	0.94 (0.053)	0.84 (0.074)	0.13
	20	20	Thin	0.55 (0.23)	1.00 (NA)	0.71 (0.15)	NA
	20	20	Thick	0.72 (0.29)	0.82 (0.18)	0.92 (0.080)	0.62
PP	20	40	All	0.54 (0.18)	0.75 (0.12)	0.90 (0.075)	0.13
PD	20	40	All	0.94 (0.036)	0.76 (0.088)	0.95 (0.030)	0.10
PA	20	40	All	0.64 (0.13)	0.81 (0.088)	0.93 (0.049)	0.14
	20	20	Thin	0.84 (0.15)	0.58 (0.21)	0.85 (0.092)	0.41
	20	20	Thick	0.40 (0.25)	1.00 (NA)	1.00 (NA)	NA

a)VP, valleculae pooling; PP, pyriform pooling; PD, piecemeal deglutition; PA, penetration/aspiration.

b)K is the number of patients.

c)N is the total number of observations.

d)A separate kappa coefficient was computed for each liquid consistency when dysphagia scores were different for thin and thick liquids.

FEES study. Intraobserver agreement level (linear weighted kappa coefficient and standard errors obtained with the multilevel delta method and the clustered bootstrap method) for the 4 FEES variables. The p‐value refers to the comparison of the three multilevel dependent kappa coefficients a)VP, valleculae pooling; PP, pyriform pooling; PD, piecemeal deglutition; PA, penetration/aspiration. b)K is the number of patients. c)N is the total number of observations. d)A separate kappa coefficient was computed for each liquid consistency when dysphagia scores were different for thin and thick liquids. The results from the multilevel delta and the clustered bootstrap methods differ, mainly because they are based on a different number of observations. In fact, by definition of the methods, the delta method is based on complete cases analysis while the clustered bootstrap method uses available cases. When considering only the complete cases, parameter estimates and the p‐values obtained with the clustered bootstrap method are closer to what is obtained with the delta method (i.e., the p‐values are 0.24 for VP, NA for PP, 0.11 for PD, and 0.24 for PA). All the agreement coefficients were positive with minimum and maximum agreement values both obtained for pyriform pooling (0.56 and 1.00). Observer 1 showed the largest variability in agreement values (range: 0.56–0.93). There was no evidence of a difference in the intraobserver agreement levels obtained individually and in consensus. This suggests that consensus ratings might offer an alternative to independent rating of FEES exams. However, changes in the scoring of the FEES criteria between the individual and the consensus ratings were observed (data not shown, Pilz et al., 2016). Therefore, the validity of the FEES criteria for individual and consensus ratings also needs to be studied in order to better compare the two rating processes. It is worth noting that the conclusions of the study should be taken with great care because of the small sample size of this exploratory study. The 5000 differences between the pairs of kappa coefficients generated by the clustered bootstrap method are depicted in Fig. 3 with 95% confidence ellipse. As expected from the results in Table 2, the point (0,0) lies in the confidence ellipse for the four FEES variables. Note that the bootstrap estimates can show some unexpected pattern (e.g. regions in the 95% confidence ellipse almost empty) because the marginal probability distribution of the observers limits the possible values of kappa coefficients. These patterns, if present, could directly challenge the multivariate normality assumption of the kappa coefficients vector. This is however not the case here where the bootstrap estimates are harmoniously distributed in the 95% confidence ellipse for the four FEES variables.

Figure 3

Differences between the kappa coefficients obtained by the observers individually and in consensus with the clustered bootstrap method (95% confidence ellipse). The square represents the bootstrap estimate and the triangle the origin point (0,0).

Discussion

A simple method based on the use of Hotelling's T 2 statistic was developed in this paper to compare dependent kappa coefficients obtained on multilevel data, a frequent situation in medical research. This method can easily be implemented in practice because it is based on simple matrix calculations. A R package “multiagree” was developed by the author and is available on github. The code to reproduce the results presented in this paper and install the package is available as Supporting Information on the journal's web page (http://onlinelibrary.wiley.com/doi/bimj.201600093/suppinfo). Additionally to the methods presented in this paper, this package also considers the case of several observers, independent kappa coefficients and kappa coefficients obtained on independent observations. The method of Fleiss (1981) (cfr Appendix 2) can be used to compare independent kappa coefficients (or other measures) by using standard errors derived with the multilevel delta or the clustered bootstrap method. The package can be used for all multilevel studies where two or more kappa coefficients have to be compared. In contrast, modeling techniques require more specific programming skills and a new program has to be written for each specific study. Nevertheless, their use is highly recommended in the presence of several covariates or of continuous covariates. Two assumptions were made by Yang and Zhou (2014) to ensure the existence of an overall kappa coefficient, that is the homogeneity of the members of a cluster and the existence of a common kappa coefficient across the clusters. When there is evidence that the assumptions do not hold, as discussed by Yang and Zhou (2014), a separate kappa coefficient should be computed for each subpopulation identified. A third assumption was necessary to ensure that the sampling distribution of the T 2 statistic is a F‐distribution, that is the multivariate normality of the vector of kappa coefficients. When the sample size is large, a normal sampling distribution of the kappa coefficients is ensured by the central limit theorem. However, normality could be problematic for small sample sizes (K=20) and large kappa values (), as discussed in the simulation section. This was however not the case in the FEES study with only patients involved (see Fig. 3). The use of nonparametric alternatives to the T 2 statistic to compare dependent kappa coefficients is a topic for future research. Accounting for the hierarchical structure of the data is strongly advised, even for small numbers of clusters and small cluster sizes, as shown in the simulations. Ignoring the hierarchical structure of the data can in general increase dramatically the type I error rate for intracluster kappa values above 0.3. These results are consistent with the results of Yang and Zhou (2014), where a good performance of the multilevel delta method was observed for small number of clusters (e.g. ) and moderate cluster sizes (e.g. ). These conclusions should however be taken with caution due to the limited simulation schemes considered in this paper. The multilevel delta method, although asymptotic, showed similar coverage levels than the clustered bootstrap method. However, in the presence of missing data, the use of the delta and the clustered bootstrap methods can lead to different conclusions because the delta method, by definition, is based on a complete case analysis while the clustered bootstrap method is based on an available case analysis. If data are not missing completely at random, both analyses may give bias estimates and invalid inference. Likelihood‐based methods could then be preferred. When the amount of missing data is important, using the multilevel delta method can reduce the sample size drastically, as for the valleculae pooling criterion in the FEES study and lead to inefficient analysis. The clustered bootstrap method is less affected. One other advantage of the clustered bootstrap method over the multilevel delta method is its simplicity. It can easily extend to other measures (e.g., agreement between several observers, price delay (Bae et al., 2012) in finance) while specific mathematical derivations are required for each new statistical measure considered to compute the variance‐covariance matrix with the delta method. To summarize, this paper provides a simple method to compare dependent agreement measures obtained on multilevel data and performs well even when the number of clusters is small (). The method should however be used with care when both the number of clusters and the number of observations per clusters are small. This method can be easily extended to other measures if the clustered bootstrap method is used to compute the variance‐covariance matrix. However, modeling techniques are highly recommended in the presence of several or continuous covariates. Likewise, the used of likelihood based techniques might be preferable if the amount of missing data is important.

Conflict of interest

The authors have declared no conflict of interest. Comparing dependent kappa coefficients obtained on multilevel data Supplementary material. Click here for additional data file. Supporting Information. Click here for additional data file.

21 in total

1. Estimating equations for kappa statistics.

Authors: J R Thompson
Journal: Stat Med Date: 2001-10-15 Impact factor: 2.373

2. A simple method for the analysis of clustered binary data.

Authors: J N Rao; A J Scott
Journal: Biometrics Date: 1992-06 Impact factor: 2.571

3. On the comparison of correlated proportions for clustered data.

Authors: N A Obuchowski
Journal: Stat Med Date: 1998-07-15 Impact factor: 2.373

4. High agreement but low kappa: II. Resolving the paradoxes.

Authors: D V Cicchetti; A R Feinstein
Journal: J Clin Epidemiol Date: 1990 Impact factor: 6.437

5. Generating correlated discrete ordinal data using R and SAS IML.

Authors: Noor Akma Ibrahim; Suliadi Suliadi
Journal: Comput Methods Programs Biomed Date: 2011-07-20 Impact factor: 5.428

6. Assessing interrater agreement from dependent data.

Authors: J M Williamson; A K Manatunga
Journal: Biometrics Date: 1997-06 Impact factor: 2.571

7. A reappraisal of the kappa coefficient.

Authors: W D Thompson; S D Walter
Journal: J Clin Epidemiol Date: 1988 Impact factor: 6.437

8. Kappa statistic for clustered matched-pair data.

Authors: Zhao Yang; Ming Zhou
Journal: Stat Med Date: 2014-02-16 Impact factor: 2.373

9. A proposed index for measuring agreement in test-retest studies.

Authors: E Rogot; I D Goldberg
Journal: J Chronic Dis Date: 1966-09

10. Kappa statistic for clustered dichotomous responses from physicians and patients.

Authors: Chaeryon Kang; Bahjat Qaqish; Jane Monaco; Stacey L Sheridan; Jianwen Cai
Journal: Stat Med Date: 2013-03-27 Impact factor: 2.373

12 in total

1. Comparison of Radiation Dose Reconstruction Methods to Investigate Late Adverse Effects of Radiotherapy for Childhood Cancer: A Report from the Childhood Cancer Survivor Study.

Authors: Sara J Schonfeld; Rebecca M Howell; Susan A Smith; Joseph P Neglia; Lucie M Turcotte; Michael A Arnold; Peter D Inskip; Kevin C Oeffinger; Chaya S Moskowitz; Tara O Henderson; Wendy M Leisenring; Todd M Gibson; Amy Berrington de González; Joshua N Sampson; Stephen J Chanock; Margaret A Tucker; Smita Bhatia; Leslie L Robison; Gregory T Armstrong; Lindsay M Morton
Journal: Radiat Res Date: 2019-12-03 Impact factor: 2.841

2. Root canal morphology and symmetry of mandibular second premolars using cone-beam computed tomography.

Authors: Faisal T Alghamdi; Wafaa A Khalil
Journal: Oral Radiol Date: 2021-05-08 Impact factor: 1.852

3. International perception of lung sounds: a comparison of classification across some European borders.

Authors: Juan Carlos Aviles-Solis; Sophie Vanbelle; Peder A Halvorsen; Nick Francis; Jochen W L Cals; Elena A Andreeva; Alda Marques; Päivi Piirilä; Hans Pasterkamp; Hasse Melbye
Journal: BMJ Open Respir Res Date: 2017-12-18

4. The use of spectrograms improves the classification of wheezes and crackles in an educational setting.

Authors: J C Aviles-Solis; I Storvoll; Sophie Vanbelle; H Melbye
Journal: Sci Rep Date: 2020-05-21 Impact factor: 4.379

5. Asymptotic variability of (multilevel) multirater kappa coefficients.

Authors: Sophie Vanbelle
Journal: Stat Methods Med Res Date: 2018-08-22 Impact factor: 3.021

6. Lesion detection by [⁸⁹Zr]Zr-DFO-girentuximab and [¹⁸F]FDG-PET/CT in patients with newly diagnosed metastatic renal cell carcinoma.

Authors: Sarah R Verhoeff; Suzanne C van Es; Eline Boon; Erik van Helden; Lindsay Angus; Sjoerd G Elias; Sjoukje F Oosting; Erik H Aarntzen; Adrienne H Brouwers; Thomas C Kwee; Sandra Heskamp; Otto S Hoekstra; Henk Verheul; Astrid A M van der Veldt; Elisabeth G E de Vries; Otto C Boerman; Winette T A van der Graaf; Wim J G Oyen; Carla M L van Herpen
Journal: Eur J Nucl Med Mol Imaging Date: 2019-06-06 Impact factor: 9.236

7. Whole-body MRI versus an FDG-PET/CT-based reference standard for staging of paediatric Hodgkin lymphoma: a prospective multicentre study.

Authors: Suzanne Spijkers; Annemieke S Littooij; Thomas C Kwee; Nelleke Tolboom; Auke Beishuizen; Marrie C A Bruin; Sjoerd G Elias; Tim van de Brug; Goya Enríquez; Constantino Sábado; Elka Miller; Claudio Granata; Charlotte de Lange; Federico Verzegnassi; Mary-Louise C Greer; Bart de Keizer; Rutger A J Nievelstein
Journal: Eur Radiol Date: 2020-09-03 Impact factor: 5.315

8. Variation in Blood Pressure Classification Using 7 Blood Pressure Estimation Protocols Among Adults in Taiwan.

Authors: Hung-Ju Lin; Heng-Yu Pan; Wen-Jone Chen; Tzung-Dau Wang
Journal: JAMA Netw Open Date: 2020-11-02

9. Comparison of a Barcode-Based Smartphone Application to a Questionnaire to Assess the Use of Cleaning Products at Home and Their Association with Asthma Symptoms.

Authors: Pierre Lemire; Sofia Temam; Sarah Lyon-Caen; Catherine Quinot; Etienne Sévin; Sophie Remacle; Karine Supernant; Rémy Slama; Orianne Dumas; Valérie Siroux; Nicole Le Moual; The Sepages Study Group
Journal: Int J Environ Res Public Health Date: 2021-03-24 Impact factor: 3.390

10. Whole-body MRI versus an [¹⁸F]FDG-PET/CT-based reference standard for early response assessment and restaging of paediatric Hodgkin's lymphoma: a prospective multicentre study.

Authors: Suzanne Spijkers; Annemieke S Littooij; Thomas C Kwee; Nelleke Tolboom; Auke Beishuizen; Marrie C A Bruin; Goya Enríquez; Constantino Sábado; Elka Miller; Claudio Granata; Charlotte de Lange; Federico Verzegnassi; Bart de Keizer; Rutger A J Nievelstein
Journal: Eur Radiol Date: 2021-05-22 Impact factor: 5.315