Literature DB >> 32020851

Homogeneity score test of AC₁ statistics and estimation of common AC₁ in multiple or stratified inter-rater agreement studies.

Abstract

BACKGROUND: Cohen's κ coefficient is often used as an index to measure the agreement of inter-rater determinations. However, κ varies greatly depending on the marginal distribution of the target population and overestimates the probability of agreement occurring by chance. To overcome these limitations, an alternative and more stable agreement coefficient was proposed, referred to as Gwet's AC1. When it is desired to combine results from multiple agreement studies, such as in a meta-analysis, or to perform stratified analysis with subject covariates that affect agreement, it is of interest to compare several agreement coefficients and present a common agreement index. A homogeneity test of κ was developed; however, there are no reports on homogeneity tests for AC1 or on an estimator of common AC1. In this article, a homogeneity score test for AC1 is therefore derived, in the case of two raters with binary outcomes from K independent strata and its performance is investigated. An estimation of the common AC1 between strata and its confidence intervals is also discussed.
METHODS: Two homogeneity tests are provided: a score test and a goodness-of-fit test. In this study, the confidence intervals are derived by asymptotic, Fisher's Z transformation and profile variance methods. Monte Carlo simulation studies were conducted to examine the validity of the proposed methods. An example using clinical data is also provided.
RESULTS: Type I error rates of the proposed score test were close to the nominal level when conducting simulations with small and moderate sample sizes. The confidence intervals based on Fisher's Z transformation and the profile variance method provided coverage levels close to nominal over a wide range of parameter combination.
CONCLUSIONS: The method proposed in this study is considered to be useful for summarizing evaluations of consistency performed in multiple or stratified inter-rater agreement studies, for meta-analysis of reports from multiple groups and for stratified analysis.

Entities: Chemical Disease Gene Species

Keywords: Common AC1; Consistency evaluation; Gwet’s AC1; Homogeneity test; Inter-rater agreement; Stratified study

Year: 2020 PMID： 32020851 PMCID： PMC7001312 DOI： 10.1186/s12874-019-0887-5

Source DB: PubMed Journal: BMC Med Res Methodol ISSN： 1471-2288 Impact factor: 4.615

Background

To evaluate the reliability when two raters classify objects as either positive (+) or negative (−), Cohen’s κ [1] and the intra-class version of κ, which is identical to Scott’s π [2], have often been used. Let p be the agreement probability, and p1 and p2 the probabilities classified as (+) by rater 1 and 2 respectively. Then Cohen’s κ (κ) and Scott’s π (κ) are defined as follows: where and p+ = (p1 + p2)/2. The p and p are the probabilities of agreement expected by chance for Cohen’s κ and Scott’s π respectively. The p assumes that the probabilities of positive classification differ between two raters, while the p assumes that these two probabilities are the same. Landis and Koch provided benchmarks of the strength of consistency as follows: values ≤0 as poor, 0.00 to 0.20 as slight, 0.21 to 0.40 as fair, 0.41 to 0.60 as moderate, 0.61 to 0.80 as substantial and 0.81 to 1.00 as almost perfect agreement [3]. Although the authors acknowledge the arbitrary nature of their benchmarks, they recommended their benchmark scale as a useful guideline for practitioners. Many extensions have been made to Cohen’s κ including those for agreement in the cases of ordinal data [4], multiple raters [5-9], comparisons of correlated κ ’s [10-13] and stratified data [14, 15]. However, as Feinstein and Cicchetti showed, Cohen’s κ depends strongly on the marginal distributions and therefore behaves paradoxically [16]. This behavior can be explained by the bias effect and the prevalence effect, on which various discussions have been undertaken [16-18]. A number of alternative measures of agreements have also been proposed, such as Holley and Guilford’s G [19], Aickin’s α [20], Andres and Marzo’s delta [21], Marasini’s s* [22, 23] and Gwet’s AC1 [24] and AC2 [25]. Gwet showed that AC1 has better statistical properties (bias and variance) than Cohen’s κ, Scott’s π and G-index under a limited set of simulations for two raters with binary outcomes [24]. Shanker and Bangdiwala compared Cohen’s κ, Scott’s π, Prevalence Adjusted Bias Adjusted Kappa (PABAK) [26], AC1 and B-statistic [27], which is not a kappa-type chance-corrected measure, in the case of two raters and binary outcomes and showed that AC1 has better properties than other kappa-type measures [28]. In addition, AC1 has been utilized in the field of medical research over the past decade [29-35]. Therefore, in this study we have limited our discussion to AC1 in the case of two raters with binary outcomes. First, a brief review of the concept of Gwet’s AC1 is provided. Consider the situation in which two raters independently classify randomly extracted subject as positive (+) or negative (−). Gwet defined two events: G = {the two raters agree} and R = {at least one rater performs random rating}. The probability of agreement expected by chance is then p = P(G ⋂ R) = P(G| R)P(R). A random rating would lead to the classification of an individual into each category with the same probability and it follows that . As for the estimation of P(R), this probability cannot be obtained from data. Therefore, Gwet proposed approximating it with a normalized measure of randomness Ψ, defined as follows: where π+ is the probability that a randomly chosen rater classifies a randomly chosen subject into the + category. Thus, the approximated probability of chance agreement is represented by AC1 is thus defined as follows: where p is the probability of agreement. Although p is approximated to , Gwet showed that the bias of γ, the difference between γ and the true inter-rater reliability, is equal to or less than Cohen’s κ, Scott’s π and G-index under some assumption in the case of two raters with binary outcomes. Gwet also provided an estimator of γ and its variance for multiple raters and multiple categories based on the randomization approach, which requires the selection of subjects to be random in such a way that all possible subject samples have the exact same chance of being selected. However, it is advantageous to employ a model-based approach when, for example, the evaluation of the effect of subject covariates on agreement is of interest. Therefore, in the case of two raters with binary outcomes, Ohyama [36] assumed the underlying probability that a subject is rated as (+) and its marginal homogeneity of the two raters, and then constructed the likelihood. The maximum likelihood estimator of γ, which is shown to be identical to the estimator given by Gwet, was derived. The likelihood-based confidence intervals for AC1, inclusion of subject covariates, hypothesis testing and sample size determination were also discussed [36]. In this article, we discuss stratification analyses as another approach to adjust the effect of subject covariates on agreement. For example, a clinical assessment whether a patient has a particular disease symptom may be influenced by overall severity of the disease. In such a case, we consider stratification based on the severity of the disease. Another example is a multicenter inter-rater agreement study, in which the classifications for subjects are conducted independently in each center. These situations require several independent agreement statistics. Then the main purpose of the analyses would be testing whether the degree of inter-rater agreement can be regarded as homogeneous across strata, such as centers and severities of the disease. For κ, Fleiss has been at the forefront of the idea of χ2 test-based inter-class consistency with large sample variances [37] and further studies by Donner, Eliasziw and Klar [14], Nam [15, 38] and Wilding, Consiglio and Shan [39] have developed the homogeneity test of κ across covariate levels. However, there are no reports on homogeneity tests for AC1 or on an estimator of common AC1. Therefore, in this article, we derive the homogeneity score test for AC1 from K independent strata and its performance is investigated. An estimation of the common AC1 between strata and its confidence intervals is also discussed. Finally, an example application of our approach to clinical trial data is provided.

Methods

Homogeneity tests

Score test

Consider K independent strata involving n subjects for k = 1, …, K. In each stratum, two raters independently classify subjects as either positive (+) or negative (−). Let X= 1 if subject i(=1, …, n) in the k-th stratum is classified as “+” by rater j(=1, 2) and X = 0 otherwise. Suppose that P(X = 1| i) = u, . The γ of the k-th stratum is then expressed as follows [36]: Let the number of observed pairs in the three categories of the k-th stratum be x1, x2 and x3 and their corresponding probabilities be P1(γ), P2(γ) and P3(γ). The data of the k-th stratum are then given as shown in Table 1.

Table 1

Data layout

Category	Ratings	Frequency	Probability
1	(+, +)	x_1k	P_1k(γ_k)
2	(+, −) or (−, +)	x_2k	P_2k(γ_k)
3	(−, −)	x_3k	P_3k(γ_k)
Total		n_k	1

Data layout The log-likelihood function is given by where γ = (γ1, …, γ)', π = (π1, …, π)', l(γ, π) = x1 log P1(γ) + x2 log P2(γ) + x3 log P3(γ), and A = 1 − 2π(1 − π). The maximum likelihood estimators of γ and π are then given by and respectively. The first and second derivatives of the log-likelihood function and the Fisher information matrix are given in the Appendix. The aim of this study is to test the homogeneity of the agreement coefficients among K strata, and thus the null hypothesis to test is represented by H0 : γ = γ0 (k = 1, 2, ..., K). The score test statistic for the null hypothesis is derived as follows (see Appendix): where are obtained by substituting the maximum likelihood estimators and under the null hypothesis into is asymptotically distributed as a χ2 with K − 1 degrees of freedom. The homogeneity hypothesis is rejected at level α when ≥ , where is the 100 × (1 − α) percentile point of the χ2 distribution with K − 1 degrees of freedom. Note that, since 0 ≤ P1(γ), P2(γ), P3(γ) ≤ 1 and P1(γ) + P2(γ) + P3(γ) = 1, substituting (6), (7) and (8) into these equations, the admissible range of γ with respect to π is obtained as follows [36]: When obtaining the maximum likelihood estimators and under the null hypothesis by numerical calculation, initial values need to be set to satisfy this condition.

Goodness-of-fit test

Donner, Eliasziw and Klar proposed a goodness-of-fit approach for testing homogeneity of kappa statistics in the case of two raters with binary outcomes [40]. This procedure can also be applied to AC1 statistics. Given that the frequencies x1, x2, x3, k = 1, …, K in Table 1 follow a multinomial distribution conditional on n, estimated probabilities under H0 are given by , which is obtained by replacing π by and γ by in P(γ); h = 1, 2, 3; k = 1, …, K. Then the goodness-of-fit statistic is derived as follows: under H0, follows an approximate χ2 distribution with K − 1 degrees of freedom. The homogeneity hypothesis is rejected at level α when , where is the 100 × (1 − α) percentile point of the χ2 distribution with K − 1 degrees of freedom.

Estimation of common AC1

If the assumption of homogeneity is reasonable, the estimate of γ0 can be used as an appropriate summary measure of reliability. The maximum likelihood estimator and are obtained by maximizing the log-likelihood functions . Since an analytical solution cannot be obtained from this function, numerical iterative calculations are used. The variance of can be expressed as follows (see Appendix): where are values using γ = γ0 in B, C, D respectively, and A simple 100 × (1 − α) % confidence interval using the asymptotic normality of can be expressed as follows: where Z is the α/2 upper quantile of the standard normal distribution and is obtained by substituting and into (14). Hereafter, this method is referred to as the simple asymptotic (SA) method. Since Eq. (14) depends on γ0, SA method may not have the correct coverage rate, and the normality of the sampling distribution of may be improved using Fisher’s Z transformation. This method is referred to below as Fisher’s Z transformation (FZ) method (see Appendix). As an alternative method, we employ the profile variance approach, which has been shown to perform well in the case of the intra-class κ for binary outcome data [41-43]. This approach also performs well for AC1 in the case of two raters with binary outcomes [36]. The confidence interval based on the profile variance can be obtained by solving the following inequality for γ0: where is given by substituting into π in (15). Hereafter, this method is referred to as the profile variance (PV) method (see Appendix).

Numerical evaluations

We conducted Monte Carlo simulations to investigate the performance of the proposed homogeneity tests and to evaluate the estimate of common AC1 and its confidence intervals under the following conditions: the number of strata in the simulation is K = 2 or 3; and random observations are generated from the trinomial distributions according to the probabilities of (6), (7) and (8) by giving the values of γ and π. The balanced and unbalanced cases were considered for the values of π and n. The values of γ and π are set within the theoretical range of Eq. (12) derived in the preceding paragraph. Ten thousand times of iterations were carried out for each parameter combination. When π is close to 0 or 1 and n is small, there are cases in which the generated data include zero cells. In such cases, B, C, D and R cannot be estimated . Thus, when zero cells were generated, we adopted the approach of adding 0.5 to the frequency of each combination by two raters, (+,+), (+,−), (−,+), (−,−). This simple method was discussed by Agresti [44] and was adopted in a previous study [39].

Results

Empirical type I error rate for the homogeneity test

The type I error rates of the homogeneity tests with a significance level of 0.05 were examined. The sample size was set at n = n = 20, 50, 80 for balanced settings and (n1, n2, n3) = (20, 50, 80) for unbalanced settings. The error rate obtained by the score test is expressed as SCORE and the error rate obtained by the goodness-of-fit test is expressed as GOF. Table 2 summarizes the results for K = 2.

Table 2

Empirical type I error rates of homogeneity tests for γ1 = γ2 = γ0 based on 10,000 simulations (K = 2 balanced sample size)

Balanced π conditions					Unbalanced π conditions
n₁ = n₂	γ₀	π₁ = π₂	SCORE	GOF	n₁ = n₂	γ₀	π₁	π₂	SCORE	GOF
20	0.1	0.5	0.045	0.067	20	0.1	0.5	0.35	0.049	0.097
	0.3		0.046	0.067		0.3			0.049	0.096
	0.5		0.048	0.062		0.5			0.050	0.083
	0.7		0.033	0.041		0.7			0.037	0.049
	0.9		0.002	0.003		0.9			0.003	0.005
	0.1	0.35	0.052	0.121		0.1	0.65	0.35	0.050	0.120
	0.3		0.054	0.126		0.3			0.051	0.120
	0.5		0.052	0.103		0.5			0.051	0.101
	0.7		0.039	0.064		0.7			0.039	0.065
	0.9		0.004	0.006		0.9			0.004	0.007
	0.7	0.2	0.047	0.132		0.7	0.5	0.2	0.038	0.090
	0.9		0.008	0.029		0.9			0.005	0.013
50	0.1	0.5	0.050	0.058	50	0.1	0.5	0.35	0.048	0.117
	0.3		0.047	0.054		0.3			0.051	0.087
	0.5		0.050	0.054		0.5			0.049	0.072
	0.7		0.051	0.053		0.7			0.050	0.060
	0.9		0.026	0.027		0.9			0.024	0.027
	0.1	0.35	0.051	0.172		0.1	0.65	0.35	0.051	0.168
	0.3		0.051	0.126		0.3			0.049	0.117
	0.5		0.052	0.092		0.5			0.051	0.092
	0.7		0.052	0.072		0.7			0.052	0.071
	0.9		0.028	0.033		0.9			0.028	0.033
	0.7	0.2	0.053	0.162		0.7	0.5	0.2	0.051	0.104
	0.9		0.037	0.061		0.9			0.032	0.042
80	0.1	0.5	0.047	0.052	80	0.1	0.5	0.35	0.051	0.120
	0.3		0.047	0.051		0.3			0.053	0.094
	0.5		0.054	0.057		0.5			0.051	0.072
	0.7		0.050	0.052		0.7			0.051	0.061
	0.9		0.037	0.039		0.9			0.047	0.050
	0.1	0.35	0.052	0.173		0.1	0.65	0.35	0.052	0.172
	0.3		0.054	0.123		0.3			0.054	0.124
	0.5		0.053	0.089		0.5			0.055	0.090
	0.7		0.051	0.069		0.7			0.051	0.069
	0.9		0.044	0.051		0.9			0.045	0.052
	0.7	0.2	0.052	0.152		0.7	0.5	0.2	0.050	0.103
	0.9		0.051	0.073		0.9			0.048	0.059

Empirical type I error rates of homogeneity tests for γ1 = γ2 = γ0 based on 10,000 simulations (K = 2 balanced sample size) Overall, the proposed score test did not show any significant type I error rate inflation, but it was very conservative when sample size was small and γ0 was close to 1. In the case of n = 20 when γ0 = 0.1, 0.3 or 0.5, the type I error rates of SCORE were maintained at the nominal level of 0.05 regardless of whether π was balanced or unbalanced, but when γ0 = 0.7 or 0.9, the type I error rates were slightly conservative. Especially when γ0 = 0.9, the rate was significantly conservative to the extent of being less than 0.01. In the case of n = 50, the type I error rates were maintained at the nominal level of 0.05 except when γ0 = 0.9. Finally in the case of n = 80, the type I error rates were almost maintained at the nominal level. In contrast, the type I error rate of GOF tended to be larger than that of SCORE and in many cases it was not maintained at the nominal level. The results obtained for K = 3 are shown Table S1 and Table S2 in Additional file 1. The Additional file 2 provides the simulation code of empirical type I error rate using R language.

Empirical power of the homogeneity test

The empirical power of the score test was investigated only for the case of K = 2, by setting γ1 = 0.1, 0.3, 0.5 and γ2 − γ1 = 0.3, 0.4. The values of π and n were set as in the type I error simulation. The results are shown in Table 3. The power tended to be large as the value of γ1 increased under the fixed values of π and γ2 − γ1.

Table 3

Empirical power of homogeneity tests based on 10,000 simulations (K = 2 balanced sample size)

Balanced π conditions						Unbalanced π conditions
n₁ = n₂	γ₁	γ₂	π₁ = π₂	SCORE	GOF	n₁ = n₂	γ₁	γ₂	π₁	π₂	SCORE	GOF
20	0.1	0.5	0.5	0.243	0.290	20	0.1	0.5	0.5	0.35	0.234	0.304
	0.3	0.6		0.173	0.202		0.3	0.6			0.168	0.224
	0.3	0.7		0.294	0.323		0.3	0.7			0.293	0.351
	0.5	0.8		0.212	0.232		0.5	0.8			0.216	0.258
	0.5	0.9		0.372	0.396		0.5	0.9			0.389	0.430
	0.1	0.5	0.35	0.245	0.357		0.1	0.5	0.65	0.35	0.243	0.355
	0.3	0.6		0.185	0.279		0.3	0.6			0.171	0.266
	0.3	0.7		0.313	0.408		0.3	0.7			0.296	0.398
	0.5	0.8		0.227	0.295		0.5	0.8			0.221	0.291
	0.5	0.9		0.396	0.483		0.5	0.9			0.394	0.475
	0.5	0.8	0.2	0.270	0.411		0.5	0.8	0.5	0.2	0.230	0.278
	0.5	0.9		0.482	0.616		0.5	0.9			0.435	0.473
50	0.1	0.5	0.5	0.538	0.562	50	0.1	0.5	0.5	0.35	0.525	0.595
	0.3	0.6		0.377	0.396		0.3	0.6			0.369	0.428
	0.3	0.7		0.635	0.651		0.3	0.7			0.630	0.673
	0.5	0.8		0.512	0.525		0.5	0.8			0.517	0.548
	0.5	0.9		0.835	0.841		0.5	0.9			0.843	0.855
	0.1	0.5	0.35	0.509	0.652		0.1	0.5	0.65	0.35	0.503	0.648
	0.3	0.6		0.379	0.485		0.3	0.6			0.364	0.470
	0.3	0.7		0.633	0.718		0.3	0.7			0.618	0.711
	0.5	0.8		0.518	0.585		0.5	0.8			0.516	0.581
	0.5	0.9		0.844	0.877		0.5	0.9			0.838	0.874
	0.5	0.8	0.2	0.552	0.711		0.5	0.8	0.5	0.2	0.531	0.598
	0.5	0.9		0.878	0.915		0.5	0.9			0.861	0.863
80	0.1	0.5	0.5	0.757	0.768	80	0.1	0.5	0.5	0.35	0.732	0.786
	0.3	0.6		0.568	0.578		0.3	0.6			0.545	0.596
	0.3	0.7		0.841	0.847		0.3	0.7			0.833	0.858
	0.5	0.8		0.716	0.722		0.5	0.8			0.717	0.741
	0.5	0.9		0.967	0.968		0.5	0.9			0.966	0.970
	0.1	0.5	0.35	0.707	0.819		0.1	0.5	0.65	0.35	0.707	0.816
	0.3	0.6		0.538	0.645		0.3	0.6			0.541	0.644
	0.3	0.7		0.826	0.879		0.3	0.7			0.822	0.884
	0.5	0.8		0.717	0.767		0.5	0.8			0.715	0.764
	0.5	0.9		0.963	0.973		0.5	0.9			0.965	0.974
	0.5	0.8	0.2	0.746	0.872		0.5	0.8	0.5	0.2	0.734	0.787
	0.5	0.9		0.976	0.982		0.5	0.9			0.974	0.970

Empirical power of homogeneity tests based on 10,000 simulations (K = 2 balanced sample size) The empirical power of the GOF test was also examined under the same simulation conditions as the score test. The results are also shown in Table 3. However, the GOF had a large type I error rate inflation (Table 2) and was invalid as a test. The Additional file 2 provides the simulation code of empirical power using R language.

Bias and mean square error for common AC1

We evaluated the bias and mean square error (MSE) of the maximum likelihood estimator for the common AC1, . The balanced and unbalanced conditions for π and the balanced condition for n were considered. The results are shown in Table 4. The bias of tended to be small as γ0 increased, but was almost unbiased. As expected, the bias and MSE tended to be small as the sample size increased.

Table 4

Bias and mean square error of the maximum likelihood estimator for the common AC1 based on 10,000 simulations (K = 2 balanced sample size)

Balanced π conditions					Unbalanced π conditions
n₁ = n₂	γ₀	π₁ = π₂	Bias	MSE	n₁ = n₂	γ₀	π₁	π₂	Bias	MSE
20	0.1	0.5	0.026	0.025	20	0.1	0.5	0.35	0.019	0.027
	0.3		0.023	0.023		0.3			0.018	0.024
	0.5		0.017	0.018		0.5			0.013	0.019
	0.7		0.009	0.012		0.7			0.007	0.012
	0.9		−0.011	0.003		0.9			−0.011	0.003
	0.1	0.35	0.009	0.029		0.1	0.65	0.35	0.010	0.028
	0.3		0.007	0.025		0.3			0.011	0.025
	0.5		0.007	0.019		0.5			0.008	0.019
	0.7		0.004	0.012		0.7			0.004	0.012
	0.9		−0.011	0.003		0.9			−0.011	0.003
	0.7	0.2	−0.007	0.012		0.7	0.5	0.2	0.001	0.012
	0.9		−0.010	0.003		0.9			−0.010	0.003
80	0.1	0.5	0.007	0.006	80	0.1	0.5	0.35	0.006	0.007
	0.3		0.006	0.006		0.3			0.005	0.006
	0.5		0.005	0.005		0.5			0.004	0.005
	0.7		0.003	0.003		0.7			0.003	0.003
	0.9		0.002	0.001		0.9			0.001	0.001
	0.1	0.35	0.004	0.007		0.1	0.65	0.35	0.003	0.008
	0.3		0.002	0.006		0.3			0.001	0.006
	0.5		0.001	0.005		0.5			0.002	0.005
	0.7		0.001	0.003		0.7			0.001	0.003
	0.9		0.000	0.001		0.9			0.000	0.001
	0.7	0.2	−0.001	0.003		0.7	0.5	0.2	0.001	0.003
	0.9		−0.001	0.001		0.9			0.001	0.001

Bias and mean square error of the maximum likelihood estimator for the common AC1 based on 10,000 simulations (K = 2 balanced sample size) The Additional file 3 provides the simulation code of bias and mean square error for common AC1 using R language.

Confidence intervals for common AC1

We conducted a simulation study to evaluate the performances of the three confidence intervals presented in the previous section. The coverage rates of the 95% confidence interval were examined. The balanced and unbalanced conditions for π and the balanced condition for n are considered. The results are shown in Table 5. The coverage rate of the SA method was generally lower than 0.95 under many conditions, with the exception of the value being close to 0.99 in the case of n1 = n2 = 20 and γ0= 0.9. The FZ method and PV method greatly improved the coverage rates close to the nominal level. However, the coverage rate of the PV method was closer to the nominal level than that of the FZ method in most cases under the conditions examined. The coverage rates of each method were also evaluated in the case of K = 3, and the unbalanced n conditions and both the FZ method and the PV method achieved coverage rates near 0.95 (results not shown).

Table 5

Coverage rates of common γ 95% confidence intervals of the three proposed methods based on 10,000 simulations

Balanced π conditions						Unbalanced π conditions
n₁ = n₂	γ₀	π₁ = π₂	SA	FZ	PV	n₁ = n₂	γ₀	π₁	π₂	SA	FZ	PV
20	0.1	0.5	0.939	0.959	0.958	20	0.1	0.5	0.35	0.939	0.958	0.958
	0.3		0.936	0.958	0.950		0.3			0.933	0.958	0.955
	0.5		0.931	0.962	0.962		0.5			0.927	0.959	0.960
	0.7		0.924	0.976	0.963		0.7			0.917	0.971	0.961
	0.9		0.998	0.963	0.955		0.9			0.997	0.966	0.955
	0.1	0.35	0.935	0.953	0.956		0.1	0.65	0.35	0.938	0.955	0.957
	0.3		0.934	0.955	0.953		0.3			0.934	0.955	0.956
	0.5		0.926	0.956	0.955		0.5			0.927	0.959	0.955
	0.7		0.918	0.965	0.958		0.7			0.917	0.967	0.960
	0.9		0.996	0.967	0.947		0.9			0.996	0.969	0.950
	0.7	0.2	0.929	0.959	0.950		0.7	0.5	0.2	0.931	0.965	0.956
	0.9		0.970	0.966	0.911		0.9			0.993	0.970	0.943
50	0.1	0.5	0.949	0.953	0.952	50	0.1	0.5	0.35	0.946	0.952	0.953
	0.3		0.945	0.955	0.954		0.3			0.943	0.952	0.951
	0.5		0.945	0.954	0.953		0.5			0.941	0.952	0.952
	0.7		0.936	0.961	0.953		0.7			0.936	0.956	0.953
	0.9		0.920	0.971	0.971		0.9			0.923	0.968	0.965
	0.1	0.35	0.942	0.948	0.952		0.1	0.65	0.35	0.946	0.954	0.954
	0.3		0.942	0.950	0.950		0.3			0.943	0.953	0.952
	0.5		0.940	0.951	0.951		0.5			0.941	0.954	0.953
	0.7		0.937	0.954	0.952		0.7			0.936	0.956	0.953
	0.9		0.926	0.966	0.960		0.9			0.925	0.968	0.960
	0.7	0.2	0.938	0.949	0.948		0.7	0.5	0.2	0.937	0.954	0.952
	0.9		0.927	0.965	0.954		0.9			0.928	0.969	0.960
80	0.1	0.5	0.945	0.952	0.952	80	0.1	0.5	0.35	0.946	0.951	0.951
	0.3		0.947	0.952	0.952		0.3			0.946	0.952	0.952
	0.5		0.943	0.953	0.952		0.5			0.942	0.950	0.950
	0.7		0.944	0.950	0.949		0.7			0.943	0.955	0.953
	0.9		0.901	0.962	0.956		0.9			0.922	0.963	0.957
	0.1	0.35	0.946	0.951	0.951		0.1	0.65	0.35	0.943	0.947	0.946
	0.3		0.944	0.949	0.949		0.3			0.945	0.950	0.950
	0.5		0.944	0.950	0.950		0.5			0.944	0.953	0.952
	0.7		0.941	0.950	0.949		0.7			0.939	0.953	0.952
	0.9		0.931	0.960	0.953		0.9			0.927	0.963	0.954
	0.7	0.2	0.944	0.950	0.949		0.7	0.5	0.2	0.942	0.955	0.954
	0.9		0.929	0.956	0.951		0.9			0.927	0.961	0.955

SA, FZ, and PV refer to 95% confidence intervals for common AC1 using the simple asymptotic, Fisher’s Z transformation, and profile variance methods, respectively

Coverage rates of common γ 95% confidence intervals of the three proposed methods based on 10,000 simulations SA, FZ, and PV refer to 95% confidence intervals for common AC1 using the simple asymptotic, Fisher’s Z transformation, and profile variance methods, respectively The Additional file 4 provides the simulation code of confidence intervals for common AC1 using R language.

An example

As an example, we used data from a randomized clinical trial called the Silicon Study, which was conducted to investigate the effectiveness of silicone fluids versus gases in the management of proliferative vitreoretinopathy (PVR) by vitrectomy [45]. The PVR classification, determined at the baseline visit, defines the severity of the disease as a continuum of increasing pathology graded as C3, D1, D2 or D3. The presence or absence of retinal injury in the superior nasal cavity was evaluated clinically by the operating ophthalmic surgeon and photographically by an independent fundus photograph reading center [46]. The data and results are summarized in Table 6. For reference, the results of the homogeneity score test proposed by Nam for the intra-class κ are also provided [15]. The probabilities of agreement in each stratum were from 0.800 to 0.880 and not so different. However, the values of κ in each stratum were from 0.117 to 0.520 and were greatly different. This might be due to the prevalence effect caused by the small values of π. In contrast, the values of γ were 0.723 to 0.861 and did not differ greatly among strata.

Table 6

Agreement between ophthalmologist and reading center classifying superior nasal retinal breaks stratified by PVR grade

	PVR grade
	C3	D1	D2	D3
Both (x₁)	1	6	5	3
One (x₂)	9	8	11	9
Neither (x₃)	65	46	54	33
Total (n)	75	60	70	45
π	0.073	0.167	0.150	0.167
p_a	0.880	0.867	0.843	0.800
κ (MLE)	0.117	0.520	0.384	0.280
AC₁ (MLE)	0.861	0.815	0.789	0.723

Agreement between ophthalmologist and reading center classifying superior nasal retinal breaks stratified by PVR grade The proposed homogeneity score statistic was 2.060 (p-value = 0.560) and the homogeneity hypothesis was not rejected. The estimate of common AC1 was 0.808 and its 95% confidence intervals were 0.743–0.873 (SA method), 0.732–0.864 (FZ method) and 0.730–0.862 (PV method). Also, the score statistic for testing the homogeneity of κ’s [15] was 2.700 (p-value = 0.440) and the common κ was 0.352. The Additional file 5 provides the code for clinical data examples using R language. To investigate the sensitivity of the indicators to π, we hypothetically considered more balanced and less balanced π under fixed p and n in each stratum. The generated data set and analysis results are summarized as Table S3 in the Additional file 1. κ was more sensitive to changes in the value of π, but AC1 was less sensitive to changes in the value of π than κ. The common AC1 was not affected as much as the common κ even if the π balance was lost.

Discussion

It is well known that Cohen’s κ depends strongly on the marginal distributions, and Gwet proposed alternative and more stable measures of agreement, AC1 for nominal data and its extended agreement AC2 for ordinal data [24, 25]. A number of alternative measures have also been proposed, as in Holley and Guilford’s G [19], Aickin’s α [20], Andres and Marzo’s delta [21] and Marasini’s s* [22, 23]. Gwet [24] and Shankar and Bangdiwala [28] compared some measures and showed that AC1 has better properties than other kappa-type measures. In addition, AC1 has been utilized in the field of medical research over the past decade [29-35]. However statistical inference procedures of AC1 have not been discussed sufficiently. Therefore, Ohyama expressed AC1 using population parameters to develop a likelihood-based inference procedure and constructed confidence intervals of AC1 based on profile variances and likelihood ratios. Inclusion of subjects’ covariates, hypothesis testing and sample size estimation were also presented [36]. In the present study, the case of stratified data was discussed as one development of Ohyama [36] for two raters with binary outcomes. Furthermore, tests were derived for the homogeneity of AC1 between K independent strata and the inference of common AC1 was discussed. In the numerical evaluation of type I error, both tests were conservative when the sample size was small and γ0 was 0.9, but the conservativeness was relaxed when the sample size was as large as 80. In other settings of simulation, the score test performed well while GOF sometimes could not achieve the nominal level. Therefore, we recommend using the score test for testing the homogeneity of AC1 among K strata. Note that, when zero cells are observed, the homogeneity score test statistic cannot be calculated. In such cases in our simulation study, we simply added 0.5 to the data set, which had no serious effect on the performance of the proposed score test in our simulation settings. If the homogeneity assumption is reasonable, it may be desired to provide an estimate of the common AC1 as a summary measure of reliability. In the present study, we proposed an estimator of common AC1 and constructed its confidence intervals based on the SA, FZ, and PV methods. We also evaluated the performance of each numerically. The bias and MSE tended to be small as the sample size increased, and the results were nearly 0 when n = 80. The PV method provides coverage levels close to nominal in most situations, while the SA method tends to provide a shortage of coverage and the FZ method tends to provide excess coverage in some situations. Therefore, we recommend the PV method for calculating confidence intervals. As in the PVR example, AC1 in each stratum is less affected by the prevalence or marginal probability than by the κ. It is suggested that the proposed homogeneity test and the general framework of common AC1 estimation are also essentially more stable than those of the κ. There were some limitations in this study. First, as described above, the performance of the proposed score test was very conservative when γ0 = 0.9 and sample size was small. An exact approach might be an alternative method in such cases. Next, in this study, the cases were limited to two raters with binary outcomes in each stratum. However, in the evaluation of medical data, it is often the case that multiple raters classify subjects into nominal or ordered categories. Our proposed method may be extended to the case of multiple raters with binary outcomes using the likelihood function for multiple raters. In the cases of two raters with nominal outcomes, Agresti [47] proposed a quasi-symmetry model with kappa as a parameter, and this technique may be extended to AC1 in the case of stratified data. Finally, continuous covariates need to be categorized adequately to apply the proposed approach. The regression model proposed by Ohyama [36] can be used to assess the effect of continuous covariates on AC1, but it is limited to the case of two raters with binary data. Nelson and Edwards [48] and Nelson, Mitani and Edwards [49] proposed a method for constructing a measure of agreement using generalized linear mixed-effect models by introducing continuous latent variables representing the subject’s true disease status and for flexibly incorporating rater and subject covariates. These approaches might be applicable to AC1 and AC2.

Conclusion

The method proposed in this study is considered to be useful for summarizing evaluations of consistency performed in multiple or stratified inter-rater agreement studies. In addition, the proposed method can be applied not only to medical or epidemiological research but also to assessment of the degree of consistency of characteristics, such as biometrics, psychological measurements, and data in the behavioral sciences.

Supplementary information

Additional file 1. Supplementary tables. Additional file 2. R code for type I errors and power. Additional file 3. R code for Bias and MSE. Additional file 4. R code for coverage rates. Additional file 5. R code for clinical data examples.

35 in total

1. Interval estimation for a difference between intraclass kappa statistics.

Authors: Allan Donner; Guangyong Zou
Journal: Biometrics Date: 2002-03 Impact factor: 2.571

2. Homogeneity score test for the intraclass version of the kappa statistics and sample-size determination in multiple or stratified studies.

Authors: Jun-mo Nam
Journal: Biometrics Date: 2003-12 Impact factor: 2.571

Review 3. The dependence of Cohen's kappa on the prevalence does not matter.

Authors: Werner Vach
Journal: J Clin Epidemiol Date: 2005-04-18 Impact factor: 6.437

4. A reappraisal of the kappa coefficient.

Authors: W D Thompson; S D Walter
Journal: J Clin Epidemiol Date: 1988 Impact factor: 6.437

5. Testing the Difference of Correlated Agreement Coefficients for Statistical Significance.

Authors: Kilem L Gwet
Journal: Educ Psychol Meas Date: 2015-07-28 Impact factor: 2.821

6. Exact approaches for testing hypotheses based on the intra-class kappa coefficient.

Authors: Gregory E Wilding; Joseph D Consiglio; Guogen Shan
Journal: Stat Med Date: 2014-03-17 Impact factor: 2.373

7. Simultaneous non-parametric confidence intervals for survival probabilities from censored data.

Authors: A A Afifi; R M Elashoff; J J Lee
Journal: Stat Med Date: 1986 Nov-Dec Impact factor: 2.373

8. Validation of data submitted by the treating surgeon in the Victorian Audit of Surgical Mortality.

Authors: Dylan Hansen; Emma Hansen; Claudia Retegan; Julia Morphet; Charles Barry Beiles
Journal: ANZ J Surg Date: 2018-11-29 Impact factor: 1.872

9. Reproducibility and Feasibility of Strategies for Morphologic Assessment of Renal Biopsies Using the Nephrotic Syndrome Study Network Digital Pathology Scoring System.

Authors: Jarcy Zee; Jeffrey B Hodgin; Laura H Mariani; Joseph P Gaut; Matthew B Palmer; Serena M Bagnasco; Avi Z Rosenberg; Stephen M Hewitt; Lawrence B Holzman; Brenda W Gillespie; Laura Barisoni
Journal: Arch Pathol Lab Med Date: 2018-02-19 Impact factor: 5.534

10. Providing quality data in health care - almost perfect inter-rater agreement in the Norwegian tonsil surgery register.

Authors: Siri Wennberg; Lasse A Karlsen; Joacim Stalfors; Mette Bratt; Vegard Bugten
Journal: BMC Med Res Methodol Date: 2019-01-07 Impact factor: 4.615

3 in total

1. The skill qualification system for portal hypertension in Japan.

Authors: Masayuki Ohta; Naoya Murashima; Tetsuji Ohyama; Tomoharu Yoshida; Shozo Hirota; Hirofumi Kawanaka; Makoto Hashizume; Shinichi Nakamura; Fumio Chikamori; Susumu Eguchi; Takashi Tajiri; Katsutoshi Obara; Shigehiro Kokubu
Journal: DEN open Date: 2021-11-25

2. Investigating the effects of COVID-19 lockdown on Italian children and adolescents with and without neurodevelopmental disorders: a cross-sectional study.

Authors: Cristiano Termine; Linda Greta Dui; Laura Borzaga; Vera Galli; Rossella Lipari; Marta Vergani; Valentina Berlusconi; Massimo Agosti; Francesca Lunardini; Simona Ferrante
Journal: Curr Psychol Date: 2021-10-25

3. Diagnostic Accuracy of CT for Metastatic Epidural Spinal Cord Compression.

Authors: James Thomas Patrick Decourcy Hallinan; Shuliang Ge; Lei Zhu; Wenqiao Zhang; Yi Ting Lim; Yee Liang Thian; Pooja Jagmohan; Tricia Kuah; Desmond Shi Wei Lim; Xi Zhen Low; Ee Chin Teo; Nesaretnam Barr Kumarakulasinghe; Qai Ven Yap; Yiong Huak Chan; Jiong Hao Tan; Naresh Kumar; Balamurugan A Vellayappan; Beng Chin Ooi; Swee Tian Quek; Andrew Makmur
Journal: Cancers (Basel) Date: 2022-08-31 Impact factor: 6.575

3 in total