| Literature DB >> 28553257 |
Abstract
Applications of interrater agreement (IRA) statistics for Likert scales are plentiful in research and practice. IRA may be implicated in job analysis, performance appraisal, panel interviews, and any other approach to gathering systematic observations. Any rating system involving subject-matter experts can also benefit from IRA as a measure of consensus. Further, IRA is fundamental to aggregation in multilevel research, which is becoming increasingly common in order to address nesting. Although, several technical descriptions of a few specific IRA statistics exist, this paper aims to provide a tractable orientation to common IRA indices to support application. The introductory overview is written with the intent of facilitating contrasts among IRA statistics by critically reviewing equations, interpretations, strengths, and weaknesses. Statistics considered include rwg, [Formula: see text], r'wg, rwg(p), average deviation (AD), awg, standard deviation (Swg), and the coefficient of variation (CVwg). Equations support quick calculation and contrasting of different agreement indices. The article also includes a "quick reference" table and three figures in order to help readers identify how IRA statistics differ and how interpretations of IRA will depend strongly on the statistic employed. A brief consideration of recommended practices involving statistical and practical cutoff standards is presented, and conclusions are offered in light of the current literature.Entities:
Keywords: data aggregation; interrater agreement; multilevel methods; reliability; rwg; within-group agreement
Year: 2017 PMID: 28553257 PMCID: PMC5427087 DOI: 10.3389/fpsyg.2017.00777
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Summary of interrater agreement statistics for likert-type response scales.
| 1 – ( |
A value of 1.0 indicates complete agreement. A value of 0 indicates agreement equal to the null distribution (i.e., one index of completely random responding. Values below 0 or above 1.0 are assumed to be the result of sampling error and should be reset to 0 (see James et al., |
Commonly used in the literature and generally known to researchers and reviewers. Likely the most researched agreement statistic. Linear function facilitates interpretation. |
Uniform distribution may inappropriately model random responding, and selecting an alternative null distribution can be difficult (for guidance, see LeBreton and Senter, May not be directly comparable (i.e., equivalent) across different means of group ratings, number of raters, or sample sizes. It is not uncommon for values to exceed +1.0 or fall below 0. These inadmissible values might not be the result of sampling error. Resetting the values to 0 may therefore be inappropriate and result in loss of information (Brown and Hauenstein, | |
|
A value of 1.0 indicates complete agreement. A value of 0 indicates agreement equal to the null distribution. Values below 0 or above 1.0 are assumed to be the result of sampling error and should be reset to 0 (see James et al., |
Commonly used in the literature and generally known to researchers and reviewers. Likely the most researched agreement statistic. |
Same as May not be directly comparable (i.e., equivalent) across different means of group ratings or the number of raters. It is upwardly influenced by the number of discrete Likert scale response options. Values in between 1.0 and 0 are difficult to interpret because the function is non-linear. | ||
| 1 – ( |
If using σeu2, the interpretation is the same as If using σ Values below 0 (using σeu2) and below 0.5 (using σ |
Presents a compelling alternative to the uniform null distribution (σeu2) by positing the theoretical maximum dissensus (σ Circumvents problems of inadmissible values by allowing for meaningful interpretations when |
May not be directly comparable (i.e., equivalent) across different means of group ratings. Maximum dissensus may inappropriately model random responding, and selecting an alternative null distribution can be difficult (for guidance, see LeBreton and Senter, May be positively correlated with group mean extremity. | |
|
Same as |
Same as With increasing items the function remains linear, unlike |
Same as | ||
| 1 – ( |
Less attenuated than is Interpretation is otherwise similar to |
Less attenuated than is |
Shares many of the same limitations as does Application has been rare in the literature and, accordingly, researchers and reviewers may be unaware of the underlying logic. | |
|
Identify subgroups, calculate each subgroup's agreement score, check homogeneity of variances and, if supported, substitute sample-weighted average group variance (denoted Homogeneity of variances can be tested using Fisher's |
Has same interpretation as does previous |
Allows for consideration of theoretically meaningful subgroups. Addresses limitation of inadmissible values that can be problematic for |
Has many of the same interpretational problems as do previous Can be difficult to generate theoretical predictions Assumes homogeneity of subgroup variances. If homogeneity assumptions cannot be supported, separate | |
| ∑(| |
Indexes the average distance of judges' ratings from the group's scale mean. Considerable justification for practical cutoff criteria have been proposed, but they are not without assumptions (see Section Standards for Agreement). |
Interpretation is not complicated by changes (e.g., non-linearity) in the number of Likert categories (bearing in mind greater deviations are expected given category increases). Circumvents problems associated with choosing an appropriate null distribution. |
May be negatively correlated with group mean extremity. Does not permit explicit modeling of random responding (i.e., has no null distribution term). AD values are highly dependent on the number of scale categories employed. This makes it very difficult to compare AD values of scales differing in length. | |
| ∑ |
Shares interpretations of |
Same advantages as Takes the average of each |
Same limitations of | |
| 1 – [(2 * |
A value of +1.0 indicates perfect agreement, given the group mean. A value of 0 indicates the observed variance is 50% of the maximum variance, given the group mean. A value of −1.0 indicates maximum disagreement given the group mean. Will equal single-item Will equal single and multi-item |
Controls for the extremeness of the group mean by not relying on a single specification of the null distribution. Uses the unbiased, sample variance to calculate observed and theoretical random variance terms, whereas the Circumvents problems of inadmissible values. Will not be affected by sample size because it employs matched variances. |
Requires at least Is not interpretable at face value beyond certain extreme group means. That is, the minimum mean with interpretable | |
| ∑ |
Shares interpretations of |
Same advantages as Takes the average of each |
Same limitations as | |
| {[∑ ( |
The root of the average squared judge deviation from the mean. |
Provides a straightforward and direct index of agreement. |
Will be scale dependent such that a greater number of response options will tend to produce greater Does not permit explicit modeling of random responding (i.e., has no null distribution term). | |
| ∑ |
Shares interpretations of |
Same advantages as Takes the average of each |
Same limitations as | |
|
Rescales the standard deviation by taking into account the mean. Large values suggest large variance relative to the mean (and scale). |
Samples with larger means may be expected to have greater standard deviations than samples with smaller means. The |
It is difficult to decide what constitutes high and low consensus based on The assumption of a non-negative ratio scale may not always be tenable. The Does not permit explicit modeling of random responding (i.e., has no null distribution term). | ||
| ∑ |
Shares interpretations of |
Same advantages as |
Same disadvantages as |
Figure 1Single and multiple-item .
Figure 2Sample .
Figure 3.