| Literature DB >> 25566145 |
Ben Kelcey1, Dan McGinn2, Heather Hill2.
Abstract
An important assumption underlying meaningful comparisons of scores in rater-mediated assessments is that measurement is commensurate across raters. When raters differentially apply the standards established by an instrument, scores from different raters are on fundamentally different scales and no longer preserve a common meaning and basis for comparison. In this study, we developed a method to accommodate measurement noninvariance across raters when measurements are cross-classified within two distinct hierarchical units. We conceptualized random item effects cross-classified graded response models and used random discrimination and threshold effects to test, calibrate, and account for measurement noninvariance among raters. By leveraging empirical estimates of rater-specific deviations in the discrimination and threshold parameters, the proposed method allows us to identify noninvariant items and empirically estimate and directly adjust for this noninvariance within a cross-classified framework. Within the context of teaching evaluations, the results of a case study suggested substantial noninvariance across raters and that establishing an approximately invariant scale through random item effects improves model fit and predictive validity.Entities:
Keywords: measurement equivalence; measurement invariance; multilevel item response models; random item effects; teaching
Year: 2014 PMID: 25566145 PMCID: PMC4274900 DOI: 10.3389/fpsyg.2014.01469
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Discrimination and threshold parameters.
| RICH | 1.14 | 0.04 | 1.08 | 0.04 | 0.99 | 0.05 | 1.05 | 0.07 | 0.10 | 0.05 | 0.20 |
| WWS | 1.39 | 0.07 | 1.18 | 0.06 | 1.15 | 0.05 | 1.46 | 0.11 | 0.19 | 0.08 | 0.44 |
| CWCM | 0.79 | 0.06 | 0.78 | 0.07 | 0.76 | 0.06 | 0.74 | 0.09 | 0.08 | 0.02 | 0.21 |
| SPMMR | 1.33 | 0.06 | 1.23 | 0.05 | 1.17 | 0.06 | 1.16 | 0.07 | 0.11 | 0.05 | 0.23 |
| RICH(1) | 0.61 | 0.03 | 0.72 | 0.07 | 0.74 | 0.12 | 0.56 | 0.12 | 0.12 | 0.06 | 0.24 |
| RICH(2) | 2.57 | 0.06 | 2.88 | 0.08 | 2.93 | 0.14 | 2.71 | 0.13 | |||
| WWS(1) | 0.53 | 0.04 | 0.57 | 0.04 | 0.64 | 0.13 | 0.48 | 0.15 | 0.07 | 0.01 | 0.24 |
| WWS(2) | 2.75 | 0.10 | 2.80 | 0.07 | 2.94 | 0.14 | 3.12 | 0.20 | |||
| CWCM(1) | −1.98 | 0.06 | −2.24 | 0.12 | −2.25 | 0.15 | −2.39 | 0.15 | 0.09 | 0.02 | 0.25 |
| SPMMR(1) | 0.83 | 0.04 | 0.94 | 0.08 | 1.03 | 0.16 | 0.83 | 0.13 | 0.25 | 0.13 | 0.49 |
| SPMMR(2) | 2.78 | 0.09 | 3.06 | 0.12 | 3.24 | 0.19 | 2.97 | 0.15 | |||
| Observations | 1.00 | — | 1.00 | — | 1.00 | — | 1.00 | — | |||
| Teachers | — | — | 0.34 | 0.05 | 0.40 | 0.06 | 0.32 | 0.06 | |||
| Raters | — | — | — | — | 0.28 | 0.09 | 0.26 | 0.09 | |||
Est, estimate; SD, standard deviation; Item Variance Across Raters, the item-specific random effect variance across raters (σ2, σ2); Low and High, the lower and upper bounds of the 95% posterior interval respectively.
Figure 1Item characteristic curve for a single item across different raters.
Test of measurement invariance for item parameters.
| RICH | 0.12 | 0.06 | 0.24 | 0.000 | 0.000 | 0.576 |
| WWS | 0.07 | 0.01 | 0.24 | 0.623 | 0.668 | 1.009 |
| CWCM | 0.09 | 0.02 | 0.25 | 0.053 | 0.276 | 0.884 |
| SPMMR | 0.25 | 0.13 | 0.49 | 0.000 | 0.000 | 0.009 |
| RICH | 0.10 | 0.05 | 0.20 | 0.000 | 0.000 | 0.858 |
| WWS | 0.19 | 0.08 | 0.44 | 0.000 | 0.000 | 0.353 |
| CWCM | 0.08 | 0.02 | 0.21 | 0.172 | 0.256 | 0.986 |
| SPMMR | 0.11 | 0.05 | 0.23 | 0.000 | 0.001 | 0.783 |
BF, Bayes factor for each item parameter under the hypotheses that the respective variance is less than 0.001, 0.01, or 0.1; Low and High, the lower and upper bounds of the 95% posterior interval respectively.
Posterior predictive checks for item fit (95% posterior intervals).
| RICH0 | 0.656 | 0.649 | 0.663 | 0.648 | 0.692 | 0.637 | 0.709 | 0.618 | 0.69 |
| RICH1+ | 0.344 | 0.337 | 0.351 | 0.308 | 0.352 | 0.291 | 0.363 | 0.31 | 0.382 |
| RICH2 | 0.044 | 0.042 | 0.047 | 0.031 | 0.043 | 0.027 | 0.042 | 0.034 | 0.052 |
| WWS0 | 0.622 | 0.613 | 0.629 | 0.604 | 0.65 | 0.597 | 0.676 | 0.573 | 0.656 |
| WWS1+ | 0.378 | 0.371 | 0.387 | 0.35 | 0.396 | 0.324 | 0.403 | 0.344 | 0.427 |
| WWS2 | 0.053 | 0.05 | 0.057 | 0.043 | 0.058 | 0.038 | 0.059 | 0.045 | 0.071 |
| CWCM0 | 0.060 | 0.058 | 0.062 | 0.042 | 0.052 | 0.041 | 0.061 | 0.035 | 0.05 |
| CWCM1 | 0.940 | 0.938 | 0.942 | 0.948 | 0.958 | 0.939 | 0.959 | 0.95 | 0.965 |
| SPMMR0 | 0.691 | 0.683 | 0.698 | 0.678 | 0.723 | 0.676 | 0.749 | 0.668 | 0.738 |
| SPMMR1+ | 0.309 | 0.302 | 0.317 | 0.277 | 0.322 | 0.251 | 0.324 | 0.262 | 0.332 |
| SPMMR2 | 0.047 | 0.044 | 0.05 | 0.034 | 0.048 | 0.027 | 0.044 | 0.03 | 0.047 |
Correlation among observation scores from different methods.
| RIE-CC | 1.00 | 0.93 | 0.91 | 0.90 | 0.89 |
| CC | 0.93 | 1.00 | 0.92 | 0.91 | 0.92 |
| ML | 0.91 | 0.92 | 1.00 | 0.96 | 0.95 |
| Single | 0.90 | 0.91 | 0.96 | 1.00 | 0.99 |
| Averages | 0.89 | 0.92 | 0.95 | 0.99 | 1.00 |
RIE-CC, random item effects cross-classified graded response model; CC, cross-classified graded response model; ML, multilevel graded response model; Single, single level graded response model
Discrepant classification rates among methods.
| RIE-CC | 0.00 | 0.23 | 0.32 | 0.37 | 0.33 |
| CC | 0.23 | 0.00 | 0.26 | 0.30 | 0.32 |
| ML | 0.32 | 0.26 | 0.00 | 0.24 | 0.23 |
| Single | 0.37 | 0.30 | 0.24 | 0.00 | 0.09 |
| Averages | 0.33 | 0.32 | 0.23 | 0.09 | 0.00 |
RIE-CC, random item effects cross-classified graded response model; CC, cross-classified graded response model; ML, multilevel graded response model; Single, single level graded response model.
Correlation between observations scores and value-added scores.
| Averages | 0.11 | −0.05 | 0.27 |
| Single | 0.12 | −0.04 | 0.28 |
| Multilevel | 0.15 | −0.01 | 0.31 |
| CC | 0.14 | −0.02 | 0.30 |
| RIE-CC | 0.17 | 0.01 | 0.33 |
Interval excludes zero.
RIE-CC, random item effects cross-classified graded response model; CC, cross-classified graded response model; ML, multilevel graded response model; Single, single level graded response model.