| Literature DB >> 32341607 |
Letty Koopman1, Bonne J H Zijlstra1, Mark de Rooij2, L Andries van der Ark1.
Abstract
Two-level Mokken scale analysis is a generalization of Mokken scale analysis for multi-rater data. The bias of estimated scalability coefficients for two-level Mokken scale analysis, the bias of their estimated standard errors, and the coverage of the confidence intervals has been investigated, under various testing conditions. It was found that the estimated scalability coefficients were unbiased in all tested conditions. For estimating standard errors, the delta method and the cluster bootstrap were compared. The cluster bootstrap structurally underestimated the standard errors of the scalability coefficients, with low coverage values. Except for unequal numbers of raters across subjects and small sets of items, the delta method standard error estimates had negligible bias and good coverage. Post hoc simulations showed that the cluster bootstrap does not correctly reproduce the sampling distribution of the scalability coefficients, and an adapted procedure was suggested. In addition, the delta method standard errors can be slightly improved if the harmonic mean is used for unequal numbers of raters per subject rather than the arithmetic mean.Entities:
Keywords: Mokken scale analysis; cluster bootstrap; delta method; rater effects; standard errors; two-level scalability coefficients
Year: 2019 PMID: 32341607 PMCID: PMC7174805 DOI: 10.1177/0146621619843821
Source DB: PubMed Journal: Appl Psychol Meas ISSN: 0146-6216
Population Values of the Two-Level Scalability Coefficients , and and the SD of the Sampling Distribution for the Four Conditions of in the Main Design.
|
| 0.25 | 0.50 | 0.75 | 1.00 | ||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
|
| .437 | .037 | .418 | .034 | .435 | .029 | .479 | .025 |
|
| .415 | .038 | .316 | .038 | .214 | .036 | .126 | .032 |
|
| .948 | .010 | .756 | .036 | .483 | .057 | .262 | .058 |
Rater Effect () and Rating Variance () Values for the Number of Answer Categories () Specialized Design.
| Rater effect | |||||
|---|---|---|---|---|---|
|
|
| 0.25 | 0.50 | 0.75 | 1.00 |
| 2 | .3 | 0.18 | 0.27 | 0.35 | 0.45 |
| 3 | .4 | 0.20 | 0.33 | 0.48 | 0.65 |
| 5 | .5 | 0.25 | 0.50 | 0.75 | 1.00 |
| 6 | .5 | 0.30 | 0.70 | 1.00 | 1.20 |
Note. is the level from the main design.
Population Values for , and for the Specialized Designs Item Discrimination , Item-Step Location , and Rating Variance , for Rater Effect .
|
|
|
| ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5 | 1 | 1.5 | Varied | 1.5 | 3 | 4.5 | Equal | 0.25 | 0.50 | 0.75 | Varied | |
|
| .185 | .418 | .569 | .381 | .377 | .418 | .439 | .400 | .464 | .418 | .357 | .384 |
|
| .125 | .316 | .439 | .284 | .327 | .316 | .270 | .252 | .343 | .316 | .269 | .270 |
|
| .675 | .756 | .772 | .747 | .866 | .756 | .616 | .630 | .738 | .756 | .752 | .704 |
Bias of Estimated Coefficients (H) and of the Estimated Standard Errors ().
|
| Bias( | Bias( | Bias( | ||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| 0.25 | −.000 | −.001 | −.002 | .002 | .002 |
|
|
| −.002 |
| 0.50 | −.001 | −.002 | −.007 | .002 | .001 | .004 |
|
|
|
| 0.75 | .001 | −.002 | −.009 | .003 | .002 | .004 |
|
|
|
| 1.00 | .001 | −.003 | −.008 | .003 | .003 |
|
|
|
|
Note. Bias that exceeds the boundary of .044 and .004 for and , respectively, is printed in boldface.
Figure 1.Plot of the coverage of the 95% confidence interval of the two-level scalability coefficients, for different levels of rater effect and the two standard error estimation methods.
Note. Error bars represent the 95% Agresti–Coull confidence interval.
Bias of the Delta Method Standard Errors () for the Two-Level Scalability Coefficients , , and for Specialized Designs of Number of Raters () and Number of Items ().
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| 2 | .002 | .002 |
| 2 | .002 |
| −.003 |
| 5 | .002 | .001 | .004 | 3 | .001 | −.004 | .000 |
| 30 | .000 | .000 | .001 | 4 | .002 | −.001 | .003 |
| 4-6 | .004 |
|
| 6 | .001 | .001 |
|
| 3-7 |
|
|
| 10 | .002 | .001 | .004 |
| 5-30 |
|
|
| 20 | .002 | .002 | .003 |
Note. Bias that exceeds the boundary of .004 is printed in boldface.
Figure 2.Coverage plots for the two-level scalability for different number of raters and items, respectively.
Note. Error bars represent the 95% Agresti–Coull confidence interval.
Post Hoc Results of the Bias() and Coverage for the Two-Stage and Cluster Bootstrap and the Delta Method, the Arithmetic and Harmonic Mean of , and item-pairs with Two, Four, and 10 Items, for , , and , and Main Design Condition With .
| Bias ( | Coverage | ||||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| ||
| Method | |||||||
| Two-stage bootstrap | −.003 | −.004 |
|
|
|
| |
| Cluster bootstrap |
|
|
|
|
|
| |
| Delta method | .002 | .001 | .004 | .955 | .950 |
| |
|
| Mean | ||||||
| 4-6 | A | .004 |
|
|
|
|
|
| H | .003 | .003 |
|
|
|
| |
| 3-7 | A |
|
|
|
|
|
|
| H |
|
|
|
|
|
| |
| 5-30 | A |
|
|
|
|
|
|
| H |
|
|
|
|
|
| |
| Number of Items | |||||||
| 2 | .002 |
| −.003 | .944 |
| .941 | |
| 4 | .002 | −.001 |
| .945 | .938 |
| |
| 10 | .002 | .003 |
| .950 | .953 |
| |
Note. Bias that exceeds the boundary of .004 and coverages where .95 is outside the Agresti–Coull interval are printed in boldface. The two-stage bootstrap results are based on 100 replications. The results are averaged across all item-pairs. A = arithmetic mean and H = harmonic mean of .
Two Small Constructed Multi-Rater Data Examples, One With a Large Rater Effect and One With a Small Rater Effect.
| Data set 1: | Data set 2: | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| |||||||||||||||||
| 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | |
| Item | 2 | 2 | 2 | 1 | 1 | 0 | 0 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 1 | 0 | 0 | 1 | 1 | 1 |
| Item | 1 | 2 | 2 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 0 | 1 | 0 | 1 | 0 |
|
|
|
|
| |||||||||||||||||
|
|
|
|
|
|
| |||||||||||||||
| 95% CI | [0.343, 1.181] | [−0.231, 0.565] | [−0.288, 0.726] | [0.349, 1.175] | [0.435, 0.970] | [0.441, 1.402] | ||||||||||||||
Note. 95% CI is the 95% Wald-based confidence interval. Both data sets have two subjects (), and each rated by a unique set of five raters () on two three-category items ( and ).