| Literature DB >> 29761162 |
Masaki Uto1, Maomi Ueno1.
Abstract
In various assessment contexts including entrance examinations, educational assessments, and personnel appraisal, performance assessment by raters has attracted much attention to measure higher order abilities of examinees. However, a persistent difficulty is that the ability measurement accuracy depends strongly on rater and task characteristics. To resolve this shortcoming, various item response theory (IRT) models that incorporate rater and task characteristic parameters have been proposed. However, because various models with different rater and task parameters exist, it is difficult to understand each model's features. Therefore, this study presents empirical comparisons of IRT models. Specifically, after reviewing and summarizing features of existing models, we compare their performance through simulation and actual data experiments.Entities:
Keywords: Information science; Psychology
Year: 2018 PMID: 29761162 PMCID: PMC5948474 DOI: 10.1016/j.heliyon.2018.e00622
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Figure 1Item response curves of the generalized partial credit model for five categories.
Task and rater characteristics in each model, and the number of parameters.
| Model | Task characteristics | Rater characteristics | Number of parameters |
|---|---|---|---|
| MFRM | Difficulty | Severity | |
| Patz1999 | Discrimination | Severity for each task | |
| Difficulty for each category | |||
| Ueno2008 | Discrimination | Severity | 2 |
| Difficulty | Range restriction | ||
| Uto2016 | Discrimination | Severity | |
| Difficulty for each category | Consistency | ||
| HRM | Difficulty for each category | Severity | |
| Consistency | |||
Figure 3Item response curves of Ueno2008 for two raters with different range restriction characteristics.
Figure 2Item response curves of Patz1999 for two raters with different rating severity.
Figure 4Item response curves of Uto2016 for two raters with different rating consistency.
RMSE for rater and task parameters calculated in the simulation experiment.
| MFRM | .054 (.048) | .070 (.069) | .069 (.056) | .096 (.091) | .103 (.082) |
| Patz1999 | .106 (.094) | .118 (.109) | .107 (.095) | .161 (.137) | .178 (.154) |
| Ueno2008 | .108 (.089) | .073 (.074) | .119 (.102) | .161 (.130) | .189 (.189) |
| Uto2016 | .088 (.091) | .078 (.081) | .105 (.091) | .130 (.110) | .127 (.114) |
| HRM | .252 (.283) | .335 (.493) | .477 (.467) | .349 (.331) | .223 (.252) |
RMSE for ability calculated in the simulation experiment.
| MFRM | .148 (.112) | .158 (.125) | .205 (.162) | .226 (.170) | .137 (.095) |
| Patz1999 | .152 (.114) | .153 (.122) | .182 (.143) | .190 (.157) | .175 (.110) |
| Ueno2008 | .166 (.130) | .150 (.116) | .211 (.161) | .214 (.151) | .151 (.115) |
| Uto2016 | .159 (.129) | .155 (.117) | .177 (.125) | .193 (.147) | .145 (.107) |
| HRM | .371 (.299) | .302 (.239) | .379 (.290) | .385 (.295) | .403 (.316) |
Transformation rules corresponding to assessment settings in which some rater and task characteristics are assumed to be present.
| Settings | Transformation procedure | |
|---|---|---|
| (A) | Raters with low consistency exist | For 60% of raters |
| (B) | Low discrimination tasks exist | For 60% of tasks |
| (C) | Raters with strong range restriction exist | Two categories |
| (D) | Difficulty to obtain each category differs among tasks | Two categories |
| (E) | Rater severity differs among tasks | We first selected |
| (F) | All the above characteristics exist | All the above transformation rules are applied simultaneously. |
Performance of models in various assessment settings.
| Setting | Model | AIC | WAIC | BIC | ML | RMSE( |
|---|---|---|---|---|---|---|
| (A) | MFRM | 4.50(.45) | 4.20(.36) | 3.90(.09) | 4.90(.09) | .478(.048) |
| Patz1999 | 2.10(.09) | 2.10(.09) | 2.10(.09) | 2.10(.09) | .404(.042) | |
| Ueno2008 | 2.89(.10) | 2.89(.10) | 2.89(.10) | 2.89(.10) | .394(.036) | |
| Uto2016 | ||||||
| HRM | 4.30(.21) | 4.60(.24) | 4.90(.09) | 3.90(.09) | .478(.068) | |
| (B) | MFRM | 4.80(.16) | 4.70(.21) | 3.90(.09) | 4.90(.09) | .548(.058) |
| Patz1999 | ||||||
| Ueno2008 | 3.00(.00) | 3.00(.00) | 3.00(.00) | 3.00(.00) | .392(.047) | |
| Uto2016 | 2.00(.00) | 2.00(.00) | 2.00(.00) | 2.00(.00) | .373(.026) | |
| HRM | 4.00(.20) | 4.10(.29) | 4.90(.09) | 3.90(.09) | .635(.115) | |
| (C) | MFRM | 4.00(.00) | 4.00(.00) | 4.00(.00) | 4.30(.21) | .318(.069) |
| Patz1999 | 2.60(.24) | 2.60(.24) | 2.60(.24) | 2.60(.24) | .258(.035) | |
| Ueno2008 | ||||||
| Uto2016 | 2.40(.24) | 2.40(.24) | 2.40(.24) | 2.40(.24) | .255(.035) | |
| HRM | 5.00(.00) | 5.00(.00) | 5.00(.00) | 4.70(.21) | .385(.047) | |
| (D) | MFRM | 4.00(.00) | 4.00(.00) | 4.00(.00) | 4.40(.24) | .318(.057) |
| Patz1999 | 1.60(.24) | .259(.026) | ||||
| Ueno2008 | 3.00(.00) | 3.00(.00) | 3.00(.00) | 3.00(.00) | .286(.028) | |
| Uto2016 | ||||||
| HRM | 5.00(.00) | 5.00(.00) | 5.00(.00) | 4.60(.24) | .408(.054) | |
| (E) | MFRM | 4.40(.24) | 4.60(.24) | 4.00(.20) | 4.90(.09) | .419(.065) |
| Patz1999 | ||||||
| Ueno2008 | 2.89(.10) | 2.89(.10) | 2.89(.10) | 2.89(.10) | .343(.055) | |
| Uto2016 | 2.10(.09) | 2.10(.09) | 2.10(.09) | 2.10(.09) | .350(.050) | |
| HRM | 4.40(.44) | 4.20(.36) | 4.80(.16) | 3.90(.09) | .711(.162) | |
| (F) | MFRM | 4.90(.09) | 4.90(.09) | 4.80(.16) | 4.90(.09) | .735(.051) |
| Patz1999 | ||||||
| Ueno2008 | 3.00(.00) | 3.00(.00) | 3.00(.00) | 3.00(.00) | .708(.066) | |
| Uto2016 | 2.00(.00) | 2.00(.00) | 2.00(.00) | 2.00(.00) | .691(.102) | |
| HRM | 3.90(.09) | 3.90(.09) | 4.00(.20) | 3.90(.09) | .876(.062) | |
Descriptive statistics for the report assessment data.
| Avg. | I-R Cor | Appearance rate for each category | Average scores of raters for each task | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | |||
| Rater 1 | 1.820 | 0.781 | 9.8 | 32.9 | 30.1 | 19.6 | 7.7 | 1.852 | 1.933 | 1.704 | 1.483 | 2.133 |
| Rater 2 | 1.962 | 0.785 | 6.3 | 30.8 | 33.6 | 19.6 | 9.8 | 1.741 | 2.033 | 1.778 | 2.103 | 2.100 |
| Rater 3 | 2.268 | 0.651 | 2.0 | 10.1 | 51.5 | 34.3 | 2.0 | 2.375 | 2.167 | 2.321 | 2.167 | 2.667 |
| Rater 4 | 2.507 | 0.652 | 0.0 | 3.5 | 49.3 | 39.6 | 7.6 | 2.296 | 2.467 | 2.464 | 2.586 | 2.733 |
| Rater 5 | 2.705 | 0.739 | 0.7 | 7.4 | 35.8 | 31.8 | 24.3 | 2.533 | 2.633 | 2.897 | 2.759 | 2.767 |
| Task 1 | 2.128 | 0.533 | 7.6 | 18.5 | 38.7 | 23.5 | 11.8 | |||||
| Task 2 | 2.247 | 0.750 | 5.3 | 13.3 | 44.0 | 26.0 | 11.3 | |||||
| Task 3 | 2.180 | 0.414 | 2.2 | 20.9 | 38.8 | 26.6 | 11.5 | |||||
| Task 4 | 2.160 | 0.651 | 4.1 | 19.2 | 36.3 | 31.5 | 8.9 | |||||
| Task 5 | 2.428 | 0.669 | 0.0 | 14.6 | 38.2 | 35.8 | 11.4 | |||||
Descriptive statistics for the peer assessment data.
| Avg. | I-R Cor | Appearance rate for each category | Average scores of raters for each task | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | |||
| Rater 1 | 2.392 | 0.590 | 2.5 | 19.2 | 28.3 | 36.7 | 13.3 | 1.933 | 2.400 | 2.533 | 2.700 |
| Rater 2 | 2.325 | 0.673 | 10.8 | 13.3 | 24.2 | 35.8 | 15.8 | 1.900 | 2.433 | 2.467 | 2.500 |
| Rater 3 | 1.842 | 0.631 | 8.3 | 27.5 | 40.8 | 18.3 | 5.0 | 1.800 | 1.800 | 1.900 | 1.867 |
| Rater 4 | 2.367 | 0.491 | 0.8 | 15.8 | 32.5 | 47.5 | 3.3 | 2.000 | 2.433 | 2.533 | 2.500 |
| Rater 5 | 2.492 | 0.408 | 0.0 | 13.3 | 38.3 | 34.2 | 14.2 | 2.300 | 2.500 | 2.567 | 2.600 |
| Rater 6 | 2.333 | 0.406 | 0.8 | 20.0 | 33.3 | 36.7 | 9.2 | 2.367 | 2.400 | 2.133 | 2.433 |
| Rater 7 | 1.258 | 0.500 | 31.7 | 27.5 | 29.2 | 6.7 | 5.0 | 1.433 | 0.900 | 1.333 | 1.367 |
| Rater 8 | 1.992 | 0.568 | 0.8 | 16.7 | 65.8 | 15.8 | 0.8 | 1.967 | 1.867 | 1.900 | 2.233 |
| Rater 9 | 1.450 | 0.451 | 7.5 | 50.8 | 30.8 | 10.8 | 0.0 | 1.733 | 1.533 | 1.000 | 1.533 |
| Rater 10 | 2.625 | 0.733 | 6.7 | 13.3 | 21.7 | 27.5 | 30.8 | 2.400 | 2.567 | 2.700 | 2.833 |
| Rater 11 | 2.517 | 0.525 | 0.0 | 9.2 | 40.0 | 40.8 | 10.0 | 2.800 | 2.367 | 2.300 | 2.600 |
| Rater 12 | 2.392 | 0.470 | 0.0 | 12.5 | 42.5 | 38.3 | 6.7 | 2.300 | 2.333 | 2.367 | 2.567 |
| Rater 13 | 1.525 | 0.522 | 15.0 | 38.3 | 30.8 | 10.8 | 5.0 | 1.833 | 1.567 | 1.300 | 1.400 |
| Rater 14 | 1.908 | 0.380 | 3.3 | 34.2 | 35.8 | 21.7 | 5.0 | 1.767 | 2.133 | 1.733 | 2.000 |
| Rater 15 | 2.383 | 0.546 | 0.0 | 7.5 | 50.8 | 37.5 | 4.2 | 2.200 | 2.300 | 2.467 | 2.567 |
| Rater 16 | 2.575 | 0.533 | 4.2 | 1.7 | 29.2 | 62.5 | 2.5 | 2.200 | 2.633 | 2.767 | 2.700 |
| Rater 17 | 2.683 | 0.493 | 0.0 | 5.0 | 35.8 | 45.0 | 14.2 | 2.467 | 2.900 | 2.467 | 2.900 |
| Rater 18 | 2.108 | 0.626 | 1.7 | 21.7 | 44.2 | 29.2 | 3.3 | 2.233 | 2.000 | 2.067 | 2.133 |
| Rater 19 | 1.683 | 0.461 | 0.0 | 32.5 | 66.7 | 0.8 | 0.0 | 1.767 | 1.567 | 1.733 | 1.667 |
| Rater 20 | 1.717 | 0.540 | 5.8 | 33.3 | 44.2 | 16.7 | 0.0 | 1.633 | 1.533 | 1.567 | 2.133 |
| Rater 21 | 2.225 | 0.676 | 6.7 | 24.2 | 28.3 | 21.7 | 19.2 | 2.067 | 2.100 | 2.267 | 2.467 |
| Rater 22 | 1.883 | 0.538 | 0.8 | 29.2 | 51.7 | 17.5 | 0.8 | 1.700 | 1.800 | 1.900 | 2.133 |
| Rater 23 | 2.150 | 0.197 | 0.8 | 7.5 | 68.3 | 22.5 | 0.8 | 2.067 | 2.233 | 2.033 | 2.267 |
| Rater 24 | 2.008 | 0.247 | 7.5 | 25.0 | 36.7 | 20.8 | 10.0 | 1.867 | 1.867 | 2.167 | 2.133 |
| Rater 25 | 2.600 | 0.650 | 6.7 | 15.8 | 20.8 | 24.2 | 32.5 | 2.067 | 2.700 | 2.533 | 3.100 |
| Rater 26 | 1.533 | 0.481 | 20.8 | 34.2 | 22.5 | 15.8 | 6.7 | 2.233 | 1.267 | 1.433 | 1.200 |
| Rater 27 | 2.592 | 0.663 | 4.2 | 15.0 | 16.7 | 45.8 | 18.3 | 2.500 | 2.667 | 2.500 | 2.700 |
| Rater 28 | 2.875 | 0.334 | 0.8 | 3.3 | 17.5 | 64.2 | 14.2 | 2.900 | 2.867 | 2.767 | 2.967 |
| Rater 29 | 2.142 | 0.644 | 2.5 | 21.7 | 41.7 | 27.5 | 6.7 | 2.100 | 2.033 | 2.000 | 2.433 |
| Rater 30 | 2.500 | 0.706 | 1.7 | 25.0 | 15.8 | 36.7 | 20.8 | 1.933 | 2.567 | 2.833 | 2.667 |
| Task 1 | 2.082 | 0.474 | 6.8 | 23.2 | 34.2 | 26.6 | 9.2 | ||||
| Task 2 | 2.142 | 0.538 | 5.8 | 21.0 | 35.9 | 27.9 | 9.4 | ||||
| Task 3 | 2.142 | 0.535 | 4.6 | 20.9 | 37.9 | 29.1 | 7.6 | ||||
| Task 4 | 2.310 | 0.587 | 3.2 | 16.8 | 36.7 | 32.4 | 10.9 | ||||
Information criterion values calculated from actual data.
| Data | AIC | WAIC | BIC | ML | |
|---|---|---|---|---|---|
| Report assessment data | MFRM | −809.186 | −803.968 | −838.611 | −786.042 |
| Patz1999 | −826.134 | −815.524 | −875.176 | −787.831 | |
| Ueno2008 | |||||
| Uto2016 | −807.605 | −797.879 | −851.743 | −771.613 | |
| HRM | −1050.488 | −1445.299 | −1197.613 | −868.446 | |
| Peer assessment data | MFRM | −4650.06 | −4646.46 | −4696.3 | −4615.25 |
| Patz1999 | −4662.97 | −4646.08 | −4776.47 | −4575.41 | |
| Ueno2008 | −4541.02 | −4504.17 | −4651.02 | −4445.21 | |
| Uto2016 | |||||
| HRM | −4683.719 | −7035.085 | −4842.054 | −4498.075 | |
Ability measurement error calculated from actual data.
| Report assessment data | Peer assessment data | |||||
|---|---|---|---|---|---|---|
| RMSE | MAE | SD | RMSE | MAE | SD | |
| MFRM | 0.337 | 0.254 | 0.221 | 0.334 | 0.258 | 0.212 |
| Patz1999 | 0.382 | 0.319 | 0.211 | 0.360 | 0.285 | 0.219 |
| Ueno2008 | 0.181 | 0.316 | 0.229 | 0.217 | ||
| Uto2016 | 0.253 | 0.187 | 0.171 | 0.146 | ||
| HRM | 0.422 | 0.321 | 0.274 | 0.453 | 0.330 | 0.311 |