| Literature DB >> 33169286 |
Abstract
Performance assessments, in which human raters assess examinee performance in practical tasks, have attracted much attention in various assessment contexts involving measurement of higher-order abilities. However, difficulty persists in that ability measurement accuracy strongly depends on rater and task characteristics such as rater severity and task difficulty. To resolve this problem, various item response theory (IRT) models incorporating rater and task parameters, including many-facet Rasch models (MFRMs), have been proposed. When applying such IRT models to datasets comprising results of multiple performance tests administered to different examinees, test linking is needed to unify the scale for model parameters estimated from individual test results. In test linking, test administrators generally need to design multiple tests such that raters and tasks partially overlap. The accuracy of linking under this design is highly reliant on the numbers of common raters and tasks. However, the numbers of common raters and tasks required to ensure high accuracy in test linking remain unclear, making it difficult to determine appropriate test designs. We therefore empirically evaluate the accuracy of IRT-based performance-test linking under common rater and task designs. Concretely, we conduct evaluations through simulation experiments that examine linking accuracy based on a MFRM while changing numbers of common raters and tasks with various factors that possibly affect linking accuracy.Entities:
Keywords: Educational measurement; IRT linking; Item response theory; Many-facet Rasch models; Performance assessment; Rater effects; Test design
Mesh:
Year: 2020 PMID: 33169286 PMCID: PMC8367909 DOI: 10.3758/s13428-020-01498-x
Source DB: PubMed Journal: Behav Res Methods ISSN: 1554-351X
Example of rater-pair design
| Task 1 | Task 2 | Task 3 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rater | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 |
| Examinee 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||
| Examinee 2 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||
| Examinee 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||
| Examinee 4 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||
Fig. 1Linking design using common raters and common tasks
Parameter distributions for the new test
| Distribution 1 | ||||
| Distribution 2 | ||||
| Distribution 3 | ||||
| Distribution 4 |
Experimental results for different parameter distributions
| Distribution 1 | |||||
| 1 | .1538(.1476) | ||||
| 2 | .1483(.1423) | ||||
| 3 | .1574(.1461) | ||||
| 4 | .1420(.1360) | ||||
| 5 | .1469(.1458) | ||||
| Distribution 2 | |||||
| 1 | |||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| Distribution 3 | |||||
| 1 | .1679(.1476) | .1464(.1435) | |||
| 2 | .1596(.1423) | .1406(.1340) | |||
| 3 | .1544(.1461) | ||||
| 4 | .1462(.1360) | ||||
| 5 | |||||
| Distribution 4 | |||||
| 1 | .1605(.1476) | .1513(.1435) | .1491(.1433) | ||
| 2 | .1473(.1423) | .1435(.1340) | |||
| 3 | .1531(.1461) | ||||
| 4 | .1470(.1360) | ||||
| 5 | .1501(.1458) | ||||
Experimental results for different numbers of examinees, tasks, and raters
| J = 50, I = 5, R = 5 | |||||
| 1 | |||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| J = 100, I = 5, R = 5 | |||||
| 1 | .3025(.2942) | ||||
| 2 | .2693(.2685) | ||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| J = 100, I = 10, R = 5 | |||||
| 1 | .2187(.2066) | .1995(.1966) | .2039(.1908) | .1995(.1911) | .1985(.1938) |
| 2 | .2048(.2026) | ||||
| 3 | .2065(.1986) | ||||
| 4 | |||||
| 5 | |||||
| J = 100, I = 5, R = 10 | |||||
| 1 | .2212(.2099) | ||||
| 2 | .2198(.2007) | ||||
| 3 | .2142(.2040) | ||||
| 4 | .1955(.1945) | ||||
| 5 | |||||
Experimental results for different rates of missing data
| R = 5, | |||||
| 1 | .3616(.3082) | .3180(.2990) | |||
| 2 | .3458(.3048) | .3123(.2981) | |||
| 3 | .3291(.3088) | .3064(.2911) | |||
| 4 | .3317(.3109) | .3032(.2856) | |||
| 5 | .3189(.2966) | .2998(.2927) | |||
| R = 10, | |||||
| 1 | .3187(.2510) | .2943(.2592) | .2795(.2431) | .2722(.2386) | .2733(.2511) |
| 2 | .2792(.2519) | .2610(.2368) | .2545(.2503) | ||
| 3 | .2777(.2584) | .2434(.2319) | |||
| 4 | .2869(.2507) | .2471(.2463) | |||
| 5 | .2803(.2462) | .2537(.2345) | |||
| R = 10, | |||||
| 1 | .3795(.3128) | .3278(.2941) | .3399(.2897) | .3260(.2842) | .3187(.2998) |
| 2 | .3459(.3084) | .3127(.3036) | .3081(.2863) | ||
| 3 | .3541(.2898) | .3091(.2901) | .2992(.2968) | ||
| 4 | .3420(.3033) | .3141(.2985) | |||
| 5 | .3488(.3002) | .3074(.2968) | |||
Experimental results for large-scale settings
| J = 1000, I = 5, R = 20, | |||||
| 1 | .5310(.3841) | .5220(.3920) | .5263(.4061) | .5242(.3872) | .5177(.4076) |
| 2 | .5078(.3883) | .4906(.4007) | .4764(.3847) | .4814(.3873) | .4800(.4074) |
| 3 | .4929(.3993) | .4641(.3930) | .4480(.3919) | .4587(.3997) | .4525(.3928) |
| 4 | .4835(.4023) | .4525(.3910) | .4314(.3858) | .4340(.3956) | .4352(.4141) |
| 5 | .4751(.4070) | .4335(.4020) | .4432(.3905) | .4168(.4065) | .4201(.3980) |
| 6 | .4505(.3965) | .4347(.3962) | .4172(.3956) | .4195(.3996) | .4088(.3979) |
| 7 | .4526(.4071) | .4279(.4053) | .4109(.3962) | .4172(.3854) | |
| 8 | .4612(.3960) | .4130(.3974) | .4133(.3972) | .4024(.3960) | |
| 9 | .4599(.4153) | .4274(.3996) | |||
| 10 | .4402(.3935) | .4250(.3894) | |||
| J = 1000, I = 5, R = 20, | |||||
| 1 | .4184(.2883) | .3958(.2885) | .4042(.2871) | .3959(.2862) | .3917(.2823) |
| 2 | .3804(.2921) | .3563(.2956) | .3535(.2848) | .3539(.2960) | .3412(.2979) |
| 3 | .3509(.2952) | .3317(.3033) | .3312(.2830) | .3197(.2971) | .3264(.2824) |
| 4 | .3457(.2922) | .3159(.2889) | .3118(.2983) | .3029(.2881) | .3030(.2929) |
| 5 | .3454(.2904) | .3181(.3102) | .3004(.2856) | .2987(.2959) | .3015(.2918) |
| 6 | .3296(.2929) | .3064(.2937) | .2970(.2905) | .2943(.2914) | .2968(.2928) |
| 7 | .3236(.2929) | .2977(.2951) | |||
| 8 | .3224(.2930) | .2966(.2928) | |||
| 9 | .3206(.2886) | ||||
| 10 | .3179(.3003) | ||||
Experimental results for different parameter distributions when characteristics of some common raters and tasks are changed
| Distribution 1 | |||||
| 1 | .1629(.1476) | .1511(.1435) | .1460(.1433) | ||
| 2 | .1543(.1423) | .1491(.1340) | .1523(.1399) | ||
| 3 | .1468(.1461) | ||||
| 4 | .1495(.1360) | ||||
| 5 | .1508(.1458) | ||||
| Distribution 2 | |||||
| 1 | |||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| Distribution 3 | |||||
| 1 | .1620(.1476) | .1510(.1435) | .1464(.1433) | .1438(.1426) | .1445(.1421) |
| 2 | .1593(.1423) | .1434(.1340) | .1457(.1399) | ||
| 3 | .1489(.1461) | ||||
| 4 | .1590(.1360) | ||||
| 5 | |||||
| Distribution 4 | |||||
| 1 | .1759(.1476) | .1519(.1435) | .1446(.1433) | .1507(.1426) | |
| 2 | .1521(.1423) | .1483(.1340) | .1459(.1399) | ||
| 3 | .1660(.1461) | .1436(.1373) | .1393(.1358) | ||
| 4 | .1597(.1360) | .1423(.1404) | |||
| 5 | .1464(.1458) | ||||
Experimental results for different parameter distributions when the absolute value of the average bias is used to calculate linking accuracy criteria instead of the RMSE
| Distribution 1 | |||||
| 1 | .1023(.0999) | ||||
| 2 | .0945(.0693) | ||||
| 3 | .1044(.0879) | ||||
| 4 | .0817(.0676) | ||||
| 5 | |||||
| Distribution 2 | |||||
| 1 | |||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| Distribution 3 | |||||
| 1 | .1108(.0999) | .0829(.0815) | |||
| 2 | .0996(.0693) | .0702(.0685) | |||
| 3 | .0931(.0879) | ||||
| 4 | .0791(.0676) | ||||
| 5 | |||||
| Distribution 4 | |||||
| 1 | .1091(.0999) | .0835(.0815) | .0838(.0829) | ||
| 2 | .0879(.0693) | .0725(.0685) | |||
| 3 | .0898(.0879) | ||||
| 4 | .0836(.0676) | ||||
| 5 | |||||