| Literature DB >> 35230590 |
Abstract
Variation in examiner stringency is a recognised problem in many standardised summative assessments of performance such as the OSCE. The stated strength of the OSCE is that such error might largely balance out over the exam as a whole. This study uses linear mixed models to estimate the impact of different factors (examiner, station, candidate and exam) on station-level total domain score and, separately, on a single global grade. The exam data is from 442 separate administrations of an 18 station OSCE for international medical graduates who want to work in the National Health Service in the UK. We find that variation due to examiner is approximately twice as large for domain scores as it is for grades (16% vs. 8%), with smaller residual variance in the former (67% vs. 76%). Combined estimates of exam-level (relative) reliability across all data are 0.75 and 0.69 for domains scores and grades respectively. The correlation between two separate estimates of stringency for individual examiners (one for grades and one for domain scores) is relatively high (r=0.76) implying that examiners are generally quite consistent in their stringency between these two assessments of performance. Cluster analysis indicates that examiners fall into two broad groups characterised as hawks or doves on both measures. At the exam level, correcting for examiner stringency produces systematically lower cut-scores under borderline regression standard setting than using the raw marks. In turn, such a correction would produce higher pass rates-although meaningful direct comparisons are challenging to make. As in other studies, this work shows that OSCEs and other standardised performance assessments are subject to substantial variation in examiner stringency, and require sufficient domain sampling to ensure quality of pass/fail decision-making is at least adequate. More, perhaps qualitative, work is needed to understand better how examiners might score similarly (or differently) between the awarding of station-level domain scores and global grades. The issue of the potential systematic bias of borderline regression evidenced for the first time here, with sources of error producing cut-scores higher than they should be, also needs more investigation.Entities:
Keywords: Borderline regression; Examiner stringency; OSCE; Standard setting
Mesh:
Year: 2022 PMID: 35230590 PMCID: PMC9117341 DOI: 10.1007/s10459-022-10096-9
Source DB: PubMed Journal: Adv Health Sci Educ Theory Pract ISSN: 1382-4996 Impact factor: 3.629
Descriptive statistics for the key facets of the PLAB2 exam
| Facet | Number of unique levels (i.e. values) in data | Typical occurrence in data | ||
|---|---|---|---|---|
|
|
|
| ||
| Candidates | 17,604 | 18 (18,18) | 17.8 | Typically candidates are assessed at 18 stations in PLAB2. Occasionally, a station might be removed from the examination due to poor psychometric performance. |
| Examiners | 862 | 6 (3,13) | 11.1 | Typically examiners are present in six PLAB2 exams in the dataset |
| Stations | 390 | 17 (8, 29) | 20.2 | Typically stations are administered in 17 exams in this dataset |
| Exams | 442 | 1 (1, 1) | 1.0 | The data is from 442 separate PLAB2 exams |
| Observations | 313,593 | Not applicable | Not applicable | There are 313,593 rows of data—one for each candidate/station interaction. |
Fig. 1Histogram of station scores (n = 313,593)
Fig. 2Histogram of global grades (n = 313,593)
Variance in station-level scores and grades from separate linear mixed models (n = 313,593)
| Facet | Model for domains scores | Model for global grades | ||
|---|---|---|---|---|
|
|
|
|
| |
| Candidate | 0.685 | 11.4% | 0.091 | 9.5% |
| Station | 0.347 | 5.7% | 0.060 | 6.3% |
| Examiner | 0.958 | 15.9% | 0.075 | 7.9% |
| Exam | 0.030 | 0.5% | 0.004 | 0.4% |
| Residual | 4.013 | 66.5% | 0.726 | 75.9% |
|
|
|
|
|
|
Outline of main modelling approach for domain scores
| Domain score modelled by random effects of |
| DOMAIN_SCORE ~ 1 + (1 | CANDIDATE) + (1 | EXAMINER) + (1 | STATION) + (1 | EXAM) (the notation (1| FACET) indicates it is being treated as a random effect) |
Overall reliability/SEM estimates for an 18 station PLAB2 OSCE
| Statistic | Domain scores (12 point scale) | Global grades | |
|---|---|---|---|
|
|
| 0.754 | 0.692 |
|
| 0.472 (3.93%) | 0.201 (6.69%) | |
|
|
| 0.678 | 0.635 |
|
| 0.571 (4.76%) | 0.228 (7.60%) |
Correlation between observed and modelled values across all candidate/station interactions
| Pearson correlation coefficient | Observed domain score | Observed global grade | Modelled domain score |
|---|---|---|---|
|
| 0.85 | ||
|
| 0.60 | 0.45 | |
|
| 0.53 | 0.52 | 0.86 |
Interpreting stringency estimates
| Facet | Interpretation of individual model estimate |
|---|---|
|
| The expected outcome for the candidate in a typical station, with a typical examiner and typical exam. This is therefore a single measure of candidate ‘ability’ having taken account of all other facets—so can be thought of as an estimate of the ‘fair’ score for the candidate. |
|
| The expected outcome at this station for a typical candidate examined by a typical examiner in a typical exam. This is an estimate of station difficulty having taken account of all other facets, with easier stations having higher values. |
|
| The expected outcome awarded by the examiner who assesses a typical candidate at a typical station in a typical exam. This is an estimate of examiner stringency having taken account of all other facets, with more hawkish examiners having lower values. |
|
| The expected outcome for the exam, assuming a typical set of candidates, stations and examiners. This is a measure of exam difficulty having taken account of all other facets, with easier exams having higher values. |
Summary statistics for estimates of stringency for each facet (station-level)
| Facet | Domains scores | Global grades | ||
|---|---|---|---|---|
|
|
|
|
| |
|
| 7.24 (0.71) | 6.77, 7.25, 7.70 | 1.66 (0.25) | 1.50, 1.68, 1.83 |
(n=390) | 7.24 (0.58) | 6.91, 7.29, 7.67 | 1.66 (0.24) | 1.52, 1.69, 1.84 |
(n = 862) | 7.24 (0.96) | 6.54, 7.22, 7.87 | 1.66 (0.26) | 1.50, 1.67, 1.84 |
(n = 442) | 7.24 (0.13) | 7.15, 7.23, 7.33 | 1.66 (0.05) | 1.63, 1.66, 1.69 |
Fig. 3Scatter graph of the two estimates of examiner stringency with cluster allocation (n = 862)
Fig. 4Scatter graph of exam-level observed and modelled exam passing score (percentage, n = 442)
A hypothetical comparison between observed and modelled pass/fail decision
| Overall candidate decisions in PLAB2 | Pass Modelled | Total | ||
|---|---|---|---|---|
|
|
| |||
|
|
| 512 | 1750 | 2,262 |
|
| 0 | 15,342 | 15,342 | |
|
| 512 | 17,092 | 17,604 | |
Summary statistics for the linear mixed model residuals (station-level)
| Mean | Median | SD | Skew | Lower quartile | Upper quartile | |
|---|---|---|---|---|---|---|
(12 point scale) | 0 | − 0.01 | 1.96 | − 0.04 | − 1.30 | 1.32 |
(3 point scale) | 0 | 0.10 | 0.83 | − 0.30 | − 0.58 | 0.60 |
Mean station BRM intercepts, slopes and predicted values (n = 7,877)
| Grade | Description | Predicted y-value | Difference = | ||
|---|---|---|---|---|---|
|
|
|
|
| ||
| 0 | Fail | 3.55 | 2.51 | 1.04 | 8.6 |
| 1 | Borderline | 5.61 | 5.24 | 0.37 | 3.1 |
| 2 | Satisfactory | 7.68 | 7.97 | − 0.29 | − 2.4 |
| 3 | Good | 9.74 | 10.70 | − 0.96 | − 8.0 |