Literature DB >> 34804284

The critical role of direct observation in entrustment decisions.

Matthew Sibbald¹, Muqtasid Mansoor², Michael Tsang², Sarah Blissett³, Geoffrey Norman¹.

Abstract

BACKGROUND: Entrustment decisions may be retrospective (based on past experiences with a trainee) or real-time (based on direct observation). We investigated judgments of entrustment based on assessor prior knowledge of candidates and based on systematic direct observation, conducted in an objective structured clinical exam (OSCE).
METHODS: Sixteen faculty examiners provided 287 retrospective and real-time entrustment ratings of 16 cardiology trainees during OSCE stations in 2019 and 2020. Reliability and validity of these ratings were assessed by comparing correlations across stations as a measure of reliability, differences across postgraduate years as an index of construct validity, correlation to standardized in-training exam (ITE) as a measure of criterion validity, and reclassification of entrustment as a measure of consequential validity.
RESULTS: Both retrospective and real-time assessments were highly reliable (all intra-class correlations >0.86). Both increased with a year of postgraduate training. Real-time entrustment ratings were significantly correlated with standardized ITE scores; retrospective ratings were not. Real-time ratings explained 37% (2019) and 46% (2020) of variance in examination scores vs. 21% (2019) and 7% (2020) for retrospective ratings. Direct observation resulted in a different level of entrustment compared with retrospective ratings in 44% of cases (p = <0.001).
CONCLUSIONS: Ratings based on direct observation made unique contributions to entrustment decisions.

Entities: Chemical

Year: 2021 PMID： 34804284 PMCID： PMC8603883 DOI： 10.36834/cmej.72040

Source DB: PubMed Journal: Can Med Educ J ISSN： 1923-1202

Introduction

While the theoretical underpinning of competency based medical education (CBME) emphasizes the role of direct observation on entrustment decisions,[1],[2] direct observation is not mandatory.[3]–[5] Direct observation is second nature in procedurally oriented specialties.[6] However, specialties focused on medical decision-making are less amenable to observation-based entrustment, prompting calls to nuance the universal application of a direct observation approach.[7],[8] Understanding the extent to which direct observation informs entrustment decisions in cognitive tasks would advance the science of CBME, however this topic remains underexplored in the published literature. The potential value of direct observation in entrustment decisions is made apparent through a contrast of two studies in emergency medicine. One used standardized assessment of observable tasks[9] showing a strong gradient in ratings over each postgraduate training year, whereas a second collected ratings after each shift[10] showing minimal gradient within each postgraduate year. The contrast between these two studies in their ability to identify growth provides some evidence of the importance of real-time direct observation in entrustment decision making. Despite the emphasis on and the potential value of direct observation in CBME contexts, direct observation is underutilized in cognitive specialities. In busy, competency-based residency programs, faculty will frequently provide retrospective entrustment ratings of uncomplicated delegated acts that they did not directly observe.[7],[10] Instead, the assessment derives from some kind of implicit mental averaging of the supervisor’s observation of resident performance on the specific task over time. While such an approach, averaging over multiple observations, might be considered more reliable and valid than a single standardized observation, it depends on the supervisor’s ability to recall and summarize. Such summative judgments are vulnerable to biases such as “primacy” and “recency” effects.[11] A central question for supervisors in cognitive specialities is ‘how often does direct observation of these delegated acts lead to similar entrustment ratings to those provided by the supervisor from informal contact with the resident?” In this study, we compared judgments of entrustment on specific stations of an objective structured clinical exam (OSCE) focused on cognitive tasks under three conditions: 1) expected level of performance, where the assessor was asked to rate typical performance for a resident at a given level with this station; 2) retrospective, where the assessor estimated how the resident would perform based on their prior observations with the resident; and 3) real-time, based on direct observation.

Methods

We conducted a prospective study comparing retrospective and real-time ratings to expected level of performance ratings in two sequential years of a residency program OSCE. The approach used the standard psychometric criteria of reliability and validity.[12] Reliability was determined across all stations in the OSCE. Construct validity was assessed by examining differences with years of training. Criterion validity was assessed by comparison with a standard in-training written examination. Consequential validity was assessed by examining change in entrustment decisions resulting from observed assessment.

Setting

Postgraduate cardiology trainees in postgraduate years (PGY) four through six participated in a formative 4-hour OSCE at a single center. The OSCE was blueprinted from the objectives of training for cardiology residency programs (Table 1). In February 2019, 10 residents participated in 12 stations with 12 different faculty examiners. In February 2020, 13 residents participated in 13 stations with 13 different faculty examiners. Seven residents and nine faculty examiners participated in both 2019 and 2020 examinations.

Table 1

OSCE Stations in 2019 and 2020

Station	2019 OSCE	2020 OSCE
1	Acute coronary disease	Acute coronary disease
2	Chronic coronary disease	Chronic coronary disease
3	Valvular heart disease	Valvular heart disease
4	Cardiac physical exam	Congenital heart disease: follow up visit of repaired tetralogy
5	Hypertension	Heart failure and cardiomyopathies
6	Pulmonary vascular disease	Hypertension related to aortic coarctation
7	Pericardial disease	Pulmonary vascular disease
8	Vascular medicine	Pericardial disease
9	Acute cardiac care	Vascular medicine
10	Electrophysiology	Acute cardiac care
11	Pregnancy in patients with cardiovascular disease	Electrophysiology
12	Congenital heart disease	Pregnancy in patients with cardiovascular disease
13	N / A	Cardiac physical exam on high fidelity simulator

OSCE Stations in 2019 and 2020 Each station was constructed to mimic a clinical encounter, with the examiner playing the role of the patient. Residents were required to take a history, interpret physical exam data, interpret investigations (e.g. bloodwork, electrocardiogram, chest radiograph, echocardiography, angiograms etc.) and integrate these data into a management plan communicated to the patient. No procedural skills were tested. One faculty member was assigned to each station, based on content expertise. All faculty members knew all trainees for an average of 1.7 years and had worked with them in at least one clinical context in the last 10 months.

Entrustment ratings

Before the OSCE, faculty members were asked to review the station to which they were assigned, and decide the level of supervision they would provide for each resident: (1) based on the time the resident spent in the training program alone, i.e. postgraduate year (expected level of performance) and (2) based on prior experience with the resident (retrospective). During the OSCE, faculty members provided the level of supervision they felt appropriate after observing the trainee complete the station (real-time). All three types of entrustment ratings used the same entrustment scale based on prior scoring systems:[4],[13] Not yet developed Competent to manage with proactive or direct supervision (i.e. needs to talk through it) Competent to manage with reactive or on demand supervision (i.e. needs prompting for some management components) Competent to manage without supervision (i.e. can provide definitive short- and long-term management for all aspects of the problem without prompting) Ready to teach this (i.e. sophisticated understanding of the problem and its possible clinical variations and their impact on management) Scores of four or higher are required for documentation of competence, whereas scores of three or lower imply some further development is required.

Standardized testing

Each October, all residents completed an international six hour standardized in-training examination (ITE) constructed by the American College of Cardiology. The examination was separate in time from the OSCE. The ITE contained approximately 150 items blueprinted from the objectives of training for cardiology residency programs. Resident scores are reported as percentiles.

Analysis

Psychometric analyses were conducted separately for 2019 and 2020, and the two analyses were treated as replications. While some residents were in both cohorts this was not accounted for in the analysis. Descriptive statistics: We calculated means and standard deviations separately for exam year (2019, 2020), postgraduate year (PGY) (4,5,6) and scoring method (expected, retrospective, real time). Histograms were constructed for average OSCE score and each type of entrustment by PGY, for 2019 and 2020.

Reliability

Test reliability for each rating type was computed across all 12 (2019) or 13 stations (2020). As each OSCE station had different content and raters, reliability estimates incorporate both variances related to content and raters. We performed a repeated measures ANOVA on individual station scores then calculated the G coefficient for the mean score across stations. In this analysis, reliability was assessed in comparison with other residents at the same level as Resident scores were nested in the PGY variable. Validity: We considered three pieces of evidence that entrustment ratings reflect trainees’ abilities to practice safely and independently:[12] the ability of the rating type to distinguish among residents at each level of training, comparison to an external standard as a form of criterion validity, and recategorization of decisions through direct observation as a form of consequential validity.

Relation to PGY

The previous analysis of variance, by rating type and PGY level, was also used to test for differences among means for both 2019 and 2020 OSCEs. Since every assessor was aware of the level of each resident, this was a weak test of validity.

Relation to In-Training Exam (ITE)

As an objective standard of performance, ITE multiple choice test can be criticized for not comprehensively assessing important domains such as communication skills. Despite this limitation, multiple choice testing has been shown superior to OSCEs in predicting malpractice,[14] peer review problems 10 years after graduation.[15] and 30 day mortality in the coronary care unit.[16] We computed simple correlations between average OSCE score and ITE score for 2020 and 2019. Postgraduate year was ignored. Following this analysis, the additional variance accounted for by retrospective and real-time ratings was calculated by first computing R2 for each Pearson correlation coefficient then taking the difference between this and the expected rating R2.

Recategorization of entrustment decisions through direct observation

We examined consequential validity by examining how frequently real-time judgements resulted in recategorization of retrospective entrustment decisions, both using a 5-point ordinal scale typical of most entrustment measurements[4],[13] and recategorization around the binary threshold typically used for summative decision making. Net recategorization by observation was calculated separately for both real time and retrospective ratings by creating tables of real time and retrospective ratings as columns and observed ratings as rows and using chi square testing to determine the significance of recategorization. Net recategorization was defined as 1 – the percentage of equivalently categorized trainees by retrospective ratings compared to observed real-time ratings of entrustment.

Critical value for significance

In the setting of multiple statistical tests, we applied a Bonferroni correction to maintain a type 1 error rate of 0.05, which resulted in the statistical threshold of p < 0.0025 being considered significant. Ethical approval was obtained by the Hamilton Integrated Ethics Review Board protocol #7567.

Results

Descriptive statistics

Expected entrustment ratings (i.e., based on training time), retrospective entrustment ratings (i.e., based on previous experience with the trainee) and real-time entrustment ratings (based on observed performance in the OSCE station) all varied by trainee level (Figure 1). Interestingly, real-time entrustment ratings had trainees from each level in each category whereas expected and retrospective ratings did not (Figure 1). Reliability coefficients for both retrospective and real-time ratings were large (R2>0.855), as shown in Table 2.

Table 2

Entrustment ratings and test reliability in 2019 and 2020 objective structured clinical exam (OSCE)

Rating type	2019 OSCE				2020 OSCE				Mean rating for all trainees		Rating reliability (intra-class correlation)
Rating type	PGY4	PGY5	PGY6	p across PGY	PGY4	PGY5	PGY6	P across PGY	2019	2020	2019	2020
Expected for level	2.50 ± 0.66	3.42 ± 0.65	4.33 ± 0.63	<0.001	2.00 ± 0.40	3.08 ± 0.48	4.08 ± 0.48	<0.001	3.42 ± 0.96	3.05 ± 0.96	0.982	0.998
Retrospective	3.00 ± 0.70	3.71 ± 0.77	4.22 ± 0.68	<0.001	2.52 ± 0.90	3.31 ± 0.77	3.57 ± 0.81	<0.001	3.66 ± 0.86	3.13 ± 0.94	0.924	0.932
Real-time	3.33 ± 0.86	3.44 ± 1.01	4.06 ± 0.67	<0.001	2.62 ± 0.88	3.38 ± 0.89	3.62 ± 0.97	<0.001	3.59 ± 0.92	3.16 ± 1.01	0.855	0.887

Ratings are out of 5. A score of 4 or more represents the ability to perform the station independently.

Entrustment ratings and test reliability in 2019 and 2020 objective structured clinical exam (OSCE) Ratings are out of 5. A score of 4 or more represents the ability to perform the station independently.

Validity

Each type of entrustment rating increased with PGY in each OSCE year as shown in Table 2 (all p < 0.001). The relation between scores derived from the three methods and the ITE are shown in Table 3. Only the real-time ratings were significantly correlated with the ITE. These ratings accounted for substantially more variance in the examination than retrospective ratings. Real-time observation resulted in both increased and decreased entrustment compared with expected and retrospective judgments (Figure 2). Observation reclassified 38% of expected entrustment ratings (X2= 73, df = 16, p < 0.00001) and 44% of retrospective entrustment ratings (X2 = 102, df = 16, p < 0.00001). When entrustment ratings were reclassified in a binary system (with scores equal to or greater than four considered competent), observation reclassified 33% of expected entrustment ratings (X2= 31.5, df = 1, p = 0.0001) and 29% of retrospective ratings (X2= 49.1, df=1, p = 0.0001).

Table 3

Correlations with in-training exam scores and variance explained by expected for level, retrospective and real time entrustment ratings

Entrustment rating type	2019			2020
Entrustment rating type	Correlation	% variance	Additional % variance**	Correlation	% variance	Additional % Variance**
Expected for level	0.253	0.064	---	-0.023	0.0^#	---
Retrospective	0.525	0.275	0.212	0.258	0.066	0.066
Real time	.658 *	0.432	0.368	.678 *	0.459	0.459

Correlations with in-training exam scores and variance explained by expected for level, retrospective and real time entrustment ratings

Discussion

The data presented in this study indicate that ratings derived from real-time observation in a standardized setting contributed unique information to assessment of individual residents. Direct observation resulted in net reclassification of entrustment ratings often; one in three ratings was reclassified across the threshold of entrustment typically used for summative decision making. Expected ratings explained only from 0 to 6% of the variance in ITE scores; retrospective judgments explained an additional 7-21% of examination performance. However, real-time ratings explained an additional 37 - 46% of the variance. Only real-time entrustment ratings correlated significantly with standardized testing. These findings provide validity evidence for standardized direct observation in entrustment ratings, even in predominantly cognitive tasks. While this study involved a small number of trainees in a single discipline, multiple tasks were assessed in a rigorous format, with replication of findings across two years. While this does not guarantee generalization to other disciplines or contexts, it is consistent with the inferences drawn between studies to suggest greater discriminatory ability of ratings based on direct observation.[9],[10] Practically, more reliance on direct observation could substantially impact opportunities and supervision provided to trainees. Currently, supervisors often rely on informal observation to form entrustment judgments and allow trainees to engage in activities in the workplace, sometimes without direct supervision. Faculty in this study were asked to make a similar judgment by considering a specific situation and assigning a level of supervision based on their prior impressions of the resident, then directly observed the resident. Interestingly, over 40% of the time faculty changed their minds after observing the trainee. Based on this finding, relying on retrospective judgments of entrustment will result in a substantive percentage of trainees being given more independence (and some less independence) in the workplace than if the degree of supervision were determined by direct observation. Relying only on retrospective judgements for entrustment potentially denies some junior trainees an opportunity for independent learning and places some senior trainees in situations of inadequate supervision. The high frequency of reclassification of entrustment, particularly the reclassification of senior trainees to lower levels of entrustment, strengthens the validity argument for the use of direct observation entrustment ratings in a competency-based residency framework. This also calls into question the risky practice of presumptively entrusting residents, then documenting entrustment when no complications occurred from a delegated act. This is indirect evidence at best. Further, the substantive reclassification of trainees based on direct observation highlights potential validity risks in assigning entrustment ratings based on integrating prior experiences, as is sometimes done in residency in-training assessments. These ratings will frequently differ from ratings based on observation, as documented in this study. There are several important limitations of the study. It involved a small number of participants taken from a single centre and discipline. However, it had a large number of assessors, spanned multiple content domains, had a reasonable variation of competency across years of training and was adequately powered to draw robust conclusions. Further, all critical conclusions were replicated across the two cohorts of trainees. The choice of criterion measure, the ITE, is proximate and empirically defensible, but does leave unanswered the relation between measures in the educational setting and longer-term outcomes. In that regard, all entrustment ratings in this study were based on an OSCE setting without relevant patient outcomes. This has significant downsides. First, the retrospective decision of supervisors in the OSCE setting may be prone to recall bias compared with an assessment done at the end of rotation. Second, the supervisors were acting as patients in the scenarios which requires trainees and supervisors to suspend their disbelief around the simulation of the scenario, potentially reducing the authenticity of the interaction. However, the advantage of this setup is that it frees supervisors from the urge to prompt or step in, simplifying the entrustment decision process. While this theoretically might lead to overestimating entrustment, a greater percentage of ratings obtained through direct observation were reclassified at a lower level than a higher level of entrustment. While all faculty examiners worked with all trainees, the degree of clinical experiences likely varied. Whether or not retrospective entrustment ratings are more correlated when faculty supervise trainees for longer periods of time in the clinical environment is unknown.

Conclusions

In summary, direct observation adds to the validity of entrustment ratings. Even among senior residents performing cognitive tasks, direct observation affects faculty impressions.

15 in total

1. Validity: on meaningful interpretation of assessment data.

Authors: Susan M Downing
Journal: Med Educ Date: 2003-09 Impact factor: 6.251

Review 2. Cognitive, social and environmental sources of bias in clinical performance ratings.

Authors: Reed G Williams; Debra A Klamen; William C McGaghie
Journal: Teach Learn Med Date: 2003 Impact factor: 2.414

3. Competency-based medical education: theory to practice.

Authors: Jason R Frank; Linda S Snell; Olle Ten Cate; Eric S Holmboe; Carol Carraccio; Susan R Swing; Peter Harris; Nicholas J Glasgow; Craig Campbell; Deepak Dath; Ronald M Harden; William Iobst; Donlin M Long; Rani Mungroo; Denyse L Richardson; Jonathan Sherbino; Ivan Silver; Sarah Taber; Martin Talbot; Kenneth A Harris
Journal: Med Teach Date: 2010 Impact factor: 3.650

4. Competency-based postgraduate training: can we bridge the gap between theory and clinical practice?

Authors: Olle ten Cate; Fedde Scheele
Journal: Acad Med Date: 2007-06 Impact factor: 6.893

5. The McMaster Modular Assessment Program (McMAP): A Theoretically Grounded Work-Based Assessment System for an Emergency Medicine Residency Program.

Authors: Teresa Chan; Jonathan Sherbino
Journal: Acad Med Date: 2015-07 Impact factor: 6.893