| Literature DB >> 35677580 |
Daniel P Walsh1, Michael J Chen1, Lauren K Buhl1, Sara E Neves1, John D Mitchell1.
Abstract
High quality feedback on resident clinical performance is pivotal to growth and development. Therefore, a reliable means of assessing faculty feedback is necessary. A feedback assessment instrument would also allow for appropriate focus of interventions to improve faculty feedback. We piloted an assessment of the interrater reliability of a seven-item feedback rating instrument on faculty educators trained via a three-workshop frame-of-reference training regimen. The rating instrument's items assessed for the presence or absence of six feedback traits: actionable, behavior focused, detailed, negative feedback, professionalism / communication, and specific; as well as for overall utility of feedback with regard to devising a resident performance improvement plan on an ordinal scale from 1 to 5. Participants completed three cycles consisting of one-hour-long workshops where an instructor led a review of the feedback rating instrument on deidentified feedback comments, followed by participants independently rating a set of 20 deidentified feedback comments, and the study team reviewing the interrater reliability for each feedback rating category to guide future workshops. Comments came from four different anesthesia residency programs in the United States; each set of feedback comments was balanced with respect to utility scores to promote participants' ability to discriminate between high and low utility comments. On the third and final independent rating exercise, participants achieved moderate or greater interrater reliability on all seven rating categories of a feedback rating instrument using Gwet's agreement coefficient 1 for the six feedback traits and using intraclass correlation for utility score. This illustrates that when this instrument is utilized by trained, expert educators, reliable assessments of faculty-provided feedback can be made. This rating instrument, with further validity evidence, has the potential to help programs reliably assess both the quality and utility of their feedback, as well as the impact of any educational interventions designed to improve feedback.Entities:
Keywords: education; feedback; interrater reliability
Year: 2022 PMID: 35677580 PMCID: PMC9168931 DOI: 10.1177/23821205221093205
Source DB: PubMed Journal: J Med Educ Curric Dev ISSN: 2382-1205
Definitions for the six binary feedback traits and utility with emblematic examples .
| Term | Definition | Example
|
|---|---|---|
| Actionable
| Identifies areas for residents to work on improving. | “Take advantage of this rotation to become more familiar with fiberoptic use.” |
| Behavior Focused
| Notes something done by the resident as modifiable or changeable. Raters differentiated the definition of a behavior from a characteristic, or personality attribute of the resident. | Behavior: “The resident placed arterial lines with enthusiasm.”
|
| Detailed
| Provides ample information describing observed cases or actions which occurred, but not necessarily how a resident performed. | “We were assigned to three transcatheter aortic valve replacement cases today, one of which was a valve-in-valve procedure.” |
| Negative Feedback
| Notes areas the resident could improve on. Does not necessarily have to be hurtful or personal. | “The resident had difficulty identifying areas of bronchial anatomy during fiberoptic intubation.” |
| Professionalism / Communication
| Notes an exceptional level of planning, preparation, and/or communication--or a lack of such. | “The resident actively engaged with nursing and surgery to
confirm fluid management strategies, and ensured all supplies
were organized to facilitate rapid sequence induction.”
|
| Specific
| Provides information related to the resident's actions. | “The resident was able to easily switch to a ‘through and through’ approach after their first arterial line placement attempt was unsuccessful.” |
| Utilityd | Assessment of whether feedback can help devise a performance improvement plan for the resident. | High-Utility Example: |
Edited from table in Using Machine Learning to Evaluate Attending Feedback on Resident Performance by Neves SE, Chen MJ, et al. in Anesth Analg. .
Examples provided for Actionable, Behavior Focused, Detailed, Negative Feedback, Professionalism / Communication, and Specific are synthetic examples created to exemplify statements containing their respective feedback traits. The example for utility is a genuine, deidentified feedback comment left by faculty on a resident’s performance which achieved the maximum utility score (5 out of 5); this example was also edited to use gender neutral pronouns.
The Actionable, Behavior Focused, Detailed, Negative Feedback, Professionalism / Communication, and Specific rating categories were treated as binary items, with raters noting whether they believed a given comment was emblematic of the respective categories (true) or not (false).
Figure 1.Distribution of feedback examples with respect to original raters’ utility scores.
Figure 2.Flowchart of feedback rating workshops and rating exercises.
Table of inter-rater reliability and percent agreement for feedback trait ratings .
|
|
|
| ||||
|---|---|---|---|---|---|---|
| Gwet's | Percent | Gwet's | Percent | Gwet's AC1
| Percent | |
|
| 0.68 | 83% | 0.56 | 73% |
| 93% |
|
|
| 87% | 0.80 | 83% | 0.80 | 83% |
|
| 0.54 | 77% |
| 90% | 0.74 | 87% |
|
| 0.68 | 83% | 0.47 | 73% | 0.54 | 77% |
|
| 0.29 | 63% | 0.55 | 77% |
| 87% |
|
| 0.2 | 60% | 0.35 | 67% | 0.67 | 83% |
Set A was used to introduce the rating instrument and train participants on its use, and therefore was not graded.
Gwet’s AC1: Gwet’s first-order agreement coefficient. Gwet’s AC1 was calculated using R.[5,7] Results for Gwet’s AC1 are reported with the value followed by interpretation in parentheses per Landis & Koch’s interpretations for kappa values and strength of agreement, and are likewise color-coded as follows: < 0.00 is poor (unused), 0.00 to 0.20 is slight agreement (red), 0.21 to 0.40 is fair (orange), 0.41 to 0.60 is moderate (yellow), 0.61 to 0.80 is substantial (blue), and 0.81 to 1.00 is almost perfect (green and bold font).
Percent Agreement represents the percent of comments on which all three participants unanimously agreed on whether the feedback trait was present in a comment or not.
ProfComm: Abbreviation for Professionalism / Communication trait.
Table of intraclass correlation, agreement rates, and mean scores for utility score ratings.
|
|
|
| |
|---|---|---|---|
|
| 0.95 | 0.75 | 0.90 |
|
| 95% | 55% | 95% |
|
| 2.55 ± 1.36 | 3.30 ± 1.30 | 3.00 ± 1.45 |
|
| 2.80 ± 1.15 | 2.95 ± 0.83 | 2.65 ± 0.93 |
|
| 2.40 ± 1.31 | 2.80 ± 1.20 | 3.00 ± 1.12 |
|
| 2.55 ± 1.32 | 3.75 ± 0.97 | 3.10 ± 0.91 |
|
| 2.58 ± 1.25 | 3.17 ± 1.08 | 2.92 ± 1.00 |
Set A was used to introduce the rating instrument and train participants on its use, and therefore was not graded. Its mean utility score from the original raters was 3.15 with a standard deviation of 1.57.
ICC: Intraclass correlation - Two-way random effects, consistency, multiple raters/measurements. Reported as the calculated ICC value with interpretation in parentheses followed by the 95% confidence interval. Interpretations for ICC values taken from Koo & Li’s guidelines, where values less than 0.5, between 0.5 and 0.75, between 0.75 and 0.9, and greater than 0.90 are indicative of poor, moderate, good, and excellent reliability, respectively. Calculations performed using R.[5,12] ICC calculations only took into account the independent ratings of participants.
Set C’s ICC was measured to be 0.745, which rounds to 0.75. As Koo & Li’s guidelines do not note how to handle scenarios where ICC is on a threshold value, we classified set C’s ICC as “moderate-good” to represent the tiers the value straddles.
Set D’s ICC was measured to be 0.904, which rounds to 0.90. As Koo & Li’s guidelines do not note how to handle scenarios where ICC is on a threshold value, we classified set D’s ICC as “good-excellent” to represent the tiers the value straddles.
Adjacent Agreement represents the percent of comments on which all three participants’ utility score ratings had a range of one or less (ex: scores of 3/4/3, 2/2/2, and 5/5/4). As with ICC, this only took into account the independent ratings of participants.
Original Mean Utility Score refers to the mean of all utility score ratings provided by the original raters’ group consensus for a given set.
Values reported for mean utility score note the mean value of all ratings within a given feedback set, with the corresponding standard deviation reported next to it.