Literature DB >> 24600300

The construct and criterion validity of the multi-source feedback process to assess physician performance: a meta-analysis.

Ahmed Al Ansari¹, Tyrone Donnon², Khalid Al Khalifa¹, Abdulla Darwish³, Claudio Violato⁴.

Abstract

BACKGROUND: The purpose of this study was to conduct a meta-analysis on the construct and criterion validity of multi-source feedback (MSF) to assess physicians and surgeons in practice.
METHODS: In this study, we followed the guidelines for the reporting of observational studies included in a meta-analysis. In addition to PubMed and MEDLINE databases, the CINAHL, EMBASE, and PsycINFO databases were searched from January 1975 to November 2012. All articles listed in the references of the MSF studies were reviewed to ensure that all relevant publications were identified. All 35 articles were independently coded by two authors (AA, TD), and any discrepancies (eg, effect size calculations) were reviewed by the other authors (KA, AD, CV).
RESULTS: Physician/surgeon performance measures from 35 studies were identified. A random-effects model of weighted mean effect size differences (d) resulted in: construct validity coefficients for the MSF system on physician/surgeon performance across different levels in practice ranged from d=0.14 (95% confidence interval [CI] 0.40-0.69) to d=1.78 (95% CI 1.20-2.30); construct validity coefficients for the MSF on physician/surgeon performance on two different occasions ranged from d=0.23 (95% CI 0.13-0.33) to d=0.90 (95% CI 0.74-1.10); concurrent validity coefficients for the MSF based on differences in assessor group ratings ranged from d=0.50 (95% CI 0.47-0.52) to d=0.57 (95% CI 0.55-0.60); and predictive validity coefficients for the MSF on physician/surgeon performance across different standardized measures ranged from d=1.28 (95% CI 1.16-1.41) to d=1.43 (95% CI 0.87-2.00).
CONCLUSION: The construct and criterion validity of the MSF system is supported by small to large effect size differences based on the MSF process and physician/surgeon performance across different clinical and nonclinical domain measures.

Entities: Chemical Disease Gene Species

Keywords: clinical performance; construct validity; criterion validity; meta-analysis; multi-source feedback system

Year: 2014 PMID： 24600300 PMCID： PMC3942110 DOI： 10.2147/AMEP.S57236

Source DB: PubMed Journal: Adv Med Educ Pract ISSN： 1179-7258

Introduction

One of the most widely recognized methods used to evaluate physicians and surgeons in practice is multi-source feedback (MSF), also referred to as a 360-degree assessment, where different assessor groups (eg, peers, patients, coworkers) rate doctors’ clinical and nonclinical performance.1 Use of MSF has been shown to be a unique form of evaluation that provides more valuable information than any single feedback source.1 MSF has gained widespread acceptance for both formative and summative assessment of professionals, and is seen as a trigger for reflecting on where changes in practice are required.2,3 Certain characteristics of health professionals have been assessed using MSF, including their professionalism, communication, interpersonal relationships, and clinical and procedural skills competence.4 One of the main benefits of MSF is that it provides physicians and surgeons with information about their clinical practice that may help them in improving and monitoring their performance.5 The number of published studies on the use of MSF to assess health professionals in clinical practice has increased substantially. In a recent systematic review studying the impact of workplace-based assessment of doctors’ education and performance, Miller and Archer6 reported evidence of support for use of MSF in that it has the potential to lead to improvement in clinical performance. Risucci et al7 demonstrated concurrent validity for MSF in surgical residents by showing a medium effect size correlation coefficient between MSF scores and American Board of Surgery In-Training Examination (ABSITE) scores. When using MSF with residents at different levels in their program, Archer et al8 showed modest increases in the performance of year 4 in comparison with year 2 trainees, thereby demonstrating the construct validity of this approach to assessment. Violato et al9 compared changes in physician performance from time 1 to time 2 (a 5-year interval) using total scores given by medical colleagues and coworkers using the MSF questionnaire and demonstrated a significant improvement in their performance over time. Although MSF has been used in a variety of contexts, the research focus varies on measures across years in programs, differences between assessor groups, or comparisons with other assessment methods, so the validity of MSF needs to be investigated further. The main purpose of this study was to conduct a meta-analysis by identifying all published empirical data on the use of MSF to assess physicians’ clinical and nonclinical performance. We conducted a meta-analysis on the construct and criterion (predictive or concurrent) validity of the MSF system as a function of both summary effect sizes, their 95% confidence intervals (CIs), and interpretation of the magnitude of these coefficients.

Materials and methods

Selection of studies

In this present study, we followed the guidelines for reporting of observational studies included in a meta-analysis.10 In addition to PubMed and MEDLINE, the CINAHL, EMBASE, and PsycINFO databases were searched from January 1975 to November 2012. We also manually searched the reference lists for further relevant studies. The following terms were used in the search: “multi-source feedback”, “360-degree evaluation”, and “assessment of medical professionalism”. Studies were included if: they used at least one MSF instrument (eg, self, colleague, coworker, and/or patient) to assess physician/surgeon performance in practice; they described the MSF instrument or its design; they described factors measured by the MSF instrument; they provided evidence of construct-related and/or criterion-related validity (predictive/concurrent); and they were published in an English-language, peer-reviewed journal. The main reason for restricting the search to refereed journals was to ensure that only studies of high quality were included in the meta-analysis. On the other hand, we excluded studies if they used nonmedical health professionals, did not provide a description or breakdown of what the MSF instrument was measuring, did not provide empirical data on MSF results, reported data on feasibility and/or reliability only, and/or focused on performance changes after receiving MSF feedback.

Data extraction

The initial search yielded 1,137 papers, as shown in Figure 1. Of these, 623 papers were excluded based on the title, 292 were excluded based on a review of the abstract, 97 were removed as they were duplicates, and a further 90 were eliminated after a review of the full-text versions. Finally, we agreed on a total of 35 papers to be included for meta-analysis. A coding protocol was developed that included each study’s title, author(s) name(s), year of publication, source of publication, study design (ie, construct or criterion validity study), physician/surgeon specialty (eg, general practice, pediatrics), and types of raters (ie, self, medical colleague, consultants, patients, and coworkers). All 35 articles were independently coded by two authors (AA and TD) and any discrepancies (eg, effect size calculations) were reviewed by a third author (KA, AD, or CV). Based on iterative reviews and discussions between the five coders, we were able to achieve 100% agreement on all coded data.

Figure 1

Selection of studies for the meta-analysis.

Statistical analysis

The statistical analysis of all effect size calculations was done using the Comprehensive Meta-Analysis software program (version 1.0.23, Biostat Inc, Englewood, NJ, USA). Most of the studies reported mean differences (Cohen’s d) between MSF scores as effect size measures. However, there were some studies that reported the Pearson’s product-moment correlation coefficient (r). For these studies, and in order to preserve consistency in the data that were reported, r was converted to Cohen’s d using the following formula: d = 2r/√(1 − r2).11 We selected MSF domains or subscale measures as the variables of interest and either contrasted these scores between assessor groups (eg, different personnel ratings, in-training year, or postgraduate year of practice) or with other measures of clinical performance competencies (eg, ABSITE or Objective Structured Clinical Examination [OSCE]). On combination of results from studies that used different research designs (eg, different physician year in practice) or different personnel ratings (eg, medical colleagues, coworkers, patients) and methods of analysis between assessor groups (eg, MSF in comparison with ABSITE, as well as an objective structured practical examination [OSPE]), we used a random-effects model in combining the unweighted and weighted effect sizes. The fixed-effects model assumes that the summary effect size differences are the same from study to study (eg, use of MSF with different questionnaires). In contrast, the random-effects model calculation reflects a more conservative estimate of the between-study variance of the participants’ performance measures.12 In this meta-analysis, residents in different years of rotation and the attending physicians/surgeons were treated equally in that they represent treating physicians at different stages of their year of practice. Therefore, we are evaluating the performance of these ‘physicians/surgeons’ that had a more or less similar trajectory in achieving clinical competency as a function of their performance by using the multi-source feedback system. To assess for the heterogeneity of effect sizes, a forest plot with Cochran Q tests was conducted. Absence of a significant P-value for Q indicates low power within studies rather than the actual consistency or homogeneity across studies included in the meta-analysis. In addition, the distribution of the studies in the forest plots was an important visual indicator to measure the consistency between studies. Interpretation of the magnitude of the effect size for both mean differences and correlations are based on Cohen’s13 suggestions, ie, d=0.20 – 0.49 is “small”, d=0.50 – 0.79 is “medium”, and d≥0.80 is considered to be a “large” effect size difference.

Results

The characteristics of the 35 studies included in the meta-analysis were based on four groups (Table 1) that reported contrasts between different physician years in practice (group A), differences between physician performance levels on two occasions (group B), rating differences between self, medical colleague, coworker, and patients (group C), and comparisons between MSF and other measures of performance (group D). The reported MSF domain measure (ie, items 1 through 5) and the corresponding unweighted effect sizes based on either the contrast or comparison variables are presented in Table 1. Different approaches to testing the validity of MSF were demonstrated by studies included in this meta-analysis. In groups A and B, we investigated the construct validity of the domains’ measures of MSF by showing that physicians at different levels of experience or on two separate occasions tend to obtain higher clinical performance scores. In groups C and D, the criterion validity of MSF is compared with other similar assessments of clinical performance or different raters as either a concurrent or predictive validity measure.

Table 1

Characteristics of MSF studies with construct and criterion (concurrent/predictive) validity effect size measures

Study source	Group	Contrast†	MSF domain*	Effect size difference (d_UWM^‡)
Archer et al20Sample size, 112 pediatrics (20 specialist registrars, 92 senior house officers)Total forms =921	A	SPRS (MC)/SHO (MC)	2, and 5	1.22
Brinkman et al19Sample size, 36 pediatric residents (16 with feedback and 16 with no feedback)Total forms =1,263	A	Feedback (MC)/No-feedback (MC)	1, 2, and 3	1.8
Massagli and Carline21Sample size, 56 rehabilitation residents (nine PGY2, nine PGY3, nine PGY4)Total forms =930	A	PGY2/PGY3PGY2/PGY4PGY3/PGY4	1, 2, 4, and 51, 2, 4, and 51, 2, 4, and 5	0.050.170.23
Archer et al8Sample size, 553 multiple specialties residents (219 Foundation year 1,334 Foundation year 2)Total forms =5,544	A	Foundation year 1 (MC)/Foundation year 2 (MC)	2, and 5	0.34
Archer et al15Sample size, 577 pediatric (343 SPRS year 2, 201 SPRS year 4, 10 pediatricians in years 1, 3, 5, 6)Total forms =4,770	A	SPRS year 2 (MC)/SPRS year 4 (MC)	2, and 5	0.29
Wood et al18Sample size, 67 obstetrics and gynecology residentsTotal forms =578	B	ObGyn time 1/ObGyn time 2	4, and 5	2.41
Lockyer et al22Sample size, 250 family physiciansTotal forms =500	B	Phys time 1/Phys time 2 (Self)	1, 2, 3, and 4	0.46
Brinkman et al19Sample size, 36 pediatric residentsTotal forms =1,263	B	Nurse time 1 (CW)/Nurse time 2 (CW)(Parents) time 1/(Parents) time 2	1, and 21, and 2	1.312.00
Violato et al9Sample size, 250 family physiciansTotal forms =20,500	B	Phys time 1/Phys time 2 (MC)Phys time 1/Phys time 2 (CW)Phys time 1/Phys time 2 (Patients)	1, 2, and 51, and 31, 3, and 4	0.660.220.01
Risucci et al7Sample size, 32 surgical residentsTotal forms =1,024	C	Self/Peer (MC)Self/Supervisors (MC)Peer (MC)/Supervisors (MC)	1, 2, and 51, 2, and 51, 2, and 5	0.560.210.25
Wenrich et al41Sample size, 318 internal medicine physiciansTotal forms =1,877	C	Nurse (CW)/Phys (MC) medical knowledgeNurse (CW)/Phys (MC) humanistic	2, and 52, and 5	0.51−0.46
Lelliott et al42Sample size, 347 psychiatristsTotal forms =11,426	C	Self/MCPatients/MC	2, 3, and 52, 3, and 5	0.470.85
Violato et al43Sample size, 28 family physiciansTotal forms =170	C	Self/MCSelf/PatientsSelf/CW	1, 2, 4, and 51, 2, 4, and 51, 2, 3, and 5	0.580.950.77
Hall et al3Sample size, 295 multiple specialtiesPhysiciansTotal forms =11,665	C	Self/PatientsSelf/MCSelf/Consultant (MC)Self-Referring physicians (MC)Self/CWConsultant (MC)/MCConsultant (MC)/CW	1, 2, 3, 4, and 51, 2, and 51, 2, and 51, 2, and 51, 2, 3, and 51, 2, and 51, 2, 3, and 5	1.300.370.801.180.760.460.18
Thomas et al44Sample size, 16 internal medicine residentsTotal forms =177	C	MC (Intern)/MCMC (Intern)/CWMC/CW	2, and 52, and 52, and 5	0.411.060.65
Lipner et al45Sample size, 356 internal medicine physiciansTotal forms =12,460	C	MC/Patients	1, 2, and 3	2.60
Violato et al5Sample size, 252 surgeonsTotal forms =7,237	C	Self/MCSelf/CWSelf/PatientsMC/CWMC/PatientsCW/Patients	1, 2, 3, and 51, 2, 3, and 51, 2, 3, 4, and 51, 2, 3, and 51, 2, 3, 4, and 53, 4, and 5	0.620.610.580.000.000.00
Wood et al27Sample size, 7 radiology residentsTotal forms =57	C	Patients/MCPatients/CWMC/CW	1, and 31, and 31, and 3	0.981.310.04
Joshi et al46Sample size, 8 obstetrics/gynecology residentsTotal forms =512	C	MC/CWMC/PatientsCW/Patients	3, and 53, and 53, and 5	1.340.430.97
Lockyer et al47Sample size, 197 anesthesiology physiciansTotal forms =5,957	C	MC/Patients	1, 2, and 3	0.06
Violato et al48Sample size, 100 pediatric physiciansTotal forms =3,963	C	Self/MCSelf/CWSelf/PatientsMC/CWMC/PatientsCW/Patients	1, 2, and 31, 2, 3, and 51, 2, 3, and 41, 2, 3, and 51, 2, 3, and 41, 3, 4, and 5	0.040.180.070.970.790.26
Violato et al32Sample size, 101 psychiatry physiciansTotal forms =4,069	C	Self/MCSelf/CWSelf/PatientsMC/CWMC/PatientsCW/Patients	1, 2, and 41, 2, 3, 4, and 51, 2, 3, and 41, 2, 3, and 51, 2, 3, and 41, 3, 4, and 5	0.831.521.130.680.280.40
Archer et al8Sample size, 553 multiple specialties residentsTotal forms =5,544	C	(Consultant) MC/(Resident) MC	2, and 5	0.37
Pollock et al14Sample size, 6 plastic surgery residentsTotal forms =240	C	CW/MC	1, 2, 3, 4, and 5	0.87
Davies et al40Sample size, 92 histopathology residentsTotal forms =1,012	C	Consultant (MC)/CW	2, and 4	0.98
Campbell et al33Sample size, 291 multiple specialties physiciansTotal forms =18,023	C	Patients/MC	1, 2, 3, and 5	0.19
Meng et al34Sample size, 15 anesthesiology residentsTotal forms =429	C	Nurse (CW)/Secretaries (CW)Nurse (CW)/Nurse aids (CW)Nurse (CW)/Technicians (CW)Secretaries (CW)/Nurse aids (CW)Secretaries (CW)/Technicians (CW)Nurse aids (CW)/Technicians (CW)	1, 3, and 51, 3, and 51, 3, and 51, 3, and 51, 3, and 51, 3, and 5	0.160.640.650.160.460.00
Lockyer et al35Samples size, 101 pathologists/laboratory physiciansTotal forms =808	C	Self/MCSelf/Referring physicians (MC)Self/CWMC/Referring physicians (MC)MC/CWReferring physicians (MC)/CW	1, 2, and 51, 2, 4, and 51, 2, 3, and 51, 2, 4, and 51, 2, 3, and 51, 2, 3, and 4	0.220.580.180.380.030.40
Lockyer et al36Sample size, 187 emergency medicine physiciansTotal forms =6,889	C	Self/MCSelf/CWSelf/PatientsMC/CWMCPatientsCW/Patients	1, 2, and 41, 2, 4, and 51, 2, 3, 4, and 51, 2, 4, and 51, 2, 3, 4, and 51, 2, 3, and 5	0.780.931.130.430.630.17
Archer et al15Sample size, 577 pediatric residentsTotal forms =4,770	C	Consultant (MC)/Resident (MC)	2, and 5	0.64
Chandler et al16Sample size, 66 pediatrics residentsTotal forms =823	C	Self/Attending (MC)Self/CWSelf/PatientsAttending (MC)/CWAttending (MC)/PatientsCW/Patients	3, and 53, and 53, and 53, and 53, and 53, and 5	0.871.100.080.260.300.45
Campbell et al17Sample size, 179 family physiciansTotal forms =10,895	C	Patients/MC	1, 2, 3, and 5	0.02
Archer and McAvoy37Sample size, 68 different specialties physiciansTotal forms =2,365	C	Patients/MCAssessor nominated by physicians/assessors nominated by referring body	2, and 52, and 5	1.901.91
Overeem et al38Sample size, 146 multiple specialtiesPhysiciansTotal forms =3,648	C	MC/PatientsMC/CWCW/Patients	1, 2, 3, 4, and 51, 2, 3, and 41, 2, 3, and 5	0.440.750.45
Lockyer et al39Sample size, 216 surgeonsTotal forms =9,072	C	Self/MCSelf/CWSelf/PatientsMC/CWMC/PatientsCW/Patients	1, 2, 3, and 41, 2, and 31, 2, 3, 4, and 51, 2, 3, and 41, 2, 3, 4, and 53, 4, and 5	1.110.861.000.440.300.21
Qu et al23Sample size, 258 multiple specialties residentsTotal forms =4,128	C	Self/Attending (MC)Self/MCSelf/CWSelf/PatientsSelf/Office staff (CW)Attending (MC)/MCAttending (MC)/CWAttending (MC)/PatientsAttending (MC)/Office staff (CW)Patients/Office staff (CW)Patients/MCPatients/CW	1, and 31, and 31, and 31, 2, 3, 4, and 51, and 31, and 31, and 31, 2, 3, 4, and 51, and 31, 2, 3, 4, and 51, 2, 3, 4, and 51, 2, 3, 4, and 5	0.300.13−0.550.191.780.080.820.382.311.870.370.42
Lockyer et al49Sample size, 37 general practice physiciansTotal forms =1,130	C	Self/MCSelf/CWSelf/PatientsMC/CWMC/PatientsCW/Patients	1, and 21, 2, and 31, 2, 3, and 41, 2, and 31, 2, 3, and 41, 3, and 4	0.220.050.040.220.210.00
Risucci et al7Sample size, 32 surgical residentsTotal forms =1,024	D	MSF/ABSITE	1, 2, and 5	1.45
Wood et al27Sample size, 7 radiology residentsTotal forms =57	D	MSF (PT)/global examinationMSF (MC)/global examinationMSF (CW)/global examination	1, and 31, and 31, and 3	1.961.021.60
Davies et al40Sample size, 92 histopathology residentsTotal forms =1,012	D	MSF (PATH-SPRAT)/OSPE	2, and 3	1.09
Yang et al24Sample size, 245 multiple specialties residentsTotal forms =1,053	D	MSF/small scale OSCEMSF/small scale OSCE + DOPS	1, 2, and 31, 2, and 3	0.792.07

Notes:

A, predictive validity (physicians in different years level); B, predictive validity (physicians performance on MSF in two occasions separated with time); C, concurrent validity (differences in personnel ratings); D, construct validity (comparing MSF with standardized measures).

MSF domains consist of the following: 1= professionalism, covering psychosocial skills, psychosocial management, humanistic qualities, compassion, attitude, professional development, teaching, and professional responsibilities and professional managements; 2= clinical competence covering clinical care, good medical practice, patient care, safe practice, clinical performance, knowledge, critical thinking, diagnosis, and management of complex problem; 3= communication, covering communication with staff and interpersonal communication skills; 4= management, covering reporting, self-management, administrative skills, office personal, access to doctor, practice process, physical office, and physical space; and 5= interpersonal relationships, covering relationships with patients, colleagues, family members, collegiality, collaboration, patient education, information provision, and patient interaction. Two of the authors (AA, TD) agreed on the names of the main five domains and agreed on the items included. d‡ refers to the unweighted mean effect size difference as defined by Cohen’s d.

Abbreviations: CW, coworkers; MC, medical colleagues; MSF, multi-source feedback; PGY, postgraduate year; SPRS, specialist registrar; Phys, family physician; ObGyn, obstetrics and gynecology; CW, coworkers; ABSITE, American Board Of Surgery In-Training Examination; PATH-SPRAT, Pathology-Sheffield Peer Review Assessment Tool; OSPE, Objective Structured Practical Examination; OSCE, Objective Structured Clinical Examination; DOPS, Direct Observation of Procedural Skills; SHO, senior house officer; PT, patients.

The sample size of the studies range from six plastic surgery residents14 to 577 pediatric residents15 who had been assessed using MSF with as few as 1.2 patients and 2.6 medical colleagues16 and as many as 47.3 patients completing forms per individual.17 Questionnaire items used as part of MSF ranged from as few as four items18 to as many as 60 items14 per questionnaire. Information on specific demographic characteristics, such as students’ sex or age was not reported, but level of training and years of practice as a physician were typically identified. In each study, the unweighted mean effect size difference (Cohen’s d) was provided or calculated based on the MSF domain measures as a contrasting variable (eg, years spent as a physician in practice) or with a comparison measure (eg, OSPE).

Construct validity of MSF system

Of the 35 studies that reported data on physician/surgeon performance, 31 (88%) demonstrated results in support of the construct validity of the MSF system. As shown in Table 2, we combined five of the studies (group A) to show that for each of the five MSF domains the effect size differences in performance between a year of practice (eg, change in performance as a function of post-graduate year 1 to year 2, Senior House Officer to Specialist Registrar)8,15,19–21 ranged from d=0.14 (95% CI 0.40–0.69) for manager skills to d=1.78 (95% CI 1.20–2.30) for communication skills.

Table 2

Random effects model (Cohen’s d) of the MSF domains with different physician years (group A)/different physician performance in two occasions (group B)

MSF domain measure	Studies included (number of outcomes)	Sample size	MSF with different physician years*	Studies included (number of outcomes)	Sample size	Difference between physicians’ performance on two occasions**
Professional	2 (4)	126	0.56 (0.39–1.59)	3 (6)	1,054	0.65 (0.30–1.00)
Clinical competence	5 (7)	1,335	0.62 (0.25–1.00)	3 (4)	554	0.99 (0.53–1.45)
Communication	1 (1)	72	1.78 (1.22–2.34)	2 (3)	750	0.23 (0.02–048)
Manager	1 (3)	54	0.14 (0.40–0.69)	3 (3)	567	0.92 (0.01–1.84)
Interpersonal relationships	4 (6)	1,263	0.42 (0.16–0.67)	2 (2)	317	1.50 (0.19–3.22)

Notes:

Effect sizes combined for physicians in different year levels (different PGY level, eg, year 1, year 2, senior house officer, specialist registrar);8,15,19–21

effect sizes combined for physicians’ performance on two occasions separated by time (eg, 5 years, 7 months, 7 years).9,18,19,22

Abbreviations: MSF, multi-source feedback; PGY, post graduate year.

When differences between physician/surgeon performance were investigated on two different occasions, we found four studies (group B) that showed differences in clinical performance across the five domain scores of MSF. In particular, Brinkman et al19 compared ratings for 36 pediatric residents on two occasions with regard to the professionalism and communication skills domains, and their results showed that there were consistently large effect size differences between time 1 and time 2. The ratings on these MSF items ranged from d=1.31 for the professionalism domain to d=2.00 for the communication skills domain. Correspondingly, Lockyer et al22 found a range of MSF scores that varied from d=0.01 for physicians over a 5-year period on the professionalism, communication skills, and management domains for self-rating assessment to d=0.66 with the same physicians over the professionalism, communication skills, and interpersonal relationship domains as rated by medical colleagues. Violato et al9 reported a small effect size of d=0.46 when the performance of 250 family physicians was compared after a 5-year interval between MSF assessments.

Criterion (predictive/concurrent) validity of the MSF system

In group C, we combined the outcomes in 21 (60%) studies that investigated the differences in MSF scores provided by different raters (eg, residents, self, medical colleague, coworker, patients) across the five domains identified. Effect size differences in performance between the different raters (eg, comparison of patients with self assessment, medical colleagues to coworkers) ranged from d=0.50 (95% CI 0.47–0.52) for interpersonal relationships to d=0.57 (95% CI 0.55–0.60) for both professionalism and clinical competence. Most of the studies in group C showed that physicians consistently rated themselves lower than did other assessor groups. However, in a study of 258 residents within different specialties reported by Qu et al, residents on self-assessments rated themselves higher than did other raters.23 As shown in the forest plot (Figure 2), the combined random-effects size calculation for the professionalism domain was “medium” (d=0.66, 95% CI 0.44–0.69).

Figure 2

Random and fixed effects model forrest plots for the MSF “personnel rating differences” for professional measures.

Notes: *The effect size values are taken from the raw data reported for the outcomes in studies group C. The Cochran Q-test for heterogeneity shows significant overall heterogeneity between studies.

Abbreviations: MSF, multi-source feedback; Pt, patients; MC, medical colleagues; Const, consultant; CW, co-workers; RefPhys, referring physicians; Nu, nursing; NuA, nursing aid; Sec, secretary; Tech, technicians; Officstaff, office staff; Attend, attending.

In group D (Table 3), of the 35 studies included in the meta-analysis, four reported data on physician/surgeon performance on MSF in comparison with other criterion measures (eg, OSPE, OSCE). The mean effect size differences were found to be “medium” to “high” across each of the five domains identified on MSF. Effect size differences in performance between domain scores and other examination measurement scores ranged from d=1.28 (95% CI 1.15–1.41) for clinical competence to d=1.43 (95% CI 0.87–2.00) for interpersonal relationships. Yang et al24 found a range of MSF scores that varied from d=0.79 for residents on the domains of professionalism, clinical competence, and communication skills to d=2.07 with the same physicians on the same domains when their MSF scores were compared with other clinical performance measures such as the OSCE.

Table 3

Random effects model (Cohen’s d) of the MSF domains with personnel ratings/academic performance (groups C and D)

MSF domain measure	Studies included (number of outcomes)	Sample size	Personnel rating differences*	Studies included (number of outcomes)	Sample size	MSF with different global measurement**
Professional	19 (82)	12,415	0.56 (0.44–0.67)	3 (6)	543	1.42 (0.72–2.12)
Clinical competence	24 (75)	12,720	0.60 (0.49–0.72)	3 (4)	614	1.34 (0.65–2.05)
Communication	20 (76)	11,280	0.56 (0.42–0.67)	3 (6)	603	1.35 (0.71–1.99)
Manager	13 (38)	6,089	0.60 (0.45–0.74)	–	–
Interpersonal relationships	23 (74)	11,660	0.54 (0.44–0.64)	1 (1)	32	1.43 (0.87–2.00)

Notes:

Effect size combined between differences in personnel ratings (ie, resident versus faculty, specialist versus consultant);3,5,7,8,14–17,23,27,32–39,41–49

effect sizes combined between MSF with standardized measures (eg, global ratings, OSPE).7,24,27,40

Abbreviations: MSF, multi-source feedback; OSPE, Objective Structured Practical Examination.

Although the Cochran Q test shows significant heterogeneity between the studies included in the four groups, a subgroup analysis to determine the potential differences as a result of moderator variables such as physician/surgeon sex or age was limited by the data reported across the primary studies included in the meta-analysis. Nevertheless, the studies were weighted by their respective sample sizes, and the random-effects model analysis (with greater than 95% CIs) provide a more conservative estimate of the combined effect sizes as illustrated by a forest plot (Figure 2).

Discussion

In this meta-analysis, the MSF demonstrates evidence of construct validity when used with physicians and surgeons across the years of a residency program or a number of years of practice. Physician/surgeon performance on the MSF domains across a single year of practice showed “small” to “large” effect size differences, with effect sizes ranging from d=0.14 (95% CI 0.40–0.69) in the manager skills domain to d=1.78 (95% CI 1.20–2.30) in the communication skills domain. The effect size differences between physician/surgeon performance on two occasions (time 1/time 2) ranged from d=0.23 (95% CI 0.13–0.33) for the communication skills domain to d=0.90 (95% CI 0.74–1.10) for the interpersonal relationship domain measure. The differences in rating for physician/surgeon performance on MSF between different assessor groups (self-assessments, medical colleagues, consultants, patients, and coworkers) showed “medium” effect size differences that ranged from d=0.50 (95% CI 0.47–0.52) for the interpersonal relationship domain to d=0.57 (95% CI 0.55–0.60) for the professionalism and clinical competence domains. In particular, these results were supported by the findings from other assessment methods such as the mini-clinical evaluation exercise (mini-CEX). Ratings with different raters in the mini-CEX have showed that in comparison with faculty evaluator ratings, residents tend to be more lenient and score trainees higher on in-training evaluation checklists.25,26 In our study of the MSF, we found that physicians and surgeons consistently rated themselves lower than did other assessor groups.23 In addition, patients and coworkers typically rated physicians/surgeons more leniently than did other raters, such as medical colleagues or consultants. The MSF showed evidence of criterion-related validity when compared with other performance examination measures (eg, global examination, OSPE, OSCE). We found a “large” correlation coefficient, with combined effect sizes ranging from d=1.28 (95% CI 1.15–1.41) for the communication skills domain to d=1.43 (95% CI 0.87–2.00) for the interpersonal relationship domain. The construct-related and criterion-related validity of MSF was supported by the findings outlined within the studies included in one or more of the four group comparisons. As illustrated in the forrest plots for the professionalism domain in group C, not all of the reported differences between personnel ratings were found to be statistically significant. When combined with the outcomes from 19 different studies, however, we found that there was a significant combined random-effects size of d=0.65 (95% CI, 0.44–0.69). In general, the findings of this meta-analysis shows “medium” combined effect sizes for the construct-related and criterion-related validity of the five main MSF domains identified. Although different questionnaires and different numbers of items were used in MSF across different specialties, they were found to consistently measure similar domains of physician/surgeon performance.15 This feedback process using multiple questionnaires in different type of raters provides a more comprehensive evaluation of clinical practice than can typically be provided by one or few sources.1

Strengths and weaknesses of the study

There are limitations to this meta-analysis. Because we were interested in determining the construct-related and criterion-related validity of MSF as a method for physician/surgeon evaluation, consistency in the use of the evaluation tool varied from a research design perspective. In addition, there was variability in the performance domains measured and in the number of items used to measure each domain depending on the MSF instrument used (ie, ranging from four items to 60 items), the raters used (ie, self, patients, medical colleague, coworker), and whether or not the MSF was being compared with other clinical skill measures (ie, OSCE). To overcome this limitation, the more conservative random-effects size analysis was performed to accommodate for the heterogeneity between the studies as indicated by the significant values obtained using the Cochran Q test. Nevertheless, we were unable to undertake subsequent subgroup analyses to determine where there may have been between-study differences because these data (eg, sex, age of participant) were rarely reported. Although some of the studies had small sample sizes such as six14 and seven participants,27 this was in part compensated by the 40 and eight raters who completed the questionnaire, respectively, on each of the participants in these studies. To achieve some control over the quality of the studies that were included in this meta-analysis, only papers that had been published in refereed journals were selected.

Implications for clinicians and policymakers

Certain characteristics of health professionals, such as clinical skills, personal communication, and client management, combined with improved performance can be assessed using MSF.8 MSF is a unique form of assessment that has been shown to have both construct-related and criterion-related validity in assessing a multitude of clinical and nonclinical performance domains. In addition, MSF has been shown to enhance changes in clinical performance,15 communication skills,7 professionalism,7 teamwork28, productivity,29 and building trusting relationship with patients.30 Consequently, MSF has been adopted and used extensively as a method for assessment of a variety of domains identified in medical education programs and licensing bodies in the UK, Canada, Europe, and other countries as well. Although MSF has gained widespread acceptance, the literature has raised a number of concerns about its implementation and its validity. Therefore, the availability of evidence to support the validity of the process and the instruments used to date is of crucial importance to enable policymakers to make the decision to implement MSF within their own programs or organizations.

Conclusion and future research

Although MSF appears to be adequate for assessment of a variety of nontechnical skills, this approach is limited to feedback from peers or medical colleagues abilities to assess aspects of clinical skills competence that reflect physicians’/surgeons’ knowledge and non-cognitive behavior. In particular, as part of the process of assessing clinical performance, other methods such as procedures-based assessment or the OSCE should be used in conjunction with the peer MSF questionnaire to ensure accurate assessment of these specific skills. We are faced with the challenge of ensuring that use of MSF for assessment of physicians and surgeons in practice is reliable and valid. As shown above, MSF has proved to be a useful method for assessing the clinical and nonclinical skills of physicians/surgeons in practice with clear evidence of construct and criterion-related validity. Although MSF is considered to be a useful assessment method, it should not be the only measure used to assess physicians and surgeons in practice. Other reliable and valid methods should be used in conjunction with MSF, in particular to assess procedural skills performance and to overcome the limitation of using a single measure. Future research should be considered by researchers in order to replicate and extend some of the empirical findings, especially the evidence for criterion-related validity. Criterion-related validity studies looking at correlations between direct observations of behavior or performance and MSF scores are required to add further evidence of validity. Future research on the various MSF instruments available may well include confirmatory factor analysis, which provides stronger construct validity evidence than the principal component factor analyses conducted currently.31 In addition, MSF assessments are entirely questionnaire-based and rely on the judgment of and inference by the assessors and respondents, which are subject to a variety of biases and heuristics. Therefore, generalizability theory should be used in future studies to determine potential sources of error measurement that can occur due to use of different assessors and specialties, as well as the characteristics of the respondents themselves.

42 in total

1. Changing physicians' practices: the effect of individual feedback.

Authors: H Fidler; J M Lockyer; J Toews; C Violato
Journal: Acad Med Date: 1999-06 Impact factor: 6.893

2. Assessment of physician performance in Alberta: the physician achievement review.

Authors: W Hall; C Violato; R Lewkonia; J Lockyer; H Fidler; J Toews; P Jennett; M Donoff; D Moores
Journal: CMAJ Date: 1999-07-13 Impact factor: 8.262

3. Evaluating professionalism and interpersonal and communication skills: implementing a 360-degree evaluation instrument in an anesthesiology residency program.

Authors: Li Meng; David G Metro; Rita M Patel
Journal: J Grad Med Educ Date: 2009-12

4. Specialty-specific multi-source feedback: assuring validity, informing training.

Authors: Helena Davies; Julian Archer; Adrian Bateman; Sandra Dewar; Jim Crossley; Janet Grant; Lesley Southgate
Journal: Med Educ Date: 2008-10 Impact factor: 6.251

5. Meta-analysis in clinical trials.

Authors: R DerSimonian; N Laird
Journal: Control Clin Trials Date: 1986-09

6. Ratings of surgical residents by self, supervisors and peers.

Authors: D A Risucci; A J Tortolani; R J Ward
Journal: Surg Gynecol Obstet Date: 1989-12

7. A study of a multi-source feedback system for international medical graduates holding defined licences.

Authors: Jocelyn Lockyer; David Blackmore; Herta Fidler; Rod Crutcher; Brian Salte; Karen Shaw; Bryan Ward; Norman Wolfish
Journal: Med Educ Date: 2006-04 Impact factor: 6.251

8. The assessment of emergency physicians by a regulatory authority.

Authors: Jocelyn M Lockyer; Claudio Violato; Herta Fidler
Journal: Acad Emerg Med Date: 2006-11-10 Impact factor: 3.451

9. Questionnaires for 360-degree assessment of consultant psychiatrists: development and psychometric properties.

Authors: Paul Lelliott; Richard Williams; Alex Mears; Manoharan Andiappan; Helen Owen; Paul Reading; Nick Coyle; Stephen Hunter
Journal: Br J Psychiatry Date: 2008-08 Impact factor: 9.319

10. Patient, faculty, and self-assessment of radiology resident performance: a 360-degree method of measuring professionalism and interpersonal/communication skills.

Authors: Jonathan Wood; Jannette Collins; Elizabeth S Burnside; Mark A Albanese; Pamela A Propeck; Frederick Kelcz; Jeannette M Spilde; Lisa M Schmaltz
Journal: Acad Radiol Date: 2004-08 Impact factor: 3.173

7 in total

1. Perceived Communication Skills Among Tertiary Care Physicians.

Authors: Ahmad S Alzahrani; Abdullah Alqahtani; Sayed Abdulkader; Motaz A Alluhabi; Rashed Alqabbas
Journal: Med Sci Educ Date: 2019-06-25

2. The Evaluation of Physicians' Communication Skills From Multiple Perspectives.

Authors: Jenni Burt; Gary Abel; Marc N Elliott; Natasha Elmore; Jennifer Newbould; Antoinette Davey; Nadia Llanwarne; Inocencio Maramba; Charlotte Paddison; John Campbell; Martin Roland
Journal: Ann Fam Med Date: 2018-07 Impact factor: 5.166

Review 3. Using Peer Feedback to Promote Clinical Excellence in Hospital Medicine.

Authors: Molly A Rosenthal; Bradley A Sharpe; Lawrence A Haber
Journal: J Gen Intern Med Date: 2020-09-21 Impact factor: 5.128

4. Nurses' evaluation of physicians' non-clinical performance in emergency departments: advantages, disadvantages and lessons learned.

Authors: Mohamad Alameddine; Afif Mufarrij; Miriam Saliba; Yara Mourad; Rima Jabbour; Eveline Hitti
Journal: BMC Health Serv Res Date: 2015-02-27 Impact factor: 2.655

5. Reliability of the interprofessional collaborator assessment rubric (ICAR) in multi source feedback (MSF) with post-graduate medical residents.

Authors: Mark F Hayward; Vernon Curran; Bryan Curtis; Henry Schulz; Sean Murphy
Journal: BMC Med Educ Date: 2014-12-31 Impact factor: 2.463

6. Lessons from surgery and anaesthesia: evaluation of non-technical skills in interventional radiology.

Authors: Chun L Pang; Salil B Patel; Nicola Pilkington
Journal: JRSM Open Date: 2015-11-03

7. Survey of physician attitudes to using multisource feedback for competence assessment in Alberta.

Authors: Nigel Ashworth; Nicole Allison Kain; Ed Jess; Karen Mazurek
Journal: BMJ Open Date: 2020-07-19 Impact factor: 2.692

7 in total