Literature DB >> 30225409

A Multifaceted Organizational Physician Assessment Program: Validity Evidence and Implications for the Use of Performance Data.

Andrea N Leep Hunderfund¹, Yoon Soo Park², Frederic W Hafferty³, Kelly M Nowicki⁴, Steven I Altchuler⁵, Darcy A Reed⁶.

Abstract

OBJECTIVE: To provide validity evidence for a multifaceted organizational program for assessing physician performance and evaluate the practical and psychometric consequences of 2 approaches to scoring (mean vs top box scores). PARTICIPANTS AND METHODS: Participants included physicians with a predominantly outpatient practice in general internal medicine (n=95), neurology (n=99), and psychiatry (n=39) at Mayo Clinic from January 1, 2013, through December 31, 2014. Study measures included hire year, patient complaint and compliment rates, note-signing timeliness, cost per episode of care, and Likert-scaled surveys from patients, learners, and colleagues (scored using mean ratings and top box percentages).
RESULTS: Physicians had a mean ± SD of 0.32±1.78 complaints and 0.12±0.76 compliments per 100 outpatient visits. Most notes were signed on time (mean ± SD, 96%±6.6%). Mean ± SD cost was 0.56±0.59 SDs above the institutional average. Mean ± SD scores were 3.77±0.25 on 4-point and 4.06±0.31 to 4.94±0.08 on 5-point Likert-scaled surveys. Mean ± SD top box scores ranged from 18.6%±16.8% to 90.7%±10.5%. Learner survey scores were positively associated with patient survey scores (r=0.26; P=.003) and negatively associated with years in practice (r=-0.20; P=.02).
CONCLUSION: This study provides validity evidence for 7 assessments commonly used by medical centers to measure physician performance and reports that top box scores amplify differences among high-performing physicians. These findings inform the most appropriate uses of physician performance data and provide practical guidance to organizations seeking to implement similar assessment programs or use existing performance data in more meaningful ways.

Entities: Chemical

Keywords: FPPE, focused practice performance evaluation; GIM, general internal medicine; MSF, multisource feedback

Year: 2017 PMID： 30225409 PMCID： PMC6135024 DOI： 10.1016/j.mayocpiqo.2017.05.005

Source DB: PubMed Journal: Mayo Clin Proc Innov Qual Outcomes ISSN： 2542-4548

As a self-regulating profession, medicine is accountable for ensuring that physicians are competent in performing their clinical roles and responsibilities,1, 2 and health care organizations play an important role in this process.3, 4 Organizations collect physician performance data for many reasons (eg, ensuring physician competency, supporting health care choices by consumers, improving care quality, or satisfying regulatory or accreditation requirements) and can use performance data in various ways. For example, scores can be used to ensure that minimal performance expectations are met3, 6 or to drive continuous improvement.7, 8, 9 Failure to meet performance expectations can lead directly to punitive consequences or can trigger additional investigations to determine whether a concern exists.10, 11, 12, 13 Likewise, scores can be used primarily as formative feedback14, 15 or for higher-stakes decisions (eg, promotion, employment, salary, privileging, and public transparency).9, 10, 16, 17, 18, 19 This panoply of purposes complicates the collection, distribution, analysis, and interpretation of physician performance data. Without a rigorous examination of the validity of their physician assessment programs, organizations risk using physician performance data in ways that are inappropriate or potentially detrimental.20, 21, 22 Furthermore, the validity of commonly used physician performance measures may not be sufficient to support all intended purposes. The use of physician performance data is further complicated by different approaches to scoring. For example, scores based on Likert-type ratings of performance can be reported as means (as often done for learner, multisource, or peer feedback surveys1, 23) or as the percentage of optimal ratings, also known as top box scores (as often done for patient satisfaction surveys24, 25, 26). The way in which scores are calculated affects their validity (eg, mean scores better represent the distribution of ratings, while top box scores may be more readily understood),27, 28, 29 yet this issue has not been extensively examined in the context of a multifaceted organizational physician performance assessment program. For these reasons, we sought to (1) provide validity evidence for 7 different types of assessments commonly used to measure physician performance and (2) examine the practical and psychometric consequences of the 2 aforementioned approaches to scoring (mean vs top box scores).

Participants and Methods

This study was a retrospective analysis of deidentified physician clinical performance data collected via routine institutional practices and was considered exempt by the Mayo Clinic Institutional Review Board.

Study Participants and Setting

Study participants included all physicians with a predominantly outpatient practice in general internal medicine (GIM; n=95), neurology (n=99), and psychiatry (n=39) at Mayo Clinic in Rochester, Minnesota, from January 1, 2013, through December 31, 2014. Physicians within the 3 included specialties collectively completed more than 300,000 outpatient visits during the study time frame.

Measures

Physician performance measures included the following: Unsolicited patient complaints and compliments related to physician care, reported as the number of complaints or compliments per 100 outpatient visits. Percentage of notes that were signed on time according to institutional policy (eg, clinical notes must be signed within 30 days). Mean internal cost per episode of care (ie, cost to the institution of providing tests and consults within a discrete period), reported as a z score relative to the institutional mean. Internal costs reflect utilization (eg, physicians who order more or more costly tests and consultations have higher internal costs) and are unrelated to prices or charges to patients/insurers. Internal costs are attributed to the physician with the highest evaluation and management billing code on the first day of a patient's evaluation. An episode of care comprises the subsequent days over which tests and consultations are performed. Patient satisfaction survey provided by Avatar International LLC (9 items rated using a 5-point Likert scale ranging from 1 = strongly disagree to 5 = strongly agree, 0 = not applicable). Learner feedback surveys, ie, evaluation forms completed by residents and fellows (subsets of items from a total pool of 22 items rated using a 5-point Likert scale: 1 = needs improvement, 2-4 = average, 5 = top 10%, 0 = not applicable; free-text comments required for ratings of 1 or 5). Multisource feedback (MSF) surveys for GIM (7 items rated using a 5-point Likert scale: 1 = needs improvement, 2-4 = meets expectations, 5 = exceeds expectations, 0 = not applicable; free-text comments required for ratings of 1 or 5) and psychiatry (5 items rated using a 4-point Likert scale ranging from 1 = strongly disagree to 4 = strongly agree). Peer feedback survey for neurology (6 items rated using a 5-point Likert scale: 1 = never, 2 = rarely, 3 = occasionally, 4 = frequently, 5 = always, 0 = not applicable). These data were collected for a variety of internal, accreditation, certification, and regulatory reasons, as is typical of physician performance data.32, 33, 34 Scores were not linked to physician reimbursement or published publicly. The GIM and psychiatry MSF surveys were completed by self-selected physicians, allied health professionals, and nonphysician coworkers. Neurology peer feedback surveys were completed by assigned physician and nurse practitioner colleagues. During the study time frame, specialties aimed to collect MSF surveys every 2 to 3 years (GIM), yearly (psychiatry), or twice yearly (neurology). Because previous studies have reported that care quality may decline with increasing years in practice,1, 35, 36, 37, 38 we also collected the hire year of each participant. To protect anonymity, hire year was reported as a categorical variable (with no fewer than 5 individuals per category), and other demographic data (age, sex, and academic rank) were not linked to performance measures.

Data Collection

An individual external to the study team compiled physician performance data from institutional databases and replaced identifiers with random subject IDs to allow data linkage across assessments. Only numeric ratings of performance were included due to the potential for written comments to contain identifying information. After all identifiers were replaced, the key linking identifiers with random subject IDs was destroyed. Only completely and permanently deidentified data were shared with the study team.

Score Calculation

For Likert-scaled assessments, we calculated mean scores and top box scores. To determine mean scores, we first calculated mean ratings for individual survey items, then a mean score across all items in each instrument. For top box scores, we calculated the percentage of ratings that received the highest possible rating. For all measures, separate scores were calculated for 2013 and 2014. To summarize overall performance, we calculated the mean score across both years.

Standard Setting

The Joint Commission requires health care organizations to periodically monitor physician performance via ongoing professional practice evaluation.10, 11, 12, 13 Organizations set performance thresholds, and failure to meet these thresholds triggers more detailed performance assessments using direct observation, medical record audits, etc (focused professional practice evaluation [FPPE]).10, 11, 12, 13 The rate at which FPPE is triggered is an important consideration for leaders, who must determine whether institutional resources are sufficient to accommodate the number of physicians requiring more detailed assessments. To determine theoretical trigger rates for each assessment, we set normative cutoff scores at 1 and 2 SD from the mean, as recommended by others.13, 39, 40 Specifically, cutoff scores were set 1 and 2 SD above the mean for patient complaints, below the mean for timeliness of note signing and feedback surveys, and above and below the mean for cost per episode of care.

Outcomes

As recommended in the Standards for Educational and Psychological Testing by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, we sought validity evidence from (1) content, as measured by adequacy of content sampling (using the Value Compass,14, 41 which conceptualizes physician performance as clinical processes, clinical outcomes, patient satisfaction, and costs); (2) response process, as measured by score distributions and means for each assessment and scoring method, number of physicians assessed per year, and number of raters (and ratings) per physician per year; (3) internal structure, as measured by internal consistency reliability (consistency in measurement among survey items, reflecting the degree to which items measure a single construct) and item discrimination indices; (4) relations to other variables, as measured by associations among scores generated by the various assessments and associations between scores and hire year; and (5) consequences of testing, as measured by FPPE trigger rates.

Statistical Analyses

Descriptive summary statistics are reported as mean ± SD or frequency (percentage). Internal consistency reliability was estimated using Cronbach α, with reliability coefficients of 0.80 or greater considered sufficient for moderate- to high-stakes summative assessment. Item discrimination indices were calculated using item-rest correlations. Associations among scores and between scores and hire year were measured using Pearson correlations. All tests were 2-sided, and P<.05 was considered statistically significant. Analyses were performed by one of us (Y.S.P.) using Stata 14 software (StataCorp LLC).

Results

Demographic characteristics of study participants are shown in Table 1. The mean ± SD age of participants was 50.1±11.4 years. Sixty-five percent of participants (151 of 233) were men, and 35% (82 of 233) were women.

Table 1

Demographic Characteristics of the 233 Study Participantsa

Characteristic	Physiciansb
Age (y), mean ± SDcd	50.1±11.4
Male sex (No. [%])d	151 (65)
Specialty (No. [%])
General internal medicine	95 (41)
Neurology	99 (42)
Psychiatry	39 (17)
Academic rank (No. [%])c
Professor	47 (20)
Associate professor	30 (13)
Assistant professor	114 (49)
Instructor	17 (7)
No rank	25 (11)
Hire year (No. [%])
2010-2014	52 (22)
2005-2009	38 (16)
2000-2004	39 (17)
1995-1999	37 (16)
1990-1994	28 (12)
1985-1989	13 (5)
1980-1984	12 (5)
Before 1980	14 (6)

Percentages may not sum to 100% due to rounding.

Participating physicians were those identified by their department or division chair as having a predominantly outpatient clinical practice.

Age and academic rank as of January 1, 2014 (the midpoint of the 2-year study time frame).

Age and sex data were not linked to physician performance data to protect the anonymity of study participants.

Demographic Characteristics of the 233 Study Participantsa Percentages may not sum to 100% due to rounding. Participating physicians were those identified by their department or division chair as having a predominantly outpatient clinical practice. Age and academic rank as of January 1, 2014 (the midpoint of the 2-year study time frame). Age and sex data were not linked to physician performance data to protect the anonymity of study participants.

Validity Evidence From Content

Five assessments measured clinical processes either directly (timeliness of note signing) or indirectly (physician performance in clinical settings as rated by learners and colleagues) (Table 2). Three assessments measured patient satisfaction, and 1 measured costs. None measured clinical outcomes.

Table 2

Physician Clinical Performance Assessments: Corresponding Content Domains and Scoresa

Assessment (physicians, No.)b	Scale	Content domainc	Mean scoresd		Top box scorese
Assessment (physicians, No.)b	Scale	Content domainc	Potential scores	Observed scores, mean ± SD	Potential scores	Observed scores, mean ± SD
Patient complaint ratef (n=226)	Complaints per 100 outpatient visits (No.)	Patient satisfaction	0.00+	0.32±1.78	NA	NA
Patient compliment ratef (n=226)	Compliments per 100 outpatient visits (No.)	Patient satisfaction	0.00+	0.12±0.76	NA	NA
Timeliness of note signing (n=231)	Clinical notes signed on time (%)	Clinical processes	0-100	96.0±6.6	NA	NA
Mean internal cost per episode of careg (n=210)	z Score relative to the institutional mean	Costs	−3 to +3 SDh	0.56±0.59	NA	NA
Patient satisfaction survey (n=201)	5-Point Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree)	Patient satisfaction	1.00-5.00	4.73±0.27	0%-100%	85.8 (11.0)
Learner evaluations (n=141)	5-Point Likert scale ranging from 1 (needs improvement) to 5 (top 10%)i	Clinical processes	1.00-5.00	4.06±0.31	0%-100%	18.6 (16.8)
MSF, internal medicine (n=10)	5-Point Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree)i	Clinical processes	1.00-5.00	4.41±0.49	0%-100%	45.0 (25.7)
Peer feedback, neurology (n=94)	5-Point Likert scale ranging from 1 (never) to 5 (always)	Clinical processes	1.00-5.00	4.94±0.08	0%-100%	90.7 (10.5)
MSF, psychiatry (n=36)	4-Point Likert scale ranging from 1 (strongly disagree) to 4 (strongly agree)	Clinical processes	1.00-4.00	3.77±0.25	0%-100%	81.5 (21.9)j

MSF = multisource feedback; NA = not applicable.

Of 233 physicians (95 internists, 99 neurologists, and 39 psychiatrists); assessment data are from 2013 and 2014 except for general internal medicine MSF data, which were collected only during 2014.

Using the Value Compass as a conceptual framework.

For Likert-scaled assessments, means were calculated first at the level of individual survey items, then across all items on a given instrument. For all measures, separate scores were calculated for 2013 and 2014, then averaged to summarize overall performance.

For Likert-scaled assessments, scores represent the percentage of optimal ratings (ie, the highest possible Likert scale rating) across all items for a given a physician over the course of a year; separate scores were calculated for 2013 and 2014, then averaged to summarize overall performance.

Unsolicited complaints and compliments related to physician care.

Cost represents the internal costs of providing care to a patient, reflects utilization (eg, physicians who order more [or more costly] tests and consultations have higher internal cost per episode of care), and is unrelated to prices or charges to patients/insurers. Internal costs are attributed to the physician with the highest evaluation and management billing code on the first day of a patient's evaluation, and the subsequent days or weeks over which tests and consultations are performed are considered an episode of care.

Captures greater than 99% of normally distributed data.

Entry of free-text comments was required for ratings of 1 or 5.

Data are from 2014 only (n=30) because psychiatry MSF data from 2013 were stored in a way that precluded calculation of top box scores.

Physician Clinical Performance Assessments: Corresponding Content Domains and Scoresa MSF = multisource feedback; NA = not applicable. Of 233 physicians (95 internists, 99 neurologists, and 39 psychiatrists); assessment data are from 2013 and 2014 except for general internal medicine MSF data, which were collected only during 2014. Using the Value Compass as a conceptual framework. For Likert-scaled assessments, means were calculated first at the level of individual survey items, then across all items on a given instrument. For all measures, separate scores were calculated for 2013 and 2014, then averaged to summarize overall performance. For Likert-scaled assessments, scores represent the percentage of optimal ratings (ie, the highest possible Likert scale rating) across all items for a given a physician over the course of a year; separate scores were calculated for 2013 and 2014, then averaged to summarize overall performance. Unsolicited complaints and compliments related to physician care. Cost represents the internal costs of providing care to a patient, reflects utilization (eg, physicians who order more [or more costly] tests and consultations have higher internal cost per episode of care), and is unrelated to prices or charges to patients/insurers. Internal costs are attributed to the physician with the highest evaluation and management billing code on the first day of a patient's evaluation, and the subsequent days or weeks over which tests and consultations are performed are considered an episode of care. Captures greater than 99% of normally distributed data. Entry of free-text comments was required for ratings of 1 or 5. Data are from 2014 only (n=30) because psychiatry MSF data from 2013 were stored in a way that precluded calculation of top box scores.

Validity Evidence From Response Process

The mean ± SD rates of complaints and compliments per physician were 0.32±1.78 and 0.12±0.76 per 100 outpatient visits, respectively (Table 2). A high percentage of notes were signed on time (mean ± SD, 96.0%±6.6%), and the mean ± SD internal cost per episode of care was 0.56±0.59 SD above the institutional mean. As shown in Table 2, mean scores were quite high for patient, learner, multisource, and peer feedback surveys and were skewed toward favorable ratings irrespective of the rating scale used. Top box scores showed more variation. The percentage of optimal ratings was less than 50% for learner and GIM MSF surveys (which required free-text comments to select the highest rating) and greater than 80% for patient, psychiatry multisource, and neurology peer feedback surveys (which did not have this requirement). Assessments supported by institutional resources (patient complaints and compliments, timeliness of note signing, cost per episode of care, patient satisfaction surveys, and learner evaluations) were used to assess more physicians per year than assessments developed and deployed in individual divisions or departments (multisource and peer feedback surveys) (Table 3). Patient satisfaction surveys had a mean ± SD of 36±18 raters per physician per year; the other survey-based assessments averaged 7 or fewer raters per physician per year.

Table 3

Physician Clinical Performance Assessments: Response Process and Internal Structure Validity Evidenceab

Assessment	Items (No.)	Response process (No.), mean ± SD			Internal structure
Assessment	Items (No.)	Physicians assessed per yearc	Raters per physician per year	Ratings per physician per year	Cronbach α	Item discrimination index, mean ± SDd
Patient complaint rate	NA	217 (1)	NA	NA	NA	NA
Patient compliment rate	NA	217 (1)	NA	NA	NA	NA
Timeliness of note signing	NA	225 (1)	NA	NA	NA	NA
Mean internal cost per episode of care	NA	205 (3)	NA	NA	NA	NA
Patient satisfaction survey	9	191 (2)	36 (18)	314 (156)	0.97	0.88 (0.04)
Learner evaluations	22e	115 (21)	6 (2)	126 (47)	0.96	0.74 (0.12)
MSF (general internal medicine)	7	11f	4f	27f	0.89	0.88 (0.08)
Peer feedback (neurology)	6	92 (1)	7 (0)	58 (19)	0.83	0.78 (0.06)
MSF (psychiatry)	5	26 (8)	6 (2)	17 (13)	0.96	0.73 (0.11)

MSF = multisource feedback; NA = not applicable.

Assessment data are from 2013 and 2014 except for general internal medicine MSF data, which were collected only during 2014.

Of 233 eligible physicians (although the number of physicians eligible for assessment by learner evaluations was likely <233 because not all physicians interact with residents and fellows). Specialties aimed to collect multisource or peer feedback for each physician every 2 to 3 years (general internal medicine, 95 physicians), every year (psychiatry, 39 physicians), or twice per year (neurology, 99 physicians).

Item discrimination indices were calculated at the item level using item-rest correlation coefficients, then averaged across all items within a given assessment to generate a mean item discrimination index.

Total pool of items; individual learner evaluation forms contained subsets of items.

No standard deviation because only 2014 data were available.

Physician Clinical Performance Assessments: Response Process and Internal Structure Validity Evidenceab MSF = multisource feedback; NA = not applicable. Assessment data are from 2013 and 2014 except for general internal medicine MSF data, which were collected only during 2014. Of 233 eligible physicians (although the number of physicians eligible for assessment by learner evaluations was likely <233 because not all physicians interact with residents and fellows). Specialties aimed to collect multisource or peer feedback for each physician every 2 to 3 years (general internal medicine, 95 physicians), every year (psychiatry, 39 physicians), or twice per year (neurology, 99 physicians). Item discrimination indices were calculated at the item level using item-rest correlation coefficients, then averaged across all items within a given assessment to generate a mean item discrimination index. Total pool of items; individual learner evaluation forms contained subsets of items. No standard deviation because only 2014 data were available.

Validity Evidence From Internal Structure

Cronbach α values for patient, learner, multisource, and peer feedback tools were all greater than 0.83 (Table 3). Mean item discrimination indices ranged from 0.73 to 0.88, indicating that items were very effective at discriminating between high and low levels of performance.

Validity Evidence From Relations to Other Variables

Physicians who received higher mean scores on learner evaluations tended to also receive higher scores on patient satisfaction surveys (r=0.26; P=.003), whereas neurologists with higher mean internal costs per episode of care tended to receive lower scores on the neurology peer feedback survey (r=−0.27; P=.008) (Table 4). The latter finding remained significant when top box scores were used (Supplemental Table 1, available online at http://www.mcpiqojournal.org). There were no other significant correlations.

Table 4

Physician Clinical Performance Assessments: Correlation Matrix (Using Mean Scores)ab

	Patient complaint rate	Patient compliment rate	Timeliness of note signing	Mean internal cost per episode of care	Patient satisfaction survey	Learner evaluations	MSF (GIM)	Peer feedback (neurology)	MSF (psychiatry)
Patient complaint rate	1.00
Patient compliment rate	−0.01 (.91)	1.00
Timeliness of note signing	0.06 (.38)	0.05 (.50)	1.00
Mean internal cost per episode of care	−0.01 (.90)	0.02 (.75)	0.05 (.47)	1.00
Patient satisfaction survey	−0.02 (.74)	0.04 (.56)	−0.12 (.09)	0.02 (.81)	1.00
Learner evaluations	−0.10 (.25)	−0.04 (.60)	−0.09 (.29)	−0.16 (.07)	0.26 (.003)	1.00
MSF (GIM)	−0.34 (.32)	NAc	−0.50 (.11)	−0.22 (.51)	0.42 (.19)	−0.93 (.24)	1.00
Peer feedback (neurology)	0.08 (.46)	0.12 (.27)	−0.08 (.43)	−0.27 (.008)	0.12 (.25)	0.10 (.35)	NA	1.00
MSF (psychiatry)	−0.03 (.86)	NAc	0.18 (.29)	0.11 (.53)	0.10 (.59)	−0.07 (.71)	NA	NA	1.00

GIM = general internal medicine; MSF = multisource feedback; NA = not applicable.

Data are given as correlation coefficients (P values); mean scores were calculated first at the item level, then across all items within a given instrument.

Insufficient variability precluded calculation of a correlation coefficient.

Physician Clinical Performance Assessments: Correlation Matrix (Using Mean Scores)ab GIM = general internal medicine; MSF = multisource feedback; NA = not applicable. Data are given as correlation coefficients (P values); mean scores were calculated first at the item level, then across all items within a given instrument. Insufficient variability precluded calculation of a correlation coefficient. Physicians with more years in practice at our organization tended to receive lower mean scores on learner evaluations (r=−0.20; P=.02). Otherwise, there were no significant associations between scores and hire year, irrespective of whether mean or top box scores were used (Supplemental Table 2, available online at http://www.mcpiqojournal.org).

Validity Evidence From Consequences

Table 5 shows trigger rates resulting from the normative cutoff scores. The trigger rate was highest for internal cost per episode of care and lowest for patient complaints. Trigger rates for top box scores were higher than for mean scores when cutoff scores were set at 1 SD from the mean.

Table 5

Physician Clinical Performance Assessments: Consequences of Measurementab

Assessment	Physicians (No.)	Threshold
		1 SD from the meanc				2 SD from the meanc
		Mean scoresd		Top box scorese		Mean scoresd		Top box scorese
		Cutoff score	Trigger rate (No. [%])	Cutoff score	Trigger rate (No. [%])	Cutoff score	Trigger rate (No. [%])	Cutoff score	Trigger rate (No. [%])
Patient complaint rate	226	>2.1	4 (2)	NA	NA	>3.9	3 (1)	NA	NA
Timeliness of note signing	231	<89.5%	24 (10)	NA	NA	<82.9%	10 (4)	NA	NA
Mean internal cost per episode of care	210	<−0.02 or >1.15	63 (30)	NA	NA	<−0.61 or >1.74	13 (6)	NA	NA
Patient satisfaction survey	201	<4.46	15 (7)	<74.8%	18 (9)	<4.19	4 (2)	<63.8%	0
Learner evaluations	141	<3.75	13 (9)	<1.8%	20 (15)	<3.43	4 (3)	<1%f	0
MSF (internal medicine)	10	<3.92	1 (10)	<19.3%	2 (18)	<3.43	1 (10)	<1%f	0
Peer feedback (neurology)	94	<4.86	13 (14)	<80.3%	13 (14)	<4.78	3 (3)	<69.8%	4 (4)
MSF (psychiatry)	36	<3.52	6 (17)	<59.6%g	11 (31)g	<3.26	1 (3)	37.7%g	8 (22)g

MSF = multisource feedback; NA = not applicable.

Assessment data are from 2013 and 2014 except for general internal medicine MSF data, which were collected only during 2014; cutoff scores were not applied to patient compliments.

Hypothetical cutoff scores set at 1 or 2 SD above the mean for patient complaints; 1 or 2 SD below the mean for timeliness of note signing, patient satisfaction survey, learner evaluations, and multisource or peer feedback surveys; or 1 or 2 SD above and below the mean for mean internal costs per episode of care.

For Likert-scaled assessments, scores represent the percentage of optimal ratings (ie, the highest possible Likert scale rating) across all items for a given physician over the course of a year; separate scores were calculated for 2013 and 2014, then averaged to summarize overall performance.

A cutoff score of less than 1% was used when 2 SD below the mean was a negative value.

Data are from 2014 only (psychiatry MSF data from 2013 were stored in a way that precluded calculation of top box scores).

Physician Clinical Performance Assessments: Consequences of Measurementab MSF = multisource feedback; NA = not applicable. Assessment data are from 2013 and 2014 except for general internal medicine MSF data, which were collected only during 2014; cutoff scores were not applied to patient compliments. Hypothetical cutoff scores set at 1 or 2 SD above the mean for patient complaints; 1 or 2 SD below the mean for timeliness of note signing, patient satisfaction survey, learner evaluations, and multisource or peer feedback surveys; or 1 or 2 SD above and below the mean for mean internal costs per episode of care. For Likert-scaled assessments, means were calculated first at the level of individual survey items, then across all items on a given instrument. For all measures, separate scores were calculated for 2013 and 2014, then averaged to summarize overall performance. For Likert-scaled assessments, scores represent the percentage of optimal ratings (ie, the highest possible Likert scale rating) across all items for a given physician over the course of a year; separate scores were calculated for 2013 and 2014, then averaged to summarize overall performance. A cutoff score of less than 1% was used when 2 SD below the mean was a negative value. Data are from 2014 only (psychiatry MSF data from 2013 were stored in a way that precluded calculation of top box scores).

Discussion

This study provides validity evidence for 7 different assessments commonly used by medical centers to determine whether physician performance is meeting professional standards. It is also the first, to our knowledge, to analyze the effects of different approaches to scoring (mean vs top box scores). A careful examination of validity evidence and scoring procedures is of practical importance to organizational leaders because it provides guidance regarding the most appropriate uses of physician performance data. It can also inform the efforts of those seeking to develop an organizational physician assessment program or use existing performance data in more meaningful ways. An ideal physician assessment program would be capable of adequately measuring physician clinical performance. In keeping with previous studies, we analyzed the content validity of our assessment program using the Value Compass, which conceptualizes physician performance as clinical processes, clinical outcomes, patient satisfaction, and costs. In doing so, we identified only 1 direct measure of clinical processes and no measures of actual clinical outcomes. Although clinical outcome data often exist at the practice group and organization levels, they are difficult to attribute to individual physicians in a team-based care environment.15, 43, 44, 45 This complicates the interpretation of cost data (which are best interpreted in conjunction with measures of care quality46, 47, 48) and highlights the challenge of procuring outcome data that can be ascribed to individual physicians. Multisource feedback is more readily available than clinical outcome data, but it can be logistically challenging to obtain feedback from a sufficient number of raters. We found that patient satisfaction surveys averaged 36 raters per physician per year, which met the recommended minimum of 30 to 50 patient raters.43, 49 However, the other survey-based assessments averaged 7 or fewer raters per physician per year, which failed to meet the recommended minimum of 8 to 12 raters for learner, multisource, and peer feedback. This may reflect rater fatigue, disquietude over assessing colleagues, or inattention to survey invitations. Institutional support seems to play an important role, as efforts to obtain feedback were more successful when they were supported institutionally than when they were developed and deployed in individual divisions and departments. This is consistent with previous studies demonstrating the feasibility of MSF, particularly when it is collected via a national process52, 53 or required for licensure. The survey-based assessments had excellent internal consistency reliability and desirable psychometric properties. However, mean scores tended to be quite high, with little variation based on hire year. Other studies of Likert-scaled physician assessments completed by patients,1, 4, 54 learners, coworkers,1, 4 and peers1, 4, 51 have had similar findings. This may reflect inflated ratings of performance (eg, due to reluctance on the part of raters to assign low scores). Alternatively, it could indicate that practicing physicians are generally at the top of the learning curve with respect to performance, as might be expected given their career stage. Taken together, these findings suggest that survey-based assessments may be able to identify physicians who fail to meet accepted performance standards. However, skew toward higher ratings makes it difficult to use scores for more aspirational purposes (eg, continuous improvement to increasingly higher levels of professional excellence) because this would require instruments capable of discriminating among high-performing individuals. We found that scores on Likert-scaled assessments were more discriminating when they were reported as top box scores than when they were reported as means. Greater discrimination among physicians may be advantageous if the intended purpose of measurement is to inspire continuous improvement. However, amplifying differences among high performers also risks engendering demoralization, disregard for performance data, or attempts to “game” the assessment system5, 22, 33, 55, 56, 57, 58 and may result in higher FPPE trigger rates. This is especially problematic for scores based on small sample sizes. Thus, the method used to calculate scores should be selected carefully in light of potential consequences of testing, and organizations receiving top box scores (eg, from patient satisfaction survey vendors) should be mindful of these considerations when interpreting and distributing performance data. Interestingly, physicians who were rated more highly by residents and fellows also tended to be rated more favorably by patients. This suggests that these assessments measure a similar construct (eg, interpersonal and communication skills, as suggested by others),50, 51, 60, 61, 62 whereas the other assessments generally provide distinct perspectives on physician performance. These findings support the value of a multifaceted physician assessment program and underscore the importance of combining multiple approaches when attempting to measure something as complex as physician performance.5, 63, 64, 65 It may be useful, for example, to compile various sources of performance data into a dashboard, portfolio, or report card rather than distributing and reviewing it in a piecemeal manner. Previous studies have found that physician scores on knowledge tests and various performance assessments decrease with increasing years in practice.1, 35, 36, 37, 38 This finding provides some rationale for monitoring physician performance over time. However, we observed little score variation by hire year, which may reflect the known mitigating effects of a practice setting that allows for frequent interactions with colleagues.66, 67 The one exception was learner feedback scores, which declined with increasing years in practice. This could be due to erosion of teaching skills (or an increasing number of competing priorities) among physicians over time. Alternatively, learners may prefer faculty who are closer to them in career stage. Further studies are needed to better understand this association. This study has several limitations. First, we analyzed physician performance data from 3 specialties at 1 organization with a salary-based physician reimbursement model. However, other medical centers collect similar data,4, 18, 68 which supports the generalizability of these findings. Second, written comments from patients, learners, or colleagues may provide a rich source of feedback,54, 64, 69, 70 but we only examined validity evidence for numeric data. Third, we were not able to analyze associations between scores and age, sex, or academic rank, given concern for preserving anonymity. Fourth, previous studies have used legal or disciplinary action, adherence to standards of care (based on analyses of billing, medical record, or administrative data), medical record audits, or specialty board recertification examination failure rates to identify underperforming physicians, but these data were not available for analysis. Finally, we used a normative approach to standard setting, which detects deviations from the average performance of a high-performing group. However, other standard setting approaches exist5, 6, 42 and may be preferred depending on the intended use of scores.

Conclusion

Health care organizations face the formidable task of implementing physician assessment programs capable of simultaneously advancing institutional goals, meeting regulatory and accreditation requirements, and providing meaningful feedback to physicians. These findings suggest that individual physician performance data are most appropriately used in combination to detect deviations from expected standards, which can then be further investigated (eg, using FPPE) to determine whether a true concern exists. Although MSF is more readily available than clinical outcome data, obtaining a sufficient number of raters per physician can be challenging without institutional support. Top box scores are more discriminating than mean scores. However, amplifying differences among high performers may have unintended consequences and increase FPPE trigger rates.

56 in total

1. Assessment of physician performance in Alberta: the physician achievement review.

Authors: W Hall; C Violato; R Lewkonia; J Lockyer; H Fidler; J Toews; P Jennett; M Donoff; D Moores
Journal: CMAJ Date: 1999-07-13 Impact factor: 8.262

2. A comparison of clinical teaching evaluations by resident and peer physicians.

Authors: Thomas J Beckman; Mark C Lee; Jayawant N Mandrekar
Journal: Med Teach Date: 2004-06 Impact factor: 3.650

3. Programmatic assessment and Kane's validity perspective.

Authors: Lambert W T Schuwirth; Cees P M van der Vleuten
Journal: Med Educ Date: 2012-01 Impact factor: 6.251

Review 4. Systematic review: the relationship between clinical experience and quality of health care.

Authors: Niteesh K Choudhry; Robert H Fletcher; Stephen B Soumerai
Journal: Ann Intern Med Date: 2005-02-15 Impact factor: 25.391

5. User perceptions of multi-source feedback tools for junior doctors.

Authors: Bryan Burford; Jan Illing; Charlotte Kergon; Gill Morrow; Moira Livingston
Journal: Med Educ Date: 2010-01-05 Impact factor: 6.251

6. Analysis & commentary: A road map for improving the performance of performance measures.

Authors: Peter J Pronovost; Richard Lilford
Journal: Health Aff (Millwood) Date: 2011-04 Impact factor: 6.301

7. An analysis of the knowledge base of practicing internists as measured by the 1980 recertification examination.

Authors: J J Norcini; R S Lipner; J A Benson; G D Webster
Journal: Ann Intern Med Date: 1985-03 Impact factor: 25.391

8. Patient satisfaction: how do qualitative comments relate to quantitative scores on a satisfaction survey?

Authors: Nicole R Santuzzi; Melanie S Brodnik; Laurie Rinehart-Thompson; Maryanna Klatt
Journal: Qual Manag Health Care Date: 2009 Jan-Mar Impact factor: 0.926

9. American Board of Medical Specialties Maintenance of Certification: theory and evidence regarding the current framework.

Authors: Richard E Hawkins; Rebecca S Lipner; Hazen P Ham; Robin Wagner; Eric S Holmboe
Journal: J Contin Educ Health Prof Date: 2013 Impact factor: 1.355

A Multifaceted Organizational Physician Assessment Program: Validity Evidence and Implications for the Use of Performance Data.

Participants and Methods

Study Participants and Setting

Measures

Data Collection

Score Calculation

Standard Setting

Outcomes

Statistical Analyses

Results

Validity Evidence From Content

Validity Evidence From Response Process

Validity Evidence From Internal Structure

Validity Evidence From Relations to Other Variables

Validity Evidence From Consequences

Discussion

Conclusion

1. Assessment of physician performance in Alberta: the physician achievement review.

2. A comparison of clinical teaching evaluations by resident and peer physicians.

3. Programmatic assessment and Kane's validity perspective.

Review 4. Systematic review: the relationship between clinical experience and quality of health care.

5. User perceptions of multi-source feedback tools for junior doctors.

6. Analysis & commentary: A road map for improving the performance of performance measures.

7. An analysis of the knowledge base of practicing internists as measured by the 1980 recertification examination.

8. Patient satisfaction: how do qualitative comments relate to quantitative scores on a satisfaction survey?

9. American Board of Medical Specialties Maintenance of Certification: theory and evidence regarding the current framework.

10. Impact of patient satisfaction ratings on physicians and clinical care.

Review 1. Using Peer Feedback to Promote Clinical Excellence in Hospital Medicine.

2. Experiential knowledge of risk and support factors for physician performance in Canada: a qualitative study.