Rachel Yudkowsky1, Yoon Soo Park, Janet Riddle, Catherine Palladino, Georges Bordage. 1. Dr. Yudkowsky is associate professor, Department of Medical Education, and director, Dr. Allan L. and Mary L. Graham Clinical Performance Center, University of Illinois at Chicago, Chicago, Illinois. Dr. Park is assistant professor, Department of Medical Education, University of Illinois at Chicago, Chicago, Illinois. Dr. Riddle is assistant professor, Department of Medical Education, University of Illinois at Chicago, Chicago, Illinois. Ms. Palladino is a student, University of Illinois College of Pharmacy. At the time of the study she was a graduate research assistant, Department of Medical Education, University of Illinois at Chicago, Chicago, Illinois. Dr. Bordage is professor, Department of Medical Education, University of Illinois at Chicago, Chicago, Illinois.
Abstract
PURPOSE: High-quality checklists are essential to performance test score validity. Prior research found that physical exam checklists of items that clinically discriminated between competing diagnoses provided more generalizable scores than all-encompassing thoroughness checklists. The purpose of this study was to compare validity evidence for clinically discriminating versus thoroughness checklists, hypothesizing that evidence would favor the former. METHOD: Faculty at four Chicago-area medical schools developed six standardized patient (SP) cases with checklists of about 20 items ("thoroughness [long] checklists"). Four clinicians identified a subset of items that clinically discriminated between competing diagnoses of each case ("clinically discriminating [short] checklists"). Cases were administered to 155 University of Illinois at Chicago fourth-year medical students during their 2011 Clinical Skills Examination (CSE). Validity evidence was compared for CSE scores based on thoroughness versus clinically discriminating checklist items. RESULTS: Validity evidence favoring clinically discriminating checklists included response process: greater SP checklist accuracy (kappa = 0.75 for long and 0.84 for short checklists, P < .05); internal structure: better item discrimination (0.28 long, 0.42 short, P < .001); internal consistency reliability (0.80 long, 0.92 short); standard error of measurement (z score 8.87 long, 8.05 short); and generalizability (G = 0.504 long, 0.533 short). There were no significant differences overall in relevance ratings, item difficulty, or cut scores of long versus short checklist items. CONCLUSIONS: Limiting checklist items to those affecting diagnostic decisions resulted in better accuracy and psychometric indices. Thoroughness items performed without thinking do not reflect clinical reasoning ability and contribute construct-irrelevant variance to scores.
PURPOSE: High-quality checklists are essential to performance test score validity. Prior research found that physical exam checklists of items that clinically discriminated between competing diagnoses provided more generalizable scores than all-encompassing thoroughness checklists. The purpose of this study was to compare validity evidence for clinically discriminating versus thoroughness checklists, hypothesizing that evidence would favor the former. METHOD: Faculty at four Chicago-area medical schools developed six standardized patient (SP) cases with checklists of about 20 items ("thoroughness [long] checklists"). Four clinicians identified a subset of items that clinically discriminated between competing diagnoses of each case ("clinically discriminating [short] checklists"). Cases were administered to 155 University of Illinois at Chicago fourth-year medical students during their 2011 Clinical Skills Examination (CSE). Validity evidence was compared for CSE scores based on thoroughness versus clinically discriminating checklist items. RESULTS: Validity evidence favoring clinically discriminating checklists included response process: greater SP checklist accuracy (kappa = 0.75 for long and 0.84 for short checklists, P < .05); internal structure: better item discrimination (0.28 long, 0.42 short, P < .001); internal consistency reliability (0.80 long, 0.92 short); standard error of measurement (z score 8.87 long, 8.05 short); and generalizability (G = 0.504 long, 0.533 short). There were no significant differences overall in relevance ratings, item difficulty, or cut scores of long versus short checklist items. CONCLUSIONS: Limiting checklist items to those affecting diagnostic decisions resulted in better accuracy and psychometric indices. Thoroughness items performed without thinking do not reflect clinical reasoning ability and contribute construct-irrelevant variance to scores.