BACKGROUND: Good interrater reliability is essential to minimize error variance and improve study power. Reasons why raters differ in scoring the same patient include information variance (different information obtained because of asking different questions), observation variance (the same information is obtained, but raters differ in what they notice and remember), interpretation variance (differences in the significance attached to what is observed), criterion variance (different criteria used to score items), and subject variance (true differences in the subject). We videotaped and transcribed 30 pairs of interviews to examine the most common sources of rater unreliability. METHOD: Thirty patients who experienced depression were independently interviewed by 2 different raters on the same day. Raters provided rationales for their scoring, and independent assessors reviewed the rationales, the interview transcripts, and the videotapes to code the main reason for each discrepancy. One third of the interviews were conducted by raters who had not administered the Hamilton Depression Rating Scale before; one third, by raters who were experienced but not calibrated; and one third, by experienced and calibrated raters. RESULTS: Experienced and calibrated raters had the highest interrater reliability (intraclass correlation [ICC]; r = 0.93) followed by inexperienced raters (r = 0.77) and experienced but uncalibrated raters (r = 0.55). The most common reason for disagreement was interpretation variance (39%), followed by information variance (30%), criterion variance (27%), and observation variance (4%). Experienced and calibrated raters had significantly less criterion variance than the other cohorts (P = 0.001). CONCLUSIONS: Reasons for disagreement varied by level of experience and calibration. Experienced and uncalibrated raters should focus on establishing common conventions, whereas experienced and calibrated raters should focus on fine tuning judgment calls on different thresholds of symptoms. Calibration training seems to improve reliability over experience alone. Experienced raters without cohort calibration had lower reliability than inexperienced raters.
BACKGROUND: Good interrater reliability is essential to minimize error variance and improve study power. Reasons why raters differ in scoring the same patient include information variance (different information obtained because of asking different questions), observation variance (the same information is obtained, but raters differ in what they notice and remember), interpretation variance (differences in the significance attached to what is observed), criterion variance (different criteria used to score items), and subject variance (true differences in the subject). We videotaped and transcribed 30 pairs of interviews to examine the most common sources of rater unreliability. METHOD: Thirty patients who experienced depression were independently interviewed by 2 different raters on the same day. Raters provided rationales for their scoring, and independent assessors reviewed the rationales, the interview transcripts, and the videotapes to code the main reason for each discrepancy. One third of the interviews were conducted by raters who had not administered the Hamilton Depression Rating Scale before; one third, by raters who were experienced but not calibrated; and one third, by experienced and calibrated raters. RESULTS: Experienced and calibrated raters had the highest interrater reliability (intraclass correlation [ICC]; r = 0.93) followed by inexperienced raters (r = 0.77) and experienced but uncalibrated raters (r = 0.55). The most common reason for disagreement was interpretation variance (39%), followed by information variance (30%), criterion variance (27%), and observation variance (4%). Experienced and calibrated raters had significantly less criterion variance than the other cohorts (P = 0.001). CONCLUSIONS: Reasons for disagreement varied by level of experience and calibration. Experienced and uncalibrated raters should focus on establishing common conventions, whereas experienced and calibrated raters should focus on fine tuning judgment calls on different thresholds of symptoms. Calibration training seems to improve reliability over experience alone. Experienced raters without cohort calibration had lower reliability than inexperienced raters.
Authors: Boadie W Dunlop; Michael E Thase; Chuan-Chuan Wun; Rana Fayyad; Christine J Guico-Pabia; Jeff Musgnung; Philip T Ninan Journal: Neuropsychopharmacology Date: 2012-08-22 Impact factor: 7.853
Authors: E Ray Dorsey; Joseph D Wagner; Michael T Bull; Ashley Rizzieri; Justin Grischkan; Meredith A Achey; Todd Sherer; Sohini Chowdhury; Claire Meunier; Lily Cappelletti; Charlotte Rocker; Irene H Richard; Heidi Schwarz; Gail Kang; Stacy H Ahmad; Rachel A Biemiller; Kevin M Biglan Journal: J Parkinsons Dis Date: 2015 Impact factor: 5.568
Authors: Robert D Gibbons; Giles Hooker; Matthew D Finkelman; David J Weiss; Paul A Pilkonis; Ellen Frank; Tara Moore; David J Kupfer Journal: J Clin Psychiatry Date: 2013-07 Impact factor: 4.384
Authors: Larry Alphs; Fabrizio Benedetti; W Wolfgang Fleischhacker; John M Kane Journal: Int J Neuropsychopharmacol Date: 2012-01-05 Impact factor: 5.176
Authors: Monica Bachmann; Wout de Boer; Stefan Schandelmaier; Andrea Leibold; Renato Marelli; Joerg Jeger; Ulrike Hoffmann-Richter; Ralph Mager; Heinz Schaad; Thomas Zumbrunn; Nicole Vogel; Oskar Bänziger; Jason W Busse; Katrin Fischer; Regina Kunz Journal: BMC Psychiatry Date: 2016-07-29 Impact factor: 3.630
Authors: Jürgen Barth; Wout E L de Boer; Jason W Busse; Jan L Hoving; Sarah Kedzia; Rachel Couban; Katrin Fischer; David Y von Allmen; Jerry Spanjer; Regina Kunz Journal: BMJ Date: 2017-01-25