OBJECTIVE: Any attempt to generalize the performance of a subjective diagnostic method should take into account the sample variation in both cases and readers. Most current measures of the performance of a test, especially the indices of reliability, only tackle the variation of cases, and hence are not suitable for generalizing results across the population of readers. We attempted to study the effect of readers' variation on two measures of multireader reliability: pair-wise agreement and Fleiss' kappa. STUDY DESIGN AND SETTING: We used a normal hierarchical model with a latent trait (signal) variable to simulate a binary decision-making task by different number of readers on an infinite sample of cases. RESULTS: It could be shown that both measures, especially Fleiss' kappa, have a large sample variance when estimated by a small number of readers, casting doubt on their accuracy given the number of readers typically used in current reliability studies. CONCLUSION: The majority of the current agreement studies is likely limited by the number of readers and is unlikely to produce a reliable estimate of reader agreement.
OBJECTIVE: Any attempt to generalize the performance of a subjective diagnostic method should take into account the sample variation in both cases and readers. Most current measures of the performance of a test, especially the indices of reliability, only tackle the variation of cases, and hence are not suitable for generalizing results across the population of readers. We attempted to study the effect of readers' variation on two measures of multireader reliability: pair-wise agreement and Fleiss' kappa. STUDY DESIGN AND SETTING: We used a normal hierarchical model with a latent trait (signal) variable to simulate a binary decision-making task by different number of readers on an infinite sample of cases. RESULTS: It could be shown that both measures, especially Fleiss' kappa, have a large sample variance when estimated by a small number of readers, casting doubt on their accuracy given the number of readers typically used in current reliability studies. CONCLUSION: The majority of the current agreement studies is likely limited by the number of readers and is unlikely to produce a reliable estimate of reader agreement.
Authors: Timothy Niacaris; Victor W Wong; Ketan M Patel; Michael Januszyk; Trevor Starnes; Michael S Murphy; James P Higgins Journal: Hand (N Y) Date: 2015-09
Authors: Sandra A Mitchell; David Jacobsohn; Kimberly E Thormann Powers; Paul A Carpenter; Mary E D Flowers; Edward W Cowen; Mark Schubert; Maria L Turner; Stephanie J Lee; Paul Martin; Michael R Bishop; Kristin Baird; Javier Bolaños-Meade; Kevin Boyd; Jane M Fall-Dickson; Lynn H Gerber; Jean-Pierre Guadagnini; Matin Imanguli; Michael C Krumlauf; Leslie Lawley; Li Li; Bryce B Reeve; Janine Austin Clayton; Georgia B Vogelsang; Steven Z Pavletic Journal: Biol Blood Marrow Transplant Date: 2011-04-12 Impact factor: 5.742
Authors: Tom J Blodgett; Sue E Gardner; Nicole P Blodgett; Lisa V Peterson; Melissa Pietraszak Journal: Clin Nurs Res Date: 2014-09-22 Impact factor: 2.075
Authors: Abhilash Guduru; David Fleischman; Sunyoung Shin; Donglin Zeng; James B Baldwin; Odette M Houghton; Emil A Say Journal: PLoS One Date: 2017-06-01 Impact factor: 3.240
Authors: Nicolas Rousselot; Thomas Tombrey; Drissa Zongo; Evelyne Mouillet; Jean-Philippe Joseph; Bernard Gay; Louis Rachid Salmi Journal: BMC Med Educ Date: 2018-11-09 Impact factor: 2.463