RATIONALE AND OBJECTIVES: Test sets for assessing and improving radiologic image interpretation have been used for decades and typically evaluate performance relative to gold standard interpretations by experts. To assess test sets for screening mammography, a gold standard for whether a woman should be recalled for additional workup is needed, given that interval cancers may be occult on mammography and some findings ultimately determined to be benign require additional imaging to determine if biopsy is warranted. Using experts to set a gold standard assumes little variation occurs in their interpretations, but this has not been explicitly studied in mammography. MATERIALS AND METHODS: Using digitized films from 314 screening mammography exams (n = 143 cancer cases) performed in the Breast Cancer Surveillance Consortium, we evaluated interpretive agreement among three expert radiologists who independently assessed whether each examination should be recalled, and the lesion location, finding type (mass, calcification, asymmetric density, or architectural distortion), and interpretive difficulty in the recalled images. RESULTS: Agreement among the three expert pairs for recall/no recall was higher for cancer cases (mean 74.3 ± 6.5) than for noncancers (mean 62.6 ± 7.1). Complete agreement on recall, lesion location, finding type and difficulty ranged from 36.4% to 42.0% for cancer cases and from 43.9% to 65.6% for noncancer cases. Two of three experts agreed on recall and lesion location for 95.1% of cancer cases and 91.8% of noncancer cases, but all three experts agreed on only 55.2% of cancer cases and 42.1% of noncancer cases. CONCLUSION: Variability in expert interpretive is notable. A minimum of three independent experts combined with a consensus should be used for establishing any gold standard interpretation for test sets, especially for noncancer cases.
RATIONALE AND OBJECTIVES: Test sets for assessing and improving radiologic image interpretation have been used for decades and typically evaluate performance relative to gold standard interpretations by experts. To assess test sets for screening mammography, a gold standard for whether a woman should be recalled for additional workup is needed, given that interval cancers may be occult on mammography and some findings ultimately determined to be benign require additional imaging to determine if biopsy is warranted. Using experts to set a gold standard assumes little variation occurs in their interpretations, but this has not been explicitly studied in mammography. MATERIALS AND METHODS: Using digitized films from 314 screening mammography exams (n = 143 cancer cases) performed in the Breast Cancer Surveillance Consortium, we evaluated interpretive agreement among three expert radiologists who independently assessed whether each examination should be recalled, and the lesion location, finding type (mass, calcification, asymmetric density, or architectural distortion), and interpretive difficulty in the recalled images. RESULTS: Agreement among the three expert pairs for recall/no recall was higher for cancer cases (mean 74.3 ± 6.5) than for noncancers (mean 62.6 ± 7.1). Complete agreement on recall, lesion location, finding type and difficulty ranged from 36.4% to 42.0% for cancer cases and from 43.9% to 65.6% for noncancer cases. Two of three experts agreed on recall and lesion location for 95.1% of cancer cases and 91.8% of noncancer cases, but all three experts agreed on only 55.2% of cancer cases and 42.1% of noncancer cases. CONCLUSION: Variability in expert interpretive is notable. A minimum of three independent experts combined with a consensus should be used for establishing any gold standard interpretation for test sets, especially for noncancer cases.
Authors: C F Nodine; H L Kundel; C Mello-Thoms; S P Weinstein; S G Orel; D C Sullivan; E F Conant Journal: Acad Radiol Date: 1999-10 Impact factor: 3.173
Authors: S Ciatto; N Houssami; A Apruzzese; E Bassetti; B Brancato; F Carozzi; S Catarzi; M P Lamberini; G Marcelli; R Pellizzoni; B Pesce; G Risso; F Russo; A Scorsolini Journal: Breast Date: 2005-08-01 Impact factor: 4.380
Authors: Tarik M Elsheikh; Sylvia L Asa; John K C Chan; Ronald A DeLellis; Clara S Heffess; Virginia A LiVolsi; Bruce M Wenig Journal: Am J Clin Pathol Date: 2008-11 Impact factor: 2.493
Authors: R Ballard-Barbash; S H Taplin; B C Yankaskas; V L Ernster; R D Rosenberg; P A Carney; W E Barlow; B M Geller; K Kerlikowske; B K Edwards; C F Lynch; N Urban; C A Chrvala; C R Key; S P Poplack; J K Worden; L G Kessler Journal: AJR Am J Roentgenol Date: 1997-10 Impact factor: 3.959
Authors: K Kerlikowske; D Grady; J Barclay; S D Frankel; S H Ominsky; E A Sickles; V Ernster Journal: J Natl Cancer Inst Date: 1998-12-02 Impact factor: 13.506
Authors: Joann G Elmore; Sara L Jackson; Linn Abraham; Diana L Miglioretti; Patricia A Carney; Berta M Geller; Bonnie C Yankaskas; Karla Kerlikowske; Tracy Onega; Robert D Rosenberg; Edward A Sickles; Diana S M Buist Journal: Radiology Date: 2009-10-28 Impact factor: 11.105
Authors: L E M Duijm; M W J Louwman; J H Groenewoud; L V van de Poll-Franse; J Fracheboud; J W Coebergh Journal: Br J Cancer Date: 2009-03-03 Impact factor: 7.640
Authors: Berta M Geller; Andy Bogart; Patricia A Carney; Edward A Sickles; Robert Smith; Barbara Monsees; Lawrence W Bassett; Diana M Buist; Karla Kerlikowske; Tracy Onega; Bonnie C Yankaskas; Sebastien Haneuse; Deirdre Hill; Matthew G Wallis; Diana Miglioretti Journal: AJR Am J Roentgenol Date: 2014-06 Impact factor: 3.959
Authors: Diana L Miglioretti; Laura Ichikawa; Robert A Smith; Diana S M Buist; Patricia A Carney; Berta Geller; Barbara Monsees; Tracy Onega; Robert Rosenberg; Edward A Sickles; Bonnie C Yankaskas; Karla Kerlikowske Journal: Acad Radiol Date: 2017-05-24 Impact factor: 3.173
Authors: Kendra A Batchelder; Aaron B Tanenbaum; Seth Albert; Lyne Guimond; Pierre Kestener; Alain Arneodo; Andre Khalil Journal: PLoS One Date: 2014-09-15 Impact factor: 3.240