Literature DB >> 22358246

Interobserver concordance in the BI-RADS classification of breast ultrasound exams.

Maria Julia G Calas¹, Renan M V R Almeida, Bianca Gutfilen, Wagner C A Pereira.

Abstract

Entities: Chemical

Mesh：

Year: 2012 PMID： 22358246 PMCID： PMC3275126 DOI： 10.6061/clinics/2012(02)16

Source DB: PubMed Journal: Clinics (Sao Paulo) ISSN： 1807-5932 Impact factor: 2.365

× No keyword cloud information.

INTRODUCTION

Breast ultrasound is an important complement to the clinical/mammographic investigation of breast lesions. This operator-dependent method entails real-time image detection and analysis and requires extensive training and experience in identifying and differentiating between benign and malignant lesions (1-4). Lesion contour and shape are considered to be the main features that allow differentiating benign and malignant lesions, the former with high sensitivity and the latter with high specificity. Many authors believe that combined ultrasound methods may yield greater accuracy (5-10). However, using morphological characteristics for lesion differentiation demands a high rate of interobserver agreement, an issue that has been extensively examined for mammography but that has been given less attention for ultrasound. Interobserver agreement is thus a matter of strong concern in clinical radiological practice (3-11). To better characterize the interobserver agreement in breast ultrasound, this study examined a group of 14 breast imagers who used the Breast Imaging Reporting and Data System (BI-RADS) ultrasound classification on 40 breast lesions. The study was exclusively concerned with lesion categorization agreement among the observers according to the BI-RADS lexicon. The accuracy of the observers was not directly assessed through comparisons with the final lesion histology.

MATERIALS AND METHODS

This study used 40 B-mode echographic images of lesions obtained from 40 patients who were examined at a private institution and who subsequently underwent surgery as indicated by their referring physicians. The study was approved by an Institutional Ethics Committee, and all of the patients provided written informed consent. All of the examinations were performed by one radiologist using Logic 5 ultrasound equipment (GE Medical Systems, Inc., Milwaukee, WI, USA) with a 12 MHz transducer. Short- and long-axes orthogonal images were recorded for each patient according to the American College of Radiology (ACR) standards (12). The image evaluation criteria were based on six BI-RADS–US categories (12): incomplete (0), negative (1), benign (2), probably benign (3), suspicious (4), and highly suggestive of malignancy (5). The surgical histopathological data were also obtained. Fourteen breast-imaging radiologists participated in the study. They worked in different institutions but had similar numbers of ultrasounds, mammography exams, and biopsy procedures and had 4 to 23 years of breast radiology experience (<5 years, n = 2; 5–10 years, n = 8; >10 years, n = 4). The retrospective review was performed on hard copies of the digitized sonographic images. Each observer received a compact disc with images from 40 lesions and a form with the ultrasound morphological criteria and BI-RADS classification. They were instructed to classify the lesions according to this system and were given 30 days to return the material. While no specific training was provided before the study, all of the readers had been using the BI-RADS lexicon since 2005. The observers had no access to clinical or histopathological information from the patients, and all of them complied with the instructions provided. To ensure patient anonymity, the names were removed from the images and materials, and each patient was identified by a code. The observers' BI-RADS analyses were classified according to their level of concordance: a) total, the same BI-RADS category was assigned; b) partial, different categories were assigned but grouped into negative (2 or 3) and positive (4 or 5) categories so that the biopsy recommendations were the same; and c) disagreement, different categories were assigned (at least for one observer) and produced recommendations for different management plans (biopsy or follow-up). We considered category 0 (incomplete) as partial agreement with the negative (2 or 3) and positive (4 or 5) categories in the sense that the patient would have to be submitted to further studies to define the final classification. The proportions of discordant classifications according to the experience time categories (<5 years, 5–10 years, and >10 years) were analyzed using Chi-square tests. The modified Fleiss' kappa index was used to analyze concordance because the data were grouped into six categories (13). The index values give the following interpretations: poor (κ<0), slight (κ = 0.0–0.20), fair (κ = 0.21–0.40), moderate (κ = 0.41–0.60), substantial (κ = 0.61–0.80), and almost perfect (κ = 0.81–1.00). The analyses stratified by the BI-RADS categorization and by the data grouping described above. The statistical analysis was performed using the R Project for Statistical Computing software (14).

RESULTS

The average age of the 40 subjects was 50.7 years, ranging from 16 to 88 years. The lesion sizes ranged from 6 to 27.0 mm (mean diameter, 15.4 mm); 19 were benign, and 21 were malignant. The concordance analysis identified three cases of total agreement, 13 of partial agreement, and 24 cases of total disagreement (Table 1).

Table 1

The histopathological results and BI-RADS classifications for the individual observers. The results are grouped according to the level of observer concordance.

Observers and assigned categories
	1	2	3	4	5	6	7	8	9	10	11	12	13	14
Concordance
Complete
Fibroadenoma
1	3	3	3	3	3	3	3	3	3	3	3	3	3	3
2	3	3	3	3	3	3	3	3	3	3	3	3	3	3
Carcinoma	4	4	4	4	4	4	4	4	4	4	4	4	4	4
Partial
Fibroadenoma
1	3	3	3	3	3	3	3	3	3	3	3	2	3	3
2	3	3	3	0	3	3	3	3	3	3	3	3	3	3
Carcinoma
1	4	4	4	5	5	4	4	5	5	5	5	5	4	4
2	5	5	5	5	5	4	5	4	4	5	5	5	5	4
3	4	4	5	4	5	0	4	5	4	4	4	4	4	4
4	4	5	5	5	4	5	5	5	5	4	5	5	5	4
5	4	4	5	4	4	4	5	5	4	4	5	5	4	4
6	4	5	5	5	5	5	4	5	4	4	5	5	4	4
7	4	5	5	5	4	5	5	5	4	5	5	5	5	4
8	4	5	4	4	4	4	4	4	4	4	4	5	4	4
9	3	3	3	3	3	3	3	3	3	3	3	3	3	2
10	4	4	4	4	4	4	5	5	5	4	5	5	4	5
11	5	5	5	5	4	4	5	5	4	4	5	5	5	4
Discordance
Fibroadenoma
1	4	3	3	2	3	3	3	4	2	3	3	4	3	3
2	3	4	3	3	3	3	4	4	3	3	4	4	3	3
3	3	4	3	3	3	3	3	3	3	3	3	3	2	4
4	4	2	2	3	4	4	4	3	3	3	3	2	4	3
5	2	2	2	2	2	2	2	2	2	2	2	3	2	5
6	3	3	3	3	3	3	3	3	3	3	3	3	3	4
7	3	3	3	3	3	3	4	3	3	3	3	3	3	4
8	4	4	3	3	2	4	3	4	3	3	3	4	3	4
9	3	3	3	0	4	3	3	3	3	3	3	3	4	3
Hematoma	4	4	4	4	3	4	3	3	4	4	4	4	3	3
Cyst
1	3	2	3	2	2	2	2	2	0	2	3	2	2	4
2	3	2	3	3	3	3	4	4	3	3	3	3	3	4
3	2	2	2	3	2	2	3	2	2	3	2	4	2	3
4	2	2	2	3	2	4	3	2	2	2	3	3	2	2
5	3	4	4	3	3	3	4	2	3	3	3	3	4	4
Carcinoma
1	4	5	4	4	4	4	4	5	3	4	4	4	4	3
2	3	4	4	3	3	3	4	3	3	4	4	3	4	3
3	4	5	4	5	4	5	4	5	0	4	5	5	4	3
4	4	4	4	4	4	3	4	5	3	3	5	5	4	2
5	4	4	5	5	4	3	5	5	4	4	4	5	4	3
6	4	4	4	3	4	3	4	5	3	3	4	4	4	4
7	4	5	5	5	4	5	5	5	5	4	5	5	5	3
8	4	5	5	5	5	5	5	5	4	4	5	5	4	3
9	4	4	4	2	2	4	5	5	5	4	4	5	4	5

In the three cases of total agreement among all the reviewers, two fibroadenomas were classified as BI-RADS 3, and one carcinoma was classified as BI-RADS 4 (Table 1). In the 13 cases of partial agreement, 10 carcinomas were assessed by all of the reviewers as BI-RADS 4 or 5, with a recommendation for tissue sampling. In one of these cases, an observer classified a carcinoma as BI-RADS 0; this classification was considered to be partially concordant because this category demanded further studies. One additional carcinoma in the partial agreement group was incorrectly classified as benign, and two fibroadenomas were classified as BI-RADS 2 or 3 by 13 observers (with one observer choosing BI-RADS 0, which was considered to be partial agreement) (Table 1). In the 24 cases of disagreement (Table 2), the histopathological analyses confirmed that 15 cases were benign and nine were malignant lesions. In 5 of the 15 benign cases, only one observer disagreed; 13 agreed. A single observer disagreed in three of the nine malignant cases. However, nine observers disagreed on a benign hematoma case (longest lesion axis = 18.2 mm), and eight observers disagreed on a carcinoma case (longest lesion axis = 26.9 mm). The proportions of discordant classifications were not significantly different by the experience time categories (11%, 12%, 15%; p = 0.62).

Table 2

The number of interobserver disagreements on cases, according to lesion type (histology) and years of experience, for the 14 observers using the BI-RADS classification system for breast ultrasonography.

Case types	Discordances according to experience (years)			Lesion size(mm)	Concordance
	<5	5-10	>10
Cyst
1	1	-	-	24.9	13
2	-	1	-	14.2	13
3	-	3	2	25.4	09
4	-	-	3	13.3	11
Fibrocystic Alteration	-	-	1	16.5	13
Hematoma	2	6	1	18.2	05
Fibroadenomas
1	1	1	1	6.4	11
2	2	1	2	24.0	09
3	-	1	1	13.6	12
4	-	4	1	22.1	09
5	-	1	-	11.4	13
6	-	1	-	10.1	13
7	-	2	-	19.2	12
8	1	3	2	13.9	08
9	-	2	-	8.2	12
Carcinoma
1	-	1	1	17.9	12
2	1	4	3	26.9	06
3		1	-	22.5	13
4	1	-	3	15.6	10
5	-	1	-	7.9	13
6	-	1	1	12.4	12
7	-	1	1	11.1	12
8	-	3	1	19.5	10
9	-	1	-	14.7	13
Total
Benign	7	26	14
Malignant	2	13	10

The kappa value for the original BI-RADS categories was 0.389 (fair agreement). This value was 0.612 when the categories were grouped as previously described, indicating substantial agreement.

DISCUSSION

Most inter- and intra-observer BI-RADS concordance studies have examined mammography because BI-RADS has been used for mammography since 1993. Recent studies of interobserver agreement in BI-RADS ultrasound assessments have yielded kappa values ranging from 0.28 to 0.83, indicating a subjectively derived assessment of the morphological lesion characteristics (6-9,11),. One limitation of this study is the small number of cases (40) compared to other studies, which have had 55 to 267 cases (6-9,11),. These cases did not consist of a random sample from the relevant female population. However, the cases were not selected according to pathological characteristics; therefore, no direct selection bias was apparent. No previously published study has used 14 observers, although one used 10 radiologists (with only 10 patients) (8). Additionally, most interobserver studies have used static image diagnosis (6-9,11),. The exceptions include Berg et al. (8) and Bosch et al. (9), both of which were real-time analyses. A second limitation of this study is the retrospective analysis of photographic records rather than real-time examination, which reflects the real clinical situation. However, no images were rejected by the observers. Another possible limitation is our not examining the possible correlations between clinical information and mammographic findings. The importance of these correlations may be seen in Skaane et al. (21), who concluded that the knowledge of previous mammography results is important for properly using BI-RADS in ultrasound. They measured kappa indices of 0.58 (range 0.52–0.66) for mammography, 0.48 (range 0.37–0.61) for ultrasound, and 0.71 (range 0.63–0.79) for both methods combined. Berg et al. (8) measured a kappa value of 0.52 for the BI-RADS categorizations of 11 radiologists, which was comparable to the results of mammography agreement studies. After grouping BI-RADS categories 1, 2, and 4A together and categories 4B, 4C, and 5 together, Berg et al. (8) obtained a kappa value of 0.56. When the categories were dichotomized as BI-RADS 1, 2, 3 vs. 4A, 4B, 4C, 5, the kappa value was 0.48. These results differ from our current finding of an increase in the kappa value (from 0.3 to 0.6) after category grouping. Using BI-RADS for mammography and ultrasound, Lazarus et al. (23) have identified a high concordance for highly suspicious lesions (κ = 0.56, BI-RADS 5). Similar results were obtained in our study: 11 of the 16 (complete or partial) agreement cases were classified as BI-RADS 4 or 5. Baker and Soo (3) analyzed 152 photographic records from 86 hospitals; in 23 cases (15.1% of the records), they noted a disagreement in interpretation. These disagreements were defined as classification differences that resulted in treatment changes, similar to the definition of disagreement used in this study. Their discrepancies included four false-negative cases, 14 false-positives, 3 cases that were described as cysts but which were found to be solid masses in biopsies, and two cases of differences between the sonographic and mammographic findings. We identified 24 cases of disagreement in our study, of which 15 were benign and nine were malignant lesions. Eight of the 14 observers in our study classified a case of medullary carcinoma as benign. This result is similar to one reported in Rahbar et al. (19); the observers agreed on the criteria leading to a benign classification, even in one case of medullary carcinoma. These misclassifications are understandable because this type of carcinoma is characterized by a partially circumscribed contour and a discrete posterior acoustic enhancement that can be confused with a complicated cyst. Shimamoto et al. (15) evaluated 54 lesions (30 benign and 24 malignant) and reported accuracies ranging from 53.7% to 61.1% in the junior observers group and from 64.8% to 72.2% in the senior group. The authors suggested that agreement was more dependent on case difficulty than on observer experience. Although our study was not designed to evaluate the accuracy of the individual observers or to correlate that accuracy with variables such as experience and lesion size, the lack of significant differences among our experience categories suggests similar results. At this point, it is important to note that the BI-RADS lexicon has been used by the observers in our study since 2005, and this familiarity may explain why experience was not statistically significant. Perhaps experience would have been more of an influence if the study had included lesion detection. Del Frate et al. (24) found that interobserver variation depended on the size of the lesion, with a better concordance (κ = 0.71–0.83) for lesions >7 mm. In Abdullah et al. (25), five breast radiologists retrospectively evaluated 267 breast masses (113 benign and 154 malignant) using the BI-RADS lexicon. The reviewers had no access to any other patient data. The interobserver BI-RADS agreement was assessed with the Aickin revised ê statistic and varied considerably (κ = 0.30). This result is similar to the value (κ = 0.28) reported by Lazarus et al. (23) and is slightly below those reported in this study; however, it is lower than the value (κ = 0.53) reported by Lee et al. (11). This inconsistency is probably related to the subdivision of BI-RADS category 4 (i.e., 4A, 4B, 4C), which reduces the frequency of agreement. This consideration was also discussed by Lee et al. (11), who noted a low percentage of 4B responses (4.8% vs. 19.4% in Abdullah et al.) (25). The most recent study published by Lai et al. (26) used a methodology similar to ours. It evaluates 30 breast lesions that underwent resection surgeries and utilizes 12 observers with different amounts of experience using ultrasound with BI-RADS for breast imaging. For experienced observers, the kappa values of categories 3, 4 and 5 were 0.72, 0.28 and 0.60, respectively. The authors concluded that diagnostic agreement decreases as the breast imaging experience of the radiologist decreases. Our study found that experience is not directly related to agreement. This difference is perhaps explained by the most-experienced group of professionals in Lai et al. having more than three years of experience, while the least-experienced professionals in our study had less than five years. Several studies have proposed using diagnostic methodologies based on image parameter estimation to improve the consistency of image interpretation. These techniques aim to quantify the morphological characteristics of tumors, such as shape and texture, and to use the results for differentiating between benignancy and malignancy. These complex procedures, including computer-aided diagnosis (CAD) systems that may reduce discrepancies between observers and thus improve ultrasound accuracy, continue to be investigated (27-30). In practice, BI-RADS categorization is defined by a combination of the mammographic and sonographic features, but this generalization did not hold in our study. Although the sample used here included only 40 lesions, this study allowed identifying the critical issues that deserve attention and further inquiry. Our kappa value for the BI-RADS classification (0.389, fair) indicates the need for standardization. Our results also indicate the need for a more meticulous version of BI-RADS, the need for real-time quantitative lesion analysis to reduce observer variation and the need to improve the accuracy of ultrasound examinations.

27 in total

1. Breast US: assessment of technical quality and image interpretation.

Authors: Jay A Baker; Mary Scott Soo
Journal: Radiology Date: 2002-04 Impact factor: 11.105

2. Sonography of solid breast lesions: observer variability of lesion description and assessment.

Authors: J A Baker; P J Kornguth; M S Soo; R Walsh; P Mengoni
Journal: AJR Am J Roentgenol Date: 1999-06 Impact factor: 3.959

3. BI-RADS for sonography: positive and negative predictive values of sonographic features.

Authors: Andrea S Hong; Eric L Rosen; Mary S Soo; Jay A Baker
Journal: AJR Am J Roentgenol Date: 2005-04 Impact factor: 3.959

4. Observer variability of Breast Imaging Reporting and Data System (BI-RADS) for breast ultrasound.

Authors: Hye-Jeong Lee; Eun-Kyung Kim; Min Jung Kim; Ji Hyun Youk; Ji Young Lee; Dae Ryong Kang; Ki Keun Oh
Journal: Eur J Radiol Date: 2007-05-24 Impact factor: 3.528

5. Intraobserver interpretation of breast ultrasonography following the BI-RADS classification.

Authors: M J G Calas; R M V R Almeida; B Gutfilen; W C A Pereira
Journal: Eur J Radiol Date: 2009-05-06 Impact factor: 3.528

6. Sonographic criteria for differentiation of benign and malignant solid breast lesions: size is of value.

Authors: C Del Frate; A Bestagno; R Cerniato; F Soldano; M Isola; F Puglisi; M Bazzocchi
Journal: Radiol Med Date: 2006-08-11 Impact factor: 3.469

7. Interreader variability and predictive value of US descriptions of solid breast masses: pilot study.

Authors: P H Arger; C M Sehgal; E F Conant; J Zuckerman; S E Rowling; J A Patton
Journal: Acad Radiol Date: 2001-04 Impact factor: 3.173

8. Complexity curve and grey level co-occurrence matrix in the texture evaluation of breast tumor on ultrasound images.

Authors: André Victor Alvarenga; Wagner C A Pereira; Antonio Fernando C Infantosi; Carolina M Azevedo
Journal: Med Phys Date: 2007-02 Impact factor: 4.071

9. Analysis of sonographic features in the differentiation of fibroadenoma and invasive ductal carcinoma.

Authors: P Skaane; K Engedal
Journal: AJR Am J Roentgenol Date: 1998-01 Impact factor: 3.959

10. Combined screening with ultrasound and mammography vs mammography alone in women at elevated risk of breast cancer.

Authors: Wendie A Berg; Jeffrey D Blume; Jean B Cormack; Ellen B Mendelson; Daniel Lehrer; Marcela Böhm-Vélez; Etta D Pisano; Roberta A Jong; W Phil Evans; Marilyn J Morton; Mary C Mahoney; Linda Hovanessian Larsen; Richard G Barr; Dione M Farria; Helga S Marques; Karan Boparai
Journal: JAMA Date: 2008-05-14 Impact factor: 56.272

6 in total

1. Contrast-enhanced ultrasound improved performance of breast imaging reporting and data system evaluation of critical breast lesions.

Authors: Jun Luo; Ji-Dong Chen; Qing Chen; Lin-Xian Yue; Guo Zhou; Cheng Lan; Yi Li; Chi-Hua Wu; Jing-Qiao Lu
Journal: World J Radiol Date: 2016-06-28

2. Positive Predictive Value for the Malignancy of Mammographic Abnormalities Based on the Presence of an Ultrasound Correlate.

Authors: Taghreed Alshafeiy; James Patrie; Mohammad Al-Shatouri
Journal: Ultrasound Int Open Date: 2022-07-15

3. Preliminary study of the technical limitations of automated breast ultrasound: from procedure to diagnosis.

Authors: Maria Julia Gregório Calas; Fernanda Philadelpho Arantes Pereira; Leticia Pereira Gonçalves; Flávia Paiva Proença Lobo Lopes
Journal: Radiol Bras Date: 2020 Sep-Oct

4. Variability in Observer Performance Between Faculty Members and Residents Using Breast Imaging Reporting and Data System (BI-RADS)-Ultrasound, Fifth Edition (2013).

Authors: Youn Joo Lee; So Young Choi; Kyu Sun Kim; Po Song Yang
Journal: Iran J Radiol Date: 2016-01-09 Impact factor: 0.212

5. Enhancing Performance of Breast Ultrasound in Opportunistic Screening Women by a Deep Learning-Based System: A Multicenter Prospective Study.

Authors: Chenyang Zhao; Mengsu Xiao; Li Ma; Xinhua Ye; Jing Deng; Ligang Cui; Fajin Guo; Min Wu; Baoming Luo; Qin Chen; Wu Chen; Jun Guo; Qian Li; Qing Zhang; Jianchu Li; Yuxin Jiang; Qingli Zhu
Journal: Front Oncol Date: 2022-02-10 Impact factor: 6.244

6. Interobserver agreement in breast ultrasound categorization in the Mammography and Ultrasonography Study for Breast Cancer Screening Effectiveness (MUST-BE) trial: results of a preliminary study.

Authors: Eun Jung Choi; Eun Hye Lee; You Me Kim; Yun-Woo Chang; Jin Hwa Lee; Young Mi Park; Keum Won Kim; Young Joong Kim; Jae Kwan Jun; Seri Hong
Journal: Ultrasonography Date: 2018-09-22

6 in total