| Literature DB >> 34561440 |
Yiqiu Shen1, Farah E Shamout2, Jamie R Oliver3, Jan Witowski3, Kawshik Kannan4, Jungkyu Park5, Nan Wu1, Connor Huddleston3, Stacey Wolfson3, Alexandra Millet3, Robin Ehrenpreis3, Divya Awal3, Cathy Tyma3, Naziya Samreen3, Yiming Gao3, Chloe Chhor3, Stacey Gandhi3, Cindy Lee3, Sheila Kumari-Subaiya3, Cindy Leonard3, Reyhan Mohammed3, Christopher Moczulski3, Jaime Altabet3, James Babb3, Alana Lewin3, Beatriu Reig3, Linda Moy3,5, Laura Heacock3, Krzysztof J Geras6,7,8.
Abstract
Though consistently shown to detect mammographically occult cancers, breast ultrasound has been noted to have high false-positive rates. In this work, we present an AI system that achieves radiologist-level accuracy in identifying breast cancer in ultrasound images. Developed on 288,767 exams, consisting of 5,442,907 B-mode and Color Doppler images, the AI achieves an area under the receiver operating characteristic curve (AUROC) of 0.976 on a test set consisting of 44,755 exams. In a retrospective reader study, the AI achieves a higher AUROC than the average of ten board-certified breast radiologists (AUROC: 0.962 AI, 0.924 ± 0.02 radiologists). With the help of the AI, radiologists decrease their false positive rates by 37.3% and reduce requested biopsies by 27.8%, while maintaining the same level of sensitivity. This highlights the potential of AI in improving the accuracy, consistency, and efficiency of breast ultrasound diagnosis.Entities:
Mesh:
Year: 2021 PMID: 34561440 PMCID: PMC8463596 DOI: 10.1038/s41467-021-26023-2
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 17.694
Fig. 1Overview of the system’s pipeline.
a US images were pre-processed to extract the breast laterality (i.e., left or right breast) and to include only the part of the image which shows the breast (cropping out the image periphery which typically contains textual metadeta about the patient and US acquisition technique). b For each breast, we assigned a cancer label using the recorded pathology reports for the respective patient within −30 and 120 days from the time of the US examination. We applied additional filtering on the internal test set to ensure that cancers in positive exams are visible in the US images and negative exams have at least one cancer-negative follow-up (see Methods section `Additional filtering of the test'). c The AI system processes all US images acquired from one breast to compute probabilistic predictions for the presence of malignant lesions. The AI system also generates saliency maps that indicate the informative regions in each image. d We evaluated the system on an internal test set (AUROC: 0.976, 95% CI: 0.972, 0.980, n = 79,156 breasts) and an external test set (AUROC: 0.927, 95% CI: 0.907, 0.959, n = 780 images). e In a reader study consisting of 663 exams (n = 1024 breasts), we showed that the AI system can improve the specificity and positive predictive value (PPV) for 10 attending radiologists while maintaining the same level of sensitivity and negative predictive value (NPV).
Statistics of the overall NYU Breast Ultrasound Dataset, internal test set, and reader study set. This dataset was collected from NYU Langone Health over an eight-year period. Exam-level BI-RADS were issued by radiologists based on patients’ breast US exams. Breast densities were determined using existing screening and diagnostic mammography reports. Patients who were not matched with any mammograms were assigned "unknown” for breast density. Abbreviations: N, number; SD, standard deviation.
| Characteristic, unit | Overall | Internal test set | Reader study |
|---|---|---|---|
| Patients, | 143,203 | 25,003 | 644 |
| Age, mean years ( | 53.7 (13.7) | 55.5 (12.7) | 52.8 (14.0) |
| < 40 yrs old, | 18,218 (12.7) | 1857 (7.4) | 90 (14.0) |
| 40 − 49 years old, | 33,955 (23.7) | 5811 (23.2) | 175 (27.2) |
| 50 − 59 years old, | 34,942 (24.4) | 6567 (26.3) | 146 (22.7) |
| 60 − 69 years old, | 26,671 (18.6) | 5198 (20.8) | 104 (16.1) |
| ≥70 years old, | 17,703 (12.4) | 3359 (13.4) | 81 (12.6) |
| Exams, | 288,767 | 44,755 | 663 |
| Images, | 5,442,907 | 858,636 | 13,582 |
| Average no. of images per exam, | 18 | 19 | 20 |
| Exams associated with biopsy, | 28,914 (10.0) | 8337 (18.6) | 587 (88.5) |
| Breasts, | 510,271 | 79,156 | 1024 |
| Breasts with benign findings, | 26,843 | 7879 | 567 |
| Breasts with malignant findings, | 5593 | 1324 | 73 |
| Exam-level BI-RADS | |||
| BI-RADS 0, | 14,078 (4.9) | 1092 (2.4) | 80 (12.1) |
| BI-RADS 1, | 86,347 (29.9) | 12,374 (27.6) | 56 (8.4) |
| BI-RADS 2, | 136,322 (47.2) | 21,675 (48.4) | 80 (12.1) |
| BI-RADS 3, | 27,711 (9.6) | 3586 (8.0) | 25 (3.8) |
| BI-RADS 4, | 22,133 (7.7) | 5578 (12.5) | 391 (59.0) |
| BI-RADS 5, | 1348 (0.5) | 338 (0.8) | 22 (3.3) |
| BI-RADS 6, | 518 (0.2) | 69 (0.2) | 3 (0.5) |
| Unknown BI-RADS, | 310 (0.1) | 43 (0.1) | 6 (0.9) |
| Exam-level mammographic density | |||
| A (breasts are almost entirely fatty), | 5384 (1.9) | 695 (1.6) | 13 (2.0) |
| B (scattered areas of fibroglandular density), | 69,948 (24.2) | 11,048 (24.7) | 143 (21.6) |
| C (breasts are heterogeneously dense), | 165,855 (57.4) | 26,509 (59.2) | 376 (56.7) |
| D (the breasts are extremely dense), | 31,829 (11.0) | 5189 (11.6) | 76 (11.5) |
| Unknown density, | 15,751 (5.5) | 1314 (2.9) | 55 (8.3) |
AI performance on the internal test set across different sub-populations. We reported the AUROC of the AI system with 95% confidence intervals on the internal test set. The biopsied population only includes exams where at least one biopsy was recommended. We stratified exams based on patient age, mammographic breast density, and the manufacturer of the US devices. Mammographic breast density was categorized based on the BI-RADS standards[69].
| Population | AUROC (95% CI) | No. of breasts | No. of cancers |
|---|---|---|---|
| Overall population | 0.976 (0.972, 0.980) | 79,078 | 1248 |
| Biopsied population | 0.940 (0.934, 0.947) | 12,973 | 1248 |
| Age | |||
| < 40 yrs old | 0.969 (0.955, 0.982) | 5176 | 72 |
| 40 − 49 yrs old | 0.970 (0.955, 0.986) | 19,677 | 160 |
| 50 − 59 yrs old | 0.981 (0.975, 0.986) | 24,142 | 292 |
| 60 − 69 yrs old | 0.980 (0.973, 0.985) | 19,039 | 326 |
| ≥70 yrs old | 0.969 (0.958, 0.981) | 11,044 | 398 |
| Breast density | |||
| Entirely fatty | 0.964 (0.942, 0.983) | 1157 | 54 |
| Scattered fibroglandular densities | 0.975 (0.961, 0.982) | 19,199 | 441 |
| Heterogeneously dense | 0.979 (0.974, 0.981) | 47,255 | 610 |
| Extremely dense | 0.964 (0.932, 0.973) | 9398 | 90 |
| Unkown | 0.970 (0.955, 0.983) | 2069 | 53 |
| Manufacturer | |||
| GE | 0.984 (0.968, 0.993) | 5708 | 47 |
| Medison | 0.990 (0.974, 0.996) | 2673 | 13 |
| Philips | 0.977 (0.970, 0.982) | 28,943 | 412 |
| Siemens | 0.974 (0.968, 0.980) | 37,572 | 699 |
| Toshiba | 0.986 (0.978, 0.992) | 4180 | 77 |
| Other | — | 2 | 0 |
Fig. 2Reader study results.
The performance of the AI system on the reader study population (n = 1024 breasts) using ROC curve (a) and precision-recall curve (b). The AI achieved 0.962 (95% CI: 0.943, 0.979) AUROC and 0.752 (95% CI: 0.675, 0.849) AUPRC. Each data point represents a single reader and the triangles correspond to the average reader performance. The inset shows a magnification of the gray shaded region.
Fig. 3Qualitative analysis of saliency maps.
In each of the six cases (a–f) from the reader study, we visualized the sagittal and transverse views of the lesion (left) and the AI’s saliency maps indicating the predicted locations of benign (middle) and malignant (right) findings (see Methods section `Deep neural network architecture'). Exams a–c display lesions that were ultimately biopsied and found to be malignant. All readers and the AI system correctly classified exams a–b as suspicious for malignancy. However, the majority of readers (7/10) and the AI system incorrectly classified case c as benign. Cases d–f display lesions that were biopsied and found to be benign. The majority of readers incorrectly classified exams d (9/10), e (10/10), and f (10/10) as suspicious for malignancy and recommended the lesions undergo biopsy. In contrast, the AI system classified exam d as malignant, but correctly identified exams e–f as being benign.
Fig. 4Performance of readers, AI, and hybrid models.
We reported the observed values (measure of center) and 95% confidence intervals (error bars) of AUROC (a), AUPRC (b), specificity (c), biopsy rate (d), and PPV (e) of ten radiologists (R1-R10), AI, and the hybrid models on the reader study set (n = 1024 breasts) The predictions of each hybrid model are weighted averages of each reader’s BI-RADS scores and the AI’s probablistic predictions (see Methods section `Hybrid model'). We dichotomized each hybrid model’s probabilistic predictions to match the sensitivity of its respective reader. We dichotomized the AI’s predictions to match the average radiologists' sensitivity. The collaboration between AI and readers improves readers' AUROC, AUPRC, specificity, and PPV, while reducing biopsy rate. We estimated the 95% confidence intervals by 1000 iterations of the bootstrap method.