| Literature DB >> 32119094 |
Thomas Schaffter1, Diana S M Buist2, Christoph I Lee3, Yaroslav Nikulin4, Dezso Ribli5, Yuanfang Guan6, William Lotter7, Zequn Jie8, Hao Du9, Sijia Wang10, Jiashi Feng11, Mengling Feng12, Hyo-Eun Kim13, Francisco Albiol14, Alberto Albiol15, Stephen Morrell16, Zbigniew Wojna17, Mehmet Eren Ahsen18, Umar Asif19, Antonio Jimeno Yepes19, Shivanthan Yohanandan19, Simona Rabinovici-Cohen20, Darvin Yi21, Bruce Hoff1, Thomas Yu1, Elias Chaibub Neto1, Daniel L Rubin22, Peter Lindholm23, Laurie R Margolies24, Russell Bailey McBride25, Joseph H Rothstein26, Weiva Sieh27, Rami Ben-Ari20, Stefan Harrer19, Andrew Trister28, Stephen Friend1, Thea Norman29, Berkman Sahiner30, Fredrik Strand31,32, Justin Guinney1, Gustavo Stolovitzky33, Lester Mackey34, Joyce Cahoon35, Li Shen36, Jae Ho Sohn37, Hari Trivedi38, Yiqiu Shen39, Ljubomir Buturovic40, Jose Costa Pereira41, Jaime S Cardoso41, Eduardo Castro41, Karl Trygve Kalleberg42, Obioma Pelka43,44, Imane Nedjar45, Krzysztof J Geras46, Felix Nensa44, Ethan Goan47, Sven Koitka43,46, Luis Caballero14, David D Cox48, Pavitra Krishnaswamy49, Gaurav Pandey26,50, Christoph M Friedrich43, Dimitri Perrin47, Clinton Fookes47, Bibo Shi51, Gerard Cardoso Negrie52, Michael Kawczynski53, Kyunghyun Cho39, Can Son Khoo54, Joseph Y Lo55, A Gregory Sorensen7, Hwejin Jung56.
Abstract
Importance: Mammography screening currently relies on subjective human interpretation. Artificial intelligence (AI) advances could be used to increase mammography screening accuracy by reducing missed cancers and false positives. Objective: To evaluate whether AI can overcome human mammography interpretation limitations with a rigorous, unbiased evaluation of machine learning algorithms. Design, Setting, and Participants: In this diagnostic accuracy study conducted between September 2016 and November 2017, an international, crowdsourced challenge was hosted to foster AI algorithm development focused on interpreting screening mammography. More than 1100 participants comprising 126 teams from 44 countries participated. Analysis began November 18, 2016. Main Outcomes and Measurements: Algorithms used images alone (challenge 1) or combined images, previous examinations (if available), and clinical and demographic risk factor data (challenge 2) and output a score that translated to cancer yes/no within 12 months. Algorithm accuracy for breast cancer detection was evaluated using area under the curve and algorithm specificity compared with radiologists' specificity with radiologists' sensitivity set at 85.9% (United States) and 83.9% (Sweden). An ensemble method aggregating top-performing AI algorithms and radiologists' recall assessment was developed and evaluated.Entities:
Mesh:
Year: 2020 PMID: 32119094 PMCID: PMC7052735 DOI: 10.1001/jamanetworkopen.2020.0265
Source DB: PubMed Journal: JAMA Netw Open ISSN: 2574-3805
Figure 1. Training and Evaluation of Algorithms During the Digital Mammography DREAM Challenge
Training and evaluation Kaiser Permamente Washington (KPW) and Karolinskia Institute data were not directly available to challenge participants; they were stored behind a firewall in a secure cloud (gray box). To access the data, participants submitted models to be run behind the firewall, in the graphics processing unit (GPU)–accelerated cloud (IBM). A, Training and evaluation of models submitted by teams during the Digital Mammography Dialogue on Reverse Engineering Assessment and Methods (DREAM) Challenge. B, A subset of the 8 best models in the evaluation KPW dataset were combined into the Challenge Ensemble Method (CEM), trained using the KPW training set and evaluated in the KPW and Karolinska Institute evaluation datasets. C, We developed a final ensemble method incorporating radiologists' interpretation in a method called CEM+radiologist (CEM+R). AUC indicates area under the curve; pAUC, partial area under the curve.
Composition of the Data Sets From Kaiser Permanente Washington and Karolinska Institute
| Characteristic | No. (%) | ||
|---|---|---|---|
| Kaiser Permanente Washington | Karolinska Institute Evaluation | ||
| Training | Evaluation | ||
| Screening examinations, No. | 100 974 | 43 257 | 166 578 |
| Women, No. | 59 923 | 25 657 | 68 008 |
| Women diagnosed with breast cancer within 12 mo of mammogram | 669 (1.1) | 283 (1.1) | 780 (1.1) |
| Women without a breast cancer diagnosis within 12 mo of mammogram | 59 254 (98.9) | 25 374 (98.9) | 67 228 (98.9) |
| Invasive breast cancers | 495 (74.0) | 202 (71.4) | 681 (87.3) |
| Ducal carcinoma in situ | 174 (26.0) | 81 (28.6) | 99 (12.7) |
| Age, mean (SD), y | 58.4 (9.7) | 58.4 (9.7) | 53.3 (9.4) |
| BMI, mean (SD) | 28.2 (6.9) | 28.1 (6.8) | NA |
| Women with ≥1 prior mammogram | 27 165 (45.3) | 11 651 (45.4) | 50 358 (74.2) |
| Time since last mammogram, mean (SD), mo | |||
| Mode 1 | 12.8 (1.7) | 12.9 (1.7) | 18.9 (0.9) |
| Mode 2 | 24.2 (2.1) | 24.2 (2.1) | 24.9 (1.1) |
Abbreviations: BMI, body mass index (calculated as weight in kilograms divided by height in meters squared); NA, not applicable.
Subchallenge 2 provided access to all screening images for the most recent screening examination and, when available, previous screening examinations.
Subchallenge 1 provided access only to the digital mammogram images from the most recent screening examination.
Figure 2. Performance of the Algorithms Submitted at the End of the Competitive Phase
Individual algorithm performance submitted at the end of the competitive phase on Kaiser Permamente Washington (KPW) and Karolinska Institute data. A, Area under the curve (AUC) and specificity computed at KPW radiologists' sensitivity of 85.9% of 31 methods submitted to the Digital Mammography Digital Mammography Dialogue on Reverse Engineering Assessment and Methods Challenge and evaluated on KPW evaluation set. B, The performance of methods is not significantly higher when clinical, demographic, and longitudinal data are provided. C-D, The AUC and specificity computed at Breast Cancer Surveillance Consortium’s sensitivity of 86.9% of methods evaluated on the KPW evaluation set generalize to the Karolinska Institute data.
Figure 3. Receiver Operating Characteristic Curves of the Best Individual CEM and CEM+R Methods
Receiver operating characteristic curves of the best individual method (orange), challenge ensemble method (CEM) (light blue), and challenge ensemble method + radiologist (CEM+R) method (dark blue) in Kaiser Permamente Washington (KPW) (A) and Karolinska Institute (KI) (B-C) data sets for single radiologist and consensus. The black cross reports the sensitivity and specificity achieved by the radiologist(s) in the corresponding cohort. AUC indicates area under the curve; FPR, false-positive rate; TPR, true-positive rate.
Figure 4. Comparison of the Specificity of Radiologist(s) and CEM+R on Kaiser Permanente Washington (KPW) and Karolinska (KI) Data
Comparison of the specificity of radiologist(s) and challenge ensemble method + radiologist (CEM+R) for different clinical/demographic conditions on KPW and KI data. For each condition, we report the CI of the specificity of radiologist (blue) and CEM+R (orange) computed at the sensitivity of radiologists. A best performing approach can be identified when the 2 CIs do not overlap. DCIS indicates ductal carcinoma in situ.