Literature DB >> 31540699

Pathways to breast cancer screening artificial intelligence algorithm validation.

Christoph I Lee¹, Nehmat Houssami², Joann G Elmore³, Diana S M Buist⁴.

Abstract

As more artificial intelligence (AI)-enhanced mammography screening tools enter the clinical market, greater focus will be placed on external validation in diverse patient populations. In this viewpoint, we outline lessons learned from prior efforts in this field, the need to validate algorithms on newer screening technologies and diverse patient populations, and conclude by discussing the need for a framework for continuous monitoring and recalibration of these AI tools. Sufficient validation and continuous monitoring of emerging AI tools for breast cancer screening will require greater stakeholder engagement and the creation of shared policies and guidelines.

Entities: Disease Gene Species

Keywords: Artificial intelligence; Breast cancer; Mammography; Population health; Reproducibility; Screening; Transparency; Validation

Mesh：

Year: 2019 PMID： 31540699 PMCID： PMC7061055 DOI： 10.1016/j.breast.2019.09.005

Source DB: PubMed Journal: Breast ISSN： 0960-9776 Impact factor: 4.380

Introduction

Promising reports of artificial intelligence (AI) algorithms from reader studies involving limited imaging case sets indicate that they may improve mammography screening accuracy beyond radiologist interpretation alone [[1], [2], [3]]. Several of these AI tools have garnered medical device regulatory approval within multiple countries, including in the U.S. from the Food and Drug Administration (FDA) [4]. With regulatory approval, these commercial products can now be marketed for clinical use directly to stakeholders including radiologists and physician groups. However, rigorous validation in large, diverse patient populations that were not involved in the original AI algorithm development is required before clinical translation. Moreover, key stakeholders, including major payers, providers, and women undergoing routine screening, need convincing evidence that these new tools can reliably improve screening performance beyond current practice standards. Given the “black box” nature of AI algorithms, there are a number of unique challenges in the process of algorithm validation and stakeholder acceptance. There is a myriad of technical, social, political, and ethical issues regarding AI algorithm validation, as well as multiple stakeholder viewpoints, beyond the scope of a single article. Thus, in this viewpoint, we focus on some of the major pressing issues in validating algorithms from the perspective of AI developers, organizations with imaging data, and regulatory agencies.

Learning from the past

In the U.S., the bar for FDA medical device approval remains low with small reader studies showing non-inferiority to existing performance being sufficient for regulatory clearance [5]. Case sets used in FDA approval reader studies usually number in the hundreds of exams and are enriched with positive cases, making performance measures unreliably applicable to routine screening at the population level where positive cases are more seldom encountered [6]. Once this initial regulatory bar is met, however, an AI software device can be marketed directly to radiology groups and health systems without the need for reproducibility in multiple populations and settings. There are serious consequences of adopting new technologies without supporting validation and reproducibility in medicine [7]. In mammography screening, we have encountered these consequences with traditional computer aided detection (CAD) software. CAD was rapidly adopted around the turn of the century in combination with digital mammography (which was concurrently replacing screen-film mammography as a primary screening modality) without robust observational studies or randomized trials to suggest improved screening performance at the population level [8]. Instead, based on small reader studies used to gain FDA approval and heavy lobbying from vendors to obtain reimbursement, CAD was widely adopted into clinical practice [8]. Unfortunately, through observational studies performed over the next decade, CAD was eventually found to increase false-positives and benign biopsies without increasing cancer detection rate [[9], [10], [11]]. The result was a substantial increased cost to healthcare systems and women undergoing screening, without realization of the promised benefit [10,11].

Validation on newer screening technologies

One of the first steps towards clinical translation for promising AI algorithms trained and tested on existing mammography datasets will be adaptation to and validation using frequently evolving screening technologies. To date, nearly all of the published reader studies demonstrating improved screening accuracy with AI have used 2D digital mammography or screen-film mammography [6]. The largest mammography AI study to date, the Digital Mammography DREAM Challenge, provided 2D mammography images and associated clinical data representing >640,000 images from >86,000 women to DREAM Challenge participants for training and validation of their deep learning algorithms for automated mammography interpretation [12]. While digital mammography is currently the most widely used imaging modality for breast cancer screening and further AI algorithm development is needed prior to dissemination and clinical implementation, screening imaging technologies themselves are rapidly changing. Digital breast tomosynthesis (DBT, 3D mammography) is quickly usurping the role of digital mammography as the first-line screening modality of choice in many settings. This is in part due to population-based studies suggesting higher cancer detection rate and possibly lower interval cancer rate with DBT compared to digital mammography [13,14]. With the majority of U.S. facilities and many European population-based programs evaluating or transitioning to DBT screening, a strong argument can be made that current AI algorithms need to be effectively scaled from 2D to 3D volumetric data. While the assumption is that 2D to 3D algorithm scaling will be straightforward, this is not guaranteed. Thus, in order to remain relevant for clinical application, emerging AI algorithms will need to be trained and validated on large DBT imaging datasets and be able to adapt to further advancements in primary screening imaging modalities in the future.

Defining sufficient algorithm validation

After a promising AI algorithm has been trained and tested on a large modern imaging dataset and has gained regulatory approval, external validation in diverse patient populations is needed to demonstrate generalizability and clinical effectiveness. There is increasing concern that AI models have structural biases based on the imaging exams and populations included in initial training and testing with calls for more distributive justice in the initial model design, evaluation, and deployment [15]. AI developers will need to demonstrate improved screening performance in large diverse populations, including women of minority race/ethnicity and differing breast cancer risk factors. External validation should be performed in population-based screening programs and also in many different clinical settings (i.e. double reader and single reader environments). In countries without centralized screening programs, such as the U.S., algorithms need to be validated in large health systems and in different geographic regions in order to ensure that there is no unintended bias against specific subpopulations, especially traditionally vulnerable populations. It is also uncertain whether retrospective validation (the predominant approach used in studies of AI for breast cancer detection thus far [6]) is sufficient or if prospective randomized and/or pragmatic trials are needed to convince key stakeholders that AI-driven mammography screening (with or without radiologist involvement) is more accurate than traditional radiologist interpretation alone. In other words, the actual threshold required to validly claim external validation of a promising AI algorithm remains up for debate. The guideposts for the adequacy in size of validation population datasets, diversity of the validation populations, and improved accuracy measures of AI-based screening over human interpretation alone are currently unknown.

Access to population-based data

Gaining access to validation datasets, even retrospectively, is currently fraught with differing priorities among imaging stakeholders and unequal access among AI developers. In order to be useful, mammography images representing populations served by regional screening programs and high quality registries need to be linked to eventual cancer outcomes in order to determine the ground truth. Thus, access to useful imaging datasets requires access to not only the images themselves but also complete cancer history (e.g., prior lumpectomy for breast cancer) and follow-up data on all women to define eventual cancer outcome status (e.g., clinical records, biopsy results). The result is the need for complex data use agreements, especially with intellectual property of eventual AI algorithms at stake. From the perspective of imaging data owners (health systems, radiology groups, and women), privacy concerns for biomedical data remain a major concern. Institutions may be reluctant to release millions of imaging exams for private developer use [16]. One potential solution is a shift from a traditional model of transferring data directly to data modelers to an alternative “model to data” paradigm where the flow of data is reversed [17]. This paradigm was used successfully in the Digital Mammography DREAM Challenge where participants submitted containerized AI models to the Challenge organizer to train and validate the submitted models on untouched imaging data behind a firewall. While this has greater protection for health information, an important disadvantage is that AI data scientists have limited access to images, which could impede their ability to optimize their algorithms. Finally, since the vast majority of players in this arena are looking to develop and commercialize their AI tool, there is fierce competition in gaining access to limited numbers of data partners willing to broker cooperative agreements with AI developers, especially for imaging data that includes exams from vulnerable populations. With intellectual property at stake, major industry players with larger resources and the ability to pay for use of imaging data have a distinct advantage over smaller start-ups, making current access to larger, diverse validation datasets inequitable. The end result is that not all AI developers will have access to validation datasets across diverse populations, potentially further exacerbating screening disparities by rendering eventual clinical algorithms less effective among already vulnerable patient populations.

Continuous improvement and monitoring

New AI tools for mammography have the advantage of continuously learning compared to traditional CAD. Ideally, AI mammography algorithms would not go through validation just once, but would undergo continuous refinement and validation over time in order to not repeat the missteps experienced with traditional CAD. However, the fluid nature of AI algorithms is inherently non-transparent, with factors leading to algorithm performance changes difficult to decipher and monitor by its end users or those developing benchmarks without explicit information provided by AI developers. These developers will likely need direct access to an institution's radiology information system to make such a continuous feedback loop possible, leading to data security concerns with potential exposure of personal health information. Thus, developers, medical organizations, industry partners, and government agencies will have to work together to create new processes and guidelines. In the U.S., the FDA is working with stakeholders to draft a new regulatory process for AI devices spanning from the pre-market approval to post-market surveillance [18]. Previously, FDA clearance required that CAD software be “locked” prior to marketing with any changes to the algorithm requiring another FDA premarket review. As newer AI models have the ability to continuously learn and adapt to more available data with each new exam, the FDA proposes an adaptive total product lifecycle regulatory approach where manufacturers would be expected to monitor the AI algorithm clinical performance and incorporate a risk management approach after an initial premarket review [19]. Algorithm change components to be monitored and reported include data management changes (new training and testing data), re-training of machine learning architecture and parameters, changes in pre-determined assessment metrics, and software update procedures. This type of medical AI device oversight will require adoption of standardized application programming interfaces (APIs) across diverse medical organization and government data networks. The real-world data requested by regulatory bodies such as the FDA for post-marketing surveillance of AI will also require large population-based registries that can help with continuous validation in the post-marketing setting. Moreover, large academic and private health systems will have to become willing partners in a new era of continuous monitoring by truly adapting into learning health care environments where AI-based imaging interpretation can continuously evolve and change. This latter enterprise will be challenging given a current environment of vendor-specific and proprietary data management tools for medical imaging without the ability for cross-communication. Yet, in the post-marketing period, regulators and manufacturers will have to work collaboratively to demonstrate that improved overall screening accuracy is maintained across different populations over time.

Summary

With multiple AI algorithms for mammography screening entering the clinical market and frequently evolving imaging technologies, external validation will be needed before and after clinical adoption. Medical organizations, AI developers, researchers, and government agencies must work together to help make evolving population-based imaging datasets representing diverse populations available for external validation in order to ensure clinical effectiveness and generalizability. As AI algorithms and screening modalities continue to undergo modifications in the post-marketing period, better standards are needed for continuous monitoring with greater transparency from AI algorithm developers. Moreover, better integration of biomedical informatics and data systems are needed for incorporating improvements in real-time and to avoid the missteps experienced with static traditional CAD software. These major paradigm shifts in validation and monitoring will be necessary before trust in “black box” algorithms for breast cancer screening are embraced by payers, health providers, and women alike. Without investment in novel validation pathways, generalized adoption of AI-enhanced breast cancer screening is unlikely to be successful.

Conflicts of interest

The authors declare no conflicts of interest directly related to this work. CL receives grant funding from GE Healthcare unrelated to this work and personal fees from the American College of Radiology for his role as Deputy Editor of the Journal of the American College of Radiology. All other authors declare no other conflicts of interest.

16 in total

1. Digital breast tomosynthesis and the challenges of implementing an emerging breast cancer screening technology into clinical practice.

Authors: Christoph I Lee; Constance D Lehman
Journal: J Am Coll Radiol Date: 2013-12 Impact factor: 5.532

2. A Road Map for Translational Research on Artificial Intelligence in Medical Imaging: From the 2018 National Institutes of Health/RSNA/ACR/The Academy Workshop.

Authors: Bibb Allen; Steven E Seltzer; Curtis P Langlotz; Keith P Dreyer; Ronald M Summers; Nicholas Petrick; Danica Marinac-Dabic; Marisa Cruz; Tarik K Alkasab; Robert J Hanisch; Wendy J Nilsen; Judy Burleson; Kevin Lyman; Krishna Kandarpa
Journal: J Am Coll Radiol Date: 2019-05-28 Impact factor: 5.532

3. Stand-Alone Artificial Intelligence for Breast Cancer Detection in Mammography: Comparison With 101 Radiologists.

Authors: Alejandro Rodriguez-Ruiz; Kristina Lång; Albert Gubern-Merida; Mireille Broeders; Gisella Gennaro; Paola Clauser; Thomas H Helbich; Margarita Chevalier; Tao Tan; Thomas Mertelmeier; Matthew G Wallis; Ingvar Andersson; Sophia Zackrisson; Ritse M Mann; Ioannis Sechopoulos
Journal: J Natl Cancer Inst Date: 2019-09-01 Impact factor: 13.506

4. Computer-aided detection in mammography: downstream effect on diagnostic testing, ductal carcinoma in situ treatment, and costs.

Authors: Joshua J Fenton; Christoph I Lee; Guibo Xing; Laura-Mae Baldwin; Joann G Elmore
Journal: JAMA Intern Med Date: 2014-12 Impact factor: 21.873

5. Influence of computer-aided detection on performance of screening mammography.

Authors: Joshua J Fenton; Stephen H Taplin; Patricia A Carney; Linn Abraham; Edward A Sickles; Carl D'Orsi; Eric A Berns; Gary Cutter; R Edward Hendrick; William E Barlow; Joann G Elmore
Journal: N Engl J Med Date: 2007-04-05 Impact factor: 91.245

6. Detection of Breast Cancer with Mammography: Effect of an Artificial Intelligence Support System.

Authors: Alejandro Rodríguez-Ruiz; Elizabeth Krupinski; Jan-Jurre Mordang; Kathy Schilling; Sylvia H Heywang-Köbrunner; Ioannis Sechopoulos; Ritse M Mann
Journal: Radiology Date: 2018-11-20 Impact factor: 11.105

7. Diagnostic Accuracy of Digital Screening Mammography With and Without Computer-Aided Detection.

Authors: Constance D Lehman; Robert D Wellman; Diana S M Buist; Karla Kerlikowske; Anna N A Tosteson; Diana L Miglioretti
Journal: JAMA Intern Med Date: 2015-11 Impact factor: 21.873

8. Digital Mammography versus Digital Mammography Plus Tomosynthesis for Breast Cancer Screening: The Reggio Emilia Tomosynthesis Randomized Trial.

Authors: Pierpaolo Pattacini; Andrea Nitrosi; Paolo Giorgi Rossi; Valentina Iotti; Vladimiro Ginocchi; Sara Ravaioli; Rita Vacondio; Luca Braglia; Silvio Cavuto; Cinzia Campari
Journal: Radiology Date: 2018-06-05 Impact factor: 11.105

9. Will Machine Learning Tip the Balance in Breast Cancer Screening?

Authors: Andrew D Trister; Diana S M Buist; Christoph I Lee
Journal: JAMA Oncol Date: 2017-11-01 Impact factor: 31.777

10. Reproducible Research Practices and Transparency across the Biomedical Literature.

Authors: Shareen A Iqbal; Joshua D Wallach; Muin J Khoury; Sheri D Schully; John P A Ioannidis
Journal: PLoS Biol Date: 2016-01-04 Impact factor: 8.029

4 in total

1. Consensus Reads: The More Sets of Eyes Interpreting a Mammogram, the Better for Women.

Authors: Solveig Hofvind; Christoph I Lee
Journal: Radiology Date: 2020-02-11 Impact factor: 11.105

2. Independent External Validation of Artificial Intelligence Algorithms for Automated Interpretation of Screening Mammography: A Systematic Review.

Authors: Anna W Anderson; M Luke Marinovich; Nehmat Houssami; Kathryn P Lowry; Joann G Elmore; Diana S M Buist; Solveig Hofvind; Christoph I Lee
Journal: J Am Coll Radiol Date: 2022-01-20 Impact factor: 5.532

3. Artificial intelligence (AI) in breast cancer care - Leveraging multidisciplinary skills to improve care.

Authors: Maria Joao Cardoso; Nehmat Houssami; Giuseppe Pozzi; Brigitte Séroussi
Journal: Breast Date: 2020-12-09 Impact factor: 4.380

4. Using deep learning to assist readers during the arbitration process: a lesion-based retrospective evaluation of breast cancer screening performance.

Authors: Laura Kerschke; Stefanie Weigel; Alejandro Rodriguez-Ruiz; Nico Karssemeijer; Walter Heindel
Journal: Eur Radiol Date: 2021-08-12 Impact factor: 5.315

4 in total