Literature DB >> 34729675

Machine Learning for Health: Algorithm Auditing & Quality Control.

Luis Oala1, Andrew G Murchison2, Pradeep Balachandran3, Shruti Choudhary4, Jana Fehr5, Alixandro Werneck Leite6, Peter G Goldschmidt7, Christian Johner8, Elora D M Schörverth9, Rose Nakasi10, Martin Meyer11, Federico Cabitza12, Pat Baird13, Carolin Prabhu14, Eva Weicken9, Xiaoxuan Liu15, Markus Wenzel9, Steffen Vogler16, Darlington Akogo17, Shada Alsalamah18,19, Emre Kazim20, Adriano Koshiyama20, Sven Piechottka21, Sheena Macpherson22, Ian Shadforth22, Regina Geierhofer23, Christian Matek24, Joachim Krois25, Bruno Sanguinetti26, Matthew Arentz27, Pavol Bielik28, Saul Calderon-Ramirez29, Auss Abbood30, Nicolas Langer31, Stefan Haufe32, Ferath Kherif33, Sameer Pujari19, Wojciech Samek9, Thomas Wiegand9.   

Abstract

Developers proposing new machine learning for health (ML4H) tools often pledge to match or even surpass the performance of existing tools, yet the reality is usually more complicated. Reliable deployment of ML4H to the real world is challenging as examples from diabetic retinopathy or Covid-19 screening show. We envision an integrated framework of algorithm auditing and quality control that provides a path towards the effective and reliable application of ML systems in healthcare. In this editorial, we give a summary of ongoing work towards that vision and announce a call for participation to the special issue  Machine Learning for Health: Algorithm Auditing & Quality Control in this journal to advance the practice of ML4H auditing.
© 2021. The Author(s).

Entities:  

Keywords:  Algorithm; Artificial intelligence; Auditing; Health; Machine learning; Quality control

Mesh:

Year:  2021        PMID: 34729675      PMCID: PMC8562935          DOI: 10.1007/s10916-021-01783-y

Source DB:  PubMed          Journal:  J Med Syst        ISSN: 0148-5598            Impact factor:   4.920


Introduction

Machine learning (ML) technology promises to automate, speed up or improve medical processes. A large number of institutions and companies are ambitiously working on fulfilling this promise spanning tasks such as medical image classification [1], segmentation [2] or reconstruction [3], protein structure prediction [4] and electrocardiography interpretation [5], among others1. However, the deployment of machine learning for health (ML4H) tools into real-world applications has been slow because existing approval processes [6] may not account for the particular failure modes and risks that accompany (ML) technology [7-11]. Certain changes to image data that may not change the decision of a human expert can completely alter the output of an image classification [12] or regression [13, 14] model. Model performance estimates are often not valid for the types of varying input distribution that can occur during real world deployment [15-17]. The decision heuristics a model learns can differ from the heuristics we may expect a human to use [1, 18–20], and model predictions may come with ill-calibrated statements of confidence [21-23] or no estimate of uncertainty altogether [24]. Developers proposing new ML4H technologies sometimes promise to match or even surpass the performance of existing methods [25] yet the reality is often more complicated. Classical ML performance evaluation does not automatically translate to clinical utility as examples from large diabetic retinopathy projects [26] or Covid-19 diagnosis illustrate [27]. The reliable and integrated management of these risks remains an open scientific and practical hurdle. In order to overcome this hurdle, we envision a framework of algorithm auditing and quality control that provides a path towards the effective and reliable application of ML systems in healthcare. In this editorial we give a brief summary of ongoing work towards that vision from our open collective of collaborators. Many of the considerations presented here originate from a consensus finding effort by the International Telecommunication Union (ITU) and World Health Organization (WHO) which started in 2018 as the Focus Group on Artificial Intelligence for Health (FG-AI4H) [28]. We are convinced that success on this path heavily depends on practical feedback. Auditing processes that are developed on paper have to be put to the test to ensure that they translate to utility in the actual auditing practice [29]. That is why we are introducing the special issue Machine Learning for Health: Algorithm Auditing & Quality Control in this journal (see the Call for Participation for more details2). The special issue will provide a platform for the submission, discussion and publication of audit methods and reports. The resulting compendium is intended to be a useful resource for users, developers, vendors and auditors of ML4H systems to manage and mitigate their particular risks.

ML4H Algorithm Auditing & Quality Control

From a bird’s eye view, many ML tools share a set of core components comprising data, an ML-model and its outputs, as visualized in Fig. 1A. The typical ML product life cycle goes through stages of planning, development, validation and, potentially, deployment under appropriate monitoring (see Fig. 1B). Feedback loops between stages, for example from product validation back to development, are commonplace3.
Fig. 1

Process overview. A: Most ML tools share a set of core components comprising data, a ML-model and its outputs B: The typical ML life cycle goes through stages of planning, development, validation and, potentially, deployment under appropriate monitoring C: An ML4H audit is carried out with respect to a dynamic set of technical, clinical and regulatory considerations that depend on the concrete ML technology and the intended use of the tool

An audit entails a detailed assessment of an ML4H tool at one or more of the ML life cycle steps. It can be carried out to anticipate, monitor, or retrospectively review operations of the tool [30, 31]. The audit output should consist of a comprehensive standardized report that can be used by different stakeholders to efficiently communicate the tool’s strengths and limitations [29]. We envision a process by which an independent body, for example appointed by a government, carries out the audit using the methods and tools outlined below. Further, they can also be used by manufacturers and researchers themselves to carry out internal quality control [32]. In either scenario, the assessment is carried out with respect to a dynamic set of technical, clinical and regulatory considerations (see Fig. 1C) that depend on the concrete ML technology and the intended use of the tool. Audit teams should thus comprise expertise in all these dimensions and have to be able to synthesize related requirements across disciplines. In the following, we list a selection of considerations for all three of these auditing dimensions, tools that can be used to aid the auditing process as well as the role so called trial audits can play in advancing ML4H quality control. Process overview. A: Most ML tools share a set of core components comprising data, a ML-model and its outputs B: The typical ML life cycle goes through stages of planning, development, validation and, potentially, deployment under appropriate monitoring C: An ML4H audit is carried out with respect to a dynamic set of technical, clinical and regulatory considerations that depend on the concrete ML technology and the intended use of the tool

Auditing Dimensions

The technical validation of an ML4H tool comprises the application of data and ML model quality assessment methods to detect possible failure modes in the model’s behavior. These include model-oriented metrics, such as predictive performance, robustness [33, 34], interpretability [1, 35], disparity [36] or uncertainty [13, 24, 37] but also data-oriented metrics related to sample size determination [38], sparseness [39], bias [40] distribution mismatch [41, 42] and label quality [7]. Rigorous statistical analysis of the model metrics is a common pitfall in both research and industry, and thus plays an important role during technical validation [43]. FG-AI4H has formulated a standardized quality assessment framework based on existing good practices [44-46] and provides practical guidance and examples for performing technical validation audits on three ML4H tools [29]. Clinical Evaluation comprises an “ongoing procedure to collect, appraise and analyse clinical data pertaining to a medical device and to analyse whether there is sufficient clinical evidence to confirm compliance with relevant essential requirements for safety and performance when using the device according to the manufacturer’s instructions for use” [47]. The EQUATOR-network, including STARD-AI [48], CONSORT-AI [49] and SPIRIT-AI [50], as well as different scientific journals and associations [51-54], have developed guidelines for the design, implementation, reporting and evaluation of AI interventions in various study designs. Key concerns are whether the ML4H tool delivers utility in clinical pathways, how cost-effective the clinician-tool interaction is [55] and whether it provides the desired benefits for the intended users [56]. To demonstrate reliable performance, it is important to look beyond common machine learning performance statistics such as accuracy and to evaluate in addition whether the ML4H tool is suited to the clinical setting in which it will be used; for example, whether the training and test data represent patient populations that are similar to the intended use population [7, 57] and whether the output translates to medically meaningful parameters [58]. Regulatory Assessment comprises the systematic evaluation of ML4H tools with respect to the applicable regulatory requirements found in laws (MDR [59], IVDR [60], 21 CFR [61], among others), to international standards (such as IEC 62304 [62], IEC 62366-1 [63] and ISO 14971 [64]), to guidelines by regulatory bodies (for example FDA [65], IMDRF [66]) or to guidelines and drafts by other organizations (for example AAMI [67] or European Commission [68]). Such guidance is of practical concern for stakeholders in the ML4H ecosystem including manufacturers (e.g. product managers, developers, developers and data scientists, quality and regulatory affairs managers) and for regulatory bodies (authorities, notified bodies). The FG-AI4H has identified and critically reviewed general yet fundamental regulatory considerations related to ML4H. This overview of regulatory considerations assessment have been converted into specific and verifiable requirements and subsequently published as a comprehensive assessment checklist entitled “Good practices for health applications of machine learning: Considerations for manufacturers and regulators” [45] which covers the entire life cycle outlined in 1B at a higher resolution. It includes checklist items which should be given high priority in the presence of limited time - an important practical constraint for real-world audits. Examples and comments give further guidance to users. New regulatory developments, such as predetermined change control plans [69], imply faster software update cycles and potentially more frequent audits. Hence, good tooling can become an important means to make effective as well as efficient audits possible.

Auditing Tools

The auditing process can be supported by appropriate tools to make it more targeted and time-efficient. This can include process and requirements descriptions, as mentioned above [44, 45, 56], which help to manage dynamic workflows that may vary by use case and ML technology. It also includes reporting templates to present the audit results in a standardized way for the communication between different stakeholders. [29, 70]. In addition, the nature of ML4H tools, as primarily software that interacts with data, lends itself to the application of test automation and simulations for the purpose of auditing. This requires software tools which can handle custom evaluation scripts, the flexible processing of different ML4H model formats and data modalities as well as security protocols that protect intellectual property and sensitive patient information [71]. We are working with open source frameworks such as EvalAI [72] and MLflow [73] to develop solutions for automated auditing4, federated auditing in remote teams5 and automated report creation. Our first demo platform is available via http://health.aiaudit.org/6 and hosted on ITU provisioned infrastructure. While quantitative performance measures can already be provided, it is essential to also offer qualitative measures. This is realized by requiring the users to fill out a standardized questionnaire [74]. Quantitative and qualitative performance results are then provided to the users as a comprehensive and standardized report card [70].

Trial Audits

We are convinced that success on the path towards a framework for algorithm auditing and quality control depends heavily on practical feedback. The development and refinement of auditing processes should routinely be accompanied by trial audits. In trial audits, draft processes and standards are applied to ML4H tools. The purpose of such an exercise is to ensure that auditing processes developed on paper translate to utility in actual auditing practice [29]. In order to facilitate the implementation of trial audits, we are introducing the special issue Machine Learning for Health: Algorithm Auditing & Quality Control in this journal. We welcome contributions pertaining to methods, tools, reports or open challenges in ML4H auditing.

Outlook

The materials summarized above bear testimony to the initial progress that has been made towards the creation of frameworks for ML4H algorithm auditing and quality control. Nevertheless, new challenges emerge as we collectively pull at the complex fabric that ML4H systems are. From the perspective of technical validation, the identification of factors which bias or deteriorate algorithmic performance is often constrained by the absence of relevant metadata. For example, the measurement device types (and related acquisition parameters) used to produce the validation inputs should be available in order to validate if the model performance is robust under device type changes. This problem can be alleviated by identifying and routinely recording this information during data acquisition. For clinical evaluation, future considerations include extending and refining the specific requirements related to how the clinical effectiveness of a tool should be monitored after implementation of the algorithm and with ongoing monitoring [59]. This also requires agreement over the clear and clinically useful procedures to obtain ground truth annotations. It might be necessary to refine the ML algorithm to the target population, if demographics or clinical character are different from training settings or if medical guidelines for diagnostics or treatment have changed [75]. Therefore, in order for these insights to be effective it is imperative that auditors exhibit a solid understanding of the training data, ML algorithm, independent test data and evaluation metrics specific to the intended use. A challenge for regulatory assessment is that standardization organizations, notified bodies and manufacturers need to efficiently formulate and parse applicable regulatory requirements for each individual ML4H tool. Comprehensive assessment checklists [45, 51] can help with that task. However, more support is needed in terms of workflow management and assisting tools if we consider the limited time and budgets which professional auditors have at their disposal. Future regulatory checklists should allow for interactive selection of use-case specific sub-checklists, an automated audit report creation, a issue of standard minimum test cases as well as accompanying glossaries and education materials for auditors. We also have to ensure that protocols are in place which translate the audit insights to actual improvements in the ML4H tool. Managing the risks presented by the exciting advances of AI in healthcare is a formidable undertaking, but with collaborative pooling of expertise and resources we believe we can rise to the task. Below is the link to the electronic supplementary material. Supplementary file1 (PDF 241 KB)
  27 in total

1.  Opinion: The dangers of faulty, biased, or malicious algorithms requires independent oversight.

Authors:  Ben Shneiderman
Journal:  Proc Natl Acad Sci U S A       Date:  2016-11-23       Impact factor: 11.205

2.  Sample-Size Determination Methodologies for Machine Learning in Medical Imaging Research: A Systematic Review.

Authors:  Indranil Balki; Afsaneh Amirabadi; Jacob Levman; Anne L Martel; Ziga Emersic; Blaz Meden; Angel Garcia-Pedrero; Saul C Ramirez; Dehan Kong; Alan R Moody; Pascal N Tyrrell
Journal:  Can Assoc Radiol J       Date:  2019-09-12       Impact factor: 2.248

3.  UNet++: A Nested U-Net Architecture for Medical Image Segmentation.

Authors:  Zongwei Zhou; Md Mahfuzur Rahman Siddiquee; Nima Tajbakhsh; Jianming Liang
Journal:  Deep Learn Med Image Anal Multimodal Learn Clin Decis Support (2018)       Date:  2018-09-20

4.  Improving Uncertainty Estimation With Semi-Supervised Deep Learning for COVID-19 Detection Using Chest X-Ray Images.

Authors:  Saul Calderon-Ramirez; Shengxiang Yang; Armaghan Moemeni; Simon Colreavy-Donnelly; David A Elizondo; Luis Oala; Jorge Rodriguez-Capitan; Manuel Jimenez-Navarro; Ezequiel Lopez-Rubio; Miguel A Molina-Cabello
Journal:  IEEE Access       Date:  2021-06-02       Impact factor: 3.367

Review 5.  Data Analysis Strategies in Medical Imaging.

Authors:  Chintan Parmar; Joseph D Barry; Ahmed Hosny; John Quackenbush; Hugo J W L Aerts
Journal:  Clin Cancer Res       Date:  2018-03-26       Impact factor: 12.531

6.  PTB-XL, a large publicly available electrocardiography dataset.

Authors:  Patrick Wagner; Nils Strodthoff; Ralf-Dieter Bousseljot; Dieter Kreiseler; Fatima I Lunze; Wojciech Samek; Tobias Schaeffter
Journal:  Sci Data       Date:  2020-05-25       Impact factor: 6.444

7.  MINIMAR (MINimum Information for Medical AI Reporting): Developing reporting standards for artificial intelligence in health care.

Authors:  Tina Hernandez-Boussard; Selen Bozkurt; John P A Ioannidis; Nigam H Shah
Journal:  J Am Med Inform Assoc       Date:  2020-12-09       Impact factor: 4.497

8.  Resolving challenges in deep learning-based analyses of histopathological images using explanation methods.

Authors:  Miriam Hägele; Philipp Seegerer; Sebastian Lapuschkin; Michael Bockmayr; Wojciech Samek; Frederick Klauschen; Klaus-Robert Müller; Alexander Binder
Journal:  Sci Rep       Date:  2020-04-14       Impact factor: 4.379

9.  Key challenges for delivering clinical impact with artificial intelligence.

Authors:  Christopher J Kelly; Alan Karthikesalingam; Mustafa Suleyman; Greg Corrado; Dominic King
Journal:  BMC Med       Date:  2019-10-29       Impact factor: 8.775

10.  As if sand were stone. New concepts and metrics to probe the ground on which to build trustable AI.

Authors:  Federico Cabitza; Andrea Campagner; Luca Maria Sconfienza
Journal:  BMC Med Inform Decis Mak       Date:  2020-09-11       Impact factor: 2.796

View more
  4 in total

1.  The RETA Benchmark for Retinal Vascular Tree Analysis.

Authors:  Xingzheng Lyu; Li Cheng; Sanyuan Zhang
Journal:  Sci Data       Date:  2022-07-11       Impact factor: 8.501

2.  A Perspective on a Quality Management System for AI/ML-Based Clinical Decision Support in Hospital Care.

Authors:  Richard Bartels; Jeroen Dudink; Saskia Haitjema; Daniel Oberski; Annemarie van 't Veen
Journal:  Front Digit Health       Date:  2022-07-06

3.  Artificial Intelligence and Statistics: Just the Old Wine in New Wineskins?

Authors:  Livia Faes; Dawn A Sim; Maarten van Smeden; Ulrike Held; Patrick M Bossuyt; Lucas M Bachmann
Journal:  Front Digit Health       Date:  2022-01-26

4.  Real-World and Regulatory Perspectives of Artificial Intelligence in Cardiovascular Imaging.

Authors:  Ernst Wellnhofer
Journal:  Front Cardiovasc Med       Date:  2022-07-22
  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.