Literature DB >> 33910923

Equity in essence: a call for operationalising fairness in machine learning for healthcare.

Judy Wawira Gichoya^1,2, Liam G McCoy³, Leo Anthony Celi^4,5,6, Marzyeh Ghassemi^7,8,9.

Abstract

Entities: Chemical

Keywords: BMJ health informatics

Mesh：

Year: 2021 PMID： 33910923 PMCID： PMC8733939 DOI： 10.1136/bmjhci-2020-100289

Source DB: PubMed Journal: BMJ Health Care Inform ISSN： 2632-1009

× No keyword cloud information.

Introduction

Machine learning for healthcare (MLHC) is at the juncture of leaping from the pages of journals and conference proceedings to clinical implementation at the bedside. Succeeding in this endeavour requires the synthesis of insights from both the machine learning and healthcare domains, in order to ensure that the unique characteristics of MLHC are leveraged to maximise benefits and minimise risks. An important part of this effort is establishing and formalising processes and procedures for characterising these tools and assessing their performance. Meaningful progress in this direction can be found in recently developed guidelines for the development of MLHC models,1 guidelines for the design and reporting of MLHC clinical trials,2 3 and protocols for the regulatory assessment of MLHC tools.4 5 But while such guidelines and protocols engage extensively with relevant technical considerations, engagement with issues of fairness, bias and unintended disparate impact is lacking. Such issues have taken on a place of prominence in the broader ML community,6–9 with recent work highlighting issues such as racial disparities in the accuracy of facial recognition and gender classification software,6 10 gender bias in the output of natural language processing models11 12 and racial bias in algorithms for bail and criminal sentencing.13 MLHC is not immune to these concerns, as seen in disparate outcomes from algorithms for allocating healthcare resources,14 15 bias in language models developed on clinical notes16 and melanoma detection models developed primarily on images of light-coloured skin.17 Within this paper, we will examine the inclusion of fairness in recent guidelines for MLHC model reporting, clinical trials and regulatory approval. We highlight opportunities to ensure that fairness is made fundamental to MLHC, and examine ways how this can be operationalised for the MLHC context.

Fairness as an afterthought?

Model development and trial reporting guidelines

Several recent documents have attempted, with varying degrees of practical implication, to enumerate guiding principles for MLHC. Broadly, these documents do an excellent job of highlighting artificial intelligence (AI)-specific technical and operational concerns, such as how to handle human-AI interaction, or how to account for model performance errors. Yet as outlined in table 1, references to fairness are either conspicuously absent, made merely in passing, or relegated to supplemental discussion.

Table 1

Fairness in recently released and upcoming guidelines

Guideline	How is fairness included?
Reporting guidelines
Development and Reporting of Prediction Models: Guidance for Authors From Editors of Respiratory, Sleep, and Critical Care Journals1	Discussion of the risk of unfairness is included in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7161722/bin/ccm-48-0623-s001.docx but not the main document.
TRIPOD-ML (Announcement Statement Only)18	No explicit mention.
STARD-AI (Announcement Statement Only)19	No explicit mention.
Checklist for Artificial Intelligence in Medical Imaging31	Bias discussed, but not clearly in the context of fairness with respect to differential performance or impact between patient groups.
Clinical Trial Guidelines
CONSORT-AI Extension3	Fairness is brought up in the discussion section but not included explicitly in any of the guideline checklist points.
SPIRIT-AI Extension2	No explicit mention.

CONSORT-AI, Consolidated Standards of Reporting Trials–Artificial Intelligence; SPIRIT-AI, Standard Protocol Items: Recommendations for Interventional Trials–Artificial Intelligence; STARD-AI, Standards for Reporting of Diagnostic Accuracy Studies–Artificial Intelligence; TRIPOD-ML, Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis–Machine Learning.

Fairness in recently released and upcoming guidelines CONSORT-AI, Consolidated Standards of Reporting Trials–Artificial Intelligence; SPIRIT-AI, Standard Protocol Items: Recommendations for Interventional Trials–Artificial Intelligence; STARD-AI, Standards for Reporting of Diagnostic Accuracy Studies–Artificial Intelligence; TRIPOD-ML, Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis–Machine Learning. Notable examples are the recent the Standard Protocol Items: Recommendations for Interventional Trials-AI (SPIRIT-AI)2 and Consolidated Standards of Reporting Trials-AI (CONSORT-AI)3 extensions, which expand prominent guidelines for the design and reporting of AI clinical trials to include concerns relevant to AI. While the latter states in the discussion that ‘investigators should also be encouraged to explore differences in performance and error rates across population subgroups’,3 there is no more formal inclusion of the concept into the guideline itself. Similarly, the announcement papers for the upcoming Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis-ML (TRIPOD-ML)18 andStandards for Reporting of Diagnostic Accuracy Studies AI Extension (STARD-AI)19 guidelines for model reporting do not allude to these issues (though we wait in anticipation for their potential inclusion in the final versions of these guidelines). While recently published guidelines from the editors of respiratory, sleep and critical care medicine journals engage with the concept in an exemplary fashion, the depth of their discussion is relegated to a supplementary segment of the paper.1

Regulatory guidance

Broadly, the engagement of prominent regulatory bodies with MLHC remains at a preliminary stage, and engagement with fairness tends to be either minimal or vague. The Food and Drug Administration in the USA has made significant strides towards modernisation of its frameworks for the approval and regulation of software-based medical interventions, including MLHC tools.5 Their documents engage broadly with technical concerns, and criteria for effective clinical evaluation, but entirely lack discussion of fairness or the relationship between these tools and the broader health equity context.20 The Canadian Agency for Drugs and Technologies in Health has explicitly highlighted the need for fairness and bias to be considered, but further elaboration is lacking.21 The work of the European Union on this topic remains at a broad stage.4 While their documents do make reference to principles of ‘diversity, non-discrimination and fairness’, they do so in a very broad manner without any clearly operationalised specifics.22 23 The engagement of the UK with MLHC is relatively advanced, with several prominent reports engaging with the topic,24–26 and an explicit ‘Code of Conduct for Data-Driven Healthcare Technology’27 from the Department of Health and Social Care that highlights the need for fairness. However, the specifics of this regulatory approach are still being decided, and no clear guidance has yet been put forth to clarify these principles in practice.28 MLHC as a whole would benefit from increased clarity and force in regulatory guidance from these major agencies.29

Operationalising fairness in MLHC practice

If fairness is an afterthought in the design and reporting of MLHC papers and trials, as well as regulatory processes, it is likely to remain an afterthought in the development and implementation of MLHC tools. If MLHC is going to prove effective for— and be trusted by—a diverse range of patients, fairness cannot be a post-hoc and after-the-fact consideration. Nor is it sufficient for fairness to be a vague abstraction of academic importance but ineffectual consequence. The present moment affords a tremendous opportunity to define MLHC such that fairness is integral, and to ensure that this commitment is reflected in model reporting guidelines, clinical trial guidelines and regulatory approaches. However, moving from vague commitments of fairness to practical and effective guidance is far from a trivial task. As work in the machine learning community has demonstrated, fairness has multiple definitions which can occasionally be incompatible,7 and bias can arise from a complex range of sources.30 Operationalisation of fairness must be context-specific, and embeds the relevant values in a field. We call for concerted effort from the MLHC community, and in particular the groups responsible for the development and propagation of guidelines, to affirm a commitment to fairness in an explicit and operationalised fashion. Similarly, we call on the various regulatory agencies to establish clear minimum standards for AI fairness. In box 1, we highlight a non-exhaustive series of recommendations that are likely to be beneficial as the MLHC community engages in this endeavour. Engage members of the public and in particular members of marginalised communities in the process of determining acceptable fairness standards. Collect necessary data on vulnerable protected groups in order to perform audits of model function (eg, on race, gender). Analyse and report model performance for different intersectional subpopulations at risk of unfair outcomes. Establish target thresholds and maximum disparities for model function between groups. Be transparent regarding the specific definitions of fairness that are used in the evaluation of a machine learning for healthcare (MLHC) model. Explicitly evaluate for disparate treatment and disparate impact in MLHC clinical trials. Commit to postmarketing surveillance to assess the ongoing real-world impact of MLHC models.

Conclusion

Values are embedded throughout the MLHC pipeline, from the design of models, to the execution and reporting of trials, to the regulatory approval process. Guidelines hold significant power in defining what is worthy of emphasis. While fairness is essential to the impact and consequences of MLHC tools, the concept is often conspicuously absent or ineffectually vague in emerging guidelines. The field of machine MLHC has the opportunity at this juncture to render fairness integral to the identity field. We call on the MLHC community to commit to the project of operationalising fairness, and to emphasise fairness as a requirement in practice.

13 in total

1. Reporting of artificial intelligence prediction models.

Authors: Gary S Collins; Karel G M Moons
Journal: Lancet Date: 2019-04-20 Impact factor: 79.321

2. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers.

Authors: John Mongan; Linda Moy; Charles E Kahn
Journal: Radiol Artif Intell Date: 2020-03-25

Review 3. The European artificial intelligence strategy: implications and challenges for digital health.

Authors: I Glenn Cohen; Theodoros Evgeniou; Sara Gerke; Timo Minssen
Journal: Lancet Digit Health Date: 2020-06-23

4. Semantics derived automatically from language corpora contain human-like biases.

Authors: Aylin Caliskan; Joanna J Bryson; Arvind Narayanan
Journal: Science Date: 2017-04-14 Impact factor: 47.728

Review 5. Machine Learning and Health Care Disparities in Dermatology.

Authors: Adewole S Adamson; Avery Smith
Journal: JAMA Dermatol Date: 2018-11-01 Impact factor: 10.282

6. Addressing health disparities in the Food and Drug Administration's artificial intelligence and machine learning regulatory framework.

Authors: Kadija Ferryman
Journal: J Am Med Inform Assoc Date: 2020-12-09 Impact factor: 4.497

7. Assessing risk, automating racism.

Authors: Ruha Benjamin
Journal: Science Date: 2019-10-25 Impact factor: 47.728

8. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: The STARD-AI Steering Group.

Authors: Viknesh Sounderajah; Hutan Ashrafian; Ravi Aggarwal; Jeffrey De Fauw; Alastair K Denniston; Felix Greaves; Alan Karthikesalingam; Dominic King; Xiaoxuan Liu; Sheraz R Markar; Matthew D F McInnes; Trishan Panch; Jonathan Pearson-Stuttard; Daniel S W Ting; Robert M Golub; David Moher; Patrick M Bossuyt; Ara Darzi
Journal: Nat Med Date: 2020-06 Impact factor: 53.440

9. Development and Reporting of Prediction Models: Guidance for Authors From Editors of Respiratory, Sleep, and Critical Care Journals.

Authors: Daniel E Leisman; Michael O Harhay; David J Lederer; Michael Abramson; Alex A Adjei; Jan Bakker; Zuhair K Ballas; Esther Barreiro; Scott C Bell; Rinaldo Bellomo; Jonathan A Bernstein; Richard D Branson; Vito Brusasco; James D Chalmers; Sudhansu Chokroverty; Giuseppe Citerio; Nancy A Collop; Colin R Cooke; James D Crapo; Gavin Donaldson; Dominic A Fitzgerald; Emma Grainger; Lauren Hale; Felix J Herth; Patrick M Kochanek; Guy Marks; J Randall Moorman; David E Ost; Michael Schatz; Aziz Sheikh; Alan R Smyth; Iain Stewart; Paul W Stewart; Erik R Swenson; Ronald Szymusiak; Jean-Louis Teboul; Jean-Louis Vincent; Jadwiga A Wedzicha; David M Maslove
Journal: Crit Care Med Date: 2020-05 Impact factor: 7.598

Review 10. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension.

Authors: Xiaoxuan Liu; Samantha Cruz Rivera; David Moher; Melanie J Calvert; Alastair K Denniston
Journal: Nat Med Date: 2020-09-09 Impact factor: 87.241

10 in total

Review 1. Artificial Intelligence for Radiation Oncology Applications Using Public Datasets.

Authors: Kareem A Wahid; Enrico Glerean; Jaakko Sahlsten; Joel Jaskari; Kimmo Kaski; Mohamed A Naser; Renjie He; Abdallah S R Mohamed; Clifton D Fuller
Journal: Semin Radiat Oncol Date: 2022-10 Impact factor: 5.421

2. Developing and Validating Multi-Modal Models for Mortality Prediction in COVID-19 Patients: a Multi-center Retrospective Study.

Authors: Joy Tzung-Yu Wu; Miguel Ángel Armengol de la Hoz; Po-Chih Kuo; José Maria Castellano; Leo Anthony Celi; Joseph Alexander Paguio; Jasper Seth Yao; Edward Christopher Dee; Wesley Yeung; Jerry Jurado; Achintya Moulick; Carmelo Milazzo; Paloma Peinado; Paula Villares; Antonio Cubillo; José Felipe Varona; Hyung-Chul Lee; Alberto Estirado
Journal: J Digit Imaging Date: 2022-07-05 Impact factor: 4.903

3. Operationalising fairness in medical algorithms.

Authors: Sonali Parbhoo; Judy Wawira Gichoya; Leo Anthony Celi; Miguel Ángel Armengol de la Hoz
Journal: BMJ Health Care Inform Date: 2022-06

Review 4. Evaluation and Mitigation of Racial Bias in Clinical Machine Learning Models: Scoping Review.

Authors: Jonathan Huang; Galal Galal; Mozziyar Etemadi; Mahesh Vaidyanathan
Journal: JMIR Med Inform Date: 2022-05-31

5. An interactive dashboard to track themes, development maturity, and global equity in clinical artificial intelligence research.

Authors: Joe Zhang; Stephen Whebell; Jack Gallifant; Sanjay Budhdeo; Heather Mattie; Piyawat Lertvittayakumjorn; Maria Del Pilar Arias Lopez; Beatrice J Tiangco; Judy W Gichoya; Hutan Ashrafian; Leo A Celi; James T Teo
Journal: Lancet Digit Health Date: 2022-04