Literature DB >> 32864600

The myth of generalisability in clinical research and machine learning in health care.

Joseph Futoma¹, Morgan Simons², Trishan Panch^3,4, Finale Doshi-Velez¹, Leo Anthony Celi^5,6,7.

Abstract

An emphasis on overly broad notions of generalisability as it pertains to applications of machine learning in health care can overlook situations in which machine learning might provide clinical utility. We believe that this narrow focus on generalisability should be replaced with wider considerations for the ultimate goal of building machine learning systems that are useful at the bedside.

Entities: Disease Species

Mesh：

Year: 2020 PMID： 32864600 PMCID： PMC7444947 DOI： 10.1016/S2589-7500(20)30186-2

Source DB: PubMed Journal: Lancet Digit Health ISSN： 2589-7500

Introduction

Dr Lee, an esteemed intensivist from the USA, is rounding in an intensive care unit (ICU). He is asked by a team member who is taking care of patients with COVID-19 if they can triage their patients to optimise use of scarce resources, such as ventilators, with their hospital's new machine learning model to predict mortality. He is about to say yes, but stops himself. Do the findings of the preprints and fast-tracked published articles that this model is based on apply to his patient population? Problems with the increase in hastily written articles notwithstanding, are the conclusions of research based on patients with COVID-19 in China and Italy from several months ago still valid in his ICU today, given the differences in practice patterns and rapidly changing guidelines and protocols? The answers to these questions strongly depend on context. For a substantial number of individuals who die in the ICU, their mortality is a result of cessation of treatment. Many factors affect the decision to discontinue invasive interventions, including whether the outcome is aligned with the patient's preferences. Therefore, a machine learning model that predicts hospital mortality is largely identifying which patients are most likely to discontinue treatment, and is effectively learning a collection of rules to predict this outcome. As a thoughtful clinician, Dr Lee realised that he should consider the broader context in which the ICU's mortality model was developed. Unlike Dr Lee, current machine learning systems typically cannot identify differences in contexts, let alone adapt to them. However, the collections of rules derived by machine learning systems might still be effective in a specific context. In this Viewpoint, we argue that an overemphasis on overly broad notions of generalisability overlooks situations in which machine learning systems have the greatest ability to deliver clinical utility.

In pursuit of generalisability

Generalisability is not a binary concept, and does not have a universally agreed definition. According to one common hierarchy, a set of rules from a machine learning system or a clinician might be applicable: internally, applying only in the narrow context in which it was developed; temporally, applying prospectively at the centre in which it was developed; or externally, applying both at new centres and in new time periods. Other hierarchies construct even more detailed levels of generalisability. A system that achieves the highest possible level of generalisability is desirable. Many medical journals mandate that articles on machine learning applications show results on external cohorts.6, 7, 8 This request is only natural: such journals often have diverse readerships, and research articles with widespread relevance to many readers are more likely to be read and circulated, increasing the visibility of the journal and publishers. Similarly, vendors, such as electronic health record companies, prize generalisability in their applications. These companies frequently sell generic black-box machine learning systems that purport to apply universally across many hospitals. Broad applicability is to their financial advantage as it allows for amortising development costs for machine learning and hopefully eliminates the need for solutions tailored to each hospital. In some cases, such broad geographical generalisability might be feasible—eg, in medical imaging applications such as diagnosing diabetic retinopathy. However, these areas are still not immune to generalisability issues, and few prospective studies or randomised trials exist. More often, universality is a myth. As users of these machine learning systems can attest, the demand for universal rules—generalisability—often results in systems that sacrifice strong performance at a single site for systems with mediocre or poor performance at many sites.13, 14, 15 The inherent trade-off that clinicians and researchers alike encounter is between improving system performance locally and having systems that generalise. Although we have already explored the story of Dr Lee, let us discuss a more general hypothetical scenario. Consider a machine learning system built by tertiary care hospital A to help clinicians identify patients who are at high risk of a hospital-acquired, highly contagious diarrhoeal infection. A prospective study at hospital A found the system was effective at helping infection-control practitioners prevent outbreaks, and the system was put into general use. Hearing of this success, their partner rural-community hospital, hospital B, decided to adopt the system. Unfortunately, its performance at hospital B was poor. Investigating the problem revealed that there was an antibiotic stewardship in hospital A, but not in hospital B. The machine learning system, implicitly trained for a context in which a certain policy around antibiotics exists, was unusable at hospital B. Does this mean it should not be used at hospital A? Of course not: the machine learning system had already shown temporal generalisability at hospital A, providing tailored predictions that ensured cases did not go unnoticed. Geographical generalisation to hospital B is not necessary for clinicians at hospital A to use the system to improve patient care. Rather, the desire for geographical generalisability is a proxy for validity: a theoretical machine learning system applied universally would tautologically always work as expected. In situations with strong signals and little local variability across sites, this strategy might make sense. However, in many scenarios there are too many practice patterns and other local idiosyncrasies that make learning a broadly applicable model effectively impossible. Instead, machine learning systems in these settings can be viewed as an aspirational form of evidence-based medicine—local data to create local inferences for local patients and clinicians.16, 17, 18, 19 Issues of generalisability are not unique to machine learning and are a dominant concern for clinical guidelines where the results of randomised controlled trials, the gold standard for evidence generation, might not generalise beyond the trial settings.20, 21, 22, 23 If hospitals want to have useful machine learning systems at the bedside, the broader research community need to stop focusing solely on generalisability and consider the ultimate goal: will this system be useful in this specific case?

Beyond generalisability

To create machine learning systems that are clinically useful, the emphasis should shift from demanding geographical generalisability to understanding how, when, and why a machine learning system works. This knowledge will help medical professionals use the system correctly, not only across institutions but also within an institution as patients and practices change. For instance, if hospital A stopped their antibiotic stewardship policy, they should know to update their machine learning system. Although the precise level of generalisability required for a real-world application will depend on the context, any system intended to be integrated into a clinical environment will need to be at least temporally generalisable, ensuring that it performs well prospectively. We further suggest that there are several important questions to ask when assessing the overall validity of a machine learning system for a particular context: when the machine learning system is right, is it right for the right reasons? Or is it relying on anticausal mechanisms due to unobserved confounders (as in the example of patients with asthmatic pneumonia who have lower mortality rates than people who do not have asthma because of more intensive care)? How do the characteristics of the cohort used to develop the machine learning system compare with typical patients at the institution where it will be used? Does the system rely on variables known to be collected differently at different centres? More broadly, all machine learning systems must be closely monitored to make sure that their performance does not degrade with time as patient demographics and practice patterns inevitably shift.25, 26 Furthermore, techniques from continual learning offer enormous potential to create more advanced machine learning systems that continuously update based on new data. In theory, such systems could address many of the pitfalls with generalisability (panel ). However, these types of self-updating algorithms pose enormous regulatory challenges, as outlined in a recent white paper by the US Food and Drug Administration, and there are still many technical and cultural barriers to integrate them into real-world systems. Changes in practice pattern over time Improved patient outcomes through adoption of low-tidal-volume ventilation in the intensive care unit (ICU) will affect the performance of models that were developed when higher tidal volumes were standard. Leucodepletion of blood for transfusion became standard of care in most countries. Models related to blood transfusion and outcomes require recalibration if validated before the practice change. Differences in practice between health systems Mortality predictions for patients admitted to the ICU with COVID-19 are highly sensitive to criteria for ICU admission across hospitals, which in turn vary depending on ICU demand and capacity. Patient demographic variation Models to predict the risk of hospitalisation from COVID-19 that are trained on data from Italy where there is a high proportion of older individuals in the population will not do well in countries with a different age distribution—eg, low-income and middle-income countries that typically have a younger population. Patient genotypic and phenotypic variation Model performance is linked to the composition of the training cohort with regard to disease genotypes or phenotypes, or both. These models will not translate well to populations in which the genotypic or phenotypic make-up is different. Some phenotypes of sepsis and acute respiratory distress syndrome, for example, might be over-represented or under-represented in different settings. Hardware and software variation for data capture Bedside monitors that have different sampling rates for the capture of physiological signals and that are measured continuously will have different susceptibilities to artifacts and will affect models that have time-series data as an input. Computer-vision models for automated interpretation of CT scans are sensitive to the machines used to obtain the images. Variation in other determinants of health and disease (eg, environmental, social, political, and cultural) A model developed in the USA to predict neurological outcomes of premature babies will not do well in a low-income country because of resource availability. The relationship of patient and disease factors with clinical events, such as hospital-acquired infection, will change when a health-care system is strained (eg, during a pandemic). This path to validation will probably require more work than simply evaluating a machine learning system on multiple datasets and then claiming universal external generalisability, as vendors might wish to do. However, models thoroughly vetted through these proposed standards have the promise of being able to do better for the context at hand. We also emphasise that the methodological process of developing a high-quality machine learning system might be generalisable: the lessons hospital A learns about how to prepare data and then train, test, and monitor their machine learning system can be used by hospital B to do the same with their own data. As we gain a deeper understanding of what patterns different machine learning systems rely on, we can also determine more accurately to what extent systems trained in one context might work in another—eg, perhaps hospital A's machine learning system would also work well at hospital C, a neighbouring tertiary care institution that follows very similar practices. Where possible, multicentre datasets might offer the potential to better capture heterogeneity across sites during model development, potentially leading to more generalisable models. Multicentre data from the relevant target populations are also the only way to validate whether a model truly generalises to a new institution.30, 31 Finally, we note that there are some circumstances where broad generalisability is desirable. For example, if we are interested in using machine learning systems to understand the underpinnings of disease (eg, a study to identify biomarkers that predict which patients with COVID-19 will develop cytokine storm), then the machine learning system's output should not be influenced by practice-specific variables, such as the specific technology used to take measurements. However, clinicians do constantly adjust their behaviours and practices depending on the unique characteristics of patients, the availability of resources, and the local practice norms. If we are to build accurate and actionable machine learning systems, we should not ignore the fact that practice-specific information is often highly predictive.32, 33, 34

Conclusion

Machine learning systems are not like thermometers, reliably measuring the temperature via universal rules of physics; nor are they like trained clinicians, gracefully adapting to new circumstances. Rather, these systems should be viewed as a set of rules that were trained to operate under certain contexts and rely on certain assumptions, and might work seamlessly at one centre but fail altogether somewhere else. We hope this Viewpoint will help reframe the narrow focus on generalisability and will encourage future researchers, developers, and reviewers to be explicit about the appropriate level of generalisability for their setting. We believe that a renewed focus on broader questions about characterising when, how, and why machine learning systems have clinical utility will help ensure that these systems work as intended for both clinicians and for patients.

29 in total

Review 1. Systematic reviews in health care: Assessing the quality of controlled clinical trials.

Authors: P Jüni; D G Altman; M Egger
Journal: BMJ Date: 2001-07-07

2. External validity of randomised controlled trials: "to whom do the results of this trial apply?".

Authors: Peter M Rothwell
Journal: Lancet Date: 2005 Jan 1-7 Impact factor: 79.321

Review 3. Bias in clinical intervention research.

Authors: Lise Lotte Gluud
Journal: Am J Epidemiol Date: 2006-01-27 Impact factor: 4.897

Review 4. Continual lifelong learning with neural networks: A review.

Authors: German I Parisi; Ronald Kemker; Jose L Part; Christopher Kanan; Stefan Wermter
Journal: Neural Netw Date: 2019-02-06

5. Assessing Radiology Research on Artificial Intelligence: A Brief Guide for Authors, Reviewers, and Readers-From the Radiology Editorial Board.

Authors: David A Bluemke; Linda Moy; Miriam A Bredella; Birgit B Ertl-Wagner; Kathryn J Fowler; Vicky J Goh; Elkan F Halpern; Christopher P Hess; Mark L Schiebler; Clifford R Weiss
Journal: Radiology Date: 2019-12-31 Impact factor: 11.105

6. Assessing the generalizability of prognostic information.

Authors: A C Justice; K E Covinsky; J A Berlin
Journal: Ann Intern Med Date: 1999-03-16 Impact factor: 25.391

7. Early warning scores for detecting deterioration in adult hospital patients: systematic review and critical appraisal of methodology.

Authors: Stephen Gerry; Timothy Bonnici; Jacqueline Birks; Shona Kirtley; Pradeep S Virdee; Peter J Watkinson; Gary S Collins
Journal: BMJ Date: 2020-05-20

Review 8. "Yes, but will it work for my patients?" Driving clinically relevant research with benchmark datasets.

Authors: Trishan Panch; Tom J Pollard; Heather Mattie; Emily Lindemer; Pearse A Keane; Leo Anthony Celi
Journal: NPJ Digit Med Date: 2020-06-19

9. A New Insight Into Missing Data in Intensive Care Unit Patient Profiles: Observational Study.

Authors: Anis Sharafoddini; Joel A Dubin; David M Maslove; Joon Lee
Journal: JMIR Med Inform Date: 2019-01-08

10. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study.

Authors: Denis Agniel; Isaac S Kohane; Griffin M Weber
Journal: BMJ Date: 2018-04-30

48 in total

Review 1. Shifting machine learning for healthcare from development to deployment and from models to data.

Authors: Angela Zhang; Lei Xing; James Zou; Joseph C Wu
Journal: Nat Biomed Eng Date: 2022-07-04 Impact factor: 25.671

2. Assessing the robustness of artificial intelligence powered planning tools in radiotherapy clinical settings-a phantom simulation approach.

Authors: Martin Hito; Wentao Wang; Hunter Stephens; Yibo Xie; Ruilin Li; Fang-Fang Yin; Yaorong Ge; Q Jackie Wu; Qiuwen Wu; Yang Sheng
Journal: Quant Imaging Med Surg Date: 2021-12

3. Artificial intelligence for mechanical ventilation: systematic review of design, reporting standards, and bias.

Authors: Jack Gallifant; Joe Zhang; Maria Del Pilar Arias Lopez; Tingting Zhu; Luigi Camporota; Leo A Celi; Federico Formenti
Journal: Br J Anaesth Date: 2021-11-09 Impact factor: 9.166

4. On the explainability of hospitalization prediction on a large COVID-19 patient dataset.

Authors: Ivan Girardi; Panagiotis Vagenas; Dario Arcos-D Iaz; Lydia Bessa I; Alexander Bu Sser; Ludovico Furlan; Raffaello Furlan; Mauro Gatti; Andrea Giovannini; Ellen Hoeven; Chiara Marchiori
Journal: AMIA Annu Symp Proc Date: 2022-02-21

5. Learning Predictive and Interpretable Timeseries Summaries from ICU Data.

Authors: Nari Johnson; Sonali Parbhoo; Andrew S Ross; Finale Doshi-Velez
Journal: AMIA Annu Symp Proc Date: 2022-02-21

6. The leap to ordinal: Detailed functional prognosis after traumatic brain injury with a flexible modelling approach.

Authors: Shubhayu Bhattacharyay; Ioan Milosevic; Lindsay Wilson; David K Menon; Robert D Stevens; Ewout W Steyerberg; David W Nelson; Ari Ercole
Journal: PLoS One Date: 2022-07-05 Impact factor: 3.752

7. Systematic Review of Approaches to Preserve Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine.

Authors: Lin Lawrence Guo; Stephen R Pfohl; Jason Fries; Jose Posada; Scott Lanyon Fleming; Catherine Aftandilian; Nigam Shah; Lillian Sung
Journal: Appl Clin Inform Date: 2021-09-01 Impact factor: 2.762