Literature DB >> 34462316

Problems in the deployment of machine-learned models in health care.

Joseph Paul Cohen¹, Tianshi Cao², Joseph D Viviano², Chin-Wei Huang², Michael Fralick², Marzyeh Ghassemi², Muhammad Mamdani², Russell Greiner², Yoshua Bengio².

Abstract

Entities: Chemical

Mesh：

Year: 2021 PMID： 34462316 PMCID： PMC8443295 DOI： 10.1503/cmaj.202066

Source DB: PubMed Journal: CMAJ ISSN： 0820-3946 Impact factor: 8.262

× No keyword cloud information.

Decision-support systems or clinical prediction tools based on machine learning (including the special case of deep learning) are similar to clinical support tools developed using classical statistical models and, as such, have similar limitations. If a machine-learned model is trained using data that do not match the data it will encounter when deployed, its performance may be lower than expected. When training, machine learning algorithms take the “path of least resistance,” leading them to learn features from the data that are spuriously correlated with target outputs instead of the correct features; this can impair the effective generalization of the resulting learned model. Avoiding errors related to these problems involves careful evaluation of machine-learned models using new data from the performance distribution, including data samples that are expected to “trick” the model, such as those with different population demographics, difficult conditions or bad-quality inputs. In a companion article, Verma and colleagues discuss how machine-learned solutions can be developed and implemented to support medical decision-making.1 Both decision-support systems and clinical prediction tools developed using machine learning (including the special case of deep learning) are similar to clinical support tools developed using classical statistical models and, as such, have similar limitations.2,3 A model that makes incorrect predictions can lead its users to make errors they otherwise would not have made when caring for patients, and therefore it is important to understand how these models can fail.4 We discuss these limitations — focusing on 2 issues in particular: out-of-distribution (or out-of-sample) generalization and incorrect feature attribution — to underscore the need to consider potential caveats when using machine-learned solutions.

What are the features of machine-learned models?

Herein we use the term “machine-learned model” to refer to a model that has been created by running a supervised machine learning algorithm on a labelled data set. Machine-learned models are trained on specific data sets, known as their training distribution. Training data are typically drawn from specific ranges of demographics, country, hospital, device, protocol and so on. Machine-learned models are not dynamic unless they are explicitly designed to be, meaning that they do not change as they are used. Typically, a machine-learned model is deterministic, having learned a fixed set of weights (i.e., coefficients or parameters) that do not change as the model is run; that is, for any specific input, it will return the same prediction every time. Although “adaptive systems” have been developed that can “learn” while being deployed by incorporating new data, such systems may give a different prediction for the same input and their safety and oversight is still unclear.5 We refer to the data that a machine-learned model will encounter when it is deployed for use as the model’s performance distribution. If a machine-learned model’s training distribution does not match its performance distribution, then the performance of the model may be lower than expected6,7 — a challenge that is commonly referred to as out-of-distribution generalization (discussed in detail below). Another challenge is if the training data contain features that are spuriously correlated with the outcomes the tool is being designed to predict, as this may cause a machine-learned model to make predictions from the “wrong” features (also discussed below). A model’s creator should seek a training data distribution that matches the performance distribution as closely as possible, and clinicians who use the tool should be aware of the exact limitations of the model’s training distribution and potential shortcomings.

What are some potential problems of machine-learned models?

Out-of-distribution generalization

Newly graduated physicians are typically most comfortable managing patients who exhibit conditions they encountered during their residency training, but they are also able to manage patients with conditions they have not previously seen because they can use theoretical knowledge to recognize patterns of illness. In contrast, machine-learned methods are limited by the data provided during the training and development phase. Furthermore, machine-learned models do not typically know their own limits unless components are included to help the model detect when data it encounters are out of distribution (for example, a component may be built in that prevents a human chest radiograph diagnostic system from processing a photo of a cat and diagnosing pneumonia8 — see strategies listed below). Three categories of out-of-distribution data,9 summarized in Figure 1, include the following:

Figure 1:

This figure shows 3 categories of out-of-distribution data, all in the context of training a machine-learned algorithm to read adult chest radiographs (see image C iii). A) Images that are unrelated to the task. B) Images that are incorrectly acquired. C) Images that are not encountered owing to a selection bias in the training distribution (e.g., images with lung cancer lesions and pacemakers were not included in the training set and therefore were unseen during training). C) (iii) Training data that are subject to a selection bias.

Data that are unrelated to the task, such as obviously wrong images from a different domain; for example, magnetic resonance images presented to a machine-learned model that was trained on radiograph images; and less obviously wrong images, such as a wrist radiograph image processed using a model trained with chest radiographs Incorrectly prepared data; for example, blurry chest radiograph images, those with poor contrast or incorrect view of the anatomy, images presented in an incorrect file format or improperly processed, and images arising from an incorrect data acquisition protocol Data not included in the training data owing to a selection bias; for example, images showing a disease not present in the training data or those arising from a population demographic not similar to that of the training data set This figure shows 3 categories of out-of-distribution data, all in the context of training a machine-learned algorithm to read adult chest radiographs (see image C iii). A) Images that are unrelated to the task. B) Images that are incorrectly acquired. C) Images that are not encountered owing to a selection bias in the training distribution (e.g., images with lung cancer lesions and pacemakers were not included in the training set and therefore were unseen during training). C) (iii) Training data that are subject to a selection bias. A machine-learned model will perform suboptimally or deliver unexpected results on out-of-distribution data. Many strategies have been developed to detect and prevent out-of-distribution data from being processed. A typical approach is for a model to compute the degree to which a data sample matches the model’s training distribution, which may be presented as a score. If the score is above a certain threshold, then the model can decide not to process a data sample. One way for the model to do this — in the case of image interpretation — is for the model to attempt to reconstruct the image and compare the reconstruction to the original by some measure of similarity, such as the absolute pixel difference.8,10 Typically, a model will do a poor job of reconstructing an image it did not encounter in training. If the reconstructed image is scored as similar enough to be judged “correct,” the model can proceed to process that image; if not, processing will not occur. However, in order to build and evaluate such out-of-distribution detection systems, known out-of-distribution examples must be used; so, even strategies to prevent errors have limits.

Incorrect feature attribution

Machine-learned models typically use only the minimally complicated set of features required to reliably discriminate between the target outputs in their training data set. That is, the model takes a “path of least resistance” during its learning,11–13 finding features that are highly predictive of the target output, which helps to make it accurate. However, a learning model may also find some distractor feature in the data that is spuriously correlated with the target output14 and, once this happens, the model may stop looking for new true discriminative features even if they exist.15 For example, in a model learning to read chest radiographs, distractor features may be the hospital, image acquisition parameters, radiograph view (e.g., anteroposterior v. anteroposterior supine), and artifacts such as presence of a pacemaker or endotracheal tube. If clinical protocols or image processing change over time, this can lead to patterns in the training data that can be detected by the model and serve as a distractor.16 Or if images from multiple hospitals are grouped together and the rate of a disease varies among hospitals, a model may learn to detect the hospital using subtle visual cues and may then base its predictions on the hospital associated with the image rather than data in the image itself. This can lead to a model appearing more accurate than it actually is if the evaluation data contain the same artifacts (e.g., the same hospital-specific distribution), but the same model could fail dramatically if the performance data do not exhibit these artifacts. Furthermore, patient demographics (e.g., age or sex) can be inferred from aspects of the training data and may be used by a learning model to predict outcome prevalence (that is, prior probability) in the training sample if better true features related to the outcome of interest are less obvious in the data. Medical data sets are often relatively small, which may increase the likelihood of spuriously correlated features. Research into altering the ways models learn to avoid this problem is ongoing.11,17 However, using a large, diverse data set for training a machine-learned model will help to avoid the effect of distractors. Other solutions include unsupervised learning and transfer learning,18 processes that use data that are unlabelled or labelled for another task to train models, to avoid detection of spurious features that are specific to a particular data set. These methods typically enable the use of much more data and have a better chance of learning features that will be general enough and useful for the intended task.18 In cases where pathology-specific features are simply not predictive enough for some images, the learning model may be forced to guess and predict the prevalence of a disease or outcome in the training distribution. The machine-learned model will appear to work when applied to data in which the disease or outcome prevalence is the same as in the training data; it may give the “right” answer. However, when applied to a different population with a different outcome prevalence, the model will likely predict incorrectly19,20 and lead to harm. It is therefore important that model developers and users verify that the machine-learned model appropriately detects features that are truly associated with the prediction or outcome of interest, using a feature attribution method such as the “image gradient” method21 or creating a counterfactual input showing what would change the classifier’s prediction22 during development and when deployed. Related to this point, another concern is that some models may simply learn to copy the actions taken by the clinicians when the data were generated. For example, if a model is trained to predict the need for blood transfusions based on historical data about transfusions, it may not have anything informative to predict from and instead will learn to replicate existing practices. A model will learn “bad habits” unless the data set used to develop it is corrected. One approach to overcome this problem would be to have expert reviewers label the data set with the true outcomes of interest (e.g., appropriate v. inappropriate blood transfusions), although this may be resource intensive and experts may not always agree on labels. It would be even better to use only labels that are objective and do not depend on human experts.

What can mitigate these problems?

Avoiding errors related to the issues discussed above involves careful evaluation of machine-learned models23 using new data from the performance distribution, including samples that are expected to expose model failures, such as those with different population demographics, difficult conditions, poor-quality images, or errors. A potentially useful approach is to create simulated test distributions by balancing data based on attributes unrelated to the target task to observe differences in performance of a model according to factors such as demographic minority class24 or geographic region.25 If a model learned to focus on a spurious feature such as age, deploying it using data in which the age of the population composed of a single age, although balanced in terms of the target variable the model was trained to predict, would lead to poor performance. Results of such tests of a model’s performance should be transparently presented to illustrate its limitations in use.26 A related article discusses evaluation of machine-learned models in some depth.27

Conclusion

It is important to understand and tackle these problems of machine-learned models before deployment so that large investments do not end in failure, which could be costly or catastrophic. IBM’s “Watson for Oncology” program28 was suspended after an investment of $62 million, allegedly owing to problematic clinical recommendations that resulted in poor acceptance by clinicians. Google’s machine-learned initiative to detect diabetic retinopathy29 struggled when it encountered “real-world” images in clinics in Thailand that were of lower quality than those in its training set, causing considerable frustration to both patients and staff. Anticipating and mitigating the challenges outlined herein will be key to avoiding such costly failures.

9 in total

Review 1. Artificial Intelligence for Medical Image Analysis: A Guide for Authors and Reviewers.

Authors: Joseph R England; Phillip M Cheng
Journal: AJR Am J Roentgenol Date: 2018-12-17 Impact factor: 3.959

2. CheXclusion: Fairness gaps in deep chest X-ray classifiers.

Authors: Laleh Seyyed-Kalantari; Guanxiong Liu; Matthew McDermott; Irene Y Chen; Marzyeh Ghassemi
Journal: Pac Symp Biocomput Date: 2021

3. How to Read Articles That Use Machine Learning: Users' Guides to the Medical Literature.

Authors: Yun Liu; Po-Hsuan Cameron Chen; Jonathan Krause; Lily Peng
Journal: JAMA Date: 2019-11-12 Impact factor: 56.272

4. Human-computer collaboration for skin cancer recognition.

Authors: Philipp Tschandl; Christoph Rinner; Zoe Apalla; Giuseppe Argenziano; Noel Codella; Allan Halpern; Monika Janda; Aimilios Lallas; Caterina Longo; Josep Malvehy; John Paoli; Susana Puig; Cliff Rosendahl; H Peter Soyer; Iris Zalaudek; Harald Kittler
Journal: Nat Med Date: 2020-06-22 Impact factor: 53.440

5. Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data.

Authors: Milena A Gianfrancesco; Suzanne Tamang; Jinoos Yazdany; Gabriela Schmajuk
Journal: JAMA Intern Med Date: 2018-11-01 Impact factor: 21.873

6. ADHD-200 Global Competition: diagnosing ADHD using personal characteristic data can outperform resting state fMRI measurements.

Authors: Matthew R G Brown; Gagan S Sidhu; Russell Greiner; Nasimeh Asgarian; Meysam Bastani; Peter H Silverstone; Andrew J Greenshaw; Serdar M Dursun
Journal: Front Syst Neurosci Date: 2012-09-28

7. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study.

Authors: John R Zech; Marcus A Badgeley; Manway Liu; Anthony B Costa; Joseph J Titano; Eric Karl Oermann
Journal: PLoS Med Date: 2018-11-06 Impact factor: 11.069

8. Deep learning predicts hip fracture using confounding patient and healthcare variables.

Authors: Marcus A Badgeley; John R Zech; Luke Oakden-Rayner; Benjamin S Glicksberg; Manway Liu; William Gale; Michael V McConnell; Bethany Percha; Thomas M Snyder; Joel T Dudley
Journal: NPJ Digit Med Date: 2019-04-30

9. Evaluation of machine learning solutions in medicine.

Authors: Tony Antoniou; Muhammad Mamdani
Journal: CMAJ Date: 2021-08-30 Impact factor: 8.262

9 in total

11 in total

1. Are smartphones and machine learning enough to diagnose tremor?

Authors: Arjun Balachandar; Musleh Algarni; Lais Oliveira; Luca Marsili; Aristide Merola; Andrea Sturchio; Alberto J Espay; William D Hutchison; Aniruddh Balasubramaniam; Frank Rudzicz; Alfonso Fasano
Journal: J Neurol Date: 2022-07-21 Impact factor: 6.682

2. Toward understanding deep learning classification of anatomic sites: lessons from the development of a CBCT projection classifier.

Authors: Juan P Cruz-Bastida; Erik Pearson; Hania Al-Hallaq
Journal: J Med Imaging (Bellingham) Date: 2022-07-25

Review 3. The future of MRI in radiation therapy: Challenges and opportunities for the MR community.

Authors: Rosie J Goodburn; Marielle E P Philippens; Thierry L Lefebvre; Aly Khalifa; Tom Bruijnen; Joshua N Freedman; David E J Waddington; Eyesha Younus; Eric Aliotta; Gabriele Meliadò; Teo Stanescu; Wajiha Bano; Ali Fatemi-Ardekani; Andreas Wetscherek; Uwe Oelfke; Nico van den Berg; Ralph P Mason; Petra J van Houdt; James M Balter; Oliver J Gurney-Champion
Journal: Magn Reson Med Date: 2022-09-21 Impact factor: 3.737