| Literature DB >> 35641814 |
Jean Feng1,2, Rachael V Phillips3, Ivana Malenica3, Andrew Bishara4,5, Alan E Hubbard3, Leo A Celi6, Romain Pirracchio4,5.
Abstract
Machine learning (ML) and artificial intelligence (AI) algorithms have the potential to derive insights from clinical data and improve patient outcomes. However, these highly complex systems are sensitive to changes in the environment and liable to performance decay. Even after their successful integration into clinical practice, ML/AI algorithms should be continuously monitored and updated to ensure their long-term safety and effectiveness. To bring AI into maturity in clinical care, we advocate for the creation of hospital units responsible for quality assurance and improvement of these algorithms, which we refer to as "AI-QI" units. We discuss how tools that have long been used in hospital quality assurance and quality improvement can be adapted to monitor static ML algorithms. On the other hand, procedures for continual model updating are still nascent. We highlight key considerations when choosing between existing methods and opportunities for methodological innovation.Entities:
Year: 2022 PMID: 35641814 PMCID: PMC9156743 DOI: 10.1038/s41746-022-00611-y
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Fig. 1AI-QI is a collaborative effort.
To ensure the continued safety and effectiveness of AI-based algorithms deployed in the hospital, institutions will need streamlined processes for monitoring model performance continuously, communicating the latest performance metrics to end-users, and revising the model or even suspending its use when substantial decay in performance is observed. Given its cross-cutting nature, AI-QI requires close collaboration between clinicians, hospital administrators, information technology (IT) professionals, model developers, biostatisticians, and regulatory agencies.
Fig. 2Cause-and-effect diagram for a drop in performance of an AI-based early warning system for Acute Hypotension Episodes (AHEs).
Each branch represents a category of potential causes. The effect is defined as model performance, which is measured by the area under the receiver operating characteristic curve (AUC).
Methods from statistical process control (SPC) and their application to monitoring ML algorithms.
| Method(s) | What the method(s) detect and assumptions | Example uses |
|---|---|---|
| CUSUM, EWMA | Detects a shift in the mean of a single variable, given shift size. Assumes the pre-shift mean and variance are known. Extensions can monitor changes in the variance. | • Monitoring changes in individual input variables |
| • Monitoring changes in real-valued performance metrics (e.g. monitoring the prediction error) | ||
| MCUSUM, MEWMA, Hotelling’s T2 | Monitor changes in the relationship between multiple variables | • Monitoring changes in the relationship between input variables |
| Generalized likelihood ratio test (GLRT), Online change point detection | Detects if a change occurred in a data distribution and when. Can be applied if characteristics of the pre- and/or post-shift distributions are unknown. GLRT methods typically make parametric assumptions. Parametric and nonparametric variants exist for online change point detection methods. | • Detecting distributional shifts for individual or multiple input variables |
| • Detecting shifts in the conditional distribution of outcome | ||
| • Determining whether parametric model recalibration/revision is needed | ||
| Generalized fluctuation monitoring | Monitor changes to the residuals or gradient | • Detect when the average gradient of the training loss for a differentiable ML algorithm (e.g. neural network) differs from zero |
Fig. 3Continual monitoring of a hypothetical AI algorithm for forecasting mean arterial pressure (MAP).
Consider a hypothetical MAP prediction algorithm that predicts a patient’s risk of developing an acute hypotensive episode based on two input variables: baseline MAP and heart rate (HR). The top two rows monitors changes in the two input variables using the CUSUM procedure, where the dark line is the chart statistic and the light lines are the control limits. The third row aims to detect changes in the conditional relationship between the outcome and input variables by monitoring the residuals using the CUSUM procedure. An alarm is fired when a chart statistic exceeds its control limits.
Model updating procedures described in this paper. The performance guarantees from these methods require the stream of data to be IID with respect to the target population. Note that in general, online learning methods may provide only weak performance guarantees or none at all.
| Method(s) | Update frequency | Complexity of model update | Performance guarantees |
|---|---|---|---|
| One-time model recalibration (e.g. Platt scaling, isotonic regression, temperature scaling) | Low | Low | Strong |
| One-time model revision | Low | Medium | Strong |
| One-time model refitting | Low | High | Strong |
| Online hypothesis testing for approving proposed modifications | Medium | High | Strong |
| Online parametric model recalibration/revision | High | Low/Medium | Medium |