| Literature DB >> 36120717 |
Sharon E Davis1, Colin G Walsh1,2,3, Michael E Matheny1,2,4,5.
Abstract
As the implementation of artificial intelligence (AI)-enabled tools is realized across diverse clinical environments, there is a growing understanding of the need for ongoing monitoring and updating of prediction models. Dataset shift-temporal changes in clinical practice, patient populations, and information systems-is now well-documented as a source of deteriorating model accuracy and a challenge to the sustainability of AI-enabled tools in clinical care. While best practices are well-established for training and validating new models, there has been limited work developing best practices for prospective validation and model maintenance. In this paper, we highlight the need for updating clinical prediction models and discuss open questions regarding this critical aspect of the AI modeling lifecycle in three focus areas: model maintenance policies, performance monitoring perspectives, and model updating strategies. With the increasing adoption of AI-enabled tools, the need for such best practices must be addressed and incorporated into new and existing implementations. This commentary aims to encourage conversation and motivate additional research across clinical and data science stakeholders.Entities:
Keywords: artificial intelligence; dataset shift; machine learning; model updating; risk model surveillance
Year: 2022 PMID: 36120717 PMCID: PMC9478183 DOI: 10.3389/fdgth.2022.958284
Source DB: PubMed Journal: Front Digit Health ISSN: 2673-253X
Prediction models evaluated for temporal validation of real-time scores generated within a production electronic medical record system.
| Details | LACE+ | VSAIL |
|---|---|---|
| Outcome | 30-day readmission | 30-day suicidal ideation or attempt |
| Intended use | Quality benchmarking using predicted risk of readmission calculated at discharge | Clinical decision support delivered at arrival for inpatient and outpatient encounters |
| Development setting | Patients from multiple hospitals in Ontario, Canada | VUMC patient population |
| Modeling approach | Logistic regression | Random forest |
| Evaluation period | January 2018 through March 2022 | December 2019 through January 2022 |
VUMC, Vanderbilt University Medical Center.
Figure 1Temporal performance at Vanderbilt University Medical Center of the (A) LACE+ readmission model in terms of mean calibration (O:E); and (B) VSAIL suicidality model in terms of number needed to screen (NNS).
Overview of gaps in best practices for model maintenance.
| Domain | Gaps/Needs |
|---|---|
| Maintenance policies | |
| How should model ownership impact local control over maintenance? |
Policies establishing updating expectations of proprietary models Clarity and fairness of local updating opportunities of proprietary models Prototypes for establishing collaborative updating of multi-system owned models |
| How do we ensure comparable performance across demographic groups is sustained during the maintenance phase? |
Guidance on whether and when changes in model fairness warrant pausing AI-enabled tools Methods for addressing performance fairness drift when model performance deteriorates differentially across subpopulations |
| How do we communicate model changes to end users and promote acceptance? |
Design of effective communication strategies for warning end users of model performance drift and informing users when updated models are implemented Guidance on aligning messaging with end-user AI literacy |
| Performance monitoring | |
| At what level should model performance be monitored and maintained? |
Guidance on aligning monitoring and maintenance with use case needs Recommendations for handling monitoring in smaller health systems, including determining minimum sample size and methods for collaborative monitoring Policies supporting collaborative model maintenance in low data resource settings Guidance on managing interim periods of local performance drift between releases of proprietary models that cannot be locally updated |
| What aspects of performance should be monitored? |
Generalization recommendations on frequency and sample sizes for measuring performance across a variety of metrics Customizable and expandable tools to monitor a matrix of metrics Guidelines for aligning metrics of interest with use case needs |
| How do we define meaningful changes in performance? |
Framework for selecting drift detection methods Guidance on establishing clinically acceptable ranges of performance and defining clinically relevant decision boundaries Methods for tailoring drift detection algorithms to detect a clinically important change |
| Are there other aspects of AI models that we should monitor, in addition to performance? |
Approaches to systematically surveil external features that may impact model inputs and for monitoring input data distributions Guidance on when to update in response to changes in model inputs if performance remains stable Systems for disseminating information on changes anticipated to affect common AI models |
| Model updating | |
| What updating approaches should be considered? |
Approaches to optimizing update method selection based on performance characteristics most relevant to use case needs Expanded suite of testing procedures options for more updating methods and increased computational efficiency Guidance on defining acceptable performance and methods to determine which updating methods, if any, restore acceptable performance |
| Should clinically meaningful or statistically significant changes in performance guide updating practice? |
Guidance on whether to update models when statistically significant improvement is possible but updating would not provide a clinically meaningful improvement Methods for comparing updating options that incorporate tests for both statistical and clinical significance Recommendations for decision-making in cases where available updating methods do not restore performance to acceptable levels |
| How do we handle biased outcome feedback after model implementation? |
Recommendations for assessing feedback from effective AI-enabled interventions Methods for model development, validation, and updating that are robust to confounding by intervention |