| Literature DB >> 35573904 |
Bohua Wan1, Brian Caffo2,3, S Swaroop Vedula3.
Abstract
To be useful, clinical prediction models (CPMs) must be generalizable to patients in new settings. Evaluating generalizability of CPMs helps identify spurious relationships in data, provides insights on when they fail, and thus, improves the explainability of the CPMs. There are discontinuities in concepts related to generalizability of CPMs in the clinical research and machine learning domains. Specifically, conventional statistical reasons to explain poor generalizability such as inadequate model development for the purposes of generalizability, differences in coding of predictors and outcome between development and external datasets, measurement error, inability to measure some predictors, and missing data, all have differing and often complementary treatments, in the two domains. Much of the current machine learning literature on generalizability of CPMs is in terms of dataset shift of which several types have been described. However, little research exists to synthesize concepts in the two domains. Bridging this conceptual discontinuity in the context of CPMs can facilitate systematic development of CPMs and evaluation of their sensitivity to factors that affect generalizability. We survey generalizability and dataset shift in CPMs from both the clinical research and machine learning perspectives, and describe a unifying framework to analyze generalizability of CPMs and to explain their sensitivity to factors affecting it. Our framework leads to a set of signaling statements that can be used to characterize differences between datasets in terms of factors that affect generalizability of the CPMs.Entities:
Keywords: clinical prediction models; dataset shift; diagnosis; explainability; external validity; generalizability; prognosis
Year: 2022 PMID: 35573904 PMCID: PMC9100692 DOI: 10.3389/frai.2022.872720
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Figure 1Selection diagrams for dataset shifts. The solid circles denote observable variables. The hollow circles represent unobservable variables. The rectangles denote the selection variables.
Figure 2A framework to unify concepts related to generalizability of clinical prediction models. *This criterion is satisfied when there are no missing data in the development and external datasets. Furthermore, when there are missing data, there is no difference in assumptions about the missingness between the datasets (e.g., missing completely at random in both datasets), or there is no difference between the process that introduced missingness in each dataset.
Hyper-parameters used in the simulation experiment.
|
|
|
|
|
|---|---|---|---|
| Covariate distribution | Mixture of Gaussian | # of components | 10 |
| Mean | (–5,5)† | ||
| Variance | (0,1)† | ||
| Posterior process | Random linear model ( | a | (–2,2)† |
| b | (–2,2)† | ||
| Random polynomial model | Degree | 5 | |
| Root range | (–5,5)† | ||
| Models | Linear regression model | n - o hyper-parameters | |
| Polynomial regression model | Degree | 5 | |
| Multi-layer perceptron | Hidden layers | [100,100] | |
| Activation function | ReLU | ||
| Optimizer | Adam | ||
| Maximum iteration for training | 1000 |
Every variable is subject to a noise sampled from N(0,1).
† The value is sampled uniformly from the range. ReLU, Rectified Linear Unit.
Figure 3Simulation to illustrate model performance in external datasets with no dataset shifts. 1. The expectation of the estimate of algorithm performance in a test dataset is the mean of a distribution of estimates obtained by evaluating the algorithm on multiple test datasets. In other words, a difference in the magnitude of the error in the test and development datasets does not necessarily indicate poor or better algorithm performance. 95% confidence intervals of the estimate in a test dataset, which indicate the width of true distribution of estimates, are necessary. 2. Test datasets of sufficient sample size are necessary to minimize bias in the estimate of algorithm performance, depending on model complexity.
Signaling statements to characterize datasets for generalizability of clinical prediction models.
|
|
|
|
|---|---|---|
| The sub-distributions that are available to be sampled are different | Source component shift | • The process for selecting patients from the target population into the source population for the datasets relied upon the same inclusion and exclusion criteria. |
| The observed sub-distributions are different | Sample selection bias Source component shift | • The process for selecting patients into the datasets (sampling weights) resulted in different proportions of sub-populations from a similar source population. |
| The errors or biases affecting the measurement of predictors and/or outcomes are different | Domain shift | • The processes introducing error into measurement of the predictors in each dataset are identical. |
| The completeness of the observed sub-distributions is different | Source component shift | • Each predictor and outcome in the datasets is either a complete observation (i.e., no missingness) or it is incomplete with missingness completely at random. |
All three basic types of dataset shifts, i.e., covariate shift, prior probability shift, and concept shift, may result from each way to introduce dataset shifts shown in this table.
| 1. | Causal graph or causal diagram (Pearl, |
|
| 2. | Transportability (Pearl and Bareinboim, | |
| 3. | Trivial Transportability (Pearl and Bareinboim, | |
| 4. | Selection diagram (Pearl and Bareinboim, |
|
| These selection variables are denoted by solid rectangles. The selection diagram is a succinct and unifying way to identify causal associations that can transport from the development dataset to the external datasets, and to denote distributional and structural differences between the development dataset and external datasets. | ||
| 5. | D-separated (Pearl, | |
| 6. | S-admissibility (Pearl and Bareinboim, | |