| Literature DB >> 34952871 |
Jenna M Reps1,2, Patrick Ryan3,2, P R Rijnbeek3,4.
Abstract
OBJECTIVE: The internal validation of prediction models aims to quantify the generalisability of a model. We aim to determine the impact, if any, that the choice of development and internal validation design has on the internal performance bias and model generalisability in big data (n~500 000).Entities:
Keywords: health informatics; preventive medicine; statistics & research methods
Mesh:
Year: 2021 PMID: 34952871 PMCID: PMC8710861 DOI: 10.1136/bmjopen-2021-050146
Source DB: PubMed Journal: BMJ Open ISSN: 2044-6055 Impact factor: 2.692
Figure 1Possible development and internal validation design strategies for big data. The options include whether to use a test set (hold out some data from development that is used to fairly assess performance) and whether to use cross-validation (where the data are partitioned, and each partition is iteratively held out while the rest of the data are used to develop the model).
Outcomes predicted in this study and the logic used to define the outcome in the data
| Outcome | Phenotype | Event count in development data | Event count in MDCR data (N~160 956) | Event count in MDCD data (N~539 813) |
| Open angle glaucoma | A first-time condition record of open-angle glaucoma with at least one condition record of open-angle glaucoma from a provider with ophthalmology, optometry or optician specialty within 1–365 days. | 174 | 510 | 102 |
| Acute liver injury | A first-time condition record of Acute liver injury during an emergency room visit or inpatient visit. No Acute liver injury exclusions 1 year prior to 60 days after. | 184 | 67 | 352 |
| Ventricular arrhythmia and sudden cardiac death | A first-time condition record of ventricular arrhythmia and sudden cardiac death during an emergency room visit or inpatient visit being the primary cause of the visit. | 297 | 642 | 1188 |
| Ischaemic stroke | A first-time condition record of ischaemic stroke during an inpatient visit | 380 | 1153 | 674 |
| Acute myocardial infarction | A first-time condition record of acute myocardial infarction during an emergency room visit or inpatient visit being the primary cause of the visit. | 491 | 1080 | 1042 |
| Gastrointestinal haemhorrage | A first-time condition record of gastrointestinal haemorrhage during an emergency room visit or inpatient visit being the primary cause of the visit. | 509 | 963 | 1037 |
| Delirium | A first-time condition record of delirium during an emergency room visit or inpatient visit | 985 | 1298 | 1842 |
| Seizure | A first-time condition record of seizure during an emergency room visit or inpatient visit | 1494 | 935 | 4314 |
| Decreased libido | A first-time condition record of decreased libido | 1661 | 130 | 926 |
| Alopecia | A first-time condition record of alopecia | 2577 | 748 | 2674 |
| Hyponatraemia | A first-time condition record of hyponatraemia or a first-time measurement of serum sodium between 1 and 136 millimole/L | 2628 | 4276 | 6035 |
| Fracture | A first-time condition record of fracture | 2722 | 4071 | 4692 |
| Vertigo | A first-time condition record of vertigo | 3046 | 2086 | 2791 |
| Tinnitus | A first-time condition record of tinnitus | 3120 | 1824 | 3186 |
| Hypotension | A first-time condition record of hypotension | 4170 | 6399 | 10 738 |
| Hypothyroidism | A condition record of hypothyroidism with another condition record of hypothyroidism within 90 days | 6117 | 3853 | 6064 |
| Suicide and suicidal ideation | A first-time condition record of suicide and suicidal ideation or a first-time observation of suicide and suicidal ideation | 10 221 | 993 | 24 972 |
| Constipation | A first-time condition record of constipation | 10 672 | 7569 | 23 463 |
| Diarrhoea | A first-time condition record of diarrhoea | 14 875 | 7226 | 24 941 |
| Nausea | A first-time condition record of nausea | 19 754 | 7824 | 38 344 |
| Insomnia | A first-time condition record of insomnia | 20 806 | 6846 | 32 118 |
MDCD, Multi-state Medicaid Database.
The different designs compared in this study.
| Design | CV | Test set? | Hyperparameter selection | Model development | Internal validation |
| No test/validation set | 0 | No | Using all data | Using all data | Using all data |
| Test/validation set | 0 | Yes | Using 10% validation data | Using 80% training data | Using 10% test data |
| Threefold CV | 3 | No | Using threefold CV on all data | Using all data | Using threefold CV on all data |
| Threefold CV with test set | 3 | Yes | Using threefold CV on 80% training data | Using 80% training data | Using 20% test data |
| Fivefold CV | 5 | No | Using fivefold CV on all data | Using all data | Using fivefold CV on all data |
| Fivefold CV with test set | 5 | Yes | Using fivefold CV on 80% training data | Using 80% training data | Using 20% test data |
| Ten-fold CV | 10 | No | Using 10-fold CV on all data | Using all data | Using 10-fold CV on all data |
| Ten-fold CV with test set | 10 | Yes | Using 10-fold CV on 80% training data | Using 80% training data | Using 20% test data |
CV, cross-validation.
The characteristics of the study populations
| Development Data (N~5 00 000) | MDCR Data (N~1 60 956) | MDCD Data (N~5 39 813) | |
| Mean Age in years (SD) | 40 (15) | 75 (7.8) | 34 (16.6) |
| Male gender % | 31 | 32 | 27.1 |
| Mean days prior observation (SD) | 1474 (1205) | 1585 (1192) | 1244 (885) |
| Condition recorded in prior year (% of patients) | |||
| Neoplastic disease | 21.1 | 45.7 | 13.4 |
| Pain | 60.1 | 74.4 | 72.8 |
| Anxiety | 41.3 | 28.6 | 50.8 |
| Respiratory tract infection | 15.9 | 12.0 | 22.2 |
| Dementia | 0.0 | 0.9 | 0.1 |
| Obesity | 10.5 | 10.6 | 17.9 |
| Diabetes mellitus | 8.9 | 27.0 | 13.5 |
| Hypertensive disorder | 24.7 | 69.0 | 29.4 |
| Heart disease | 9.2 | 46.5 | 14.0 |
| Hyperlipidaemia | 23.3 | 56.3 | 19.8 |
MDCD, Multi-state Medicaid Database; MDCR, Medicare Supplemental Database.
Figure 2The AUROC/AUPRC/E-statistic performance estimates for five repetitions per design per prediction task. The columns represent the prediction task, with the number representing the number of patients with the outcome during the time at-risk. For example, the first column corresponds to a prediction task where 174 patients had the outcome, whereas the last column corresponds to a prediction task where 20 806 patients had the outcome. The rows correspond to whether CV was used by the design (top row does not use CV) or the number of folds (3, 5 or 10). The internal validation performances of the designs that used a test set are coloured in red, and those not using a test set are blue (dots with vertical lines indicating the 95% confidence interval). The external validation performances for a model are the light grey pointers (MDCD) and black crosses (MDCR) that have the same x-coordinate and fall within the same row/column. AUPRC, area under the precision recall curve; AUROC, area under the receiver operating curve; CV, cross-validation; MDCD, Multi-state Medicaid Database; MDCR, Medicare Supplemental Database.
Figure 3Box plots showing the internal performance estimate minus the external performance estimate per design and external database. The left side shows the AUROC differences, the centre shows the AUPRC differences, and the right side shows the E-statistic differences. For the AUROC, values near 0 indicate that the internal validation AUROC estimates were accurate as the external validation AUROCs were similar. For AUPRC and AUPRC values less than 0 indicate that the performance was better externally, values greater than 0 indicate the performance is worse externally. For the E-statistic, values less than 0 indicate worse calibration when the models were externally validated. AUPRC, area under the precision recall curve; AUROC, area under the receiver operating curve; CV, cross-validation; MDCD, Multi-state Medicaid Database; MDCR, Medicare Supplemental Database.