| Literature DB >> 30458006 |
Fatemeh Rahimian1,2, Gholamreza Salimi-Khorshidi1,2, Amir H Payberah1,2, Jenny Tran1,2, Roberto Ayala Solares1,2, Francesca Raimondi1,2, Milad Nazarzadeh1,2, Dexter Canoy1,2, Kazem Rahimi1,2,3.
Abstract
BACKGROUND: Emergency admissions are a major source of healthcare spending. We aimed to derive, validate, and compare conventional and machine learning models for prediction of the first emergency admission. Machine learning methods are capable of capturing complex interactions that are likely to be present when predicting less specific outcomes, such as this one. METHODS ANDEntities:
Mesh:
Year: 2018 PMID: 30458006 PMCID: PMC6245681 DOI: 10.1371/journal.pmed.1002695
Source DB: PubMed Journal: PLoS Med ISSN: 1549-1277 Impact factor: 11.069
Predictors considered, and how they are represented in CPRD and in our models.
| Category | Predictor | Representation in CPRD | Representation in models | In QA | In QA+ | In T |
|---|---|---|---|---|---|---|
| Age | Year of birth | Computed based on mid-year of year of birth | * | * | * | |
| Sex | Binary variable (male/female) | Binary variable | * | * | * | |
| Ethnicity | Ethnicity (categorical value) | Categorical variable | * | * | * | |
| Socioeconomic status | Index of Multiple Deprivation | Numeric variable on a scale of 1 to 5 | * | * | * | |
| BMI | Weight measurement recorded repeatedly in various clinic visits | BMI based on height and most recent recorded weight | * | * | * | |
| Smoking status | Current tobacco use, in terms of number of cigars/cigarettes per day (recorded repeatedly) | Categorical variable for latest status: non-smoker, ex-smoker, light smoker (less than 10 cigarettes/day), moderate smoker (10–20 cigarettes/day), heavy smoker (more than 20 cigarettes/day), smoker (amount not recorded) | * | * | * | |
| Alcohol intake | Current alcohol consumption, in terms of units of alcohol per day (recorded repeatedly) | Categorical variable for latest status: non-drinker, ex-drinker, trivial (less than 1 unit/week), light (1–2 units/week), moderate (3–6 units/week), heavy (7–9 units/week), very heavy (more than 9 units/week), drinker (amount not recorded) | * | * | * | |
| Family history of chronic disease | Binary variable (yes/no) | Binary variable (yes/no) | * | * | * | |
| Strategic health authority (region) | Categorical variable | Categorical variable | * | * | * | |
| Marital status | Categorical variable | Categorical variable | * | * | ||
| Previous emergency admissions | Read Code and date of event | Number of occurrences during last year | * | * | * | |
| Time since last occurrence (in days) | * | |||||
| Prior GP visits (consultations) | Read Code and date of event | Number of occurrences during last year | * | * | ||
| Time since last occurrence (in days) | * | |||||
| Total duration spent in GP visits (minutes) | * | |||||
| Diabetes, atrial fibrillation, cardiovascular disease, congestive cardiac failure, venous thromboembolism, cancer, asthma or COPD, epilepsy, falls, manic depression or schizophrenia, chronic renal disease, chronic liver disease or pancreatitis, valvular heart disease, treated hypertension, rheumatoid arthritis or SLE, depression (QOF definition) | Read Code and date of entry | One separate binary variable for each disease, 16 variables in total | * | * | ||
| Time since first diagnosis (in days)—1 separate variable for each disease, 16 variables in total | * | |||||
| Arthritis, connective tissue disease, hemiplegia, HIV/AIDS, hyperlipidaemia, learning disability, obesity, osteoporosis, peripheral arterial disease, peptic ulcer disease, substance abuse | Read Code and date of entry | One separate binary variable for each disease, 11 variables in total | * | |||
| Time since first diagnosis (in days)—1 separate variable for each disease, 11 variables in total | * | |||||
| Systolic blood pressure, haemoglobin, cholesterol/HDL, liver function test (γ-GT, aspartate aminotransferase, or bilirubin), platelets, ESR | Numeric value for result and date of measurement | Binary (yes/no) variable for if recorded—1 variable per test | * | * | ||
| Numeric variable for most recent result—1 variable per test | * | * | * | |||
| Binary variable for abnormal result—1 variable per test | * | * | * | |||
| Time since the latest result (in days)—1 variable per test | * | |||||
| Statin, NSAID, anticoagulant, corticosteroid, antidepressant, antipsychotic | Date of prescription if applicable | Binary (yes/no) variable for if prescription exists | * | * | * |
γ-GT, γ-glutamyl transferase; COPD, chronic obstructive pulmonary disease; CPRD, Clinical Practice Research Datalink; ESR, erythrocyte sedimentation rate; GP, general practice; HDL, high-density lipoprotein; NSAID, non-steroidal anti-inflammatory drug; QOF, Quality and Outcomes Framework; SLE, systemic lupus erythematosus.
Baseline characteristics of derivation and validation cohorts.
| Predictor | Derivation cohort | Validation cohort | |
|---|---|---|---|
| Female | 1,937,265 (51.66) | 454,424 (51.21) | |
| Male | 1,812,667 (48.34) | 432,941 (48.79) | |
| 51.0 (19.8) | 53.1 (19.9) | ||
| Missing | 2,960,949 (78.96) | 735,198 (82.85) | |
| Single | 481,753 (12.85) | 44,941 (5.06) | |
| Married/stable relationship | 481,753 (12.85) | 94,000 (10.60) | |
| Separated/widowed | 64,441 (1.72) | 13,226 (1.49) | |
| 2.8 (1.4) | 3.3 (1.4) | ||
| 646,360 (17.24) | 196,800 (22.18) | ||
| Missing, | 1,094,892 (29.20) | 242,324 (27.31) | |
| Mean (SD) | 26.1 (5.6) | 26.4 (5.8) | |
| North East | 0 | 89,004 (10.03) | |
| North West | 0 | 613,460 (69.13) | |
| Yorkshire and the Humber | 0 | 184,901 (20.84) | |
| East Midlands | 150,831 (4.02) | 0 | |
| West Midlands | 518,586 (13.83) | 0 | |
| East of England | 540,346 (14.41) | 0 | |
| South West | 558,036 (14.88) | 0 | |
| South Central | 572,791 (15.27) | 0 | |
| London | 817,870 (21.81) | 0 | |
| South East Coast | 591,472 (15.77) | 0 | |
| Missing | 2,625,523 (70.02) | 536,806 (60.49) | |
| White | 1,039,476 (27.72) | 339,466 (38.26) | |
| Indian | 16,740 (0.45) | 1,571 (0.18) | |
| Pakistani | 6,153 (0.16) | 2,395 (0.27) | |
| Bangladeshi | 1,958 (0.05) | 355 (0.04) | |
| Other Asian | 7,466 (0.20) | 711 (0.08) | |
| Caribbean | 9,786 (0.26) | 541 (0.06) | |
| Black African | 15,499 (0.41) | 1,234 (0.14) | |
| Chinese | 3,493 (0.09) | 810 (0.09) | |
| Other | 23,838 (0.64) | 3,476 (0.39) | |
| Missing | 680,838 (18.16) | 143,032 (16.12) | |
| Non-smoker | 1,678,287 (44.76) | 379,395 (42.76) | |
| Ex-smoker | 392,806 (10.48) | 90,539 (10.20) | |
| Light smoker (<10 cigarettes/day) | 286,113 (7.63) | 69,704 (7.86) | |
| Moderate smoker (10–20 cigarettes/day) | 345,639 (9.22) | 102,533 (11.55) | |
| Heavy smoker (>20 cigarettes/day) | 250,567 (6.68) | 80,713 (9.10) | |
| Smoker, amount not recorded | 115,367 (3.08) | 21,321 (2.40) | |
| Missing | 1,089,383 (29.05) | 235,862 (26.58) | |
| Non-drinker | 360,048 (9.60) | 82,905 (9.34) | |
| Ex-drinker | 24,802 (0.66) | 8,339 (0.94) | |
| Trivial (<1 unit/week) | 249,020 (6.64) | 46,001 (5.18) | |
| Light (1–2 units/week) | 447,954 (11.95) | 99,739 (11.24) | |
| Moderate (3–6 units/week) | 418,400 (11.16) | 101,713 (11.46) | |
| Heavy (7–9 units/week) | 161,290 (4.30) | 40,942 (4.61) | |
| Very heavy (>9 units/week) | 627,261 (16.73) | 194,999 (21.98) | |
| Drinker, amount not recorded | 371,774 (9.91) | 76,865 (8.66) | |
| No emergency admission, | 3,583,848 (95.57) | 834,693 (94.06) | |
| 1 emergency admission, | 120,614 (3.22) | 36,046 (4.06) | |
| 2 emergency admissions, | 30,111 (0.80) | 10,546 (1.19) | |
| 3+ emergency admissions, | 15,359 (0.41) | 6,080 (0.69) | |
| Mean number of days since last admission (SD) | 170.7 (103.7) | 169.6 (105.8) | |
| Mean number of consultations (SD) | 21.7 (24.4) | 24.4 (25.9) | |
| Mean consultation duration | 124.9 (227.5) | 163.0 (374.3) | |
| Mean number of days since last consultation (SD) | 300.8 (83.4) | 307.4 (78.7) | |
| Systolic blood pressure | Missing, | 443,729 (11.83) | 98,150 (11.06) |
| Mean (SD) | 127.9 (18.4) | 128.6 (19.2) | |
| Cholesterol/HDL | Missing, | 2,781,874 (74.18) | 593,814 (66.92) |
| Mean (SD) | 3.8 (1.6) | 3.8 (1.8) | |
| Haemoglobin | Missing, | 2,012,077 (53.66) | 4434,005 (48.91) |
| Haemoglobin < 110 g/l, | 84,396 (2.25) | 23,178 (2.61) | |
| Platelets | Missing, | 12,056,437 (54.84) | 449,522 (50.66) |
| Platelets > 480 × 109/l, | 21,305 (0.57) | 5,900 (0.66) | |
| Liver function test | Missing, | 2,285,715 (60.95) | 489,673 (55.18) |
| Abnormal liver function test, | 23,217 (0.62) | 9,328 (1.05) | |
| ESR | Missing, | 2,908,165 (77.55) | 683,599 (77.04) |
| Abnormal ESR, | 96,436 (2.57) | 21,828 (2.46) | |
| Diabetes | 326,672 (8.71) | 83,309 (9.39) | |
| Atrial fibrillation | 122,627 (3.27) | 49,647 (5.59) | |
| Cardiovascular disease | 379,071 (10.11) | 104,215 (11.74) | |
| Congestive cardiac failure | 140,439 (3.75) | 53,742 (6.06) | |
| Venous thromboembolism | 99,083 (2.64) | 36,791 (4.15) | |
| Cancer | 143,923 (3.84) | 36,677 (4.13) | |
| Asthma or COPD | 753,223 (20.09) | 162,853 (18.35) | |
| Epilepsy | 103,800 (2.77) | 7,690 (0.87) | |
| Falls | 354,748 (9.46) | 86,801 (9.78) | |
| Manic depression or schizophrenia | 33,716 (0.90) | 0 (0.00) | |
| Chronic renal disease | 272,292 (7.26) | 72,221 (8.14) | |
| Chronic liver disease or pancreatitis | 68,726 (1.83) | 0 (0.00) | |
| Valvular heart disease | 49,274 (1.31) | 0 (0.00) | |
| Treated hypertension | 892,430 (23.8) | 193,826 (21.84) | |
| Rheumatoid arthritis or SLE | 58,658 (1.56) | 0 (0.00) | |
| Depression (QOF definition) | 862,357 (23.0) | 173,965 (19.6) | |
| Arthritis | 52,4936 (14.0) | 161,050 (18.15) | |
| Connective tissue disease | 32,850 (0.88) | 7,079 (0.80) | |
| Hemiplegia | 7,097 (0.19) | 2,553 (0.29) | |
| HIV/AIDS | 29,701 (0.79) | 7,176 (0.81) | |
| Hyperlipidaemia | 216,304 (5.77) | 66,238 (7.46) | |
| Learning disability | 18,574 (0.50) | 5,069 (0.57) | |
| Obesity | 231,123 (6.16) | 66,210 (7.46) | |
| Osteoporosis | 66,877 (1.78) | 20,056 (2.26) | |
| Peripheral arterial disease | 56,828 (1.52) | 20,761 (2.34) | |
| Peptic ulcer disease | 62,122 (1.66) | 24,151 (2.72) | |
| Substance abuse | 54,517 (1.45) | 19,673 (2.22) | |
| Statin | 552,982 (14.75) | 164,814 (18.57) | |
| NSAID | 1505,161 (40.14) | 423,637 (47.74) | |
| Anticoagulant | 122,803 (3.27) | 34,285 (3.86) | |
| Corticosteroid | 809,336 (21.58) | 214,067 (24.12) | |
| Antidepressant | 649,131 (17.31) | 210,259 (23.69) | |
| Antipsychotic | 114,487 (3.05) | 40,060 (4.51) | |
COPD, chronic obstructive pulmonary disease; ESR, erythrocyte sedimentation rate; HDL, high-density lipoprotein; IMD, Index of Multiple Deprivation; NSAID, non-steroidal anti-inflammatory drug; QOF, Quality and Outcomes Framework; SLE, systemic lupus erythematosus.
Cross-validated model discrimination for different predictor sets and modelling techniques: Derivation cohort.
| Predictor set | Model | |||||
|---|---|---|---|---|---|---|
| CPH | RF | GBC | ||||
| AUC | 95% CI | AUC | 95% CI | AUC | 95% CI | |
| 0.741 | 0.739, 0.743 | 0.754 | 0.752, 0.756 | 0.777 | 0.775, 0.779 | |
| 0.739 | 0.738, 0.740 | 0.755 | 0.754, 0.756 | 0.779 | 0.777, 0.781 | |
| 0.740 | 0.739, 0.741 | 0.752 | 0.751, 0.753 | 0.779 | 0.777, 0.781 | |
| 0.751 | 0.750, 0.753 | 0.822 | 0.818, 0.826 | 0.834 | 0.833, 0.835 | |
| 0.805 | 0.804, 0.806 | 0.825 | 0.824, 0.826 | 0.848 | 0.847, 0.849 | |
For any given set of predictors, GBC outperforms the other 2 models. Similarly, for any given model, T predictors show the best predictive power.
AUC, area under the receiver operating characteristic curve; CPH, Cox proportional hazards; GBC, gradient boosting classifier; RF, random forest.
Fig 1Cross-validated model calibration for different predictor sets and modelling techniques.
(a) QA variables; (b) QA+ variables; (c) T variables. The x-axis shows the predicted probability of emergency admission, while the y-axis shows the fraction of actual admissions for each predicted probability. The shaded areas depict the standard deviation across different folds in a 5-fold cross-validation. CPH, Cox proportional hazards; GBC, gradient boosting classifier; RF, random forest.
Externally validated model discrimination for different predictor sets and modelling techniques: Validation cohort.
| Predictor set | Model | ||
|---|---|---|---|
| CPH | RF | GBC | |
| 0.736 | 0.736 | 0.796 | |
| 0.743 | 0.799 | 0.810 | |
| 0.788 | 0.810 | 0.826 | |
Predictor set T and GBC modelling constantly perform better than their counterparts. The results conform to the pattern observed in internal cross-validation.
CPH, Cox proportional hazards; GBC, gradient boosting classifier; RF, random forest.
Fig 2Externally validated model calibration for different predictor sets and modelling techniques.
(a) QA variables; (b) QA+ variables; (c) T variables. The x-axis shows the predicted probability of emergency admission, while the y-axis shows the fraction of actual admissions for each predicted probability. CPH, Cox proportional hazards; GBC, gradient boosting classifier; RF, random forest.
Fig 3Model discrimination for different follow-up periods (from 12 to 60 months after baseline).
Colours differentiate the 3 modelling techniques (GBC, RF, and CPH), whereas line styles indicate the predictor sets (QA, QA+, and T). AUC, area under the receiver operating characteristic curve; CPH, Cox proportional hazards; GBC, gradient boosting classifier; RF, random forest.