| Literature DB >> 32019971 |
Piotr Dworzynski1,2, Martin Aasbrenn1,3, Klaus Rostgaard2, Mads Melbye2,4,5, Thomas Alexander Gerds6, Henrik Hjalgrim2,7, Tune H Pers8,9.
Abstract
Identification of individuals at risk of developing disease comorbidities represents an important task in tackling the growing personal and societal burdens associated with chronic diseases. We employed machine learning techniques to investigate to what extent data from longitudinal, nationwide Danish health registers can be used to predict individuals at high risk of developing type 2 diabetes (T2D) comorbidities. Leveraging logistic regression-, random forest- and gradient boosting models and register data spanning hospitalizations, drug prescriptions and contacts with primary care contractors from >200,000 individuals newly diagnosed with T2D, we predicted five-year risk of heart failure (HF), myocardial infarction (MI), stroke (ST), cardiovascular disease (CVD) and chronic kidney disease (CKD). For HF, MI, CVD, and CKD, register-based models outperformed a reference model leveraging canonical individual characteristics by achieving area under the receiver operating characteristic curve improvements of 0.06, 0.03, 0.04, and 0.07, respectively. The top 1,000 patients predicted to be at highest risk exhibited observed incidence ratios exceeding 4.99, 3.52, 1.97 and 4.71 respectively. In summary, prediction of T2D comorbidities utilizing Danish registers led to consistent albeit modest performance improvements over reference models, suggesting that register data could be leveraged to systematically identify individuals at risk of developing disease comorbidities.Entities:
Mesh:
Year: 2020 PMID: 32019971 PMCID: PMC7000818 DOI: 10.1038/s41598-020-58601-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Overview of Danish national health registers employed in this study. *ICD-8 used between 1977 and 1995, ICD-10 between 1995 and 2016. ICD-10 stands for 10th revision of International Statistical Classification of Diseases and Related Health Problems; DN, Danish National; NCSP, Nordic Medico-Statistical Committee Classification of Surgical Procedures; ATC, Anatomical Therapeutic Chemical Classification; #, number of.
| Register | Data | Int. classification | Dates | Total |
|---|---|---|---|---|
| #records | ||||
| DN Patient Register | Hospital diagnoses | ICD-10 | 1995 → 2016* | 179 million |
| Surgical procedures | NCSP | 1996 → 2016 | 26.9 million | |
| Treatment procedures | — | 1999 → 2016 | 103.4 million | |
| Diagnostic procedures | — | 1999 → 2016 | 62.5 million | |
| DN Prescription Register | Drug prescriptions | ATC | 1994 → 2016 | 930.4 million |
| DN Health Service Register | Claims data | — | 1990 → 2016 | 2,147 million |
| DN Medical Birth Register | Pregnancies & prenatal care | — | 1973 → 2016 | 2.7 million |
Overview of study population characteristics among all newly diagnosed type 2 diabetics (T2D population) and five comorbidity populations (individuals undiagnosed with a comorbidity at the time of their first T2D diagnosis). T2D, type 2 diabetes; RFV, register feature vector; #, number of.
| T2D population | Heart failure | Myocardial infarction | |
|---|---|---|---|
| # Individuals | 203,517 | 190,819 | 190,987 |
| # Cases | — | 8,940 (4.69%) | 6,485 (3.40%) |
| # Non-cases | — | 181,879 (95.31%) | 184,502 (96.60%) |
| % Women | 47.03 | 47.52 | 48.14 |
| Median age at T2D diagnosis | 61.44 | 60.67 | 60.95 |
| Median # days until outcome | — | 1631.60 | 1537.80 |
| # Features in RFV | — | 6,155 | 6,155 |
| # Individuals | 191,095 | 120,114 | 200,646 |
| # Cases | 7,922 (4.15%) | 33,057 (27.52%) | 5,617 (2.80%) |
| # Non-cases | 183,173 (95.85%) | 87,057 (72.48%) | 195,029 (97.20%) |
| % Women | 47.38 | 49.77 | 47.18 |
| Median age at T2D diagnosis | 60.80 | 57.25 | 61.32 |
| Median # days until outcome | 1641.60 | 1414.00 | 2065.30 |
| # Features in RFV | 6,155 | 6,155 | 6,155 |
Figure 1The colored arrows represent each individual’s accumulated history of register events (determinants, used as predictive features). t represents the date of prediction (in this case the day of the first T2D diagnosis represented by blue circles). t depicts the buffer period of 30 days used to exclude individuals who were diagnosed with the comorbidity shortly before t. t represents the five-year prediction horizon. Individuals for whom the comorbidity occurred before the prediction horizon were removed. Individuals with their first comorbidity diagnosis occurring within the prediction horizon are referred to as cases and all others are referred to as non-cases. For each individual, register features representing medical events which occurred before the time of prediction are aggregated into counts (table below) and used as the prediction model’s predictor variables (features) to determine the likelihoods of being a case or a non-case. ML, machine learning.
Figure 2Data split and model tuning. (a) Data were split so that the training set constituted the first 70% of the data (time-wise, according to the time of the first T2D diagnosis). The test and validation sets were divided by balanced random sampling from the remainder (time-wise latter part) of the dataset. (b) For each model type, the best parameter set was chosen through its evaluation following three-fold cross-validation on the training data. (c) For each model type, the best model was obtained by re-training the model on the entirety of the training set using the best parameter set. Performance of each best model was evaluated on the test set for the purpose of development of this work. (d) Performance of each best model was evaluated on the validation set for the purpose of reporting the results. ML, machine learning.
AUROC measures for each prediction model’s best parameterization. We applied a reference- and three register-based models on fifteen years of health register data comprising hospital diagnoses, hospital procedures, drug prescriptions and interactions with primary care contractors to predict five-year risk for five T2D comorbidities. For each comorbidity, prediction was performed on a T2D population free of that comorbidity at the date of prediction (date of individuals’ first T2D diagnosis). The reference model was a logistic ridge regression based on canonical features: age, sex, country or region of birth and date of first T2D diagnosis as well as their interactions, while the register-based models were logistic ridge regression, random forest and gradient boosting based on the canonical features as well as hospital diagnoses, hospital procedures, drug prescriptions and interactions with primary care extracted from Danish health registers. Incidences are proportions of cases within comorbidities’ sub-population at the end of the prediction horizon. Value ranges in brackets represent 95% confidence intervals based on bootstrap sampling. For heart failure, myocardial infarction, cardiovascular disease and chronic kidney disease the gradient boosting model outperformed the reference models. AUROC, area under receiver operating characteristic curve.
| Heart failure (incidence: 0.04) | ||||
|---|---|---|---|---|
| Δ | Δ | Δ | ||
| Reference, logistic regression (RLR) | 0.74 (0.72–0.75) | |||
| Logistic regression (LR) | 0.77 (0.76–0.79) | 0.04 (0.02–0.05) | ||
| Random forest (RF) | 0.77 (0.75–0.78) | 0.03 (−0.01) | −0.01 (−0.02–0.01) | |
| Gradient boosting (GB) | 0.80 (0.78–0.81) | 0.06 (0.05–0.07) | 0.02 (0.01–0.03) | 0.03 (0.02–0.04) |
| Reference, logistic regression (RLR) | 0.68 (0.65–0.70) | |||
| Logistic regression (LR) | 0.70 (0.68–0.73) | 0.03 (0.01–0.04) | ||
| Random forest (RF) | 0.67 (0.64–0.69) | −0.01 (−0.03–0.01) | −0.04 (−0.06–−0.02) | |
| Gradient boosting (GB) | 0.71 (0.69–0.73) | 0.03 (0.02–0.05) | 0.01 (0.00–0.02) | 0.04 (0.03–0.06) |
| Reference, logistic regression (RLR) | 0.71 (0.69–0.73) | |||
| Logistic regression (LR) | 0.72 (0.70–0.74) | 0.01 (0.00–0.01) | ||
| Random forest (RF) | 0.69 (0.67–0.71) | −0.02 (−0.04–−0.01) | −0.03 (−0.04–−0.01) | |
| Gradient boosting (GB) | 0.72 (0.70–0.74) | 0.01 (0.00–0.02) | 0.01 (0.00–0.02) | 0.03 (0.02–0.05) |
| Reference, logistic regression (RLR) | 0.66 (0.64–0.67) | |||
| Logistic regression (LR) | 0.68 (0.67–0.69) | 0.02 (0.02–0.03) | ||
| Random forest (RF) | 0.68 (0.67–0.69) | 0.02 (0.02–0.03) | 0.00 (0.00–0.01) | |
| Gradient boosting (GB) | 0.69 (0.68–0.70) | 0.04 (0.03–0.05) | 0.02 (0.01–0.02) | 0.01 (0.01–0.02) |
| Reference, logistic regression (RLR) | 0.71 (0.69–0.73) | |||
| Logistic regression (LR) | 0.74 (0.72–0.76) | 0.04 (0.02–0.05) | ||
| Random forest (RF) | 0.74 (0.72–0.76) | 0.03 (0.01–0.05) | 0.00 (−0.02–0.01) | |
| Gradient boosting (GB) | 0.77 (0.76–0.79) | 0.07 (0.05–0.08) | 0.03 (0.02–0.04) | 0.04 (0.02–0.05) |
Figure 3All individuals were ranked according to their predicted risk (in increasing order) by the best gradient boosting (blue) and best reference (orange) models and binned into percentiles. Plotted are the observed five-year comorbidity incidences for individuals in each percentile. Left y-axis; incidence defined as the observed proportion of individuals who did develop the given comorbidity within the five-year prediction horizon. Right y-axis; the incidence risk ratio defined as a ratio between the percentiles’ and population observed five-year comorbidity incidence. Gray horizontal line; population five-year comorbidity incidence.
Figure 4For each comorbidity individuals were ranked according to their predicted risk by the gradient boosting (blue) and reference (orange) models. For a number of individuals predicted to have the highest risk, risk ratios were calculated as the comorbidity incidence of individuals ranking above that threshold over the comorbidity incidence in the entire study population. 95% confidence intervals (shaded areas) were obtained through bootstrap sampling.
Figure 5Top 50 most predictive register feature vector features from the best gradient boosting model ranked by their importance and colour-coded according to type (x-axis). Feature importance is a normalized estimate of a relative contribution of the feature to the model prediction (y-axis). Drug prescription features had the highest overall cumulative importance followed by the canonical features and hospital diagnoses. Age, interaction between age and sex, and date of first T2D diagnoses were the three most important features for all comorbidities.