| Literature DB >> 34032855 |
Mathieu Ravaut1,2, Vinyas Harish3,4,5,6, Hamed Sadeghi1, Kin Kwan Leung1, Maksims Volkovs1, Kathy Kornas3, Tristan Watson3,7, Tomi Poutanen1, Laura C Rosella3,4,5,6,7,8.
Abstract
Importance: Systems-level barriers to diabetes care could be improved with population health planning tools that accurately discriminate between high- and low-risk groups to guide investments and targeted interventions. Objective: To develop and validate a population-level machine learning model for predicting type 2 diabetes 5 years before diabetes onset using administrative health data. Design, Setting, and Participants: This decision analytical model study used linked administrative health data from the diverse, single-payer health system in Ontario, Canada, between January 1, 2006, and December 31, 2016. A gradient boosting decision tree model was trained on data from 1 657 395 patients, validated on 243 442 patients, and tested on 236 506 patients. Costs associated with each patient were estimated using a validated costing algorithm. Data were analyzed from January 1, 2006, to December 31, 2016. Exposures: A random sample of 2 137 343 residents of Ontario without type 2 diabetes was obtained at study start time. More than 300 features from data sets capturing demographic information, laboratory measurements, drug benefits, health care system interactions, social determinants of health, and ambulatory care and hospitalization records were compiled over 2-year patient medical histories to generate quarterly predictions. Main Outcomes and Measures: Discrimination was assessed using the area under the receiver operating characteristic curve statistic, and calibration was assessed visually using calibration plots. Feature contribution was assessed with Shapley values. Costs were estimated in 2020 US dollars.Entities:
Mesh:
Year: 2021 PMID: 34032855 PMCID: PMC8150694 DOI: 10.1001/jamanetworkopen.2021.11315
Source DB: PubMed Journal: JAMA Netw Open ISSN: 2574-3805
Cohort Description
| Variable | No. (%) | |||||
|---|---|---|---|---|---|---|
| Training (January 2013 to December 2014) | Validation (January to December 2015) | Test (January to December 2016) | ||||
| Total | Positives | Total | Positives | Total | Positives | |
| Full cohort | ||||||
| Unique patients, No. | 1 657 395 | 23 979 | 243 442 | 1874 | 236 506 | 1967 |
| Instances, No. | 12 900 257 | 23 979 | 959 276 | 1874 | 927 230 | 1967 |
| Sex | ||||||
| Male | 6 233 595 (48.3) | 12 249 (51.1) | 459 715 (47.9) | 971 (51.8) | 440 433 (47.5) | 999 (50.8) |
| Female | 6 666 662 (51.7) | 11 730 (48.9) | 499 561 (52.1) | 903 (48.2) | 486 797 (52.5) | 968 (49.2) |
| Age group, y | ||||||
| <10 | 1 616 100 (12.5) | 205 (0.9) | 102 462 (10.7) | 14 (0.7) | 88 668 (9.6) | 8 (0.4) |
| 10-19 | 1 954 979 (15.2) | 358 (1.5) | 142 442 (14.8) | 32 (1.7) | 136 183 (14.7) | 32 (1.6) |
| 20-29 | 1 939 960 (15.0) | 696 (4.0) | 148 168 (15.4) | 75 (4.0) | 144 396 (15.6) | 79 (4.0) |
| 30-39 | 1 882 470 (14.6) | 2624 (10.9) | 140 953 (14.7) | 220 (11.7) | 135 758 (14.6) | 203 (10.3) |
| 40-49 | 2 108 830 (16.3) | 5374 (22.4) | 155 409 (16.2) | 423 (22.6) | 149 244 (16.1) | 437 (22.2) |
| 50-59 | 1 657 299 (12.8) | 6353 (26.5) | 130 529 (13.6) | 486 (25.9) | 130 880 (14.1) | 524 (26.7) |
| 60-69 | 987 254 (7.7) | 4701 (19.6) | 80 069 (8.3) | 364 (19.4) | 82 448 (8.9) | 423 (21.5) |
| 70-79 | 510 517 (4.0) | 2438 (10.2) | 39 803 (4.1) | 182 (9.7) | 40 475 (4.4) | 181 (9.2) |
| 80-89 | 222 638 (1.7) | 902 (3.8) | 17 637 (1.8) | 72 (3.8) | 17 239 (1.9) | 74 (3.8) |
| 90-100 | 19 840 (0.2) | 53 (0.2) | 1761 (0.2) | 6 (0.3) | 1924 (0.2) | 6 (0.3) |
| Immigration status | ||||||
| Immigrant | 1 537 571 (11.9) | 4293 (17.9) | 122 532 (12.8) | 338 (18.0) | 122 607 (13.2) | 384 (19.5) |
| Long-term resident | 11 362 686 (88.1) | 19 686 (82.1) | 836 744 (87.2) | 1536 (82.0) | 804 623 (86.8) | 1583 (80.5) |
| Race/ethnicity marginalization score, quintile | ||||||
| 1st | 19 588 853 (15.2) | 3690 (15.4) | 144 694 (15.1) | 275 (14.7) | 136 943 (14.8) | 303 (15.4) |
| 2nd | 2 083 902 (16.2) | 3604 (15.0) | 153 306 (16.0) | 274 (14.6) | 147 340 (15.9) | 250 (12.7) |
| 3rd | 2 279 478 (17.7) | 3711 (15.5) | 167 552 (17.5) | 304 (16.2) | 162 545 (17.5) | 318 (16.2) |
| 4th | 2 698 267 (20.9) | 4441 (18.5) | 201 623 (21.0) | 355 (18.9) | 194 554 (21.0) | 366 (18.6) |
| 5th | 3 710 695 (28.8) | 8126 (33.9) | 279 566 (29.1) | 642 (34.3) | 273 841 (29.5) | 703 (35.8) |
| Deprivation marginalization score, quintile | ||||||
| 1st | 3 041 507 (23.6) | 4339 (18.1) | 227 873 (23.8) | 366 (19.5) | 220 439 (23.8) | 358 (18.2) |
| 2nd | 2 566 726 (19.9) | 4569 (19.1) | 190 232 (19.8) | 333 (17.8) | 185 106 (20.0) | 383 (19.5) |
| 3rd | 2 442 622 (18.9) | 4572 (19.1) | 182 185 (19.0) | 359 (19.2) | 173 694 (18.7) | 372 (18.9) |
| 4th | 2 288 370 (17.7) | 4714 (19.7) | 170 096 (17.7) | 394 (21.0) | 164 405 (17.7) | 420 (21.4) |
| 5th | 2 391 970 (18.5) | 5378 (22.4) | 176 355 (18.4) | 398 (21.2) | 171 579 (18.5) | 407 (20.7) |
We give the number of patients, number of instances, and associated number of positive data points for the training, validation, and test sets. Note that the number of positive patients and instances match exactly as a patient can only be diagnosed once with diabetes; we also give the distribution of each set in terms of sex, age, and immigration status.
Race/ethnicity and deprivation marginalization scores quantify the degree of marginalization within each dissemination area according to ethnic concentration and material deprivation. A dissemination area typically encompasses a few hundred inhabitants. These 2 scores are quintiles ranging from 1 to 5 based on each patient's history from the 2004-2008 period, where 5 represents a highest degree of marginalization.
Figure 1. Diabetes Onset Prediction Performance
A, Calibration is assessed visually with a calibration curve composed of 20 population bins of equal size. B, Precision and recall curves are displayed. The left y-axis corresponds to precision, and the right y-axis to recall. The test area under the receiver operating curve is 80.26.
Figure 2. Diabetes Onset Calibration Across Population Groups
The model is evaluated on specific subsets of the population: sex (2 categories), age (10 bins of 10 years), immigration status (2 categories), race/ethnicity marginalization score (5 quintiles), material deprivation marginalization score (5 quintiles), and number of events in the observation window (5 categories). We display the incidence rate (left y-axis, dark blue bars), average model prediction (right y-axis, light blue bars), and number of positive cases within each subset. The size of each subset can be read on the x-axis. Note that incidence rates can vary dramatically between subsets, especially for age, making comparisons between subsets challenging.
Model Prediction Risk Levels
| Bin | Age, mean | Individuals, % | Time in Canada, y | Marginalization scores | HbA1c, mean | ||
|---|---|---|---|---|---|---|---|
| Women | Immigrants | Ethnicity | Deprivation | ||||
| Model prediction | |||||||
| Top 1% | 58.3 | 59.6 | 38.8 | 17.3 | 4.22 | 3.63 | 5.84 |
| Next 5% | 59.4 | 42.3 | 26.5 | 18.4 | 3.85 | 3.45 | 5.81 |
| Next 15% | 58.3 | 40.8 | 16.5 | 19.4 | 3.44 | 3.15 | 5.73 |
| Bottom 79% | 31.8 | 55.3 | 11.4 | 19.7 | 3.38 | 2.87 | 5.53 |
| Label | |||||||
| Positive | 53.7 | 49.2 | 19.5 | 19.1 | 3.54 | 3.15 | 5.92 |
| Negative | 37.4 | 52.5 | 13.2 | 19.6 | 3.42 | 2.95 | 5.63 |
Abbreviation: HbA1c, hemoglobin A1c.
In the first setup, we rank patients by their model's output in decreasing order, then bin them into 4 categories: top 1%, next 5% (between top 1% and top 6%), next 15% (between top 6% and top 21%), and the remaining 79%. For each bin, we display statistics pertaining to general demographic factors (mean age, fraction of women, fraction of immigrants and time in Canada for immigrants) and socioeconomic factors (race/ethnicity and deprivation marginalization scores of the neighborhood), as well as the mean HbA1c. Means are computed across nonmissing values from patients within each bin. For instance, time in Canada is computed only for immigrants of each model output bin as the value is missing for long-term residents. The second setup evaluates the same variables but when splitting patients according to their label (positive or negative).
Figure 3. Estimation of the Total Cost of the Cohort Predicted to Develop Diabetes in Ontario from 2009-2016
Diabetes cost per year (A) and per population percentile (B) are displayed. The 5% most at-risk patients concentrate 26% of the total cost. USD indicates US dollars.