| Literature DB >> 30917171 |
Stephen F Weng1,2, Luis Vaz1, Nadeem Qureshi1,2, Joe Kai1,2.
Abstract
BACKGROUND: Prognostic modelling using standard methods is well-established, particularly for predicting risk of single diseases. Machine-learning may offer potential to explore outcomes of even greater complexity, such as premature death. This study aimed to develop novel prediction algorithms using machine-learning, in addition to standard survival modelling, to predict premature all-cause mortality.Entities:
Mesh:
Year: 2019 PMID: 30917171 PMCID: PMC6436798 DOI: 10.1371/journal.pone.0214365
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Selected baseline characteristics of the study population aged 40–69 years recruited between 2006 to 2010 stratified by mortality during follow-up.
Categorical variables are numbers and proportions and continuous variables are mean and standard deviations.
| Characteristics | Alive (n = 488,207) | Died (n = 14,418) | p-value |
|---|---|---|---|
| Female (%) | 267,792 (54.8) | 5,667 (39.3) | — |
| Male (%) | 220,415 (45.5) | 8,751 (60.7) | < 0.001 |
| 56.4 (8.1) | 61.3 (6.6) | < 0.001 | |
| White (%) | 458,948 (94.0) | 13,860 (96.1) | — |
| South Asian (%) | 9,712 (2.0) | 170 (1.2) | < 0.001 |
| East Asian (%) | 1,552 (0.3) | 22 (0.2) | 0.003 |
| Black (%) | 7,951 (1.6) | 114 (0.8) | < 0.001 |
| Other/mixed (%) | 7,371 (1.5) | 147 (1.0) | < 0.001 |
| Unknown (%) | 2,673 (0.6) | 105 (0.7 | 0.008 |
| None (%) | 80,930 (16.6) | 4,361 (30.3) | — |
| College/University (%) | 157,904 (32.3) | 3,298 (22.9) | < 0.001 |
| A/AS Levels (%) | 54,076 (11.1) | 1,257 (8.7) | < 0.001 |
| O Levels/GCSEs (%) | 102,535 (21.0) | 2,683 (18.6) | < 0.001 |
| CSEs (%) | 26,370 (5.4) | 523 (3.6) | < 0.001 |
| NVQ/HND/HNC (%) | 31,671 (6.5) | 1,065 (7.4) | < 0.001 |
| Other Professional Qualifications (%) | 25,009 (5.1) | 801 (5.5) | < 0.001 |
| Unknown (%) | 9,712 (2.0) | 430 (3.0) | < 0.001 |
| Non-smoker (%) | 438,137 (89.7) | 11,502 (79.8) | — |
| Current smoker (%) | 50,070 (10.3) | 2,916 (20.2) | < 0.001 |
| No (%) | 466,947 (95.6) | 12,721 (88.2) | — |
| Yes (%) | 21,260 (4.4) | 1,697 (11.8) | < 0.001 |
| No (%) | 446,616 (91.5) | 11,181 (77.6) | — |
| Yes (%) | 41,591 (8.5) | 3,237 (22.4) | < 0.001 |
| No (%) | 469,221 (96.1) | 12,666 (87.9) | — |
| Yes (%) | 18,986 (3.9) | 1,752 (12.1) | < 0.001 |
| No (%) | 481,876 (98.7) | 13,798 (95.7) | — |
| Yes (%) | 6,331 (1.3) | 620 (4.3) | < 0.001 |
| No (%) | 482,355 (98.8) | 13,761 (95.4) | — |
| Yes (%) | 5,852 (1.2) | 657 (4.6) | < 0.001 |
| 1.5 (5.0) | 3.5 (8.0) | < 0.001 | |
| 90.2 (13.4) | 95.1 (14.7) | < 0.001 | |
| 168.4 (9.3) | 169.3 (9.1) | < 0.001 | |
| 77.9 (15.9) | 80.9 (17.7) | < 0.001 | |
| 31.4 (8.5) | 30.9 (8.5) | < 0.001 | |
| 27.4 (4.8) | 28.1 (5.4) | < 0.001 | |
| 139.5 (19.1) | 143.6 (20.3) | < 0.001 | |
| 82.2 (10.4) | 82.2 (10.9) | 0.998 | |
| 1915.1 (2856.2) | 1704.5 (2840.4) | < 0.001 | |
| 2.8 (0.8) | 2.6 (0.8) | < 0.001 | |
| -1.3 (3.1) | -0.6 (3.4) | < 0.001 |
* Missing values: Weight: 0.55% missing; Height: 0.50% missing; BMI: 0.62% missing; Waist circumference: 0.43% missing; Body fat percentage: 2.08% missing; Diastolic blood pressure: 6.92% missing; Systolic blood pressure: 6.93% missing; FEV1: 9.71% missing; Cigarettes per day: 3.35% missing; Townsend index: 0.13% missing
Adjusted hazard ratios from final multivariable Cox regression model predicting 10-year mortality in the training cohort (n = 376,971).
| Predictor Variables | Hazard Ratio | P-Value | 95% Confidence Interval | |
|---|---|---|---|---|
| Lower | Upper | |||
| Female | Ref | — | — | — |
| Male | 2.17 | < 0.001 | 2.08 | 2.27 |
| 44.00 | < 0.001 | 36.35 | 53.27 | |
| None | Ref | — | — | — |
| College/University | 0.75 | < 0.001 | 0.71 | 0.80 |
| A/AS Levels | 0.83 | < 0.001 | 0.77 | 0.89 |
| O Levels/GCSEs | 0.81 | < 0.001 | 0.76 | 0.85 |
| CSEs | 0.88 | 0.020 | 0.79 | 0.98 |
| NVQ/HND/HNC | 0.80 | < 0.001 | 0.74 | 0.87 |
| Other professional qualifications | 0.79 | < 0.001 | 0.73 | 0.87 |
| Unknown | 0.97 | 0.590 | 0.86 | 1.09 |
| White | Ref | — | — | — |
| South Asian | 0.59 | < 0.001 | 0.49 | 0.70 |
| East Asian | 0.67 | 0.110 | 0.41 | 1.09 |
| Black | 0.63 | < 0.001 | 0.51 | 0.78 |
| Other/Mixed | 0.81 | < 0.030 | 0.66 | 0.98 |
| Unknown | 1.00 | 0.990 | 0.78 | 1.27 |
| No | Ref | — | — | — |
| Yes | 2.58 | < 0.001 | 2.47 | 2.71 |
| No | Ref | — | — | — |
| Yes | 1.58 | < 0.001 | 1.49 | 1.68 |
| No | Ref | — | — | — |
| Yes | 1.72 | < 0.001 | 1.62 | 1.83 |
| No | Ref | — | — | — |
| Yes | 1.87 | < 0.001 | 1.71 | 2.05 |
| Non-smoker | Ref | — | — | — |
| Current smoker | 2.01 | < 0.001 | 1.91 | 2.11 |
| 0.73 | < 0.001 | 0.60 | 0.89 | |
| 1.14 | 0.200 | 0.94 | 1.38 | |
| 1.13 | < 0.001 | 1.10 | 1.17 | |
| 1.12 | 0.070 | 0.99 | 1.27 | |
| 0.94 | < 0.001 | 0.93 | 0.95 | |
| 0.53 | < 0.001 | 0.51 | 0.56 | |
Top 15 risk factor variables for predicting mortality listed in descending order of “importance” by algorithm derived from the training cohort of 376,971 patients.
| Cox model | Random Forest | Deep Learning |
|---|---|---|
| Age | BMI | Smoking |
| Prior diagnosis of cancer | FEV1 | Age |
| Gender | Waist circumference | Prior diagnosis of cancer |
| Smoking | Diastolic blood pressure | Alcohol consumption |
| Prior diagnosis of COPD | Systolic blood pressure | Digoxin prescribed |
| FEV1 | Age | Gender |
| Prior diagnosis of T2DM | Body fat percentage | Warfarin prescribed |
| Prior diagnosis of CHD | Smoking | Townsend deprivation index |
| Diastolic blood pressure | Prior diagnosis cancer | Residential air pollution |
| BMI | Gender | Prior diagnosis of CHD |
| Systolic blood pressure | Skin tone | Statins prescribed |
| Townsend deprivation index | Education | Prior diagnosis of COPD |
| Ethnicity | Prior diagnosis T2DM | Job exposure to hazardous materials |
| MET-min week | Vegetable consumption | Education |
| Education | Fruit consumption | FEV1 |
a ranking determined by strongest to weakest Cox regression coefficients
b ranking determined by largest to smallest mean decreases in accuracy
c ranking determined by largest to smallest scaled importance derived from network weights
orange = top risk factor in all three algorithms; blue = top risk factor in two algorithms; green = top risk factor in one algorithm
Binary classification accuracy comparing each algorithm for predicting “high” risk of mortality in the test cohort (n = 125,657).
| Algorithm | Optimal Threshold | Correctly Classified Death | Correctly Classified Alive | Sensitivity | Specificity |
|---|---|---|---|---|---|
| Deep Learning | > 2% | 2,343/3,608 | 92,978/122,049 | 64.9% | 76.2% |
| Random Forest | > 5% | 2,300/3,608 | 94,603/122,049 | 63.7% | 77.5% |
| Adjusted Cox Model | > 6% | 2,197/3,608 | 92,832/122,049 | 60.9% | 76.1% |
| Age/Gender Cox Model | > 8.4% | 1,728/3,608 | 93,661/122,049 | 43.7% | 76.7% |