| Literature DB >> 33792549 |
Linh Tran1, Lianhua Chi2, Alessio Bonti1, Mohamed Abdelrazek1, Yi-Ping Phoebe Chen2.
Abstract
BACKGROUND: Cardiovascular disease (CVD) is the greatest health problem in Australia, which kills more people than any other disease and incurs enormous costs for the health care system. In this study, we present a benchmark comparison of various artificial intelligence (AI) architectures for predicting the mortality rate of patients with CVD using structured medical claims data. Compared with other research in the clinical literature, our models are more efficient because we use a smaller number of features, and this study could help health professionals accurately choose AI models to predict mortality among patients with CVD using only claims data before a clinic visit.Entities:
Keywords: cardiovascular; deep learning; imbalanced data; machine learning; medical claims data; mortality
Year: 2021 PMID: 33792549 PMCID: PMC8050753 DOI: 10.2196/25000
Source DB: PubMed Journal: JMIR Med Inform
Prevalence of cardiovascular disease by age group and sex, 2017-2018.
| Age group (years) | Men, na | Women, na | Total, na | Men, % (95% CI)b | Women, % (95% CI)b | Total, % (95% CI)b |
| 18-44 | 31,400 | 56,600 | 88,000 | 0.7 (0.3-1.1) | 1.2 (0.7-1.8) | 1.0 (0.7-1.3) |
| 45-54 | 50,600 | 42,300 | 92,900 | 3.3 (2.4-4.2) | 2.6 (1.7-3.5) | 3.0 (2.4-3.6) |
| 55-64 | 136,700 | 114,700 | 251,500 | 10.0 (7.6-12.4) | 7.9 (6.0-9.9) | 8.9 (7.4-10.5) |
| 65-74 | 208,900 | 135,600 | 344,500 | 19.8 (17.2-22.4) | 12.2 (10.0-14.4) | 15.9 (14.3-17.5) |
| 75+ | 213,200 | 160,100 | 373,300 | 32.1 (27.1-37.0) | 20.3 (17.5-23.1) | 25.7 (23.1-28.2) |
| Persons (number/age-standardized ratec) | 640,800 | 509,300 | 1,150,200 | 6.5 (5.9-7.0) | 4.8 (4.3-5.3) | 5.6 (5.2-5.9) |
aDue to rounding, discrepancies may occur between sums of the component items and totals.
bCI is a statistical term describing a range (interval) of values within which we can be “confident” that the true value lies, usually because it has a 95% or higher chance of doing so.
cAge-standardized to the 2001 Australian Standard Population (Source: AIHW analysis of ABS 2019).
Figure 1Decision tree.
Figure 2Random forests.
Figure 3Extra trees.
Figure 4Gradient boosting trees workflow.
Figure 5Artificial neural network architecture. ReLU: Rectified Linear Unit.
Hyperparameters for grid search.
| Algorithms and parameter name | Search space | Optimal | |||
|
| |||||
|
|
Penalty C tol solver multi_class |
(‘l1’, ‘l2’, ‘none’) (0.01, 0.1, 1.0) (0.0001, 0.001, 0.01) (‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’) (‘auto’, ‘ovr’, ‘multinomial’) |
l2 1.0 0.0001 lbfgs auto | ||
|
| |||||
|
|
n_estimators max_depth max_features min_samples_splitmin_samples_leaf |
(5, 10, 50, 100, 150) (1, 2, 3, 5, None) (’auto’, ’sqrt’) (2, 5, 10) (1, 2, 4) |
100 None auto 2 1 | ||
|
| |||||
|
|
n_estimators max_depth max_features min_samples_splitmin_samples_leaf |
(5, 10, 50, 100, 150) (1, 2, 3, 5, None) (’auto’, ’sqrt’) (2, 5, 10) (1, 2, 4) |
100 None auto 2 1 | ||
|
| |||||
|
|
Loss n_estimators max_depth learning_rate criterion |
(‘deviance’, ‘exponential’) (5, 10, 50, 100, 150) (1, 2, 3, 5) (0.001, 0.01, 0.1) (’friedman_mse’, ’mse’, ’mae’) |
deviance 100 3 0.1 friedman_mse | ||
Performance metrics of machine learning models without the Synthetic Minority Oversampling Technique.
| Algorithms | Accuracy | Area under the receiver operating characteristic curve | Precision | Recall | Brier loss |
| Logistic regression | 97.8 | 96.4 | 98.5a | 93.4 | 0.016 |
| Random forest | 98.5b | 97.7 | 98.1 | 96.1 | 0.012c |
| Extra trees | 97.9 | 96.8 | 98.1 | 94.2 | 0.016 |
| Gradient boosting trees | 98.4 | 97.8d | 97.5 | 96.5e | 0.012c |
| Artificial neural network | 97.1 | 95.3 | 96.6 | 91.8 | 0.024 |
aThe highest precision.
bThe highest accuracy.
cThe least Brier loss.
dThe highest area under the receiver operating characteristic curve.
eThe highest recall.
Training time of machine learning models without Synthetic Minority Oversampling Technique.
| Algorithms | Training time (seconds) |
| Logistic regression | 6.6a |
| Random forest | 106.8 |
| Extra trees | 46.8 |
| Gradient boosting trees | 186 |
| Artificial neural network | 1277.4 |
aThe least training time.
Figure 6Confusion matrices of logistic regression.
Figure 10Confusion matrix of artificial neural network.
Figure 11Calibration curve of random forest without Synthetic Minority Oversampling Technique.
Figure 20Calibration curve of artificial neural network with Synthetic Minority Oversampling Technique.
Performance metrics of machine learning models with the Synthetic Minority Oversampling Technique.
| Algorithms | Accuracy | Area under the receiver operating characteristic curve | Precision | Recall | Brier loss |
| Logistic regression | 98.2 | 97.4 | 97.3a | 95.9 | 0.015 |
| Random forest | 98.4b | 98.0c | 96.8 | 97.3 | 0.012d |
| Extra trees | 98.1 | 97.4 | 97.1 | 95.8 | 0.016 |
| Gradient boosting trees | 98.1 | 97.9 | 95.2 | 97.7e | 0.014 |
| Artificial neural network | 96.7 | 96.2 | 93.0 | 95.1 | 0.026 |
aThe highest precision.
bThe highest accuracy.
cThe highest area under the receiver operating characteristic curve.
dThe least Brier loss.
eThe highest recall.
Figure 8Confusion matrix of extra trees.
Training time of machine learning models with the Synthetic Minority Oversampling Technique.
| Algorithms | Training time (seconds) |
| Logistic regression | 292.9a |
| Random forest | 497.9 |
| Extra trees | 347.5 |
| Gradient boosting trees | 648.1 |
| Artificial neural network | 5480.3 |
aThe least training time.