| Literature DB >> 34067129 |
Surya Krishnamurthy1, Kapeleshh Ks2, Erik Dovgan3, Mitja Luštrek3, Barbara Gradišek Piletič4, Kathiravan Srinivasan1, Yu-Chuan Jack Li5, Anton Gradišek3, Shabbir Syed-Abdul5.
Abstract
Chronic kidney disease (CKD) represents a heavy burden on the healthcare system because of the increasing number of patients, high risk of progression to end-stage renal disease, and poor prognosis of morbidity and mortality. The aim of this study is to develop a machine-learning model that uses the comorbidity and medication data obtained from Taiwan's National Health Insurance Research Database to forecast the occurrence of CKD within the next 6 or 12 months before its onset, and hence its prevalence in the population. A total of 18,000 people with CKD and 72,000 people without CKD diagnosis were selected using propensity score matching. Their demographic, medication and comorbidity data from their respective two-year observation period were used to build a predictive model. Among the approaches investigated, the Convolutional Neural Networks (CNN) model performed best with a test set AUROC of 0.957 and 0.954 for the 6-month and 12-month predictions, respectively. The most prominent predictors in the tree-based models were identified, including diabetes mellitus, age, gout, and medications such as sulfonamides and angiotensins. The model proposed in this study could be a useful tool for policymakers in predicting the trends of CKD in the population. The models can allow close monitoring of people at risk, early detection of CKD, better allocation of resources, and patient-centric management.Entities:
Keywords: chronic kidney disease; deep learning; electronic health records; machine learning
Year: 2021 PMID: 34067129 PMCID: PMC8151834 DOI: 10.3390/healthcare9050546
Source DB: PubMed Journal: Healthcare (Basel) ISSN: 2227-9032
Figure 1Chronology of time periods and events.
Figure 2Distribution of age and sex in the CKD cohort.
Figure 3Data processing pipeline.
Characteristics of the dataset.
| Predictors | ||
|---|---|---|
| Demographics | ||
| Age (mean ± std) | Numerical | CKD: 65 ± 15.01, |
| Gender | Categorical | Males in CKD: 57.4%, |
| Diagnosis and procedures | ||
| ICD-9 based frequencies | Numerical | Frequency of visits with a diagnosis for each of 965 ICD codes |
| Medication | ||
| ATC-based frequencies | Numerical | Frequency of prescriptions for each of 537 ATC codes |
|
| ||
| CKD | Categorical | 20% diagnosed with CKD |
Figure 4Example of temporal, temporal-quarterly, and aggregated data prepared from the raw data.
Figure 5The CNN architecture used in this study.
Figure 6ROC curves of representative models.
Confusion matrices of the CNN model, showing the number of instances in each class, while the fraction of predicted instances are in parentheses.
|
| ||
| Predicted | Predicted | |
| Actual | 12,841 | 1359 |
| Actual | 423 | 3169 |
|
| ||
| Predicted | Predicted | |
| Actual | 13,035 | 1288 |
| Actual | 411 | 3186 |
Performance metrics for 6-month data.
| Dataset Type | Algorithm | Accuracy | F1 | Precision | Recall or Sensitivity | Specificity | AUROC |
|---|---|---|---|---|---|---|---|
| Temporal-monthly | CNN |
|
|
|
|
|
|
| BLSTM | 0.87 | 0.735 | 0.624 | 0.893 | 0.865 | 0.939 | |
| Aggregated | LightGBM | 0.751 | 0.525 | 0.426 | 0.685 | 0.767 | 0.799 |
| logistic | 0.736 | 0.503 | 0.405 | 0.664 | 0.754 | 0.761 | |
| randomforest | 0.725 | 0.488 | 0.390 | 0.652 | 0.743 | 0.762 | |
| decision tree | 0.732 | 0.483 | 0.395 | 0.622 | 0.76 | 0.745 | |
| Temporal-quarterly | CNN | 0.814 | 0.620 | 0.525 | 0.757 | 0.828 | 0.876 |
| BLSTM | 0.801 | 0.602 | 0.503 | 0.749 | 0.814 | 0.860 | |
| LightGBM | 0.801 | 0.588 | 0.505 | 0.704 | 0.826 | 0.841 | |
| logistic | 0.710 | 0.475 | 0.373 | 0.653 | 0.724 | 0.748 | |
| randomforest | 0.737 | 0.501 | 0.405 | 0.654 | 0.758 | 0.773 | |
| decision tree | 0.750 | 0.471 | 0.41 | 0.555 | 0.799 | 0.731 |
Performance metrics for 12-month data.
| Dataset Type | Algorithm | Accuracy | F1 | Precision | Recall or Sensitivity | Specificity | AUROC |
|---|---|---|---|---|---|---|---|
| Temporal-monthly | CNN |
|
|
|
|
|
|
| BLSTM | 0.865 | 0.731 | 0.61 | 0.903 | 0.856 | 0.936 | |
| Aggregated | LightGBM | 0.759 | 0.524 | 0.437 | 0.654 | 0.786 | 0.789 |
| logistic | 0.722 | 0.491 | 0.39 | 0.66 | 0.738 | 0.766 | |
| randomforest | 0.74 | 0.487 | 0.406 | 0.608 | 0.774 | 0.756 | |
| decision tree | 0.735 | 0.48 | 0.399 | 0.604 | 0.77 | 0.736 | |
| Temporal-quarterly | CNN | 0.802 | 0.610 | 0.507 | 0.765 | 0.812 | 0.867 |
| BLSTM | 0.779 | 0.585 | 0.471 | 0.773 | 0.781 | 0.855 | |
| LightGBM | 0.786 | 0.575 | 0.48 | 0.71 | 0.803 | 0.834 | |
| logistic | 0.742 | 0.486 | 0.406 | 0.605 | 0.776 | 0.747 | |
| randomforest | 0.741 | 0.495 | 0.41 | 0.627 | 0.77 | 0.758 | |
| decision tree | 0.746 | 0.465 | 0.404 | 0.548 | 0.797 | 0.721 |
Figure 7Performance of the CNN model across different sizes of (a) training data and (b) feature set.
Figure 8Feature importance for the LightGBM models for men and women for 6 and 12 months. The color of the x-axis labels is related to the type of feature: comorbidities are red, medications are blue, and age is black.