| Literature DB >> 32258428 |
Ji Hwan Park1, Han Eol Cho2, Jong Hun Kim3, Melanie M Wall4, Yaakov Stern4,5, Hyunsun Lim6, Shinjae Yoo1, Hyoung Seop Kim7, Jiook Cha4,8,9,10.
Abstract
Nationwide population-based cohort provides a new opportunity to build an automated risk prediction model based on individuals' history of health and healthcare beyond existing risk prediction models. We tested the possibility of machine learning models to predict future incidence of Alzheimer's disease (AD) using large-scale administrative health data. From the Korean National Health Insurance Service database between 2002 and 2010, we obtained de-identified health data in elders above 65 years (N = 40,736) containing 4,894 unique clinical features including ICD-10 codes, medication codes, laboratory values, history of personal and family illness and socio-demographics. To define incident AD we considered two operational definitions: "definite AD" with diagnostic codes and dementia medication (n = 614) and "probable AD" with only diagnosis (n = 2026). We trained and validated random forest, support vector machine and logistic regression to predict incident AD in 1, 2, 3, and 4 subsequent years. For predicting future incidence of AD in balanced samples (bootstrapping), the machine learning models showed reasonable performance in 1-year prediction with AUC of 0.775 and 0.759, based on "definite AD" and "probable AD" outcomes, respectively; in 2-year, 0.730 and 0.693; in 3-year, 0.677 and 0.644; in 4-year, 0.725 and 0.683. The results were similar when the entire (unbalanced) samples were used. Important clinical features selected in logistic regression included hemoglobin level, age and urine protein level. This study may shed a light on the utility of the data-driven machine learning model based on large-scale administrative health data in AD risk prediction, which may enable better selection of individuals at risk for AD in clinical trials or early detection in clinical settings.Entities:
Keywords: Alzheimer's disease; Predictive markers
Year: 2020 PMID: 32258428 PMCID: PMC7099065 DOI: 10.1038/s41746-020-0256-0
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Fig. 1Consort diagram.
Individuals with or without incident AD were drawn from the Korean National Health Insurance Service-National Sample Cohort.
Sample characteristics.
| Definite AD | Probable AD | Non-AD | |
|---|---|---|---|
| Number | 614 | 2026 | 38,710 |
| Age | 80.7 (80.2–81.1) | 79.2 (79.0–79.5) | 74.5 (74.4–74.5) |
| Sex (male: female) | 229 (44.6%): 285 (55.4%) | 733 (36.2%): 1293 (63.8%) | 18,200 (47.0%): 20,510 (53.0%) |
| Income levela | 6.00 (5.73–6.27) | 5.90 (5.87–5.93) | 6.02 (5.87–6.17) |
Based on the 0-year prediction model; The range indicates minimum and maximum.
a10 levels based on subject’s monthly salary.
Fig. 2Performance of machine learning models in predicting incident AD.
Receiver-Operating Characteristic plots are shown for 0, 1, 2, 3, 4-subsequent year prediction. Incident AD was defined based on ICD-10 AD codes and anti-dementia medication for AD, “Definite AD”, or based on AD codes only, “Probable AD”. In each year prediction, a best performing model was selected for plotting.
Performance of AD predictive models trained on NHIS-NSC by using balanced samples.
| Sample | Subsequent years of incidence predictedb | Classifier | Accuracy | AUC | Sensitivity | Specificity |
|---|---|---|---|---|---|---|
| Definite AD (AD/non-AD 614/614) | 0 year | LR | 0.76 | 0.794 | 0.726 | 0.793 |
| SVM | 0.763 | 0.817 | 0.715 | 0.811 | ||
| RF | 0.823 | 0.898a | 0.795 | 0.852 | ||
| 1 year | LR | 0.677 | 0.693 | 0.65 | 0.704 | |
| SVM | 0.678 | 0.705 | 0.699 | 0.656 | ||
| RF | 0.713 | 0.775a | 0.686 | 0.74 | ||
| 2 year | LR | 0.652 | 0.684 | 0.639 | 0.666 | |
| SVM | 0.663 | 0.687 | 0.572 | 0.753 | ||
| RF | 0.675 | 0.730a | 0.608 | 0.742 | ||
| 3 year | LR | 0.623 | 0.645 | 0.562 | 0.684 | |
| SVM | 0.607 | 0.635 | 0.58 | 0.633 | ||
| RF | 0.632 | 0.677a | 0.572 | 0.693 | ||
| 4 year | LR | 0.627 | 0.661 | 0.509 | 0.745 | |
| SVM | 0.646 | 0.685 | 0.538 | 0.754 | ||
| RF | 0.663 | 0.725a | 0.621 | 0.705 | ||
| Probable AD (AD/non-AD 2026/2026) | 0 year | LR | 0.736 | 0.783 | 0.689 | 0.783 |
| SVM | 0.734 | 0.794 | 0.652 | 0.816 | ||
| RF | 0.788 | 0.850a | 0.723 | 0.853 | ||
| 1 year | LR | 0.663 | 0.697 | 0.634 | 0.692 | |
| SVM | 0.661 | 0.691 | 0.592 | 0.729 | ||
| RF | 0.688 | 0.759a | 0.609 | 0.767 | ||
| 2 year | LR | 0.643 | 0.672 | 0.633 | 0.654 | |
| SVM | 0.645 | 0.68 | 0.58 | 0.709 | ||
| RF | 0.638 | 0.693a | 0.564 | 0.713 | ||
| 3 year | LR | 0.61 | 0.635 | 0.557 | 0.663 | |
| SVM | 0.597 | 0.644a | 0.427 | 0.767 | ||
| RF | 0.581 | 0.609 | 0.505 | 0.657 | ||
| 4 year | LR | 0.611 | 0.644 | 0.516 | 0.707 | |
| SVM | 0.601 | 0.641 | 0.465 | 0.738 | ||
| RF | 0.641 | 0.683a | 0.603 | 0.679 |
AD Alzheimer’s dementia, LR logistic regression, SVM support vector machine, RF random forest.
aBest performing models based on AUC.
bSubsequent years of incidence predicted = an year of incidence–the last year of health data (e.g., 3 year = an incidence in 2013–the health data used in the prediction up to 2010; 3 year future prediction).
Top ten features and weights from logistic regression (0-year prediction).
| Type of data | Name | 95% CI | Odd ratio | ||
|---|---|---|---|---|---|
| Health checkup | Hemoglobin (g/dL) | −0.902 | −0.903/−0.901 | 0.405 | <0.001 |
| Demography | Age | 0.689 | 0.687/0.690 | 1.991 | <0.001 |
| Health checkup | Urine proteina | 0.303 | 0.300/0.306 | 1.353 | <0.001 |
| Medication | Zotepine (antipsychotic drug) | 0.303 | 0.280/0.325 | 1.353 | <0.001 |
| Medication | Nicametate Citrate (vasodilator) | −0.297 | −0.298/−0.295 | 0.743 | <0.001 |
| Disease code | Other degenerative disorders of nervous system in diseases classified elsewhere | −0.292 | −0.309/−0.274 | 0.746 | <0.001 |
| Disease code | Disorders of external ear in diseases classified elsewhere | −0.274 | −0.328/−0.220 | 0.760 | <0.001 |
| Medication | Tolfenamic acid 200 mg (pain killer) | −0.266 | −0.279/−0.254 | 0.766 | <0.001 |
| Disease code | Adult respiratory distress syndrome | −0.259 | −0.282/−0.236 | 0.771 | <0.001 |
| Medication | Eperisone Hydrochloride (antispasmodic drug) | 0.255 | 0.237/0.272 | 1.290 | <0.001 |
aUrine protein was detected by urine dipstick test (1: negative (−), 2: weak positive (±), 3: positive (1+), 4: positive (2+), 5: positive (3+), 6: positive (4+)).