| Literature DB >> 30777059 |
Andreas Philipp Hassler1,2, Ernestina Menasalvas2, Francisco José García-García3, Leocadio Rodríguez-Mañas4, Andreas Holzinger5,6.
Abstract
BACKGROUND: Increasing life expectancy results in more elderly people struggling with age related diseases and functional conditions. This poses huge challenges towards establishing new approaches for maintaining health at a higher age. An important aspect for age related deterioration of the general patient condition is frailty. The frailty syndrome is associated with a high risk for falls, hospitalization, disability, and finally increased mortality. Using predictive data mining enables the discovery of potential risk factors and can be used as clinical decision support system, which provides the medical doctor with information on the probable clinical patient outcome. This enables the professional to react promptly and to avert likely adverse events in advance.Entities:
Keywords: Data mining; Data preprocessing; Frailty syndrome; Health data analytics; Machine learning; Missing value imputation; Predictive modeling; Risk factor discovery
Mesh:
Year: 2019 PMID: 30777059 PMCID: PMC6483150 DOI: 10.1186/s12911-019-0747-6
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Overview of features with more than 5% missing values
| Feature | Percentage of missing data | Reason for missingness | Imputation possible |
|---|---|---|---|
|
|
| MNAR (follow-up question) | No |
|
|
| MNAR (follow-up question) | No |
|
|
| MNAR (follow-up question) | No |
|
|
| MNAR (follow-up question) | No |
|
|
| MNAR (follow-up question) | No |
| Earlier alcohol consumption | 19.20 | MAR | Yes |
|
|
| MNAR (follow-up question) | No |
|
|
| MNAR (follow-up question) | No |
|
|
| MNAR (follow-up question) | No |
| D Dimer [ | 17.72 | MAR | Yes |
| High-sensitivity C-reactive protein (hs-CRP) [mg/L] | 14.98 | MAR | Yes |
| Number of IADL abilities | 6.33 | MAR | Yes |
| Total MMSE score | 15.82 | MAR | Yes |
| Total GDS | 9.49 | MAR | Yes |
| Depression | 9.49 | related to | No |
| Insulin [U/mL] | 11.60 | MAR | Yes |
| HDL | 9.07 | MAR | Yes |
| LDL | 9.07 | MAR | Yes |
|
|
| MAR | Yes |
|
|
| MAR | Yes |
| Mobility scale question 5 | 8.44 | MNAR (follow-up question) | No |
| Mobility scale question 6 | 8.44 | MNAR (follow-up question) | No |
| Mobility scale question 8 | 14.35 | MNAR (follow-up question) | No |
| Mobility scale question 9 | 13.92 | MNAR (follow-up question) | No |
| Mobility scale question 11 | 7.81 | MNAR (follow-up question) | No |
| Mobility scale question 12 | 7.59 | MNAR (follow-up question) | No |
| Mobility scale question 14 | 25.95 | MNAR (follow-up question) | No |
| Mobility scale question 15 | 26.16 | MNAR (follow-up question) | No |
| MMSE temporal domain 1 | 17.93 | MAR | Yes |
| MMSE temporal domain 2 | 18.78 | MAR | Yes |
| MMSE temporal domain 3 | 18.14 | MAR | Yes |
| MMSE temporal domain 4 | 22.57 | MAR | Yes |
| MMSE temporal domain 5 | 12.87 | MAR | Yes |
| MMSE spatial domain 1 | 13.08 | MAR | Yes |
| MMSE spatial domain 2 | 13.29 | MAR | Yes |
| MMSE spatial domain 3 | 13.29 | MAR | Yes |
| MMSE spatial domain 4 | 13.29 | MAR | Yes |
| MMSE spatial domain 5 | 13.29 | MAR | Yes |
| MMSE remembering 1 | 18.99 | MAR | Yes |
| MMSE remembering 2 | 19.41 | MAR | Yes |
|
|
| MAR | Yes |
|
|
| MAR | Yes |
| MMSE object naming | 13.92 | MAR | Yes |
| MMSE repeat phrase | 13.08 | MAR | Yes |
| MMSE left right | 13.50 | MAR | Yes |
| MMSE following written order | 13.29 | MAR | Yes |
| MMSE write sentence | 13.92 | MAR | Yes |
| MMSE copying design | 13.50 | MAR | Yes |
| Cognitive impairment | 17.09 | MAR | Yes |
| Individual income | 8.44 | MAR | Yes |
| Household income | 13.29 | MAR | Yes |
| Number of persons in the family | 18.78 | MAR | Yes |
| Insulin like growth factor 1 (IGF1) [ng/mL] | 27.00 | MAR | Yes |
|
|
| MNAR (follow-up question) | No |
| Overall income | 13.71 | MAR | Yes |
Features where more than one third of the values are missing are presented in bold
Fig. 1Imputation Process. This figure illustrates how the imputation and the feature ranking process are connected. At first, the imputation models are built using features, which show a minimum correlation (here 7% was used) to the feature to be imputed. After that, the obtained 5 different data sets are used for the feature selection process. Knowing the selected features, the imputation is re-done. This by using as predictors additional to the correlated features also the selected ones
Fig. 2Feature Selection Process. This figure shows the overall feature selection process. At first, the Boruta algorithm is applied on each imputed data set. Then, the 5 different selected feature sets are compared and features which appear in 3 or more selected sets are chosen for the final feature set
Fig. 3Feature Selection Results (Boruta). This image shows the attributes and their importance measure, by which they were selected (green) or rejected (red). This decision was made by comparing their importance measure to randomly permuted copies of themselves, the so called shadow attributes [54]. Features which could neither be selected nor rejected were marked tentative (yellow)
Fig. 4Modeling and Evaluation Procedure. This image shows the general modeling and evaluation procedure. Firstly, models are built using the 5 different obtained imputed data sets. Secondly, the models are evaluated in a cross-fold validation setup. Then the resulting performance measure values (e.g. accuracy, sensitivity, specificity) can be compared
Obtained final selection of features using the Boruta algorithm and a voting system (presence of the feature in at least 3 out of the 5 sets)
| Description | Type |
|---|---|
| Height (cm) | Numeric |
| Presence of cognitive impairment | Binary |
| Presence of depression | Binary |
| Mobility Scale follow-up question (tiredness when going out) | Binary |
| Mobility Scale question (stair-climbing ability) | Binary |
| Mobility Scale follow-up question (tiredness when walking outside) | Binary |
| Mobility Scale question (walking outside ability) | Binary |
| MMSE follow-up question (remembering objects ability) | Categorical |
| Total GDS | Binary |
| Age in years | Numeric |
| ADL question (difficulty washing) | Categorical |
| Number of ADL abilities | Numeric |
| Number of IADL abilities | Numeric |
| IADL question (difficulty using telephone) | Categorical |
| IADL question (difficulty shopping) | Categorical |
| IADL question (difficulty cooking) | Categorical |
| IADL question (difficulty doing light housework) | Categorical |
| IADL question (difficulty doing heavy housework) | Categorical |
| IADL question (difficulty using public transportation) | Categorical |
| Total MMSE score | Numeric |
| Sum of mobility score main features (em1,em2, em3,em4,em5) | Numeric |
| Number of drugs (drug intake) | Numeric |
| Alkaline phosphatase [U/L] | Numeric |
| Presence of polypharmacy | Binary |
| Self-reported health status | Categorical |
| Self-reported health status compared to people the same age | Categorical |
| Capacity of dealing with problems | Categorical |
| Capacity of dealing with tasks | Categorical |
| GDS question (dropped activity of interests) | Binary |
| GDS question (boredom) | Binary |
| Presence of joint inflammation (more than 4 weeks in a row) | Categorical |
10-fold cross-validation results for the binary classification models for each imputed data set, working with the two classes non−frail and frail
| Prediction method | Accuracy | AUC | Sensitivity | Specificity | Precision | F1-Score |
|---|---|---|---|---|---|---|
| Imputation 1 | ||||||
| Naive Bayes | 73.20 ± 5.97% | 0.756 ± 0.052 | 0.656 ± 0.102 | 0.749 ± 0.067 | ||
| CART | 72.77 ± 5.20% | 0.710 ± 0.061 | 0.782 ± 0.108 | 0.639 ± 0.168 | 0.789 ± 0.065 | 0.778 ± 0.049 |
| Bagging CART | 75.51 ± 7.16% | 0.731 ± 0.070 | 0.830 ± 0.086 | 0.633 ± 0.084 | 0.786 ± 0.048 | 0.806 ± 0.060 |
| C5.0 | 0.752 ± 0.086 | 0.644 ± 0.164 | 0.804 ± 0.075 | |||
| Random forest | 77.64 ± 5.62% | 0.755 ± 0.053 | 0.844 ± 0.089 | 0.667 ± 0.087 | 0.806 ± 0.041 | 0.823 ± 0.050 |
| Support vector machines (RBF) | 77.64 ± 6.55% | 0.824 ± 0.09 | 0.700 ± 0.099 | 0.819 ± 0.053 | 0.819 ± 0.057 | |
| Linear discriminant analysis | 75.11 ± 5.34% | 0.739 ± 0.042 | 0.789 ± 0.096 | 0.689 ± 0.047 | 0.805 ± 0.023 | 0.795 ± 0.055 |
| Imputation 2 | ||||||
| Naive Bayes | 72.78 ± 6.47% | 0.750 ± 0.059 | 0.656 ± 0.109 | 0.745 ± 0.072 | ||
| CART | 70.89 ± 5.94% | 0.699 ± 0.057 | 0.741 ± 0.098 | 0.656 ± 0.104 | 0.781 ± 0.047 | 0.757 ± 0.058 |
| Bagging CART | 75.11 ± 6.59% | 0.729 ± 0.072 | 0.820 ± 0.089 | 0.639 ± 0.134 | 0.792 ± 0.066 | 0.802 ± 0.054 |
| C5.0 | 77.39 ± 7.35% | 0.745 ± 0.093 | 0.622 ± 0.192 | 0.797 ± 0.082 | ||
| Random forest | 77.01 ± 6.65% | 0.752 ± 0.064 | 0.827 ± 0.101 | 0.678 ± 0.101 | 0.809 ± 0.052 | 0.815 ± 0.060 |
| Support vector machines (RBF) | 0.827 ± 0.085 | 0.694 ± 0.102 | 0.816 ± 0.057 | 0.820 ± 0.060 | ||
| Linear discriminant analysis | 76.14 ± 5.15% | 0.752 ± 0.046 | 0.792 ± 0.081 | 0.711 ± 0.057 | 0.817 ± 0.032 | 0.803 ± 0.050 |
| Imputation 3 | ||||||
| Naive Bayes | 73.41 ± 5.64% | 0.757 ± 0.057 | 0.664 ± 0.083 | 0.755 ± 0.056 | ||
| CART | 73.21 ± 5.75% | 0.728 ± 0.07 | 0.746 ± 0.064 | 0.709 ± 0.14 | 0.815 ± 0.067 | 0.776 ± 0.045 |
| Bagging CART | 78.28 ± 3.92% | 0.764 ± 0.057 | 0.688 ± 0.148 | 0.823 ± 0.062 | 0.828 ± 0.026 | |
| C5.0 | 74.06 ± 7.12% | 0.709 ± 0.089 | 0.837 ± 0.057 | 0.581 ± 0.181 | 0.774 ± 0.073 | 0.802 ± 0.048 |
| Random forest | 77.62 ± 6.65% | 0.762 ± 0.076 | 0.820 ± 0.068 | 0.704 ± 0.134 | 0.824 ± 0.068 | 0.820 ± 0.052 |
| Support vector machines (RBF) | 0.838 ± 0.049 | 0.720 ± 0.09 | 0.833 ± 0.048 | |||
| Linear discriminant analysis | 78.47 ± 4.77% | 0.773 ± 0.051 | 0.821 ± 0.059 | 0.726 ± 0.085 | 0.833 ± 0.045 | 0.825 ± 0.040 |
| Imputation 4 | ||||||
| Naive Bayes | 72.78 ± 5.89% | 0.750 ± 0.061 | 0.657 ± 0.083 | 0.749 ± 0.057 | ||
| CART | 71.26 ± 5.83% | 0.697 ± 0.053 | 0.762 ± 0.095 | 0.631 ± 0.083 | 0.774 ± 0.043 | 0.765 ± 0.058 |
| Bagging CART | 76.38 ± 5.77% | 0.747 ± 0.069 | 0.817 ± 0.076 | 0.676 ± 0.147 | 0.812 ± 0.065 | 0.811 ± 0.046 |
| C5.0 | 74.25 ± 7.13% | 0.712 ± 0.085 | 0.587 ± 0.157 | 0.774 ± 0.07 | 0.803 ± 0.052 | |
| Random forest | 76.99 ± 5.90% | 0.755 ± 0.069 | 0.817 ± 0.069 | 0.693 ± 0.136 | 0.819 ± 0.067 | 0.815 ± 0.046 |
| Support vector machines (RBF) | 0.771 ± 0.057 | 0.827 ± 0.053 | 0.714 ± 0.092 | 0.829 ± 0.049 | ||
| Linear discriminant analysis | 78.06 ± 5.39% | 0.807 ± 0.061 | 0.737 ± 0.091 | 0.837 ± 0.049 | 0.820 ± 0.045 | |
| Imputation 5 | ||||||
| Naive Bayes | 73.41 ± 5.45% | 0.756 ± 0.053 | 0.664 ± 0.088 | 0.754 ± 0.057 | ||
| CART | 71.67 ± 7.79% | 0.702 ± 0.087 | 0.762 ± 0.100 | 0.642 ± 0.166 | 0.786 ± 0.089 | 0.769 ± 0.066 |
| Bagging CART | 76.79 ± 4.69% | 0.749 ± 0.053 | 0.827 ± 0.071 | 0.671 ± 0.115 | 0.809 ± 0.049 | 0.815 ± 0.039 |
| C5.0 | 75.31 ± 4.08% | 0.726 ± 0.055 | 0.615 ± 0.138 | 0.787 ± 0.055 | 0.808 ± 0.030 | |
| Random forest | 78.03 ± 5.10% | 0.764 ± 0.060 | 0.830 ± 0.073 | 0.698 ± 0.129 | 0.824 ± 0.061 | 0.824 ± 0.041 |
| Support vector machines (RBF) | 0.827 ± 0.055 | 0.714 ± 0.092 | 0.828 ± 0.049 | |||
| Linear discriminant analysis | 77.62 ± 5.35% | 0.769 ± 0.058 | 0.800 ± 0.063 | 0.737 ± 0.102 | 0.836 ± 0.054 | 0.816 ± 0.045 |
The highest obtained value for each performance category for each imputed data set is marked in bold