| Literature DB >> 32013925 |
Davide Chicco1, Giuseppe Jurman2.
Abstract
BACKGROUND: Cardiovascular diseases kill approximately 17 million people globally every year, and they mainly exhibit as myocardial infarctions and heart failures. Heart failure (HF) occurs when the heart cannot pump enough blood to meet the needs of the body.Available electronic medical records of patients quantify symptoms, body features, and clinical laboratory test values, which can be used to perform biostatistics analysis aimed at highlighting patterns and correlations otherwise undetectable by medical doctors. Machine learning, in particular, can predict patients' survival from their data and can individuate the most important features among those included in their medical records.Entities:
Keywords: Biomedical informatics; Biostatistics; Cardiovascular heart diseases; Data mining; Ejection fraction; Feature ranking; Feature selection; Heart failure; Machine learning; Medical records; Serum creatinine
Mesh:
Substances:
Year: 2020 PMID: 32013925 PMCID: PMC6998201 DOI: 10.1186/s12911-020-1023-5
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Survival prediction results including the follow-up time – mean of 100 executions
| Method | F1 score | Accuracy | TP rate | TN rate | PR AUC | ROC AUC | |
|---|---|---|---|---|---|---|---|
| Logistic regression | blue | blue0.719* | blue0.838* | blue0.785* | blue0.860* | blue0.617* | blue0.822* |
| (EF, SR, & FU) | |||||||
| Logistic regression | 0.714 | 0.833 | 0.780 | 0.856 | 0.612 | 0.818 | |
| (all features) |
Top row: logistic regression using only ejection fraction (EF), serum creatinine (SC), and follow-up time month (FU). Bottom row: logistic regression using all features. MCC: Matthews correlation coefficient. TP rate: true positive rate (sensitivity, recall). TN rate: true negative rate (specificify). Confusion matrix threshold for MCC, F1 score, accuracy, TP rate, TN rate: τ=0.5. PR: precision-recall curve. ROC: receiver operating characteristic curve. AUC: area under the curve. MCC: worst value = –1 and best value = +1. F1 score, accuracy, TP rate, TN rate, PR AUC, ROC AUC: worst value = 0 and best value = 1. MCC, F1 score, accuracy, TP rate, TN rate, PR AUC, ROC AUC formulas: Additional file 1 (“Binary statistical rates” section). We reported bluein blue and with ∗ the top results for each score.
Meanings, measurement units, and intervals of each feature of the dataset
| Feature | Explanation | Measurement | Range |
|---|---|---|---|
| Age | Age of the patient | Years | [40,..., 95] |
| Anaemia | Decrease of red blood cells or hemoglobin | Boolean | 0, 1 |
| High blood pressure | If a patient has hypertension | Boolean | 0, 1 |
| Creatinine phosphokinase | Level of the CPK enzyme in the blood | mcg/L | [23,..., 7861] |
| (CPK) | |||
| Diabetes | If the patient has diabetes | Boolean | 0, 1 |
| Ejection fraction | Percentage of blood leaving | Percentage | [14,..., 80] |
| the heart at each contraction | |||
| Sex | Woman or man | Binary | 0, 1 |
| Platelets | Platelets in the blood | kiloplatelets/mL | [25.01,..., 850.00] |
| Serum creatinine | Level of creatinine in the blood | mg/dL | [0.50,..., 9.40] |
| Serum sodium | Level of sodium in the blood | mEq/L | [114,..., 148] |
| Smoking | If the patient smokes | Boolean | 0, 1 |
| Time | Follow-up period | Days | [4,...,285] |
| (target) death event | If the patient died during the follow-up period | Boolean | 0, 1 |
mcg/L: micrograms per liter. mL: microliter. mEq/L: milliequivalents per litre
Statistical quantitative description of the category features
| Full sample | Dead patients | Survived patients | ||||
|---|---|---|---|---|---|---|
| Category feature | # | % | # | % | # | % |
| Anaemia (0: false) | 170 | 56.86 | 50 | 52.08 | 120 | 59.11 |
| Anaemia (1: true) | 129 | 43.14 | 46 | 47.92 | 3 | 40.89 |
| High blood pressure (0: false) | 194 | 64.88 | 57 | 59.38 | 137 | 67.49 |
| High blood pressure (1: true) | 105 | 35.12 | 39 | 40.62 | 66 | 32.51 |
| Diabetes (0: false) | 174 | 58.19 | 56 | 58.33 | 118 | 58.13 |
| Diabetes (1: true) | 125 | 41.81 | 40 | 41.67 | 85 | 41.87 |
| Sex (0: woman) | 105 | 35.12 | 34 | 35.42 | 71 | 34.98 |
| Sex (1: man) | 194 | 64.88 | 62 | 64.58 | 132 | 65.02 |
| Smoking (0: false) | 203 | 67.89 | 66 | 68.75 | 137 | 67.49 |
| Smoking (1: true) | 96 | 32.11 | 30 | 31.25 | 66 | 32.51 |
#: number of patients. %: percentage of patients. Full sample: 299 individuals. Dead patients: 96 individuals. Survived patients: 203 individuals.
Statistical quantitative description of the numeric features
| Full sample | Dead patients | Survived patients | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Numeric feature | Median | Mean | Median | Mean | Median | Mean | |||
| Age | 60.00 | 60.83 | 11.89 | 65.00 | 65.22 | 13.21 | 60.00 | 58.76 | 10.64 |
| Creatinine phosphokinase | 250.00 | 581.80 | 970.29 | 259.00 | 670.20 | 1316.58 | 245.00 | 540.10 | 753.80 |
| Ejection fraction | 38.00 | 38.08 | 11.83 | 30.00 | 33.47 | 12.53 | 38.00 | 40.27 | 10.86 |
| Platelets | 262.00 | 263.36 | 97.80 | 258.50 | 256.38 | 98.53 | 263.00 | 266.66 | 97.53 |
| Serum creatinine | 1.10 | 1.39 | 1.03 | 1.30 | 1.84 | 1.47 | 1.00 | 1.19 | 0.65 |
| Serum sodium | 137.00 | 136.60 | 4.41 | 135.50 | 135.40 | 5.00 | 137.00 | 137.20 | 3.98 |
| Time | 115.00 | 130.30 | 77.61 | 44.50 | 70.89 | 62.38 | 172.00 | 158.30 | 67.74 |
Full sample: 299 individuals. Dead patients: 96 individuals. Survived patients: 203 individuals. σ: standard deviation
Fig. 2Aggregated results of the feature rankings. Borda list of the 700 rankings obtained applying seven ranking methods on 100 instances of 70% training subsets of D. We ranked the Borda list by importance, quantitatively expressed as the Borda count score, corresponding to the average position across all 700 lists. The lower the score, the higher the average rank of the feature in the 700 lists and thus the more important the feature. We highlight the top two features with red circles
Survival prediction results on all clinical features – mean of 100 executions
| Method | F1 score | Accuracy | TP rate | TN rate | PR AUC | ROC AUC | |
|---|---|---|---|---|---|---|---|
| Random forests | blue | 0.547 | blue0.740* | 0.491 | 0.864 | 0.657 | blue0.800* |
| Decision tree | blue0.554* | 0.737 | blue0.532* | 0.831 | 0.506 | 0.681 | |
| Gradient boosting | 0.527 | 0.738 | 0.477 | 0.860 | 0.594 | 0.754 | |
| Linear regression | 0.475 | 0.730 | 0.394 | 0.892 | 0.495 | 0.643 | |
| One rule | 0.465 | 0.729 | 0.383 | 0.892 | 0.482 | 0.637 | |
| Artificial neural network | 0.483 | 0.680 | 0.428 | 0.815 | blue0.750* | 0.559 | |
| Naïve bayes | 0.364 | 0.696 | 0.279 | 0.898 | 0.437 | 0.589 | |
| SVM radial | 0.182 | 0.690 | 0.122 | 0.967 | 0.587 | 0.749 | |
| SVM linear | 0.115 | 0.684 | 0.072 | blue0.981* | 0.594 | 0.754 | |
| 0.148 | 0.624 | 0.121 | 0.866 | 0.323 | 0.493 |
MCC: Matthews correlation coefficient. TP rate: true positive rate (sensitivity, recall). TN rate: true negative rate (specificify). Confusion matrix threshold for MCC, F1 score, accuracy, TP rate, TN rate: τ=0.5. PR: precision-recall curve. ROC: receiver operating characteristic curve. AUC: area under the curve. MCC: worst value = –1 and best value = +1. F1 score, accuracy, TP rate, TN rate, PR AUC, ROC AUC: worst value = 0 and best value = 1. MCC, F1 score, accuracy, TP rate, TN rate, PR AUC, ROC AUC formulas: Additional file 1 (“Binary statistical rates” section). Gradient boosting: eXtreme Gradient Boosting (XGBoost). SVM radial: Support Vector Machine with radial Gaussian kernel. SVM linear: Support Vector Machine with linear kernel. Our hyper-parameter grid search optimization for k-Nearest Neighbors selected k=3 on most of the times (10 runs out of 100). Our hyper-parameter grid search optimization for the Support Vector Machine with radial Gaussian kernel selected C=10 on most of the times (56 runs out of 100). Our hyper-parameter grid search optimization for the Support Vector Machine with linear kernel selected C=0.1 on most of the times (50 runs out of 100). Our hyper-parameter grid search optimization for the Artificial Neural Network selected 1 hidden layer and 100 hidden units on most of the times (74 runs out of 100). We report bluein blue and with ∗ the top performer results for each score.
Mann–Whitney U test
| Mann–Whitney | ||
|---|---|---|
| Rank | Feature | Test |
| 1 | Serum creatinine | 0 |
| 2 | Ejection fraction | 0.000001 |
| 3 | Age | 0.000167 |
| 4 | Serum sodium | 0.000293 |
| 5 | High blood pressure | 0.171016 |
| 6 | Anaemia | 0.252970 |
| 7 | Platelets | 0.425559 |
| 8 | Creatinine phosphokinase | 0.684040 |
| 9 | Smoking | 0.828190 |
| 10 | Sex | 0.941292 |
| 11 | Diabetes | 0.973913 |
Results of the univariate application of the Mann–Whitney U test between each feature and the target feature death event
Pearson correlation coefficients (PCC) and Shapiro–Wilk tests
| Pearson correlation coefficient | Shapiro–Wilk test | ||||
|---|---|---|---|---|---|
| Rank | Feature | abs(PCC) | Rank | Feature | |
| 1 | Serum creatinine | 0.294 | 1 | Creatinine phosphokinase | 7.05×10−28 |
| 2 | Ejection fraction | 0.269 | 2 | Serum creatinine | 5.39×10−27 |
| 3 | Age | 0.254 | 3 | Smoking | 4.58×10−26 |
| 4 | Serum sodium | 0.195 | 4 | Death event | 4.58×10−26 |
| 5 | High blood pressure | 0.079 | 5 | Sex | 1.17×10−25 |
| 6 | Anaemia | 0.066 | 6 | High blood pressure | 1.17×10−25 |
| 7 | Creatinine phosphokinase | 0.063 | 7 | Diabetes | 5.12×10−25 |
| 8 | Platelets | 0.049 | 8 | Anaemia | 6.21×10−25 |
| 9 | Smoking | 0.013 | 9 | Platelets | 2.89×10−12 |
| 10 | Sex | 0.004 | 10 | Serum sodium | 9.21×10−10 |
| 11 | Diabetes | 0.002 | 11 | Ejection fraction | 7.22×10−09 |
| 12 | Age | 5.34×10−05 | |||
Results of the univariate application of the Pearson correlation coefficient between each feature and the target feature death event, absolute value (left), and the univariate application of the Shapiro–Wilk test on each feature (right)
Chi squared test
| Chi squared test | ||
|---|---|---|
| Rank | Feature | |
| 1 | Ejection fraction | 0.000500 |
| 2 | Serum creatinine | 0.000500 |
| 3 | Serum sodium | 0.003998 |
| 4 | Age | 0.005997 |
| 5 | High blood pressure | 0.181909 |
| 6 | Anaemia | 0.260370 |
| 7 | Creatinine phosphokinase | 0.377811 |
| 8 | Platelets | 0.637681 |
| 9 | Smoking | 0.889555 |
| 10 | Sex | 1 |
| 11 | Diabetes | 1 |
Results of the application of the chi squared test between each feature and the target feature death event
Fig. 1Random Forests feature selection. Accuracy reduction. Gini impurity. Random Forests feature selection through accuracy reduction (a). Random Forests feature selection through Gini impurity (b)
Random Forests feature selection aggregate ranking
| Final rank | Feature | Accuracy decrease | Accuracy decrease rank | Gini impurity | Gini impurity rank |
|---|---|---|---|---|---|
| 1 | Serum creatinine | 3.78×10−2 | 1 | 11.84 | 1 |
| 2 | Ejection fraction | 3.43×10−2 | 2 | 10.71 | 2 |
| 3 | Age | 1.53×10−2 | 3 | 8.58 | 3 |
| 4 | Creatinine phosphokinase | 7.27×10−4 | 6 | 7.26 | 4 |
| 4 | Serum sodium | 7.20×10−3 | 4 | 6.49 | 6 |
| 6 | Sex | 1.64×10−3 | 5 | 1.12 | 8 |
| 6 | Platelets | 2.47×10−4 | 8 | 6.80 | 5 |
| 8 | High blood pressure | −1.68×10−3 | 11 | 1.13 | 7 |
| 8 | Smoking | 3.68×10−4 | 7 | 0.95 | 11 |
| 10 | Anaemia | −5.91×10−4 | 10 | 1.06 | 9 |
| 10 | Diabetes | −1.41×10−4 | 9 | 1.02 | 10 |
We merged the two rankings through their position, through the Borda’s method [103]
Survival prediction results on serum creatinine and ejection fraction – mean of 100 executions
| Method | F1 score | Accuracy | TP rate | TN rate | PR AUC | ROC AUC | |
|---|---|---|---|---|---|---|---|
| Random forests | blue | blue0.754* | blue0.585* | 0.541 | blue0.855* | 0.541 | 0.698 |
| Gradient boosting | 0.750 | blue0.585* | blue0.550* | 0.845 | blue0.673* | blue0.792* | |
| SVM radial | 0.720 | 0.543 | 0.519 | 0.816 | 0.494 | 0.667 |
MCC: Matthews correlation coefficient. TP rate: true positive rate (sensitivity, recall). TN rate: true negative rate (specificify). Confusion matrix threshold for MCC, F1 score, accuracy, TP rate, TN rate: τ=0.5. PR: precision-recall curve. ROC: receiver operating characteristic curve. AUC: area under the curve. MCC: worst value = –1 and best value = +1. F1 score, accuracy, TP rate, TN rate, PR AUC, ROC AUC: worst value = 0 and best value = 1. MCC, F1 score, accuracy, TP rate, TN rate, PR AUC, ROC AUC formulas: Additional file 1 (“Binary statistical rates” section). Gradient boosting: eXtreme Gradient Boosting (XGBoost). SVM radial: Support Vector Machine with radial Gaussian kernel. We reported bluein blue and with ∗ the top results for each score.
Fig. 3Scatterplot of serum creatinine versus ejection fraction. Serum creatinine (x axis) range: [0.50, 9.40] mg/dL. Ejection fraction (y axis) range: [14, 80]%. We manually drew a black straight line to highlight the discrimination between alive and dead patients
Fig. 4Barplot of the survival percentage for each follow-up month. Follow-up time (x axis) range: [0, 9] months. Survival percentage (y axis) range: [11.43, 100]%. For each month, we report here the percentage of survived patients. For the 0 month (less than 30 days), for example, there were 11.43% survied patients and 88.57% deceased patients
Stratified logistic regression feature ranking
| Rank | Clinical feature | Importance |
|---|---|---|
| 1 | Ejection fraction | 4.13938106 |
| 2 | Serum creatinine | 3.69917184 |
| 3 | Age | 2.61938095 |
| 4 | Creatinine phosphokinase | 1.88929235 |
| 5 | Sex | 1.32038950 |
| 6 | Platelets | 1.06270364 |
| 7 | High blood pressure | 0.79478093 |
| 8 | Anaemia | 0.77547306 |
| 9 | Smoking | 0.65828165 |
| 10 | Diabetes | 0.60355319 |
| 11 | Serum sodium | 0.54241360 |
Results of the feature ranking obtained by the stratified logisitc regression. Importance: coefficient of the trained logistic regression model, average of 100 execution