| Literature DB >> 33330661 |
Lorenzo Fassina1, Alessandro Faragli2,3,4,5, Francesco Paolo Lo Muzio6,7, Sebastian Kelle2,3,4,5, Carlo Campana8, Burkert Pieske2,3,4,5, Frank Edelmann3,4,5, Alessio Alogna3,4,5.
Abstract
Heart failure (HF) affects at least 26 million people worldwide, so predicting adverse events in HF patients represents a major target of clinical data science. However, achieving large sample sizes sometimes represents a challenge due to difficulties in patient recruiting and long follow-up times, increasing the problem of missing data. To overcome the issue of a narrow dataset cardinality (in a clinical dataset, the cardinality is the number of patients in that dataset), population-enhancing algorithms are therefore crucial. The aim of this study was to design a random shuffle method to enhance the cardinality of an HF dataset while it is statistically legitimate, without the need of specific hypotheses and regression models. The cardinality enhancement was validated against an established random repeated-measures method with regard to the correctness in predicting clinical conditions and endpoints. In particular, machine learning and regression models were employed to highlight the benefits of the enhanced datasets. The proposed random shuffle method was able to enhance the HF dataset cardinality (711 patients before dataset preprocessing) circa 10 times and circa 21 times when followed by a random repeated-measures approach. We believe that the random shuffle method could be used in the cardiovascular field and in other data science problems when missing data and the narrow dataset cardinality represent an issue.Entities:
Keywords: data science; heart failure; missing data; narrow dataset cardinality; random shuffle
Year: 2020 PMID: 33330661 PMCID: PMC7714902 DOI: 10.3389/fcvm.2020.599923
Source DB: PubMed Journal: Front Cardiovasc Med ISSN: 2297-055X
Figure 1Simplified representation of the original dataset along with its variants. (A) The simplified original dataset showing four patients (P = patient) each analyzed with three features (F = feature), displayed with different symbols and colors, and grouped into two classes highlighted with the colored boxes. (B) Representation of the “repeated-measure” variant to expand the cardinality of the original dataset. (C) Same as B, but for our proposed “shuffle” variant.
Figure 2Comparison of the simplified original dataset with its enhancements. (A) Plot of two original numerical features for two classes (the 1st and the 3rd of 61 classes). (B) Plot of two numerical features for two classes (the 1st and the 3rd of 61 classes) whose cardinality has been enhanced 2 ×: original plus one intraclass random generation of values inside each feature according to a fitted repeated-measures model. (C) Plot of two numerical features for two classes (the 1st and the 3rd of 61 classes) whose cardinality has been enhanced 2 ×: original plus one intraclass random exchange/shuffle of values inside each feature (each feature is independently shuffled in random and intraclass manner).
Machine learning with 10-fold cross-validation to calculate the classification accuracy (%).
| Fine tree | 86.2 | 100 | 100 | 100 |
| Fine KNN | 93.2 | 100 | 100 | 100 |
| Weighted KNN | 86.0 | 100 | 100 | 100 |
| Linear SVM | 75.3 | 100 | 100 | 100 |
The names of the classification methods (fine tree, fine KNN, weighted KNN, linear SVM) refer to the preset tools inside the “Model Type” section of the MATLAB® Classification Learner application (all default settings were unchanged).
Regression with 10-fold cross-validation, endpoint = composite, to calculate the regression RMSE (root mean square error).
| Fine tree | 0.093 | 0 | 0 | 0 |
| Linear | 2.7 × 10−16 | 3.2 × 10−16 | 1.7 × 10−15 | 2.5 × 10−16 |
| Linear SVM | 0.108 | 0.066 | 0.065 | 0.065 |
The names of the regression methods (fine tree, linear, linear SVM) refer to the preset tools inside the “Model Type” section of the MATLAB® Regression Learner application (all default settings were unchanged).
Regression with 10-fold cross-validation, endpoint = all-cause hospitalization, to calculate the regression RMSE (root mean square error).
| Fine tree | 0.003 | 0 | 0 | 0 |
| Linear | 1.9 × 10−16 | 2.5 × 10−16 | 6.8 × 10−16 | 5.5 × 10−16 |
| Linear SVM | 0.146 | 0.065 | 0.065 | 0.065 |
The names of the regression methods (fine tree, linear, linear SVM) refer to the preset tools inside the “Model Type” section of the MATLAB® Regression Learner application (all default settings were unchanged).