| Literature DB >> 32819346 |
Erica Tavazzi1, Sebastian Daberdaku1, Rosario Vasta2, Andrea Calvo2, Adriano Chiò2, Barbara Di Camillo3.
Abstract
BACKGROUND: Clinical registers constitute an invaluable resource in the medical data-driven decision making context. Accurate machine learning and data mining approaches on these data can lead to faster diagnosis, definition of tailored interventions, and improved outcome prediction. A typical issue when implementing such approaches is the almost unavoidable presence of missing values in the collected data. In this work, we propose an imputation algorithm based on a mutual information-weighted k-nearest neighbours approach, able to handle the simultaneous presence of missing information in different types of variables. We developed and validated the method on a clinical register, constituted by the information collected over subsequent screening visits of a cohort of patients affected by amyotrophic lateral sclerosis.Entities:
Keywords: Amyotrophic lateral sclerosis; Clinical datasets; Imputation; K-nearest neighbours; Missing data; Mutual information; Naïve Bayes
Year: 2020 PMID: 32819346 PMCID: PMC7439551 DOI: 10.1186/s12911-020-01166-2
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Dataset. For each feature, the type either static (S) or dynamic (D) is defined. For the continuous and ordinal features, percentage of native missing values and inter-quartile range (IQR) values at 25%, 50% and 75% are reported; for the categorical features, levels and corresponding percentage of instances are reported; for the NIV and PEG variables, we reported the total number of patients who were administered these interventions
| Continuous features | Categorical features | |||||||
|---|---|---|---|---|---|---|---|---|
| Feature | Type | % NA | IQR | Feature | Type | Levels | % | |
| BMI premorbid [kg/m2] | S | 2.08 | 23/25/28 | sex | S | Female | 47.6 | |
| BMI diagnosis [kg/m2] | S | 0.91 | 22/24/27 | Male | 52.4 | |||
| FVC diagnosis [%] | S | 4.12 | 83/98/108 | NA | 0 | |||
| age at onset [years] | S | 0 | 56/64/70 | familiality | S | No | 91.4 | |
| diagnostic delay [months] | S | 0 | 5/9/14 | Yes | 8.1 | |||
| onset delta [months] | S | 0 | -18/-11/-6 | NA | 0.5 | |||
| genetics | S | C9orf72 | 7.1 | |||||
| FUS | 0.3 | |||||||
| SOD1 | 1.4 | |||||||
| TARDBP | 1.6 | |||||||
| Ordinal features | wild type | 83.6 | ||||||
| Feature | Type | % NA | IQR | NA | 6.0 | |||
| ALSFRS-R 1 | D | 0 | 2/3/4 | FTD | S | No | 53.0 | |
| ALSFRS-R 2 | D | 0 | 3/4/4 | Yes | 13.0 | |||
| ALSFRS-R 3 | D | 0 | 2/3/4 | NA | 34.0 | |||
| ALSFRS-R 4 | D | 0 | 2/3/4 | onset site | S | Bulbar | 34.4 | |
| ALSFRS-R 5 | D | 0 | 1/2/3 | Limb | 65.6 | |||
| ALSFRS-R 6 | D | 0 | 1/2/3 | NA | 0 | |||
| ALSFRS-R 7 | D | 0 | 1/3/3 | NIV | D | No | 59.6 | |
| ALSFRS-R 8 | D | 0 | 2/2/3 | Yes | 40.4 | |||
| ALSFRS-R 9 | D | 0 | 0/1/3 | NA | 0 | |||
| ALSFRS-R 10 | D | 0 | 3/4/4 | PEG | D | No | 31.9 | |
| ALSFRS-R 11 | D | 0 | 3/4/4 | Yes | 25.0 | |||
| ALSFRS-R 12 | D | 0 | 4/4/4 | NA | 43.1 | |||
Fig. 1Sample construction for imputation and survival classification. a Sample construction for each patient with missing data to be imputed. b Candidate sample construction procedure. In this example, subject i has n=4 visits in the first three months of screening (one in the first month, two in the second and one in the third) while candidate j has 3 visits in this interval (one visit per month). Since the visit at t matches both visits at t and t, its dynamic feature values are repeated twice in the resulting feature vector (sample). c Survival classification sample construction for each patient
Fig. 2Algorithm workflow of the wk-NN MI imputation method
nRMSD scores for the continuous features in the training set. The best performances are highlighted in bold
| Features | Imputation methods | ||||||
|---|---|---|---|---|---|---|---|
| Amelia II | MICE | missForest | k-RN | wk-NN | k-RN | wk-NN MI | |
| BMI premorbid | 0.1012 | 0.0960 | 0.1323 | 0.1634 | 0.1286 | 0.1617 | |
| BMI diagnosis | 0.1560 | 0.1069 | 0.1476 | 0.1750 | 0.1457 | 0.1687 | |
| FVC diagnosis | 0.2466 | 0.2463 | 0.2534 | 0.1970 | 0.1876 | 0.1953 | |
| age at onset | 0.2355 | 0.2362 | 0.2393 | 0.1855 | 0.1748 | 0.1820 | |
| diagnostic delay | 0.1150 | 0.1218 | 0.1316 | 0.1484 | 0.1282 | 0.1495 | |
| onset delta | 0.1362 | 0.1362 | 0.1665 | 0.1848 | 0.1584 | 0.1778 | |
| Average | 0.1651 | 0.1572 | 0.1784 | 0.1757 | 0.1539 | 0.1725 | |
nRMSD scores for the ordinal features in the training set. The best performances are highlighted in bold
| Features | Imputation methods | ||||||
|---|---|---|---|---|---|---|---|
| Amelia II | MICE | missForest | k-RN | wk-NN | k-RN | wk-NN MI | |
| ALSFRS-R 1 | 0.1959 | 0.1540 | 0.1788 | 0.2454 | 0.1529 | 0.2390 | |
| ALSFRS-R 2 | 0.1644 | 0.1433 | 0.1684 | 0.1904 | 0.1394 | 0.1907 | |
| ALSFRS-R 3 | 0.1768 | 0.1387 | 0.1679 | 0.2175 | 0.1331 | 0.2130 | |
| ALSFRS-R 4 | 0.2173 | 0.1916 | 0.2145 | 0.2516 | 0.1606 | 0.2455 | |
| ALSFRS-R 5 | 0.2183 | 0.1863 | 0.2179 | 0.2812 | 0.1763 | 0.2727 | |
| ALSFRS-R 6 | 0.2064 | 0.2015 | 0.2113 | 0.2864 | 0.1849 | 0.2773 | |
| ALSFRS-R 7 | 0.1953 | 0.1696 | 0.1833 | 0.2645 | 0.1544 | 0.2550 | |
| ALSFRS-R 8 | 0.2021 | 0.1488 | 0.1651 | 0.2460 | 0.1470 | 0.2377 | |
| ALSFRS-R 9 | 0.2655 | 0.2405 | 0.2268 | 0.3744 | 0.2222 | 0.3657 | |
| ALSFRS-R 10 | 0.1060 | 0.1093 | 0.1565 | 0.2523 | 0.1668 | 0.2475 | |
| ALSFRS-R 11 | 0.0854 | 0.0982 | 0.1340 | 0.2446 | 0.1585 | 0.2403 | |
| ALSFRS-R 12 | 0.0682 | 0.0434 | 0.0485 | 0.0933 | 0.0637 | 0.0908 | |
| Average | 0.1751 | 0.1521 | 0.1728 | 0.2457 | 0.1550 | 0.2396 | |
PFC scores for the categorical features in the training set. The best performances are highlighted in bold
| Features | Imputation methods | ||||||
|---|---|---|---|---|---|---|---|
| Amelia II | MICE | missForest | k-RN | wk-NN | k-RN | wk-NN MI | |
| sex | 0.4859 | 0.4416 | 0.4463 | 0.5160 | 0.3974 | 0.4831 | |
| familiality | 0.1646 | 0.1268 | 0.1372 | 0.0842 | 0.0823 | 0.0842 | |
| genetics | 0.3310 | 0.1781 | 0.1751 | 0.0956 | 0.0895 | 0.0956 | |
| FTD | 0.3295 | 0.2642 | 0.3565 | 0.2060 | 0.2003 | 0.1960 | |
| onset site | 0.2957 | 0.1516 | 0.1403 | 0.3672 | 0.1017 | 0.3484 | |
| NIV | 0.1111 | 0.0556 | 0.0537 | 0.0518 | 0.0480 | 0.0518 | |
| PEG | 0.0948 | 0.0150 | 0.0208 | ||||
| Average | 0.2589 | 0.1761 | 0.1900 | 0.1897 | 0.1323 | 0.1809 | |
Fig. 3Normalised absolute error distributions obtained with MICE and wk-NN MI (with k=20) on the continuous features of the training set
Fig. 4Normalised absolute error distributions obtained with MICE and wk-NN MI (with k=20) on the ordinal features of the training set
Fig. 5Proportion of falsely classified obtained with MICE and wk-NN MI (with k=20) on the categorical features of the training set
nRMSD scores for the continuous features in the test set. The best performances are highlighted in bold
| Features | Imputation methods | ||||||
|---|---|---|---|---|---|---|---|
| Amelia II | MICE | missForest | k-RN | wk-NN | k-RN | wk-NN MI | |
| BMI premorbid | 0.1302 | 0.1353 | 0.1787 | 0.2047 | 0.1692 | 0.2034 | |
| BMI diagnosis | 0.1459 | 0.1227 | 0.1653 | 0.1968 | 0.1665 | 0.2033 | |
| FVC diagnosis | 0.2481 | 0.2401 | 0.2584 | 0.2036 | 0.1821 | 0.1980 | |
| age at onset | 0.2799 | 0.2650 | 0.2781 | 0.2024 | 0.1847 | 0.2061 | |
| diagnostic delay | 0.1286 | 0.1228 | 0.1350 | 0.1481 | 0.1209 | 0.1422 | |
| onset delta | 0.1489 | 0.1529 | 0.1910 | 0.1785 | 0.1512 | 0.1686 | |
| Average | 0.1803 | 0.1731 | 0.2011 | 0.1890 | 0.1624 | 0.1869 | |
nRMSD scores for the ordinal features in the test set. The best performances are highlighted in bold
| Features | Imputation methods | ||||||
|---|---|---|---|---|---|---|---|
| Amelia II | MICE | missForest | k-RN | wk-NN | k-RN | wk-NN MI | |
| ALSFRS-R 1 | 0.3148 | 0.1852 | 0.1852 | 0.2467 | 0.1609 | 0.2528 | |
| ALSFRS-R 2 | 0.2680 | 0.1852 | 0.2122 | 0.2197 | 0.1527 | 0.2049 | |
| ALSFRS-R 3 | 0.2663 | 0.1673 | 0.1504 | 0.2443 | 0.1504 | 0.2265 | |
| ALSFRS-R 4 | 0.2832 | 0.1913 | 0.1852 | 0.2770 | 0.1813 | 0.2762 | |
| ALSFRS-R 5 | 0.3012 | 0.1741 | 0.2060 | 0.3039 | 0.1714 | 0.2873 | |
| ALSFRS-R 6 | 0.3035 | 0.1768 | 0.1973 | 0.3141 | 0.1701 | 0.2996 | |
| ALSFRS-R 7 | 0.2873 | 0.1687 | 0.1800 | 0.2762 | 0.1550 | 0.2787 | |
| ALSFRS-R 8 | 0.2910 | 0.1550 | 0.1550 | 0.2645 | 0.1519 | 0.2514 | |
| ALSFRS-R 9 | 0.3189 | 0.2192 | 0.2774 | 0.3709 | 0.2491 | 0.3549 | |
| ALSFRS-R 10 | 0.1845 | 0.0903 | 0.1481 | 0.2410 | 0.1416 | 0.2462 | |
| ALSFRS-R 11 | 0.1938 | 0.0941 | 0.1408 | 0.2316 | 0.1340 | 0.2415 | |
| ALSFRS-R 12 | 0.1728 | 0.0506 | 0.1013 | 0.0551 | 0.0990 | 0.0529 | |
| Average | 0.2654 | 0.1542 | 0.1740 | 0.2576 | 0.1561 | 0.2516 | |
PFC scores for the categorical features in the test set. The best performances are highlighted in bold
| Features | Imputation methods | ||||||
|---|---|---|---|---|---|---|---|
| Amelia II | MICE | missForest | k-RN | wk-NN | k-RN | wk-NN MI | |
| sex | 0.4440 | 0.4813 | 0.4366 | 0.5560 | 0.4366 | 0.4440 | |
| familiality | 0.2724 | 0.0970 | 0.1381 | 0.0821 | |||
| genetics | 0.3166 | 0.2124 | 0.1776 | 0.1506 | 0.1506 | 0.1506 | |
| FTD | 0.4749 | 0.3575 | 0.3911 | 0.2626 | 0.2235 | 0.2346 | |
| onset site | 0.2910 | 0.1418 | 0.1343 | 0.4552 | 0.0896 | 0.4664 | |
| NIV | 0.0485 | 0.0299 | 0.0634 | 0.0410 | 0.0149 | 0.0410 | |
| PEG | 0.0101 | 0.0352 | |||||
| Average | 0.2646 | 0.1900 | 0.1966 | 0.2122 | 0.1456 | 0.1986 | |
Fig. 6Normalised absolute error distributions obtained with MICE and wk-NN MI (with k=20) on the continuous features of the test set
Fig. 7Normalised absolute error distributions obtained with MICE and wk-NN MI (with k=20) on the ordinal features of the test set
Fig. 8Proportion of falsely classified obtained with MICE and wk-NN MI (with k=20) on the categorical features of the test set
Fig. 9Precision-Recall and ROC plots of the naïve Bayes classifiers. The plots show that the imputation of the training set with the proposed method improves the classification performance of a naïve Bayes classifier