| Literature DB >> 34276792 |
Yi-Hui Zhou1,2, Ehsan Saghapour1.
Abstract
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.Entities:
Keywords: decision trees; electronic health records; gradient boosting; imputation; prediction
Year: 2021 PMID: 34276792 PMCID: PMC8283820 DOI: 10.3389/fgene.2021.691274
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
The Boston data have information for predicting the value of house prices; the spam data contain the attributes to determine whether e-mails spam; the letter data have character image features to identify a letter of the alphabet; the breast cancer data gathered the numerical features of cell images for tumor diagnosis.
| Boston | 506 | 13 | Both | |
| Spam | 4,601 | 57 | Continuous | |
| Letter | 20,000 | 16 | Categorical | |
| Breast cancer | 569 | 30 | Continuous |
Pseudocode of the ImputeEHR algorithm.
| Require: X is |
| 1. Make initial guess using mean or median imputation for missing values; |
| 2. |
| column |
| w.r.t. increasing amount of missing values; |
| 3. |
| 4. |
| 5. |
| 6. Fit a LightGBM or Xgboost : |
| 7. Predict |
| 8. |
| 9. |
| 10. Update γ |
| 11. |
| 12. |
Figure 1Running time of ImputeEHR1 (blue), MissForest (orange), and ImputeEHR2 (gray) for each dataset.
Figure 2Our pipeline of the MIMIC-III data imputation and prediction.
Figure 3(Left) Receiver operating characteristic curve (ROC) comparison between our pipeline and the method (Sharafoddini et al., 2019) on the mortality prediction of the MIMIC-III data. (Right) Precision recall curve comparison.
Figure 4Illustration of the web app for visualization.
Figure 5Visualization of patterns in the imputed dataset. User has the option to use the number of cluster and dimension reduction method.
Figure 6Visualization of the important features selected by the four methods.
Figure 7Pipeline of the predictive model.