| Literature DB >> 29330539 |
Runmin Wei1,2, Jingye Wang1, Mingming Su1,3, Erik Jia4, Shaoqiu Chen1,2, Tianlu Chen5, Yan Ni6.
Abstract
Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection can significantly affect following data analyses. Typically, there are three types of missing values, missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR). Our study comprehensively compared eight imputation methods (zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC)) for different types of missing values using four metabolomics datasets. Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate imputation accuracy. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes analysis were used to evaluate the overall sample distribution. Student's t-test followed by correlation analysis was conducted to evaluate the effects on univariate statistics. Our findings demonstrated that RF performed the best for MCAR/MAR and QRILC was the favored one for left-censored MNAR. Finally, we proposed a comprehensive strategy and developed a public-accessible web-tool for the application of missing value imputation in metabolomics ( https://metabolomics.cc.hawaii.edu/software/MetImp/ ).Entities:
Mesh:
Year: 2018 PMID: 29330539 PMCID: PMC5766532 DOI: 10.1038/s41598-017-19120-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Evaluation of different imputation methods for MCAR/MAR (a,b) NRMSE on unlabeled and labeled metabolomics data. (c,d) PCA-Procrustes sum of squared errors on unlabeled and labeled metabolomics data. (e) Pearson correlation of log p-values (t-test) of complete data and imputed data. (f) PLS-Procrustes sum of squared errors.
Figure 2Evaluation of different imputation methods for MNAR (a,b) SOR on unlabeled and labeled metabolomics data. (c,d) PCA-Procrustes sum of squared errors on unlabeled and labeled metabolomics data. (e) Pearson correlation of log p-values (t-test) of missing variables from complete data and imputed data. (f) PLS-Procrustes sum of squared errors.
Figure 3Comparisons between QRILC and HM for MNAR (a-c) PCA score plot for complete data, QRILC imputed data and HM imputed data on top 2 PCs. (d–k) Violin plots of eight missing variables.
Figure 4An imputation strategy for metabolomics studies.