| Literature DB >> 34045614 |
Nishith Kumar1, Md Aminul Hoque2, Masahiro Sugimoto3,4.
Abstract
Mass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA .Entities:
Year: 2021 PMID: 34045614 PMCID: PMC8159923 DOI: 10.1038/s41598-021-90654-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Performance investigation of different missing imputation techniques using average MSE for without class level data.
Figure 2Performance investigation of different missing value imputation techniques using receiver operating characteristic curve of DE calculation for two class level dataset with 5% missing values in absence and presence of outliers.
Average misclassification error rate (MER) and area under the receiver operating characteristic curve (AUC) of DE calculation for two class simulated data with 5% missing values and different rates of outliers.
| Methods | Without outliers MER (AUC) | 3% outliers MER (AUC) | 5% outliers MER (AUC) | 7% outliers MER (AUC) | 10% outliers MER (AUC) |
|---|---|---|---|---|---|
| RF | 4.23 (0.956) | 19.76 (0.797) | 27.41 (0.720) | 31.39 (0.679) | 35.52 (0.638) |
| PPCA | 3.77 (0.964) | 18.59 (0.811) | 26.03 (0.737) | 30.06 (0.698) | 34.48 (0.652) |
| kNN | 4.83 (0.952) | 20.41 (0.794) | 27.85 (0.719) | 31.83 (0.678) | 35.93 (0.637) |
| BPCA | 3.45 (0.967) | 21.31 (0.788) | 28.65 (0.703) | 32.34 (0.675) | 36.45 (0.641) |
| EM-PCA | 3.38 (0.969) | 18.76 (0.829) | 26.08 (0.733) | 30.09 (0.707) | 36.03 (0.649) |
| Zero | 16.57 (0.829) | 28.88 (0.709) | 34.15 (0.651) | 37.13 (0.623) | 39.95 (0.598) |
| Mean | 5.23 (0.949) | 21.29 (0.789) | 28.62 (0.719) | 32.37 (0.672) | 36.41 (0.642) |
| Median | 5.12 (0.951) | 20.99 (0.791) | 28.36 (0.706) | 32.18 (0.676) | 36.25 (0.643) |
| Minimum | 12.44 (0.867) | 26.45 (0.728) | 32.26 (0.673) | 35.52 (0.640) | 38.75 (0.604) |
| rmiMAE | 2.94 (0.971) | 4.21 (0.958) | 4.77 (0.951) | 4.98 (0.948) | 5.13 (0.964) |
| Proposed |
Bold indicates the lower MER and Higher AUC throughout the column.
Figure 3Performance investigation of different missing value imputation techniques using receiver operating characteristic curve of sample classification for two class level dataset with 5% missing values in presence of outliers.
Average misclassification error rate(MER) and area under the receiver operating characteristic curve (AUC) for two class simulated data with 5% missing values and different rates of outliers.
| Methods | 3% Outliers MER (AUC) | 5% Outliers MER (AUC) | 7% Outliers MER (AUC) | 10% Outliers MER (AUC) |
|---|---|---|---|---|
| RF | 4.10 (0.9516) | 8.67 (0.9143) | 16.17 (0.8395) | 17 (0.8332) |
| PPCA | 5.67 (0.9408) | 7.37 (0.9276) | 15.53 (0.8454) | 19.37 (0.8048) |
| kNN | 5.53 (0.9412) | 9.27 (0.9054) | 16.57 (0.83885) | 19.6 (0.8042) |
| BPCA | 5.77 (0.9391) | 8.50 (0.9149) | 16.27 (0.84025) | 21.2 (0.7882) |
| EM-PCA | 5.03 (0.9460) | 7.43 (0.9270) | 15.33 (0.8475) | 15.77 (0.8385) |
| Zero | 7.73 (0.9224) | 12.70 (0.8759) | 19 (0.8068) | 18.9 (0.8114) |
| Mean | 5.93 (0.9371) | 9.30 (0.9059) | 16.5 (0.8358) | 21.37 (0.7858) |
| Median | 6.17 (0.9353) | 9.67 (0.9021) | 16.43 (0.8366) | 19.87 (0.8021) |
| Minimum | 8.67 (0.9088) | 13.70 (0.8675) | 18.87 (0.8088) | 19.27 (0.8073) |
| rmiMAE | 1.69 (0.9831) | 1.81 (0.9819) | 2.36 (0.9764) | 2.82 ( 0.9718) |
| Proposed |
Bold indicates the lower MER and Higher AUC throughout the column.
Average misclassification error rate and area under the receiver operating characteristic curve (AUC) for three class simulated data with 5% missing values and different rates of outliers.
| Methods | 3% Outliers MER (AUC) | 5% Outliers MER (AUC) | 7% Outliers MER (AUC) | 10% Outliers MER (AUC) |
|---|---|---|---|---|
| RF | 5.60 (0.9661) | 10.60 (0.9122) | 12.97 (0.8631) | 21.63 (0.7687) |
| PPCA | 4.33 (0.9718) | 7.73 (0.9375) | 11.03 (0.8797) | 21.47 (0.7734) |
| kNN | 3.67 (0.9783) | 9.40 (0.9251) | 13.80 (0.8634) | 18.60 (0.7785) |
| BPCA | 4.00 (0.9735) | 10.17 (0.9131) | 13.37 (0.8688) | 21.77 (0.7562) |
| EM-PCA | 5.20 (0.9645) | 8.93 (0.9302) | 11.07 (0.8799) | 21.67 (0.7674) |
| Zero | 4.67 (0.9696) | 8.67 (0.9172) | 15.80 (0.8546) | 15.20 (0.8054) |
| Mean | 4.13 (0.9724) | 10.97 (0.8995) | 13.93 (0.8620) | 18.23 (0.7924) |
| Median | 4.20 (0.9721) | 10.03 (0.9152) | 13.50 (0.8673) | 17.93 (0.7948) |
| Minimum | 5.23 (0.9626) | 8.30 (0.9193) | 12.13 (0.8913) | 14.33 (0.8214) |
| rmiMAE | 1.82 (0.9816) | 2.21 (0.9783) | 2.57 (0.9721) | 3.29 (0.9672) |
| Proposed |
Bold indicates the lower MER and Higher AUC throughout the column.
Figure 4Performance investigation of different missing value imputation techniques using MSE calculation for different rates of missing values of (a) Human Cachexia dataset and (b) treated dataset.
Figure 5Performance measures calculation procedure for real dataset on the basis of sample classification.
Average misclassification error rate and area under the receiver operating characteristic curve (AUC) of sample classification for two class real dataset (hepatocellular carcinoma) with 26.52% missing values and artificially imputed different rates of outliers.
| Methods | Without outliers MER (AUC) | 3% outliers MER (AUC) | 5% outliers MER (AUC) | 7% outliers MER (AUC) | 10% outliers MER (AUC) |
|---|---|---|---|---|---|
| RF | 10.67 (0.8903) | 13.61 (0.8642) | 20.16 (0.8001) | 21.18 (0.7883) | 22.61 (0.7751) |
| PPCA | 15.22 (0.8495) | 16.45 (0.8365) | 26.44 (0.7364) | 26.56 (0.7355) | 26.67 (0.7323) |
| kNN | 13.53 (0.8795) | 13.94 (0.8676) | 25.56 (0.7484) | 25.97 (0.7402) | 26.33 (0.7375) |
| BPCA | 14.61 (0.8571) | 16.28 (0.8342) | 21.67 (0.7882) | 23.38 (0.7679) | 25.45 (0.7452) |
| EM-PCA | 11.44 (0.8894) | 11.58 (0.8813) | 23.72 (0.7665) | 23.70 (0.7648) | 23.69 (0.7618) |
| Zero | 11.55 (0.8874) | 12.39 (0.8747) | 15.58 (0.8433) | 17.51 (0.8246) | 19.44 (0.8026) |
| Mean | 10.56 (0.8914) | 13.61 (0.8644) | 20.73 (0.7947) | 22.69 (0.7723) | 24.89 (0.7502) |
| Median | 9.94 (0.9068) | 12.38 (0.8783) | 17.61 (0.8276) | 20.17 (0.7975) | 23.54 (0.7637) |
| Minimum | 10.56 (0.8937) | 11.24 (0.8894) | 13.57 (0.8646) | 15.67 (0.8427) | 17.44 (0.8265) |
| rmiMAE | 0.78 (0.9922) | 1.84 (0.9815) | 2.36 (0.9773) | 2.97 (0.9701) | 3.65 (0.9644) |
| Proposed | 0.00 (1.00) |
Bold indicates the lower MER and Higher AUC throughout the column.
Average misclassification error rate and area under the receiver operating characteristic curve (AUC) of sample classification for three class real dataset (MDA-MB-231) with 15.81% missing values and artificially imputed different rates of outliers.
| Methods | Without outliers MER (AUC) | 3% outliers MER (AUC) | 5% outliers MER (AUC) | 7% outliers MER (AUC) | 10% outliers MER (AUC) |
|---|---|---|---|---|---|
| RF | 4.23 (0.9635) | 21.57 (0.7996) | 23.53 (0.7771) | 31.96 (0.6951) | 34.70 (0.6777) |
| PPCA | 4.33 (0.9629) | 18.27 (0.8288) | 25.13 (0.7564) | 21.83 (0.7938) | 30.63 (0.7238) |
| kNN | 3.43 (0.9759) | 18.63 (0.8296) | 21.80 (0.7940) | 24.56 (0.7646) | 43.56 (0.6672) |
| BPCA | 8.07 (0.9267) | 18.77 (0.8207) | 22.06 (0.7836) | 26.43 (0.7475) | 33.70 (0.6858) |
| EM-PCA | 4.46 (0.9619) | 19.76 (0.8101) | 19.63 (0.8161) | 21.66 (0.7938) | 34.93 (0.6741) |
| Zero | 3.45 (0.9755) | 12.20 (0.8843) | 25.86 (0.7554) | 27.40 (0.7347) | 38.53 (0.6374) |
| Mean | 3.73 (0.9728) | 11.73 (0.8858) | 25.30 (0.7597) | 26.2 (0.7452) | 35.90 (0.6668) |
| Median | 3.67 (0.9732) | 11.36 (0.8861) | 22.73 (0.7874) | 23.16 (0.7753) | 34.90 (0.6748) |
| Minimum | 3.43 (0.9758) | 13.10 (0.8774) | 24.64 (0.7647) | 27.16 (0.7457) | 37.10 (0.6457) |
| rmiMAE | 1.45 (0.9854) | 2.47 (0.9757) | 3.04 (0.9753) | 3.56 (0.9668) | 4.01 (0.9611) |
| Proposed |
Bold indicates the lower MER and Higher AUC throughout the column.