| Literature DB >> 31199438 |
Qian Li1, Kate Fisher2,3, Wenjun Meng2, Bin Fang4, Eric Welsh2, Eric B Haura5, John M Koomen6, Steven A Eschrich2, Brooke L Fridley2, Y Ann Chen2.
Abstract
MOTIVATION: Missingness in label-free mass spectrometry is inherent to the technology. A computational approach to recover missing values in metabolomics and proteomics datasets is important. Most existing methods are designed under a particular assumption, either missing at random or under the detection limit. If the missing pattern deviates from the assumption, it may lead to biased results. Hence, we investigate the missing patterns in free mass spectrometry data and develop an omnibus approach GMSimpute, to allow effective imputation accommodating different missing patterns.Entities:
Mesh:
Year: 2020 PMID: 31199438 PMCID: PMC6956786 DOI: 10.1093/bioinformatics/btz488
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Missing pattern in MS proteomics technical replicates. Each panel shows the log abundance of ‘non-missing’ and ‘missing’ pY, pS or pT per pair of technical replicates by violin and box plots. On the x-axis, C1, C2 represent two biologically control samples, and D1, D2 represent two biologically samples treated by Dasatinib
Fig. 2.Pearson correlation on simulated abundance. The mean of Pearson correlation between the true and imputed values are presented for each scenario at different sample sizes. For each level of missing percentage, scenarios are ordered by increasing the proportion of AIM from left to right
Fig. 3.Normalized root mean square errors on simulated abundance. It shows the mean of NRMSE between the true and imputed values across scenarios. For each level of missing percentage, scenarios are ordered by increasing proportion of abundance independent missingness from left to right
Fig. 4.Pearson correlation on TCGA metabolomics studies. The mean of Pearson correlation between the true and imputed values in each TCGA study is presented across scenarios
Fig. 5.Ratio of LFC between the imputed and complete abundance matrix on TCGA metabolomics data. Ratio >1: LFC enlarged and no change in upregulation; 0
Pearson correlation of differential analysis log10 adjusted P-values in between the complete and imputed data for TCGA metabolomics studies, with known metabolites only
| Missing (AIM, ADM) | TS-LASSO | Random Forest | KNN-TN (K=5) | Compound minimum |
|---|---|---|---|---|
| TCGA breast cancer ( | ||||
| 3%, 12% | 0.858 | 0.780 | 0.796 | 0.861 |
| 6%, 9% | 0.903 | 0.796 | 0.836 | 0.850 |
| 7.5%, 7.5% | 0.921 | 0.813 | 0.851 | 0.876 |
| 9%, 6% | 0.927 | 0.829 | 0.878 | 0.812 |
| 12%, 3% | 0.959 | 0.899 | 0.918 | 0.790 |
| TCGA ccRCC ( | ||||
| 3%, 12% | 0.917 | 0.908 | 0.914 | 0.913 |
| 6%, 9% | 0.903 | 0.886 | 0.893 | 0.914 |
| 7.5%, 7.5% | 0.930 | 0.895 | 0.919 | 0.846 |
| 9%, 6% | 0.955 | 0.936 | 0.946 | 0.861 |
| 12%, 3% | 0.982 | 0.967 | 0.973 | 0.819 |