| Literature DB >> 35058491 |
Minjie Shen1, Yi-Tan Chang1, Chiung-Ting Wu1, Sarah J Parker2, Georgia Saylor3, Yizhi Wang1, Guoqiang Yu1, Jennifer E Van Eyk2, Robert Clarke4, David M Herrington3, Yue Wang5.
Abstract
Missing values are a major issue in quantitative proteomics analysis. While many methods have been developed for imputing missing values in high-throughput proteomics data, a comparative assessment of imputation accuracy remains inconclusive, mainly because mechanisms contributing to true missing values are complex and existing evaluation methodologies are imperfect. Moreover, few studies have provided an outlook of future methodological development. We first re-evaluate the performance of eight representative methods targeting three typical missing mechanisms. These methods are compared on both simulated and masked missing values embedded within real proteomics datasets, and performance is evaluated using three quantitative measures. We then introduce fused regularization matrix factorization, a low-rank global matrix factorization framework, capable of integrating local similarity derived from additional data types. We also explore a biologically-inspired latent variable modeling strategy-convex analysis of mixtures-for missing value imputation and present preliminary experimental results. While some winners emerged from our comparative assessment, the evaluation is intrinsically imperfect because performance is evaluated indirectly on artificial missing or masked values not authentic missing values. Nevertheless, we show that our fused regularization matrix factorization provides a novel incorporation of external and local information, and the exploratory implementation of convex analysis of mixtures presents a biologically plausible new approach.Entities:
Mesh:
Year: 2022 PMID: 35058491 PMCID: PMC8776850 DOI: 10.1038/s41598-022-04938-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Comparative assessment of eight representative missing value imputation methods, divided into three categories.
Figure 2Two-phased workflow of realistic simulation-based assessment on missing value imputation methods.
Summary of real proteomics datasets used in this work (DIA-MS).
| Sample size | Protein size | Total Missing Rate | Setting #1 protein size | Setting #2 protein size | |
|---|---|---|---|---|---|
| Batch A | 98 | 2107 | 24.67% | 751 (35.64%) | 1935 (91.84%) |
| Batch B | 55 | 2604 | 29.63% | 819 (31.45%) | 2324 (89.25%) |
| Batch C | 47 | 2590 | 25.52% | 976 (37.68%) | 2325 (89.77%) |
Figure 3(a) Imputation performance of the eight methods on the simulation data of setting #1, with assumed MNAR missing mechanism and varying total missing rates. (b) Imputation performance of the eight methods on the simulation data of setting #1, with assumed MCAR missing mechanism and varying total missing rates. (c) Imputation performance of the eight methods on the simulation data of setting #2, focusing on authentic missing mechanism and varying masked rates.
Summary of the relative performance among the imputation methods being evaluated.
| Overall summary (Batch A, Batch B, Batch C) | ||||||
|---|---|---|---|---|---|---|
| Mechanisms | Best | Worst | ||||
| Criteria | ||||||
| NRMSE | RMSE | SOR | NRMSE | RMSE | SOR | |
| Setting #1 | ||||||
| MCAR | NIPALS SVT KNN PW | SVT NIPALS KNN PW | NIPALS | HalfMin | HalfMin | HalfMin |
MIX (90% MCAR + 10%MNAR) | SVT NIPALS KNN PW | SVT NIPALS KNN PW | SVT NIPALS KNN PW | HalfMin | HalfMin | HalfMin |
| MNAR | SVT HalfMin | HalfMin | SVT HalfMin | Mean | Mean | Mean |
| Setting #2 | ||||||
| MAR + MNAR | SVT KNN SW NIPALS | SVT | NIPALS SVT KNN SW | HalfMin | HalfMin | HalfMin |
Figure 4(a) Imputation performance of the FRMF variants on the simulation data of setting #2, with varying masked rates. (b) Imputation performance of CAM variants on the simulation data of setting #2, with varying masked rates, in comparison to that of SVT and NIPALS. The imputation accuracy is evaluated in the original intensity space (before log-transformation).
Figure 5Workflow of the CAM based imputation method with two variant algorithms.
Figure 6CAM principles for latent variable modelling and deconvolution. (a) Mixed expression profile of latent process mixtures. (b) Illustration of mixing operation in scatter space, where a compressed and rotated scatter simplex whose vertices host marker genes is produced and corresponded to mixing proportions. (c) Mathematical description of expression profile of latent process mixtures.