| Literature DB >> 35804317 |
Zeeshan Hamid1, Kip D Zimmerman1, Hector Guillen-Ahlers1, Cun Li2,3, Peter Nathanielsz2,3, Laura A Cox1,2, Michael Olivier4.
Abstract
BACKGROUND: Reliable and effective label-free quantification (LFQ) analyses are dependent not only on the method of data acquisition in the mass spectrometer, but also on the downstream data processing, including software tools, query database, data normalization and imputation. In non-human primates (NHP), LFQ is challenging because the query databases for NHP are limited since the genomes of these species are not comprehensively annotated. This invariably results in limited discovery of proteins and associated Post Translational Modifications (PTMs) and a higher fraction of missing data points. While identification of fewer proteins and PTMs due to database limitations can negatively impact uncovering important and meaningful biological information, missing data also limits downstream analyses (e.g., multivariate analyses), decreases statistical power, biases statistical inference, and makes biological interpretation of the data more challenging. In this study we attempted to address both issues: first, we used the MetaMorphues proteomics search engine to counter the limits of NHP query databases and maximize the discovery of proteins and associated PTMs, and second, we evaluated different imputation methods for accurate data inference. We used a generic approach for missing data imputation analysis without distinguising the potential source of missing data (either non-assigned m/z or missing values across runs).Entities:
Keywords: Missing value imputation; Non-human primates (NHP); Posttranslational modifications (PTMs)
Mesh:
Year: 2022 PMID: 35804317 PMCID: PMC9264528 DOI: 10.1186/s12864-022-08723-1
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 4.547
Fig. 1Overview of the study design (A) Enhanced discovery of PTMs using high resolution mass spectrometry and search database augmentation in MetaMorphues (B) Graphical representation of imputation workflow (missing data introduction and comparative evaluation of single and multiple imputation)
Fig. 2Distribution of peptides with or without modification (number in bars represent number of peptides observed in each category). Lower panel shows breakdown of number of PTM specific peptides detected in each subclass of three categories (Biological, Metal and Artifact)
Amino acid specificity of detected PTMs in our study
| Amino Acid Specificity | PTMs identified in our study |
|---|---|
| Cysteine (C) | Nitrosylation, Ammonia loss, Carbamyl |
| Aspartate (D) | Carboxylation, Calcium, Fe [II], Sodium, Fe [III], Zinc, Potassium, Magnesium, Cu [I] |
| Glutamate (E) | Carboxylation, Calcium, Fe [II], Sodium, Fe [III], Zinc, Potassium, Magnesium, Cu [I], Water Loss |
| Lysine (K) | Methylation, Formylation, Hydroxylation, Trimethylation, Acetylation, Glutarylation, Succinylation, Butyrylation, Crotonylation, Malonylation, Dimethylation, Hydroxybutyrylation, Carboxylation, Pyridoxal phosphate, Carbamyl |
| Methionine (M) | Oxidation, Carbamyl |
| Asparagine (N) | Hydroxylation, Deamidation, Ammonia loss |
| Proline (P) | Hydroxylation |
| Glutamine (Q) | Deamidation |
| Arginine (R) | Citrullination, Methylation, Dimethylation, Carbamyl |
| Serine (S) | Phosphorylation, ADP-ribosylation, HexNAc |
| Threonine (T) | Phosphorylation, HexNAc |
| N-Terminal (X) | Acetylation, Carbamyl |
| Tyrosine (Y) | Sulfonation, Nitrosylation, Phosphorylation |
Fig. 3Percentage of correctly imputed values (imputed values with Percent Bias < 5%) for 12 different Single Imputation methods and 4 different levels of overall missingness. Each point represents a single iteration of added missingness and imputation evaluation. A < 5% protein missingness. B 5–10% protein missingness. C 10–20% protein missingness. D 20–30% protein missingness
Fig. 4Differences between true regression estimates and imputed regression estimates (A) as well as differences between association p-values (B) for 6 imputation methods. Evaluations were done on randomly selected 10 and 40 proteins from complete data (296 proteins with no missing values). Global missingness of 10 to 40% was introduced for each individual protein list. Regressions were computed with all imputed data regardless of their distance from truth
Fig. 5Differences between true Cronbach’s alpha and imputed Cronbach’s alpha for 6 imputation methods. (*** Bonferroni p < 0.001 * Bonferroni p < 0.05). Evaluations were done on randomly selected 10, 20 or 30 proteins (left axis) from complete data (296 proteins with no missing values). Global missingness (top axis) of 10 and 20% was introduced for each individual protein list
Fig. 6In depth Evaluation of Random forest imputation for the full protein dataset (proteins present in at-least 25% of samples, corresponding to 1252 proteins) (*** Bonferroni p < 0.001 * Bonferroni p < 0.05)