| Literature DB >> 27123000 |
Riccardo Di Guida1, Jasper Engel2, J William Allwood3, Ralf J M Weber3, Martin R Jones3, Ulf Sommer2, Mark R Viant4, Warwick B Dunn5.
Abstract
INTRODUCTION: The generic metabolomics data processing workflow is constructed with a serial set of processes including peak picking, quality assurance, normalisation, missing value imputation, transformation and scaling. The combination of these processes should present the experimental data in an appropriate structure so to identify the biological changes in a valid and robust manner.Entities:
Keywords: Glog transformation; KNN; Metabolomics; PQN normalisation; Random forest; UHPLC-MS
Year: 2016 PMID: 27123000 PMCID: PMC4831991 DOI: 10.1007/s11306-016-1030-9
Source DB: PubMed Journal: Metabolomics ISSN: 1573-3882 Impact factor: 4.290
Summary of the percentage of missing values present in four datasets and the correlation of missing values observed with m/z, retention time and response
| Dataset | Mouse serum | Placental tissue | Human urine | Mammalian cell extract |
|---|---|---|---|---|
| Metabolite features before filtering | 4435 | 3412 | 3823 | 2008 |
| Missing values before filtering (%) | 15.0 | 10.2 | 14.0 | 8.7 |
| Metabolite features after filtering | 2996 | 2622 | 2684 | 1598 |
| Missing values after filtering (%) | 4.5 | 2.8 | 5.0 | 3.7 |
| Pearson coefficient (missing values vs. mean abundance) | −0.05 | −0.12 | −0.05 | −0.08 |
| Pearson coefficient (missing values vs. | 0.07 | −0.02 | 0.30 | 0.35 |
| Pearson coefficient (missing values vs. retention time) | 0.02 | −0.11 | 0.03 | −0.07 |
Filtering was performed as defined in Sect. 2.1.1.3
Normalised root mean squared errors (NRMSE) for four datasets for comparison of six different missing value imputation methods
| MVI method | Mouse serum | Placental tissue | Human urine | Mammalian cell extract |
|---|---|---|---|---|
| Small value replacement | 9.99 | 3.64 | 7.66 | 5.53 |
| Mean | 1.82 | 0.66 | 1.47 | 1.01 |
| Median | 1.60 | 0.68 | 1.49 | 1.01 |
| K-nearest neighbours | 1.29 | 0.58 | 1.44 | 0.54 |
| Bayesian principal components analysis | 1.30 | 0.62 | 1.49 | 1.12 |
| Random forest | 0.75 | 0.45 | 1.16 | 0.37 |
An NRMSE close to zero implies the imputation algorithm has most correctly predicted the missing values
Summary of the number of the 96 modified metabolite features defined as statistically significant (q < 0.05) and the number of metabolite features falsely reported as statistically significant (q < 0.05) for all of the different data processing methods applied
| Normalisation | Missing value imputation | Transformation | Scaling | True positive results | False positive results |
|---|---|---|---|---|---|
| PQN/SUM | RF | All | All | 81 | 0 |
| PQN | None | All | All | 81 | 0 |
| PQN | MN/MD | All | All | 80 | 0 |
| SUM | None | All | All | 80 | 0 |
| SUM | MN/MD/SV | All | All | 79 | 0 |
| PQN | SV | All | All | 77 | 0 |
| SUM | BPCA | glog | Range/Autoscaling/VAST | 81 | 1 |
| SUM | KNN | All | All | 81 | 3 |
| SUM | BPCA | All | None/Pareto | 81 | 3 |
| PQN | KNN | All | All | 82 | 5 |
| PQN | BPCA | All | All | 82 | 6 |
All defines that all methods provided the same result. The closer the number of significant modified features to 96 implies the data processing has performed more ideally
PQN probabilistic quotient normalization, RF random forest, MN mean, MD median; SV small value, KNN k-nearest neighbour, BPCA Bayesian principal components analysis, glog generalised log
Summary of the top ten permutations according to p-value achieved for PC1 scores values
| Normalisation | MVI | Transformation | Scaling | Variance (PC1; %) | Variance (PC2; %) | P-value (PC1) |
|---|---|---|---|---|---|---|
| SUM | RF | glog | None | 42.7 | 39.3 | 4.82E−12 |
| SUM | RF | nlog | None | 44.7 | 28.6 | 1.66E−08 |
| PQN | RF | glog | None | 43.9 | 28.7 | 1.66E−08 |
| PQN | RF | IHS | None | 44.7 | 28.6 | 1.83E−07 |
| PQN | RF | nlog | None | 44.7 | 28.6 | 1.83E−07 |
| SUM | RF | glog | Pareto | 41.5 | 31.2 | 0.01601 |
| SUM | RF | nlog | Pareto | 43.2 | 30.4 | 0.02768 |
| PQN | RF | nlog | Pareto | 42.1 | 31.2 | 0.02934 |
| PQN | RF | IHS | Pareto | 42.1 | 31.2 | 0.02934 |
| PQN | RF | glog | Pareto | 41.7 | 31.5 | 0.03482 |
The greater the combined percentage variance for PC1 and PC2 and the lowest p-values for PC1 and PC2 implies the data processing has performed more ideally
PQN probabilistic quotient normalization, RF random forest, glog generalised log, nlog normal log, IHS inverse hyperbolic sine
Fig. 1Examples of PCA and PLS-DA scores plots for acceptable and not acceptable data processing methods a PCA scores plot for data processed applying RF missing value imputation, SUM normalisation, glog transformation and no scaling which is defined as an acceptable method; 100 % variance accounted for in PC1 and 2, PC1 p = 4.8E−12; b PCA scores plot for data processed applying small value missing value imputation, SUM normalisation, glog transformation and no scaling which is defined as not an acceptable method; 25.7 % variance accounted for in PC1 and 2, PC1 p = 1.8E−7; c PLS-DA scores plot for data processed applying KNN missing value imputation, PQN normalisation, glog transformation and no scaling which is defined as an acceptable method; R2 = 0.61, Q2 = 0.46; d PLS-DA scores plot for data processed applying small value missing value imputation, SUM normalisation, glog transformation and no scaling which is defined as not an acceptable method; R2 = 0.42, Q2 = 0.31. Red circles = Class A; black crosses = Class B; Green triangles = QC sample (Color figure online)
Summary of the top 8 data processing methods according to the PLS-DA R2 values
| Normalisation | MVI | Transformation | Scaling | R2 | Q2 | R2 − Q2 |
|---|---|---|---|---|---|---|
| PQN | BPCA | nlog | Pareto | 0.63 | 0.47 | 0.16 |
| PQN | KNN | glog | None | 0.61 | 0.46 | 0.15 |
| PQN | BPCA | nlog | None | 0.59 | 0.44 | 0.15 |
| SUM | KNN | glog | None | 0.59 | 0.51 | 0.08 |
| SUM | BPCA | nlog | None | 0.57 | 0.40 | 0.16 |
| PQN | BPCA | IHS | None | 0.56 | 0.38 | 0.18 |
| PQN | BPCA | glog | rg | 0.56 | 0.55 | 0.01 |
| SUM | KNN | nlog | Auto | 0.55 | 0.40 | 0.16 |
PQN probabilistic quotient normalization, KNN k-nearest neighbour, BPCA Bayesian principal components analysis, glog generalised log, nlog normal log, IHS inverse hyperbolic sine, rg range scaling, Auto autoscaling