| Literature DB >> 31776543 |
Qingxia Yang1, Jiajun Hong1, Yi Li1, Weiwei Xue1, Song Li1, Hui Yang1, Feng Zhu1.
Abstract
Unwanted experimental/biological variation and technical error are frequently encountered in current metabolomics, which requires the employment of normalization methods for removing undesired data fluctuations. To ensure the 'thorough' removal of unwanted variations, the collective consideration of multiple criteria ('intragroup variation', 'marker stability' and 'classification capability') was essential. However, due to the limited number of available normalization methods, it is extremely challenging to discover the appropriate one that can meet all these criteria. Herein, a novel approach was proposed to discover the normalization strategies that are consistently well performing (CWP) under all criteria. Based on various benchmarks, all normalization methods popular in current metabolomics were 'first' discovered to be non-CWP. 'Then', 21 new strategies that combined the 'sample'-based method with the 'metabolite'-based one were found to be CWP. 'Finally', a variety of currently available methods (such as cubic splines, range scaling, level scaling, EigenMS, cyclic loess and mean) were identified to be CWP when combining with other normalization. In conclusion, this study not only discovered several strategies that performed consistently well under all criteria, but also proposed a novel approach that could ensure the identification of CWP strategies for future biological problems.Entities:
Keywords: area under the curve; bioinformatics; consistency score; metabolomics; normalization
Year: 2020 PMID: 31776543 PMCID: PMC7711263 DOI: 10.1093/bib/bbz137
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Classification of each studied method based on the description of previous publications. The key descriptions for defining methods’ classification were underlined and in italic. Abbreviation (abbr.) was assigned and used to indicate each method in the whole manuscript
|
|
|
|
|---|---|---|
| A. Sample-based normalization methods | ||
| Contrast |
| Using nonlinear curve fitting to normalize all studied samples |
| Cubic splines |
| A nonlinear baseline method aiming at |
| Cyclic loess |
| Normalizing |
| EigenMS |
| Preserving the original differences and removing the bias |
| Linear baseline |
| Normalizing all studied samples |
| Mean |
| During the normalization, |
| Median |
| Normalizing the studied samples by assuming that |
| MSTUS |
| Dividing the intensity |
| PQN |
| Dividing by the median quotient for each intensity |
| Quantile |
| Achieving |
| Total sum |
| Assigning an appropriate weight to each sample to minimize differences among all samples |
| B. Metabolite-based normalization methods | ||
| Auto scaling |
| Scaling the metabolite to unit variance, and using |
| Level scaling |
| Scaling certain metabolite relative to |
| Pareto scaling |
| Using |
| Range scaling |
| Using |
| Vast scaling |
| Stabilizing variables using |
| C. Sample- and metabolite-based normalization methods | ||
| VSN |
| This normalization method both |
Five benchmark datasets analyzed in this study. Each dataset was collected as representatives of the diverse analytical platforms (LC–MS of positive and negative modes, GC–MS, NMR and DIMS)
|
|
|
|
|
|---|---|---|---|
| LC–MS Positive Mode | Anal Chim Acta 2012;743:90–100 | LC–MS positive | 1586 metabolites from 60 HCC patients and 129 CIR controls |
| LC–MS Negative Mode | Anal Chim Acta 2012;743:90–100 | LC–MS negative | 940 metabolites from 59 HCC patients and 126 CIR controls |
| GC–MS | Anal Chem 2009;81:7974–80 | GC–MS | 46 metabolites from mixtures of different concentrations (15 versus 15) |
| NMR Spectroscopy | Metabolomics 2014;10:950–7 | NMR | 51 metabolites from 27 fasted and 26 carbohydrate prefed pigs |
| Direct Infusion MS | Sci Data 2014;1:140012 | DIMS | 48 metabolites pertaining to 66 cow and 68 sheep samples |
Figure 1The relationship among the performances of all studied normalization strategies identified based on the hierarchical clustering of the quantitative metrics across all five benchmarks representing different analytical platforms. The analyzed metrics for each criterion (Ca, Cb and Cc) were PMAD, AUC and CS, respectively. The leaves of the hierarchical tree gave the name of the studied strategies. The background colors of the strategies of a single method, sequential combination of ‘sample’-based and ‘metabolite’-based methods and sequential integration of ‘metabolite’-based and ‘sample’-based ones were white, light blue and light orange, respectively. () The methods with PMAD of superior (≤0.3), good (>0.3 and <0.7) and poor (>0.7) performance were colored by dark orange, light orange and gray, respectively. () The methods with AUC value of superior (>0.9), good (>0.7 and ≤0.9) and poor (≤0.7) performances were colored by dark green, light green and gray, respectively. If the AUC values of a combined strategy and any single method in this combination equaled to 1 (perfect classification), a white round dot was applied to highlight that strategy. () The methods that ranked to be the top one-third, bottom one-third and remaining one-third by their CS values were indicated by dark blue, gray and light blue color, respectively. If the performance of a combined strategy was better than both single methods within this combination, a triangle was used to highlight that strategy.
The normalization performance under each criterion (PMAD, AUC and CS) assessed by the ranks of five representative datasets and the clustering partitions (α, β and γ) illustrated in Figure 1. There were three method types: (A) 17 ‘sample/metabolite’-based methods, (B) 21 combined strategies CWP under all three criteria and (C) 28 methods consistently poor-performing under all three criteria. Median and SD represented the median value and the SD of the ranks of five representative datasets, respectively
|
|
|
Figure 2Venn diagram of the numbers of the single method in Partitions α (A) and α&β (B) of Figure 1, and the combined strategy in Partitions α (C) and α&β (D) of Figure 1 by all criteria. Identification of the WP strategy under two of the three criteria based on the orders of the studied strategies ranked by any two criteria in Figure 1. (E) AUC and PMAD; (F) CS and AUC; (G) PMAD and CS.
Figure 3The performances of the 21 newly identified CWP strategies. (A) Quantitative illustrations of the assessing results using PMAD (light orange bar), AUC (light green bar) and CS (light blue bar). (B) The percentages of each method’s appearance over the total number of its possible combinations. The bars colored in light blue and light orange indicated the sequential combination of ‘sample’-based and ‘metabolite’-based methods and the sequential combination of ‘metabolite’-based and ‘sample’-based methods, respectively.
Figure 4The performances of the 28 ‘badly performing’ strategies. (A) Quantitative illustrations of the assessing results by PMAD (light orange bar), AUC (light green bar) and CS (light blue bar). (B) The percentages of each method’s appearance over the total number of its possible combinations. The bars colored in light blue and light orange indicated the sequential combination of ‘sample’-based and ‘metabolite’-based method and sequential combination of ‘metabolite’-based and ‘sample’-based method, respectively. The blue dash line indicated the single methods performing badly under all criteria.