| Literature DB >> 28221771 |
Leslie Myint1, Andre Kleensang2, Liang Zhao2, Thomas Hartung2,3, Kasper D Hansen1,4.
Abstract
As mass spectrometry-based metabolomics becomes more widely used in biomedical research, it is important to revisit existing data analysis paradigms. Existing data preprocessing efforts have largely focused on methods which start by extracting features separately from each sample, followed by a subsequent attempt to group features across samples to facilitate comparisons. We show that this preprocessing approach leads to unnecessary variability in peak quantifications that adversely impacts downstream analysis. We present a new method, bakedpi, for the preprocessing of both centroid and profile mode metabolomics data that relies on an intensity-weighted bivariate kernel density estimation on a pooling of all samples to detect peaks. This new method reduces this unnecessary quantification variability and increases power in downstream differential analysis.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28221771 PMCID: PMC5362739 DOI: 10.1021/acs.analchem.6b04719
Source DB: PubMed Journal: Anal Chem ISSN: 0003-2700 Impact factor: 6.986
Figure 1Problems with sample-specific processing in XCMS and MZmine2. Peak detection and bounding for a single peak in the MTBLS2_rep1 data set. (a) The m/z-RT space surrounding this peak for a single sample, color is used to depict intensity (red is high). (b) Overlaid extracted ion chromatograms from all 8 samples in the experiment. Different colors denote different samples. (c) The peak bounds for all samples for XCMS (blue), MZmine2 (purple) and bakedpi (orange; all samples have same bounds). This experiment compares two groups of samples indicated with different color shades. (d) XCMS peak quantification vs peak width. (e) Like part d but for MZmine. (f) Distribution of peak quantifications, based on the peak bounds in part c. Substantial heterogeneity in the sample-specific bounds leads to excess variability in the quantifications; this is addressed by using the same RT bound for all samples.
Figure 2Weighted bivariate kernel density estimation. We depict a selected rectangle in m/z-RT space for (a) one sample and (b) the pooled metasample. m/z values with higher intensity are shown in red, lower with blue. (c) The weighted bivariate density estimate.
Characteristics of Evaluation Datasetsa
| name (source) | MS instrument column | no. samples (group 1, 2) |
|---|---|---|
| ASD_hirisk (C) | QTOF | 20, 20 |
| HPLC-HILIC | ||
| timecourse_4 h (C) | QTOF | 6, 6 |
| HPLC-HILIC | ||
| timecourse_24 h (C) | QTOF | 6, 6 |
| HPLC-HILIC | ||
| MTBLS2_rep1 (M) | QTOF | 4, 4 |
| UPLC-reverse phase | ||
| MTBLS2_rep2 (M) | QTOF | 4, 4 |
| UPLC-reverse phase | ||
| CAMERA_pos (M) | QTOF | 3, 3 |
| UPLC-reverse phase | ||
| CAMERA_neg (M) | QTOF | 3, 3 |
| UPLC-reverse phase | ||
| MTBLS103 (M) | QTOF | 14, 12 |
| UPLC-HILIC | ||
| MTBLS213 (M) | QTOF | 6, 6 |
| UPLC-reverse phase | ||
| MTBLS126 (M) | Orbitrap | 3, 3 |
| HPLC-HILIC |
C = CAAT, M = Metabolights.
Figure 3Variability comparison of peak quantifications. (a) For peaks that are detected both by bakedpi and XCMS, the distribution of the differences in residual standard deviation for all data sets are shown as violin plots. Each violin is a mirrored density plot; the median is indicated by a horizontal red line. (b) Like part a but for MZmine. For all data sets, the majority of peaks detected by both methods have quantifications that are less variable when quantified with bakedpi.
Figure 4Comparison of differential analysis quality and type I error control in the timecourse_4 h data set. (a) Distribution of p-values for peaks detected by both bakedpi and XCMS. (b) Like part a but for MZmine, (c) median error rate over null permutations as a function of the nominal error rate.