| Literature DB >> 25888443 |
Gunnar Libiseller1, Michaela Dvorzak2, Ulrike Kleb3, Edgar Gander4, Tobias Eisenberg5, Frank Madeo6,7, Steffen Neumann8, Gert Trausinger9, Frank Sinner10,11, Thomas Pieber12,13, Christoph Magnes14.
Abstract
BACKGROUND: Untargeted metabolomics generates a huge amount of data. Software packages for automated data processing are crucial to successfully process these data. A variety of such software packages exist, but the outcome of data processing strongly depends on algorithm parameter settings. If they are not carefully chosen, suboptimal parameter settings can easily lead to biased results. Therefore, parameter settings also require optimization. Several parameter optimization approaches have already been proposed, but a software package for parameter optimization which is free of intricate experimental labeling steps, fast and widely applicable is still missing.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25888443 PMCID: PMC4404568 DOI: 10.1186/s12859-015-0562-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Workflow for the optimization of XCMS parameter. A pooled sample is measured sequentially within the studies. The LC-MS data of the pooled sample are then used for optimization. The DoEs are created by using Box-Behnken designs. The individual experiments of the design are calculated in parallel. Peaks are classified as reliable peaks (RP) when they are part of an isotopologue. These RPs serve as basis for the calculation of the Peak Picking Score (PPS). Two additional scores are introduced for retention time correction and grouping. To improve the quality of retention time correction, the relative retention time deviations within the peak groups are minimized which leads to the Retention time Correction Score (RCS). So called ‘reliable groups’ and ‘non-reliable groups’ are defined to assess grouping. The ratio of the squared number of ‘reliable groups’ to ‘non-reliable groups’, the Grouping Score (GS), is maximized within the optimization process. The resulting scores are evaluated by using response-surface-models. The combination of parameters that yields the best score is used as new center for the next DoE. The optimization process continues as long as the respective scores are increasing.
XCMS methods and their respective parameters optimized by IPO
|
|
|
|---|---|
| xcmsSet(method = ‘centWave’) | min peakwidth, max peakwidth, ppm, mzdiff |
| xcmsSet(method = ‘matchedFilter’) | fwhm, step, steps, snthresh, mzdiff |
| retcor(method = ‘obiwarp’) | profStep, gapInit, gapExtend |
| group(method = ‘density’) | bw, mzwid, minfrac |
Results of the example data sets
|
|
|
| ||||
|---|---|---|---|---|---|---|
| pooled sample injections | ||||||
| training set: | 12 | 4 | 6 | |||
| test set: | 11 | 4 | 6 | |||
| DoEs peakpicking | 4 | 3 | 2 | |||
| DoEs retcor + grouping | 5 | 5 | 4 | |||
| time for peakpicking optimization | 3.8 h | 1.5 h | 0.9 h | |||
| time for retcor + grouping optimization | 0.8 h | 0.7 h | 0.6 h | |||
| overall time | 4.6 h | 2.2 h | 1.5 h | |||
| default | optimized | default | optimized | default | optimized | |
| #peaks | ||||||
| training set: | 55,845 | 57,075 | 33,298 | 31,710 | 24,247 | 24,230 |
| test set: | 65,851 | 53,205 | 34,415 | 32,397 | 27,539 | 25,609 |
| #RPa | ||||||
| training set: | 6,999 | 8,433 | 12,606 | 14,367 | 2,710 | 3,351 |
| test set: | 7,587 | 7,903 | 12,999 | 14,594 | 1,582 | 1,869 |
| #LIPb | ||||||
| training set: | 15,497 | 11,645 | 15,245 | 17,284 | 11,327 | 11,490 |
| test set: | 11,163 | 10,855 | 15,643 | 17,680 | 12,646 | 10,962 |
| PPSc | ||||||
| training set: | 1,214 | 1,565 | 8,802 | 14,308 | 568 | 881 |
| test set: | 1,053 | 1,475 | 9,001 | 14,472 | 168 | 238 |
| RCSd | ||||||
| training set: | 12.3 | 144.8 | 67.8 | 575.4 | 92.8 | 311.8 |
| test set: | 9.4 | 142.4 | 37.6 | 580.4 | 48.1 | 206.7 |
| #reliable groups | ||||||
| training set: | 536 | 990 | 3,669 | 5,343 | 1,504 | 2,424 |
| test set: | 314 | 759 | 1,564 | 5,639 | 793 | 1,855 |
| #non-reliable groups | ||||||
| training set: | 2,636 | 82 | 3,605 | 151 | 1,217 | 101 |
| test set: | 2,740 | 70 | 3,248 | 110 | 1,150 | 69 |
| GSe | ||||||
| training set: | 109 | 11,952 | 3,734 | 189,057 | 1,859 | 58,176 |
| test set: | 36 | 8,230 | 753 | 289,076 | 547 | 49,870 |
areliable peaks; blow intensity peaks; cpeak picking score; dretention time correction; score; egrouping score
Figure 2Selected chromatograms showing the different peak types at well-defined masses obtained from the different data sets. Chromatograms derive from a) metabolite fingerprinting data set; b) lipidomics data set; c) central carbon metabolism data set. Peaks derived from default parameters are presented in the left chromatograms and peaks coming from optimized parameters are shown in the chromatograms on the right side, respectively. The peak area integrated by XCMS is colored red. The m/z range for the chromatogram was chosen from the respective minimum and maximum m/z values of the particular peak. Comparison of chromatograms a) clearly demonstrate that default peak width parameters were too small for the broad peaks, b) shows an example where the mass range used in the default settings was too wide and c) illustrate peaks where the default peak width parameters were too wide.
Peak width parameter settings and resulting peak width statistics of the training sets
|
|
|
| ||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| ‘peakwidth’ parameter [sec] | 20-50 | 32.2-95 | 20-50 | 29.6-80 | 20-50 | 10-35 |
| mean peak width [sec] | 44.2 | 57.9 | 44.6 | 58.4 | 27.3 | 15.6 |
| median peak width [sec] | 40.6 | 52.2 | 41.8 | 54.5 | 24.4 | 12.6 |
| modal peak width [sec] | 38.9 | 51.3 | 41.4 | 56.8 | 10.3 | 5.8 |