Literature DB >> 24688694

Spectral deconvolution for gas chromatography mass spectrometry-based metabolomics: current status and future perspectives.

Abstract

Mass spectrometry coupled to gas chromatography (GC-MS) has been widely applied in the field of metabolomics. Success of this application has benefited greatly from computational workflows that process the complex raw mass spectrometry data and extract the qualitative and quantitative information of metabolites. Among the computational algorithms within a workflow, deconvolution is critical since it reconstructs a pure mass spectrum for each component that the mass spectrometer observes. Based on the pure spectrum, the corresponding component can be eventually identified and quantified. Deconvolution is challenging due to the existence of co-elution. In this review, we focus on progress that has been made in the development of deconvolution algorithms and provide thoughts on future developments that will expand the application of GC-MS in metabolomics.

Entities: Chemical Disease Gene Species

Keywords: Metabolomics; bioinformatics; mass spectrometry

Year: 2013 PMID： 24688694 PMCID： PMC3962095 DOI： 10.5936/csbj.201301013

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Metabolomics is the comprehensive qualitative and/or quantitative study of a metabolome (the set of metabolites synthesized by an organism, tissue, or cells) and as such provides measurements essential for systems biology approaches for the study of health and disease. It has benefited greatly from advances in analytical technologies including nuclear magnetic resonance (NMR) and mass spectrometry (MS) coupled to separation techniques. The advantages of NMR are the minimal requirements for sample preparation and the non-discriminating and non-destructive nature of the technique. However, the low sensitivity of NMR makes it difficult to detect low-abundance metabolites that could constitute key discoveries of new biomarkers or biological mechanisms. Mass spectrometry-based metabolomics offers high selectivity and sensitivity and, more importantly, the potential to identify metabolites. Combining MS with separation techniques reduces the complexity of the mass spectra due to metabolite separation in the time dimension and provides additional information about the physical and chemical properties of the metabolites [1, 2]. Due to these advantages, the MS-based metabolomics approach is being widely used in food and nutrition research [3], plant science [4], marine science [5], environmental science [6], drug development and toxicology studies [7, 8], and many others. Obtaining a full coverage of the metabolome generally requires multiple separation approaches since metabolites are heterogeneous, low molecular-weight components (less than 1,500 Da) that are characterized by a wide variation in physical and chemical properties (e.g., polarity, volatility, and solubility) [9]. Three types of separation techniques are commonly used in MS-based metabolomics: liquid chromatography (LC), gas chromatography (GC), and capillary electrophoresis (CE). GC combined with electron ionization (EI) mass spectrometry allows for the identification and quantification of volatile and thermally stable compounds of low polarity from sources such as biological tissues and foods [10]. The fragmentation of metabolites during EI are highly characteristic of the chemical structure, allowing these mass spectra to be used for identification of compounds from mass spectral libraries [11]. Due to its high reproducibility, chromatographic peak resolution, and the existence of libraries of mass spectra, GC-EI-MS is regarded as the gold standard for metabolomics research [12]. However, GC-MS is incompatible with nonvolatile and thermally labile compounds. As a result, derivatization methods have been developed to make these metabolites less polar, more volatile and/or thermally stable so that they can be analyzed on GC-MS. Complementary to GC-MS, LC-MS enables identification and quantification of high polarity compounds including organic acids, fatty acids, amino acids, and steroids. LC-MS-based metabolomics is generally performed using soft ionization techniques such as electrospray ionization (ESI) and atmospheric pressure chemical ionization (APCI), which do not cause fragmentation of the molecular ions and thus allow for the determination of elemental compositions. In addition, LC-MS/MS can be used to identify metabolites [10, 13]. The third separation technique, CE, is suited for the separation of polar and charged compounds, as compounds are separated on the basis of their electrophoretic mobility. However, the repeatability of migration time in CE is very poor compared to that of the retention time in GC and LC. This factor is especially critical in metabolomics because a high accuracy of metabolite peak alignment must be achieved prior to downstream data analyses. Large variations in migration time make alignment very challenging [14]. MS-based metabolomics experiments using these separation techniques can be conducted using three approaches: 1) non-targeted metabolic profiling where the identity and relative quantity of as many metabolites as possible are obtained; 2) targeted profiling where the absolute quantity of a pre-selected smaller set of metabolites, typically related by chemical or biological similarity, are obtained using internal standards and reference compounds, and 3) metabolic fingerprinting where a global snapshot of the metabolism is acquired and compared without performing quantification and chemical identification. Biologically interesting components can then be subjected to targeted profiling for identification and quantification. These aforementioned analytical platforms (LC-, GC, and CE-MS) and metabolomic approaches (profiling and fingerprinting) have been applied widely in metabolomics research. A number of excellent review articles have been published summarizing this work [3, 15–24]. Carrying out a metabolomics study generally involves a sequence of steps: experimental design, sample collection and preparation, analysis of samples on analytical platforms, and data handling. Among these steps, the last step relies heavily on bioinformatics. Because mass spectrometry-based metabolomics studies generally produce large and complex datasets, the bioinformatics involved is nontrivial and requires specialized computational algorithms and software tools. For metabolic profiling where identification and quantitation (semi-quantitation for non-targeted and absolute quantitation for targeted profiling) are performed, bioinformatics includes three sequential steps: (1) data processing converts mass spectral data into tables of known or unknown metabolites with their identity and quantity, (2) data analysis identifies interesting metabolites or metabolic patterns through statistical analyses, clustering, and classification, and (3) data interpretation places the metabolomics data in the context of metabolic pathways and integrates metabolomics data with data from other omics platforms. Combination of the three steps is essential for knowledge discovery from the metabolomics experiments. Among the three data handling steps, processing of raw mass spectral data is critical because any inaccuracy in this stage will propagate to the subsequent two steps. For a GC-MS based metabolomics study, data processing is even more critical due to co-elution of two or more compounds and the in-source fragmentation of molecular ions caused by the hard EI ionization. Co-elution and in-source fragmentation cause the resulting raw mass spectra to consist of mass peaks from all of the co-eluting metabolites. In order to identify and extract the quantitative information of the corresponding metabolites, the spectrum for each single metabolite has to be constructed based on the composite spectra. This spectrum construction step is called deconvolution in GC-MS data processing. Considering the importance of GC-MS for analyzing compounds commonly observed in fruits, vegetables, nutritional and medicinal plants, and human biofluids after food digestion, we would focus on the deconvolution aspect of GC-MS data processing in this review. We will review the progress that has been made so far, and provide our thoughts on future developments that will enable further progress in metabolomics research. For the other aspects of data processing and the subsequent data analysis and interpretation, we refer readers to the research article by Stein el al. [25], review articles by Wishart [26, 27] and by Halket et. al. [10].

Deconvolution

For GC-MS data, deconvolution is the process of computationally separating co-eluting components and creating a pure spectrum for each component. Specifically, for each observed EIC that results from two or more components, deconvolution calculates the contribution of each component to the EIC. Figure 1 depicts the necessity of deconvolution.

Figure 1

Comparison of constructed mass spectra and subsequent metabolite identification results with and without accurate deconvolution of shared peaks from two co-eluting compounds, uridine (Left) and n-Eicosanoic acid (Right). (A) Raw EICs of selected masses. Mass 43, 73, and 117 marked with red circles are shared by both compounds. Mass 217 is unique to uridine while mass 132 is unique to n-Eicosanoic acid. (B1-2) Constructed mass spectra of uridine (B1) and n-Eicosanoic acid (B2) after deconvolution using ADAP 1.0. The shared masses 43, 73, and 117 are only included either in the spectrum for n-Eicosanoic acid or in uridine. Their matching scores are 810 and 881, respectively. (C1-2) Constructed mass spectra after deconvolution that decomposes shared peaks. Each of the shared masses, 43, 73, and 117, is included in the spectra for both n-Eicosanoic acid and uridine. Their matching scores are 909 and 948, respectively. (D1-2) Reference spectra from an in-house library. This Figure is Figure 1 in the original article [33]. Reprinted with permission from the American Chemical Society. Deconvolution had evolved based on the work of a number of researchers [28, 29] and was popularized with the publication of the AMDIS (Automated Mass Spectrometry Deconvolution and Identification System) algorithm [30] and subsequent development of the software tool [31]. The principle behind AMDIS also formed the basis for subsequent developments of other deconvolution algorithms including MetaboliteDetector [32] and ADAP-GC [33]. These three software tools are freely available. Commercial software tools have also been developed that include ChromaTOF [34] and AnalyerPro [35]. As far as we know, the technical details of the latter two software tools have not been released. Next we will examine the deconvolution algorithms implemented in AMDIS, MetaboliteDetector, and ADAP-GC.

Spectrum deconvolution by AMDIS

The overall deconvolution process in AMDIS consists of four sequential steps: 1) noise analysis, 2) component perception, 3) model shape determination (The original paper included this step as part of step 2 [30]. For convenience of comparison with other algorithms in this review, we describe this step separately), and 4) spectrum deconvolution. The first step extracts the noise characteristics for a GC-MS data file by calculating the noise factor to be used for representing signal magnitude in noise units. Noise factor is conceptually defined as Briefly, each EIC and the total ion chromatogram (TIC) are divided into segments of a certain number of scans (13 scans were used in the original publication [30]). For each segment that has no zero abundance values, a mean abundance is computed and the number of times that this mean value is crossed is counted. If the number of crossings is greater than one-half of the number of scans (i.e., 7 scans when the segment is 13 scans wide), this segment is accepted and the median deviation from the mean abundance for that segment is found (Figure 2). This deviation is the average random deviation within this segment. It is then divided by the square root of the mean abundance for that segment to obtain a segment-specific N value as defined in Eq. 1. The median of all of the segment-specific N values is taken as the characteristic N value for the entire GC-MS data file. The square root of a signal multiplied by N is the magnitude of this signal in ‘noise units’.

Figure 2

Illustration of the determination of the noise factor (N f) from 13- scan ion chromatogram segments. The upper chromatogram is rejected because it has fewer than seven “crossings” of the mean. The lower ion chromatogram crosses the mean eight times, so provides a sample noise factor. The median distance from the mean (seventh closest to the mean) is used to generate a sample noise factor N f. The final N f for the analysis is taken as the median of all sample values. This Figure is Figure 1 in the original article [30]. Reprinted with permission from Elsevier. The second step, component perception, perceives individual chromatographic components. The rationale behind component perception is that a component exists when a sufficient magnitude of ions maximize together. It is achieved through a process as illustrated in Figure 3 where the deconvolution window is established first. When determining a deconvolution window, AMDIS “sequentially examines scans starting at the scan of maximization and proceeds in the forward and reverse directions up to a pre-set maximum number of scans (12 is the default). If a signal abundance is encountered that is more than five noise units greater than the smallest abundance between that scan and the starting scan (with noise units measured for the smallest abundance), then it is presumed that another component has been found and the window length is set to the preceding scan. Also, if the intensity falls below 5% of the maximum intensity, the window is fixed at that scan.”

Figure 3

Four steps for determining whether an ion chromatogram peak is large enough to be used for peak perception. (1) A scan window is set using minima on each side of the peak; (2) a tentative baseline is drawn between the lowest points on each side (readjusted if a point between these end points falls below the line); (3) a least-squares line is drawn using the lowest one-half of points as measured from the baseline in step 2; (4) signal height between the maximum and least squares line is computed. Peaks must have heights larger than four noise units (N f v ) for use in peak perception (A is the absolute abundance at the peak maximum). This Figure is Figure 2 in the original article [30]. Reprinted with permission from Elsevier. The third step determines the model peaks to be used in the next step for deconvolution. The model shape for each perceived component is taken as the sum of the individual ion chromatograms that maximize together and whose sharpness values are within 75% of the maximum value for this component. The sharpness value between the maximum abundance, A , and an abundance value located n scans from the maximum, A is defined as: The maximum sharpness values on each side of the maximum scan are averaged as the sharpness value of this ion chromatogram. The last step, deconvolution, extracts ‘purified’ spectra from individual ion chromatograms for each component using the model shapes and the least-squares method. Briefly, each ion chromatogram is fit to the model profiles allowing a linear baseline: This four-step process of deconvolution that AMDIS uses forms the basis of deconvolution in MetaboliteDetector [32] and ADAP-GC 2.0 [33]. Even though the deconvolution principle underlying these three algorithms are similar, the algorithms differ in details that we describe next.

Spectrum deconvolution by MetaboliteDetector

Deconvolution in MetaboliteDetector differs from AMDIS in component perception and model peak determination. For component perception, it detects the beginning and ending of chromatographic peaks by calculating the first derivative of the smoothed intensity values (Figure 4):

Figure 4

Single ion chromatographic peak detection. The peak borders are determined based on the first derivative of the intensity values. The red line represents the intensity values vs. retention time. The blue line depicts the first derivative of the intensity values. If the values of the first derivative cross the peak threshold, a peak begin or end is set (dotted lines). This Figure is Figure 2 in the original article [32]. Reprinted with permission from American Chemical Society. A resulting peak is counted as valid if three criteria are met: the peak must consist of more than three values, the height above the baseline in signal-to-noise units of the maximum peak value must exceed a predefined threshold, and the quality of the peak shape must be in a certain range. The quality of a peak shape, named discrepancy index, is defined based on the assumption that all values of the first derivative of an ideal single peak must be positive from the peak beginning to the peak maximum and negative from the peak maximum to the peak ending. The absolute values of the first derivatives that agree with this assumption are summed as ideal slopes and the absolute values of the first derivatives that disagree with this assumption are summed as nonideal slopes. The discrepancy index q of a peak shape is formally defined as the ratio of the nonideal to ideal slopes: Reasonable values of q are in the range between 0% and 10%. To determine the model peak shape for each perceived component, MetaboliteDetector sorts all single ion peaks of a compound having q values below 10% by their sharpness values. The top 25% of the peaks in terms of the sharpness value are summed to form the model peak for this compound.

Spectrum deconvolution by ADAP-GC 2.0

Deconvolution in ADAP-GC 2.0 differs from AMDIS and MetaboliteDetector in terms of component perception, noise analysis, and model peak determination (Figure 5). Specifically, ADAP defined the concept of chromatographic peak features (CPFs).

Figure 5

Illustration of the data analysis workflow of ADAP-GC 2.0 using two co-eluting compounds. (A) Detection of CPFs from TIC and determination of deconvolution windows (indicated by blue vertical lines). Two representative CPFs are displayed: one simple CPF marked by a blue solid circle at the apex and one composite CPF marked by red and green solid circles at the apexes. (B-F) Deconvolution of the EIC CPFs that have given rise to this composite TIC CPF. (B) Raw EICs of 46 good candidates. (C) The constructed mirror images of the 46 good candidates. (D) Determination of the number of components and corresponding model CPFs for each component using hierarchical clustering. The red dashed line indicates the empirical cutoff for determining the number of clusters. (E) The composite CPFs of masses 73, 147, 245, and 273 (solid line) were decomposed into simple CPFs (dashed line). (F) Two mass spectra were constructed and identified. This Figure is Figure 3 in the original article [33]. Reprinted with permission from the American Chemical Society. A CPF is the elution profile of a minimum number of components that makes the elution profile complete, with ‘complete’ meaning that the elution profile lasts from the beginning to the end of the elution of the component(s). A CPF that results from a single component is defined as a simple CPF, and a CPF that results from two or more components is defined as a composite CPF. A simple CPF has only one local maximum, and a composite CPF could have one, two, or more local maxima. ADAP-GC 2.0 detects a CPF by determining the beginning, ending, and apex time of each local peak based on local maximum (for peak apex) and minimum (for beginning and ending). To determine if a peak is a simple CPF or part of a composite CPF, the algorithm calculates the ratio of intensity values at the boundaries to the intensity value at the peak apex. If one of the ratios is higher than a configurable threshold, this peak is considered part of a composite CPF. All of the neighboring incomplete peaks are then merged to form a composite CPF. Subsequently, ADAP-GC 2.0 determines the deconvolution windows based on both the TIC and EIC. Basically, the beginning and ending of a TIC CPF delimit a window and any EIC CPF whose peak apex falls in the window will participate in the window-specific deconvolution. For each deconvolution window, ADAP determines the number of components and the model peak shape for each component. Unlike the model peak in AMDIS and MetaboliteDetector, ADAP-GC 2.0 defines a model CPF as the elution profile of a compound when it elutes from a chromatography system alone and its concentration is within the linear dynamic range of the mass spectrometer. As such, a model peak can result from the elution of a single component only. To determine the model CPF, ADAP first calculates five quality metrics of each CPF. These five metrics are sharpness, signal-to-noise ratio (SNR), peak intensity, Gaussian similarity, and the mass. The sharpness value of a CPF is calculated as Where N is the total number of time points for a CPF, p is the time index of the apex, and I is the abundance value at time index i. The SNR is estimated based on the high- and low-frequency signal components of the CPF, which is calculated using the continuous wavelet transform. For details about other metrics, please refer to the original article. Following the calculation of the five metrics, ADAP selects those CPFs with high values of sharpness, SNR, and Gaussian similarity. Each of the filter-passing CPFs is assigned a total quality score calculated as Those CPFs whose total score pass the following threshold are considered good candidates of model CPFs: Finally, all of the good candidates for model CPFs participate in a hierarchical clustering process for ADAP to determine the most likely number of components in the current deconvolution window and the corresponding model peak for each component.

Conceptual comparison of deconvolution algorithms in AMDIS, MetaboliteDetector, and ADAP-GC

Each of the aforementioned computational steps plays an important role in determining whether or not metabolic changes can be detected in metabolomics studies. Noise analysis: Mass spectra and chromatograms that are obtained from the spectra are inherently noisy. All of the three algorithms we described above incorporated noise analysis. AMDIS and MetaboliteDetector calculate the noise factor of a GC-MS data file and then convert the signal magnitude in noise units in some of the subsequent calculations. ADAP-GC 2.0 uses a completely different approach by directly computing the signal-to-noise ratio of each CPF and uses it as a filter to prevent noisy CPFs from being selected as model CPFs. Without direct comparison between these two approaches, it is challenging to state which one leads to better performance. Determination of deconvolution windows: AMDIS's approach often causes deconvolution windows to be narrower than optimal for quantitation, as pointed out in the original article [30]. This happens in two scenarios where a deconvolution window contains only part of the entire chromatographic peaks. One is when a chromatographic peak is very intense, the deconvolution window is fixed at 5% of the maximum intensity, and the remaining part of the peaks that is below 5% of the maximum intensity is outside of the deconvolution window. If neighboring co-eluting components are in very low concentrations, their abundance will be much lower than that of the dominating peak and most of their chromatograms will be outside of the deconvolution window. We have observed many examples of this case in our own work. The other scenario is when the algorithm determines that a co-eluting component is encountered because the abundance value is more than five noise units greater than the smallest abundance between that scan and the maximization scan. Since the window is set to the preceding scan, part of the peak will be left out of the deconvolution window as well. In order to resolve this issue, ADAP-GC 2.0 tries to detect composite CPFs and ensure that their deconvolution windows contain the entire CPF. Conceptually, this should increase the accuracy of the quantitative information that ADAP extracts about metabolites from the data. In MeteboliteDetector, how deconvolution windows are determined was not explicitly described. Regardless of the specific approach, it must utilize the information about the beginning and ending of chromatographic peaks. MetaboliteDetector determines the beginnings and endings using first derivatives as in Eq. 4. For details, please refer to the original article. This approach tends to be very sensitive to noise because the derivative operation amplifies noise [36]. Even though chromatograms are smoothed prior to this step, fluctuations can still exist along a chromatographic peak and cause the algorithm to decide that a beginning or ending of the chromatographic peak has been encountered. Consequently, part of the peak will be left out of the deconvolution window and the quantitative information extracted from the data about the corresponding component will be reduced. Determination of model peaks: The central goal of deconvolution is to decompose a composite CPF into the weighted summation of the model peaks and then form the spectrum of a single component based on the weights in Eqn. 3. Specifically, the resulting c * M (nmax) in Eqn. 3, where nmax denotes the scan with the maximum model peak abundance, will be used as the abundance for the corresponding m/z in the extracted spectrum. Conceptually, it should be preferred that each model peak corresponds to one single component only. However, since both AMDIS and MetaboliteDetector use a summation of peaks as the model peak and one or more of the constituent peaks could consist of signals from multiple co-eluting components, the likelihood that the final model peak also consists of signals from multiple co-eluting components is very high. Consequently, the magnitude of the mass peaks in the extracted spectrum will be inaccurate, which could ultimately cause both false positive and false negative metabolite identifications. Moreover, the quantitative information for corresponding components will be inaccurate as well since it is calculated based on the extracted spectrum. ADAP, however, selects only the purest peak for model peaks by using five peak quality metrics. On the other hand, ADAP has a weakness too in determining model peaks. It favors model peaks that are symmetric and resemble a Gaussian curve. In reality, fronting and tailing do occur and cause asymmetric peak shapes even when a compound elutes alone from the chromatography system. This issue can conceptually be alleviated or eliminated by decreasing the weighting factor that is assigned to the Gaussian similarity in Eq. 7. Of course, testing is needed to check if the overall performance of the algorithm is affected as a result of this change. Based on the above description, we can see that it will be worthwhile to carry out a detailed comparison about the performance of the critical computational steps and come up with the best overall deconvolution strategy. Even though the aforementioned three algorithms have not been directly compared, a comparison between AMDIS and two commercial software packages, ChromaTOF [37] and AnalyzerPro [35], has been performed by Lu et. al [12]. The article concluded that none of these approaches provided a comprehensive solution meeting the specific needs of metabolomics [12]. Specifically, 1) AnalyzerPro tends to produce a great number of false negatives, 2) ChromaTOF and AMDIS tend to produce multiple peak assignments that clearly correspond to a single chromatographic peak and chemical entity, and 3) all of them are still fairly slow for the flood of data from high-throughput metabolomics studies. Since details about the deconvolution algorithms in ChromaTOF and AnalyzerPro are unknown, we are unaware of the possible factors that could have caused their respective issues. For AMDIS, it was originally developed for automated identification of chemical weapons and related compounds [30]. Therefore, there is a strong emphasis in AMDIS on low false negative rates for metabolite identification. Since identification is achieved by matching the ‘purified’ spectra for single components that are obtained from deconvolution against libraries of reference mass spectra, each step of the deconvolution process needs to ensure that as few components as possible are missed. As a result, AMDIS is best suited for analyzing simple mixtures consisting of a small number of compounds. When analyzing complex mixtures, time-consuming manual checking of the results is necessary in order to reject false positive identifications [32]. Appropriate configuration of processing parameters can help alleviate the challenge to find an acceptable balance between false positive and false negative identifications. However, without prior knowledge about the sample composition in un-targeted metabolic profiling experiments, researchers usually do not know what parameter setting is appropriate for their datasets. This issue exists in other algorithms as well.

Summary and Outlook

The past decade has witnessed tremendous progress on metabolomics bioinformatics research. However, the progress has not been as fast as that on the instrumentation side as mass spectrometry coupled to chromatography is becoming increasingly sensitive and their operations are becoming increasingly high-throughput. Next, we provide our thoughts on new bioinformatics capabilities for GC-MS data processing that are needed for metabolomics to progress further. More efficient and reliable deconvolution algorithms need to be developed for identification and quantification of metabolites due to the aforementioned limitations of existing deconvolution algorithms. Newly developed algorithms need to be implemented into user-friendly and high-throughput software tools. These software tools should be equipped with visualization capabilities that will allow metabolomics researchers to visually examine intermediate and final results for verifying the correctness of significant metabolites detected in the data analysis and data interpretation stages. Development of these algorithms and software tools will benefit greatly from an interdisciplinary team of researchers and software engineers with engineering, chemistry, math, and computer science background. Computational algorithms and software tools are also needed for identifying unknown metabolites. This is because, with unprecedented sensitivity, dynamic range, and throughput, both GC-MS and LC-MS analytical platforms can now detect many metabolites in a short time frame that could not be observed before. However, the power of the detection methods has outstripped our ability to process the data. Often, the number of unidentified metabolites in a sample is 2-3-fold more than the number of identified metabolites even after extensive data processing. This is because even the largest and most comprehensive of currently available libraries contain only a small portion of the endogenous metabolites found in biological samples [38-41]. Identifying these unknown metabolites is currently a major bottleneck in metabolomics [42, 43]. This issue is even more acute in studies investigating plant products because diverse plant species are estimated to produce more than 200,000 metabolites of enormous biochemical diversity of which only 10,000 chemical structures are known [4, 11]. Even the extensively studied model plant Arabidopsis thaliana has a large number of enzymes whose substrates and products remain unknown [44]. Meanwhile, commercially available standard reagents, especially those of secondary metabolites produced by plants, are very limited in number. Given the fundamental importance of biochemicals to agriculture, nutrition, and health, the potential benefit of identifying even half of these unknown metabolites is astronomical. Therefore, a pressing need in the metabolomics community is the development of effective methods for prioritizing, studying, and ultimately identifying uncharacterized metabolites [40]. To address this need, novel bioinformatics capabilities to process raw mass spectral data and to create libraries of unidentified spectra must be developed [10, 11, 26, 38, 45, 46]. For GC-MS-based metabolomics, some progress has been made in the effort to use and catalogue unidentified GC-MS spectra in metabolomics studies [4, 37, 38, 46–64]. Among these efforts, most rely on nonsystematic, low-throughput, manually assisted curation of unidentified peaks [38], and only a few use more advanced approaches with minimal manual annotation. These latter efforts include the work on developing the volatile compound mass spectral database named vocBinBase [63], developing a spectral library of unknown compounds for urine [46], and identifying conserved metabolites [38]. These three projects used spectra extracted from the raw mass spectral data by AMDIS or ChromaTOF (LECO Corp., Michigan, USA). In constructing the vocBinBase database, unknown spectra are filtered and only the filter-passing spectra are imported into the database. Unfortunately, since each spectrum in the database is obtained from only a single sample, it is very likely that it is not the best representation of the corresponding compound. Missing, inaccurate, or false peaks in these spectra will cause future false positive and false negative identifications, especially as the database grows and different compounds with similar spectra are incorporated into the library. A better approach is to construct a consensus spectrum from many measurements of the same compound. Consensus spectra are indeed used in some of these developments [38]. However, inaccuracies in these spectra still need to be corrected based on analyses of extracted spectra from multiple samples. This step is very important because inaccuracies are very likely to occur due to factors such as co-elution, low signal-to-noise ratios in some regions of the chromatograms, and inappropriate parameter settings in the process of chromatographic peak picking and deconvolution [12]. This spectra correction step will significantly increase our confidence in the consensus spectra. In addition, it will improve the accuracy of the quantitative information about the corresponding unknown metabolites, which is critical to comparative metabolomics In addition to the above approaches, the widely used GOLM GC-MS mass spectral database incorporated mass spectra of unidentified metabolites. However, similar to the vocBinBase database, no derivation of consensus spectra or determination of the best representative spectra was implemented [65]. For LC-MS-based metabolomics, the creation of multiple databases of LC-MS/MS mass spectra has made significant contributions to the field, including METLIN [13] and MassBank [66]. However, there has been no report of systematically creating spectral libraries of unknown compounds. This lack of capabilities to handle unidentified GC-MS and LC-MS spectra reveals that metabolomics bioinformatics is lagging behind technological advances in analytical instrumentation, and this is hindering investigators from taking advantage of data that the instruments already capture.

56 in total

Review 1. Analytical metabolomics: nutritional opportunities for personalized health.

Authors: Elizabeth M S McNiven; J Bruce German; Carolyn M Slupsky
Journal: J Nutr Biochem Date: 2011-11 Impact factor: 6.048

Review 2. Chemical derivatization and mass spectral libraries in metabolic profiling by GC/MS and LC/MS/MS.

Authors: John M Halket; Daniel Waterman; Anna M Przyborowska; Raj K P Patel; Paul D Fraser; Peter M Bramley
Journal: J Exp Bot Date: 2004-12-23 Impact factor: 6.992

3. Exploring the temperature-stress metabolome of Arabidopsis.

Authors: Fatma Kaplan; Joachim Kopka; Dale W Haskell; Wei Zhao; K Cameron Schiller; Nicole Gatzke; Dong Yul Sung; Charles L Guy
Journal: Plant Physiol Date: 2004-11-19 Impact factor: 8.340

Review 4. Metabolomics for plant stress response.

Authors: Vladimir Shulaev; Diego Cortes; Gad Miller; Ron Mittler
Journal: Physiol Plant Date: 2008-02 Impact factor: 4.500

5. Challenges and developments in tandem mass spectrometry based clinical metabolomics.

Authors: Uta Ceglarek; Alexander Leichtle; Mathias Brügel; Linda Kortz; Romy Brauer; Kristin Bresler; Joachim Thiery; Georg Martin Fiedler
Journal: Mol Cell Endocrinol Date: 2008-10-19 Impact factor: 4.102

Review 6. Nutrigenomics in human intervention studies: current status, lessons learned and future perspectives.

Authors: Jonas Wittwer; Isabel Rubio-Aliaga; Birgit Hoeft; Igor Bendik; Peter Weber; Hannelore Daniel
Journal: Mol Nutr Food Res Date: 2011-01-31 Impact factor: 5.914

Review 7. Harnessing metabolomics for nutrition research.

Authors: Michael A Zulyniak; David M Mutch
Journal: Curr Pharm Biotechnol Date: 2011-07 Impact factor: 2.837

8. Comprehensive analysis of metabolites in Corynebacterium glutamicum by gas chromatography/mass spectrometry.

Authors: Sergey Strelkov; Mirko von Elstermann; Dietmar Schomburg
Journal: Biol Chem Date: 2004-09 Impact factor: 3.915

Review 9. Monitoring healthy metabolic trajectories with nutritional metabonomics.

Authors: Sebastiano Collino; François-Pierre J Martin; Sunil Kochhar; Serge Rezzi
Journal: Nutrients Date: 2009-09-04 Impact factor: 5.717

10. The critical evaluation of a comprehensive mass spectral library.

Authors: P Ausloos; C L Clifton; S G Lias; A I Mikaya; S E Stein; D V Tchekhovskoi; O D Sparkman; V Zaikin; D Zhu
Journal: J Am Soc Mass Spectrom Date: 1999-04 Impact factor: 3.262

14 in total

1. MetHoS: a platform for large-scale processing, storage and analysis of metabolomics data.

Authors: Konstantinos Tzanakis; Tim W Nattkemper; Karsten Niehaus; Stefan P Albaum
Journal: BMC Bioinformatics Date: 2022-07-08 Impact factor: 3.307

2. The importance of GC-MS date processing and analysis strategies suitable for plant and environmental metabolomics : with references to Changes in the abundance of sugars and sugar-like compounds in tall fescue (Festuca arundinacea) due to growth in naphthalene-treated sand.

Authors: Kai P Law; Ting-Li Han
Journal: Environ Sci Pollut Res Int Date: 2016-04-06 Impact factor: 4.223

3. Targeting unique biological signals on the fly to improve MS/MS coverage and identification efficiency in metabolomics.

Authors: Kevin Cho; Michaela Schwaiger-Haber; Fuad J Naser; Ethan Stancliffe; Miriam Sindelar; Gary J Patti
Journal: Anal Chim Acta Date: 2021-01-12 Impact factor: 6.558

Review 4. Analytical methods in untargeted metabolomics: state of the art in 2015.

Authors: Arnald Alonso; Sara Marsal; Antonio Julià
Journal: Front Bioeng Biotechnol Date: 2015-03-05

5. Metabolite profiling of Dioscorea (yam) species reveals underutilised biodiversity and renewable sources for high-value compounds.

Authors: Elliott J Price; Paul Wilkin; Viswambharan Sarasan; Paul D Fraser
Journal: Sci Rep Date: 2016-07-07 Impact factor: 4.379

6. Dereplication of Natural Products Using GC-TOF Mass Spectrometry: Improved Metabolite Identification by Spectral Deconvolution Ratio Analysis.

Authors: Fausto Carnevale Neto; Alan C Pilon; Denise M Selegato; Rafael T Freire; Haiwei Gu; Daniel Raftery; Norberto P Lopes; Ian Castro-Gamboa
Journal: Front Mol Biosci Date: 2016-09-30

7. Optimizing 2D gas chromatography mass spectrometry for robust tissue, serum and urine metabolite profiling.

Authors: Zhanru Yu; Honglei Huang; Alexander Reim; Philip D Charles; Alan Northage; Dianne Jackson; Ian Parry; Benedikt M Kessler
Journal: Talanta Date: 2017-01-07 Impact factor: 6.057

8. Recommendations for Improving Identification and Quantification in Non-Targeted, GC-MS-Based Metabolomic Profiling of Human Plasma.

Authors: Hanghang Wang; Michael J Muehlbauer; Sara K O'Neal; Christopher B Newgard; Elizabeth R Hauser; James R Bain; Svati H Shah
Journal: Metabolites Date: 2017-08-25

Review 9. Metabolomics and Its Application to Acute Lung Diseases.

Authors: Kathleen A Stringer; Ryan T McKay; Alla Karnovsky; Bernadette Quémerais; Paige Lacy
Journal: Front Immunol Date: 2016-02-29 Impact factor: 7.561

10. Specter: linear deconvolution for targeted analysis of data-independent acquisition mass spectrometry proteomics.

Authors: Ryan Peckner; Samuel A Myers; Alvaro Sebastian Vaca Jacome; Jarrett D Egertson; Jennifer G Abelin; Michael J MacCoss; Steven A Carr; Jacob D Jaffe
Journal: Nat Methods Date: 2018-04-02 Impact factor: 28.547