Bioinformatic tools are required
to carry out essential functions such as statistical analyses and
database functionalities. Now, they are also needed for one of the
most difficult tasks, helping researchers decide which metabolites
are the most biologically meaningful. This can be achieved through
aiding the identification process, reducing feature redundancy, putting
forward better candidates for tandem mass spectrometry (MS/MS), speeding
up or automating the workflow, deconvolving the feature list through
meta-analysis or multigroup analysis, or using stable isotopes and
pathway mapping. This review thus focuses on the most recent and innovative
bioinformatic advancements for identifying metabolites.A primary
objective of metabolomics beyond biomarker discovery is to identify
the most meaningful metabolites that correlate with disease pathogenesis
or other perturbations of metabolism. Metabolites play important roles
in biological pathways; their flux or differential regulation (dysregulation)
can reveal novel insights into disease and environmental influences.
Therefore, one of the most important goals of metabolomic analysis
has been to assign metabolite identity so they can be used for further
statistical and informed pathway analysis.[1,2] Over
the past few years, technologies for analyzing metabolites by untargeted
or targeted metabolomics have undergone extensive improvements. Strides
to establish the most efficient protocols for experimental design,
sample extraction techniques, and data acquisition have paid off providing
robust complex data sets.[3−9] As more is being required of these data sets such as assigning identity
and biological meaning to the features, bioinformatics is the area
of metabolomics which is currently undergoing the most needed growth.It is often the case that metabolomic analysis results in a list
of metabolites with low specificity for the disease or stimulus being
studied (Figure 1). Some of these metabolites
seem to be dysregulated in a variety of diseases such as acylcarnitines[10−13] and fatty acids.[14−17] They may be more indicative of a perturbed systemic cause (appetite,
physical activity, diurnal rhythm changes, etc..), sample contamination,
or instrumental/bioinformatic noise, rather than a specific biomarker
of disease. An example of this can be seen in the analysis of urinary
biomarkers of ionizing radiation, where dicarboxylic acids were downregulated
in the rat after radiation exposure. It was proven that this observation
was actually caused by a decreased appetite after radiation exposure
perturbing the β-oxidation pathway and not from radiation-induced
cellular changes.[18,19] Furthermore, dicarboxylic acids
can leach out from plastics during the extraction process, further
adding to the ambiguity of their role in ionizing radiation.[20]
Figure 1
Biomarkers that have high vs low disease specificity.
Biomarkers that have high vs low disease specificity.As well as identifying the correct
source of the biomarkers, it is also important to identify their physiological
role and how to utilize them as therapeutic targets. This first has
to start with the identification of the metabolite and is determined
by filtering thresholds set by the user which is intrinsically biased.
These thresholds include those for fold change and p-value, which are highly dependent on the experiment; in vitro experiments
would exhibit lower variation between biological replicates than in
vivo. The ease of identifying the metabolite is also determined by
its concentration in the sample and previous annotation in metabolite
databases. Filtering thresholds for metabolite intensity that are
set too high may omit important biologically meaningful metabolites
rather than noise. Furthermore, a metabolite that is novel or not
curated in a database may not be taken into consideration based on
the chemical knowledge of the researcher and what they deem as meaningful.In order to transform the complex list of identified metabolites
into markers of disease, or assign what role they play, bioinformatic
tools can aid in identifying the potential pathways that the metabolite
may belong to. It is then that the researcher can use this knowledge
surrounding the biology of the metabolite to probe the mechanism of
the disease. Untargeted metabolomics has already been used in such
a manner to find the source of neuropathic pain.[21]N,N-Dimethylsphingosine
was dysregulated in a rat model of neuropathic pain, furthermore when
dosed to control rats it induced mechanical hypersensitivity. This
metabolite implicated the sphingomyelin-ceramide pathway as a potential
therapeutic target. Antimetabolite inhibitors of enzymes in this pathway
were tested and were able to ameliorate neuropathic pain (unpublished
data). This study holds promise for other metabolomic studies to maximize
the potential information contained within the data for finding therapeutics
of disease rather than only providing lists of dysregulated metabolites.
Streamlining Data Acquisition: The First Step
Data Alignment
One of the most important developments for untargeted liquid chromatography/mass
spectrometry (LC/MS)-based metabolomics was nonlinear retention time
(RT) alignment with XCMS using endogenous metabolites.[22] This realignment for untargeted analysis is
important to match peaks representing the same analytes from different
samples for comparative analysis; these peaks naturally drift between
sample runs due to sample build up on the columns and physical changes
to the column from the mobile phase, both which change the nature
of the sample-stationary phase interaction. An often used technique
involved spiking internal standards into samples prior to acquisition,
but this was based on the assumption that RT deviations were linear.[23] Thus, XCMS was particularly poignant as there
were no options for carrying out nonlinear realignment and untargeted
LC/MS-based untargeted metabolomics (Figure 2). Since XCMS there have been a number of notable alignment algorithms
developed including MZmine2.[24,25] New developments in
columns and LC systems have improved RT drift considerably producing
more reliable data.
Figure 2
Nonlinear retention time alignment by XCMS allows for
untargeted metabolomic analysis.
Nonlinear retention time alignment by XCMS allows for
untargeted metabolomic analysis.
Automated Metabolomics
The feature identification process
is the most time-consuming and complex part of the metabolomic workflow.
After annotation of peaks and statistical analyses, MS/MS data need
to be acquired for the feature of interest. This data is then compared
to metabolite databases and commercial standards for definitive identification
and validation, a process that can take weeks to carry out but can
be potentially shortened through integration of metabolite profiling
and identification into a single autonomous mass spectrometry method.During the typical metabolomic workflow, MS1 data is
acquired for each of the samples and a feature list table is produced
displaying the results of the statistical analysis. The investigator
will search through the table and pick out the features of interest
that need to be identified, then manually search metabolite databases
for putative identification. To confirm the identification, further
mass spectrometry setup and run time is needed for the subsequent
MS/MS analysis. XCMS Online already speeds up this process through
its integration to METLIN,[26] the world’s
largest metabolite database, which enables each feature, when possible,
to have putative metabolite identification based on its mass-to-charge
ratio (m/z) match. In addition METLIN
has automated MS/MS matching to help confirm identity. RAMClust also
has a semiautomated workflow where indiscriminant MS/MS (idMS/MS)
data is used for automatic database searching.[27] Both the XCMS Online and RAMClust programs aid in the identification
process but still require informed manual interpretation.A
simple but effective way to shorten this acquisition-to-identification
process from weeks to hours is to simultaneously acquire MS1 and MS/MS data over the duration of a LC/MS run, using an autonomous
untargeted metabolomic workflow[26,28] (Figure 3). During this workflow MS1 data are preprocessed
by XCMS; features are extracted, realigned for RT correction, and
undergo statistical analysis. The MS/MS data are acquired automatically
using data dependent acquisition (DDA); the MS1 data are
scanned for precursor ions selected by predefined parameters. Automatic
spectral matching of the MS/MS data to databases containing MS/MS
spectra, such as METLIN, Human Metabolome Database (HMDB),[29] and MassBank[30] aid
in putative identification. This autonomous approach has been optimized
to achieve the correct balance of acquired MS1 and MS/MS
data. A high scan speed can allow for greater MS/MS spectral coverage,
but a scan speed that is too high can affect the quality of the spectra
due to the lack of data points for each extracted ion chromatogram
(EIC). In both cases adequate time is required to obtain high-quality
spectra, which is mitigated by using quadrupole time-of-flights (QTOFs)
with fast scan rates and high sensitivity. The autonomous workflow
has many benefits for untargeted metabolomics; it saves weeks of mass
spectrometry time as well as inspection time for manually picking
out significant peaks. It also saves sample as repeat injections are
not needed. However, the downfall is its reliance on metabolite databases
which are not yet fully comprehensive. METLIN has the largest number
of MS/MS spectra (>12 000 metabolites with high-resolution
MS/MS data) and is growing, which will improve the likelihood of successful
matches with this workflow.
Figure 3
Overview of autonomous metabolomics. MS1 and MS/MS spectra acquisition, relative comparative analysis,
and metabolite identification are carried out simultaneously.
Overview of autonomous metabolomics. MS1 and MS/MS spectra acquisition, relative comparative analysis,
and metabolite identification are carried out simultaneously.XCMS has been utilized in another
semiautomated acquisition and processing workflow along with RT prediction
to aid in metabolite identification.[31] This
workflow performs feature detection, RT correction, gap filling, feature
annotation, in silico fragmentation, and spectral matching to databases.[31] A nearline (nearly online)
DDA and MS/MS processing step using MetShot (an R package) is also
incorporated; MS/MS experiments are automatically generated from a
ranked list of interesting precursor features within the same analysis,
it uses defined filters which results in the acquisition of only relevant
spectra.[32] The filters include sorting
and prioritizing features by p-value or fold change,
selecting features related to quasi-molecular ions, and the removal
of ions that would be too low for MS/MS analysis. It also separates
features from coeluting and cofragmenting compounds and places them
into separate MS/MS files. To further aid in metabolite identification,
the spectra can be compared to mass spectral databases.In silico
prediction tools for fragmentation such as MetFrag can further aid
putative identification to help overcome the incompleteness of the
databases.[33] More recently MetFusion has
been developed which combines MassBank, METLIN, or the Golm Metabolome
Database[34] libraries with MetFrag to improve
the rank of the correct candidate.[35] For
each candidate metabolite returned by MetFrag, RT prediction further
limits the number of potential candidates based on physiochemical
properties (lipophilicity) that may aid in metabolite identification.[36−39] Automated workflows that incorporate data upload, processing, identification,
and pathway analyses will thus markedly improve the efficiency of
the metabolomic workflow.
Data Streaming For Cloud-Based Metabolomics
To overcome the challenges involved in uploading metabolomic data
files to servers, a streaming approach was developed.[40] Data upload can sometimes take more than 1 day due to limitations
set by the speed of data transfer over the Internet. The software
developed allows the acquired data to be uploaded from the instrument
computer workstation directly to XCMS Online where they are converted
and processed. This concept of data streaming reduces the mean wait
time for complete data processing from 20 h to fewer than 3 h (Figure 4). The upload speed is dependent on data size and
the Internet connection, but there is a marked improvement in the
efficiency of the untargeted metabolomic workflow, as the data upload
occurs parallel to the data acquisition. In addition, simultaneous
MS/MS data are only acquired for the features of interest based on
statistical thresholds and matches to metabolite databases from the
processed files already uploaded and analyzed by XCMS Online.
Figure 4
XCMS-based
data streaming workflow (top left) allows data upload and processing
after each LC/MS run is performed, dramatically reducing the processing
time after the data are acquired for the final sample (top right).
A thousand XCMS Online data sets were examined for their average processing
time without streaming. For low-resolution data (∼1.4 GB) and
high-resolution data (∼14.0 GB) over 10 and 20 h was required
after the final LC/MS analysis was performed, respectively. Streaming
allowed a 7-fold decrease in average processing time after data acquisition.
Reprinted from ref (40). Copyright 2014 Nature America, Inc.
XCMS-based
data streaming workflow (top left) allows data upload and processing
after each LC/MS run is performed, dramatically reducing the processing
time after the data are acquired for the final sample (top right).
A thousand XCMS Online data sets were examined for their average processing
time without streaming. For low-resolution data (∼1.4 GB) and
high-resolution data (∼14.0 GB) over 10 and 20 h was required
after the final LC/MS analysis was performed, respectively. Streaming
allowed a 7-fold decrease in average processing time after data acquisition.
Reprinted from ref (40). Copyright 2014 Nature America, Inc.
Feature Analysis
Mass Spectral Annotations
In order
to accurately identify metabolites from LC/MS data, it is first essential to elucidate which
features are isotopologue ions, adduct ions ([M + Na]+,
[M + Cl]−), multiply charged ions or fragment ions
([M + H – H2O]+). These all exist from
the ionization process.[41,42] Through correct annotation,
the complexity of the data sets can be reduced and features of biological
interest identified. LC/MS data generated from untargeted metabolomics
experiments are typically processed using freely available bioinformatics
software such as XCMS,[22,43] OpenMS,[44] apLCMS,[45] xMSanalyzer,[46] mzMatch,[47] or MZmine.[24] These platforms create, in different manners,
grouped feature lists across multiple samples and classes of samples
in which only a percentage of the ions are unique; a feature is defined
as a two-dimensional bounded signal, a chromatographic peak (RT),
and a mass spectral peak (m/z).[48]Annotation software recently introduced,
use clustering principles to deconvolve the data, where isotopic
or adduct ions will coelute and improve confidence in assigning identity.
The software uses within sample, sample-to-sample, or Bayesian “priors” correlation to find clusters. A Bioconductor
package in R, called Collection of Algorithms for Metabolite pRofile
Annotation (CAMERA) was recently designed to be used in conjunction
with XCMS.[48] It facilitates the annotation
of isotopic peaks, adducts, and fragment ions in peak lists using
RT-based grouping, peak shape integration/analysis, and intensity
correlation. CAMERA can reduce the number of redundant features in
the processed LC/MS data by approximately 50%, decreasing the number
of false identifications and unnecessary MS/MS experiments that would
potentially be carried out.[49,50] One of the disadvantages
of CAMERA is that it is biased toward the most abundant features,[27] but even so, very low abundant peaks would anyway
have a low likelihood of being identified due to the concentration
constraints of MS/MS analysis. Another spectral matching-based annotation
software, RAMclust,[27] works on the basis
that two features resulting from the same metabolite will have similar
RTs and abundances across different samples within a sample set. A
similarity function allows the generation of spectra through grouping
features from a single metabolite into a single cluster. Another aspect
of RAMclust is the feature finding parameter, which carries idMS/MS,
enabling manual interpretation and automatic database searching of
feature clusters.Daly et al.[51] use
a Bayesian “priors” approach called
MetAssign. They point out that the level of confidence in annotations
varies across metabolites and data sets. To improve confidence in
peak annotation, probabilistic estimates of the presence/absence of
metabolites are provided based on integration of information from
multiple peaks. MetAssign is also based on the principle that several
peaks of an isotopic series at the same RT should provide a higher
confidence in a putative identification than a single noisy peak.
When validated on an experimental data set, this software performed
better than CAMERA and mzMatch[47] (another
annotation software) for peak annotation. It also provided a measure
of confidence for its putative annotations. Fernandez-Albert et al.[52] showed that peak aggregation (clustering) improved
the statistical power of LC/MS data when using analysis of variance
(ANOVA) to select features and multivariate methods such as partial
least-squares discriminant analysis (PLS-DA)[53] and support vector machines (SVM).[54] They
used four peak aggregation methods to take advantage of and solve
high variable collinearity. Two of the clustering methods, “Non-Negative
Matrix Factorization (NMF) Reduction” and “Principal
Component Analysis (PCA) Decomposition” significantly improved
the detection of significant features. The recent surge of papers
for improving peak annotation shows the importance of reducing redundant
features and integrating a confidence level for the annotations may
aid in more efficient identification.
Credentialing
Even though these tools can aid in annotating multiple features for
a single metabolite and thus reduce the redundancy of many of the
metabolite features, a new tool introduced by Mahieu et al.[55] aids in identifying features, which are of biological
origin only. The idea is based on the principle that untargeted metabolomic
analysis results in thousands of biological and artifactual features.
Omitting these artifacts that can arise from contaminants (from sample
extraction or carryover from previous experiments), chemical/background
noise detected by the MS, and bioinformatic noise (misannotation during
processing) will allow those more important biological features to
be assessed. This process of distinguishing features of biological
origin from artifactual ones is called “credentialing”.
To assess the efficacy of the credentialing algorithm, Escherichia
coli (E. coli) was grown in media containing
natural-abundance 12C glucose or media containing U–13Cglucose. The cultures were mixed together in defined ratios
and analyzed by untargeted LC/MS metabolomics. The credentialing algorithm
identified and credentialed features based on isotope-intensity ratios;
intensities from coeluting isotopologue pairs compared to the
values expected in the culture volume ratios. Signals of biological
origin could thus be distinguished from artifactual ones. This tool
allowed the authors to optimize the bioinformatic analysis, reducing
overall noise features by 15% and increasing biological feature detection
by 20%. Another advantage to this tool, like the others aforementioned
that annotate unwanted feature peaks, is that the list of features
for MS/MS is dramatically reduced; when credentialing was applied
to the E. coli data set it reduced the number of
candidates from 23 567 to 2 912. Even if all these metabolites
cannot be correctly identified, knowing that the ones targeted for
analysis are of biological origin effectively improves the metabolomic
workflow, and moves toward finding those that are meaningful. Similarly,
others have used stable isotopes for peak annotation but do not provide
enough specificity to remove all spurious peaks.[56−59] Unlike these methods, the 13C and 12C samples are run together to reduce RT
variation, and the absolute mass differences of U–13C and U–12C metabolites are filtered rather than
using predicted molecular formulas. Therefore, the credentialing approach
limits the amount of noise and enhances the annotation of biologically
relevant peaks, meanwhile the other workflows are better for improving
formula annotation which would be useful for identification and have
a lower false discovery rate.
Calculating Mass Measurement
Errors
Metabolite identification can also be problematic
in high throughput or large-scale LC/MS runs. During these long run
times the mass accuracy suffers and the number of incorrectly assigned
or redundant peaks dramatically increases. The mass accuracy is crucial
for matching experimental accurate masses to those found in databases,
an increase of 10 ppm (ppm) in the mass accuracy window results in
a 10-fold increase in database hits.[60] The
major factor in maintaining a high accuracy window of less than 5
ppm is the intensity of the ion signal.[61−64] This can be demonstrated when
measuring the mass error of the lock mass signal; its two isotopic
peaks which are at lower concentrations often have a larger ppm than
the parent ion. Conversely when a sample is too concentrated and peaks
are saturated, the mass accuracy can also suffer. When the mass accuracy
shifts the mass error window needs to be widened to perhaps 10–22
ppm, which greatly increases the false positive rate. Methods to correct
the mass accuracy while data is being acquired include using a reference
mix of known ions which the mass spectrometer uses to calibrate during
the run. Ions that occur naturally in most of the experimental runs
can also be used. There are also prediction models that have been
designed to estimate the mass measurement errors. These models do
not change the acquired data but aid in reducing the false positive
rate. One such model by Shahaf et al.[60] uses XCMS and CAMERA to process the data and annotate peaks. Mass
measurement errors are estimated using an annotated library of reference
metabolites which are obtained from multiple runs on the same MS instrument,
containing information on peaks of related features such as isotopes
and adducts. The data are grouped into bins that cover the available
mass and intensity range and a prediction model applied to the data
which takes into account the one-sided confidence levels of all the
mass measurement errors. The model predicts the error for an ion peak’s
mass-intensity pair within a range and is specific for the instrument
on which it was carried out, making it a reusable model for the next
experiment. A reduction in the false positive rate of 21% on a small
data set was seen and could be very effective for larger high-throughput
studies given that the prediction models are optimized specifically
for an instrument.
Statistical Analysis: Visualization and Deconvolution
Multivariate and univariate statistical analyses can further aid
in finding meaningful metabolites from the hundreds of filtered features
postannotation. However, choosing the correct test is challenging
for those without a background in bioinformatics. Recently, journals
have become more stringent in making sure that articles submitted
for publication use the correct statistical tests. Indeed Nature requires a statistical checklist to ensure articles
have statistical adequacy. Gowda et al.[65] highlight the appropriate use of different statistical tests, such
as when to use parametric vs nonparametric tests and paired vs unpaired
tests for metabolomic analysis. Paired tests can be especially useful
in human studies where there is high interindividual variability,
comparing differences between two measurements (e.g., metabolic response
before and after drug treatment) for each subject across the sample
set.Developments in visualizing results and filtering features
have facilitated characterization and structural identification in
untargeted metabolomics. MetaboAnalyst[66] and XCMS Online both provide comprehensive statistical analysis
tools which include univariate, multivariate, high-dimensional feature
selection, clustering, and supervised classification analysis. Most
recently the interactive cloud plot, interactive PCA, and interactive
heat map were introduced to XCMS Online.[65,67] These interactive statistical read-outs let the user customize the
display and choose the most valuable features. The cloud plot allows
for instant visualization of each feature processed from the untargeted
data.[67] The interactive version of this
plot can be manipulated to remove isotopes, change thresholds and
ion-intensity ranges, and zooms in on areas of feature overlap.[65] By clicking on a feature it is possible to instantly
obtain information about the feature; p-value, q-value, m/z, RT, intensity,
and peak group. One of its strengths is that coeluting features and
the hydrophilicity/hydrophobicity can be easily observed. An example
of an interactive cloud plot can be seen in Figure 5. PCA and heatmaps have been used widely in metabolomic research;
however, the interactive versions of these tools allow for instant
modification by user definitions of criteria and visualization of
underlying information (Figure 6).
Figure 5
Interactive
multigroup cloud plot. Metabolite features whose level varies significantly
(p < 0.01) across wild-type and mutant bacteria
are projected on the cloud plot depending on their RT (x-axis) and m/z (y-axis). Each metabolite feature is represented by a bubble. Statistical
significance (p-value) is represented by the bubble’s
color intensity. The size of the bubble denotes feature intensity.
When the user scrolls the mouse over a bubble, feature assignments
are displayed in a pop-up window (m/z, RT, p-value, fold change). When a bubble is selected
by a “mouse click”, the EIC, Box–Whisker plot,
Posthoc, and METLIN hits appear on the main panel. Each bubble is
linked to the METLIN database to provide putative identifications
based on accurate m/z. The variation
pattern of glutamic acid (m/z 146.0468,
MS/MS METLIN match) across different mutants is shown by a box–whisker
plot. Reprinted from ref (65). Copyright 2014 American Chemical Society.
Figure 6
Interactive heatmap with metabolomic data visualization.
Each row represents a metabolite feature, and each column represents
a sample. The Z-scale of each feature is plotted
on the red-green color scale. When a feature annotation tile (m/z, RT, or p-value) is
selected, its Box–Whisker plot, EIC (extracted ion chromatogram),
MS spectrum, and putative METLIN matches appear.
Interactive
multigroup cloud plot. Metabolite features whose level varies significantly
(p < 0.01) across wild-type and mutant bacteria
are projected on the cloud plot depending on their RT (x-axis) and m/z (y-axis). Each metabolite feature is represented by a bubble. Statistical
significance (p-value) is represented by the bubble’s
color intensity. The size of the bubble denotes feature intensity.
When the user scrolls the mouse over a bubble, feature assignments
are displayed in a pop-up window (m/z, RT, p-value, fold change). When a bubble is selected
by a “mouse click”, the EIC, Box–Whisker plot,
Posthoc, and METLIN hits appear on the main panel. Each bubble is
linked to the METLIN database to provide putative identifications
based on accurate m/z. The variation
pattern of glutamic acid (m/z 146.0468,
MS/MS METLIN match) across different mutants is shown by a box–whisker
plot. Reprinted from ref (65). Copyright 2014 American Chemical Society.Interactive heatmap with metabolomic data visualization.
Each row represents a metabolite feature, and each column represents
a sample. The Z-scale of each feature is plotted
on the red-green color scale. When a feature annotation tile (m/z, RT, or p-value) is
selected, its Box–Whisker plot, EIC (extracted ion chromatogram),
MS spectrum, and putative METLIN matches appear.One of the most useful tools for finding meaningful metabolites
is meta-analyses and multigroup comparisons. With these analyses it
is possible to observe shared metabolic patterns across multiple experiments
and metabolite variation patterns across multiple data groups. Meta-analysis
can prioritize interesting features by integrating data from multiple
studies and help identify shared homologous patterns of metabolic
variation across the results of multiple different experiments. Multigroup
analysis on the other hand is an extension of the two-group/pairwise
analysis that allows for the comparison of multiple classes and identifies
features whose variation pattern is statistically significant across
them. This type of analysis is particular useful for a time-course
experiment. An example of multigroup analysis can be seen in Figure 5 where the metabolic response to stress was analyzed
across different types of bacteria.
Targeted Validation
While not an aspect of the bioinformatic solution per se, it is worth
noting that all untargeted metabolomic analyses require further validation
to remove false positives and provide an additional level of confidence
in the follow up biological experiments. To accomplish this, typically
quantitative targeted analysis (triple quadrupole mass spectrometry
(QqQ-MS)) are performed using multiple reaction monitoring (MRM).
False positives can occur during untargeted analysis therefore carrying
out targeted QqQ-MS with standards for each metabolite and can provide
assurance that an accurate fold change and p-value
is being reported.
Pathway Analysis: Putting Metabolomic Data
into a Biological Context
On the quest to find meaningful
metabolites in metabolomic data, the ability to relate the identified
metabolites to the biological question at hand is imperative. Pathway/network
tools can aid in elucidating the roles the metabolites play in multiple
pathways. However, it is often the case that metabolomic analysis
can result in a number of metabolites that are not related to each
other, i.e., they are not in the same pathways or a thin coverage
is providing for one particular pathway; this can be due to some metabolites
being in flux, while others may be at a steady concentration and would
not be differentially regulated. Furthermore, some metabolites cannot
be mapped to any pathways. Therefore, it is far from trivial to visualize
metabolites with respect to their presence and interactions in biochemical
networks.There have been a number of recently developed programs
which are designed to help with biological pathway and network analyses.
As part of the streaming approach aforementioned, simultaneous MS/MS
data are acquired for features of interest based on statistical thresholds
and matches to metabolite databases from files already uploaded and
analyzed by XCMS Online. When two or more putatively assigned metabolites
are observed in the same pathway, the MS1 data are mined
for other metabolites in that pathway and targeted for MS/MS analysis,
essentially aiding pathway and network analysis. Thus, streaming allows
for biology-dependent data acquisition (BDDA) to take place. BDDA
differs from DDA as MS/MS is not triggered on the basis of ion intensity.[68] The BDDA concept was validated in the analysis
of tumor samples, where four metabolites were found belonging to the
same pathway (Figure 7).
Figure 7
Biology-dependent data
acquisition relies on statistics generated after each sample run for
mass spectrometry data acquisition decision making. The representative
example shows a decreasing P value for a feature
of interest over the time-course of data streaming. When the P value for the features reaches 0.001, MS/MS is performed.
A two-tailed Wilcoxon signed-rank test was used to calculate the statistical
significance for n = 28. Box and whisker plots display
the full range of variation (whiskers, median with minimum–maximum;
boxes, interquartile range). Reprinted from ref (40). Copyright 2014 Nature
America, Inc.
Biology-dependent data
acquisition relies on statistics generated after each sample run for
mass spectrometry data acquisition decision making. The representative
example shows a decreasing P value for a feature
of interest over the time-course of data streaming. When the P value for the features reaches 0.001, MS/MS is performed.
A two-tailed Wilcoxon signed-rank test was used to calculate the statistical
significance for n = 28. Box and whisker plots display
the full range of variation (whiskers, median with minimum–maximum;
boxes, interquartile range). Reprinted from ref (40). Copyright 2014 Nature
America, Inc.Another notable program
is mummichog, which can aid in finding biological
pathway activity.[69] Instead of identifying
the metabolites before carrying out pathway/network analysis, mummichog predicts biological activity using the m/z values of both the statistically dysregulated
and unchanged features. Genome-scale human metabolic networks are
then mined for enriched pathways. These metabolic networks include
KEGG,[70] Recon1,[71] and Edinburgh human metabolic network.[72] However, mummichog can incorrectly assign metabolites
since one m/z could account for
several different metabolites, but the program’s design is
to find networks rather than individual metabolites. Indeed one of
the major benefits of this program is that one can bypass the initial
laborious metabolite identification process and move directly onto
investigating whole pathways that are disrupted and can be targeted
in future studies. It also eliminates user bias, where features are
traditionally picked for identification based on interest and biochemical
relevance.[50] The successes of this technology
will likely progress as metabolic models improve and as metabolomic
data becomes integrated into genome-scale metabolic models.Another approach aids visualization of metabolites in related pathways,
through the creation of network “MetaMapp” graphs in
Cytoscape.[73,74] These graphs integrate biochemical
pathway and chemical relationships using KEGG reactant pair database,[70] Tanimoto chemical and NIST mass spectral similarity
scores.[75] Differential expression of metabolite
nodes are superimposed onto network graphs to help with the interpretation
of their involvement in metabolic networks and facilitate biological
interpretation. To overcome the issue of incomplete mapping and also
metabolites that lack any putative identification or network mapping
in current reaction databases, metabolites are associated together
based on their chemical similarity as these compounds should be in
theory metabolized to each other. However, this can result in a loss
of overall biochemical clarity but still allows for visualization
of their involvement in pathways, which would otherwise not be possible.
The other advantages to this program are that it can be used on any
type of acquired metabolomic data (MS or nuclear magnetic resonance
spectroscopy) allowing for integration between multiple analytical
platforms. Multiple species can be mapped at the same time as it is
not constricted by genomic information, and the maps can be automatically
updated as more additional information regarding the identification
of the metabolites is gathered.Another notable method for network
analysis is Metabolite Set Enrichment Analysis (MSEA), part of the
MetaboAnalyst program.[66,76,77] MSEA is based on gene set enrichment analysis (GSEA) and is used
to investigate the enrichment of predefined groups of functionally
related metabolites rather than individual metabolites. This program
was developed through the creation of one encompassing metabolite
library using HMDB, PubChem,[78] ChEBI,[79] KEGG, BiGG,[80] METLIN,
BioCyc,[81] Reactome,[82] and Wikipedia, and importantly includes reference concentrations
for many metabolites. The main limitation of MSEA is that it is biased
to mammalian studies therefore metabolomic experiments using other
species will require separate metabolite sets. However, MSEA can automatically
direct the investigator to biologically important pathways, and remove
manual searching of pathway databases. When a list of compound names
is entered into MetaboAnalyst, Over Representation Analysis (ORA)
is performed and evaluates whether a metabolite set is represented
more than expected by chance within the given compound list. Figure 8 shows an example of the output from a MSEA analysis
of metabolites that were dysregulated after the treatment of rats
with ionizing radiation exposure. A number of metabolic pathways were
dysregulated, specifically, two metabolites, taurine and isethionic
acid were mapped onto the taurine metabolism pathway.[19] MeltDB, like MetaboAnalyst is a software platform
that processes raw data, carries out peak picking, statistical analysis,
and visualization. It was first introduced in 2008 but has recently
undergone a number of improvements including pathway mapping via ProMeTra.[83,84] ProMeTra allows visualization and mapping of metabolite, transcript,
and protein ratios to metabolic pathways maps defined by the user.
Figure 8
An example
of metabolite set enrichment analysis (MSEA) using MetaboAnalyst.[66] (A) Enrichment of metabolic pathways after hypergeometric
test to evaluate whether urinary metabolites upregulated in rats after
radiation exposure are represented more than expected by chance within
a compound list. (B) Taurine (Tau) and isethionic acid (IseThio) were
found to be involved in the same pathway; taurine metabolism. Other
metabolites in the taurine pathway are not changed; taurocyamine (TauCyam),
sulfoacetaldehyde (SulfoAcet), hypotaurine (HypTau), and taurocholic
acid (TauCho).
An example
of metabolite set enrichment analysis (MSEA) using MetaboAnalyst.[66] (A) Enrichment of metabolic pathways after hypergeometric
test to evaluate whether urinary metabolites upregulated in rats after
radiation exposure are represented more than expected by chance within
a compound list. (B) Taurine (Tau) and isethionic acid (IseThio) were
found to be involved in the same pathway; taurine metabolism. Other
metabolites in the taurine pathway are not changed; taurocyamine (TauCyam),
sulfoacetaldehyde (SulfoAcet), hypotaurine (HypTau), and taurocholic
acid (TauCho).The new bioinformatic
tools developed for pathway analysis use a number of different methodologies
to achieve the same end-goal but have different advantages over each
other. For example, some do not require the features to be identified
first such as mummichog, which finds metabolic pathways
based on m/z. Some require feature
identification such as MetaMapp but can fill in gaps for metabolites
not found in any pathways.[74] Others use
the network analysis function as part of their workflow as in the
autonomous/BDDA approach and also as part of the MetaboAnalyst software
package.[66]
The Past and the Future:
Stable Isotopes in Metabolic Analysis
Stable isotopes have
been used in targeted MS-based metabolomics as internal standards
to validate the identity of a metabolite and also for elucidating
and tracing the involvement of metabolites in pathways for the past
60 years.[85−94] The stable isotopes have the same physicochemical properties as
natural abundance metabolites but contain at least one atom that is
one mass unit greater, thus the isotope can be distinguishable from
its natural abundance counterpart. A known metabolite such as glutamine
for example can be introduced into an in vitro or in vivo experiment
with at least one of its atoms labeled with a stable isotope, such
as [1-13C] or [5-13C]. The incorporation of
the 13C-labeled atom(s) into other metabolites can show
the potential impact of the labeled metabolite in pathways and on
the biology of the system being studied. This is exemplified in a
study where labeled 13C was transferred from [5-13C]glutamine to acetyl coenzyme A for lipid biosynthesis during hypoxia.
The conversion of glutamine was traced by reductive flux, catalyzed
by isocitrate dehydrogenase (IDH).[95] This
study revealed a novel biological role for glutamine in lipogenesis
during hypoxia. A follow-up study using a quantitative flux model
with [U–13C] glutamine and glucose showed that fatty
acid labeling from glutamine does occur, but simultaneously with oxidative
flux, and not by net reductive IDH flux.[96]A logical progression from the stable isotope targeted metabolomic
technology is to follow the full conversion of the labeled isotopes
in an untargeted unbiased manner, allowing observation of the metabolic
fate of the metabolites. For metabolites such as glucose, glutamine,
or acetyl coenzyme A which are involved in multiple pathways, they
can be used as surrogate markers to infuse into the experiment and
assess widespread metabolic flux within a perturbed system (influenced
by a disease, environmental factor, genetic change, microbial influence,
xenobiotic use). However, infusing these precursors will also result
in precursor-related metabolic perturbations. Furthermore, it is somewhat
problematic to introduce labeled precursors at biologically relevant
concentrations. In rodent models most of these metabolites will be
completely metabolized within a few minutes and undetectable in biofluids
by LC/MS. Designing an experiment that shows the direct involvement
of a metabolite with the outcome of the disease pathogenesis may be
more useful. For example a metabolite shown through conventional untargeted
metabolomics to be downregulated in a disease, could be administered
as a stable isotope, and reverse the disease phenotype. This would
allow the metabolic fate to be revealed which may not have been seen
by targeted analysis (due to only a small number of metabolites being
targeted) or conventional untargeted analysis (due to the complexity
of the data set). The simplest experiment would be to have a fully
labeled metabolite, where all the 13C atoms would be transferred
to its products and cofactors. More complicated experiments arise
when the precursor metabolite is partially labeled, since the metabolite
may split and those unlabeled atoms would not provide a labeled isotopomer.
Glucose, for example, splits into glyceraldehyde 3-phosphate and dihydroxyacetone
phosphate during glycolysis; these metabolites are converted into
pyruvate or used for lipid biosynthesis, respectively. Partial labeling
of glucose would allow the observation of how these atoms are distributed
during glycolysis. Another issue is synchronizing metabolism. According
to Bar-Joseph et al.,[97] cells need to be
arrested so that they start at the same phase, and even then they
may lose their synchronization after awhile. Therefore, determining
what phase the cells are in will help decide what time point is best
for sampling; this however is somewhat impossible for in vivo experiments
but useful for in vitro validation studies.Similar to conventional
untargeted metabolomics, which ultimately provides a list of dysregulated
features, stable isotope untargeted metabolomics would provide the
same list of metabolites but containing a metabolically derived stable
isotope perturbed to the same extent (fold change and statistical
significance) as the natural abundance metabolite. This would allow
direct comparison of the paired isotopologues. This concept is one
that many metabolomic researchers would seize upon to use but has
been technologically difficult to set up due to many constraints;
algorithms capable of detecting metabolite isotopologues are needed,
and databases with MS/MS fragmentation data are required for identifying
the isotopologues and isotopomers. Huang et al.[98] have made great strides into providing the bioinformatics
tools to make this possible. Using the XCMS platform they have extended
its capabilities to identify isotopologue groups that correspond to
isotopically labeled compounds. The aptly named X13CMS
program can thus be used to track the metabolism of isotopically labeled
substrates in an untargeted manner revealing valuable insights into
metabolic mechanisms. Another major advancement to stable isotope
untargeted metabolomics has been the recent development of a database
containing thousands of metabolite isotopes. This isotope-based database,
isoMETLIN will allow users to find all possible isotopologues derived
from METLIN and obtain MS/MS fragmentation data on isotopomers[99] (Figure 9).
Figure 9
Stable isotope
untargeted and targeted metabolomics.
Stable isotope
untargeted and targeted metabolomics.
Conclusions
Recent developments in bioinformatic tools have
enhanced the untargeted metabolomic workflow, enabling researchers
to identify metabolite features from LC/MS data and assign their biological
roles by identifying their involvement in chemical pathways. Experiments
carried out on high-resolution mass spectrometers result in thousands
of dysregulated features. One of the biggest obstacles has been deconvolution
and identification of these features, the latter requiring highly
biased manual interpretation by the researcher. Automated multistep
workflows have alleviated this process by incorporating functions
that remove redundant features and enhance the efficiency and efficacy
of metabolite identification. The novel use of stable isotopes for
untargeted metabolomics and feature annotation has further enhanced
the ability of the investigator to recover biologically relevant metabolites.
Another major advancement has been in the development of metabolite
databases and in silico fragmentation tools to help identify these
metabolites for functionality in biological pathways. Indeed the area
of growth for bioinformatics in metabolomic research will be in finding
the role of these metabolites, rather than creating lists of biomarkers
without mechanistic implications. This is somewhat dependent on the
further curation of metabolite MS/MS fragmentation data in metabolite
databases as well as the development of network mapping tools.
Authors: Ray O Bahado-Singh; Ranjit Akolekar; Rupasri Mandal; Edison Dong; Jianguo Xia; Michael Kruger; David S Wishart; Kypros Nicolaides Journal: J Matern Fetal Neonatal Med Date: 2012-04-28
Authors: Warwick B Dunn; David I Broadhurst; Helen J Atherton; Royston Goodacre; Julian L Griffin Journal: Chem Soc Rev Date: 2010-08-17 Impact factor: 54.564
Authors: Lisa Matthews; Gopal Gopinath; Marc Gillespie; Michael Caudy; David Croft; Bernard de Bono; Phani Garapati; Jill Hemish; Henning Hermjakob; Bijay Jassal; Alex Kanapin; Suzanna Lewis; Shahana Mahajan; Bruce May; Esther Schmidt; Imre Vastrik; Guanming Wu; Ewan Birney; Lincoln Stein; Peter D'Eustachio Journal: Nucleic Acids Res Date: 2008-11-03 Impact factor: 16.971
Authors: Heiko Neuweger; Marcus Persicke; Stefan P Albaum; Thomas Bekel; Michael Dondrup; Andrea T Hüser; Jörn Winnebald; Jessica Schneider; Jörn Kalinowski; Alexander Goesmann Journal: BMC Syst Biol Date: 2009-08-23
Authors: Charlotte Simmler; Daniel Kulakowski; David C Lankin; James B McAlpine; Shao-Nong Chen; Guido F Pauli Journal: Adv Nutr Date: 2016-01 Impact factor: 8.701
Authors: Mary Ann Moran; Elizabeth B Kujawinski; Aron Stubbins; Rob Fatland; Lihini I Aluwihare; Alison Buchan; Byron C Crump; Pieter C Dorrestein; Sonya T Dyhrman; Nancy J Hess; Bill Howe; Krista Longnecker; Patricia M Medeiros; Jutta Niggemann; Ingrid Obernosterer; Daniel J Repeta; Jacob R Waldbauer Journal: Proc Natl Acad Sci U S A Date: 2016-03-07 Impact factor: 11.205