Ethan King1, Richard Overstreet2, Julia Nguyen1, Danielle Ciesielski3. 1. Computing and Analytics Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States. 2. Signature Science and Technology Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States. 3. Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States.
Abstract
Tandem mass spectrometry (MS/MS) is a primary tool for the identification of small molecules and metabolites where resultant spectra are most commonly identified by matching them with spectra in MS/MS reference libraries. The high degree of variability in MS/MS spectrum acquisition techniques and parameters creates a significant challenge for building standardized reference libraries. Here we present a method to improve the usefulness of existing MS/MS libraries by augmenting available experimental spectra data sets with statistically interpolated spectra at unreported collision energies. We find that highly accurate spectral approximations can be interpolated from as few as three experimental spectra and that the interpolated spectra will be consistent with true spectra gathered from the same instrument as the experimental spectra. Supplementing existing spectral databases with interpolated spectra yields consistent improvements to identification accuracy on a range of instruments and precursor types. Applying this method yields significant improvements (∼10% more spectra correctly identified) on large data sets (2000-10 000 spectra), indicating this is a quick yet adept tool for improving spectral matching in situations where available reference libraries are not yet sufficient. We also find improvements of matching spectra across instrument types (between an Agilent Q-TOF and an Orbitrap Elite), at high collision energies (50-90 eV), and with smaller data sets available through MassBank.
Tandem mass spectrometry (MS/MS) is a primary tool for the identification of small molecules and metabolites where resultant spectra are most commonly identified by matching them with spectra in MS/MS reference libraries. The high degree of variability in MS/MS spectrum acquisition techniques and parameters creates a significant challenge for building standardized reference libraries. Here we present a method to improve the usefulness of existing MS/MS libraries by augmenting available experimental spectra data sets with statistically interpolated spectra at unreported collision energies. We find that highly accurate spectral approximations can be interpolated from as few as three experimental spectra and that the interpolated spectra will be consistent with true spectra gathered from the same instrument as the experimental spectra. Supplementing existing spectral databases with interpolated spectra yields consistent improvements to identification accuracy on a range of instruments and precursor types. Applying this method yields significant improvements (∼10% more spectra correctly identified) on large data sets (2000-10 000 spectra), indicating this is a quick yet adept tool for improving spectral matching in situations where available reference libraries are not yet sufficient. We also find improvements of matching spectra across instrument types (between an Agilent Q-TOF and an Orbitrap Elite), at high collision energies (50-90 eV), and with smaller data sets available through MassBank.
Mass spectrometry (MS) is a gold-standard
compound identification
method used in many fields such as food safety, wastewater/environmental
analysis, clinical and forensic toxicology, metabolic profiling, lipid
and peptide analysis, and many more.[1−13] The technique came to prominence with the use of standardized hard
ionization methods (fixing the electron ionization source at 70 eV)
coupled with gas chromatography (GC-MS), facilitating the development
of extensive GC-MS spectral libraries and associated look-up techniques
that quickly match experimental spectra to known reference spectra.
Because GC-MS is limited in its ability to analyze small molecules,
metabolites, and nonvolatile substances, researchers developed alternate
MS methods to analyze these compounds. A common alternative is to
use liquid chromatography sample preparation, soften the ionization
method, and connect a series of mass spectrometers in tandem to refine
how compounds are ionized. By coupling two or more mass analyzers,
analysts can first ionize the molecules to separate the ions by their
mass-to-charge (m/z) ratio and then
identify ions having a particular m/z ratio to split into smaller fragment ions. This level of detail
allows analysts to elucidate molecule structure from molecular weight
discovery and fragmentation behavior, enhancing the ability to identify
unknown unknowns.[14−16] Electrospray ionization-liquid chromatography-tandem
mass spectrometry (ESI-LC-MS/MS) has come to the forefront as a specialized
MS method that deserves a place next to GC-MS as highly rigorous,
specific, and sensitive compound identification methods. As a soft
ionization technique, ESI-LC reduces sample preparation demands, allowing
analysts to study nonvolatile and larger substances and making it
invaluable for the study of small molecules and metabolites that fragment
beyond recognition under hard ionization conditions.As MS/MS
technology becomes ubiquitous, researchers are calling
for standardization of techniques and data.[17−19] With so much
variability in MS/MS acquisition techniques, building the spectral
reference libraries and rigorously validated workflows that were essential
to the mainstreaming of GC-MS is a massive challenge for the MS community.
Most laboratories use the high-quality, value-enhanced commercial
National Institute of Standards and Technology (NIST) and Wiley libraries,[20] which have greatly increased their collection
of MS/MS spectra over the years. Value is added to these libraries
because they are curated by experienced mass spectrometrists who manually
inspect and correct spectra, remove noise and artifacts, add structures
and CAS numbers, build consensus spectra, add peak annotations, and
perform interlibrary comparisons.[21,22] NIST’s
spectra, in particular, are generated on a number of different instruments
and at many different collision energies, the latter being one of
the most important factors in a resulting MS/MS spectrum. Realistically,
however, NIST and Wiley will never be able to keep up with the modular
nature of MS/MS workflows or the quantity and diversity of molecules
of interest to researchers.Researchers are using a number of
approaches to generate more diverse
spectral libraries. One approach is to gather all the available spectra
into a communal database, like MassBank,[23] but the quality assurance (QA) task for such an endeavor is monumental,
and appropriate QA standards are under debate.[18,19] A popular approach is to generate spectra in silico for both known
and suspected molecules using quantum-chemical properties and/or machine-learning
techniques.[24−36] This is a highly active area of research with varying degrees of
success in both quality of the prediction and length of time required
to generate a quality prediction, but it rarely accounts for the differences
between spectra produced by different instruments. Also, for disciplines
requiring rigorous validation for legal purposes, in silico libraries
may not be considered valid for a confirmatory analysis. A limiting
but pervasive approach is for laboratories to curate and maintain
their own internal libraries that match their own validated workflows.
While the community in general may be disappointed at the loss of
communal resources, this final option is quite popular and allows
laboratories to maintain libraries that perform best with the spectra
they generate.To improve the utility of available MS/MS spectral
databases—whose
contents are invaluable due to the diversity of instrumentation, collision
energies, other instrument parameters, and molecules of interest—we
developed a first of its kind tool that takes existing spectra, applies
principle component analysis (PCA) to fill in gaps in the libraries,
and allows researchers to do a spectral comparison and compound identification
with high accuracy. Analyzing high-resolution mass spectrometry (HRMS)
data available in NIST20,[37] we show that
filling in unavailable collision energies with PCA-interpolated spectra
results in ∼10% improvement in compound identification than
comparing with just the existing database. Our method reduces the
need for libraries to cover a broader range of collision energies
with the costly and time-consuming collection of spectra. Additionally,
unlike many in silico library tools, this interpolation method generates
spectra quickly and accounts for the spectral differences caused by
the particular instruments and settings used to generate the spectra.
This tool can be broadly applied as a quick and simple way to improve
accuracy when performing spectral matching for compound identification.
Methods
Spectral Interpolation
Because we start with fine-grain
HRMS data, we bin the peaks of the spectra to increase our ability
to identify key spectral features. This peak-binning process naturally
creates a vector representation for spectra, so we can leverage linear
algebra tools and construct a method to interpolate spectra across
collision energies. We first lay out the notation for this process
and then discuss details of the approach. A high-level overview is
illustrated in Figure .
Figure 1
(a) Three experimental spectra are collected from a consistent
mass spectrometry workflow at collision energies spanning the energies
of interest. (b) The experimental spectra are binned to the desired
level of detail, forming vector-formatted spectra that are suited
for mathematical analysis. (c) The vector-formatted spectra are joined
into a matrix, and the SVD of the matrix is computed. The resulting
SVD provides a set of basis vectors bk and the weight coefficients ck at the
known collision energies. The weight coefficients for the desired
spectra are interpolated from the known weight coefficients. (d) Finally,
for a desired collision energy e, the interpolated
weights are applied as a linear combination with the basis vectors
to determine the anticipated spectrum at the unknown collision energy.
(a) Three experimental spectra are collected from a consistent
mass spectrometry workflow at collision energies spanning the energies
of interest. (b) The experimental spectra are binned to the desired
level of detail, forming vector-formatted spectra that are suited
for mathematical analysis. (c) The vector-formatted spectra are joined
into a matrix, and the SVD of the matrix is computed. The resulting
SVD provides a set of basis vectors bk and the weight coefficients ck at the
known collision energies. The weight coefficients for the desired
spectra are interpolated from the known weight coefficients. (d) Finally,
for a desired collision energy e, the interpolated
weights are applied as a linear combination with the basis vectors
to determine the anticipated spectrum at the unknown collision energy.Let a spectrum s with a set of
peaks P be given by the setwhere i is the measured intensity of a peak at m/z value m.
To apply principal component analysis (PCA) to a set of spectra, they
must be represented in the same vector space. To conform a set of
spectra, we first choose a Qmax value
such that all relevant m/z values
are in the interval [0,Qmax] and partition
the interval into N uniform bins {[qmin,qmin)}}. For our purposes, Qmax is determined for each trial independently by the highest nonzero m/z value in the set of spectra being analyzed,
and the value of N is set to bin the peaks to the
nearest integer. This coarse binning is chosen to allow many trials
to be run quickly to validate the process, though the mathematical
details still apply for the much finer detail required for a real-world
analysis. We then represent a spectrum s as a vector where, at each index n, the vector value v becomes either the intensity of the highest peak in that section
of the partition or zero if there are no peaks in that bin. This modified
binning method is designed to ignore noise around prominent peaks.
If a moderate height peak is surrounded by many very small peaks,
the common method of binning by summing the peaks may allow the peak
to appear more prominent. This binning method preserves peak prominence
and reduces noise. Mathematically, this is expressed as follows.To capture how the spectra for a given
molecule progress as collision
energy changes, we seek an optimal representation of the set of spectra
using singular value decomposition (SVD). Commonly in PCA, SVD is
used to identify a set of component vectors that is smaller than the
full data set but still represents all of the data well enough to
make strong predictions. However, because spectral prediction is incredibly
nuanced, we generate a full set of basis vectors to retain as much
spectral detail as possible. In this analysis, a basis is a minimal
set of vectors required to be able to recreate any spectrum in the
data set through a linear combination of the basis vectors (in linear
algebra terms, the basis spans the original data set). The basis vectors
also have the properties of being linearly independent and orthogonal.
These properties make PCA a powerful tool but also make the basis
vectors purely statistical artifacts, no longer representative of
actual spectra. To represent a given set of J known,
vectorized spectra V = {v} taken at collision energies {e}, we use SVD to construct an orthonormal basis {b} of the span of V, where K is the dimension of the span. Because of the complexity
of the HRMS data, K will be equal to J in most cases for this method.Now because {b} is a basis, there exist a set of coefficients {c} such
that each vectorized spectrum v ∈ V can be written as a linear combination
of basis vectors.That is, each spectrum can be represented
as a weighted sum of the basis vectors b, where the coefficient c can be understood as the contribution
of the vector b to the
spectrum v. In this
view, the changes in spectra across collision energies can be described
by the changes to the contributions (coefficients) of each basis vector.
For example, a given basis vector may have a small contribution at
low collision energy but a large contribution at higher collision
energies. It is important here to note that the coefficients c can be
positive or negative because the basis vectors do not necessarily
correspond to any physical phenomenon (e.g., fragment structure/stability);
they are statistical in nature.Finally, to generate the interpolations
for all missing collision
energies, we need to build functions that map how the contributions
for each basis vector change as a function of the collision energy.
These functions are represented by the dotted lines in Figure . Ideally we would use a function
that takes in a scalar collision energy e and outputs
the corresponding continuous, HRMS spectrum g(e) for a given molecule. While we cannot determine the true
function g, we can construct an approximation ĝ
from to that outputs an N-dimensional
vectorized spectrum in the span of the basis {b} of the formwhere we initially definefor all j ∈ {1,...,J} and k ∈ {1,···,K} such that the approximation exactly satisfies (5) with the values satisfying (3) for our J known vectorized spectral representations.
We then estimate the values of f at all other e ∈ [emin,emax] by linear interpolation,
where we have the following.By this definition, the vectorized spectrum
approximations of the form 4 may include a negative
intensity value. To make sensible spectrum estimates, all negative
values in ĝ(e) are set to
zero.
Figure 2
(top) The left-hand plots titled b show the basis vectors generated from the capsaicin
spectra shown in the bottom row of the figure. The right-hand plots
titled f(e) give the coefficients that reconstruct the spectrum at a given
electronvolt value, e, with a linear interpolation
(dashed line) plotted across collision energies. For each basis vector,
these interpolations are the functions f. Across the range of collision energies, the contribution
of b1 remains relatively constant, as
it contains the prominent base peak across all electronvolts. At 10
eV, there are small influences from basis vector b3, and the negative value of f2 decreases peak intensity at the positive value in basis vector b2 while increasing the intensity of the peaks
shown as negative. At 20 eV, the contributions of basis vector b3 switch sign, and the coefficient for basis
vector b2 begins to increase. Finally
at 40 eV, the influence of basis vector b2 increases, corresponding with the appearance of strong peaks with
lower m/z values than the base peak
observed at 10 and 20 eV. This represents the fragmentation that occurs
between 20 and 40 eV. (bottom) Normalized MS/MS Agilent Q-TOF spectra
for capsaicin from the NIST20 database at collision energies of 10,
20, and 40 eV are shown in the bottom row.
(top) The left-hand plots titled b show the basis vectors generated from the capsaicin
spectra shown in the bottom row of the figure. The right-hand plots
titled f(e) give the coefficients that reconstruct the spectrum at a given
electronvolt value, e, with a linear interpolation
(dashed line) plotted across collision energies. For each basis vector,
these interpolations are the functions f. Across the range of collision energies, the contribution
of b1 remains relatively constant, as
it contains the prominent base peak across all electronvolts. At 10
eV, there are small influences from basis vector b3, and the negative value of f2 decreases peak intensity at the positive value in basis vector b2 while increasing the intensity of the peaks
shown as negative. At 20 eV, the contributions of basis vector b3 switch sign, and the coefficient for basis
vector b2 begins to increase. Finally
at 40 eV, the influence of basis vector b2 increases, corresponding with the appearance of strong peaks with
lower m/z values than the base peak
observed at 10 and 20 eV. This represents the fragmentation that occurs
between 20 and 40 eV. (bottom) Normalized MS/MS Agilent Q-TOF spectra
for capsaicin from the NIST20 database at collision energies of 10,
20, and 40 eV are shown in the bottom row.Figure (top) shows
the three basis vectors b for a set of three capsaicin spectra (Figure (bottom)) along with the known contribution
values c (the dots on the right-hand plots) and how they change as
a function of collision energy, f. The basis vector b1 represents
a peak prominent across all collision energies, and the associated
coefficients, c1, and f1, remain close to constant. In contrast, b2 represents peaks that are more prominent in
the highest collision energy spectra, estimated as linear increases
to the contribution of b2 in the interpolation f2. Head-to-tail comparisons of interpolated
spectra against experimental spectra from NIST20 are shown in Figure . While this method
shows strong results within the range [emin, emax], it is important to note that,
as an interpolation, extrapolating to spectrum estimates at collision
energies outside the range [emin, emax] is not possible, as the approximations
are not meaningfully defined for such values.
Figure 3
Sample interpolation
predicted (ITP) Q-TOF capsaicin spectra compared
to the known spectra available in NIST20 at collision energies of
15 and 30 eV. Note that the methods used to generate predictions preclude
accurate predictions outside the range of provided collision energies.
These spectra were generated with samples at 10, 20, and 40 eV, so
ITP spectra can only be generated for collision energies between 10
and 40 eV.
Sample interpolation
predicted (ITP) Q-TOF capsaicin spectra compared
to the known spectra available in NIST20 at collision energies of
15 and 30 eV. Note that the methods used to generate predictions preclude
accurate predictions outside the range of provided collision energies.
These spectra were generated with samples at 10, 20, and 40 eV, so
ITP spectra can only be generated for collision energies between 10
and 40 eV.
Spectra Curation
We assess the accuracy of our interpolation
methodology for molecular identification using spectra from the HRMS
and APCI libraries available in NIST20.[37] The instruments used to generate these spectra are the Agilent 6530
Q-TOF, the Thermo Finnigan Velos Orbitrap, Thermo Finnigan Elite Orbitrap,
and the Orbitrap Fusion Lumos. Spectra at multiple collision energies
for each molecule are only available for high-energy collision dissociation
(HCD) and quadrupole time-of-flight (Q-TOF) measurements; therefore,
an analysis of our method is limited to these. We further restrict
our analysis to small molecules with molecular weight in the range
of 100–500 Da and stratify by precursor type to focus on molecules
that do not respond well to standard hard ionization methods. We selected
the positive ion and negative ion mode precursors with the most available
data, which are [M + H]+ and [M-H]−,
respectively. In addition we test the [M + Na]+ precursor
across all instruments. We include this variety of precursor types
to ensure our method works for different analytical workflows. Table gives the number
of molecules available in each subset. Throughout we will refer to
the instrument precursor pairs by the acronyms given in the table.
Table 1
This Table Gives the Number of Molecules
with Spectra for Each Precursor and Instrument Pair, along with the
Acronym That Each Pair Is Referred to in This Text
Agilent Q-TOF
Elite Orbitrap
Velos Orbitrap
Orbitrap
Fusion Lumos
[M + H]+
QHP: 2,603
EHP: 15,081
VHP: 558
LHP: 6,191
[M-H]−
QHN: 246
EHN: 8,932
VHN: 13
LHN: 2,066
[M
+ Na]+
QNa: 173
ENa: 4,087
VNa: 107
LNa: 927
For all Q-TOF spectra and HCD spectra
from the Thermo Finnigan
Velos and Elite Orbitraps, NIST20 reports the collision energy in
electronvolts. For the Orbitrap Fusion Lumos, however, only approximately
one-third of the spectra have an electronvolt value listed alongside
their normalized collision energy (NCE). When an instrument implements
NCE, the collision energy is dynamically adjusted based on the expected
precursor weight so that small ions are not ejected from the trap
and large ions can be sufficiently fragmented.In Figure (top),
we inspect the relationship between reported electronvolt and NCE
values from the Orbitrap Fusion Lumos. The reported electronvolts
are listed on the x-axis, and the reported NCE values
are on the y-axis. By adding a color gradient based
on the precursor m/z of each molecule,
it becomes apparent that NCE has a generally linear relationship with
estimated precursor m/z and electronvolts
that can be used to approximate the electronvolt value when it is
not explicitly reported. The spectra represented in Figure (top), when grouped by reported
NCE and precursor m/z, display a
preexisting variation in the reported electronvolt with a standard
deviation of ∼0.2539 eV across all molecules. We treat this
variation as negligible for our purposes and assign electronvolt values
to spectra without explicit measurement based on the following algorithm.
First, all Orbitrap Fusion Lumos spectra having a precursor m/z within one unit of the listed precursor m/z with electronvolt values provided were
collected. Next, a linear regression was computed on this subset of
spectra to predict the linear trend for that precursor m/z. Finally, the spectrum is assigned an electronvolt
approximation based on its listed NCE value and this localized regression.
The average difference in computed versus reported electronvolt means
for a given NCE and precursor m/z was ∼0.1210 eV, indicating a strong correlation for our estimates. Figure (bottom) shows the
same data from Figure (top) with the interpolated values added to the plot. Here we can
see that many of the spectra whose precise electronvolt values were
unknown will be excluded from this work due to our focus on small
molecules and low collision energies.
Figure 4
(top) Comparison of reported electronvolt
readings against NCE
for the Orbitrap Fusion Lumos, color graded by precursor m/z. A strong linear trend suggests the potential
to apply linear regression to approximate missing electronvolt values.
(bottom) All Orbitrap Fusion Lumos scores are shown with interpolated
electronvolt values where none were reported. Variation in the reported
data set leads to imperfect alignment, though a strong correlation
is still obvious. The subset of data used for most of this analysis
is restricted to spectra collected at or below 40 eV, including only
the bottom left-hand corner of these images.
(top) Comparison of reported electronvolt
readings against NCE
for the Orbitrap Fusion Lumos, color graded by precursor m/z. A strong linear trend suggests the potential
to apply linear regression to approximate missing electronvolt values.
(bottom) All Orbitrap Fusion Lumos scores are shown with interpolated
electronvolt values where none were reported. Variation in the reported
data set leads to imperfect alignment, though a strong correlation
is still obvious. The subset of data used for most of this analysis
is restricted to spectra collected at or below 40 eV, including only
the bottom left-hand corner of these images.
Interpreting Added Value from Interpolation-Predicted Spectra
To measure accuracy gained by using our interpolation-predicted
(ITP) spectra method, we employ two experiments of similar format.
In each case, we subset spectra from NIST20 to represent a limited
number of available database spectra. The remaining spectra serve
as unknown test spectra that can be searched against the subset database.To evaluate the quality of our ITP spectra, we compare only the
test spectrum, the ITP spectrum, and the database spectra used to
generate the ITP spectrum. For a test spectrum at collision energy teV, we construct an ITP spectrum at collision
energy teV and compute the cosine similarity
between the test spectrum and ITP spectrum, denoted by ITPsim. We also compute the cosine similarity between the test spectrum
and the closest matching database spectrum, denoted by DBsim. That is, if there are three database spectra used to generate an
ITP spectrum, the cosine similarity is computed between the ITP spectrum
and each spectra used to generate the ITP spectrum. The highest cosine
similarity score is retained as DBsim. Finally, the gain
in cosine similarity when the ITP spectra are included in the subset
database can be measured as the difference between these scores, Δsim = ITPsim – DBsim.To
determine the benefit added by including ITP spectra in a limited
database search, we modify the previous experiment slightly. We first
assume both the precursor molecular weight and collision energy are
known for each test spectrum. To identify candidate molecules for
our subset database, we screen the NIST20 database for all spectra
matching the instrument, precursor type, and target collision energies
used to generate the ITP spectra. We then identify all molecules in
the candidate list within 10 Da of the test spectrum’s precursor
weight. For a test spectrum at collision energy teV, we use the spectra in our subset database to construct
ITP spectra at the valuesfor each candidate molecule. Finally, we identify
the highest cosine similarity score between our test spectrum and
(1) all known spectra identified as appropriate for our subset database
(DBsim) and also (2) the subset database spectra combined
with the ITP spectra—but not replacing any known subset spectra—denoted
maxITPsim. To assess performance of the ITP enhanced database,
we report the difference between these values.For each trial, we also report the
percentage of spectra for which
the highest cosine similarity score matches the molecule of the test
spectrum when matching to the ITP spectra enhanced database (IM) and
the percentage of spectra likewise correctly identified when matching
to just subset database spectra (DBM). We also record the difference
in electronvolts between the closest matching interpolated spectrum
and the test spectrum, ΔeV. That is, for the test
spectrum with collision energy teV and
the closest matching ITP spectrum with collision energy IeV, we computeFinally, we record the percentage of spectra
that were correctly identified with only the subset database spectra
but misidentified when using the ITP spectra and denote this value
as IF.
Results and Discussion
Adds Benefit When Few Spectra Are Available
We first
determine how many known spectra are required to create an accurate
ITP spectrum by testing ITP spectra generated by as few as two spectra
and as many as eight. To generate a subset database containing n known spectra per molecule, we select n uniformly spaced electronvolt values between 10 and 45 as database
collision energies. This range was chosen to represent reasonable
energies for small molecules. Then we construct the database from
the closest spectra within (45–10)/2n eV of
each of the uniformly spaced energies.For this trial, we consider
the Agilent Q-TOF and the Elite Orbitrap spectra with either a [M
+ H]+ or [M-H]− precursor to ensure the
behavior is similar on the Q-TOF and Orbitrap instruments. Data sets
are referenced by the acronyms as given in Table . Within each instrument/precursor pair,
we screen for molecules with at least 10 spectra in the 10–45
eV range. The number of unique molecules satisfying the criteria for
each of QHP, QHN, EHP, and EHN is 991, 101, 1112, and 460, respectively.
The data sets for the Velos Orbitrap and Orbitrap Fusion Lumos are
too small to provide sufficient test data under these requirements,
so they are omitted.As shown in Figure , the average cosine similarity between the
ITP and test spectra
quickly approaches 1 as more spectra are added to the database. However,
the ITP spectra achieve high accuracy with as few as three spectra
in the database, reporting an average cosine similarity over 0.95.
Furthermore, with just three known spectra at a range of electronvolts,
generating an ITP spectrum to target the test electronvolt in a specific
reference library gives a stronger match on average, across all molecules,
than the closest known spectrum. These results are shown in Figure , where a positive
difference (Δsim) indicates a stronger match with
the ITP spectrum and a negative score indicates a stronger match to
a spectrum in the database.
Figure 5
Average cosine similarity between ITP spectra
and true spectra
(ITPsim) as more known, database spectra are used to generate
the interpolation. Results are shown for NIST20 spectra on a Q-TOF
and Elite Orbitrap with [M + H]+ precursor (QHP and EHP,
respectively) as well as with an [M-H]− precursor
(QHM, EHM). As more database spectra are available ITP spectra quickly
approach true spectra.
Figure 6
Average difference in cosine similarity of ITP spectra
and database
spectra to test spectra (Δsim) as more spectra are
available for interpolation. Results are shown for NIST20 spectra
on a Q-TOF and Elite Orbitrap with [M + H]+ precursor (QHP
and EHP, respectively) as well as with an [M-H]− precursor (QHN, EHN). A positive Δsim indicates
the interpolated spectra are closer matches than the known spectra
used to generate the interpolations, while a negative value indicates
the reverse.
Average cosine similarity between ITP spectra
and true spectra
(ITPsim) as more known, database spectra are used to generate
the interpolation. Results are shown for NIST20 spectra on a Q-TOF
and Elite Orbitrap with [M + H]+ precursor (QHP and EHP,
respectively) as well as with an [M-H]− precursor
(QHM, EHM). As more database spectra are available ITP spectra quickly
approach true spectra.Average difference in cosine similarity of ITP spectra
and database
spectra to test spectra (Δsim) as more spectra are
available for interpolation. Results are shown for NIST20 spectra
on a Q-TOF and Elite Orbitrap with [M + H]+ precursor (QHP
and EHP, respectively) as well as with an [M-H]− precursor (QHN, EHN). A positive Δsim indicates
the interpolated spectra are closer matches than the known spectra
used to generate the interpolations, while a negative value indicates
the reverse.With the exception of the QHP case, including only
spectra at 10
and 45 eV is insufficient to produce ITP spectra that match test spectra
better than the available database on average. By including just one
intermediate spectrum, however, the ITP spectra offer a consistent
advantage. Because the ITP spectra are mathematically determined,
the interpolations can only include peaks that are present in the
spectra provided to the model. This means the interpolations cannot
contain spurious peaks but do require enough initial data—at
least three spectra—to ensure all expected peaks are found.
For molecule classes where known, anomalous fragmentation occurs,
researchers must be deliberate when selecting the representative collision
energies. Figure plots
the average Δsim value for the tested cases, showing
that our method offers the most advantage in cases where limited data
are available. The ITP spectra yield diminishing returns when more
spectra are known for a molecule in the electronvolt range of interest.
This likely happens because the increased availability of known spectra
evenly dispersed through the range of electronvolts increases the
chances of a known spectra being closer in electronvolts to the entries
in the library. While the gains of this method are more modest when
more data are available, including the ITP spectra still increases
matching scores over the database spectra on average.
Performs Consistently across Instruments and Precursor Types
On the basis of the results of the previous section, we now test
how much value is added to a subset database of three spectra by augmenting
it with ITP spectra for all nine instrument and precursor types outlined
in Table . Using the
NIST20 libraries, we construct subset databases that contain only
spectra at ∼10, 20, and 40 eV by identifying the closest spectrum
for each available molecule within 5 eV of each target energy. All
other spectra serve as unknown test spectra that can be searched for
in the subset database. The number of molecules with spectra within
the target electronvolt values for each instrument precursor pair
are reported in Table .
Table 2
Summary of ITP Spectra Enhanced Database
Performance across Instrument and Precursor Types: The Number of Molecules
Available in NIST20 for the Given Instrument/Percursor Type Pairing
(# mol.), the Average Highest Cosine Similarity between a Test Spectrum
and and ITP Spectra Enhanced Database (maxITPsim), the
Average Highest Cosine Similarity between a Test Spectrum and the
Subset Database (maxΔsim), the Percent of Test Spectra
Accurately Identified by Their Best Match Either by Directly Matching
against the Database (DBM) or Matching against ITP Spectra (IM), the
Average Distance the Test Spectrum’s Electronvolt and the Closest
Match’s Electronvolt (ΔeV), and the Percent
of Spectra Correctly Identified without Interpolation That Were Incorrectly
Identified with Interpolation (IF)
# mol.
mxITPsim
mxΔsim
IM
DBM
ΔeV
IF
QHP
1995
0.97
0.029
95.3%
88.1%
0.84
0.6%
QHN
128
0.97
0.021
96.5%
94.3%
1.4
0.1%
QNa
30
0.97
0.023
99.5%
95.7%
1.3
0.5%
EHP
10 059
0.96
0.026
87.3%
76.0%
0.65
1%
EHN
4,105
0.98
0.035
87.8%
78.3%
1.1
1.5%
ENa
589
0.93
0.020
100%
100%
1.3
0%
VHP
394
0.95
0.028
93.1%
85.6%
0.92
0.4%
VHN
6
0.97
0.021
100%
98.2%
1.4
0%
VNa
22
0.87
–0.019
100%
100%
1.3
0%
LHP
3935
0.95
0.026
87.6%
78.3%
0.56
0.8%
LHN
893
0.95
0.018
93.8%
88.5%
1.4
0.5%
LNa
135
0.96
0.021
99.1%
97.7%
1.1
0%
For each instrument/precursor pair, we test identification
of the
held out test spectra against the associated database after augmenting
it with interpolated spectra. For each instrument/precursor pair, Table also reports the
percentage of spectra for which the highest cosine similarity score
matches the identity of the test spectrum when matching to the ITP
spectra enhanced database (IM) and the percentage of spectra likewise
correctly identified when matching to just subset database spectra
(DBM).For all cases, the percent of spectra correctly identified
improved
when using interpolation estimates. The ΔeV and ITPsim results indicate that the interpolation is providing improvements
in line with our expectation: test spectra at a collision energy teV find a better match with an ITP spectrum
within 1 or 2 eV of teV on average in
all cases. Note that interpolation does not strictly improve identification—a
small proportion of spectra is correctly identified matching to just
the subset database but is misidentified when ITP spectra are added.
These instances are far outweighed by those in which identification
is improved but should be kept in mind for researchers attempting
to use interpolation with molecules that are known to have complicated
spectra across collision energies.
Is Robust to Similarity Metrics
Our interpolation method
is robust to alternative similarity metrics. Here we report our findings
using a novel spectral entropy metric that outperforms 42 alternative
similarity metrics including cosine similarity for MS/MS library matching.[38] We test the capacity for spectral interpolation
to improve identification with entropy similarity using the four instrument
precursor pairs with the most data (QHP, EHP, EHN, LHP). Results are
reported in Table .
Table 3
Summary of Entropy Similarity Metric
Analysis: IM Indicates the Percent of Test Spectra That Were Accurately
Identified by the Highest Cosine Similarity Match in ITP Augmented
Databases, DMB Indicates the Percent of Spectra Accurately Identified
Using Just the Subset Database, and IF Indicates the Percent of Spectra
That Become Misidentified When ITP Spectra Are Added to a Subset Database
IM
DBM
IF
QHP
97.2%
95.0%
0.5%
EHP
95.7%
91.2%
0.6%
EHN
94.6%
89.6%
0.8%
LHP
96.4%
92.9%
0.5%
Entropy similarity increases identification accuracy
across the
board, and supplementing a subset database with interpolation leads
to further improvement in all cases. For the EHP case, use of the
entropy similarity metric and interpolation results in a 20% increase
in identification accuracy over cosine similarity demonstrating the
pronounced effect the choice of methodology can have on performance.
Strengthens Spectra Matching across Instrument Types
There is a set of molecules in NIST20 with [M + H]+ precursors
and spectra on both the Agilent Q-TOF and Elite Orbitrap. We test
the use of ITP spectra for molecule identification across instruments
using this overlapping data set of 892 molecules. Results using both
cosine similarity and entropy similarity for matching are shown in Table .
Table 4
Summary of Cross-Instrument Analysis:
Percent of Test Spectra on the Test Instrument (Test Inst.) Accurately
Identified Using the Reference Spectra from Another Instrument (DB
Inst.) with the Given Comparison Metric Either Directly Matching against
the Database (DM) or Matching against Interpolated Spectra (IM). The
Column ΔeV Reports the Average Difference between
the Electronvolt of the Test Spectrum and the Electronvolt of the
Closest Matching Interpolated Database Spectrum
Test Inst.
DB Inst.
Metric
IM
DM
ΔeV
QHP
EHP
cosine
89%
85%
4.7
EHP
QHP
cosine
84%
82%
–3.4
QHP
EHP
entropy
93%
92%
4.8
EHP
QHP
entropy
91%
90%
–4.0
For both QHP and EHP, the use of ITP spectra improves
cross-database
identification, though the improvement is less than that observed
in self-comparisons. For instance, when the entropy similarity metric
is used, only a 1% improvement is seen in accuracy. This suggests
there may be changes in the relationships between peaks and collision
energies across instruments that limit the accuracy of interpolation
approximations from one instrument to another.An interesting
result is also seen in the average ΔeV value for
the cross-instrument comparisons. There is a consistent
shift with both similarity metrics in the collision energies of the
closest matching spectra that suggests the Agilent Q-TOF spectra are
most similar to the Elite Orbitrap spectra at ∼4 eV higher
collision energies. A similar result was found experimentally in a
systematic comparison of Q-TOF and Orbitrap HCD MS/MS spectra at varying
collision energies for peptide MS/MS spectra.[39] Our analysis suggests such a shift may hold more generally across
a diverse set of chemical classes.
Works at High Collision Energies
While most of the
NIST20 spectra data lie in the range of 10–40 eV, there are
a number of spectra taken at higher electronvolts on the Orbitrap
instruments. We test our method with the same procedure at higher
collision energies by constructing databases with three spectra at
∼50, 70, and 90 eV on the instrument/precursor pairs EHP, EHN,
and LHP. The data sets at these energies will be limited due to our
focus on small molecules (100–500 Da) and will generally be
more representative of the heavier end of that range. Recall that
the LHP trials will be using approximated electronvolt values as described
in the Methods section. Results for the identification
of test spectra using cosine similarity are shown in Table .
Table 5
Summary of High Collision Energy Analysis
on Orbitrap Instruments: The Number of Molecules Available in NIST20
for the Given Instrument/Percursor Type Pairing (# mol.), the Average
Highest Cosine Similarity between a Test Spectrum and and ITP Spectra
Enhanced Database (maxITPsim), the Average Highest Cosine
Similarity between a Test Spectrum and the Subset Database (maxΔsim), the Percent of Test Spectra Accurately Identified by
Their Best Match Either by Directly Matching against the Database
(DBM) or Matching against ITP Spectra (IM), the Average Distance the
Test Spectrum’s Electronvolt and the Closest Match’s
Electronvolt (ΔeV), and the Percent of Spectra Correctly
Identified without Interpolation That Were Incorrectly Identified
with Interpolation (IF)
# mol.
maxITPsim
maxΔsim
ΔeV
IM
DBM
IF
EHP
3348
0.99
0.047
0.77
96.1%
84.2%
0.3%
EHN
1452
0.99
0.033
0.76
98.3%
95.2%
0.4%
LHP
3935
0.96
0.038
0.28
96.4%
92.8%
0.5%
Again, the use of interpolation improves identification
accuracy
in all cases, demonstrating that this method can be effective at a
range of collision energies. In fact both the average interpolation
accuracy and identification performance are increased with the higher
collision energy databases.
Works with Different Workflows
To test the robustness
of our method, we also sourced spectra from MassBank. In particular,
Dr. Nikolaos Thomaidis at the University of Athens has submitted a
large number of ESI-LC-QTOF spectra from a Bruker maXis Impact using
a [M + H]+ precursor.[40] For
549 molecules with molecular weight between 100 and 500 Da, spectra
at collision energies (10, 20, 30, 40, 50) are available. We construct
a reduced database with three spectra for each molecule at 10, 30,
and 50 eV and test identification of the 20 and 40 eV spectra using
interpolation. Results for spectra identification using cosine similarity
are in Table .
Table 6
Summary of Analysis Using Spectra
from MassBank: The Number of Molecules Available in NIST20 for the
Given Instrument/Percursor Type Pairing (# mol.), the Average Highest
Cosine Similarity between a Test Spectrum and and ITP Spectra Enhanced
Database (maxITPsim), the Average Highest Cosine Similarity
between a Test Spectrum and the Subset Database (maxΔsim), the Percent of Test Spectra Accurately Identified by Their Best
Match Either by Directly Matching against the Database (DBM) or Matching
against ITP Spectra (IM), the Average Distance the Test Spectrum’s
Electronvolt and the Closest Match’s Electronvolt (ΔeV), and the Percent of Spectra Correctly Identified without
Interpolation That Were Incorrectly Identified with Interpolation
(IF)
# mol.
maxITPsim
maxΔsim
ΔeV
IM
DBM
IF
549
0.94
0.039
–0.12
91.2%
89.5%
2.6%
Interpolation improves the identification accuracy
on the MassBank
data, though the improvement is less than what we see in the NIST20
Q-TOF data, and the IF percent is also higher. The reduced improvement
may be due to the increased distance between available energies in
the database, making the interpolations less accurate. This hypothesis
is supported by the lower average mxITPsim value.
Conclusion
Spectral interpolation provides a quick,
robust method to improve
small-molecule identification with MS/MS reference matching from limited
data sets. We found interpolation to consistently improve the percent
of spectra correctly identified across instrument and precursor types
with only three database spectra per molecule. The method offers the
most benefit for instances where only a few spectra are available
with diminishing returns as more database spectra are added. Augmenting
databases using spectral interpolation offers a transparent method
for improving identification where inspection of both spectra estimates
and predicted relationships between peaks and collision energy is
straightforward as shown in Figure . The methodology is agnostic to the choice of comparison
metric and can be used in any workflow where spectra at multiple collision
energies are available.
Data and Software Availability
The Supporting Information
for this article contains a set of Python functions that can be applied
to mass spectrometry data to generate the interpolations described.
A sample set of three Q-TOF spectra of capsaicin is included. Our
benchmarking work was performed as described using data from NIST20
and can be replicated by licensed users. The methods described here
can be applied to any available mass spectrometry data set.
Authors: Kai Dührkop; Huibin Shen; Marvin Meusel; Juho Rousu; Sebastian Böcker Journal: Proc Natl Acad Sci U S A Date: 2015-09-21 Impact factor: 11.205
Authors: Emma L Schymanski; Heinz P Singer; Philipp Longrée; Martin Loos; Matthias Ruff; Michael A Stravs; Cristina Ripollés Vidal; Juliane Hollender Journal: Environ Sci Technol Date: 2014-01-14 Impact factor: 9.028
Authors: Tobias Kind; Hiroshi Tsugawa; Tomas Cajka; Yan Ma; Zijuan Lai; Sajjan S Mehta; Gert Wohlgemuth; Dinesh Kumar Barupal; Megan R Showalter; Masanori Arita; Oliver Fiehn Journal: Mass Spectrom Rev Date: 2017-04-24 Impact factor: 10.946