| Literature DB >> 26839597 |
Sebastian Böcker1, Kai Dührkop1.
Abstract
BACKGROUND: Untargeted metabolomics commonly uses liquid chromatography mass spectrometry to measure abundances of metabolites; subsequent tandem mass spectrometry is used to derive information about individual compounds. One of the bottlenecks in this experimental setup is the interpretation of fragmentation spectra to accurately and efficiently identify compounds. Fragmentation trees have become a powerful tool for the interpretation of tandem mass spectrometry data of small molecules. These trees are determined from the data using combinatorial optimization, and aim at explaining the experimental data via fragmentation cascades. Fragmentation tree computation does not require spectral or structural databases. To obtain biochemically meaningful trees, one needs an elaborate optimization function (scoring).Entities:
Keywords: Computational methods; Fragmentation trees; Mass spectrometry; Metabolites; Natural products
Year: 2016 PMID: 26839597 PMCID: PMC4736045 DOI: 10.1186/s13321-016-0116-8
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Number of molecular formulas that match the mass of some precursor peak in the Agilent and GNPS dataset, using the maximum of 10 ppm and 2 mDa as allowed mass deviation. Note the logarithmic scale of the y-axis. SIRIUS 3 restricts the set of candidate molecular formulas solely by the non-negative ring double bond equivalent (RDBE) rule (green), see (3). More restrictive filtering such as the Seven Golden Rules [20] (orange) further reduce the number of molecular formulas to be considered; nevertheless, multiple explanations remain for most precursor ions. We find that 1.6 % of the compounds in our datasets violate the Seven Golden Rules. We also report the number of molecular formulas found in PubChem for the above mentioned mass accuracy
Fig. 2Example of a fragmentation tree. Left the molecular structure of Nateglinide. Right the measured MS/MS spectrum of Nateglinide from the GNPS dataset. Middle the FT computed from the MS/MS spectrum. Each node is labeled with the molecular formula of the corresponding ion, and each edge is labeled with the molecular formula of the corresponding loss. For nodes, we also report m/z and relative intensity of the corresponding peak. We stress that the FT is computed without any knowledge of the molecular structure and without using any database, but solely from the MS/MS spectrum
Priors for implausible losses
| Probability | Loss type and molecular formulas |
|---|---|
|
| Implausible losses: |
|
| Neutral losses with negative ring double bond equivalent RDBE |
| 0.1 | Nitrogen-only losses, carbon-only losses: for example, |
| 1 | All other neutral losses |
| 0.9 | Common radical losses: |
|
| All other radical losses |
For an edge (u, v) with loss let be the prior for chosen according to this table. Expert knowledge and evaluation of FTs from SIRIUS2 resulted in the implausible losses listed here [41]. These losses should only very rarely (if ever) occur in a FT, so we manually select reduced priors
Fig. 3Analysis Workflow. After importing the tandem mass spectra of a compound, all molecular formulas within the mass accuracy of the parent peak are generated (3). Each of these candidates is then scored (4–7) and, finally, candidates are sorted with respect to this score (8). To score a candidate molecular formula, we compute the fragmentation graph with the candidate formula being the root (4); score the edges of the graph using Bayesian statistics (5); find the best-scoring FT in this graph using combinatorial optimization (6); finally, we use hypothesis-driven recalibration to find a best match between theoretical and observed peak masses (7), recalibrate, and repeat steps (4–6) for this candidate formula. In our evaluation, we compare the output list with the true answer (9)
Fig. 11Left normalized histogram of the mass error distribution, for the GNPS dataset. Right normalized histogram of the noise peak intensity distribution and fitted Pareto distribution (dashed line), for the GNPS dataset
Fig. 4Performance evaluation, percentage of instances (y-axis) where the correct molecular formula is present in the top k for (x-axis). Left performance evaluation for different methods on both datasets. Methods are “SIRIUS 3” (the method presented here), “SIRIUS2-ILP” (scores from [41, 42] solved by integer linear programming), “SIRIUS2-DP” (scores from [41, 42] solved by dynamic programming), and “PubChem search” (searching PubChem for the closest precursor mass). Right performance of SIRIUS 3 for the two compound batches (CHNOPS as solid line, “contains FClBrI” as dashed line) and the two datasets (GNPS green, Agilent blue)
Fig. 5Left identification rates of all methods in dependence on the mass of the compound, compare to Fig. 4. Restricting SIRIUS 3 to molecular formulas from PubChem is included for comparison. Right histogram for masses of all compounds in the two datasets, bin width 50 Da
Fig. 6Identification rates of SIRIUS 3, SIRIUS2-ILP and SIRIUS2-DP depending on the number of candidate molecular formulas: that is, the number of decompositions of the precursor mass that have non-negative RDBE, see (3). Searching PubChem by precursor mass, and restricting SIRIUS 3 to molecular formulas from PubChem are included for comparison
Fig. 7Performance evaluation of SIRIUS 3 when adding isotope information, percentage of instances (y-axis) where the correct answer is present in the top k for (x-axis). Isotope pattern filtering efficiency 5 % (solid), 10 % (dashed), and 20 % (dotted). Batch CHNOPS (left) and “contains FClBrI” (right), datasets GNPS (green) and Agilent (blue)
Fig. 8Left histogram of running times of all instances (compounds) in the two datasets. Right cumulative distribution of running times
Performance comparison of SIRIUS 3 with MOLGEN-MS/MS using 60 compounds from [49], uncalibrated spectra
| MOLGEN-MS/MS | SIRIUS2-DP | SIRIUS 3 | ||||||
|---|---|---|---|---|---|---|---|---|
| With isotopes | With isotopes | Without isotopes | With isotopes | |||||
| 10 ppm | 5 ppm | 10 ppm | 5 ppm | 10 ppm | 5 ppm | 10 ppm | 5 ppm | |
| Top 1 | 36 | 34 | 34 | 35 | 49 | 45 | 55 | 56 |
| Top 2 | 44 | 47 | 50 | 46 | 51 | 51 | 58 | 58 |
| Top 5 | 54 | 55 | 52 | 48 | 58 |
| 60 | 60 |
| Average rank | 2.55 | 2.30 | 1.57 | 1.63 | 1.58 | 1.5 | 1.17 | 1.15 |
| Worst rank | 23 | 20 | 11 | 15 | 10 |
| 5 | 5 |
All tools are run with mass accuracy parameter 5 and 10 ppm. Best entries in italics. Results for MOLGEN-MS/MS and SIRIUS2-DP taken from [49]. In that evaluation, SIRIUS2-DP crashed 7/5 times for 10/5 ppm mass accuracy, and did not consider the correct molecular formula of the compound for 0/6 compounds
Fig. 9Similarity search performance plots for chemical similarity. Methods “SIRIUS 3” and “SIRIUS2-DP” compare trees via tree alignments [42]. Method “peak counting” uses direct spectral comparison. Method “MACCS” uses fingerprints computed from the structure of the compound. Left similarity search results using leave-one-out evaluation on both datasets. Right similarity search across databases: compounds from GNPS are searched in Agilent, and vice-versa
Fig. 10Left histogram of compounds from KEGG that show a particular ratio of hetero atoms except oxygen, and carbon atoms (green); histogram of all decompositions of compound masses from KEGG over the alphabet CHNOPS with mass accuracy 10 ppm (red). We observe that compounds from KEGG [56] have relatively small ratios, whereas this ratio can get arbitrarily large for the decompositions that, in most cases, do not correspond to true molecules. Normalized density of the prior (dashed). Right histogram of the corrected RDBE values from (4) (green); histogram of all decompositions (red); normalized density of the prior (dashed)
Priors for common losses l
| Mol. formula | Mass | Loss name | Known | Intensity GNPS | Intensity Agilent |
| ||
|---|---|---|---|---|---|---|---|---|
| Total | Expected | Total | Expected | |||||
| Ha | 1.0078 | Hydrogen radical | 110 | 0.00 | 77 | 0.00 | a | |
|
| 2.0157 | Hydrogen | A, B | 1799 | 0.00 | 890 | 0.00 | a |
|
| 14.0157 | Methylene | 33 | 17.47 | 71 | 37.35 | 1.92 | |
|
| 15.0235 | Methyl | A | 3231 | 46.48 | 1481 | 21.31 | 69.53 |
|
| 16.0313 | Methane | A, B, C | 2011 | 75.23 | 929 | 34.76 | 26.73 |
|
| 17.0265 | Ammonia | A, B, C | 1409 | 62.73 | 1481 | 65.92 | 22.47 |
|
| 18.0106 | Water | A, B, C | 5548 | 85.53 | 4014 | 61.88 | 64.87 |
| HF | 20.0062 | Hydrogen fluoride | 266 | 13.43 | 365 | 18.36 | 19.88 | |
|
| 26.0157 | Ethine | B, C | 2434 | 133.98 | 2324 | 127.90 | 18.17 |
| CHN | 27.0109 | Hydrogen cyanide | 1117 | 139.90 | 1078 | 134.94 | 7.99 | |
| CO | 27.9949 | Carbon monoxide | B, C | 4232 | 177.14 | 2614 | 109.45 | 23.89 |
|
| 28.0313 | Ethene | A, B, C | 483 | 87.19 | 1108 | 199.82 | 5.55 |
|
| 29.0265 | Methyleneimine | B | 347 | 158.43 | 305 | 139.34 | 2.19 |
| S | 31.9721 | Sulfur | B, C | 79 | 38.60 | 179 | 87.07 | 2.06 |
|
| 32.0262 | Methyl esters | 202 | 127.42 | 341 | 214.18 | 1.59 | |
| Cl | 34.9689 | Chlorine | 296 | 45.18 | 394 | 60.27 | 6.55 | |
| HCl | 35.9767 | Hydrogen chloride | 462 | 45.88 | 613 | 60.95 | 10.07 | |
|
| 42.0106 | Ketene | B, C | 811 | 246.67 | 584 | 177.75 | 3.29 |
|
| 42.0470 | Propene | 207 | 101.85 | 656 | 322.40 | 2.03 | |
|
| 43.0422 | Aminoethylene | 332 | 177.18 | 454 | 242.22 | 1.88 | |
|
| 43.9898 | Carbon dioxide | B, C | 281 | 199.41 | 215 | 153.06 | 1.41 |
| Br | 78.9183 | Bromine | 20 | 0.91 | 95 | 4.23 | 22.51 | |
| HBr | 79.9262 | Hydrogen bromide | 9 | 0.63 | 65 | 4.38 | 14.98 | |
|
| 79.9663 | Metaphosphoric acid | B, C | 3 | 0.78 | 25 | 6.55 | 3.93 |
|
| 95.9435 | Phosphenothioic acid | 0 | 0.11 | 26 | 4.60 | 5.65 | |
| I | 126.9045 | Iodine | 29 | 0.25 | 60 | 0.52 | 116.53 | |
| HI | 127.9123 | Hydrogen iodide | 11 | 0.15 | 45 | 0.61 | 74.61 | |
| CIO | 154.8994 | Iodomethanone | 0 | 0.04 | 3 | 0.32 | 10.28 | |
|
| 223.0303 | 20 | 1.12 | 5 | 0.30 | 18.54 | ||
|
| 233.0066 | 2-Chlorophenothiazine | 1 | 0.06 | 25 | 0.83 | 30.72 | |
|
| 253.8089 | Iodine | 0 | 0.00 | 10 | 0.03 | 357.31 | |
|
| 256.0170 | 3 | 0.12 | 9 | 0.40 | 24.93 | ||
Entry “mass” is the exact theoretical mass of the loss. Entry “known” indicates whether the loss was included in the expert-curated common loss lists in A [34], B [41], or C [42]. Entry “total” indicate the (rounded) frequency of the loss in the trees computed from the dataset, weighted by the maximum peak intensity of the two peaks that are responsible for this loss. Entries “expected” is the weighted frequency we would expect from the loss mass prior, and is the common loss prior after correcting for the loss mass prior
aLosses H and can be interpreted as artifacts of the loss mass prior
b , and are artifacts, stemming from either their high mass or the small number of chlorine-containing compounds in the datasets
Priors for common fragments f
| Molecular formula | Ion mass | Total intensity | Total count |
| |||
|---|---|---|---|---|---|---|---|
| Protonated | Neutral | GNPS | Agilent | GNPS | Agilent | ||
| C3H6N+ | C3H5N | 56.0495 | 0.00 | 93.63 | 0 | 392 | 2.40 |
| C3H8N+ | C3H7N | 58.0651 | 0.67 | 100.53 | 4 | 323 | 2.59 |
|
| C5H4 | 65.0386 | 0.00 | 83.35 | 0 | 530 | 2.14 |
|
|
| 70.0651 | 7.38 | 56.00 | 8 | 313 | 1.62 |
|
|
| 72.0808 | 0.00 | 72.92 | 0 | 179 | 1.87 |
|
|
| 77.0386 | 1.00 | 139.35 | 3 | 720 | 3.60 |
|
|
| 79.0542 | 0.52 | 69.85 | 3 | 514 | 1.80 |
|
|
| 86.0964 | 0.92 | 71.08 | 5 | 128 | 1.85 |
|
|
| 91.0542 | 60.61 | 252.97 | 300 | 720 | 8.04 |
|
|
| 92.0495 | 3.92 | 76.12 | 31 | 185 | 2.05 |
|
|
| 97.0648 | 10.95 | 58.00 | 37 | 86 | 1.77 |
|
|
| 98.0964 | 8.73 | 74.93 | 66 | 139 | 2.14 |
|
|
| 103.0542 | 64.49 | 34.61 | 562 | 241 | 2.54 |
|
|
| 105.0335 | 50.43 | 47.64 | 178 | 100 | 2.51 |
|
|
| 105.0699 | 108.25 | 104.51 | 580 | 352 | 5.45 |
|
|
| 107.0491 | 62.41 | 53.68 | 320 | 187 | 2.98 |
|
|
| 107.0855 | 35.23 | 29.89 | 171 | 120 | 1.67 |
|
|
| 108.0444 | 23.05 | 48.66 | 64 | 76 | 1.84 |
|
|
| 109.0648 | 37.15 | 53.40 | 161 | 107 | 2.32 |
|
|
| 115.0542 | 73.68 | 43.35 | 618 | 262 | 3.00 |
|
|
| 117.0699 | 61.67 | 46.66 | 371 | 206 | 2.78 |
|
|
| 119.0855 | 51.94 | 40.94 | 265 | 190 | 2.38 |
|
|
| 121.0648 | 125.09 | 72.05 | 394 | 201 | 5.05 |
|
|
| 128.0621 | 51.79 | 13.98 | 305 | 98 | 1.69 |
|
|
| 129.0699 | 60.60 | 24.99 | 425 | 163 | 2.19 |
|
|
| 130.0651 | 75.54 | 31.39 | 343 | 159 | 2.74 |
|
|
| 131.0855 | 61.12 | 34.37 | 277 | 144 | 2.45 |
|
|
| 132.0808 | 58.68 | 21.70 | 216 | 99 | 2.06 |
|
|
| 135.0441 | 40.37 | 22.03 | 176 | 53 | 1.60 |
|
|
| 135.0804 | 42.95 | 32.87 | 221 | 121 | 1.94 |
|
|
| 143.0855 | 54.61 | 23.88 | 288 | 118 | 2.01 |
|
|
| 144.0808 | 61.59 | 20.67 | 220 | 99 | 2.11 |
|
|
| 145.1012 | 57.60 | 28.15 | 219 | 110 | 2.20 |
|
|
| 146.0600 | 62.09 | 7.40 | 242 | 52 | 1.78 |
|
|
| 147.0804 | 67.16 | 33.97 | 247 | 107 | 2.59 |
|
|
| 159.0804 | 47.36 | 17.07 | 230 | 84 | 1.65 |
|
|
| 160.0757 | 58.53 | 18.10 | 221 | 40 | 1.96 |
|
|
| 165.0699 | 54.17 | 36.32 | 255 | 163 | 2.32 |
|
|
| 167.0855 | 28.57 | 36.85 | 123 | 65 | 1.68 |
|
|
| 171.0804 | 44.97 | 28.37 | 164 | 62 | 1.88 |
Entry “ion mass” is the exact theoretical mass of the protonated fragment. Entries “GNPS/Agilent” indicate total sum of the peak intensities and total peak count of the fragment in the two datasets. Note that a particular fragment can be very common, yet have relatively small sum of peak intensities, because fragments peaks are consistently of small intensity
Fig. 12Loss mass distribution, after the final round of parameter estimation. Frequencies of the losses are weighted by the intensity of their peaks. The frequency of the identified common losses have been decreased to the value of the log-normal distribution. Left normalized histogram for bin width 17 Da (green). Right kernel density estimation (green). Maximum likelihood estimate of the log-normal distribution drawn in both plots (black, dashed)