| Literature DB >> 26834843 |
Christoph Ruttkies1, Emma L Schymanski2, Sebastian Wolf3, Juliane Hollender4, Steffen Neumann1.
Abstract
BACKGROUND: The in silico fragmenter MetFrag, launched in 2010, was one of the first approaches combining compound database searching and fragmentation prediction for small molecule identification from tandem mass spectrometry data. Since then many new approaches have evolved, as has MetFrag itself. This article details the latest developments to MetFrag and its use in small molecule identification since the original publication.Entities:
Keywords: Compound identification; High resolution mass spectrometry; In silico fragmentation; Metabolomics; Structure elucidation
Year: 2016 PMID: 26834843 PMCID: PMC4732001 DOI: 10.1186/s13321-016-0115-9
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Comparison of in silico fragmentation results for 473 Eawag Orbitrap spectra (formula search)
| MetFrag2010 | MetFrag2.2 | CFM-ID | MetFrag2.2 + CFM-ID | ||
|---|---|---|---|---|---|
| ChemSpider | ChemSpider | PubChem | PubChem | PubChem | |
| Pessimistic ranks | |||||
| Median rank | 8 | 4 | 12 | 11 | 8 |
| Mean rank | 74 | 38 | 141 | 127 | 85 |
| Mean RRP | 0.859 | 0.894 | 0.880 | 0.881 | 0.901 |
| Top 1 ranks | 73 (15 %) | 105 (22 %) | 30 (6 %) | 43 (9 %) | 62 (13 %) |
| Top 5 ranks | 202 | 267 | 145 | 170 | 202 |
| Top 10 ranks | 258 | 320 | 226 | 232 | 276 |
| Expected top ranks | |||||
| Top 1 ranks | 90 (19 %) | 124 (26 %) | 43 (9 %) | 57 (12 %) | 70 (15 %) |
| Top 5 ranks | 218 | 280 | 163 | 193 | 213 |
| Top 10 ranks | 274 | 329 | 245 | 261 | 288 |
MetFrag2010 and MetFrag2.2 were compared with the same ChemSpider candidate sets; MetFrag2.2 and CFM-ID with the same PubChem candidate sets. Far right: Best top 1 pessimistic ranks obtained by combining MetFrag2.2 and CFM-ID 2.0 with the weights and . The expected ranks, which partially account for equally scored candidates as calculated in [16], are shown in the lower part of the table
PubChem and ChemSpider results (number of pessimistic top 1 ranks) for 473 Eawag Orbitrap spectra
| Weight term | Score term | Weights | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
| 1 | 1 | 1 | 0 | 1 | 0 | 0 |
|
|
| 1 | 1 | 0 | 1 | 0 | 1 | 0 |
|
|
| 1 | 0 | 1 | 1 | 0 | 0 | 1 |
The weights indicate where the score term was included (1) or excluded (0) from the candidate ranking. For PubChem ; for ChemSpider only. See text for explanations
PubChem and ChemSpider results for 473 Eawag orbitrap spectra with formula retrieval, including in silico fragmentation, RT and reference information as shown, with the given for the highest number of Top 1 ranks
| MetFrag2.2 | MetFrag2.2 + CFM-ID | |||
|---|---|---|---|---|
| Database | ChemSpider | PubChem | PubChem | PubChem |
| RT/log | CDK XlogP | CDK XlogP | XLOGP3 | CDK XlogP |
|
| 0.49 | 0.57 | 0.50 | 0.33 |
|
| 0.19 | 0.02 | 0.16 | 0.03 |
|
| 0.32 | 0.41 | 0.34 | 0.35 |
|
| – | – | – | 0.29 |
| Median rank | 1 | 1 | 1 | 1 |
| Mean rank | 6.5 | 35 | 41 | 18 |
| Mean RRP | 0.990 | 0.977 | 0.977 | 0.978 |
| Top 1 ranks | 420 (89 %) | 336 (71 %) | 336 (71 %) | 343 (73 %) |
| Top 5 ranks | 447 | 396 | 398 | 411 |
| Top 10 ranks | 454 | 422 | 414 | 429 |
For PubChem ; for ChemSpider only. See text for explanations. Far right: combining CFM-ID results to incorporate complementary fragmentation information
Fig. 1Top 1 ranks with PubChem (XlogP3) on the Orbitrap XL Dataset. The results were obtained with MetFrag formula query and the inclusion of references and retention time. The reference score was calculated with the number of patents (PNP) and PubMed references (PPC). The larger dots show the best result (336 number 1 ranks), 75th percentile (320), median (312), 25th percentile (249) and worst result (61). For the best result, the weights were and
Fig. 2Top 1 ranks with ChemSpider on the Orbitrap XL Dataset. The results were obtained with MetFrag formula query and the inclusion of references and retention time. The reference score was calculated with the ChemSpider reference count (CRC). The larger dots show the best result (420), 75th percentile (399), median (388), 25th percentile (311) and worst result (104). The weights for the best result were and
Results (Top 1, 5 and 10 ranks) using PubChem formula queries on three additional datasets
| Weight term | Score Term | Weights | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
| 1 | 1 | 1 | 0 | 1 | 0 | 0 |
|
|
| 1 | 1 | 0 | 1 | 0 | 1 | 0 |
|
|
| 1 | 0 | 1 | 1 | 0 | 0 | 1 |
The weights indicate where ranking parameters were included (1) or excluded (0) from the candidate ranking. Retention time score calculation was performed using the XLOGP3 values of PubChem. . See text for explanations
Best Top 1 rank results on three additional datasets using PubChem formula queries including in silico fragmentation, RT and reference information as shown, with the given
| Dataset | MetFrag2.2 | ||
|---|---|---|---|
| UFZ (n = 225) | EQex (n = 289) | EQexPlus (n = 310) | |
|
| 0.40 | 0.38 | 0.61 |
|
| 0.23 | 0.27 | 0.11 |
|
| 0.37 | 0.35 | 0.28 |
| Median rank | 1 | 1 | 1 |
| Mean rank | 58.0 | 14.6 | 46.2 |
| Mean RRP | 0.972 | 0.981 | 0.976 |
| Top 1 ranks | 165 (73 %) | 236 (82 %) | 196 (63 %) |
| Top 5 ranks | 188 | 261 | 233 |
| Top 10 ranks | 191 | 268 | 247 |
Retention time score calculation was performed using the XLOGP3 values of PubChem. . See text for explanations
Top MetFrag2.2 candidates for unknown at m/z 199.0428 with different settings
| CSID | 6386 | 69438 | 6388 |
|---|---|---|---|
|
|
|
| |
| Original results (134 candidates) | |||
| Rank (n = 134) | 1 | 6 | 90 |
| #Peaks explained | 5 | 5 | 5 |
| CDK log | 1.44/0.167 | 1.50/0.161 | 2.02/0.107 |
| |
|
|
|
| Substructure interpretation | |||
| Included | S(=O)(=O)O | S(=O)(=O)O | CCc1ccc(cc1)S(=O)(=O)O |
| Excluded | – | S(=O)(=O)OC | – |
| Comment | No ethyl loss in MS/MS | Disproven via standard | Present in suspect list |
Structures overlaid with the included substructure were generated with AMBIT [57]. See text for details
Summary of MetFrag2.2 results for terbutylazine and four isobars
| Name | Terbutylazine | Propazine | Secbutylazine | Triethazine |
|
|---|---|---|---|---|---|
| CSID | 20848 | 4768 | 22172 | 15157 | 4954587 |
|
|
|
|
|
| |
|
| 0.958 | 0.765 | 0.997 | 0.653 |
|
| #Peaks explained | 11/15 | 10/15 | 12/15 | 8/15 |
|
|
|
| 204 | 56 | 45 | 4 |
| ChemAxon log | 1.65 | 2.75 | 2.28 | 1.11 | 2.31 |
|
| 0.159 |
| 0.223 | 0.103 | 0.225 |
| ChemAxon log | 1.63 | 2.75 | 2.19 | 0.97 | 2.23 |
|
| 0.249 | 0.247 |
| 0.192 | 0.266 |
| Suspect hit |
|
|
|
| 0 |
| Substructure hits |
| 0 |
| 1 |
|
| Matches | NC(C)(C)C | – | NC(C)CC | N[CH | NCCCC |
| N[CH | N[CH | N[CH | |||
|
|
| 3.43 | 3.69 | 2.53 | 2.52 |
|
|
| 3.41 | 3.85 | 2.87 | 2.68 |
| Comment | Correct substance | No longer in use | Can co-elute with 20848 |
The predicted log P and log D from the retention time was 3.17 and 2.18 using a training set of 810 substances calculated externally with ChemAxon and added to MetFrag2.2 via the UserLogP option. Included substructure SMARTS were N[CH][CH], NCCCC, NC(C)CC, NC(C)(C)C
Name synonym assigned for space reasons. The values in italics indicates the best result per category. Structures overlaid with the included substructure were generated with AMBIT [57]. See text for details and weights