| Literature DB >> 35637304 |
Michael A Stravs1,2, Kai Dührkop3, Sebastian Böcker3, Nicola Zamboni4.
Abstract
Current methods for structure elucidation of small molecules rely on finding similarity with spectra of known compounds, but do not predict structures de novo for unknown compound classes. We present MSNovelist, which combines fingerprint prediction with an encoder-decoder neural network to generate structures de novo solely from tandem mass spectrometry (MS2) spectra. In an evaluation with 3,863 MS2 spectra from the Global Natural Product Social Molecular Networking site, MSNovelist predicted 25% of structures correctly on first rank, retrieved 45% of structures overall and reproduced 61% of correct database annotations, without having ever seen the structure in the training phase. Similarly, for the CASMI 2016 challenge, MSNovelist correctly predicted 26% and retrieved 57% of structures, recovering 64% of correct database annotations. Finally, we illustrate the application of MSNovelist in a bryophyte MS2 dataset, in which de novo structure prediction substantially outscored the best database candidate for seven spectra. MSNovelist is ideally suited to complement library-based annotation in the case of poorly represented analyte classes and novel compounds.Entities:
Mesh:
Year: 2022 PMID: 35637304 PMCID: PMC9262714 DOI: 10.1038/s41592-022-01486-3
Source DB: PubMed Journal: Nat Methods ISSN: 1548-7091 Impact factor: 47.990
Fig. 1Conceptual overview of MSNovelist.
Using the existing SIRIUS and CSI:FingerID approach, a molecular fingerprint and a molecular formula were predicted. These data were used as input to an encoder–decoder RNN model with LSTM architecture to predict a SMILES sequence. Finally, candidate structures were ranked by modified Platt score, that is, according to the match to the predicted molecular fingerprint.
Fig. 2Validation of MSNovelist with GNPS dataset.
a, Rank of correct structure in results for MSNovelist (blue), and naïve generation (orange), with ranking by modified Platt score (solid line) or by RNN score (dashed line), and comparison to database search (CSI:FingerID on PubChem; green) for the GNPS dataset (n = 3,863). b, Rank of correct structure in results for MSNovelist and naïve generation, with ranking by modified Platt score or ordered by model probability, and comparison to database search for GNPS-OK dataset (n = 1,507). c, Tanimoto similarity of best incorrect candidate to correct structure for MSNovelist, naïve generation, database search, best candidate from training set and random candidate from training set. d, Modified Platt score of top candidates, for MSNovelist, naïve generation, database search, best candidate from training set (light blue) and random candidate from training set (red) e, Three randomly chosen examples of incorrect predictions (top candidate) from GNPS dataset. Structures 1a, 2a and 3a represent de novo prediction; structures 1b, 2b and 3b represent a correct result. Red marks sites predicted incorrectly by the model (or the entire molecule if the prediction was completely wrong), and blue marks the corresponding correct alternative.
Fig. 3De novo annotation of bryophyte metabolites.
a, Scores of best MSNovelist candidates versus best database scores for 232 spectra; the solid line represents a 1:1, and the dashed line represents ; labels indicate spectrum ID. b, MS2 spectrum of feature 377. c, Proposed spectrum interpretation for structure 377a (MSNovelist) and 377b (database).
| (tokens) | ||||||||||||
| C | c | [C−] | N | n | [nH] | ... | ( | ) | ‘H’ | |||
| C | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | (elements) | ||
| N | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | |||
| … | ||||||||||||
| H | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | (implicit H) | ||
| () | 0 | 0 | 0 | 0 | 0 | 0 | −1 | 1 | 0 | (parentheses) | ||