| Literature DB >> 32939179 |
Jinzhe Zhang1,2, Kei Terayama2,3,4,5, Masato Sumita2,6, Kazuki Yoshizoe2, Kengo Ito5,7, Jun Kikuchi5,7,8, Koji Tsuda1,2,9.
Abstract
Nuclear magnetic resonance (NMR) spectroscopy is an effective tool for identifying molecules in a sample. Although many previously observed NMR spectra are accumulated in public databases, they cover only a tiny fraction of the chemical space, and molecule identification is typically accomplished manually based on expert knowledge. Herein, we propose NMR-TS, a machine-learning-based python library, to automatically identify a molecule from its NMR spectrum. NMR-TS discovers candidate molecules whose NMR spectra match the target spectrum by using deep learning and density functional theory (DFT)-computed spectra. As a proof-of-concept, we identify prototypical metabolites from their computed spectra. After an average 5451 DFT runs for each spectrum, six of the nine molecules are identified correctly, and proximal molecules are obtained in the other cases. This encouraging result implies that de novo molecule generation can contribute to the fully automated identification of chemical structures. NMR-TS is available at https://github.com/tsudalab/NMR-TS.Entities:
Keywords: 404 Materials informatics / Genomics; NMR; deep learning; density functional theory; molecule generation
Year: 2020 PMID: 32939179 PMCID: PMC7476483 DOI: 10.1080/14686996.2020.1793382
Source DB: PubMed Journal: Sci Technol Adv Mater ISSN: 1468-6996 Impact factor: 8.090
Figure 1.Concept of this study and molecular generator scheme. NMR-TS tries to identify an unknown molecule from its NMR spectrum (target NMR spectrum) by designing molecules with NMR spectra as similar as possible to the target NMR spectrum. The NMR spectrum of a generated molecule is simulated by quantum chemical calculation. The Wasserstein distance is used to quantify the proximity between the NMR spectra of the target and generated molecules.
Figure 2.Examples of using the Wasserstein score (WS) to quantify the difference between the target NMR spectrum and the NMR spectra of SMILES generated molecules. A WS closer to 1 indicates high similarity between the spectra. In this example, the spectrum of Cc1cc(C)on1 is most similar to the target spectrum.
Figure 3.Nine test molecules with their chemical structural formulas and SMILES representations.
Correct answer rate and average Wasserstein score (WS) for each trie size.
| Target molecules found | Ave. of best WSs | |
|---|---|---|
| NMR-TS (Trie size = 0) | (1/9) | 0.564 |
| NMR-TS (Trie size = 1) | (4/9) | 0.778 |
| NMR-TS (Trie size = 100) | (4/9) | 0.850 |
| NMR-TS (Trie size = 1000) | (4/9) | 0.837 |
| NMR-TS (Trie size = 9800) | (5/9) | 0.892 |
| Database search (baseline) | (0/9) | 0.740 |
Figure 4.Test molecules, baseline molecules, and best candidate molecules generated by NMR-TS. The corresponding Wasserstein score (WS) is shown for each baseline and candidate molecule. For test molecules I, III–VI, and VIII, NMR-TS gave the correct structures. For test molecules II, VII, and IX, NMR-TS failed to find the correct structures.
Figure 5.NMR-TS search results for target spectra of test molecules I–IX showing the best Wasserstein score (WS) as the function of time with different trie sizes. See Table 1 for the details of the different parameter sets.
Figure 6.(a) Evolution of the average Wasserstein score (WS) of the best candidates for the nine test molecules over time with different trie sizes. When the trie size is 0, ChemTS starts with a root node without any expansion. When the trie size is 1, 100, 1000, or 9800, a WS is obtained for each spectrum in the database against the target spectrum and based on this ranking, the top 1, 100, 1000, and 9800 molecules, respectively, are fed into the trie. (b) Evolution of the total number of candidates with scores better than the database baseline for all test molecules over time. (c) Comparison of the best candidate scores from the database search and NMR-TS. C = 1, trie size = 9800.