| Literature DB >> 34976353 |
Zhaorui Huang1, Michael S Chen1, Cristian P Woroch1, Thomas E Markland1, Matthew W Kanan1.
Abstract
Methods to automate structure elucidation that can be applied broadly across chemical structure space have the potential to greatly accelerate chemical discovery. NMR spectroscopy is the most widely used and arguably the most powerful method for elucidating structures of organic molecules. Here we introduce a machine learning (ML) framework that provides a quantitative probabilistic ranking of the most likely structural connectivity of an unknown compound when given routine, experimental one dimensional 1H and/or 13C NMR spectra. In particular, our ML-based algorithm takes input NMR spectra and (i) predicts the presence of specific substructures out of hundreds of substructures it has learned to identify; (ii) annotates the spectrum to label peaks with predicted substructures; and (iii) uses the substructures to construct candidate constitutional isomers and assign to them a probabilistic ranking. Using experimental spectra and molecular formulae for molecules containing up to 10 non-hydrogen atoms, the correct constitutional isomer was the highest-ranking prediction made by our model in 67.4% of the cases and one of the top-ten predictions in 95.8% of the cases. This advance will aid in solving the structure of unknown compounds, and thus further the development of automated structure elucidation tools that could enable the creation of fully autonomous reaction discovery platforms. This journal is © The Royal Society of Chemistry.Entities:
Year: 2021 PMID: 34976353 PMCID: PMC8635205 DOI: 10.1039/d1sc04105c
Source DB: PubMed Journal: Chem Sci ISSN: 2041-6520 Impact factor: 9.825
Fig. 1Overview of the automated structure prediction framework. The inputs are the full 1H NMR spectrum, 13C NMR peaks, and the molecular formula. The outputs are the substructure probability profile, substructure-annotated NMR spectra, and a ranked list of predicted molecular structures. A test set example with experimentally collected spectra is shown with the actual outputs of the model.
Fig. 2Distribution of true/false positives and true/false negatives as a function of the probability predicted by our substructure prediction model for the test set.
Substructure and molecular structure prediction results for the validation and the test sets with different input NMR data
| Dataset | Inputs | Substructure prediction | Molecular structure prediction | |||
|---|---|---|---|---|---|---|
| Micro-average | PRC-AUC score | Top-1 acc. (%) | Top-10 acc. (%) | Mean reciprocal rank | ||
| Validation | 13C | 0.718 | 0.850 | 61.7 | 90.7 | 0.726 |
| 1H | 0.747 | 0.871 | 63.6 | 84.1 | 0.710 | |
| 1H, 13C | 0.869 | 0.953 | 85.5 | 96.3 | 0.890 | |
| Test | 13C | 0.672 | 0.792 | 38.9 | 87.4 | 0.541 |
| 1H | 0.720 | 0.823 | 47.4 | 85.3 | 0.604 | |
| 1H, 13C | 0.803 | 0.904 | 67.4 | 95.8 | 0.777 | |
Performance of the substructure prediction model for selected substructures in test set
| Entry | Substructure | SMARTS string | Accuracy |
| PRC-AUC score | Number in set |
|---|---|---|---|---|---|---|
| 1 |
| [CX4H3] | 0.947 | 0.950 | 0.993 | 52 |
| 2 |
| [CX4H3][CX4H0] | 1.000 | 1.000 | 1.000 | 9 |
| 3 |
| [CX4H3][CX4H1] | 0.979 | 0.900 | 0.955 | 10 |
| 4 |
| [CX4H3][CX3H0] | 0.979 | 0.917 | 0.992 | 11 |
| 5 |
| [CX4H3][OX2H0] | 0.979 | 0.500 | 0.711 | 3 |
| 6 |
| [CX3]( | 0.916 | 0.907 | 0.993 | 47 |
| 7 |
| [CX3]( | 0.968 | 0.968 | 0.998 | 46 |
| 8 |
| O | 0.968 | 0.914 | 0.952 | 19 |
| 9 |
| [cH] | 1.000 | 1.000 | 1.000 | 32 |
| 10 |
| [cH][cH] | 0.958 | 0.929 | 0.994 | 27 |
| 11 |
| [CX4H2][CX4H2] | 0.926 | 0.877 | 0.956 | 29 |
| 12 |
| [#6H1] | 0.895 | 0.923 | 0.984 | 64 |
| 13 |
| [OX2H1] | 0.947 | 0.959 | 0.993 | 61 |
| 14 |
| [#7X3H2] | 0.905 | 0.816 | 0.862 | 23 |
| 15 |
| [#7X3H1] | 0.779 | 0.222 | 0.490 | 19 |
Selected test set examples and their top ranked molecular predictions. The true structure is highlighted in green under predicted structures
|
|
Fig. 3Annotated spectra generated by the substructure prediction model for two examples in the test set. The top predicted substructure is shown for each highlighted peak.