| Literature DB >> 35034452 |
Philipp Seidl1, Philipp Renz1, Natalia Dyubankova2, Paulo Neves2, Jonas Verhoeven2, Jörg K Wegner3, Marwin Segler4, Sepp Hochreiter1,5, Günter Klambauer1.
Abstract
Finding synthesis routes for molecules of interest is essential in the discovery of new drugs and materials. To find such routes, computer-assisted synthesis planning (CASP) methods are employed, which rely on a single-step model of chemical reactivity. In this study, we introduce a template-based single-step retrosynthesis model based on Modern Hopfield Networks, which learn an encoding of both molecules and reaction templates in order to predict the relevance of templates for a given molecule. The template representation allows generalization across different reactions and significantly improves the performance of template relevance prediction, especially for templates with few or zero training examples. With inference speed up to orders of magnitude faster than baseline methods, we improve or match the state-of-the-art performance for top-k exact match accuracy for k ≥ 3 in the retrosynthesis benchmark USPTO-50k. Code to reproduce the results is available at github.com/ml-jku/mhn-react.Entities:
Year: 2022 PMID: 35034452 PMCID: PMC9092346 DOI: 10.1021/acs.jcim.1c01065
Source DB: PubMed Journal: J Chem Inf Model ISSN: 1549-9596 Impact factor: 6.162
Figure 1Simplified depiction of our approach. Standard approaches only encode the molecule and predict a fixed set of templates. In our modern Hopfield network (MHN)-based approach, the templates are also encoded and transformed to stored patterns via the template encoder. The Hopfield layer learns to associate the encoded input molecule, the state pattern ξ, with the memory of encoded templates, the stored patterns X. Multiple Hopfield layers can operate in parallel or can be stacked using different encoders.
Template Top-k Accuracy (%) of Different Method Variants on USPTO-sm and USPTO-lg*
“Model” indicates how the templates were ranked. “Filter” specifies if and how templates were excluded from the ranking via FPF or an applicability check (App). Pre-train indicates whether a model was pre-trained on the applicability task. Error bars represent confidence intervals on binomial proportions. The gray rows indicate methods specifically proposed here or in prior work.
Width of 95% confidence interval <1.3%.
Width of 95% confidence interval <0.4%.
Note that the applicability filter violates modeling constraints from the section entitled “Single-Step Retrosynthesis”.
Figure 2Histogram showing the fraction of samples for different template frequencies. The leftmost red bar indicates that over 40% of chemical reactions of USPTO-lg have a unique reaction template. The majority of reaction templates are rare.
Figure 3Top-100 accuracy for different template popularity on the USPTO-sm/USPTO-lg datasets. The gray bars represent the proportion of samples in the test set. Error bars represent 95% confidence intervals on binomial proportion. Our method performs especially well on samples with reaction templates with few training examples.
Reactant Top-k Accuracy (%) on USPTO-50k Retrosynthesisa
Data taken from refs (11, 19, 20, 24, 29, 31, and 66−72). Bold values indicate values within 0.1 of the maximum value, green denotes a value within 1 percentage point of the maximum value, and yellow denotes a value within 3 percentage points to the maximum value. Error bars represent standard deviations across five reruns. Category (“Cat.”) indicates whether a method is template-based (tb) or template-free (tf). Methods in the upper part have been (re-)implemented in this work.
Figure 4Reactant top-k accuracy versus inference speed for different values of k. Upper left is better. For Transformer/GLN, the points represent different beam sizes. For MHN/NeuralSym, the points reflect different numbers of generated reactant sets, namely, {1, 3, 5, 10, 20, 50}. In case of a Transformer, the points depict different beam sizes: {1, 3, 5, 10, 20, 50, 75, 100}, from left to right.