Literature DB >> 35694522

MegaSyn: Integrating Generative Molecular Design, Automated Analog Designer, and Synthetic Viability Prediction.

Fabio Urbina1, Christopher T Lowden2, J Christopher Culberson2, Sean Ekins1.   

Abstract

Generative machine learning models have become widely adopted in drug discovery and other fields to produce new molecules and explore molecular space, with the goal of discovering novel compounds with optimized properties. These generative models are frequently combined with transfer learning or scoring of the physicochemical properties to steer generative design, yet often, they are not capable of addressing a wide variety of potential problems, as well as converge into similar molecular space when combined with a scoring function for the desired properties. In addition, these generated compounds may not be synthetically feasible, reducing their capabilities and limiting their usefulness in real-world scenarios. Here, we introduce a suite of automated tools called MegaSyn representing three components: a new hill-climb algorithm, which makes use of SMILES-based recurrent neural network (RNN) generative models, analog generation software, and retrosynthetic analysis coupled with fragment analysis to score molecules for their synthetic feasibility. We show that by deconstructing the targeted molecules and focusing on substructures, combined with an ensemble of generative models, MegaSyn generally performs well for the specific tasks of generating new scaffolds as well as targeted analogs, which are likely synthesizable and druglike. We now describe the development, benchmarking, and testing of this suite of tools and propose how they might be used to optimize molecules or prioritize promising lead compounds using these RNN examples provided by multiple test case examples.
© 2022 The Authors. Published by American Chemical Society.

Entities:  

Year:  2022        PMID: 35694522      PMCID: PMC9178760          DOI: 10.1021/acsomega.2c01404

Source DB:  PubMed          Journal:  ACS Omega        ISSN: 2470-1343


Introduction

We (as well as many other research groups and companies) have regularly used machine learning models to propose molecules for testing and then validated them in vitro with vendor available molecules as a first step.[1−3] However, to optimize bioactivity of any hit molecules obtained or maintain activity with improved absorption, distribution, metabolism, excretion, and toxicity (ADME/tox) properties, vendor available compounds may not be sufficient. The most desirable chemical modifications for analogs are rarely available, and thus ways to generate and explore novel molecules are required. In recent years, generative models have become commonly used as part of the design–make–test cycle[4] to produce molecules de novo(5,6) and this field has been reviewed by many others.[7−9] These generative models have come from several different architectures [e.g., recurrent neural networks (RNNs),[6] variational autoencoder (VAE),[10] and generative adversarial networks (GAN)][11] and have been shown to generate valid, novel molecules in the same chemical space as their training sets, with desirable physicochemical properties.[12−15] Molecular representation is varied in such generative models, including SMILES, and more recently molecule trees and SELFIES, both of which have enjoyed success in producing 100% valid molecule strings.[16,17] We chose the SMILES representation as our basis, as it has seen widespread success and is favored due to the simplicity and ease of molecular representation.[18] However, many of these generative models have enjoyed limited success in real-world drug discovery projects due to their narrow range of capabilities and a lack of defined work pipelines for distinct generative tasks. One issue is that the focus of drug discovery projects may be varied, and a single generative design process would likely not work for all scenarios. For instance, in one project, a lead molecule scaffold may require an iterative design to find the most suitable analog, and thus, the generative model employed should only enumerate on the core structure. Conversely, in another case given a set of known active and inactive compounds against a target, the project may wish to discover entirely new scaffolds that do not exist in “patent space” yet have similar desired molecular properties to the known active compounds. While most generative models can utilize the desired physicochemical properties in the training of the generative models, in practice, the goals are often not achievable using generic, out-of-the-box generative models. This lowers the practical utility of generative models, as currently proposed. To control the “closeness” of the generated compounds to a molecule of interest, the Tanimoto similarity score[19] is often included in the training. Even though the diversity of molecules generated is considerable, generative models retrained on the same parameters often end up in similar local minima of chemistry property space, reducing their usefulness past an initial run.[20] To address these and other limitations, we have now created MegaSyn, a suite of algorithms, which takes a similar approach to weak learner ensemble methods such as Random Forests.[21] MegaSyn modifies the approach of generative models in two ways. First, instead of training one generative model, MegaSyn trains many “weaker” generative models, starting from a generic model trained on a larger library, such as the druglike library (ChEMBL),[22] and iteratively produces generative models that are continuously “focused” down onto target molecule(s) and physicochemical properties of interest. To improve diversity, random branch models are produced from each of these focused generative model nodes until multiple generative models have been created that explore many local minima. While we chose ChEMBL as a database of druglike molecules, the generic model could be trained on other sets of molecules such as NPASS for natural products.[23] Second, for any proposed target molecule(s) of interest, MegaSyn first deconstructs target molecules into several substructures along with modifications, which we find improves the ability to discover analogs of interest. As we will illustrate, depending on the desired outcome (i.e., completely new drug scaffolds or enumeration on a common structural core), MegaSyn allows flexibility and balance in the exploration of chemical property space versus focused generative capabilities by traversing the “tree” of generative models based on the desired outcome. Generating novel molecules is important but it is also key to evaluate proposed structures for synthesizability and suggest synthetic pathways for the synthesis of the compounds.[8] The technologies involved in proposing, evaluating, planning, and assessing the synthetic feasibility of compound syntheses have been available within the cheminformatics industry for decades,[24−28] but their implementation remains relatively state of the art (reviewed recently).[29] For example, the earliest efforts in synthesis planning, reaction prediction, and synthetic feasibility assessment developed rule-based approaches such as LHASA, CAMEO, and CAESA (as reviewed in ref (30)). These software require collaboration between the chemist and machine to get the best out of the relatively limited functionality.[30] In recent years, we have seen considerable development in computer-aided synthesis planning with the collection of tens of thousands of manually curated reaction transformation rules to yield millions of chemical reactions as a network in Chematica,[31] which can be used to select the most cost-effective or chemically diverse synthetic pathways.[32] While the manual collection of such rules is not scalable, there has also been a shift to use machine learning approaches. One has used deep neural networks trained on 3.5 million reactions from the Reaxys database with the extended connectivity fingerprints (ECFP4).[33] Another approach has used 15,000 reactions from the USPTO augmented by a set of over 5 million reactions with nonrecorded products to train a neural network.[25] Others have developed proof-of-concept tools that they suggest are not ready for practical use such as CompRet, which enumerates a chemical reaction network based on a depth-first proof number search, enumerating all synthetic routes and then recommending synthetic routes using simple scoring functions.[34] A template-free self-corrected retrosynthesis predictor was built using a transformer neural network architecture, which improved on prior accuracy rates using the USPTO-50K set.[35] Scientists at Pfizer have also demonstrated that a transformer-based retrosynthesis model generated with public USPTO training data could predict over 147,000 reactions from Pfizer electronic notebooks with a top 1% accuracy of 69%, and this number increases with their own data used in training.[36] A more recent use of the transformer architecture using transfer learning for retrosynthesis prediction with literature data demonstrated a top 1% accuracy up to 60.7%.[37] Various methods have also been used to predict synthetic accessibility such as using the probability of existence of substructures for the compound in question along with the number of symmetry atoms, graph complexity, and number of chiral centers.[38] Clearly, these examples are an abbreviated selection and more recent open-source software for retrosynthetic planning includes AiZynthFinder,[39] LillyMol,[40] ASKCOS,[25] and RENATE.[41] Comparison of such methods is very limited and has only recently been reported,[42] as such efforts require the synthesis of compounds using proposed routes obtained with each method.[39] We now describe a generative model that is flexible to address the needs of many drug discovery projects, as well as prototype Pipeline Pilot protocols for automated lead expansion, filtration of analogs, and selection of a representative set that is user-accessible. Because the molecules are generated in an automated fashion, some of the molecules may be difficult or impractical to pursue from a synthetic chemistry perspective. Thus, we also created an automated tool to predict the relative difficulty of synthesis for targeted analog molecules utilizing automated retrosynthetic analysis coupled with a fragment analysis to score molecules for their synthetic feasibility. We have evaluated these tools using a set of FDA-approved drugs as well as a recently published set of natural products.[43] We also provide several test cases that recapitulate a recently described known analog of ibogaine[44] and analogs of lapatinib with improved predicted properties[45] using MegaSyn. These examples show that MegaSyn can virtually generate synthetically feasible compounds with the desired druglike properties.

Experimental Section

To aid in outlining the current study and the various components, we have developed an overview (Figure ).
Figure 1

Overview diagram of this study and mapping to figures and supplemental figures to orient the reader.

Overview diagram of this study and mapping to figures and supplemental figures to orient the reader.

Activity Models for MegaSyn

All activity models consisted of naïve Bayes-trained models using the scikit-learn package in python. Data sets were acquired for each target (e.g., HER1, HER2) from ChEMBL target activities except for the blood–brain-barrier (BBB) data set, which was acquired from Wang et al.[46] All activities were binarized according to an activity threshold, with 1 indicating active and 0 indicating inactive (Table S1). For input to each model, ECFP6 fingerprints were calculated for each molecule and folded down to a 1024-bit feature vector. Each model was trained and calibrated using isotonic regression with 3-fold cross-validation, from which all statistics were generated. The calibrated prediction scores for each activity model were used as input into the composite score (defined below) when used for training a MegaSyn model.

Evaluation of Variational Autoencoder, Generative Adversarial Networks, and Recurrent Neural Networks

As input, molecules are represented as tokenized SMILES strings. Briefly, each SMILES is tokenized, and each character is represented in a vocabulary (e.g., “c [nH]”, “1”, “=”). Each token in the vocabulary has a corresponding numerical representation (e.g., all “c” were represented by 1, all = are represented by the number 2, etc.). SMILES are encoded by their integer vocabulary representation and padded to the longest sequence length with zeroes which were masked during training. Beyond this, several differences exist between models during training. Variational autoencoder (VAE): the variational autoencoder utilizes an encoder–decoder architecture to map chemical space into a latent vector.[10] The encoder is composed of three LSTM layers of 512 units each followed by a linear layer of 64 units (the latent space). Our decoder is comprised of three LSTM layers of 512 units each with a dropout of 0.2 between all layers. We used KL divergence as our loss term with an Adam optimizer = 0.0001,[47] patience = 10,200 epochs, and a batch size of 64. Generative adversarial networks (GANs): we implemented a latentGAN[11] architecture for our generative GAN model. Wasserstein GAN[48] with gradient penalty was utilized for the GAN model. The heteroencoder was comprised of three LSTM layers of 512 units each with a final linear layer of 64 units (the latent space), while the decoder was comprised of three LSTM layers of 512 units each followed by a linear layer with softmax activation to return the probability of each character in the vocabulary. The autoencoder was trained for 100 epochs with a batch size = 128 and an Adam optimizer with a learning rate = 0.0001 using teacher forcing. The discriminator of the GAN was formed by three linear layers of 256 hidden units each with ReLU activation[49] between each layer (except for the last layer). The generator consists of five linear layers of 512 hidden units each with batch normalization = 0.9 and leaky ReLU activation between each layer. The autoencoder was pretrained using the ChEMBL data set followed by training of the full GAN model. Recurrent neural networks (RNNs):[6] each LSTM-based model is composed of an embedding layer, three LSTM layers (512 hidden units), followed by a linear layer with softmax activation the size of a vocabulary generated from the training data.

MegaSyn Design

MegaSyn was implemented using Pytorch 1.7.1. Each LSTM-based model in the ensemble uses the architecture as described unless noted differently below. Each model is composed of an embedding layer, three LSTM layers (512 hidden units), followed by a linear layer with softmax activation the size of a vocabulary generated from the training data. As input for all models, molecules are represented as tokenized SMILES strings. MegaSyn is composed of three distinct model types: the initial pretrained model, a set of primed models, and finally a set of exploratory models.

Initial Model

The initial model is trained on ChEMBL 28’s ∼2 million compounds.[22] Briefly, ChEMBL 28 SMILES were a batch of 60,000 random SMILES that were taken for each training epoch. The loss function for a sequence of encoded SMILES is the negative log-likelihood. The model uses an Adam optimizer with a learning rate = 0.002. Teacher forcing is used to expedite training of the generative model.

Primed Models

For each set of primed models, the initial model is trained for n epochs, with a new agent model saved every two epochs. The target molecule(s) of interest are broken down into substructures based on RECAP rules. Simplified carbon-only versions of these substructures and the original molecule are also generated. The initial model is trained on this set of structures and substructures alone, using the same parameters as the initial model (described above) using teacher forcing. Every i epoch, the model is saved until a set of n primed models have been created. These primed models are also scored to include only druglike molecules, evaluated using a QED score.[50] The nature of the hill-climb maximum likelihood estimation (MLE) allows only correctly produced structures to influence the model, thereby restricting the molecules generated to a druglike small-molecule space. We find that 16 total epochs with a model saved after every two epochs represent the gradient of general to diverse reasonably well for a number of target molecules, and thus all models were trained to 16 total epochs unless otherwise specified.

Exploration Models

For each primed model, de novo molecules are generated. The generated molecules are then ranked based on a composite score from any number of criteria. The composite score is represented as below. For each criterion i in the composite score (i.e., predicted target activity or druglikeness (QED) score),[50] the composite score is defined aswhere x is the ith score for molecule and y is the ith desired score. This score is assumed to be bound between 0 and 1, with 1 being the desired property (in case of the inverse, the score is taken a 1 – y). This score usually includes QED, model prediction scores against target (target model), and any other desired scores, e.g., the average Tanimoto distance from a desired library. This design forces all scores to have positive scores (closer to 1) to always be desirable, and thus composite scores greater than the “target” score are always positive, and composite scores below the target score are always negative. We bound the scoring range to [0,1] as the majority of the individual scores are already in the range of [0,1] such as prediction scores, QED, and Tanimoto distance. We chose the sum-of-logs to penalize molecules for not meeting or exceeding all criteria. This prevents, for example, a nontoxic molecule with a high QED score but no predicted activity of interest from dominating the generative space. The scoring function is general, and if a molecular property score between 0 and 1 can be assigned to a compound, it can be included in the final composite score, giving a large potential to the tasks the generative model can be applied. The top 10% of ranked compounds are kept and fed back into the model for training using NLL and teacher forcing, a training concept called MLE. A new set of molecules is generated after training, and the cycle continues for the number of epochs. Importantly, the top 10% of compounds are kept from one epoch to the next; only if a newly generated compound has a score higher than one in the current top 10% list does it replace one in the set. This hill-climb MLE constrains the model to only be trained on molecules generated from the druglike space of the ChEMBL library in this case, restricting the model to generate only molecules with the desirable physicochemical properties. Eventually, the model will find a substructure minimum and is then capable of generating analogs of this specific substructure. Often, based on the initial seed molecules of the very first iteration, the model will converge to one local minimum. At least four models are trained and generated from each primed model node to obtain models that focus on different substructures of the original target molecule. The top-scoring 10% of compounds found over the training loop for each model are kept. Care must be taken when generating molecules “far” from the initial training data that drives the models. It is assumed that chemical structure similarity should correlate with the uncertainty of the model: that the “closer” a structure of a molecule is to the training set, the more likely the model is to be correct. Often, an applicability domain (AD) score is applied, based on a distance metric, to reflect this.[51−53] A distinction can be made, however, between the distance from the training set and the distance from the decision boundary of a model, which are two distinct measures: a molecule may be structurally far from the training set, yet if it is far from the decision boundary of the model, it may still be accurately predicted.[42] Although the distance from the training data can track reliably with correct model predictions, class probabilities as produced by machine learning models were stronger predictors of misclassification.[42] Therefore, weighting the prediction score heavily in our generative model likely represents a more reliable measure of applicability domain for generated molecules.

Automated Analog Generation

Lead Expansion/Enumeration

We have developed a Pipeline Pilot (Biovia, San Diego, version 19.1.0.1964)[54] protocol for automated lead expansion, filtration of analogs, and selection of a representative set. For lead expansion, we encoded several different medicinal chemistry strategies to generate potential analogs. Included in these strategies are classical bioisosteric replacement and similarity “bioisosteres” (for which Pipeline Pilot components already exist).[55−60] The classical bioisosteres include the replacement of several common functional groups with sterically similar functional groups believed to have similar physicochemical effects in a biological environment. Similarity bioisosteres locate fragments within molecules and replace them with similar fragments based on a user-specified similarity measure (e.g., FCFP_6 and PHFC_2). Another strategy involves the enumeration of heteroatomic regioisomers. Heteroatoms are identified and relocate them to every possible position around within the molecule.[61] Finally, several molecular transformations (37 aromatic/phenyl replacement, 2 conformational restriction/ expansion, 92 Topliss, 8 Magic methyl) have been encoded to identify modification sites on molecules and automate the enumeration of analogs using common medicinal chemistry approaches.[62−64] These approaches include Topliss, Magic Methyl, conformational restriction/relaxation, and ring expansion/contraction. The user can select or deselect the different transformation categories as desired. The transformations were carried out using the “Perform Reaction from Tag” or “Perform Reaction on Each Molecule” components with the “IfMultipleReactionsPossible” parameter set to “Perform Each Reaction”. An organic filter was applied to remove any transformation products that contain inorganic molecules (under the assumption that these would not be of interest to medicinal chemists). Transformation categories may be selected/deselected. Retrosynthetic analysis (described below) may be selected or deselected. Input is an SD file. The “Perform Reactions on Each Molecule” component is used for retrosynthetic analysis in a (run to completion) subprotocol with the IfMultipleReactionsPossible parameter set to “Perform All Reactions”. Each successful round of retrosynthesis is saved in a Pipeline Pilot Cache using a “Cache Writer” component with default parameter settings.

Tagging and Scoring

Using these techniques, tens to thousands of analogs are generated for a typical lead molecule, depending on its complexity. These molecules are then examined for any undesirable functional groups such as reactive functional groups and toxicophores.[65] Molecules with any of these features are tagged (and can be removed later as desired). The molecules are then scored for synthetic feasibility, using a newly developed algorithm (see below). The molecules are then clustered using FCFP_4 fingerprints, so that a diverse set can be selected if desired. The canonical tautomer is generated for each molecule, and duplicate molecules are removed.

Selection

After the analogs are enumerated, tagged, and scored, the resulting analogs are displayed in a graphical and tabular format. Categorical and numeric charts, such as pie charts and histograms, are then generated along with a tabular output in Pipeline Pilot. The charts and tabular output are linked together such that the user can select subsets of molecules and export them readily.

Automated Retrosynthetic Analysis and Synthetic Feasibility Prediction

Three primary methodologies were used to evaluate synthetic feasibility. The first method involves the fragmentation of known (synthesized) molecules and the relative presence of those fragments in targets. The second method couples automated retrosynthesis with the first method. In addition to using automated retrosynthesis to rate synthetic feasibility, a separate application for retrosynthetic analysis was created. Finally, a weighting mechanism was added to penalize molecular elements that are undesirable from a synthetic perspective.

Fragmentation

To create the fragments used in the first method, two molecular sources were used. These included eMolecules[66] consisting of 26,400,125 molecules (at the time of download) and ChEMBL version 24 consisting of 1,820,035 molecules.[67] Separately, these sources were subjected to fragmentation using Pipeline Pilot using the Generate Fragments component. Specifically, ring assemblies (contiguous ring systems), BridgeAssemblies (contiguous ring systems that share two or more bonds), and Chains (contiguous atoms not in rings) and BemisMurcko assemblies were generated.[68] Canonical SMILES were generated for each fragment. Fragments containing less than two atoms were filtered. Fragments that occurred more than 10 times over the entire molecule set were retained, along with their frequency (occurrence count). Each unique fragment and its frequency were saved in comma-separated files for each source.

Fragmentation Scoring

Molecules evaluated for synthetic feasibility are fragmented in the same way as the source sets. A baseline score is created by the ratio of fragments of the incoming molecules.[69] The baseline score is calculated as followsThe score is then weighted using an algorithm that takes the size of the fragment (number of atoms) and the frequency of occurrence in the source set.

Retrosynthetic Analysis

This was carried out by applying a set of transformations that apply known reactions in reverse.[40] Our solution has two primary sources of these transformations. The primary source is a set of reactions extracted from patents by a group at Eli Lilly (Lilly).[40] The secondary source is a set of reactions detailed by a group at Astra Zeneca (AZ).[70] All of the reactions were reversed so that they could be applied that way. In the case of the Lilly reactions,[40] a set of 1,929,251 reactions in a format similar to SMIRKS were culled to a set of 8,040 reactions simply by looking at the number of characters in the text for each reaction. The idea here was that smaller reactions were more likely to represent the core or substructures of reactants and products and therefore would be applicable to a larger number of molecules. The reactions were then reversed by swapping the products with the reactants and converted from the SMIRKS format to RXN format in Pipeline Pilot. Approximately 10,000 druglike molecules were tested by running each of the 8040 reactions on them. Of the 8040 reactions, 2632 unique reactions were used at least once. This set of reactions was used as the final Lilly reaction set. A much smaller set of ∼45 common reactions were derived from the AZ group.[70] These reactions were hand-written SMIRKS that represented common transformations used in organic synthesis. These SMIRKS were reversed by hand. Some were removed due to their promiscuous nature when applied in reverse (e.g., carbon–carbon bond formation reactions). Once the core set of retrosynthetic reactions were selected and curated, the retrosynthetic analysis tool was developed and subjected to numerous rounds of testing (using experienced medicinal chemists) and enhancement, where various rules were imposed to encourage better outcomes. It was arbitrarily determined that up to five rounds of retrosynthetic reactions should be applied to each molecule. In the first round, each unique set of reaction products is retained. The “size” of each product molecule was determined by the number of non-hydrogen atoms. Most retrosynthetic reactions produce more than one product. For each set of products that are created by an individual reaction, the largest product (selected product) is retained. In rounds 2–5, an additional restraint is imposed. Only the five smallest of the selected products are allowed to the next round. In rounds 4 and 5, another additional restraint is imposed. The selected products must be smaller than the smallest selected product in all other rounds to be moved to the next round or to be reported. Results are reported for each round that is executed with all precursor molecules from each round.

Fragmentation and Retro Combined Scoring

The retrosynthetic analysis tool was combined with the fragmentation score to enhance the synthetic feasibility score. For the enhanced scoring, the selected product from the last three executed rounds that were executed (if at least three rounds were executed) is scored using the fragmentation scoring system. The highest score is then selected as the consensus score.

Weighting Mechanisms

After reviewing the results (with our own experienced synthetic chemists), it was clear that a certain key weighting mechanism was required to be added for certain features that are difficult to synthesize. The presence of one or more absolute chiral centers is one example of a penalizing feature. The presence of one or more spiro atoms is another example. For each of these elements that are present in the molecule, the score is reduced by a certain relative ratio.

Software Testing

A set of “best-selling 25 small-molecule drugs” were selected as an example of well-known molecules to test the automated retrosynthetic analysis software (Supporting Information, Table S2 and Figure S4). A set of 346 natural products (Canvass) were used to compare with a library of 201 FDA-approved drugs.[43]

Visualization of FDA-Approved Drugs and Natural Products

The molecular property space of FDA-approved drugs and the Canvass data set[43] were compared using a t-distributed stochastic neighbor embedding (t-SNE) plot (see the methods below).

Data Analysis

To determine if the FDA-approved drugs were considered more synthetically feasible than the Canvass natural product library, Bootstrap hypothesis testing was performed on the two data sets.[71] Briefly, both data sets (FDA library and Canvass) are combined into one data set. Two data sets of sizes n and m (the size of the FDA library and Canvass library, respectively) are randomly sampled from the combined data set. The mean and standard deviation are calculated. A p-value is calculated by determining the likelihood of the true mean occurring from the bootstrapped sample means.

t-SNE Plot Generation

All t-SNE plots were generated using the scikit-learn package in python with default parameters (number of components = 2, perplexity = 30.0, early exaggeration = 12, learning rate = 200, number of iterations = 1000, number of iterations without progress = 300, minimum gradient norm = 1e-07, metric = Euclidean).

Results

Evaluation of Different Generative Approaches

First, we evaluated several different generative model architectures (to compare with published benchmark resources MOSES[72] and GuacaMol),[73] which had been introduced in the literature in recent years: recurrent neural networks (RNNs),[6] generative adversarial networks (GANs),[11] and variational autoencoders (VAEs).[13] To assess the capabilities of each architecture, we decided to use a number of metrics proposed in the literature, including validity: whether the compounds generated are theoretically realistic molecules; uniqueness: the fractions of molecules that are unique; novelty: the fraction of molecules generated not in the training set; and finally, the Fréchet ChemNet distance: (FCD)[74] a measure of how close distributions of generated data are to the molecules in the training set. As comparing architectures is difficult given the ability of different hyperparameter tuning to alter results, we chose hyperparameters based on their initial implementation. We then trained each architecture (RNN, VAE, and GAN) on 1.2 million ChEMBL compounds and filtered to between 10 and 50 heavy atoms. We employed early stopping to reduce the length of time to train each model. Finally, we generated 100,000 compounds per architecture. We found that all three architectures performed similarly and were all capable of generating valid, unique, and novel compounds with a good FCD score (Figure ).[72,73] These scores were comparable to those reported in the literature with other benchmarking studies (Figure ) and suggested that the choice of generative model architecture for MegaSyn was not a significant factor in improving generative model capabilities.
Figure 2

Comparison of different model architectures for generative models using our models (CPI) in comparison to values reported from two other published benchmark resources (MOSES[72] and GuacaMol).[73]

Comparison of different model architectures for generative models using our models (CPI) in comparison to values reported from two other published benchmark resources (MOSES[72] and GuacaMol).[73]

MegaSyn Design

At its core, MegaSyn uses long short-term memory (LSTM)-based generative models to learn the proper structure of SMILES strings.[6] As input, molecules are represented as tokenized SMILES strings. MegaSyn is composed of three distinct model types: the initial pretrained model, a set of primed models, and finally a set of exploratory models (Figure ).
Figure 3

MegaSyn architecture. First, an initial model is trained on a drug database (i.e., ChEMBL). Next, a set of primed models are generated by training on a target compound(s). Finally, exploratory models are generated from each primed model node, completing a set of generative models that range from general, druglike molecules to analogs of the target compound(s).

MegaSyn architecture. First, an initial model is trained on a drug database (i.e., ChEMBL). Next, a set of primed models are generated by training on a target compound(s). Finally, exploratory models are generated from each primed model node, completing a set of generative models that range from general, druglike molecules to analogs of the target compound(s). The initial model is trained on ChEMBL 28’s of ∼2 million compounds.[22,67] The purpose of training this model is to teach it how to create druglike molecules. Once trained, the initial model “knows” how to put together druglike molecules and then can be queried to generate compounds that fall within ChEMBL’s chemical space. This represents the prior knowledge of chemistry: valid chemical structures and how they are put together, atom by atom, is learned in this initial model. This considerable quantity of chemical information will be transfer-learned in the subsequent model. The initial model takes the largest amount of time to train; however, once trained, it can be reused for many projects as the prior model, and the overall training time of MegaSyn is small in comparison to a full retraining of a typical generative model starting from training on the entire ChEMBL database. After the initial model is trained, a set of “primed” models are trained (Figure ). The initial model is first presented molecule(s) of interest. Critically, we included a form of target-structure analysis by preprocessing each targeted molecule of interest into a set of substructures. The molecule(s) of interest are broken down into substructures based on RECAP rules using RDKit’s RecapDecompose module.[75] Next, simplified carbon-only versions of these substructures and the original molecule are also generated. We found that breaking down the molecule(s) of interest into substructures allowed a different substructure set to be considered by the primed models, improving the analog exploration space around the molecule(s) of interest. Model 1 is trained on this list of structures and substructures for several epochs using teacher forcing. Every i epochs, the model is saved until a set of n primed models have been created. Each of these primed models represents generic exploration of chemical space (early primed models) to focused enumeration of the target molecule(s) (late primed models). How many epochs the model is trained on is critical; if too little are trained, the primed models explore a very wide chemical space around the target molecule. If too many epochs are trained, the model learns to focus only on the specific structure and substructures of the target(s) of interest themselves. We find that 16 total epochs with a model saved after every 2 epochs represent the gradient of general to diverse reasonably well for several target molecules, although we note that this may represent the training “distance” from ChEMBL to similar target molecules and that due to the few targets trained at a time, primed models can be quickly generated.

Exploration-Ensemble Models

Primed models represent nodes along a singular branch from a general druglike library (ChEMBL) to specific analogs of a molecule of interest. To explore more diverse chemical space around each of these nodes, a final set of ensemble models are branched off from each primed model node. For each primed model, de novo molecules are generated (∼2000–10,000 appear sufficient to cover a broad chemical space). The generated molecules are then ranked based on a composite score from a number of criteria. This usually includes QED,[50] activity against the target (target model), and any other desired scores. Notably, if a score can be assigned to a compound, it can be included in the final composite score; this provides flexibility to the tasks the generative model can be applied. We can weight each objective according to its importance on a scale from 0 to 1, with 1 being extremely important and 0 representing no importance. After the generated set of compounds are scored, the top 10% of ranked compounds are kept and the model is trained on these top compounds, a training concept called hill-climb MLE.[76] A new set of molecules are generated after training, and the cycle continues. Importantly, the top 10% of compounds are kept from one epoch to the next; only if a newly generated compound has a score higher than one in the current top 10% list does it replace one in the set. Eventually, the model will find a substructure minimum and is then capable of generating analogs of this specific substructure. Often, based on the initial seed molecules of the very first iteration, the model will converge to one local minimum. At least four models are trained and generated from each primed model node to obtain models that focus on different substructures of the original target molecule. The top-scoring 10% of compounds found over the entire training loop for each model are kept. The collection of models is indexed to give flexibility on what regions of chemical space the user could explore. Instead of sampling from a single generative model, MegaSyn randomly samples from a collection of t total models (initial model + (i/n) * 4) in parallel. It should be noted that training multiple models from the initial model takes a limited amount of time, only requiring ∼6 h on a single Nvidia GeForce GTX 1080 Ti GPU to generate 32 models, the number of models generated per MegaSyn case study in this paper. The desired “focus” of the model can be driven by a generative specificity parameter, which weights the chance of a model to be sampled from, either closer molecules to the training target(s) or driving away from the targets to generate novel compounds.

Evaluation of De Novo Molecules Generated from MegaSyn

Case Study 1: Lapatinib Analogs

We decided to evaluate the capability of MegaSyn to generate valid, novel molecules with desired properties by employing several case studies. As an example of our generative approach, we chose to optimize lapatinib, an orally active drug for breast cancer and other tumors (Figure A). Lapatinib inhibits EGFR (HER1) and HER2 kinases and thus is commonly used in combination therapy for HER2-positive breast cancer.[77] Lapatinib, however, is relatively poor at crossing the blood–brain barrier (BBB), with highly variable metastasis uptake and is not detected in normal brain tissues.[78] We used MegaSyn to design analogs that simultaneously optimizes for HER1 and HER2 activities with an improved ability to cross the BBB. All activity models were built using naïve Bayes (Table S1; see methods). Both HER1 and HER2 data sets were reasonably well balanced (∼42% actives and ∼40% actives, respectively). HER1 and HER2 models had ROCs of 0.80 and 0.86 and F1 scores of 0.69 and 0.77, respectively (Table S1). For inputs to the scoring function, we considered a QED score > 0.6. Similarity to lapatinib or lapatinib fragments (Tanimoto similarity >0.6), and prediction scores from machine learning models, we constructed for crossing the BBB, HER1 inhibition, HER2 inhibition, and finally a HERG model (using HEK293 cell data only) to ensure the molecules avoid this ion channel. We ran MegaSyn for 16 total epochs, saving a primed model node every two epochs and generating four exploratory models per primed model node for a total of 32 RNN-based models. A total of 10,000 molecules were generated from each of the 32 RNN-based models.
Figure 4

Case study 1. (A) Structure of lapatinib, the target molecule of interest. (B) Number of the top 2000 MegaSyn compounds that fall within the applicability domain of the HER1 and HER2 models.

Case study 1. (A) Structure of lapatinib, the target molecule of interest. (B) Number of the top 2000 MegaSyn compounds that fall within the applicability domain of the HER1 and HER2 models.

MegaSyn Explores Diverse Chemical Space

t-SNE plots of the top 200,000 scored molecules show that MegaSyn explores a rich chemical space around lapatinib (Figure A), ranging from Tanimoto similarity scores of 0.1–0.97. To determine if the generated molecules were within the applicability domain of our models, we applied a modified approach of Aniceto et al.[79] (Figure B). First, we trained an ensemble random forest classifier using scikit-learn’s RandomForestClassifier (number of trees = 500, class_weight = balanced). Using the ensemble, we calculated the bias (prediction score – true binary value) and standard deviation of the ensemble predictions. Using the formula: (bias * 1-STD) to define a weight, we then calculated the average weighted Tanimoto similarity of every compound to the remaining training data set. We set a threshold corresponding to the 75th quartile of all weights calculated in the training set. Next, for each MegaSyn-generated compound, we used the maximum Tanimoto score to the training set and weighed it by that nearest neighbor’s bias and STD. We rejected the compound from the applicability domain (AD) if the weighted Tanimoto did surpass the threshold. We found that 37.6 and 44.1% of the top 2000 score compounds were in the applicability domain for HER1 and HER2, respectively. In contrast with other generative models, we also used a single LSTM-based generative model pretrained on ChEMBL and used the exact same loss function and multiparameter optimization score to drive the generative models. Both MegaSyn and the LSTM single models were trained for the same number of iterations, with MegaSyn having each ensemble submodel trained iterations/N-p*2, where N is the number of ensemble submodels and p is the number of primed models. We then sampled 5000 compounds from LSTM-based generative models to compare against MegaSyn, from which we also sampled 5000 compounds. We also sampled 5000 compounds from the individual ensemble models to explore the submodel heterogeneity. MegaSyn had significantly higher multiparameter optimization scores, suggesting that it can find a better composite score maximum (Figure A). While we draw analogies to ensemble prediction models, our results suggest that each of the individual models in the MegaSyn ensemble is not necessarily weaker generative models and instead often discover different local minima due to the probabilistic nature of the generative approach in combination with the hill-climb MLE scoring feedback loop. Each submodel generally found a distinct region of chemical space to focus on and did not converge onto the same spatial regions (Figure B). Thus, while the ensemble models generally generate lower-scoring molecules compared to the LSTM due to less training, by chance some ensemble models rapidly converge on a region of chemical space, which is optimal. This suggests that the ensemble approach is better able to avoid ending up in local minima by exploring a larger chemical space with multiple, weaker models compared to any single-trained model. When we limit our number of molecules down to the top 2000 scored generated molecules, molecular diversity is still common, further suggesting that MegaSyn is not just enumerating on a common core structure alone but exploring diverse options to meet the criteria used in the scoring function (Figure B). In contrast to the Tanimoto similarity score, the region in the t-SNE plot with the highest multioptimization score is distinct from the location of lapatinib, suggesting that MegaSyn is potentially capable of finding novel chemical space with better molecular properties than lapatinib (Figure A). While the majority of the top 2000 compounds are predicted to cross the BBB (Figure B), there is a clear structure–activity relationship between the activity relationship with HER1 activity and especially HER2 activity, which shows higher selectivity among the top compounds (Figure C,D). We evaluated the atomic contribution to model prediction for lapatinib and two of the top-scoring generated compounds (Figure S1). While the BBB model suggests that the smaller generated compounds have no distinct atom-specific prediction differences (Figure S1), the HER1 model suggests that the core atomic contribution to predicted activity is retained, with a new strong atomic contributor (the carbon atom highlighted in the first top-generated molecule under HER1) in addition (Figure S1). For HER2, however, the strongest atomic contributor is not retained from lapatinib in the top-scoring generated compounds, and instead novel atomic contributors are highlighted, suggesting that the optimization of the generated molecules can “find” distinct properties that allow the generated molecules to still be active against the target (Figure S1). We next evaluated the synthetic feasibility of the top-scoring compounds by using our newly built retrosynthetic analysis tool.
Figure 5

t-SNE plots of structural diversity of MegaSyn-generated compounds. (A) t-SNE plot based on ECFP6 for 200,000 top-scoring generated molecules colored by Tanimoto similarity to lapatinib. (B) t-SNE plot based on ECFP6 for 2000 top-scoring generated molecules colored by Tanimoto similarity to lapatinib. The blue dot represents lapatinib.

Figure 6

Comparison of MegaSyn and individual ensemble models (designated “E 1”, “E 2”, etc.) vs a single LSTM model multioptimization score using the same ChEMBL pretrained model for setup. (A) Boxplot showing the multiparameter optimization score for the generated compounds. (B) A t-SNE plot showing the structurally distinct generated chemical space from each submodel in the MegaSyn ensemble (designated E 1, E 2, etc.).

Figure 7

t-SNE plots based on ECFP6 of the top 2,000 scoring compounds generated by MegaSyn colored by (A) the predicted ability to cross the BBB, (B) multiobjective optimization score, (C) predicted HER1 inhibition, or (D) predicted HER2 inhibition.

t-SNE plots of structural diversity of MegaSyn-generated compounds. (A) t-SNE plot based on ECFP6 for 200,000 top-scoring generated molecules colored by Tanimoto similarity to lapatinib. (B) t-SNE plot based on ECFP6 for 2000 top-scoring generated molecules colored by Tanimoto similarity to lapatinib. The blue dot represents lapatinib. Comparison of MegaSyn and individual ensemble models (designated “E 1”, “E 2”, etc.) vs a single LSTM model multioptimization score using the same ChEMBL pretrained model for setup. (A) Boxplot showing the multiparameter optimization score for the generated compounds. (B) A t-SNE plot showing the structurally distinct generated chemical space from each submodel in the MegaSyn ensemble (designated E 1, E 2, etc.). t-SNE plots based on ECFP6 of the top 2,000 scoring compounds generated by MegaSyn colored by (A) the predicted ability to cross the BBB, (B) multiobjective optimization score, (C) predicted HER1 inhibition, or (D) predicted HER2 inhibition.

Case Studies for Retrosynthetic Analysis

Before scoring the retrosynthetic feasibility of MegaSyn-generated compounds, we first evaluated test cases to show the utility of the retrosynthetic analysis tool. Initially, the retrosynthetic analysis tool was tested on several examples to illustrate potential utility. As an example application of this tool, Sorensen et al. recently described a three-step synthesis for the antiviral drug tilorone.[80] Our software suggests several approaches to synthesize tilorone (Figure S2). Another molecule tested in this way was the kinase inhibitor axitinib.[81] The retrosynthetic analysis results were compared with a known synthesis route (Figure S3). We have expanded this analysis and generated a larger evaluation of the “top 25 selling small-molecule drugs” (Table S2). This resulted in a similar number of alternative synthetic routes for these drugs (Figure S4). Fifteen out of 25 were “retro-synthesized” completely to commercially available reactants (eMolecules were checked for commercial availability). Two of the drugs only required one step, four required two steps, eight required three steps, and one required five steps to break down into commercially available reactants. In many cases, the retrosynthesis went further than required to reach commercially available reactants (Table S2).

Synthetic Feasibility Prediction

An example of using tilorone for synthetic feasibility prediction is also shown in Figure S5, which illustrates the analysis of results as a whole and the scoring. In addition, we have compared the synthetic feasibility consensus scores of an FDA-approved drug library versus 346 natural products in the Canvass data set[43] (Figure ). This analysis shows a good separation of drugs from natural products using this score. We hence decided to use reference points of a synthetic feasibility score of <60 to indicate synthetic feasibility and a score of >90 to indicate a compound that is more easily synthesizable. The FDA data set and Canvass were statistically significantly different (p = 0.00318), suggesting that the synthetic feasibility tool is easily capable of discerning difficult-to-synthesize molecules (e.g., natural products) from generally simpler molecules like drugs. Visualization of the chemistry space of these approved drugs and the natural products further demonstrates that they cover different chemical property space areas, with drugs generally focused on the center of the plot while natural products are on the periphery (Figure S6).
Figure 8

Boxplot comparing the consensus synthetic feasibility score for an FDA-approved library versus 346 natural products in the Canvass data set and 200 of the top-scoring MegaSyn-generated lapatinib analog compounds. 195/200 MegaSyn compounds had a score >60 and 46/200 compounds had a score >90, indicating that the compounds were synthetically feasible and easily synthesizable, respectively.

Boxplot comparing the consensus synthetic feasibility score for an FDA-approved library versus 346 natural products in the Canvass data set and 200 of the top-scoring MegaSyn-generated lapatinib analog compounds. 195/200 MegaSyn compounds had a score >60 and 46/200 compounds had a score >90, indicating that the compounds were synthetically feasible and easily synthesizable, respectively.

Synthetic Feasibility of MegaSyn-Generated Compounds

After validating our synthetic feasibility tool earlier, we used the consensus model to score the top 200 MegaSyn-generated lapatinib analogs ranked by the MPO score (Figure ). Most compounds (97.5%) were scored as synthetically feasible, nearly a quarter (23%) being considered easily synthesizable (Figure ). This suggests that MegaSyn can generate valid, druglike, readily synthesizable compounds with the desired predicted physicochemical and bioactivity properties.

Case Study 2: Ibogaine Analogs

As a second more challenging case study, we chose to potentially improve upon a natural product, ibogaine. Ibogaine is a natural product derived from tabernanth iboga (Figure A). Recent research has shown that psychedelics such as ibogaine may have therapeutic potential as antiaddictive agents. However, ibogaine has several undesirable properties, including the inhibition of the hERG channel and the induction of a psychedelic experience. In a recent publication, Cameron et al. proposed, synthesized, and tested new ibogaine analogs with the following targeted properties in mind: that it does not inhibit the hERG channel, it maintains specificity to the 5-HT2A, which is thought to be necessary for the therapeutic action, and it does not induce a psychedelic experience.[44] Ultimately, the authors discovered tabernanthalog, an ibogaine derivative with these desired properties.[44]
Figure 9

MegaSyn generation of new molecules based on ibogaine. (A) The structures of ibogaine and tabernanthalog. (B) t-SNE plots of the top 2000 generated molecules based on ECFP6 fingerprints colored by Tanimoto similarity. (C) Structures of three randomly sampled molecules from the top 200 compounds. (D) Histogram of the AlogP of the top 50 generated compounds. The AlogP of ibogaine is indicated by the red dashed line.

MegaSyn generation of new molecules based on ibogaine. (A) The structures of ibogaine and tabernanthalog. (B) t-SNE plots of the top 2000 generated molecules based on ECFP6 fingerprints colored by Tanimoto similarity. (C) Structures of three randomly sampled molecules from the top 200 compounds. (D) Histogram of the AlogP of the top 50 generated compounds. The AlogP of ibogaine is indicated by the red dashed line. We have used this paper as a test case and challenged MegaSyn to find tabernanthalog, using the following criteria: activity against 5-HT2A, inactivity against hERG, 5-HT1A, 5-HT1F, and 5-HT2C, similarity to ibogaine and substructures (Tanimoto > 0.6), and lower cLogP than ibogaine. We ran MegaSyn for 16 total epochs, saving a primed model node every two epochs and generating four exploratory models per primed model node for a total of 32 LSTM-based models. We built machine learning models against 5-HT2A, hERG, 5-HT1A, 5-HT1F, and 5-HT2C to include in the multiobjective scoring function to drive MegaSyn. All activity models were built using naïve Bayes (Table S1; see methods). Most of the models were well balanced and had precision values from 0.68 to 1, F1 scores from 0.6 to 0.96, and ROC values > 0.83. We then generated 100,000 compounds and took the top 50 highest multiobjective scoring compounds. Tabernanthalog was included in the top 50 highest scoring compounds. In addition, MegaSyn captured a wide variety of other related structures, including dissimilar scaffolds to ibogaine (Figure B,C). Most of the top 50 compounds had a lower AlogP than ibogaine, suggesting that MegaSyn could find molecules with improved predicted druglike properties (Figure D). In addition, the top 10 generated compounds had an MPO score (were MPO here represents a measure of BBB penetration)[82] comparable or better than tabernanthalog and all had a higher MPO score than ibogaine, suggesting that several of the novel MegaSyn-generated compounds have a higher probability of crossing the BBB (Table S3).

Automated Analog Generation

In addition to the de novo design of molecules with MegaSyn, we have also developed an easy-to-use web interface using Pipeline Pilot for running an automated analog generation protocol, which can be used for lead expansion. We encoded several different medicinal chemistry strategies to generate potential analogs. A file of the molecules to generate analogs is then uploaded, and the output consists of a pie chart summarizing the makeup of the molecule analogs and bar charts of their properties (Figure S7). The charts and tabular output are linked together such that the user can select subsets of molecules and readily export them. This tool can also be used with the retrosynthetic analysis described earlier to score the likely synthetic feasibility.

Discussion

The goal of this study was to generate a complementary suite of accessible tools for generative molecular design, computer-assisted synthesis, retrosynthesis, and synthetic viability to propose new analogs or additional molecules as the next steps after the identification of a potential hit. We aimed to make use of existing data and algorithms wherever possible to deliver this additional functionality to provide meaningful synthesis suggestions for each molecule. We have now described these methods for automated lead expansion, filtration of analogs, and selection of a representative set of molecules that is user-accessible. This collection of capabilities can also be combined with other software or machine learning tools to score proposed analogs with models of interest. Over the past few years, new discoveries in the field of de novo drug design have renewed interest in generating new molecules using machine learning.[6−9] RNNs have been used for generating libraries for HTS, hit-to-lead optimization, and fragment-based hit discovery.[15,83−87] A feature of these generative models is the ability to optimize multiple parameters such as the physicochemical properties or biological activity. While these new approaches are promising, a critical gap in knowledge is that limited experimental validation data was generated by synthesizing compounds and testing for activity for any of the aforementioned studies, with only a few groups validating their approach by making and testing compounds[88] or finding structurally similar compounds from vendors.[89] Default or “vanilla” generative models, while capable of generating novel compounds, often do not end up in the desired chemical space. In our conversations with numerous drug discovery experts at various companies, the major complaints regarding generative models are that they either end up enumerating on the same initial target molecules, essentially rediscovering what their medicinal chemists have already proposed (suggesting that the model is too focused) or ends up far away from “realistic” drug designs, proposing molecules that are well outside the realm of synthesizability. We suggest that a single model is not sufficient to cover all of the possible tasks requested of a generative model, so we have attempted to circumvent these issues by creating a large enumeration of models, from very general (little information is considered about the desired molecular space) to the specific (models that generate only analogs of the desired target molecules).

MegaSyn Initial Model Choice

The current MegaSyn models are all initially based on a single pretrained model on the ChEMBL database. This serves two purposes: first, the model has already learned how to compose correct molecule structures from SMILES strings. Second, despite the large number of learned molecules, ChEMBL has the additional bonus of being comprised almost entirely of druglike molecules. This works to the advantage of MegaSyn due to the unique training strategy of using the hill-climb MLE algorithm. The use of a hill-climb MLE means that only molecules that the initial generative model is capable of generating can be used for training, creating a feedback loop of only druglike molecules being generated and trained on, and prevents undesirable properties from being generated. This is further re-enforced through the use of a QED score to prevent molecules from straying too far into non-drug-like space. The initial choice of database to train the initial model, then, is critical to the success of a generative model. Exploration of other databases to train the initial model can be used to change the desired outcome.

Composite Score Function

The core driver of MegaSyn is the composition of the composite score function, which often includes a score for druglikeness (such as QED) and similarity to target molecule(s) (i.e., Tanimoto similarity) in addition to the primary activity scoring models for potential drug targets. The accuracy and choice of scores, then, are also critical for the success of MegaSyn. The number of possible scores to include are unbounded and only require that a molecule can be scored and ranked numerically. For example, including machine learning models on toxicity (HERG, drug-induced liver toxicity, CYPs, etc.) can be combined with on-target (i.e., a 5-HT receptor) and off-target (other 5-HT receptors or other targets to avoid) to create a composite score of dozens of scoring functions. We included a weighting value, from 0 to 1, which allows flexibility in score inclusion; instead of only the important score functions, several “nice-to-have” scores may also be introduced with a lower weight value than the more critical score functions.

Case Study Results: Pros and Cons

In the absence of prospective validations of the generative approaches, the use of case studies is a promising alternative to explore the possible application and limitations of generative de novo design software, as demonstrated herein. We illustrated that MegaSyn, even when faced with a natural product (ibogaine), can discover the same molecular analog (tabernanthalog) as proposed by medicinal chemists (using “traditional” medicinal chemistry approaches to design), suggesting that it is capable of supplementing medicinal chemistry exploration. In addition, several of the top-scoring compounds in this case study had molecular scaffolds distinct from ibogaine, highlighting that ibogaine is not considered a “druglike” molecule. Further, the proposed top-scored compounds for lapatinib, while “similar” to lapatinib, possessed improved predicted molecular properties, which was the intent. The downside of such case studies is that the interpretation of success is only as good as the accuracy of the composite scoring function. While we can judge generated molecules as being reasonable from a chemistry point of view, it remains to be seen whether the other top-scoring compounds are in fact 1. active, 2. nontoxic, and 3. selective, without making them and testing them. These are critical points that have yet to be fully investigated by any generative model proposed to date (to the best of our knowledge), and we do not know whether the bias of using machine learning models to drive generative models also affects the probability of the top-scoring generated compounds to be truly active, nontoxic, or selective. We would argue, however, that these same machine learning models could be used to direct drug discovery projects regardless of the origin of the proposed molecules, therefore suggesting that generative models may provide a promising route to finding new molecules to test, especially when combined with retrosynthetic analysis.

Retrosynthetic Analysis and Analog Designer

While some of the tools described herein are likely less sophisticated than the approaches described earlier for computer-assisted synthesis,[30−32] retrosynthesis,[25,33−37] and synthetic viability tools (e.g., AutoGrow 3,[90] chemical stability,[91] and others[38] to eliminate invalid options), they can be readily implemented in Pipeline Pilot that is a widely used and a commercially available product. Similarly, this approach and software could also be readily reimplemented in open-source tools such as KNIME[54,92] or scripted in Python or other languages. In conclusion, we have demonstrated that MegaSyn can propose synthesizable analogs for molecules based on the integration of various software components (open source and commercial). We have also demonstrated that we can recapitulate synthetic approaches for approved drugs in our case studies and that our synthetic feasibility score can reliably differentiate between approved drugs that are likely to be more synthetically feasible than more complex natural products. While these efforts represent essentially retrospective evaluations of the software developed, this is in line with what has been demonstrated with several more sophisticated tools described earlier. The next step is using the MegaSyn suite of tools to propose analogs, define how to make them, and rank their synthetic feasibility before ultimately selecting molecules to synthesize and testing in vitro. We are currently applying MegaSyn to perform just this on various internal and collaborative research projects. Developing tools like MegaSyn should also consider the potential for dual use of this technology as we have recently reported with the application of MegaSyn to develop VX, several analogs, and its precursors.[93] As this is a commercial product, we have control over who has access to it, such that we can implement restrictions or an API for any forward-facing models that may be of a sensitive nature, as has been done elsewhere for other machine learning models such as GPT-3.[94]
  67 in total

1.  Molecular transformations as a way of finding and exploiting consistent local QSAR.

Authors:  Robert P Sheridan; Peter Hunt; J Chris Culberson
Journal:  J Chem Inf Model       Date:  2006 Jan-Feb       Impact factor: 4.956

2.  GuacaMol: Benchmarking Models for de Novo Molecular Design.

Authors:  Nathan Brown; Marco Fiscato; Marwin H S Segler; Alain C Vaucher
Journal:  J Chem Inf Model       Date:  2019-03-19       Impact factor: 4.956

3.  Data-driven Chemical Reaction Prediction and Retrosynthesis.

Authors:  Vishnu H Nair; Philippe Schwaller; Teodoro Laino
Journal:  Chimia (Aarau)       Date:  2019-12-18       Impact factor: 1.509

Review 4.  The Use of Conformational Restriction in Medicinal Chemistry.

Authors:  Pedro de Sena M Pinheiro; Daniel A Rodrigues; Rodolfo do Couto Maia; Sreekanth Thota; Carlos A M Fraga
Journal:  Curr Top Med Chem       Date:  2019       Impact factor: 3.295

5.  Deep learning enables rapid identification of potent DDR1 kinase inhibitors.

Authors:  Alex Zhavoronkov; Yan A Ivanenkov; Alex Aliper; Mark S Veselov; Vladimir A Aladinskiy; Anastasiya V Aladinskaya; Victor A Terentiev; Daniil A Polykovskiy; Maksim D Kuznetsov; Arip Asadulaev; Yury Volkov; Artem Zholus; Rim R Shayakhmetov; Alexander Zhebrak; Lidiya I Minaeva; Bogdan A Zagribelnyy; Lennart H Lee; Richard Soll; David Madge; Li Xing; Tao Guo; Alán Aspuru-Guzik
Journal:  Nat Biotechnol       Date:  2019-09-02       Impact factor: 54.908

6.  Application of Generative Autoencoder in De Novo Molecular Design.

Authors:  Thomas Blaschke; Marcus Olivecrona; Ola Engkvist; Jürgen Bajorath; Hongming Chen
Journal:  Mol Inform       Date:  2017-12-13       Impact factor: 3.353

7.  Efficient multi-objective molecular optimization in a continuous latent space.

Authors:  Robin Winter; Floriane Montanari; Andreas Steffen; Hans Briem; Frank Noé; Djork-Arné Clevert
Journal:  Chem Sci       Date:  2019-07-08       Impact factor: 9.825

8.  A non-hallucinogenic psychedelic analogue with therapeutic potential.

Authors:  Lindsay P Cameron; Robert J Tombari; Ju Lu; Alexander J Pell; Zefan Q Hurley; Yann Ehinger; Maxemiliano V Vargas; Matthew N McCarroll; Jack C Taylor; Douglas Myers-Turnbull; Taohui Liu; Bianca Yaghoobi; Lauren J Laskowski; Emilie I Anderson; Guoliang Zhang; Jayashri Viswanathan; Brandon M Brown; Michelle Tjia; Lee E Dunlap; Zachary T Rabow; Oliver Fiehn; Heike Wulff; John D McCorvy; Pamela J Lein; David Kokel; Dorit Ron; Jamie Peters; Yi Zuo; David E Olson
Journal:  Nature       Date:  2020-12-09       Impact factor: 49.962

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.