Literature DB >> 31841276

iQSPR in XenonPy: A Bayesian Molecular Design Algorithm.

Stephen Wu1,2, Guillaume Lambard3, Chang Liu1, Hironao Yamada1,4, Ryo Yoshida1,2,3.   

Abstract

iQSPR is an inverse molecular design algorithm based on Bayesian inference that was developed in our previous study. Here, the algorithm is integrated in Python as a new module called iQSPR-X in the all-in-one materials informatics platform XenonPy. Our new software provides a flexible, easy-to-use, and extensible platform for users to build customized molecular design algorithms using pre-set modules and a pre-trained model library in XenonPy. In this paper, we describe key features of iQSPR-X and provide guidance on its use, illustrated by an application to a polymer design that targets a specific range of bandgap and dielectric constant.
© 2019 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA.

Entities:  

Keywords:  Bayesian inference; machine learning; molecular design; open source; polymer

Mesh:

Substances:

Year:  2019        PMID: 31841276      PMCID: PMC7050509          DOI: 10.1002/minf.201900107

Source DB:  PubMed          Journal:  Mol Inform        ISSN: 1868-1743            Impact factor:   3.353


Introduction

Inverse molecular design is the process of computationally creating new chemical structures that exhibit desired properties, and this approach has been one of the most important research subjects in materials science. For decades, scientists have searched for efficient methods of discovering novel materials for a wide variety of industrial and engineering applications. Conventional approaches have often relied on expert knowledge to investigate new structures by trial and error, starting from known materials and considering a relatively small sub‐region in the whole search space. Although the chemical space of small organic molecules consists of approximately 1060 candidates,1 the total number of currently known compounds is at most on the order of 108.2 Hence, most of the chemical space remains unexplored, and the concept of computer‐aided molecular design (CAMD) has emerged to accelerate this extremely slow discovery process.3 An early attempt by Joback and Stephanopoulos framed CAMD as an optimization problem with rule‐based molecule enumeration.4 In many subsequent works, materials properties were optimized within a search space that was pre‐constrained to a small subspace built from expert‐selected molecular fragments or chemical rules, using heuristic optimization algorithms such as genetic algorithms5, 6 and Monte Carlo based stochastic optimization.7 For example, Miyao et al.8, 9 used a set of chemically favorable fragments and designed templates of specific molecular graphs that were combined with some mixture models for property predictions to generate a desired class of candidate molecules. Although these methods were a major step forward in the history of CAMD, they still suffered from a lack of capability to handle the large and highly diverse discrete spaces of candidate molecules. In recent years, a new family of CAMD algorithms has emerged, inspired by the great success of modern machine learning (ML) methods. In particular, to broaden the search space, ML methods that use probabilistic language models based on deep neural networks (DNNs) have proliferated intensively since 2017.10 In these methods, a language model is trained on a given set of existing molecules, the chemical structures of which are translated into a set of strings according to the simplified molecular‐input line‐entry system (SMILES) chemical language.11 Models trained to recognize chemically realistic structures are then used to refine chemical strings in the molecular design calculation. Promising examples have included various types of variational autoencoders,12, 13, 14, 15 generative adversarial networks,16 recurrent neural networks,17, 18 and so on. These methods have been able to produce diverse chemical structures; however, they often require large training datasets to obtain a DNN‐based generator that can produce chemically realistic molecules with grammatically valid SMILES. Datasets this large are unavailable in many applications. Furthermore, many of these methods generate chemically or grammatically invalid representations of molecules at relatively high rates, unless their hyperparameters are carefully tuned.19, 20, 21 Some previous works considered simpler generative models to avoid the need to train the model with large dataset. Yoshikawa et al.22 exploited a grammatical evolution method with parallel computation to generate a diverse set of candidate molecules conditional on arbitrarily given design targets. Ikebata et al.23 combined a simple probabilistic language model based on an n‐gram representation of SMILES sequences with a Bayesian inference framework to sequentially modify a population of molecules into promising candidate molecules that would exhibit desired properties. For a more complete review of the above methods, Schwalbe‐Koda and Gómez‐Bombarelli24 have provided a detailed overview of recent developments in inverse molecular design. In this paper, we introduce iQSPR‐X, a flexible software constructed to implement the Bayesian molecular design algorithm iQSPR, which was developed in our previous work.23 The algorithm was implemented in XenonPy, a Python package with an integrated platform of materials informatics.25 In contrast to the original iQSPR algorithm developed in R, the new version allows users to exploit various features of XenonPy as described below. The basic computational workflow consists of a two‐step iteration: (1) current chemical structures are modified to new ones using a generator and (2) candidate molecules that show promise for desired properties are selected using an evaluator, which is a set of ML models for predicting material properties. The generator and the evaluator can be pre‐trained separately with given training instances. Users can either train new models from scratch or reuse relevant pre‐trained models from a model library in XenonPy, which covers a broad array of material properties for small molecules and polymers. In addition, when the available data on the structure‐property relationship for a target task are limited, directly obtaining a reliable prediction model is difficult. However, an ML technique called transfer learning can be used to extract knowledge relevant to the target task from a large set of pre‐trained models to help training new models more efficiently.26 Successful application of the iQSPR method, in conjunction with transfer learning to overcome limited polymeric properties data, was demonstrated in our previous study, which achieved the discovery of new polymers with high thermal conductivity.27 A set of tutorials distributed as Jupyter notebooks are available at the website of XenonPy,25 and these include detailed explanations and sample codes for building customized generators and evaluators, performing the inverse design calculations, and using some of the convenient modules in XenonPy. In this paper, we highlight some key features of iQSPR‐X and describe its application to the task of designing polymers using data from Polymer Genome (PG).28, 29

Computational Methods

Bayesian Molecular Design

The primary task of the Bayesian molecular design is to draw a set of samples from the posterior distribution P(S|∈U), which represents the conditional probability of observing a chemical structure S, given material properties ={Y | i=1,…,m}, that lies in a target region U. In the iQSPR‐X implementation, S is encoded as a SMILES string; i. e., S=s 1 s 2…s, where s is any valid character in SMILES. For example, phenol (C6H6O) can be represented by the SMILES string “C1=CC=C(C=C1)O”, where C and O denote the carbon and oxygen atoms, respectively; “=” denotes a double bond; the two “1” digits denote the opening and closing of the ring structure; and the parentheses denote the beginning and ending of the branching component. According to Bayes′ theorem, a posterior distribution is proportional to the product of a likelihood function and a prior distribution: where P(∈U|S) represents the likelihood function that evaluates the goodness‐of‐fit of S with respect to the given property requirement ∈U, and P(S) represents the prior probability that S belongs to a predefined search space of SMILES strings. Thus, P(S) will deliver a small or even zero probability when presented with an unfavorable or chemically unrealistic structure, thereby acting as a filter for such out‐of‐scope or invalid structures. In iQSPR‐X, a sequential Monte Carlo algorithm proposed by Ikebata et al.23 is implemented. This algorithm is somewhat similar to a genetic algorithm. With a given set of initial samples 0={S 0 | i=1,…,N} of size N, the pre‐trained prior is used as a generator to propose a new set of samples 0’. A fitness score is then assigned to each sample in 0’ using the likelihood, which is the evaluator in iQSPR‐X. By resampling N samples from 0’ in proportion to the fitness scores, a refined set 1 is obtained and once again modified by the generator. This cycle is repeated T times to obtain a final sample set . There are three important building blocks in this algorithm: the generator (prior), the evaluator (likelihood), and the descriptor ϕ(S). When building models for the evaluator, we encode a chemical structure into a descriptor vector ϕ(S) using, for example, a molecular fingerprinting algorithm. Using training instances {(, S)|k=1,…,N} on the structure‐property relationships, we then derive a model that describes the materials properties as a function of the descriptor ϕ(S), defining =(ϕ(S)) with the trained model . Although iQSPR‐X allows users to plug in customized functions for each building block, we also provide some commonly used functions internally, and these can be directly called from the package. For the descriptor, all available fingerprint types in RDKit30 and the Python descriptor package Mordred31 are available by default. Users can alternatively use a set of features extracted from pre‐trained neural networks in the XenonPy model library, as described in the next section. For the evaluator, a Gaussian likelihood is given as a choice with any user‐defined model μ(ϕ(S)) and the standard deviation σ(S), which represents the uncertainty of predicted properties: where U is the target region in the m dimensional space, and μ(ϕ(S)) and σ(ϕ(S)) are the mean and standard deviation for the ith property, respectively, obtained from ML models with input ϕ(S). For the generator, the extended n‐gram model developed by Ikebata et al.23 can be used by training it with any chemical structures given in SMILES. The model takes the form P(S)=P(s 1)∏ i=2 P(s|s ,…,s 1). Figure 1 summarizes the computational workflow of iQSPR‐X.
Figure 1

Computational workflow in iQSPR‐X with three main building blocks that users can flexibly construct: the generator, the evaluator, and the converter that translates an input chemical structure into a descriptor vector.

Computational workflow in iQSPR‐X with three main building blocks that users can flexibly construct: the generator, the evaluator, and the converter that translates an input chemical structure into a descriptor vector.

Generator: Extended n‐gram Model

The role of the generator is to propose new candidate molecules modified from a set of initial molecules. We implemented the extended n‐gram model as an internally available function in iQSPR‐X. This model consists of two components: (1) a table that records the probability of observing a subsequent character given a substring and (2) a function that modifies a given SMILES string based on the stored n‐gram probability table. The table can be trained by supplying a set of SMILES strings sampled from the desired search space. The maximum length of a substring to be considered and stored in the table is controlled by the “order” parameter. In the extended n‐gram model, SMILES strings are internally tokenized into a list of characters. For example, “=O” and “%10” are considered as one character, and a terminal character is automatically added at the end of each string. When proposing a new candidate molecule, the modifier function deletes a random number of characters from the end of the SMILES string, and then elongates the shortened string based on the n‐gram table. Because the representation of a molecule in SMILES is not unique, a reordering of the SMILES string is probabilistically performed to avoid constantly modifying the same part of the chemical structure. In short, the most important parameters in modelling the generator include the probability required to trigger reordering, the range of the number of letters to be deleted, and the order parameter controlling the maximum length of a substring in training and sampling the n‐gram model. Users can adjust these parameters based on the expected molecule size in the targeted search space. Although SMILES is a powerful representation of chemical structures, as exemplified by its ability to handle chirality using the “@” symbol, the non‐uniqueness of SMILES representations may lead to subtle effects in certain usages. For example, the aromatic ring in phenol can be represented as “C1=CC=CC=C1” or “c1ccccc1.” We recommend that users not mix different representations of the same molecular structure when training the extended n‐gram model.

Evaluator: Likelihood Function

The role of the evaluator is to provide a fitness score for a candidate molecule to estimate how likely the candidate possesses the desired properties. iQSPR‐X allows users to write their own evaluator, which receives a list of molecules, converts them to a set of descriptors using a pre‐set descriptor conversion function, and returns a list of corresponding log‐likelihood values. A Gaussian likelihood function can also be used if users select a desired descriptor and provide an ML model that returns the mean and standard deviation for a given set of descriptors as input.

Pre‐trained Neural Descriptors in XenonPy

One of the most distinctive features of our software is the availability in XenonPy of a comprehensive set of pre‐trained neural features for use as the descriptor ϕ(S). The sampling efficiency of iQSPR‐X is highly influenced by the reliability of the evaluator that predicts the material properties for any given chemical structure. Building such models from scratch is often time‐consuming and requires a large set of training data, which is not available in many applications. XenonPy currently provides 140,000 pre‐trained neural networks for the prediction of physical, chemical, electronic, thermodynamic, and mechanical properties of small organic molecules, polymers, and inorganic crystalline materials, with models for 15, 18, and 12 properties of these material types, respectively. The models are distributed as MXNet32 (R) and/or PyTorch33 (Python) model objects. The distributed API (application programming interface) allows users to query the XenonPy.MDL database. Users can directly use a retrieved model relevant to the target task, if available, or can re‐train a pre‐trained model on the target task using a transfer learning technique as described below. Transfer learning has significant potential to overcome the problem of limited materials property data, as demonstrated in our previous study,26 for various materials science tasks. Other studies have also shown promising applications of transfer learning in materials informatics.18, 34, 35, 36, 37, 38, 39, 40, 41 In this study, we applied a specific type of transfer learning using pre‐trained neural networks. For a target property, a neural network pre‐trained on proxy properties is available in the library, where the source datasets are sufficiently large. If the two properties are physically or chemically interrelated, the pre‐trained models can be expected to autonomously acquire common features relevant to the proxy properties. The features learned by solving the related tasks are partially transferable to the descriptor ϕ(S) in a model constructed for the target task. In general, earlier or shallower layers in a neural network tend to acquire general features to form the basis of the material descriptions, and only the last one or two layers identify specific features for the prediction of a source property. In iQSPR‐X, we freeze the shallower layers for use as a feature extractor. A subnetwork ϕ(S) of such a pre‐trained model can be reused in the supervised learning of the target property. To simplify the implementation of the repetitious tasks of neural descriptor extraction, XenonPy provides users with an internal function to extract values from any hidden layer in a pre‐trained neural network. With its large library of pre‐trained models and wide range of built‐in descriptors, XenonPy provides a strong foundation for flexibly arranging the necessary building blocks of the iQSPR algorithm.

Results and Discussion

Data

We used data from PG to illustrate the use of iQSPR‐X based on an example motivated from a previous study on polymer design.28 PG is an open database for polymeric properties that currently contains 854 polymers composed of nine types of atoms (H, C, O, N, S, F, Cl, Br, and I) with experimental data for three material properties (glass transition temperature, density, and solubility parameter) and computational data from density functional theory (DFT) for four material properties (bandgap (E), refractive index, dielectric constant (ϵ), and atomization energy). Using a subset of the data (4‐block polymers composed of CH2, NH, CO, C6H4, C4H2S, CS, and O), Mannodi‐Kanakkithodi et al.28 designed 6‐ to 12‐block polymers with high ϵ for insulator applications using ML models and a genetic algorithm. They were specifically interested in polymers with higher ϵ and E, and this goal was adopted in our example. The given data of the chemical structures S and their materials properties were used to train the generator and the evaluator. Here, we considered S to be the SMILES strings of the repeating polymer units. The connection points, i. e., the head and tail of a monomer, were denoted as “*”. In PG, the lowest‐energy crystal structures of the polymers were used for the DFT calculation. For each polymer, E was computed using a hybrid Heyd‐Scuseria‐Ernzerhof (HSE06) electronic exchange‐correlation functional, and ϵ, which is the sum of the electronic and ionic dielectric constants, was computed using density functional perturbation theory (DFPT). Mannodi‐Kanakkithodi et al.28 have detailed this computational procedure. As shown in Figure 2a, we observed an inverse relation between ϵ and E. Polymers containing thiophene (C4H2S) tended to reach high ϵ, but generally had low E. In contrast, polymers containing fluorine (F) atoms tended to reach high E, but generally had low ϵ. However, in contrast to the enrichment offered by either C4H2S or F atoms, polymers exhibiting high ϵ and high E tended to be composed of CH2, NH, CO, C6H4 and O.28 The design objective was to solve this nontrivial trade‐off problem.
Figure 2

Summary of observed data in PG. (a) Joint distribution of ϵ and E. Red dots denote all polymers containing F atoms, green dots denote those having C4H2S as fragments, and blue dots denote all other polymers. (b) Histogram of the lengths of SMILES strings in PG.

Summary of observed data in PG. (a) Joint distribution of ϵ and E. Red dots denote all polymers containing F atoms, green dots denote those having C4H2S as fragments, and blue dots denote all other polymers. (b) Histogram of the lengths of SMILES strings in PG.

Training the Generator

In this study, we considered two ways to train the extended n‐gram model. First, we used all 854 polymers in PG as a training set, which covered a wide variety of polymers. Second, we focused only on specific types of chemical structures that shared some common features, taken from other data sources. In practice, users may often be interested in designing a specific class of molecules. Here, we explored F‐containing polymers with high ϵ and E. In particular, we focused on a training set containing the fragment “C(F)(F)N,” which was taken from PubChem.2, 42 Because of the extremely high diversity of chemical structures in PubChem, we used multiple steps to extract training molecules: (1) we screened molecules in PubChem continuously until 5,000 molecules having the desired fragment were found (screening a total of over 36,940,000 molecules); (2) we reduced the number of molecules to 3,860 that consisted only of C, O, N, F, and/or S atoms; (3) we finally extracted 2,485 molecules by filtering out those that had more than six F atoms or included more than one molecule in a single SMILES string (SMILES string with “.”). The final training set was formed by the union of these selected PubChem molecules and the set of PG polymers. The order parameter controlling the length of a substring for training and sampling the n‐gram tables was set to be 20 for both cases, after examining the distribution of the SMILES lengths of the molecules in PG (see Figure 2b). In the construction of a training set, duplicates of each SMILES string were generated by performing random reordering of the string for at most 15 times. This step is important to avoid the occurrence of unseen substring patterns during the generation of new molecules, considering that we set the reorder probability to be 0.5 during the molecular generation. Figure 3 illustrates how the two different generators modified molecules step‐by‐step starting from the same initial chemical structures. The generator trained on the PubChem molecules showed a stronger tendency to include F‐containing fragments during the modification process.
Figure 3

Modification of molecules using extended n‐gram models trained with different datasets. The same five chemical structures in PG were successively modified five times according to generators that were trained on 854 polymers from PG (top) and with the 854 polymers from PG and 2,485 F‐containing molecules from PubChem (bottom).

Modification of molecules using extended n‐gram models trained with different datasets. The same five chemical structures in PG were successively modified five times according to generators that were trained on 854 polymers from PG (top) and with the 854 polymers from PG and 2,485 F‐containing molecules from PubChem (bottom).

Training the Evaluator

We conducted a series of experiments to obtain a model for the Gaussian likelihood function. As a descriptor, we used pre‐trained neural network models in XenonPy that were trained with 10 different types of fingerprints available in RDKit: atom pair and topological torsion fingerprints, Morgan fingerprints (both feature‐based and not feature‐based), basic fingerprints in RDKit, and five more that were obtained by adding the MACCS keys to the five listed fingerprints. With each of the 10 fingerprints, 100 randomly constructed neural networks, each having six fully connected layers, were trained with either the ϵ or E datasets; the number of epochs was 2,000 and the dropout rate was 0.1. The dataset was randomly separated into training and validation sets at a ratio of 8 : 2. Figure 4 shows a comparison of the validated mean absolute errors (MAEs) across the different fingerprint descriptors. The atom pair fingerprints with the MACCS keys showed consistently high performance on both ϵ and E, and was therefore selected for use in the Bayesian molecular design.
Figure 4

Box‐plots of the MAEs across different fingerprint descriptors evaluated on the validation datasets of either ϵ or E. APFP denotes the atom pair fingerprints, ECFP denotes the non‐feature‐based radius‐3 Morgan fingerprints, FCFP denotes the feature‐based radius‐3 Morgan fingerprints, TTFP denotes the topological torsion fingerprints, RDKit denotes the basic fingerprints in RDKit, and +M denotes the addition of the MACCS keys.

Box‐plots of the MAEs across different fingerprint descriptors evaluated on the validation datasets of either ϵ or E. APFP denotes the atom pair fingerprints, ECFP denotes the non‐feature‐based radius‐3 Morgan fingerprints, FCFP denotes the feature‐based radius‐3 Morgan fingerprints, TTFP denotes the topological torsion fingerprints, RDKit denotes the basic fingerprints in RDKit, and +M denotes the addition of the MACCS keys. The default model in the Gaussian likelihood function is set to be a Bayesian linear regression model. Users can directly train the model with their input data using a one‐line Python script. In this paper, we considered three more approaches to constructing models that return μ and σ: (1) bagging to calculate a bootstrap variance σ for any deterministic model, (2) random forests combined with a jackknife method,43 and (3) Bayesian linear models with neural descriptors extracted from pre‐trained models.26 In our example, we tested these different methods to select the best prediction model. Five‐fold cross validation (CV) was performed on both ϵ and E and the model with the best prediction performance was selected. For the bagging approach, the gradient boosting method in scikit‐learn44 was used and the training data in each fold of the 5‐fold CV were further divided into 10 non‐overlapping bags. Each bag produced a gradient boosting regressor under the default setting in scikit‐learn. The mean and standard deviation of the predicted values from the 10 trained models were taken as μ and σ, respectively. For the random forest approach, the forestci package was used along with the random forest method in scikit‐learn to calculate μ and σ. The number of trees was set to be 500 and the “max_feature” option was selected to be “sqrt.” For the Bayesian linear regression with neural descriptors, we began by selecting pre‐trained model from the model library in XenonPy for each of the two target properties. The 100 pre‐trained neural networks of ϵ and E were modified such that the last hidden layers were connected to Bayesian linear regressors, and the prediction performances of the models were then evaluated by the 10‐fold CV applied to the training data within each fold of the 5‐fold CV. Each of the models of ϵ and E that achieved the overall lowest MAE was selected, and their last hidden layers were concatenated to form a new neural descriptor. This descriptor was used to replace the originally selected descriptor in the default Gaussian likelihood function. Finally, this evaluator was trained with the full training data within each fold of the five‐fold CV. Figure 5 shows the performance of each model on the five‐fold CV for the ϵ and E datasets. The bagging approach with the gradient boosting model achieved the best overall performance and was therefore selected for the inverse design calculation.
Figure 5

Prediction performance of different models on the five‐fold CV for the ϵ and E datasets. GB denotes bagging with gradient boosting, RF denotes random forests with jackknife‐based uncertainty quantification, and NN denotes pre‐trained neural networks with their last hidden layers connected to Bayesian linear regressors.

Prediction performance of different models on the five‐fold CV for the ϵ and E datasets. GB denotes bagging with gradient boosting, RF denotes random forests with jackknife‐based uncertainty quantification, and NN denotes pre‐trained neural networks with their last hidden layers connected to Bayesian linear regressors.

Design Results

In this example, we set the target property region to be ϵ>4.5 and E>5 eV. Three rounds of iQSPR‐X were executed using different setups to compare the effect of various components of the algorithm. The first run used the generator trained with molecules in PG, and in the inverse design calculation, 100 initial samples were randomly selected from the 854 molecules in PG. The second run used the same generator, but the 100 initial samples were randomly selected from a subset of the molecules in PG that had a relatively low ϵ or E (ϵ<4 or E<4.5 eV). The third run used the same initial samples as in the first run, but the generator was trained with the PG and PubChem molecules, as detailed in the previous section. Other components of iQSPR‐X were set to be the same for all three runs. The n‐gram order parameter was set to be 20, with the range of the number of letters to be deleted set to be 1–10. The reordering probability was set to be 0.5. The descriptor was selected to be a concatenation of the atom pair fingerprints and the MACCS keys in RDKit. The evaluator was selected to be the Gaussian likelihood with 10 “bags” of gradient boosting models trained on 10‐fold CV of the full PG datasets for ϵ and E. The mean function μ was given by the mean of the predictions from the 10 models. For practical purposes, the variance function σ 2 was composed of the bootstrap variance plus a tiny pre‐set constant (0.04 for ϵ and 0.09 for E). To avoid trapping at a local region of the entire search space, an annealing schedule was applied: the likelihood scores were powered by a sequence of factors from 0 to 1, which corresponds to a sequential transformation of a distribution from a uniform distribution to the actual posterior distribution. From empirical evidence, a slow cooling schedule is recommended. We started with 20 steps of powers linearly increasing from 0 to 0.2, 10 additional steps linearly increasing from 0.2 to 0.4, and another 10 steps linearly increasing from 0.4 to 1. Finally, we performed another 60 steps with the power fixed at 1, and these samples were recorded as candidate molecules. Movies S1, S2, and S3 in the supplementary materials demonstrate how the candidate molecules proposed in each step of the sequential Monte Carlo approach the target region. In the first run, one of the initial samples was observed to reach the target region, and a number of the samples continued to explore structures similar to that molecule, whereas other samples pursued alternative possibilities. In the end, the best proposed candidate molecules converged to molecules similar to those found in the previous study28 (see Figure 6).
Figure 6

Comparison of the best candidate molecules from a previous study28 and the top 25 candidate molecules generated from iQSPR‐X. The optimal combinations of 8 to 12 building blocks that were proposed in the previous study are shown as a comparison.

Comparison of the best candidate molecules from a previous study28 and the top 25 candidate molecules generated from iQSPR‐X. The optimal combinations of 8 to 12 building blocks that were proposed in the previous study are shown as a comparison. In contrast, with a significantly different set of initial samples, the second run struggled to converge to candidate molecules similar to the first run. Instead, it became trapped at molecules with complex ring structures (see Figure 7). For an intractable trade‐off problem such as this example, a small finite set of samples cannot support full exploration of the search space. Increasing the number of samples is an intuitive yet computationally intensive solution. An alternative is to adjust parameters in the iQSPR‐X algorithm, such as the initial samples, the n‐gram order, the number of letters to be deleted, and so on.
Figure 7

Comparison of the top 25 candidate molecules generated from iQSPR‐X with initial samples randomly drawn from polymers in PG with ϵ<4 or E<4.5 eV and the top 25 candidate molecules generated using an extended n‐gram model trained with samples from both PG and PubChem.

Comparison of the top 25 candidate molecules generated from iQSPR‐X with initial samples randomly drawn from polymers in PG with ϵ<4 or E<4.5 eV and the top 25 candidate molecules generated using an extended n‐gram model trained with samples from both PG and PubChem. iQSPR‐X can also be used to search intensively for a specific molecular subspace. In the third run, the generator showed a clear tendency to attach F‐containing fragments to chemical structures after being trained with thousands more samples from PubChem. As a result, we observed the frequent appearance of molecules with relatively higher E during the sequential Monte Carlo iterations. The best candidate molecules were composed of the F‐containing fragments with different combinations of the CH2, NH, and CO blocks (see Figure 7).

Conclusions

iQSPR‐X is an ML engine for generating a target‐specific molecular library. XenonPy provides an all‐in‐one ML‐based materials design platform, in which descriptor calculations, property prediction models for high throughput screening, molecular library generators, and inverse design algorithms are all present as independent modules that users can either take as pre‐existing functions in XenonPy or build flexibly to accommodate their own needs in conjunction with other major ML and materials informatics Python packages. Moreover, transfer learning offers further capability and convenience to the ML technique. By implementing the iQSPR algorithm with the XenonPy platform, users can fully enjoy the benefits of a wide range of pre‐existing functions and models that will greatly simplify the process of establishing the Bayesian inverse design algorithm. Detailed tutorials for each component are available at the website of XenonPy.25 In this paper, we demonstrated some basic functionalities of iQSPR‐X by applying it to the task of designing polymers exhibiting high ϵ and E. We showed how changes to the setup of iQSPR‐X, such as the initial sample sets and the generator, might influence the outcome of the computational workflow. One of our runs identified chemical structures that were similar to the best candidate molecules proposed in the original study. Furthermore, we demonstrated that by including a focused set of molecules in the training process of the generator, we were able to guide the algorithm to search a particular subspace in the large molecule space. Moreover, although users can quickly start the inverse design process using the default functions and setups, the true potential of the algorithm can be realized by building customized modules for a variety of tasks in materials science. The XenonPy project aims to gather contributions from various users in diverse fields of materials and data science. Contributors are highly welcome to share and implement their own codes in XenonPy as off‐the‐shelf modules.

Supplementary Materials

Movie S1‐PG_basic.avi: This video shows the evolution of the material property values of the proposed candidate molecules in each step of the XenonPy‐iQSPR iterations for the first run of our example. Blue dots denote the original data from PG, and red dots denote the proposed candidate molecules, with the radius of the dots proportional to the sum of the predicted variances of ϵ and E. Beta refers to the power value used in the annealing schedule. Movie S2‐PG_lowVal.avi: This video is the same type as Movie S1, for the second run of our example. Movie S3‐PG_Pubchem.avi: This video is the same type as Movie S1, for the third run of our example.

Conflict of Interest

None declared. As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials are peer reviewed and may be re‐organized for online delivery, but are not copy‐edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors. Supplementary Click here for additional data file. Supplementary Click here for additional data file. Supplementary Click here for additional data file.
  23 in total

1.  Inverse QSPR/QSAR Analysis for Chemical Structure Generation (from y to x).

Authors:  Tomoyuki Miyao; Hiromasa Kaneko; Kimito Funatsu
Journal:  J Chem Inf Model       Date:  2016-02-08       Impact factor: 4.956

Review 2.  Inverse molecular design using machine learning: Generative models for matter engineering.

Authors:  Benjamin Sanchez-Lengeling; Alán Aspuru-Guzik
Journal:  Science       Date:  2018-07-26       Impact factor: 47.728

3.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.

Authors:  Rafael Gómez-Bombarelli; Jennifer N Wei; David Duvenaud; José Miguel Hernández-Lobato; Benjamín Sánchez-Lengeling; Dennis Sheberla; Jorge Aguilera-Iparraguirre; Timothy D Hirzel; Ryan P Adams; Alán Aspuru-Guzik
Journal:  ACS Cent Sci       Date:  2018-01-12       Impact factor: 14.553

4.  Bayesian molecular design with a chemical language model.

Authors:  Hisaki Ikebata; Kenta Hongo; Tetsu Isomura; Ryo Maezono; Ryo Yoshida
Journal:  J Comput Aided Mol Des       Date:  2017-03-09       Impact factor: 3.686

5.  ChemTS: an efficient python library for de novo molecular generation.

Authors:  Xiufeng Yang; Jinzhe Zhang; Kazuki Yoshizoe; Kei Terayama; Koji Tsuda
Journal:  Sci Technol Adv Mater       Date:  2017-11-24       Impact factor: 8.090

6.  A Transfer Learning Approach for Microstructure Reconstruction and Structure-property Predictions.

Authors:  Xiaolin Li; Yichi Zhang; He Zhao; Craig Burkhart; L Catherine Brinson; Wei Chen
Journal:  Sci Rep       Date:  2018-09-07       Impact factor: 4.379

7.  Using a Novel Transfer Learning Method for Designing Thin Film Solar Cells with Enhanced Quantum Efficiencies.

Authors:  Mine Kaya; Shima Hajimirza
Journal:  Sci Rep       Date:  2019-03-22       Impact factor: 4.379

8.  iQSPR in XenonPy: A Bayesian Molecular Design Algorithm.

Authors:  Stephen Wu; Guillaume Lambard; Chang Liu; Hironao Yamada; Ryo Yoshida
Journal:  Mol Inform       Date:  2019-11-05       Impact factor: 3.353

9.  Bayesian-Driven First-Principles Calculations for Accelerating Exploration of Fast Ion Conductors for Rechargeable Battery Application.

Authors:  Randy Jalem; Kenta Kanamori; Ichiro Takeuchi; Masanobu Nakayama; Hisatsugu Yamasaki; Toshiya Saito
Journal:  Sci Rep       Date:  2018-04-11       Impact factor: 4.379

10.  Molecular generative model based on conditional variational autoencoder for de novo molecular design.

Authors:  Jaechang Lim; Seongok Ryu; Jin Woo Kim; Woo Youn Kim
Journal:  J Cheminform       Date:  2018-07-11       Impact factor: 5.514

View more
  2 in total

1.  iQSPR in XenonPy: A Bayesian Molecular Design Algorithm.

Authors:  Stephen Wu; Guillaume Lambard; Chang Liu; Hironao Yamada; Ryo Yoshida
Journal:  Mol Inform       Date:  2019-11-05       Impact factor: 3.353

Review 2.  Inverse Design of Materials by Machine Learning.

Authors:  Jia Wang; Yingxue Wang; Yanan Chen
Journal:  Materials (Basel)       Date:  2022-02-28       Impact factor: 3.623

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.