Literature DB >> 30733290

Machine-learning approach to the design of OSDAs for zeolite beta.

Frits Daeyaert¹, Fengdan Ye², Michael W Deem^3,2.

Abstract

We report a machine-learning strategy for design of organic structure directing agents (n class="Chemical">OSDAs) for zeolite beta. We use machine learning to replace a computationally expensive molecular dynamics evaluation of the stabilization energy of the OSDA inside zeolite beta with a neural network prediction. We train the neural network on 4,781 candidate OSDAs, spanning a range of stabilization energies. We find that the stabilization energies predicted by the neural network are highly correlated with the molecular dynamics computations. We further find that the evolutionary design algorithm samples the space of chemically feasible OSDAs thoroughly. In total, we find 469 OSDAs with verified stabilization energies below -17 kJ/(mol Si), comparable to or better than known OSDAs for zeolite beta, and greatly expanding our previous list of 152 such predicted OSDAs. We expect that these OSDAs will lead to syntheses of zeolite beta.

Entities: Chemical Disease Gene Species

Keywords: OSDA; machine learning; neural network; zeolite beta

Year: 2019 PMID： 30733290 PMCID： PMC6397530 DOI： 10.1073/pnas.1818763116

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

Zeolites are crystalline nanoporous n class="Chemical">aluminosilicate minerals that have wide use in absorption, separation, and catalysis (1). Presently, a total of 245 zeolite structures, both natural and man-made and differing in structure and pore size, have been identified (2). Zeolite beta is a large 3D 12-ring channel system (3), and it is one of the 17 zeolites of commercial interest (4). Its industrial uses include the alkylation of benzene (5) and the separation of organics from water (6). Synthetic zeolites such as zeolite beta are synthesized by hydrothermal synthesis from suitable amorphous aluminosilicate precursors (7). To direct the synthesis toward a particular zeolite structure, organic bases that act as templates, termed organic structure directing agents (OSDAs), are added to the reaction medium (8, 9). While template-free syntheses of zeolite beta have been reported (10), the main synthetic route uses tetraethyl amine as the OSDA (11). Syntheses of zeolite beta with roughly 50–100 other OSDAs have been reported. Zeolite beta conn class="Chemical">sists of three polymorphs: polymorph A (BEA), polymorph B (BEB), and polymorph C (BEC) (3). At present, no synthetic route to pure BEA has been obtained. Existing formulations of zeolite beta lead to an intergrown hybrid structure of BEA and BEB (3). Uniformly structured zeolites can lead to smaller, cleaner, and more efficient catalytic processes (12). Moreover, the BEA polymorph is chiral, and an enantiomerically enriched form of pure BEA would be of great interest for enantiospecific catalysis and separation (13). Ongoing research in our group is therefore directed toward the design of suitable OSDAs leading to both pure and enantiomerically enriched BEA. Selectivity toward a given zeolite is promoted by a structure directing agent and depends to a large degree on favorable nonbonding interactions governed by packing in the n class="Chemical">zeolite framework (14). In the past, we have successfully built upon this observation to use structure-based molecular design to obtain OSDAs for several zeolites (15–17), including a chiral OSDA leading to an enantiomerically enriched zeolite STW (18). The methods we have applied in these efforts include algorithms both for de novo design (19, 20) and virtual combinatorial chemistry (21), as well as virtual screening of selected sets of available compounds. At the heart of these algorithms is a computational procedure to predict the suitability of a molecule to serve as OSDA for a given zeolite (19). The scoring function calculates a series of molecular properties of increasing computational complexity, with the least computationally intensive properties being used as filters (22). The most computationally intensive calculation consists of a molecular dynamics evaluation of the stabilization energy of a putative OSDA in the target zeolite and requires on the order of 3 h of CPU time when the target is BEA. A de novo design or virtual combinatorial chemistry experiment typically requires on the order of 200,000 calls of the scoring function, of which around 10% reach the stage of the molecular dynamics run. In view of our efforts to design OSDAs for zeolite BEA, it is of great interest to us to speed up the evaluation of this scoring function. In our research so far we have performed a large number of calculations, and in this paper we describe our efforts to effectively tap this database of information using a data-driven approach. Machine-learning (ML) algorithms that synthen class="Chemical">size existing data to produce predictive models are seeing a revival in molecular and materials science thanks to the growing availability of massive numbers of data (23). Examples include algorithms for quantum chemistry (24), retrosynthetic chemistry (25), and de novo design (26, 27). Once properly trained, an ML algorithm is very fast to produce an output from new input. Therefore, given the large number of predicted stabilization energies of putative BEA OSDAs that we have collected thus far, we have trained an ML algorithm to build a quantitative structure–property relationship to accurately and efficiently predict OSDA-BEA stabilization energies. That is, we trained neural networks to predict OSDA stabilization energies based on their molecular structures. We have used 3D-MoRSE (Molecule Representation of Structures based on Electron diffraction) descriptors for the OSDA molecules (28, 29). These descriptors are input to the neural network as described in . We used this ML approach to replace the molecular dynamics evaluation of the stabilization energy with a trained neural network. We further used this approach to produce putative BEA OSDAs.

Results and Discussion

Training the Models.

We used ML to relate 3D structure of BEA n class="Chemical">OSDAs to their stabilization energies. A neural network was trained on the descriptors of molecular structure of OSDAs to predict stabilization energies (see and ). These descriptors encode the 3D molecular structure by sampling a calculated diffraction pattern. Each scattering parameter, s, will produce one intensity, I, which is one descriptor. Note that these descriptors are the input to the neural network, not the output. To determine the best-performing set of hyperparameters for the neural network, we tested networks with various values of the maximum scatter parameter (), its step size in Fourier space (), and the number of hidden nodes (h) in the network. Values for the maximum scatter parameter were 8, 16, 24, and 32 Å, and step size were 0.125, 0.250, 0.500, and 1.000 Å. For each combination of these settings, we randomly choose 80% of the total molecules as a training/test set and set 20% apart for validation (). The training/test sets were used to train neural networks with increasing number of hidden nodes. The total number of weights in the model depends on the number of input nodes and the number of hidden nodes, the former being determined by the maximal scatter parameter, , and its increment, . The highest number of hidden nodes was either 10 or the number of nodes for which the total number of weights was less than the number of molecules in the training set (29). The best model was chosen as the one for which the mean root-mean-square error (RMSE) for test set, (), was the lowest. This criterion is adopted to avoid overfitting to the training set, as discussed in . We trained four variations of the network: a network weighing all MD energies equally, model 1; a network weighing the MD energies below −15 kJ/(mol Si) more, model 2; a network without weighing trained on only the set of charged OSDAs, model 3; and a network without weighing and in which the output layer uses a linear activation function, model 4. A sigmoid activation function was used on the output layer in models 1–3. The results of the exploration of the hyperparameter space are listed in . The top two sets of hyperparameters for each model, for which the mean RMSE on the test set, , was found to be smallest are summarized in Table 1. For these best models, the RMSE for the OSDAs in the validation set () were calculated. They are listed in column 10 of Table 1. We also calculated the RMSE for the total set of OSDAs in the training plus test set (). These are listed in column 9 of Table 1. The validation set has not influenced the hyperparameter selection procedure. The fact that and are very similar indicates that the neural nets are well trained and not overfit to the train and test set. For reference, the tetraethyl amine OSDA has a stabilization energy of −10 kJ/(mol Si) in zeolite beta A (30). Fig. 1 shows the scatter plots of the MD-calculated and ML-predicted stabilization energies for the OSDAs in the validation set for each of the eight models.

Table 1.

Top two sets of hyperparameters selected from models 1–4

Model	smax	Δs	Number of intensities	h	Total number of weights	RMSEtraining¯	RMSEtest¯	RMSEtraining+test	RMSEvalidation
1a	24	0.500	49	5	256	1.52 (0.03)	1.79 (0.07)	1.45	1.41
1b	8	0.500	17	8	153	1.59 (0.02)	1.75 (0.06)	1.52	1.47
2a	24	0.500	49	4	205	1.66 (0.04)	1.83 (0.08)	1.50	1.65
2b	8	0.500	17	8	153	1.68 (0.02)	1.84 (0.07)	1.59	1.59
3a	8	0.500	17	2	39	1.61 (0.07)	1.68 (0.14)	1.50	1.64
3b	32	0.500	65	1	68	1.55 (0.04)	1.75 (0.13)	1.55	1.68
4a	32	0.500	65	5	336	1.90 (0.05)	1.92 (0.07)	1.87	1.87
4b	24	0.250	97	2	199	1.91 (0.05)	1.95 (0.09)	1.88	1.89

The is defined in , and is defined in . The values between brackets are the corresponding SDs. The is defined in , and is defined in .

Fig. 1.

Scatter plots of MD- versus ML-predicted stabilization energies for the OSDAs in the validation set for the eight models (A–H). Models 1a and 1b were trained on all compounds without weighing. Models 2a and 2b were trained on all compounds with weighing. Compared with models 1a and 1b, models 2a and 2b have better prediction for OSDAs with MD-calculated energy below −15 kJ/mol Si. Models 3a and 3b were trained on charged compounds only without weighing. No charged OSDAs have an MD-calculated energy below −17.5 kJ/mol Si, which limited the ability of the neural network to find favorable OSDAs. Models 4a and 4b used a linear activation function in the output node.

Top two sets of hyperparameters selected from models 1–4 The is defined in , and is defined in . The values between brackets are the corresponding SDs. The is defined in , and is defined in . Scatter plots of MD- versus ML-predicted stabilization energies for the OSDAs in the validation set for the eight models (A–H). Models 1a and 1b were trained on all compounds without weighing. Models 2a and 2b were trained on all compounds with weighing. Compared with models 1a and 1b, models 2a and 2b have better prediction for n class="Chemical">OSDAs with MD-calculated energy below −15 kJ/mol Si. Models 3a and 3b were trained on charged compounds only without weighing. No charged OSDAs have an MD-calculated energy below −17.5 kJ/mol Si, which limited the ability of the neural network to find favorable OSDAs. Models 4a and 4b used a linear activation function in the output node. Overall, the neural networks were successful at predicting energies of the OSDAs in the validation set. Models 1a and 1b have the best rms error for the validation set, . By introducing weighing in the cost function, models 2a and 2b improve the prediction in the low-energy region below −15 kJ/(mol n class="Chemical">Si), the region in which OSDAs for BEA are expected to be effective. While this increases the rms error for the validation sets, a modest increase in predictability in this region can be observed in Fig. 1 . Models 3a and 3b performed equally well (Fig. 1 ). However, no charged OSDAs had MD-calculated energies below −17.5 kJ/mol Si, which limited the ability of the neural network to find favorable OSDAs. Using sigmoid activation, the predicted energies will always be contained in the range of the energies from the training and test set. While this will keep the neural network from erroneously extrapolating to molecules not in this range, it slightly distorts the computed versus predicted relations in Fig. 1. This phenomenon is improved with linear activation, models 4a and 4b, as shown in Fig. 1 . While the RMSE for model 4 is slightly higher than in the other models, the difference in and is significantly lower, indicating less overfitting to the training set of this model. In Table 1 we have a comparison of prediction from single neural networks and and prediction from averages of 30 neural networks and . The RMSE values of the complete training plus test sets and the validation sets are generally lower than the ones in the training or testing set. This illustrates the capability of the ensemble fitting to improve the models (31): The RMSEs of training and test sets, respectively, are averages taken from multiple n class="Chemical">single models, , while the RMSEs of the complete training plus test sets and the validation sets are from energies predicted from an ensemble of 30 models (). However, we also observed that even a single neural network is able to capture most of the predictability of the 30-neural-network ensemble. This result indicates that the predictions of the neural networks are stable to convergence issues and choice of training set.

In Silico Materials Design.

We used all eight models in Table 1 in a de novo evolutionary design algorithm program. For each model in Table 1, a total number of 1,000,000 trial molecules were generated by the program and scored un class="Chemical">sing the score vector (see and ). Fig. 2 shows the top five predicted OSDAs for model 1b. In addition, the synthesis pathway for the top-scoring molecule is shown in Fig. 2. Table 2 lists, for each run, the best OSDA found, with its ML-predicted and MD-calculated stabilization energy, the number of compounds with an ML predicted stabilization energy below −15 kJ/(mol Si), the total number of molecules for which the stabilization energy was actually predicted, and the total number of unique molecules generated. The total number of unique molecules generated during a run is lower than 1,000,000, because molecules may appear, disappear, and then reappear in the population during the course of the genetic algorithm (22). In each run, a large number (∼1,000) of molecules were predicted to have stabilization energies below the threshold of −15 kJ/(mol Si), column 3 in Table 2. The ML- and MD energies of the best-scoring molecules obtained with models 1b, 2a, 2b, and 4b are within 1 kJ from one another; the difference is around 2 kJ for models 1a and 4a. The best-scoring molecules found with models 3a and 3b are identical. Their ML-predicted binding energies are slightly different because the two models have different hyperparameters. The MD-calculated energies differ because of the stochastic nature of the MD procedure. The gaps between the ML and MD energies in models 3a and 3b are larger than for the other models, reflecting the lower prediction precision for these models (see Table 4). The ML method vastly accelerates the energy calculation process. An ML prediction of the stabilization energy of a putative OSDA requires about 28 s of CPU time, whereas an MD energy calculation requires 160 min of CPU time on average.

Fig. 2.

Results for OSDA design using model 1b. (A) The top five molecules produced. The molecule scores in this figure are the ML determined binding energy in kJ/(mol Si). (B) Proposed synthesis route to the first molecule in the output shown in A. The outcome of the synthesis route is listed together with the acronym of the reaction used (ALKYLATENP), as well as the structures and catalog names of the proposed reagents.

Table 2.

Best OSDA found with its ML-predicted and MD-calculated stabilization energy, number of compounds with an ML-predicted stabilization energy below −15 kJ/(mol Si), the total number of molecules for which the stabilization energy was predicted, and the total number of unique molecules generated in each run

Table 4.

The number of compounds with ML-predicted energies below −15 kJ/(mol Si), the number of compounds with ML-predicted energies between −15 and −14 kJ/(mol Si) and among which the number of compounds with MD-calculated energies below −17 kJ/(mol Si), the number of TP, and the prediction precision for the eight in silico materials design runs

Model	EML≤ −15	−15 < EML ≤ −14 (EMD ≤ −17)	TP (precision)*
1a	1,058 (1,054)^†	839 (32, 3.8%)	812 (76.7%)
1b	1,179 (1,177)	625 (6, 0.9%)	865 (73.4%)
2a	836 (832)	696 (33, 4.7%)	690 (82.5%)
2b	910 (908)	550 (14, 2.5%)	672 (73.8%)
3a	1,857 (1,840)	915 (60, 6.6%)	727 (39.1%)
3b	1,280 (1,280)	1,204 (104, 8.6%)	660 (51.6%)
4a	712 (695)	827 (34, 4.1%)	538 (75.6%)
4b	599 (599)	805 (57, 7.1%)	484 (80.8%)

In parentheses is prediction precision, defined as TP/(number with ≤ −15) ≡ TP/(TP + FP), where FP is false positive and TP is true positive.

In parentheses is the number of MD energies, as some MD evaluations failed.

Results for OSDA den class="Chemical">sign using model 1b. (A) The top five molecules produced. The molecule scores in this figure are the ML determined binding energy in kJ/(mol Si). (B) Proposed synthesis route to the first molecule in the output shown in A. The outcome of the synthesis route is listed together with the acronym of the reaction used (ALKYLATENP), as well as the structures and catalog names of the proposed reagents. Best OSDA found with its ML-predicted and MD-calculated stabilization energy, number of compounds with an ML-predicted stabilization energy below −15 kJ/(mol n class="Chemical">Si), the total number of molecules for which the stabilization energy was predicted, and the total number of unique molecules generated in each run Table 3 shows the cross-section of the putative OSDAs generated in different runs with ML-predicted stabilization energies E ≤ −15. kJ/(mol n class="Chemical">Si). It also lists the number of molecules generated in each run that were also present in the training set. There is considerable overlap between the different runs. This means the different runs have explored overlapping regions in molecular space. As can be seen in column 8 of this table, even some molecules of the training and test sets have been rediscovered. In total, 3,062 highly scoring putative OSDAs have been discovered through our in silico materials design approach. Generally, the goal is to generate as many unique, favorable OSDAs as possible. False positives are not a major concern, since we can easily screen the 3,062 OSDAs with subsequent MD calculation. False negatives are much harder to identify, as it is computationally infeasible to calculate the MD energies for all OSDAs generated by the eight runs.

Table 3.

Cross-section of the putative OSDAs generated in different runs with ML-predicted stabilization energies E ≤ −15. kJ/(mol Si)

Run	1a	1b	2a	2b	3a	3b	4a	4b	In training set
1a	1,058	749	630	560	477	452	497	453	13
1b		1,179	585	691	445	446	402	384	10
2a			836	565	386	374	419	435	11
2b				910	320	312	339	328	7
3a					1,857	1,051	322	254	21
3b						1,280	354	311	12
4a							712	386	17
4b								599	11
Total unique molecules: 3,062

Column 10 lists the number of molecules generated in one run that are present in the training or validation set.

Cross-section of the putative OSDAs generated in different runs with ML-predicted stabilization energies E ≤ −15. kJ/(mol n class="Chemical">Si) Column 10 lists the number of molecules generated in one run that are present in the training or validation set.

Verification.

Ultimately, we validated the materials design on neural network framework by calculating the stabilization energies of the den class="Chemical">signed OSDAs and comparing them with MD-calculated energies. The goal is to test whether the training, test, and validation sets cover a limited part of the possible chemical space. If so, a neural network trained and validated on such datasets may not necessarily generalize beyond the space on which it was trained. Two possible reasons that can lead to this problem are the possible insufficiency of 3D-MoRSE descriptors to generally describe the 3D molecular structure of the OSDAs, and insufficient complexity of the neural network structure to capture deeper features of the OSDAs’ 3D structure. An in silico materials design run may explore a different chemical space than the training set, and the neural networks may perform poorly and predict inaccurate energies. This concern was tested by comparing the MD energies and ML-predicted energies for the OSDAs. All molecules generated by in silico materials den class="Chemical">sign and predicted using the ML methods to have a stabilization energy below −14 kJ/(mol Si) were subjected to MD calculation of their stabilization energy for verification. Although it is impractical to calculate the energy by MD for all OSDAs, such calculation on the limited number of predicted OSDAs with stabilization enery below −14 kJ/(mol Si) can give a good estimation of the false negatives. Table 4 lists this measure of false negatives in column 3, with the total number of compounds with ML-predicted energies between −15 and −14 kJ/(mol Si), and the number of compounds among these with MD-calculated energies below −17 kJ/(mol Si). Table 4 also lists the number of compounds with ML-predicted energies below −15 kJ/(mol Si), the number of true positives (TPs), and the prediction precision. Dataset S1 shows all compounds with MD energies below −17 kJ/(mol Si) based upon screening all compounds with predicted ML energies below −14 kJ/(mol Si). In total, there are 469 compounds. This expands upon the 152 compounds with stabilization energy below −17 kJ/(mol Si) that were in our training list of 4,781 compounds. The number of compounds with ML-predicted energies below −15 kJ/(mol Si), the number of compounds with ML-predicted energies between −15 and −14 kJ/(mol n class="Chemical">Si) and among which the number of compounds with MD-calculated energies below −17 kJ/(mol Si), the number of TP, and the prediction precision for the eight in silico materials design runs In parentheses is prediction precision, defined as TP/(number with ≤ −15) ≡ TP/(TP + FP), where FP is false pon class="Chemical">sitive and TP is true positive. In parentheses is the number of MD energies, as some MD evaluations failed. From Table 4 we can see that the false-negative proportion is roughly 5%. The prediction precision is nearly 80% for most models, but 50% for model 3. Among the false pon class="Chemical">sitives are some that lie in a different region in the chemical space. For the run with model 1b, for example, we noted that four high-scoring molecules were considerably larger in volume than other molecules in the same and the other runs. They are depicted in , together with their molecular volume and the ML-predicted and MD-verified stabilization energies. To further investigate the issue of exploring chemical space, we applied a principal coordinate analysis (PCA) analyn class="Chemical">sis () to the 3D-MoRSE intensities of all molecules generated in run 1b with a predicted stabilization energy to BEA lower than −15 kJ/(mol Si). The scatter plot of the first and second principal components of these molecules is shown in , in which the red dots correspond to the “large” molecules in and are clearly outliers. shows the scatter plot of the predicted ML stabilization energy versus the molecular volume. The minimal predicted stabilization energy of a molecule follows an approximately parabolic curve with the molecular volume, and the four false-positive hits clearly fall out of this distribution. A representation of the molecular space explored by the eight in silico runs is presented in . To construct this figure, the 2D Tanimoto fingerprints of the 3,062 unique molecules were generated, and from these a Euclidean distance matrix was computed (). This distance matrix was used to calculate the principal coordinates of each of the 3,062 molecules. The first two principal coordinates are plotted in . The fraction of the variance covered in these two coordinates is 0.20 and 0.10, respectively. Conn class="Chemical">siderable structure is present in this plot, and this can be analyzed in a cursory way by picking representative points and examining the corresponding molecular structures, as is done in . The two large clusters separated by the first principal coordinate distinguish molecules containing aromatic 6-cycle (a through e) and charged pyrazole (f and g) functionalities on the one hand, and charged imidazole functionalities (h through l) on the other hand. Within the two large clusters, smaller subclusters can be discerned that correspond to different molecular scaffolds (). While the specific clustering depends on the choice of the 2D descriptors used for the principal coordinate analysis, this result shows that the in silico material design program produces a variety of molecular scaffolds. The individual subspaces searched by the eight runs are illustrated in the eight subplots of . In this figure, the molecules generated in each run are represented as green and red dots, and the blue dots correspond to molecules generated in the runs other than the indicated run.

Conclusions

We have used a data set of 4,781 putative zeolite n class="Chemical">BEA OSDAs for which the stabilization energies in BEA have been obtained through computationally intensive MD calculation to train ML models for predicting stabilization energies using a neural network. Through exploration of the hyperparameter space we have trained and validated eight models, taking care to strictly separate training and testing sets on one hand, and validation sets on the other hand (32). The molecules generated by the in silico material design fall within the domain of applicability of the ML algorithm. In total we have found 3,062 distinct putative OSDAs for zeolite beta, 469 of which are predicted to be exceptionally stable. We have shown that this protocol enables an effective and computationally tractable search for novel OSDAs.

Methods

Neural Network.

The structure of the neural network is shown in . There was one hidden layer between the input and output layer. The input layer consists of structural descriptors I of each n class="Chemical">OSDA obtained through the 3D-MoRSE code (29). The output layer predicts stabilization energies. Sigmoid activation was adopted in the hidden layer. The output layer adopts either sigmoid activation or linear activation depending on the model. The samples of OSDAs for training, testing, and validating the neural network conn class="Chemical">sist of 4,781 putative BEA OSDAs that we have obtained in our search for OSDAs for pure BEA and chiral BEA zeolite in the past five years. In this search, our procedure consisted of first designing putative small, achiral “monomer” OSDAs and then finding suitable chiral linkers to dimerize these (18). We here use these monomers for training a neural network. To obtain good-scoring monomer OSDAs for BEA we have used three strategies: de novo design, virtual screening, and virtual combinatorial chemistry. A de novo design algorithm (19, 21) was used to generate many putative BEA OSDAs. Analogs of the highest-scoring hits were selected from the available building block databases in eMolecules (https://reaxys.emolecules.com/) and Chemspace (https://chem-space.com/). Finally, we extended this set by generating alkylated derivatives. In this way, we have obtained 4,781 putative BEA OSDAs with predicted stabilization energies between −20 and 0 kJ/(mol Si). These OSDAs consists of a set of 3,875 uncharged molecules and a set of 906 molecules that contain one or several charged N atoms. The materials design approach is a de novo den class="Chemical">sign program that searches and generates synthesizable molecules with desirable properties. Through a genetic algorithm, this method can search the entire chemical space defined by a list of predefined well-documented organic chemistry reactions and a user-supplied database of commercially available reagents. The output is a set of molecules that score well on the scoring function and their synthesis route. The score function used for the design of n class="Chemical">BEA OSDAs is summarized in . First, it was verified that the molecule to be scored was amenable to molecular mechanics minimization with the force field used. Then the total number of rotatable bonds, the largest number of consecutive sp3–sp3 rotatable bonds, the presence of atoms other than C, N, or H, the presence of triply bonded C, and the ratio of C atoms to charged N atoms were calculated. These properties can all be deduced from the molecular topology and are computationally trivial to obtain. If all of these fell within their respective thresholds, a locally optimal conformation of the molecule was calculated and the molecular volume was obtained. If this fell within its threshold, a conformational search was performed to obtain the global minimal energy conformation of the molecule using GACS. This conformation was used either as a starting point for the MD procedure to obtain the stabilization energy in the zeolite structure, or to calculate the 3D-MoRSE score to be input into the neural network. Here we chose the latter. The set of reactions used to synthesize virtual molecules presently conn class="Chemical">sists of 100 organic chemistry reactions. The database of reagents we used contains 39,500 commercially available chemicals. To start the run, we randomly selected reactions, reagents, and tree depths to generate the initial population of molecules. Here, tree depth was defined as the number of reactions that take place to form one solution molecule. This depth was usually constrained between 3 and 5. The population size was fixed at , and every generated molecule was scored. It was possible that some molecules did not pass the scoring filters () and therefore did not have the molecular volume or stabilization energy calculated. The population was evolved by applying these reactions and a genetic algorithm search for improved predicted stabilization energies. Supplementary Materials and Methods, figures, and tables can be found in . Detailed materials and methods and discussion of overfitting are available, as well as , and a .sdf file containing the 469 n class="Chemical">OSDAs with stabilization energies computed by MD to be below −17 kJ/(mol Si).

11 in total

1. Combinatorial QSAR modeling of chemical toxicants tested against Tetrahymena pyriformis.

Authors: Hao Zhu; Alexander Tropsha; Denis Fourches; Alexandre Varnek; Ester Papa; Paola Gramatica; Tomas Oberg; Phuong Dao; Artem Cherkasov; Igor V Tetko
Journal: J Chem Inf Model Date: 2008-03-01 Impact factor: 4.956

2. 3D-MoRSE descriptors explained.

Authors: Oleg Devinyak; Dmytro Havrylyuk; Roman Lesyk
Journal: J Mol Graph Model Date: 2014-11-04 Impact factor: 2.518

3. A Pareto Algorithm for Efficient De Novo Design of Multi-functional Molecules.

Authors: Frits Daeyaert; Micheal W Deem
Journal: Mol Inform Date: 2016-07-21 Impact factor: 3.353

4. Planning chemical syntheses with deep neural networks and symbolic AI.

Authors: Marwin H S Segler; Mike Preuss; Mark P Waller
Journal: Nature Date: 2018-03-28 Impact factor: 49.962

5. Enantiomerically enriched, polycrystalline molecular sieves.

Authors: Stephen K Brand; Joel E Schmidt; Michael W Deem; Frits Daeyaert; Yanhang Ma; Osamu Terasaki; Marat Orazov; Mark E Davis
Journal: Proc Natl Acad Sci U S A Date: 2017-05-01 Impact factor: 11.205

6. Synthesis of a specified, silica molecular sieve by using computationally predicted organic structure-directing agents.

Authors: Joel E Schmidt; Michael W Deem; Mark E Davis
Journal: Angew Chem Int Ed Engl Date: 2014-06-24 Impact factor: 15.336

7. Bypassing the Kohn-Sham equations with machine learning.

Authors: Felix Brockherde; Leslie Vogt; Li Li; Mark E Tuckerman; Kieron Burke; Klaus-Robert Müller
Journal: Nat Commun Date: 2017-10-11 Impact factor: 14.919

8. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.

Authors: Rafael Gómez-Bombarelli; Jennifer N Wei; David Duvenaud; José Miguel Hernández-Lobato; Benjamín Sánchez-Lengeling; Dennis Sheberla; Jorge Aguilera-Iparraguirre; Timothy D Hirzel; Ryan P Adams; Alán Aspuru-Guzik
Journal: ACS Cent Sci Date: 2018-01-12 Impact factor: 14.553

9. Template-Framework Interactions in Tetraethylammonium-Directed Zeolite Synthesis.

Authors: Joel E Schmidt; Donglong Fu; Michael W Deem; Bert M Weckhuysen
Journal: Angew Chem Int Ed Engl Date: 2016-11-22 Impact factor: 15.336

10. Application of Generative Autoencoder in De Novo Molecular Design.

Authors: Thomas Blaschke; Marcus Olivecrona; Ola Engkvist; Jürgen Bajorath; Hongming Chen
Journal: Mol Inform Date: 2017-12-13 Impact factor: 3.353

8 in total

Review 1. Big-Data Science in Porous Materials: Materials Genomics and Machine Learning.

Authors: Kevin Maik Jablonka; Daniele Ongari; Seyed Mohamad Moosavi; Berend Smit
Journal: Chem Rev Date: 2020-06-10 Impact factor: 60.622

Review 2. Digital Innovation Enabled Nanomaterial Manufacturing; Machine Learning Strategies and Green Perspectives.

Authors: Georgios Konstantopoulos; Elias P Koumoulos; Costas A Charitidis
Journal: Nanomaterials (Basel) Date: 2022-08-01 Impact factor: 5.719

Review 3. Machine learning potential era of zeolite simulation.

Authors: Sicong Ma; Zhi-Pan Liu
Journal: Chem Sci Date: 2022-04-12 Impact factor: 9.969

4. Data science assisted investigation of catalytically active copper hydrate in zeolites for direct oxidation of methane to methanol using H₂O₂.

Authors: Junya Ohyama; Airi Hirayama; Nahoko Kondou; Hiroshi Yoshida; Masato Machida; Shun Nishimura; Kenji Hirai; Itsuki Miyazato; Keisuke Takahashi
Journal: Sci Rep Date: 2021-01-22 Impact factor: 4.379

5. Multi-objective de novo molecular design of organic structure-directing agents for zeolites using nature-inspired ant colony optimization.

Authors: Koki Muraoka; Watcharop Chaikittisilp; Tatsuya Okubo
Journal: Chem Sci Date: 2020-07-20 Impact factor: 9.825

6. Discovering Relationships between OSDAs and Zeolites through Data Mining and Generative Neural Networks.

Authors: Zach Jensen; Soonhyoung Kwon; Daniel Schwalbe-Koda; Cecilia Paris; Rafael Gómez-Bombarelli; Yuriy Román-Leshkov; Avelino Corma; Manuel Moliner; Elsa A Olivetti
Journal: ACS Cent Sci Date: 2021-04-16 Impact factor: 14.553

7. The Role of Machine Learning in the Understanding and Design of Materials.

Authors: Seyed Mohamad Moosavi; Kevin Maik Jablonka; Berend Smit
Journal: J Am Chem Soc Date: 2020-11-10 Impact factor: 15.419

8. Elucidating the Role of Tetraethylammonium in the Silicate Condensation Reaction from Ab Initio Molecular Dynamics Simulations.

Authors: Ngoc Lan Mai; Ha T Do; Nguyen Hieu Hoang; Anh H Nguyen; Khanh-Quang Tran; Evert Jan Meijer; Thuat T Trinh
Journal: J Phys Chem B Date: 2020-10-29 Impact factor: 2.991

8 in total