| Literature DB >> 28281211 |
Hisaki Ikebata1, Kenta Hongo2,3,4, Tetsu Isomura5, Ryo Maezono2, Ryo Yoshida6,7,8.
Abstract
The aim of computational molecular design is the identification of promising hypothetical molecules with a predefined set of desired properties. We address the issue of accelerating the material discovery with state-of-the-art machine learning techniques. The method involves two different types of prediction; the forward and backward predictions. The objective of the forward prediction is to create a set of machine learning models on various properties of a given molecule. Inverting the trained forward models through Bayes' law, we derive a posterior distribution for the backward prediction, which is conditioned by a desired property requirement. Exploring high-probability regions of the posterior with a sequential Monte Carlo technique, molecules that exhibit the desired properties can computationally be created. One major difficulty in the computational creation of molecules is the exclusion of the occurrence of chemically unfavorable structures. To circumvent this issue, we derive a chemical language model that acquires commonly occurring patterns of chemical fragments through natural language processing of ASCII strings of existing compounds, which follow the SMILES chemical language notation. In the backward prediction, the trained language model is used to refine chemical strings such that the properties of the resulting structures fall within the desired property region while chemically unfavorable structures are successfully removed. The present method is demonstrated through the design of small organic molecules with the property requirements on HOMO-LUMO gap and internal energy. The R package iqspr is available at the CRAN repository.Entities:
Keywords: Bayesian analysis; Inverse-QSPR; Molecular design; Natural language processing; SMILES; Small organic molecules
Mesh:
Substances:
Year: 2017 PMID: 28281211 PMCID: PMC5393296 DOI: 10.1007/s10822-016-0008-z
Source DB: PubMed Journal: J Comput Aided Mol Des ISSN: 0920-654X Impact factor: 3.686
Fig. 1Outline of the Bayesian molecular design method
Correspondence table between the formal and modified rules of SMILES
| Type | Original | Modified |
|---|---|---|
| Start of a ring closure |
n
| & |
| End of a ring closure | n (same to the start) |
&
|
| Bond followed by atom | =A (double), #A (triple) | =A or #A form a single character |
| Terminal character of a molecule | N/A | $ |
| String in a square bracket | [abcde] | [abcde] form a single character |
Fig. 2Illustration of the substring selector with three examples. In the contraction operation, a substring inside of the outermost closed parentheses (green) is reduced to the character in its first position (red). The extraction operation is to remove the rest (black) of the last () characters from the reduced string. The corresponding graphs are shown on the right where the atoms in the boxes indicate the last characters in the inputs of (left)
MAEs of the QSPR models with the eight different fingerprint descriptors for the internal energy and the HOMO-LUMO gap
| Fingerprint | Energy (kcal/mol) | HOMO-LUMO gap (eV) | Runtime (s) |
|---|---|---|---|
| 1 | 32.6 | 0.53 | 0.50 |
| 2 | 30.4 | 0.54 | 0.41 |
| 3 | 29.3 | 1.37 | 2.57 |
| 4 | 28.3 | 1.66 | 0.36 |
| 5 | 22.1 | 0.55 | 5.32 |
| 6 | 46.8 | 0.84 | 0.39 |
| 1,2,4 | 23.5 | 0.54 | 1.61 |
| 1,2,4,5 | 18.9 | 0.50 | 7.71 |
The six fingerprints in the rcdk package (bottom) and their combinations were tested. The last column denotes the average runtime for the QSPR score (likelihood) calculation per 100 molecules. The runtimes were measured on an Intel Xeon 2.0 GHz processor with 128 GB memory using the iqspr package
1. ‘standard’: paths of a default length (1024 bits)
2. ‘extended’: the ‘standard’ fingerprint is modified such that ring and atomic properties are taken into account (1024 bits)
3. ‘maccs’: MDL MACCS keys (166 bits)
4. ‘circular’: ECFP6 fingerprint (1024 bits)
5. ‘pubchem’: PubChem fingerprint (881 bits)
6. ‘graph’: ‘standard’ is modified by taking into account connectivity (1024 bit)
Fig. 3a Perplexity scores (left) and valid grammar rate (1 − the syntax error rate) (right) with respect to 1000 SMILES strings generated from trained chemical language models. The conventional n-gram and the extended language models were trained with the BO and KN algorithms. The error bars represent the standard deviations across the 10 experiments corresponding to different training sets. b Examples of molecules generated from the trained chemical language model with (top). The bottom row displays the most similar PubChem compounds that had the Tanimoto coefficient 0.9 on the PubChem fingerprint
Parameters and experimental conditions for the Bayesian molecular design analysis
| Process | Description | Parameter |
|---|---|---|
| Forward prediction | Number of training data |
|
| Fingerprint descriptor | 1, 2, 4 | |
| The normal prior |
| |
| The Gamma prior |
| |
| Chemical language model | Number of training data | 50,000 |
| Markov-order |
| |
| Estimation algorithm | Back-off method | |
| Backward prediction | Size of population |
|
| Number of iterations |
| |
| Reordering probability |
| |
| Binomial probability |
| |
| Trial number |
| |
| Cooling schedule |
| |
| Threshold on ESS |
| |
| Initial structures | Phenol c1ccccc1O |
Fig. 4a Snapshots of structure alteration during the early phase of the inverse-QSPR calculation () with the desired property region set to , or . The initial molecule (phenol) is shown at the top. The created molecules shown here were those ranked in the top four by the likelihood score at each t. Supplementary Movie 1–3 visualize the whole processes of structure modification over . b Property refinements resulting from the backward prediction at . Results on the three different property regions, , and , are displayed together, and color-coded by red, green and blue, respectively. The shaded rectangles indicate the target regions. The dots indicate the HOMO-LUMO gaps and internal energies of the designed molecules that were calculated by the predicted values of the QSPR models. For each and t, the 10 non-redundant molecules exhibiting the greater likelihoods are shown. c Properties of 50 molecules which were selected from the overall backward prediction process for (red), (green), and (blue). The HOMO-LUMO gap and internal energy were calculated by the trained QSPR models (left) and the DFT calculation (right). The gray dots indicate the training data points. In each , the 50 non-redundant molecules that achieved the highest likelihoods are shown. d Newly created molecules in the predefined property regions. The bottom row of each pair shows instances of significantly similar PubChem compounds that had the Tanimoto index