Literature DB >> 35811887

Application of Machine Learning in Developing Quantitative Structure-Property Relationship for Electronic Properties of Polyaromatic Compounds.

Tuan H Nguyen1, Lam H Nguyen1, Thanh N Truong2.   

Abstract

The degree of π orbital overlap (DPO) model has been demonstrated to be an excellent quantitative structure-property relationship (QSPR) that can map two-dimensional structural information of polycyclic aromatic hydrocarbons (PAHs) and thienoacenes to their electronic properties, namely, band gaps, electron affinities, and ionization potentials. However, the model suffers from significant limitations that narrow its applications due to inefficient manual procedures in parameter optimization and descriptor formulation. In this work, we developed a machine learning (ML)-based method for efficiently optimizing DPO parameters and proposed a truncated DPO descriptor, which is simple enough that can be automatically extracted from simplified molecular-input line-entry system strings of PAHs and thienoacenes. Compared with the result from our previous studies, the ML-based methodology can optimize DPO parameters with four times fewer data, while it can achieve the same level of accuracy in predictions of the mentioned electronic properties to within 0.1 eV. The truncated DPO model also has similar accuracy to the full DPO model. Consequently, the ML-based DPO approach coupled with the truncated DPO model enables new possibilities for developing automatic pipelines for high-throughput screening and investigating new QSPR for new chemical classes.
© 2022 The Authors. Published by American Chemical Society.

Entities:  

Year:  2022        PMID: 35811887      PMCID: PMC9261278          DOI: 10.1021/acsomega.2c02650

Source DB:  PubMed          Journal:  ACS Omega        ISSN: 2470-1343


Introduction

Applications of machine learning techniques to research in chemical and related fields have been gaining attention recently. Some novel applications include generating drug candidates,[1] investigating chemical phenomena,[2] and assisting theoretical calculation.[3−8] One of the most prominent tasks for applying machine learning is physical or chemical property predictions. In this direction, quantitative structure–activity relationship (QSAR) specializes in the prediction of the biological activity of compounds from their structural information.[9−12] Similarly, materials scientists employ a similar technique called quantitative structure–property relationship (QSPR) for predicting various properties of materials from their 2D or 3D structural data.[13−17] In most QSPR and QSAR methodologies, predicting tasks heavily depend on a set of descriptors,[9,13] which serve as numerical representations of structural information of molecules. Typically, these descriptors are numerical objects obtained by transforming raw molecular data by some predefined procedure. These descriptors can be as simple as a list of molecule’s compositions[5,6] or as complex as matrices,[18−20] fingerprints,[15,16,21] etc.[8,14,22−26] Recently, representation learning[27] has been introduced to harness deep learning concepts for actively refining the process of molecular representation extraction through learning. For instance, neural fingerprints[28] are modeled after the extended connectivity fingerprints (ECFPs).[21] The ECFPs are designed to solely represent each molecular fragment as uniquely as possible. On the other hand, neural fingerprints replace some operations in ECFP generation with learnable modules that actively configured themselves during training to optimally represent molecular fragments for a certain task. As such, fingerprints of fragments are similar if the learned model can recognize the similarity between them. Hence, the representation has more expressiveness and interpretability. This approach was recently featured in multiple publications and yielded excellent results.[29−32] The degree of π orbital overlap (DPO) model is a quantitative structure–property relationship (QSPR) model for predicting electronic properties of polyaromatic hydrocarbon (PAH) and thienoacene compounds.[33,34] The DPO model relies on a representation or descriptor called the degree of π orbital overlap (DPO). Based on a quantum mechanical physical model, this descriptor represents a PAH or thienoacene molecule by a polynomial of six nonzero parameters, each associating the contribution of a topological trait of a polyaromatic hydrocarbon (PAH) compound to its final electronic properties. This has been proven to be rather accurate as it can accurately predict electronic properties, namely, electron affinities, ionization potentials, and band gaps to within 0.1 eV.[33,34] The original parameter optimization schedule is based on the prerecognition that each of them associates with a structural feature, which suggests a specific data set for optimizing it by trial and error. The described optimization procedure gives rise to several setbacks. First, it requires a large number of data points to calibrate the model, yet the final set of parameters may not be optimal globally. Second, the application of the DPO model to other chemical classes is not trivial. Finally, to date, determination of the DPO value of each molecule is done manually, and thus, it is unrealistic to use for high-throughput screening of massive databases. In this work, we implement and assess the performance of a learning procedure for developing a DPO model for a class of molecules. The goal of this new procedure is to provide an automated pipeline for parameter optimization in order to remove the number of setbacks of our previous works as mentioned above. This is motivated by recognizing that DPO descriptors can be treated as a learnable representation, which can be done by applying various learning principles and techniques such as empirical risk minimization, gradient descent, and backpropagation. Furthermore, toward the goal of creating an automatic pipeline for the DPO model, we devised a truncated DPO model. The analysis of results in the optimization of the DPO descriptors reveals that several terms in the DPO descriptor can be neglected with little consequence, thus suggesting a truncated DPO model. With the truncated DPO model, we demonstrate an automated pipeline that uses simplified molecular-input line-entry system (SMILES) strings as input for molecular representations that can be employed for high-throughput screening. Assessments for the accuracy and applicability of the truncated DPO model in predicting the electronic properties of PAHs and thienoacenes are then provided.

Methods

DPO Models and Descriptors

DPO Models

A DPO model is a physics-based QSPR model for predicting electronic properties of polyaromatic compounds such as PAHs and thienoacenes. It is based on a simple particle in the 2D box quantum mechanical model to connect structural information to its physical properties related to its energy levels such as its electron affinity (EA), ionization potential (IP), and highest occupied molecular orbital (HOMO)–lowest unoccupied molecular orbital (LUMO) gap. Technically, input structural information can be represented by 2D structures of polyaromatic compounds. The procedure of the model can be described briefly as follows. First, from a polyaromatic compound’s structure, a polynomial of six preoptimized nonzero parameters is evaluated to obtain the structure’s DPO value. This step is done manually by following a set of structural descriptive rules. The DPO value is subsequently used in QSPR linear equations to obtain the predicted properties. Three distinct linear equations of the form y = wx + w, in which x is the DPO value, y is the property, and w and w are weights, correspond to three modeled electronic properties mentioned above. In this work, the term DPO descriptor is referred to the polynomial, as it reflects the topological structure value of the polyaromatic compound. The parameters that associate with these DPO descriptors are called DPO parameters.

Rules for Determining DPO Descriptors

A brief explanation of the DPO rules is presented here to introduce some important features of the rules. Complete descriptions of the DPO rules with examples are given in our previous works.[33,34] In this study, we explain how the DPO descriptor captures topological structural information and how the truncated DPO descriptor arises. The goal of the DPO rule is to assign each fused bond in the polyaromatic compound a component in the DPO polynomial suggested by the use of an effective 2D particle-in-a-box quantum mechanical model. Fused bonds of the same segment are assigned similarly depending on the topological position of the segment in the molecule. The topological position can be defined with respect to the reference segment, which can be uniquely determined via a set of rules. Thus, the reference segment needs to be determined first and is generally the longest segment. With reference to this segment, the topological position of any segment is specified by the orientation and the overlayer order. In terms of orientations of different segments in a PAH, a segment can be parallel, forming an angle of 120°, or forming an angle of 60° to the reference segment. For each of those segments, the parameters a, b, and c are used in assigning value to its fused bonds. The overlayer order of a segment represents its distance to the reference segment orthogonally. As such, the farther the segment is above or below the reference segment, the larger the order is. The sum of values assigned to fused bonds of such segments is scaled by a factor of d raised to the power of the overlayer order k. In summary, supposing o and k are respectively the relative orientation and the overlayer order of a segment, the equation below is the contribution of a given segment to the overall DPO value and describes how fused bonds in that segment are assigned value. Figure illustrates examples for assigning DPO values to fused bonds using the above equation. For compound C1, the reference segment is marked with an arrow across. Its fused bonds are assigned as the first case of the equation with k = 0. The segment that forms with the reference segment an angle of 120° is assigned to the second case with k = 0. Finally, the uppermost segment that is parallel with the reference segment is two orders above the reference; therefore, the first case of the equation is used with k = 2. For compound C2, consider the upper rightmost segment that forms with the reference segment (marked with an arrow) an angle of 120°. Since it stems from the parallel segment that is one order above the reference segment, the second case of the equation is used with k = 1 assigned to this segment. For similar cases where there are segments that form an angle of 60° with the reference one, refer to compounds C3 and C4.
Figure 1

Examples of assigning DPO values to different fused bonds. The overlayer order k and orientation o are also given for some segments.

Examples of assigning DPO values to different fused bonds. The overlayer order k and orientation o are also given for some segments. Lastly, the parameters a* and d* are used whenever thiophene rings are in the segment. The DPO value of a structure is the sum of DPO values of all segments with the assigned terms to all of its fused bonds.

Truncated DPO Descriptors

In the DPO model, the contribution of each overlayer segment is scaled by d. Previous studies showed that the d parameter is between 0.2 and 0.3. This suggests that segments that are far from the reference segment can be assumed to have a negligible contribution and thus can be dropped from the total DPO descriptor. The truncated DPO model presented here omits the contributions of several types of overlayer segments. The truncated DPO inherits most of the original DPO assignment rules two new simplifications. First, if multiple segments have the same longest length, then there is no need to go through an elaborate process to determine the unique reference segment. Instead, the final DPO value is the average of all DPO values where each one of such segments is treated as the reference segment. Figure illustrates this rule.
Figure 2

Illustrations of computing truncated DPO values for the case of multiple longest segments. Molecule A has the two longest segments shown as arrows in A1 and A2. The truncated DPO of A is the average of DPO values when A1 and A2 are chosen as the reference segment, i.e., .

Illustrations of computing truncated DPO values for the case of multiple longest segments. Molecule A has the two longest segments shown as arrows in A1 and A2. The truncated DPO of A is the average of DPO values when A1 and A2 are chosen as the reference segment, i.e., . Second, only two types of overlayer segments are considered. One is the adjacent segments that have one ring in common with the reference segment. The other is segments that have one ring in common with the adjacent segments. Contributions from other overlayer segments are ignored. For instance, Figure shows two different structures that have the same value of truncated DPO.
Figure 3

Illustration on the assignment of DPO values to each fused bond; terms in gray are neglected in the truncated DPO model.

Illustration on the assignment of DPO values to each fused bond; terms in gray are neglected in the truncated DPO model.

Generating Truncated DPO from SMILES Strings

In order to employ machine learning techniques and further expand the applicability of the DPO model to different chemical classes, we need to provide a means to extract structural information from polyaromatic molecules represented by the simplified molecular-input line-entry system (SMILES).[35] A SMILES string provides information regarding atoms and bonds of a molecule that can be used to create graph data structures. From there, recognizing groups of atoms that form rings can be done by employing a graph traversal algorithm. Rings that are in a straight line can be easily recognized and grouped into segments. Subsequently, the longest segments along with their adjacent segments and topology can be determined easily for the truncated DPO. The process is schematically illustrated in Figure .
Figure 4

Flow chart illustrating the process of extracting truncated DPO descriptors from SMILES strings.

Flow chart illustrating the process of extracting truncated DPO descriptors from SMILES strings.

Optimization of the DPO Model

In this work, instead of manually optimizing six DPO parameters for PAHs and thienoacenes one at a time as in our previous works, a machine learning technique is developed. The optimization scheme is based on the empirical risk minimization (ERM) principle.[36] The empirical risk refers to the error of the model’s predictions for all training samples and is assessed by the so-called loss function, which is the mean square error (MSE) for regressing models. According to the ERM paradigm, the set of optimal parameters of the model is obtained by minimizing the value of the empirical risk and can be solved by the gradient descent algorithm. The first-order derivative of the loss function with respect to parameters is obtained by working backward from the loss function to the parameters using chain rules. Hence, this technique of finding derivatives is named backpropagation.[37] The optimization procedure is given step by step below. In step 1, set t = 0. Initiate six nonzero DPO parameters p[0] = a[0], b[0], c[0], d[0], a*[0], and d*[0] as zeros. We adopt the convention that the number inside the square bracket of the superscript of a parameter denotes the number of time steps or iterations, while the subscript denotes the general index of a molecule. Also, p is used to collectively denote six nonzero parameters a, b, c, d, a*, and d*. In step 2, derive analytically the gradient of the DPO polynomial g with respect to the parameter p for each molecule in the training set. In step 3, compute the DPO value x[ and its derivative with respect to p ∇px[ numerically using the set of parameters in the current time step. Note that we denote x as a numerical DPO value, while g denotes its polynomial. In step 4, apply the least-square algorithm to determine w[ and w[ from DPO value x’s and the true value of the modeled property y’s. In step 5, compute the prediction ŷ[ by plugging xi[ into the linear equation (eq ). In step 6, compute the mean square error (MSE) loss function (eq ) for all molecules in the training set. In step 7, compute the gradient of the loss function with respect to each parameter by working backward using the chain rule: In step 8, update these parameters according to equations (eq ).α is the adjustable learning rate that is 0.1 by default. In step 9, increment t = t + 1 and repeat steps 3 through 6 to obtain a new loss value with the value of new parameters. If predefined conditions are not satisfied, then continue steps 7 through 9. Else, if predefined conditions are satisfied, then stop and return to the model with the optimized parameters. Many conditions can be set in step 9 to mitigate training hardships and make the training process more automatic. For instance, to make the algorithm practical, a maximum number of cycles and the convergent threshold were introduced. The algorithm is succinctly illustrated in the form of a flow chart as in Figure .
Figure 5

Flow chart illustrating the ML-based DPO parameter optimization procedure.

Flow chart illustrating the ML-based DPO parameter optimization procedure. The DPO model described above is called the machine learning (ML)-based DPO model. This model is implemented in the Python programming language. The model is built with various libraries. Numpy[38] is employed to rapidly carry out vector computation. Pandas[39] is employed for working with data sets. Sklearn[40] is used to rapidly deploy linear regression models. Sympy[41] is used for symbolic differentiation and rapid evaluation of symbolic expressions.

Data

This work mainly concerns the electronic properties of polyaromatic compounds, namely, PAHs and thienoacenes. The PAH data is reused from our previous work on the DPO model for PAH.[34] The data sets consist of a training set containing PAH molecules of 3–6 rings and a test set consisting of PAH molecules of 7–8 rings. Similarly, taken from our previous work on the DPO model for thienoacene molecules[33] are two pairs of training and testing data for two classes of thienoacenes containing either one or two thiophene rings. Cumulatively, there are 248 data points, of which 132 of them are training instances and the remaining 116 data points are for testing. The current methodology enables us to investigate the convergence and stability of the parameter optimization procedure via machine learning framework as functions of the data training size. To make the optimization procedure more robust, for each run, all 248 data points are freshly split into 132 data points for training and 116 data points for testing in a random and stratified manner. This is done by first binning all data points according to their bandgap values. The bin width is chosen to be 0.5 eV. The range of bandgap values starts from 1.5 eV and ends at 5.0 eV. Data points are then drawn from each bin to assemble the training set. The number of data points to be sampled from each bin is the size of the training set scaled by the ratio of the bin size and the total number of data points. Additional data points may be sampled randomly regardless of bins to meet the specific sampling size. This sampling method is referred to as stratified sampling throughout this work.

Results and Discussion

Convergence Rate of the ML-Based DPO Models

To examine the convergence rate of the ML-based DPO models, we fitted the DPO models against the bandgap property of samples in a full-size training set with various learning rate values. The mean square error (MSE) of the model for the HOMO–LUMO gap is monitored at each time step. The model is considered converged if the difference between its MSE before and after an update is less than 10–5 eV2. The maximum number of time steps is 150. Plots of the root-mean-square deviation (RMSD, square root of MSE) of models for the bandgap property as a function of the number of time steps are shown in Figure . Models trained with learning rate values of 0.01, 0.05, and 1.5 fail to reach convergence within the allotted number of time steps. With learning rates of 0.01 and 0.05, the model’s error descends too slowly, and thus, it is unable to reach the converged error of 0.1 eV within 160 time steps. On the other hand, the error at a learning rate of 1.5 oscillates around 0.20 and 0.25 eV and is unable to converge. Models trained at learning rate values of 0.50, 1.00, and 1.25 take 43, 24, and 126 time steps to converge, respectively, while it is barely converged for models at a learning rate of 0.10. Note that at a learning rate of 1.25, the error exhibits some oscillation and does not monotonically decrease as observed for those between 0.1 and 1.0. These results suggest that the learning rate values should range between 0.10 and 1.0. For this study, a learning rate of 1.0 was used.
Figure 6

Plots of RMSD for the band gap of ML-based DPO models trained at different learning rates versus the number of time steps.

Plots of RMSD for the band gap of ML-based DPO models trained at different learning rates versus the number of time steps.

Training and Testing ML-Based DPO Models

Here, we present the results of the ML-based full and truncated DPO models trained with full-size (132 data points) training sets and their accuracy assessed with the full-size test set (116 data points). In order to compare them with our previous works,[33,34] DPO parameters are fitted against the same density functional theory (DFT) calculation HOMO–LUMO gap values. These parameters are then used to obtained QSPR linear equations for predicting band gaps, EA, and IP properties. The converged values for DPO parameters and parameters of the linear equation fitted against HOMO–LUMO gap values are presented in Table along with those from our previous studies. The machine learning-based optimized parameters are rather close to those obtained using the manual optimizing process mentioned. The converged parameters for both the full and truncated DPO models are also quite close together.
Table 1

Values of DPO Parameters and Linear Regression Weights for the ML-Based Full and Truncated DPO Models and Published Values

 DPO parameters
parameters of the linear equation
 abcda*d*wbw
full DPO0.08–0.120.320.300.400.164.99–0.77
truncated DPO0.07–0.130.360.280.410.164.99–0.76
refs (33) and (34)0.05–0.250.330.330.500.154.68–0.65
Figures and 8 show the linear correlations between the optimized DPO values and the quantum mechanical DFT-calculated values of the HOMO–LUMO gap, EA, and IP. Excellent linear correlations between the values of structural DPO and the physical electronic properties further confirm the physics behind the QSPR model. Figure plots correlations between predictions from both ML-based full and truncated DPO models and the DFT-calculated values of electronic properties of molecules in the test set. Moreover, we repeat this experiment 20 times, each with freshly generated training data sets, and then average them over the optimized parameters and the root-mean-square deviation (RMSD) values. It is found that both the full and truncated DPO models converged to the average, and standard deviation values of the RMSD of all 20 runs are 0.10 ± 0.01, 0.07 ± 0.01, and 0.06 ± 0.00 eV for the HOMO–LUMO gap, EA, and IP, respectively. They are the same magnitude of errors compared to our previous studies and are within the accuracy range of the DFT level of theory, which was used to generate the data.
Figure 7

Plots of linear correlations between the optimized full DPO values and the DFT-calculated properties of (A) HOMO–LUMO gap, (B) electron affinity, and (C) ionization potential of molecules from the training set.

Figure 8

Plots of linear correlations between the optimized truncated DPO values and the DFT-calculated properties of (A) HOMO–LUMO gap, (B) electron affinity, and (C) ionization potential of molecules from the training set.

Figure 9

Plots of QSPR-predicted values versus DFT-calculated electronic properties of the HOMO–LUMO gap, electron affinity, and ionization potential, respectively from the top to bottom for compounds in the test set. Plots (A)–(C) (left side) are from the full DPO model, and plots (D)–(F) (right side) are from the truncated DPO model.

Plots of linear correlations between the optimized full DPO values and the DFT-calculated properties of (A) HOMO–LUMO gap, (B) electron affinity, and (C) ionization potential of molecules from the training set. Plots of linear correlations between the optimized truncated DPO values and the DFT-calculated properties of (A) HOMO–LUMO gap, (B) electron affinity, and (C) ionization potential of molecules from the training set. Plots of QSPR-predicted values versus DFT-calculated electronic properties of the HOMO–LUMO gap, electron affinity, and ionization potential, respectively from the top to bottom for compounds in the test set. Plots (A)–(C) (left side) are from the full DPO model, and plots (D)–(F) (right side) are from the truncated DPO model. Two general problems in machine learning that models often encounter are underfitting, which is identified by the low training accuracy, and overfitting, which is recognized by a high training accuracy but low test accuracy. The results presented in Figure and the average results of multiple test runs indicate that the ML-based DPO models do not suffer from either the overfitting or underfitting case. The former is no surprise since the DPO model has been based on a quantum mechanical model that connects structural variables to their physical properties.[33,34] The latter problem is entirely overcome thanks to the use of the stratified data splitting approach. The above results justify the present truncation to simplify the calculation of the total DPO value for a given molecule. Furthermore, the ML-based approach provides equally accurate results to the previous methodology in deriving QSPR relationships but is more robust and can be automated. Next, we investigated the stability of the ML-based models, particularly the robustness of the ML-based methodology in optimizing the DPO model, namely, the effect of the training set size on the accuracy of the models is examined. To this end, using the stratified data splitting approach, different training sets with sizes of 13, 26, 40, 53, 66, 79, 106, and 132 data points were constructed. The performances of those models for the task of predicting values of the HOMO–LUMO gap, EA, and IP were then assessed using the full-size test set of 116 data points. Similar to the above, for each training data size, 20 experiments were done with different random data splits (data ensembles from 132 data points), and the results were then averaged. The plots of average RMSDs for both the truncated and full DPO models as a function of training set sizes are presented in Figure . Figure presents the plots of the variation of DPO parameters versus the sizes of the training set.
Figure 10

Plots of RMSDs and their standard deviations (as error bars) of the full and truncated DPO models of data from the test set versus the sizes of the training set. (A) HOMO–LUMO gap, (B) electron affinity, and (C) ionization potential.

Figure 11

Plots of DPO parameter values versus the training set size for (A) full DPO model and (B) truncated DPO model. The stars are values from our previous studies.[33,34]

Plots of RMSDs and their standard deviations (as error bars) of the full and truncated DPO models of data from the test set versus the sizes of the training set. (A) HOMO–LUMO gap, (B) electron affinity, and (C) ionization potential. Plots of DPO parameter values versus the training set size for (A) full DPO model and (B) truncated DPO model. The stars are values from our previous studies.[33,34] Figure shows that the RMSDs for both the full and the truncated DPO models decrease rapidly as the training set size increases. In particular, both ML-based DPO models converge to an RMSD value between 0.10 and 0.11 eV for the band gap and 0.06 and 0.07 eV for both the EA and IP by about 53 training data points, which is less than half of the full-size training set. Even with a smaller data set of 26 training samples, both models already achieve a satisfactory RMSD of around 0.12–0.13 eV for the band gap and 0.07–0.08 eV for the EA and IP. Similarly, as shown in Figure , both models’ parameters nearly converge to their respective optimal values with only 26 training points. The result indicates that the machine learning methodology presented in this study is much more robust and efficient in optimizing the DPO parameters as compared to our previous manual approach. Furthermore, the fact that nearly identical results were achieved for both the full and truncated DPO models justify the truncation approximations used and the applicability of the truncated DPO model in automated pipelines for high-throughput applications. In addition, the success of the truncated DPO model also indicates that for electronic properties of PAH or thienoacene, only the longest segment and its nearest neighbor segments are important. Contributions from other segments are negligible.

How to Use the ML-Based DPO Models

To encourage applications of the ML-based DPO model presented here, we provided a set of tools that users can easily modify to apply to other physical properties of aromatic hydrocarbons or other classes of molecules. In particular, code for the ML-based DPO model can be found in the GDDPO.py script. Data splitting, model training, and testing can be done in a couple of lines using the all-in-one Python Object Trainer from the trainer.py script. The data set consists of either manually formulated DPO polynomials or SMILES strings representing molecules and their “true” values of physical properties such as the band gap, EA, or IP. Data and codes presented in this work are available via GitHub at https://github.com/Tuan-H-Nguyen/Machine-Learning-Degree-Pi-Orbital. In addition, it is also worth pointing out that models that have a similar structure to the DPO model but with different rules for mapping compounds to polynomial descriptors can also be used with this script. Note that users can customize for different descriptors, which requires a list of symbols used to denote parameters to be provided to either the model or the Trainer object. In this case, the object serves as an optimizer for seeking optimal values of those designated parameters using a given set of data.

Conclusions

In this study, we present a machine learning approach to optimize the QSPR-DPO model for mapping structural information of PAHs and thienoacenes to their electronic properties. Furthermore, to expand the possibility of employing the DPO model to other chemical applications, a truncated DPO model that can be easily implemented in an automated pipeline is proposed. While the former provides the original model with a deep learning-inspired learning mechanism, the latter introduces some new rules that facilitate the extraction of DPO descriptors from linear molecular SMILES strings. Systematic assessments on the performance and accuracy of both the ML-based methodology and the truncated DPO model in comparison with previous works were done using the same data set of 248 PAHs and thienoacenes. The results indicate that the ML-based methodology can optimize DPO parameters to a reasonable accuracy of around 0.12 eV in the band gap with as little as 26 data points, which is much smaller than the 132 data points used in our previous works. The methodology rapidly converges as the number of training samples increases. The ML-based methodology also shows the same level of accuracy in generating QSPR relationships for predicting the band gap, ionization potential, and electron affinity electronic properties to within 0.1 eV as compared to our previous works and within the accuracy of the quantum chemistry method used to generate the data. Comparison between results from the full and truncated DPO models shows that the truncated DPO model can achieve the same level of accuracy as the full model. This confirms the validity of the truncated DPO model. Consequently, the ML-based methodology combined with the truncated DPO model enables an automated DPO model-based pipeline that takes in SMILES strings and returns predictions on electronic properties to be implemented, thus expanding the ease of use and applicability of the DPO model to different chemical classes as well as its applicability in high-throughput screening for materials design.
  28 in total

1.  Extended-connectivity fingerprints.

Authors:  David Rogers; Mathew Hahn
Journal:  J Chem Inf Model       Date:  2010-05-24       Impact factor: 4.956

2.  Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies.

Authors:  Katja Hansen; Grégoire Montavon; Franziska Biegler; Siamac Fazli; Matthias Rupp; Matthias Scheffler; O Anatole von Lilienfeld; Alexandre Tkatchenko; Klaus-Robert Müller
Journal:  J Chem Theory Comput       Date:  2013-07-30       Impact factor: 6.006

3.  Support vector machine regression (LS-SVM)--an alternative to artificial neural networks (ANNs) for the analysis of quantum chemistry data?

Authors:  Roman M Balabin; Ekaterina I Lomakina
Journal:  Phys Chem Chem Phys       Date:  2011-05-19       Impact factor: 3.676

4.  Prediction of Intramolecular Reorganization Energy Using Machine Learning.

Authors:  Sule Atahan-Evrenk; F Betul Atalay
Journal:  J Phys Chem A       Date:  2019-06-17       Impact factor: 2.781

5.  Systematic Comparison and Comprehensive Evaluation of 80 Amino Acid Descriptors in Peptide QSAR Modeling.

Authors:  Peng Zhou; Qian Liu; Ting Wu; Qingqing Miao; Shuyong Shang; Heyi Wang; Zheng Chen; Shaozhou Wang; Heyan Wang
Journal:  J Chem Inf Model       Date:  2021-03-12       Impact factor: 4.956

6.  Cartesian message passing neural networks for directional properties: Fast and transferable atomic multipoles.

Authors:  Zachary L Glick; Alexios Koutsoukas; Daniel L Cheney; C David Sherrill
Journal:  J Chem Phys       Date:  2021-06-14       Impact factor: 3.488

7.  Δ-machine learning for potential energy surfaces: A PIP approach to bring a DFT-based PES to CCSD(T) level of theory.

Authors:  Apurba Nandi; Chen Qu; Paul L Houston; Riccardo Conte; Joel M Bowman
Journal:  J Chem Phys       Date:  2021-02-07       Impact factor: 3.488

8.  Molecular graph convolutions: moving beyond fingerprints.

Authors:  Steven Kearnes; Kevin McCloskey; Marc Berndl; Vijay Pande; Patrick Riley
Journal:  J Comput Aided Mol Des       Date:  2016-08-24       Impact factor: 3.686

9.  Quantitative Structure-Property Relationships for the Electronic Properties of Polycyclic Aromatic Hydrocarbons.

Authors:  Lam H Nguyen; Thanh N Truong
Journal:  ACS Omega       Date:  2018-08-09

10.  Quantum Mechanical-Based Quantitative Structure-Property Relationships for Electronic Properties of Two Large Classes of Organic Semiconductor Materials: Polycyclic Aromatic Hydrocarbons and Thienoacenes.

Authors:  Lam H Nguyen; Tuan H Nguyen; Thanh N Truong
Journal:  ACS Omega       Date:  2019-04-24
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.