Tuan H Nguyen1, Lam H Nguyen1, Thanh N Truong2. 1. Institute for Computational Science and Technology, Ho Chi Minh City 700000, Vietnam. 2. Department of Chemistry, University of Utah, Salt Lake City, Utah 84112, United States.
Abstract
The degree of π orbital overlap (DPO) model has been demonstrated to be an excellent quantitative structure-property relationship (QSPR) that can map two-dimensional structural information of polycyclic aromatic hydrocarbons (PAHs) and thienoacenes to their electronic properties, namely, band gaps, electron affinities, and ionization potentials. However, the model suffers from significant limitations that narrow its applications due to inefficient manual procedures in parameter optimization and descriptor formulation. In this work, we developed a machine learning (ML)-based method for efficiently optimizing DPO parameters and proposed a truncated DPO descriptor, which is simple enough that can be automatically extracted from simplified molecular-input line-entry system strings of PAHs and thienoacenes. Compared with the result from our previous studies, the ML-based methodology can optimize DPO parameters with four times fewer data, while it can achieve the same level of accuracy in predictions of the mentioned electronic properties to within 0.1 eV. The truncated DPO model also has similar accuracy to the full DPO model. Consequently, the ML-based DPO approach coupled with the truncated DPO model enables new possibilities for developing automatic pipelines for high-throughput screening and investigating new QSPR for new chemical classes.
The degree of π orbital overlap (DPO) model has been demonstrated to be an excellent quantitative structure-property relationship (QSPR) that can map two-dimensional structural information of polycyclic aromatic hydrocarbons (PAHs) and thienoacenes to their electronic properties, namely, band gaps, electron affinities, and ionization potentials. However, the model suffers from significant limitations that narrow its applications due to inefficient manual procedures in parameter optimization and descriptor formulation. In this work, we developed a machine learning (ML)-based method for efficiently optimizing DPO parameters and proposed a truncated DPO descriptor, which is simple enough that can be automatically extracted from simplified molecular-input line-entry system strings of PAHs and thienoacenes. Compared with the result from our previous studies, the ML-based methodology can optimize DPO parameters with four times fewer data, while it can achieve the same level of accuracy in predictions of the mentioned electronic properties to within 0.1 eV. The truncated DPO model also has similar accuracy to the full DPO model. Consequently, the ML-based DPO approach coupled with the truncated DPO model enables new possibilities for developing automatic pipelines for high-throughput screening and investigating new QSPR for new chemical classes.
Applications of machine
learning techniques to research in chemical
and related fields have been gaining attention recently. Some novel
applications include generating drug candidates,[1] investigating chemical phenomena,[2] and assisting theoretical calculation.[3−8] One of the most prominent tasks for applying machine learning is
physical or chemical property predictions. In this direction, quantitative
structure–activity relationship (QSAR) specializes in the prediction
of the biological activity of compounds from their structural information.[9−12] Similarly, materials scientists employ a similar technique called
quantitative structure–property relationship (QSPR) for predicting
various properties of materials from their 2D or 3D structural data.[13−17] In most QSPR and QSAR methodologies, predicting tasks heavily depend
on a set of descriptors,[9,13] which serve as numerical
representations of structural information of molecules. Typically,
these descriptors are numerical objects obtained by transforming raw
molecular data by some predefined procedure. These descriptors can
be as simple as a list of molecule’s compositions[5,6] or as complex as matrices,[18−20] fingerprints,[15,16,21] etc.[8,14,22−26]Recently, representation learning[27] has
been introduced to harness deep learning concepts for actively refining
the process of molecular representation extraction through learning.
For instance, neural fingerprints[28] are
modeled after the extended connectivity fingerprints (ECFPs).[21] The ECFPs are designed to solely represent each
molecular fragment as uniquely as possible. On the other hand, neural
fingerprints replace some operations in ECFP generation with learnable
modules that actively configured themselves during training to optimally
represent molecular fragments for a certain task. As such, fingerprints
of fragments are similar if the learned model can recognize the similarity
between them. Hence, the representation has more expressiveness and
interpretability. This approach was recently featured in multiple
publications and yielded excellent results.[29−32]The degree of π orbital
overlap (DPO) model is a quantitative
structure–property relationship (QSPR) model for predicting
electronic properties of polyaromatic hydrocarbon (PAH) and thienoacene
compounds.[33,34] The DPO model relies on a representation
or descriptor called the degree of π orbital overlap (DPO).
Based on a quantum mechanical physical model, this descriptor represents
a PAH or thienoacene molecule by a polynomial of six nonzero parameters,
each associating the contribution of a topological trait of a polyaromatic
hydrocarbon (PAH) compound to its final electronic properties. This
has been proven to be rather accurate as it can accurately predict
electronic properties, namely, electron affinities, ionization potentials,
and band gaps to within 0.1 eV.[33,34] The original parameter
optimization schedule is based on the prerecognition that each of
them associates with a structural feature, which suggests a specific
data set for optimizing it by trial and error. The described optimization
procedure gives rise to several setbacks. First, it requires a large
number of data points to calibrate the model, yet the final set of
parameters may not be optimal globally. Second, the application of
the DPO model to other chemical classes is not trivial. Finally, to
date, determination of the DPO value of each molecule is done manually,
and thus, it is unrealistic to use for high-throughput screening of
massive databases.In this work, we implement and assess the
performance of a learning
procedure for developing a DPO model for a class of molecules. The
goal of this new procedure is to provide an automated pipeline for
parameter optimization in order to remove the number of setbacks of
our previous works as mentioned above. This is motivated by recognizing
that DPO descriptors can be treated as a learnable representation,
which can be done by applying various learning principles and techniques
such as empirical risk minimization, gradient descent, and backpropagation.
Furthermore, toward the goal of creating an automatic pipeline for
the DPO model, we devised a truncated DPO model. The analysis of results
in the optimization of the DPO descriptors reveals that several terms
in the DPO descriptor can be neglected with little consequence, thus
suggesting a truncated DPO model. With the truncated DPO model, we
demonstrate an automated pipeline that uses simplified molecular-input
line-entry system (SMILES) strings as input for molecular representations
that can be employed for high-throughput screening. Assessments for
the accuracy and applicability of the truncated DPO model in predicting
the electronic properties of PAHs and thienoacenes are then provided.
Methods
DPO Models and Descriptors
DPO Models
A DPO model is a physics-based
QSPR model for predicting electronic properties of polyaromatic compounds
such as PAHs and thienoacenes. It is based on a simple particle in
the 2D box quantum mechanical model to connect structural information
to its physical properties related to its energy levels such as its
electron affinity (EA), ionization potential (IP), and highest occupied
molecular orbital (HOMO)–lowest unoccupied molecular orbital
(LUMO) gap. Technically, input structural information can be represented
by 2D structures of polyaromatic compounds. The procedure of the model
can be described briefly as follows. First, from a polyaromatic compound’s
structure, a polynomial of six preoptimized nonzero parameters is
evaluated to obtain the structure’s DPO value. This step is
done manually by following a set of structural descriptive rules.
The DPO value is subsequently used in QSPR linear equations to obtain
the predicted properties. Three distinct linear equations of the form y = wx + w, in which x is the DPO value, y is the property, and w and w are weights, correspond to three modeled electronic properties
mentioned above. In this work, the term DPO descriptor is referred
to the polynomial, as it reflects the topological structure value
of the polyaromatic compound. The parameters that associate with these
DPO descriptors are called DPO parameters.
Rules
for Determining DPO Descriptors
A brief explanation of the
DPO rules is presented here to introduce
some important features of the rules. Complete descriptions of the
DPO rules with examples are given in our previous works.[33,34] In this study, we explain how the DPO descriptor captures topological
structural information and how the truncated DPO descriptor arises.The goal of the DPO rule is to assign each fused bond in the polyaromatic
compound a component in the DPO polynomial suggested by the use of
an effective 2D particle-in-a-box quantum mechanical model. Fused
bonds of the same segment are assigned similarly depending on the
topological position of the segment in the molecule. The topological
position can be defined with respect to the reference segment, which
can be uniquely determined via a set of rules. Thus, the reference
segment needs to be determined first and is generally the longest
segment. With reference to this segment, the topological position
of any segment is specified by the orientation and the overlayer order.
In terms of orientations of different segments in a PAH, a segment
can be parallel, forming an angle of 120°, or forming an angle
of 60° to the reference segment. For each of those segments,
the parameters a, b, and c are used in assigning value to its fused bonds. The overlayer
order of a segment represents its distance to the reference segment
orthogonally. As such, the farther the segment is above or below the
reference segment, the larger the order is. The sum of values assigned
to fused bonds of such segments is scaled by a factor of d raised to the power of the overlayer order k. In
summary, supposing o and k are respectively
the relative orientation and the overlayer order of a segment, the
equation below is the contribution of a given segment to the overall
DPO value and describes how fused bonds in that segment are assigned
value.Figure illustrates
examples for assigning DPO values to fused bonds using the above equation.
For compound C1, the reference segment is marked with an arrow across.
Its fused bonds are assigned as the first case of the equation with k = 0. The segment that forms with the reference segment
an angle of 120° is assigned to the second case with k = 0. Finally, the uppermost segment that is parallel with
the reference segment is two orders above the reference; therefore,
the first case of the equation is used with k = 2.
For compound C2, consider the upper rightmost segment that forms with
the reference segment (marked with an arrow) an angle of 120°.
Since it stems from the parallel segment that is one order above the
reference segment, the second case of the equation is used with k = 1 assigned to this segment. For similar cases where
there are segments that form an angle of 60° with the reference
one, refer to compounds C3 and C4.
Figure 1
Examples of assigning DPO values to different
fused bonds. The
overlayer order k and orientation o are also given for some segments.
Examples of assigning DPO values to different
fused bonds. The
overlayer order k and orientation o are also given for some segments.Lastly, the parameters a* and d* are used whenever thiophene rings are in the segment. The DPO value
of a structure is the sum of DPO values of all segments with the assigned
terms to all of its fused bonds.
Truncated
DPO Descriptors
In the
DPO model, the contribution of each overlayer segment is scaled by d. Previous studies showed that the d parameter is between 0.2 and 0.3. This suggests that segments
that are far from the reference segment can be assumed to have a negligible
contribution and thus can be dropped from the total DPO descriptor.
The truncated DPO model presented here omits the contributions of
several types of overlayer segments.The truncated DPO inherits
most of the original DPO assignment rules two new simplifications.
First, if multiple segments have the same longest length, then there
is no need to go through an elaborate process to determine the unique
reference segment. Instead, the final DPO value is the average of
all DPO values where each one of such segments is treated as the reference
segment. Figure illustrates
this rule.
Figure 2
Illustrations of computing truncated DPO values for the case of
multiple longest segments. Molecule A has the two longest segments
shown as arrows in A1 and A2. The truncated DPO of A is the average
of DPO values when A1 and A2 are chosen as the reference segment,
i.e., .
Illustrations of computing truncated DPO values for the case of
multiple longest segments. Molecule A has the two longest segments
shown as arrows in A1 and A2. The truncated DPO of A is the average
of DPO values when A1 and A2 are chosen as the reference segment,
i.e., .Second, only two types of overlayer
segments are considered. One
is the adjacent segments that have one ring in common with the reference
segment. The other is segments that have one ring in common with the
adjacent segments. Contributions from other overlayer segments are
ignored. For instance, Figure shows two different structures that have the same value of
truncated DPO.
Figure 3
Illustration on the assignment of DPO values to each fused
bond;
terms in gray are neglected in the truncated DPO model.
Illustration on the assignment of DPO values to each fused
bond;
terms in gray are neglected in the truncated DPO model.
Generating Truncated DPO from SMILES Strings
In order to employ machine learning techniques and further expand
the applicability of the DPO model to different chemical classes,
we need to provide a means to extract structural information from
polyaromatic molecules represented by the simplified molecular-input
line-entry system (SMILES).[35] A SMILES
string provides information regarding atoms and bonds of a molecule
that can be used to create graph data structures. From there, recognizing
groups of atoms that form rings can be done by employing a graph traversal
algorithm. Rings that are in a straight line can be easily recognized
and grouped into segments. Subsequently, the longest segments along
with their adjacent segments and topology can be determined easily
for the truncated DPO. The process is schematically illustrated in Figure .
Figure 4
Flow chart illustrating
the process of extracting truncated DPO
descriptors from SMILES strings.
Flow chart illustrating
the process of extracting truncated DPO
descriptors from SMILES strings.
Optimization of the DPO Model
In
this work, instead of manually optimizing six DPO parameters for PAHs
and thienoacenes one at a time as in our previous works, a machine
learning technique is developed. The optimization scheme is based
on the empirical risk minimization (ERM) principle.[36] The empirical risk refers to the error of the model’s
predictions for all training samples and is assessed by the so-called
loss function, which is the mean square error (MSE) for regressing
models. According to the ERM paradigm, the set of optimal parameters
of the model is obtained by minimizing the value of the empirical
risk and can be solved by the gradient descent algorithm. The first-order
derivative of the loss function with respect to parameters is obtained
by working backward from the loss function to the parameters using
chain rules. Hence, this technique of finding derivatives is named
backpropagation.[37] The optimization procedure
is given step by step below.In step 1, set t = 0. Initiate six nonzero DPO parameters p[0] = a[0], b[0], c[0], d[0], a*[0], and d*[0] as zeros. We adopt the convention that the number
inside the square bracket of the superscript of a parameter denotes
the number of time steps or iterations, while the subscript denotes
the general index of a molecule. Also, p is used
to collectively denote six nonzero parameters a, b, c, d, a*, and d*.In step 2, derive analytically
the gradient of the DPO polynomial g with
respect to the parameter p for each molecule in the
training set.In step 3, compute the DPO value x[ and
its derivative with respect to p ∇px[ numerically using the set of parameters
in the current time step. Note that we denote x as a numerical DPO value, while g denotes its polynomial.In step 4, apply the least-square algorithm to determine w[ and w[ from DPO value x’s and the true
value of the modeled property y’s.In step 5, compute the prediction ŷ[ by plugging xi[ into the linear equation (eq ).In step
6, compute the mean square error (MSE) loss function (eq ) for all molecules in
the training set.In
step 7, compute the gradient of the loss function with respect
to each parameter by working backward using the chain rule:In step 8, update these parameters according
to equations (eq ).α is the adjustable
learning rate that is 0.1 by default.In step 9, increment t = t +
1 and repeat steps 3 through 6 to obtain a new loss value with the
value of new parameters. If predefined conditions are not satisfied,
then continue steps 7 through 9. Else, if predefined conditions are
satisfied, then stop and return to the model with the optimized parameters.Many conditions can be set in step 9 to mitigate training hardships
and make the training process more automatic. For instance, to make
the algorithm practical, a maximum number of cycles and the convergent
threshold were introduced. The algorithm is succinctly illustrated
in the form of a flow chart as in Figure .
Figure 5
Flow chart illustrating the ML-based DPO parameter
optimization
procedure.
Flow chart illustrating the ML-based DPO parameter
optimization
procedure.The DPO model described above
is called the machine learning (ML)-based
DPO model. This model is implemented in the Python programming language.
The model is built with various libraries. Numpy[38] is employed to rapidly carry out vector computation. Pandas[39] is employed for working with data sets. Sklearn[40] is used to rapidly deploy linear regression
models. Sympy[41] is used for symbolic differentiation
and rapid evaluation of symbolic expressions.
Data
This work mainly concerns the
electronic properties of polyaromatic compounds, namely, PAHs and
thienoacenes. The PAH data is reused from our previous work on the
DPO model for PAH.[34] The data sets consist
of a training set containing PAH molecules of 3–6 rings and
a test set consisting of PAH molecules of 7–8 rings. Similarly,
taken from our previous work on the DPO model for thienoacene molecules[33] are two pairs of training and testing data for
two classes of thienoacenes containing either one or two thiophene
rings. Cumulatively, there are 248 data points, of which 132 of them
are training instances and the remaining 116 data points are for testing.The current methodology enables us to investigate the convergence
and stability of the parameter optimization procedure via machine
learning framework as functions of the data training size. To make
the optimization procedure more robust, for each run, all 248 data
points are freshly split into 132 data points for training and 116
data points for testing in a random and stratified manner. This is
done by first binning all data points according to their bandgap values.
The bin width is chosen to be 0.5 eV. The range of bandgap values
starts from 1.5 eV and ends at 5.0 eV. Data points are then drawn
from each bin to assemble the training set. The number of data points
to be sampled from each bin is the size of the training set scaled
by the ratio of the bin size and the total number of data points.
Additional data points may be sampled randomly regardless of bins
to meet the specific sampling size. This sampling method is referred
to as stratified sampling throughout this work.
Results and Discussion
Convergence Rate of the
ML-Based DPO Models
To examine the convergence rate of the
ML-based DPO models, we
fitted the DPO models against the bandgap property of samples in a
full-size training set with various learning rate values. The mean
square error (MSE) of the model for the HOMO–LUMO gap is monitored
at each time step. The model is considered converged if the difference
between its MSE before and after an update is less than 10–5 eV2. The maximum number of time steps is 150.Plots
of the root-mean-square deviation (RMSD, square root of MSE) of models
for the bandgap property as a function of the number of time steps
are shown in Figure . Models trained with learning rate values of 0.01, 0.05, and 1.5
fail to reach convergence within the allotted number of time steps.
With learning rates of 0.01 and 0.05, the model’s error descends
too slowly, and thus, it is unable to reach the converged error of
0.1 eV within 160 time steps. On the other hand, the error at a learning
rate of 1.5 oscillates around 0.20 and 0.25 eV and is unable to converge.
Models trained at learning rate values of 0.50, 1.00, and 1.25 take
43, 24, and 126 time steps to converge, respectively, while it is
barely converged for models at a learning rate of 0.10. Note that
at a learning rate of 1.25, the error exhibits some oscillation and
does not monotonically decrease as observed for those between 0.1
and 1.0. These results suggest that the learning rate values should
range between 0.10 and 1.0. For this study, a learning rate of 1.0
was used.
Figure 6
Plots of RMSD for the band gap of ML-based DPO models trained at
different learning rates versus the number of time steps.
Plots of RMSD for the band gap of ML-based DPO models trained at
different learning rates versus the number of time steps.
Training and Testing ML-Based DPO Models
Here, we present the results of the ML-based full and truncated
DPO models trained with full-size (132 data points) training sets
and their accuracy assessed with the full-size test set (116 data
points). In order to compare them with our previous works,[33,34] DPO parameters are fitted against the same density functional theory
(DFT) calculation HOMO–LUMO gap values. These parameters are
then used to obtained QSPR linear equations for predicting band gaps,
EA, and IP properties.The converged values for DPO parameters
and parameters of the linear equation fitted against HOMO–LUMO
gap values are presented in Table along with those from our previous studies. The machine
learning-based optimized parameters are rather close to those obtained
using the manual optimizing process mentioned. The converged parameters
for both the full and truncated DPO models are also quite close together.
Table 1
Values of DPO Parameters and Linear
Regression Weights for the ML-Based Full and Truncated DPO Models
and Published Values
DPO
parameters
parameters
of the linear equation
a
b
c
d
a*
d*
wb
w
full DPO
0.08
–0.12
0.32
0.30
0.40
0.16
4.99
–0.77
truncated DPO
0.07
–0.13
0.36
0.28
0.41
0.16
4.99
–0.76
refs (33) and (34)
0.05
–0.25
0.33
0.33
0.50
0.15
4.68
–0.65
Figures and 8 show the linear
correlations between the optimized
DPO values and the quantum mechanical DFT-calculated values of the
HOMO–LUMO gap, EA, and IP. Excellent linear correlations between
the values of structural DPO and the physical electronic properties
further confirm the physics behind the QSPR model. Figure plots correlations between
predictions from both ML-based full and truncated DPO models and the
DFT-calculated values of electronic properties of molecules in the
test set. Moreover, we repeat this experiment 20 times, each with
freshly generated training data sets, and then average them over the
optimized parameters and the root-mean-square deviation (RMSD) values.
It is found that both the full and truncated DPO models converged
to the average, and standard deviation values of the RMSD of all 20
runs are 0.10 ± 0.01, 0.07 ± 0.01, and 0.06 ± 0.00
eV for the HOMO–LUMO gap, EA, and IP, respectively. They are
the same magnitude of errors compared to our previous studies and
are within the accuracy range of the DFT level of theory, which was
used to generate the data.
Figure 7
Plots of linear correlations between the optimized
full DPO values
and the DFT-calculated properties of (A) HOMO–LUMO gap, (B)
electron affinity, and (C) ionization potential of molecules from
the training set.
Figure 8
Plots of linear correlations
between the optimized truncated DPO
values and the DFT-calculated properties of (A) HOMO–LUMO gap,
(B) electron affinity, and (C) ionization potential of molecules from
the training set.
Figure 9
Plots of QSPR-predicted
values versus DFT-calculated electronic
properties of the HOMO–LUMO gap, electron affinity, and ionization
potential, respectively from the top to bottom for compounds in the
test set. Plots (A)–(C) (left side) are from the full DPO model,
and plots (D)–(F) (right side) are from the truncated DPO model.
Plots of linear correlations between the optimized
full DPO values
and the DFT-calculated properties of (A) HOMO–LUMO gap, (B)
electron affinity, and (C) ionization potential of molecules from
the training set.Plots of linear correlations
between the optimized truncated DPO
values and the DFT-calculated properties of (A) HOMO–LUMO gap,
(B) electron affinity, and (C) ionization potential of molecules from
the training set.Plots of QSPR-predicted
values versus DFT-calculated electronic
properties of the HOMO–LUMO gap, electron affinity, and ionization
potential, respectively from the top to bottom for compounds in the
test set. Plots (A)–(C) (left side) are from the full DPO model,
and plots (D)–(F) (right side) are from the truncated DPO model.Two general problems in machine learning that models
often encounter
are underfitting, which is identified by the low training accuracy,
and overfitting, which is recognized by a high training accuracy but
low test accuracy. The results presented in Figure and the average results of multiple test
runs indicate that the ML-based DPO models do not suffer from either
the overfitting or underfitting case. The former is no surprise since
the DPO model has been based on a quantum mechanical model that connects
structural variables to their physical properties.[33,34] The latter problem is entirely overcome thanks to the use of the
stratified data splitting approach.The above results justify
the present truncation to simplify the
calculation of the total DPO value for a given molecule. Furthermore,
the ML-based approach provides equally accurate results to the previous
methodology in deriving QSPR relationships but is more robust and
can be automated.Next, we investigated the stability of the
ML-based models, particularly
the robustness of the ML-based methodology in optimizing the DPO model,
namely, the effect of the training set size on the accuracy of the
models is examined. To this end, using the stratified data splitting
approach, different training sets with sizes of 13, 26, 40, 53, 66,
79, 106, and 132 data points were constructed. The performances of
those models for the task of predicting values of the HOMO–LUMO
gap, EA, and IP were then assessed using the full-size test set of
116 data points. Similar to the above, for each training data size,
20 experiments were done with different random data splits (data ensembles
from 132 data points), and the results were then averaged. The plots
of average RMSDs for both the truncated and full DPO models as a function
of training set sizes are presented in Figure . Figure presents the plots of the variation of DPO parameters
versus the sizes of the training set.
Figure 10
Plots of RMSDs and their
standard deviations (as error bars) of
the full and truncated DPO models of data from the test set versus
the sizes of the training set. (A) HOMO–LUMO gap, (B) electron
affinity, and (C) ionization potential.
Figure 11
Plots
of DPO parameter values versus the training set size for
(A) full DPO model and (B) truncated DPO model. The stars are values
from our previous studies.[33,34]
Plots of RMSDs and their
standard deviations (as error bars) of
the full and truncated DPO models of data from the test set versus
the sizes of the training set. (A) HOMO–LUMO gap, (B) electron
affinity, and (C) ionization potential.Plots
of DPO parameter values versus the training set size for
(A) full DPO model and (B) truncated DPO model. The stars are values
from our previous studies.[33,34]Figure shows
that the RMSDs for both the full and the truncated DPO models decrease
rapidly as the training set size increases. In particular, both ML-based
DPO models converge to an RMSD value between 0.10 and 0.11 eV for
the band gap and 0.06 and 0.07 eV for both the EA and IP by about
53 training data points, which is less than half of the full-size
training set. Even with a smaller data set of 26 training samples,
both models already achieve a satisfactory RMSD of around 0.12–0.13
eV for the band gap and 0.07–0.08 eV for the EA and IP. Similarly,
as shown in Figure , both models’ parameters nearly converge to their respective
optimal values with only 26 training points.The result indicates
that the machine learning methodology presented
in this study is much more robust and efficient in optimizing the
DPO parameters as compared to our previous manual approach. Furthermore,
the fact that nearly identical results were achieved for both the
full and truncated DPO models justify the truncation approximations
used and the applicability of the truncated DPO model in automated
pipelines for high-throughput applications. In addition, the success
of the truncated DPO model also indicates that for electronic properties
of PAH or thienoacene, only the longest segment and its nearest neighbor
segments are important. Contributions from other segments are negligible.
How to Use the ML-Based DPO Models
To encourage
applications of the ML-based DPO model presented here,
we provided a set of tools that users can easily modify to apply to
other physical properties of aromatic hydrocarbons or other classes
of molecules. In particular, code for the ML-based DPO model can be
found in the GDDPO.py script. Data splitting, model training, and
testing can be done in a couple of lines using the all-in-one Python
Object Trainer from the trainer.py script. The data set consists of
either manually formulated DPO polynomials or SMILES strings representing
molecules and their “true” values of physical properties
such as the band gap, EA, or IP. Data and codes presented in this
work are available via GitHub at https://github.com/Tuan-H-Nguyen/Machine-Learning-Degree-Pi-Orbital.In addition, it is also worth pointing out that models that
have a similar structure to the DPO model but with different rules
for mapping compounds to polynomial descriptors can also be used with
this script. Note that users can customize for different descriptors,
which requires a list of symbols used to denote parameters to be provided
to either the model or the Trainer object. In this case, the object
serves as an optimizer for seeking optimal values of those designated
parameters using a given set of data.
Conclusions
In this study, we present a machine learning approach to optimize
the QSPR-DPO model for mapping structural information of PAHs and
thienoacenes to their electronic properties. Furthermore, to expand
the possibility of employing the DPO model to other chemical applications,
a truncated DPO model that can be easily implemented in an automated
pipeline is proposed. While the former provides the original model
with a deep learning-inspired learning mechanism, the latter introduces
some new rules that facilitate the extraction of DPO descriptors from
linear molecular SMILES strings. Systematic assessments on the performance
and accuracy of both the ML-based methodology and the truncated DPO
model in comparison with previous works were done using the same data
set of 248 PAHs and thienoacenes.The results indicate that
the ML-based methodology can optimize
DPO parameters to a reasonable accuracy of around 0.12 eV in the band
gap with as little as 26 data points, which is much smaller than the
132 data points used in our previous works. The methodology rapidly
converges as the number of training samples increases. The ML-based
methodology also shows the same level of accuracy in generating QSPR
relationships for predicting the band gap, ionization potential, and
electron affinity electronic properties to within 0.1 eV as compared
to our previous works and within the accuracy of the quantum chemistry
method used to generate the data. Comparison between results from
the full and truncated DPO models shows that the truncated DPO model
can achieve the same level of accuracy as the full model. This confirms
the validity of the truncated DPO model. Consequently, the ML-based
methodology combined with the truncated DPO model enables an automated
DPO model-based pipeline that takes in SMILES strings and returns
predictions on electronic properties to be implemented, thus expanding
the ease of use and applicability of the DPO model to different chemical
classes as well as its applicability in high-throughput screening
for materials design.
Authors: Steven Kearnes; Kevin McCloskey; Marc Berndl; Vijay Pande; Patrick Riley Journal: J Comput Aided Mol Des Date: 2016-08-24 Impact factor: 3.686