Literature DB >> 36061712

LGB-Stack: Stacked Generalization with LightGBM for Highly Accurate Predictions of Polymer Bandgap.

Kai Leong Goh¹, Atsushi Goto¹, Yunpeng Lu¹.

Abstract

Recently, the Ramprasad group reported a quantitative structure-property relationship (QSPR) model for predicting the E gap values of 4209 polymers, which yielded a test set R 2 score of 0.90 and a test set root-mean-square error (RMSE) score of 0.44 at a train/test split ratio of 80/20. In this paper, we present a new QSPR model named LGB-Stack, which performs a two-level stacked generalization using the light gradient boosting machine. At level 1, multiple weak models are trained, and at level 2, they are combined into a strong final model. Four molecular fingerprints were generated from the simplified molecular input line entry system notations of the polymers. They were trimmed using recursive feature elimination and used as the initial input features for training the weak models. The output predictions of the weak models were used as the new input features for training the final model, which completes the LGB-Stack model training process. Our results show that the best test set R 2 and the RMSE scores of LGB-Stack at the train/test split ratio of 80/20 were 0.92 and 0.41, respectively. The accuracy scores further improved to 0.94 and 0.34, respectively, when the train/test split ratio of 95/5 was used.

Entities: Chemical

Year: 2022 PMID： 36061712 PMCID： PMC9434625 DOI： 10.1021/acsomega.2c02554

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Bandgap and Polymers

Bandgap (Egap) is a highly important electrical property, which plays a crucial role in the rational design of functional materials.[1]Egap is defined as the energy difference between the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO).[2] The idea is that mobile electrons are essential for electrical conductivity, and a smaller Egap allows the electrons to transit more easily between the HOMO and LUMO.[3]Figure illustrates the differences in Egap of conductors, semiconductors, and insulators. Materials with Egap approaching or equal to 0 eV are conductors, while those with Egap between 1.5 and 3.0 eV are semiconductors. Materials with Egap above 4.0 eV are insulators.[4,5]

Figure 1

Different Egap of conductors, semiconductors, and insulators.

Different Egap of conductors, semiconductors, and insulators. In recent decades, polymers are becoming increasingly significant in materials science,[6,7] and their applications have grown exponentially.[8−10] Polymers can either insulative or conductive. Examples of polymeric insulators are polyethylene and polypropylene, which are widely used in electrical insulations, packaging, household items, and automotive parts. They are highly valued for their chemical resistance, rigidity, stiffness, and thermal stability.[11] Examples of polymeric conductors and semiconductors are polyacetylene and its derivatives, which are used for modeling mechanisms of electrical conduction for conjugated organic polymers. Conductive polymers are appealing for simultaneously exhibiting the electrical and optical properties of metals or semiconductors and retaining the mechanical properties and processing advantages of regular polymers.[12−16]

Machine Learning and Polymers

Machine learning has grown significantly over the recent years, and its ability to process and learn massive amounts of data has been demonstrated across various fields.[17,18] The standard approach for discovering new materials involve potentially dangerous experiments in laboratories and also lengthy computations that are performed for one molecule at a time. Furthermore, the chemical space of synthetic materials is still far from being fully covered and there are still many new materials to be discovered. Hence, there is a growing interest in using machine learning to map suitable representations of materials to their physical properties with known experimental data to facilitate discovery, which can save substantial time and cost.[19,20] One important machine learning technique is supervised learning, which learns from the inputs and outputs in a train dataset in order to make predictions on a test dataset.[21] In computational chemistry, supervised learning manifests itself in the form of the quantitative structure–property relationship (QSPR) modeling. A QSPR model quantifies and relates the determining factors for a particular measured property with molecular features of a given system of chemical compounds. QSPR is essentially a mathematical model that connects experimental property values with a set of features derived from the molecular structures.[22] As a result, machine learning has been applied to the design and prediction of the structure of many polymers and their properties.[23] As an example of polymer design using machine learning, Wu et al. (2019) trained a molecular design algorithm that can recognize the relationship between thermal conductivity and other target properties to identify thousands of hypothetical polymers, out of which three were comparable to those of the state-of-the-art polymers in non-composite thermoplastics.[19] As an example of prediction of polymer properties, an online platform named Polymer Genome was developed by the Ramprasad group, which hosts their own machine learning models for rapid and accurate predictions of polymer properties.[24] Those models are trained on carefully curated database of polymers with properties obtained from first-principles computations and experimental measurements.[25]

Aims and Objectives

Over the years, many studies had been conducted regarding the prediction of Egap values of various materials using machine learning methods.[26−28] Recently, the Ramprasad group (Kamal et al., 2021) reported a highly accurate QSPR model for predicting the Egap of polymers using the Gaussian process algorithm.[29] In this paper, this model will be referred to as the Ramprasad model, and its results will be set as the benchmark. Based on this foundation, the aim was to develop an alternative QSPR model that surpasses the accuracy of the Ramprasad model. Our results were directly compared with the benchmark as presented in the benchmark paper without any reimplementation of the Ramprasad model.

Methodology

Dataset

The dataset in this project contains 4209 polymers, which was the same as that used in the benchmark paper.[29] The authors shared the dataset publicly through the KHAZANA data repository.[30] Each data point includes the Egap values, where, according to the authors, the range of values cover the expected range for polymeric materials. Also, there is reportedly a decent level of diversity in terms of the property range and chemistry of the polymers.[29] The distribution of Egap values of the 4209 polymers is shown in Figure . Each polymer is also represented by a simplified molecular input line entry system (SMILES) string, in which atoms are represented by the chemical symbols of the elements; single bonds and bonded hydrogens are implied without the use of any symbols; double bonds and triple bonds are represented by “=” and “#”, respectively; branching is denoted by a substring in parentheses; cyclic structures are represented by an enclosure within two chemical symbols tagged with the same number; and aromatic structures are similar to cyclic structures except the chemical symbols are in lowercase.[31]

Figure 2

Distribution of Egap values of the 4209 polymers.

Featurization

In order for the ML algorithms to process the polymer structures, the polymers had to be represented in numerical or categorical formats, which are machine-readable. This process is known as featurization. In this project, molecular fingerprints were calculated based on the molecular objects of the polymers. Molecular fingerprints are high-dimensional vectors populated with bits or integers, which are derived from the transformations of the corresponding molecular graphs.[32,33] Most of the pre-processing and the featurization of polymer molecular structures were achieved using the RDKit cheminformatics software.[34] First, the SMILES strings were converted into 2D molecular objects. Second, four 2D molecular fingerprints were calculated, namely, Avalon fingerprint (A-FP) and Avalon count fingerprint (AC-FP),[35,36] layered fingerprint (L-FP), and RDKit fingerprint (R-FP).[37,38] A-FP and R-FP are two different types of bit vector substructure fingerprints based on hashing molecular subgraphs, which are influenced by the types of atoms and bonds present in the molecule. AC-FP is the count vector version of A-FP. L-FP is a variant of R-FP that uses a set of pre-defined generic substructure patterns. A-FP, L-FP, and R-FP are bit vector fingerprints that record the presence of a structural feature. The calculated features in a bit vector fingerprint are a series of binary bits, indicating the absence and presence of substructures within the molecule. On the other hand, a count vector fingerprint that records the number of times the same structural feature appears. The calculated features in a count vector fingerprint are a collection of integers, indicating the frequency of the presence of substructures within the molecule. These four were chosen because they were reported to yield reliable accuracy.[33,35−38] The number of features in these fingerprints can be customized, and the most common settings in RDKit are in multiples of 256 bits (i.e., 512, 1024, 2048, and 4096).[34] In the benchmark paper, recursive feature elimination (RFE) was performed, and the number of features was reduced from 800 to 600. RFE is a backward feature elimination algorithm that relies on feature importance ranking.[39] RFE was reported to be a very popular and highly effective algorithm for feature reduction in numerous studies, which was employed in the pipelines of many highly accurate prediction models.[40−44] In this project, the four fingerprints were customized to 4096 bits each to capture as many relevant features as possible and were labeled group (A) fingerprints. These 4096-bit fingerprints were subsequently reduced to 256 and 512 bits using RFE with the help of the scikit-learn(45) package and were labeled groups (B) and (C) fingerprints, respectively.

Model Training

Two ensemble algorithms were involved in the model training process. Ensemble algorithms construct multiple weak models with relatively low accuracies and poorer generalizations and subsequently combine their individual strengths to create a single stronger model with higher accuracy and much better generalization.[46] The first ensemble algorithm was LightGBM (light gradient boosting machine), which is a type of gradient boosting machine (GBM). GBM works by consecutively fitting new decision trees to provide a more accurate estimate of the outputs.[47] A general outline of a GBM is illustrated in Figure . LightGBM was chosen because it was reported to have extremely high scalability and fast computation, outperforming most GBMs.[48] It was also reported that LightGBM can achieve high prediction accuracies.[49,50] The second ensemble algorithm was stacked generalization. At the first level, multiple weak models are trained by fitting a few base learners on the original training data. These weak models will each compute a set of output predictions, which are later concatenated to form a new set of input features. At the second level, a final model is trained by fitting a final estimator on the new input features, which will yield the final predictions.[51] Here, a stacked generalization model was developed using LightGBM as both the base learner and the final estimator. This model was named LGB-Stack, and its architecture can be visualized using Figure .

Figure 3

General outline of a gradient boosting machine.

Figure 4

Outline of LGB-Stack, a stacked generalization model.

General outline of a gradient boosting machine. Outline of LGB-Stack, a stacked generalization model. For the model training, the dataset was split into the train and test sets. In the preliminary evaluation at LGB-Stack level 1, the group (A) fingerprints were compared against the groups (B) and (C) fingerprints. The train/test split ratio was set to 80/20 because the authors of the benchmark paper deemed that this ratio produced the best results.[29] As for the final evaluation at LGB-Stack level 2, the same range of train/test split ratios as the benchmark was investigated, which were 10/90, 20/80, 30/70, 40/60, 50/50, 60/40, 70/30, 80/20, and 90/10.[29] During the model training, there is a possibility of overfitting, whereby a model fits accurately on the train data but performs very poor predictions on the test data.[52] In the benchmark paper, k-fold cross-validation (CV) was used for the minimization of overfitting.[29] However, it was reported that Monte Carlo CV (MCCV) produces more accurate results than k-fold CV.[53,54] Hence, MCCV was the choice of CV in this paper. Following the benchmark, each model had 50 pseudo-random instances of CV.[29] Using the scikit-learn package, the train/test splits were performed and the trained models were evaluated for their accuracies.[45] To ensure consistency, the random state within one full LGB-Stack training instance was set to the same number throughout the training process, from 0 to 49. This means that the four-weak models at level 1 would share the same random state as the final model at the level 2. The scikit-learn package was also used for computing the scoring metrics of model accuracy.[45] The scoring metrics chosen were R2 (coefficient of determination) and root-mean-square error (RMSE). R2 explains the amount of variance accounted for in the relationship between two variables, with values between 0 and 1. When R2 = 1, the model accounts for all the variance. When R2 = 0, no variance is accounted for. Hence, the performance of a model improves as its R2 approaches 1.[55] On the other hand, RMSE is an indicator of the fit between the predictions and the actual values. The RMSE scores range between 0 and ∞ and follow the unit of measurements of the properties being predicted. A model with a smaller RMSE score has a greater accuracy.[56]

Results and Discussion

LGB-Stack Level 1

With reference to “level 1” in the outline of LGB-Stack in Figure , the weak models trained at the first level of LGB-Stack were evaluated. Table shows the best accuracy scores of the preliminary weak models training at train/test split ratios 80/20. Comparisons were made between the Ramprasad model and those trained on the (A), (B), and (C) fingerprints. More detailed tables containing the best and mean accuracy scores for the 50 runs for each of the weak models can be found in part A of the Supporting Information, where the values are expressed to four decimal places.

Table 1

Best Accuracy Scores for the Weak Models Compared with the Ramprasad Model at Train/Test Split Ratio 80/20

	R²		RMSE (eV)
model	train	test	train	test	number of bits
Ramprasad[29]	0.96	0.90	0.28	0.44	600
A-FP (A)	0.94	0.90	0.35	0.46	4096
AC-FP (A)	0.95	0.90	0.33	0.45
L-FP (A)	0.96	0.91	0.29	0.44
R-FP (A)	0.97	0.90	0.27	0.45
A-FP (B)	0.94	0.90	0.36	0.45	256
AC-FP (B)	0.95	0.90	0.33	0.45
L-FP (C)	0.96	0.91	0.30	0.42	512
R-FP (C)	0.96	0.91	0.28	0.43

From Table , it is shown that the accuracy scores for the weak models trained on the fingerprints in groups (B) and (C) are similar to their counterparts in group (A). These results suggest that there was no serious overfitting when using the 4096-bit group (A) fingerprints, given the similarity of the accuracy scores and the similarity in the differences between the train and test accuracy scores. Moreover, the results also show that majority of the features in 4096 bits are redundant and can be removed to allow shorter computation time. This also means that RFE is effective at selecting the important features to retain out of the 4096 bits. In this preliminary evaluation, it was found that the A-FP and AC-FP of group (B) performed slightly better than those of group (C). On the contrary, for L-FP and R-FP, it was group (C) that surpassed their counterparts in group (B). Hence, A-FP and AC-FP of group (B), together with L-FP and R-FP of group (C), were selected for the full LGB-Stack model training due to their relatively better accuracy scores.

LGB-Stack Level 2

With reference to “level 2” in the outline of LGB-Stack in Figure , the final model trained at the second level of LGB-Stack was evaluated. The output predictions of the four chosen weak models based on group B fingerprints in Table were concatenated to form a new set of input variables. Figure a shows the learning curve for LGB-Stack, which plots the mean train and test RMSE scores against the percentage share of train data in the data split, similar to that in the benchmark paper. Figure b,c shows the scatter plots of predicted Egap values against the actual Egap values for LGB-Stack. Table shows the comparison of accuracy scores between the Ramprasad model and LGB-Stack. The full set of best and mean accuracy scores for LGB-Stack can be found in part B of the Supporting Information, where the values are expressed to four decimal places.

Figure 5

Table 2

Best Accuracy Scores for LGB-Stack Trained on Trimmed Fingerprints Compared With the Ramprasad Model

	R²		RMSE (eV)
model	train	test	train	test	train/test split ratio
Ramprasad[29]	0.96	0.90	0.28	0.44	80/20
LGB-Stack	0.98	0.92	0.22	0.41
	0.97	0.92	0.23	0.40	90/10
	0.97	0.94	0.24	0.34	95/5

Graph (a) is the learning curve of LGB-Stack based on RMSE; graphs (b–d) are scatter plots of the predicted Egap values against the actual Egap values for LGB-Stack. The train/test split ratios here are 80/20, 90/10, and 95/5, respectively. In the learning curve of the benchmark paper, the Ramprasad model was reported to saturate at the 80/20 train/test split, which indicated that the inherent data in the dataset are sufficiently representative of polymers in the chemical space defined by the authors.[29] However, in the case of LGB-Stack, Figure a shows that there are still a little more convergence of train and test RMSE scores even at the split ratio of 90/10, which indicated that this dataset might be even more representative of the polymers than what the authors previously thought. This prompted us to go one step further to train the model at train/test ratio of 95/5. Table shows that LGB-Stack has achieved better accuracy scores for both train and test set at split ratio 80/20. The accuracy scores further improved at split ratios 90/10 and 95/5. These results were expected based on the trend observed in Figure a. The three scatter plots in Figure b,c share similar scattering patterns for both train and test sets, which suggests that there was no serious overfitting. Most importantly, the scatter pattern became increasingly more compact as the train/test split ratio went from 80/20 to 95/5. This agrees with the trend observed in Figure a. The results suggest that the four weak models are accurate and diverse enough for their strengths to be combined to obtain a good final model, which is important for a successful stacked generalization.[46,51] Therefore, it can be concluded that LGB-Stack is indeed a better QSPR model for the prediction of the Egap values of polymers than the Ramprasad model, and the primary aim of this paper has been achieved.

Conclusions

In summary, the objective defined in this paper has been successfully achieved. A two-level QSPR model called LGB-Stack with a very high accuracy was developed. Four 4096-bit 2D molecular fingerprints (A-FP, AC-FP, L-FP, and R-FP) were calculated and trimmed using RFE. At level 1 of LGB-Stack, LightGBM was trained on the four molecular fingerprints, which resulted in four weak models with four sets of outputs. At level 2 of LGB-Stack, LightGBM was trained on the four sets of outputs to obtain the final output of LGB-Stack. The final results show that LGB-Stack has surpassed the benchmark model. In the future, we hope to increase the number of data points in the dataset that we use, so as to allow the machine learning algorithm to perform better supervised learning at split ratios higher than 80/20, such as 90/10 and 95/5. We also wish to explore the other similar physical properties, which might give rise to the possibility of transfer learning, in which the predictions made on a certain physical property can be used for making predictions on related physical properties.

22 in total

1. QSAR--how good is it in practice? Comparison of descriptor sets on an unbiased cross section of corporate data sets.

Authors: Peter Gedeck; Bernhard Rohde; Christian Bartels
Journal: J Chem Inf Model Date: 2006 Sep-Oct Impact factor: 4.956

2. Highly Sensitive, Printable Nanostructured Conductive Polymer Wireless Sensor for Food Spoilage Detection.

Authors: Zhong Ma; Ping Chen; Wen Cheng; Kun Yan; Lijia Pan; Yi Shi; Guihua Yu
Journal: Nano Lett Date: 2018-06-27 Impact factor: 11.189

3. Predicting the Band Gaps of Inorganic Solids by Machine Learning.

Authors: Ya Zhuo; Aria Mansouri Tehrani; Jakoah Brgoch
Journal: J Phys Chem Lett Date: 2018-03-19 Impact factor: 6.475

4. Mastering the game of Go without human knowledge.

Authors: David Silver; Julian Schrittwieser; Karen Simonyan; Ioannis Antonoglou; Aja Huang; Arthur Guez; Thomas Hubert; Lucas Baker; Matthew Lai; Adrian Bolton; Yutian Chen; Timothy Lillicrap; Fan Hui; Laurent Sifre; George van den Driessche; Thore Graepel; Demis Hassabis
Journal: Nature Date: 2017-10-18 Impact factor: 49.962