Literature DB >> 32149214

QSPR Studies on the Octane Number of Toluene Primary Reference Fuel Based on the Electrotopological State Index.

Long Jiao¹, Huanhuan Liu¹, Le Qu^2,3, Zhiwei Xue⁴, Yuan Wang¹, Yanzhao Wang¹, Bin Lei¹, Yunlei Zang¹, Rui Xu¹, Zhen Zhang¹, Hua Li¹, Omar Abdulaziz Ahmed Alyemeni¹.

Abstract

The quantitative structure-property relationship (QSPR) models for predicting the octane number (ON) of toluene primary reference fuel (TPRF; blends of n-heptane, isooctane, and toluene) was investigated. The electrotopological state (E-state) index of TPRF components was computed and weight-summed to generate the quantitative descriptor of TPRF samples. The partial least squares (PLS) technique was used to build up the regression model between the ON and weight-summed E-state index of the investigated samples. The QSPR models for the research octane number (RON) and motor octane number (MON) of TPRF were built. The prediction performance of the obtained PLS models was assessed by the external test set validation and leave-one-out cross-validation. The validation results demonstrate that the proposed PLS models are feasible for predicting the ON, both RON and MON, of TPRF. In addition, several other QSPR models for the ON of TPRF were developed by employing the stepwise regression and Scheffé polynomials methods, and the prediction performance of these models were compared with that of the PLS models. The comparison result shows that the proposed PLS models are slightly better than multiple linear regression models and Scheffé models. It is demonstrated that the combination of the E-state index and PLS is an easy-to-use and promising method for studying and forecasting the ON of TPRF.

Entities: CellLine Chemical Disease Gene Species

Year: 2020 PMID： 32149214 PMCID： PMC7057327 DOI： 10.1021/acsomega.9b03139

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Octane number (ON), also known as octane rating, is a standard index widely used to measure the resistance of an engine or fuel to knock. The higher the ON, the more compression a fuel can withstand before detonating (igniting).[1−3] In other words, the higher the ON, the more compression required for the ignition of a fuel. Fuels with different ONs must be utilized in different engines. In broad terms, fuels with a higher ON are used in high-performance internal-combustion engines that demand higher compression ratios. The use of fuels with a lower ON than the requirement of an engine may lead to engine knocking. With the increase in the demand for more efficient vehicles, and the continuous increasing global consumption of high-quality anti-knock fuels, studies on the determination and control of fuels’ ON have attracted much attention in recent decades. There are two standard empirical measurements for ON, known as research octane number (RON) and motor octane number (MON). The two numbers quantify the anti-knock quality of fuels under two different working conditions.[4−9] Since commercial fuels are mixtures of numerous compounds that are usually inconsistent due to the different crude oil origin and refinery processes, at present, understanding the chemical properties and behaviors of fuels is a complicated and hard task. To solve this problem, researchers usually employ mixtures (surrogates) to mimic and study the properties and behaviors of real fuels under certain conditions.[9−18] Presently, blends of n-heptane, isooctane, and toluene (methylbenzene in some works), known as toluene primary reference fuel (TPRF), are widely applied to mimick the behaviors of gasoline in a lot of researches.[13−18] ASTM D2699 (for RON) and ASTM D2700 (for MON) provide the standard methods for determining the ON of fuels, which are rigorous and complicated experiments conducted with ON testers by means of the procedure settings in the two standards.[6−8] Besides the standard methods, several electrochemical, chromatographic, and spectroscopic methods have been proposed[19−21] for the determination of ON. Although these methods were simpler than ASTM standard methods, experimentally determining the ON of fuels, by and large, is inefficient and sometimes dangerous.[5,9,22−24] It is still necessary to set up more reliable, easy-to-use, and cost-effective methodologies for determining the ON of fuels. Therefore, the quantitative structure–property relationship (QSPR) technique, which is convenient and cost-effective for predicting the property of chemical substances, has gained much attention in fuel researches, in an effort to reduce the testing burden. A range of QSPR models trying to correlate the ON of fossil fuels’ component compounds, including paraffin, naphthene, and olefins, to basic physical and chemical parameters have been reported.[24−27] QSPR studies on blending systems of fuels, such as primary reference fuels and TPRF, have gained much attention, whereas QSPR models based on molecular structures of fuel components are rarely reported. The electrotopological state (E-state) index is an atomic-level chemometric topological index in which both electronic characteristics and topological environment in a molecule of each non-hydrogen atom are encoded together.[28−33] Because E-state index is able to encode the structural information about the valence electronic structure of an atom as well as its presence in the topological network of a molecule, it is effectual to recognize atoms or molecular fragments that have a significant influence on the molecular property. Moreover, the E-state index is easy to calculate and present. It is able to concisely and conveniently encode molecular structures. At present, the E-state index has shown its comprehensive and promising usage in QSPR studies. This paper, therefore, investigated and proposed the QSPR models associating the ON of TPRF with the component composition and molecular structure of TPRF. The E-state index was utilized as the structural descriptor of n-heptane, isooctane, and toluene. It is, of course, the independent variable of the regression model. The partial least squares (PLS) regression was used to develop the calibration model between the E-state index and ON of TPRF. The proposed models can be used to forecast the ON of TPRF, only demanding the structure and the mole composition of each component in TPRF.

Experimental Section

Data Sets and Software

A total of 47 TPRF samples were studied, and their chemical composition is listed in Table . The observed RON value of 47 samples and the observed MON value of 45 samples were gotten from refs (5, 6) and are shown in Table .

Table 1

Composition, RON, and MON of the Investigated Samples

no.	mole fraction (%)b			RON			MON
	A	B	C	Obs.	Pred.	RE (%)	Obs.	Pred.	RE (%)
1	1.5	5.3	93.2	112.8	112.30	–0.44	106.0	101.31	–4.42
2	1.6	12.4	86.0	108.5	111.48	2.75	101.0	101.36	0.36
3a	3.0	4.0	93.0	111.8	110.80	–0.89	104.4	100.23	–3.99
4	4.5	2.6	92.9	109.5	109.24	–0.24	102.4	98.35	–3.96
5	4.8	14.1	81.1	107.6	107.32	–0.26	96.6	97.99	1.44
6	6.0	1.3	92.7	108.0	107.68	–0.30	101.0	96.83	–4.13
7	6.2	8.2	85.6	105.4	106.61	1.15	97.5	96.63	–0.89
8a	8.4	22.3	69.4	101.6	102.13	0.52	91.2	93.83	2.88
9	8.7	31.0	60.3	99.8	100.55	0.75	90.9	93.18	2.51
10	8.7	10.5	80.8	103.3	103.34	0.04	92.6	94.05	1.57
11	9.3	5.5	85.2	103.1	103.35	0.24	94.0	93.53	–0.50
12	9.8	56.5	33.7	95.2	95.83	0.66	90.5	90.83	0.36
13a	10.0	65.0	25.0	93.7	94.35	0.69	90.3	90.21	–0.10
14	10.9	38.8	50.3	96.7	96.93	0.24	88.7	90.52	2.05
15	12.3	2.7	84.9	101.0	100.12	–0.87	90.5	90.47	–0.03
16	12.3	34.0	53.7	96.3	95.94	–0.37	88.3	89.20	1.02
17	12.6	7.0	80.4	99.8	99.29	–0.51	88.7	90.11	1.59
18a	13.0	27.0	60.0	96.3	96.11	–0.20	87.3	88.73	1.64
19	13.3	47.2	39.4	92.8	92.93	0.14	86.9	87.51	0.70
20	13.5	12.0	74.5	98.0	97.58	–0.43	87.4	88.94	1.76
21	13.7	42.8	43.5	93.0	93.15	0.16	86.7	87.34	0.74
22	15.0	35.0	50.0	93.0	92.70	–0.32	85.8	86.30	0.58
23a	16.0	56.5	27.5	89.1	88.60	–0.56	85.6	84.29	–1.53
24	16.5	3.5	80.0	96.9	95.20	–1.75	85.2	86.16	1.13
25	16.6	14.7	68.7	95.0	93.60	–1.47	83.7	85.57	2.23
26	16.9	59.9	23.3	87.0	87.20	0.23	84.0	83.15	–1.01
27	17.0	63.0	20.0	86.6	86.55	–0.06	84.2	82.74	–1.73
28a	17.0	69.0	14.0	85.7	85.74	0.05	84.6	82.69	–2.26
29	17.3	23.0	59.7	92.1	91.69	–0.45	82.9	84.44	1.86
30	18.8	66.7	14.4	84.5	83.79	–0.84	82.0	80.60	–1.71
31	21.6	28.7	49.7	86.2	85.96	–0.28	79.6	79.64	0.05
32	24.7	7.3	68.0	85.3	85.33	0.04	75.2	77.46	3.01
33a	25.7	15.2	59.1	83.8	83.09	–0.85	76.2	75.93	–0.35
34	26.3	34.9	38.8	79.0	79.72	0.91	74.0	74.45	0.61
35	30.6	27.2	42.2	76.2	75.78	–0.55	70.9	70.24	–0.93
36	31.4	41.6	27.0	73.6	72.86	–1.01	70.0	68.71	–1.84
37	32.0	18.9	49.1	75.1	75.33	0.31	68.0	69.23	1.81
38a	34.0	7.5	58.5	75.5	74.58	–1.22	68.0	67.56	–0.65
39	36.9	49.0	14.1	66.0	65.50	–0.76	64.4	62.51	–2.93
40	38.8	22.9	38.2	66.1	66.90	1.21	61.0	61.85	1.39
41	42.2	9.3	48.5	63.7	64.99	2.03	58.0	59.01	1.74
42	46.2	27.3	26.5	58.0	57.79	–0.36	53.9	53.90	0
43a	51.0	11.3	37.7	53.2	54.47	2.39	48.0	49.57	3.27
44	54.2	32.0	13.8	48.0	47.93	–0.15	46.7	45.00	–3.64
45	60.5	13.4	26.1	42.0	43.46	3.48
46	63.8	14.2	22.0	39.0	39.38	0.97	37.0	35.59	–3.81
47	70.8	15.7	13.6	32.0	30.84	–3.63

Samples in Subset R2 and M2.

A, B, and C denote n-heptane, isooctane, and toluene, respectively.

Samples in Subset R2 and M2. A, B, and C denote n-heptane, isooctane, and toluene, respectively. All computations were done with programs coded in Matlab (version R2014a) on a computer with an i7-7700 processor and 8.00 GB RAM.

Construction, Validation, and Evaluation of the Calibration Model

Partial Least Squares

PLS is a widely used multivariate calibration technique that allows to mathematically relate the response variables (dependent variables), Y, with the explanatory variables (independent variables), X. The general idea behind PLS is to try to find the multidimensional direction in the space of the X matrix that explains the maximum multidimensional variance direction in the space of the Y matrix. In PLS, the decomposition of explanatory variables X, X = TPT, and the decomposition of response variables Y, Y = UQT, are performed simultaneously. Herein, T, U, P, and Q are generally denoted as the matrix of X-scores (also known as latent variables, LVs), Y-scores, X-loadings, and Y-loadings, respectively. The basic strategy of PLS is that Y should be taken into account for constructing T so as to maximize the covariance between T and U. The LVs are orthogonal to each other, and their number is equal to the rank of X. Hence, PLS outputs the matrix of regression coefficients B as well as T, U, P, and Q and gets the regression model Y = UQT = TBQT = XPBQT by performing regression between T and U. Because PLS considers both the outer relation (X and Y matrix individually) and inner relation (correlating the X and Y matrix) in the model construction, it could lead to a maximum correlation between T and U. Due to its basic strategy, sometimes PLS is called “projection to latent structures”. PLS is a good alternative to classical multivariate calibration techniques, such as multiple linear regression (MLR) and principal component regression (PCR).[34−36] The prediction performance of PLS is usually better than that of MLR and PCR. The advantages of PLS include: (a) it can overcome the multicollinearity of original data, (b) it can keep useful information in the calibration variables while keeping out redundant information in the calibration variables, (c) it is distributional free, and (d) it can be applied to small data sets even if the number of samples is less than the number of variables.

Validation Approaches

LOO-CV is commonly utilized to estimate the forecasting performance and robustness of a multivariable regression model.[37] In practice, the regression task is always on the basis of limited available samples. The idea of LOO-CV is to utilize all the available samples to assess the regression model. The advantage of LOO-CV over other random sub-sampling techniques is that each sample is used for validation exactly once. When performing LOO-CV with a data set containing n samples, there are n corresponding cycles in the whole validation procedure. Each cycle i (i = 1, ..., n), comprises 4 steps: (1) taking sample i out as temporary the “test set”, which does not participate in the establishment of the regression model; (2) building up a regression model with the rest (n – 1) samples; (3) validating the obtained model with sample i by computing the prediction error; and (4) after testing the obtained model with the n samples in turn, computing the prediction error for the whole data set. As a traditional validation algorithm, the external test set validation[38,39] has been widely used to estimate the forecast accuracy of calibration models. This algorithm assigns all the available samples of the whole working data set into a calibration set (or training set) and a test set. Usually, the samples in the two subsets are randomly selected from the whole working data set. The calibration set plays the role of building up the regression model. The test set is designed to give an independent assessment of the forecast ability of the model. Hence, the test set cannot participate in the model development, and thus, it is thoroughly independent from the calibration set. There is always the risk of overestimating the predictive performance of a regression model in any validation method. Hence, the external test set validation and LOO-CV are often used together in order to reduce the risk of overestimation. When performing both external test set validation and LOO-CV jointly with a data set, usually, the whole data set is divided into a calibration set and a test set in order to carry out the external test set validation. After that, all the samples of the calibration set are predicted in turn so as to complete the LOO-CV.[40−42]

Statistics and Applicability Domain of the Calibration Model

Root mean squared relative error (RMSRE), root mean squared error (RMSE), and concordance correlation coefficient (CCC)[43−46] were employed collectively to assess the forecasting performance of the obtained models. Equations and 2 show the definition of RMSRE and RMSEwhere RE denotes the prediction relative error of the ith sample, AE denotes the absolute error of the ith sample, and n is the number of samples. The calculation of CCC is shown in eq In eq , y and ŷ represent the observed and predicted ON value of each sample, respectively; y̅ and are the mean value of the observed and predicted ON, respectively; and nEXT denotes the number of samples in the external test set. PRESS denotes the predictive error sum of squares, and TSS represents the total sum of squares for the whole data set. The advantage of CCC is that it is more stable and precautionary than rm2, Q2, and the Golbraikh–Tropsha method. When different validation measures are discordant or in conflict for assessing QSPR models, CCC would help in making the decision of whether or not to accept the developed models. Applicability domain (AD) is a theoretical region that is applied for estimating the uncertainty of a calibration model on the basis of the considered data of the training set.[47] Williams plot is one of the most popular distance-based approaches to display the AD of a calibration model.[48] In Williams plot, the leverage value of the whole data set is calculated according to eq where X is the matrix of the descriptor, and XT is the transposed X. The leverage value of the ith sample is the diagonal elements of H, h. The leverage value of a sample greater than the critical value h* indicates that this sample is likely to differ from other samples and be located outside the optimum sample space. The h* value is calculated according to eq where p is the number of descriptors in the QSPR model (namely, the number of independent variables in the regression model), and n is the number of samples utilized in building the model.

E-State Index

The detailed theory of the E-state index has been described in several articles;[28−31] this section thereby presents only a brief outline of this topological index. When calculating the E-state index of a molecule, the topological graph (also known as a geometric graph) of the molecule is generated and used. In the molecular topological graph, one non-hydrogen atom is regarded as a point of this graph and one chemical bond is considered an edge of this graph. Each atom of the molecule is encoded by the terms of the E-state index. Both the intrinsic valence electronic state of an atom and the electronic influence, which is perturbed by the context of the topological character of the molecule, of all other atoms in the molecule on that atom are considered and quantified within the E-state index. Each term of the E-state index for an atom consists of an intrinsic value for the atom (in its valence state) and a value for its perturbation by the other atoms of the molecule. Equation defines the intrinsic value for a non-hydrogen atom In eq , N indicates the principal quantum number for the valence shell of the atom. Items δV and δ indicate the connectivity values. Equations and 8 show the calculations for δV and δ, respectively.where σ denotes the number of electrons in σ orbits of the atom and h denotes the number of hydrogen atoms linked to the atom by chemical bonds.where ZV, π, and n denote the number of valence electrons, the number of electrons in π orbits, and the number of electrons in lone pairs, respectively. Equation shows the calculation for the E-state index (S) for an atom i in the moleculewhere r denotes the nearest relative distance, which is regarded as 1 between the two adjacent non-hydrogen atoms, between the ith and jth atom in the topological graph plus 1. m represents the amount of non-hydrogen atoms in the molecule. Then, S is summed up for each type of atom. All the terms of S comprise the complete E-state index.

Results and Discussion

Description of the TPRF Samples

The E-state index values of n-heptane, isooctane, and toluene, which were computed according to the approach described in Section , are listed in Table . As shown in Table , the value of the E-state index can differentiate the three compounds. Building a QSPR model related to the three compounds on the basis of the E-state index should be reasonable.

Table 2

E-State Index of n-Heptane, Isooctane, and Toluene

compound	SaaCH	SsCH₃	SaasC	SssCH₂	SsssCH	SssssC
n-heptane	0	4.4914	0	7.0086	0	0
isooctane	0	11.3924	0	1.3264	0.8426	0.5220
toluene	10.2616	2.0833	1.3218	0	0	0

Nevertheless, directly employing the E-state index of the three compounds as independent variables to build the QSPR model of these samples is unreasonable, because this index could not describe the content of each component in the samples. It is always important to reduce the risk of chance correlation by building QSPR models with adequate chemical information. The weight-summed E-state index of the three components were proposed and calculated to describe the investigated samples. The computation of weight-summed E-state index was done according to eq where S are the items of the weight-summed E-state index; S1, S2, and S3 indicate the E-state index of the three compounds (components) in a TPRF sample; and c1, c2, and c3 denote the mole fraction of the three components in a TPRF sample. The weight-summed E-state index S of the studied TPRF samples is listed in Table S1. Since S can quantitatively describe the chemical composition and structure of each sample to some extent, using it as the descriptor of TPRF to build the QSPR model of these samples should be practicable. The quantitative relationship between ON, including RON and MON, and the weight-summed E-state index of the investigated samples was studied.

RON Model

Development and Validation of the PLS Model

PLS was used to model the quantitative relationship between the E-state index and RON of the investigated samples. The weight-summed E-state index, S, was used as an independent variable and the RON value was used as a dependent variable to build the regression model. Two validation techniques, external test set validation and leave-one-out cross-validation, were carried out to assess the forecasting capacity of the developed model. The 47 TPRF samples were randomly assigned to two subsets: Subset R1, which comprises 38 samples (listed in Table ), and Subset R2, which includes 9 samples (listed in Table and marked bya). First, the external test set validation was carried out. Subset R1 was used as a calibration set to build the regression model. The correlation coefficients between the six terms of the weight-summed E-state index were first considered. Table lists the calculated correlation coefficients. From the correlation coefficients shown in Table , it can be concluded that there is a severe inter-correlation in the data set of the weight-summed E-state index. For instance, the correlation coefficient between SaaCH and SaasC has reached 1, which means a completely linear correlation. Moreover, the correlation coefficients also indicate that the linear correlation between SsCH3 and SsssCH, SsCH3 and SssssC, has reached a high extent. Thus, PLS was employed to build the regression model in order to overcome the inter-correlation among the E-state index terms and exclude the redundant information in the E-state index. According to the RMSECV shown in Figure S1 and Table S2, and the percentage variance captured by the PLS model listed in Table S3, it is obvious that the first two LVs, LV1 and LV2, should be included in the regression model. Then, a two-latent-variable PLS model was built and the RON value of the 9 samples in Subset R2 was predicted by using the developed model. The result is shown in Table . The plot of predicted RON versus observed RON is shown in Figure a. For the 9 samples, the RMSRE, RMSE, and CCC of prediction are, respectively, 1.04, 0.74, and 0.9989. The correlation coefficient (R) between the predicted and observed RON is 0.9991, and the linear relationship between the predicted and observed RON is: RONpred = 0.9800 × RONobs + 1.6620. Secondly, LOO-CV was conducted by predicting the RON value of the 38 samples of Subset R1 in turn, and the result is shown in Table . The plot of predicted RON versus observed RON is shown in Figure a. For the 38 samples, the RMSRE of prediction is 1.18 and the RMSE of prediction is 0.84. The correlation coefficient between the predicted and observed RON is 0.9992, and the linear relationship between the predicted and observed RON is: RONpred = 0.9987 × RONobs + 0.1086. As shown in Figure a and Table , the predicted RON is in good agreement with the observed RON.

Table 3

Correlation Coefficients Between the Terms of the Weight-Summed E-State Index of the 47 TPRF Samples

	SaaCH	SsCH₃	SaasC	SssCH₂	SsssCH	SssssC
SaaCH	1.0000	–0.8731	1.0000	–0.7868	–0.7442	–0.7442
SsCH₃	–0.8731	1.0000	–0.8731	0.3860	0.9754	0.9754
SaasC	1.0000	–0.8731	1.0000	–0.7868	–0.7442	–0.7442
SssCH₂	–0.7868	0.3860	–0.7868	1.0000	0.1733	0.1733
SsssCH	–0.7442	0.9754	–0.7442	0.1733	1.0000	1.0000
SssssC	–0.7442	0.9754	–0.7442	0.1733	1.0000	1.0000

Figure 1

Predicted RON versus observed RON of: (a) PLS model, (b) MLR model, (c) Scheffé model1, (d) Scheffé model2. “▲” indicates the samples of Subset R1; “▼” indicates the samples of Subset R2. The Williams plot of this model is displayed in Figure a. This plot reveals that, in the leverage value, no sample of the 47 samples exceeds the threshold value h* = 0.5526. It is demonstrated that building and testing the model with these samples is reasonable and reliable. It is also revealed by Figure a that all the samples except for sample 2 are located within the threshold range of the standard residual (−3, 3). It is indicated that sample 2 is a potential response outlier in this model. Fortunately, the prediction RE of this sample is merely 2.75%, which usually means a satisfactory prediction accuracy for a QSPR model. It is demonstrated that the developed PLS model based on this data set should be acceptable.

Figure 2

Williams plot of: (a) PLS model (h* = 0.5526), (b) MLR model (h* = 0.2368), (c) Scheffé model1 (h* = 0.8684), (d) Scheffé model2 (h* = 0.3158). “▲” indicates the samples of Subset R1; “▼” indicates the samples of Subset R2. Generally speaking, the results of the two validations demonstrate that the weight-summed E-state index quantitatively relates to the RON value of these samples. It is reasonable to employ the weight-summed E-state index as a descriptor to establish the QSPR model for the RON of these samples. The PLS method is able to model the relationship between the weight-summed E-state index and RON of these samples. The QSPR model developed by PLS can be applied for predicting and studying the RON of TPRF.

Comparison with Other Methods

Three models, one MLR model and two Scheffé models, for predicting RON of the investigated samples were built. The MLR model correlates RON and the weight-summed E-state index of these samples by using a stepwise regression. Scheffé model1 and Scheffé model2 were built with the Scheffé polynomials method proposed in ref (9). The mole fractions of n-heptane, isooctane, and toluene, which were, respectively, denoted as x1, x2, and x3, were used as the independent variables of Scheffé models. The prediction capacity of these models was also assessed by the external test set validation and LOO-CV. In the external test set validation, Subsets R1 and R2 were, respectively, used as the calibration set and the test set. The RON value of Subset R1 was predicted in turn in the LOO-CV. The predicted and observed RONs are shown in Figure and Table S4. The Williams plot of these models is displayed in Figure . When building the MLR model, the “Use probability of F value” parameter of the stepwise regression was set to 0.05 for entry and 0.10 for removal. The result of the stepwise regression suggests that two terms of the weight-summed E-state index, SsCH3 and SssCH2, should be included in the model as independent variables. The prediction result of the external test set validation is: RMSRE = 1.04, RMSE = 0.74, CCC = 0.9989, and RONpred = 0.9793 × RONobs + 1.7107 (R = 0.9991). The result of LOO-CV is: RMSRE = 1.20, RMSE = 0.84, and RONpred = 0.9988 × RONobs + 0.1051 (R = 0.9991). Scheffé model1 includes 10 terms; that is, the regression model has 10 independent variables: x1, x2, x3, x1x2, x1x3, x2x3, x1x2(x1 – x2), x1x3(x1 – x3), x2x3(x2 – x3), and x1x2x3. The prediction result of the external test set validation is: RMSRE = 0.80, RMSE = 0.69, CCC = 0.9991, and RONpred = 0.9983 × RONobs – 0.2812 (R = 0.9994). For LOO-CV, RMSRE = 0.83, RMSE = 0.75, and RONpred = 0.9972 × RONobs + 0.2435 (R = 0.9994). Scheffé model2 employed three first-order terms, x1, x2, and x3, as the independent variables. The prediction result of the external test set validation is: RMSRE = 1.04, RMSE = 0.74, CCC = 0.9989, and RONpred = 0.9804 × RONobs + 1.6402 (R = 0.9990). For LOO-CV, RMSRE = 1.25, RMSE 0.89, and RONpred = 0.9991 × RONobs + 0.0738 (R = 0.9991). As shown in Figures , 2 and Table S4, the PLS model is slightly better than the MLR model in the prediction performance. Scheffé model1 is better than the other three models. However, Scheffé model1 employed too many independent variables, which may increase the risk of un-robustness. The prediction performance of Scheffé model2 is lower than that of other models.

MON Model

Development and Validation of the PLS Model

Herein, the PLS method was used to model the quantitative relationship between the E-state index and MON of the investigated samples. Similarly, the weight-summed E-state index S was used as the independent variable and the MON value was used as the dependent variable to build the regression model. The external test set validation and LOO-CV were carried out to assess the prediction performance of the developed model. The 45 TPRF samples (samples 1–44 and sample 46 listed in Table ) were assigned to two subsets: Subset M1, which includes 36 samples (listed in Table ), and Subset M2, which comprises 9 samples (listed in Table and marked bya). In the external test set validation, a two-latent-variable PLS model was built up by using Subset M1 as the calibration set. The correlation coefficients shown in Table indicate that there is still a severe inter-correlation in the data set of the weight-summed E-state index. Figure S2, Tables S2 and S5 demonstrate that LV1 and LV2 should be contained in the regression model. The MON value of Subset M2 was predicted with this model. The result is listed in Table and Figure a. The RMSRE, RMSE, and CCC of this prediction are 2.25, 1.96, and 0.9913, respectively. The relationship between the predicted and observed MON is: MONpred = 0.9433 × MONobs + 4.3503, with R = 0.9928. In LOO-CV, the MON value of the 36 samples of Subset M1 was predicted in turn. The prediction results are listed in Table and Figure a. The RMSRE and RMSE of prediction are 2.05 and 1.72. The relationship between the predicted and observed MON is: MONpred = 0.9907 × MONobs + 0.7120, with R = 0.9940. As shown in the external test set validation and LOO-CV, the predicted MON is in good agreement with the observed MON. Figure a is the Williams plot of this model. This plot indicates that the 45 samples are all located under the threshold leverage value of h* = 0.5833 and within the threshold range of standard residual (−3, 3). It is indicated that there are no response outliers in this model, and the model building and validating is reasonable.

Table 4

Correlation Coefficients Between the Terms of the Weight-Summed E-State Index of the 45 TPRF Samples

	SaaCH	SsCH₃	SaasC	SssCH₂	SsssCH	SssssC
SaaCH	1.0000	–0.9064	1.0000	–0.7834	–0.8127	–0.8127
SsCH₃	–0.9064	1.0000	–0.9064	0.4476	0.9828	0.9828
SaasC	1.0000	–0.9064	1.0000	–0.7834	–0.8127	–0.8127
SssCH₂	–0.7834	0.4476	–0.7834	1.0000	0.2747	0.2747
SsssCH	–0.8127	0.9828	–0.8127	0.2747	1.0000	1.0000
SssssC	–0.8127	0.9828	–0.8127	0.2747	1.0000	1.0000

Figure 3

Figure 4

Williams plot of: (a) PLS model (h* = 0.5833), (b) MLR model (h* = 0.2500), (c) Scheffé model1 (h* = 0.9167), (d) Scheffé model2 (h* = 0.3333). “▲” indicates the samples of Subset M1; “▼” indicates the samples of Subset M2.

Predicted MON versus observed MON of: (a) PLS model, (b) MLR model, (c) Scheffé model1, (d) Scheffé model2. “▲” indicates the samples of Subset M1; “▼” indicates the samples of Subset M2. Williams plot of: (a) PLS model (h* = 0.5833), (b) MLR model (h* = 0.2500), (c) Scheffé model1 (h* = 0.9167), (d) Scheffé model2 (h* = 0.3333). “▲” indicates the samples of Subset M1; “▼” indicates the samples of Subset M2. On the whole, the two validations demonstrate that the weight-summed E-state index is quantitatively related to the MON value of these samples. It is reasonable to use the weight-summed E-state index as the descriptor to establish the QSPR model for the MON of these samples. Moreover, the PLS method is able to model the relationship between the weight-summed E-state index and MON of these samples. The QSPR model herein is available and promising for forecasting the MON of TPRF.

Comparison with Other Methods

The MLR model, Scheffé model1, and Scheffé model2 for predicting the MON of the investigated samples were established. The MLR model is built with a stepwise regression method (setting “Use probability of F value” to 0.05 for entry and 0.10 for removal) by utilizing the weight-summed E-state index of independent variables. Scheffé model1, which includes 10 independent variables—x1, x2, x3, x1x2, x1x3, x2x3, x1x2(x1 – x2), x1x3(x1 – x3), x2x3(x2 – x3), and x1x2x3—and Scheffé model2, which uses x1, x2 and x3 as independent variables, were developed with the Scheffé polynomials method. Herein, x1, x2, and x3 still, respectively, indicate the mole fractions of n-heptane, isooctane, and toluene. These models were also assessed by the external test set validation and LOO-CV. In the external test set validation, Subset M1 and Subset M2 were used, respectively, as the calibration set and test set. The MON value of Subset M1 was predicted in LOO-CV. The predicted and observed MONs are shown in Figure and Table S6. The Williams plots of these models are displayed in Figure . The independent variables of the MLR model are SsCH3 and SssCH2. The prediction result of the external test set validation is: RMSRE = 2.24, RMSE = 1.95, CCC = 0.9914, and MONpred = 0.9430 × MONobs + 4.3584 (R = 0.9929). For LOO-CV, RMSRE = 2.06, RMSE = 1.7,2 and MONpred = 0.9908 × MONobs + 0.7113 (R = 0.9940). The external test set validation result of Scheffé model1 is: RMSRE = 1.95, RMSE = 1.35, CCC = 0.9960, and MONpred = 0.9929 × MONobs – 0.0947 (R = 0.9970). The LOO-CV result of Scheffé model1 is: RMSRE = 2.56, RMSE = 1.18, and MONpred = 0.9862 × MONobs + 1.1654 (R = 0.9972). The external test set validation result of Scheffé model2 is: RMSRE = 2.28, RMSE = 1.98, CCC = 0.9911, and MONpred = 0.9440 × MONobs + 4.3286 (R = 0.9925). The LOO-CV result of Scheffé model2 is: RMSRE = 2.07, RMSE = 1.74, and MONpred = 0.9903 × MONobs + 0.7338 (R = 0.9938). As shown in Figures , 4 and Table S6, there are no significant differences between the PLS model and MLR model in the prediction performance. According to the external test set validation, it seems that Scheffé model1 might be better than the other three models. However, as shown in Table S6, the predict error of sample 46 is obviously larger than that of the other three models and other samples. And the standard residual shown in Figure demonstrates that this sample is a response outliers. That is, Scheffé model1 might be un-robust. Building the QSPR model with this method may not be the most reliable choice. In addition, the prediction performance of Scheffé model2 is slightly lower than that of the PLS model. The comparison results demonstrate that employing the E-state index as descriptors and building the regression model with PLS is better than the Scheffé polynomials method to develop the MON model of TPRF.

Conclusions

The proposed method is feasible and easy to use for correlating the ON of TPRF and the structure of TPRF components. In this method, the weight-summed E-state index is used as a descriptor of TPRF and PLS regression is used as a calibration method to establish the QSPR models of TPRF. The results of external test validation and LOO-CV indicate that the two obtained QSPR models are practicable for predicting the RON and MON of these samples. It is demonstrated that there is a quantitative relationship between the chemical composition, represented by the structure and mole fraction of the components, and the ON of TPRF. In addition, the comparison of the developed PLS, MLR, and Scheffé models shows that the PLS models are slightly better than MLR and Scheffé models in their prediction performance. Generally speaking, the combination of the weight-summed E-state index and PLS regression should be a practicable and promising method for studying the QSPR model of the ON of multi-component fuels, such as TPRF.

8 in total

1. Real external predictivity of QSAR models. Part 2. New intercomparable thresholds for different validation criteria and the need for scatter plot inspection.

Authors: Nicola Chirico; Paola Gramatica
Journal: J Chem Inf Model Date: 2012-07-13 Impact factor: 4.956

2. Real external predictivity of QSAR models: how to evaluate it? Comparison of different validation criteria and proposal of using the concordance correlation coefficient.

Authors: Nicola Chirico; Paola Gramatica
Journal: J Chem Inf Model Date: 2011-08-12 Impact factor: 4.956

3. Robust Ultraviolet-Visible (UV-Vis) Partial Least-Squares (PLS) Models for Tannin Quantification in Red Wine.

Authors: José Luis Aleixandre-Tudo; Helené Nieuwoudt; José Luis Aleixandre; Wessel J Du Toit
Journal: J Agric Food Chem Date: 2015-01-27 Impact factor: 5.279

4. IUPAC-consistent approach to the limit of detection in partial least-squares calibration.

Authors: Franco Allegrini; Alejandro C Olivieri
Journal: Anal Chem Date: 2014-07-17 Impact factor: 6.986

5. The E-state as the basis for molecular structure space definition and structure similarity

Authors:
Journal: J Chem Inf Comput Sci Date: 2000-05

6. Screening Brazilian commercial gasoline quality by hydrogen nuclear magnetic resonance spectroscopic fingerprintings and pattern-recognition multivariate chemometric analysis.

Authors: Danilo Luiz Flumignan; Nivaldo Boralle; José Eduardo de Oliveira
Journal: Talanta Date: 2010-04-10 Impact factor: 6.057

7. Prediction of impact sensitivity of nitro energetic compounds by neural network based on electrotopological-state indices.

Authors: Rui Wang; Juncheng Jiang; Yong Pan; Hongyin Cao; Yi Cui
Journal: J Hazard Mater Date: 2008-11-13 Impact factor: 10.588

8. QSAR study on melanocortin-4 receptors by support vector machine.

Authors: Eslam Pourbasheer; Siavash Riahi; Mohammad Reza Ganjali; Parviz Norouzi
Journal: Eur J Med Chem Date: 2009-12-23 Impact factor: 6.514

8 in total