Literature DB >> 32082439

Sparse modeling of chemical bonding in binary compounds.

Yosuke Kanda¹, Hitoshi Fujii², Tamio Oguchi^1,2.

Abstract

A sparse model for quantifying energy difference between zinc-blende and rock-salt crystal structures in octet elemental and binary materials is constructed by using the linearly independent descriptor-generation method and exhaustive search, following the previous work by Ghiringhelli et al. [Phys Rev Lett. 2015;114:105503]. The obtained simplest model includes only atomic radius information of constituent atoms and its physical meaning is interpreted in relation to van Arkel-Ketelaar's triangle for classifying chemical bonding in binary compounds.

Entities: Chemical Disease Gene

Keywords: 404 Materials informatics / Genomics; Sparse modeling; binary compounds; chemical bonding; machine learning

Year: 2019 PMID： 32082439 PMCID： PMC7006824 DOI： 10.1080/14686996.2019.1697858

Source DB: PubMed Journal: Sci Technol Adv Mater ISSN： 1468-6996 Impact factor: 8.090

Introduction

Recently, data-intensive scientific discovery and design have been the focus of great attention for the acceleration of research and development in materials science, being widely called materials informatics (MI). The major aims of MI are the exploration of new materials with desired properties, the optimization of existing materials for particular performances, and the understanding of underlying physical mechanisms for further development. Generally, if one demands high predictability for a model constructed by data-science machine-learning techniques, complicated methods using non-linear models such as kernel ridge regression [1], neural network [2], and random forest [3] are appropriate, though their interpretation becomes troublesome because of the non-linearity involved in the modeling. On the other hand, simple modeling such as linear regression with interpretable descriptors is suitable for extracting intuitive understanding from materials data at the sacrifice of its predictability to a certain degree. Sparse modeling [4] is the statistical learning technique to realize such a simple model by the selection and reduction of the descriptors assumed. A pioneering work with the use of the sparse modeling for materials properties was reported by Ghiringhelli et al. [5]. Total energy differences between zinc-blende and rock-salt crystal structures obtained by density-functional-theory (DFT) calculations for 82 elementary and binary semiconductors are modeled with the least absolute shrinkage and selection operation (LASSO) [6] and exhaustive search techniques within the linear regression modeling. They have succeeded to construct simple models with a small number of descriptors at relatively high predictability. The key to success can be found in the construction of the descriptors. They first assumed several basic descriptors such as ionization potential, electron affinity, and some DFT atomic data for constituent atoms and then operated them to get higher-order descriptors with multiplication, division, and functionalization up to the order of thousands. The LASSO technique is utilized to reduce the number of descriptors to tens by statistical procedures and error evaluations. Finally, an exhaustive search is used to extract the most important descriptors for a given number of descriptors among them. Nevertheless, the obtained model is still far understandable with physical intuitiveness because of complicated functions of several basic descriptors. In this study, we aim to construct a simpler and interpretable model for the same problem as that Ghiringhelli and coworkers attacked. Our idea is two folds: one is the symmetrization of basic descriptors for the permutation of constituent elements and the second is the high-order operation of basic descriptors without using complicated functions like exponential. Also, regression trials with a single basic descriptor will be carried out. During the high-order descriptor operations, collinearity problems (including multicollinearity and near multicollinearity) often take place because of strong dependency between the generated higher-order descriptors by products of descriptors. The linear independent descriptor generation (LIDG) method recently proposed by us [7] is adopted to remove those collinearity problems if they happen. Our models are found to be as simple as the previous models, without utilizing complicated descriptors and able to quantitatively classify the chemical bonding in binary compound systems.

Methods

Target variables

The target variables prepared by Ghiringhelli et al. [5] are used for the construction of modeling in this study. Namely, total energy differences between zinc-blende (ZB) and rock-salt (RS) type structures calculated for 82 octet elementary and binary compounds with main-group elements as The data used for the present regression are listed in Appendix A. To confirm the precision of the target data, total energies of the 82 systems with ZB and RS structures are recalculated by using the all-electron full-potential linearized augmentation plane-wave method implemented in our HiLAPW code [8] and the root-mean-square errors are 7meV/atom in the total-energy difference and 0.009Å in the equilibrium lattice constant.

Descriptors

Ghiringhelli et al. [5] distinguished the constituent elements and according to the size of electronegativity. However, permutation of and leads to no physical change in the system at the equiatomic condition and models constructed should be symmetric by the permutation. In the present study, we generate descriptors as follows: Prepare basic descriptors and for constituent atoms and on the basis of our intuition. Symmetrize them by permutation and add their inversion, being called first-order descriptors. Generate high-order descriptors by multiplication of the first-order descriptors. Remove multicollinearity and near multicollinearity by erasing the irrelevant descriptors Iterate to generate the high-order descriptor generation and to reduce collinearity problems, if needed. Concerning the basic descriptors, easily obtainable physical quantities could appeal our intuition to construct interpretable models. Atomic radius, ionization potential, electron affinity, and electronegativity are adopted in this study and tabulated in Appendix A. As for the symmetrization and inversion, we consider the following operations as The high-order descriptors are generated by multiplication of the first-order descriptors. From first-order descriptors, th-order descriptors can be constructed. As mentioned above, every time high-order descriptors are generated, multicollinearity and near multicollinearity are removed by the LIDG method [7]. Here, multicollinearity is a linear dependency between descriptor vectors. Such a linear dependency often occurs when higher-order descriptor generations are performed. The existence of the linear dependency means that has non-trivial solutions , where is a descriptor matrix (design matrix) with descriptor vector as the columns. is the number of descriptors. Thus, to find multicollinearity, all non-trivial solutions of should be found. Fortunately, the non-trivial solutions can be easily found by computing the reduced row echelon form (RREF) [9] of . In the LIDG method, is linearly independentized by appropriately removing the detected descriptors having a multicollinearity relationship. Since the constant term is originally included in the regression model, constant terms additionally arising by multiplication are removed tacitly.

Model selection

In sparse modeling [4], the best model that has the highest predictivity is usually selected by the cross-validation procedure [10,11]. For the purpose, we employ the leave-one-out scheme in this study, where 81 sets of data (target and descriptors) are used for the construction of model and the remaining one set of data called is adopted for estimating the predictivity error [12] as where , , and are true, predicted, and averaged target valuables, respectively. Then, the measure of predictivity for the model selection by the cross-validation is calculated by average as with the total number of data set (82 in this case). To obtain models as simple as possible, the exhaustive search method [13] for a given number of descriptors is employed.

Results

Using the procedures described in the previous section with the four kinds of basic descriptors, 86 descriptors are generated up to the second order, called descriptor space 1 (DS1) as listed in Appendix B. Figure 1 shows for the best models by the exhaustive search as a function of the number of descriptors in DS1. That is, when descriptors are used, linear regressions (ordinary least-squares method) are performed for all the combinations of descriptors, and of Equation (4) is calculated for each, and then the maximum value of is plotted. Here, is the total number of descriptors in DS1. Detailed results of model selection are summarized in Appendix C.

Figure 1.

Measure of predictivity for the best models with descriptor space 1 (DS1), 2 (DS2), and 3 (DS3) as a function of the number of descriptors obtained by the exhaustive search.

Measure of predictivity for the best models with descriptor space 1 (DS1), 2 (DS2), and 3 (DS3) as a function of the number of descriptors obtained by the exhaustive search. It is seen in Figure 1 that as the number of descriptors is increased, the predictivity with DS1 is also increased through and then almost saturated afterward. Therefore, the model with is appropriately simple with relatively high predictability. This model called Model 1 is given as where and are electronegativity and atomic radius, respectively. The regression performance of Model 1 is shown in Figure 2. It is quite interesting that only the electronegativity and atomic radius are included in Model 1 with a simple form, but its physical meaning is not readily understandable.

Figure 2.

Regression performance of Model 1. (a) predicted and DFT data. (b) predicted and DFT data for each semiconductor. ID corresponds to that in Table A1.

Table A1.

Target data for regression taken from Ghiringhelli et al. [5]. is total energy difference (in eV/atom) between zinc-blende and rock-salt structures of 82 elementary and binary systems .

ID	A	B	ΔE	ID	A	B	ΔE	ID	A	B	ΔE
0	Li	F	−0.059	28	Si	Si	0.275	56	Rb	I	−0.169
1	Li	Cl	−0.038	29	Si	Ge	0.264	57	Sr	O	−0.221
2	Li	Br	−0.033	30	Si	Sn	0.136	58	Sr	S	−0.369
3	Li	I	−0.022	31	K	F	−0.146	59	Sr	Se	−0.375
4	Be	O	0.430	32	K	Cl	−0.165	60	Sr	Te	−0.381
5	Be	S	0.506	33	K	Br	−0.166	61	Ag	F	−0.156
6	Be	Se	0.495	34	K	I	−0.168	62	Ag	Cl	−0.044
7	Be	Te	0.466	35	Ca	O	−0.266	63	Ag	Br	−0.030
8	B	N	1.713	36	Ca	S	−0.369	64	Ag	I	0.037
9	B	P	1.020	37	Ca	Se	−0.361	65	Cd	O	−0.087
10	B	As	0.879	38	Ca	Te	−0.350	66	Cd	S	0.070
11	B	Sb	0.581	39	Cu	F	−0.019	67	Cd	Se	0.083
12	C	C	2.638	40	Cu	Cl	0.156	68	Cd	Te	0.113
13	C	Si	0.668	41	Cu	Br	0.152	69	In	N	0.150
14	C	Ge	0.808	42	Cu	I	0.203	70	In	P	0.170
15	C	Sn	0.450	43	Zn	O	0.102	71	In	As	0.122
16	Na	F	−0.146	44	Zn	S	0.275	72	In	Sb	0.080
17	Na	Cl	−0.133	45	Zn	Se	0.259	73	Sn	Sn	0.016
18	Na	Br	−0.127	46	Zn	Te	0.241	74	Cs	F	−0.112
19	Na	I	−0.115	47	Ga	N	0.433	75	Cs	Cl	−0.152
20	Mg	O	−0.178	48	Ga	P	0.341	76	Cs	Br	−0.158
21	Mg	S	−0.087	49	Ga	As	0.271	77	Cs	I	−0.165
22	Mg	Se	−0.055	50	Ga	Sb	0.158	78	Ba	O	−0.095
23	Mg	Te	−0.005	51	Ge	Ge	0.202	79	Ba	S	−0.326
24	Al	N	0.072	52	Ge	Sn	0.087	80	Ba	Se	−0.350
25	Al	P	0.219	53	Rb	F	−0.136	81	Ba	Te	−0.381
26	Al	As	0.212	54	Rb	Cl	−0.161
27	Al	Sb	0.150	55	Rb	Br	−0.164

Electronegativity and atomic radius are known to be empirically correlated as [14,15] though they are not so highly collinear that our near-collinearity criteria judge. Note that Pearson’s correlation coefficient is . Therefore, atomic radius only and electronegativity only in the basic descriptor set are used on trial to generate 24 high-order descriptors up to fourth order for sparse modeling, called descriptor space 2 (DS2) and 3 (DS3), respectively, as listed in Appendix B. As results, it is found that DS2 gives much better than DS3. For example, for in DS2 and DS3 is 0.892 and 0.714, respectively. In Figure 1, with DS2 becomes almost constant beyond and the model with might be a good one from the viewpoints of predictivity and interpretable sparse modeling, being called Model 2 expressed as and its regression performance is given in Figure 3.

Figure 3.

Regression performance of Model 2. (a) predicted and DFT data. (b) predicted and DFT data for each semiconductor. ID corresponds to that in Table A1.

Regression performance of Model 2. (a) predicted and DFT data. (b) predicted and DFT data for each semiconductor. ID corresponds to that in Table A1. It should be emphasized that Model 2 is a really simple model including atomic-radius descriptors only at high predictivity (). Regression performance of the present models (Model 1 and Model 2) and the previous ones (Model A, Model B, and Model C) is summarized in Table 1 in terms of decision coefficient [16], measure of predictivity (Equation 4) [12], Akaike information criterion AIC [17,18], mean absolute error MAE , and maximum absolute error MaxAE . Model A, Model B, and Model C of the previous work in Table 1 are the best models with descriptors selecting one, two, and three, respectively, from left in the following descriptor list:

Table 1.

	Present		Previous work [5]
Criterion	Model 1	Model 2	Model A	Model B	Model C
M	3	2	1	2	3
R2	0.913	0.876	0.883	0.929	0.957
Q2	0.902	0.866	0.867	0.918	0.946
AIC	−92.4	−65.0	−72.0	−110.6	−149.4
MAE (eV)	0.102	0.118	0.121	0.097	0.071
MaxAE (eV)	0.457	0.460	0.400	0.349	0.301

Regression performance of models obtained in the present and the previous works. , , , AIC, MAE, and MaxAE are the number of descriptors, decision coefficient [16], measure of predictivity (Equation 4) [12], Akaike information criterion [17,18], mean absolute error, and maximum absolute error, respectively. Models in the previous work are given in the text. Note that the values of ionization potential, electron affinity, and electronegativity used in the present study are slightly different from those in the previous work [5]. Because of that, MAE and MaxAE do not perfectly coincide with those listed previously.

Discussion

Let us consider the possible consequences of Model 2 that is the simplest one among the models constructed in the preceding session. In the cases of elemental materials (), becomes positive for Å, preferring zinc-blende (properly diamond) structure. Actually, no elementary materials nor compounds with the same atomic radii greater than 1.68 Å are included in the present octet compounds. For compounds with largely different atomic radii, rock-salt structure with higher coordination than zinc blende is realized. From Equation (7), the borderline between ZB and RS structures, namely , is given as , providing a quantitative guideline to classify ZB and RS structures in the present systems. The borderline and the structural classification will be discussed further below in relation to van Arkel-Ketelaar’s triangle of chemical bonding. Approximately, Model 2 shown in Equation (7) tells that the energy difference between ZB and RS structures is linearly scaled to the absolute difference in the atomic radius of the constituent atoms () and inversely proportional to the cell volume (). In the octet compounds, the cohesion mechanism is dominated by covalent bonds with additive ionic electrostatic interactions. Covalent bonds originate from the formation of bonding and antibonding states between neighboring orbitals and are roughly proportional to the size of the corresponding hopping integrals. According to the scaling rules in the tight-binding theory [19-21], the hopping integral for neighboring orbitals is proportional to , where is the interatomic distance. Therefore, it is reasonable to see the inverse proportionality of the cell volume in the energy difference. Chemical trends in the stable structure directly derived from Model 2 are listed in Table 2.

Table 2.

Relations between atomic radius and stable structure derived from Model 2 (Equation 7). is defined in Equation 1.

Atomic radius	ΔE	Stable structure
\|rA−rB\| : large	<0	Rock salt
rA+rB : small
and	>0	Zinc blende
\|rA−rB\| : small

Relations between atomic radius and stable structure derived from Model 2 (Equation 7). is defined in Equation 1. Empirically, electronegativity is well known to be related to chemical bonding in compounds [22] and has an inverse relation to the atomic radius, as shown in Equation (6). With this relation, the trends with respect to the atomic radius listed in Table 2 can be converted to trends with respect to electronegativity given in Table 3. This result is consistent with our knowledge of the stable structure in semiconductors such that covalent (ionic) compounds tend to possess zinc-blende (rock-salt) crystal structure [23]. Nevertheless, it is quite interesting to able to model the energy difference quantitatively better with atomic radius than with electronegativity, as mentioned in the preceding section.

Table 3.

Relations between electronegativity, stable structure, and chemical bond, derived from Table 2 and Equation 6.

Electronegativity	ΔE	Stable structure	Chemical bond
\|ENA−ENB\| : large	<0	Rock salt	Ionic
ENA+ENB : large
and	>0	Zinc blende	Covalent
\|ENA−ENB\| : small

Relations between electronegativity, stable structure, and chemical bond, derived from Table 2 and Equation 6. van Arkel-Ketelaar’s triable is a map for displaying chemical bonding of compounds [24-26]. Metallic, ionic, and covalent bonding are represented in a two-dimensional (2D) map with the axes of mean and difference of electronegativity of the constituent atoms in the latest version [27,28]. Following van Arkel-Ketelaar’s triangle, the total energy difference given by Equation (7) is plotted in a 2D map of the sum and difference of atomic radius as shown in Figure 4. Figure 4 precisely reproduces the stable crystal structure, either zinc-blende or rock-salt and covalent or ionic bonding via relation between structure and chemical bonding. Note that the models constructed by regression include no information about chemical bonding characteristics beyond the training data. As a matter of fact, Model 2 can not represent metallic systems, that may correspond to an empty region in the present triangle shown in Figure 4.

Figure 4.

Total energy difference map in a triable of the sum and difference of atomic radius of the constituent atoms given by Model 2 (Equation 7). Red-colored (blue-colored) dots form an area where zinc-blende (rock-salt) structure is stable and covalent (ionic) bonding is realized. An area with no dots corresponds to the region where training data are not included, possibly indicating a metallic bonding region.

Conclusions

A simple model quantifying energy difference between zinc-blende and rock-salt structure in octet elemental and binary semiconductors is obtained with only the information of atomic radius of constituent atoms, leading to a 2D map of chemical bonding represented in terms of the sum and difference of atomic radius. It is found that our descriptor-generation method including symmetrization for permutation, multiplication operation to higher order, and removal of collinearity problems is crucial to construct such a sparse model in addition to the exhaustive search. That is, since we use only symmetrized descriptors as initial descriptors, it is guaranteed that a correct model can be obtained at least in terms of symmetry. In addition, since inappropriate descriptors that do not satisfy symmetry are not included, the number of descriptor candidates can be reduced. The above two are the effects of descriptor symmetrization. On the other hand, the model obtained in the previous study does not satisfy the symmetry due to the permutation of and elements. Therefore, no matter how high the prediction accuracy, it can be said that this is a physically inappropriate model at least in symmetry. One would also like to mention the effect of removing multicollinearity. For example, if there is multicollinearity, such as , in descriptor matrix, the estimation and prediction accuracies do not change regardless of which one of , and is deleted from the descriptor matrix. Therefore, it cannot be decided from statistics whether , , or should be removed. In our LIDG method, however, since the multicollinearity has been detected prior to regression, one can introduce the simplicity of descriptors in the descriptor selection process and employ two descriptors with a simpler form between , , and . Therefore, the obtained model is the simplest model among the models that give the same prediction accuracy. This is the advantage of the LIDG method in the detection and removal of multicollinearity.

Table A2.

Basic descriptors. , , , and are atomic radius, ionization potential, electron affinity, and electronegativity, respectively. , , and are the radius at maximum probability amplitude of , , and orbitals, respectively.

Atom	ra(Å)	IPb(eV)	EAc(eV)	ENd	rse(Å)	rpe(Å)	rde(Å)
Li	1.67	5.392	0.6180	0.98	1.652	1.995	6.930
Be	1.12	9.322	−0.5000	1.57	1.078	1.211	2.877
B	0.87	8.298	0.2770	2.04	0.805	0.826	1.946
C	0.67	11.260	1.2629	2.55	0.644	0.630	1.631
N	0.56	14.534	−0.0700	3.04	0.539	0.511	1.540
O	0.48	13.618	1.4611	3.44	0.462	0.427	2.219
F	0.42	17.422	3.3990	3.98	0.406	0.371	1.428
Na	1.90	5.139	0.5479	0.93	1.715	2.597	6.566
Mg	1.45	7.646	−0.4000	1.31	1.330	1.897	3.171
Al	1.18	5.986	0.4410	1.61	1.092	1.393	1.939
Si	1.11	8.151	1.3850	1.90	0.938	1.134	1.890
P	0.98	10.486	0.7465	2.19	0.826	0.966	1.771
S	0.88	10.360	2.0771	2.58	0.742	0.847	2.366
Cl	0.79	12.967	3.6170	3.16	0.679	0.756	1.666
K	2.43	4.341	0.5015	0.82	2.128	2.443	1.785
Ca	1.94	6.113	−0.3000	1.00	1.757	2.324	0.679
Cu	1.45	7.726	1.2280	1.90	1.197	1.680	2.576
Zn	1.42	9.394	−0.6000	1.65	1.099	1.547	2.254
Ga	1.36	5.999	0.3000	1.81	0.994	1.330	2.163
Ge	1.25	7.899	1.2000	2.01	0.917	1.162	2.373
As	1.14	9.810	0.8100	2.18	0.847	1.043	2.023
Se	1.03	9.752	2.0207	2.55	0.798	0.952	2.177
Br	0.94	11.814	3.3650	2.96	0.749	0.882	1.869
Rb	2.65	4.177	0.4859	0.82	2.240	3.199	1.960
Sr	2.19	5.695	−0.3000	0.95	1.911	2.548	1.204
Ag	1.65	7.576	1.3020	1.93	1.316	1.883	2.968
Cd	1.61	8.993	−0.7000	1.69	1.232	1.736	2.604
In	1.56	5.786	0.3000	1.78	1.134	1.498	3.108
Sn	1.45	7.344	1.2000	1.96	1.057	1.344	2.030
Sb	1.33	8.641	1.0700	2.05	1.001	1.232	2.065
Te	1.23	9.009	1.9708	2.10	0.945	1.141	1.827
I	1.15	10.451	3.0591	2.66	0.896	1.071	1.722
Cs	2.98	3.894	0.4716	0.79	2.464	3.164	1.974
Ba	2.53	5.212	−0.3000	0.89	2.149	2.632	1.351

aRef. [14]:,bRef. [29]:,cRef. [30]:,dRef. [15]:,eRef. [5].

Table B1.

Descriptor space 1 (DS1): 86 descriptors up to second order of atomic radius, ionization potential, electron affinity, and electronegativity.

Order	Descriptor
1	rA+rB, IPA+IPB, EAA+EAB, ENA+ENB
	\|rA−rB\|, \|IPA−IPB\|, \|EAA−EAB\|, \|ENA−ENB\|
	1rA+rB, 1IPA+IPB, 1EAA+EAB, 1ENA+ENB
2	rA+rB2, rA+rBIPA+IPB, rA+rBEAA+EAB
	rA+rBENA+ENB, rA+rB\|rA−rB\|, rA+rB\|IPA−IPB\|
	rA+rB\|EAA−EAB\|, rA+rB\|ENA−ENB\|, rA+rBIPA+IPB
	rA+rBEAA+EAB, rA+rBENA+ENB, IPA+IPB2, IPA+IPBEAA+EAB
	IPA+IPBENA+ENB, IPA+IPB\|rA−rB\|
	IPA+IPB\|IPA−IPB\|, IPA+IPB\|EAA−EAB\|
	IPA+IPB\|ENA−ENB\|, IPA+IPBrA+rB, IPA+IPBEAA+EAB, IPA+IPBENA+ENB
	EAA+EAB2, EAA+EABENA+ENB, EAA+EAB\|rA−rB\|
	EAA+EAB\|IPA−IPB\|, EAA+EAB\|EAA−EAB\|
	EAA+EAB\|ENA−ENB\|, EAA+EABrA+rB, EAA+EABIPA+IPB, EAA+EABENA+ENB
	ENA+ENB2, ENA+ENB\|rA−rB\|, ENA+ENB\|IPA−IPB\|
	ENA+ENB\|EAA−EAB\|, ENA+ENB\|ENA−ENB\|
	ENA+ENBrA+rB, ENA+ENBIPA+IPB, ENA+ENBEAA+EAB, \|rA−rB\|2, \|rA−rB\|\|IPA−IPB\|
	\|rA−rB\|\|EAA−EAB\|, \|rA−rB\|\|ENA−ENB\|, \|rA−rB\|rA+rB, \|rA−rB\|IPA+IPB
	\|rA−rB\|EAA+EAB, \|rA−rB\|ENA+ENB, \|IPA−IPB\|2, \|IPA−IPB\|\|EAA−EAB\|
	\|IPA−IPB\|\|ENA−ENB\|, \|IPA−IPB\|rA+rB, \|IPA−IPB\|IPA+IPB, \|IPA−IPB\|EAA+EAB, \|IPA−IPB\|ENA+ENB
	\|EAA−EAB\|2, \|EAA−EAB\|\|ENA−ENB\|, \|EAA−EAB\|rA+rB, \|EAA−EAB\|IPA+IPB
	\|EAA−EAB\|EAA+EAB, \|EAA−EAB\|ENA+ENB, \|ENA−ENB\|2, \|ENA−ENB\|rA+rB, \|ENA−ENB\|IPA+IPB
	\|ENA−ENB\|EAA+EAB, \|ENA−ENB\|ENA+ENB, 1rA+rB2, 1rA+rBIPA+IPB
	1rA+rBEAA+EAB, 1rA+rBENA+ENB, 1IPA+IPB2
	1IPA+IPBEAA+EAB, 1IPA+IPBENA+ENB, 1EAA+EAB2
	1EAA+EABENA+ENB, 1ENA+ENB2

Table B2.

Descriptor space 2 (DS2): 24 descriptors up to fourth order of atomic radius.

Order	Descriptors
1	rA+rB, \|rA−rB\|, 1rA+rB
2	rA+rB2, rA+rB\|rA−rB\|, \|rA−rB\|2, \|rA−rB\|rA+rB
	1rA+rB2
3	rA+rB3, rA+rB2\|rA−rB\|, rA+rB\|rA−rB\|2
	\|rA−rB\|3, \|rA−rB\|2rA+rB, \|rA−rB\|rA+rB2, 1rA+rB3
4	rA+rB4, rA+rB3\|rA−rB\|, rA+rB2\|rA−rB\|2
	rA+rB\|rA−rB\|3, \|rA−rB\|4, \|rA−rB\|3rA+rB, \|rA−rB\|2rA+rB2
	\|rA−rB\|rA+rB3, 1rA+rB4

Table B3.

Descriptor space 3 (DS3): 24 descriptors up to fourth order of electronegativity.

Order	Descriptors
1	ENA+ENB, \|ENA−ENB\|, 1ENA+ENB
2	ENA+ENB2, ENA+ENB\|ENA−ENB\|, \|ENA−ENB\|2
	\|ENA−ENB\|ENA+ENB, 1ENA+ENB2
3	ENA+ENB3, ENA+ENB2\|ENA−ENB\|
	ENA+ENB\|ENA−ENB\|2, \|ENA−ENB\|3
	\|ENA−ENB\|2ENA+ENB, \|ENA−ENB\|ENA+ENB2, 1ENA+ENB3
4	ENA+ENB4, ENA+ENB3\|ENA−ENB\|
	ENA+ENB2\|ENA−ENB\|2, ENA+ENB\|ENA−ENB\|3
	\|ENA−ENB\|4, \|ENA−ENB\|3ENA+ENB, \|ENA−ENB\|2ENA+ENB2, \|ENA−ENB\|ENA+ENB3
	1ENA+ENB4

Table C1.

Results of model selection in descriptor space 1 (DS1) by exhaustive search. Models are ranked by score and listed only top 3.

M	Ranking	R2	Q2	ΔE
1	1	0.658	0.595	4.121rA+rB2 −0.60
	2	0.592	0.522	3.561rA+rB −1.33
	3	0.546	0.481	86.631rA+rB1IPA+IPB −1.82
2	1	0.823	0.782	−0.53\|ENA−ENB\|rA+rB + 4.251rA+rB2 0.36
	2	0.771	0.728	−0.17IPA+IPBrA+rB + 9.111rA+rB2 −0.20
	3	0.767	0.716	−0.11\|IPA−IPB\|rA+rB + 4.421rA+rB2 −0.46
3	1	0.913	0.902	0.59\|ENA−ENB\| −1.95\|ENA−ENB\|rA+rB
				+ 6.151rA+rB2 −0.75
	2	0.911	0.899	0.12\|EAA−EAB\|\|ENA−ENB\| −1.26\|ENA−ENB\|rA+rB
				+ 5.771rA+rB2 −0.60
	3	0.904	0.892	0.10(rA+rB)\|ENA−ENB\| −1.13\|ENA−ENB\|rA+rB
				+ 5.891rA+rB2 −0.70
4	1	0.934	0.921	1.86\|EAA−EAB\|IPA+IPB+ 0.13\|ENA−ENB\|2
				−1.52\|ENA−ENB\|rA+rB + 6.141rA+rB2 −0.68
	2	0.933	0.921	0.04(rA+rB)\|EAA−EAB\| + 0.11\|ENA−ENB\|2
				−1.42\|ENA−ENB\|rA+rB + 6.091rA+rB2 −0.67
	3	0.930	0.920	0.13(rA+rB)\|ENA−ENB\| −0.06(IPA+IPB)\|ENA−ENB\|
				+ 0.22IPA+IPBrA+rB + 0.23\|ENA−ENB\|2 −1.00
5	1	0.945	0.936	0.22(rA+rB)\|ENA−ENB\| −0.07(IPA+IPB)\|ENA−ENB\|
				+ 0.22IPA+IPBrA+rB −1.06\|rA−rB\|ENA+ENB
				+ 0.21\|ENA−ENB\|2 −0.10
	2	0.946	0.936	−0.01(IPA+IPB)(EAA+EAB)
				+ 0.05(EAA+EAB)\|EAA−EAB\| + 0.11\|ENA−ENB\|2
				−1.50\|ENA−ENB\|rA+rB + 6.161rA+rB2 −0.53
	3	0.944	0.935	0.22(rA+rB)\|ENA−ENB\| −0.07(IPA+IPB)\|ENA−ENB\|
				+ 0.22IPA+IPBrA+rB −4.55\|rA−rB\|IPA+IPB
				+ 0.21\|ENA−ENB\|2 −1.01

Table C2.

Results of model selection in descriptor space 2 (DS2) by exhaustive search. Models are ranked by score and listed only top 3.

M	Ranking	R2	Q2	ΔE
1	1	0.711	0.682	8.081rA+rB4 −0.19
	2	0.698	0.652	5.701rA+rB3 −0.34
	3	0.658	0.595	4.121rA+rB2 −0.60
2	1	0.876	0.866	6.871rA+rB3 −5.02\|rA−rB\|(rA+rB)3 −0.18
	2	0.863	0.844	5.161rA+rB2 −5.49\|rA−rB\|(rA+rB)3 −0.51
	3	0.844	0.833	−2.01\|rA−rB\|(rA+rB)2+ 8.431rA+rB4+ 0.03
3	1	0.903	0.893	0.08\|rA−rB\|2+ 6.031rA+rB2 −7.03\|rA−rB\|(rA+rB)3 −0.68
	2	0.902	0.893	5.971rA+rB2+ 0.02(rA+rB)\|rA−rB\|2
				−6.61\|rA−rB\|(rA+rB)3 −0.67
	3	0.902	0.892	5.851rA+rB2 + 0.04\|rA−rB\|3 −6.58\|rA−rB\|(rA+rB)3 −0.64
4	1	0.903	0.892	0.02(rA+rB)\|rA−rB\| + 6.041rA+rB2 + 0.08\|rA−rB\|3rA+rB
				−7.01\|rA−rB\|(rA+rB)3 −0.69
	2	0.903	0.892	5.991rA+rB2 + 0.01(rA+rB)2\|rA−rB\| + 0.09\|rA−rB\|3rA+rB
				−6.84\|rA−rB\|(rA+rB)3 −0.67
	3	0.903	0.892	0.07\|rA−rB\|2 + 6.001rA+rB2 + 0.03\|rA−rB\|3rA+rB
				−7.01\|rA−rB\|(rA+rB)3 −0.67
5	1	0.905	0.892	−0.05(rA+rB)3 + 0.03(rA+rB)\|rA−rB\|2
				+ 5.831rA+rB3 + 0.01(rA+rB)4 −6.65\|rA−rB\|(rA+rB)3+ 0.34
	2	0.904	0.892	−0.12\|rA−rB\|2 + 5.881rA+rB2 + 0.18\|rA−rB\|3
				−0.04\|rA−rB\|4 −6.56\|rA−rB\|(rA+rB)3 −0.64
	3	0.905	0.891	−0.12(rA+rB)2 + 0.03(rA+rB)\|rA−rB\|2 + 5.601rA+rB3
				+ 0.003(rA+rB)4 −6.69\|rA−rB\|(rA+rB)3 + 0.54

Table C3.

Results of model selection in descriptor space 3 (DS3) by exhaustive search. Models are ranked by score and listed only top 3.

M	Ranking	R2	Q2	ΔE
1	1	0.375	0.338	−19.07\|ENA−ENB\|(ENA+ENB)3 + 0.47
	2	0.375	0.336	−5.25\|ENA−ENB\|(ENA+ENB)2 + 0.51
	3	0.336	0.294	−1.27\|ENA−ENB\|ENA+ENB + 0.50
2	1	0.572	0.498	0.72(ENA+ENB) −0.004(ENA+ENB)3\|ENA−ENB\|
				−2.47
	2	0.565	0.482	0.08(ENA+ENB)2 −0.004(ENA+ENB)3\|ENA−ENB\|
				−0.94
	3	0.546	0.478	−11.541ENA+ENB −0.004(ENA+ENB)3\|ENA−ENB\|
				+ 3.31
3	1	0.708	0.654	0.07(ENA+ENB)\|ENA−ENB\|2 + 0.004(ENA+ENB)4
				−0.01(ENA+ENB)3\|ENA−ENB\| −0.44
	2	0.708	0.646	0.02(ENA+ENB)3 + 0.05(ENA+ENB)\|ENA−ENB\|2
				−0.01(ENA+ENB)3\|ENA−ENB\| −0.85
	3	0.697	0.630	0.02(ENA+ENB)3 + 0.06\|ENA−ENB\|3
				−0.01(ENA+ENB)3\|ENA−ENB\| −0.72
4	1	0.728	0.676	0.05(ENA+ENB)2\|ENA−ENB\| + 0.004(ENA+ENB)4
				−0.02(ENA+ENB)3\|ENA−ENB\|
				+ 0.01(ENA+ENB)2\|ENA−ENB\|2 −0.60
	2	0.725	0.673	0.11(ENA+ENB)\|ENA−ENB\| + 0.004(ENA+ENB)4
				−0.02(ENA+ENB)3\|ENA−ENB\|
				+ 0.01(ENA+ENB2)\|ENA−ENB\|2 −0.61
	3	0.718	0.662	0.32\|ENA−ENB\| + 0.004(ENA+ENB)4
				−0.02(ENA+ENB)3\|ENA−ENB\|
				+ 0.01(ENA+ENB)2\|ENA−ENB\|2 −0.60
5	1	0.754	0.714	1.20(ENA+ENB)\|ENA−ENB\|
				−0.32(ENA+ENB)2\|ENA−ENB\| + 0.005(ENA+ENB)4
				+ 0.03(ENA+ENB)2\|ENA−ENB\|2
				−6.10\|ENA−ENB\|ENA+ENB2 −0.91
	2	0.752	0.709	1.08(ENA+ENB)\|ENA−ENB\|
				−0.29(ENA+ENB)2\|ENA−ENB\|
				+ 0.19(ENA+ENB)\|ENA−ENB\|2
				−2.76\|ENA−ENB\|2ENA+ENB + 0.01(ENA+ENB)4 −0.86
	3	0.753	0.709	2.33\|ENA−ENB\| + 0.90\|ENA−ENB\|2
				−0.17(ENA+ENB)2\|ENA−ENB\| + 0.005(ENA+ENB)4
				−13.32\|ENA−ENB\|ENA+ENB2 −0.85

4 in total

1. Bonds and Bands in Semiconductors: New insight into covalent bonding in crystals has followed from studies of energy-band spectroscopy.

Authors: J C Phillips
Journal: Science Date: 1970-09-11 Impact factor: 47.728

2. External validation and prediction employing the predictive squared correlation coefficient test set activity mean vs training set activity mean.

Authors: Gerrit Schüürmann; Ralf-Uwe Ebert; Jingwen Chen; Bin Wang; Ralph Kühne
Journal: J Chem Inf Model Date: 2008-11 Impact factor: 4.956

3. Big data of materials science: critical role of the descriptor.

Authors: Luca M Ghiringhelli; Jan Vybiral; Sergey V Levchenko; Claudia Draxl; Matthias Scheffler
Journal: Phys Rev Lett Date: 2015-03-10 Impact factor: 9.161

4. Accelerating materials property predictions using machine learning.

Authors: Ghanshyam Pilania; Chenchen Wang; Xun Jiang; Sanguthevar Rajasekaran; Ramamurthy Ramprasad
Journal: Sci Rep Date: 2013-09-30 Impact factor: 4.379

4 in total

Order	Descriptor
1	rA+rB, IPA+IPB, EAA+EAB, ENA+ENB
	\|rA−rB\|, \|IPA−IPB\|, \|EAA−EAB\|, \|ENA−ENB\|
	1rA+rB, 1IPA+IPB, 1EAA+EAB, 1ENA+ENB
2	rA+rB2, rA+rBIPA+IPB, rA+rBEAA+EAB
	rA+rBENA+ENB, rA+rB\|rA−rB\|, rA+rB\|IPA−IPB\|
	rA+rB\|EAA−EAB\|, rA+rB\|ENA−ENB\|, rA+rBIPA+IPB
	rA+rBEAA+EAB, rA+rBENA+ENB, IPA+IPB2, IPA+IPBEAA+EAB
	IPA+IPBENA+ENB, IPA+IPB\|rA−rB\|
	IPA+IPB\|IPA−IPB\|, IPA+IPB\|EAA−EAB\|
	IPA+IPB\|ENA−ENB\|, IPA+IPBrA+rB, IPA+IPBEAA+EAB, IPA+IPBENA+ENB
	EAA+EAB2, EAA+EABENA+ENB, EAA+EAB\|rA−rB\|
	EAA+EAB\|IPA−IPB\|, EAA+EAB\|EAA−EAB\|
	EAA+EAB\|ENA−ENB\|, EAA+EABrA+rB, EAA+EABIPA+IPB, EAA+EABENA+ENB
	ENA+ENB2, ENA+ENB\|rA−rB\|, ENA+ENB\|IPA−IPB\|
	ENA+ENB\|EAA−EAB\|, ENA+ENB\|ENA−ENB\|
	ENA+ENBrA+rB, ENA+ENBIPA+IPB, ENA+ENBEAA+EAB, \|rA−rB\|2, \|rA−rB\|\|IPA−IPB\|
	\|rA−rB\|\|EAA−EAB\|, \|rA−rB\|\|ENA−ENB\|, \|rA−rB\|rA+rB, \|rA−rB\|IPA+IPB
	\|rA−rB\|EAA+EAB, \|rA−rB\|ENA+ENB, \|IPA−IPB\|2, \|IPA−IPB\|\|EAA−EAB\|
	\|IPA−IPB\|\|ENA−ENB\|, \|IPA−IPB\|rA+rB, \|IPA−IPB\|IPA+IPB, \|IPA−IPB\|EAA+EAB, \|IPA−IPB\|ENA+ENB
	\|EAA−EAB\|2, \|EAA−EAB\|\|ENA−ENB\|, \|EAA−EAB\|rA+rB, \|EAA−EAB\|IPA+IPB
	\|EAA−EAB\|EAA+EAB, \|EAA−EAB\|ENA+ENB, \|ENA−ENB\|2, \|ENA−ENB\|rA+rB, \|ENA−ENB\|IPA+IPB
	\|ENA−ENB\|EAA+EAB, \|ENA−ENB\|ENA+ENB, 1rA+rB2, 1rA+rBIPA+IPB
	1rA+rBEAA+EAB, 1rA+rBENA+ENB, 1IPA+IPB2
	1IPA+IPBEAA+EAB, 1IPA+IPBENA+ENB, 1EAA+EAB2
	1EAA+EABENA+ENB, 1ENA+ENB2

Order	Descriptors
1	rA+rB, \|rA−rB\|, 1rA+rB
2	rA+rB2, rA+rB\|rA−rB\|, \|rA−rB\|2, \|rA−rB\|rA+rB
	1rA+rB2
3	rA+rB3, rA+rB2\|rA−rB\|, rA+rB\|rA−rB\|2
	\|rA−rB\|3, \|rA−rB\|2rA+rB, \|rA−rB\|rA+rB2, 1rA+rB3
4	rA+rB4, rA+rB3\|rA−rB\|, rA+rB2\|rA−rB\|2
	rA+rB\|rA−rB\|3, \|rA−rB\|4, \|rA−rB\|3rA+rB, \|rA−rB\|2rA+rB2
	\|rA−rB\|rA+rB3, 1rA+rB4

Order	Descriptors
1	ENA+ENB, \|ENA−ENB\|, 1ENA+ENB
2	ENA+ENB2, ENA+ENB\|ENA−ENB\|, \|ENA−ENB\|2
	\|ENA−ENB\|ENA+ENB, 1ENA+ENB2
3	ENA+ENB3, ENA+ENB2\|ENA−ENB\|
	ENA+ENB\|ENA−ENB\|2, \|ENA−ENB\|3
	\|ENA−ENB\|2ENA+ENB, \|ENA−ENB\|ENA+ENB2, 1ENA+ENB3
4	ENA+ENB4, ENA+ENB3\|ENA−ENB\|
	ENA+ENB2\|ENA−ENB\|2, ENA+ENB\|ENA−ENB\|3
	\|ENA−ENB\|4, \|ENA−ENB\|3ENA+ENB, \|ENA−ENB\|2ENA+ENB2, \|ENA−ENB\|ENA+ENB3
	1ENA+ENB4