Literature DB >> 35510195

Universal machine learning framework for defect predictions in zinc blende semiconductors.

Arun Mannodi-Kanakkithodi^1,2, Xiaofeng Xiang³, Laura Jacoby⁴, Robert Biegaj⁵, Scott T Dunham⁶, Daniel R Gamelin⁴, Maria K Y Chan¹.

Abstract

We develop a framework powered by machine learning (ML) and high-throughput density functional theory (DFT) computations for the prediction and screening of functional impurities in groups IV, III-V, and II-VI zinc blende semiconductors. Elements spanning the length and breadth of the periodic table are considered as impurity atoms at the cation, anion, or interstitial sites in supercells of 34 candidate semiconductors, leading to a chemical space of approximately 12,000 points, 10% of which are used to generate a DFT dataset of charge dependent defect formation energies. Descriptors based on tabulated elemental properties, defect coordination environment, and relevant semiconductor properties are used to train ML regression models for the DFT computed neutral state formation energies and charge transition levels of impurities. Optimized kernel ridge, Gaussian process, random forest, and neural network regression models are applied to screen impurities with lower formation energy than dominant native defects in all compounds.

Entities: Chemical

Keywords: combinatorial screening; computational materials science; density functional theory; high-throughput data; machine learning; materials informatics; mid-gap states; point defects; semiconductors

Year: 2022 PMID： 35510195 PMCID： PMC9058924 DOI： 10.1016/j.patter.2022.100450

Source DB: PubMed Journal: Patterns (N Y) ISSN： 2666-3899

Introduction

Compositional manipulation of semiconductors is one of the primary methods used to obtain optimal properties.1, 2, 3, 4, 5, 6 Apart from alloying, the primary means for compositional control of semiconductor properties is the introduction of dopants or impurities, i.e., guest atoms at a cation, anion, or interstitial site. Such impurities, even in a very dilute concentration, can potentially cause major changes in the electronic structure and physical properties of the material.7, 8, 9, 10 A complete understanding of a semiconductor’s optoelectronic behavior requires estimating the formation energies of point defects, whether accidental or intentionally introduced.,, While approximately 90% of solar cells still rely on crystalline Si as the absorber, related group IV semiconductors such as SiC, II–VI semiconductors such as CdTe, III–V semiconductors such as GaAs, and various derivative compounds are all viable as photovoltaic (PV) materials and are currently in use in single terminal as well as tandem solar cells.,13, 14, 15, 16, 17 Many of these compounds have also been used in transistors, photodiodes, lasers, and qubits or quantum sensors. The chemical space of binary group IV, III–V, and II–VI semiconductors contains compounds that exist in the cubic zinc blende (ZB) or wurtzite crystal structures and show systematic trends in lattice constants, electronic band gaps, optical absorption coefficients, and defect properties. Alloying in these spaces has frequently been used for tuning properties and performance, with some prominent examples including the use of CdSeTe in solar cells, and AlGaAs in light-emitting diodes., Although the structure and optoelectronic properties of binary, ternary, and even quaternary compounds in the group IV, III–V, and II–VI semiconductor space have been widely studied both computationally and experimentally,,,, a comprehensive understanding of the formation likelihood and electronic levels of point defects and impurities is missing. A look at functional atomic defects in semiconductors reveals that the energy levels created inside the band gap can (a) reduce PV efficiency via nonradiative recombination of charge carriers, (b) enable sub-gap absorption or emission if the levels are partially filled or if they have low photoionization energies, and (c) enable quantum computing, quantum sensing, and quantum communication via their nuclear or electronic spins. A universal prediction framework for impurity behavior in known and novel semiconductor spaces is thus paramount. Given such a framework for group IV, III–V, and II–VI semiconductors, it would be possible to perform high-throughput screening of impurity atoms from across the periodic table in terms of their energetics relative to dominant native defects (such as vacancies and self-interstitials), the nature of equilibrium conductivity, and the location of energy levels with respect to band edges. For years, defect levels and their donor or acceptor type nature have been experimentally measured using methods such as deep level transient spectroscopy and cathodoluminescence,, but such studies have been limited by difficulties in sample preparation and assigning measured levels to specific vacancies, interstitials, substitutions, or complex defects. Computationally, the first principles density functional theory (DFT) has been widely used to predict the formation energies of point defects as a function of the net charge in the system, the chemical potential conditions, and the Fermi level as it goes from the valence band maximum (VBM) to the conduction band minimum (CBM).,,,, When an appropriate level of theory is applied, the DFT-computed defect charge transition levels have been seen to match well with measured levels and have helped to identify specific charge transitions of specific defects. DFT can reliably predict defect and impurity behavior in a variety of semiconductors, but limitations arise from the computational expense of using large supercells and performing charged calculations, making it difficult to extend calculations to explore new systems broadly. Predictive machine learning (ML) models, trained from existing or freshly generated data, act as surrogates for DFT calculations by providing statistical estimates of the desired properties.,26, 27, 28 The burgeoning field of materials informatics has led to many successes, with some of the most notable contributions resulting from the combination of first principle computations and ML. ML applied on DFT data has seen the development of predictive and design tools29, 30, 31; the discovery of novel materials for batteries, capacitors, solar cells, and thermoelectrics32, 33, 34, 35, 36; and the efficient exploration of extremely large chemical spaces., Indeed, ML has been instrumental in accelerating the prediction of properties related to point defects and dopants in materials. This includes predicting vacancy formation and substitutional energies of oxides using regression algorithms applied on DFT data,39, 40, 41, 42 ML formation energies, transition levels, and the migration energies of point defects in known semiconductors and alloys,, predicting the dopability of semiconductors, and improving high-fidelity predictions of point defect properties using previously unknown correlations. Recent work from our group involved performing high-throughput DFT computations to study the formation energies and charge transition levels of impurities in halide perovskites and Cd-chalcogenides, following which ML models were trained for the prediction and screening of impurity atoms that can shift the equilibrium Fermi level as determined by dominant native defects. An extension of these studies in terms of semiconductor and impurity chemical spaces as well as ML techniques can pave the path toward a universal framework for impurity prediction and design. In this work, we consider atomic impurities from across the periodic table, in a chemical space of binary group IV, III–V, and II–VI semiconductors in the ZB structure, and use the DFT + ML methodology to predict their complete charge, chemical potential, and Fermi level-dependent formation energies. This is a direct extension of our work on Cd-chalcogenides, which forms a subset of the computational data presented here. We perform high-throughput DFT computations on impurity atoms simulated at the cation, anion, and different interstitial sites in several selected compounds in the group IV (e.g., Si, SiC, and GeC), III–V (e.g., BSb, GaAs, and InP), and II–VI (e.g., ZnSe and CdS) chemical space and use descriptors encoding information about the semiconductor, the impurity atom, and the defect site coordination environment as input to train ML models that predict the neutral state formation energy and six types of charge transition levels for any possible impurity. We used sure independence screening and sparsifying operator (SISSO) for feature selection and the K-nearest neighbors (KNN) approach for outlier detection, followed by regression techniques such as random forest, Gaussian process, and neural network to yield the predictive models. In the following sections, we discuss the exact composition of the chemical space and visualize the DFT computed data, also plotting the Fermi level-dependent formation energies of native defects and impurities for selected compounds. We then delve deep into the development of the ML framework, explaining the descriptor choices, methods of feature selection, outlier exclusion, and the various regression techniques used. We compare the performances of different models using root mean square errors (RMSE) and estimate the uncertainties in prediction for each technique. The best models thus obtained are used to make predictions for the entire chemical space, only approximately 10% of which was used to generate the DFT data, and make a list of dominating impurities for each compound. We finish with a perspective on what can be accomplished using this design framework, the limitations of this work, and potential next steps. The workflow adopted here is laid out in Figure 1A, highlighting data generation, feature selection, regression model development, and high-throughput screening. The DFT data and ML codes generated through this work are made available on Github.

Figure 1

Outline and chemical space

(A–D) (A) The DFT-ML workflow followed in this work, and the semiconductor-impurity chemical space in terms of (B) the cation and anion choices for group IV, II–VI, and III–V compounds, (C) types of defect sites, and (D) impurity atoms selected from across the periodic table.

Outline and chemical space (A–D) (A) The DFT-ML workflow followed in this work, and the semiconductor-impurity chemical space in terms of (B) the cation and anion choices for group IV, II–VI, and III–V compounds, (C) types of defect sites, and (D) impurity atoms selected from across the periodic table.

Results

Semiconductor and impurity chemical space

The chemical space considered in this work has been pictured in Figure 1 in terms of the semiconductor compounds (b), possible defect sites (c), and impurity atoms (d). We include AB semiconductors (with A broadly defined as the cation and B the anion) belonging to groups II–VI, III–V and IV–IV, leading to 8 group II–VI compounds (CdO, CdS, CdSe, CdTe, ZnO, ZnS, ZnSe, and ZnTe), 16 group III–V compounds (BN, BP, BAs, BSb, AlN, AlP, AlAs, AlSb, GaN, GaP, GaAs, GaSb, InN, InP, InAs, and InSb), and 10 group IV compounds (C, Si, Ge, Sn, SiC, GeC, SnC, SiGe, SiSn, and GeSn). The resulting 34 compounds are modeled in the cubic ZB structure, with A atoms occupying an FCC lattice and B atoms occupying the tetrahedral sites. The DFT-computed lattice constants and band gaps (using different levels of theory) of all the compounds are listed in Table S1, along with corresponding experimental measurements collected from the literature. It can be seen that, although the cubic lattice constants are reasonably accurate, the standard generalized gradient approximation-Perdew-Burke-Ernzerhof (GGA-PBE) functional underestimates the band gap, as has been well demonstrated in the past.47, 48, 49 Band gaps computed for some compounds using the hybrid HSE06 functional, with and without spin-orbit coupling (SOC), compare better with experiments. In any AB compound in the ZB structure, defects or impurities could be found at the A site, B site, or several possible symmetrically inequivalent interstitial sites. Figure 1 also shows the defect sites considered in this work, namely, the A and B sites and three types of interstitial sites: the A site interstitial (with 4 neighboring A atoms), the B site interstitial (with 4 neighboring B atoms), and the neutral site interstitial (with 3 neighboring A and B atoms each). The 5 defect sites are considered in the 30 binary compounds while in the remaining 4 elemental systems (C, Si, Ge, and Sn), 3 defect sites are considered (A site, A site interstitial, and neutral site interstitial). For a few defects, we also tested other possible interstitial sites in the ZB structure, such as anion/cation-split sites (as described in the literature), and generally found one of the three chosen interstitial sites to be lower in energy. In terms of impurity atoms, we consider nearly all elements from periods II to VI as well as all lanthanides, leading to a total of 77 species, as pictured in Figure 1. The total number of possible impurities in this chemical space can thus be estimated as: 77 × 5 × 30 + 77 × 3 × 4 = 12,474. Out of these 12,474 data points, about 10% are considered for DFT computations to determine their neutral state formation energies, and charge transition levels; ML models trained on these data based on the properties of the semiconductor compound, defect site coordination, and impurity atom lead to generalized predictions applicable to all the data points. The 10% of data points chosen for computations constituted a desirable diversity in semiconductor type, element type, and defect site type; in other words, while the actual data points were selected at random (and added to prior data), we ensured that every compound (out of 34), element (out of 77), and defect site (substitutional or interstitial) is roughly equally represented, with the exception of CdTe, which is heavily represented.

Defect properties: Benchmarking DFT and native defect energy picture

The methodology to compute defect formation energy (E) from DFT as a function of charge (q), chemical potential (μ) conditions, and Fermi level (E) is described in the Experimental procedures section. Ultimately, for all native defects and impurities, we calculate neutral state formation energies at two extreme μ conditions, namely A-rich and B-rich, and six types of defect charge transition levels, namely +3/+2, +2/+1, +1/0, 0/−1, −1/−2, and −2/−3. An experimental comparison of defect properties computed at the chosen level of theory is worthwhile before launching a computational data-driven discovery exercise. As noted earlier, the standard GGA-PBE functional used in this work is known to underestimate band gaps, but it has been reported that defect charge transition levels from PBE can compare well with experiments for various semiconductor classes such as hybrid perovskites, and even group IV, III–V, and II–VI semiconductors.,,, This contrast can be attributed to total energy differences in DFT being more accurate than using Kohn-Sham energy levels to estimate band edges,, or band gaps, or in the case of hybrid perovskites, a fortuitous cancellation of errors that leads to PBE being as accurate as HSE06+SOC., It has been reported in past work that defect levels computed from semilocal GGA-PBE for well-known ZB semiconductors such as Si and GaAs can span the physical band gap of the material,; that is, the defect transition levels calculated from PBE correspond well with experimental values up to the experimental band gap, even for transitions that are further from the VBM than the PBE-calculated band gap. To ascertain the accuracy of PBE defect and impurity levels in all II–VI, III–V, and IV–IV compounds, we scoured the published literature51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66 (in a brute force manner, something we hope to replace with more efficient and comprehensive natural language processing-based searches in the future,) and collected measured energy levels for 84 defects across the 34 compounds, adding to the set of 15 points collected for CdTe in ref12 As presented in Figure 2 and Table S2, the PBE predicted defect levels are highly correlated with experimental measurements, with a correlation coefficient (R2) of 0.85. We find that the RMSE of PBE predictions compared with experiments is less than 0.2 eV for III–V and IV–IV compounds, and 0.24 eV for II–VI compounds, resulting in a total RMSE of approximately 0.21 eV. This is an acceptable level of accuracy that is similar to what we found in earlier work,12and is within the recognized accuracy limit of DFT electronic levels; a similar ML versus DFT accuracy would be desired for eventual prediction and screening to be performed with some degree of experimental precision. To our knowledge, this is the largest comparison performed to date between DFT computed defect levels and experimental measurements.

Figure 2

Comparison of DFT-computed defect levels with experimentally measured levels

(Obtained from publications51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66). Measured versus DFT RMSE values are also shown for different semiconductor types and for the combined set of points. A few defect levels have been labeled.

Comparison of DFT-computed defect levels with experimentally measured levels (Obtained from publications51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66). Measured versus DFT RMSE values are also shown for different semiconductor types and for the combined set of points. A few defect levels have been labeled. Before discussing the computational dataset of impurity formation energies and charge transition levels, we take a look at the complete picture of native point defect formation energies as a function of q, μ, and E. In each of the 34 semiconductors, we performed neutral and charged DFT calculations for all possible vacancy, interstitial, and anti-site substitutional defects. It should be noted that all interstitial and substitutional native defects are made a part of the impurity DFT data for ML but vacancies are not, as in the current ML framework, many descriptor dimensions are made up of properties of the atom occupying the defect site. In Figure 3, we plotted the computed formation energies as a function of Fermi level (as it goes from the VBM to the CBM) for all native point defects and selected impurities in (a) ZnSe at Se-rich conditions, (b) AlAs at As-rich conditions, and (c) SiC at Si-rich conditions.

Figure 3

Charge- and Fermi-level-dependent formation energy picture

(A–C) Computed formation energies of native defects (solid lines) and selected impurities (dashed lines) in (A) ZnSe under Se-rich conditions, (B) AlAs under As-rich conditions, and (C) SiC under Si-rich conditions, as a function of the Fermi level as it goes from the VBM (E = 0 eV) to the CBM (E = experimental band gap). The intersection point of the dominant donor and acceptor type native defects (shown using extended dotted colored lines) approximately gives the equilibrium defect formation energy, and the vertical dotted lines show the equilibrium Fermi level. Some charge transition levels and neutral state formation energies have been labeled.

Charge- and Fermi-level-dependent formation energy picture (A–C) Computed formation energies of native defects (solid lines) and selected impurities (dashed lines) in (A) ZnSe under Se-rich conditions, (B) AlAs under As-rich conditions, and (C) SiC under Si-rich conditions, as a function of the Fermi level as it goes from the VBM (E = 0 eV) to the CBM (E = experimental band gap). The intersection point of the dominant donor and acceptor type native defects (shown using extended dotted colored lines) approximately gives the equilibrium defect formation energy, and the vertical dotted lines show the equilibrium Fermi level. Some charge transition levels and neutral state formation energies have been labeled. From Figure 3, we can deduce the lowest energy donor and acceptor type native defects, their preferred charged states inside the band gap, the p-type or n-type nature of equilibrium conductivity, and the energetics of impurities relative to dominant native defects. For instance, in ZnSe at Se-rich conditions, the VZn and Zn are the dominant acceptor and donor type defects respectively, and pin the equilibrium E (determined by applying charge neutrality conditions) closer to the valence band edge, indicating a p-type conductivity. It can be seen that impurities PtZn and Cl create higher energy negatively charged defects in the band gap than the VZn, meaning they cannot compensate for native defects. Similarly, the VAl and AsAl form the lowest energy acceptor and donor type defects in AlAs at As-rich conditions and pin the equilibrium E around the middle of the band gap, resulting in an intrinsic type conductivity, while impurities TlAs and C are both higher energy defects. In SiC at Si-rich conditions, the V and CSi are the lowest energy donor and acceptor type defects and lead to intrinsic conductivity. Also marked in Figures 3A–3C are some neutral state formation energies, ΔH, and some charge transition levels, ε(q1/q2). In Tables S3 and S4, we list the dominating acceptor and donor type defects, and the equilibrium E, E, and type of conductivity in every compound at A-rich and B-rich chemical potential conditions; all impurity formation energies will be compared against these defects to determine whether they are dominating or not, and how they might change the equilibrium conductivity.

Computational dataset

Building on the dataset of impurity properties in Cd-chalcogenides from our previous work and the native defects presented in the previous section, we performed additional impurity calculations for randomly selected impurity atoms across the space of 34 group IV, III–V, and II–VI semiconductors. For any defect or impurity, what is ultimately desired is a complete picture of the formation energy as a function of the charge, Fermi level, and chemical potential, as shown in Figure 3. We explicitly considered two types of predictable properties that can yield the entire formation energy picture, namely, the neutral state formation energy (ΔH [A-rich] or ΔH [B-rich]) and possible charge transition levels (ε(+3/+2), ε(+2/+1), ε(+1/0), ε(0/-1), ε(−1/−2) and ε(−2/−3)). Out of a total chemical space of 12,474 impurity types, we computed using DFT (a) 1568 ΔH values at either chemical potential condition and (b) 1004 ε(q1/q2) values for all six charge transition types. We then trained eight separate ML models for ΔH (A-rich), ΔH (B-rich), and transition levels from ε(+3/+2) to ε(−2/−3). In Figure S1, the distributions of semiconductor types (II–VI, III–V, or IV–IV) and impurity types (A site, B site, or interstitial site) are pictured for the entire chemical space and for the two DFT datasets. It can be seen that, based on the chemical space we selected, almost one-half of the data points belong to III–V semiconductors and one-quarter of the points each belong to II–VI and IV–IV; this is, however, not reflected in the DFT datasets, owing to a predominance of data available on II–VI semiconductors from past and present work. Nevertheless, it is expected that the III–V and IV–IV semiconductors are adequately represented because, while choosing data points for DFT calculations, it was ensured that at least 10 impurities in each compound are selected, and every defect site is considered. Further, the entire chemical space contains approximately 40% substitutional impurities and 60% interstitial impurities, with the former being equally divided between A site and B site substitutions, and the latter divided equally between the three types of interstitial sites; all the defect sites are pictured in an example ZB supercell in Figure 1. The defect site distribution is similar for each defect property (ΔH or ε), ensuring adequate representation. To visualize the DFT data, we plotted all the computed charge transition levels and formation energies in Figure 4. The transition levels are plotted two at a time against each other in Figures 4A–4C, as a fraction of the experimental band gap of the compound. Many transition levels are seen to lie deep inside (>0.2 eV from the band edges) the shaded region that represents the band gap, which indicates the tendency of certain impurities to create deep energy levels. It can also be seen that most of the mid-gap impurity levels belong to +1/0 and 0/−1 transitions, and to a lesser extent to +2/+1 and −1/−2, but almost not at all to the higher charge transitions like +3/+2 and −2/−3. The ranges of values of the transition levels are fairly wide, from deep inside the VB to the band gap to deep inside the CB, which reveals a great variety in the type of impurities based on their preferred oxidation states and the sites they occupy. In Figure 4D, we plotted ΔH (A rich) versus ΔH (B rich), which shows values that range from approximately −5 eV to approximately 20 eV. For the group IV compounds C, Si, Ge, and Sn, the A-rich and B-rich conditions are the same, leading to many of the red points lying along the diagonal. In general, ΔH (A rich) and ΔH (B rich) pin two extremes of the impurity formation energy values, and medium chemical potential conditions would lead to intermediate formation energies.

Figure 4

Visualization of DFT data

(A–D) (A–C) transition levels (+3/+2) to (−2/−3), and (D) neutral state formation energies at A-rich and B-rich chemical potential conditions, plotted for different semiconductor types.

Visualization of DFT data (A–D) (A–C) transition levels (+3/+2) to (−2/−3), and (D) neutral state formation energies at A-rich and B-rich chemical potential conditions, plotted for different semiconductor types.

Machine learning framework

Descriptors

Aside from generating the computational data, the need for domain expertise is most evident in creating appropriate descriptors or sets of features that can uniquely represent every point in the dataset. In the semiconductor and impurity chemical space used in this work, we can uniquely identify every data point using the identity of the semiconductor, the identity of the impurity atom, and the defect site it occupies. Thus, we define descriptors for any impurity M at any site S in any compound AB by combining the following three levels of information: AB: Available computed or experimental properties of the semiconductor AB, namely, the formation energy, lattice constant, band gap, and the electronic and ionic dielectric constants; this leads to five dimensions. Elem: Tabulated elemental properties of the impurity M as well as species A and B, such as ionic radius, ionization energy, electronegativity, and so on; this leads to 81 dimensions. CM: Quantifying the chemical coordination environment around the defect site S in terms of A and B neighbors, using the Coulomb Matrix definition; this leads to eight dimensions. The complete list of descriptors can be found on the x axis of Figure S2 as well as Table S5, which show the Pearson coefficient of linear correlation between the properties of interest, ΔH and ε(q1/q2), and each of the descriptors.

Feature selection

The primary feature set of 94 dimensions are all assumed to be relevant to describe the targeted predictors, that is, the impurity transition levels and formation energies. To better explore the nonlinear relationships that may exist between these descriptors dimensions and the properties, we used the SISSO method to perform feature engineering. First, a set of operators, namely +, −, ∗, /, exp, log, ˆ(−1), ˆ2, ˆ3, sqrt, cbrt, |-|, are implemented recursively for feature space expansion. The total feature size goes from approximately 102 to approximately 105 after two iterations. Next, sure independence screening is used to screen all features from 0D feature space (no iteration) to 1D feature space (one iteration) to 2D feature space (two iterations) using a linear correlation metric, leaving behind only highly correlated features. Finally, a sparsifying operation is applied to filter down the feature space to 80–150 for each output.

Outlier identification

The detection of outliers in a dataset helps identify candidates with unusual properties, which may either be erroneous or lead to lower accuracy when used to train ML models. KNN is a method commonly applied to identify the outliers in a dataset based on a feature space. KNN assigns classes to data points based on the most common assignment of its k-nearest neighbors; any point that is surrounded by points belonging to a different class is denoted an outlier. Here, the DFT computed transition levels and formation energy values were used as a combined input to a KNN framework. Another method we considered to detect outliers was the principal component analysis (PCA), which decreases the number of variables used to describe an output while still maintaining most of the descriptive information. A covariance matrix of the data is decomposed to orthogonal eigenvectors, associated with eigenvalues that signify how much of the variance in the data that eigenvector captures. Each data point is labeled with an outlier score as determined by the sum of the weighted distances to all the eigenvectors, with smaller eigenvalues having higher influence (as this is where outliers are more likely to exist). Using both KNN and PCA, we selected 10% of the data with the highest outlier scores to be removed. We found that PCA disproportionately removed data points belonging to IV–IV semiconductors. Ultimately, the KNN filtered inlier points proved more effective for all models, and were used as the standard training sets.

Training regression models

In the following sections, we discuss the optimization and performance of various regression models trained on the computational data, starting with linear regression and moving on to different nonlinear regression techniques, namely, random forests, Gaussian processes, kernel ridge, and neural network regression. Common to every method used in this work is the way the training-test split, cross-validation, hyperparameter optimization, and error evaluation was performed. A separate model was trained for each of the eight outputs, namely the six types of transition levels and two formation energies. Five-fold cross-validation was implemented for each model because of a strong dependence of the prediction ability on the exact points chosen for training. Cross-validation helps to reduce the reported bias and variance, and is important for avoiding overfitting. Various important hyperparameters were optimized for each regression technique; for instance, for neural networks, they include the number of hidden layers, the numbers of nodes in each layer, the dropout rate, and so on. All regression models were trained using functions in the Python ML library scikit-learn. The metric for evaluating model performance was chosen to be the prediction RMSE. Each of the five folds was treated as a validation set over multiple training cycles, and the prediction RMSE for each fold was averaged over the number of folds. This leads to an effective 80–20 training-test split in the dataset, and an effective test prediction error is obtained for every data point, providing an unbiased prediction that reveals the true predictive power of the trained model. The optimal set of hyperparameters is chosen such that the cross-validation error is minimized; we ultimately report training and test errors for every model, but optimization is based on the validation error, such that the actual test set in each iteration remains unseen by the model during the training process. Further, the standard deviation in predictions over the multiple training cycles is defined as the uncertainty for each predicted point, providing an error bar that accompanies every prediction. Results are presented as parity plots of ML predictions versus DFT-calculated properties, with reported RMSE values in eV, and plots between uncertainties and errors in every data point; the predictions are also visualized in terms of semiconductor type. The training prediction RMSE values are listed in Tables S6 and S7, and the test prediction RMSE values listed in Tables 1 and 2, divided in terms of property, ML technique used, and type of data point. The set of best hyperparameter values obtained for each technique and each property are included as part of the Supplementary materials, as are alternative prediction performance metrics, mean absolute errors, and R2 scores.

Table 1

ML test set prediction RMSE values for transition levels

Property	ML method	II–VI error (eV)	III–V error (eV)	IV–IV error (eV)	Total error (eV)
ε(+3/+2)	MLR	0.35	0.37	0.34	0.35
ε(+3/+2)	Ridge	0.35	0.35	0.32	0.34
ε(+3/+2)	LASSO	0.36	0.36	0.32	0.35
ε(+3/+2)	Elastic net	0.35	0.35	0.32	0.34
ε(+3/+2)	RFR	0.36	0.31∗	0.35	0.34
ε(+3/+2)	KRR	0.33	0.37	0.31	0.33
ε(+3/+2)	GPR	0.32	0.36	0.32	0.33
ε(+3/+2)	NN∗	0.29∗	0.36	0.29∗	0.31∗
ε(+2/+1)	MLR	0.42	0.46	0.46	0.44
ε(+2/+1)	Ridge	0.42	0.43	0.45	0.43
ε(+2/+1)	LASSO	0.43	0.44	0.45	0.44
ε(+2/+1)	Elastic net	0.42	0.43	0.45	0.43
ε(+2/+1)	RFR	0.39	0.36	0.40	0.38
ε(+2/+1)	KRR	0.33	0.38	0.40	0.36
ε(+2/+1)	GPR	0.32	0.38	0.41	0.36
ε(+2/+1)	NN∗	0.29∗	0.35∗	0.38∗	0.33∗
ε(+1/0)	MLR	0.40	0.39	0.43	0.40
ε(+1/0)	Ridge	0.40	0.38	0.42	0.40
ε(+1/0)	LASSO	0.41	0.39	0.43	0.41
ε(+1/0)	Elastic net	0.40	0.38	0.42	0.40
ε(+1/0)	RFR	0.38	0.36	0.39	0.38
ε(+1/0)	KRR	0.31	0.34	0.38	0.33
ε(+1/0)	GPR∗	0.29∗	0.32	0.38	0.32∗
ε(+1/0)	NN	0.29	0.31∗	0.37∗	0.32
ε(0/–1)	MLR	0.37	0.42	0.34	0.38
ε(0/–1)	Ridge	0.37	0.40	0.34	0.37
ε(0/–1)	LASSO	0.37	0.40	0.34	0.37
ε(0/–1)	Elastic net	0.37	0.40	0.34	0.37
ε(0/–1)	RFR	0.37	0.33	0.35	0.35
ε(0/–1)	KRR	0.32	0.36	0.32	0.33
ε(0/–1)	GPR	0.31	0.34	0.32	0.32
ε(0/–1)	NN∗	0.28∗	0.33∗	0.31∗	0.30∗
ε(–1/–2)	MLR	0.33	0.38	0.30	0.33
ε(–1/–2)	Ridge	0.32	0.37	0.29	0.32
ε(–1/–2)	LASSO	0.32	0.37	0.29	0.33
ε(–1/–2)	Elastic net	0.32	0.37	0.29	0.33
ε(–1/–2)	RFR	0.34	0.35	0.27	0.33
ε(–1/–2)	KRR	0.29	0.32	0.27∗	0.29
ε(–1/–2)	GPR	0.29	0.31	0.28	0.29
ε(–1/–2)	NN∗	0.26∗	0.29∗	0.28	0.27∗
ε(–2/–3)	MLR	0.27	0.26	0.22	0.26
ε(–2/–3)	Ridge	0.27	0.26	0.22	0.25
ε(–2/–3)	LASSO	0.27	0.26	0.22	0.25
ε(–2/–3)	Elastic net	0.27	0.26	0.22	0.25
ε(–2/–3)	RFR	0.24∗	0.28	0.27	0.25
ε(–2/–3)	KRR	0.26	0.24	0.21	0.24
ε(–2/–3)	GPR	0.25	0.24	0.21	0.24
ε(–2/–3)	NN∗	0.25	0.22∗	0.22∗	0.24∗

The gene in closest proximity to the cytokine QTL SNPs.

Table 2

ML test set prediction RMSE values for formation energies

Property	ML method	II–VI error (eV)	III–V error (eV)	IV–IV error (eV)	Total error (eV)
ΔH (A rich)	MLR	0.85	1.57	1.81	1.16
ΔH (A rich)	Ridge	0.85	1.54	1.78	1.14
ΔH (A rich)	LASSO	0.88	1.55	1.79	1.16
ΔH (A rich)	Elastic Net	0.85	1.53	1.78	1.14
ΔH (A rich)	RFR	1.05	1.03	1.20∗	1.07
ΔH (A rich)	KRR∗	0.62	1.35	1.32	0.89∗
ΔH (A rich)	GPR	0.59∗	1.33	1.71	0.96
ΔH (A rich)	NN	0.62	1.30∗	1.40	0.89
ΔH (B rich)	MLR	1.04	1.82	1.81	1.31
ΔH (B rich)	Ridge	1.04	1.73	1.77	1.29
ΔH (B rich)	LASSO	1.08	1.74	1.80	1.32
ΔH (B rich)	Elastic Net	1.05	1.72	1.77	1.28
ΔH (B rich)	RFR	1.09	1.25∗	1.52	1.18
ΔH (B rich)	KRR	0.77∗	1.52	1.45	1.03
ΔH (B rich)	GPR	0.82	1.52	1.70	1.11
ΔH (B rich)	NN∗	0.81	1.34	1.44∗	1.01∗

Lowest prediction errors.

ML test set prediction RMSE values for transition levels The gene in closest proximity to the cytokine QTL SNPs. ML test set prediction RMSE values for formation energies Lowest prediction errors.

Linear regression

Figure S2 shows the Pearson coefficient of linear correlation between various (primary) descriptor dimensions and the properties of interest. We see that many of the features are 50%–70% correlated with the properties, showing a certain degree of linear relationship. We further plotted the correlation coefficients for the 10 best SISSO-based compound features (essentially, complex functions of combinations of original features) in Figure S3. The highest correlated features reveal the specific descriptors or combinations thereof that could best predict the defect formation energy and charge transition levels. We notice that atomic radii and ionization energy differences (between defect atom M and A/B atoms) are most important for ε(+3/+2), while valence and electronegativity differences dominate for ε(+2/+1); coefficients for both remain small, at approximately 0.35. The highest correlations are between 0.5 and 0.55 for ε(+1/0) and ε(0/−1), with ionization energy and atomic radii differences dominating in both. ε(−1/−2) and ε(−2/−3) show even higher correlations of between 0.7 and 0.75, with descriptors such as the Mendeleev number, covalent radii, and ICSD volume of M/A/B atoms being most important. Finally, we find correlation maximums of approximately 0.45 for ΔH (A rich) and ΔH (B rich), determined primarily by the semiconductor lattice constant, ionization potential, boiling point, heat of fusion/vaporization, and specific heat capacity. Overall, these correlations reveal that the relative electronegativities, ionization energies, and radii of elements are important in placing defect energy levels relative to band edges, while the size of the lattice and heat of fusion may determine how likely it is for the defect atom to exist at site in consideration To explore linear relationships further, we chose multiple linear regression (MLR) as the method to train the first predictive models. Given a vector of properly standardized features , and calculated output vector Y, the matrix of coefficients β corresponding with each feature and the output is determined by minimizing the least square error . While MLR yields an unbiased predictor, it is prone to overfitting when several features are highly correlated with the output. To address this issue, we use three shrinkage methods, namely, least absolute shrinkage and selection operator (LASSO) regression, ridge regression, and elastic net regression, all of which yield a biased predictor, but with a lower variance, leading to less overfitting compared with the standard least square. Ridge regression shrinks the coefficients β by imposing an L2 penalty, whereas LASSO uses an L1 penalty. The elastic net is another regularized linear regression technique that combines both L1 penalty and L2 penalty. Typically, β shrinkage inside LASSO regression progress more severely compared with the other two approaches, and some of the coefficients are brought down to 0. In Figures S4–S7, we present parity plots for models trained using MLR, LASSO, ridge regression, and elastic net regression, respectively. We see from the plots and from Tables 1 and 2 that there is a marginal improvement in prediction going from MLR to LASSO, ridge, or elastic net regression. The presented results are using the SISSO features, as they provide better predictions than using the primary feature set, which is presumably due to the nonlinear nature of the input-output relationship. We further find that there is a strong dependence of the prediction error on semiconductor type and impurity site. We observe these effects in nonlinear models as well, which will be discussed in subsequent sections. Such prediction error differences can be attributed to the imbalanced distribution of training data; the DFT datasets are biased toward II–VI semiconductors and interstitial site impurities, as seen from Figure S1. In general, we get better performance on II–VI or interstitial data points, since the models we trained work better on majority groups.

Random forest regression

The improvement in linear regression model performances upon going from using the primary features to the SISSO-based features shows the importance of interpreting nonlinear relationships between the features and properties. However, non-linearity is still limited by the set of operators used in the SISSO method; to further explore this effect, we adopted a popular nonlinear regression algorithm known as random forest regression (RFR). RFR is an ensemble measurement method that fits a designated number of classifying decision trees such that each tree is fit on a different randomized sub-sample of the dataset, chosen through bootstrapping. During the construction of any tree, the best split for each node is found based on some number of input features. Averaging over all the trees in the forest can be performed in several ways, and, in this work, the model combines the results of the trees by averaging their probabilistic prediction, which improves prediction accuracy and can help to control overfitting. Hyperparameter tuning focuses on the five most important features in the RFR model, namely, the number of trees in the forest, the maximum depth of each tree, the number of features to consider when looking for the best node split, the minimum number of samples required to split an internal node, and the minimum number of samples required to be at a leaf node. For each of the eight outputs, Bayesian optimization was performed using a function set to minimize both the test RMSE and the difference between the training and test RMSE to balance the bias-variance trade off in the model. Figure S9 shows a comparison between grid search and Bayesian search based hyperparameter optimization for RFR; it is seen that both methods produce similar test errors, but the latter mitigates overfitting (difference between training and test errors) far better, thus motivating its use. Parity plots for the optimized models for all eight properties are shown in Figure 5A. Looking at the error values listed in Tables 1 and 2, there is a general improvement in all the transition level prediction RMSEs from between 0.3 and 0.45 eV for the linear models to between 0.25 and 0.38 eV for RFR, and the formation energy RMSEs drop from 1.2 eV or higher to between 1.07 and 1.16 eV. We further plotted the RFR uncertainties in prediction against the absolute prediction error values in Figure S8. While uncertainty and error do not correlate linearly, such plots reveal the degree of confidence one can have in a given prediction. A large portion of points lie in the low uncertainty, low error region, but a number of points with low uncertainty also show large prediction errors, which highlights the need to exercise caution in trusting the ML predictions regardless of the estimated uncertainty.

Figure 5

Parity plots for best regression models

(A–C) (A) Random Forest, (B) Gaussian process, and (C) NN regression, plotted for different semiconductor types.

Parity plots for best regression models (A–C) (A) Random Forest, (B) Gaussian process, and (C) NN regression, plotted for different semiconductor types. As observed in the linear regression results, we found that RFR was able to predict formation energies of II–VI semiconductor impurities more accurately than III–V or IV–IV, owing to the larger portion of II–VI semiconductor points in the training dataset. Interestingly, the transition levels showed much less of a dependence on semiconductor type; the difference could be due to the larger range of values in the formation energy data versus the transitional levels. We found that the points the model predicted most inaccurately for formation energies are relative outliers as predicted by KNN and PCA, and of those points, III–V and IV–IV semiconductor types make up a larger portion than in the dataset as a whole. When analyzing the prediction results by site of impurity defect, it was once again seen that interstitials are predicted slightly better than substitutionals, once again owing to the predominance of the former in the dataset. Finally, we examined the feature importance values that are reported as part of random forest models; these values were collected for each property and averaged over five-fold cross validation. The importance values for the ten best (SISSO-based) features are plotted for all eight properties in Figure S10. It is seen that the most important features for predicting ε(+3/+2), ε(+2/+1) and ε(+1/0) are the differences between valence, preferred oxidation states, and electronegativities of the defect atom M and A/B atoms. The ε(0/−1) is determined by the difference between thermal expansion coefficient of M and thermal conductivity of B, ε(−1/−2) by the atomic radius of B and electrical conductivity of A, and ε(−2/−3) by the covalent radius of B and Mendeleev number of A. While many of the important features for transition levels are similar to those obtained from Figure S3, the emergence of some new features shows the importance of uncovering more complex nonlinear relationships. Finally, the most important features for ΔH (A rich) and ΔH (B rich) are differences between group numbers and heat of vaporization for the former and differences in thermal expansion coefficients for the latter.

Kernel ridge regression

The improvement in prediction with RFR provided the motivation for alternative nonlinear regression techniques that could lead to further lowering of errors. Kernel ridge regression (KRR) is a similarity-based regression technique that uses the kernel trick to solve a nonlinear problem in a linear fashion. The original low-dimensional features are used as input and mapped to a high-dimensional kernel space in which they can be linearly interpreted. In this work, we use different possible choices for the kernel function, namely polynomial, radial basis function, and Laplacian. For hyperparameter optimization, we applied the grid search method to search a dense space for the best combination of kernel choice and different parameters in the kernel, separately for each output. The prediction performances for the eight outputs are shown as parity plots in Figure S11A and listed in Tables 1 and 2. KRR shows a marked improvement in formation energy prediction and slight improvements in transition level predictions compared to RFR. The improvement is heavily owed to significant lowering of errors for impurities in the II–VI compounds. We find the KRR RMSE for ΔH (A rich) to be 0.89 eV and for ΔH (B rich) to be 1.03 eV, while the RMSE values for the six transition levels range between 0.25 and 0.35 eV. As shown in Figure S11B, the uncertainties on the KRR predictions range from 0 to 0.25 eV for the transition levels and 0 to 1 eV for the formation energies. Once again, a large concentration of points lie in the low uncertainty, low error region, with a few outliers existing in the opposite end of the spectrum.

Gaussian process regression

Another nonlinear regression technique that uses the kernel trick is Gaussian process regression (GPR)., GPR uses the kernel and the observation to define a likelihood function on account of the covariance of a prior distribution over the target functions. The prior and likelihood function is assumed to have a Gaussian distribution. Based on Bayes’ theorem, we get a predictive posterior distribution, from which we can attain a point prediction using its mean, and an uncertainty value using its variance. A major difference between GPR and KRR is that GPR can internally choose each kernel’s hyperparameters by applying gradient-ascent on the marginal likelihood function, while KRR requires a grid or random search using a loss function. It can be seen from the GPR parity plots in Figure 5B and from Tables 1 and 2 that the prediction RMSE values are very similar to those obtained with KRR. The formation energy errors are between 0.96 and 1.1 eV, while the transition level errors range from 0.24 to 0.36 eV. It can also be seen from the training prediction errors listed in Tables S6 and S7 that there is a greater difference between the training and the test RMSE for both formation energies and transition levels than KRR. This can be explained by the flexibility of the GPR models, which likely causes overfitting when dealing with a small dataset and high dimensional features. The uncertainty versus absolute error plots in Figure 6A show similar trends to KRR, with a majority of the points occupying the low-error, low-uncertainty region.

Figure 6

Gaussian process regression: Error versus uncertainty

(A and B) Prediction uncertainty as a function of absolute prediction error for (A) Gaussian process and (B) NN regression, plotted for different semiconductor types.

Gaussian process regression: Error versus uncertainty (A and B) Prediction uncertainty as a function of absolute prediction error for (A) Gaussian process and (B) NN regression, plotted for different semiconductor types.

Neural network regression

Finally, we used neural networks (NN) to train regression models and compared the results with nonlinear regression models from RFR, KRR, and GPR. The Keras functional API model was used to build a deep feedforward NN to machine learn a multi-output regression. A sequential model trained to predict the six transition levels and two formation energies was found to be time consuming and lacked the ability to predict multiple outputs at once effectively. Further, a grid search used to explore the number of hidden layers, number of neurons, learning rate, epochs, batch size, optimizers, and activation functions was found to be inefficient. Separate models were thus trained for each property using the SISSO-generated descriptors, and scikit-optimize (skopt) was used for Bayesian hyperparameter optimization. To overcome an overfitting problem arising from minimizing only the test RMSE, the optimization function was revised to also include the difference in train and test RMSE. Each NN architecture contains two to three dense neuron layers, through which the input is concatenated before returning the output through the final layer. The number of neurons in each dense layer varies with the input dimensions for each specific property or output. Kernel and activity regularizers were also integrated in each dense layer to prevent overfitting. The “relu” activation function was ultimately used for each dense layer, beating out sigmoid, softmax, softplus, tanh, and selu functions, while the Adam optimizer was selected over SGD, RMSEprop, Adadelta, and Adagrad. NN model training involved 10-repeated 5-fold cross-validation, where the mean and SD of prediction of every data point were used as the predicted value and uncertainty value, respectively. Parity plots for the best models thus obtained are presented in Figure 5C, while Figure 6B shows the uncertainty versus absolute error plots. It can be seen from the parity plots and Tables 1 and 2 that NN predictions for both transition levels and formation energies are similar to KRR and GPR. Transition level RMSE values are seen to range from 0.24 to 0.33 eV, while the formation energy RMSEs are between 0.9 and 1 eV. A comparison with training set predictions in Tables S6 and S7 further reveals that the gaps in test and training predictions from NN are similar to those from KRR, implying less overfitting as compared to GPR. A possible disadvantage of the NN models comes from the larger uncertainty values seen in general compared to other methods, as visible from Figure 6B, while the absolute error values are similar to other methods. This is an effect of the stronger dependence of NN model prediction on the hyperparameter choice, leading to a larger SD in prediction; this is expected to affect NN predictions over the entire chemical space. We further note that standard deviations over 10-folds may not be sufficient to converge the uncertainties, but we use an ensemble of 10 predictions here to save on training time and keep estimates consistent across different ML models. Methods such as Monte Carlo dropout can help to attain better uncertainty estimates as well, and will be applied in future work.

High-throughput screening of dominating impurities

The detailed ML analysis presented in this work reveals that multiple nonlinear regression techniques can be trained to make predictions of impurity transition levels and formation energies with errors that are within 10% of the range of values across the dataset. In Figure 7A, we present the test set prediction RMSE values of eight different ML techniques used in this work, namely MLR, ridge, LASSO, elastic net, RFR, KRR, GPR, and NN, for the six transition levels and two formation energies. The errors are plotted separately for the II–VI, III–V, and IV–IV points, as well as all the points taken together. It can be seen that for all the data types, the general trend is a reduction in RMSE upon going from linear to nonlinear techniques. It is also seen that in general, the RFR performance is worse than KRR, GPR, and NN, while the latter three have similar formation energy errors with NN edging out the other two for most of the transition levels. From these results, one can expect NN, GPR, and KRR to yield similar results for the complete formation energy picture of all impurities as a function of charge, chemical potential, and Fermi level, which can be formulated using the predicted neutral state formation energies and all possible charge transitions.

Figure 7

ML model performance comparison

(A–C) The performance of various ML models by semiconductor type, in terms of (A) prediction RMSE, and screening accuracy, precision and recall scores for (B) dominating impurities and (C) low energy impurities with mid-gap energy levels at A-rich and B-rich chemical potential conditions.

ML model performance comparison (A–C) The performance of various ML models by semiconductor type, in terms of (A) prediction RMSE, and screening accuracy, precision and recall scores for (B) dominating impurities and (C) low energy impurities with mid-gap energy levels at A-rich and B-rich chemical potential conditions. We performed high-throughput prediction of the complete formation energies of the entire dataset of 12,474 impurities, using the best NN, GPR, KRR, and RFR models. It is important to note here that a significant amount of time is saved by replacing full DFT calculations with almost instantaneous ML predictions. On average, any 1 point defect in a 64-atom supercell simulated in the neutral state requires approximately 500 core hours, while 6 charged state calculations require a further approximately 2000 core hours (running on 8 Intel Broadwell XEON E5-2695 nodes with 36 cores each). For the DFT datasets of approximately 1500 neutral state formation energies and approximately 1000 charge transition levels of 6 types, this translates to approximately 2.75 million core hours. For the entire dataset of 12,474 impurities, approximately 32 million core hours would be required for complete DFT optimization and prediction of all defect properties. In contrast, every ML model takes a matter of minutes to train and make predictions over the entire chemical space. Thus, based on computations using 1/10th of the total computing time required, we can make reasonable predictions for all the data points. Predictions for the entire set of 12,474 impurities using different ML models are included as a spreadsheet as part of the Supplementary materials. The ML-predicted impurity formation energies across the dataset were compared with the dominant native defect energetics for each compound, based on which screening is performed for (a) dominating impurities, i.e., impurities with lower energy than native defects, which will change the equilibrium Fermi level of the semiconductor, and (b) low energy impurities (lower than native defects) with mid-gap energy levels. The screening performance of each ML model is determined by comparing the ML and DFT screening for the data points in the original DFT dataset. Given the expected DFT versus experiments and ML versus DFT errors, we relax the screening criteria by ± 0.2 eV for the DFT data and by ±0.5 eV for the ML data. We thus calculated the number of true positives (TP, dominating/mid-gap from both DFT and ML), true negatives (TN), false positives (FP) and false negatives (FN) for each method. Based on these scores, the following metrics were defined:Accuracy = (TP + TN)/(TP + TN + FP + FN)Precision = TP/(TP + FP)Recall = TP/(TP + FN) Figures 7B and 7C show the accuracy, precision, and recall scores of each ML technique for screening of dominating impurities and low energy impurities with mid-gap levels, respectively. Results are plotted for the total dataset and for each semiconductor type, for both A-rich and B-rich conditions. The accuracies (in blue) of RFR, GPR, and KRR for all data types are seen to be greater than 95% for screening of dominating impurities in Figure 7B, while the precision (red) and recall (green) range from 80% to 95%. Interestingly, the accuracy, precision and recall scores of NN predictions are universally seen to lag behind the scores from RFR, GPR, and KRR. This surprising lack of predictive power of the NN models is attributed to their strong dependence on the hyperparameter choices, which is intimately linked with the exact nature of the training dataset. This leads to the higher uncertainty values seen in Figure 6B and likely overfitting, which may not manifest in a limited test set, but over the entire set of 12,474 impurities, some predictions may be well off, resulting in lower accuracy, precision, and recall scores. The NN scores are better for screening of low energy impurities with mid-gap levels in Figure 7C, and all four techniques (NN, GPR, KRR, and RFR) show similar accuracy, precision, and recall. Scores are seen to be lower for III–V data points than others, for reasons relating to the general imbalance in the dataset. To further elucidate the effect of this dataset imbalance, we retrained some GPR and RFR models on a reduced dataset with (a) equal number of II–VI, III–V, and IV–IV points, and (b) equal number of interstitial and substitutional points, essentially by removing a large number of II–VI or interstitial points (mostly belonging to impurities in CdTe). Figures S15 and S16 show, respectively, the GPR and RFR RMSE values for all properties using the entire dataset and using reduced balanced datasets. It can be seen that the overall predictions are slightly worse with the balanced datasets, simply because fewer data points are being used for training, but the gap between II–VI, III–V, and IV–IV errors reduces, as does the gap between substitutional and interstitial errors. However, for formation energy, the II–VI points and interstitial points are still predicted better than other data types. We conclude that, although errors for different data types can be brought closer to each other using more balanced datasets, we prefer to use models trained on the entire dataset since they lead to similar or better errors for different data types. In Table 3, we list several impurities deemed to be dominating from both DFT and ML (GPR used as example here), along with their stable charge states, the corresponding dominating native defect, the type of shift induced in the equilibrium E, and whether mid-gap energy levels are created. For example, it can be seen that Ti at the Al site in AlAs creates a stable +1 charged donor type defect, and, along with a −3 charged As vacancy acceptor, makes the conductivity more n-type and creates a transition level in the band gap. Similarly, a Be interstitial defect in Si induces a p-type shift in conductivity. Lists of dominating impurities with or without mid-gap energy levels were thus generated for all compounds. Finally, we plotted the complete charge and E-dependent formation energies of selected impurities from both DFT and GPR for a few cases in Figure 8. There is an impressive match between the DFT and GPR curves for most of the impurities, with charge states and transitions in general remaining consistent. A few impurities such as BiZn in ZnS and In in AlAs are seen to show greater disparity between DFT and GPR, but qualitative trends remain the same. Also plotted for each case in Figure 8 are the dominant native defects, and it can be seen that almost all impurities are correctly predicted to be dominating or not dominating from GPR compared with DFT, which implies a reliable qualitative screening, even when the actual predicted formation energies or transition levels are off. As a final test of the generalizability of our ML framework, we selected 25 new impurities deemed to be dominating from GPR and KRR predictions, and performed additional computations on them. Figure S14 shows the parity plots between the DFT-computed formation energies and transition levels and the GPR/KRR-predicted values. It can be seen that RMSE values are generally between 0.8 and 1.1 eV for formation energies and between approximately 0.2 and 0.4 eV for the transition levels, indicating that prediction accuracy is at a very similar level to test set predictions.

Table 3

Selected dominating impurities identified by both DFT and ML (GPR), at A-rich chemical potential conditions

Semiconductor	Impurity	Shift in Eqm. E_F	Dominating defects	Mid-gap level?
CdS	In_Cd	n-type	In_Cd, q = 1 and V_Cd, q = −2	Y
CdS	I_S	n-type	I_S, q = 1 and V_Cd, q = −3	Y
CdS	Ti_i	p-type	Ti_i, q = 2 and V_S, q = −1	Y
CdSe	Cu_Cd	p-type	Cu_Cd, q = −1 and Cd_i, q = 2	Y
CdSe	F_i	p-type	F_i, q = −1 and V_Se, q = 2	N
CdSe	Ni_i	p-type	Ni_i, q = −1 and V_Se, q = 2	Y
CdTe	Bi_Cd	n-type	Bi_Cd, q = 1 and V_Cd, q = −2	Y
CdTe	As_Te	p-type	As_Te, q = −1 and V_Te, q = 2	Y
CdTe	Na_i	n-type	Na_i, q = 1 and V_Cd, q = −2	N
ZnS	Li_i	n-type	Li_i, q = 1 and V_Zn, q = −2	N
ZnS	Ti_i	n-type	Ti_i, q = 1 and V_Zn, q = −2	Y
ZnSe	Al_Zn	n-type	Al_Zn, q = 1 and V_Zn, q = −2	Y
ZnSe	Br_Se	n-type	Br_Se, q = 1 and Zn_Se, q = −1	Y
ZnTe	Cr_i	n-type	Cr_i, q = 1 and V_Te, q = −2	N
ZnTe	Mn_i	n-type	Mn_i, q = 1 and Zn_Te, q = −2	Y
AlN	Se_N	p-type	Se_N, q = −1 and V_N, q = 1	Y
AlP	Hf_Al	n-type	Hf_Al, q = 1 and Al_P, q = −1	Y
AlP	Cr_i	n-type	Cr_i, q = 1 and V_Al, q = −2	Y
AlAs	Ti_Al	n-type	Ti_Al, q = 1 and V_As, q = −3	Y
GaN	Tl_Ga	p-type	Tl_Ga, q = −1 and V_N, q = 1	Y
GaN	P_N	p-type	P_N, q = −2 and V_N, q = 1	Y
GaP	Ni_Ga	p-type	Ni_Ga, q = −1 and Ga_i, q = 2	Y
GaP	Li_i	n-type	Li_i, q = 1 and Ga_P, q = −2	Y
GaAs	Sc_i	n-type	Sc_i, q = 3 and Ga_As, q = −2	Y
GaSb	Al_Ga	n-type	Al_Ga, q = 1 and V_Ga, q = −2	Y
InN	Zr_i	n-type	Zr_i, q = 2 and V_N, q = −1	Y
InP	Cu_i	n-type	Cu_i, q = 1 and In_P, q = −2	Y
InAs	Ca_In	p-type	Ca_In, q = −1 and In_As, q = 2	N
Si	Ti_Si	p-type	Ti_Si, q = −1 and Si_i, q = 2	Y
Si	Be_i	n-type	Be_i, q = 1 and V_Si, q = −3	Y
SiC	V_Si	n-type	V_Si, q = 1 and V_C, q = −2	Y
SiC	Cr_i	p-type	Cr_i, q = −1 and V_C, q = 1	Y
SnC	As_Sn	n-type	As_Sn, q = 1 and V_C, q = −2	N
SnC	Cr_Sn	p-type	Cr_Sn, q = −1 and V_C, q = 2	N

Figure 8

Defect formation energies from DFT and ML

(A–F) A comparison of the complete charge and Fermi level-dependent formation energy picture of selected impurities from DFT (solid lines) and GPR (dashed lines), presented for (A) CdTe at Cd-rich conditions, (B) ZnS at S-rich conditions, (C) AlAs at As-rich conditions, (D) GaP at Ga-rich conditions, (E) Si at Si-rich conditions, and (F) SiC at C-rich conditions. The dominant donor and acceptor type native defects are also pictured.

Selected dominating impurities identified by both DFT and ML (GPR), at A-rich chemical potential conditions Defect formation energies from DFT and ML (A–F) A comparison of the complete charge and Fermi level-dependent formation energy picture of selected impurities from DFT (solid lines) and GPR (dashed lines), presented for (A) CdTe at Cd-rich conditions, (B) ZnS at S-rich conditions, (C) AlAs at As-rich conditions, (D) GaP at Ga-rich conditions, (E) Si at Si-rich conditions, and (F) SiC at C-rich conditions. The dominant donor and acceptor type native defects are also pictured.

Discussion

The DFT + ML strategy presented in this work enables the quick prediction and screening of impurities in semiconductors, but is still limited by several factors. The primary concern is certainly the accuracy of the PBE functional, which determines the reliability of the computational dataset and every subsequent step. Despite the impressive correspondence between measured and PBE computed defect levels, a generalization over all the semiconductor compounds and all types of impurities requires further caution. The use of advanced levels of theory, such as HSE06 and GW with and without SOC, may yet be necessary for future improvements of prediction models. However, ML models built on PBE data are still certainly useful for a number of reasons: (a) although quantitative predictions may be off, they provide qualitative screening of impurities likely to create low energy charged defects and/or consequential energy levels in the band gap, with an expected accuracy of greater than 95%, and (b) PBE and ML-PBE estimates provide starting points for more advanced calculations, and can be used in a multi-fidelity learning framework wherein higher fidelity predictions are improved using lower fidelity data. We note here that, although we consider mid-gap states that only arise from defect charge transitions, there are other internal transitions such as the d-d or f-f transitions of transition metals and lanthanides that could potentially further affect the absorption and emission characteristics of a semiconductor., Going forward, a number of extensions and improvements will be made to this work, the first being the generation of higher accuracy DFT data and training multi-fidelity learning models. In ref12, we showed that with a much smaller set of HSE06 data points, ML descriptors could be combined with models trained on larger quantities of PBE data to yield excellent predictions for Cd-chalcogenides; this will be extended to all group IV, III–V, and II–VI semiconductors. Various types of multi-fidelity learning models can be developed, from PBE–experiments (using the current dataset of 89 points supplemented with more data) to PBE–HSE to PBE–HSE–experiments, providing a potential pathway to bridging the DFT versus experiment gap. Further, while all compounds were currently studied in the ZB structure, the DFT data and ML models will be extended to include defects in the wurtzite and rock salt structures. The current ML framework can also be extended to semiconductor alloys in the same chemical spaces using the same type of descriptors, as was demonstrated for the limited example of Cd-chalcogenides.12 The set of descriptors and ML methods used can also be expanded, for instance by including low accuracy unit cell defect calculations as used in refs.3,12 Finally, tools can be created for the on-demand prediction of the entire defect formation energy picture of any point defect or impurity in any compound, and a comparison of said defect with dominating native defects and other impurities. In summary, we used a combination of DFT and ML to predict the charge-, Fermi level-, and chemical potential-dependent formation energy of any substitutional or interstitial impurity or point defect in ZB structures of group IV, III–V, and II–VI semiconductors. A DFT dataset was created for the neutral state formation energies and various charge transition levels of upward of 1000 possible impurities across 34 compounds, which formed about 10% of the entire semiconductor + impurity chemical space. ML models were built from the data by using descriptors that included properties of the compound, the defect site, and the impurity atoms, and applying algorithms ranging from linear regression techniques to nonlinear methods such as random forest and NN. For the eight properties of interest (2 formation energies and 6 transition levels), KRR, GPR, and NN generally lead to similar performances, and the best models were deployed to predict all impurity properties in a high-throughput manner. Lists of dominating impurities, which can change the equilibrium conductivity of the compound as determined by native defects, were created using the ML predictions. The learning and design framework described in this work can be extended in terms of new semiconductors and mixed composition compounds, more involved descriptors and ML techniques, and more advanced levels of theory. The same design framework is also applicable to other semiconductor classes such as halide perovskites and I–III–VI semiconductors, and can lead to novel materials with improved optoelectronic properties for solar cells and related applications.

Experimental procedures

Resource availability

Lead contact

Requests for data and additional information should be directed to the lead contact, Arun Mannodi-Kanakkithodi (amannodi@purdue.edu).

Materials availability

This study did not generate physical materials.

Calculating defect properties from DFT

All native defects and impurities were simulated in 64 atom 2 × 2 × 2 supercells of the parent compound, based on previously optimized 8 atom ZB unit cells. DFT optimization was performed in the neutral and charged states (q = −3, −2, −1, 0, +1, +2, +3) while keeping the supercell shape and size fixed. All computations were performed using the Vienna ab-initio Simulation Package, using the PBE exchange-correlation functional and projector-augmented wave atom potentials. The kinetic energy cut-off for the planewave basis set was 500 eV, and all atoms were relaxed until forces on each were less than 0.05 eV/Å. Brillouin zone integration was performed using a 3 × 3 × 3 Monkhorst-Pack mesh. For any defect or impurity atom M in a compound AB, the following equations yield the formation energy E as a function of the chemical potential μ, charge q, and Fermi level E, and any impurity charge transition level, ε(q1/q2): Here, E(AB) is the DFT energy of an AB supercell without defects, E(M) is the DFT energy of the AB supercell containing a defect M in a charge state q, E is the VBM as computed from an electronic structure calculation on AB, and E is the charge correction energy using the scheme developed by Freysoldt et al., to account for periodic interaction between image charges. E depends on the chemical potential change Δμ involved in creating the defect, and for a given Δμ, it is a function of E and q, such that the slope of the E versus the E plot is equal to q. For any defect or impurity M in compound AB, the chemical potentials of all species are defined with reference to the elemental standard states of M, A, and B, as well as their lowest formation energy binary or ternary compounds. For an impurity M (M occupying an A site), in Equation 1, Δμ = μ − μ; for an impurity M (M occupying an interstitial site), Δμ = − μ; for a vacancy at the B site V, Δμ = μ. We calculate formation energies at two extreme chemical potential conditions, namely, A rich (where μ = energy of elemental standard state of A) and B rich (where μ = energy of elemental standard state of B), and note that by tuning the μ conditions, defects can be made more or less stable, and the equilibrium conductivity—determined by defect charge neutrality conditions—can be made more p-type or n-type. Equation 2 defines a charge transition level , that is, the E value where the defect transitions from a charge state q1 to q2, which is independent of the μ conditions; in this work, for every defect or impurity, we calculate six possible transition levels, namely, +3/+2, +2/+1, +1/0, 0/−1, −1/−2, and −2/−3.

ML details

The ML approaches used in this work include dimensionality reduction/outlier identification using SISSO, PCA, and other techniques, and training predictive models using linear regression and three types of nonlinear regression: random forests, Gaussian processes, and NN. Necessary introduction to each technique and relevant information about how hyperparameters are optimized and errors are converged are provided in different subsections within the manuscript. All ML training and prediction was done using appropriate functions in Scikit-learn.

25 in total

1. Generalized Gradient Approximation Made Simple.

Authors:
Journal: Phys Rev Lett Date: 1996-10-28 Impact factor: 9.161

2. Ab initio molecular-dynamics simulation of the liquid-metal-amorphous-semiconductor transition in germanium.

Authors:
Journal: Phys Rev B Condens Matter Date: 1994-05-15

Review 3. Gaussian processes for machine learning.

Authors: Matthias Seeger
Journal: Int J Neural Syst Date: 2004-04 Impact factor: 5.866

4. Energy band gaps and lattice parameters evaluated with the Heyd-Scuseria-Ernzerhof screened hybrid functional.

Authors: Jochen Heyd; Juan E Peralta; Gustavo E Scuseria; Richard L Martin
Journal: J Chem Phys Date: 2005-11-01 Impact factor: 3.488

5. Theory of defect levels and the "band gap problem" in silicon.

Authors: Peter A Schultz
Journal: Phys Rev Lett Date: 2006-06-19 Impact factor: 9.161

6. Deep impurity levels in semiconductor superlattices.

Authors:
Journal: Phys Rev B Condens Matter Date: 1988-11-15

7. Materials Design of Solar Cell Absorbers Beyond Perovskites and Conventional Semiconductors via Combining Tetrahedral and Octahedral Coordination.

Authors: Jing Wang; Hangyan Chen; Su-Huai Wei; Wan-Jian Yin
Journal: Adv Mater Date: 2019-03-18 Impact factor: 30.849

Review 8. Opportunities and challenges of text mining in aterials research.

Authors: Olga Kononova; Tanjin He; Haoyan Huo; Amalie Trewartha; Elsa A Olivetti; Gerbrand Ceder
Journal: iScience Date: 2021-02-06

9. Self-compensation in arsenic doping of CdTe.

Authors: Tursun Ablekim; Santosh K Swain; Wan-Jian Yin; Katherine Zaunbrecher; James Burst; Teresa M Barnes; Darius Kuciauskas; Su-Huai Wei; Kelvin G Lynn
Journal: Sci Rep Date: 2017-07-04 Impact factor: 4.379