Literature DB >> 35785305

Inverse Design of Hybrid Organic-Inorganic Perovskites with Suitable Bandgaps via Proactive Searching Progress.

Tian Lu¹, Hongyu Li¹, Minjie Li^2,3, Shenghao Wang¹, Wencong Lu^1,2,3.

Abstract

Hybrid organic-inorganic perovskites (HOIPs) have shown the encouraging development in solar cells that have achieved excellent device performance. One of the most important issues has been focused on finding Pb-free candidates with suitable bandgaps, which could accelerate the commercialization of environmentally friendly HOIP-based cells. Herein, we propose a new inverse design method, proactive searching progress (PSP), to efficiently discover potential HOIPs from universal chemical space by combining machine learning (ML) techniques. Compared to the pioneering work on this topic, we carried out our ML study based on 1201 collected HOIP samples with experimental bandgaps rather than theoretical properties. On the basis of 25 selected features, a weighted voting regressor ML model was constructed to predict bandgaps of HOIPs. The model comprehensively embedded four submodels and performed the coefficient determinations of 0.95 for leaving-one-out cross-validation and 0.91 for testing set. The feature analysis revealed that the tolerance factor (t f) below 0.971 and the new tolerance factor (τf) in 3.75-4.09 contributed to lower bandgaps and vice versa. By applying the PSP method, the Pb-free HOIPs with optimal bandgaps were successfully designed from a generated chemical space comprising over 8.20 × 1018 combinations, which included 733848 candidates (e.g., Cs0.334FA0.266MA0.400Sn0.769Ge0.003Pd0.228Br0.164I2.836) with an optimal bandgap of 1.34 eV for single junction solar cells, 1511073 large-bandgap candidates (e.g., Cs0.392FA0.016MA0.592Cr0.383Sr0.347Sn0.270Br1.171I1.829) for top parts in tandem solar cells (TSCs), and 20242 low-bandgap ones (e.g., MA0.815FA0.185Sn0.927Ge0.073I3) for bottom cells in TSCs. Finally, three new HOIPs were synthesized with an average bandgap error 0.07 eV between predictions and experiments. We are convinced that the proposed PSP method and ML progress could facilitate the discovery of new promising HOIPs for photovoltaic devices with the desired properties.

Entities: Chemical

Year: 2022 PMID： 35785305 PMCID： PMC9245129 DOI： 10.1021/acsomega.2c01380

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Development of hyybrid organic–inorganic perovskites (HOIPs) has flourished in photovoltaic (PV) technology over the past decade, attributed to their exceptional merits including tunable bandgap, long carrier diffusion length, high light-absorption coefficient, low nonradiative loss, carrier mobility, simple solution processability, and low-cost experimental synthesis.[1−4] HOIP materials have a universal formula, ABX3, in which the A site usually contains organic cations (such as MA+ → methylammonium, FA+ → formamidinium) or Cs+, the B site involves divalent metal cations (such as Pb2+, Sn2+, Ge2+, Ga2+), and the X site includes halogen anions (Cl–, Br–, I–). The typical HOIPs, e.g., MAPbI3, FAPbI3, and their derivatives, have been widely applied in single-junction devices, namely perovskite solar cells (PSCs), with the bandgap range of 1.40–1.70 eV whose power conversion efficiencies (PCEs) have elevated to 25.5%[5] from the original 3.8%.[6] In the meantime, ascribed to the fascinating property of their tunable bandgaps, HOIPs with low-bandgap (1.10–1.40 eV) are competent in serving as bottom cells for all-perovskite tandem solar cells, while the materials with a wide bandgap (1.70–2.30 eV) are suitable as top cells for all kinds of tandem solar cells[7] in which their PCEs have reached 47.1%.[5] However, HOIP photovoltaic materials still have unavoidable imperfections and more potential for further progress. One of the topic issues is the contaminant brought by the toxic element Pb that contributes to most solar cells with outstanding PCEs. In this context, more Pb-free HOIP materials are urgently needed to facilitate the commercial application in solar cells.[3] Meanwhile, the bandgaps of HOIPs in efficient single junction cells are mostly lying at 1.45–1.55 eV, whose theoretical maximum PCEs are 31.02–32.07% according to the Shockley-Queisser (SQ) limit model.[8−10] However, the optimum bandgap located in 1.20–1.40 eV renders the maximum PCEs of 32.7–33.7%, while the highest value 33.7% could be achieved at a bandgap of 1.34 eV. By adjusting the bandgap to the optimal value, the HOIP-based device may have a higher PCE. As for the case of tandem solar cells (TSCs), manipulating bandgaps of HOIPs is also essential to gain satisfactory PCEs. The ideal bandgap of the perovskite bottom cell is usually 1.20–1.30 eV, while the suitable bandgap for the perovskite top cell is close to 1.70 eV for Si or copper indium gallium selenide (CIGS) bottom cells (bandgap 1.10 eV) and 1.75–1.80 eV for perovskite bottom cells.[11] Considering the demands of nontoxicity and suitable bandgap adjustment, it is still of paramount importance to discover Pb-free HOIPs with suitable bandgaps for both of PSCs and TSCs. As experimental and quantum-based research has been suffering due to the large cost in human endeavor, time consumption, and trial-and-error attempts, machine learning (ML) has provided a more efficient strategy to assist materials design at lower expense, which has been continuously exerted in various fields such as electrocatalysis, batteries, polymers, alloys, and others.[12−15] The ML technique has also been successfully applied in finding out promising HOIPs with desired properties. As a proof of concept, Lu et al.[16] prepared 212 ABX3 HOIPs samples from high-throughput calculations along with their bandgaps calculated via the density functional theory (DFT) method in which 11 organic cations were considered in the A site, 32 divalent metal cations, and four halogen anions in the B and X sites. Fourteen features were selected by using last-place elimination for the gradient boosting regression (GBR) model, resulting the high model performance with a determination coefficient (R2) of 0.985 and mean squared error (MSE) of 0.085 eV. The sorted features indicated the importance of the tolerance factor (tf), octahedral factor (Of), and ionic charge in B site (ICB) that took the major contributions to the model performance. Then the bandgaps of 5158 virtually generated samples were predicted via the GBR model. By inspecting their proper predictions (0.9–1.6 eV), tf (0.8–1.2), Of (0.4–0.7), experimental accessibility, and nontoxicity, six HOIPs were finally selected with DFT-calculated bandgaps of 0.91–1.14 eV, namely C2H5OInBr3, C2H6NInBr3, NH3NH2InBr3, C2H5OSnBr3, NH4InBr3, and C2H6NSnBr3. Saidi et al.[17] arranged 862 ABX3 HOIP structures along with their DFT-calculated bandgaps, in which 18 organic ions, Pb2+/Sn2+ cations, and Cl–/Br–/I– anions were included for A/B/X sites. A hierarchical convolutional neural network (HCNN) was trained based on atomic descriptors to predict the DFT bandgaps of HOIPs, exhibiting a low root mean squared error (RMSE) value of 0.02 eV. In our recent work,[18] a robust extremely gradient boosting (XGBoost) model was fitted based on 102 DFT-optimized samples and atomic descriptors to identify the formability of DFT-optimized HOIP structures, which led the leaving-one-out cross-validation (LOOCV) accuracy of 95% and testing accuracy of 88%. By using the SHapley Additive exPlanations (SHAP) tool, we found that the radius and lattice constant of B site had the positive contribution to formability while the A site radius, tolerant factor, and first ionization of B site were the reverse case. Handy with the well-established XGBoost model, 198 nontoxic HOIPs were screened from 18560 generated structures with formability probabilities over 99%. Although the bright prospect of the ML technique in discovering potential HOIP materials has been unveiled in these pioneering works, the majority of the relevant work paid large attention to DFT-based properties such as bandgap and formability instead of the experimental characteristics whose results tend to deviate from the experiments, e.g., the underestimated bandgaps calculated by generalized gradient approximation (GGA).[3] Second, only limited organic cations in the A site (up to 18) have been reported in the current ML studies, while this work has collected 88 organic cations (both from experimental and theoretical publications) that are expected to extend the A site choices for experimentalists. Third, few researches have reported the virtual designs for the doped HOIPs, e.g., the formula as A′A″B′B″X′X″, though there already are cases of complex HOIP formulas in experiments such as Cs0.05FA0.79MA0.16Pb0.58Sn0.42I2.48Br0.52.[19] Thus, it is meaningful to expand the chemical space to involve the doped structures, which may cause an enormous searching space, e.g., 4.6 × 1011 combinations in the case of A′A″A‴B′B″X′X″ (suppose 20 organic choices in the A site, 9 in the B site, 3 in the X site, and doping ratio step 0.01). In this situation, the traditional high-throughput ML prediction is no longer affordable using current computing power; thus, an innovative searching strategy is imperatively essential to avoid this dilemma. Last but not the least is the inadequate interpretations of the established models despite their qualified performances, which are basically important for a deep understanding of experimental principles. Theus, there is still a large capacity to be elevated in discovering new potential HOIPs with suitable bandgaps and nontoxicity. Herein, we reviewed 9594 PSC publications from Web of Science in 2009–2021 and collected 1201 HOIP samples along with their experimental bandgaps. On the basis of the experimental sample set in 2009–2020, various robust models were built, resulting in four best models with the LOOCV R2 0.93–0.95 and testing R2 0.88–0.92. A weighted voting regressor (WVR) model was designed to embed the four well-fitted estimators, exhibiting a more comprehensive performance with LOOCV R2 0.95 and testing R2 0.91. The 42 samples in 2021 were set aside as the external set to validate the WVR model, indicating the R2 of 0.84. By combining the SHAP tool and WVR model, the relationships between the top 10 important features and HOIP formulas were successfully explored based on both the experimental data set and a virtually generated data set. Thereafter, we constructed a comprehensive chemical space for the doped HOIPs formulated as A′A″A‴B′B″B‴X′X″X‴, in which there were collected 88 organic fragments plus Cs/K/Rb in the A site, eight metal elements in the B site (excluding Pb), and Cl/Br/I in the X site, resulting in over 8.20 × 1018 combinations. In this regard, we proposed a new inverse design method, namely proactive searching progress (PSP), to efficiently discover Pb-free HOIPs with the desired bandgaps from such universal chemical space, which only took a few minutes to locate at least one candidate. As the explored result, we finally found 733848 HOIPs with a bandgap of 1.34 ± 0.05 eV for single junction solar cells, 20242 with 1.20 ± 0.05 eV for bottom cells in TSCs, and 764883 with 1.70 ± 0.05 eV as well as 746190 with 1.75 ± 0.05 eV for top cells in TSCs. The three new compositions of HOIPs were synthesized for model validation. Their bandgaps were characterized, resulting in an average error 0.07 eV from the predictions. We are convinced that such an inverse design method could help accelerate the development of HOIP materials beyond bandgaps. Since the developed inverse design methods are flexible and independent from the ML models, they could also be applied in discovering other materials with their desired properties.

Methods

In this section, we illustrate the detailed processes of training ML models and the PSP method. In the training process, various ML packages were involved, including scikit-learn,[20] XGBoost, CatBoost, LightGBM, and SHAP,[21] which were used to fit robust models and analyze the relationships between features and HOIP structures. The self-developed reverse design method PSP was applied after the model building to find out the Pb-free candidate HOIPs with proper bandgaps. Most codes were packed into our Python package in https://pypi.org/project/fast-machine-learning/. Figure illustrates the workflowchart concerning on the integrated package to construct ML models and search new candidate HOIPs via PSP. The applied codes are provided in GitHub: https://github.com/luktian/InverseDesignViaPSP. A demo of PSP method can be tried online at http://materials-data-mining.com/pspweb/.

Figure 1

Workflowchart concerning on the integrated package to construct ML models and search new candidate HOIPs via PSP.

Data Collection and Feature Generation

In this study, we extracted 1201 HOIPs along with their experimental bandgap values from 9594 PSC publications from Web of Science (using the keyword “perovskite solar cells”) in 2009–2021. The sample number reduced to 479 from 1201 after cleansing the raw sample set in Code S1 of Supporting Information. The 437 samples collected in 2009–2020 were regarded as the history data for modeling and further divided into training and test sets. The 42 samples from 2021 were set aside as the external set to validate the generalization of ML model at last. An integrated descriptor database was assembled from the Villars Database (https://mpds.io/) and Mendeleev Python package (https://github.com/lmmentel/mendeleev) to depict the structure information on the HOIP samples, which were furtherly divided into 34 base descriptors, eight string descriptors, and 66 other descriptors. The base descriptors refer to the basic chemical and physical properties for elements such as radius, volume, chemical potential, electronegativity, and others, while string descriptors involve categorical properties such as the block position in the periodic table, the lattice crystal structure of the simple substance, and the like. The string descriptors have been encoded into numeric data ahead of generating descriptors via the encoder method provided in scikit-learn.[20] The 66 other descriptors contain relatively more sophisticated or pointless properties and hence would not be employed in our data set in pursuit of an as understandable model as possible. Regarding the lack of properties of the organic cations of the A site, e.g., MA+ in methylammonium lead iodide, some basic chemical and physical properties of the organic fragments were calculated and supplemented by using Gaussian[22] and Multiwfn[23] software, which involves molecular volume, electronegativity, chemical potential, molecular radius, ionic radius, vertical electron affinity, and others. The descriptor details are listed in Table S2, and quick instructions to generate the descriptor features for a HOIP sample are illustrated in Code S2 of the Supporting Information. Given an HOIP structure ABX3, the descriptors were generated for each site (A, B, X), and there were 42 descriptors calculated in each site and hence 126 descriptors for each sample. Three universal structural factors, namely Of,[24]tf,[25] and tau factor (τf)[26] were added, resulting in a total of 129 features. Although the structural phases would have a large impact on bandgaps as well, it would not be considered in the model due to insufficient data. The phase information on the 479 samples is listed at https://github.com/luktian/InverseDesignViaPSP.

Feature Engineering and Model Training

The 129 features were preprocessed by pruning the variables with missing values or their standard deviations nearly to zero, bringing about 102 remnant features. To perform a reasonable division of the 437 samples, the size and distribution of samples in training and test sets were optimized in Code S3 of the Supporting Information, resulting the optimal parameters of training size 81.95% and test size 18.05% and random state 1959 (representing sample distributions). The recursive feature addition (RFA)[27] strategy was employed to search the optimal feature set, which determined the optimal feature set by iteratively adding one more top nth feature under a feature importance (FI) order. In this strategy, seven algorithms were considered in parallel, involving CatBoost (CAT), XGBoost (XGB), LightGBM (LGB), gradient boosting machine (GBM),[28] support vector machine (SVM),[29] decision tree regressor (DTR), and multiple linear regressor (MLR).[20] The FI for tree-based algorithms was ranked by the feature contributions extracted from SHAP package,[21] while FI for the other algorithms was calculated based on maximum relevance minimum redundancy (mRMR) method.[30,31] The RMSE and R2 of LOOCV were regarded as indicators to determine the outstanding models. It was found that the CAT, XGB, LGB, and GBT models were the top 4 outstanding models based on 13, 13, 12, and 10 features screened in Code S4 of the Supporting Information. The contributions of selected features were calculated using SHAP package for each established model in Code S5 of Supporting Information. The hyper-parameter optimizations for CAT, XGB, LGB, and LGB models were performed using the grid search (GS) approach in Code S6 of the Supporting Information. The tree number, learning rate, and tree depth were considered as the hyper-parameters for the models, and the LOOCV RMSE of the training set was used to indicate the model performance.

Weighted Voting Regressor and Model Analysis

Inspired from voting regressor (VR) implemented from scikit-learn[20] that combines multiple regressors as submodels and returns the average predicted values, we developed a weighted voting regressor (WVR). Different from VR model, the submodel in WVR was trained based on individual feature sets, and the optimized weight coefficient was used to control the model contribution to the final prediction. The SHAP values of the WVR model were then calculated in Code S8 of the Supporting Information that had overcome the incompatibility between the self-developed WVR model and SHAP package. In pursuit of a more comprehensive analysis of feature contributions to the model, in addition to the experimental data set, an extra virtual data set composed of mixed HOIPs formulated as A′A″B′B″X′X″ was generated as the input data for SHAP calculation. The organic fragments MA, FA, GA (diaminomethaniminium), EA (ethanaminium), and ED (ethane-1,2-diaminium) plus Cs were considered as the candidates for the A site with the elements Pb, Sn, Ge, Cd, and Pd for the B site and Cl, Br, and I for the X site. For the balance between calculating cost and precision, we set the doping range 0–1 for A/B site and 0–3 for X site along with the range step 0.25 for A/B site and 1.0 for X site, in which doping numbers for each site were all fixed at 2. As a result, the virtually generated data set comprised 45000 virtual HOIPs.

Proactive Searching Progress (PSP)

Given a set of material compositions {γ|γ = (γ,γ,...,γ), i ∈ R}, we could easily get their predicted outputs {o|i ∈ R} from a well-fitted ML model and obtain a set of samples {(γ,o)|i ∈ R}. The ML model plays the role of a function o = f(γ) that describes the distribution of the samples in the whole chemical space. In the case of searching the materials with desired property value o* from a universal chemical space, there would be large wastes of time and cost if all of the possibilities via the ML model o = f(γ) were traversed. Inspired from the sequential model-based optimization (SMBO) in the field of parameter optimization, we herein proposed PSP method, and introduced Gaussian process (GP) method to describe the local distribution g(γ) to approximate the ML model o′ = g(γ) ∼ o = f(γ) in partial chemical space. According to the simulated GP distribution, it could be easy to determine the compositions of next designed point by identifying the composition with the o′ closest to o* and predict the property value by using the ML model. By iteratively adding the newly explored points, the GP distribution would be updated by steps and behave more accurate for property estimation. The core objective of PSP is to discover potential chemical compositions whose properties are close to the expected value o* predefined by the user. As the general workflow is concluded in Figure a, the initial virtual samples {γ|i ≥ 2} will be randomly generated from the whole chemical space at the initial step. The properties {o|i ≥ 2} of the initial samples are predicted by using the established ML model to obtain the sample points {(γ,o)|i ≥ 2}. Instead of the predicted property value o, the expected error {EE|EE = |o – o*|, i ∈ R,EE → 0} between the prediction o and expected property value o* is practically defined as the response of GP distribution to indicate the quality of each sample. The GP function approximates the distribution asFor illustrative purposes, Figure b–d considers the searching progress of a one-compositional material as a simple example in which the first and second samples with EE values of 1.4 and 3.4 are plotted in Figure b. The first generated samples are taken to describe the local distribution by GP method (or other likely methods), as shown in Figure (c). The plot data could be practically obtained from the GP distribution. Then the next designed point with the minimal EE estimated by GP method (EEGP) is determined in the fitted GP distribution, and the predicted property as well as EE are given via the ML model in Figure d. In the next step, a new designed point will be iteratively determined by the updated GP distribution in Figure e until the iteration reaches the maximum step predefined by the user. During the searching progress, once the EE is lower than the predefined criterion, the corresponding samples would be outputted. By only exploiting the local space directionally, PSP could accelerate the searching period and avoid the huge undesired compositions. Additionally, we established a demo of PSP method online to discover HOIP candidates for the illustrative purpose at http://materials-data-mining.com/pspweb/.

Figure 2

General principle of the proactive searching progress (PSP) method. (a) Workflow details of PSP method. (b) First and second generated virtual samples. (c) Simulated GP distribution based on the first and second generated virtual samples. (d) Third virtual samples searched by GP distribution. (e) Updated GP distribution adjusted by the third virtual samples.

Results and Discussion

Model Construction

The 1159 samples from the publications in 2009–2020 were gathered, and 437 samples were retained after cleansing the duplicates in Code S1. As exhibited in Figure a, the bandgap ranges in 1.17–3.31 eV, and the most values were in the 1.20–2.00 eV range accounting for 86.96% of the whole set. Then 129 descriptors were generated for each sample, and 102 remained after preprocessing in Code S2. A reasonable training and test set were drawn in Code S3, rendering 358 training and 79 test samples.

Figure 3

(a) Distribution of bandgaps for 437 HOIP samples. (b) Changing trends of LOOCV RMSEs for CAT, XGB, LGB, and GBM models extracted from RFA results. (c) Feature importance of top 10 features. (d) Distributions of SHAP values with feature values. Given the preprocessed training set with 358 samples and 102 remnant features, an RFA strategy was employed to filter optimal features in Code S4. Figure b shows the decreasing trends of LOOCV RMSE values for CAT, XGB, LGB, and GBT models as the selected feature number increased in which RMSEs were fixed in 0.08–0.12 when the feature number was over 13. Figure S5b plots the same trend for SVM, DTR, and MLR but with higher RMSEs in 0.13–0.24, which signaled the much poorer model performance than the tree-based models. Figure S6 displayed the uprising trend of LOOCV R2 for the four tree-based models with R2 values over 0.89 based on the top 13 features and the other three models along with those lower than 0.85. The test metrics expressed in Figure S7–S8 indicated the same conclusions, and hence, the four tree-based models were retained for the further step. By considering the balance between model performance and complexity, the inflection points circled in Figure b were selected for model construction in which the addition of the features triggered large promotions for the LOOCV RMSEs. As a result, the CAT, XGB, LGB, and GBT models were built up based on the top 13, 13, 12, and 10 features, respectively (Figure S10), whose LOOCV R2 were 0.94, 0.92, 0.90, and 0.93, respectively (Figure S6). The corresponding test R2 of the four models were 0.88, 0.89, 0.87, and 0.91, respectively (see in Figure S8), signifying the powerful predictivities and generalization abilities. After hyper-parameter optimization using the GS approach in Code S6, the LOOCV R2 of CAT, XGB, LGB, and GBT models increased to 0.95, 0.93, 0.93, and 0.93, respectively, while the corresponding test R2 were 0.91, 0.89, 0.88, and 0.92 respectively (Table S5). The relevant LOOCV and test RMSEs were 0.08–0.09 and 0.11–0.12. The 5-fold cross-validation (CV5) and 10-fold cross-validation (CV10) were additionally calculated in Table S5. The CV5 and CV10 R2 of the tree-based models were 0.90–0.95, and the RMSEs were 0.08–0.11, which might also indicate the good model performance. To explicitly implement four well-fitted models, a meta-model could be introduced to integrate the fitted submodels and balance their individual weaknesses. For example, the voting regressor (VR) model from scikit-learn[20] combines multiple different machine learning regressors (trained on the same feature set) and returns the average prediction. Inspired by the VR model, the weighted voting regressor (WVR) model was designed whose submodels were trained based on individual feature sets (e.g., XGB on 13 features while GBT on 10 features). The optimized weight coefficients were used to control the contributions of each submodel to the predicted values, and hence the weighted predictions were returned. In this context, a WVR model was fitted by combining CAT, XGB, LGB, and GBT models in Code S7. As shown in Table S7, the R2 and RMSE in LOOCV of WVR were 0.95 and 0.079 better than the values of XGB, LGB, and GBT, while the R2 and RMSE in WVR were 0.91 and 0.106 more favorable than the values of CAT, XGB, and LGB. Meanwhile, the CV5 and CV10 R2 were 0.94, and the corresponding RMSEs were 0.08. Hence, the WVR model complimented the weakness of each submodel and achieved a comprehensively superior performance than the submodels. To further evaluate the generalization ability of the WVR model for the unknown samples, a separate validating set composed of 42 HOIP samples in 2021 was prepared. Listed in Table S7, the R2 of external validating set between experimental and predicted bandgaps was 0.84, higher than the external R2 in the submodels (0.74–0.80 in Table S5) and slightly lower than the testing R2 of 0.91 in the WVR model. The RMSE, MAE, and MSE of the external validating set in WVR model were 0.060, 0.041, and 0.004, respectively, exhibiting even better than the values of test set (0.106, 0.056, and 0.011), which indicated the powerful predicting power and low predicting errors (0.041–0.056 eV) of the WVR model.

Model Analysis

The SHAP values were calculated by utilizing the self-developed WVR model and SHAP method to analyze the feature contributions to model predictions in Code S8. Besides the experimental data set, a set of virtual data set composed of 45000 doped HOIP samples was employed for model interpretation via high-throughput computation. Figure c exhibits the top 10 features ranked by their contributions extracted from SHAP values, whose vertical axis comprised feature ranking, and the horizontal axis shows the SHAP values for the features. Among the 10 features, five X-site descriptors accounted for 50%, and NX (representing element name in X site) took the most important position. In the remnants, structural and B-site descriptors counted 2, respectively, while the rest 1 was the one in A site. Figure d displays the distributions of SHAP values of features, where the horizontal axis comprises the sorted sample indexes according to model predictions. The red/blue color expresses the positive/negative SHAP values for each sample and feature, which further indicate the positive/negative contributions to predictions. The positive SHAP values for each feature (in red color) were mainly localized in the left part that corresponds to the higher bandgaps, while the negative values (in blue color) were located at the right part related to the lower predictions, thence indicating a well-separately isolated distribution of positive/negative SHAP values and the large contributions to the predictions of bandgaps. Similar plots for the virtual data set are drawn in Figure S18b,d. The scatter plots between the feature and SHAP values are drawn in Figures and S19 with the color indicating the prediction value to acquire a further understanding of each feature. The partial dependence (PD)[28] plots are drawn in Figures and S21, which could signify the overall marginal effect that the feature had on the prediction.

Figure 4

Scatter plots of the features NX (a), TIB (b), tf (c), and τf (d) versus their SHAP values.

Figure 5

Partial dependence (PD) plots of the features NX (a), TIB (b), tf (c), and τf (d) versus their predictions.

Scatter plots of the features NX (a), TIB (b), tf (c), and τf (d) versus their SHAP values. Partial dependence (PD) plots of the features NX (a), TIB (b), tf (c), and τf (d) versus their predictions. As shown in Figure a, NX was the element name in X site. The SHAP values for NX showed a decreasing trend as the NX values were increasing. The samples with lower NX values and hence corresponding higher SHAP values tended to express the larger bandgaps (in darker red color) and vice versa. By marking an inflection point (determining the SHAP values positive or negative) for NX of 34.72, the NX values over 34.72 would result in the negative SHAP values, and the ones lower than 34.72 led to the positive. The original values of NX have been converted to numbers in the procedure for generating descriptors in which Cl, Br, and I were signaled by 20, 13, and 46 respectively. Hence, in pursuit of a higher/lower bandgap, the ratio of I in the X site should be decreased/increased for a low/high NX value that would trigger a high/low SHAP value, which was consistent to the current domain knowledges.[32−36] As indicated by the PD plot for NX in Figure , the same conclusion was extracted that the bandgap decreased as the NX value increased. The steep points that triggered sharp deceases in predictions were labeled on the plot. Specifically, the steep point 25.00 for NX might refer to the doped couple Br1.915I1.091 or Cl2.423I0.577, and 27.80 to Br1.655I1.345 or Cl2.100I0.900, 30.20 to Br1.455I1.545 or Cl1.846I1.154, 41.50 to Br0.409I2.591 or Cl0.519I2.481, respectively. TIB signaled the third ionization energy of the element in the B site, in which the energies of Sn, Pb, Ge, Cd, and Pd were 2943, 3081, 3302, 3616, and 3177 kJ/mol, respectively. Indicated by the scatter plot in Figure b, the TIB value was proportional to its SHAP value and bandgap prediction. Hence, the higher bandgaps might require the B site elements with the higher third ionization energy, which would influence on the charge effect. Combining the relevant PD plots of TIB in Figure b, it could be noted that the steep points at 2943 and 3081 kJ/mol were the third ionizations of Sn and Pb, which revealed the overall increasing trending of the bandgap as the proportion of lead arose in the doped B site couple SnPb. When the TIB value was over 3081, the prediction was nearly unchanged and even slightly declined, indicating that the higher ratios of Ge/Cd/Pd might have little influence on bandgap. As shown in Figure c, the SHAP values of tf had an overall increasing trend with the ascendent tf values, whose inflection point was around 0.971. The PD plot in Figure c displayed a sigmoid-like function trend in the experimental and virtual data sets, in which the tf range below/above 0.930/1.086 indicated steady prediction changes and the data in the range of 0.930–1.086 resulted in a steep increment for bandgap values. As for another structural factor τf, the inflection point of 3.50 could be noticed from the scatter plot in Figure d. The SHAP values were mostly positive as the τf value was lower than 3.50, while the values tended to be more negative as the τf value was over 3.50. The PD plot in Figure d showed that the τf ranges lower than 3.46, 4.13–4.78 and higher than 4.78 would render steady prediction trends, while the range 3.46–4.13 signaled an overall decreasing tendency. Since these two factors were both relevant to structural and geometric effects of HOIP samples, it might be comprehensive to take them into consideration together. If we seek HOIPs with lower bandgaps, the value of tf should not be over 0.971 and the value of τf is recommended to be 3.46–4.13, while HOIPs with tf > 1.014, τf < 3.46, and 4.13 < τf < 4.18 (4.18 determines the structure formability as indicated in reference[26]) tend to own the larger bandgaps. The full discussion of six other important descriptors in the top 10 can be found in Code S8, which includes discussions on the virtual data set and the individual conditional expectation (ICE)[37] plots.

Proactive Searching Progress

To exploit as much material space as possible, we collected 88 organic fragments (Table S4) and three inorganic elements Cs/K/Rb from experimental and theoretical publications for the A site, while eight metal elements Sn/Ge/Pd/Bi/Sr/Ca/Cr/La and three halogen elements Cl/Br/I were employed for the B and X sites. The doping range was set as 0–1 for A/B site and 0–3 for X site with the ratio step of 0.001 and the doping number 1–3. Thence the chemical space comprised over 8.20 × 1018 combinations for the formula A′A″A‴B′B″B‴X′X″X‴. To efficiently search the material compositions with expected bandgap values from the universal chemical space, the PSP method was applied in this study. The material compositions were set as the combinations of the elements/fragments in each site and the relevant ratios. The material property was set as the bandgap of HOIPs, and the EE criterion was restrained less than 0.05 eV. The expected values were predefined as 1.34 eV for PSCs, 1.20 eV for bottom cells in TSCs, and 1.70 and 1.75 eV for top cells in TSCs. For illustrative purposes, the high-dimensional features for each HOIP samples were compressed into the first component via component analysis (PCA), and Figure a presents the distribution of bandgaps versus the first PCA component (PCA-1). After a few iterations in the progress of searching the samples with 1.34 eV bandgap, 20 virtual samples were drawn from the searched result in Figure b and linked by the dotted lines. The distribution of bandgaps was obviously changed from the original one, and the predictions had a decreasing trend as the searching progress proceeded, which exhibited an efficient way to design the samples with expected property values from a universal chemical space. An additional demo web could be accessed at http://materials-data-mining.com/pspweb/.

Figure 6

Bandgap distribution in the initial step (a) and after a few iterations (b) versus the first component in principal component analysis.

Top Candidates

As a result of the PSP design, we obtained 820782 samples for the Pb-free HOIPs with bandgap 1.34 ± 0.05 eV, 22808 samples with 1.20 ± 0.05 eV, 984938 samples with 1.70 ± 0.05 eV, and 902401 samples with 1.75 ± 0.05 eV, respectively, in which the feature distributions for the four candidate sets are shown in Figure . As displayed in Figure a, the tf of low-bandgap HOIPs (1.20 and 1.34 eV) revealed a narrower range than the wide-bandgap ones (1.70 and 1.75 eV). The tf ranges of HOIPs with 1.20 and 1.34 eV were 0.88–0.97 and 0.83–0.99, while the cases of 1.70 and 1.75 eV were 0.80–1.02 and 0.81–1.20. Hence, the lower tf values (<0.97) could render a larger probability of low-bandgap samples and the higher tf values (>0.99) would lead to large-bandgap samples, which is consistent with the conclusion in model analysis. Simultaneously signaled by Figure b, the τf revealed an inverse trend that the peak of feature distribution was decreasing as the bandgap increased. The τf value of low-bandgap samples was concentrated at 3.75–4.09 and could be extended to 3.67–4.18 (considering the formability criterion), while the peaks of τf distribution of large-bandgap samples were around 3.60–3.90. Hence, the τf value fixed in 3.75–4.09 was essentially needed for seeking the low-bandgap samples, and the feature value should be controlled in 3.51–3.90 (or even lower) for the large-bandgap samples, which conducted the similar conclusion to the model analysis. In summary, the criterions tf < 0.97 and 3.75 < τf < 4.09 were recommanded for searching low-bandgap samples, while the criterions tf > 0.97, τf < 3.90 and 4.09 < τf < 4.18 were for the wide-bandgap ones. In the experimental sample set, we found that the most samples obeyed such criteria. For instance, the FAGeI3, MAGeI3, and FASnBr3 (bandgap 2.30, 2.00, 1.90 eV) had the tf of 1.12, 1.10, 1.00 and τf of 4.79, 4.77, 3.52. And Cs0.3FA0.7SnI3, MASnI3, FA0.25MA0.75SnI3 (bandgap 1.29, 1.30, 1.28 eV) owned the tf of 0.95, 0.96, 0.96 and τf of 3.81, 3.78, 3.76. After filtering by using the criterion τf < 4.18, 0.8 < tf < 1.2 and the existence of p-block elements in B site, there were 20242, 733848, 764883, and 746190 non-Pb samples left for the HOIPs with the bandgap of 1.20, 1.34, 1.70, and 1.75 eV, respectively.

Figure 7

Statistical distributions of tf (a), τf (b), NX (c), and TIB (d) in explored candidate set.

Statistical distributions of tf (a), τf (b), NX (c), and TIB (d) in explored candidate set. Also exhibited in Figure c, NX of low-bandgap samples was centered at a value of 46, indicating the X site of most low-bandgap samples was composed of iodine. Meanwhile, the X site compositions of most large-bandgap samples was the mixture of bromine and iodine. As shown in Figure d, the TIB value of low-bandgap samples was concentrated in 1849–3080 kJ/mol, while the opposite value was spreading over 1800–5000 kJ/mol. The remnant feature distributions are accessible in Figure S23. The four candidate sets are accessible in our GitHub: https://github.com/luktian/InverseDesignViaPSP/tree/main/code9. We also expanded our studies to several specific HOIP chemical systems to provide potential candidates for experimental researchers. For example, from our searching results, we found that the candidates Cs0.334FA0.266MA0.400Sn0.769Ge0.003Pd0.228Br0.164I2.836, Cs0.366FA0.338MA0.296Sn0.800Cr0.104-Pd0.096Br0.056I2.944, and Cs0.509FA0.412MA0.079Sn0.634Ge0.095Cr0.271Br0.109I2.891 had the optimal bandgap 1.34 eV, representing the Cs-FA-MA-Sn-Ge-Pd, Cs-FA-MA-Sn-Cr-Pd, and Cs-FA-MA-Sn-Ge-Cr mixed HOIPs, respectively, which might be potential non-Pb alternatives to the Cs0.05FA0.79MA0.16PbSn1–Br0.52I2.48 (x = 0, 0.084, 0.168, 0.252, 0.336, 0.420, with bandgap 1.35–1.61 eV) studied by Ji et al.[19] In addition, the candidates Cs0.494MA0.217FA0.289Sn0.866Ge0.134Br0.377I2.623, Cs0.305FA0.273MA0.422Sn0.975Cr0.025Br0.264I2.736, and Cs0.300MA0.675FA0.025Sn0.817Pd0.183Br0.051I2.949 also had a bandgap of 1.34 eV and represented Cs-FA-MA-Sn-Ge, Cs-FA-MA-Sn-Cr, and Cs-FA-MA-Sn-Pd mixed HOIPs, respectively. One hundred thirty-five Cs-FA-MA-based candidates are provided in Table S9. The bandgaps of Cs-FA-MA based HOIPs could be adjusted to the ideal range 1.60–1.76 eV for the top part in tandem cells as well. For instance, the candidates Cs0.392FA0.016MA0.592Cr0.383Sr0.347Sn0.270Br1.171I1.829, and Cs0.445FA0.161MA0.394Pd0.508Cr0.228Sn0.263-Br1.094I1.906, were found with 1.67 eV bandgap, which were non-Pb substitutes of Cs0.15FA0.17MA0.68PbBr0.6I2.4 investigated by Sala et al.[38] And 136 Cs-FA-MA based candidates are listed in Table S10. As for the HOIPs with bandgap 1.20 eV,[39−41] the non-Pb candidates MA0.815FA0.185Sn0.927Ge0.073I3 (1.22 eV) and MA0.861FA0.139Sn0.914Ge0.086I3 (1.22 eV) were found, as compared to the experimental samples, e.g., FA0.55MA0.45Sn0.55Pb0.45I3 (1.24 eV),[42] FA0.5MA0.5Sn0.5Pb0.5I3 (1.20 eV),[43] and FA0.83Cs0.17Pb0.3Sn0.7I3 (1.23 eV).[44] Twenty-seven MA-FA-Sn based candidates are supplied in Table S11. Moreover, from the explored samples, ASnI3 (A+ → ammonium), AGSnI3 (AG+ → hydrazinium), and XQSnI3 (XQ+ → sulfonium) were also found with low bandgaps of 1.29, 1.29, and 1.30 eV, whose formability probability were predicted over 99% in our recent study.[45]

Experimental Validation

A few studies on Sn–Ge mixed HOIPs have been reported in recent years due to their environmentally friendliness, high stability, and excellent optoelectronic properties.[46−48] However, there is a lack of studies on the system MASnGe1–I3 according to our collected data, except for the end points MASnI3 and MAGeI3. To explore the bandgaps of Sn–Ge systems and validate the WVR model, a series of samples formulated as MASnGe1–I3 (x = 1, 0.85, 0.74, 0.66, 0) were synthesized, whose predicted bandgaps via the WVR model were 1.28, 1.41, 1.51, 1.59, 2.01 eV, respectively. The experimental details were provided in Section S3 of the Supporting Information. The optical absorbance spectrum of the material is shown in Figure S25. The optical bandgap was obtained by using the Tauc formula,[49] which resulted in 1.23 and 2.02 eV for MASnI3 and MAGeI3 films, respectively. These two values are in good agreement with other reports.[50,51] The initial incorporation of Ge (corresponding to x = 0.85) caused a sudden increase in the optical bandgap edge as compared to MASnI3. With increasing the Ge concentration (i.e., x changes from 0.85 to 0.66), the optical bandgap edge has a small shift to lower wavelength, resulting in bandgaps of 1.53, 1.55, 1.54 eV (corresponding to x = 0.85, 0.74, 0.66). The predictions of these three new compositions were consistent to the experimental bandgaps with the average error of 0.07 eV.

Conclusion

In summary, four tree-based ML models were built up, namely CAT, XGB, LGB, and GBM models, based on the filtered 13, 13, 12, and 10 features, respectively. After tuning the model parameters, the R2 values of LOOCV and testing set for four models were 0.90–0.94 and 0.88–0.91, respectively. By applying the self-developed WVR meta-model, the four submodels were embedded together, exhibiting the comprehensive performance of LOOCV R2 0.95 and testing R2 0.91. Then the top 10 features were analyzed by calculating the SHAP contributions based the experimental data set and a virtually generated data set, which revealed the mapping relationships between the formulas and bandgaps in detail. In particular, tf below 0.971 and τf in 3.75–4.09 were beneficial for a low bandgap (e.g., 1.20 and 1.34 eV), while tf over 0.971 and τf in 4.09–4.18 (or lower 3.90) contributed to a large bandgap (e.g., over 1.70 eV). To discover the new Pb-free HOIPs with suitable bandgaps, we generated a universal chemical space for the HOIP formula A′A″A‴B′B″B‴X′X″X‴, in which 91, 8, and 3 choices were considered for the A, B, and X sites, respectively, resulting in 8.20 × 1018 combinations. The inverse design method PSP was proposed to find out the candidate HOIPs with desired bandgaps, in which 733848, 764883, 746190, and 20242 candidates were found for the HOIPs whose bandgaps were 1.34 eV for PSCs, 1.70/1.75 eV for top cells in TSCs, and 1.20 eV for bottom cells in TSCs. From the searched results, Cs0.334FA0.266MA0.400Sn0.769Ge0.003Pd0.228Br0.164I2.836, Cs0.366FA0.338MA0.296Sn0.800Cr0.104Pd0.096-Br0.056I2.944, and Cs0.509FA0.412MA0.079-Sn0.634Ge0.095Cr0.271Br0.109I2.891 were found for Cs-FA-MA mixed HOIPs with optimal bandgap 1.34 eV for PSCs. Cs0.392FA0.016MA0.592Cr0.383Sr0.347-Sn0.27Br1.171I1.829 and Cs0.445FA0.161MA0.394Pd0.508Cr0.228Sn0.263Br1.094I1.906 were found for Cs-FA-MA mixed HOIPs with large bandgap for the top cells in TSCs, while MA0.815FA0.185Sn0.927Ge0.073I3 and MA0.86FA0.139Sn0.914Ge0.086I3 were found for the bottom cells in TSCs. We believed that the inverse design methods as well as the ML training processes could facilitate the development of both photovoltaic fields and other advanced materials.

18 in total

1. Multiwfn: a multifunctional wavefunction analyzer.

Authors: Tian Lu; Feiwu Chen
Journal: J Comput Chem Date: 2011-12-08 Impact factor: 3.376

2. From Local Explanations to Global Understanding with Explainable AI for Trees.

Authors: Scott M Lundberg; Gabriel Erion; Hugh Chen; Alex DeGrave; Jordan M Prutkin; Bala Nair; Ronit Katz; Jonathan Himmelfarb; Nisha Bansal; Su-In Lee
Journal: Nat Mach Intell Date: 2020-01-17

3. Predicting the Formability of Hybrid Organic-Inorganic Perovskites via an Interpretable Machine Learning Strategy.

Authors: Shilin Zhang; Tian Lu; Pengcheng Xu; Qiuling Tao; Minjie Li; Wencong Lu
Journal: J Phys Chem Lett Date: 2021-07-31 Impact factor: 6.475

4. Perovskite-perovskite tandem photovoltaics with optimized band gaps.

Authors: Giles E Eperon; Tomas Leijtens; Kevin A Bush; Rohit Prasanna; Thomas Green; Jacob Tse-Wei Wang; David P McMeekin; George Volonakis; Rebecca L Milot; Richard May; Axel Palmstrom; Daniel J Slotcavage; Rebecca A Belisle; Jay B Patel; Elizabeth S Parrott; Rebecca J Sutton; Wen Ma; Farhad Moghadam; Bert Conings; Aslihan Babayigit; Hans-Gerd Boyen; Stacey Bent; Feliciano Giustino; Laura M Herz; Michael B Johnston; Michael D McGehee; Henry J Snaith
Journal: Science Date: 2016-10-20 Impact factor: 47.728

5. Fabrication of Efficient Low-Bandgap Perovskite Solar Cells by Combining Formamidinium Tin Iodide with Methylammonium Lead Iodide.

Authors: Weiqiang Liao; Dewei Zhao; Yue Yu; Niraj Shrestha; Kiran Ghimire; Corey R Grice; Changlei Wang; Yuqing Xiao; Alexander J Cimaroli; Randy J Ellingson; Nikolas J Podraza; Kai Zhu; Ren-Gen Xiong; Yanfa Yan
Journal: J Am Chem Soc Date: 2016-09-19 Impact factor: 15.419