Tian Lu1, Hongyu Li1, Minjie Li2,3, Shenghao Wang1, Wencong Lu1,2,3. 1. Materials Genome Institute, Shanghai University, Shanghai 200444, China. 2. Department of Chemistry, College of Sciences, Shanghai University, Shanghai 200444, China. 3. Zhejiang Laboratory, Hangzhou 311100, China.
Abstract
Hybrid organic-inorganic perovskites (HOIPs) have shown the encouraging development in solar cells that have achieved excellent device performance. One of the most important issues has been focused on finding Pb-free candidates with suitable bandgaps, which could accelerate the commercialization of environmentally friendly HOIP-based cells. Herein, we propose a new inverse design method, proactive searching progress (PSP), to efficiently discover potential HOIPs from universal chemical space by combining machine learning (ML) techniques. Compared to the pioneering work on this topic, we carried out our ML study based on 1201 collected HOIP samples with experimental bandgaps rather than theoretical properties. On the basis of 25 selected features, a weighted voting regressor ML model was constructed to predict bandgaps of HOIPs. The model comprehensively embedded four submodels and performed the coefficient determinations of 0.95 for leaving-one-out cross-validation and 0.91 for testing set. The feature analysis revealed that the tolerance factor (t f) below 0.971 and the new tolerance factor (τf) in 3.75-4.09 contributed to lower bandgaps and vice versa. By applying the PSP method, the Pb-free HOIPs with optimal bandgaps were successfully designed from a generated chemical space comprising over 8.20 × 1018 combinations, which included 733848 candidates (e.g., Cs0.334FA0.266MA0.400Sn0.769Ge0.003Pd0.228Br0.164I2.836) with an optimal bandgap of 1.34 eV for single junction solar cells, 1511073 large-bandgap candidates (e.g., Cs0.392FA0.016MA0.592Cr0.383Sr0.347Sn0.270Br1.171I1.829) for top parts in tandem solar cells (TSCs), and 20242 low-bandgap ones (e.g., MA0.815FA0.185Sn0.927Ge0.073I3) for bottom cells in TSCs. Finally, three new HOIPs were synthesized with an average bandgap error 0.07 eV between predictions and experiments. We are convinced that the proposed PSP method and ML progress could facilitate the discovery of new promising HOIPs for photovoltaic devices with the desired properties.
Hybrid organic-inorganic perovskites (HOIPs) have shown the encouraging development in solar cells that have achieved excellent device performance. One of the most important issues has been focused on finding Pb-free candidates with suitable bandgaps, which could accelerate the commercialization of environmentally friendly HOIP-based cells. Herein, we propose a new inverse design method, proactive searching progress (PSP), to efficiently discover potential HOIPs from universal chemical space by combining machine learning (ML) techniques. Compared to the pioneering work on this topic, we carried out our ML study based on 1201 collected HOIP samples with experimental bandgaps rather than theoretical properties. On the basis of 25 selected features, a weighted voting regressor ML model was constructed to predict bandgaps of HOIPs. The model comprehensively embedded four submodels and performed the coefficient determinations of 0.95 for leaving-one-out cross-validation and 0.91 for testing set. The feature analysis revealed that the tolerance factor (t f) below 0.971 and the new tolerance factor (τf) in 3.75-4.09 contributed to lower bandgaps and vice versa. By applying the PSP method, the Pb-free HOIPs with optimal bandgaps were successfully designed from a generated chemical space comprising over 8.20 × 1018 combinations, which included 733848 candidates (e.g., Cs0.334FA0.266MA0.400Sn0.769Ge0.003Pd0.228Br0.164I2.836) with an optimal bandgap of 1.34 eV for single junction solar cells, 1511073 large-bandgap candidates (e.g., Cs0.392FA0.016MA0.592Cr0.383Sr0.347Sn0.270Br1.171I1.829) for top parts in tandem solar cells (TSCs), and 20242 low-bandgap ones (e.g., MA0.815FA0.185Sn0.927Ge0.073I3) for bottom cells in TSCs. Finally, three new HOIPs were synthesized with an average bandgap error 0.07 eV between predictions and experiments. We are convinced that the proposed PSP method and ML progress could facilitate the discovery of new promising HOIPs for photovoltaic devices with the desired properties.
Development
of hyybrid organic–inorganic perovskites (HOIPs)
has flourished in photovoltaic (PV) technology over the past decade,
attributed to their exceptional merits including tunable bandgap,
long carrier diffusion length, high light-absorption coefficient,
low nonradiative loss, carrier mobility, simple solution processability,
and low-cost experimental synthesis.[1−4] HOIP materials have a universal formula,
ABX3, in which the A site usually contains organic cations
(such as MA+ → methylammonium, FA+ →
formamidinium) or Cs+, the B site involves divalent metal
cations (such as Pb2+, Sn2+, Ge2+, Ga2+), and the X site includes halogen anions (Cl–, Br–, I–). The
typical HOIPs, e.g., MAPbI3, FAPbI3, and their
derivatives, have been widely applied in single-junction devices,
namely perovskite solar cells (PSCs), with the bandgap range of 1.40–1.70
eV whose power conversion efficiencies (PCEs) have elevated to 25.5%[5] from the original 3.8%.[6] In the meantime, ascribed to the fascinating property of their tunable
bandgaps, HOIPs with low-bandgap (1.10–1.40 eV) are competent
in serving as bottom cells for all-perovskite tandem solar cells,
while the materials with a wide bandgap (1.70–2.30 eV) are
suitable as top cells for all kinds of tandem solar cells[7] in which their PCEs have reached 47.1%.[5]However, HOIP photovoltaic materials still
have unavoidable imperfections
and more potential for further progress. One of the topic issues is
the contaminant brought by the toxic element Pb that contributes to
most solar cells with outstanding PCEs. In this context, more Pb-free
HOIP materials are urgently needed to facilitate the commercial application
in solar cells.[3] Meanwhile, the bandgaps
of HOIPs in efficient single junction cells are mostly lying at 1.45–1.55
eV, whose theoretical maximum PCEs are 31.02–32.07% according
to the Shockley-Queisser (SQ) limit model.[8−10] However, the
optimum bandgap located in 1.20–1.40 eV renders the maximum
PCEs of 32.7–33.7%, while the highest value 33.7% could be
achieved at a bandgap of 1.34 eV. By adjusting the bandgap to the
optimal value, the HOIP-based device may have a higher PCE. As for
the case of tandem solar cells (TSCs), manipulating bandgaps of HOIPs
is also essential to gain satisfactory PCEs. The ideal bandgap of
the perovskite bottom cell is usually 1.20–1.30 eV, while the
suitable bandgap for the perovskite top cell is close to 1.70 eV for
Si or copper indium gallium selenide (CIGS) bottom cells (bandgap
1.10 eV) and 1.75–1.80 eV for perovskite bottom cells.[11] Considering the demands of nontoxicity and suitable
bandgap adjustment, it is still of paramount importance to discover
Pb-free HOIPs with suitable bandgaps for both of PSCs and TSCs.As experimental and quantum-based research has been suffering due
to the large cost in human endeavor, time consumption, and trial-and-error
attempts, machine learning (ML) has provided a more efficient strategy
to assist materials design at lower expense, which has been continuously
exerted in various fields such as electrocatalysis, batteries, polymers,
alloys, and others.[12−15] The ML technique has also been successfully applied in finding out
promising HOIPs with desired properties. As a proof of concept, Lu
et al.[16] prepared 212 ABX3 HOIPs
samples from high-throughput calculations along with their bandgaps
calculated via the density functional theory (DFT) method in which
11 organic cations were considered in the A site, 32 divalent metal
cations, and four halogen anions in the B and X sites. Fourteen features
were selected by using last-place elimination for the gradient boosting
regression (GBR) model, resulting the high model performance with
a determination coefficient (R2) of 0.985
and mean squared error (MSE) of 0.085 eV. The sorted features indicated
the importance of the tolerance factor (tf), octahedral factor (Of), and ionic
charge in B site (ICB) that took the major contributions
to the model performance. Then the bandgaps of 5158 virtually generated
samples were predicted via the GBR model. By inspecting their proper
predictions (0.9–1.6 eV), tf (0.8–1.2), Of (0.4–0.7), experimental accessibility,
and nontoxicity, six HOIPs were finally selected with DFT-calculated
bandgaps of 0.91–1.14 eV, namely C2H5OInBr3, C2H6NInBr3, NH3NH2InBr3, C2H5OSnBr3, NH4InBr3, and C2H6NSnBr3. Saidi et al.[17] arranged 862 ABX3 HOIP structures along with their DFT-calculated
bandgaps, in which 18 organic ions, Pb2+/Sn2+ cations, and Cl–/Br–/I– anions were included for A/B/X sites. A hierarchical convolutional
neural network (HCNN) was trained based on atomic descriptors to predict
the DFT bandgaps of HOIPs, exhibiting a low root mean squared error
(RMSE) value of 0.02 eV. In our recent work,[18] a robust extremely gradient boosting (XGBoost) model was fitted
based on 102 DFT-optimized samples and atomic descriptors to identify
the formability of DFT-optimized HOIP structures, which led the leaving-one-out
cross-validation (LOOCV) accuracy of 95% and testing accuracy of 88%.
By using the SHapley Additive exPlanations (SHAP) tool, we found that
the radius and lattice constant of B site had the positive contribution
to formability while the A site radius, tolerant factor, and first
ionization of B site were the reverse case. Handy with the well-established
XGBoost model, 198 nontoxic HOIPs were screened from 18560 generated
structures with formability probabilities over 99%.Although
the bright prospect of the ML technique in discovering
potential HOIP materials has been unveiled in these pioneering works,
the majority of the relevant work paid large attention to DFT-based
properties such as bandgap and formability instead of the experimental
characteristics whose results tend to deviate from the experiments,
e.g., the underestimated bandgaps calculated by generalized gradient
approximation (GGA).[3] Second, only limited
organic cations in the A site (up to 18) have been reported in the
current ML studies, while this work has collected 88 organic cations
(both from experimental and theoretical publications) that are expected
to extend the A site choices for experimentalists. Third, few researches
have reported the virtual designs for the doped HOIPs, e.g., the formula
as A′A″B′B″X′X″, though
there already are cases of complex HOIP formulas in experiments such
as Cs0.05FA0.79MA0.16Pb0.58Sn0.42I2.48Br0.52.[19] Thus, it is meaningful to expand the chemical space to
involve the doped structures, which may cause an enormous searching
space, e.g., 4.6 × 1011 combinations in the case of
A′A″A‴B′B″X′X″ (suppose
20 organic choices in the A site, 9 in the B site, 3 in the X site,
and doping ratio step 0.01). In this situation, the traditional high-throughput
ML prediction is no longer affordable using current computing power;
thus, an innovative searching strategy is imperatively essential to
avoid this dilemma. Last but not the least is the inadequate interpretations
of the established models despite their qualified performances, which
are basically important for a deep understanding of experimental principles.
Theus, there is still a large capacity to be elevated in discovering
new potential HOIPs with suitable bandgaps and nontoxicity.Herein, we reviewed 9594 PSC publications from Web of Science in
2009–2021 and collected 1201 HOIP samples along with their
experimental bandgaps. On the basis of the experimental sample set
in 2009–2020, various robust models were built, resulting in
four best models with the LOOCV R2 0.93–0.95
and testing R2 0.88–0.92. A weighted
voting regressor (WVR) model was designed to embed the four well-fitted
estimators, exhibiting a more comprehensive performance with LOOCV R2 0.95 and testing R2 0.91. The 42 samples in 2021 were set aside as the external set
to validate the WVR model, indicating the R2 of 0.84. By combining the SHAP tool and WVR model, the relationships
between the top 10 important features and HOIP formulas were successfully
explored based on both the experimental data set and a virtually generated
data set. Thereafter, we constructed a comprehensive chemical space
for the doped HOIPs formulated as A′A″A‴B′B″B‴X′X″X‴,
in which there were collected 88 organic fragments plus Cs/K/Rb in
the A site, eight metal elements in the B site (excluding Pb), and
Cl/Br/I in the X site, resulting in over 8.20 × 1018 combinations. In this regard, we proposed a new inverse design method,
namely proactive searching progress (PSP), to efficiently discover
Pb-free HOIPs with the desired bandgaps from such universal chemical
space, which only took a few minutes to locate at least one candidate.
As the explored result, we finally found 733848 HOIPs with a bandgap
of 1.34 ± 0.05 eV for single junction solar cells, 20242 with
1.20 ± 0.05 eV for bottom cells in TSCs, and 764883 with 1.70
± 0.05 eV as well as 746190 with 1.75 ± 0.05 eV for top
cells in TSCs. The three new compositions of HOIPs were synthesized
for model validation. Their bandgaps were characterized, resulting
in an average error 0.07 eV from the predictions. We are convinced
that such an inverse design method could help accelerate the development
of HOIP materials beyond bandgaps. Since the developed inverse design
methods are flexible and independent from the ML models, they could
also be applied in discovering other materials with their desired
properties.
Methods
In this section, we illustrate
the detailed processes of training
ML models and the PSP method. In the training process, various ML
packages were involved, including scikit-learn,[20] XGBoost, CatBoost, LightGBM, and SHAP,[21] which were used to fit robust models and analyze the relationships
between features and HOIP structures. The self-developed reverse design
method PSP was applied after the model building to find out the Pb-free
candidate HOIPs with proper bandgaps. Most codes were packed into
our Python package in https://pypi.org/project/fast-machine-learning/. Figure illustrates
the workflowchart concerning on the integrated package to construct
ML models and search new candidate HOIPs via PSP. The applied codes
are provided in GitHub: https://github.com/luktian/InverseDesignViaPSP. A demo of PSP method can be tried online at http://materials-data-mining.com/pspweb/.
Figure 1
Workflowchart concerning on the integrated package to construct
ML models and search new candidate HOIPs via PSP.
Workflowchart concerning on the integrated package to construct
ML models and search new candidate HOIPs via PSP.
Data Collection
and Feature Generation
In this study,
we extracted 1201 HOIPs along with their experimental bandgap values
from 9594 PSC publications from Web of Science (using the keyword
“perovskite solar cells”) in 2009–2021. The sample
number reduced to 479 from 1201 after cleansing the raw sample set
in Code S1 of Supporting Information. The 437 samples collected in 2009–2020 were regarded as
the history data for modeling and further divided into training and
test sets. The 42 samples from 2021 were set aside as the external
set to validate the generalization of ML model at last.An integrated
descriptor database was assembled from the Villars Database (https://mpds.io/) and Mendeleev Python
package (https://github.com/lmmentel/mendeleev) to depict the structure information on the HOIP samples, which
were furtherly divided into 34 base descriptors, eight string descriptors,
and 66 other descriptors. The base descriptors refer to the basic
chemical and physical properties for elements such as radius, volume,
chemical potential, electronegativity, and others, while string descriptors
involve categorical properties such as the block position in the periodic
table, the lattice crystal structure of the simple substance, and
the like. The string descriptors have been encoded into numeric data
ahead of generating descriptors via the encoder method provided in
scikit-learn.[20] The 66 other descriptors
contain relatively more sophisticated or pointless properties and
hence would not be employed in our data set in pursuit of an as understandable
model as possible. Regarding the lack of properties of the organic
cations of the A site, e.g., MA+ in methylammonium lead
iodide, some basic chemical and physical properties of the organic
fragments were calculated and supplemented by using Gaussian[22] and Multiwfn[23] software,
which involves molecular volume, electronegativity, chemical potential,
molecular radius, ionic radius, vertical electron affinity, and others.
The descriptor details are listed in Table S2, and quick instructions to generate the descriptor features for
a HOIP sample are illustrated in Code S2 of the Supporting Information. Given an HOIP structure
ABX3, the descriptors were generated for each site (A,
B, X), and there were 42 descriptors calculated in each site and hence
126 descriptors for each sample. Three universal structural factors,
namely Of,[24]tf,[25] and
tau factor (τf)[26] were
added, resulting in a total of 129 features.Although the structural
phases would have a large impact on bandgaps
as well, it would not be considered in the model due to insufficient
data. The phase information on the 479 samples is listed at https://github.com/luktian/InverseDesignViaPSP.
Feature Engineering and Model Training
The 129 features
were preprocessed by pruning the variables with missing values or
their standard deviations nearly to zero, bringing about 102 remnant
features. To perform a reasonable division of the 437 samples, the
size and distribution of samples in training and test sets were optimized
in Code S3 of the Supporting Information, resulting the optimal parameters of training size 81.95% and test
size 18.05% and random state 1959 (representing sample distributions).The recursive feature addition (RFA)[27] strategy was employed to search the optimal feature set, which determined
the optimal feature set by iteratively adding one more top nth feature under a feature importance (FI) order. In this
strategy, seven algorithms were considered in parallel, involving
CatBoost (CAT), XGBoost (XGB), LightGBM (LGB), gradient boosting machine
(GBM),[28] support vector machine (SVM),[29] decision tree regressor (DTR), and multiple
linear regressor (MLR).[20] The FI for tree-based
algorithms was ranked by the feature contributions extracted from
SHAP package,[21] while FI for the other
algorithms was calculated based on maximum relevance minimum redundancy
(mRMR) method.[30,31] The RMSE and R2 of LOOCV were regarded as indicators to determine the
outstanding models. It was found that the CAT, XGB, LGB, and GBT models
were the top 4 outstanding models based on 13, 13, 12, and 10 features
screened in Code S4 of the Supporting Information. The contributions of selected features were calculated
using SHAP package for each established model in Code S5 of Supporting Information.The
hyper-parameter optimizations for CAT, XGB, LGB, and LGB models
were performed using the grid search (GS) approach in Code S6 of the Supporting Information. The tree
number, learning rate, and tree depth were considered as the hyper-parameters
for the models, and the LOOCV RMSE of the training set was used to
indicate the model performance.
Weighted Voting Regressor
and Model Analysis
Inspired
from voting regressor (VR) implemented from scikit-learn[20] that combines multiple regressors as submodels
and returns the average predicted values, we developed a weighted
voting regressor (WVR). Different from VR model, the submodel in WVR
was trained based on individual feature sets, and the optimized weight
coefficient was used to control the model contribution to the final
prediction.The SHAP values of the WVR model were then calculated
in Code S8 of the Supporting Information that had overcome the incompatibility between the self-developed
WVR model and SHAP package. In pursuit of a more comprehensive analysis
of feature contributions to the model, in addition to the experimental
data set, an extra virtual data set composed of mixed HOIPs formulated
as A′A″B′B″X′X″ was generated
as the input data for SHAP calculation. The organic fragments MA,
FA, GA (diaminomethaniminium), EA (ethanaminium), and ED (ethane-1,2-diaminium)
plus Cs were considered as the candidates for the A site with the
elements Pb, Sn, Ge, Cd, and Pd for the B site and Cl, Br, and I for
the X site. For the balance between calculating cost and precision,
we set the doping range 0–1 for A/B site and 0–3 for
X site along with the range step 0.25 for A/B site and 1.0 for X site,
in which doping numbers for each site were all fixed at 2. As a result,
the virtually generated data set comprised 45000 virtual HOIPs.
Proactive Searching Progress (PSP)
Given a set of material
compositions {γ|γ = (γ,γ,...,γ), i ∈ R}, we could easily get their
predicted outputs {o|i ∈ R} from a well-fitted
ML model and obtain a set of samples {(γ,o)|i ∈ R}. The ML model
plays the role of a function o = f(γ) that describes the distribution of the
samples in the whole chemical space. In the case of searching the
materials with desired property value o* from a universal chemical space, there would be large wastes of
time and cost if all of the possibilities via the ML model o = f(γ) were traversed.
Inspired from the sequential model-based optimization (SMBO) in the
field of parameter optimization, we herein proposed PSP method, and
introduced Gaussian process (GP) method to describe the local distribution g(γ) to approximate the ML model o′ = g(γ)
∼ o = f(γ) in partial chemical space. According to the simulated GP distribution,
it could be easy to determine the compositions of next designed point
by identifying the composition with the o′
closest to o* and predict the property
value by using the ML model. By iteratively adding the newly explored
points, the GP distribution would be updated by steps and behave more
accurate for property estimation.The core objective of PSP
is to discover potential chemical compositions whose properties are
close to the expected value o* predefined
by the user. As the general workflow is concluded in Figure a, the initial virtual samples
{γ|i ≥ 2} will be randomly generated from the whole chemical space
at the initial step. The properties {o|i ≥ 2} of the initial samples
are predicted by using the established ML model to obtain the sample
points {(γ,o)|i ≥
2}. Instead of the predicted property value o, the expected error {EE|EE = |o – o*|, i ∈ R,EE → 0} between the prediction o and expected property value o* is practically defined as the response of GP distribution
to indicate the quality of each sample. The GP function approximates
the distribution asFor illustrative purposes, Figure b–d considers the searching
progress of a one-compositional material as a simple example in which
the first and second samples with EE values of 1.4 and 3.4 are plotted
in Figure b. The first
generated samples are taken to describe the local distribution by
GP method (or other likely methods), as shown in Figure (c). The plot data could be
practically obtained from the GP distribution. Then the next designed
point with the minimal EE estimated by GP method (EEGP)
is determined in the fitted GP distribution, and the predicted property
as well as EE are given via the ML model in Figure d. In the next step, a new designed point
will be iteratively determined by the updated GP distribution in Figure e until the iteration
reaches the maximum step predefined by the user. During the searching
progress, once the EE is lower than the predefined criterion, the
corresponding samples would be outputted. By only exploiting the local
space directionally, PSP could accelerate the searching period and
avoid the huge undesired compositions. Additionally, we established
a demo of PSP method online to discover HOIP candidates for the illustrative
purpose at http://materials-data-mining.com/pspweb/.
Figure 2
General principle of the proactive searching progress (PSP) method.
(a) Workflow details of PSP method. (b) First and second generated
virtual samples. (c) Simulated GP distribution based on the first
and second generated virtual samples. (d) Third virtual samples searched
by GP distribution. (e) Updated GP distribution adjusted by the third
virtual samples.
General principle of the proactive searching progress (PSP) method.
(a) Workflow details of PSP method. (b) First and second generated
virtual samples. (c) Simulated GP distribution based on the first
and second generated virtual samples. (d) Third virtual samples searched
by GP distribution. (e) Updated GP distribution adjusted by the third
virtual samples.
Results and Discussion
Model
Construction
The 1159 samples from the publications
in 2009–2020 were gathered, and 437 samples were retained after
cleansing the duplicates in Code S1. As
exhibited in Figure a, the bandgap ranges in 1.17–3.31 eV, and the most values
were in the 1.20–2.00 eV range accounting for 86.96% of the
whole set. Then 129 descriptors were generated for each sample, and
102 remained after preprocessing in Code S2. A reasonable training and test set were drawn in Code S3, rendering 358 training and 79 test samples.
Figure 3
(a) Distribution
of bandgaps for 437 HOIP samples. (b) Changing
trends of LOOCV RMSEs for CAT, XGB, LGB, and GBM models extracted
from RFA results. (c) Feature importance of top 10 features. (d) Distributions
of SHAP values with feature values.
(a) Distribution
of bandgaps for 437 HOIP samples. (b) Changing
trends of LOOCV RMSEs for CAT, XGB, LGB, and GBM models extracted
from RFA results. (c) Feature importance of top 10 features. (d) Distributions
of SHAP values with feature values.Given the preprocessed training set with 358 samples and 102 remnant
features, an RFA strategy was employed to filter optimal features
in Code S4. Figure b shows the decreasing trends of LOOCV RMSE
values for CAT, XGB, LGB, and GBT models as the selected feature number
increased in which RMSEs were fixed in 0.08–0.12 when the feature
number was over 13. Figure S5b plots the
same trend for SVM, DTR, and MLR but with higher RMSEs in 0.13–0.24,
which signaled the much poorer model performance than the tree-based
models. Figure S6 displayed the uprising
trend of LOOCV R2 for the four tree-based
models with R2 values over 0.89 based
on the top 13 features and the other three models along with those
lower than 0.85. The test metrics expressed in Figure S7–S8 indicated the same conclusions, and hence,
the four tree-based models were retained for the further step.By considering the balance between model performance and complexity,
the inflection points circled in Figure b were selected for model construction in
which the addition of the features triggered large promotions for
the LOOCV RMSEs. As a result, the CAT, XGB, LGB, and GBT models were
built up based on the top 13, 13, 12, and 10 features, respectively
(Figure S10), whose LOOCV R2 were 0.94, 0.92, 0.90, and 0.93, respectively (Figure S6). The corresponding test R2 of the four models were 0.88, 0.89, 0.87, and 0.91,
respectively (see in Figure S8), signifying
the powerful predictivities and generalization abilities. After hyper-parameter
optimization using the GS approach in Code S6, the LOOCV R2 of CAT, XGB, LGB, and
GBT models increased to 0.95, 0.93, 0.93, and 0.93, respectively,
while the corresponding test R2 were 0.91,
0.89, 0.88, and 0.92 respectively (Table S5). The relevant LOOCV and test RMSEs were 0.08–0.09 and 0.11–0.12.
The 5-fold cross-validation (CV5) and 10-fold cross-validation (CV10)
were additionally calculated in Table S5. The CV5 and CV10 R2 of the tree-based
models were 0.90–0.95, and the RMSEs were 0.08–0.11,
which might also indicate the good model performance.To explicitly
implement four well-fitted models, a meta-model could
be introduced to integrate the fitted submodels and balance their
individual weaknesses. For example, the voting regressor (VR) model
from scikit-learn[20] combines multiple different
machine learning regressors (trained on the same feature set) and
returns the average prediction. Inspired by the VR model, the weighted
voting regressor (WVR) model was designed whose submodels were trained
based on individual feature sets (e.g., XGB on 13 features while GBT
on 10 features). The optimized weight coefficients were used to control
the contributions of each submodel to the predicted values, and hence
the weighted predictions were returned. In this context, a WVR model
was fitted by combining CAT, XGB, LGB, and GBT models in Code S7. As shown in Table S7, the R2 and RMSE in LOOCV of
WVR were 0.95 and 0.079 better than the values of XGB, LGB, and GBT,
while the R2 and RMSE in WVR were 0.91
and 0.106 more favorable than the values of CAT, XGB, and LGB. Meanwhile,
the CV5 and CV10 R2 were 0.94, and the
corresponding RMSEs were 0.08. Hence, the WVR model complimented the
weakness of each submodel and achieved a comprehensively superior
performance than the submodels.To further evaluate the generalization
ability of the WVR model
for the unknown samples, a separate validating set composed of 42
HOIP samples in 2021 was prepared. Listed in Table S7, the R2 of external validating
set between experimental and predicted bandgaps was 0.84, higher than
the external R2 in the submodels (0.74–0.80
in Table S5) and slightly lower than the
testing R2 of 0.91 in the WVR model. The
RMSE, MAE, and MSE of the external validating set in WVR model were
0.060, 0.041, and 0.004, respectively, exhibiting even better than
the values of test set (0.106, 0.056, and 0.011), which indicated
the powerful predicting power and low predicting errors (0.041–0.056
eV) of the WVR model.
Model Analysis
The SHAP values were
calculated by utilizing
the self-developed WVR model and SHAP method to analyze the feature
contributions to model predictions in Code S8. Besides the experimental data set, a set of virtual data set composed
of 45000 doped HOIP samples was employed for model interpretation
via high-throughput computation.Figure c exhibits the top 10 features ranked by
their contributions extracted from SHAP values, whose vertical axis
comprised feature ranking, and the horizontal axis shows the SHAP
values for the features. Among the 10 features, five X-site descriptors
accounted for 50%, and NX (representing
element name in X site) took the most important position. In the remnants,
structural and B-site descriptors counted 2, respectively, while the
rest 1 was the one in A site. Figure d displays the distributions of SHAP values of features,
where the horizontal axis comprises the sorted sample indexes according
to model predictions. The red/blue color expresses the positive/negative
SHAP values for each sample and feature, which further indicate the
positive/negative contributions to predictions. The positive SHAP
values for each feature (in red color) were mainly localized in the
left part that corresponds to the higher bandgaps, while the negative
values (in blue color) were located at the right part related to the
lower predictions, thence indicating a well-separately isolated distribution
of positive/negative SHAP values and the large contributions to the
predictions of bandgaps. Similar plots for the virtual data set are
drawn in Figure S18b,d.The scatter
plots between the feature and SHAP values are drawn
in Figures and S19 with the color indicating the prediction
value to acquire a further understanding of each feature. The partial
dependence (PD)[28] plots are drawn in Figures and S21, which could signify the overall marginal
effect that the feature had on the prediction.
Figure 4
Scatter plots of the
features NX (a),
TIB (b), tf (c), and τf (d) versus their SHAP values.
Figure 5
Partial
dependence (PD) plots of the features NX (a), TIB (b), tf (c), and τf (d) versus their predictions.
Scatter plots of the
features NX (a),
TIB (b), tf (c), and τf (d) versus their SHAP values.Partial
dependence (PD) plots of the features NX (a), TIB (b), tf (c), and τf (d) versus their predictions.As shown in Figure a, NX was the element name
in X site.
The SHAP values for NX showed a decreasing
trend as the NX values were increasing.
The samples with lower NX values and hence
corresponding higher SHAP values tended to express the larger bandgaps
(in darker red color) and vice versa. By marking an inflection point
(determining the SHAP values positive or negative) for NX of 34.72, the NX values
over 34.72 would result in the negative SHAP values, and the ones
lower than 34.72 led to the positive. The original values of NX have been converted to numbers in the procedure
for generating descriptors in which Cl, Br, and I were signaled by
20, 13, and 46 respectively. Hence, in pursuit of a higher/lower bandgap,
the ratio of I in the X site should be decreased/increased for a low/high NX value that would trigger a high/low SHAP value,
which was consistent to the current domain knowledges.[32−36] As indicated by the PD plot for NX in Figure , the same conclusion
was extracted that the bandgap decreased as the NX value increased. The steep points that triggered sharp
deceases in predictions were labeled on the plot. Specifically, the
steep point 25.00 for NX might refer to
the doped couple Br1.915I1.091 or Cl2.423I0.577, and 27.80 to Br1.655I1.345 or Cl2.100I0.900, 30.20 to Br1.455I1.545 or Cl1.846I1.154, 41.50 to
Br0.409I2.591 or Cl0.519I2.481, respectively.TIB signaled the third ionization
energy of the element
in the B site, in which the energies of Sn, Pb, Ge, Cd, and Pd were
2943, 3081, 3302, 3616, and 3177 kJ/mol, respectively. Indicated by
the scatter plot in Figure b, the TIB value was proportional to its SHAP value
and bandgap prediction. Hence, the higher bandgaps might require the
B site elements with the higher third ionization energy, which would
influence on the charge effect. Combining the relevant PD plots of
TIB in Figure b, it could be noted that the steep points at 2943 and 3081
kJ/mol were the third ionizations of Sn and Pb, which revealed the
overall increasing trending of the bandgap as the proportion of lead
arose in the doped B site couple SnPb. When the TIB value
was over 3081, the prediction was nearly unchanged and even slightly
declined, indicating that the higher ratios of Ge/Cd/Pd might have
little influence on bandgap.As shown in Figure c, the SHAP values of tf had an overall
increasing trend with the ascendent tf values, whose inflection point was around 0.971. The PD plot in Figure c displayed a sigmoid-like
function trend in the experimental and virtual data sets, in which
the tf range below/above 0.930/1.086 indicated
steady prediction changes and the data in the range of 0.930–1.086
resulted in a steep increment for bandgap values. As for another structural
factor τf, the inflection point of 3.50 could be
noticed from the scatter plot in Figure d. The SHAP values were mostly positive as
the τf value was lower than 3.50, while the values
tended to be more negative as the τf value was over
3.50. The PD plot in Figure d showed that the τf ranges lower than 3.46,
4.13–4.78 and higher than 4.78 would render steady prediction
trends, while the range 3.46–4.13 signaled an overall decreasing
tendency. Since these two factors were both relevant to structural
and geometric effects of HOIP samples, it might be comprehensive to
take them into consideration together. If we seek HOIPs with lower
bandgaps, the value of tf should not be
over 0.971 and the value of τf is recommended to
be 3.46–4.13, while HOIPs with tf > 1.014, τf < 3.46, and 4.13 < τf < 4.18 (4.18 determines the structure formability as indicated
in reference[26]) tend to own the larger
bandgaps.The full discussion of six other important descriptors
in the top
10 can be found in Code S8, which includes
discussions on the virtual data set and the individual conditional
expectation (ICE)[37] plots.
Proactive Searching
Progress
To exploit as much material
space as possible, we collected 88 organic fragments (Table S4) and three inorganic elements Cs/K/Rb
from experimental and theoretical publications for the A site, while
eight metal elements Sn/Ge/Pd/Bi/Sr/Ca/Cr/La and three halogen elements
Cl/Br/I were employed for the B and X sites. The doping range was
set as 0–1 for A/B site and 0–3 for X site with the
ratio step of 0.001 and the doping number 1–3. Thence the chemical
space comprised over 8.20 × 1018 combinations for
the formula A′A″A‴B′B″B‴X′X″X‴.To efficiently search the material compositions with expected bandgap
values from the universal chemical space, the PSP method was applied
in this study. The material compositions were set as the combinations
of the elements/fragments in each site and the relevant ratios. The
material property was set as the bandgap of HOIPs, and the EE criterion
was restrained less than 0.05 eV. The expected values were predefined
as 1.34 eV for PSCs, 1.20 eV for bottom cells in TSCs, and 1.70 and
1.75 eV for top cells in TSCs. For illustrative purposes, the high-dimensional
features for each HOIP samples were compressed into the first component
via component analysis (PCA), and Figure a presents the distribution of bandgaps versus
the first PCA component (PCA-1). After a few iterations in the progress
of searching the samples with 1.34 eV bandgap, 20 virtual samples
were drawn from the searched result in Figure b and linked by the dotted lines. The distribution
of bandgaps was obviously changed from the original one, and the predictions
had a decreasing trend as the searching progress proceeded, which
exhibited an efficient way to design the samples with expected property
values from a universal chemical space. An additional demo web could
be accessed at http://materials-data-mining.com/pspweb/.
Figure 6
Bandgap distribution
in the initial step (a) and after a few iterations
(b) versus the first component in principal component analysis.
Bandgap distribution
in the initial step (a) and after a few iterations
(b) versus the first component in principal component analysis.
Top Candidates
As a result of the
PSP design, we obtained
820782 samples for the Pb-free HOIPs with bandgap 1.34 ± 0.05
eV, 22808 samples with 1.20 ± 0.05 eV, 984938 samples with 1.70
± 0.05 eV, and 902401 samples with 1.75 ± 0.05 eV, respectively,
in which the feature distributions for the four candidate sets are
shown in Figure .
As displayed in Figure a, the tf of low-bandgap HOIPs (1.20
and 1.34 eV) revealed a narrower range than the wide-bandgap ones
(1.70 and 1.75 eV). The tf ranges of HOIPs
with 1.20 and 1.34 eV were 0.88–0.97 and 0.83–0.99,
while the cases of 1.70 and 1.75 eV were 0.80–1.02 and 0.81–1.20.
Hence, the lower tf values (<0.97)
could render a larger probability of low-bandgap samples and the higher tf values (>0.99) would lead to large-bandgap
samples, which is consistent with the conclusion in model analysis.
Simultaneously signaled by Figure b, the τf revealed an inverse trend
that the peak of feature distribution was decreasing as the bandgap
increased. The τf value of low-bandgap samples was
concentrated at 3.75–4.09 and could be extended to 3.67–4.18
(considering the formability criterion), while the peaks of τf distribution of large-bandgap samples were around 3.60–3.90.
Hence, the τf value fixed in 3.75–4.09 was
essentially needed for seeking the low-bandgap samples, and the feature
value should be controlled in 3.51–3.90 (or even lower) for
the large-bandgap samples, which conducted the similar conclusion
to the model analysis. In summary, the criterions tf < 0.97 and 3.75 < τf < 4.09
were recommanded for searching low-bandgap samples, while the criterions tf > 0.97, τf < 3.90 and
4.09 < τf < 4.18 were for the wide-bandgap
ones. In the experimental sample set, we found that the most samples
obeyed such criteria. For instance, the FAGeI3, MAGeI3, and FASnBr3 (bandgap 2.30, 2.00, 1.90 eV) had
the tf of 1.12, 1.10, 1.00 and τf of 4.79, 4.77, 3.52. And Cs0.3FA0.7SnI3, MASnI3, FA0.25MA0.75SnI3 (bandgap 1.29, 1.30, 1.28 eV) owned the tf of 0.95, 0.96, 0.96 and τf of 3.81,
3.78, 3.76. After filtering by using the criterion τf < 4.18, 0.8 < tf < 1.2 and
the existence of p-block elements in B site, there were 20242, 733848,
764883, and 746190 non-Pb samples left for the HOIPs with the bandgap
of 1.20, 1.34, 1.70, and 1.75 eV, respectively.
Figure 7
Statistical distributions
of tf (a),
τf (b), NX (c), and TIB (d) in explored candidate set.
Statistical distributions
of tf (a),
τf (b), NX (c), and TIB (d) in explored candidate set.Also exhibited in Figure c, NX of low-bandgap samples was
centered at a value of 46, indicating the X site of most low-bandgap
samples was composed of iodine. Meanwhile, the X site compositions
of most large-bandgap samples was the mixture of bromine and iodine.
As shown in Figure d, the TIB value of low-bandgap samples was concentrated
in 1849–3080 kJ/mol, while the opposite value was spreading
over 1800–5000 kJ/mol. The remnant feature distributions are
accessible in Figure S23. The four candidate
sets are accessible in our GitHub: https://github.com/luktian/InverseDesignViaPSP/tree/main/code9.We also expanded our studies to several specific HOIP chemical
systems to provide potential candidates for experimental researchers.
For example, from our searching results, we found that the candidates
Cs0.334FA0.266MA0.400Sn0.769Ge0.003Pd0.228Br0.164I2.836, Cs0.366FA0.338MA0.296Sn0.800Cr0.104-Pd0.096Br0.056I2.944, and Cs0.509FA0.412MA0.079Sn0.634Ge0.095Cr0.271Br0.109I2.891 had the
optimal bandgap 1.34 eV, representing the Cs-FA-MA-Sn-Ge-Pd, Cs-FA-MA-Sn-Cr-Pd,
and Cs-FA-MA-Sn-Ge-Cr mixed HOIPs, respectively, which might be potential
non-Pb alternatives to the Cs0.05FA0.79MA0.16PbSn1–Br0.52I2.48 (x = 0, 0.084, 0.168, 0.252, 0.336, 0.420, with bandgap 1.35–1.61
eV) studied by Ji et al.[19] In addition,
the candidates Cs0.494MA0.217FA0.289Sn0.866Ge0.134Br0.377I2.623, Cs0.305FA0.273MA0.422Sn0.975Cr0.025Br0.264I2.736, and Cs0.300MA0.675FA0.025Sn0.817Pd0.183Br0.051I2.949 also had a bandgap of 1.34 eV and represented Cs-FA-MA-Sn-Ge,
Cs-FA-MA-Sn-Cr, and Cs-FA-MA-Sn-Pd mixed HOIPs, respectively. One
hundred thirty-five Cs-FA-MA-based candidates are provided in Table S9. The bandgaps of Cs-FA-MA based HOIPs
could be adjusted to the ideal range 1.60–1.76 eV for the top
part in tandem cells as well. For instance, the candidates Cs0.392FA0.016MA0.592Cr0.383Sr0.347Sn0.270Br1.171I1.829, and Cs0.445FA0.161MA0.394Pd0.508Cr0.228Sn0.263-Br1.094I1.906, were found with 1.67 eV bandgap,
which were non-Pb substitutes of Cs0.15FA0.17MA0.68PbBr0.6I2.4 investigated by
Sala et al.[38] And 136 Cs-FA-MA based candidates
are listed in Table S10. As for the HOIPs
with bandgap 1.20 eV,[39−41] the non-Pb candidates MA0.815FA0.185Sn0.927Ge0.073I3 (1.22 eV) and MA0.861FA0.139Sn0.914Ge0.086I3 (1.22 eV) were found, as compared to the experimental
samples, e.g., FA0.55MA0.45Sn0.55Pb0.45I3 (1.24 eV),[42] FA0.5MA0.5Sn0.5Pb0.5I3 (1.20 eV),[43] and FA0.83Cs0.17Pb0.3Sn0.7I3 (1.23 eV).[44] Twenty-seven MA-FA-Sn
based candidates are supplied in Table S11. Moreover, from the explored samples, ASnI3 (A+ → ammonium), AGSnI3 (AG+ → hydrazinium),
and XQSnI3 (XQ+ → sulfonium) were also
found with low bandgaps of 1.29, 1.29, and 1.30 eV, whose formability
probability were predicted over 99% in our recent study.[45]
Experimental Validation
A few studies
on Sn–Ge
mixed HOIPs have been reported in recent years due to their environmentally
friendliness, high stability, and excellent optoelectronic properties.[46−48] However, there is a lack of studies on the system MASnGe1–I3 according to our collected data, except for the end points MASnI3 and MAGeI3. To explore the bandgaps of Sn–Ge
systems and validate the WVR model, a series of samples formulated
as MASnGe1–I3 (x = 1, 0.85, 0.74, 0.66, 0)
were synthesized, whose predicted bandgaps via the WVR model were
1.28, 1.41, 1.51, 1.59, 2.01 eV, respectively. The experimental details
were provided in Section S3 of the Supporting Information. The optical absorbance spectrum of
the material is shown in Figure S25. The
optical bandgap was obtained by using the Tauc formula,[49] which resulted in 1.23 and 2.02 eV for MASnI3 and MAGeI3 films, respectively. These two values
are in good agreement with other reports.[50,51] The initial incorporation of Ge (corresponding to x = 0.85) caused a sudden increase in the optical bandgap edge as
compared to MASnI3. With increasing the Ge concentration
(i.e., x changes from 0.85 to 0.66), the optical
bandgap edge has a small shift to lower wavelength, resulting in bandgaps
of 1.53, 1.55, 1.54 eV (corresponding to x = 0.85,
0.74, 0.66). The predictions of these three new compositions were
consistent to the experimental bandgaps with the average error of
0.07 eV.
Conclusion
In summary, four tree-based
ML models were built up, namely CAT,
XGB, LGB, and GBM models, based on the filtered 13, 13, 12, and 10
features, respectively. After tuning the model parameters, the R2 values of LOOCV and testing set for four models
were 0.90–0.94 and 0.88–0.91, respectively. By applying
the self-developed WVR meta-model, the four submodels were embedded
together, exhibiting the comprehensive performance of LOOCV R2 0.95 and testing R2 0.91. Then the top 10 features were analyzed by calculating the
SHAP contributions based the experimental data set and a virtually
generated data set, which revealed the mapping relationships between
the formulas and bandgaps in detail. In particular, tf below 0.971 and τf in 3.75–4.09
were beneficial for a low bandgap (e.g., 1.20 and 1.34 eV), while tf over 0.971 and τf in 4.09–4.18
(or lower 3.90) contributed to a large bandgap (e.g., over 1.70 eV).
To discover the new Pb-free HOIPs with suitable bandgaps, we generated
a universal chemical space for the HOIP formula A′A″A‴B′B″B‴X′X″X‴,
in which 91, 8, and 3 choices were considered for the A, B, and X
sites, respectively, resulting in 8.20 × 1018 combinations.
The inverse design method PSP was proposed to find out the candidate
HOIPs with desired bandgaps, in which 733848, 764883, 746190, and
20242 candidates were found for the HOIPs whose bandgaps were 1.34
eV for PSCs, 1.70/1.75 eV for top cells in TSCs, and 1.20 eV for bottom
cells in TSCs. From the searched results, Cs0.334FA0.266MA0.400Sn0.769Ge0.003Pd0.228Br0.164I2.836, Cs0.366FA0.338MA0.296Sn0.800Cr0.104Pd0.096-Br0.056I2.944, and Cs0.509FA0.412MA0.079-Sn0.634Ge0.095Cr0.271Br0.109I2.891 were found for Cs-FA-MA
mixed HOIPs with optimal bandgap 1.34 eV for PSCs. Cs0.392FA0.016MA0.592Cr0.383Sr0.347-Sn0.27Br1.171I1.829 and Cs0.445FA0.161MA0.394Pd0.508Cr0.228Sn0.263Br1.094I1.906 were found for Cs-FA-MA mixed
HOIPs with large bandgap for the top cells in TSCs, while MA0.815FA0.185Sn0.927Ge0.073I3 and MA0.86FA0.139Sn0.914Ge0.086I3 were found for the bottom cells
in TSCs. We believed that the inverse design methods as well as the
ML training processes could facilitate the development of both photovoltaic
fields and other advanced materials.
Authors: Scott M Lundberg; Gabriel Erion; Hugh Chen; Alex DeGrave; Jordan M Prutkin; Bala Nair; Ronit Katz; Jonathan Himmelfarb; Nisha Bansal; Su-In Lee Journal: Nat Mach Intell Date: 2020-01-17
Authors: Giles E Eperon; Tomas Leijtens; Kevin A Bush; Rohit Prasanna; Thomas Green; Jacob Tse-Wei Wang; David P McMeekin; George Volonakis; Rebecca L Milot; Richard May; Axel Palmstrom; Daniel J Slotcavage; Rebecca A Belisle; Jay B Patel; Elizabeth S Parrott; Rebecca J Sutton; Wen Ma; Farhad Moghadam; Bert Conings; Aslihan Babayigit; Hans-Gerd Boyen; Stacey Bent; Feliciano Giustino; Laura M Herz; Michael B Johnston; Michael D McGehee; Henry J Snaith Journal: Science Date: 2016-10-20 Impact factor: 47.728
Authors: Christopher J Bartel; Christopher Sutton; Bryan R Goldsmith; Runhai Ouyang; Charles B Musgrave; Luca M Ghiringhelli; Matthias Scheffler Journal: Sci Adv Date: 2019-02-08 Impact factor: 14.136