Literature DB >> 33842776

XGBoost: An Optimal Machine Learning Model with Just Structural Features to Discover MOF Adsorbents of Xe/Kr.

Heng Liang¹, Kun Jiang², Tong-An Yan³, Guang-Hui Chen¹.

Abstract

The inert gases Xe and Kr mainly exist in the used nuclear fuel (UNF) with the Xe/Kr ratio of 20:80, which it is difficult to separate. In this work, based on the G-MOFs database, high-throughput computational screening for metal-organic frameworks (MOFs) with high Xe/Kr adsorption selectivity was performed by combining grand canonical Monte Carlo (GCMC) simulations and machine learning (ML) technique for the first time. From the comparison of eight classical ML models, it is found that the XGBoost model with seven structural descriptors has superior accuracy in predicting the adsorption and separation performance of MOFs to Xe/Kr. Compared with energetic or electronic descriptors, structural descriptors are easier to obtain. Note that the determination coefficients R 2 of the generalized model for the Xe adsorption and Xe/Kr selectivity are very close to 1, at 0.951 and 0.973, respectively. In addition, 888 and 896 MOFs have been successfully predicted by the XGBoost model among the top 1000 MOFs in adsorption capacity and selectivity by GCMC simulation, respectively. According to the feature engineering of the XGBoost model, it is shown that the density (ρ), porosity (ϕ), pore volume (Vol), and pore limiting diameter (PLD) of MOFs are the key features that affect the Xe/Kr adsorption property. To test the generalization ability of the XGBoost model, we also tried to screen MOF adsorbents on the CO2/CH4 mixture, it is found that the prediction performance of XGBoost is also much better than that of the traditional machine learning models although with the unbalanced data. Note that the dimension of features of MOFs is low while the quantity of MOF samples in database is very large, which is suitable for the prediction by model such as XGBoost to search the global minimum of cost function rather than the model involving feature creation. The present study represents the first report using the XGBoost algorithm to discover the MOF adsorbates.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 33842776 PMCID： PMC8028164 DOI： 10.1021/acsomega.1c00100

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

The noble gases xenon (Xe) and krypton (Kr) are widely used in industrial production and daily life due to their special physical and chemical properties. For instance, Xe can be used in commercial lighting,[1,2] medical imaging,[3] anesthesia,[4,5] and neuroprotection,[6,7] while Kr is widely used in the electronics industry, electric light source industry, as well as in gas lasers and plasma streams. The content of xenon and krypton in the atmosphere just covers a minor proportion, and they mainly exist in used nuclear fuel (UNF) with a Xe/Kr ratio of 20:80. The radioisotopes of 135Xe and 85Kr in the process of UNF reprocessing are significant gas fission nuclides,[8] with strong radiation and important applications in nuclear fuel cycling[9] and nuclear environmental monitoring. Note that Xe/Kr selective adsorption separation is also a key step in the reprocessing of the UNF. At present, the inert gases Xe and Kr are generally produced through separation by large-scale air separation equipment, using cryogenic distillation separation according to the difference in the boiling points of Xe and Kr (the boiling points of Xe and Kr are 161.7 and 115.8 K, respectively). The large energy consumption and high cost greatly limit the applications of Xe and Kr, and the development of a novel separation method of Xe–Kr binary gas mixture under mild conditions has always been the focus. Compared to cryogenic distillation, the utilization of solid adsorbents to achieve gas adsorption and separation is environmentally more friendly and economical. However, the separation of Xe and Kr using traditional solid adsorbents such as zeolite and activated charcoal[10−12] has poor adsorption selectivity and capacity, so scientists have been committed to the development of new adsorption materials. Compared with traditional adsorption materials, metal–organic frameworks (MOFs) are nano-multifunctional pore materials emerged in the past two decades, which have a lot of advantages such as highly diverse crystal structures and adjustability of structural properties.[13] In recent years, increasingly more works on the adsorption and separation of Xe and Kr[14] have shown that MOF materials are much superior compared to traditional zeolite and activated charcoal in the adsorption and separation of Xe–Kr binary mixture. High-throughput screening is an effective method to obtain high-performance materials, and it is also an effective method to deeply understand the structure–adsorption property relationship of candidate adsorption materials. Generally, high-throughput computational screening is carried out by molecular simulations in a database with a large number of material samples to rapidly predict the adsorption property on gases.[15] So far, nearly 70 000 different MOFs have been synthesized, and there are also thousands of MOFs that have been predicted theoretically but have not yet been synthesized.[16−19] In 2016, using molecular simulation and high-throughput screening among 120 000 MOFs, Banerjee et al.[20] found that the Xe uptake and Xe/Kr selectivity of SBMOF-1 are very large at 1.39 and 16.00 mmol/g, respectively. In 2018, Gong et al.[21] designed and synthesized Z11CBF-1000-2, with an improved Xe/Kr selectivity of 19.70, but the adsorption capacity of Xe is just 0.02 mmol/g. Due to the variety of MOFs and the large number of samples, high-throughput screening for high-quality MOFs is also an expensive and time-consuming process. In recent years, with the coming of the era of Big Data, the importance of data-driven machine learning (ML) technique has been recognized by most of the people. Unlike traditional calculation methods, ML is based on statistics rather than solving physical equations, which can predict material properties quickly at a low cost.[22] So far, ML-related models constructed by simple structural features of MOFs can predict material adsorption property quickly. For example, the Snurr group[23] utilized mix logistic regression (MLR), least absolute shrinkage and selection operator (LASSO), and ridge models to establish the relationship between geometric structures of MOFs and hydrogen storage capacity and found that the LASSO model has a better description on the hydrogen storage in MOFs; with 20 000 different nanoporous materials on the selective adsorption of Xe/Kr as training data, Smit et al.[24] utilized the random forest (RF) decision tree model to screen for the high-performance nanoporous separation materials in the testing set with 655 000 samples. However, the calculated mean-square error (MSE) is large, i.e., 1.41; in 2020, the Luo group[25] predicted the adsorption of MOFs on H2 using the deep neural networks (DNN) model, whose transfer leads to an increase in the determination coefficient to 0.98 for the screening of MOF adsorption on CH4, but this transfer ML model failed to screen for Xe/Kr-separated MOFs, with the determination coefficient dropped from 0.92 to 0.41. In this work, we tried to screen for the MOF selective adsorption of Xe/Kr based on the G-MOFs database (Material Genomic MOFs Database) self-assembled using MGPNM program.[26] This database is available at: https://figshare.com/s/ec378d7315581e48f1e4. Note that for the first time the G-MOFs database is used for the screening for MOF selective adsorption of Xe/Kr. The relationship between MOF features and Xe/Kr selective adsorption property was established using ML algorithms, including ridge regression,[27] LASSO,[28] Elastic Net,[29] Bayesian regression,[30] support vector machine (SVM),[31] artificial neural network (ANN),[32] RF,[33] and XGBoost.[34] Finally, the XGBoost model with just structural descriptors successfully predicted 38 top MOFs of larger Xe/Kr adsorption selectivity and Xe uptake than recently reported SBMOF-1[20] and Z11CBF-1000-2,[21] which overcomes the defects of random forest[24] and transfer machine learning model.[25] In addition, to test the generalization ability of the XGBoost model, we also tried to screen for MOF adsorbents on a more complex CO2/CH4 mixture and found that the prediction performance of XGBoost is also much better than that of the traditional machine learning models.

Computational Details

In this work, the high-throughput screening for MOFs selective adsorption of Xe/Kr was performed among the G-MOFs database. Note that totally 303 991 structures in G-MOFs are self-assembled using 17 different metal clusters and 9 functional groups connected by 32 different organic linkers with the Material Genomics program MGPN.[26] To date, 162 thoroughly different MOF structures have been synthesized experimentally in G-MOFs.[26]

Grand Canonical Monte Carlo (GCMC) of Adsorption Simulation

Grand canonical Monte Carlo (GCMC)[35,36] method with the μVT ensemble was applied to simulate the adsorption of Xe and Kr on MOFs. The absorbates are regarded as rigid molecules during the adsorption process at 298 K and 1 bar, with a 20:80 ratio of Xe/Kr as the real UNF environment. Several different types of motion of molecules are considered, including translation, regrowth, deletion, and exchange. The adsorption process involves only the nonbonding interactions, and the interaction between adsorbates and MOFs is calculated using the Lennard-Jones (LJ) potential in eq .where and σ = σ + σ/2, with σ and ε being the diameters with depth of i, with the cutoff distance of the LJ interaction at 14.0 Å. For every simulation process, totally 2 × 107 cycles were performed with the former 1 × 107 cycles used to equilibrate system and the latter one used to calculate the related thermodynamic properties. The UFF[37] force field was applied to describe the atoms of adsorbent, while the TraPPE[38] force field was used for the krypton and xenon atoms, which have been successfully employed to describe the adsorption of Kr and Xe on MOF materials.[39] Our group[40,41] also utilized such force fields to describe UTSA-280 and Mg-SBMOF-1 on the Xe/Kr selective adsorption. The high-throughput GCMC simulations in this work were performed with the HT-CADSS[26] program. For the adsorption and separation process, adsorption selectivity is an important parameter to judge the separation property. For the two-component gas mixture of Xe and Kr, the selectivity SXe/Kr can be expressed by eq where x and y are the mole fractions of the adsorption-phase and volume-phase components, respectively. The thermal stabilities of top MOF materials were evaluated using Forcite module[42] in the Materials Studio program.[43] The entire annealing process is increased from 300 to 1800 K in the NVT ensemble, with five cycles in five picoseconds.

Machine Learning

Selection of Descriptors

Generally, ML model can predict objective property with continuous data of features. The structural parameters of MOFs including seven different features [large cavity diameter (LCD), pore limiting diameter (PLD), global cavity diameter (GCD), pore volume (Vol), density (ρ), specific surface area (Sa), and porosity (ϕ)] were calculated with the Zeo++ 0.3[44] software. From the histograms of the relationship between adsorption property (Xe uptake and Xe/Kr selectivity) and physical parameters as plotted in Figure S1, it is shown that the seven structural parameters [LCD, PLD, GCD, Vol, ρ, Sa, and ϕ] of MOFs in the G-MOFs database have a wide continuous distribution as plotted in Figure S1a–g, respectively, which may be used as the input variable features of the ML model. For the ML technique, the effectiveness and relevance of descriptors will directly determine the accuracy of the model. Generally, descriptors (features) should possess the following three characteristics:[45] (1) correlation with the output to some extent; (2) the lowest possible dimension; and (3) easy to obtain. The suitable features can be selected by calculating the Pearson correlation coefficient[46] according to eq where x and y represent two different features, and x̅ and y̅ represent the mean values of different features, respectively. Note that the Pearson correlation coefficients of r is between −1 and 1. When r takes a negative value, the feature shows a negative correlation to target property; when it takes a positive value, the feature shows a positive correlation. The absolute value of r locates between 0.5 and 1, which represents a strong correlation, while that between 0.3 and 0.5 represents medium correlation; the absolute values of r between 0.1 and 0.3 as well as less than 0.1 denote weak correlation or no correlation. We initially tested the correlation between the physical parameters and the gas adsorption property of MOFs to screen for the features from the correlation diagram as shown in Figure . For the adsorption capacity of Xe or selectivity of Xe/Kr, the above-mentioned seven physical parameters have moderate correlation with the Pearson correlation coefficients of r greater than 0.33 to the adsorption capacity of Xe and strong correlation with r greater than 0.58 to Xe/Kr selectivity, which meet the requirement as descriptors as input data.

Figure 1

Correlation diagram of material features and adsorption properties of Xe/Kr based on the G-MOFs database. Note that the color bars represent the size of the Pearson correlation coefficients.

Algorithm Selection and Evaluation

The data trained in the training set is a mapping from the structures to the adsorption property of the MOFs to find an objective function that can accurately predict the adsorption property in the testing set. Our dataset is composed of continuous input of physical parameters and output corresponding to the adsorption properties of MOFs including Xe uptake and Xe/Kr selectivity. Therefore, we tried to build supervised learning models using ridge regression, LASSO, Elastic Net, SVM, Bayes regression, ANN, RF, and XGBoost. The criteria to evaluate the quality of the regression model are mean-square error (MSE), mean absolute error (MAE), root-mean-square error (RMSE), and determination coefficient R2 as expressed in eqs –6, respectively where M represents the quantity of samples, ŷ represents the estimated value by the model, and MSE stands for the expectation of the square of the difference between the true value[47] and the estimated value. The larger the MSE value, the worse the prediction. where MAE represents the difference between the true value and the estimated value. The larger the MAE value, the worse the prediction. where RMSE represents the difference between the true value and the estimated value under the root sign. The larger the RMSE value, the worse the prediction and the RMSE is more sensitive to outliers. For adsorption capacity, the units of MSE, MAE, and RMSE are mmol2/g2, mmol/g, and mmol/g, respectively. where ŷ represents the true value and y̅ denotes the average of the true value. Note that the numerator of the expression is the sum of the squared difference between the true value and the predicted value, while the denominator represents the difference between the true value and the average value. The closer is the determination coefficient R2 to 1, the better is the performance of the predicted result. When R2 in the testing set is much larger than that in the training set, the model is considered as overfitting, otherwise, it is an underfitting model; note that when the division of the testing set is classified properly, the R2 value in the training set is larger than that in the testing set. These three parameters were utilized to evaluate the performance of models, where R2 is the primary criterion, while MAE, MSE, and RMSE are the auxiliary ones. The choice of different ML models has a great influence on the final prediction effect, but a few researchers elaborated on the advantage and disadvantage of different ML models, which brings inconvenience of their application.

Results and Discussion

Evaluation of Different Machine Learning Models

To verify the reliability of the different models, the GCMC simulations of Xe uptakes and Xe/Kr selectivity were performed on all MOF samples as plotted in Figure S2. It is found that almost all MOFs have the Xe/Kr selectivity over 1, indicating that most of the MOFs prefer to adsorb Xe rather than Kr, which is also in line with our purpose to discover the MOFs selective adsorption of Xe in UNF. Note that seven structural features were selected as descriptors for training from the calculations of Pearson correlation coefficient. The present strategy aims at minimizing the training set and maximizing the testing set to build the model. At the same time, it is of significance to extract the subset of the overall distribution from all of the samples as the training set, which thus can be used to represent the overall distribution. According to the learning curve in Figure , we found that accompanied by the increase of training set data, the R2 value on the testing set increases gradually, while the degree of overfitting of the model decreases gradually. When using 30% data as the training set, the degree of overfitting of the model to the adsorption properties is below 5%. According to the above strategy and the adsorption property calculated by GCMC, 30% of the samples are used as the training set and the remaining 70% of the samples are included in the testing set. Therefore, the sampling method ensures that our training set materials fully cover our seven-dimensional feature space. In this principle, eight different models including ridge regression, LASSO, Elastic Net, SVM, Bayesian regression, ANN, RF, and XGBoost with seven descriptors were tested by the fivefold cross-validation, grid search, and hyperparameter tuning in the training set, with the relevant data of R2, RMSE, MSE, and MAE listed in Table S1. The parity plots representing the predicted and simulated adsorption selectivity and capacity of MOFs data for the above models are shown in Figure S3a–h as (x – 1) and (x – 2), respectively.

Figure 2

Adsorption properties learning curve with increased training set data volume. Red represents the R2 value of the training set, while green represents the R2 value of the testing set for (a) Xe/Kr selectivity and (b) Xe uptake. As for the linear models, the R2 of LASSO and ridge regression are both close to 0.688, as listed in Table S1, indicating that the data possess few linear characteristics, as verified by the parity plots of Figure S3a,b, respectively. Note that the effect of improved LASSO and ridge regression models with the addition of L1 and L2 paradigms of the linear regression model is approximately equal to the Elastic Net model. However, the R2 value of the Elastic Net model is just 0.687, as shown in the parity plot of Figure S3c, indicating that the regularization coefficient has no great influence on the model. Tuning on the L1 and L2 paradigms has no apparent effectiveness on features. Thus, the linear model is not suitable for the G-MOFs database. As for nonlinear models, the R2 value of the SVM regression model is just 0.660 with the relevant parity plots shown in Figure S3d. Note that this model performs very well on a small number of samples in high-dimensional space, which extremely depends on the selection of the kernel function. When the sample amount is relatively large, the effect of the model plummets. Note that the current dataset is characteristic of low dimension with a large data volume, thus leading to the poor performance of the SVM model; for the Bayesian regression model, the calculated R2 value is 0.687, as shown in Table S1 and the parity plots of Figure S3e. The Bayesian model can fit the data of small-scale samples well to obtain the probability distribution of the test data rather than specific values. However, when the data increases to a large amount such as more than 300 000 in G-MOFs, the model reduces the influence of the distribution to a linear one. Therefore, this model is not suitable for the large amount of data in the G-MOFs database. As for the ANN model of deep learning, the determination coefficient R2 reaches 0.831 with the relevant parity plots shown in Figure S3f. Note that two hidden layers are used to train and build the ANN model after feature selection and adsorption performance mapping. It is shown that the model has high accuracy and weak dependence on the data structure. However, the training procedure of this model generates massive combined features and thus reduces the model interpretability, leading to difficulty in judging the effect of physical parameters on the adsorption property. As for the RF model, the determination coefficient R2 reaches 0.933 with the parity plots shown in Figure S3g. Note that the RF model consists of a large number of individual decision trees, where each tree will issue a category prediction, and the category with the most votes will be the prediction of our model. But when there are repetitive values in some feature of MOFs leading to a lot of noises in the dataset, RF cannot accurately predict the values of the objective function. However, note that the determination coefficient R2 of the XGBoost model for Xe adsorption and Xe/Kr selectivity reaches 0.951 and 0.973, with MSE at just 0.003 and 0.065, MAE at just 0.029 and 0.147, and RMSE at just 0.055 and 0.255 in the testing set, respectively, as listed in Table S1 and shown in the parity plots of Figure S3h. For the XGBoost model, we carried out fivefold cross-validation and grid search to tune the hyperparameters. The main parameters optimized by XGBoost model are eta (0.1), max_depth (10), min_child_weight (0.5), and subsample (0.8). From the statistical point of view, the prediction performance of the XGBoost model is much superior to the above ones. Note that XGBoost is an algorithm based on boosting tree, with a regularization term added to the optimization objective function, which is described according to eq Among them, Ψ represents the objective function, y represents the input value of the data, f(X) stands for prediction value, and N represents the number of features. In addition, to prevent the overfitting, the XGBoost model is performed using regularization with γ and λ as regularization coefficients controlling the complexity of the model and the output of the objective function. When γ and λ are both equal to 0, the model has only the same loss function as the objective function. L represents the number of leaf nodes and ω represents the influence of the mth leaf node on the model. Note that the addition of a regularization coefficient will not improve the accuracy but prevents the model from overfitting in the iterative process. The XGBoost model uses second-order Taylor approximation of the loss function and speeds up the process of searching for the global minimum through the first and second derivatives of loss function. The specific derivation can be found in the related literature.[48] Compared with other ML models, the better performance of the XGBoost in predicting adsorption of MOFs is not only due to the addition of the regularization coefficient in the cost function and the second-order Taylor expansion of the cost function to overcome the overfitting but also with just the easily available structural descriptors we can achieve accurate prediction of adsorption properties. Note that these structural features do not have a very strong correlation with the adsorption properties, while Ridge, Lasso, Elastic Net, Bayesian, and ANN models cannot predict accurately without strong correlation characteristics. In summary, the R2 value of the testing set of the XGBoost model for Xe adsorption and Xe/Kr selectivity prediction is close to 1 and much larger than those of the other models. MAE, MSE, and RMSE are also close to 0, which meets the requirement of an excellent regression model. Therefore, we finally chose the XGBoost model with seven structure descriptors to predict MOFs with selective adsorption property of Xe/Kr in the following.

Construction and Verification of XGBoost Model

The 30% and the remaining 70% MOF samples are selected as the training and testing sets, respectively, as listed in Table S1. We just included the seven structural descriptors including LCD, PLD, GCD, Vol, ρ, Sa, and ϕ. Through fivefold cross-validation, grid search, and hyperparameter tuning of the training set, it is found that when the XGBoost model is constructed with structural descriptors, the determination coefficients R2 of the adsorption capacity of Xe and the selectivity of Xe/Kr in the training set are 0.976 and 0.986, with RMSE at 0.032 and 0.182, respectively, while the determination coefficients are 0.951 and 0.973, with RMSE at 0.055 and 0.255 for the testing set, respectively. Obviously, there is also no overfitting or underfitting for the XGBoost model with the effect, as shown in the parity plots of Figure a,b, respectively. In addition, the predicted top 10 MOFs in Xe/Kr selectivity are completely consistent with those screened out by GCMC simulations, as shown in Table and the parity plots of Figure a,b, respectively. Note that the top 2 MOFs in selectivity predicted by the model are the same as those by GCMC simulations in sequence, corresponding to Al2O6-fum_B_No3 and Al2O6-ADC_B-fum_B_No112. The adsorption property differences of these two MOFs between the GCMC simulations and prediction by the XGBoost model were compared as listed in Table and Figure , respectively. Note that the simulated Xe/Kr selectivities of the two MOFs are 27.68 and 27.45, respectively, while the model predicted selectivities of the two MOFs to Xe/Kr are 26.98 and 26.26, with relative errors to those of GCMC simulations just 2.53 and 4.34%, respectively. In addition, the predicted adsorption capacities for Xe of the three MOFs are 2.49 and 2.44 mmol/g, with the relative errors at just 0.28 and 1.77%, respectively. To learn the stability of top 2 materials including Al2O6-fum_B_No3 and Al2O6-ADC_B-fum_B_No112, we carried out simulation annealing and found that at 1600 K, they still keep the stable structures without collapse. The calculated Henry coefficients of these top 2 materials for Xe are 24.77 and 26.23 mmol g–1 bar–1, respectively. The Henry selectivities for Xe/Kr are 22.76 and 27.36, indicating that these two materials have remarkable desorption properties.

Figure 3

Table 1

Comparison of Xe/Kr Adsorption Property of Top 10 Materials between GCMC Simulations and XGBoost Model Predictiona

	XGBoost		GCMC		RE (%)
MOFs	N_xe	S_Xe/Kr	N_xe	S_Xe/Kr	N_xe	S_Xe/Kr
Al₂O₆-fum_B_No3	2.490	26.98	2.483	27.68	0.28	2.53
Al₂O₆-ADC_B-fum_B_No112	2.437	26.26	2.481	27.45	1.77	4.34
Al₂O₆-ADC_B-fum_B_No107	2.599	25.43	2.804	26.37	7.31	3.56
Al₂O₆-ADC_B-fum_B_No102	2.544	20.15	2.837	25.84	10.33	22.02
Al₂O₆-fum_B_No6	2.589	20.14	2.680	22.49	3.40	10.45
Al₂O₆-ADC_B-fum_B_No100	2.377	19.92	2.617	22.23	9.17	10.39
CuN₄-SiF₆-irmof20_No1	2.155	19.31	2.657	20.48	18.89	5.71
ZnN₄-SiF₆-irmof20_No5	1.983	18.45	2.657	20.31	25.37	9.16
Al₂O₆-fum_B-irmof6_B_No27	2.374	18.25	2.056	19.40	15.47	5.93
Al₂O₆-BDC_B-fum_B_No125	2.269	17.76	2.172	19.31	4.47	8.03

Note that the uptakes of NXe are in mmol/g. RE = |(GCMC – XGBoost)|/GCMC × 100%.

Figure 4

Schematic plots with comparison of top 10 materials between GCMC simulation and XGBoost prediction on (a) Xe/Kr selectivity and (b) Xe uptake.

Parity plots for training and testing sets data from the G-MOFs database using XGBoost model for the (a) Xe/Kr selectivity and (b) Xe uptake at 1 bar and 298 K. Each dot represents one MOF structure from the G-MOFs database. The red and blue dots represent the training set and testing set data, respectively. Schematic plots with comparison of top 10 materials between GCMC simulation and XGBoost prediction on (a) Xe/Kr selectivity and (b) Xe uptake. Note that the uptakes of NXe are in mmol/g. RE = |(GCMC – XGBoost)|/GCMC × 100%. From Table , it is found that 8 MOFs in the G-MOFs are better than Z11CBF-1000-2 and 38 MOFs are better than SBMOF-1 in both adsorption capacity and selectivity. Note that the Xe/Kr adsorption selectivities and Xe capacities of SBMOF-1[20] and Z11CBF-1000-2[21] are 16, 19.70 and 1.39, 0.02 mmol/g, respectively. Meanwhile, from GCMC simulations, we found that 30 of such 38 MOFs have been covered in the testing set. We referred Woo’s work,[49] by comparing the top 1000 MOFs predicted by the XGBoost model with GCMC simulations, and found that XGBoost-predicted 888 MOFs are in the range of GCMC-simulated top 1000 ones in Xe adsorption capacity and the predicted 896 MOFs are among the GCMC-simulated top 1000 ones in Xe/Kr selectivity as listed in Table , both accounting for almost 90%. Therefore, the present XGBoost model with seven descriptors can accurately predict the high-performance MOFs selective Xe/Kr.

Table 2

Predicted MOFs Based on G-MOFs Have Better Performance on Selective Adsorption Separation Xe/Kr compared with SBMOF-1 and Z11CBF-1000-2a

	adsorption property
MOFs	S_Xe/Kr	N_xe	S_Xe/Kr and N_Xe
SBMOF-1	38	1169	38
Z11CBF-1000-2	8	190 191	8

Note that SXe/Kr represents Xe/Kr selectivity, while Nxe represents Xe adsorption.

Table 3

Number of Top XGBoost Prediction Property Out of N Thousand that in the GCMC Top 1000 Materials

	1000	2000	3000	4000	5000
S_(Xe/Kr)	896	974	996	999	1000
N_(Xe)	888	972	993	996	1000

Note that SXe/Kr represents Xe/Kr selectivity, while Nxe represents Xe adsorption. The XGBoost model was developed by the Guestrin group[34] in 2016, which has quickly become well known in the ML-related competitions and now widely used in the fields of diagnosis and materials due to its fast and accurate characteristics. For example, the Ni group[50] tried four different regression models, taking the atomic volume, mass density, unit cell volume, and lattice type of the crystal materials as features, and accurately predicted the thermal conductivity of the crystal materials; the Karanicolas group[51] built a drug scoring function based on the XGBoost model, which is much higher than the traditional scoring function, and successfully found the novel targeted drugs for AChE. As the gas adsorbents, features of MOF sample are in low dimension, and it is suitable to use a model that can accurately search for the global minimum of the cost function, rather than the model involving feature creation. Note that covalent organic frameworks (COFs) and zeolites are also materials with low-dimensional structural features. We hope the present XGBoost model just with structural descriptors may assist the screen and design not only MOF adsorbents but also other porous material adsorbents such as covalent organic frameworks (COFs) and zeolites in future. Thus, the XGBoost model stands out from the comparison with different ML algorithms including transfer machine learning[25] and random forest[24] models due to its excellent performance, although which has not been reported in the field of discovery of materials for gas adsorption separation. Compared with screening for MOF adsorbents using features of AP-RDF[49] and Qst,[52] the present XGBoost model successfully found the high-performance MOF adsorbents just using structural descriptors.

Influence of Structural Features on Adsorption Performance

To understand the impact of different structural parameters of MOFs on the Xe/Kr adsorption property, we also compared the weight coefficients of different features on the Xe adsorption capacity and Xe/Kr selectivity in the XGBoost model by the histogram as plotted in Figure , and it is shown that the features that significantly affect both the Xe uptake and Xe/Kr selectivity are Vol, ρ, ϕ, and LCD.

Figure 5

Histogram of the influence of seven structural features in the XGBoost model.

Histogram of the influence of seven structural features in the XGBoost model. To explore the range of the four features corresponding to the optimal adsorption property, we plotted features–adsorption property relationship in parity plots, in Figure a–d. It is shown that the density (ρ) of the MOF corresponding to large Xe/Kr selectivity and Xe adsorption capacity is about 1.0, while the other three features are negatively correlated to the adsorption property.

Figure 6

Schematic scatter plots of the four main structural features that influence the Xe uptake and Xe/Kr selectivity, including (a) ρ, (b) ϕ, (c) Vol, and (d) PLD of MOFs.

Schematic scatter plots of the four main structural features that influence the Xe uptake and Xe/Kr selectivity, including (a) ρ, (b) ϕ, (c) Vol, and (d) PLD of MOFs. To find the specific ranges of the four features [including Vol, ρ, ϕ, and LCD] that affect the adsorption property, we utilized the regression decision tree[53] scheme and three different datasets were used to train the decision tree model. The datasets of all MOFs are defined as Class A, while those of selectivity larger than 10 and uptakes larger than 1 mmol/g are defined as Class B, which represents promising materials with large Xe/Kr selectivity; after excluding Class B from Class A, the remaining section is defined as Class C, representing poor performing MOFs with low Xe/Kr selectivity and uptakes. Through fivefold cross-validation, grid search, and hyperparameter selection of regression tree model, the data of the three types of MOFs are analyzed as collected in Table S2, and we can find that deferent from Class A and B, there is no overfitting for Class C materials within the decision tree model, with the MSEs for the Xe uptakes and Xe/Kr selectivity at just 0.204 and 1.718, respectively. Therefore, we choose Class C as the dataset of the decision tree model. The regression model with a maximum depth of 3 is used to describe the data, and the maximum adsorption capacity and selectivity are selected as high-quality adsorption criteria through the corresponding tree model, as plotted in Figure . Note that the maximum average adsorption capacity for the tree with 376 samples is 2.495 mmol/g, corresponding to the volume (Vol) less than 1023.375 m2/cm3 and the density (ρ) between 0.731 and 0.985 g/cm3 as plotted in Figure a. In addition, when the density (ρ) of the materials is larger than 0.929 g/cm3 and the PLD is less than 7.244 Å, 15 281 samples with a large average selectivity of 6.094 were screened as plotted in Figure b.

Figure 7

Influence of the four main structural features including ρ, ϕ, Vol, and PLD of MOFs on (a) Xe uptake and (b) Xe/Kr selectivity, under the decision tree model.

Influence of the four main structural features including ρ, ϕ, Vol, and PLD of MOFs on (a) Xe uptake and (b) Xe/Kr selectivity, under the decision tree model. To investigate the effect of the different metal centers and organic ligands on the adsorption property, the GCMC-simulated top 500 MOFs in Xe adsorption capacity or Xe/Kr selectivity were screened out, and it is found that there is an intersectional part of 602 different materials with the proportions of the metal centers and organic ligands of each adsorption material represented in a pie chart, as shown in Figure a,b, respectively, where the weight of every metal center and ligand is set as 1.0 for statistics, while the weight of dual ligands is set as 0.5. In the G-MOFs database assembled by the HT-CADSS program, it is shown that almost all of the materials with good adsorption and separation property for Xe/Kr contain Al2O6 cluster and FUM ligand as shown in Figure a,b. It is noted that Al2O6-fum_B_No3 is combined by Al2O6 cluster and FUM ligand, corresponding to the largest Xe/Kr adsorption selectivity and Xe uptakes at 21.99 and 2.52 mmol/g.

Figure 8

Pie chart of the structural characteristics of the top 500 MOFs in Xe uptake and Xe/Kr selectivity: (a) the proportion of different metal cluster and (b) the proportion of different organic ligands.

Model Generalization

To learn the application range of the XGBoost model, we decide to screen for adsorbent on other gas mixture in G-MOFs. Note that coal is usually gasified under high temperature and high pressure to produce a large amount of mixed gas, mainly composed of CO2 and CH4, with the gas content ratio of about 1:1.[54] The capture of carbon dioxide before combustion can effectively improve the utilization of methane. Therefore, it is very important to find suitable MOF materials to selectively capture CO2 in CH4. The screening of MOFs adsorptive separation from Xe/Kr to CO2/CH4 is just a process from simple to complex mixture, which can further reflect the suitability of our ML model. With the ratio of CO2/CH4 in mixture at 50:50, the adsorption on MOFs was simulated with GCMC method at 298 K and 1 bar using the same method as in Xe/Kr. Note that the best separation material in the G-MOFs database is Zn2O8-BTC_B-irmof7_A_No16,[26] corresponding to the selectivity of CO2/CH4 at 50.5 and uptake of CO2 at 4.0 mmol/g, which is lower than SAJFEO with the CO2/CH4 selectivity at 210.33 and uptake of CO2 at 6.31 mmol/g.[55] Through the data analysis as listed in Table S3, we found that the calculated adsorption properties in G-MOFs datasets are obviously imbalanced, especially for the adsorption selectivity of CO2/CH4. For the machine learning model, a good dataset is that the number of positive samples basically equals that of the negative ones. The imbalanced data refer to the large gaps of cost function among the large number of samples. In the data of MOFs selective adsorption of CO2/CH4, there are a few highly selective materials. This will be biased toward the low selectivity range, which is not beneficial to the prediction of materials. As for Xe/Kr adsorption selectivity, there is no obvious imbalance problem for the data in the G-MOFs database. The XGBoost model can define the ratio of different data and overcome the problem of imbalance from the perspective of adsorption characteristics. Based on the adsorption data of G-MOF to CO2/CH4, we initially tested the correlation between the physical parameters and the gas adsorption property on MOFs of CO2/CH4 with the correlation diagram as shown in Figure S4. For the adsorption capacity of CO2 or selectivity of CO2/CH4, seven physical parameters including LCD, PLD, GCD, Vol, ρ, Sa, and ϕ are moderately correlated to adsorption property with Pearson correlation coefficients of r greater than 0.3, which is the same as that of the Xe/Kr system. The seven material features [including LCD, PLD, GCD, Vol, ρ, Sa, and ϕ] were also used as descriptors with samples in the training set:testing set at 30:70 to build the model, and the fivefold cross-validation and grid search method is used to tune the hyperparameters of the model. The prediction of the G-MOFs database of CO2/CH4 adsorption performance is depicted in the parity plots shown in Figure S5a,b, respectively. From Table S4, it is found that the determinant coefficients R2 XGBoost model for selectivity and uptake of CO2 are 0.6836 and 0.8817, respectively, which are much larger than those of other models, indicating that the prediction accuracy in the adsorption property of XGBoost is much better than traditional machine learning models, including Ridge, LASSO, Elastic Net, SVM, Bayesian, ANN, and RF, although the prediction performance is not as good as Xe/Kr adsorption separation.

Conclusions

In this work, a high-throughput screening for MOFs selective adsorption of Xe/Kr in the G-MOFs database with more than 300 000 materials was performed using GCMC simulations and machine learning technique for the first time. It is found that the XGBoost model with structural descriptors can successfully predict the top materials with the high adsorption selectivity of Xe/Kr. Based on seven structural parameters [including LCD, PLD, GCD, Vol, ρ, Sa, and ϕ] of MOFs, eight machine learning models, including ridge regression, LASSO, Elastic Net, SVM, Bayesian regression, ANN, RF, and XGBoost, were tried to predict the adsorption and separation property for Xe/Kr within the G-MOFs database. Compared with energetic or electronic descriptors, structural descriptors are easier to obtain. With 30% of training set and 70% of testing set of the samples, it is found that the XGBoost is the optimal model in predicting the adsorption capacity of Xe and selectivity of Xe/Kr for MOFs. For example, the determination coefficient R2 in the testing set of Xe adsorption capacity and Xe/Kr selectivity are 0.951 and 0.973 for the testing set. In addition, the XGBoost model successfully predicted top 8 MOFs with higher adsorption capacity and selectivity than Z11CBF-1000-2, and 38 MOFs are better than SBMOF-1. For the top 2 MOFs in selectivity including Al2O6-fum_B_No3 and Al2O6-ADC_B-fum_B_No112, the predicted Xe/Kr selectivities are 26.97 and 26.26, respectively, which are close to the GCMC-simulated 27.68 and 27.45, respectively. In addition, the XGBoost feature engineering showed that four features, including ρ, ϕ, Vol, and PLD, mainly determine the high-performance MOFs selective adsorption of Xe/Kr. By verifying the model through a more complex CO2/CH4 mixture, we found that even if there exists the imbalanced problem of data, the prediction performance of XGBoost is still much better than those of the traditional machine learning models. As gas adsorbents, features of MOF material are continuous data in low dimension, it is suitable to use a model like XGBoost that can accurately search for the global minimum of the cost function, rather than the model involving feature creation. This work represents the first machine learning study using the XGBoost model for the screening of MOF gas adsorbents, which is better than the other ML models including the formerly used transfer machine learning[33] and random forest model.[25] We hope the present XGBoost model just with structural descriptors may assist the screen and design not only MOF adsorbents for Xe/Kr in UNF but also other gas mixture adsorbents among porous materials such as covalent organic frameworks (COFs) and zeolites in future.

23 in total

1. Large-scale screening of hypothetical metal-organic frameworks.

Authors: Christopher E Wilmer; Michael Leaf; Chang Yeon Lee; Omar K Farha; Brad G Hauser; Joseph T Hupp; Randall Q Snurr
Journal: Nat Chem Date: 2011-11-06 Impact factor: 24.427

2. Improved naïve Bayesian modeling of numerical data for absorption, distribution, metabolism and excretion (ADME) property prediction.

Authors: Anthony E Klon; Jeffrey F Lowrie; David J Diller
Journal: J Chem Inf Model Date: 2006 Sep-Oct Impact factor: 4.956

3. Xenon for induction of anaesthesia.

Authors: J H Rasmussen; M Mosfeldt; F C Pott; B Belhage
Journal: Acta Anaesthesiol Scand Date: 2009-04 Impact factor: 2.105

4. Switching Kr/Xe selectivity with temperature in a metal-organic framework.

Authors: Carlos A Fernandez; Jian Liu; Praveen K Thallapally; Denis M Strachan
Journal: J Am Chem Soc Date: 2012-05-24 Impact factor: 15.419

5. Postsynthetic modification of a metal-organic framework for stabilization of a hemiaminal and ammonia uptake.

Authors: William Morris; Christian J Doonan; Omar M Yaghi
Journal: Inorg Chem Date: 2011-06-28 Impact factor: 5.165

6. Transfer Learning Study of Gas Adsorption in Metal-Organic Frameworks.

Authors: Ruimin Ma; Yamil J Colón; Tengfei Luo
Journal: ACS Appl Mater Interfaces Date: 2020-07-15 Impact factor: 9.229

7. A closed-circuit neonatal xenon delivery system: a technical and practical neuroprotection feasibility study in newborn pigs.

Authors: Ela Chakkarapani; Marianne Thoresen; Catherine E Hobbs; Kristian Aquilina; Xun Liu; John Dingley
Journal: Anesth Analg Date: 2009-08 Impact factor: 5.108

8. Selective gas adsorption and unique structural topology of a highly stable guest-free zeolite-type MOF material with N-rich chiral open channels.

Authors: Jian-Rong Li; Ying Tao; Qun Yu; Xian-He Bu; Hirotoshi Sakamoto; Susumu Kitagawa
Journal: Chemistry Date: 2008 Impact factor: 5.236

9. Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions.

Authors: Joseph O Ogutu; Torben Schulz-Streeck; Hans-Peter Piepho
Journal: BMC Proc Date: 2012-05-21

10. Metal-organic framework with optimally selective xenon adsorption and separation.

Authors: Debasis Banerjee; Cory M Simon; Anna M Plonka; Radha K Motkuri; Jian Liu; Xianyin Chen; Berend Smit; John B Parise; Maciej Haranczyk; Praveen K Thallapally
Journal: Nat Commun Date: 2016-06-13 Impact factor: 14.919

4 in total

1. Machine learning to predict metabolic drug interactions related to cytochrome P450 isozymes.

Authors: Ning-Ning Wang; Xiang-Gui Wang; Guo-Li Xiong; Zi-Yi Yang; Ai-Ping Lu; Xiang Chen; Shao Liu; Ting-Jun Hou; Dong-Sheng Cao
Journal: J Cheminform Date: 2022-04-15 Impact factor: 5.514

2. Combining Machine Learning and Molecular Simulations to Unlock Gas Separation Potentials of MOF Membranes and MOF/Polymer MMMs.

Authors: Hilal Daglar; Seda Keskin
Journal: ACS Appl Mater Interfaces Date: 2022-07-11 Impact factor: 10.383

3. The diagnostic significance of integrating m6A modification and immune microenvironment features based on bioinformatic investigation in aortic dissection.

Authors: Ruiming Guo; Jia Dai; Hao Xu; Suhua Zang; Liang Zhang; Ning Ma; Xin Zhang; Lixuan Zhao; Hong Luo; Donghai Liu; Jian Zhang
Journal: Front Cardiovasc Med Date: 2022-08-29

4. Machine Learning Prediction of Resistance to Subinhibitory Antimicrobial Concentrations from Escherichia coli Genomes.

Authors: Sam Benkwitz-Bedford; Martin Palm; Talip Yasir Demirtas; Ville Mustonen; Anne Farewell; Jonas Warringer; Leopold Parts; Danesh Moradigaravand
Journal: mSystems Date: 2021-08-24 Impact factor: 6.496

4 in total