| Literature DB >> 34278102 |
Peisong Yang1, Huan Zhang2, Xin Lai1, Kunfeng Wang1, Qingyuan Yang2,3, Duli Yu1,3.
Abstract
Covalent organic frameworks (COFs) have the advantages of high thermal stability and large specific surface and have great application prospects in the fields of gas storage and catalysis. This article mainly focuses on COFs' working capacity of methane (CH4). Due to the vast number of possible COF structures, it is time-consuming to use traditional calculation methods to find suitable materials, so it is important to apply appropriate machine learning (ML) algorithms to build accurate prediction models. A major obstacle for the use of ML algorithms is that the performance of an algorithm may be affected by many design decisions. Finding appropriate algorithm and model parameters is quite a challenge for nonprofessionals. In this work, we use automated machine learning (AutoML) to analyze the working capacity of CH4 based on 403,959 COFs. We explore the relationship between 23 features such as the structure, chemical characteristics, atom types of COFs, and the working capacity. Then, the tree-based pipeline optimization tool (TPOT) in AutoML and the traditional ML methods including multiple linear regression, support vector machine, decision tree, and random forest that manually set model parameters are compared. It is found that the TPOT can not only save complex data preprocessing and model parameter tuning but also show higher performance than traditional ML models. Compared with traditional grand canonical Monte Carlo simulations, it can save a lot of time. AutoML has broken through the limitations of professionals so that researchers in nonprofessional fields can realize automatic parameter configuration for experiments to obtain highly accurate and easy-to-understand results, which is of great significance for material screening.Entities:
Year: 2021 PMID: 34278102 PMCID: PMC8280634 DOI: 10.1021/acsomega.0c05990
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Figure 1Schematic diagram of a DT.
Figure 2K-fold cross-validation.
Type of ML Algorithms and Hyperparameter Values Tried and the Best Parameter
| ML algorithm | hyperparameters | values tried | best parameter |
|---|---|---|---|
| DT | Max_depth | 10, 11, ···, 20 | 18 |
| SVM | Kernal | rbf, liner, poly | Rbf |
| 0.1, 0.5, 1, 2, 3, 4, 5 | 0.5 | ||
| RF | N_trees | 100, 200, ···, 500 | 500 |
| Min_leaf_size | 1, 2, 3, 4, 5, 6, 7 | 4 |
Figure 3Algorithm flow of the TPOT framework covers the process of data preparation, feature engineering, model generation, and model evaluation.
Structural and Chemical Descriptors Used to Construct a Feature Vector for ML
| feature | unit |
|---|---|
| largest cavity diameter (LCD) | Å |
| pore limiting diameter (PLD) | Å |
| global cavity diameter (GCD) | Å |
| volumetric surface area (VSA) | m2/cm3 |
| gravimetric surface area (GSA) | m2/g |
| unit cell-based surface area (ASA) | Å2 |
| per unit cell volume (Vol) | Å3 |
| density ( | g/cm3 |
| void fraction (ϕ) | |
| free
volume ( | cm3/g |
| unit cell-based accessible pore volume ( | Å3 |
| zero-coverage heat of adsorption
( | kJ/mol |
| Henry coefficient ( | mol/kg/Pa |
| hydrogen (H %) | [number of hydrogen atoms per unit cell]/[total number of atoms] |
| carbon (C %) | [number of carbon atoms per unit cell]/[total number of atoms] |
| nitrogen (N %) | [number of nitrogen atoms per unit cell]/[total number of atoms] |
| oxygen (O %) | [number of oxygen atoms per unit cell]/[total number of atoms] |
| fluorine (F %) | [number of fluorine atoms per unit cell]/[total number of atoms] |
| chlorine (Cl %) | [number of chlorine atoms per unit cell]/[total number of atoms] |
| bromine (Br %) | [number of bromine atoms per unit cell]/[total number of atoms] |
| boron (B %) | [number of boron atoms per unit cell]/[total number of atoms] |
| silicon (Si %) | [number of silicon atoms per unit cell]/[total number of atoms] |
| sulfur (S %) | [number of sulfur atoms per unit cell]/[total number of atoms] |
Figure 4Correlation thermograms of structural features.
Correlation Coefficient between Different Features and between Features and Working Capacity
| Vol | LCD | PLD | GCD | ASA | VSA | GSA | ϕ | working capacity | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Vol | 1.000 | –0.617 | 0.592 | 0.439 | 0.587 | 0.779 | –0.594 | 0.684 | 0.647 | –0.492 | ||
| –0.617 | 1.000 | –0.625 | 0.515 | –0.621 | –0.608 | 0.477 | –0.816 | –0.616 | –0.677 | 0.241 | ||
| LCD | 0.592 | –0.625 | 1.000 | 0.335 | –0.780 | 0.282 | 0.591 | 0.682 | 0.587 | –0.645 | ||
| PLD | 0.439 | 0.515 | 1.000 | 0.977 | 0.174 | –0.756 | 0.121 | 0.437 | 0.508 | 0.444 | –0.631 | |
| GCD | 0.587 | –0.621 | 0.977 | 1.000 | 0.329 | –0.779 | 0.276 | 0.586 | 0.624 | 0.583 | –0.623 | |
| ASA | 0.779 | –0.608 | 0.335 | 0.174 | 0.329 | 1.000 | –0.312 | 0.686 | 0.770 | 0.641 | 0.619 | –0.191 |
| VSA | –0.594 | 0.477 | –0.780 | –0.756 | –0.779 | –0.312 | 1.000 | –0.272 | –0.596 | –0.477 | –0.630 | 0.912 |
| GSA | 0.684 | –0.816 | 0.282 | 0.121 | 0.276 | 0.686 | –0.272 | 1.000 | 0.687 | 0.842 | 0.778 | –0.110 |
| –0.616 | 0.591 | 0.437 | 0.586 | 0.770 | –0.596 | 0.687 | 1.000 | 0.647 | –0.493 | |||
| ϕ | 0.647 | 0.682 | 0.508 | 0.624 | 0.641 | –0.477 | 0.842 | 0.647 | 1.000 | 0.707 | –0.221 | |
| –0.677 | 0.587 | 0.444 | 0.583 | 0.619 | –0.630 | 0.778 | 0.707 | 1.000 | –0.524 | |||
| working capacity | –0.492 | 0.241 | –0.645 | –0.631 | –0.623 | –0.191 | 0.912 | –0.110 | –0.493 | –0.221 | –0.524 | 1.000 |
Figure 5Scatter plots of the first four features with the strongest correlation. The relationships between (a) VSA and working capacity. (b) PLD and working capacity. (c) C % and working capacity. (d) Si % and working capacity.
Principal Component Covering the Ratio of Variation Information for 23 Features
| principal component | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 |
|---|---|---|---|---|---|---|
| ratio of variation information(%) | 0.392 | 0.130 | 0.078 | 0.063 | 0.049 | 0.047 |
| principal component | PC7 | PC8 | PC9 | PC10 | PC1 + PC2 +···+ PC10 | |
| ratio of variation information(%) | 0.046 | 0.044 | 0.036 | 0.030 |
Figure 6Correlation thermograms of ten principal components.
Model Comparison of Different ML Algorithms Based on Different Data Processing
| MLR | SVR | DT | |||||||
|---|---|---|---|---|---|---|---|---|---|
| RMSE (VSTP/V) | MAE (VSTP/V) | RMSE (VSTP/V) | MAE (VSTP/V) | RMSE (VSTP/V) | MAE (VSTP/V) | ||||
| raw data | 7.62 | 0.920 | 5.30 | 14.73 | 0.700 | 9.62 | 3.91 | 0.979 | 2.31 |
| feature selection | 7.61 | 0.921 | 5.26 | 9.03 | 0.888 | 5.47 | 3.86 | 0.979 | 2.31 |
| feature extraction | 9.86 | 0.86 | 6.54 | 13.48 | 0.749 | 8.52 | 5.23 | 0.962 | 3.01 |
| time consuming (h) | 0.6 | 78.64 | 18.26 | ||||||
Figure 7Model prediction results of different ML methods. (a,b) are the results of the RF model after feature selection and PCA, respectively, (c) is the result of the TPOT.
Figure 8Comparison of learning curves between RF and TPOT models.
Prediction of CH4 Delivery Based on Different Characteristics
| structure features | user define features | |||||
|---|---|---|---|---|---|---|
| RMSE (cm3/g) | MAE (cm3/g) | RMSE (cm3/g) | MAE (cm3/g) | |||
| MLR | 25.07 | 0.931 | 21.21 | 92.10 | 0.182 | 42.47 |
| DT | 26.44 | 0.923 | 19.53 | 92.12 | 0.182 | 42.69 |
| SVM | 66.33 | 0.516 | 43.06 | 100.76 | 0.021 | 58.07 |
| RF | 20.25 | 0.955 | 15.69 | 71.33 | 0.510 | 41.38 |
| TPOT | ||||||
Prediction of MOF’s Selectivity and Working Ability of CO2 in Mixed Gas
| selectivity
for CO2/H2 | working capacity CO2/H2 | selectivity
for CO2/N2 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| RMSE | MAE | RMSE | MAE (mmol/kg) | RMSE | MAE | ||||
| MLR | 33.17 | 0.759 | 26.44 | 1.02 | 0.383 | 0.78 | 1.30 | 0.347 | 0.83 |
| DT | 39.67 | 0.655 | 24.82 | 0.88 | 0.545 | 0.49 | 0.94 | 0.654 | 0.46 |
| SVM | 59.11 | 0.235 | 41.68 | 1.16 | 0.215 | 0.88 | 1.21 | 0.431 | 0.51 |
| RF | 29.78 | 0.806 | 17.38 | 0.60 | 0.787 | 0.37 | 0.84 | 0.723 | 0.38 |
| TPOT | |||||||||