| Literature DB >> 33903606 |
Aditi S Krishnapriyan1,2, Joseph Montoya3, Maciej Haranczyk4, Jens Hummelshøj3, Dmitriy Morozov5.
Abstract
Machine learning has emerged as a powerful approach in materials discovery. Its major challenge is selecting features that create interpretable representations of materials, useful across multiple prediction tasks. We introduce an end-to-end machine learning model that automatically generates descriptors that capture a complex representation of a material's structure and chemistry. This approach builds on computational topology techniques (namely, persistent homology) and word embeddings from natural language processing. It automatically encapsulates geometric and chemical information directly from the material system. We demonstrate our approach on multiple nanoporous metal-organic framework datasets by predicting methane and carbon dioxide adsorption across different conditions. Our results show considerable improvement in both accuracy and transferability across targets compared to models constructed from the commonly-used, manually-curated features, consistently achieving an average 25-30% decrease in root-mean-squared-deviation and an average increase of 40-50% in R2 scores. A key advantage of our approach is interpretability: Our model identifies the pores that correlate best to adsorption at different pressures, which contributes to understanding atomic-level structure-property relationships for materials design.Entities:
Year: 2021 PMID: 33903606 PMCID: PMC8076181 DOI: 10.1038/s41598-021-88027-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Schematic outlining point cloud to persistence diagram. (left) A point set (representing atomic centers) with balls of increasing radius around the points, (right) 1-dimensional persistence diagram of the point set. Representative cycles, corresponding to the points in the diagram, are highlighted with matching colors. The larger the loop, the higher the persistence value (). Figure created with Ipe 7.2.23 (http://ipe.otfried.org/).
Figure 2Model performances for hMOF dataset and CO2 adsorption. Comparison of root-mean-square deviation (left), coefficient of determination (right) in predicting gas uptakes in CO2 for different features at different pressures for the hMOF dataset. The RMSD is low at lower pressures because the distribution of carbon dioxide adsorption capacity has low variance in this regime. The topological features consistently outperform the standard structural features at all pressures. The T + WE and T + S + WE models achieve the best performance in general.
Summary of model performances for hMOF dataset and CO2 adsorption.
| Descriptor | 0.01 bar | 0.05 bar | 0.1 bar | 0.5 bar | 2.5 bar |
|---|---|---|---|---|---|
| Structural | 0.45 | 0.55 | 0.61 | 0.67 | 0.71 |
| Topological | 0.57 | 0.64 | 0.68 | 0.75 | 0.80 |
| T + S | 0.70 | 0.70 | 0.72 | 0.78 | 0.84 |
| T + WE | 0.84 | 0.85 | 0.90 | 0.93 | |
| T + S + WE | 0.70 | ||||
| Best model, Fanourgakis et al.[ | – | 0.65 | – | 0.90 | 0.93 |
Machine learning results for carbon dioxide adsorption predictions on the hMOF dataset at different pressures, represented by R2 score. The best performing model for a given pressure is highlighted.
Model performance on BW dataset. Root-mean-square-deviation (RMSD) and coefficient of determination (R2 score) results in predicting the Henry’s coefficient (log k) for CO2 and CH4, gas uptakes for CO2, and gas uptakes for CH4 for the BW dataset.
| Target | RMSD | R | ||||||
|---|---|---|---|---|---|---|---|---|
| S | T | T + WE | S | T | T + WE | |||
| log(K | 0.46 | 0.38 | 28.3% | 0.60 | 0.68 | 30% | ||
| log(K | 0.27 | 0.20 | 33.3% | 0.50 | 0.73 | 58% | ||
| 0.15 bar CO | 0.71 | 0.56 | 31% | 0.57 | 0.71 | 38.6% | ||
| 16 bar CO | 1.9 | 2.53 | 5.3% | 0.93 | 0.88 | 1.1% | ||
| 5.8 bar CH | 19.18 | 14.85 | 27.2% | 0.68 | 0.82 | 23.5% | ||
| 65 bar CH | 23.87 | 20.61 | 26% | 0.83 | 0.87 | 8.4% | ||
Different sets of features (S = baseline structural, T = topological, T + WE = topological and word embeddings) are shown. For each target, the units are mol kg Pa and V/V respectively. The best model is in bold. As the improvement from the topology + word embeddings is always greater than the structural features, the percentage of improvement (decrease in the case of RMSD and increase in the case of R score) is also shown ().
Model performance on CoREMOF dataset. Root-mean-square-deviation (RMSD) and coefficient of determination (R score) results in predicting the Henry’s coefficient (log k) for CO and CH and gas uptakes for CH for the CoREMOF dataset. Different sets of features (S = baseline structural, T = topological, T + WE = topological and word embeddings) are shown.
| Target | RMSD | R | ||||||
|---|---|---|---|---|---|---|---|---|
| S | T | T + WE | S | T | T + WE | |||
| log(K | 0.90 | 0.73 | 33.3% | 0.26 | 0.53 | 165% | ||
| log(K | 0.34 | 0.30 | 29.4% | 0.55 | 0.65 | 41.2% | ||
| 5.8 bar CH | 27.15 | 22.00 | 25.7% | 0.47 | 0.65 | 51.1% | ||
| 65 bar CH | 32.06 | 25.57 | 23.1% | 0.76 | 0.85 | 14.5% | ||
For each target, the units are mol kg Pa and V/V respectively. The best model is in bold. As the improvement from the topology + word embeddings is always greater than the structural features, the percentage of improvement (decrease in the case of RMSD and increase in the case of R score) is also shown ().
Figure 3Feature analysis of machine learning models. Summary of relative feature importance across different targets for the 1D, 2D topological features, and word embeddings. The BW, CoREMOF, and hMOF datasets are shown here.
Most important 1D/2D birth–death points for the different datasets (in Angstroms). These values correspond to the porous framework sizes most important for a given adsorption task.
| Target property | 1D birth | 1D death | 2D birth | 2D death |
|---|---|---|---|---|
| log(K | 1 | 4 | 3.3 | 4.1 |
| log(K | 1.6 | 2 | 3.6 | 4.4 |
| 0.15 bar CO | 3.5 | 3.6 | 3.4 | 4 |
| 16 bar CO | 1.7 | 2 | 3.1 | 3.9 |
| 5.8 bar CH | 1.4 | 3 | 3.8 | 4.6 |
| 65 bar CH | 3.6 | 4.3 | 2.3 | 3.2 |
| log(K | 0.3 | 1.3 | 2.3 | 3.1 |
| log(K | 0.3 | 1 | 3.6 | 4.4 |
| 5.8 bar CH | 1 | 3.3 | 3.4 | 4 |
| 65 bar CH | 3.9 | 4.8 | 2.4 | 3.2 |
| 0.01 bar CO | 0.02 | 0.7 | 3.2 | 3.5 |
| 0.05 bar CO | 1.1 | 1.6 | 1.6 | 2.1 |
| 0.1 bar CO | 1.1 | 2.7 | 4.4 | 5.5 |
| 0.5 bar CO | 1.3 | 3.5 | 4.7 | 5.8 |
| 2.5 bar CO | 1 | 3.7 | 4 | 5.1 |
Figure 4Example 1D and 2D representative cycles for different MOFs. (a) 1D channel, hMOF-675 (hMOFs) (b) 2D void, str-m4-o14-o14-acs-sym-5 (BW). The representative cycles are picked based on the approach described in Supplementary Fig. 3. Figure created with VisIt 3.1.4 (https://wci.llnl.gov/simulation/computer-codes/visit).
Figure 5Correlating void structure to MOF property. (a) str-m4-o14-acs-sym-8 (b) str-m4-o1-o22-acs-sym-94 (c) str-m4-o1-o24-acs-sym-96 (d) str-m4-o1-o24-acs-sym-165. The representative cycles of voids corresponding to the void most correlated with the CO Henry’s coefficient in example MOFs with high CO Henry’s coefficients. The voids are all composed of a similar bonding structure, with each different atom type represented by a different color. As noted in[33], the process of identifying the void structure that appears in top performing MOFs can be extremely time-consuming via manually detected features. Thus, we hope that our much faster and topologically—grounded approach will allow for further study in pinpointing the channel and void shapes and bonding structures that correlate best to important material’s properties, thereby encouraging the targeted design of structures to maximize desirable properties. Figure created with VisIt 3.1.4 (https://wci.llnl.gov/simulation/computer-codes/visit).
Material properties sharing overlap with word embedding feature importances. Machine learning models trained with elemental word embeddings and materials properties are compared to the models trained with MOF composition word embeddings and MOF target properties for the CoREMOF dataset. The feature importances of each model are analyzed, and compared by Jaccard similarity. The top three materials properties most similar to the model trained to MOF target properties are listed.
| Target property | 1 | 2 | 3 |
|---|---|---|---|
| log(K | Electronegativity | Poisson’s ratio | Mendeleev’s number |
| log(K | Electronegativity | Poisson’s ratio | Thermal conductivity |
| 5.8 bar CH | Thermal conductivity | Poisson’s ratio | Brinell’s hardness |
| 65 bar CH | Thermal conductivity | Electronegativity | Melting point |